X

Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Data loading into HDFS - Part2. Data movement from the Oracle Database to the HDFS

Alexey Filanovskiy
Product Manager

Some time ago I started to explain how to move data to the Hadoop Distributed File  System (HDFS) from different sources. In my first blogpost about this I told about batch data loading from generic Linux (or even Unix) serversToday I’m going to explain some best practices about data
movement (offloading) from the Oracle Database to the HDFS in batch mode.
Generally speaking there are two major ways:
Sqoop and Copy2Hadoop.
You may also think about
Oracle Table Access for Hadoop (OTA4H), but it’s
not right use case for this product. OTA4H was designed for querying relatively
small Oracle Tables from Hadoop (Hive or Spark), but not for offloading big
amount of data.

Sqoop.

Sqoop is open source tool designed to data export/import
from Database. Any JDBC compatible database may be used with sqoop. This is
MapReduce job that creates few sessions (equal number of mappers) in DB and
each session generates SQL that query its part of table. If I run MapReduce job
with single mapper, then I would get one SQL statement into Database and sqoop
doesn’t need to distribute job across the whole table. But if I want to
accelerate this process I may use multiple mappers (sessions into database).
That’s fine, but now I need to decide how to distribute parts of the table
between Mappers. For resolve this, sqoop introduce the “split-by”
column 
term. While you run Sqoop job you specify number of the mappers with –m key
(by default there are 4 mappers) as well as split-by column. After submitting the job, sqoop generates bound query, like:

 

 select min(split_by_col), max(split_by_col) from table

And then based on its output, generate SQLs. For example, if
I specify 4 mappers, query will return 1 as the minimum and 10 000 as the
maximum, follow queries will be generated:

 


select * from table where split_by_col>=1 and split_by_col<2500


select * from table where split_by_col>=2500 and split_by_col<5000

select * from table where split_by_col>=5000 and split_by_col<7500

select * from table where split_by_col>=7500 and split_by_col=<10000

Good approach when you have incremental column, like a
primary key, but it’s disaster when not. Good new that you could specify some
function, like mod (--split-by 'mod(subs_key,30)') and smooth this uneven
distribution. Ok, it was a theory now let’s start test and tune performance.

Test environment.

For perform this exercise I got quarter rack of Exadata x5-2 (Oracle Database Appliance) and starter rack of Big Data Appliance (Hadoop optimized Appliance). They were connected by
Infiniband as well as Ethernet.

Test enviroment

Into database I loaded CDR like data (actually
it doesn’t matter, just consider it like a some random data) and compress it
with few type of compressions:

 

 Table name  Type of compression  Table size, GB
 ORCL_DPI  Not compressed 1710.96
 ORCL_DPI_COMPRESS  Oracle Basic compression 535.15
 ORCL_DPI_QH  Exadata query high compression 206.15
 ORCL_DPI_AH  Exadata archive high compression 189.21

basic part of DDL is the same (varay only compress options) for every table, like:

 

create table orcl_dpi(

msisdn number,

sgsn varchar2(4000),

src_ip varchar2(4000),

src_port number,

dst_ip varchar2(4000),

dst_port number,

trans_ip varchar2(4000),

trans_port number,

hostname varchar2(4000),

uri varchar2(4000),

start_time varchar2(4000),

stop_time varchar2(4000),

bytes number);

Within my first test I moved from the Database uncompressed table and I’ve used follow sqoop command:

 

sqoop import -D mapred.job.name='default general sqoop' --connect jdbc:oracle:thin:@'(DESCRIPTION=(LOAD_BALANCE=YES)(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode01)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode02)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=orcl.my.company)))' --username bds --password bds --table orcl_dpi --as-textfile --delete-target-dir --target-dir /tmp/sqoop_test --columns "msisdn,sgsn,src_ip,src_port ,dst_ip ,dst_port ,trans_ip,trans_port,hostname,uri ,start_time,stop_time,bytes" --split-by msisdn

and after 4 hours and 15 minutes I was completely frustrated by performance. Then I’ve remembered that I heard about special version of the sqoop for Oracle Database that named OraOop, it uses some features of Oracle Database (this sqoop edition could be used only over Oracle DB), for example rowid. Instead of generation SQLs with split-by based distribution, it divide the table on equal portion and query it by rowed, like this:

 

Seems, that it have to be much better. This version is contributed within BDA and for start using it you just have to specify --direct key. Trust, but verify. For checking this I ran follow command:

 

 

 

sqoop import -D mapred.job.name='default oraoop' --direct --connect jdbc:oracle:thin:@'(DESCRIPTION=(LOAD_BALANCE=YES)(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode01)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode02)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=orcl.my.company)))' --username bds --password bds --table orcl_dpi --as-textfile --delete-target-dir --target-dir /tmp/sqoop_test

 

 

 

and after 3 hours 48 minutes was again frustrated by this Job Performance. Reason for that is small amount of mappes (only 4). So, we have a lot of resources on the DB side as well as on the BDA side, but don’t use them. I went to the Hadoop MapReduce job details and understood why. We use only 4 mappers (it's Sqoop default).

Withing next test I increased number of the mappres up to 50. I didn’t have any resons for this specific number, just put my index finger in the sky: 

 

sqoop import -D mapred.job.name='increase number of mappers up to 50' --direct --connect jdbc:oracle:thin:@'(DESCRIPTION=(LOAD_BALANCE=YES)(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode01)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode02)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=orcl.my.company)))' --username bds --password bds --table orcl_dpi --as-textfile --delete-target-dir --target-dir /tmp/sqoop_test -m 50

 

and now I move all data in 43 minutes. Much better, but can I do more? I thought that it would be great to determinate number of mappers more accurately. For this I need to know how many cpu cores are allowed to use for MapReduce on my cluster. I’m a not a linux guru and wrote this a bit awkward script:

 

dcli -C 'cat `find /var/ -name yarn-site.xml|grep NODEMAN | sort -k1 -n|head -1`|grep -A 1 "yarn.nodemanager.resource.cpu-vcores"|grep -v name'|awk -F'</?value>' 'NF>1{print $2}' | awk '{a=a+$1} END {print a}'

 

that return me 391. So, for MapReduce on my cluster I can use 391 core. Let me specify this number and run the next test:

 

sqoop import -D mapred.job.name='increase number of mappers up to 391' --direct --connect jdbc:oracle:thin:@'(DESCRIPTION=(LOAD_BALANCE=YES)(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode01)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode02)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=orcl.my.company)))' --username bds --password bds --table orcl_dpi --as-textfile --delete-target-dir --target-dir /tmp/sqoop_test -m 391

 

Now we have 19.7 minutes. Not so bad, comparable with starting results. Let's have a look at the Hardware utilization for this two tests (with 50 and 391 mappers).

from first graph we obviously see that IO utilization in second case raised up to 90%:

 

Big Data Appliance (Hadoop) Disk utilization, %

This graphs "Ethernet receive" shows us income traffic (which comes from Exadata database) on BDA nodes. In this examples I use Ethernet as an external network.

 

Big Data Appliance (Hadoop) Ethernet Receive, MByte/sec

BDA Infiniband network are using for interconnect. When I write data into Hadoop I put first replica over external network and then replicate it over the interconnect network (Infiniband)

 

Big Data Appliance (Hadoop) Infiniband Transmit, MByte/sec

This graph shows us Ethernet Transmit traffic for the database. In second case it's not exceed 350 MByte/sec (2800 Mbit/sec)

 

 

 

Exadata (Database), Ethernet networks transmit, MBytes/sec

Next graph will help us to understand CPU utilization on the BDA and Exadata side. It's fine on BDA cumulative CPU consumption is bellow 50%. On the database even less than 10%

CPU consumption on Exadata(Database) and BDA (Hadoop) side, %

Let's have a look what's going on Exadata Storage cell side. All cells utilized evenly, that's good. Exadata Storage cell don't use so much CPU. From each node we read 600MByte/sec and 9000 IOPS.

is it a lot or not 600MByte/sec and 9000 IOPS? Let's check Flash utilization on one of the node -it's only 15%:

 

But let’s try to optimize this. We close to the IO bound and in the same point of time we have more than half of CPU resource available. We could utilize this CPU resource for data compression into Hadoop. I've run sequentially sqoop command that use gzip compression into Hadoop:

 

sqoop import -D mapred.job.name='add compression (Gzip
default)' --direct --connect
jdbc:oracle:thin:@'(DESCRIPTION=(LOAD_BALANCE=YES)(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode01)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode02)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=orcl.my.company)))'
--username bds --password bds --table orcl_dpi --as-textfile
--delete-target-dir --target-dir /tmp/sqoop_test -m 391 –z

was done in 8.8 minutes. And then Snappy compression:

 

 

sqoop import -D mapred.job.name='change compression codec to snappy' --direct --connect jdbc:oracle:thin:@'(DESCRIPTION=(LOAD_BALANCE=YES)(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode01)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode02)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=orcl.my.company)))' --username bds --password bds --table orcl_dpi --as-textfile --delete-target-dir --target-dir /tmp/sqoop_test -m 391 -z --compression-codec org.apache.hadoop.io.compress.SnappyCodec

it was done in 8.6 minutes. Nice leap! And of course everybody interesting to know how it possible. Graphs will help us to answer.

Compression obviously add significant impact into IO profile. Disks utilized:

on 90% for uncompressed data, on 25% with Gzip and on 40% with Snappy compression 

 

 

Big Data Appliance (Hadoop) Disk utilization, %

External network is utilized more (but for a less time period) in case of Gzip or Snappy compression, because we pull data much faster, but there is almost no difference between Gzip and Snappy case (because it's almost the same thoughput). External network utilization for Gzip and Snappy case is about 250GB/sec => 2Gbit/sec

 

Big Data Appliance (Hadoop) external network utilization, MB/sec

Interconnect Network conversely utilized less in case of writing compress data, because we write less amount of data (thanks for compression). 

 

 

Big Data Appliance (Hadoop) internal network utilization, MB/sec

for performance time we pay by CPU. CPU consumption is much highter in case of Snappy (75% versus 50%) and even more highter in case of Gzip (95%).

 

Big Data Applicance (Hadoop), CPU consumption, %

As soon as we read data faster, we utilize mo database resources.  

Exadata (Database) Storage Cell utilization

But even in this case Flash (all table based on Flash) of our storage cell don't utilized more than on 40%

Exadata (databsae) Storage Cell Flash utilization, %

Ok, Snappy seems like a resonable trade off between compression rate, elapsed time and performance. Well, what if data is compressed into database and taking in account that most probably you will move historical data, this case is pretty realistic.

I’ve run three tests that read compress data from Oracle Database (that was run on Exadata) with three possible compression options:

{C}- {C}Oracle basic compression

{C}- {C}Compression for Query High (specific for Exadata compression type)

{C}- {C}Compression for Archive High (specific for Exadata compression type)

Table
Name

Size,
GB

ORCL_DPI

1710.96

ORCL_DPI_AH

189.21

ORCL_DPI_COMPRESS

535.14

ORCL_DPI_QH

206.15

 For perform this I used these three statements -

oracle basic compression: 

 

sqoop import -D mapred.job.name='Run test over TCP (basic
compress ->snappy)' --direct --connect
jdbc:oracle:thin:@'(DESCRIPTION=(LOAD_BALANCE=YES)(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode01)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode02)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=orcl.my.company)))'
--username bds --password bds --table orcl_dpi_compress --as-textfile
--delete-target-dir --target-dir /tmp/sqoop_test -m 391 -z --compression-codec
org.apache.hadoop.io.compress.SnappyCodec

Exadata Archive High Compression:

 

sqoop import -D mapred.job.name='Run test over TCP (compress
AH ->snappy)' --direct --connect
jdbc:oracle:thin:@'(DESCRIPTION=(LOAD_BALANCE=YES)(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode01)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode02)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=orcl.my.company)))'
--username bds --password bds --table orcl_dpi_ah --as-textfile
--delete-target-dir --target-dir /tmp/sqoop_test -m 391 -z --compression-codec
org.apache.hadoop.io.compress.SnappyCodec

 Exadata Query High Compression:

 

sqoop import -D mapred.job.name='Run test over TCP (compress
QH ->snappy)' --direct --connect
jdbc:oracle:thin:@'(DESCRIPTION=(LOAD_BALANCE=YES)(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode01)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=exadatanode02)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=orcl.my.company)))'
--username bds --password bds --table orcl_dpi_qh --as-textfile
--delete-target-dir --target-dir /tmp/sqoop_test -m 391 -z --compression-codec
org.apache.hadoop.io.compress.SnappyCodec

And after finishing I’ve checked elapsed times were following - 8.1 minutes for tables compressed with Query High, 11.1 minutes for Archive High compressed table and 8.6 minutes for table with basic compression. Now let's have a look on the graphs (first test - over Oracle Basic compression, Second over Arhive High compression, Third over Query High compression):

 

Big Data Appliance (Hadoop) disk utilization, %

 

 

Big Data Appliance (Hadoop) External Networks Utilization, MBytes/sec

 

 

Exadata (database), External Network Utilization, Mbytes/sec

 

 

Big Data Appliance (Hadoop) Internal network utilization, Mbytes/sec

 

Exadata (database) Storage Cell Utilization

Exadata (database) Storage Cell  Flash Utilization,%

Now I understand that I’m a CPU bound on the Hadoop side (in case of basic compression or Query High compression). But I have remembered about SDP feature for getting data from the database. Follow this link I have configured SDP listener on the Exadata and run the same test over the Infiniband network.

export HADOOP_OPTS="-Doracle.net.SDP=true -Djava.net.preferIPv4Stack=true";

sqoop import -D mapred.job.name='Run test over SDP
(compress->snappy)' -D mapred.child.java.opts="-Doracle.net.SDP=true
-Djava.net.preferIPv4Stack=true"
--direct --connect
jdbc:oracle:thin:@'(DESCRIPTION=(LOAD_BALANCE=YES)(ADDRESS=(PROTOCOL=SDP)(HOST=exadatanode01)(PORT=1522))(ADDRESS=(PROTOCOL=SDP)(HOST=exadatanode02)(PORT=1522))(CONNECT_DATA=(SERVICE_NAME=orcl.my.company)))'
--username bds --password bds --table orcl_dpi_qh --as-textfile
--delete-target-dir --target-dir /tmp/sqoop_test -m 391 -z --compression-codec
org.apache.hadoop.io.compress.SnappyCodec;

Coping over Infiniband a bit faster than coping over Ethernet - 7.6 minutes.

Graphs bellow compare Big Data Appliance and Exadata utilization for SDP and non-SDP case (first test case is SDP, second is not SDP):

 

Big Data Appliance (Hadoop), Disk utilization, %

 

 

Big Data Appliance (Hadoop) and Exadata CPU utilization, %

In first case it may seems that external network not utilized at all. But it's not true. SDP packets have not been monitored with Cloudera Manager.

 

 

Big Data Appliance (Hadoop), external network utilization, Mbytes/sec

The same story with Exadata external network.

 

Exadata (Database), external network utilization, Mbytes/sec

 

 

 

Big Data Appliance (Hadoop), internal network utilization, Mbytes/sec

This graph shows us what we won. We reduce number of interuptions on the database and on the Hadoop side, but it brings us only one persent of the CPU time.

 

 

Big Data Appliance and Exadata Soft IRQ rate, %

For conclude all this results I putted them into the table.

 

It’s hard to measure sqoop efficiency in GB/sec, because it would not be correct for compare GB of compressed data and GB of uncompressed data (plus add few types of input and few types of output compression). To do this comparison, I’ve introduced new column (Database uncompressed analog), which is represent uncompressed data volume (1710.96GB) into database divided by elapsed time. Millions records per second also could be pretty fair metric. 

Sqoop’s restrictions.

Now, let’s talk about sqoop's restrictions. For sure, there are more than four restrictions, but I want to highlight most important ones in my opinion. Generally speaking sqoop is good for offloading plain tables and partitions (without joins and other transformations).

1) Generic sqoop has “--query” option, when you could specify ad-hoc SQL, but in this case it would not be OraOop version, what automatically means low performance (it’s non rowid access) and high Oracle Database utilization. I highly recomend use Oraoop Sqoop for offloading data from Oracle database.

     Taking in account that very often customers normalize data before data offloading, this restriction became significant obstacle. 

2) Not all Oracle datatype are supported by sqoop (you could find list here in proper section of documentation)

3) You could not use views with OraOop sqoop

     4) You could not use PL/SQL with sqoop

Copy2Hadoop does not have these restrictions.

Copy2Hadoop

Use cases for Copy2Hadoop:

1) If you have a pretty complex queries for your offloading (window functions, PL/SQL, Joins, sub-queries e.t.c.), not just offloading whole table or partition. For Hadoop world it could be good use case, because very often Big Data query tools works better over de-normalized data.

2) Copy2Hadoop creates datapump files that could have any Oracle datatype which is supported by Oracle External table and store data in Oracle’s datatype. When you convert datatype from Oracle format to the Java format (like sqoop does) there is always risk that some information would be converted incorrectly. Copy2Hadoop insures you from that.

3) If you use Big Data SQL, you will get read performance benefit because you will not convert the datatype while you read this dumpfile.

Let me show you an example how to work with Big Data SQL. For start using this product with RAC (Exadata QR is the 2 node RAC) you have to provide some shared file system, that is available for both database nodes. It could be:

- NAS storage

- ACFS

- DBFS

     - Other shared filesystem…

For my experiment I chose ACFS (NAS is extra Hardware, DBFS is logging system that is not necessary for my particular case) and configure it.

After that I’ve mounted it on the /u01/app/oracle/acfsdata/ (ACFS volume) directory on the both of the database nodes (exadatanode01 and exadatanode02). Steps do do this:

1) Create directory into database:

 

 SQL> create or replace directory COPY2HADOOP as '/u01/app/oracle/acfsdata/copy2hadoop';

2) Create dmp files with follow statement:

 

SQL> CREATE TABLE orcl_dpi_uncompressed

ORGANIZATION EXTERNAL (

TYPE oracle_datapump

DEFAULT DIRECTORY COPY2HADOOP

LOCATION ('filename1.dmp','filename2.dmp',

……..

'filename128.dmp'

)

) PARALLEL 128

AS SELECT /*+ PARALLEL(orcl_dpi, 128,2)*/ * FROM orcl_dpi;

 

3) Mount database node’s directories to the Hadoop (BDA) nodes.

There is like cross mounting, that allows us to use both database instances for copy data. This trick was done for high availability and performance considerations.

On BDA nodes 1,3,5 I executed:

 

$ mount exadatanode01: /u01/app/oracle/acfsdata/copy2hadoop /mnt/copy2hadoop

on the 2,4,6 nodes:

 

$ mount exadatanode02: /u01/app/oracle/acfsdata/copy2hadoop /mnt/copy2hadoop

4) Move data from ACFS to the HDFS with distcp command:

 

$ hadoop distcp file:///mnt/copy2hadoop hdfs://cluster-ns/user/oracle/table

there was example of usage Copy2Hadoop for file creation and moving them into HDFS. 

Copy2Hadoop compression.

As we noted before data compression is usually goodness in Hadoop and Datawarehouse world. I ran test that is similar with previous one, but create and copy compress files, like this:

CREATE TABLE orcl_dpi_compress

ORGANIZATION EXTERNAL (

TYPE oracle_datapump

DEFAULT DIRECTORY COPY2HADOOP

ACCESS PARAMETERS (COMPRESSION ENABLED HIGH)

LOCATION ('filename1.dmp','filename2.dmp',…'filename128.dmp'

)

) PARALLEL 128

AS SELECT /*+ PARALLEL(orcl_dpi, 128,2)*/ * FROM orcl_dpi;

 

For compression case I totally spent 14.3 minutes for creation files and 4.1 minutes (18.4 minutes total) for coping them. Database (because of compression) was the bottleneck for file creation step. On BDA I expected to see CPU as a bottleneck, but for this particular case we were limited by Degree of Copy parallelism. Hadoop was able to copy only with DOP=128 (it’s almost maximum DOP that is available for database), like number of the dmp files, but meanwhile it was more available resources (CPU, Network, Disks).

Sqoop and Copy2Hadoop conclusion.

1) Nowadays we generally have two tools for data movement from the Oracle Database – Sqoop and Copy2Hadoop

2) For performance considerations it make sense to use only Oraoop version of Sqoop (link)

3) In my configuration (Exadata x5-2 QR and Big Data Appliance Starter Rack) Sqoop was faster for simple import, when we moved all table (partition) as is without any transformation.

4) Copy2Hadoop has advantages over sqoop:

a. Offload results of the complex query (window functions, PL/SQL, Joins) or even views (like a sub-case for complex queries)

b. Have not trivial datatypes, that not supported by sqoop or you have a risk incorrect datatype convertation

c. For data processing you are using Big Data SQL and want to improve read performance by skipping datatype transformation

d. Easier to configure

5) You may have temptation to use OTA4H for data offloading, but you have to remember that design goal of this product was different and it’s not good idea to use it for move big amount of data.

Important. It's not official benchmark, it's one example how to tune you data offloading with Sqoop. Your results could be different. 

Join the discussion

Comments ( 6 )
  • guest Wednesday, March 23, 2016

    Great and informative blog post. The only part that confused me was the Copy2Hadoop one . So in my understanding You can create external files in data pump format which later can be queried from Hadoop via some tools even if stored in data pump format ?

    Regards

    GG


  • guest Monday, March 28, 2016

    That is correct. The Data Pump files can be queried with Hive, we provide a Hive SerDe for that purpose.


  • WingsOfTechnology Wednesday, March 30, 2016

    Thanks for your tutorial, I have introduced to Sqoop and Copy2Hadoop tools today by reading this post.


  • Harish Y Wednesday, December 13, 2017
    This blog is very useful for the learners, I have been seeing this phrase mapred.job.name in all the import queries. Could someone please explain me the purpose of this property: mapred.job.name..

    Thanks a lot
  • Rasike Gomes Wednesday, March 27, 2019
    Hi , We are currently on Oracle 12.2 EE, Can we access HDFS hive tables directly as external tables without licensed options as Oracle Big SQL ?

    We do have Cloudera data lake in place.
  • Alexey Wednesday, March 27, 2019
    no, in order to load data into Oracle RDBMS you need to have either Oracle Connectors or Big Data SQL
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.