X

Move data between Oracle Database and Apache Hadoop using high speed connectors.

  • August 7, 2018

Uploading Your Data from Oracle Database On-Premise to HDFS in Oracle Cloud

This is a post by Federico Alejandro Garcia Calabria.   Thanks Federico!

There are many ways to get your data from your on-premise Oracle Database to Oracle Big Data Cloud Service. This blog will walk you through an easy way.

Note that this blog uses Oracle Big Data Cloud Service (BDCS), which is the Big Data Appliance in Oracle Cloud.  The Hadoop distribution is from Cloudera.

Below are the high level steps to copy the data.   We will walk through each step in detail in this blog post.

1-Create Oracle Data Pump files using external tables and the ORACLE_DATAPUMP access driver.

2-Copy the Oracle Data Pump files from your on-premise system to OSS.

3-Copy the Oracle Data Pump files from OSS to HDFS.

4-Create Hive tables to read the Oracle Data Pump files.

You will need the following:

For Step 2: File Transfer Manager Command Line Interface (FTM CLI) -> Java-based utility that allows upload/download to Oracle’s Object Store Service (OSS). You can download the binary from here

For Step 3: ODCP CLI/Big Data Manager (BDM) -> ODCP is a utility to copy data from OSS and other supported storage providers to HDFS.   ODCP is pre-installed in each node of your Big Data Cloud Service cluster. If you prefer a UI to drag and drop instead of using the command-line you can use Big Data Manager (that is also pre-installed on Big Data Cloud Service) to transfer your data from OSS to HDFS.

For Step 4: Hive SerDe to read Oracle Data Pump files. This is pre-installed on Oracle Big Data Cloud Service (this is part of the Copy to Hadoop feature that is available with Oracle Big Data Cloud Service).

Detailed Steps

Step 1.   Create Oracle Data Pump files from Oracle Database table data that you want to move.  You can do this by exporting the data from the Oracle Database table using the oracle_datapump access driver (Note: Using Oracle Data Pump files generated by other database utilities will not be accessible by Hive tables.  You need to use the oracle_datapump access driver to export the data.)

CREATE TABLE table_name

ORGANIZATION EXTERNAL (

TYPE oracle_datapump

DEFAULT DIRECTORY database_directory

LOCATION ('filename1.dmp','filename2.dmp'...)

) PARALLEL n

AS SELECT * FROM tablename;

 

Step 2.   Move data from your local machine (on-premise) to OSS.

You can run the following command:

java -jar "[LOCAL PATH TO ftmcli.jar]" upload --save-auth-key --properties-file "[LOCAL PATH TO ftmcli.properties]" [name of OSS container] [LOCAL PATH TO FILE TO UPLOAD]

Note that the OSS’s container must be created from the OSS’ GUI prior to the execution of this command. The --save-auth-key parameter will save your authentication credentials so that future executions will not ask for them. The --properties-file parameter refers to a configuration file that contains necessary information to perform the transfer. Take a look at the default file that comes with the client to know how it should be updated with your BDCS information. For more information check this:

https://docs.oracle.com/en/cloud/iaas/storage-cloud/csclr/general-syntax-ftm-cli-commands.html#GUID-6E51B750-1958-4FB3-98BB-AF24ECC882D7

 

Step 3.  Move data from OSS to HDFS.

You can think of OSS as your staging area before moving to HDFS.  You can use the odcp tool to move data from OSS to HDFS.  Have a look at an odcp example command:

odcp -V swift://[CONTAINER]/[FILE] hdfs:///[PATH]

Suppose you have a [CONTAINER] with the name “container”, you must reference it as container.oss in the command. The –V option will enable verbose printout; you can check the complete list of commands typing odcp –help.

Note that odcp uses spark under the hood. This means that files are going to be divided in chunks before transfer and will be concatenated at the destination. If your HDFS destination is an encryption zone there is an HDFS constraint to perform concatenation operations there. Hence, you will not be able to transfer individual files greater than 1GB.

Check odcp’s full documentation here: https://docs.oracle.com/en/cloud/paas/big-data-cloud/csbdi/copying-data-odcp.html#GUID-AE8587AF-6538-43A6-A2F3-52D63E287788

(Some additional resources on using Big Data Manager:
https://www.youtube.com/watch?v=YUNstcRR6ZE&list=PLKCk3OyNwIzvqw3niT3RM3I0v1f39FFqf
https://docs.oracle.com/en/cloud/paas/big-data-cloud/tutorials.html )

Step 4.  Create a Hive table to query the Oracle Data Pump files you have copied to HDFS.

Use the Beeline CLI to connect to the Hive server.    Select a Hive database to use.  Then create a Hive table to access the data that you copied in Oracle Data Pump format.

This table uses the Hive SerDe that is pre-installed on Oracle Big Data Cloud Service.

CREATE EXTERNAL TABLE table1

ROW FORMAT SERDE 'oracle.hadoop.hive.datapump.DPSerDe' 

STORED AS INPUTFORMAT 'oracle.hadoop.hive.datapump.DPInputFormat' 

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 

LOCATION 'hdfs:///<HDFS location of your data>'

 

You are now ready to query your data in Hive!   You can use Hive to move the data into Parquet and other formats.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.