Guest Author: Ionut Bruma - Oracle, Senior Sales Consultant
This article presents an overview of how to use Oracle Data Integrator (ODI) for Big Data with Hive parquet storage. The scenario shows how we can ingest CSV files into Hive and store them directly in Parquet format using standard connectors and Knowledge Modules (KMs) offered by Oracle Data Integrator for Big Data. For those new to this extension of ODI, Oracle Data Integrator for Big Data brings advanced data integration capabilities to customers who are looking to implement a seamless and responsive Big Data Management platform.
For the practical scenario we have used Big Data Lite VM. This machine can be downloaded from OTN following the link attached: http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html
Overview of Hive
The Apache Hive software projects a structure over the large datasets residing in distributed storage thus facilitating querying and managing this data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Let me walk you through how to set up ODI for this use case.
Preparing the File Topology
Create a File Data Server:
Use the Topology tab of ODI Studio. Right click File technology and click on New to create a new Data Server
In the Definition panel, enter a Name: eg: CSV Source
In the JDBC panel, leave the default driver details:
JDBC driver: com.sunopsis.jdbc.driver.file.FileDriver
JDBC driver URL: jdbc:snps:dbfile
Press button. A pop-up should show that the connection has been successful as below.
Go back to the Topology and right click the newly created Data Server. Choose New Physical Schema.
The new configuration window will appear. In the Definition tab enter the connection details.
Directory (Schema): <fill_path_to_the_folder_containing_csv_files>.
Directory (Schema): repeat the same string here.
Press Save. An information screen will pop-up. Press ok.
Then expand the Logical Architecture tab and find the File technology. Right click on it and click New Logical Schema to associate it with the newly created Physical Schema.
In the Definition tab select the Context (Global in our case, but it can be Production, Development and so on). Associate the context with the Physical Schema previously created.
Creating the Model reflecting CSV structure
Go to Designer window and expand Models, right click on the folder and click New Model.
In the Definition tab fill in the Name, Technology and Logical Schema. For the Technology choose File, and then choose the Logical Schema you have just created for this technology.
Click button to retrieve all the files. You should see them under the model created. In order to have a preview of the data inside the CSV file, right-click on the datastore and choose View Data:
Similar to the creation of the File Data Server in the Topology tab, create a Hive Data Server. Fill in the Name as Hive for example. Under the Connection specify the user (the authentication will be internal to the Hive engine, we will specify password=default in the driver specifications). Fill in the metastore URI and go to the JDBC tab.
In the JDBC tab fill in the correct driver and URL for the Hive technology as shown in the following example. We have everything emulated inside Big Data Lite environment [JT1] so we will use localhost and the password default (for default authentication).
Click . The connection test should be successful.
Now out of the new Hive Data Server (that’s actually the connection to the server) we will create a new schema or in ODI language a new Physical Schema. Right click on the data server and choose new Physical Schema.
Fill in the details related to the schema (the schema should exist in ODI already) that you want to include in the Topology. You can use the same schema for both Work and Staging schema.
Next, expand the Logical Architecture window and locate the Hive technology. Right click on it and create a new Logical Schema that will be linked to the Hive schema.
Give it a name and associate the Logical Schema with the Physical Schema as shown below (a dropdown menu is available under Physical Schemas area):
Creating the Model reflecting Hive structure
If using reverse engineering then a table should exist in Hive. Here is a table creation DDL example:
CREATE TABLE Test
STORED as PARQUET;
Log into Hive and run this code. From the OS command line, run bee command like shown below.
Set the database where you want to deploy:
Return to ODI Studio, go to Designer window and expand Models, right click on the folder and click New Model.
In the Definition tab fill in the Name, Technology and Logical Schema. For the Technology choose Hive, then choose the logical schema you have just created for this technology.
Next, go to Reverse Engineer tab and choose the radio button for Customized. Choose the proper knowledge module as RKM Hive and click . Optionally you can choose a mask for the tables you want to reverse engineer or you will reverse the entire schema.
The reverse engineering process can be monitored inside Operator Navigator:
If you need to forward engineer your Hive Datastore, you will need to create it and fill it out manually. Right click the Hive data model and click New Datastore. Fill in the Name with any name. For the Datastore Type pick Table.
Navigate to Storage attributes and fill in the Table Type: Managed, Storage Type: Native, Row Format: Built-In, Storage Format: PARQUET as shown in the picture below.
Navigate to Attributes tab and fill in the attributes of the table:
Next, create the mapping that will convert automatically the CSV file into a Hive Parquet stored structure.
Click the project you need to create the mapping in, expand the Folder and right-click on the mappings. Click on New Mapping.
Give it a name,
and start building the mapping. Our mapping is a one to one mapping because we only load the data into Hive but you can also perform any transformation ODI is capable of before loading the data into Hive Parquet table.
Navigate to Physical tab and make the following adjustments. Select the access point (MCC_AP in my case)
and configure the Loading Knowledge Module as shown in the following picture:
It is important that LKM File to Hive LOAD DATA is used (and NOT LKM File to Hive LOAD DATA Direct). In addition, the FILE_IS_LOCAL (outside the cluster) option must be set to True, otherwise it will look for the file in HDFS instead of taking the CSV file from outside the cluster.
Configure the Integration part (IKM settings):
Select the target table (this should be the Hive Parquet one):
Open the Integration Knowledge Module tab and choose the IKM Hive Append.GLOBAL KM. Next, if the table does not exist (you use the forward engineering approach described above) then set the CREATE_TARG_TABLE option to TRUE otherwise keep it false. If you want to truncate the table after each run then set the TRUNCATE option to TRUE.
Run the mapping and review the results in Operator Navigator:
Examine the Hive Table:
Look inside the Parquet file to see the data:
Data_lake.db is the name of my database and mccmnc is the name of my table. You should replace them with your own names.
ODI offers out of the box integration with Big Data technologies such as Hive, Hadoop, Spark, Pig, etc. When building a data lake or a data warehouse many files come as flat files in different formats like CSV, TXT, JSON and have to be injected in HDFS/HIVE in formats like Parquet. ODI is able to build a reusable flow in order to automatically transfer the CSV files as they come from sources directly into the target HIVE tables. This post presents a step-by-step guide on how to setup the right topology that reflects bots CSV source and HIVE Parquet table target. Then we tried to show two ways of building the target model (reverse engineering an existing table or forward engineer a structure created in ODI) and furthermore how to choose the proper Knowledge Modules for the loading and integration part.