Tuesday Jun 16, 2015

ODI - Hive DDL Generation for Parquet, ORC, Cassandra and so on

Here you'll see how with some small surgical extensions we can use ODI to generate complex integration models in Hive for modelling all kinds of challenges;

  • integrate data from Cassandra, or any arbitrary SerDe
  • use Hive Parquet or ORC storage formats
  • create partitioned, clustered tables
  • define arbitrarily complex attributes on tables

I'm very interested in hearing what you think of this, and what is needed/useful over and above (RKMs etc) 

When you use this technique to generate your models you can benefit from one of ODI's lesser known but very powerful features - the Common Format Designer (see here for nice write up). With this capability you can build models from other models, generate the DDL and generate the mappings or interfaces to integrate the data from the source model to the target. This gets VERY interesting in the Big Data space since many customers want to get up and running with data they already know and love.

What do you need to get there? The following basic pieces;

  • a Hive custom create table action here.
  • flex fields for external table indicator, table data format, attribute type metadata get groovy script here

 Pretty simple right? If you take the simple example on the Confluence wiki;

  • CREATE TABLE parquet_test (
  •    id INT,
  •    str STRING,
  •    lst ARRAY<STRING>,
  •    strct STRUCT<A:STRING,B:STRING>) 

...and think about how to model this in ODI, you'll think - how do I define the Parquet storage? How can the complex types be defined? That's where the flex fields come in, you can define any of the custom storage properties you wish in the datastore's Data Format flex field, then define the complex type formats.

In ODI the Parquet table above can be defined with the Hive datatypes as  below, we don't capture the complex fields within attributes mp, 1st or stct. The partitioned column 'part' is added in to the regular list of datastore attributes (like Oracle, unlike Hive DDL today);

Using the flex field Data Format we can specify the STORED AS PARQUET info, the table is not external to Hive so we do not specify EXTERNAL in the external data field;

This value can have lots of ODI goodies in it which makes it very versatile - you can use variables, these will be resolved when the generated procedure containing the DDL is executed, plus you can use <% style code substitution. 

The partitioning information is defined for the part attribute, the 'Used for Partitioning' field is checked below;

Let's take stock, we have our table defined as Parquet and we have the partitioning info defined. Now we have to define the complex types. This is done on attributes mp, 1st and stct using the creatively named Metadata field - below I have fully defined the MAP, STRUCT and ARRAY for the respective attributes;

Note the struct example escapes the : as the code is executed via the JDBC driver when executing via the agent and ':' is the way of binding information to a statement execution (you can see the generated DDL further below with the ':' escaped). 

With that we are complete, we can now generate DDL for our model;

This brings up a dialog with options for where to store the procedure with the generated DDL - this has actually done a base compare against the Hive system and allowed me to selected which datastores to create DDL for or which DDL to create, I have checked the parquet_test datastore;

Upon completion, we can see the procedure which was created. We can chose to further edit and customize this if so desired. We can schedule it anytime we want or execute.

Inspecting the Create Hive Table action, the DDL looks exactly as we were after when we first started - the datastore schema / name will be resolved upon execution using ODI's runtime context - which raised another interesting point! We could have used some ODI substitution code <% when defining the storage format for example;

As I mentioned earlier you can take this approach to capture all sorts of table definitions for integrating. If you want to look on the inside of how this is done, check the content of the Hive action in image below (I linked the code of the action above), it should look very familiar if you have seen any KM code and uses the flex fields to inject the relevant information in the DDL;

This is a great example of what you can do with ODI and how to leverage it to work efficiently. 

There is a lot more that can be illustrated here including what the RKM Hive can do or be modified to do. Perhaps creating the datastore and complex type information from JSON, JSON schema, AVRO schema or Parquet is very useful? I'd be very interested to hear your comments and thoughts. So let me know....

Thursday Apr 09, 2015

ODI, Big Data SQL and Oracle NoSQL

Back in January Anuj posted an article here on using Oracle NoSQL via the Oracle database Big Data SQL feature. In this post, I guess you could call it part 2 of Anuj's I will follow up with how the Oracle external table is configured and how it all hangs together with manual code and via ODI. For this I used the Big Data Lite VM and also the newly released Oracle Data Integrator Big Data option. The BDA Lite VM 4.1 release uses version 3.2.5 of Oracle NoSQL - from this release I used the new declarative DDL for Oracle NoSQL to project the shape from NoSQL with some help from Anuj.

My goal for the integration design is to show a logical design in ODI and how KMs are used to realize the implementation and leverage Oracle Big Data SQL - this integration design supports predicate pushdown so I actually minimize data moved from my NoSQL store on Hadoop and the Oracle database - think speed and scalability! My NoSQL store contains user movie recommendations. I want to join this with reference data in Oracle which includes the customer information, movie and genre information and store in a summary table.

Here is the code to create and load the recommendation data in NoSQL - this would normally be computed by another piece of application logic in a real world scenario;

  • export KVHOME=/u01/nosql/kv-3.2.5
  • cd /u01/nosql/scripts
  • ./admin.sh

  • connect store -name kvstore
  • EXEC "CREATE TABLE recommendation( \
  •          custid INTEGER, \
  •          sno INTEGER, \
  •          genreid INTEGER,\
  •          movieid INTEGER,\
  •          PRIMARY KEY (SHARD(custid), sno, genreid, movieid))"
  • PUT TABLE -name RECOMMENDATION  -file /home/oracle/movie/moviework/bigdatasql/nosqldb/user_movie.json

The Manual Approach

This example is using the new data definition language in NoSQL. To make this accessible via Hive, users can create Hive external tables that use the NoSQL Storage Handler provided by Oracle. If this were manually coded in Hive, we could define the table as follows;

  •                  custid INT,
  •                  sno INT,
  •                  genreId INT,
  •                  movieId INT)
  •           STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler'
  •           TBLPROPERTIES  ( "oracle.kv.kvstore"="kvstore",
  •                            "oracle.kv.hosts"="localhost:5000",
  •                            "oracle.kv.hadoop.hosts"="localhost",
  •                            "oracle.kv.tableName"="recommendation");

At this point we have made NoSQL accessible to many components in the Hadoop stack - pretty much every component in the hadoop ecosystem can leverage the HCatalog entries defined be they Hive, Pig, Spark and so on. We are looking at Oracle Big Data SQL tho, so let's see how that is achieved. We must define an external table that uses either the SerDe or a Hive table, below you can see how the table has been defined in Oracle;

  • CREATE TABLE recommendation(
  •                  custid NUMBER,
  •                  sno NUMBER,
  •                  genreid NUMBER,
  •                  movieid NUMBER
  •          )
  •                  ORGANIZATION EXTERNAL
  •          (
  •                  TYPE ORACLE_HIVE
  •                  ACCESS PARAMETERS  (
  •                      com.oracle.bigdata.tablename=default.recommendation
  •                  )
  •          ) ;

Now we are ready to write SQL! Really!? Well let's see, below we can see the type of query we can do to join the NoSQL data with our Oracle reference data;

  • SELECT m.title, g.name, c.first_name
  • FROM recommendation r, movie m, genre g, customer c
  • WHERE r.movieid=m.movie_id and r.genreid=g.genre_id and r.custid=c.cust_id and r.custid=1255601 and r.sno=1 
  • ORDER by r.sno, r.genreid;

Great, we can now access the data from Oracle - we benefit from the scalability of the solution and minimal data movement! Let's make it better, let's make it more maintainable, flexibility to future changes and also accessible by more people by showing how it is done in ODI.

Oracle Data Integrator Approach

The data in NoSQL has a shape, we can capture that shape in ODI just as it is defined in NoSQL. We can then design mappings that manipulate the shape and load into whatever target we like. The SQL we saw above is represented in a logical mapping as below;

Users can use the same design experience as other data items and benefit from the mapping designer. They can join, map, transform just as normal. The ODI designer allows you to separate how you physically want this to happen from the logical semantics - this is all about giving you flexibility to change and adapt to new integration technologies and patterns.

In the physical design we can assign Knowledge Modules that take the responsibility of building the integration objects that we previously manually coded above. These KMs are generic so support all shapes and sizes of data items. Below you can see how the LKM is assigned for accessing Hive from Oracle;

This KM takes the role of building the external table - you can take this use it, customize it and the logical design stays the same. Why is that important? Integration recipes CHANGE as we learn more and developers build newer and better mechanisms to integrate. 

This KM takes care of creating the external table in Hive that access our NoSQL system. You could also have manually built the external table and imported this into ODI and used that as a source for the mapping, I want to show how the raw items can be integrated as the more metadata we have and you use to design the greater the flexibility in the future. The LKM Oracle NoSQL to Hive uses regular KM APIs to build the access object, here is a snippet from the KM;

  • create table <%=odiRef.getObjectName("L", odiRef.getTableName("COLL_SHORT_NAME"), "W")%>
  •  <%=odiRef.getColList("(", "[COL_NAME] [DEST_CRE_DT]", ", ", ")", "")%> 
  •           STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler'
  •           TBLPROPERTIES  ( "oracle.kv.kvstore"="<%=odiRef.getInfo("SRC_SCHEMA")%>",
  •                            "oracle.kv.hosts"="<%=odiRef.getInfo("SRC_DSERV_NAME")%>",
  •                            "oracle.kv.hadoop.hosts"="localhost",
  •                            "oracle.kv.tableName"="<%=odiRef.getSrcTablesList("", "[TABLE_NAME]", ", ", "")%>");

You can see the templatized code versus literals, this still needs some work as you can see, can you spot some hard-wiring that needs fixed? ;-) This was using the Big Data option of ODI so integration with Hive is much improved and it leverages the DataDirect driver which is also a big improvement. In this post I created a new technology for Oracle NoSQL in ODI, you can do this too for anything you want, I will post this technology on java.net and more so that as a community we can learn and share.


Here we have seen how we can make seemingly complex integration tasks quite simple and leverage the best of data integration technologies today and importantly in the future!

Monday Apr 06, 2015

Announcing Oracle Data Integrator for Big Data

Proudly announcing the availability of Oracle Data Integrator for Big Data. This release is the latest in the series of advanced Big Data updates and features that Oracle Data Integration is rolling out for customers to help take their Hadoop projects to the next level.

Increasing Big Data Heterogeneity and Transparency

This release sees significant additions in heterogeneity and governance for customers. Some significant highlights of this release include

  • Support for Apache Spark,
  • Support for Apache Pig, and
  • Orchestration using Oozie.

Click here for a detailed list of what is new in Oracle Data Integrator (ODI).

Oracle Data Integrator for Big Data helps transform and enrich data within the big data reservoir/data lake without users having to learn the languages necessary to manipulate them. ODI for Big Data generates native code that is then run on the underlying Hadoop platform without requiring any additional agents. ODI separates the design interface to build logic and the physical implementation layer to run the code. This allows ODI users to build business and data mappings without having to learn HiveQL, Pig Latin and Map Reduce.

Oracle Data Integrator for Big Data Webcast

We invite you to join us on the 30th of April for our webcast to learn more about Oracle Data Integrator for Big data and to get your questions answered about Big Data Integration. We discuss how the newly announced Oracle Data Integrator for Big Data

  • Provides advanced scale and expanded heterogeneity for big data projects 
  • Uniquely compliments Hadoop’s strengths to accelerate decision making, and 
  • Ensures sub second latency with Oracle GoldenGate for Big Data.

Thursday Mar 26, 2015

Oracle Big Data Lite 4.1.0 is available with more on Oracle GoldenGate and Oracle Data Integrator

Oracle's big data team has announced the newest Oracle Big Data Lite Virtual Machine 4.1.0.  This newest Big Data Lite Virtual Machine contains great improvements from a data integration perspective with inclusion of the recently released Oracle GoldenGate for Big Data.  You will see this in an improved demonstration that highlights inserts, updates, and deletes into Hive using Oracle GoldenGate for Big Data with Oracle Data Integrator performing a merge of the new operations into a consolidated table.

Big Data Lite is a pre-built environment which includes many of the key capabilities for Oracle's big data platform.   The components have been configured to work together in this Virtual Machine, providing a simple way to get started in a big data environment.  The components include Oracle Database, Cloudera Distribution including Apache Hadoop, Oracle Data Integrator, Oracle GoldenGate amongst others. 

Big Data Lite also contains hands-on labs and demonstrations to help you get started using the system.  Tame Big Data with Oracle Data Integration is a hands-on lab that teaches you how to design Hadoop data integration using Oracle Data Integrator and Oracle GoldenGate. 

                Start here to learn more!  Enjoy!

Thursday Feb 19, 2015

Introducing Oracle GoldenGate for Big Data!

Big data systems and big data analytics solutions are becoming critical components of modern information management architectures.  Organizations realize that by combining structured transactional data with semi-structured and unstructured data they can realize the full potential value of their data assets, and achieve enhanced business insight. Businesses also notice that in today’s fast-paced, digital business environment to be agile and respond with immediacy, access to data with low latency is essential. Low-latency transactional data brings additional value especially for dynamically changing operations that day-old data, structured or unstructured, cannot deliver.

Today we announced the general availability of Oracle GoldenGate for Big Data product, which offers a real-time transactional data streaming platform into big data systems. By providing easy-to-use, real-time data integration for big data systems, Oracle GoldenGate for Big Data facilitates improved business insight for better customer experience. It also allows IT organizations to quickly move ahead with their big data projects without extensive training and management resources. Oracle GoldenGate for Big Data's real-time data streaming platform also allows customers to keep their big data reservoirs up to date with their production systems. 

Oracle GoldenGate’s fault-tolerant, secure and flexible architecture shines in this new big data streaming offering as well. Customers can enjoy secure and reliable data streaming with subseconds latency. With Oracle GoldenGate’s core log-based change data capture capabilities it enables real-time streaming without degrading the performance of the source production systems.

The new offering, Oracle GoldenGate for Big Data, provides integration for Apache Flume, Apache HDFS, Apache Hive and Apache Hbase. It also includes Oracle GoldenGate for Java, which enables customers to easily integrate to additional big data systems, such as Oracle NoSQL, Apache Kafka, Apache Storm, Apache Spark, and others.

You can learn more about our new offering via Oracle GoldenGate for Big Data data sheet and by registering for our upcoming webcast:

How to Future-Proof your Big Data Integration Solution

March 5th, 2015 10am PT/ 1pm ET

I invite you to join this webcast to learn from Oracle and Cloudera executives how to future-proof your big data infrastructure. The webcast will discuss :

  • Selection criteria that will drive business results with Big Data Integration 
  • Oracle's new big data integration and governance offerings, including Oracle GoldenGate for Big Data
  • Oracle’s comprehensive big data features in a unified platform 
  • How Cloudera Enterprise Data Hub and Oracle Data Integration combine to offer complementary features to store data in full fidelity, to transform and enrich the data for increased business efficiency and insights.

Hope you can join us and ask your questions to the experts.

Hive, Pig, Spark - Choose your Big Data Language with Oracle Data Integrator

The strength of Oracle Data Integrator (ODI) has always been the separation of logical design and physical implementation. Users can define a logical transformation flow that maps any sources to targets without being concerned what exact mechanisms would be used to realize such a job. In fact, ODI doesn’t have its own transformation engine but instead outsources all work to the native mechanisms of the underlying platforms, may it be relational databases, data warehouse appliances, or Hadoop clusters.

In the case of Big Data this philosophy of ODI gains even more importance. New Hadoop projects are incubated and released on a constant basis and introduce exciting new capabilities; the combined brain trust of the big data community conceives new technology that outdoes any proprietary ETL engine. ODI’s ability to separate your design from the implementation enables you to pick the ideal environment for your use case; and if the Hadoop landscape evolves, it is easy to retool an existing mapping with a new physical implementation. This way you don’t have to tie yourself to one language that is hyped this year, but might be legacy in the next.

ODI enables the generation from logical design into executed code through physical designs and Knowledge Modules. You can even define multiple physical designs for different languages based on the same logical design. For example, you could choose Hive as your transformation platform, and ODI would generate Hive SQL as the execution language. You could also pick Pig, and the generated code would be Pig Latin. If you choose Spark, ODI will generate PySpark code, which is Python with Spark APIs. Knowledge Modules will orchestrate the generation of code for the different languages and can be further configured to optimize the execution of the different implementation, for example parallelism in Pig or in-memory caching for Spark.

The example below shows an ODI mapping that reads from a log file in HDFS, registered in HCatalog. It gets filtered, aggregated, and then joined with another table, before being written into another HCatalog-based table. ODI can generate code for Hive, Pig, or Spark based on the Knowledge Modules chosen. 

 ODI provides developer productivity and can future-proof your investment by overcoming the need to manually code Hadoop transformations to a particular language.  You can logically design your mapping and then choose the implementation that best suits your use case.

Thursday Jan 22, 2015

OTN Virtual Technology Summit Data Integration Subtrack Features Big Data Integration and Governance

I am sure many of you have heard about the quarterly Oracle Technology Network (OTN) Virtual Technology Summits. It provides a hands-on learning experience on the latest offerings from Oracle by bringing experts from our community and product management team. 

The next OTN Virtual Technology Summit is scheduled to February 11th (9am-12:30pm PT) and will feature Oracle's big data integration and metadata management capabilities with hands-on-lab content.

The Data Integration and Data Warehousing sub-track includes the following sessions and speakers:

Feb 11th 9:30am PT -- HOL: Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors

Oracle GoldenGate 12c is well known for its highly performant data replication between relational databases. With the GoldenGate Adaptors, the tool can now apply the source transactions to a Big Data target, such as HDFS. In this session, we'll explore the different options for utilizing Oracle GoldenGate 12c to perform real-time data replication from a relational source database into HDFS. The GoldenGate Adaptors will be used to load movie data from the source to HDFS for use by Hive. Next, we'll take the demo a step further and publish the source transactions to a Flume agent, allowing Flume to handle the final load into the targets.

Speaker: Michael Rainey, Oracle ACE, Principal Consultant, Rittman Mead

Feb 11th 10:30am PT -- HOL: Bringing Oracle Big Data SQL to Oracle Data Integration 12c Mappings

Oracle Big Data SQL extends Oracle SQL and Oracle Exadata SmartScan technology to Hadoop, giving developers the ability to execute Oracle SQL transformations against Apache Hive tables and extending the Oracle Database data dictionary to the Hive metastore. In this session we'll look at how Oracle Big Data SQL can be used to create ODI12c mappings against both Oracle Database and Hive tables, to combine customer data held in Oracle tables with incoming purchase activities stored on a Hadoop cluster. We'll look at the new transformation capabilities this gives you over Hadoop data, and how you can use ODI12c's Sqoop integration to copy the combined dataset back into the Hadoop environment.

Speaker: Mark Rittman, Oracle ACE Director, CTO and Co-Founder, Rittman Mead

Feb 11th 11:30am PT-- An Introduction to Oracle Enterprise Metadata Manager
This session takes a deep technical dive into the recently released Oracle Enterprise Metadata Manager. You’ll see the standard features of data lineage, impact analysis and version management applied across a myriad of Oracle and non-Oracle technologies into a consistent metadata whole, including Oracle Database, Oracle Data Integrator, Oracle Business Intelligence and Hadoop. This session will examine the Oracle Enterprise Metadata Manager "bridge" architecture and how it is similar to the ODI knowledge module. You will learn how to harvest individual sources of metadata, such as OBIEE, ODI, the Oracle Database and Hadoop, and you will learn how to create OEMM configurations that contain multiple metadata stores as a single coherent metadata strategy.

Speaker: Stewart Bryson, Oracle ACE Director, Owner and Co-founder, Red Pill Analytics

I invite you to register now to this free event and enjoy this feast for big data integration and governance enthusiasts.

Americas -- February 11th/ 9am to 12:30pm PT- Register Now

Please note the same OTN Virtual Technology Summit content will be presented again to EMEA and APAC. You can register for the via the links below.

EMEA – February 25th / 9am to 12:30pm GMT* - Register Now

APAC – March 4th / 9:30am-1:00pm IST* - Register Now

Join us and let us know how you like the data integration sessions in this quarter's OTN event.

Wednesday Oct 01, 2014

Streaming Relational Transactions to Flume using Oracle GoldenGate

In prior articles, we have introduced architectures for streaming transactions from Oracle GoldenGate to HDFS, Hive, and HBase. In this article we are adding to this list by presenting a solution for streaming transactions into the Flume service. 

Apache Flume is a distributed, fault tolerant, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS. It can be configured into a variety of distribution mechanisms, such as delivery to multiple clusters, or rolling of HDFS files based on time or size. 

As shown in the diagram below, streaming database transactions to Flume is accomplished by developing a custom handler using Oracle GoldenGate's Java API and Flume's Avro RPC APIs.

The custom handler is deployed as an integral part of the Oracle GoldenGate Pump process.   The Pump process and the custom adapter are configured through the Pump parameter file and custom adapter's properties file. The Pump process executes the adapter in its address space. The Pump reads the Trail File created by the Oracle GoldenGate Capture process and passes the transactions to the adapter. The adapter then writes the transactions to a Flume Avro RPC source at the given host/port defined in the configuration file. The Flume Agent streams the data to the final destination; in the supplied example Flume writes into an HDFS directory for subsequent consumption by Hive. 

A sample implementation of the Flume Adapter for Oracle GoldenGate is provided at the My Oracle Support site as Knowledge Base article 1926867.1

Friday Sep 26, 2014

Oracle GoldenGate and Oracle Data Integrator on the Oracle BigDataLite 4.0 Virtual Machine

Oracle's big data team has just announced the Oracle BigDataLite Virtual Machine 4.0, a pre-built environment to get you started on an environment reflecting the core software of Oracle's Big Data Appliance 4.0. BigDataLite is a VirtualBox VM that contains a fully configured Cloudera Hadoop distribution CDH 5.1.2, Oracle DB with Big Data SQL, Oracle's Big Data Connectors, Oracle Data Integrator 12.1.3, Oracle GoldenGate, and other software.

The demos for Oracle GoldenGate and Oracle Data Integrator show an end-to-end use case of the fictional Oracle MoviePlex on-line movie streaming company. It shows how to load data into a Data Reservoir in real-time, process and transform the data using the power of Hadoop, and utilize the data on an Oracle data warehouse, either by loading the data with Oracle Loader for Hadoop or by using Hive tables within Oracle SQL queries with Oracle Big Data SQL. 

Please follow the demo instructions to try out all these steps yourself. If you would like to build the ODI mappings from scratch, try our ODI and Big Data Hands-on Lab.  Enjoy! 

Sunday Jul 13, 2014

New Big Data Features in ODI 12.1.3

Oracle Data Integrator (ODI) 12.1.3 extends its Hadoop capabilities through a number of exciting new cababilities. The new features include:

  • Loading of RDBMS data from and to Hadoop using Sqoop
  • Support for Apache HBase databases
  • Support for Hive append functionality
With these new additions ODI provides full connectivity to load, transform, and unload data in a Big Data environment.

The diagram below shows all ODI Hadoop knowledge modules with KMs added in ODI 12.1.3 in red. 

Sqoop support

Apache Sqoop is designed for efficiently transferring bulk amounts of data between Hadoop and relational databases such as Oracle, MySQL, Teradata, DB2, and others. Sqoop operates by creating multiple parallel map-reduce processes across a Hadoop cluster and connecting to an external database and transfering data from or to Hadoop storage in a partitioned fashion. Data can be stored in Hadoop using HDFS, Hive, or HBase. ODI adds two knowledge modules IKM SQL to Hive- HBase-File (SQOOP) and IKM File-Hive to SQL (SQOOP).

Loading from and to Sqoop in ODI is straightforward. Create a mapping with the database source and hadoop target (or vice versa) and apply any necessary transformation expressions.

In the physical design of the map, make sure to set the LKM of the target to LKM SQL Multi-Connect.GLOBAL and choose a Sqoop IKM, such as  IKM SQL to Hive- HBase-File (SQOOP). Change the MapReduce Output Directory IKM property MAPRED_OUTPUT_BASE_DIR to an appropriate HDFS dir. Review all other properties and tune as necessary. Using these simple steps you should be able to perform a quick Sqoop load. 

For more information please review the great ODI Sqoop article from Benjamin Perez-Goytia, or read the ODI 12.1.3 documentation about Sqoop.

HBase support

ODI adds support for HBase as a source and target. HBase metadata can be reverse-engineered using the RKM HBase knowledge module, and HBase can be used as source and target of a Hive transformation using LKM HBase to Hive and IKM Hive to HBase. Sqoop KMs also support HBase as a target for loads from a database. 

For more information please read the ODI 12.1.3 documentation about HBase.

Hive Append support

Prior to Hive 0.8 there had been no direct way to append data to an existing table. Prior Hive KMs emulated such logic by renaming the existing table and concatenating old and new data into a new table with the prior name. This emulated append operation caused major data movement, particularly when the target table has been large.

Starting with version 0.8 Hive has been enhanced to support appending. All ODI 12.1.3 Hive KMs have been updated to support the append capability by default but provide backward compatibility to the old behavior through the KM property HIVE_COMPATIBLE=0.7. 


ODI 12.1.3 provides an optimal and easy-to use way to perform data integration in a Big Data environment. ODI utilizes the processing power of the data storage and processing environment rather than relying on a proprietary transformation engine. This core "ELT" philosophy has its perfect match in a Hadoop environment, where ODI can provide unique value by providing a native and easy-to-use data integration envionment.

Wednesday Jul 09, 2014

ODI 11g - HDFS Files to Oracle, community post from ToadWorld

There is a new tutorial on using ODI to move HDFS files into an Oracle database using OLH-OSCH from Deepak Vohra on the ToadWorld blog. This article covers all the setup required in great detail and will be very helpful if you're planning on integrating with HDFS files.


Wednesday Jul 02, 2014

Learn more about ODI and Apache Sqoop

The ODI A-Team just published a new article about moving data from relational databases into Hadoop using ODI and Apache Sqoop. Check out the blog post here: Importing Data from SQL databases into Hadoop with Sqoop and Oracle Data Integrator (ODI)

Wednesday Jan 29, 2014

ODI 12.1.2 Demo on the Oracle BigDataLite Virtual Machine

Update 4/14/2015: This blog is outdated, instructions for the current Big Data Lite 4.1 are located here

Oracle's big data team has just announced the Oracle BigDataLite Virtual Machine, a pre-built environment to get you started on an environment reflecting the core software of Oracle's Big Data Appliance 2.4. BigDataLite is a VirtualBox VM that contains a fully configured Cloudera Hadoop distribution CDH 4.5, an Oracle DB 12c, Oracle's Big Data Connectors, Oracle Data Integrator 12.1.2, and other software.

You can use this environment to see ODI 12c in action integrating big data with Oracle DB using ODI's declarative graphical design, efficient EL-T loads, and Knowledge Modules designed to optimize big data integration. 

The sample data contained in BigDataLite represents the fictional Oracle MoviePlex on-line movie streaming company. The ODI sample performs the following two steps:

  • Pre-process application logs within Hadoop: All user activity on the MoviePlex web site is gathered on HDFS in Avro format. ODI is reading these logs through Hive and processes activities by aggregating, filtering, joining and unioning the records in an ODI flow-based mapping. All processing is performed inside Hive map-reduce jobs controlled by ODI, and the resulting data is stored in a staging table within Hive.
  • Loading user activity data from Hadoop into Oracle: The previously pre-processed data is loaded from Hadoop into an Oracle 12c database, where this data can be used as basis for Business Intelligence reports. ODI is using the Oracle Loader for Hadoop (OLH) connector, which executes distributed Map-Reduce processes to load data in parallel from Hadoop into Oracle. ODI is transparently configuring and invoking this connector through the Hive-to-Oracle Knowledge Module.

Both steps are orchestrated and executed through an ODI Package workflow. 

Demo Instructions

Please follow these steps to execute the ODI demo in BigDataLite:

  1. Download and install BigDataLite. Please follow the instructions in the Deployment Guide at the download page
  2. Start the VM and log in as user oracle, password welcome1.
  3. Start the Oracle Database 12c by double-clicking the icon on the desktop.

  4. Start ODI 12.1.2 by clicking the icon on the toolbar.

  5. Press Connect To Repository... on the ODI Studio window. 

  6. Press OK in the ODI Login dialog.

  7. Switch to the Designer tab, open the Projects accordion and expand the projects tree to Movie > First Folder > Mappings. Double-click on the mapping Transform Hive Avro to Hive Staging.

  8. Review the mapping that transforms source Avro data by aggregating, joining, and unioning data within Hive. You can also review the mapping Load Hive Staging to Oracle the same way. 

    (Click image for full size)

  9. In the Projects accordion expand the projects tree to Movie > First Folder > Packages. Double-click on the package Process Movie Data.

  10. The Package workflow for Process Movie Data opens. You can review the package.

  11. Press the Run icon on the toolbar. Press OK for the Run and Information: Session started dialogs. 

  12. You can follow the progress of the load by switching to the Operator tab and expanding All Executions and the upmost Process Movie Data entry. You can refresh the display by pressing the refresh button or setting Auto-Refresh. 

  13. Depending on the environment, the load can take 5-15 minutes. When the load is complete, the execution will show all green checkboxes. You can traverse the operator log and double-click entries to explore statistics and executed commands. 

This demo shows only some of the ODI big data capabilities. You can find more information about ODI's big data capabilities at:

Monday Nov 04, 2013

Big Data Matters with ODI12c

contributed by Mike Eisterer

On October 17th, 2013, Oracle announced the release of Oracle Data Integrator 12c (ODI12c).  This release signifies improvements to Oracle’s Data Integration portfolio of solutions, particularly Big Data integration.

Why Big Data = Big Business

Organizations are gaining greater insights and actionability through increased storage, processing and analytical benefits offered by Big Data solutions.  New technologies and frameworks like HDFS, NoSQL, Hive and MapReduce support these benefits now. As further data is collected, analytical requirements increase and the complexity of managing transformations and aggregations of data compounds and organizations are in need for scalable Data Integration solutions.

ODI12c provides enterprise solutions for the movement, translation and transformation of information and data heterogeneously and in Big Data Environments through:

  • The ability for existing ODI and SQL developers to leverage new Big Data technologies.
  • A metadata focused approach for cataloging, defining and reusing Big Data technologies, mappings and process executions.
  • Integration between many heterogeneous environments and technologies such as HDFS and Hive.
  • Generation of Hive Query Language.

Working with Big Data using Knowledge Modules

 ODI12c provides developers with the ability to define sources and targets and visually develop mappings to effect the movement and transformation of data.  As the mappings are created, ODI12c leverages a rich library of prebuilt integrations, known as Knowledge Modules (KMs).  These KMs are contextual to the technologies and platforms to be integrated.  Steps and actions needed to manage the data integration are pre-built and configured within the KMs. 

The Oracle Data Integrator Application Adapter for Hadoop provides a series of KMs, specifically designed to integrate with Big Data Technologies.  The Big Data KMs include:

  • Check Knowledge Module
  • Reverse Engineer Knowledge Module
  • Hive Transform Knowledge Module
  • Hive Control Append Knowledge Module
  • File to Hive (LOAD DATA) Knowledge Module
  • File-Hive to Oracle (OLH-OSCH) Knowledge Module 

Nothing to beat an Example:

To demonstrate the use of the KMs which are part of the ODI Application Adapter for Hadoop, a mapping may be defined to move data between files and Hive targets. 

The mapping is defined by dragging the source and target into the mapping, performing the attribute (column) mapping (see Figure 1) and then selecting the KM which will govern the process. 

In this mapping example, movie data is being moved from an HDFS source into a Hive table.  Some of the attributes, such as “CUSTID to custid”, have been mapped over.

Figure 1  Defining the Mapping

Before the proper KM can be assigned to define the technology for the mapping, it needs to be added to the ODI project.  The Big Data KMs have been made available to the project through the KM import process.   Generally, this is done prior to defining the mapping.

Figure 2  Importing the Big Data Knowledge Modules

Following the import, the KMs are available in the Designer Navigator.

Figure 3  The Project View in Designer, Showing Installed IKMs

Once the KM is imported, it may be assigned to the mapping target.  This is done by selecting the Physical View of the mapping and examining the Properties of the Target.  In this case MOVIAPP_LOG_STAGE is the target of our mapping.

Figure 4  Physical View of the Mapping and Assigning the Big Data Knowledge Module to the Target

Alternative KMs may have been selected as well, providing flexibility and abstracting the logical mapping from the physical implementation.  Our mapping may be applied to other technologies as well.

The mapping is now complete and is ready to run.  We will see more in a future blog about running a mapping to load Hive.

To complete the quick ODI for Big Data Overview, let us take a closer look at what the IKM File to Hive is doing for us.  ODI provides differentiated capabilities by defining the process and steps which normally would have to be manually developed, tested and implemented into the KM.  As shown in figure 5, the KM is preparing the Hive session, managing the Hive tables, performing the initial load from HDFS and then performing the insert into Hive.  HDFS and Hive options are selected graphically, as shown in the properties in Figure 4.

Figure 5  Process and Steps Managed by the KM

What’s Next

Big Data being the shape shifting business challenge it is is fast evolving into the deciding factor between market leaders and others.

Now that an introduction to ODI and Big Data has been provided, look for additional blogs coming soon using the Knowledge Modules which make up the Oracle Data Integrator Application Adapter for Hadoop:

  • Importing Big Data Metadata into ODI, Testing Data Stores and Loading Hive Targets
  • Generating Transformations using Hive Query language
  • Loading Oracle from Hadoop Sources

For more information now, please visit the Oracle Data Integrator Application Adapter for Hadoop web site, http://www.oracle.com/us/products/middleware/data-integration/hadoop/overview/index.html

Do not forget to tune in to the ODI12c Executive Launch webcast on the 12th to hear more about ODI12c and GG12c.

Thursday Sep 12, 2013

Stream Relational Transactions into Big Data Systems

Are you one of the organizations adopting ‘big data systems’ to manage and analyze a class of data typically referred to as big data? If so, you may know that big data includes data that could be structured, semi-structured or unstructured, each of which originates from a variety of different sources.  Another characterization of big data is described by the data's volume, velocity, and veracity. Due to its promise to help harness the data deluge we are faced with, the adoption of big data solutions is becoming quite pervasive. In this blog post I’d like discuss how to leverage Oracle GoldenGate’s real-time replication for big data systems.

The term 'big data systems' is an umbrella terminology used in general to discuss a wide variety of technologies each of which is used for a specific purpose. Broadly speaking, big data technologies address the needs for batch, transactional, and real-time processing requirements. Using the appropriate big data technology is highly dependent on the use case being addressed.

While gaining business intelligence from transactional data continues to be a dominant factor in the decision making process, businesses have realized that gaining intelligence from other forms of data they have been collecting will enable them achieve a more complete view, address additional business objectives, and lead to better decision making. The following table illustrates some examples of various industry verticals, forms of data, and the objective the business attempts to achieve using the other forms of data.





Practitioner’s notes, machine statistics.

Best practices and reduced hospitalization.


Weblog, click streams.

Micro-segmentation recommendations.


Weblogs, fraud reports.

Fraud detection, risk analysis.


Smart meter reading, call center data.

Real-time and predictive utilization analysis.

Role of transactional data

When using other forms of data for analytics, better contextual intelligence is obtained when the analysis is combined with transactional data. Especially low-latency transactional data brings additional value to dynamically changing operations that day-old data cannot deliver. In organizations, a vast majority of applications' transactional data is captured in relational databases. In order to ensure an efficient supply of transactional data for big data analytics, there are several requirements that the data integration solution should address:

<!--[if !supportLists]-->· <!--[endif]-->Reliable change data capture and delivery mechanism

<!--[if !supportLists]-->· <!--[endif]-->Minimal resource consumption when extracting data from the relational data source

<!--[if !supportLists]-->· <!--[endif]-->Secured data delivery

<!--[if !supportLists]-->· <!--[endif]-->Ability to customize data delivery

<!--[if !supportLists]-->· <!--[endif]-->Support heterogeneous database sources

<!--[if !supportLists]-->· <!--[endif]-->Easy to install, configure and maintain

A solution which can reliably stream database transactions to a desired target enables that the effort is spent on data analysis rather than data acquisition. Also, when the solution is non-intrusive and minimally impacts the source database, it minimizes the need for additional resources and changes on the source database.

Oracle GoldenGate is a time tested and proven product for real-time, heterogeneous relational database replication. Oracle GoldenGate addresses the challenges listed above and is widely used by organizations for mission critical data replication among relational databases. Furthermore, GoldenGate moves transactional data in real-time to support timely operational business intelligence needs.

Oracle GoldenGate Integration Options for Big Data Analytics

There is a variety of integration options available with the Oracle GoldenGate product that facilitates delivering transactions on relational databases into non-relational targets.

Oracle GoldenGate provides pre-built adapters which integrate with Flat Files and Messaging Systems. Please refer to Oracle GoldenGate for Java - Administration Guide and Oracle GoldenGate for Flat Files -Administration Guide for more information.

Oracle GoldenGate also provides Java APIs and a framework for developing custom integrations to Java enabled targets. Using this capability, custom adapters or handlers can be developed to address specific requirements. In this blog post I’d like focus on Oracle GoldenGate Java APIs for developing custom integrations to big data systems.

As we mentioned earlier, 'big data systems' is an umbrella terminology used in general to describe a wide variety of technologies, each of which is used for a specific purpose. Among the various big data systems, Hadoop and its suite of technologies are widely adopted by various organizations for processing big data. The below diagram illustrates a general high level architecture for integrating with Hadoop.

<!--[if !vml]--><!--[endif]--> <!--[if !vml]--><!--[endif]-->

Custom Adapter

<!--[if !vml]--><!--[endif]--> <!--[if !vml]--><!--[endif]--> <!--[if !vml]--><!--[endif]--> <!--[if !vml]--><!--[endif]-->

Pump Parameter file

Adapter Properties file

<!--[if !vml]-->

You can implement custom adapter or handler for the big data system using Oracle GoldenGate's Java API. The custom adapter is deployed as an integral part of the Oracle GoldenGate Pump process. The Pump and the custom adapter are configured through the Pump parameter file and custom adapter's properties file respectively. Depending upon the requirements, the properties for the custom adapter will need to be determined and implemented.

The Pump process will execute the adapter in its address space. The Pump reads the Trail File created by the Oracle GoldenGate Capture process and passes the transactions to the adapter. Based on the configuration, the adapter will write the transactions into Hadoop.

Enabling the co-existence of big data systems with relational systems will benefit organizations to better serve customers and improve decision-making capabilities. Oracle GoldenGate, which has an excellent record of empowering IT on the various aspects of data management requirements, provides the capability to integrate with big data systems. In the upcoming blog posts, we will discuss in depth the implementation and the configuration of integrating Oracle GoldenGate with Hadoop technologies. 


Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality


« November 2015