Tuesday Jul 07, 2015

Chalk Talk Video: Kick-Start Big Data Integration with Oracle

Next in the series for Oracle Data Integration chalk talk videos, we speak to Oracle Data Integrator (ODI) for big data. ODI allows you to become a big data developer without learning to code Java and Map Reduce! ODI generates the code and optimizes it with support for Hive, Spark, Oozie, and Pig.

View this video to learn more: Chalk Talk: Kick-Start Big Data Integration with Oracle.

For additional information on Oracle Data Integrator, visit the ODI homepage and the ODI for Big Data page. This blog can be very handy also: Announcing Oracle Data Integrator for Big Data.

Tuesday Jun 16, 2015

ODI - Hive DDL Generation for Parquet, ORC, Cassandra and so on

Here you'll see how with some small surgical extensions we can use ODI to generate complex integration models in Hive for modelling all kinds of challenges;

  • integrate data from Cassandra, or any arbitrary SerDe
  • use Hive Parquet or ORC storage formats
  • create partitioned, clustered tables
  • define arbitrarily complex attributes on tables

I'm very interested in hearing what you think of this, and what is needed/useful over and above (RKMs etc) 

When you use this technique to generate your models you can benefit from one of ODI's lesser known but very powerful features - the Common Format Designer (see here for nice write up). With this capability you can build models from other models, generate the DDL and generate the mappings or interfaces to integrate the data from the source model to the target. This gets VERY interesting in the Big Data space since many customers want to get up and running with data they already know and love.

What do you need to get there? The following basic pieces;

  • a Hive custom create table action here.
  • flex fields for external table indicator, table data format, attribute type metadata get groovy script here

 Pretty simple right? If you take the simple example on the Confluence wiki;

  • CREATE TABLE parquet_test (
  •    id INT,
  •    str STRING,
  •    mp MAP<STRING,STRING>,
  •    lst ARRAY<STRING>,
  •    strct STRUCT<A:STRING,B:STRING>) 
  • PARTITIONED BY (part STRING)
  • STORED AS PARQUET;

...and think about how to model this in ODI, you'll think - how do I define the Parquet storage? How can the complex types be defined? That's where the flex fields come in, you can define any of the custom storage properties you wish in the datastore's Data Format flex field, then define the complex type formats.

In ODI the Parquet table above can be defined with the Hive datatypes as  below, we don't capture the complex fields within attributes mp, 1st or stct. The partitioned column 'part' is added in to the regular list of datastore attributes (like Oracle, unlike Hive DDL today);

Using the flex field Data Format we can specify the STORED AS PARQUET info, the table is not external to Hive so we do not specify EXTERNAL in the external data field;


This value can have lots of ODI goodies in it which makes it very versatile - you can use variables, these will be resolved when the generated procedure containing the DDL is executed, plus you can use <% style code substitution. 

The partitioning information is defined for the part attribute, the 'Used for Partitioning' field is checked below;

Let's take stock, we have our table defined as Parquet and we have the partitioning info defined. Now we have to define the complex types. This is done on attributes mp, 1st and stct using the creatively named Metadata field - below I have fully defined the MAP, STRUCT and ARRAY for the respective attributes;

Note the struct example escapes the : as the code is executed via the JDBC driver when executing via the agent and ':' is the way of binding information to a statement execution (you can see the generated DDL further below with the ':' escaped). 

With that we are complete, we can now generate DDL for our model;

This brings up a dialog with options for where to store the procedure with the generated DDL - this has actually done a base compare against the Hive system and allowed me to selected which datastores to create DDL for or which DDL to create, I have checked the parquet_test datastore;

Upon completion, we can see the procedure which was created. We can chose to further edit and customize this if so desired. We can schedule it anytime we want or execute.

Inspecting the Create Hive Table action, the DDL looks exactly as we were after when we first started - the datastore schema / name will be resolved upon execution using ODI's runtime context - which raised another interesting point! We could have used some ODI substitution code <% when defining the storage format for example;

As I mentioned earlier you can take this approach to capture all sorts of table definitions for integrating. If you want to look on the inside of how this is done, check the content of the Hive action in image below (I linked the code of the action above), it should look very familiar if you have seen any KM code and uses the flex fields to inject the relevant information in the DDL;

This is a great example of what you can do with ODI and how to leverage it to work efficiently. 

There is a lot more that can be illustrated here including what the RKM Hive can do or be modified to do. Perhaps creating the datastore and complex type information from JSON, JSON schema, AVRO schema or Parquet is very useful? I'd be very interested to hear your comments and thoughts. So let me know....

Thursday Apr 09, 2015

ODI, Big Data SQL and Oracle NoSQL

Back in January Anuj posted an article here on using Oracle NoSQL via the Oracle database Big Data SQL feature. In this post, I guess you could call it part 2 of Anuj's I will follow up with how the Oracle external table is configured and how it all hangs together with manual code and via ODI. For this I used the Big Data Lite VM and also the newly released Oracle Data Integrator Big Data option. The BDA Lite VM 4.1 release uses version 3.2.5 of Oracle NoSQL - from this release I used the new declarative DDL for Oracle NoSQL to project the shape from NoSQL with some help from Anuj.

My goal for the integration design is to show a logical design in ODI and how KMs are used to realize the implementation and leverage Oracle Big Data SQL - this integration design supports predicate pushdown so I actually minimize data moved from my NoSQL store on Hadoop and the Oracle database - think speed and scalability! My NoSQL store contains user movie recommendations. I want to join this with reference data in Oracle which includes the customer information, movie and genre information and store in a summary table.

Here is the code to create and load the recommendation data in NoSQL - this would normally be computed by another piece of application logic in a real world scenario;

  • export KVHOME=/u01/nosql/kv-3.2.5
  • cd /u01/nosql/scripts
  • ./admin.sh

  • connect store -name kvstore
  • EXEC "CREATE TABLE recommendation( \
  •          custid INTEGER, \
  •          sno INTEGER, \
  •          genreid INTEGER,\
  •          movieid INTEGER,\
  •          PRIMARY KEY (SHARD(custid), sno, genreid, movieid))"
  • PUT TABLE -name RECOMMENDATION  -file /home/oracle/movie/moviework/bigdatasql/nosqldb/user_movie.json

The Manual Approach

This example is using the new data definition language in NoSQL. To make this accessible via Hive, users can create Hive external tables that use the NoSQL Storage Handler provided by Oracle. If this were manually coded in Hive, we could define the table as follows;

  • CREATE EXTERNAL TABLE IF NOT EXISTS recommendation(
  •                  custid INT,
  •                  sno INT,
  •                  genreId INT,
  •                  movieId INT)
  •           STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler'
  •           TBLPROPERTIES  ( "oracle.kv.kvstore"="kvstore",
  •                            "oracle.kv.hosts"="localhost:5000",
  •                            "oracle.kv.hadoop.hosts"="localhost",
  •                            "oracle.kv.tableName"="recommendation");

At this point we have made NoSQL accessible to many components in the Hadoop stack - pretty much every component in the hadoop ecosystem can leverage the HCatalog entries defined be they Hive, Pig, Spark and so on. We are looking at Oracle Big Data SQL tho, so let's see how that is achieved. We must define an external table that uses either the SerDe or a Hive table, below you can see how the table has been defined in Oracle;

  • CREATE TABLE recommendation(
  •                  custid NUMBER,
  •                  sno NUMBER,
  •                  genreid NUMBER,
  •                  movieid NUMBER
  •          )
  •                  ORGANIZATION EXTERNAL
  •          (
  •                  TYPE ORACLE_HIVE
  •                  DEFAULT DIRECTORY DEFAULT_DIR
  •                  ACCESS PARAMETERS  (
  •                      com.oracle.bigdata.tablename=default.recommendation
  •                  )
  •          ) ;

Now we are ready to write SQL! Really!? Well let's see, below we can see the type of query we can do to join the NoSQL data with our Oracle reference data;

  • SELECT m.title, g.name, c.first_name
  • FROM recommendation r, movie m, genre g, customer c
  • WHERE r.movieid=m.movie_id and r.genreid=g.genre_id and r.custid=c.cust_id and r.custid=1255601 and r.sno=1 
  • ORDER by r.sno, r.genreid;

Great, we can now access the data from Oracle - we benefit from the scalability of the solution and minimal data movement! Let's make it better, let's make it more maintainable, flexibility to future changes and also accessible by more people by showing how it is done in ODI.

Oracle Data Integrator Approach

The data in NoSQL has a shape, we can capture that shape in ODI just as it is defined in NoSQL. We can then design mappings that manipulate the shape and load into whatever target we like. The SQL we saw above is represented in a logical mapping as below;


Users can use the same design experience as other data items and benefit from the mapping designer. They can join, map, transform just as normal. The ODI designer allows you to separate how you physically want this to happen from the logical semantics - this is all about giving you flexibility to change and adapt to new integration technologies and patterns.

In the physical design we can assign Knowledge Modules that take the responsibility of building the integration objects that we previously manually coded above. These KMs are generic so support all shapes and sizes of data items. Below you can see how the LKM is assigned for accessing Hive from Oracle;

This KM takes the role of building the external table - you can take this use it, customize it and the logical design stays the same. Why is that important? Integration recipes CHANGE as we learn more and developers build newer and better mechanisms to integrate. 

This KM takes care of creating the external table in Hive that access our NoSQL system. You could also have manually built the external table and imported this into ODI and used that as a source for the mapping, I want to show how the raw items can be integrated as the more metadata we have and you use to design the greater the flexibility in the future. The LKM Oracle NoSQL to Hive uses regular KM APIs to build the access object, here is a snippet from the KM;

  • create table <%=odiRef.getObjectName("L", odiRef.getTableName("COLL_SHORT_NAME"), "W")%>
  •  <%=odiRef.getColList("(", "[COL_NAME] [DEST_CRE_DT]", ", ", ")", "")%> 
  •           STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler'
  •           TBLPROPERTIES  ( "oracle.kv.kvstore"="<%=odiRef.getInfo("SRC_SCHEMA")%>",
  •                            "oracle.kv.hosts"="<%=odiRef.getInfo("SRC_DSERV_NAME")%>",
  •                            "oracle.kv.hadoop.hosts"="localhost",
  •                            "oracle.kv.tableName"="<%=odiRef.getSrcTablesList("", "[TABLE_NAME]", ", ", "")%>");

You can see the templatized code versus literals, this still needs some work as you can see, can you spot some hard-wiring that needs fixed? ;-) This was using the 12.1.3.0.1 Big Data option of ODI so integration with Hive is much improved and it leverages the DataDirect driver which is also a big improvement. In this post I created a new technology for Oracle NoSQL in ODI, you can do this too for anything you want, I will post this technology on java.net and more so that as a community we can learn and share.

Summary 

Here we have seen how we can make seemingly complex integration tasks quite simple and leverage the best of data integration technologies today and importantly in the future!


Thursday Mar 26, 2015

Oracle Big Data Lite 4.1.0 is available with more on Oracle GoldenGate and Oracle Data Integrator

Oracle's big data team has announced the newest Oracle Big Data Lite Virtual Machine 4.1.0.  This newest Big Data Lite Virtual Machine contains great improvements from a data integration perspective with inclusion of the recently released Oracle GoldenGate for Big Data.  You will see this in an improved demonstration that highlights inserts, updates, and deletes into Hive using Oracle GoldenGate for Big Data with Oracle Data Integrator performing a merge of the new operations into a consolidated table.

Big Data Lite is a pre-built environment which includes many of the key capabilities for Oracle's big data platform.   The components have been configured to work together in this Virtual Machine, providing a simple way to get started in a big data environment.  The components include Oracle Database, Cloudera Distribution including Apache Hadoop, Oracle Data Integrator, Oracle GoldenGate amongst others. 

Big Data Lite also contains hands-on labs and demonstrations to help you get started using the system.  Tame Big Data with Oracle Data Integration is a hands-on lab that teaches you how to design Hadoop data integration using Oracle Data Integrator and Oracle GoldenGate. 

                Start here to learn more!  Enjoy!

Thursday Feb 19, 2015

Introducing Oracle GoldenGate for Big Data!

Big data systems and big data analytics solutions are becoming critical components of modern information management architectures.  Organizations realize that by combining structured transactional data with semi-structured and unstructured data they can realize the full potential value of their data assets, and achieve enhanced business insight. Businesses also notice that in today’s fast-paced, digital business environment to be agile and respond with immediacy, access to data with low latency is essential. Low-latency transactional data brings additional value especially for dynamically changing operations that day-old data, structured or unstructured, cannot deliver.

Today we announced the general availability of Oracle GoldenGate for Big Data product, which offers a real-time transactional data streaming platform into big data systems. By providing easy-to-use, real-time data integration for big data systems, Oracle GoldenGate for Big Data facilitates improved business insight for better customer experience. It also allows IT organizations to quickly move ahead with their big data projects without extensive training and management resources. Oracle GoldenGate for Big Data's real-time data streaming platform also allows customers to keep their big data reservoirs up to date with their production systems. 

Oracle GoldenGate’s fault-tolerant, secure and flexible architecture shines in this new big data streaming offering as well. Customers can enjoy secure and reliable data streaming with subseconds latency. With Oracle GoldenGate’s core log-based change data capture capabilities it enables real-time streaming without degrading the performance of the source production systems.

The new offering, Oracle GoldenGate for Big Data, provides integration for Apache Flume, Apache HDFS, Apache Hive and Apache Hbase. It also includes Oracle GoldenGate for Java, which enables customers to easily integrate to additional big data systems, such as Oracle NoSQL, Apache Kafka, Apache Storm, Apache Spark, and others.

You can learn more about our new offering via Oracle GoldenGate for Big Data data sheet and by registering for our upcoming webcast:

How to Future-Proof your Big Data Integration Solution

March 5th, 2015 10am PT/ 1pm ET

I invite you to join this webcast to learn from Oracle and Cloudera executives how to future-proof your big data infrastructure. The webcast will discuss :

  • Selection criteria that will drive business results with Big Data Integration 
  • Oracle's new big data integration and governance offerings, including Oracle GoldenGate for Big Data
  • Oracle’s comprehensive big data features in a unified platform 
  • How Cloudera Enterprise Data Hub and Oracle Data Integration combine to offer complementary features to store data in full fidelity, to transform and enrich the data for increased business efficiency and insights.

Hope you can join us and ask your questions to the experts.

Hive, Pig, Spark - Choose your Big Data Language with Oracle Data Integrator

The strength of Oracle Data Integrator (ODI) has always been the separation of logical design and physical implementation. Users can define a logical transformation flow that maps any sources to targets without being concerned what exact mechanisms would be used to realize such a job. In fact, ODI doesn’t have its own transformation engine but instead outsources all work to the native mechanisms of the underlying platforms, may it be relational databases, data warehouse appliances, or Hadoop clusters.

In the case of Big Data this philosophy of ODI gains even more importance. New Hadoop projects are incubated and released on a constant basis and introduce exciting new capabilities; the combined brain trust of the big data community conceives new technology that outdoes any proprietary ETL engine. ODI’s ability to separate your design from the implementation enables you to pick the ideal environment for your use case; and if the Hadoop landscape evolves, it is easy to retool an existing mapping with a new physical implementation. This way you don’t have to tie yourself to one language that is hyped this year, but might be legacy in the next.

ODI enables the generation from logical design into executed code through physical designs and Knowledge Modules. You can even define multiple physical designs for different languages based on the same logical design. For example, you could choose Hive as your transformation platform, and ODI would generate Hive SQL as the execution language. You could also pick Pig, and the generated code would be Pig Latin. If you choose Spark, ODI will generate PySpark code, which is Python with Spark APIs. Knowledge Modules will orchestrate the generation of code for the different languages and can be further configured to optimize the execution of the different implementation, for example parallelism in Pig or in-memory caching for Spark.

The example below shows an ODI mapping that reads from a log file in HDFS, registered in HCatalog. It gets filtered, aggregated, and then joined with another table, before being written into another HCatalog-based table. ODI can generate code for Hive, Pig, or Spark based on the Knowledge Modules chosen. 

 ODI provides developer productivity and can future-proof your investment by overcoming the need to manually code Hadoop transformations to a particular language.  You can logically design your mapping and then choose the implementation that best suits your use case.

Wednesday Nov 12, 2014

ODI 12c - Spark SQL and Hive?

In this post I'll cover some new capabilities in the Apache Spark 1.1 release and show what they mean to ODI today. There's a nice slide shown below from the Databricks training for Spark SQL that pitches some of the Spark SQL capabilities now available. As well as programmatic access via Python, Scala, Java, the Hive QL compatibility within Spark SQL is particularly interesting for ODI...... today. The Spark 1.1 release supports a subset of the Hive QL features which in turn is a subset of ANSI SQL, there is already a lot there and it is only going to grow. The Hive engine today uses map-reduce which is not fast today, the Spark engine is fast, in-memory - you can read much more on that elsewhere.

Figure taken from from the Databricks training for Spark SQL, July 2014.

In the examples below I used the Oracle Big Data Lite VM, I downloaded the Spark 1.1 release and built using Maven (I was on CDH 5.2). To use Spark SQL in ODI, we need to create a Hive data server - the Hive data server masquerades as many things, it can can be used for Hive, for HCatalog or for Spark SQL. Below you can see my data server, note the Hive port is 10001, by default 10000 is the Hive server port - we aren't using Hive server to execute the query, here we are using the Spark SQL server. I will show later how I started the Spark SQL server on this port (Apache Spark doc for this is here).

I started the server using the Spark standalone cluster that I configured using the following command from my Spark 1.1 installation;

./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.bind.host bigdatalite --hiveconf hive.server2.thrift.port 10001 --master spark://192.168.122.1:7077

You can also specify local (for test), Yarn or other cluster information for the master. I could have just as easily started the server using Yarn by specify the master URI as something like --master yarn://192.168.122.1:8032 where 8032 is my Yarn resource manager port. I ran using the 10001 port so that I can run both Spark SQL and Hive engines in parallel whilst I do various tests. To reverse engineer I actually used the Hive engine to reverse engineer the table definitions in ODI (I hit some problems using the Spark SQL reversing, so worked around it) and then changed the model to use my newly created Spark SQL data server above.

Then I built my mappings just like normal - and used the KMs in ODI for Hive just like normal. For example the mapping below aggregates movie ratings and then joins with movie reference data to load movie rating data - the mapping uses the datastores from a model obtained from the Hive metastore;

If you look at the physical design the Hive KMs are assigned but we will execute this through the Spark SQL engine rather than through Hive. The switch from engine to engine was handled in the URL within our our Hive dataserver.

When the mapping is executed you can use the Spark monitoring API to check the status of the running application and Spark master/workers.

You can also monitor from the regular ODI operator and ODI console. Spark SQL support uses the Hive metastore for all the table definitions be they internally or externally managed data. 

There are other blogs from tools showing how to access and use Spark SQL, such as the one here from Antoine Amend using SQL Developer. Antoine has also another very cool blog worth checking out Processing GDELT Data Using Hadoop. In this post he shows a custom InputFormat class that produces records/columns. This is a very useful post for anyone wanting to see the Spark newAPIHadoopFile api in action. It has a pretty funky name, but is a key piece (along with its related methods) of the framework.

  1. // Read file from HDFS - Use GdeltInputFormat
  2. val input = sc.newAPIHadoopFile(
  3.    "hdfs://path/to/gdelt",
  4.    classOf[GdeltInputFormat],
  5.    classOf[Text],
  6.    classOf[Text]

Antoine also provides the source code to GdeltInputFormat so you can see the mechanics of his record reader, although the input data is delimited data (so could have been achieved in different ways) it's a useful resource to be aware of.

If you are looking at Spark SQL, this post was all about using Spark SQL via the JDBC route - there is another whole topic on transformations using the Spark framework alongside Spark SQL that is for future discussion. You should be aware of and check out the Hive QL compatibility documentation here, check what you can do can't do within Spark SQL today. Download the BDA Lite VM and give it a try.

Wednesday Oct 01, 2014

Streaming Relational Transactions to Flume using Oracle GoldenGate

In prior articles, we have introduced architectures for streaming transactions from Oracle GoldenGate to HDFS, Hive, and HBase. In this article we are adding to this list by presenting a solution for streaming transactions into the Flume service. 

Apache Flume is a distributed, fault tolerant, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS. It can be configured into a variety of distribution mechanisms, such as delivery to multiple clusters, or rolling of HDFS files based on time or size. 

As shown in the diagram below, streaming database transactions to Flume is accomplished by developing a custom handler using Oracle GoldenGate's Java API and Flume's Avro RPC APIs.

The custom handler is deployed as an integral part of the Oracle GoldenGate Pump process.   The Pump process and the custom adapter are configured through the Pump parameter file and custom adapter's properties file. The Pump process executes the adapter in its address space. The Pump reads the Trail File created by the Oracle GoldenGate Capture process and passes the transactions to the adapter. The adapter then writes the transactions to a Flume Avro RPC source at the given host/port defined in the configuration file. The Flume Agent streams the data to the final destination; in the supplied example Flume writes into an HDFS directory for subsequent consumption by Hive. 

A sample implementation of the Flume Adapter for Oracle GoldenGate is provided at the My Oracle Support site as Knowledge Base article 1926867.1

Friday Sep 26, 2014

Oracle GoldenGate and Oracle Data Integrator on the Oracle BigDataLite 4.0 Virtual Machine

Oracle's big data team has just announced the Oracle BigDataLite Virtual Machine 4.0, a pre-built environment to get you started on an environment reflecting the core software of Oracle's Big Data Appliance 4.0. BigDataLite is a VirtualBox VM that contains a fully configured Cloudera Hadoop distribution CDH 5.1.2, Oracle DB 12.1.0.2.0 with Big Data SQL, Oracle's Big Data Connectors, Oracle Data Integrator 12.1.3, Oracle GoldenGate, and other software.

The demos for Oracle GoldenGate and Oracle Data Integrator show an end-to-end use case of the fictional Oracle MoviePlex on-line movie streaming company. It shows how to load data into a Data Reservoir in real-time, process and transform the data using the power of Hadoop, and utilize the data on an Oracle data warehouse, either by loading the data with Oracle Loader for Hadoop or by using Hive tables within Oracle SQL queries with Oracle Big Data SQL. 

Please follow the demo instructions to try out all these steps yourself. If you would like to build the ODI mappings from scratch, try our ODI and Big Data Hands-on Lab.  Enjoy! 

Sunday Jul 13, 2014

New Big Data Features in ODI 12.1.3

Oracle Data Integrator (ODI) 12.1.3 extends its Hadoop capabilities through a number of exciting new cababilities. The new features include:

  • Loading of RDBMS data from and to Hadoop using Sqoop
  • Support for Apache HBase databases
  • Support for Hive append functionality
With these new additions ODI provides full connectivity to load, transform, and unload data in a Big Data environment.

The diagram below shows all ODI Hadoop knowledge modules with KMs added in ODI 12.1.3 in red. 

Sqoop support

Apache Sqoop is designed for efficiently transferring bulk amounts of data between Hadoop and relational databases such as Oracle, MySQL, Teradata, DB2, and others. Sqoop operates by creating multiple parallel map-reduce processes across a Hadoop cluster and connecting to an external database and transfering data from or to Hadoop storage in a partitioned fashion. Data can be stored in Hadoop using HDFS, Hive, or HBase. ODI adds two knowledge modules IKM SQL to Hive- HBase-File (SQOOP) and IKM File-Hive to SQL (SQOOP).

Loading from and to Sqoop in ODI is straightforward. Create a mapping with the database source and hadoop target (or vice versa) and apply any necessary transformation expressions.

In the physical design of the map, make sure to set the LKM of the target to LKM SQL Multi-Connect.GLOBAL and choose a Sqoop IKM, such as  IKM SQL to Hive- HBase-File (SQOOP). Change the MapReduce Output Directory IKM property MAPRED_OUTPUT_BASE_DIR to an appropriate HDFS dir. Review all other properties and tune as necessary. Using these simple steps you should be able to perform a quick Sqoop load. 

For more information please review the great ODI Sqoop article from Benjamin Perez-Goytia, or read the ODI 12.1.3 documentation about Sqoop.

HBase support

ODI adds support for HBase as a source and target. HBase metadata can be reverse-engineered using the RKM HBase knowledge module, and HBase can be used as source and target of a Hive transformation using LKM HBase to Hive and IKM Hive to HBase. Sqoop KMs also support HBase as a target for loads from a database. 

For more information please read the ODI 12.1.3 documentation about HBase.

Hive Append support

Prior to Hive 0.8 there had been no direct way to append data to an existing table. Prior Hive KMs emulated such logic by renaming the existing table and concatenating old and new data into a new table with the prior name. This emulated append operation caused major data movement, particularly when the target table has been large.

Starting with version 0.8 Hive has been enhanced to support appending. All ODI 12.1.3 Hive KMs have been updated to support the append capability by default but provide backward compatibility to the old behavior through the KM property HIVE_COMPATIBLE=0.7. 

Conclusion

ODI 12.1.3 provides an optimal and easy-to use way to perform data integration in a Big Data environment. ODI utilizes the processing power of the data storage and processing environment rather than relying on a proprietary transformation engine. This core "ELT" philosophy has its perfect match in a Hadoop environment, where ODI can provide unique value by providing a native and easy-to-use data integration envionment.

Thursday Jul 03, 2014

ODI 11g - Hive to Oracle with OLH-OSCH, community post from ToadWorld

There is a new blog post on using ODI to move data from Hive to Oracle using OLH-OSCH from Deepak Vohra on the ToadWorld blog. It covers everything from install to all the configuration of paths and configurations files. So if you are going down this route it's worth checking it out, he goes into great detail into everything that needs done and setup.

http://www.toadworld.com/platforms/oracle/w/wiki/10955.integrating-apache-hive-table-data-with-oracle-database-11g-in-oracle-data-integrator-11g.aspx

Big thanks to Deepak for sharing his experiences and providing the blog to get folk up and running. 

Wednesday Jan 29, 2014

ODI 12.1.2 Demo on the Oracle BigDataLite Virtual Machine

Update 4/14/2015: This blog is outdated, instructions for the current Big Data Lite 4.1 are located here

Oracle's big data team has just announced the Oracle BigDataLite Virtual Machine, a pre-built environment to get you started on an environment reflecting the core software of Oracle's Big Data Appliance 2.4. BigDataLite is a VirtualBox VM that contains a fully configured Cloudera Hadoop distribution CDH 4.5, an Oracle DB 12c, Oracle's Big Data Connectors, Oracle Data Integrator 12.1.2, and other software.

You can use this environment to see ODI 12c in action integrating big data with Oracle DB using ODI's declarative graphical design, efficient EL-T loads, and Knowledge Modules designed to optimize big data integration. 

The sample data contained in BigDataLite represents the fictional Oracle MoviePlex on-line movie streaming company. The ODI sample performs the following two steps:

  • Pre-process application logs within Hadoop: All user activity on the MoviePlex web site is gathered on HDFS in Avro format. ODI is reading these logs through Hive and processes activities by aggregating, filtering, joining and unioning the records in an ODI flow-based mapping. All processing is performed inside Hive map-reduce jobs controlled by ODI, and the resulting data is stored in a staging table within Hive.
  • Loading user activity data from Hadoop into Oracle: The previously pre-processed data is loaded from Hadoop into an Oracle 12c database, where this data can be used as basis for Business Intelligence reports. ODI is using the Oracle Loader for Hadoop (OLH) connector, which executes distributed Map-Reduce processes to load data in parallel from Hadoop into Oracle. ODI is transparently configuring and invoking this connector through the Hive-to-Oracle Knowledge Module.

Both steps are orchestrated and executed through an ODI Package workflow. 

Demo Instructions

Please follow these steps to execute the ODI demo in BigDataLite:

  1. Download and install BigDataLite. Please follow the instructions in the Deployment Guide at the download page
  2. Start the VM and log in as user oracle, password welcome1.
  3. Start the Oracle Database 12c by double-clicking the icon on the desktop.


  4. Start ODI 12.1.2 by clicking the icon on the toolbar.


  5. Press Connect To Repository... on the ODI Studio window. 


  6. Press OK in the ODI Login dialog.


  7. Switch to the Designer tab, open the Projects accordion and expand the projects tree to Movie > First Folder > Mappings. Double-click on the mapping Transform Hive Avro to Hive Staging.


  8. Review the mapping that transforms source Avro data by aggregating, joining, and unioning data within Hive. You can also review the mapping Load Hive Staging to Oracle the same way. 

    (Click image for full size)

  9. In the Projects accordion expand the projects tree to Movie > First Folder > Packages. Double-click on the package Process Movie Data.


  10. The Package workflow for Process Movie Data opens. You can review the package.


  11. Press the Run icon on the toolbar. Press OK for the Run and Information: Session started dialogs. 




  12. You can follow the progress of the load by switching to the Operator tab and expanding All Executions and the upmost Process Movie Data entry. You can refresh the display by pressing the refresh button or setting Auto-Refresh. 


  13. Depending on the environment, the load can take 5-15 minutes. When the load is complete, the execution will show all green checkboxes. You can traverse the operator log and double-click entries to explore statistics and executed commands. 

This demo shows only some of the ODI big data capabilities. You can find more information about ODI's big data capabilities at:


Wednesday Oct 09, 2013

Streaming relational transactions to Hive

Following the introductory blog post on the topic – ' Stream your transactions into Big Data Systems', and blog post on 'Streaming relational transactions to HDFS', in this blog post I will discuss the architecture for streaming relational transactions into Hive.

Referring to the architecture diagram below, integrating database with Hive is accomplished by developing a custom handler using Oracle GoldenGate's Java API and Hadoop HDFS APIs.

The custom handler is deployed as an integral part of the Oracle GoldenGate Pump process. The Pump process and the custom adapter are configured through the Pump parameter file and custom adapter's properties file.The Pump process executes adapter in its address space. The Pump reads the Trail File created by the Oracle GoldenGate Capture process and passes the transactions to the adapter. Based on the configuration, the adapter writes the transactions in a desired format, with the appropriate content to a file which is defined by the Hive DDL for the table.

A sample implementation of the Hive adapter is provided on My Oracle Support (Knowledge ID - 1586188.1). This is provided to illustrate the capability and to assist in the adoption of the Oracle GoldenGate Java API in developing custom solutions. The sample implementation illustrates the configuration and the code required for replicating database transactions on an example table to a corresponding Hive table. The instructions for configuring Oracle GoldenGate, compiling and running the sample implementation are also provided.

The sample code and configuration may be extended to develop custom solutions, however, please note that Oracle will not provide support for the code and the configuration illustrated in the knowledge base paper.

It would be great if you could share your use case about leveraging Oracle GoldenGate in your Big Data strategy and your feedback on using the custom handler for integrating relational database with Hive. Please post your comments in this blog or in the Oracle GoldenGate public forum - https://forums.oracle.com/community/developer/english/business_intelligence/system_management_and_integration/goldengate

Tuesday Jan 15, 2013

ODI - Hive and MongoDB

I've been experimenting with another Hive storage handler, this time for MongoDB, there are a few out there including this one from MongoDB. The one I have been using supports basic primitive types and also supports read and write - using the standard approach of storage handler class and custom properties to describe the data mask. This then lets you access MongoDB via hive external table very easily and abstract away a lot of integration complexity - also makes it ideal for using in ODI. I have been using on my Linux VM where I have Hive running to access my MongoDB running on an another machine. The storage handler is found here, I used it to access the same example I blogged about here, below is the external table definition;

  1. ADD JAR /home/oracle/mongo/hive-mongo.jar;

  2. create external table mongo_emps(EMPNO string, ENAME string, SAL int)  
  3. stored by "org.yong3.hive.mongo.MongoStorageHandler"  
  4. with serdeproperties ( "mongo.column.mapping" = "EMPNO,ENAME,SAL" )  
  5. tblproperties ( "mongo.host" = "<my_mongo_ipaddress>", "mongo.port" = "27017",  
  6. "mongo.db" = "test", "mongo.collection" = "myColl" );

Very simple. The nice aspect of the Hive external table are the SerDeProperties that can be specified, very simple but provides a nice flexible approach. I can then reverse engineer this into ODI (see reverse engineering posting here) and use it in my Hive integration mappings to read and potentially write to MongoDB.

The primitive types supported can also project nested document types, so for example in the document below (taken from here), name, contribs and awards are strings but have JSON structures;

  1. {
  2. "_id" : 1,
  3. "name" : {
  4. "first" : "John",
  5. "last" :"Backus"
  6. },
  7. "birth" : ISODate("1924-12-03T05:00:00Z"),
  8. "death" : ISODate("2007-03-17T04:00:00Z"),
  9. "contribs" : [ "Fortran", "ALGOL", "Backus-Naur Form", "FP" ],
  10. "awards" : [
  11. {
  12. "award" : "W.W. McDowellAward",
  13. "year" : 1967,
  14. "by" : "IEEE Computer Society"
  15. },
  16. {
  17. "award" : "National Medal of Science",
  18. "year" : 1975,
  19. "by" : "National Science Foundation"
  20. },
  21. {
  22. "award" : "Turing Award",
  23. "year" : 1977,
  24. "by" : "ACM"
  25. },
  26. {
  27. "award" : "Draper Prize",
  28. "year" : 1993,
  29. "by" : "National Academy of Engineering"
  30. }
  31. }

can be processed with the following external table definition, which then can be used in ODI;

  1. create external table mongo_bios(name string, birth string, death string, contribs string, awards string)  
  2. stored by "org.yong3.hive.mongo.MongoStorageHandler"  
  3. with serdeproperties ( "mongo.column.mapping" = "name,birth,death,contribs,awards" )  
  4. tblproperties ( "mongo.host" = "<my_ip_address>", "mongo.port" = "27017",  
  5. "mongo.db" = "test", "mongo.collection" = "bios" );

All very simple and that's what makes it so appealing. Anyway, that's a quick following on using external tables with MongoDB and Hive to the SQL oriented approach I described here that used java table functions.

Wednesday Jan 02, 2013

ODI - Hive and NoSQL, the code

This post includes the Java client demonstration code used in the Hive and NoSQL post illustrated here. The BasicBigData.java code is a NoSQL client which populates a key value store that is queryable using the Hive external table from that post. It didn't take long to code and a few peeks at the NoSQL javadoc to get it going. You can take this java code and compile and run it (instructions for compiling are similar to the verification demo here - it is very easy).

The java code uses the NoSQL major/minor path constructor to describe the Key, below is a snippet to define the birthdate for Bob Smith;

  1. ArrayList<String> mjc1 = new ArrayList<String>();
  2. mjc1.add("Smith");
  3. mjc1.add("Bob");
  4. ...
  5. ArrayList<String> mnrb = new ArrayList<String>();
  6. mnrb.add("birthdate");
  7. ...
  8. store.put(Key.createKey(mjc1,mnrb),Value.createValue("05/02/1975".getBytes()));
  9. ...

In the referenced post, to actually aggregate the key values, we used the Hive collect_set aggregation function (see here for Hive aggregation functions). The collect_set aggregation function returns a set of objects with duplicates eliminated. To get the aggregation function behavior in ODI with the correct group by we must tell ODI about the Hive aggregation function. We can define a new language element for collect set in the Topology tree, define the element as a group function, and also define the expression for Hive under the Implementation tab;

We are then able to define expressions which reference this aggregation function and get the exact syntax defined in the earlier post. Below we see the Hive expressions using collect_set below;

From this design and the definition of the aggregation function in ODI, when its executed you can see the generated Hive QL with the correct columns in the grouping function;

The target Hive datastore in the interface I defined as been loaded with the key values from the NoSQL keystore, cool!

Those are a few of the missing pieces which would let you query NoSQL through Hive external tables, hopefully some useful pointers. 

About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today