Tuesday Jun 16, 2015

ODI - Hive DDL Generation for Parquet, ORC, Cassandra and so on

Here you'll see how with some small surgical extensions we can use ODI to generate complex integration models in Hive for modelling all kinds of challenges;

  • integrate data from Cassandra, or any arbitrary SerDe
  • use Hive Parquet or ORC storage formats
  • create partitioned, clustered tables
  • define arbitrarily complex attributes on tables

I'm very interested in hearing what you think of this, and what is needed/useful over and above (RKMs etc) 

When you use this technique to generate your models you can benefit from one of ODI's lesser known but very powerful features - the Common Format Designer (see here for nice write up). With this capability you can build models from other models, generate the DDL and generate the mappings or interfaces to integrate the data from the source model to the target. This gets VERY interesting in the Big Data space since many customers want to get up and running with data they already know and love.

What do you need to get there? The following basic pieces;

  • a Hive custom create table action here.
  • flex fields for external table indicator, table data format, attribute type metadata get groovy script here

 Pretty simple right? If you take the simple example on the Confluence wiki;

  • CREATE TABLE parquet_test (
  •    id INT,
  •    str STRING,
  •    lst ARRAY<STRING>,
  •    strct STRUCT<A:STRING,B:STRING>) 

...and think about how to model this in ODI, you'll think - how do I define the Parquet storage? How can the complex types be defined? That's where the flex fields come in, you can define any of the custom storage properties you wish in the datastore's Data Format flex field, then define the complex type formats.

In ODI the Parquet table above can be defined with the Hive datatypes as  below, we don't capture the complex fields within attributes mp, 1st or stct. The partitioned column 'part' is added in to the regular list of datastore attributes (like Oracle, unlike Hive DDL today);

Using the flex field Data Format we can specify the STORED AS PARQUET info, the table is not external to Hive so we do not specify EXTERNAL in the external data field;

This value can have lots of ODI goodies in it which makes it very versatile - you can use variables, these will be resolved when the generated procedure containing the DDL is executed, plus you can use <% style code substitution. 

The partitioning information is defined for the part attribute, the 'Used for Partitioning' field is checked below;

Let's take stock, we have our table defined as Parquet and we have the partitioning info defined. Now we have to define the complex types. This is done on attributes mp, 1st and stct using the creatively named Metadata field - below I have fully defined the MAP, STRUCT and ARRAY for the respective attributes;

Note the struct example escapes the : as the code is executed via the JDBC driver when executing via the agent and ':' is the way of binding information to a statement execution (you can see the generated DDL further below with the ':' escaped). 

With that we are complete, we can now generate DDL for our model;

This brings up a dialog with options for where to store the procedure with the generated DDL - this has actually done a base compare against the Hive system and allowed me to selected which datastores to create DDL for or which DDL to create, I have checked the parquet_test datastore;

Upon completion, we can see the procedure which was created. We can chose to further edit and customize this if so desired. We can schedule it anytime we want or execute.

Inspecting the Create Hive Table action, the DDL looks exactly as we were after when we first started - the datastore schema / name will be resolved upon execution using ODI's runtime context - which raised another interesting point! We could have used some ODI substitution code <% when defining the storage format for example;

As I mentioned earlier you can take this approach to capture all sorts of table definitions for integrating. If you want to look on the inside of how this is done, check the content of the Hive action in image below (I linked the code of the action above), it should look very familiar if you have seen any KM code and uses the flex fields to inject the relevant information in the DDL;

This is a great example of what you can do with ODI and how to leverage it to work efficiently. 

There is a lot more that can be illustrated here including what the RKM Hive can do or be modified to do. Perhaps creating the datastore and complex type information from JSON, JSON schema, AVRO schema or Parquet is very useful? I'd be very interested to hear your comments and thoughts. So let me know....

Wednesday Dec 17, 2014

Oracle Partition Exchange Blog from the ODI A-Team

More great information from the ODI A-Team!

Check out the A-Team’s most recent blog about the Oracle Partition Exchange – it does come in two parts:

Using Oracle Partition Exchange with ODI

Configuring ODI with Oracle Partition Exchange

The knowledge module is on Java.Net, and it is called “IKM Oracle Partition Exchange Load”.  To search for it, enter “PEL” or “exchange” in the Search option of Java.Net.

A sample ODI 12.1.3 Repository is available as well.  The ODI sample repository has great examples of how to perform both initial and incremental data upload operations with Oracle Partition Exchange.  This repository will help users to understand how to use Oracle Partition Exchange with ODI.

Happy reading!

Friday Feb 14, 2014

ODI 12c - Table Functions, Parallel Unload to File and More

ODI 12c includes a new component for integrating and transformation data programmatically, there have been plenty of examples through the years of such implementations, recent examples include SQL access to R from Mark Hornick (see an example blog here). As well as a great integration technique they have fantastic performance and scalability options - hence you see posts and talks from Kuassi Mensah on in-database map-reduce; all about leveraging the Oracle database's parallel query engine and the skills you already have (SQL and PLSQL/java).

The table function component in ODI 12c lets you integrate an existing table function implementation into a flow - the parameters for the table function can be scalar or a ref cursor, you can see how the examples from the AMIS posting here are defined within the mapping designer below, there are multiple table functions chained together, used as both a data source and a transformation;

In the above image you can see the table function name defined in the database is specified in the component's general properties (property is Function Name). The signature for the function must be manually defined by adding input/output connector points and attributes. Check the AMIS blog and reflect on the design above.

Regarding performance, one of the examples I blogged (OWB here and ODI here) was parallel unload to file. The table function examples from those previous blogs were fairly rudimentary, in this blog we will see what happens when we tweak the implementation of such functions - we can get much better performance. Here is the table function implementation I will use within the ODI examples (the type definitions used come from the OWB blog post above).

  1. create or replace function ParallelUnloadX (r SYS_REFCURSOR) return NumSet 
  3.    TYPE row_ntt IS TABLE OF VARCHAR2(32767);
  4.    v_rows row_ntt;
  5.    v_buffer VARCHAR2(32767);
  6.    i binary_integer := 0; 
  7.    v_lines pls_integer := 0;
  8.    c_eol CONSTANT VARCHAR2(1) := CHR(10); 
  9.    c_eollen CONSTANT PLS_INTEGER := LENGTH(c_eol); 
  10.    c_maxline CONSTANT PLS_INTEGER := 32767; 
  11.    out utl_file.file_type; 
  12.    filename varchar2(256) := 'dbunload'; 
  13.    directoryname varchar2(256) := 'MY_DIR'; 
  14.    vsid varchar2(120); 
  15. begin 
  16.    select sid into vsid from v$mystat where rownum=1; 
  17.    filename := filename || vsid || '.dat'; 
  18.    out := utl_file.fopen (directoryname, filename , 'w');

  19.    loop 
  20.      fetch r BULK COLLECT INTO v_rows; 
  21.      for i in 1..v_rows.COUNT LOOP
  22.        if LENGTH(v_buffer) + c_eollen + LENGTH (v_rows(i)) <= c_maxline THEN
  23.          v_buffer := v_buffer || c_eol || v_rows(i);
  24.        else
  25.          IF v_buffer IS NOT NULL then
  26.            utl_file.put_line(out, v_buffer);
  27.          end if;
  28.          v_buffer := v_rows(i);
  29.        end if;
  30.      end loop;
  31.      v_lines := v_lines + v_rows.COUNT;
  32.      exit when r%notfound;
  33.    end loop;
  34.    close r;
  35.    utl_file.put_line(out, v_buffer); 

  36.    utl_file.fclose(out); 
  37.    PIPE ROW(i); 
  38.    return ;
  39. end; 
  40. /

The function uses PARALLEL_ENABLE and PARTITION BY keywords - these 2 are critical to performance and scalability. In addition, this function is further optimized; it uses the PLSQL BULK COLLECT capability and also buffers data in PLSQL variables before writing to file (this avoids IO calls). This was not rocket science to tune (plenty of posts on PLSQL IO tuning such as this) yet you can see the impact it has on performance further below.

My mapping using the table function as a target is shown below, 

In the physical design I define the parallel hints, this will then perform parallel unloads to file and you can easily leverage the hardware and power of the Oracle database. Using the hints to tweak the physical design let's the designer very easily compare and tune performance - you do not have to design the parallelism in your own flows.

In the table below you can see the performance difference when I use the PARALLEL(4) hint on a 4 CPU machine;

5 million rows  16s  6s
32 million rows 200s  47s 

If I execute the agent based SQL to file LKM, the time taken out of the box is 398 seconds (slower than 47s above when a hint is used) on the 32 million row example, the only divide and conquer techniques with the LKM are building a custom workflow to do such. With the table function approach if your database is on a bigger, more powerful host you can easily take advantage of the system by tweaking the hints.

As you see, the ODI table function component provides another custom exit point in a flow which let's you not only provide some useful integration capabilities but you can also do it in a very efficient manner - leveraging and exploiting the database you are running on. Hopefully this gives you a little insight and an interesting use case or two.


Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality


« October 2015