Thursday Aug 29, 2013

ODI - File transformation debatching

Carrying on from the blog post on performing file transformations within ODI using nXSD, in this post I will show how a large file can be debatched into smaller chunks/shards. The example uses a COBOL data file as input and debatches this file into parts using the debatching capability of the adapter.

COBOL is a great example of a complex data source for Hadoop and Big Data initiatives since there are plenty of systems that companies wish to unlock potential gems from. Its big data and there is tonnes of it! The ODI file transformer tool on java.net has been extended to include an optional parameter to define the number of rows in the chunk/shard. Having this parameter now lets you take control of the relative size of the files being processed - perhaps the platform has certain characteristics that work better than others. Hadoop has challenges with millions of small files and with very very large ones, so being able to prepare the data is useful.

The COBOL copybook was used as an input in the Native Format Builder and an nXSD generated for the copybook, I used the following copybook;

  •         02  DTAR020.                                      
  •         03  DTAR020-KCODE-STORE-KEY.                                      
  •             05 DTAR020-KEYCODE-NO      PIC X(08).                         
  •             05 DTAR020-STORE-NO        PIC S9(03)   COMP-3.               
  •         03  DTAR020-DATE               PIC S9(07)   COMP-3.               
  •         03  DTAR020-DEPT-NO            PIC S9(03)   COMP-3.               
  •         03  DTAR020-QTY-SOLD           PIC S9(9)    COMP-3.               
  •         03  DTAR020-SALE-PRICE         PIC S9(9)V99 COMP-3.               

The data file came from an example on the web. Below you can see for the ODIFileTransformer tool, an actual example of the command executed, the SHARDROWS parameter defines the number of rows to be written to each data file. The tool simply adds an integer index to the end of the output parameter - a little basic I know, the source is on java.net if you feel like being creative.

ODIFileTransformer "-INPUT=D:\input\cbl_data.bin" "-SCHEMA=D:\reference\cbl.xsd" "-OUTPUT=d:\output\out.xml" "-ROOT=ROOT" "-SHARDROWS=2" 

Executing this generates many output files with for example 2 rows in each, this is a contrived example just to illustrate the 2 rows in the generated file;

  1. <ROOT xmlns="http://TargetNamespace.com/CBL">
  2.    <DTAR020>
  3.       <DTAR020-KCODE-STORE-KEY>
  4.          <DTAR020-KEYCODE-NO>69684558</DTAR020-KEYCODE-NO>
  5.          <DTAR020-STORE-NO>+020</DTAR020-STORE-NO>
  6.       </DTAR020-KCODE-STORE-KEY>
  7.       <DTAR020-DATE>+0040118</DTAR020-DATE>
  8.       <DTAR020-DEPT-NO>+280</DTAR020-DEPT-NO>
  9.       <DTAR020-QTY-SOLD>+000000001</DTAR020-QTY-SOLD>
  10.       <DTAR020-SALE-PRICE>+00000019.00</DTAR020-SALE-PRICE>
  11.    </DTAR020>
  12.    <DTAR020>
  13.       <DTAR020-KCODE-STORE-KEY>
  14.          <DTAR020-KEYCODE-NO>69684558</DTAR020-KEYCODE-NO>
  15.          <DTAR020-STORE-NO>+020</DTAR020-STORE-NO>
  16.       </DTAR020-KCODE-STORE-KEY>
  17.       <DTAR020-DATE>+0040118</DTAR020-DATE>
  18.       <DTAR020-DEPT-NO>+280</DTAR020-DEPT-NO>
  19.       <DTAR020-QTY-SOLD>-000000001</DTAR020-QTY-SOLD>
  20.       <DTAR020-SALE-PRICE>-00000019.00</DTAR020-SALE-PRICE>
  21.    </DTAR020>
  22. </ROOT>
This is a small extension to the earlier version of the tool and provides a useful debatching capability. The ZIP containing the tool is on the java.net site, see the earlier post on some background details. ...and then what? Well that would be telling.

Monday Aug 05, 2013

ODI - Complex file example, multi record row

Here we will see how a parser is defined to project a schema based on a data file with many records in a single line in the file. The complex file adapter provides a plethora of capabilities and is probably one of those under utilized and little known components - plus I think its one that the more examples provided for it the better! This example I will show has a department defined on every line and each department can have many employees (as many as you want) defined on the same line. Replace Department and Employee for your own concepts; buyers, sellers, or whatever. We define the parser in an NXSD file, which is based on the standard XSD with additional information to define how the data is parsed and projected as XML. If you know XSD you are in business, if not, with a little knowledge you can become very dangerous - resort to the examples in the Oracle documentation if not.

My source data file has the following content;

ODIDavid  ARCHITECTBob    DEVELOPERJohn   MANAGER  
OWFFred   SUPPORT  Jim    SUPPORT  

I want to define a parser that creates departments for each line (ODI is a department and OWF is a department, its a 3 character string) and also has employees (David, Bob etc a 7 character string) with their job roles (ARCHITECT, DEVELOPER etc a 9 character string) and the department they are in.

The following NXSD file does exactly that;

  1. <?xml version="1.0" encoding="UTF-8" ?>

  2. <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  3.             xmlns:nxsd="http://xmlns.oracle.com/pcbpel/nxsd"
  4.             targetNamespace="http://TargetNamespace.com/testsch"
  5.             xmlns:tns="http://TargetNamespace.com/testsch"
  6.             nxsd:encoding="ASCII" nxsd:stream="chars" nxsd:version="NXSD">

  7.   <xsd:element name="Departments">
  8.     <xsd:complexType>
  9.       <xsd:sequence>
  10.         <xsd:element name="Department" type="tns:DepartmentT" maxOccurs="unbounded"/>
  11.       </xsd:sequence>
  12.     </xsd:complexType>
  13.   </xsd:element>

  14.   <xsd:complexType name="DepartmentT">
  15.     <xsd:sequence>
  16.         <xsd:element name="Name" type="xsd:string" nxsd:style="fixedLength" nxsd:length="3" />
  17.         <xsd:element name="Employee" type="tns:EmployeeT"  nxsd:style="array" maxOccurs="unbounded" nxsd:arrayTerminatedBy="${eol}" />
  18.     </xsd:sequence>
  19.   </xsd:complexType>

  20.   <xsd:complexType name="EmployeeT">
  21.          <xsd:sequence>
  22.            <xsd:element name="Name" type="xsd:string" nxsd:style="fixedLength" nxsd:length="7"/>
  23.            <xsd:element name="Role" type="xsd:string" nxsd:style="fixedLength" nxsd:length="9"/>
  24.          </xsd:sequence>
  25.   </xsd:complexType>

  26. </xsd:schema>

My employee records above happen to fixed length based on the length of the name and the role, there are many of them, so they are in an unbounded array and the array is terminated by an EOL character. When I reverse this into ODI I get datastores for DEPARTMENTS, DEPARTMENT and EMPLOYEE; EMPLOYEE has columns Name and Role, DEPARTMENT has column Name.

I can define this NXSD in my favorite editor and develop and test the parsing using ODI, this avoids installing and using JDeveloper and gives you a shortcut to quickly developing and building parsers using this capability. Previous blogs on the topic are here and here.

Saturday Dec 29, 2012

ODI - Hive and Complex JSON

Time for a nested JSON example using Hive external tables. ODI treats nested complex types in Hive the same way it treats types in other technologies such as Oracle, the type name is captured, not the definition - you can see XMLType or SDO_GEOMETRY as an example within the ODI Oracle technology. The Hive technology in ODI also has the ARRAY, STRUCT and MAP types seeded. For the nested JSON example I will use the example define in this JSON SerDe page here.

The external table definition is below, I have defined this in Hive and reverse engineered into ODI just like the previous post. Note it is using a different SerDe from the post here, when I tried using that Google SerDe the data was not projected properly (all data was projected as null...so beware of components used). Just like the previous post we need to add the jar whenever it is used (plus don't forget the ODI_HIVE_SESSION_JARS), either from Hive or ODI;

  1. ADD JAR /home/oracle/json/hive-serde-1.0.jar;
  2. CREATE EXTERNAL TABLE message (
  3.   messageid string,
  4.   messagesize int,
  5.   sender string,
  6.   recipients array<string>,
  7.   messageparts array<struct<
  8.     extension: string,
  9.     size: int
  10.   >>,
  11.   headers map<string,string>
  12. )
  13. ROW FORMAT SERDE 'com.proofpoint.hive.serde.JsonSerde'
  14. LOCATION '/user/oracle/json_complex';

This external table has ARRAY fields, STRUCT fields and MAP fields, so we are going above and beyond simple types. The data I will use is the same as the referenced web page;

  1. {
  2.   "messageId": "34dd0d3c-f53b-11e0-ac12-d3e782dff199",
  3.   "messageSize": 12345,
  4.   "sender": "alice@example.com",
  5.   "recipients": ["joe@example.com", "bob@example.com"],
  6.   "messageParts": [
  7.     {
  8.       "extension": "pdf",
  9.       "size": 4567
  10.     },
  11.     {
  12.       "extension": "jpg",
  13.       "size": 9451
  14.     }
  15.   ],
  16.   "headers": {
  17.     "Received-SPF": "pass",
  18.     "X-Broadcast-Id": "9876"
  19.   }
  20. }

Again, I will use the Hive RKM that I mentioned in the post here in order to reverse engineer the external table defined in Hive into ODI. Below you can see the table and how it is represented in ODI. You can see the recipients and messageparts columns are defined as ARRAY, and headers is a MAP.

We can view the data in ODI just like other tables - remember the external table is projected the JSON data from HDFS through Hive;

When the datastore is used in an interface these are the columns that are available in mapping, you can use any of the Hive functions or operators available on these complex types, plus constructors for building them. For example to retrieve the 1st element in the array the following Hive QL can be used;

  • select sender, recipients[0] from message;

 You can build such expressions in ODI;

You just need to be aware of the datatypes you are using and the functions available. Haven't gone into complex SQL/Hive QL here, but you see the basic mechanics are pretty straightforward. One of the points that comes home here is the functionality level of the non-core pieces of Hadoop, so the first 2 JSON SerDe libraries I have used support different capabilities of JSON - so be aware.

Tuesday Jun 14, 2011

Complex File Processing - 0 to 60

The Complex File technology adds a powerful new driver into the ODI capabilities. It leverages the Fusion Middleware capabilities for supporting complex files, you can then build ODI interfaces just as you are used to!

Using an example from the Oracle┬« Fusion Middleware User's Guide for Technology Adapters 11g Release 1 (see here), the viewlet quickly builds up the model based on the XSD for the complex file (use Native Format Builder to construct).

 Have a look at the viewlet right here

http://blogs.oracle.com/dataintegration/resource/viewlets/odi_11_complexfile_viewlet_swf.html

You can pause it to catch your breath! Its condensed into 60 seconds to keep your attention.

About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
3
5
6
7
8
9
10
12
13
14
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today