By David Allan on Aug 29, 2013
Carrying on from the blog post on performing file transformations within ODI using nXSD, in this post I will show how a large file can be debatched into smaller chunks/shards. The example uses a COBOL data file as input and debatches this file into parts using the debatching capability of the adapter.
COBOL is a great example of a complex data source for Hadoop and Big Data initiatives since there are plenty of systems that companies wish to unlock potential gems from. Its big data and there is tonnes of it! The ODI file transformer tool on java.net has been extended to include an optional parameter to define the number of rows in the chunk/shard. Having this parameter now lets you take control of the relative size of the files being processed - perhaps the platform has certain characteristics that work better than others. Hadoop has challenges with millions of small files and with very very large ones, so being able to prepare the data is useful.
The COBOL copybook was used as an input in the Native Format Builder and an nXSD generated for the copybook, I used the following copybook;
- 02 DTAR020.
- 03 DTAR020-KCODE-STORE-KEY.
- 05 DTAR020-KEYCODE-NO PIC X(08).
- 05 DTAR020-STORE-NO PIC S9(03) COMP-3.
- 03 DTAR020-DATE PIC S9(07) COMP-3.
- 03 DTAR020-DEPT-NO PIC S9(03) COMP-3.
- 03 DTAR020-QTY-SOLD PIC S9(9) COMP-3.
- 03 DTAR020-SALE-PRICE PIC S9(9)V99 COMP-3.
The data file came from an example on the web. Below you can see for the ODIFileTransformer tool, an actual example of the command executed, the SHARDROWS parameter defines the number of rows to be written to each data file. The tool simply adds an integer index to the end of the output parameter - a little basic I know, the source is on java.net if you feel like being creative.
ODIFileTransformer "-INPUT=D:\input\cbl_data.bin" "-SCHEMA=D:\reference\cbl.xsd" "-OUTPUT=d:\output\out.xml" "-ROOT=ROOT" "-SHARDROWS=2"
Executing this generates many output files with for example 2 rows in each, this is a contrived example just to illustrate the 2 rows in the generated file;
- <ROOT xmlns="http://TargetNamespace.com/CBL">