Monday Aug 20, 2012

Hadoop - Invoke Map Reduce

Carrying on from the previous post on Hadoop and HDFS with ODI packages, this is another needed call out task - how to execute existing map-reduce code from ODI. I will show this by using ODI Packages and the Open Tool framework.

The Hadoop JobConf SDK is the class needed for initiating jobs whether local or remote – so the ODI agent could be hosted on a system other than the Hadoop cluster for example, and just fire jobs off to the Hadoop cluster. Also some useful posts such as this older one on executing map-reduce jobs from java (following the reply from Thomas Jungblut in this post) helped me get up to speed.

Where better to start than the WordCount example (see a version of it here, both mapper and reducer and inner classes), let’s see how this can be invoked from an ODI package. The HadoopRunJob below is a tool I added via the Open Tool framework, it basically wrappers the JobConf SDK, the parameters are defined in ODI.

You can see some of the parameters below, so I define the various class names I need below, plus various other parameters including the Hadoop job name, can also specify the job tracker to fire the job on (for a client-server style architecture). The input path and output path are also defined as parameters, you can see the first tool in the package is calling the copy file to HDFS – this is just to demonstrate that I will copy the files needed by the WordCount program into HDFS ready for it to run.

Nice and simple, and shields a lot of the complexity hidden behind some simple tools. The JAR file containing WordCount needed to be available to the ODI Agent (or Studio since I invoked it with the local agent), that was it. When the package is executed, just like normal the agent processes the code and executes the steps. I I run the package above it will successfuly copy the files to HDFS and perform the word count. On a second execution of the package an error will be reported because the output directory already exists as below.

I left the example like this to illustrate that we can then extend the package design to have conditional branching to handle errors after a flow, just like the following;

Here after executing the word count, the status is checked and you can conditionally branch on success or failure – just like any other ODI package. I used the beep just for demonstration.

The above HadoopRunJob tool used above was done using the v1 MapReduce SDK, with MR2/Yarn this again will change – these kinds of changes hammer home the need for better tooling to abstract common concepts for users to exploit.

You can see from these posts that we can provide useful tooling behind the basic mechanics of Hadoop and HDFS very easily, along with the power of generating map-reduce jobs from interface designs which you can see from the OBEs here.

Friday Nov 20, 2009

Parallel Processing in ODI

This post assumes that you have some level of familiarity with ODI. The concepts of Packages, Interfaces, Procedures and Scenarios are used here assuming that you understand them in the context of ODI. If you need more details on these elements, please refer to the ODI Tutorial for a quick introduction, or to the complete ODI documentation for detailed information.

ODI: Parallel Processing

A common question in ODI is how to run processes in parallel. When you look at a typical ODI package, all steps are described in a serial fashion and will be executed in sequence.

ParallelPackageSerial.PNG

However, this same package can parallelize and synchronize processes if needed.

PARALLEL PROCESSES

The first piece of the puzzle if you want to parallelize your executions is that a package can invoke other packages once they have been compiled into scenarios (the process of generation of scenarios is described later in this post). You can then have a master package that will orchestrate other scenarios. There is no limit as to how many levels of nesting you will have, as long as your processes are making sense: Your master package invokes a seconday package which, in turn invokes another package...

When you invoke these scenarios, you have two possible execution modes: synchronous and asynchronous.

ParallelScenario.PNG

A synchronous execution will serialize the scenario execution with other steps in the package: ODI executes the scenario, and only after its execution is completed, runs the next step.

An asynchronous execution will only invoke the scenario but will immediately execute the next step in the calling package: the scenario will then run in parallel with the next step. You can use this option to start multiple scenarios concurrently: they will all run in parallel, independently of one another.

SYNCHRONIZING PROCESSES

Once we have started multiple processes in parallel, a common requirement is to synchronize these processes: some steps may run in parallel, but at times we will need all separate threads to be completed before we proceed with a final series of steps. ODI provides a tool for this: OdiWaitForChildSession.

ParallelSynchronize.PNG

An interesting feature is that as you start your different processes in parallel, they can each be assigned a keyword (this is just one of the parameters you can set when you start a scenario). When you synchronize the processes, you can select which processes will be synchronized based on a selection of keywords.

ADDING SCENARIOS TO YOUR PACKAGE FOR PARALLEL PROCESSING

To add a scenario to your package, simply drag and drop the generated scenario in the package, and edit the execution parameters as needed. In particular, remember to set the execution mode to Asynchronous.

You can generate a scenario from a package, from an interface, or from a procedure. The last two will be more atomic (one interface or one procedure only per execution unit). The typical way to generate a scenario is to right-click on one of these objects and to select Generate Scenario.

The generation of scenarios can also be automated with ODI processes that would invoke the ODI tool OdiGenerateAllScen. The parameters of this tool will let you define which scenarios are being generated automatically.

In all cases, scenarios can be found in the object tree, under the object they were generated from - or in the Operator interface, in the Scenarios tab.

While you are developing your different objects, keep in mind that you can Regenerate existing scenarios. This is faster than deleting existing ones only to re-create them with the same version number. To re-generate a scenario, simply right-click on the existing version and select Regenerate ... .

From an execution perspective, you can specify that the scenario you will execute is version -1 (negative one) to ensure that the latest version number is always the one executed. This is a lot easier than editing the parameters with each new release.

DISPLAYING PARALLEL PROCESSING

You will notice that as of 10.1.3.4, ODI does not graphically differentiate between serialized and parallelized executions: all are represented in a serial manner. One way to make parallel executions more visible is stack up the objects vertically, versus the more natural horizontal execution for serialized objects. (If we have electricians reading this, the layout will be very familiar to them, but this is only a coincidence...)

ParallelPackageStackUp.PNG

OTHER OBJECTS THAN SCENARIOS

Scenarios are not the only objects that will allow for parallel (or Asynchronous) execution. If you look at the ODI tool OdiOSCommand, you will notice a Synchronous option that will allow you to define if the external component you are executing will run in parallel with the current process, or if it will be serialized in your process. The same is true for the Data Quality tool OdiDataQuality.

EXECUTION LOGS

As you will start running more processes in parallel, be ready to see more processes being executed concurrently in the Operator interface. If you are only interested in seing the master processes though, the Hierarchy tab will allow you to limit your view to parent processes. Children processes will be listed under the entry Childres Sessions under each session.

Likewise, when you access the logs from the web front end, you can view the Parent processes only.

Enjoy!

Screenshots were taken using version 10.1.3.5 of ODI. Actual icons and graphical representations may vary with other versions of ODI.

About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
3
5
6
7
8
9
10
12
13
14
17
18
19
20
21
23
24
25
26
27
28
29
30
   
       
Today