Hadoop - Invoke Map Reduce
By David Allan-Oracle on Aug 20, 2012
Carrying on from the previous post on Hadoop and HDFS with ODI packages, this is another needed call out task - how to execute existing map-reduce code from ODI. I will show this by using ODI Packages and the Open Tool framework.
The Hadoop JobConf SDK is the class needed for initiating jobs whether local or remote – so the ODI agent could be hosted on a system other than the Hadoop cluster for example, and just fire jobs off to the Hadoop cluster. Also some useful posts such as this older one on executing map-reduce jobs from java (following the reply from Thomas Jungblut in this post) helped me get up to speed.
Where better to start than the WordCount example (see a version of it here, both mapper and reducer and inner classes), let’s see how this can be invoked from an ODI package. The HadoopRunJob below is a tool I added via the Open Tool framework, it basically wrappers the JobConf SDK, the parameters are defined in ODI.
You can see some of the parameters below, so I define the various class names I need below, plus various other parameters including the Hadoop job name, can also specify the job tracker to fire the job on (for a client-server style architecture). The input path and output path are also defined as parameters, you can see the first tool in the package is calling the copy file to HDFS – this is just to demonstrate that I will copy the files needed by the WordCount program into HDFS ready for it to run.
Nice and simple, and shields a lot of the complexity hidden behind some simple tools. The JAR file containing WordCount needed to be available to the ODI Agent (or Studio since I invoked it with the local agent), that was it. When the package is executed, just like normal the agent processes the code and executes the steps. I I run the package above it will successfuly copy the files to HDFS and perform the word count. On a second execution of the package an error will be reported because the output directory already exists as below.
I left the example like this to illustrate that we can then extend the package design to have conditional branching to handle errors after a flow, just like the following;
Here after executing the word count, the status is checked and you can conditionally branch on success or failure – just like any other ODI package. I used the beep just for demonstration.
The above HadoopRunJob tool used above was done using the v1 MapReduce SDK, with MR2/Yarn this again will change – these kinds of changes hammer home the need for better tooling to abstract common concepts for users to exploit.
You can see from these posts that we can provide useful tooling behind the basic mechanics of Hadoop and HDFS very easily, along with the power of generating map-reduce jobs from interface designs which you can see from the OBEs here.