Thursday Aug 16, 2012

Hadoop and HDFS - file system tools

Underpinning the Oracle Big Data Appliance (and any other Hadoop cluster) is HDFS, working with files in HDFS is just like working with regular file systems (quite different under the hood) but to manipulate from the OS you just have a different API to use, Hadoop uses the 'hadoop fs' command prior to a mkdir or rm, whereas local file systems use just mkdir or rm. ODI has file management tools available in the package editor- there are tools for preparing files and moving them around. The HDFS commands can act just like any other tool in ODI, let’s see how!

Here I will show you how I added a bunch of tools to perform common HDFS file system commands and map-reduce job execution from within the ODI package editor – this is over and above the support ODI has for building interfaces that exploit the Hadoop ecosystem.

You will see how users can easily perform HDFS actions and execute map-reduce jobs – I will save the later for another post. Using ODI Tools extensibility into the package editor, I have wrapped the following;

Tool

Description

HDFSCopyFromLocalFile

Copy a file (or group of files) from the local file system to HDFS.

HDFSCopyToLocalFile

Copy a file (or group of files) from HDFS to the local file system.

HDFSMkDirs

Create directories in HDFS. Will create intermediate directories in a path that are missing.

HDFSMoveFromLocalFile

Move a file (or group of files) from the local file system to HDFS.

HDFSRm

Delete directories in HDFS – can recursively delete also.

HadoopRunJob

Run a map-reduce job by defining the job parameters - mapper class, reducer class, formats, input, output etc.

These are common HDFS actions that every Hadoop training walks the user through and are equivalents of the local file system tools ODI has under the Files group in the package toolbox. You can see from the example below the HDFSCopyFromLocalFile call has a source uri to copy and a destination on HDFS - its very simple to use. This uses the FileSystem SDK to manipulate the HDFS file system. The HDFSCopyFromLocalFile below copies a file /home/oracle/weblogs/apache08122012_1290.log from the local file system, to HDFS at /users/oracle/logs/input. Very simple.

The above flow basically removes a directory structure, creates a directory, copies some files, runs a few map-reduce jobs and then tidies up with some removal. I built these tools using ODI’s Open Tool SDK, its another great way of extending the product. In the above image you can see in the Toolbox there is a bunch of tools in the Plugins folder. These tools used Hadoop’s SDKs including org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path and org.apache.hadoop.conf.Configuration. Another cool aspect to ODI Open Tools is that you can call the tools from commands within a KM task, or from procedure commands.

The dependent JARs must be available for ODI to execute. This is a common reply in forums across the web to developers using these SDKs. The HadoopRunJob uses the JobConf SDK to define the map-reduce job (following the reply from Thomas Jungblut in this post). I will cover this in another blog post. For more on integrating with Hadoop and ODI see the self study material here.

About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
3
5
6
7
8
9
10
12
13
14
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today