Hadoop and HDFS - file system tools

Underpinning the Oracle Big Data Appliance (and any other Hadoop cluster) is HDFS, working with files in HDFS is just like working with regular file systems (quite different under the hood) but to manipulate from the OS you just have a different API to use, Hadoop uses the 'hadoop fs' command prior to a mkdir or rm, whereas local file systems use just mkdir or rm. ODI has file management tools available in the package editor- there are tools for preparing files and moving them around. The HDFS commands can act just like any other tool in ODI, let’s see how!

Here I will show you how I added a bunch of tools to perform common HDFS file system commands and map-reduce job execution from within the ODI package editor – this is over and above the support ODI has for building interfaces that exploit the Hadoop ecosystem.

You will see how users can easily perform HDFS actions and execute map-reduce jobs – I will save the later for another post. Using ODI Tools extensibility into the package editor, I have wrapped the following;

Tool

Description

HDFSCopyFromLocalFile

Copy a file (or group of files) from the local file system to HDFS.

HDFSCopyToLocalFile

Copy a file (or group of files) from HDFS to the local file system.

HDFSMkDirs

Create directories in HDFS. Will create intermediate directories in a path that are missing.

HDFSMoveFromLocalFile

Move a file (or group of files) from the local file system to HDFS.

HDFSRm

Delete directories in HDFS – can recursively delete also.

HadoopRunJob

Run a map-reduce job by defining the job parameters - mapper class, reducer class, formats, input, output etc.

These are common HDFS actions that every Hadoop training walks the user through and are equivalents of the local file system tools ODI has under the Files group in the package toolbox. You can see from the example below the HDFSCopyFromLocalFile call has a source uri to copy and a destination on HDFS - its very simple to use. This uses the FileSystem SDK to manipulate the HDFS file system. The HDFSCopyFromLocalFile below copies a file /home/oracle/weblogs/apache08122012_1290.log from the local file system, to HDFS at /users/oracle/logs/input. Very simple.

The above flow basically removes a directory structure, creates a directory, copies some files, runs a few map-reduce jobs and then tidies up with some removal. I built these tools using ODI’s Open Tool SDK, its another great way of extending the product. In the above image you can see in the Toolbox there is a bunch of tools in the Plugins folder. These tools used Hadoop’s SDKs including org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path and org.apache.hadoop.conf.Configuration. Another cool aspect to ODI Open Tools is that you can call the tools from commands within a KM task, or from procedure commands.

The dependent JARs must be available for ODI to execute. This is a common reply in forums across the web to developers using these SDKs. The HadoopRunJob uses the JobConf SDK to define the map-reduce job (following the reply from Thomas Jungblut in this post). I will cover this in another blog post. For more on integrating with Hadoop and ODI see the self study material here.

Comments:

Hi David,

Nice Post. Keep going ..

-Suraj.

Posted by Suraj on August 20, 2012 at 05:27 AM PDT #

Hi there.

I´m beginner in hadoop concepts and tools. What do you indicates me to study ? My consulting had the Greenplum hadoop VM with Cent-OS Linux distribution containing the hadoop ecosystem services running. But hadoop hasn´t a graphical and intuitive tools to work.

Please help-us with the tools indications.

Thanks a lot.

Hilario

Posted by CARLOS HILARIO on April 12, 2013 at 01:41 PM PDT #

Its a big area, so depends on what you will be doing at first. I'd start small, with the VM and get through the basics of understanding all sorts from map reduce to Hive. Products like Hive and Pig are much more accessible for wider user base than just raw map reduce.

Cheers
David

Posted by David on April 16, 2013 at 02:06 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Learn the latest trends, use cases, product updates, and customer success examples for Oracle's data integration products-- including Oracle Data Integrator, Oracle GoldenGate and Oracle Enterprise Data Quality

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
3
5
6
7
8
9
10
12
13
14
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today