Pat Shuff's Blog

  • Iaas
    September 26, 2016

Making Hadoop easier

Last week we looked at provisioning a Hadoop server and realized that the setup was a little complex and somewhat difficult. This is what people typically do the first time when they want to provision a service. They download the binaries (or source if you are really crazy) and install everything from scratch. Our recommendation is to do everything this way the first time. It does help you get a better understanding of how the setup works and dependencies. For example, Hadoop 2.7.3 required Java 1.8 or greater. If we go with Hadoop 2.7.2 we can get by with Java 1.7.

Rather than going through all of the relationships, requirements, and libraries needed to get something working we are going to do what we would typically do to spin up a server if we suddenly need one up and running. We go to a service that provides pre-compiled and pre-configured public domain code sandboxes and get everything running that way. The service of choice for the Oracle Compute Cloud is Bitnami We can search for a Hadoop configuration and provision it into our IaaS foundation. Note that we could do the same using the Amazon EMR and get the same results. The key difference between the two are configurations, number of servers, and cost. We are going to go through the Bitnami deployment on the Oracle Cloud in this blog.

Step 1 Search for Hadoop on http://oracle.bitnami.com and launch the instance into your region of choice.

Step 2 Configure and launch the instance. We give the instance a name, we increase the default disk size from 10 GB to 60 GB to have room for data, we go with the hadoop 2.7.2-1 version, select Oracle Linux 6.7 as the OS (Ubuntu is an alternative), and go with a small OC3 footprint for the compute size. Don't change the security rules. A new one will be generated for you as well as the ssh keys when you provision through this service.

Step 3 Log into your instance. To do this you will need ssh and use the keys that bitnami generates for you. The instance creation takes 10-15 minutes and should show you a screen with the ip address and have links for you to download the keys.

Step 4 Once you have access to the master system you can execute the commands that we did last week. The only key difference with this implementation is that you will need to install java-1.8 with a yum install because by default the development kit is not installed and we need the jar functionality as part of configuration.
The steps needed to repeat our tests from the previous blog entry.

 --- setup hdfs file system 
hdfs namenode -format
hdfs getconf -namenodes
hdfs dfs -mkdir input
cp /opt/bitnami/hadoop/etc/hadoop/*.xml input
hdfs dfs -put input/*.xml input
--- setup simple test with wordcount
hdfs dfs -mkdir wordcount
hdfs dfs -mkdir wordcount/input
mkdir ~/wordcount
mkdir ~/wordcount/input
vi file01
mv file01 ~/wordcount/input
vi ~/wordcount/input/file02
hdfs dfs -put ~/wordcount/input/* wordcount/input
vi WordCount.java
--- install java-1.8 to get all of the libraries
sudo yum install java-1.8\*
--- create ec.jar file
export HADOOP_CLASSPATH=/opt/bitnami/java/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
hadoop jar wc.jar WordCount wordcount/input wordcount/output
hadoop fs -cat wordcount/output/part-r-00000
--- download data and test pig
mkdir data
cd data
w get http://stat-computing.org/dataexpo/2009/1987.csv.bz2
w get http://stat-computing.org/dataexpo/2009/1988.csv.bz2
bzip2 -d 1987.csv.bz2
bzip2 -d 1988.csv.bz2
hdfs dfs -mkdir airline
hdfs dfs -copyFromLocal 19*.csv airline
vi totalmiles.pig
pig totalmiles.pig
hdfs dfs -cat data/totalmiles/part-r-00000

Note that we can do the exact same thing using Amazon AWS. They have a MapReduce product called EMR. If you go to the main console, click on EMR at the bottom of the screen, you can create a Hadoop cluster. Once you get everything created and can ssh into the master you can repeat the steps above.

I had a little trouble with the WordCount.java program in that the library version was a little different. The JVM_1.7 libraries had a problem linking and adding the JVM_1.8 binaries did not properly work with the Hadoop binaries. You also need to change the HADOOP_CLASSPATH to point to the proper tools.jar file since it is in a different location from the Bitnami install. I think with a little tweaking it would all work. The pig sample code works with no problem so we were able to test that without changing anything.

In summary, provisioning a Hadoop server or cluster in the cloud is very easy if someone else has done the heavy lifting and pre-configured a server or group of servers for you. I was able to provision two clusters before lunch, run through the exercises, and still have time to go through it again to verify. Using a service like private Marketplaces, Bitnami, or the AWS Marketplace makes it much simpler to deploy sandbox images.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.