X

Pat Shuff's Blog

  • Iaas |
    September 23, 2016

Hadoop on IaaS - part 2

Today we are going to get our hands dirty and install a single instance standalong Hadoop Cluster on the Oracle Compute Cloud. This is a continuing series of installing public domain software on Oracle Cloud IaaS. We are going to base our installation on three components
We are using Oracle Linux 6.7 because it is the easiest to install on Oracle Compute Cloud Services. We could have done Ubuntu or SUSE or Fedora and followed some of the tutorials from HortonWorks or Cloudera or Apache Single Node Cluster. Instead we are going old school and installing from the Hadoop home page by downloading a tar ball and configuring the operating system to run a single node cluster.

Step 1:

Install Oracle Linux 6.7 on an Oracle Compute Cloud instance. Note that you can do the same thing by installing on your favorite virtualization engine like VirtualBox, VMWare, HyperV, or any other cloud vendor. The only true dependency is the operating system beyond this point. If you are installing on the Oracle Cloud, go with the OL_67_3GB..... option, go with the smallest instance, delete the boot disk, replace it with a 60 GB disk, rename it and launch. The key reason that we need to delete the boot disk is that by default the 3 GB disk will not take the Hadoop binary. We need to grow it to at least 40 GB. We pad a little bit with a 60 GB disk. If you check the new disk as a boot disk it replaces the default Root disk and allows you to create an instance with a 60 GB disk.



Step 2:

Run yum to update the os, install w get, and java version 1.8. You need to login as opc to the instance so that you can run as root.


Note that we are going to diverge from the Hadoop for Dummies that we referenced yesterday. They suggest attaching to a yum repository and doing an install from the repository for the bigtop package. We don't have that option for Oracle Linux and need to do the install from the binaries by downloading a tar or src image. The bigtop package basically takes the Apache Hadoop bundle and translates them to rpm files for an operating system. Oracle does not provide this as part of the yum repository and Apache does not create one for Oracle Linux or RedHat. We are going to download the tar file from the links provided at Apache Hadoop homepage we are following install instructions for a single node cluster.

Step 3:

Get the tar.gz file by pulling it from http://apache.osuosl.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

Step 4:
We unpack the tar.gz file with the tar xvzf hadoop-2.7.2.tar.gz command

Step 5:

Next we add the following to the .bashrc file in the home directory to setup some environment variables. The java code is done in the same location by the yum command. The location of the hadoop code is based on downloading into the opc home directory.

export JAVA_HOME=/usr
export HADOOP_HOME=/home/opc/hadoop-2.7.3
export HADOOP_CONFIG_DIR=/home/opc/hadoop-2.7.3/etc/hadoop
export HADOOP_MAPRED_HOME=/home/opc/hadoop-2.7.3
export HADOOP_COMMON_HOME=/home/opc/hadoop-2.7.3
export HADOOP_HDFS_HOME=/home/opc/hadoop-2.7.3
export YARN_HOME=/home/opc/hadoop-2.7.3
export PATH=$PATH:$HADOOP_HOME/bin

Step 6

Source the .bashrc to pull in these environment variables

Step 7
Edit the /etc/hosts file to add namenode to the file.

Step 8

Setup ssh so that we can loop back to localhost and launch an agent. I had to edit the authorized_keys to add a return before the new entry. If you don't the ssh won't work.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
vi ~/.ssh/authorized_keys
ssh localhost
exit

Step 9
Test the configuration then configure the hadoop file system for single node.

cd $HADOOP_HOME
mkdir input
cp etc/hadoop/*.xml input
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
vi etc/hadoop/core-site.xml

When we ran this and there were a couple of warnings which we can ignore. The test should finish without error and generate a long output list. We then edit to core-site.xml file by changing the following lines at the end. (omit the spaces, the blog software masked them and the only way to show the full file was to add spaces)

< configuration >< property >< name >fs.defaultFS< /name >< value >hdfs://namenode:8020< /value >< /property >
< /configuration >

Step 10

Create the hadoop file system with the command hdfs namenode -format

Step 11

Verify the configuration with the command hdfs getconf -namenodes

Step 12

Start the hadoop file system with the command sbin/start-dfs.sh

At this point we have the hadoop filesystem up and running. We now need to configure MapReduce and test functionality.
Step 13

Make the HDFS directories required to execute MapReduce jobs with the commands

  hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/opc
hdfs dfs -mkdir input
hdfs dfs -put etc/hadoop/*.xml input

Step 14
Run a MapReduce example and look at the output

  hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep 
input output 'dfs[a-z.]+'
hdfs dfs -get output output
cat output/* output/output/*

Step 15

Create a test program to do a wordcount of two files. This example comes from an Apache MapReduce Tutorial

hdfs dfs -mkdir wordcount
hdfs dfs -mkdir wordcount/input
mkdir ~/wordcount
mkdir ~/wordcount/input
vi ~/wordcount/input/file01
- add
Hello World Bye World
vi ~/wordcount/input/file02
- add
Hello Hadoop Goodbye Hadoop
hdfs dfs -put ~/wordcount/input/* wordcount/input
vi ~/wordcount/WordCount.java

Create WordCount.java with the following code

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Step 16

Compile and run the WordCount.java code

cd ~/wordcount
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.101-3.b13.el6_8.x86_64
export HADOOP_CLASSPATH=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.101-3.b13.el6_8.x86_64/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
hadoop jar wc.jar WordCount wordcount/input wordcount/output
hadoop fs -cat wordcount/output/part-r-00000

At this point we have a working system and can run more MapReduce jobs, look at results, and play around with Big Data foundations.


In summary, this is a relatively complex example. We have moved beyond a simple install of an Apache web server or Tomcat server and editing some files to get results. We have the foundations for a Big Data analytics solution running on the Oracle Compute Cloud Service. The steps to install are very similar to the other installation tutorials that we referenced earlier on Amazon and Virtual Machines. Oracle Compute is a good foundation for public domain code. Per core the processes are cheaper than other cloud vendors. Networking is non-blocking and higher performance. Storage throughput is faster and optimized for compute high I/O and tied to the compute engine. Hopefully this tutorial has given you the foundation to start playing with Hadoop on Oracle IaaS.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services