Step 1:
Install Oracle Linux 6.7 on an Oracle Compute Cloud instance. Note that you can do the same thing by installing on your favorite virtualization engine like VirtualBox, VMWare, HyperV, or any other cloud vendor. The only true dependency is the operating system beyond this point. If you are installing on the Oracle Cloud, go with the OL_67_3GB..... option, go with the smallest instance, delete the boot disk, replace it with a 60 GB disk, rename it and launch. The key reason that we need to delete the boot disk is that by default the 3 GB disk will not take the Hadoop binary. We need to grow it to at least 40 GB. We pad a little bit with a 60 GB disk. If you check the new disk as a boot disk it replaces the default Root disk and allows you to create an instance with a 60 GB disk.
Step 2:
Run yum to update the os, install w get, and java version 1.8. You need to login as opc to the instance so that you can run as root.
Note that we are going to diverge from the Hadoop for Dummies that we referenced yesterday. They suggest attaching to a yum repository and doing an install from the repository for the bigtop package. We don't have that option for Oracle Linux and need to do the install from the binaries by downloading a tar or src image. The bigtop package basically takes the Apache Hadoop bundle and translates them to rpm files for an operating system. Oracle does not provide this as part of the yum repository and Apache does not create one for Oracle Linux or RedHat. We are going to download the tar file from the links provided at Apache Hadoop homepage we are following install instructions for a single node cluster.
Step 3:
Get the tar.gz file by pulling it from http://apache.osuosl.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
Step 4:
We unpack the tar.gz file with the tar xvzf hadoop-2.7.2.tar.gz command
Step 5:
Next we add the following to the .bashrc file in the home directory to setup some environment variables. The java code is done in the same location by the yum command. The location of the hadoop code is based on downloading into the opc home directory.
export JAVA_HOME=/usr
export HADOOP_HOME=/home/opc/hadoop-2.7.3
export HADOOP_CONFIG_DIR=/home/opc/hadoop-2.7.3/etc/hadoop
export HADOOP_MAPRED_HOME=/home/opc/hadoop-2.7.3
export HADOOP_COMMON_HOME=/home/opc/hadoop-2.7.3
export HADOOP_HDFS_HOME=/home/opc/hadoop-2.7.3
export YARN_HOME=/home/opc/hadoop-2.7.3
export PATH=$PATH:$HADOOP_HOME/bin
Step 6
Source the .bashrc to pull in these environment variables
Step 7
Edit the /etc/hosts file to add namenode to the file.
Step 8
Setup ssh so that we can loop back to localhost and launch an agent. I had to edit the authorized_keys to add a return before the new entry. If you don't the ssh won't work.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
vi ~/.ssh/authorized_keys
ssh localhost
exit
Step 9
Test the configuration then configure the hadoop file system for single node.
cd $HADOOP_HOME
mkdir input
cp etc/hadoop/*.xml input
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
vi etc/hadoop/core-site.xml
When we ran this and there were a couple of warnings which we can ignore. The test should finish without error and generate a long output list. We then edit to core-site.xml file by changing the following lines at the end. (omit the spaces, the blog software masked them and the only way to show the full file was to add spaces)
< configuration >< property >< name >fs.defaultFS< /name >< value >hdfs://namenode:8020< /value >< /property >
< /configuration >
Step 10
Create the hadoop file system with the command hdfs namenode -format
Step 11
Verify the configuration with the command hdfs getconf -namenodes
Step 12
Start the hadoop file system with the command sbin/start-dfs.sh
At this point we have the hadoop filesystem up and running. We now need to configure MapReduce and test functionality.
Step 13
Make the HDFS directories required to execute MapReduce jobs with the commands
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/opc
hdfs dfs -mkdir input
hdfs dfs -put etc/hadoop/*.xml input
Step 14
Run a MapReduce example and look at the output
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep
input output 'dfs[a-z.]+'
hdfs dfs -get output output
cat output/* output/output/*
Step 15
Create a test program to do a wordcount of two files. This example comes from an Apache MapReduce Tutorial
hdfs dfs -mkdir wordcount
hdfs dfs -mkdir wordcount/input
mkdir ~/wordcount
mkdir ~/wordcount/input
vi ~/wordcount/input/file01
- add
Hello World Bye World
vi ~/wordcount/input/file02
- add
Hello Hadoop Goodbye Hadoop
hdfs dfs -put ~/wordcount/input/* wordcount/input
vi ~/wordcount/WordCount.java
Create WordCount.java with the following code
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper
Step 16
Compile and run the WordCount.java code
cd ~/wordcount
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.101-3.b13.el6_8.x86_64
export HADOOP_CLASSPATH=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.101-3.b13.el6_8.x86_64/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
hadoop jar wc.jar WordCount wordcount/input wordcount/output
hadoop fs -cat wordcount/output/part-r-00000
At this point we have a working system and can run more MapReduce jobs, look at results, and play around with Big Data foundations.
In summary, this is a relatively complex example. We have moved beyond a simple install of an Apache web server or Tomcat server and editing some files to get results. We have the foundations for a Big Data analytics solution running on the Oracle Compute Cloud Service. The steps to install are very similar to the other installation tutorials that we referenced earlier on Amazon and Virtual Machines. Oracle Compute is a good foundation for public domain code. Per core the processes are cheaper than other cloud vendors. Networking is non-blocking and higher performance. Storage throughput is faster and optimized for compute high I/O and tied to the compute engine. Hopefully this tutorial has given you the foundation to start playing with Hadoop on Oracle IaaS.