Friday Apr 03, 2015

Cloudera Hive client on Solaris

Goal: From an Oracle Solaris server, access Hive data that is stored in a Cloudera Hadoop cluster, for example, access Hive data in an Oracle Big Data Appliance from a Solaris SPARC RDBMS server.

SC_BDA.png

This blog assumes that the Cloudera Hadoop cluster is already up and running. To test, I started with Oracle Big Data Lite Virtual Machine that was installed using the process documented at Big Data Lite with a static IP address on Oracle Linux 6 / EL6

start.png

Step 1: On the Solaris server, create an "oracle" user:

The user doesn't have to be "oracle", but that fits well with my planned usage.

# /usr/sbin/groupadd oinstall
# /usr/sbin/groupadd dba
# /usr/sbin/useradd -d localhost:/export/home/oracle -m -g oinstall -G dba oracle
# passwd oracle
# echo "oracle ALL=(ALL) ALL" > /etc/sudoers.d/oracle
# su - oracle


Set up a VNC server using the process at Installing VNC server on Solaris 11.

Step 2: Visit the Cloudera Manager to determine the version of Cloudera that is running:

  • Click on the "Support" pull down and the "About":

ClientConfig_1.png

  • In my case, I was using Cloudera Express 5.1.2
version2_1.png

Step 3: On the Oracle Solaris server, install the Hive and Hadoop tarballs that match your cluster
tarball_1_1.png

  • Choose your version. In my case, CDH 5.1.2.

tarball_2_1.png


tarball_3_1.png

  • Unpack the tarballs on your Solaris server:
$ tar xzf Downloads/hadoop-2.3.0-cdh5.1.2.tar.gz
$ tar xzf Downloads/hive-0.12.0-cdh5.1.2.tar.gz


  • And verify:
$ ls hadoop-2.3.0-cdh5.1.2
bin                  examples             libexec
bin-mapreduce1       examples-mapreduce1  sbin
cloudera             include              share
etc                  lib                  src

$ ls hive-0.12.0-cdh5.1.2/

bin                examples           LICENSE            RELEASE_NOTES.txt
conf               hcatalog           NOTICE             scripts
docs               lib                README.txt


Step 4: Download the HDFS configuration files to the Solaris server:
  • In the Cloudera Manager, go the the hdfs page:
HdfsClientConfig1_1.png

* From the hdfs page, download the client configuration. Place the client configuration onto the Solaris server

HdfsClientConfig2_1.png

  • Unpack the HDFS client configuration on your Solaris server:

$ unzip Downloads/hdfs-clientconfig.zip
Archive:  Downloads/hdfs-clientconfig.zip
  inflating: hadoop-conf/hdfs-site.xml 
  inflating: hadoop-conf/log4j.properties 
  inflating: hadoop-conf/topology.py 
  inflating: hadoop-conf/topology.map 
  inflating: hadoop-conf/hadoop-env.sh 
  inflating: hadoop-conf/core-site.xml 

Step 6: Configure the HDFS client software on the Solaris server

  • Edit hadoop-conf/hadoop-env.sh set JAVA_HOME correctly:

export JAVA_HOME=/usr/jdk/instances/jdk1.7.0

  • Move the configuration files into place:

$ cp hadoop-conf/* hadoop-2.3.0-cdh5.1.2/etc/hadoop/

  • Add the Hadoop bin directory to your PATH. You may want to do this in your local shell or .bashrc

$ export PATH=~/hadoop-2.3.0-cdh5.1.2/bin:$PATH

Step 7: Test the HDFS client software on Solaris

  • Verify that the remote hdfs filesystem is visible from your Solaris server

$ hdfs dfs -ls
15/04/02 14:40:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 6 items
drwx------   - oracle oracle          0 2014-08-25 01:55 .Trash
drwx------   - oracle oracle          0 2014-09-23 09:25 .staging
drwxr-xr-x   - oracle oracle          0 2014-01-12 15:15 moviedemo
drwxr-xr-x   - oracle oracle          0 2014-09-24 05:38 moviework
drwxr-xr-x   - oracle oracle          0 2014-09-08 11:50 oggdemo
drwxr-xr-x   - oracle oracle          0 2014-09-20 09:59 oozie-oozi


Step 8: Download the Hive configuration files to the Solaris server:

  • In the Cloudera Manager, go the the hive page:
go_to_hive_1.png

* From the hive page, download the client configuration. Place the client configuration on the Solaris server

ClientConfig_1.png

  • Unpack the hive client configuration on your Solaris server:

$ unzip Downloads/hive-clientconfig.zip
Archive:  Downloads/hive-clientconfig(1).zip
  inflating: hive-conf/hive-site.xml 
  inflating: hive-conf/hive-env.sh  
  inflating: hive-conf/log4j.properties 
  inflating: hive-conf/hadoop-env.sh 
  inflating: hive-conf/core-site.xml 
  inflating: hive-conf/mapred-site.xml 
  inflating: hive-conf/topology.py  
  inflating: hive-conf/yarn-site.xml 
  inflating: hive-conf/hdfs-site.xml 
  inflating: hive-conf/topology.map  


Step 9: Configure the Hive client software on the Solaris server

  • Move the configuration files into place:

$ cp hive-conf/* hive-0.12.0-cdh5.1.2/conf/

Step 10: YARN configuration option for Hadoop:

  • The HFDS configuration files don't include yarn-site.xml. Copy the YARN configuration file from the hive tree to the hadoop tree:

$ cp hive-0.12.0-cdh5.1.2/conf/yarn-site.xml hadoop-2.3.0-cdh5.1.2/etc/hadoop/

Step 11: Hide hive-env.sh:

$ mv hive-0.12.0-cdh5.1.2/conf/hive-env.sh hive-0.12.0-cdh5.1.2/conf/hive-env.sh.HIDDEN

  • Add the Hive bin directory to your PATH. You may want to do this in your local shell or .bashrc

$ export PATH=~/hive-0.12.0-cdh5.1.2/bin:$PATH

Step 12: Test the Hive client software on Solaris

  • Verify that the remote Hive tables are visible from your Solaris server

$ hive
15/04/03 12:20:37 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
15/04/03 12:20:37 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
15/04/03 12:20:37 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
15/04/03 12:20:37 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
15/04/03 12:20:37 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
15/04/03 12:20:37 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/04/03 12:20:37 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/04/03 12:20:37 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.

Logging initialized using configuration in jar:file:/home/oracle/hive-0.12.0-cdh5.1.2/lib/hive-common-0.12.0-cdh5.1.2.jar!/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/oracle/hadoop-2.3.0-cdh5.1.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/oracle/hive-0.12.0-cdh5.1.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive> show tables;
OK
cust
movie
movie_rating
movieapp_log_avro
movieapp_log_json
movieapp_log_odistage
movieapp_log_stage
movielog
session_stats
Time taken: 1.799 seconds, Fetched: 9 row(s)
hive> 

Conclusion: Successfully accomplished task of configuring an Oracle Solaris server to access Hive data that is stored in a Cloudera Hadoop cluster.

Tuesday Aug 13, 2013

Hive 0.11 (May, 15 2013) and Rank() within a category

This is a follow up to a Stack Overflow question HiveQL and rank():

libjack recommended that I upgrade to Hive 0.11 (May, 15 2013) to take advantage of Windowing and Analytics functions. His recommendation worked immediately, but it took a while for me to find the right syntax to sort within categories. This blog entry records the correct syntax.


1. Sales Rep data

Here is a CSV file with Sales Rep data:

$ more reps.csv
1,William,2
2,Nadia,1
3,Daniel,2
4,Jana,1


Create a Hive table for the Sales Rep data:

create table SalesRep (
  RepID INT,
  RepName STRING,
  Territory INT
  )
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n';

... and load the CSV into the Hive Sales Rep table:

LOAD DATA
 LOCAL INPATH '/home/hadoop/MyDemo/reps.csv'
 INTO TABLE SalesRep;



2. Purchase Order data

Here is a CSV file with PO data:

$ more purchases.csv
4,1,100
2,2,200
2,3,600
3,4,80
4,5,120
1,6,170
3,7,140


Create a Hive table for the PO's:

create table purchases (
  SalesRepId INT,
  PurchaseOrderId INT,
  Amount INT
  )
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n';


... and load CSV into the Hive PO table:

LOAD DATA
 LOCAL INPATH '/home/hadoop/MyDemo/purchases.csv'
 INTO TABLE purchases;



3. Hive JOIN

So this is the underlining data that is being worked with:

SELECT p.PurchaseOrderId, s.RepName, p.amount, s.Territory
FROM purchases p JOIN SalesRep s
WHERE p.SalesRepId = s.RepID;


PO ID Rep
Amount
Territory
1
Jana 100 1
2
Nadia 200 1
3
Nadia 600 1
4
Daniel 80 2
5
Jana 120 1
6
William 170 2
7
Daniel 140 2


4. Hive Rank by Volume only

SELECT
  s.RepName, s.Territory, V.volume,
rank() over (ORDER BY V.volume DESC) as rank
FROM
  SalesRep s
  JOIN
    ( SELECT
      SalesRepId, SUM(amount) as Volume
      FROM purchases
      GROUP BY SalesRepId) V
  WHERE V.SalesRepId = s.RepID
  ORDER BY V.volume DESC;



Rep
Territory
Amount
Rank
Nadia 1
800 1
Daniel 2
220 2
Jana 1
220
2
William 2
170 4

The ranking over the entire data set - Daniel is tied for second among all Reps.


5. Hive Rank within Territory, by Volume

SELECT
  s.RepName, s.Territory, V.volume,
  rank() over (PARTITION BY s.Territory ORDER BY V.volume DESC) as rank
FROM
  SalesRep s
  JOIN
    ( SELECT
      SalesRepId, SUM(amount) as Volume
      FROM purchases
      GROUP BY SalesRepId) V
  WHERE V.SalesRepId = s.RepID
  ORDER BY V.volume DESC;



Rep
Territory
Amount
Rank
Nadia 1
800 1
Jana 1
220 2
Daniel 2
220
1
William 2
170 2

The ranking is within the territory - Daniel is the best is his territory.


6. FYI: this example was developed on a SPARC T4 server with Oracle Solaris 11 and Apache Hadoop 1.0.4

Wednesday May 22, 2013

Debugging Hadoop using Solaris Studio in a Solaris 11 Zone


I've found Orgad Kimchi's How to Set Up a Hadoop Cluster Using Oracle Solaris Zones to be very useful, however, for a development environment, it is too complex. When map/reduce tasks are running in a clustered environment, it is challenging to isolate bugs. Debugging is easier when working within a standalone Hadoop installation. I've put the following instructions together for installation of a standalone Hadoop configuration in a Solaris Zone with Solaris Studio for application development.

A lovely feature of Solaris is that your global zone may host both a Hadoop cluster set up in a manner similar to Orgad's instructions and simultaneously host a zone for development that is running a Hadoop standalone configuration.

Create the Zone

These instructions assume that Solaris 11.1 is already running in the Global Zone.

Add the Hadoop Studio Zone

# dladm create-vnic -l net0 hadoop_studio

# zonecfg -z 
hadoop-studio
Use 'create' to begin configuring a new zone.
zonecfg:
hadoop-studio> create
create: Using system default template 'SYSdefault'
zonecfg:
hadoop-studio> set zonepath=/ZONES/hadoop-studio
zonecfg:
hadoop-studio> add net
zonecfg:
hadoop-studio:net> set physical=hadoop_studio
zonecfg:
hadoop-studio:net> end
zonecfg:
hadoop-studio> verify
zonecfg:
hadoop-studio> commit
zonecfg:
hadoop-studio> exit


Install and boot the zone

# zoneadm -z hadoop-studio install
# zoneadm -z 
hadoop-studio boot

Login to the zone console to set the network, time, root password, and unprivileged user.

# zlogin -C hadoop-studio

After the zone's initial configuration steps, nothing else needs to be done from within the global zone. You should be able to log into the Hadoop Studio zone with ssh as the unprivileged user and gain privileges with "su" and "sudo".

All of the remaining instructions are from inside the Hadoop Studio Zone.


Install extra Solaris software and set up the development environment

I like to start with both JDK's installed and not rely on the "/usr/java" symbolic link:

# pkg install  jdk-6
# pkg install --accept jdk-7


Verify the JDKs:

# /usr/jdk/instances/jdk1.6.0/bin/java -version
java version "1.6.0_35"
Java(TM) SE Runtime Environment (build 1.6.0_35-b10)
Java HotSpot(TM) Server VM (build 20.10-b01, mixed mode)

# /usr/jdk/instances/jdk1.7.0/bin/java -version
java version "1.7.0_07"
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) Server VM (build 23.3-b01, mixed mode)


Add VNC Remote Desktop software

# pkg install --accept solaris-desktop


Create a Hadoop user:

# groupadd hadoop
# useradd -d localhost:/export/home/hadoop -m -g hadoop hadoop
# passwd hadoop
# usermod -R root hadoop


Edit /home/hadoop/.bashrc:

export PATH=/usr/bin:/usr/sbin
export PAGER="/usr/bin/less -ins"
typeset +x PS1="\u@\h:\w\\$ "

# Hadoop
export HADOOP_PREFIX=/home/hadoop/hadoop
export PATH=$HADOOP_PREFIX/bin:$PATH

# Java
export JAVA_HOME=/usr/jdk/instances/jdk1.6.0
export PATH=$JAVA_HOME/bin:$PATH

# Studio
export PATH=$PATH:/opt/solarisstudio12.3/bin
alias solstudio='solstudio --jdkhome /usr/jdk/instances/jdk1.6.0'

Edit /home/hadoop/.bash_profile:

. ~/.bashrc

And make sure that the ownership and permission make sense:

# ls -l /home/hadoop/.bash*      
-rw-r--r--   1 hadoop   hadoop        12 May 22 05:24 /home/hadoop/.bash_profile
-rw-r--r--   1 hadoop   hadoop       372 May 22 05:24 /home/hadoop/.bashrc


Now is a good time to a start remote VNC desktop for this zone:

# su - hadoop

$ vncserver


You will require a password to access your desktops.

Password:
Verify:
xauth:  file /home/hadoop/.Xauthority does not exist

New '
hadoop-studio:1 ()' desktop is hadoop-studio:1

Creating default startup script /home/hadoop/.vnc/xstartup
Starting applications specified in /home/hadoop/.vnc/xstartup
Log file is /home/hadoop/.vnc/
hadoop-studio:1.log

Access the remote desktop with your favorite VNC client

The default 10 minute time out on the VNC desktop is too fast for my preferences:

System -> Preferences -> Screensaver
  Display Modes:
  Blank after: 100
  Close the window (I always look for a "save" button, but no, just close the window without explicitly saving.)



Download and Install Hadoop

For this article, I used the "12 October, 2012 Release 1.0.4" release. Download the Hadoop tarball and copy it into the home directory of hadoop:

$ ls -l hadoop-1.0.4.tar.gz
-rw-r--r--   1 hadoop   hadoop   62793050 May 21 12:03 hadoop-1.0.4.tar.gz

Unpack the tarball into the home directory of the hadoop user:

$ gzip -dc hadoop-1.0.4.tar.gz  | tar -xvf -
$ mv hadoop-1.0.4 hadoop


Hadoop comes pre-configured in Standalone Mode

Edit /home/hadoop/hadoop/conf/hadoop-env.sh, and set JAVA_HOME:

export JAVA_HOME=/usr/jdk/instances/jdk1.6.0

That is all. Now, you can run a Hadoop example:

$ hadoop jar hadoop/hadoop-examples-1.0.4.jar pi 2 10
Number of Maps  = 2
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Starting Job
...
Job Finished in 10.359 seconds
Estimated value of Pi is 3.80000000000000000000



Install Solaris Studio:

Visit https://pkg-register.oracle.com/ to obtain Oracle_Solaris_Studio_Support.key.pem, Oracle_Solaris_Studio_Support.certificate.pem and follow the instructions for "pkg set-publisher" and "pkg update" or "pkg install"

# sudo pkg set-publisher \
          -k /var/pkg/ssl/Oracle_Solaris_Studio_Support.key.pem \
          -c /var/pkg/ssl/Oracle_Solaris_Studio_Support.certificate.pem \
          -G '*' -g https://pkg.oracle.com/solarisstudio/support solarisstudio

# pkg install developer/solarisstudio-123/*


If your network requires a proxy, you will need set the proxy before starting Solaris Studio:

proxy.jpg


Start Solaris Studio:

$ solstudio

(Notice the alias in .bashrc that adds --jdkhome to the solstudio start up command.)

Go to "Tools -> Plugins.

Click on "Reload Catalog"

Load the Java SE plugins. I ran into a problem when the Maven plug in was installed. Something that I should diagnose at a future date.

plugins.jpg



Create a New Project:

File -> New Project

Step 1:
- Catagory: Java
- Project: Java Application
- Next

NewProject.jpg

Step 2: Fill in similar to this:

NewJavaApplication.jpg

Copy the example source into the project:

$ cp -r \
    $HADOOP_PREFIX/src/examples/org/apache/hadoop/examples/* \
    ~/SolStudioProjects/examples/src/org/apache/hadoop/examples/


Starting to look like a development environment:

DevEnvironment.jpg


Modify the Project to compile with Hadoop jars. Right-click on the project and select "Properties"

properties.jpg


Add in the necessary Hadoop compile jars:

CompileJars.jpg

I found that I needed these jars at run time:

RunJars.jpg

Add Program Arguments (2 10):

Arguments.jpg

Now, if you click on the "Run" button. PiEstimators will run inside the IDE:

PiRuns.jpg

And the set up behaves as expected if you set a break point and click on "Debug":

debug.jpg

Thursday May 16, 2013

Hadoop Java Error logs

I was having trouble isolating a problem with "reduce" tasks running on Hadoop slave servers. 

After poking around on the Hadoop slave, I found an interesting lead in /var/log/hadoop/userlogs/job_201302111641_0057/attempt_201302111641_0057_r_000001_1/stdout:

$ cat /tmp/hadoop-hadoop/mapred/local/userlogs/job_201302111641_0059/attempt_201302111641_0059_r_000001_1/stdout

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xfe67cb31, pid=25828, tid=2
#
# JRE version: 6.0_35-b10
# Java VM: Java HotSpot(TM) Server VM (20.10-b01 mixed mode solaris-x86 )
# Problematic frame:
# C  [libc.so.1+0xbcb31]  pthread_mutex_trylock+0x29
#
# An error report file with more information is saved as:
# /tmp/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201302111641_0059/attempt_201302111641_0059_r_000001_1/work/hs_err_pid25828.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

The HotSpot crash log (hs_err_pid25828.log in my case) will be very interesting because it contains information obtained at the time of the fatal error, including the following information, where possible:

  • The operating exception or signal that provoked the fatal error
  • Version and configuration information
  • Details on the thread that provoked the fatal error and thread's stack trace
  • The list of running threads and their state
  • Summary information about the heap
  • The list of native libraries loaded
  • Command line arguments
  • Environment variables
  • Details about the operating system and CPU

Great, but hs_err_pid25654.log had been cleaned up before I could get to it. In fact, I found that the hs_err_pid.log files were available for less than a minute and they were always gone before I could capture one.

To try to retain the Java error log file, my first incorrect guess was:

 <property>
   <name>keep.failed.task.files</name>
   <value>true</value>
 </property>


My next approach was to add "-XX:ErrorFile=/tmp/hs_err_pid%p.log" to the Java command line for the reduce task.

When I tried adding the Java option to HADOOP_OPTS in /usr/local/hadoop/conf/hadoop-env.sh, I realized that this setting isn't applied to the Map and Reduce Task JVMs.

Finally, I found that adding the Java option to the mapred.child.java.opts property in mapred-site.xml WORKED!!

$ cat /usr/local/hadoop/conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

     <property>
         <name>mapred.job.tracker</name>
         <value>p3231-name-node:8021</value>
     </property>

     <property>
         <name>mapred.child.java.opts</name>
         <value>-XX:ErrorFile=/tmp/hs_err_pid%p.log</value>
     </property>


</configuration>


Now I can view the Java error logs on my Hadoop slaves:

$ ls -l /tmp/*err*
-rw-r--r--   1 hadoop   hadoop     15626 May 16 15:42 /tmp/hs_err_pid10028.log
-rw-r--r--   1 hadoop   hadoop     15795 May 16 15:43 /tmp/hs_err_pid10232.log

About

Jeff Taylor-Oracle

Search

Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
21
22
23
24
25
26
27
28
29
30
31
      
Today