Tuesday Aug 13, 2013

Hive 0.11 (May, 15 2013) and Rank() within a category

This is a follow up to a Stack Overflow question HiveQL and rank():

libjack recommended that I upgrade to Hive 0.11 (May, 15 2013) to take advantage of Windowing and Analytics functions. His recommendation worked immediately, but it took a while for me to find the right syntax to sort within categories. This blog entry records the correct syntax.


1. Sales Rep data

Here is a CSV file with Sales Rep data:

$ more reps.csv
1,William,2
2,Nadia,1
3,Daniel,2
4,Jana,1


Create a Hive table for the Sales Rep data:

create table SalesRep (
  RepID INT,
  RepName STRING,
  Territory INT
  )
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n';

... and load the CSV into the Hive Sales Rep table:

LOAD DATA
 LOCAL INPATH '/home/hadoop/MyDemo/reps.csv'
 INTO TABLE SalesRep;



2. Purchase Order data

Here is a CSV file with PO data:

$ more purchases.csv
4,1,100
2,2,200
2,3,600
3,4,80
4,5,120
1,6,170
3,7,140


Create a Hive table for the PO's:

create table purchases (
  SalesRepId INT,
  PurchaseOrderId INT,
  Amount INT
  )
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n';


... and load CSV into the Hive PO table:

LOAD DATA
 LOCAL INPATH '/home/hadoop/MyDemo/purchases.csv'
 INTO TABLE purchases;



3. Hive JOIN

So this is the underlining data that is being worked with:

SELECT p.PurchaseOrderId, s.RepName, p.amount, s.Territory
FROM purchases p JOIN SalesRep s
WHERE p.SalesRepId = s.RepID;


PO ID Rep
Amount
Territory
1
Jana 100 1
2
Nadia 200 1
3
Nadia 600 1
4
Daniel 80 2
5
Jana 120 1
6
William 170 2
7
Daniel 140 2


4. Hive Rank by Volume only

SELECT
  s.RepName, s.Territory, V.volume,
rank() over (ORDER BY V.volume DESC) as rank
FROM
  SalesRep s
  JOIN
    ( SELECT
      SalesRepId, SUM(amount) as Volume
      FROM purchases
      GROUP BY SalesRepId) V
  WHERE V.SalesRepId = s.RepID
  ORDER BY V.volume DESC;



Rep
Territory
Amount
Rank
Nadia 1
800 1
Daniel 2
220 2
Jana 1
220
2
William 2
170 4

The ranking over the entire data set - Daniel is tied for second among all Reps.


5. Hive Rank within Territory, by Volume

SELECT
  s.RepName, s.Territory, V.volume,
  rank() over (PARTITION BY s.Territory ORDER BY V.volume DESC) as rank
FROM
  SalesRep s
  JOIN
    ( SELECT
      SalesRepId, SUM(amount) as Volume
      FROM purchases
      GROUP BY SalesRepId) V
  WHERE V.SalesRepId = s.RepID
  ORDER BY V.volume DESC;



Rep
Territory
Amount
Rank
Nadia 1
800 1
Jana 1
220 2
Daniel 2
220
1
William 2
170 2

The ranking is within the territory - Daniel is the best is his territory.


6. FYI: this example was developed on a SPARC T4 server with Oracle Solaris 11 and Apache Hadoop 1.0.4

Tuesday Jul 23, 2013

Ganglia on Solaris 11.1

Here are some notes that I took while building Ganglia Core 3.6.0 and Ganglia Web 3.5.7 with Solaris Studio 12.3 and installing on Solaris 11.1. These notes are only intended augment (not replace) other Ganglia install guides.

1) Add a ganglia user to build with (This is an optional step, you may build as any user, gmond will run as root )

    # useradd -d localhost:/export/home/ganglia -m ganglia
    # passwd ganglia
    # usermod -R root ganglia
    # echo "ganglia ALL=(ALL) ALL" > /etc/sudoers.d/ganglia
    # chmod 440 /etc/sudoers.d/ganglia
# su - ganglia



2) These packages are needed:

    $ sudo pkg install system/header gperf glibmm apache-22 php-53 apache-php53 apr-util-13 libconfuse rrdtool
$ sudo svcadm enable apache22

3) Download and unpack:

$ gzip -dc ganglia-3.6.0.tar.gz | tar xvf -

$ gzip -dc ganglia-web-3.5.7.tar.gz | tar xvf -

$ cd ganglia-3.6.0


4) Compiler error when building libgmond.c: "default_conf.h", line 75: invalid directive". In lib/default_conf.h, Removed the 3 lines with a '#' character.

udp_recv_channel {\n\
mcast_join = 239.2.11.71\n\
port = 8649\n\
  bind = 239.2.11.71\n\
retry_bind = true\n\
  # Size of the UDP buffer. If you are handling lots of metrics you really\n\
  # should bump it up to e.g. 10MB or even higher.\n\
# buffer = 10485760\n\
}\n\

5) Compiler error: "data_thread.c", line 143: undefined symbol: FIONREAD. In gmetad/data_thread.c, added:

#include <sys/filio.h>

6) Runtime error: "Cannot load /usr/local/lib/ganglia/modcpu.so metric module: ld.so.1: gmond: fatal: relocation error: file /usr/local/lib/ganglia/modcpu.so: symbol cpu_steal_func: referenced symbol not found". Stub added to the bottom of ./libmetrics/solaris/metrics.c

g_val_t
cpu_steal_func ( void )
{
  static g_val_t val=0;
return val;
}


7) Build Ganglia Core 3.6.0 without gmetad for all of the machines in the cluster, except a primary node:

$ export PATH=/usr/local/bin:/usr/local/sbin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/lib:/usr/apr/1.3/lib
$ export PKG_CONFIG_PATH=/usr/apr/1.3/lib/pkgconfig

$ cd ganglia-3.6.0

$ ./configure

$ make

8) Install Ganglia Core 3.6.0 (on all machines in the cluster):

$ sudo make install
$ gmond --default_config > /tmp/gmond.conf
$ sudo cp /tmp/gmond.conf /usr/local/etc/gmond.conf

$ sudo vi /usr/local/etc/gmond.conf  # (Remove these lines)

metric {
   name = "cpu_steal"
value_threshold = "1.0"
  title = "CPU steal"
}

9) Start gmond as root (on all machines in the cluster):

# export PATH=/usr/local/bin:/usr/local/sbin:$PATH
# export LD_LIBRARY_PATH=/usr/local/lib:/usr/apr/1.3/lib
# gmond

7) Build Ganglia Core 3.6.0 with gmetad for the primary node:

    $ export PATH=/usr/local/bin:/usr/local/sbin:$PATH
    $ export LD_LIBRARY_PATH=/usr/local/lib:/usr/apr/1.3/lib
    $ export PKG_CONFIG_PATH=/usr/apr/1.3/lib/pkgconfig

    $ cd ganglia-3.6.0

    $ ./configure --with-gmetad

    $ make


10) Install ganglia-web-3.5.7 (on a primary server)

    $ cd ganglia-web-3.5.7

    $ vi Makefile # Set these variables

    GDESTDIR = /var/apache2/2.2/htdocs/
    APACHE_USER =  webservd


    $ sudo make install


    $ sudo mkdir -p /var/lib/ganglia/rrds
$ sudo chown -Rh nobody:nobody /var/lib/ganglia

11) Start gmond and gmetad on the primary node


12) You need to remove the "It works" page so that index.php is the default

    # sudo rm /var/apache2/2.2/htdocs/index.html

Now you can visit the primary node with a web browser.

Tuesday May 28, 2013

Adding users in Solaris 11 with power like the initial account

During Solaris 11.1 installation, the system administrator is prompted for a user name and password which will be used to create an unprivileged account. For security reasons, by default, root is a role, not a user, therefore the initial login can't be to a root account. The first login must use the initial unprivileged account. Later, the initial unprivileged user can acquire privileges through either "su" or "sudo".

For enterprise class deployments, the system administrator should be familiar with RBAC and create users with least privileges.

In contrast, I'm working in a lab environment and want to be able to simply and quickly create new users with power like the initial user. With Solaris 10, this was straight forward, but Solaris 11 adds a couple of twists.

Create a new user jenny in Solaris 11:

# useradd -d localhost:/export/home/jenny -m jenny
# passwd jenny

Jenny can't su to root:

jenny@app61:~$ su -
Password:
Roles can only be assumed by authorized users
su: Sorry

Because Jenny doesn't have that role:

jenny@app61:~$ roles
No roles

Give her the role:

root@app61:~# usermod -R root jenny


And then Jenny can su to root:

jenny@app61:~$ roles
root

jenny@app61:~$ su -
Password:
Oracle Corporation      SunOS 5.11      11.1    September 2012
You have new mail.
root@app61:~#


But even when jenny has the root role, she can't use sudo:

jenny@app61:~$ sudo -l
Password:
Sorry, user jenny may not run sudo on app61.

jenny@app61:~$ sudo touch /jenny
Password:
jenny is not in the sudoers file.  This incident will be reported.


Oh no, she is in big trouble, now.

User jeff was created as the initial account, and he can use sudo:

jeff@app61:~$ sudo -l
Password:
User jeff may run the following commands on this host:
    (ALL) ALL


But jeff isn't in the sudoers file:

root@app61:~# grep jeff /etc/sudoers

So how do you make jenny as powerful as jeff with respect to sudo?

Turns out that jeff, created during the Solaris installation, is in here:

root@app61:~# cat /etc/sudoers.d/svc-system-config-user
jeff ALL=(ALL) ALL


My coworker, Andrew, offers the following advice: "The last line of /etc/sudoers is a directive to read "drop-in" files from the /etc/sudoers.d directory. You can still edit /etc/sudoers. It may better to leave svc-system-config-user alone and create another drop-in file for local edits. If you want to edit sudoers as part of an application install then you should create a drop-in for the application - this makes the edits easy to undo if you remove the application. If you have multiple drop-ins in /etc/sudoers.d they are processed in alphabetical (sorted) order.  There are restrictions on file names and permissions for drop-ins. The permissions must be 440 (read only for owner and group) and the file name can't have a dot or ~ in it. These are in the very long man page."



Wednesday May 22, 2013

Debugging Hadoop using Solaris Studio in a Solaris 11 Zone


I've found Orgad Kimchi's How to Set Up a Hadoop Cluster Using Oracle Solaris Zones to be very useful, however, for a development environment, it is too complex. When map/reduce tasks are running in a clustered environment, it is challenging to isolate bugs. Debugging is easier when working within a standalone Hadoop installation. I've put the following instructions together for installation of a standalone Hadoop configuration in a Solaris Zone with Solaris Studio for application development.

A lovely feature of Solaris is that your global zone may host both a Hadoop cluster set up in a manner similar to Orgad's instructions and simultaneously host a zone for development that is running a Hadoop standalone configuration.

Create the Zone

These instructions assume that Solaris 11.1 is already running in the Global Zone.

Add the Hadoop Studio Zone

# dladm create-vnic -l net0 hadoop_studio

# zonecfg -z 
hadoop-studio
Use 'create' to begin configuring a new zone.
zonecfg:
hadoop-studio> create
create: Using system default template 'SYSdefault'
zonecfg:
hadoop-studio> set zonepath=/ZONES/hadoop-studio
zonecfg:
hadoop-studio> add net
zonecfg:
hadoop-studio:net> set physical=hadoop_studio
zonecfg:
hadoop-studio:net> end
zonecfg:
hadoop-studio> verify
zonecfg:
hadoop-studio> commit
zonecfg:
hadoop-studio> exit


Install and boot the zone

# zoneadm -z hadoop-studio install
# zoneadm -z 
hadoop-studio boot

Login to the zone console to set the network, time, root password, and unprivileged user.

# zlogin -C hadoop-studio

After the zone's initial configuration steps, nothing else needs to be done from within the global zone. You should be able to log into the Hadoop Studio zone with ssh as the unprivileged user and gain privileges with "su" and "sudo".

All of the remaining instructions are from inside the Hadoop Studio Zone.


Install extra Solaris software and set up the development environment

I like to start with both JDK's installed and not rely on the "/usr/java" symbolic link:

# pkg install  jdk-6
# pkg install --accept jdk-7


Verify the JDKs:

# /usr/jdk/instances/jdk1.6.0/bin/java -version
java version "1.6.0_35"
Java(TM) SE Runtime Environment (build 1.6.0_35-b10)
Java HotSpot(TM) Server VM (build 20.10-b01, mixed mode)

# /usr/jdk/instances/jdk1.7.0/bin/java -version
java version "1.7.0_07"
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) Server VM (build 23.3-b01, mixed mode)


Add VNC Remote Desktop software

# pkg install --accept solaris-desktop


Create a Hadoop user:

# groupadd hadoop
# useradd -d localhost:/export/home/hadoop -m -g hadoop hadoop
# passwd hadoop
# usermod -R root hadoop


Edit /home/hadoop/.bashrc:

export PATH=/usr/bin:/usr/sbin
export PAGER="/usr/bin/less -ins"
typeset +x PS1="\u@\h:\w\\$ "

# Hadoop
export HADOOP_PREFIX=/home/hadoop/hadoop
export PATH=$HADOOP_PREFIX/bin:$PATH

# Java
export JAVA_HOME=/usr/jdk/instances/jdk1.6.0
export PATH=$JAVA_HOME/bin:$PATH

# Studio
export PATH=$PATH:/opt/solarisstudio12.3/bin
alias solstudio='solstudio --jdkhome /usr/jdk/instances/jdk1.6.0'

Edit /home/hadoop/.bash_profile:

. ~/.bashrc

And make sure that the ownership and permission make sense:

# ls -l /home/hadoop/.bash*      
-rw-r--r--   1 hadoop   hadoop        12 May 22 05:24 /home/hadoop/.bash_profile
-rw-r--r--   1 hadoop   hadoop       372 May 22 05:24 /home/hadoop/.bashrc


Now is a good time to a start remote VNC desktop for this zone:

# su - hadoop

$ vncserver


You will require a password to access your desktops.

Password:
Verify:
xauth:  file /home/hadoop/.Xauthority does not exist

New '
hadoop-studio:1 ()' desktop is hadoop-studio:1

Creating default startup script /home/hadoop/.vnc/xstartup
Starting applications specified in /home/hadoop/.vnc/xstartup
Log file is /home/hadoop/.vnc/
hadoop-studio:1.log

Access the remote desktop with your favorite VNC client

The default 10 minute time out on the VNC desktop is too fast for my preferences:

System -> Preferences -> Screensaver
  Display Modes:
  Blank after: 100
  Close the window (I always look for a "save" button, but no, just close the window without explicitly saving.)



Download and Install Hadoop

For this article, I used the "12 October, 2012 Release 1.0.4" release. Download the Hadoop tarball and copy it into the home directory of hadoop:

$ ls -l hadoop-1.0.4.tar.gz
-rw-r--r--   1 hadoop   hadoop   62793050 May 21 12:03 hadoop-1.0.4.tar.gz

Unpack the tarball into the home directory of the hadoop user:

$ gzip -dc hadoop-1.0.4.tar.gz  | tar -xvf -
$ mv hadoop-1.0.4 hadoop


Hadoop comes pre-configured in Standalone Mode

Edit /home/hadoop/hadoop/conf/hadoop-env.sh, and set JAVA_HOME:

export JAVA_HOME=/usr/jdk/instances/jdk1.6.0

That is all. Now, you can run a Hadoop example:

$ hadoop jar hadoop/hadoop-examples-1.0.4.jar pi 2 10
Number of Maps  = 2
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Starting Job
...
Job Finished in 10.359 seconds
Estimated value of Pi is 3.80000000000000000000



Install Solaris Studio:

Visit https://pkg-register.oracle.com/ to obtain Oracle_Solaris_Studio_Support.key.pem, Oracle_Solaris_Studio_Support.certificate.pem and follow the instructions for "pkg set-publisher" and "pkg update" or "pkg install"

# sudo pkg set-publisher \
          -k /var/pkg/ssl/Oracle_Solaris_Studio_Support.key.pem \
          -c /var/pkg/ssl/Oracle_Solaris_Studio_Support.certificate.pem \
          -G '*' -g https://pkg.oracle.com/solarisstudio/support solarisstudio

# pkg install developer/solarisstudio-123/*


If your network requires a proxy, you will need set the proxy before starting Solaris Studio:

proxy.jpg


Start Solaris Studio:

$ solstudio

(Notice the alias in .bashrc that adds --jdkhome to the solstudio start up command.)

Go to "Tools -> Plugins.

Click on "Reload Catalog"

Load the Java SE plugins. I ran into a problem when the Maven plug in was installed. Something that I should diagnose at a future date.

plugins.jpg



Create a New Project:

File -> New Project

Step 1:
- Catagory: Java
- Project: Java Application
- Next

NewProject.jpg

Step 2: Fill in similar to this:

NewJavaApplication.jpg

Copy the example source into the project:

$ cp -r \
    $HADOOP_PREFIX/src/examples/org/apache/hadoop/examples/* \
    ~/SolStudioProjects/examples/src/org/apache/hadoop/examples/


Starting to look like a development environment:

DevEnvironment.jpg


Modify the Project to compile with Hadoop jars. Right-click on the project and select "Properties"

properties.jpg


Add in the necessary Hadoop compile jars:

CompileJars.jpg

I found that I needed these jars at run time:

RunJars.jpg

Add Program Arguments (2 10):

Arguments.jpg

Now, if you click on the "Run" button. PiEstimators will run inside the IDE:

PiRuns.jpg

And the set up behaves as expected if you set a break point and click on "Debug":

debug.jpg

Tuesday May 21, 2013

non-interactive zone configuration

When creating new Solaris zones, at initial boot up, the system administrator is prompted for the new hostname, network settings, etc of the new zone. I get tired of the brittle process of manually entering the initial settings and I prefer to be able to automate the process. I had previously figured out the process for Solaris 10, but I've only recently figured out the process for Solaris 11.



As a review, with Solaris 10, use your favorite editor to create a sysidcfg file:

system_locale=C
terminal=dtterm
security_policy=NONE
network_interface=primary {
                hostname=app-41
}
name_service=DNS {
    domain_name=us.mycorp.com
    name_server=232.23.233.33,154.45.155.15,77.88.21.211
    search=us.mycorp.com,yourcorp.com,thecorp.com
}
nfs4_domain=dynamic
timezone=US/Pacific
root_password=xOV2PpE67YUzY


1) Solaris 10 Install: Using sysidcfg to avoid answering the configuration questions in a newly installed zone:

After the "zoneadm -z app-41 install" you can copy the sysidcfg file to "/ZONES/app-41/root/etc/sysidcfg"  (assuming your "zonepath" is "/ZONES/app-41") and the initial boot process will read the settings from the file and not prompt the system administrator to manually enter the settings.

2) Solaris 10 Clone: Using sysidcfg when cloning the zone 

I used a similar trick on Solaris 10 when cloning old zone "app-41 to new zone "app-44":

#  zonecfg -z app-41 export | sed -e 's/app-41/app-44/g' | zonecfg -z app-44
#  zoneadm -z app-44 clone app-41
#  cat
/ZONES/app-41/root/etc/sysidcfg | sed -e 's/app-41/app-44/g' > /ZONES/app-44/root/etc/sysidcfg
#  zoneadm -z app-44 boot




With Solaris 11, instead of a small human readable file which will containing the configuration information, the information is contained in an XML file that would be difficult to create using an editor. Instead, create the initial profile by executing "sysconfig":

# sysconfig create-profile -o sc_profile.xml
# mkdir /root/profiles/app-61
# mv sc_profile.xml /root/profiles/app-6
1/sc_profile.xml

The new XML format is longer so I won't include it in this blog entry and it is left as an exercise for the reader to review the file that has been created.

1) Solaris 11 Install

# dladm create-vnic -l net0 app_61

# zonecfg -z app-61
Use 'create' to begin configuring a new zone.
zonecfg:p3231-zone61> create
create: Using system default template 'SYSdefault'
zonecfg:app-61> set zonepath=/ZONES/app-61
zonecfg:app-61> add net
zonecfg:app-61:net> set physical=app_61
zonecfg:app-61:net> end
zonecfg:app-61> verify
zonecfg:app-61> commit
zonecfg:app-61> exit

# zoneadm -z app-61 install -c /root/profiles/app-61
# zoneadm -z app-61 boot
# zlogin -C app-61


2) Solaris 11 Clone: If you want to clone app-61 to app-62 and have an existing sc_profile.xml, you can re-use most of the settings and only adjust what has changed:


# dladm create-vnic -l net0 app_62

# zoneadm -z app-61 halt

# mkdir /root/profiles/app-62

# sed \
-e 's/app-61/app-62/g' \
-e 's/app_61/app_62/g' \
-e 's/11.22.33.61/11.22.33.62/g' \
< /root/profiles/app-61/sc_profile.xml \
> /root/profiles/app-62/sc_profile.xml

# zonecfg -z app-61 export | sed -e 's/61/62/g' | zonecfg -z app-62

# zoneadm -z app-62 clone -c /root/profiles/app-62 app-61
# zoneadm -z app-62 boot
# zlogin -C app-62

I hope this trick saves you some time and makes your process less brittle.


Thursday May 16, 2013

Hadoop Java Error logs

I was having trouble isolating a problem with "reduce" tasks running on Hadoop slave servers. 

After poking around on the Hadoop slave, I found an interesting lead in /var/log/hadoop/userlogs/job_201302111641_0057/attempt_201302111641_0057_r_000001_1/stdout:

$ cat /tmp/hadoop-hadoop/mapred/local/userlogs/job_201302111641_0059/attempt_201302111641_0059_r_000001_1/stdout

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xfe67cb31, pid=25828, tid=2
#
# JRE version: 6.0_35-b10
# Java VM: Java HotSpot(TM) Server VM (20.10-b01 mixed mode solaris-x86 )
# Problematic frame:
# C  [libc.so.1+0xbcb31]  pthread_mutex_trylock+0x29
#
# An error report file with more information is saved as:
# /tmp/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201302111641_0059/attempt_201302111641_0059_r_000001_1/work/hs_err_pid25828.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

The HotSpot crash log (hs_err_pid25828.log in my case) will be very interesting because it contains information obtained at the time of the fatal error, including the following information, where possible:

  • The operating exception or signal that provoked the fatal error
  • Version and configuration information
  • Details on the thread that provoked the fatal error and thread's stack trace
  • The list of running threads and their state
  • Summary information about the heap
  • The list of native libraries loaded
  • Command line arguments
  • Environment variables
  • Details about the operating system and CPU

Great, but hs_err_pid25654.log had been cleaned up before I could get to it. In fact, I found that the hs_err_pid.log files were available for less than a minute and they were always gone before I could capture one.

To try to retain the Java error log file, my first incorrect guess was:

 <property>
   <name>keep.failed.task.files</name>
   <value>true</value>
 </property>


My next approach was to add "-XX:ErrorFile=/tmp/hs_err_pid%p.log" to the Java command line for the reduce task.

When I tried adding the Java option to HADOOP_OPTS in /usr/local/hadoop/conf/hadoop-env.sh, I realized that this setting isn't applied to the Map and Reduce Task JVMs.

Finally, I found that adding the Java option to the mapred.child.java.opts property in mapred-site.xml WORKED!!

$ cat /usr/local/hadoop/conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

     <property>
         <name>mapred.job.tracker</name>
         <value>p3231-name-node:8021</value>
     </property>

     <property>
         <name>mapred.child.java.opts</name>
         <value>-XX:ErrorFile=/tmp/hs_err_pid%p.log</value>
     </property>


</configuration>


Now I can view the Java error logs on my Hadoop slaves:

$ ls -l /tmp/*err*
-rw-r--r--   1 hadoop   hadoop     15626 May 16 15:42 /tmp/hs_err_pid10028.log
-rw-r--r--   1 hadoop   hadoop     15795 May 16 15:43 /tmp/hs_err_pid10232.log

Tuesday Dec 04, 2012

Solaris 11 VNC Server is "blurry" or "smeared"

I've been annoyed by quality of the image that is displayed by my VNC viewer when I visit a Solaris 11 VNC server. How should I describe the image? Blurry? Grainy? Smeared? Low resolution? Compressed? Badly encoded?

This is what I have gotten used to seeing on Solaris 11:

Solaris 11 Blurry VNC image

This is not a problem for me when I view Solaris 10 VNC servers. I've finally taken the time to investigate, and the solution is simple. On the VNC client, don't allow "Tight" encoding.

My VNC Viewer will negotiate to Tight encoding if it is available. When negotiating with the Solaris 10 VNC server, Tight is not a supported option, so the Solaris 10 server and my client will agree on ZRLE. 

Now that I have disabled Tight encoding on my VNC client, the Solaris 11 VNC Servers looks much better:

Solaris 11 crisp VNC image

How should I describe the display when my VNC client is forced to negotiate to ZRLE encoding with the Solaris 11 VNC Server? Crisp? Clear? Higher resolution? Using a lossless compression algorithm?

When I'm on a low bandwidth connection, I may re-enable Tight compression on my laptop. In the mean time, the ZRLE compression is sufficient for a coast-to-coast desktop, through the corporate firewall, encoded with VPN, through my ISP and onto my laptop. YMMV.

Tuesday Nov 27, 2012

ZFS for Database Log Files


I've been troubled by drop outs in CPU usage in my application server, characterized by the CPUs suddenly going from close to 90% CPU busy to almost completely CPU idle for a few seconds. Here is an example of a drop out as shown by a snippet of vmstat data taken while the application server is under a heavy workload.

# vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s3 s4 s5 s6   in   sy   cs us sy id
 1 0 0 130160176 116381952 0 16 0 0 0 0  0  0  0  0  0 207377 117715 203884 70 21 9
 12 0 0 130160160 116381936 0 25 0 0 0 0 0  0  0  0  0 200413 117162 197250 70 20 9
 11 0 0 130160176 116381920 0 16 0 0 0 0 0  0  1  0  0 203150 119365 200249 72 21 7
 8 0 0 130160176 116377808 0 19 0 0 0 0  0  0  0  0  0 169826 96144 165194 56 17 27
 0 0 0 130160176 116377800 0 16 0 0 0 0  0  0  0  0  1 10245 9376 9164 2  1 97
 0 0 0 130160176 116377792 0 16 0 0 0 0  0  0  0  0  2 15742 12401 14784 4 1 95
 0 0 0 130160176 116377776 2 16 0 0 0 0  0  0  1  0  0 19972 17703 19612 6 2 92

 14 0 0 130160176 116377696 0 16 0 0 0 0 0  0  0  0  0 202794 116793 199807 71 21 8
 9 0 0 130160160 116373584 0 30 0 0 0 0  0  0 18  0  0 203123 117857 198825 69 20 11


This behavior occurred consistently while the application server was processing synthetic transactions: HTTP requests from JMeter running on an external machine.

I explored many theories trying to explain the drop outs, including:
  • Unexpected JMeter behavior
  • Network contention
  • Java Garbage Collection
  • Application Server thread pool problems
  • Connection pool problems
  • Database transaction processing
  • Database I/O contention

Graphing the CPU %idle led to a breakthrough:

AppServerIdle.jpg

Several of the drop outs were 30 seconds apart. With that insight, I went digging through the data again and looking for other outliers that were 30 seconds apart. In the database server statistics, I found spikes in the iostat "asvc_t" (average response time of disk transactions, in milliseconds) for the disk drive that was being used for the database log files.

Here is an example:

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

    0.0 2053.6    0.0 8234.3  0.0  0.2    0.0    0.1   0  24 c3t60080E5...F4F6d0s0
    0.0 2162.2    0.0 8652.8  0.0  0.3    0.0    0.1   0  28 c3t60080E5...F4F6d0s0
    0.0 1102.5    0.0 10012.8  0.0  4.5    0.0    4.1   0  69 c3t60080E5...F4F6d0s0
    0.0   74.0    0.0 7920.6  0.0 10.0    0.0  135.1   0 100 c3t60080E5...F4F6d0s0
    0.0  568.7    0.0 6674.0  0.0  6.4    0.0   11.2   0  90 c3t60080E5...F4F6d0s0
    0.0 1358.0    0.0 5456.0  0.0  0.6    0.0    0.4   0  55 c3t60080E5...F4F6d0s0
    0.0 1314.3    0.0 5285.2  0.0  0.7    0.0    0.5   0  70 c3t60080E5...F4F6d0s0

Here is a little more information about my database configuration:
  • The database and application server were running on two different SPARC servers.
  • Storage for the database was on a storage array connected via 8 gigabit Fibre Channel
  • Data storage and log file were on different physical disk drives
  • Reliable low latency I/O is provided by battery backed NVRAM
  • Highly available:
    • Two Fibre Channel links accessed via MPxIO
    • Two Mirrored cache controllers
    • The log file physical disks were mirrored in the storage device
  • Database log files on a ZFS Filesystem with cutting-edge technologies, such as copy-on-write and end-to-end checksumming

Why would I be getting service time spikes in my high-end storage? First, I wanted to verify that the database log disk service time spikes aligned with the application server CPU drop outs, and they did:

AppServerIdleLogService.jpg

At first, I guessed that the disk service time spikes might be related to flushing the write through cache on the storage device, but I was unable to validate that theory.

After searching the WWW for a while, I decided to try using a separate log device:

# zpool add ZFS-db-41 log c3t60080E500017D55C000015C150A9F8A7d0

The ZFS log device is configured in a similar manner as described above: two physical disks mirrored in the storage array.

This change to the database storage configuration eliminated the application server CPU drop outs:

AppServerIdleAfter.jpg

Here is the zpool configuration:

# zpool status ZFS-db-41
  pool: ZFS-db-41
 state: ONLINE
 scan: none requested
config:

        NAME                                     STATE
        ZFS-db-41                                ONLINE
          c3t60080E5...F4F6d0  ONLINE
        logs
          c3t60080E5...F8A7d0  ONLINE


Now, the I/O spikes look like this:

                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1053.5    0.0 4234.1  0.0  0.8    0.0    0.7   0  75 c3t60080E5...F8A7d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1131.8    0.0 4555.3  0.0  0.8    0.0    0.7   0  76 c3t60080E5...F8A7d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1167.6    0.0 4682.2  0.0  0.7    0.0    0.6   0  74 c3t60080E5...F8A7d0s0
    0.0  162.2    0.0 19153.9  0.0  0.7    0.0    4.2   0  12 c3t60080E5...F4F6d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1247.2    0.0 4992.6  0.0  0.7    0.0    0.6   0  71 c3t60080E5...F8A7d0s0
    0.0   41.0    0.0   70.0  0.0  0.1    0.0    1.6   0   2 c3t60080E5...F4F6d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1241.3    0.0 4989.3  0.0  0.8    0.0    0.6   0  75 c3t60080E5...F8A7d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1193.2    0.0 4772.9  0.0  0.7    0.0    0.6   0  71 c3t60080E5...F8A7d0s0

We can see the steady flow of 4k writes to the ZIL device from O_SYNC database log file writes. The spikes are from flushing the transaction group.

Like almost all problems that I run into, once I thoroughly understand the problem, I find that other people have documented similar experiences. Thanks to all of you who have documented alternative approaches.

Saved for another day: now that the problem is obvious, I should try "zfs:zfs_immediate_write_sz" as recommended in the ZFS Evil Tuning Guide.

References:

Thursday Jun 07, 2012

ndd on Solaris 10

This is mostly a repost of LaoTsao's Weblog, with some tweaks and additions. In 2008, his blog pointed out that with Solaris 9 and earlier,  an rc3 script would be used to specify ndd parameters at boot up. With Solaris 10 and later, it is more elegant to to SMF.

The last time that I tried to cut & paste directly off of his page, some of the XML was messed up, so I am reposting working Solaris 10 XML in this blog entry.

Additionally, I am including scripts that I use to distribute the settings to multiple servers. I run the distribution scripts from my MacBook, but they should also work from a windows laptop using cygwin, or from an existing Solaris installation.

Why is it necessary to set ndd parameters at boot up?

The problem being addressed is how to set ndd parameter which survive reboot. It is easy to specify ndd settings from a shell, but they only apply to the running OS and don't survive reboots.

Examples of ndd setting being necessary include performance tuning, as described in NFS Tuning for HPC Streaming Applications, and installing Oracle Database 11gR2 on Solaris 10 with prerequisites, as show here:

11gr2_ndd_check.jpg

On Solaris 10 Update 10, the default values network settings don't match the Oracle 11gR2 prerequisites:


Expected Value Actual Value
tcp_smallest_anon_port 9000 32768
tcp_largest_anon_port 65500 65535
udp_smallest_anon_port 9000 32768
udp_largest_anon_port 65500 65535

To distribute the SMF files, and for future administration, it is helpful to enable passwordless ssh from your secure laptop:

================
If not already present, create a ssh key on you laptop
================

# ssh-keygen -t rsa

================
Enable passwordless ssh from my laptop.
Need to type in the root password for the remote machines.
Then, I no longer need to type in the password when I ssh or scp from my laptop to servers.
================

#!/usr/bin/env bash

for server in `cat servers.txt`
do
  echo root@$server
  cat ~/.ssh/id_rsa.pub | ssh root@$server "cat >> .ssh/authorized_keys"
done

Specify the servers to distribute to:

================
servers.txt
================

testhost1
testhost2

In addition to ndd values, I offen use the following /etc/system setting: 

================
etc_system_addins
================

set rpcmod:clnt_max_conns=8
set zfs:zfs_arc_max=0x1000000000
set nfs:nfs3_bsize=131072
set nfs:nfs4_bsize=131072

Modify ndd-nettune.txt with the ndd values that are appropriate for your deployment: 

================
ndd-nettune.txt
================

#!/sbin/sh
#
# ident   "@(#)ndd-nettune.xml    1.0     01/08/06 SMI"

. /lib/svc/share/smf_include.sh
. /lib/svc/share/net_include.sh

# Make sure that the libraries essential to this stage of booting  can be found.
LD_LIBRARY_PATH=/lib; export LD_LIBRARY_PATH
echo "Performing Directory Server Tuning..." >> /tmp/smf.out
#
# Performance Settings
#
/usr/sbin/ndd -set /dev/tcp tcp_max_buf 2097152
/usr/sbin/ndd -set /dev/tcp tcp_xmit_hiwat 1048576
/usr/sbin/ndd -set /dev/tcp tcp_recv_hiwat 1048576
#
# Oracle Database 11gR2 Settings
#
/usr/sbin/ndd -set /dev/tcp tcp_smallest_anon_port 9000
/usr/sbin/ndd -set /dev/tcp tcp_largest_anon_port 65500
/usr/sbin/ndd -set /dev/udp udp_smallest_anon_port 9000
/usr/sbin/ndd -set /dev/udp udp_largest_anon_port 65500

# Reset the library path now that we are past the critical stage
unset LD_LIBRARY_PATH


================
ndd-nettune.xml
================

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<!-- ident "@(#)ndd-nettune.xml 1.0 04/09/21 SMI" -->
<service_bundle type='manifest' name='SUNWcsr:ndd'>
  <service name='network/ndd-nettune' type='service' version='1'>
    <create_default_instance enabled='true' />
    <single_instance />
    <dependency name='fs-minimal' type='service' grouping='require_all' restart_on='none'>
      <service_fmri value='svc:/system/filesystem/minimal' />
    </dependency>
    <dependency name='loopback-network' grouping='require_any' restart_on='none' type='service'>
      <service_fmri value='svc:/network/loopback' />
    </dependency>
    <dependency name='physical-network' grouping='optional_all' restart_on='none' type='service'>
      <service_fmri value='svc:/network/physical' />
    </dependency>
    <exec_method type='method' name='start' exec='/lib/svc/method/ndd-nettune' timeout_seconds='3' > </exec_method>
    <exec_method type='method' name='stop'  exec=':true'                       timeout_seconds='3' > </exec_method>
    <property_group name='startd' type='framework'>
      <propval name='duration' type='astring' value='transient' />
    </property_group>
    <stability value='Unstable' />
    <template>
      <common_name>
    <loctext xml:lang='C'> ndd network tuning </loctext>
      </common_name>
      <documentation>
    <manpage title='ndd' section='1M' manpath='/usr/share/man' />
      </documentation>
    </template>
  </service>
</service_bundle>

Execute this shell script to distribute the files. The ndd values will be immediately modified and then survive reboot. The servers will need to rebooted to pick up the /etc/system settings:

================
system_tuning.sh
================

#!/usr/bin/env bash

for server in `cat servers.txt`
do
  cat etc_system_addins | ssh root@$server "cat >> /etc/system"

  scp ndd-nettune.xml root@${server}:/var/svc/manifest/site/ndd-nettune.xml
  scp ndd-nettune.txt root@${server}:/lib/svc/method/ndd-nettune
  ssh root@$server chmod +x /lib/svc/method/ndd-nettune
  ssh root@$server svccfg validate /var/svc/manifest/site/ndd-nettune.xml
  ssh root@$server svccfg import /var/svc/manifest/site/ndd-nettune.xml
done





Wednesday May 30, 2012

Java EE Application Servers, SPARC T4, Solaris Containers, and Resource Pools

I've obtained a substantial performance improvement on a SPARC T4-2 Server running a Java EE Application Server Cluster by deploying the cluster members into Oracle Solaris Containers and binding those containers to cores of the SPARC T4 Processor. This is not a surprising result, in fact, it is consistent with other results that are available on the Internet. See the "references", below, for some examples. Nonetheless, here is a summary of my configuration and results.

(1.0) Before deploying a Java EE Application Server Cluster into a virtualized environment, many decisions need to be made. I'm not claiming that all of the decisions that I have a made will work well for every environment. In fact, I'm not even claiming that all of the decisions are the best possible for my environment. I'm only claiming that of the small sample of configurations that I've tested, this is the one that is working best for me. Here are some of the decisions that needed to be made:

(1.1) Which virtualization option? There are several virtualization options and isolation levels that are available. Options include:

  • Hard partitions:  Dynamic Domains on Sun SPARC Enterprise M-Series Servers
  • Hypervisor based virtualization such as Oracle VM Server for SPARC (LDOMs) on SPARC T-Series Servers
  • OS Virtualization using Oracle Solaris Containers
  • Resource management tools in the Oracle Solaris OS to control the amount of resources an application receives, such as CPU cycles, physical memory, and network bandwidth.

Oracle Solaris Containers provide the right level of isolation and flexibility for my environment. To borrow some words from my friends in marketing, "The SPARC T4 processor leverages the unique, no-cost virtualization capabilities of Oracle Solaris Zones" 

(1.2) How to associate Oracle Solaris Containers with resources? There are several options available to associate containers with resources, including (a) resource pool association (b) dedicated-cpu resources and (c) capped-cpu resources. I chose to create resource pools and associate them with the containers because I wanted explicit control over the cores and virtual processors. 

(1.3) Cluster Topology? Is it best to deploy (a) multiple application servers on one node, (b) one application server on multiple nodes, or (c) multiple application servers on multiple nodes? After a few quick tests, it appears that one application server per Oracle Solaris Container is a good solution.

(1.4) Number of cluster members to deploy? I chose to deploy four big application servers. I would like go back and test many 32-bit application servers, but that is left for another day.

(2.0) Configuration tested.

(2.1) I was using a SPARC T4-2 Server which has 2 CPU and 128 virtual processors. To understand the physical layout of the hardware on Solaris 10, I used the OpenSolaris psrinfo perl script available at http://hub.opensolaris.org/bin/download/Community+Group+performance/files/psrinfo.pl:

test# ./psrinfo.pl -pv
The physical processor has 8 cores and 64 virtual processors (0-63)
The core has 8 virtual processors (0-7)
  The core has 8 virtual processors (8-15)
  The core has 8 virtual processors (16-23)
  The core has 8 virtual processors (24-31)
  The core has 8 virtual processors (32-39)
  The core has 8 virtual processors (40-47)
  The core has 8 virtual processors (48-55)
  The core has 8 virtual processors (56-63)
    SPARC-T4 (chipid 0, clock 2848 MHz)
The physical processor has 8 cores and 64 virtual processors (64-127)
  The core has 8 virtual processors (64-71)
  The core has 8 virtual processors (72-79)
  The core has 8 virtual processors (80-87)
  The core has 8 virtual processors (88-95)
  The core has 8 virtual processors (96-103)
  The core has 8 virtual processors (104-111)
  The core has 8 virtual processors (112-119)
  The core has 8 virtual processors (120-127)
    SPARC-T4 (chipid 1, clock 2848 MHz)

(2.2) The "before" test: without processor binding. I started with a 4-member cluster deployed into 4 Oracle Solaris Containers. Each container used a unique gigabit Ethernet port for HTTP traffic. The containers shared a 10 gigabit Ethernet port for JDBC traffic.

(2.3) The "after" test: with processor binding. I ran one application server in the Global Zone and another application server in each of the three non-global zones (NGZ): 

(3.0) Configuration steps. The following steps need to be repeated for all three Oracle Solaris Containers.

(3.1) Stop AppServers from the BUI.

(3.2) Stop the NGZ.

test# ssh test-z2 init 5

(3.3) Enable resource pools:

test# svcadm enable pools

(3.4) Create the resource pool:

test# poolcfg -dc 'create pool pool-test-z2'

(3.5) Create the processor set:

test# poolcfg -dc 'create pset pset-test-z2'

(3.6) Specify the maximum number of CPU's that may be addd to the processor set:

test# poolcfg -dc 'modify pset pset-test-z2 (uint pset.max=32)'

(3.7) bash syntax to add Virtual CPUs to the processor set:

test# (( i = 64 )); while (( i < 96 )); do poolcfg -dc "transfer to pset pset-test-z2 (cpu $i)"; (( i = i + 1 )) ; done

(3.8) Associate the resource pool with the processor set:

test# poolcfg -dc 'associate pool pool-test-z2 (pset pset-test-z2)'

(3.9) Tell the zone to use the resource pool that has been created:

test# zonecfg -z test-z2 set pool=pool-test-z2

(3.10) Boot the Oracle Solaris Container

test# zoneadm -z test-z2 boot

(3.11) Save the configuration to /etc/pooladm.conf

test# pooladm -s

(4.0) Verification

(4.1) View the processors in each processor set 

test# psrset

user processor set
5: processors 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
user processor set
6: processors 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
user processor set
7: processors 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 12

(4.2) Verify that the Java processes are associated with the processor sets:

test# ps -e -o,vsz,rss,pid,pset,comm | grep java | sort -n

  VSZ     RSS    PID PSET COMMAND
3715416 1543344 25143   5 <JAVA_HOME>/bin/sparcv9/java
3772120 1600088 25142   - <JAVA_HOME>/bin/sparcv9/java
3780960 1608832 25144   6 <JAVA_HOME>/bin/sparcv9/java
3792648 1620560 25145   7 <JAVA_HOME>/bin/sparcv9/java

(5.0) Results. Using the resource pools improves both throughput and response time:

(6.0) Run Time Changes

(6.1) I wanted to show an example which started from scratch, which is why I stopped the Oracle Solaris Containers, configured the pools and booted up fresh. There is no room for confusion. However, the steps should work for running containers. One exception is "zonecfg -z test-z2 set pool=pool-test-z2" which will take effect when the zone is booted.

(6.2) I've shown poolcfg with the '-d' option which specifies that the command will work directly on the kernel state. For example, at runtime, you can move CPU core 12 (virtual processors 96-103) from test-z3 to test-z2 with the following command:

test# (( i = 96 )); while (( i < 104 )); do poolcfg -dc "transfer to pset pset-test-z2 (cpu $i)"; (( i = i + 1 )) ; done

(6.3) To specify a run-time change to a container's pool binding, use the following steps:

Identify the zone ID (first column)

test# zoneadm list -vi
  ID NAME        STATUS     PATH                      BRAND    IP
   0 global      running    /                         native   shared
  28 test-z3     running    /zones/test-z3            native   shared
  31 test-z1     running    /zones/test-z1            native   shared
  32 test-z2     running    /zones/test-z2            native   shared

Modify binding if necessary:

test# poolbind -p pool-test-z2 -i zoneid 32

(7.0) Processor sets are particularly relevant to multi-socket configurations:

Processor sets reduce cross calls (xcal) and migrations (migr) in multi-socket configurations:

Single Socket Test
1 x SPARC T4 Socket
2 x Oracle Solaris Containers
mpstat samples
The impact of processor sets was hardly measurable
 (about a 1% throughput difference)
Without Processor Binding
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 40    1   0  525   933   24 1793  124  363  153    1  2551   50   7   0  43
 41    2   0  486  1064   24 1873  137  388  159    2  2560   51   7   0  42
 42    1   0  472   973   23 1770  124  352  153    1  2329   49   7   0  44
 43    1   0  415   912   22 1697  115  320  153    1  2175   47   7   0  47
 44    1   0  369   884   22 1665  111  300  150    1  2008   45   6   0  49
 45    2   0  494   902   23 1730  116  324  152    1  2233   46   7   0  47
 46    3   0  918  1075   26 2087  163  470  172    1  2935   55   8   0  38
 47    2   0  672   999   25 1955  143  416  162    1  2777   53   7   0  40
 48    2   0  691   976   25 1904  136  396  159    1  2748   51   7   0  42
 49    3   0  849  1081   24 1933  145  411  163    1  2670   52   7   0  40
With each container bound to 4 cores.
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 40    1   0  347  1164   20 1810  119  311  210    1  2079   42   6   0  51
 41    1   0  406  1219   21 1898  131  344  214    1  2266   45   7   0  48
 42    1   0  412  1214   21 1902  130  342  212    1  2289   45   7   0  49
 43    2   0  410  1208   21 1905  130  343  219    1  2304   45   7   0  48
 44    1   0  411  1208   21 1906  131  343  214    1  2313   45   7   0  48
 45    1   0  433  1209   21 1917  133  344  215    1  2337   45   7   0  48
 46    2   0  500  1244   24 1989  141  368  218    1  2482   46   7   0  47
 47    1   0  377  1183   21 1871  127  331  211    1  2289   45   7   0  49
 48    1   0  358   961   23 1699   77  202  183    1  2255   41   6   0  53
 49    1   0  339  1008   21 1739   84  216  188    1  2231   41   6   0  53




Two Socket Test
2 x T4 Sockets
4 x Oracle Solaris Container
mpstat sample
The impact of processor sets was substantial
(~25% better throughput)
Without Processor Binding
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 40    1   0 1277  1553   32 2726  317  942   70    2  2620   66  11   0  24
 41    0   0 1201  1606   30 2618  309  895   71    2  2603   67  11   0  23
 42    1   0 1104  1517   30 2519  295  846   70    2  2499   65  10   0  24
 43    1   0  997  1447   28 2443  283  807   69    2  2374   64  10   0  26
 44    1   0  959  1402   28 2402  277  776   67    2  2336   64  10   0  26
 45    1   0 1057  1466   29 2538  294  841   68    2  2400   64  10   0  26
 46    3   0 2785  1776   35 3273  384 1178   74    2  2841   68  12   0  20
 47    1   0 1508  1610   33 2949  346 1039   72    2  2764   67  11   0  22
 48    2   0 1486  1594   33 2963  346 1036   72    2  2761   67  11   0  22
 49    1   0 1308  1589   32 2741  325  952   71    2  2694   67  11   0  22
With each container bound to 4 cores.
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 40    1   0  423  1223   20 1841  157  377   60    1  2185   48   7   0  45
 41    1   0  505  1279   22 1942  168  405   65    1  2396   50   7   0  43
 42    1   0  500  1278   22 1941  170  405   65    1  2413   50   7   0  42
 43    2   0  492  1277   22 1955  171  408   64    1  2422   50   8   0  42
 44    1   0  504  1269   22 1941  167  407   64    1  2430   50   7   0  42
 45    1   0  513  1284   22 1977  173  412   64    1  2475   50   8   0  42
 46    2   0  582  1302   25 2021  177  431   67    1  2612   52   8   0  41
 47    1   0  462  1247   21 1918  168  400   62    1  2392   50   7   0  43
 48    1   0  466  1055   25 1777  120  282   56    1  2424   47   7   0  47
 49    1   0  412  1080   22 1789  122  285   56    1  2354   46   7   0  47

    (8.0) References:

    Thursday Apr 12, 2012

    What is bondib1 used for on SPARC SuperCluster with InfiniBand, Solaris 11 networking & Oracle RAC?

    A co-worker asked the following question about a SPARC SuperCluster InfiniBand network:

    > on the database nodes the RAC nodes communicate over the cluster_interconnect. This is the
    > 192.168.10.0 network on bondib0. (according to ./crs/install/crsconfig_params NETWORKS
    > setting) 
    > What is bondib1 used for? Is it a HA counterpart in case bondib0 dies?

    This is my response:

    Summary: In a SPARC SuperCluster installation, bondib0 and bondib1 are the InfiniBand links that are used for the private interconnect (usage includes global cache data blocks and heartbeat) and for communication to the Exadata storage cells. Currently, the database is idle, so bondib1 is currently only being used for outbound cluster interconnect traffic.

    Details:

    bondib0 is the cluster_interconnect

    $ oifcfg getif           
    bondeth0  10.129.184.0  global  public
    bondib0  192.168.10.0  global  cluster_interconnect
    ipmpapp0  192.168.30.0  global  public


    bondib0 and bondib1 are on 192.168.10.1 and 192.168.10.2 respectively.

    # ipadm show-addr | grep bondi
    bondib0/v4static  static   ok           192.168.10.1/24
    bondib1/v4static  static   ok           192.168.10.2/24


    This private network is also used to communicate with the Exadata Storage Cells. Notice that the network addresses of the Exadata Cell Disks are on the same subnet as the private interconnect:  

    SQL> column path format a40
    SQL> select path from v$asm_disk;

    PATH                                     

    ---------------------------------------- 
    o/192.168.10.9/DATA_SSC_CD_00_ssc9es01  
    o/192.168.10.9/DATA_SSC_CD_01_ssc9es01
    ...

    Hostnames tied to the IPs are node1-priv1 and node1-priv2 

    # grep 192.168.10 /etc/hosts
    192.168.10.1    node1-priv1.us.oracle.com   node1-priv1
    192.168.10.2    node1-priv2.us.oracle.com   node1-priv2

    For the four compute node RAC:

    • Each compute node has two IP address on the 192.168.10.0 private network.
    • Each IP address has an active InfiniBand link and a failover InfiniBand link.
    • Thus, the compute nodes are using a total of 8 IP addresses and 16 InfiniBand links for this private network.

    bondib1 isn't being used for the Virtual IP (VIP):

    $ srvctl config vip -n node1
    VIP exists: /node1-ib-vip/192.168.30.25/192.168.30.0/255.255.255.0/ipmpapp0, hosting node node1
    VIP exists: /node1-vip/10.55.184.15/10.55.184.0/255.255.255.0/bondeth0, hosting node node1


    bondib1 is on bondib1_0 and fails over to bondib1_1:

    # ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES
    ipmpapp0    ipmpapp0    ok        --        ipmpapp_0 (ipmpapp_1)
    bondeth0    bondeth0    degraded  --        net2 [net5]
    bondib1     bondib1     ok        --        bondib1_0 (bondib1_1)
    bondib0     bondib0     ok        --        bondib0_0 (bondib0_1)


    bondib1_0 goes over net24

    # dladm show-link | grep bond
    LINK                CLASS     MTU    STATE    OVER
    bondib0_0           part      65520  up       net21
    bondib0_1           part      65520  up       net22
    bondib1_0           part      65520  up       net24
    bondib1_1           part      65520  up       net23


    net24 is IB Partition FFFF

    # dladm show-ib
    LINK         HCAGUID         PORTGUID        PORT STATE  PKEYS
    net24        21280001A1868A  21280001A1868C  2    up     FFFF
    net22        21280001CEBBDE  21280001CEBBE0  2    up     FFFF,8503
    net23        21280001A1868A  21280001A1868B  1    up     FFFF,8503
    net21        21280001CEBBDE  21280001CEBBDF  1    up     FFFF


    On Express Module 9 port 2:

    # dladm show-phys -L
    LINK              DEVICE       LOC
    net21             ibp4         PCI-EM1/PORT1
    net22             ibp5         PCI-EM1/PORT2
    net23             ibp6         PCI-EM9/PORT1
    net24             ibp7         PCI-EM9/PORT2


    Outbound traffic on the 192.168.10.0 network will be multiplexed between bondib0 & bondib1

    # netstat -rn

    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    192.168.10.0         192.168.10.2         U        16    6551834 bondib1  
    192.168.10.0         192.168.10.1         U         9    5708924 bondib0  


    The database is currently idle, so there is no traffic to the Exadata Storage Cells at this moment, nor is there currently any traffic being induced by the global cache. Thus, only the heartbeat is currently active. There is more traffic on bondib0 than bondib1

    # /bin/time snoop -I bondib0 -c 100 > /dev/null
    Using device ipnet/bondib0 (promiscuous mode)
    100 packets captured

    real        4.3
    user        0.0
    sys         0.0


    (100 packets in 4.3 seconds = 23.3 pkts/sec)

    # /bin/time snoop -I bondib1 -c 100 > /dev/null
    Using device ipnet/bondib1 (promiscuous mode)
    100 packets captured

    real       13.3
    user        0.0
    sys         0.0


    (100 packets in 13.3 seconds = 7.5 pkts/sec)

    Half of the packets on bondib0 are outbound (from self). The remaining packet are split evenly, from the other nodes in the cluster.

    # snoop -I bondib0 -c 100 | awk '{print $1}' | sort | uniq -c
    Using device ipnet/bondib0 (promiscuous mode)
    100 packets captured
      49 node1
    -priv1.us.oracle.com
      24 node2
    -priv1.us.oracle.com
      14 node3
    -priv1.us.oracle.com
      13 node4
    -priv1.us.oracle.com

    100% of the packets on bondib1 are outbound (from self), but the headers in the packets indicate that they are from the IP address associated with bondib0:

    # snoop -I bondib1 -c 100 | awk '{print $1}' | sort | uniq -c
    Using device ipnet/bondib1 (promiscuous mode)
    100 packets captured
     100 node1-priv1.us.oracle.com

    The destination of the bondib1 outbound packets are split evenly, to node3 and node 4.

    # snoop -I bondib1 -c 100 | awk '{print $3}' | sort | uniq -c
    Using device ipnet/bondib1 (promiscuous mode)
    100 packets captured
      51 node3-priv1.us.oracle.com
      49 node4-priv1.us.oracle.com

    Conclusion: In a SPARC SuperCluster installation, bondib0 and bondib1 are the InfiniBand links that are used for the private interconnect (usage includes global cache data blocks and heartbeat) and for communication to the Exadata storage cells. Currently, the database is idle, so bondib1 is currently only being used for outbound cluster interconnect traffic.

    Thursday Mar 29, 2012

    Watch out for a trailing slash on $ORACLE_HOME

    Watch out for a trailing slash on $ORACLE_HOME
    oracle$ export ORACLE_HOME=/u01/app/11.2.0.3/grid/
    oracle$ ORACLE_SID=+ASM1
    oracle$ sqlplus / as sysasm

    SQL*Plus: Release 11.2.0.3.0 Production on Thu Mar 29 13:04:01 2012

    Copyright (c) 1982, 2011, Oracle.  All rights reserved.

    Connected to an idle instance.

    SQL>


    oracle$ export ORACLE_HOME=/u01/app/11.2.0.3/grid
    oracle$ ORACLE_SID=+ASM1
    oracle$ sqlplus / as sysasm

    SQL*Plus: Release 11.2.0.3.0 Production on Thu Mar 29 13:04:44 2012

    Copyright (c) 1982, 2011, Oracle.  All rights reserved.


    Connected to:
    Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
    With the Real Application Clusters and Automatic Storage Management options

    SQL>

    User "oracle" unable to start or stop listeners

    Recently ran into a problem where user "oracle" was unable to start or stop listeners:

    oracle$ srvctl stop listener
    PRCR-1065 : Failed to stop resource ora.LISTENER.lsnr
    CRS-0245:  User doesn't have enough privilege to perform the operation
    CRS-0245:  User doesn't have enough privilege to perform the operation
    PRCR-1065 : Failed to stop resource ora.LISTENER_IB.lsnr
    CRS-0245:  User doesn't have enough privilege to perform the operation
    CRS-0245:  User doesn't have enough privilege to perform the operation

    The system is currently "fixed":

    oracle$ srvctl start listener

    oracle$ srvctl status listener
    Listener LISTENER is enabled
    Listener LISTENER is running on node(s): etc9cn02,etc9cn01
    Listener LISTENER_IB is enabled
    Listener LISTENER_IB is running on node(s): etc9cn02,etc9cn01

    oracle$ srvctl stop listener

    oracle$ srvctl status listener
    Listener LISTENER is enabled
    Listener LISTENER is not running
    Listener LISTENER_IB is enabled
    Listener LISTENER_IB is not running

    oracle$ srvctl start listener


    How it was "fixed":

    Before:

    # crsctl status resource ora.LISTENER.lsnr -p | grep ACL=
    ACL=owner:root:rwx,pgrp:root:r-x,other::r--

    # crsctl status resource ora.LISTENER_IB.lsnr -p | grep ACL=
    ACL=owner:root:rwx,pgrp:root:r-x,other::r--


    "Fix":

    # crsctl setperm resource ora.LISTENER.lsnr -o oracle
    # crsctl setperm resource ora.LISTENER.lsnr -g oinstall
    # crsctl setperm resource ora.LISTENER_IB.lsnr -g oinstall
    # crsctl setperm resource ora.LISTENER_IB.lsnr -o oracle


    After:

    # crsctl status resource ora.LISTENER.lsnr -p | grep ACL=
    ACL=owner:oracle:rwx,pgrp:oinstall:r-x,other::r--

    # crsctl status resource ora.LISTENER_IB.lsnr -p | grep ACL=
    ACL=owner:oracle:rwx,pgrp:oinstall:r-x,other::r--


    I may never know how the system got into this state.

    Tuesday Sep 20, 2011

    NFS root access for Oracle RAC on Sun ZFS Storage 7x20 Appliance

    When installing Oracle Real Application Clusters (Oracle RAC) 11g Release 2 for Solaris Operating System it is necessary to first install Oracle Grid Infrastructure, and it is also necessary to configure shared storage. I install the Grid Infrastructure and then the Database largely following the on-line instructions:

    I ran into an interesting problem when installing Oracle RAC in a system that included SPARC Enterprise M5000 Servers and a Sun ZFS Storage 7x20 Appliance, illustrated in the following diagram:

    (Click to Expand)

    Network Diagram

    When configuring the shared storage for Oracle RAC, you may decide to use NFS for Data Files. In this case, you must set up the NFS mounts on the storage appliance to allow root access from all of the RAC clients. This allows files being created from the RAC nodes to be owned by root on the mounted NFS filesystems, rather than an anonymous user, which is the default behavior.

    In a default configuration, a Solaris NFS server maps "root" access to "nobody". This can be overridden as stated on the share_nfs(1M) man page:

    Only root users from the hosts specified in access_list will have root access... By default, no host has root access, so root users are mapped to an anonymous user ID...

    Example: The following will give root read-write permissions to hostb:

     share -F nfs -o ro=hosta,rw=hostb,root=hostb /var 

    The Sun ZFS Storage 7x20 Appliance features a browser user interface (BUI), "a graphical tool for administration of the appliance. The BUI provides an intuitive environment for administration tasks, visualizing concepts, and analyzing performance data." The following clicks can be used to allow root on the clients mounting to the storage to be considered root:

    (Click to Expand)

    S7000 Allow Root Screenshot

    1. Go the the "Shares" page

      • (not shown) Select the "pencil" to edit the share that will be used for Oracle RAC shared storage.
    2. Go to the Protocols page.
    3. For the NFS Protocol, un-check "inherit from project".
    4. Click the "+" to add an NFS exception.
    5. Enter the hostname of the RAC node.
    6. Allow read/write access.
    7. Check the "Root Access" box.
    8. Click "Apply" to save the changes.

    Repeat steps 3-8 for each RAC node. Repeat steps 1-8 for every share that will be used for RAC shared storage.

    More intuitive readers, after reviewing the network diagram and the screenshot of the S7420 NFS exceptions screen, may immediately observe that it was a mistake to enter the hostname of the RAC nodes associated with the gigabit WAN network. In hindsight, this was an obvious mistake, but at the time I was entering the data, I simply entered the names of the machines, which did not strike me as a "trick question".

    The next step is to configure the RAC nodes as NFS clients. After the shares have been set up on the Sun ZFS Storage Appliance, the next step is to mount the shares onto the RAC nodes. For each RAC node, update the /etc/vfstab file on each node with an entry similar to the following:

    nfs_server:/vol/DATA/oradata  /u02/oradata nfs rw,bg,hard,nointr,rsize=32768,wsize=32768,proto=tcp,noac,forcedirectio, vers=3,suid
    

    Here's a tip of the hat to Pythian's "Installing 11gR2 Grid Infrastructure in 5 Easy Lessons":

    Lesson #3: Grid is very picky and somewhat uninformative about its NFS support

    Like an annoying girlfriend, the installer seems to say “Why should I tell you what’s the problem? If you really loved me, you’d know what you did wrong!”

    You need to trace the installer to find out what exactly it doesn’t like about your configuration.

    Running the installer normally, the error message is:

    [FATAL] [INS-41321] Invalid Oracle Cluster Registry (OCR) location.

    CAUSE: The installer detects that the storage type of the location (/cmsstgdb/crs/ocr/ocr1) is not supported for Oracle Cluster Registry.

    ACTION: Provide a supported storage location for the Oracle Cluster Registry.

    OK, so Oracle says the storage is not supported, but I know that ... NFS is support just fine. This means I used the wrong parameters for the NFS mounts. But when I check my vfstab and /etc/mount, everything looks A-OK. Can Oracle tell me what exactly bothers it?

    It can. If you run the silent install by adding the following flags to the command line:

    -J-DTRACING.ENABLED=true -J-DTRACING.LEVEL=2

    If you get past this stage, it is clear sailing up until you run "root.sh" near the end of the Grid Installation, which is the stage that will fail if the root user's files are mapped to anonymous.

    So now, I will finally get to the piece to the puzzle that I found to be perplexing. Remember that in my configuration (see diagram, above) each RAC node has two potential paths to the Sun ZFS storage appliance, one path via the router that is connected to the corporate WAN, and one path via the private 10 gigabit storage network. When I accessed the NAS storage via the storage network, root was always mapped to nobody, despite my best efforts. While trying to debug, I discovered that when I accessed the NAS storage via the corporate WAN network, root was mapped to root:

    # ping -s s7420-10g0 1 1
    PING s7420-10g0: 1 data bytes
    9 bytes from s7420-10g0 (192.168.42.15): icmp_seq=0.
    
    # ping -s s7420-wan 1 1
    PING s7420-wan: 1 data bytes
    9 bytes from s7420-wan (10.1.1.15): icmp_seq=0.
    
    # nfsstat -m
    /S7420/OraData_WAN from s7420-wan:/export/OraData
     Flags:         vers=3,proto=tcp,sec=sys,hard,nointr,noac,link,symlink,acl,rsize=32768,wsize=32768,retrans=5,timeo=600
     Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60
    
    /S7420/OraData_10gbe from s7420-10g0:/export/OraData
     Flags:         vers=3,proto=tcp,sec=sys,hard,nointr,noac,link,symlink,acl,rsize=32768,wsize=32768,retrans=5,timeo=600
     Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60
    
    # touch /S7420/OraData_10gbe/foo1
    # touch /S7420/OraData_WAN/foo2
    # touch /net/s7420-10g0/export/OraData/foo3
    # touch /net/s7420-wan/export/OraData/foo4
    # touch /net/192.168.42.15/export/OraData/foo5
    # touch /net/10.1.1.15/export/OraData/foo6
    
    # ls -l /S7420/OraData_10gbe/foo*
    -rw-r--r--   1 nobody   nobody         0 Sep 20 12:54 /S7420/OraData_10gbe/foo1
    -rw-r--r--   1 root     root           0 Sep 20 12:54 /S7420/OraData_10gbe/foo2
    -rw-r--r--   1 nobody   nobody         0 Sep 20 12:55 /S7420/OraData_10gbe/foo3
    -rw-r--r--   1 root     root           0 Sep 20 12:56 /S7420/OraData_10gbe/foo4
    -rw-r--r--   1 nobody   nobody         0 Sep 20 12:58 /S7420/OraData_10gbe/foo5
    -rw-r--r--   1 root     root           0 Sep 20 13:04 /S7420/OraData_10gbe/foo6

    Having discovered that when I accessed the NAS storage via the storage network, root was always mapped to nobody, but that when I accessed the NAS storage via the corporate WAN network, root was mapped to root, I investigated the mounts on the S7420:

    # ssh osm04 -l root
    Password: 
    Last login: Thu Sep 15 22:46:46 2011 from 192.168.42.11
    s7420:> shell
    Executing shell commands may invalidate your service contract. Continue? (Y/N) 
    Executing raw shell; "exit" to return to appliance shell ...
    +-----------------------------------------------------------------------------+
    |  You are entering the operating system shell.  By confirming this action in |
    |  the appliance shell you have agreed that THIS ACTION MAY VOID ANY SUPPORT  |
    ...
    +-----------------------------------------------------------------------------+
    s7420# showmount -a | grep OraData
    192.168.42.11:/export/OraData
    rac1.bigcorp.com:/export/OraData
    

    When I saw the "showmount" output, the lightbulb in my brain turned on and I understood the problem: I had entered the node names associated with the WAN, rather than node names associated with the private storage network. When NFS packets were arriving from the corporate WAN, the S7420 was using DNS to resolve WAN IP addresses into the WAN hostnames, which matched the hostnames that I had entered into the S7420 NFS Exception form. In contrast, when NFS packets were arriving from the 10 gigabit private storage network, the system was not able to resolve the IP address into hostname because the private storage network data did not exist in DNS. Even if the name resolution was successful, if would have been necessary to enter the the node names associated private storage area network into the S7420 NFS Exceptions form.

    Several solutions spring to mind: (1) On a typical Solaris NFS server, I would have enabled name resolution of the 10 gigabit private storage network addresses by adding entries to /etc/hosts, and used those node names for the NFS root access. This was not possible because on the appliance, /etc is mounted as read-only. (2) It occurred to me to enter the IP addresses into the S7420 NFS exceptions form, but the BUI would only accept hostnames. (3) One potential solution is to put the private 10 gigabit IP addresses into the corporate DNS server. (4) Instead, I chose to give root read-write permissions to all clients on the 10 gigabit private storage network:

    (Click to Expand)

    S7000_allow_network_thumb.jpg

    Now, the RAC installation will be able to complete successfully with RAC nodes accessing the Sun ZFS Storage 7x20 Appliance via the private 10 gigabit storage network.

    Tuesday Aug 23, 2011

    Flash Archive with MPXIO

    It was necessary to exchange one SPARC Enterprise M4000 Server running Solaris 10 Update 9 with a replacement server.  I thought, "no problem"

    Ran "flar create -n ben06 -S /S7420/ben06-`date '+%m-%d-%y'`.flar" on the original server to create a flash archive on NAS storage.

    Upon restoring the flar onto the replacement SPARC Enterprise M4000 Server, problems appeared:

    Rebooting with command: boot
    Boot device: disk  File and args:
    SunOS Release 5.10 Version Generic_144488-17 64-bit
    Copyright (c) 1983, 2011, Oracle and/or its affiliates. All rights reserved.
    Hostname: ben06
    The / file system (/dev/rdsk/c0t0d0s0) is being checked.

    WARNING - Unable to repair the / filesystem. Run fsck

    manually (fsck -F ufs /dev/rdsk/c0t0d0s0).

    Aug 22 10:25:31 svc.startd[7]: svc:/system/filesystem/usr:default: Method "/lib/svc/method/fs-usr" failed with exit status 95.

    Aug 22 10:25:31 svc.startd[7]: system/filesystem/usr:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)
    Requesting System Maintenance Mode
    (See /lib/svc/share/README for more information.)

     It got fairly ugly from there: 

    • When I tried to run the fsck command, above, it reported "No such device"
    • Although "mount" reported that / was mounted read/write, other commands, such as "vi", reported that everything was read-only
    • "format" reported that I didn't have any devices at all!!
    Eventually I realized that there seems to be a bug with Flash Archives + MPXIO

    After I installed with the flar, I rebooted into single user mode from alternate media (bootp or cdrom), mounted the internal drive, and modified some files, changing "no" to "yes":

    /kernel/drv/fp.conf:mpxio-disable="yes";
    /kernel/drv/iscsi.conf:mpxio-disable="yes";
    /kernel/drv/mpt.conf:mpxio-disable="yes";
    /kernel/drv/mpt.conf:disable-sata-mpxio="yes";
    /kernel/drv/mpt_sas.conf:mpxio-disable="yes";

    Then, after a "reboot -- -r", everything was fine.

    About

    Jeff Taylor-Oracle

    Search

    Archives
    « May 2015
    SunMonTueWedThuFriSat
         
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
          
    Today