Tuesday Mar 02, 2010

OpenOffice Calc cut&paste to Thunderbird

This blog entry is about how to copy this OpenOffice 3.1 data:


Into this Thunderbird 2.0 e-mail:


 First I used "gimp" to save a jpg version of the graph:



Then inserted into the e-mail.

Thunderbird -> Insert -> Image


OK - not too hard, but a more direct method would be convenient.

Inserting the OpenOffice Calc cells as a Thunderbird Table is almost easy.  Just select the cells in Calc, <ctrl>C, and paste into Thunderbird with <ctrl>V:


The problem, IMHO, is that when the OpenOffice Calc data is pasted into the Thunderbird message, all of the existing text in the Thunderbird message is assigned a font that is "too small".  Changing the font of the existing text was not an intended consequence of inserting a table into the message.

If you use "Paste Without Formating", Thunderbird puts text, not an HTML table, into your email, which may or may not be OK depending on the task at hand.

After you have pasted a table into your e-mail, if you want to increase the font size of some or all of your data, you can select any area, and use "<ctrl>+" (or Format->Size->Larger) to manually set the font size to what you had in mind.

Instead, I prefer to use the following approach to return to the original font size:

1) After the paste, in Thunderbird, type "<ctrl>A" to select all of the text

2) Click "Insert" -> "HTML" from the Thunderbird pulldown menus.  This brings up a window with all of the HTML for the current window


 3) Delete:
    ; font-size:x-small

4) Click "insert"



You may want to experiment with removing the entire "style" tag if you don't like the Liberation Sans font.

I hope that someone out there in the WWW finds that this blog entry is helpful.

Sunday Jan 31, 2010

NFS Streaming over 10 GbE

NFS Tuning for HPC Streaming Applications


I was recently working in a lab environment with the goal of setting up a Solaris 10 Update 8 (s10u8) NFS server application that would be able to stream data to a small number of s10u8 NFS clients with the highest possible throughput for a High Performance Computing (HPC) application.  The workload varied over time: at some points the workload was read-intensive while at over times the workload was write intensive.  Regardless of read or write, the application's I/O pattern was always "large block sequential I/O" which was easily modeled with a "dd" stream from one or several clients.

Due to business considerations, 10 gigabit ethernet (10GbE) was chosen for the network infrastructure.  It was necessary to not only to install appropriate server, network and I/O hardware, but also to tune each subsystem.  I wish it was more obvious if a gigabit implied 1024\^3 or 1000\^3.  In ether case, one might naively assume that the connection should be able to reach NFS speeds of 1.25 gigabytes per second, however, my goal was to be able to achieve NFS end-to-end throughput close to 1.0 gigabytes per second.


A network with the following components worked well:

  • Sun Fire X4270 servers
    • Intel Xeon X5570 CPU's @ 2.93 GHz
    • Solaris 10 Update 8
  • a network, consisting of:
    • Force10 S2410 Data Center 10 GbE Switch
    • 10-Gigabit Ethernet PCI Express Ethernet Controller, either:
      • 375-3586 (aka Option X1106A-Z) with the Intel 82598EB chip (ixgbe driver), or
      • 501-7283 (aka Option X1027A-Z) with the "Neptune" chip (nxge driver)
  • an I/O system on the NFS server:
    • 375-3487 Sun StorageTek PCI Express SAS 8-Channel HBAs (SG-XPCIE8SAS-E-Z, LSI-based disk controllers)
    • Storage, either
      • Sun Storage J4400 Arrays, or
      • Sun Storage F5100 Flash Array

This configuration based on recent hardware was able to reach close to full line speed performance.  In contrast, a slightly older server with Intel Xeon E5440 CPU's @ 2.83 Ghz was not able to reach full line speed.

The application's I/O pattern is large block sequential and known to be 4K aligned, so the Sun Storage F5100 Flash Array is a good fit.  Sun does not recommend this device for general purpose NFS storage.


When the hardware was initially installed, rather than immediately measuring NFS performance, the individual network and IO subsystems were tested.  To measure the network performance, I used netperf. I found that the "out of the box" s10u8 performance was below my expectations; it seems that the Solaris "out of the box" settings are better fitted to a web server with a large number of potentially slow (WAN) connections.  To get the network humming for large block LAN workload I made several changes:

a) The TCP Sliding Window settings in /etc/default/inetinit

ndd -set /dev/tcp tcp_xmit_hiwat  1048576
ndd -set /dev/tcp tcp_recv_hiwat  1048576
ndd -set /dev/tcp tcp_max_buf    16777216
ndd -set /dev/tcp tcp_cwnd_max    1048576

b) The network interface card "NIC" settings, depending on the card:

accept_jumbo = 1;
soft-lso-enable = 1;

\* From http://www.solarisinternals.com/wiki/index.php/Networks
\* For ixgbe or nxge
set ddi_msix_alloc_limit=8
\* For nxge
set nxge:nxge_bcopy_thresh=1024
set pcplusmp:apic_multi_msi_max=8
set pcplusmp:apic_msix_max=8
set pcplusmp:apic_intr_policy=1
set nxge:nxge_msi_enable=2

c) Some seasoning :-)

\* Added to /etc/system on S10U8 x64 systems based on
\*   http://www.solarisinternals.com/wiki/index.php/Networks (Nov 18, 2009)
\* For few TCP connections
set ip:tcp_squeue_wput=1
\* Bursty
set hires_tick=1

d) Make sure that you are using jumbo frames. I used mtu 8150, which I know made both the NICs and the switch happy.  Maybe I should have tried a slightly more aggressive setting of 9000.  

/etc/hostname.nxge0 mtu 8150

/etc/hostname.ixgbe0 mtu 8150
e) Verifying the MTU with ping and snoop.  Some ping implementations include a flag to allow the user to set the "do not fragment" (DNF) flag, which is very useful for verifying that the MTU is properly set.  With the ping implementation that ships with s10u8, you can't set the DNF flag.  To verify the MTU, use snoop to see if large pings are fragmented:

server# snoop -r -d nxge0
Using device nxge0 (promiscuous mode)

// Example 1: A 8000 byte packet is not fragmented
client% ping -s 8000 1
PING 8000 data bytes
8008 bytes from server10G-43 ( icmp_seq=0. time=0.370 ms -> ICMP Echo request (ID: 14797 Sequence number: 0) -> ICMP Echo reply (ID: 14797 Sequence number: 0)

Example 2: A 9000 byte ping is broken into 2 packets in both directions
ping -s 9000 1
PING 9000 data bytes
9008 bytes from server10G-43 ( icmp_seq=0. time=0.383 ms -> ICMP IP fragment ID=32355 Offset=0    MF=1 TOS=0x0 TTL=255 -> ICMP IP fragment ID=32355 Offset=8128 MF=0 TOS=0x0 TTL=255 -> ICMP IP fragment ID=49788 Offset=0    MF=1 TOS=0x0 TTL=255 -> ICMP IP fragment ID=49788 Offset=8128 MF=0 TOS=0x0 TTL=255

Example 3: A 32000 byte ping is broken into 4 packets in both directions
ping -s 32000 1
PING 32000 data bytes
32008 bytes from server10G-43 ( icmp_seq=0. time=0.556 ms -> ICMP IP fragment ID=32356 Offset=0    MF=1 TOS=0x0 TTL=255 -> ICMP IP fragment ID=32356 Offset=8128 MF=1 TOS=0x0 TTL=255 -> ICMP IP fragment ID=32356 Offset=16256 MF=1 TOS=0x0 TTL=255 -> ICMP IP fragment ID=32356 Offset=24384 MF=0 TOS=0x0 TTL=255 -> ICMP IP fragment ID=49789 Offset=0    MF=1 TOS=0x0 TTL=255 -> ICMP IP fragment ID=49789 Offset=8128 MF=1 TOS=0x0 TTL=255 -> ICMP IP fragment ID=49789 Offset=16256 MF=1 TOS=0x0 TTL=255 -> ICMP IP fragment ID=49789 Offset=24384 MF=0 TOS=0x0 TTL=255

f) Verification: after network tuning was complete, it had very impressive performance, with either the nxgbe or the ixgbe driver.  The end-to-end measurement reported by netperf of 9.78 GbE is very close to full line speed and indicates that the switch, network interface cards, drivers and Solaris system call overhead are minimally intrusive.

$ /usr/local/bin/netperf -fg -H -tTCP_STREAM -l60 TCP STREAM TEST from ::ffff: ( port 0 AF_INET to ::ffff: ( port 0 AF_INET

Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10\^9bits/sec

1048576 1048576 1048576    60.56       9.78

$ /usr/local/bin/netperf -fG -H -tTCP_STREAM -l60
TCP STREAM TEST from ::ffff: ( port 0 AF_INET to ::ffff: ( port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    GBytes/sec

1048576 1048576 1048576    60.00       1.15

Both of the two tests above show about the same throughput, but in different units.   With the network performance tuning, above, the system can achieve slightly less than 10 gigabits per second, which is slightly more than 1 gigabyte per second.

g) Observability: I found "nicstat" (download link at http://www.solarisinternals.com/wiki/index.php/Nicstat) to be a very valuable tool for observing network performance.  To compare the network performance with the running application against synthetic tests, I found that it was useful to graph the "nicstat" output. (see http://blogs.sun.com/taylor22/entry/graphing_solaris_performance_stats)  You can verify that the Jumbo Frames MTU is working as expected by checking that the average packet payload is big by dividing Bytes-per-sec / Packets-per-sec to get average bytes-per-packet.

Network Link Aggregation for load spreading

I initially hoped to use link aggregation to get a 20GbE NFS stream using 2 X 10GbE ports on the NFS server and 2 X 10GbE ports on the NFS client.  I hoped that packets would be distributed to the link aggregation group in a "per packet round robin" fashion.  What I found is that, regardless of whether the negotiation is based on L2, L3 or L4, LACP will negotiate port pairs based on a source/destination mapping, so that each stream of packets will only use one specific port from a link aggregation group. 

Link aggregation can be useful in spreading multiple streams over ports, but the streams will not necessarily be evenly divided across ports.  The distribution of data over ports in a link aggregation group can be viewed with "nicstat" 

After reviewing literature, I concluded that It is best to use IPMP for failover, but link aggregation for load spreading.  "Link aggregation" has finer control for load spreading than IPMP:  Comparing IPMP and Link Aggregation
  • IPMP can be used to protect against switch/router failure because each NIC can be connected to a different switch and therefore can protect against either NIC or switch failure.
  • With Link Aggregation, all of the ports in the group must be connected to a single switch/router, and that switch/router must support Link Aggregation Control Protocol (LACP), so there is no protection against switch failure.
With the Force10 switch that I tested, I was disappointed that the LACP algorithm was not doing a good job of spreading inbound packets to my hot server.  Again, once the switch mapped a client to one of the ports in the "hot server's link group", it stuck so it was not unusual for several clients to be banging hard on one port while another port was idle.

Multiple subnets for network load spreading

After trying several approaches for load spreading with S10u8, I chose to use multiple subnets.  (If I had been using Nevada 107 or newer which includes Project Clearview, I might have come to a different conclusion.)  In the end, I decided that the best solution was an old fashion approach using a single common management subnet combined with multiple data subnets:

  • All of the machines were able to to communicate with each other on a slower "management network", specifically Sun Grid Engine jobs were managed on the slower network.
  • The clients were partitioned into a small number of "data subnets".
  • The "hot" NFS data server had multiple NIC's, with each NIC on a separate "data subnet".
  • A limitation of this approach that is that clients in one subnet partition only have a low bandwidth connection to clients in different subnet partitions. This was OK for my project.
  • The advantage of manually preallocating the port distribution was that my benchmark was more deterministic.  I did not get overloaded ports in a seemingly random pattern.

Disk IO: Storage tested

Configurations that had sufficient bandwidth for this environment included:

  • A Sun Storage F5100 Flash Array using 80 FMods in a ZFS RAID 0 stripe to create a 2TB volume. 
  • A "2 X Sun Storage J4400 Arrays" JBOD configuration with a ZFS RAID 0 stripe
  • A "4 X Sun Storage J4400 Arrays" configuration with a ZFS RAID 1+0 mirrored and striped

Disk I/O: SAS HBA's

The Sun Storage F5100 Flash Array was connected to the the Sun Fire X4270 server using 4 PCIe 375-3487 Sun StorageTek PCI Express SAS 8-Channel HBAs (SG-XPCIE8SAS-E-Z, LSI-based disk controllers) so that each F5100 domain with 20 FMods used an independent HBA.  Using 4 SAS HBAs has a 20% better theoretical throughput than using 2 SAS HBA's:

  • A fully loaded Sun Storage F5100 Flash Array with 80 FMods has 4 domains with 20 FMods per domain
  • Each  375-3487 Sun StorageTek PCI Express SAS 8-Channel HBA (SG-XPCIE8SAS-E-Z, LSI-based disk controller) has
    • two 4x wide SAS ports
    • an 8x wide PCIe bus connection
  • Each F5100 domain to SAS HBA, connected with a single 4x wide SAS port, will have a maximum half duplex speed of (3Gb/sec \* 4) = 12Gb/Sec =~ 1.2 GB/Sec per F5100 domain
  • PCI Express x8 (half duplex) = 2 GB/sec
  • A full F5100 (4 domains) connected using 2 SAS HBA's would be limited by PCIe to 4.0 GB/Sec
  • A full F5100 (4 domains) connected using 4 SAS HBA's would be limited by SAS to 4.8 GB/Sec.
  • Therefore a full Sun Storage F5100 Flash Array has 20% theoretically better throughput when connected using 4 SAS HBA's rather than 2 SAS HBA's.

The "mirrored and striped" configuration using 4 X Sun Storage J4400 Arrays was connected using 3 PCIe 375-3487 Sun StorageTek PCI Express SAS 8-Channel HBAs (SG-XPCIE8SAS-E-Z, LSI-based disk controllers):

SAS Wiring for 4 JBOD trays

Multipathing (MPXIO)

MPXIO was used in the "mirrored and striped" 4 X Sun Storage J4400 Array so that for every disk in the JBOD configuration, I/O could be requested by either of the 2 SAS cards connected to the array.  To eliminate any "single point of failure" I chose to mirror all of the drives in one tray with drives in another tray, so that any tray could be removed without losing data.

The command for creating a ZFS RAID 1+0 "mirrored and striped" volume out of MPXIO devices looks like this:

zpool create -f jbod \\
mirror c3t5000C5001586BE93d0 c3t5000C50015876A6Ed0 \\
mirror c3t5000C5001586E279d0 c3t5000C5001586E35Ed0 \\
mirror c3t5000C5001586DCF2d0 c3t5000C50015825863d0 \\
mirror c3t5000C5001584C8F1d0 c3t5000C5001589CEB8d0 \\
It was a bit tricky to figure out which disks (i.e. "c3t5000C5001586BE93d0") were in which trays.  I ended up writing a surprisingly complicated Ruby script to choose devices to mirror.  This script worked for me.  Your mileage may vary.  Use at your own risk.

#!/usr/bin/env ruby -W0

all=`stmsboot -L`
Device = Struct.new( :path_count, :non_stms_path_array, :target_ports, :expander, :location)
Location = Struct.new( :device_count, :devices )

  if s =~ /\\/dev/ then
    if !my_map.has_key?(s2[1]) then
      ## puts "creating device for #{s2[1]}"
      my_map[s2[1]] = Device.new(0,[],[])
    ## puts my_map[s2[1]]
    my_map[s2[1]].path_count += 1
    puts "no match on #{s}"

my_map.each {
  ##puts "key is #{k}"
  mpath_data=`mpathadm show lu #{k}`
  mpath_data.each {
    if !in_target_section then
      if line =~ /Target Ports:/ then
    if line =~ /Name:/ then
  ##puts "key is #{k} value is #{v}"
  ##puts k v.non_stms_path_array[0],  v.non_stms_path_array[1]

my_map.each {
  my_map[k].expander = my_map[k].target_ports[0][0,14]
  my_map[k].location = my_map[k].target_ports[0][14,2].hex % 64
  if !location_map.has_key?(my_map[k].location) then
    puts "creating entry for #{my_map[k].location}"
    location_map[my_map[k].location] = Location.new(0,[])
  location_map[my_map[k].location].device_count += 1


location_array.sort.each {
  puts "mirror #{location_map[location].devices[0].gsub('/dev/rdsk/','')} #{location_map[location].devices[1].gsub('/dev/rdsk/','')} \\\\"

Separate ZFS Intent Logs?

Based on http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on , I ran some tests comparing separate intent logs on either disk or SSD (slogs) vs the default chained logs (clogs).  For the large block sequential workload, "tuning" the configuration by adding separate ZFS Intent Logs actually slowed the system down slightly. 

ZFS Tuning

/etc/system parameters for ZFS
\* For ZFS
set zfs:zfetch_max_streams=64
set zfs:zfetch_block_cap=2048
set zfs:zfs_txg_synctime=1
set zfs:zfs_vdev_max_pending = 8

NFS Tuning

a) Kernel settings

Solaris /etc/system

\* For NFS
set nfs:nfs3_nra=16
set nfs:nfs3_bsize=1048576
set nfs:nfs3_max_transfer_size=1048576

\* Added to /etc/system on S10U8 x64 systems based on
\*  http://www.solarisinternals.com/wiki/index.php/Networks (Nov 18, 2009)
\* For NFS throughput
set rpcmod:clnt_max_conns = 8

b) Mounting the NFS filesystem

/etc/vfstab - /mnt/nfs nfs - no vers=3,rsize=1048576,wsize=1048576

c) Verifing the NFS mount parameters

# nfsstat -m
/mnt/ar from
 Flags:         vers=3,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=1048576,wsize=1048576,retrans=5,timeo=600
 Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60


With tuning, the Sun Storage J4400 Arrays via NFS achieved write throughput of 532 MB/Sec and read throughput of 780 MB/sec for a single 'dd' stream.

$ /bin/time dd if=/dev/zero of=/mnt/jbod/test-80g bs=2048k count=40960; umount /mnt/jbod; mount /mnt/jbod; /bin/time dd if=/mnt/jbod/test-80g of=/dev/null bs=2048k
40960+0 records in
40960+0 records out

real     2:33.7
user        0.1
sys      1:30.9
40960+0 records in
40960+0 records out

real     1:44.9
user        0.1
sys      1:04.0

With tuning, the Sun Storage F5100 Flash Array via NFS achieved write throughput of 496 MB/Sec and read throughput of 832 MB/sec for a single 'dd' stream.

$ /bin/time dd if=/dev/zero of=/mnt/lf/test-80g bs=2048k count=40960; umount /mnt/lf; mount /mnt/lf; /bin/time dd if=/mnt/lf/test-80g of=/dev/null bs=2048k
40960+0 records in
40960+0 records out

real     2:45.0
user        0.2
sys      2:17.3
40960+0 records in
40960+0 records out

real     1:38.4
user        0.1
sys      1:19.6
To reiterate, this testing was done in preparation for work with an HPC application that is known to have large block sequential I/O that is aligned on 4K boundaries.  The Sun Storage F5100 Flash Array would not be recommended for general purpose NFS storage that is not known to be 4K aligned.


  - http://blogs.sun.com/brendan/entry/1_gbyte_sec_nfs_streaming
  - http://blogs.sun.com/dlutz/entry/maximizing_nfs_client_performance_on

Thursday Jan 21, 2010

Graphing Solaris Performance Stats

Graphing Solaris Performance Stats with gnuplot

It is not unusual to see an engineer import text from "vmstat" or "iostat" to a spreadsheet application such as Microsoft Office Excel or OpenOffice Calc to visualize the data.  This is a fine approach when used periodically but impractical when used frequently.  The process of transferring the data to a laptop, manually massaging the data, launching the office application, importing the data and selecting the columns to chart is too cumbersome when used as a daily process or if there are a large number of machines that are being monitored.  It my case, I needed to visualize the performance from a few servers that were under test, and needed a few graphs from the servers, a few times a day.  I used some traditional Unix scripts and gnuplot (http://www.gnuplot.info) from the Companion CD (http://www.sun.com/software/solaris/freeware) to quickly graph the data.

The right tool for graphing Solaris data depends on your use case scenario:

  • One or two graphs, now and then: Import the data into your favorite spreadsheet application.
  • Historic data, more graphs, more frequently: use gnuplot
  • Many graphs, real-time or historic data, for more machines, such as a grid of servers being managed by Sun Grid Engine:  a formal tool such a Ganglia (http://ganglia.info, http://www.sunfreeware.com/) is recommended. An advantage of Ganglia is that performance data is exposed via a web interface to a potentially large number of viewers in real time.

That being said, here are some scripts that I used to view Solaris Performance data with gnuplot.

1. Gathering data.  For each benchmark run, a script was used to start gathering performance data:


mkdir $dir
vmstat 1        > $dir/vmstat.out        2>&1 &
zpool iostat 1  > $dir/zpool_iostat.out  2>&1 &
nicstat 1       > $dir/nicstat.out       2>&1 &
iostat -nmzxc 1 > $dir/iostat.out        2>&1 &
/opt/DTraceToolkit-0.99/Bin/iopattern 1 > $dir/iopattern.out   2>&1 &

The statistics gathering processes were all killed at the end of the benchmark run. Hence, each test had a directory with a comprehensive set of statistics files.

Next it was necessary to write a set of scripts to operate on the directories.

2. Graphing CPU utilization from "vmstat".

This script was fairly short and straightforward.  The "User CPU Utilization" and "System CPU Utilization" are in the 20th and 21st columns.  I added an optional argument to truncate the graph after a specific amount of time to account for the cases where the vmstat process was not killed immediately after the benchmark.  A bash "here document" is used to enter gnuplot commands.



if [ $# == 2 ] ; then
  (( seconds = minutes \* 60 ))
  cat $file | head -$seconds > /tmp/data

gnuplot -persist <<EOF
set title "$dir"
plot "$file" using 20 title "%user" with lines, \\
     "$file" using 21 title "%sys" with lines


Graph of CPU utilization based on vmstat output

3. Graphing IO throughput from "iostat -nmzxc 1" data

This script was a little bit more complicated for three reasons:

  • The data file contains statistics for several filesystems that are not interesting and will be filtered out.  The script needs to be launched with an argument that will be used to select one device.
  • I used the 'z' option to iostat which does not print traces when the device is idle (Zero I/O).  The 'z' option makes a smaller file that is more human readable, but it it not good for graphing.  Thus I needed synthesize the zero traces before passing the data to gnuplot.
  • I wanted to include a smooth line for the iostat "%w" and "%b" columns with a scale of 0 to 100.

# This script is used to parse "iostat -nmzxc" data which is formatted like this:
#                     extended device statistics
#     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
#     0.0    0.9    0.8    3.8  0.0  0.0    0.0    0.5   0   0 c0t1d0
#     0.0    0.0    0.0    0.0  0.0  0.0    0.0    2.4   0   0 sge_master:/opt/sge6-2/default/common
#     0.0    0.8    1.9  184.5  0.0  0.0    4.1   31.1   0   1

if [ $# -lt 2 -o $# -gt 3 ] ; then
  echo "Usage: $0 pattern dir [minutes]"
  exit 1

(( minutes = 24 \* 60 )) #default: graph 1 day

if [ $# == 3 ] ; then

(( seconds = minutes \* 60 ))

if [ ! -r $all_data ] ; then
  echo "can not read $all_data"
  exit 1

# For each time interval, either:
#   print the trace for the device that matches the pattern, or
#   print a "zero" trace if there is not one in the data file 
# You can tell that there was no trace for the device during an
# interval if you reach the "extended device statistics" line 
# without finding a trace
gawk -v pattern=$pattern '
$0 ~ pattern {
  found = 1 ;

/extended/ {
  if (found == 0)
    printf("    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 \\n")
  found = 0;
} ' $all_data | head -$seconds > $plot_data

gnuplot -persist <<EOF
set title "$pattern - $dir"
set ytics nomirror
set y2range [0:100]
set y2tics 0, 20
plot "$plot_data" using  3 title "read (kb/sec)" axis x1y1 with lines, \\
     "$plot_data" using  4 title "write (kb/sec)" axis x1y1 with lines, \\
     "$plot_data" using  9 title "%w" axis x1y2 smooth bezier with lines, \\
     "$plot_data" using 10 title "%b" axis x1y2 smooth bezier with lines


I created the following graph with the command "graph_iostat.bash jbod NFS_client_10GbE 5" to select data only from the "jbod" NFS mount, where the data is stored in the directory named "NFS_client_10GbE" and only graph the first 5 minutes worth of data.


The iostat data was collected on an NFS client connected with a 10 gigabit network.  There is some write activity (green) at the start of the 5 minute sample period, followed by several minutes of intense reading (red) where the client hits speeds of 600-700MB/sec. The purple "%b" line, with values on the right x1y2 axis, indicates that during the intense read phase, the mount point is busy about 90% of the time.  

4. Graphing I/O Service time from "iostat -nmzxc" data.

I also find that columns 6 and 7 from iostat are very interesting and can be graphed using a simplification of the previous script.

  • actv: average number of transactions actively being serviced
  • svc_t: average response time  of  transactions,  in  milliseconds


#                     extended device statistics
#     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
#     0.0    0.9    0.8    3.8  0.0  0.0    0.0    0.5   0   0 c0t1d0
#     0.0    0.0    0.0    0.0  0.0  0.0    0.0    2.4   0   0 sge_master:/opt/sge6-2/default/common
#     0.0    0.8    1.9  184.5  0.0  0.0    4.1   31.1   0   1

if [ $# -lt 2 -o $# -gt 3 ] ; then
  echo "Usage: $0 pattern dir [minutes]"
  exit 1

(( minutes = 24 \* 60 )) #default: graph 1 day

if [ $# == 3 ] ; then

(( seconds = minutes \* 60 ))

# For each time interval, either:
#   print the trace for the device that matches the pattern, or
#   print a "zero" trace if there is not one in the data file 
# You can tell that there was no trace for the device during an
# interval if you reach the "extended device statistics" line 
# without finding a trace
gawk -v pattern=$pattern '
$0 ~ pattern {
  found = 1 ;

/extended/ {
  if (found == 0)
    printf("    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 \\n")
  found = 0;
} ' $all_data | head -$seconds > $plot_data

gnuplot -persist <<EOF
set title "$pattern - $dir"
set log y
plot "$plot_data" using  6 title "wsvc_t" with lines, \\
     "$plot_data" using  7 title "asvc_t" with lines \\


Here is the graph produced by the command "graph_iostat_svc_t.bash jbod NFS_client_10GbE 5"


5. Graphing network throughput data from "nicstat"

Another very valuable Solaris performance statistics tool is "nicstat".  For the download link, see http://blogs.sun.com/timc/entry/nicstat_the_solaris_and_linux .  A script to graph the data from nicstat follows the same pattern.


if [ $# -lt 2 -o $# -gt 3 ] ; then
  echo "Usage: $0 interface dir [minutes]"
  exit 1

(( minutes = 24 \* 60 )) #default: graph 1 day

if [ $# == 3 ] ; then

(( seconds = $minutes \* 60 ))

if [ ! -r $all_data ] ; then
  echo "can not read $all_data"
  exit 1

grep $interface $all_data | head -$seconds > $plot_data

gnuplot -persist <<EOF
set title "$interface - $dir"
plot "$plot_data" using 3 title "read" with lines, \\
     "$plot_data" using 4 title "write" with lines

 "graph_nicstat.bash ixgbe2 NFS_server_10GbE 5"


6. Graphing IO throughput from "zpool iostat" data

The challenge for plotting "zpool iostat" data is that the traces are not in constant units and therefore it is necessary to re-compute the data in constant units, in this example, MB/sec. 


if [ $# -lt 2 -o $# -gt 3 ] ; then
  echo "Usage: $0 pattern dir [minutes]"
  exit 1

(( minutes = 24 \* 60 )) #default: graph 1 day

if [ $# == 3 ] ; then

(( seconds = minutes \* 60 ))

if [ ! -r $all_data ] ; then
  echo "can not read $all_data"
  exit 1

grep $pool $all_data | awk '{printf("%s/1048576\\n",$6)}' | sed -e 's/K/\*1024/g' -e 's/M/\*1048576/g' -e 's/G/\*1073741824/g' | bc | head -$seconds > $plot_data1
grep $pool $all_data | awk '{printf("%s/1048576\\n",$7)}' | sed -e 's/K/\*1024/g' -e 's/M/\*1048576/g' -e 's/G/\*1073741824/g' | bc | head -$seconds > $plot_data2

gnuplot -persist <<EOF
set title "$pool - $dir"
set log y
plot "$plot_data1" using 1 title "read (MB/sec)" with lines, \\
     "$plot_data2" using 1 title "write (MB/sec)" with lines


Graphing the IO throughput of the zpool named "jbod" using the command  "graph_iostat_svc_t.bash jbod NFS_client_10GbE 5" shows that zpool can deliver data at speeds of close to one gigabyte per second.


It is easy to modify the scripts above to graph the output of many tools that output a table of data in text format.

Tuesday Nov 10, 2009

Solaris/x64 VNC with Cut & Paste

Yesterday, I was trying to get Cut & Paste to work between various VNC clients and a VNC server that was running on a Solaris10 Update 8 x64 server.  The VNC software that was first in my PATH was from the SFWvnc package that is shipped on the Solaris Companion CD

I was quite confused:

 1) Various Google searches revealed that vncconfig must be running on the server for cut and paste to work, however, it would not start:

 $ vncconfig
 No VNC extension on display :1.0

2) The man page for vncconfig indicates that this may be caused by using version 3 Xvnc.

3) SFWvnc is version 3.3.7

4) There is no free version for Solaris x64 at www.realvnc.com (but there is a SPARC build)

So I was left trying to figure out what is there easiest way to get  Solaris/x64 VNC with Cut & Paste to work?  I wondered if I need to download RealVNC's 4.X source and build the server.  Did I need to purchase the Enterprise Edition of RealVNC, even though I was not intending to use enterprise features?

Solution:  As it turns out the solution is simple: "pkgrm SFWvnc".  This SFW package has VNC 3 files from the Solaris Companion CD that compete with the VNC 4 files that come with SUNWxvnc and SUNWvncviewer in S10U5 and newer.  I've asked the owner to have SFWvnc removed from the S10U9 Solaris Companion CD.

Monday Apr 27, 2009

Solaris Containers & 32-bit Java Heap allocation

A note from Steve Dertien at PTC:


We’ve solved the memory issue with zones.  The issue is impacted by the kernel version on the server and the zone type that they have created.

What we’ve discovered is that older kernel versions do not adequately support the larger heap sizes in a whole zone configuration.  The kernel version can be output using uname –a as follows.

# uname -a
SunOS mlxsv016 5.10 Generic_137111-04 sun4v sparc SUNW,T5240

With that particular version you can allocate a JVM with a 2Gb heap in the global zone and in a sparse zone.  In a whole zone you will not be able to allocate the full 2Gb to the JVM.  The output of a JVM failure will look like the following in this case:

# ./java -Xmx2048m -version
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.

This issue is resolved when you upgrade the kernel to a newer version.  The Sun servers at PTC are using a newer version of the kernel and therefore we’re not experiencing this issue.  For Windchill support on Solaris zones (whole or sparse) we should indicate that the customer must be on this kernel version or newer (assuming this does not regress).

# uname -a
SunOS edc-sunt5120 5.10 Generic_137137-09 sun4v sparc SUNW,SPARC-Enterprise-T5120

I don’t know if there are newer kernel versions but we should probably put a customer document together that states when running Windchill in a Solaris zone of any kind that the kernel must be patched to this level or higher and how to test for this condition.  [Jeff adds: See http://blogs.sun.com/patch/entry/solaris_10_kernel_patchid_progression] Also, this issue does not exist with the 64bit JVM when the –d64 option is supplied to the command.

This is the definition of a whole zone versus a sparse zone (lifted from here: http://opensolaris.org/os/community/zones/faq/)

Q: What is a global zone? Sparse-root zone? Whole-root zone? Local zone?
A: After installing Solaris 10 on a system, but before creating any zones, all processes run in the global zone. After you create a zone, it has processes that are associated with that zone and no other zone. Any process created by a process in a non-global zone is also associated with that non-global zone.

Any zone which is not the global zone is called a non-global zone. Some people call non-global zones simply "zones." Others call them "local zones" but this is discouraged.

The default zone filesystem model is called "sparse-root." This model emphasizes efficiency at the cost of some configuration flexibility. Sparse-root zones optimize physical memory and disk space usage by sharing some directories, like /usr and /lib. Sparse-root zones have their own private file areas for directories like /etc and /var. Whole-root zones increase configuration flexibility but increase resource usage. They do not use shared filesystems for /usr, /lib, and a few others.

There is no supported way to convert an existing sparse-root zone to a whole-root zone. Creating a new zone is required.

Wikipedia also indicates that the penalty for a Whole-zone is mitigated if the file system that the zone is installed on is a ZFS clone of the global image.  This means that the system will only require additional file system space for data that uses different blocks.  Essentially two copies of the same thing occupy only one block of space instead of the traditional two.  For those that are concerned about the consumed disk space for whole zones, that can be mitigated using the ZFS file system.

Instructions for creating a sparse zone are outlined rather well here: http://www.logiqwest.com/dataCenter/Demos/RunBooks/Zones/createBasicZone.html

Instructions for creating a whole zone are outlined rather well here: http://www.logiqwest.com/dataCenter/Demos/RunBooks/Zones/createSelfContainedZone.html

The major difference is in the second step.  A sparse zone uses the “create” command while a whole zone uses “create -b”.  Jeff Taylor also sent a link to a nice tool called Zonestat (http://opensolaris.org/os/project/zonestat/).  The output of the tool does a great job at showing you the distribution of resources across the zones.  The command below assumes that you placed the zonestat.pl script into /usr/bin as it does not exist by default.

# perl /usr/bin/zonestat.pl

Zonename| IT|Size|Used| RAM| Shm| Lkd| VM|


 global  0D   64  0.0 318M  0.0  0.0 225M

edc-sne  0D   64  0.0 273M  0.0  0.0 204M

edc-sne2  0D   64  0.0 258M  0.0  0.0 190M

==TOTAL= ===   64  0.0 1.1G  0.0  0.0 620M

One last detail.  To get the configuration of a zone we should ask customers for the output of their zone configuration by using the zonecfg utility from the root zone.  The commands that will work the best is either “zonecfg -z <zone_name> info” or “zonecfg -z <zone_name> export”.  We need to carefully evaluate any capped-memory settings or other settings defined in the zone to determine if those are also causing any potential issues for Windchill.  The output of that command will indicate whether the zone is a sparse or whole zone as the inherit section is likely not there in a whole zone.


# zonecfg -z edc-sne info

zonename: edc-sne

zonepath: /home/edc-sne

brand: native

autoboot: true





ip-type: shared


        dir: /lib


        dir: /platform


        dir: /sbin


        dir: /usr



        physical: e1000g0



# zonecfg -z edc-sne2 info

zonename: edc-sne2

zonepath: /home/edc-sne2

brand: native

autoboot: true





ip-type: shared



        physical: e1000g0


The export command will allow use to create the zones internally for our own testing purposes and they should contain the create command that was leveraged, but it’s misleading.  If you use the standard “create” command it will automatically create the required inherited properties.  But the export command will use “create –b” and you will see several of the following instead:

add inherit-pkg-dir

set dir=/sbin


I currently do not have an opinion as to whether we want to advocate a Whole vs. Sparse zone.  It seems a Whole zone creates more independence from the global zone.  This may be necessary if you want to have more absolute control over the configuration.

Best Regards,

Tuesday Apr 14, 2009

Algorithmics Financial Risk Management Software on the Sun Fire X4270

For the past several months, I've been slaving over a new server powered by Intel® Xeon® 5500 Series processors.  I have been using the system to investigate the characteristics of Algorithmics' Risk Analysis application, and I've found that the new server and the Solaris OS run great.  Solaris lets you take advantage of all the great new features of these new CPUs such as TurboBoost and HyperThreading. The big takeaway is that the new version of Algo's software really benefits from the new CPU: this platform runs faster than the other Solaris systems I've tested.

If you don't know Algorithmics, their software analyzes the risk of financial instruments, and financial institutions are using it to help keep the current financial meltdown from worsening. Algo is used by banks and financial institutions in 30 countries. I'm part of the team at Sun that has worked closely with Algorithmics to optimize performance of Algorithmics' risk management solutions on Sun Fire servers running Solaris.

Algo's software has a flexible simulation framework for a broad class of financial instruments.  SLIMs are their newest generation of very fast simulation engines. Different SLIMs are available to simulate interest rate products (swaps, FRAs, bonds, etc.), CDSs and some option products. The simulation performance with the new SLIMs is spectacular.

 Solaris provides everything that is required for Algo to take advantage of the new server line's performance and energy efficiency:
  • Sun Studio Express has been optimized to generate efficient SIMD instructions
  • ZFS provides the high performance I/O
  • Hyperthreading:  Solaris can easily take advantage of the heavily threaded architecture.
  • Turbo Boost.  Yes, it kicks in immediately when a SLIM is launched
  • QuickPath chip-to-chip interconnect.  Solaris is NUMA aware.

 Here are the answers to some questions that you may have about SLIMs on Solaris and the Xeon® Processor 5570:

 Q: Are SLIMs I/O bound or CPU bound?
 A: SLIM's require both the computational and the I/O subsystem to be excellent.  For the data sets that have been provided by Algo running on the X4270 with ZFS striped over 12 internal hard drives, results vary.  The majority of the SLIMs are CPU bound, but a few are I/O bound.  

 Q: What are the computational characteristics of SLIM's running on the X4270?
 A: The number of threads to be used are specified by the user.  Most of the SLIM's scale well up to 8 threads and some scale all the way up to 16 threads.  SLIM's that don't scale to 16 threads on the X4270 with internal hard drives are primarily I/O bound and benefit from even faster storage. (Hint: check this Algo blog again in the future for more discussion regarding faster I/O devices.) 

 Q: What are the I/O characteristics of SLIM's running on the X4270 with ZFS?
 A: The I/O pattern of SLIMs is large block sequential write.  ZFS caches the writes and flushes the cache approximately once every 10 seconds.  Each hard drive hits peaks of 80 MB/second.  With 12 striped internal hard drives, the total system throughput can reach close to 1.0 GB/second. 

 Q: Is external storage required for SLIMs?
 A: 14 internal drives (plus 2 for mirrored OS disks) at 146GB each can hold a data set upwards of 1.7TB.  If your data fits, internal drives will be a cost effective solution.

 Q: Is hardware RAID required for the SLIMs data.  Background:  RAID can potentially fulfill needs including (a) Striping to creating larger storage units from smaller disks, and (b) data redundancy so that you don't loose important data.  (c) Hardware RAID can increasing I/O performance via cacheing and fast checksum block computation.  
 A: No, hardware RAID is not required.  (a) ZFS has a full range of RAID capabilities.   (b) The cube of data produced by SLIMs is normally considered to be temporary data that can be recalculated if necessary,  and therefore redundancy is not required.  (c) If redundancy is required for your installation, RAID-Z has been shown to have a negligible impact on SLIMs' I/O performance. The SLIM's write intensive I/O pattern will blow through cache and be bound by disk write performance, so there is no advantage to adding an additional layer of cache.

 Q: Algo often recommends using Direct I/O filesystems instead of buffering.  Should we use Direct I/O with ZFS?
 A: No.  All ZFS I/O is buffered.  Based on comparisons against QFS with Direct I/O enabled, ZFS is recommended.

 Q: In some cases ZFS performance can be improved by disabling ZIL or by putting ZIL on on a faster device in a ZFS hybrid storage pool.  Does it help SLIMs performance?
 A: No. SLIMs' I/O is not synchronous. SSD for ZIL will not improve SLIMs' performance when writing to direct attached storage.

 Q: Is the use of Power Management recommended?
 A: Yes.   PowerTOP was used to enable and monitor the CPU power management.  When the machine is idle, power is reduced.  When SLIM's are executing, the CPU's quickly jumps into TurboBoost mode.  There was no significant performance difference between running SLIM's with and without power management enabled.

Q: The Sun Fire X4270 DDR3 memory can be configured to run at 800MHz, 1066MHz or 1333MHz.  Does the DDR3 speed effect SLIMs performance?
A: Yes, several of the SLIMs (but not all) run better on systems configured to run DDR3 at 1333MHz.

Q: Would it be better to deploy SLIMs on a blade server (like the X6275 Server Module) or a rack mounted server (like the X4270).
A: Again, the answer resolves around storage.  If the SLIMs time series data fits onto the internal disk drives, the rack mounted server will be a cost effective solution.  If your time series data is greater than 1.5 TB, it will be necessary to use external storage, and the blade server will a more cost effective solution. 

 Platform tested:

     \* Sun Fire X4270 2RU rack-mount server
     \* 2 Intel® Xeon® Processor 5500 Series CPU's
     \* 14 -  146GB 10K RPM disk drives
           o 2 for the operating system
           o 12 for data storage
           o 2 empty slots
     \* Memory configurations ranged from 24GB to 72GB

Tuesday Feb 26, 2008

Jumpstart: ERROR: could not load the media (/cdrom)

During a "Solaris 10 8/07 s10s_u4wos_12b SPARC" jumpstart install, the jumpstart client complained:

cat: cannot open /cdrom/.cdtoc
cat: cannot open /cdrom/.cdtoc
expr: syntax error
expr: syntax error
cat: cannot open /cdrom/.cdtoc
expr: syntax error
ERROR: could not load the media (/cdrom)

It turned out that my /etc/dfs/dfstab entry was not pointing the the correct level of the hierarcy.

The correct entry is:

share -F nfs -o ro,anon=0 /export/home/sol_10_807_sparc

The incorrect entry is:

share -F nfs -o ro,anon=0 /export/home/sol_10_807_sparc/Solaris_10

Root of the problem was that "/export/home/sol_10_807_sparc/.cdtoc" was not visible via NFS.

# cat /etc/bootparams
scnode2.mydomain.com  root=scnode1:/export/home/sol_10_807_sparc/Solaris_10/Tools/Boot install= boottype=:in sysid_config= install_config= rootopts=:rsize=8192 ns=[]:dns[]

Friday Jul 13, 2007

Fino's Additions for wt.properties

While running a Windchill 8.0 Sizing Study on Solaris, changes and additions to Windchill property files reduced the CPU load and improved the average response time.  


[Read More]

Sunday Jun 24, 2007

Configuring jumbo frames on the V490’s ce and the T2000's e1000g

Jumbo frames are Ethernet frames above 1518 bytes in size.


[Read More]

Wednesday May 30, 2007

FAQ for Windchill on Solaris

I work in Sun's "ISV Engineering" team.  Our responsibilities include working with Sun's key ISV partners to port, tune and optimize industry leading applications on Sun's hardware and software stack, and to ensure that the latest solutions from the ISV's are certified on the latest products from Sun. One of the applications that I focus on is Windchill from  PTC.  As such, I am in frequent contact with PTC's R&D, Global Services, QA and customers, all of whom bring varying degrees of Solaris expertise.  This is a collection of questions and answers.


[Read More]

Friday May 25, 2007

JVM Tuning for Windchill

    The World Wide Web is full of articles that cover Java Tuning. With so much information available, it is hard for a Windchill Administrator to know where to start. Which approaches are useful? Which articles and options apply to Windchill? How to get started? What are the right settings for Windchill?

    Why is this so hard? The best Java options depend on the hardware that is used for the Windchill server and the usage patterns of the Windchill users. One of the fundamental questions that needs to be answered by every Windchill administrator is: “Is memory being used efficiently?” Every Windchill installation will need to customize –Xmx, which sets the maximum size of the Java heap. The default value is 64MB1., much too small for Windchill. Unfortunately, there is not one correct answer for every installation. When a well written Java program performs poorly, there are two typical causes. Either the Java heap size is two small causing an excessive amount of garbage collection, or a Java heap size that is so big that portions are paged to virtual memory. Either problem can be severe, so finding a balance is important. In conjunction with setting the right heap size, every administrator will need to set the size of the Eden (young), survivor, and tenured (old) generations2. Also, there are a large number of other Java options, some that help Windchill, some that have minimal impact, and some that only apply to JVM releases that are not supported by Windchill3.

    This article will focus on Java 1.4.2. Windchill 8.0 M020, released in May 2006, was tested with the Sun Microsystems Java Software Developer Kit version 1.4.2_09 for Solaris 9 and 10. The Support Matrix notes that higher versions in the 1.4.2_xx series are also expected to work.


[Read More]



« April 2014