Tuesday Nov 27, 2012

ZFS for Database Log Files


I've been troubled by drop outs in CPU usage in my application server, characterized by the CPUs suddenly going from close to 90% CPU busy to almost completely CPU idle for a few seconds. Here is an example of a drop out as shown by a snippet of vmstat data taken while the application server is under a heavy workload.

# vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s3 s4 s5 s6   in   sy   cs us sy id
 1 0 0 130160176 116381952 0 16 0 0 0 0  0  0  0  0  0 207377 117715 203884 70 21 9
 12 0 0 130160160 116381936 0 25 0 0 0 0 0  0  0  0  0 200413 117162 197250 70 20 9
 11 0 0 130160176 116381920 0 16 0 0 0 0 0  0  1  0  0 203150 119365 200249 72 21 7
 8 0 0 130160176 116377808 0 19 0 0 0 0  0  0  0  0  0 169826 96144 165194 56 17 27
 0 0 0 130160176 116377800 0 16 0 0 0 0  0  0  0  0  1 10245 9376 9164 2  1 97
 0 0 0 130160176 116377792 0 16 0 0 0 0  0  0  0  0  2 15742 12401 14784 4 1 95
 0 0 0 130160176 116377776 2 16 0 0 0 0  0  0  1  0  0 19972 17703 19612 6 2 92

 14 0 0 130160176 116377696 0 16 0 0 0 0 0  0  0  0  0 202794 116793 199807 71 21 8
 9 0 0 130160160 116373584 0 30 0 0 0 0  0  0 18  0  0 203123 117857 198825 69 20 11


This behavior occurred consistently while the application server was processing synthetic transactions: HTTP requests from JMeter running on an external machine.

I explored many theories trying to explain the drop outs, including:
  • Unexpected JMeter behavior
  • Network contention
  • Java Garbage Collection
  • Application Server thread pool problems
  • Connection pool problems
  • Database transaction processing
  • Database I/O contention

Graphing the CPU %idle led to a breakthrough:

AppServerIdle.jpg

Several of the drop outs were 30 seconds apart. With that insight, I went digging through the data again and looking for other outliers that were 30 seconds apart. In the database server statistics, I found spikes in the iostat "asvc_t" (average response time of disk transactions, in milliseconds) for the disk drive that was being used for the database log files.

Here is an example:

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

    0.0 2053.6    0.0 8234.3  0.0  0.2    0.0    0.1   0  24 c3t60080E5...F4F6d0s0
    0.0 2162.2    0.0 8652.8  0.0  0.3    0.0    0.1   0  28 c3t60080E5...F4F6d0s0
    0.0 1102.5    0.0 10012.8  0.0  4.5    0.0    4.1   0  69 c3t60080E5...F4F6d0s0
    0.0   74.0    0.0 7920.6  0.0 10.0    0.0  135.1   0 100 c3t60080E5...F4F6d0s0
    0.0  568.7    0.0 6674.0  0.0  6.4    0.0   11.2   0  90 c3t60080E5...F4F6d0s0
    0.0 1358.0    0.0 5456.0  0.0  0.6    0.0    0.4   0  55 c3t60080E5...F4F6d0s0
    0.0 1314.3    0.0 5285.2  0.0  0.7    0.0    0.5   0  70 c3t60080E5...F4F6d0s0

Here is a little more information about my database configuration:
  • The database and application server were running on two different SPARC servers.
  • Storage for the database was on a storage array connected via 8 gigabit Fibre Channel
  • Data storage and log file were on different physical disk drives
  • Reliable low latency I/O is provided by battery backed NVRAM
  • Highly available:
    • Two Fibre Channel links accessed via MPxIO
    • Two Mirrored cache controllers
    • The log file physical disks were mirrored in the storage device
  • Database log files on a ZFS Filesystem with cutting-edge technologies, such as copy-on-write and end-to-end checksumming

Why would I be getting service time spikes in my high-end storage? First, I wanted to verify that the database log disk service time spikes aligned with the application server CPU drop outs, and they did:

AppServerIdleLogService.jpg

At first, I guessed that the disk service time spikes might be related to flushing the write through cache on the storage device, but I was unable to validate that theory.

After searching the WWW for a while, I decided to try using a separate log device:

# zpool add ZFS-db-41 log c3t60080E500017D55C000015C150A9F8A7d0

The ZFS log device is configured in a similar manner as described above: two physical disks mirrored in the storage array.

This change to the database storage configuration eliminated the application server CPU drop outs:

AppServerIdleAfter.jpg

Here is the zpool configuration:

# zpool status ZFS-db-41
  pool: ZFS-db-41
 state: ONLINE
 scan: none requested
config:

        NAME                                     STATE
        ZFS-db-41                                ONLINE
          c3t60080E5...F4F6d0  ONLINE
        logs
          c3t60080E5...F8A7d0  ONLINE


Now, the I/O spikes look like this:

                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1053.5    0.0 4234.1  0.0  0.8    0.0    0.7   0  75 c3t60080E5...F8A7d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1131.8    0.0 4555.3  0.0  0.8    0.0    0.7   0  76 c3t60080E5...F8A7d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1167.6    0.0 4682.2  0.0  0.7    0.0    0.6   0  74 c3t60080E5...F8A7d0s0
    0.0  162.2    0.0 19153.9  0.0  0.7    0.0    4.2   0  12 c3t60080E5...F4F6d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1247.2    0.0 4992.6  0.0  0.7    0.0    0.6   0  71 c3t60080E5...F8A7d0s0
    0.0   41.0    0.0   70.0  0.0  0.1    0.0    1.6   0   2 c3t60080E5...F4F6d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1241.3    0.0 4989.3  0.0  0.8    0.0    0.6   0  75 c3t60080E5...F8A7d0s0
                    extended device statistics             
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1193.2    0.0 4772.9  0.0  0.7    0.0    0.6   0  71 c3t60080E5...F8A7d0s0

We can see the steady flow of 4k writes to the ZIL device from O_SYNC database log file writes. The spikes are from flushing the transaction group.

Like almost all problems that I run into, once I thoroughly understand the problem, I find that other people have documented similar experiences. Thanks to all of you who have documented alternative approaches.

Saved for another day: now that the problem is obvious, I should try "zfs:zfs_immediate_write_sz" as recommended in the ZFS Evil Tuning Guide.

References:

Thursday Jun 07, 2012

ndd on Solaris 10

This is mostly a repost of LaoTsao's Weblog, with some tweaks and additions. In 2008, his blog pointed out that with Solaris 9 and earlier,  an rc3 script would be used to specify ndd parameters at boot up. With Solaris 10 and later, it is more elegant to to SMF.

The last time that I tried to cut & paste directly off of his page, some of the XML was messed up, so I am reposting working Solaris 10 XML in this blog entry.

Additionally, I am including scripts that I use to distribute the settings to multiple servers. I run the distribution scripts from my MacBook, but they should also work from a windows laptop using cygwin, or from an existing Solaris installation.

Why is it necessary to set ndd parameters at boot up?

The problem being addressed is how to set ndd parameter which survive reboot. It is easy to specify ndd settings from a shell, but they only apply to the running OS and don't survive reboots.

Examples of ndd setting being necessary include performance tuning, as described in NFS Tuning for HPC Streaming Applications, and installing Oracle Database 11gR2 on Solaris 10 with prerequisites, as show here:

11gr2_ndd_check.jpg

On Solaris 10 Update 10, the default values network settings don't match the Oracle 11gR2 prerequisites:


Expected Value Actual Value
tcp_smallest_anon_port 9000 32768
tcp_largest_anon_port 65500 65535
udp_smallest_anon_port 9000 32768
udp_largest_anon_port 65500 65535

To distribute the SMF files, and for future administration, it is helpful to enable passwordless ssh from your secure laptop:

================
If not already present, create a ssh key on you laptop
================

# ssh-keygen -t rsa

================
Enable passwordless ssh from my laptop.
Need to type in the root password for the remote machines.
Then, I no longer need to type in the password when I ssh or scp from my laptop to servers.
================

#!/usr/bin/env bash

for server in `cat servers.txt`
do
  echo root@$server
  cat ~/.ssh/id_rsa.pub | ssh root@$server "cat >> .ssh/authorized_keys"
done

Specify the servers to distribute to:

================
servers.txt
================

testhost1
testhost2

In addition to ndd values, I offen use the following /etc/system setting: 

================
etc_system_addins
================

set rpcmod:clnt_max_conns=8
set zfs:zfs_arc_max=0x1000000000
set nfs:nfs3_bsize=131072
set nfs:nfs4_bsize=131072

Modify ndd-nettune.txt with the ndd values that are appropriate for your deployment: 

================
ndd-nettune.txt
================

#!/sbin/sh
#
# ident   "@(#)ndd-nettune.xml    1.0     01/08/06 SMI"

. /lib/svc/share/smf_include.sh
. /lib/svc/share/net_include.sh

# Make sure that the libraries essential to this stage of booting  can be found.
LD_LIBRARY_PATH=/lib; export LD_LIBRARY_PATH
echo "Performing Directory Server Tuning..." >> /tmp/smf.out
#
# Performance Settings
#
/usr/sbin/ndd -set /dev/tcp tcp_max_buf 2097152
/usr/sbin/ndd -set /dev/tcp tcp_xmit_hiwat 1048576
/usr/sbin/ndd -set /dev/tcp tcp_recv_hiwat 1048576
#
# Oracle Database 11gR2 Settings
#
/usr/sbin/ndd -set /dev/tcp tcp_smallest_anon_port 9000
/usr/sbin/ndd -set /dev/tcp tcp_largest_anon_port 65500
/usr/sbin/ndd -set /dev/udp udp_smallest_anon_port 9000
/usr/sbin/ndd -set /dev/udp udp_largest_anon_port 65500

# Reset the library path now that we are past the critical stage
unset LD_LIBRARY_PATH


================
ndd-nettune.xml
================

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<!-- ident "@(#)ndd-nettune.xml 1.0 04/09/21 SMI" -->
<service_bundle type='manifest' name='SUNWcsr:ndd'>
  <service name='network/ndd-nettune' type='service' version='1'>
    <create_default_instance enabled='true' />
    <single_instance />
    <dependency name='fs-minimal' type='service' grouping='require_all' restart_on='none'>
      <service_fmri value='svc:/system/filesystem/minimal' />
    </dependency>
    <dependency name='loopback-network' grouping='require_any' restart_on='none' type='service'>
      <service_fmri value='svc:/network/loopback' />
    </dependency>
    <dependency name='physical-network' grouping='optional_all' restart_on='none' type='service'>
      <service_fmri value='svc:/network/physical' />
    </dependency>
    <exec_method type='method' name='start' exec='/lib/svc/method/ndd-nettune' timeout_seconds='3' > </exec_method>
    <exec_method type='method' name='stop'  exec=':true'                       timeout_seconds='3' > </exec_method>
    <property_group name='startd' type='framework'>
      <propval name='duration' type='astring' value='transient' />
    </property_group>
    <stability value='Unstable' />
    <template>
      <common_name>
    <loctext xml:lang='C'> ndd network tuning </loctext>
      </common_name>
      <documentation>
    <manpage title='ndd' section='1M' manpath='/usr/share/man' />
      </documentation>
    </template>
  </service>
</service_bundle>

Execute this shell script to distribute the files. The ndd values will be immediately modified and then survive reboot. The servers will need to rebooted to pick up the /etc/system settings:

================
system_tuning.sh
================

#!/usr/bin/env bash

for server in `cat servers.txt`
do
  cat etc_system_addins | ssh root@$server "cat >> /etc/system"

  scp ndd-nettune.xml root@${server}:/var/svc/manifest/site/ndd-nettune.xml
  scp ndd-nettune.txt root@${server}:/lib/svc/method/ndd-nettune
  ssh root@$server chmod +x /lib/svc/method/ndd-nettune
  ssh root@$server svccfg validate /var/svc/manifest/site/ndd-nettune.xml
  ssh root@$server svccfg import /var/svc/manifest/site/ndd-nettune.xml
done





Wednesday May 30, 2012

Java EE Application Servers, SPARC T4, Solaris Containers, and Resource Pools

I've obtained a substantial performance improvement on a SPARC T4-2 Server running a Java EE Application Server Cluster by deploying the cluster members into Oracle Solaris Containers and binding those containers to cores of the SPARC T4 Processor. This is not a surprising result, in fact, it is consistent with other results that are available on the Internet. See the "references", below, for some examples. Nonetheless, here is a summary of my configuration and results.

(1.0) Before deploying a Java EE Application Server Cluster into a virtualized environment, many decisions need to be made. I'm not claiming that all of the decisions that I have a made will work well for every environment. In fact, I'm not even claiming that all of the decisions are the best possible for my environment. I'm only claiming that of the small sample of configurations that I've tested, this is the one that is working best for me. Here are some of the decisions that needed to be made:

(1.1) Which virtualization option? There are several virtualization options and isolation levels that are available. Options include:

  • Hard partitions:  Dynamic Domains on Sun SPARC Enterprise M-Series Servers
  • Hypervisor based virtualization such as Oracle VM Server for SPARC (LDOMs) on SPARC T-Series Servers
  • OS Virtualization using Oracle Solaris Containers
  • Resource management tools in the Oracle Solaris OS to control the amount of resources an application receives, such as CPU cycles, physical memory, and network bandwidth.

Oracle Solaris Containers provide the right level of isolation and flexibility for my environment. To borrow some words from my friends in marketing, "The SPARC T4 processor leverages the unique, no-cost virtualization capabilities of Oracle Solaris Zones" 

(1.2) How to associate Oracle Solaris Containers with resources? There are several options available to associate containers with resources, including (a) resource pool association (b) dedicated-cpu resources and (c) capped-cpu resources. I chose to create resource pools and associate them with the containers because I wanted explicit control over the cores and virtual processors. 

(1.3) Cluster Topology? Is it best to deploy (a) multiple application servers on one node, (b) one application server on multiple nodes, or (c) multiple application servers on multiple nodes? After a few quick tests, it appears that one application server per Oracle Solaris Container is a good solution.

(1.4) Number of cluster members to deploy? I chose to deploy four big application servers. I would like go back and test many 32-bit application servers, but that is left for another day.

(2.0) Configuration tested.

(2.1) I was using a SPARC T4-2 Server which has 2 CPU and 128 virtual processors. To understand the physical layout of the hardware on Solaris 10, I used the OpenSolaris psrinfo perl script available at http://hub.opensolaris.org/bin/download/Community+Group+performance/files/psrinfo.pl:

test# ./psrinfo.pl -pv
The physical processor has 8 cores and 64 virtual processors (0-63)
The core has 8 virtual processors (0-7)
  The core has 8 virtual processors (8-15)
  The core has 8 virtual processors (16-23)
  The core has 8 virtual processors (24-31)
  The core has 8 virtual processors (32-39)
  The core has 8 virtual processors (40-47)
  The core has 8 virtual processors (48-55)
  The core has 8 virtual processors (56-63)
    SPARC-T4 (chipid 0, clock 2848 MHz)
The physical processor has 8 cores and 64 virtual processors (64-127)
  The core has 8 virtual processors (64-71)
  The core has 8 virtual processors (72-79)
  The core has 8 virtual processors (80-87)
  The core has 8 virtual processors (88-95)
  The core has 8 virtual processors (96-103)
  The core has 8 virtual processors (104-111)
  The core has 8 virtual processors (112-119)
  The core has 8 virtual processors (120-127)
    SPARC-T4 (chipid 1, clock 2848 MHz)

(2.2) The "before" test: without processor binding. I started with a 4-member cluster deployed into 4 Oracle Solaris Containers. Each container used a unique gigabit Ethernet port for HTTP traffic. The containers shared a 10 gigabit Ethernet port for JDBC traffic.

(2.3) The "after" test: with processor binding. I ran one application server in the Global Zone and another application server in each of the three non-global zones (NGZ): 

(3.0) Configuration steps. The following steps need to be repeated for all three Oracle Solaris Containers.

(3.1) Stop AppServers from the BUI.

(3.2) Stop the NGZ.

test# ssh test-z2 init 5

(3.3) Enable resource pools:

test# svcadm enable pools

(3.4) Create the resource pool:

test# poolcfg -dc 'create pool pool-test-z2'

(3.5) Create the processor set:

test# poolcfg -dc 'create pset pset-test-z2'

(3.6) Specify the maximum number of CPU's that may be addd to the processor set:

test# poolcfg -dc 'modify pset pset-test-z2 (uint pset.max=32)'

(3.7) bash syntax to add Virtual CPUs to the processor set:

test# (( i = 64 )); while (( i < 96 )); do poolcfg -dc "transfer to pset pset-test-z2 (cpu $i)"; (( i = i + 1 )) ; done

(3.8) Associate the resource pool with the processor set:

test# poolcfg -dc 'associate pool pool-test-z2 (pset pset-test-z2)'

(3.9) Tell the zone to use the resource pool that has been created:

test# zonecfg -z test-z2 set pool=pool-test-z2

(3.10) Boot the Oracle Solaris Container

test# zoneadm -z test-z2 boot

(3.11) Save the configuration to /etc/pooladm.conf

test# pooladm -s

(4.0) Verification

(4.1) View the processors in each processor set 

test# psrset

user processor set
5: processors 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
user processor set
6: processors 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
user processor set
7: processors 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 12

(4.2) Verify that the Java processes are associated with the processor sets:

test# ps -e -o,vsz,rss,pid,pset,comm | grep java | sort -n

  VSZ     RSS    PID PSET COMMAND
3715416 1543344 25143   5 <JAVA_HOME>/bin/sparcv9/java
3772120 1600088 25142   - <JAVA_HOME>/bin/sparcv9/java
3780960 1608832 25144   6 <JAVA_HOME>/bin/sparcv9/java
3792648 1620560 25145   7 <JAVA_HOME>/bin/sparcv9/java

(5.0) Results. Using the resource pools improves both throughput and response time:

(6.0) Run Time Changes

(6.1) I wanted to show an example which started from scratch, which is why I stopped the Oracle Solaris Containers, configured the pools and booted up fresh. There is no room for confusion. However, the steps should work for running containers. One exception is "zonecfg -z test-z2 set pool=pool-test-z2" which will take effect when the zone is booted.

(6.2) I've shown poolcfg with the '-d' option which specifies that the command will work directly on the kernel state. For example, at runtime, you can move CPU core 12 (virtual processors 96-103) from test-z3 to test-z2 with the following command:

test# (( i = 96 )); while (( i < 104 )); do poolcfg -dc "transfer to pset pset-test-z2 (cpu $i)"; (( i = i + 1 )) ; done

(6.3) To specify a run-time change to a container's pool binding, use the following steps:

Identify the zone ID (first column)

test# zoneadm list -vi
  ID NAME        STATUS     PATH                      BRAND    IP
   0 global      running    /                         native   shared
  28 test-z3     running    /zones/test-z3            native   shared
  31 test-z1     running    /zones/test-z1            native   shared
  32 test-z2     running    /zones/test-z2            native   shared

Modify binding if necessary:

test# poolbind -p pool-test-z2 -i zoneid 32

(7.0) Processor sets are particularly relevant to multi-socket configurations:

Processor sets reduce cross calls (xcal) and migrations (migr) in multi-socket configurations:

Single Socket Test
1 x SPARC T4 Socket
2 x Oracle Solaris Containers
mpstat samples
The impact of processor sets was hardly measurable
 (about a 1% throughput difference)
Without Processor Binding
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 40    1   0  525   933   24 1793  124  363  153    1  2551   50   7   0  43
 41    2   0  486  1064   24 1873  137  388  159    2  2560   51   7   0  42
 42    1   0  472   973   23 1770  124  352  153    1  2329   49   7   0  44
 43    1   0  415   912   22 1697  115  320  153    1  2175   47   7   0  47
 44    1   0  369   884   22 1665  111  300  150    1  2008   45   6   0  49
 45    2   0  494   902   23 1730  116  324  152    1  2233   46   7   0  47
 46    3   0  918  1075   26 2087  163  470  172    1  2935   55   8   0  38
 47    2   0  672   999   25 1955  143  416  162    1  2777   53   7   0  40
 48    2   0  691   976   25 1904  136  396  159    1  2748   51   7   0  42
 49    3   0  849  1081   24 1933  145  411  163    1  2670   52   7   0  40
With each container bound to 4 cores.
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 40    1   0  347  1164   20 1810  119  311  210    1  2079   42   6   0  51
 41    1   0  406  1219   21 1898  131  344  214    1  2266   45   7   0  48
 42    1   0  412  1214   21 1902  130  342  212    1  2289   45   7   0  49
 43    2   0  410  1208   21 1905  130  343  219    1  2304   45   7   0  48
 44    1   0  411  1208   21 1906  131  343  214    1  2313   45   7   0  48
 45    1   0  433  1209   21 1917  133  344  215    1  2337   45   7   0  48
 46    2   0  500  1244   24 1989  141  368  218    1  2482   46   7   0  47
 47    1   0  377  1183   21 1871  127  331  211    1  2289   45   7   0  49
 48    1   0  358   961   23 1699   77  202  183    1  2255   41   6   0  53
 49    1   0  339  1008   21 1739   84  216  188    1  2231   41   6   0  53




Two Socket Test
2 x T4 Sockets
4 x Oracle Solaris Container
mpstat sample
The impact of processor sets was substantial
(~25% better throughput)
Without Processor Binding
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 40    1   0 1277  1553   32 2726  317  942   70    2  2620   66  11   0  24
 41    0   0 1201  1606   30 2618  309  895   71    2  2603   67  11   0  23
 42    1   0 1104  1517   30 2519  295  846   70    2  2499   65  10   0  24
 43    1   0  997  1447   28 2443  283  807   69    2  2374   64  10   0  26
 44    1   0  959  1402   28 2402  277  776   67    2  2336   64  10   0  26
 45    1   0 1057  1466   29 2538  294  841   68    2  2400   64  10   0  26
 46    3   0 2785  1776   35 3273  384 1178   74    2  2841   68  12   0  20
 47    1   0 1508  1610   33 2949  346 1039   72    2  2764   67  11   0  22
 48    2   0 1486  1594   33 2963  346 1036   72    2  2761   67  11   0  22
 49    1   0 1308  1589   32 2741  325  952   71    2  2694   67  11   0  22
With each container bound to 4 cores.
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 40    1   0  423  1223   20 1841  157  377   60    1  2185   48   7   0  45
 41    1   0  505  1279   22 1942  168  405   65    1  2396   50   7   0  43
 42    1   0  500  1278   22 1941  170  405   65    1  2413   50   7   0  42
 43    2   0  492  1277   22 1955  171  408   64    1  2422   50   8   0  42
 44    1   0  504  1269   22 1941  167  407   64    1  2430   50   7   0  42
 45    1   0  513  1284   22 1977  173  412   64    1  2475   50   8   0  42
 46    2   0  582  1302   25 2021  177  431   67    1  2612   52   8   0  41
 47    1   0  462  1247   21 1918  168  400   62    1  2392   50   7   0  43
 48    1   0  466  1055   25 1777  120  282   56    1  2424   47   7   0  47
 49    1   0  412  1080   22 1789  122  285   56    1  2354   46   7   0  47

    (8.0) References:

    Thursday Apr 12, 2012

    What is bondib1 used for on SPARC SuperCluster with InfiniBand, Solaris 11 networking & Oracle RAC?

    A co-worker asked the following question about a SPARC SuperCluster InfiniBand network:

    > on the database nodes the RAC nodes communicate over the cluster_interconnect. This is the
    > 192.168.10.0 network on bondib0. (according to ./crs/install/crsconfig_params NETWORKS
    > setting) 
    > What is bondib1 used for? Is it a HA counterpart in case bondib0 dies?

    This is my response:

    Summary: In a SPARC SuperCluster installation, bondib0 and bondib1 are the InfiniBand links that are used for the private interconnect (usage includes global cache data blocks and heartbeat) and for communication to the Exadata storage cells. Currently, the database is idle, so bondib1 is currently only being used for outbound cluster interconnect traffic.

    Details:

    bondib0 is the cluster_interconnect

    $ oifcfg getif           
    bondeth0  10.129.184.0  global  public
    bondib0  192.168.10.0  global  cluster_interconnect
    ipmpapp0  192.168.30.0  global  public


    bondib0 and bondib1 are on 192.168.10.1 and 192.168.10.2 respectively.

    # ipadm show-addr | grep bondi
    bondib0/v4static  static   ok           192.168.10.1/24
    bondib1/v4static  static   ok           192.168.10.2/24


    This private network is also used to communicate with the Exadata Storage Cells. Notice that the network addresses of the Exadata Cell Disks are on the same subnet as the private interconnect:  

    SQL> column path format a40
    SQL> select path from v$asm_disk;

    PATH                                     

    ---------------------------------------- 
    o/192.168.10.9/DATA_SSC_CD_00_ssc9es01  
    o/192.168.10.9/DATA_SSC_CD_01_ssc9es01
    ...

    Hostnames tied to the IPs are node1-priv1 and node1-priv2 

    # grep 192.168.10 /etc/hosts
    192.168.10.1    node1-priv1.us.oracle.com   node1-priv1
    192.168.10.2    node1-priv2.us.oracle.com   node1-priv2

    For the four compute node RAC:

    • Each compute node has two IP address on the 192.168.10.0 private network.
    • Each IP address has an active InfiniBand link and a failover InfiniBand link.
    • Thus, the compute nodes are using a total of 8 IP addresses and 16 InfiniBand links for this private network.

    bondib1 isn't being used for the Virtual IP (VIP):

    $ srvctl config vip -n node1
    VIP exists: /node1-ib-vip/192.168.30.25/192.168.30.0/255.255.255.0/ipmpapp0, hosting node node1
    VIP exists: /node1-vip/10.55.184.15/10.55.184.0/255.255.255.0/bondeth0, hosting node node1


    bondib1 is on bondib1_0 and fails over to bondib1_1:

    # ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES
    ipmpapp0    ipmpapp0    ok        --        ipmpapp_0 (ipmpapp_1)
    bondeth0    bondeth0    degraded  --        net2 [net5]
    bondib1     bondib1     ok        --        bondib1_0 (bondib1_1)
    bondib0     bondib0     ok        --        bondib0_0 (bondib0_1)


    bondib1_0 goes over net24

    # dladm show-link | grep bond
    LINK                CLASS     MTU    STATE    OVER
    bondib0_0           part      65520  up       net21
    bondib0_1           part      65520  up       net22
    bondib1_0           part      65520  up       net24
    bondib1_1           part      65520  up       net23


    net24 is IB Partition FFFF

    # dladm show-ib
    LINK         HCAGUID         PORTGUID        PORT STATE  PKEYS
    net24        21280001A1868A  21280001A1868C  2    up     FFFF
    net22        21280001CEBBDE  21280001CEBBE0  2    up     FFFF,8503
    net23        21280001A1868A  21280001A1868B  1    up     FFFF,8503
    net21        21280001CEBBDE  21280001CEBBDF  1    up     FFFF


    On Express Module 9 port 2:

    # dladm show-phys -L
    LINK              DEVICE       LOC
    net21             ibp4         PCI-EM1/PORT1
    net22             ibp5         PCI-EM1/PORT2
    net23             ibp6         PCI-EM9/PORT1
    net24             ibp7         PCI-EM9/PORT2


    Outbound traffic on the 192.168.10.0 network will be multiplexed between bondib0 & bondib1

    # netstat -rn

    Routing Table: IPv4
      Destination           Gateway           Flags  Ref     Use     Interface
    -------------------- -------------------- ----- ----- ---------- ---------
    192.168.10.0         192.168.10.2         U        16    6551834 bondib1  
    192.168.10.0         192.168.10.1         U         9    5708924 bondib0  


    The database is currently idle, so there is no traffic to the Exadata Storage Cells at this moment, nor is there currently any traffic being induced by the global cache. Thus, only the heartbeat is currently active. There is more traffic on bondib0 than bondib1

    # /bin/time snoop -I bondib0 -c 100 > /dev/null
    Using device ipnet/bondib0 (promiscuous mode)
    100 packets captured

    real        4.3
    user        0.0
    sys         0.0


    (100 packets in 4.3 seconds = 23.3 pkts/sec)

    # /bin/time snoop -I bondib1 -c 100 > /dev/null
    Using device ipnet/bondib1 (promiscuous mode)
    100 packets captured

    real       13.3
    user        0.0
    sys         0.0


    (100 packets in 13.3 seconds = 7.5 pkts/sec)

    Half of the packets on bondib0 are outbound (from self). The remaining packet are split evenly, from the other nodes in the cluster.

    # snoop -I bondib0 -c 100 | awk '{print $1}' | sort | uniq -c
    Using device ipnet/bondib0 (promiscuous mode)
    100 packets captured
      49 node1
    -priv1.us.oracle.com
      24 node2
    -priv1.us.oracle.com
      14 node3
    -priv1.us.oracle.com
      13 node4
    -priv1.us.oracle.com

    100% of the packets on bondib1 are outbound (from self), but the headers in the packets indicate that they are from the IP address associated with bondib0:

    # snoop -I bondib1 -c 100 | awk '{print $1}' | sort | uniq -c
    Using device ipnet/bondib1 (promiscuous mode)
    100 packets captured
     100 node1-priv1.us.oracle.com

    The destination of the bondib1 outbound packets are split evenly, to node3 and node 4.

    # snoop -I bondib1 -c 100 | awk '{print $3}' | sort | uniq -c
    Using device ipnet/bondib1 (promiscuous mode)
    100 packets captured
      51 node3-priv1.us.oracle.com
      49 node4-priv1.us.oracle.com

    Conclusion: In a SPARC SuperCluster installation, bondib0 and bondib1 are the InfiniBand links that are used for the private interconnect (usage includes global cache data blocks and heartbeat) and for communication to the Exadata storage cells. Currently, the database is idle, so bondib1 is currently only being used for outbound cluster interconnect traffic.

    Thursday Mar 29, 2012

    Watch out for a trailing slash on $ORACLE_HOME

    Watch out for a trailing slash on $ORACLE_HOME
    oracle$ export ORACLE_HOME=/u01/app/11.2.0.3/grid/
    oracle$ ORACLE_SID=+ASM1
    oracle$ sqlplus / as sysasm

    SQL*Plus: Release 11.2.0.3.0 Production on Thu Mar 29 13:04:01 2012

    Copyright (c) 1982, 2011, Oracle.  All rights reserved.

    Connected to an idle instance.

    SQL>


    oracle$ export ORACLE_HOME=/u01/app/11.2.0.3/grid
    oracle$ ORACLE_SID=+ASM1
    oracle$ sqlplus / as sysasm

    SQL*Plus: Release 11.2.0.3.0 Production on Thu Mar 29 13:04:44 2012

    Copyright (c) 1982, 2011, Oracle.  All rights reserved.


    Connected to:
    Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
    With the Real Application Clusters and Automatic Storage Management options

    SQL>

    User "oracle" unable to start or stop listeners

    Recently ran into a problem where user "oracle" was unable to start or stop listeners:

    oracle$ srvctl stop listener
    PRCR-1065 : Failed to stop resource ora.LISTENER.lsnr
    CRS-0245:  User doesn't have enough privilege to perform the operation
    CRS-0245:  User doesn't have enough privilege to perform the operation
    PRCR-1065 : Failed to stop resource ora.LISTENER_IB.lsnr
    CRS-0245:  User doesn't have enough privilege to perform the operation
    CRS-0245:  User doesn't have enough privilege to perform the operation

    The system is currently "fixed":

    oracle$ srvctl start listener

    oracle$ srvctl status listener
    Listener LISTENER is enabled
    Listener LISTENER is running on node(s): etc9cn02,etc9cn01
    Listener LISTENER_IB is enabled
    Listener LISTENER_IB is running on node(s): etc9cn02,etc9cn01

    oracle$ srvctl stop listener

    oracle$ srvctl status listener
    Listener LISTENER is enabled
    Listener LISTENER is not running
    Listener LISTENER_IB is enabled
    Listener LISTENER_IB is not running

    oracle$ srvctl start listener


    How it was "fixed":

    Before:

    # crsctl status resource ora.LISTENER.lsnr -p | grep ACL=
    ACL=owner:root:rwx,pgrp:root:r-x,other::r--

    # crsctl status resource ora.LISTENER_IB.lsnr -p | grep ACL=
    ACL=owner:root:rwx,pgrp:root:r-x,other::r--


    "Fix":

    # crsctl setperm resource ora.LISTENER.lsnr -o oracle
    # crsctl setperm resource ora.LISTENER.lsnr -g oinstall
    # crsctl setperm resource ora.LISTENER_IB.lsnr -g oinstall
    # crsctl setperm resource ora.LISTENER_IB.lsnr -o oracle


    After:

    # crsctl status resource ora.LISTENER.lsnr -p | grep ACL=
    ACL=owner:oracle:rwx,pgrp:oinstall:r-x,other::r--

    # crsctl status resource ora.LISTENER_IB.lsnr -p | grep ACL=
    ACL=owner:oracle:rwx,pgrp:oinstall:r-x,other::r--


    I may never know how the system got into this state.

    Tuesday Sep 20, 2011

    NFS root access for Oracle RAC on Sun ZFS Storage 7x20 Appliance

    When installing Oracle Real Application Clusters (Oracle RAC) 11g Release 2 for Solaris Operating System it is necessary to first install Oracle Grid Infrastructure, and it is also necessary to configure shared storage. I install the Grid Infrastructure and then the Database largely following the on-line instructions:

    I ran into an interesting problem when installing Oracle RAC in a system that included SPARC Enterprise M5000 Servers and a Sun ZFS Storage 7x20 Appliance, illustrated in the following diagram:

    (Click to Expand)

    Network Diagram

    When configuring the shared storage for Oracle RAC, you may decide to use NFS for Data Files. In this case, you must set up the NFS mounts on the storage appliance to allow root access from all of the RAC clients. This allows files being created from the RAC nodes to be owned by root on the mounted NFS filesystems, rather than an anonymous user, which is the default behavior.

    In a default configuration, a Solaris NFS server maps "root" access to "nobody". This can be overridden as stated on the share_nfs(1M) man page:

    Only root users from the hosts specified in access_list will have root access... By default, no host has root access, so root users are mapped to an anonymous user ID...

    Example: The following will give root read-write permissions to hostb:

     share -F nfs -o ro=hosta,rw=hostb,root=hostb /var 

    The Sun ZFS Storage 7x20 Appliance features a browser user interface (BUI), "a graphical tool for administration of the appliance. The BUI provides an intuitive environment for administration tasks, visualizing concepts, and analyzing performance data." The following clicks can be used to allow root on the clients mounting to the storage to be considered root:

    (Click to Expand)

    S7000 Allow Root Screenshot

    1. Go the the "Shares" page

      • (not shown) Select the "pencil" to edit the share that will be used for Oracle RAC shared storage.
    2. Go to the Protocols page.
    3. For the NFS Protocol, un-check "inherit from project".
    4. Click the "+" to add an NFS exception.
    5. Enter the hostname of the RAC node.
    6. Allow read/write access.
    7. Check the "Root Access" box.
    8. Click "Apply" to save the changes.

    Repeat steps 3-8 for each RAC node. Repeat steps 1-8 for every share that will be used for RAC shared storage.

    More intuitive readers, after reviewing the network diagram and the screenshot of the S7420 NFS exceptions screen, may immediately observe that it was a mistake to enter the hostname of the RAC nodes associated with the gigabit WAN network. In hindsight, this was an obvious mistake, but at the time I was entering the data, I simply entered the names of the machines, which did not strike me as a "trick question".

    The next step is to configure the RAC nodes as NFS clients. After the shares have been set up on the Sun ZFS Storage Appliance, the next step is to mount the shares onto the RAC nodes. For each RAC node, update the /etc/vfstab file on each node with an entry similar to the following:

    nfs_server:/vol/DATA/oradata  /u02/oradata nfs rw,bg,hard,nointr,rsize=32768,wsize=32768,proto=tcp,noac,forcedirectio, vers=3,suid
    

    Here's a tip of the hat to Pythian's "Installing 11gR2 Grid Infrastructure in 5 Easy Lessons":

    Lesson #3: Grid is very picky and somewhat uninformative about its NFS support

    Like an annoying girlfriend, the installer seems to say “Why should I tell you what’s the problem? If you really loved me, you’d know what you did wrong!”

    You need to trace the installer to find out what exactly it doesn’t like about your configuration.

    Running the installer normally, the error message is:

    [FATAL] [INS-41321] Invalid Oracle Cluster Registry (OCR) location.

    CAUSE: The installer detects that the storage type of the location (/cmsstgdb/crs/ocr/ocr1) is not supported for Oracle Cluster Registry.

    ACTION: Provide a supported storage location for the Oracle Cluster Registry.

    OK, so Oracle says the storage is not supported, but I know that ... NFS is support just fine. This means I used the wrong parameters for the NFS mounts. But when I check my vfstab and /etc/mount, everything looks A-OK. Can Oracle tell me what exactly bothers it?

    It can. If you run the silent install by adding the following flags to the command line:

    -J-DTRACING.ENABLED=true -J-DTRACING.LEVEL=2

    If you get past this stage, it is clear sailing up until you run "root.sh" near the end of the Grid Installation, which is the stage that will fail if the root user's files are mapped to anonymous.

    So now, I will finally get to the piece to the puzzle that I found to be perplexing. Remember that in my configuration (see diagram, above) each RAC node has two potential paths to the Sun ZFS storage appliance, one path via the router that is connected to the corporate WAN, and one path via the private 10 gigabit storage network. When I accessed the NAS storage via the storage network, root was always mapped to nobody, despite my best efforts. While trying to debug, I discovered that when I accessed the NAS storage via the corporate WAN network, root was mapped to root:

    # ping -s s7420-10g0 1 1
    PING s7420-10g0: 1 data bytes
    9 bytes from s7420-10g0 (192.168.42.15): icmp_seq=0.
    
    # ping -s s7420-wan 1 1
    PING s7420-wan: 1 data bytes
    9 bytes from s7420-wan (10.1.1.15): icmp_seq=0.
    
    # nfsstat -m
    /S7420/OraData_WAN from s7420-wan:/export/OraData
     Flags:         vers=3,proto=tcp,sec=sys,hard,nointr,noac,link,symlink,acl,rsize=32768,wsize=32768,retrans=5,timeo=600
     Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60
    
    /S7420/OraData_10gbe from s7420-10g0:/export/OraData
     Flags:         vers=3,proto=tcp,sec=sys,hard,nointr,noac,link,symlink,acl,rsize=32768,wsize=32768,retrans=5,timeo=600
     Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60
    
    # touch /S7420/OraData_10gbe/foo1
    # touch /S7420/OraData_WAN/foo2
    # touch /net/s7420-10g0/export/OraData/foo3
    # touch /net/s7420-wan/export/OraData/foo4
    # touch /net/192.168.42.15/export/OraData/foo5
    # touch /net/10.1.1.15/export/OraData/foo6
    
    # ls -l /S7420/OraData_10gbe/foo*
    -rw-r--r--   1 nobody   nobody         0 Sep 20 12:54 /S7420/OraData_10gbe/foo1
    -rw-r--r--   1 root     root           0 Sep 20 12:54 /S7420/OraData_10gbe/foo2
    -rw-r--r--   1 nobody   nobody         0 Sep 20 12:55 /S7420/OraData_10gbe/foo3
    -rw-r--r--   1 root     root           0 Sep 20 12:56 /S7420/OraData_10gbe/foo4
    -rw-r--r--   1 nobody   nobody         0 Sep 20 12:58 /S7420/OraData_10gbe/foo5
    -rw-r--r--   1 root     root           0 Sep 20 13:04 /S7420/OraData_10gbe/foo6

    Having discovered that when I accessed the NAS storage via the storage network, root was always mapped to nobody, but that when I accessed the NAS storage via the corporate WAN network, root was mapped to root, I investigated the mounts on the S7420:

    # ssh osm04 -l root
    Password: 
    Last login: Thu Sep 15 22:46:46 2011 from 192.168.42.11
    s7420:> shell
    Executing shell commands may invalidate your service contract. Continue? (Y/N) 
    Executing raw shell; "exit" to return to appliance shell ...
    +-----------------------------------------------------------------------------+
    |  You are entering the operating system shell.  By confirming this action in |
    |  the appliance shell you have agreed that THIS ACTION MAY VOID ANY SUPPORT  |
    ...
    +-----------------------------------------------------------------------------+
    s7420# showmount -a | grep OraData
    192.168.42.11:/export/OraData
    rac1.bigcorp.com:/export/OraData
    

    When I saw the "showmount" output, the lightbulb in my brain turned on and I understood the problem: I had entered the node names associated with the WAN, rather than node names associated with the private storage network. When NFS packets were arriving from the corporate WAN, the S7420 was using DNS to resolve WAN IP addresses into the WAN hostnames, which matched the hostnames that I had entered into the S7420 NFS Exception form. In contrast, when NFS packets were arriving from the 10 gigabit private storage network, the system was not able to resolve the IP address into hostname because the private storage network data did not exist in DNS. Even if the name resolution was successful, if would have been necessary to enter the the node names associated private storage area network into the S7420 NFS Exceptions form.

    Several solutions spring to mind: (1) On a typical Solaris NFS server, I would have enabled name resolution of the 10 gigabit private storage network addresses by adding entries to /etc/hosts, and used those node names for the NFS root access. This was not possible because on the appliance, /etc is mounted as read-only. (2) It occurred to me to enter the IP addresses into the S7420 NFS exceptions form, but the BUI would only accept hostnames. (3) One potential solution is to put the private 10 gigabit IP addresses into the corporate DNS server. (4) Instead, I chose to give root read-write permissions to all clients on the 10 gigabit private storage network:

    (Click to Expand)

    S7000_allow_network_thumb.jpg

    Now, the RAC installation will be able to complete successfully with RAC nodes accessing the Sun ZFS Storage 7x20 Appliance via the private 10 gigabit storage network.

    Tuesday Aug 23, 2011

    Flash Archive with MPXIO

    It was necessary to exchange one SPARC Enterprise M4000 Server running Solaris 10 Update 9 with a replacement server.  I thought, "no problem"

    Ran "flar create -n ben06 -S /S7420/ben06-`date '+%m-%d-%y'`.flar" on the original server to create a flash archive on NAS storage.

    Upon restoring the flar onto the replacement SPARC Enterprise M4000 Server, problems appeared:

    Rebooting with command: boot
    Boot device: disk  File and args:
    SunOS Release 5.10 Version Generic_144488-17 64-bit
    Copyright (c) 1983, 2011, Oracle and/or its affiliates. All rights reserved.
    Hostname: ben06
    The / file system (/dev/rdsk/c0t0d0s0) is being checked.

    WARNING - Unable to repair the / filesystem. Run fsck

    manually (fsck -F ufs /dev/rdsk/c0t0d0s0).

    Aug 22 10:25:31 svc.startd[7]: svc:/system/filesystem/usr:default: Method "/lib/svc/method/fs-usr" failed with exit status 95.

    Aug 22 10:25:31 svc.startd[7]: system/filesystem/usr:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)
    Requesting System Maintenance Mode
    (See /lib/svc/share/README for more information.)

     It got fairly ugly from there: 

    • When I tried to run the fsck command, above, it reported "No such device"
    • Although "mount" reported that / was mounted read/write, other commands, such as "vi", reported that everything was read-only
    • "format" reported that I didn't have any devices at all!!
    Eventually I realized that there seems to be a bug with Flash Archives + MPXIO

    After I installed with the flar, I rebooted into single user mode from alternate media (bootp or cdrom), mounted the internal drive, and modified some files, changing "no" to "yes":

    /kernel/drv/fp.conf:mpxio-disable="yes";
    /kernel/drv/iscsi.conf:mpxio-disable="yes";
    /kernel/drv/mpt.conf:mpxio-disable="yes";
    /kernel/drv/mpt.conf:disable-sata-mpxio="yes";
    /kernel/drv/mpt_sas.conf:mpxio-disable="yes";

    Then, after a "reboot -- -r", everything was fine.

    Thursday Jun 02, 2011

    Solaris installation on a SPARC T3 from a remote CDROM ISO

    I wanted to install Solaris 11 Express on a SPARC T3-1 Server that was located in California. I was working in Massachusetts and had the ISO image of the Solaris 11 Express media on my laptop. Reading the SPARC T3-2 Server Documentation and the Integrated Lights Out Manager (ILOM) 3.0 Documentation it seemed that it would be possible to remotely mount my ISO image and use it for the installation, but I found that the documentation was rather spotty. Specifically, I didn't stumble across any documentation that led me to use the "rcdrom" device in the OpenBoot Prom (OBP, aka "the OK prompt") Here is the recipe:

    1. The SPARC T3-1 server was powered up, pre-installed with Solaris 10, and properly wired:
      • Ethernet for Solaris was available via the NET0 port.
      • Ethernet for the Service Processor was available via the SP NET MGT Ethernet port
      • The SP SER MGT port was connected to a terminal concentrator
    2. The Service Processor had a static IP Address, so it could be accessed with either SSH or HTTPS
    3. The ILOM remote CD-ROM image functionality isn't supported on Mac OS X, so I used Windows XP running in VirtualBox to access the ILOM via HTTPS.
    4. In the ILOM HTTPS interface, I clicked on "Redirection", "Use serial redirection", "Launch remote Console"
    5. In the ILOM Remote Console window, there is a pulldown for "Devices", "CD-ROM image..."
      • Selected the ISO image on my laptop 
    6. In Solaris 10, I was able to see that the remote ISO was mounted ("df -h")
    7. Shutdown Solaris 10 taking me back to the OK prompt
    8. I was able to see the "rcdrom" device:
      •   ok devalias
    9. And boot:
      •   ok boot rcdrom
    10. Confirmed success.  The T3 booted off of my ISO (slowly) and eventually promped for the installation language.  (English, Spanish, etc)

    Wednesday Jan 12, 2011

    11gR2 Grid Infrastructure Patch Set for Solaris x86-64

    "Oracle Recommended Patches" (Doc ID 756671.1) recommends 9655006 - 11.2.0.1.2 Grid Infrastructure Patch Set Update (GI PSU2) for 11.2.0.1 installations, but I couldn't find the patch for Solaris x86-64.  In fact, How to Upgrade to Oracle Grid Infrastructure 11g Release 2 states, "GI PSU2 is unavailable for Solaris x86-64 systems."

    How disappointing, right?  Not really.  Instead of going to Grid 11.2.0.1.2, go to Grid 11.2.0.2 with patch 10098816.

    Sounds easy enough, except that How to Upgrade to Oracle Grid Infrastructure 11g Release 2 also states:

  • To upgrade existing 11.2.0.1 Oracle Grid Infrastructure installations to Oracle Grid Infrastructure 11.2.0.2, you must first do at least one of the following:

    • Patch the release 11.2.0.1 Oracle Grid Infrastructure home with the 9413827 and 9706490 patches.

    • Install Oracle Grid Infrastructure Patch Set 1 (GI PSU1) or Oracle Grid Infrastructure Patch Set 2 (GI PSU2).


  • Since GI PSU2 is not available, and I had trouble finding the patch number of GI PSU1, I went with p6880880 (OPatch), followed by 9413827 followed by 10098816.  (FYI, once 9413827 is installed, it is not necessary to install 9706490 because patch 9706490 is Subset of 9413827.)

    Ta-da!! (According to yourdictionary.com, sound of a fanfare: an exclamation of triumph or pride accompanying an announcement, a bow, etc)

    Run level verification check failed when installing 11.2.0.2 Grid Upgrade (p10098816)

    When I tried to upgrade my 11.2.0.1 GRID/ASM installation which was running inside a Solaris 10 Update 9 container using the 11.2.0.2 Grid Upgrade (p10098816), the run level verification check failed:

    INFO: Run Level: This is a prerequisite condition to test whether the system is running with proper run level. INFO: Severity:CRITICAL

    The system was at Run level 3, as required, but the unprivileged user couldn't see it:

    $ who -r
    who: Cannot open /etc/inittab: Permission denied

    Strange: initab's permission was "rw-r----"

    I fixed the problem with:

    # chmod 644 /etc/inittab

    ... and then the verification succeeded.

    Wednesday Dec 15, 2010

    Useful GNU grep option

    I liked "How to get the serial number of a Sun machine through iLOM" at http://blog.philipp-michels.de/?p=102:

    $ ipmitool -v -H {IP} -U root -P {PW} fru | grep /SYS -A 5

    FRU Device Description : /SYS (ID 20)
     Product Manufacturer  : SUN MICROSYSTEMS
     Product Name          : SUN FIRE X4200 M2
     Product Part Number   : 602-3891-01
     Product Serial        : 0116SL46F2
     Product Extra         : 080020FFFFFFFFFFFFFF00144F7DB00A

    The problem is that the "grep -A" option is not supported by the Solaris default grep.

    The solution is to use GNU grep (ggrep at /usr/sfw/bin/ggrep) which is packaged on the Solaris Companion CD (also see http://blogs.sun.com/taylor22/entry/installing_the_solaris_10_os)

    While I was searching the www for grep -A, I found many more interesting usage examples, such as finding stack traces for exceptions in Java log files and  finding the text surrounding any ORA- messages in the Oracle alert.log

    Update: Solaris 11 installation and usage:

    # pkg install gnu-grep

    # which ggrep
    /usr/bin/ggrep

    # ls -l /usr/bin/ggrep
    lrwxrwxrwx   1 root     root          15 Jun  2 13:08 /usr/bin/ggrep -> ../gnu/bin/greproot@

    For example, some message was repeated

    # grep  "repeated 2 times"  /var/adm/messages.*
    /var/adm/messages.3:Feb 11 16:43:56 p3231-05 last message repeated 2 times


    Which one? Can't use Solaris grep to find the line before the pattern.

    # grep -B1 "repeated 2 times"  /var/adm/messages.*
    grep: illegal option -- B
    grep: illegal option -- 1
    Usage: grep [-c|-l|-q] -bhinsvw pattern file . . .


    But ggrep does the trick.

    # ggrep -B1 "repeated 2 times"  /var/adm/messages.*
    /var/adm/messages.3-Feb 11 16:43:54 p3231-05 dlmgmtd[49]: [ID 183745 daemon.warning] Duplicate links in the repository: net0
    /var/adm/messages.3:Feb 11 16:43:56 p3231-05 last message repeated 2 times


    Monday Jul 26, 2010

    Adding a hard drive for /export/home under ZFS

    Summary: added a new hard drive to my "OpenSolaris 2009.06 snv_111b" system and attached it to /export/home. 

    The first step was easy: my Sun Ultra 24 had a free drive bay, so I stuck the new drive into the running system, ran "devfsadm -C -v" and then from "format" I could see that the disk was available to OpenSolaris as "c9d0"

    Next, a decision needed to be made with respect to how to utilize the drive. I considered using several approaches, such as adding the disk to the existing rpool.  After considering my potential future projects, I decided to attach the disk as a new zpool that would only contain /export/home.  Isolating /export/home should yield some flexibility when I play nasty games with the root drive.

    The process was slightly complicated by the fact that logging into OpenSolaris as root is disabled by default. To allow root login, I removed "type=role;" from the root line in /etc/user_attr.  Then, the following commands were executed while logged in as root.

    Before:

    # zfs list
    NAME                         USED  AVAIL  REFER  MOUNTPOINT
    new                           79K   685G    19K  /new
    rpool                       26.2G   202G  82.5K  /rpool
    rpool/ROOT                  16.4G   202G    19K  legacy
    rpool/ROOT/opensolaris       262M   202G  5.25G  /
    rpool/ROOT/opensolaris-1    25.0M   202G  5.65G  /tmp/tmpOM0whB
    rpool/ROOT/opensolaris-2    33.4M   202G  6.59G  /
    rpool/ROOT/opensolaris-3     171M   202G  6.59G  /
    rpool/ROOT/opensolaris-4    17.4M   202G  6.72G  /
    rpool/ROOT/opensolaris-5     106M   202G  6.72G  /
    rpool/ROOT/opensolaris-6    15.8G   202G  14.6G  /
    rpool/dump                  3.97G   202G  3.97G  -
    rpool/export                1.82G   202G    21K  /export
    rpool/export/home           1.82G   202G    22K  /export/home
    rpool/export/home/jeff      1.82G   202G  1.82G  /export/home/jeff
    rpool/swap                  3.97G   206G   101M  -
    

    Commands:

    
    # 1) Create a zpool on the new disk
    
         zpool create new c9d0
    
    # 2) Create a snapshot of the existing data
    
         zfs snapshot -r rpool/export@July26 
    
    # 3) Copy the data to the new hard drive
    
         zfs send rpool/export@July26 | zfs receive new/export
         zfs send rpool/export/home@July26 | zfs receive new/export/home
         zfs send rpool/export/home/jeff@July26 | zfs receive new/export/home/jeff
    
    # 4) Rename the old dataset
    
         zfs rename rpool/export rpool/export_old
    
    # 5) Push the old dataset to a new mount point for archival 
    
         zfs unmount rpool/export_old/home/jeff
         zfs unmount rpool/export_old/home
         zfs set mountpoint=/export_old rpool/export_old
         zfs set mountpoint=/export_old/home rpool/export_old/home
         zfs set mountpoint=/export_old/home/jeff rpool/export_old/home/jeff
         zfs mount rpool/export_old/home
         zfs mount rpool/export_old/home/jeff
    
    # 6) Put the new hard drive on /export
    
         zfs set mountpoint=/export new/export
    

    When I was satisfied that I could log in as "jeff" and see the data on the new disk, I cleaned up the old disk:

    # 7) cleaned up the old disk
    
         zfs destroy rpool/export_old/home/jeff@July26
         zfs destroy rpool/export_old/home/jeff
         zfs destroy rpool/export_old/home@July26
         zfs destroy rpool/export_old/home
         zfs destroy rpool/export_old@July26
         zfs destroy rpool/export_old
    

    After:

    # zpool status
      pool: new
     state: ONLINE
     scrub: none requested
    config:
    
    	NAME        STATE     READ WRITE CKSUM
    	new         ONLINE       0     0     0
    	  c9d0      ONLINE       0     0     0
    
    errors: No known data errors
    
      pool: rpool
     state: ONLINE
     scrub: none requested
    config:
    
    	NAME        STATE     READ WRITE CKSUM
    	rpool       ONLINE       0     0     0
    	  c8d0s0    ONLINE       0     0     0
    
    errors: No known data errors
    
    # zfs list
    NAME                       USED  AVAIL  REFER  MOUNTPOINT
    new                       1.86G   683G    19K  /new
    new/export                1.86G   683G    21K  /export
    new/export/home           1.86G   683G    22K  /export/home
    new/export/home/jeff      1.86G   683G  1.82G  /export/home/jeff
    rpool                     24.4G   204G  82.5K  /rpool
    rpool/ROOT                16.4G   204G    19K  legacy
    rpool/ROOT/opensolaris     262M   204G  5.25G  /
    rpool/ROOT/opensolaris-1  25.0M   204G  5.65G  /tmp/tmpOM0whB
    rpool/ROOT/opensolaris-2  33.4M   204G  6.59G  /
    rpool/ROOT/opensolaris-3   171M   204G  6.59G  /
    rpool/ROOT/opensolaris-4  17.4M   204G  6.72G  /
    rpool/ROOT/opensolaris-5   106M   204G  6.72G  /
    rpool/ROOT/opensolaris-6  15.8G   204G  14.7G  /
    rpool/dump                3.97G   204G  3.97G  -
    rpool/swap                3.97G   208G   101M  -
    

    Mission accomplished without needing to reboot the system!

    Friday Apr 02, 2010

    Java Monitoring and Tuning Example

    Introduction:

    I've recently been working with a several ISV's who were concerned about application server performance issues. Various application servers have been investigated, including WebLogic, WebSphere, Tomcat and JBoss, however, some of the tools and techniques necessary to monitor and resolve the application server performance were constant. Since Java is the underlying technology, one of the first issues to investigate is Java garbage collection. VisualGC is a good tool to understand the garbage collection patterns.

    This blog entry is a case study of a tuning exercise, shown in three stages.  In the first stage, the system is running well under a medium load. In the second stage, the load has been increased causing response times to spike.  In the third stage, the the Java parameters have been modified and the system is able to run well under a heavier load.  

    This blog entry is not intended to be an exhaustive and free standing description of Java Garbage Collection, rather, this blog entry is intended to supplement other descriptions such as Tuning Garbage Collection with the 5.0 Java Virtual Machine and Java SE 6 HotSpot[tm] Virtual Machine Garbage Collection Tuning.

    Environment and tools:

    In this case study, a web container is deployed into it's own Solaris Container, and it is the only busy process in the container.  This Solaris Container is using a pool of two quad-core SPARC64 VII CPUs, with 2 virtual CPU's per core, yielding a total of 16 virtual CPUs.

    Screen shots show the output from tools including VisualGC, prstat, jstat, and xcpustate:

    • VisualGC is a tool that is available for download from with the jvmstat 3.0 package and displays a Java Virtual Machine's memory usage.
    • xcpustate is a tool that is available on the Solaris Companion CD and shows the status of each virtual CPU: blue indicates idle, green is busy in application code and yellow is busy in Solaris kernel code. 
    • jstat is a text based tool that is included with the JDK and is used to display Java garbage collection statistics.  Not to be confused with jvmstat.
    • prstat is a text based tool that is included with Solaris and shows resource usage per process.

    Java Options:

    The following Java options where in place as the project began and where left intact:

    -server -XX:PermSize=512m -XX:MaxPermSize=720m -Xss256k -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled -XX:+UseTLAB -XX:+DisableExplicitGC -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000

    The following options where modified during the course of the project:

    Initial Java Options
    Modified Java Options
    -Xms1024m -Xms4096m
    -Xmx2048m -Xmx6144m
    -XX:NewSize=128m -XX:NewSize=1536m
    -XX:MaxNewSize=256m -XX:MaxNewSize=1536m
    N/A -XX:SurvivorRatio=4
    N/A

    -verbose:gc
    -XX:+PrintGCTimeStamps
    -XX:+PrintGCDetails
    -XX:+PrintTenuringDistribution
    -Xloggc:/web_dir/log/gc.log

    Stage 1: "if it ain't broke, don't fix it"

    In the first stage, the system is running well under a medium load.

    Here is a screen shot of VisualGC running a scenario where the end-user response times are good. The window on the left shows an instantaneous view of the Java memory statistics.  The window on the right show a strip charts of the statistics over time, where new data is published on the right side and slides to the left over time.  Eventually the historic data falls off the left side of the strip chart. The right window has the following strip charts, from top to bottom:

    • Compile TIme strip chart shows that the Java run time compiler has be kicking in, but this a fairly light load which was subsequently seen to disappear as all of the classes in use where fully compiled and optimized.
    • Class Loader strip chart shows that the class loader has recently been idle.
    • GC Time shows that Java garbage has been active, but the narrow spikes indicate short bursts of garbage collection. The narrow spikes are a good omen. If wide peaks are seen it could be indicative of a problem.
    • Eden Space strip chart shows that the data quickly fills the 128MB Eden space and is then collected. When the frequency of the garbage collections approach 1 collection per second or greater, aliasing can be a problem and you should use jstat to ascertain the actual frequency
    • Survivor space strip charts shows that the survivor space is not being used.  Combined with view of the Old Gen activity, deduce that the tenuring process is not working and that data that survives one collection is promoted directly into the Old Generation.  I don't like this configuration, but the user response times are good, so I wouldn't change it, yet.  As the load increases, this will need to be resolved.
    • Old Gen strip chart shows that the memory is being filled as data is promoted but quickly cleaned up by the CMS collector. 
    • Perm Gen strip chart shows a steady state of used space and some availble space, which is good. 

    033_visualgc.jpg

    I believe in the idiom "if it ain't broke, don't fix it", so although I'm not in love with the above screen shot, I wouldn't change anything unless there is a tangible performance problem.  With this load, the current settings are working.

    That being said, I don't like that Eden so small and that the Survivor space is not being used.  Objects that survive an initial young generation GC are being promoted directly to the Old Generation, bypassing the tenuring process.

    The following screen shot was also taken while running the medium load scenario where the end-user response times are good. On the left, prstat shows that there is one busy Java process which has been using 7.5% of this Solaris Container's CPU resource. The xcpustate window shows that all of the CPUs are lightly loaded by the Java process.

     033_prstat.jpg

    The following jstat data was gathered while running the medium load scenario where the end-user response times are good. The following columns are displayed:

    • The first ten column describe memory usage:
      • The first four columns, S0C, S1C, S0U, and S1U are about survivor space. Capacity (C) and Utilization (U) for spaces 0 and 1. The capacity is 64 KB but neither space is utilized. 
      • The next two columns are about Eden, showing Capacity (EC) and Utilization (EU).  The capacity is constant at approximately 128MB (130944KB). How much data is in Eden at the instantaneous time that the trace is taken depends on how much data has filled in since the most recent young generation collection and varies wildly.
      • The next two columns are about the Old Generation, showing Capacity (OC) and Utilization (OU).  The capacity is constant at 1.2 GB. How much data is in Eden at the instantaneous time that the trace is taken depends on how much data has filled in since the most recent young generation collection and varies slowly.
      • Then there are two columns are about the Permanent Gerneration, showing Capacity (PC) and Utilization (PU). The capacity is constant at 720 MB (737280 KB) and the utilization is fairly flat near 440 MB (450560 KB)
    • The final 5 columns are about the number of collections and time spent in garbage collection:
      • YGC is the count of the total number of Young Generation Collections since the Java process was launched.

    #jstat -gc 15515 10000
      S0C    S1C   S0U    S1U      EC       EU        OC         OU        PC      PU        YGC     YGCT  FGC    FGCT     GCT
     64.0   64.0   0.0    0.0   130944.0 128571.1 1226784.0   657397.4  737280.0 450352.0    497   15.956  158    53.966   69.922
     64.0   64.0   0.0    0.0   130944.0 32841.4  1226784.0   624092.3  737280.0 450367.8    503   16.154  160    54.918   71.072
     64.0   64.0   0.0    0.0   130944.0   0.0    1226784.0   631067.9  737280.0 450251.8    506   16.245  162    54.963   71.208
     64.0   64.0   0.0    0.0   130944.0 76982.1  1226784.0   603735.6  737280.0 450255.0    511   16.383  163    55.725   72.107
     64.0   64.0   0.0    0.0   130944.0 118467.4 1226784.0   606041.5  737280.0 450265.4    515   16.538  165    56.504   73.042
     64.0   64.0   0.0    0.0   130944.0 108089.0 1226784.0   644641.9  737280.0 450273.4    520   16.795  165    56.504   73.298
     64.0   64.0   0.0    0.0   130944.0 65716.7  1226784.0   627548.7  737280.0 449934.1    523   16.901  167    57.483   74.384
     64.0   64.0   0.0    0.0   130944.0 93575.9  1226784.0   602544.6  737280.0 449690.5    526   17.021  168    58.691   75.712
     64.0   64.0   0.0    0.0   130944.0 65022.2  1226784.0   635921.9  737280.0 449698.6    531   17.217  169    58.736   75.953
     64.0   64.0   0.0    0.0   130944.0 23657.6  1226784.0   626730.9  737280.0 449708.1    537   17.471  171    59.584   77.055
    \^C
    

    The Verbose GC log file confirms the jstat data.

    1398.985: [GC[YG occupancy: 32842 K (131008 K)]1398.985: [Rescan (parallel) , 0.0162640 secs]1399.002: [weak refs processing, 0.3515853 secs]1399.353: [class unloading, 0.3569879 secs]1399.711: [scrub symbol & string tables, 0.1207449 secs] [1 CMS-remark: 757680K(1226784K)] 790522K(1357792K), 0.9462555 secs]
    1399.932: [CMS-concurrent-sweep-start]
    1400.006: [GC 1400.006: [ParNew: 130904K->0K(131008K), 0.0295234 secs] 887457K->758257K(1357792K), 0.0304601 secs]
    1400.485: [GC 1400.486: [ParNew: 130944K->0K(131008K), 0.0344479 secs] 872839K->751757K(1357792K), 0.0351573 secs]
    1401.453: [GC 1401.453: [ParNew: 130944K->0K(131008K), 0.0309997 secs] 830358K->704885K(1357792K), 0.0318843 secs]
    1402.550: [CMS-concurrent-sweep: 1.110/2.618 secs]
    1402.550: [CMS-concurrent-reset-start]
    1402.563: [CMS-concurrent-reset: 0.013/0.013 secs]
    1403.901: [GC 1403.901: [ParNew: 130944K->0K(131008K), 0.0325028 secs] 835829K->711510K(1357792K), 0.0332487 secs]
    1404.156: [GC [1 CMS-initial-mark: 711510K(1226784K)] 744661K(1357792K), 0.0442969 secs]
    1404.201: [CMS-concurrent-mark-start]
    1404.834: [GC 1404.835: [ParNew: 130941K->0K(131008K), 0.0357494 secs] 842451K->717262K(1357792K), 0.0365797 secs]
    1404.953: [GC 1404.954: [ParNew: 130893K->0K(131008K), 0.0183660 secs] 848155K->717517K(1357792K), 0.0190584 secs]
    1405.055: [GC 1405.055: [ParNew: 130944K->0K(131008K), 0.0173852 secs] 848461K->717814K(1357792K), 0.0180515 secs]
    1405.342: [GC 1405.343: [ParNew: 130944K->0K(131008K), 0.0203568 secs] 848758K->720848K(1357792K), 0.0211302 secs]
    1405.763: [GC 1405.763: [ParNew: 130944K->0K(131008K), 0.0204437 secs] 851792K->722671K(1357792K), 0.0211113 secs]
    1406.192: [GC 1406.192: [ParNew: 130944K->0K(131008K), 0.0204619 secs] 853615K->724768K(1357792K), 0.0214120 secs]
    1407.042: [GC 1407.042: [ParNew: 130944K->0K(131008K), 0.0326238 secs] 855712K->734062K(1357792K), 0.0334577 secs]
    1409.453: [GC 1409.453: [ParNew: 130944K->0K(131008K), 0.0340238 secs] 865006K->741292K(1357792K), 0.0347661 secs]
    1410.355: [CMS-concurrent-mark: 3.050/6.154 secs]
    1410.355: [CMS-concurrent-preclean-start]
    1412.357: [GC 1412.358: [ParNew: 130944K->0K(131008K), 0.0334870 secs] 872236K->747988K(1357792K), 0.0344020 secs]
    1412.936: [CMS-concurrent-preclean: 0.366/2.581 secs]
    1412.936: [CMS-concurrent-abortable-preclean-start]
    1412.941: [CMS-concurrent-abortable-preclean: 0.005/0.006 secs]
    1412.948: [GC[YG occupancy: 66917 K (131008 K)]1412.949: [Rescan (parallel) , 0.0494186 secs]1412.998: [weak refs processing, 0.4237737 secs]1413.422: [class unloading, 0.2246387 secs]1413.647: [scrub symbol & string tables, 0.1256452 secs] [1 CMS-remark: 747988K(1226784K)] 814906K(1357792K), 0.9224302 secs]
    1413.872: [CMS-concurrent-sweep-start]
    1417.184: [GC 1417.185: [ParNew: 130944K->0K(131008K), 0.0439868 secs] 878065K->755909K(1357792K), 0.0450652 secs]
    

    Resident Set Size

    When memory is allocated to a Solaris process, virtual memory pages to fulfill the request are reserved in swap space, but this is a very light weight event that does not require allocating RAM or writing data to the backing store (virtual memory disk).  The first time that any page is accessed, RAM is allocated and zeroed for the page. Finally, in a scenario where RAM is scarce, Solaris may need to pageout the data to the backing store.

    It is not uncommon for a Java users to notice a substantial gap between the swap space that has been reserved for the process compared the amount of RAM that is being used.  Observe the prstat SIZE and RSS columns.  The "RSS" column (Resident Set Size) indicates how much RAM is in use by the process. The "SIZE" column displays the the total virtual size of process, including RAM that is allocated to the process, pages that have been reserved but never accessed and data that has been paged out due to memory pressure.  In this case, there is no virtual memory pressure (seen via vmstat, not shown) so we know that there is more than 2 GB of memory space that has been reserved by the process but not yet accessed. Going back to the VisualGC screen shot above, in the top left window, notice that Old Gen has:

    • Active data, solid orange, 664MB.
    • Empty (but already allocated) space shown with a green hash.  The sum of the active data and the empty space is 1.17 GB.
    • Space that has been reserved but not yet allocated, shown with a  gray hash. The virtual memory footprint of the Old Generation is 1.87 GB, including both active, empty but allocated, and reserved space.

    In the jstat Old Capacity "OC" column, (1226784 KB) matches the 1.17 GB in the VisualGC output and indicates sum of the memory that has been allocated to the Old Generation, but not pages that have been reserved and not yet allocated.

    More detail about Solaris memory can be found at sources including Solaris 10 Physical Memory Management and The Solaris Memory System - Sizing, Tools and Architecture

    The following graph is produced using from vmstat gathered while running the medium load scenario where the end-user response times are good. (See Graphing Solaris Performance Stats with gnuplot) You can see that there is a consistent light CPU load on the system throughout the half hour test.


     033_vmstat.jpg

    Stage 2: It's broken

    In the second stage, the load has been increased causing response times to spike. The processing alternates between two phases. Garbage collection runs constantly while the test is running, but the Old generation CMS collector can't clear out the data fast enough to clear the In once

    What I don't like about the following VisualGC

    1) GC is constantly running
    2) Survivor space is not being used
    3) Two phases:

    (a) a phase that can be seen in the Eden Space strip chart with very frequent Eden GC's concurrent with application work, and

    (b) a phase that can be seen in the Old Gen strip chart where where the application can not execute due to exclusive Old Generation GC work.

    4) CMS is not able to hold Old Gen down

     100_bad_visualgc.jpg

    The following jstat data was gather during a run where the load has been increased causing response times to spike. There is one sample every 10 seconds (10000 ms) and twenty samples to cover a total of 200 seconds.  Observing the YGC column, notice that there are 685 (3675 - 2990) Young Generation Collections, for an average rate of 3.4 collections/second.  A tuning goal will be to slow this rate down.

    #jstat -gc $PID 10000 20
      S0C    S1C   S0U    S1U      EC       EU        OC         OU        PC      PU        YGC     YGCT  FGC    FGCT     GCT
     64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1553383.3  737280.0 449596.9   2990  119.803  123   283.914  403.717
     64.0   64.0   0.0    0.0   130944.0 93869.1  1966080.0  1886113.2  737280.0 449599.9   3046  121.996  123   283.914  405.910
     64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1942614.0  737280.0 449599.9   3055  122.564  124   283.914  406.478
     64.0   64.0   0.0    0.0   130944.0 57856.7  1966080.0  1364918.4  737280.0 449602.5   3093  124.077  125   294.691  418.768
     64.0   64.0   0.0    0.0   130944.0   0.0    1966080.0  1714499.8  737280.0 449606.8   3149  126.198  125   294.691  420.889
     64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1943757.0  737280.0 449608.7   3186  127.848  126   294.691  422.539
     64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1943757.0  737280.0 449605.8   3186  127.848  126   294.691  422.539
     64.0   64.0   0.0    0.0   130944.0   0.0    1966080.0  1389764.8  737280.0 449616.0   3231  129.533  127   307.723  437.256
     64.0   64.0   0.0    0.0   130944.0 101211.9 1966080.0  1741419.7  737280.0 449619.9   3286  131.759  127   307.723  439.483
     64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1943870.4  737280.0 449629.2   3323  133.224  128   307.723  440.947
     64.0   64.0   0.0    0.0   130944.0 91907.9  1966080.0  1116053.3  737280.0 449621.3   3330  133.492  129   319.118  452.610
     64.0   64.0   0.0    0.0   130944.0   0.0    1966080.0  1541494.5  737280.0 449631.2   3392  135.838  129   319.118  454.956
     64.0   64.0   0.0    0.0   130944.0 53983.1  1966080.0  1907711.7  737280.0 449651.6   3444  137.955  129   319.118  457.073
     64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1942148.9  737280.0 449659.3   3450  138.658  130   319.118  457.776
     64.0   64.0   0.0    0.0   130944.0 71632.8  1966080.0  1291300.9  737280.0 449672.5   3484  139.935  131   331.870  471.805
     64.0   64.0   0.0    0.0   130944.0   0.0    1966080.0  1661931.9  737280.0 449687.8   3535  142.096  131   331.870  473.966
     64.0   64.0   0.0    0.0   130944.0 55150.8  1966080.0  1930792.5  737280.0 449693.2   3581  143.826  131   331.870  475.696
     64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1941935.7  737280.0 449693.2   3583  144.291  132   331.870  476.161
     64.0   64.0   0.0    0.0   130944.0 66316.6  1966080.0  1405189.1  737280.0 449701.0   3631  146.142  133   342.789  488.931
     64.0   64.0   0.0    0.0   130944.0   0.0    1966080.0  1744177.7  737280.0 449704.4   3675  148.066  133   342.789  490.855
    

    The Verbose GC log file confirms the jstat data.

    1445.781: [GC 1445.781: [ParNew: 130944K->0K(131008K), 0.0528409 secs] 2048880K->1929346K(2097088K), 0.0733931 secs]
    1446.057: [GC 1446.077: [ParNew: 130944K->0K(131008K), 0.0507988 secs] 2060290K->1937743K(2097088K), 0.0519872 secs]
    1446.328: [GC 1446.330: [ParNew (promotion failed): 130944K->130944K(131008K), 0.3677102 secs]1446.698: [CMS1448.338: [CMS-concurrent-mark: 8.326/20.506 secs]
     (concurrent mode failure): 1943651K->1332161K(1966080K), 13.5947895 secs] 2068687K->1332161K(2097088K), 13.9636616 secs]
    1460.373: [GC [1 CMS-initial-mark: 1332161K(1966080K)] 1364958K(2097088K), 0.0570809 secs]
    1460.431: [CMS-concurrent-mark-start]
    1460.513: [GC 1460.513: [ParNew: 130944K->0K(131008K), 0.0629766 secs] 1463105K->1340611K(2097088K), 0.0638556 secs]
    1460.644: [GC 1460.645: [ParNew: 130944K->0K(131008K), 0.0287314 secs] 1471555K->1344385K(2097088K), 0.0295894 secs]
    1460.752: [GC 1460.752: [ParNew: 130944K->0K(131008K), 0.0313205 secs] 1475329K->1349195K(2097088K), 0.0325400 secs]
    1460.859: [GC 1460.860: [ParNew: 130944K->0K(131008K), 0.0354078 secs] 1480139K->1353073K(2097088K), 0.0363298 secs]
    1460.988: [GC 1460.988: [ParNew: 130944K->0K(131008K), 0.0363377 secs] 1484017K->1356257K(2097088K), 0.0374367 secs]
    1461.164: [GC 1461.164: [ParNew: 130944K->0K(131008K), 0.0479562 secs] 1487201K->1362016K(2097088K), 0.0488208 secs]
    1461.439: [GC 1461.439: [ParNew: 130944K->0K(131008K), 0.0507014 secs] 1492960K->1372934K(2097088K), 0.0515876 secs]
    1461.579: [GC 1461.580: [ParNew: 130944K->0K(131008K), 0.0319675 secs] 1503878K->1375415K(2097088K), 0.0332522 secs]
    1461.693: [GC 1461.693: [ParNew: 130944K->0K(131008K), 0.0340291 secs] 1506359K->1378926K(2097088K), 0.0348877 secs]
    1461.821: [GC 1461.821: [ParNew: 130944K->0K(131008K), 0.0314839 secs] 1509870K->1382392K(2097088K), 0.0323204 secs]
    1461.950: [GC 1461.951: [ParNew: 130943K->0K(131008K), 0.0323091 secs] 1513336K->1386662K(2097088K), 0.0331815 secs]
    1462.123: [GC 1462.124: [ParNew: 130944K->0K(131008K), 0.0385567 secs] 1517606K->1391853K(2097088K), 0.0398456 secs]
    1462.380: [GC 1462.380: [ParNew: 130944K->0K(131008K), 0.0698981 secs] 1522797K->1405319K(2097088K), 0.0708113 secs]
    1462.627: [GC 1462.628: [ParNew: 130944K->0K(131008K), 0.0482389 secs] 1536263K->1417126K(2097088K), 0.0493317 secs]
    1462.835: [GC 1462.835: [ParNew: 130944K->0K(131008K), 0.0438596 secs] 1548070K->1423124K(2097088K), 0.0447105 secs]
    1462.989: [GC 1462.989: [ParNew: 130944K->0K(131008K), 0.0356933 secs] 1554068K->1430934K(2097088K), 0.0365304 secs]
    1463.156: [GC 1463.156: [ParNew: 130944K->0K(131008K), 0.0426650 secs] 1561878K->1437914K(2097088K), 0.0438581 secs]
    1463.342: [GC 1463.342: [ParNew: 130944K->0K(131008K), 0.0369751 secs] 1568858K->1445730K(2097088K), 0.0378227 secs]
    1463.508: [GC 1463.508: [ParNew: 130944K->0K(131008K), 0.0393694 secs] 1576674K->1458926K(2097088K), 0.0402032 secs]
    1463.709: [GC 1463.709: [ParNew: 130944K->0K(131008K), 0.0351472 secs] 1589870K->1465137K(2097088K), 0.0364246 secs]
    1463.860: [GC 1463.860: [ParNew: 130944K->0K(131008K), 0.0399990 secs] 1596081K->1468331K(2097088K), 0.0408329 secs]
    1463.998: [GC 1463.998: [ParNew: 130944K->0K(131008K), 0.0338172 secs] 1599275K->1471847K(2097088K), 0.0346756 secs]
    1464.123: [GC 1464.123: [ParNew: 130944K->0K(131008K), 0.0281716 secs] 1602791K->1475013K(2097088K), 0.0292473 secs]
    1464.235: [GC 1464.235: [ParNew: 130944K->0K(131008K), 0.0282486 secs] 1605957K->1477824K(2097088K), 0.0293566 secs]
    1464.361: [GC 1464.361: [ParNew: 130944K->0K(131008K), 0.0319564 secs] 1608768K->1480943K(2097088K), 0.0328203 secs]
    1464.590: [GC 1464.590: [ParNew: 130944K->0K(131008K), 0.0430505 secs] 1611887K->1486579K(2097088K), 0.0439285 secs]
    1464.875: [GC 1464.875: [ParNew: 130944K->0K(131008K), 0.0404393 secs] 1617523K->1496962K(2097088K), 0.0415705 secs]
    1465.079: [GC 1465.079: [ParNew: 130944K->0K(131008K), 0.0389969 secs] 1627906K->1509873K(2097088K), 0.0398899 secs]
    1465.318: [GC 1465.318: [ParNew: 130944K->0K(131008K), 0.0377232 secs] 1640817K->1518156K(2097088K), 0.0386229 secs]
    1465.533: [GC 1465.534: [ParNew: 130939K->0K(131008K), 0.0433747 secs] 1649096K->1527800K(2097088K), 0.0446424 secs]
    1465.652: [GC 1465.653: [ParNew: 130944K->0K(131008K), 0.0267976 secs] 1658744K->1530747K(2097088K), 0.0276501 secs]
    1465.754: [GC 1465.755: [ParNew: 130944K->0K(131008K), 0.0382624 secs] 1661691K->1533772K(2097088K), 0.0390969 secs]
    1465.883: [GC 1465.884: [ParNew: 130944K->0K(131008K), 0.0346790 secs] 1664716K->1538809K(2097088K), 0.0355499 secs]
    1466.010: [GC 1466.010: [ParNew: 130944K->0K(131008K), 0.0360124 secs] 1669753K->1542202K(2097088K), 0.0368312 secs]
    1466.133: [GC 1466.134: [ParNew: 130896K->0K(131008K), 0.0264424 secs] 1673098K->1545037K(2097088K), 0.0276615 secs]
    1466.284: [GC 1466.284: [ParNew: 130944K->0K(131008K), 0.0334272 secs] 1675981K->1551352K(2097088K), 0.0342620 secs]
    1466.486: [GC 1466.487: [ParNew: 130944K->0K(131008K), 0.0468365 secs] 1682296K->1559392K(2097088K), 0.0476777 secs]
    1466.761: [GC 1466.762: [ParNew: 130944K->0K(131008K), 0.0668140 secs] 1690336K->1571022K(2097088K), 0.0679247 secs]
    1467.043: [GC 1467.043: [ParNew: 130944K->0K(131008K), 0.0595864 secs] 1701966K->1579608K(2097088K), 0.0604566 secs]
    1467.294: [GC 1467.294: [ParNew: 130944K->0K(131008K), 0.0422655 secs] 1710552K->1586731K(2097088K), 0.0433685 secs]
    1467.412: [GC 1467.413: [ParNew: 130915K->0K(131008K), 0.0357374 secs] 1717646K->1590261K(2097088K), 0.0365601 secs]
    1467.523: [GC 1467.523: [ParNew: 130944K->0K(131008K), 0.0368735 secs] 1721205K->1592832K(2097088K), 0.0376957 secs]
    1467.637: [Full GC 1467.637: [ParNew: 130901K->0K(131008K), 0.0387518 secs] 1723877K->1595360K(2097088K), 0.0396553 secs]
    1467.755: [GC 1467.755: [ParNew: 130944K->0K(131008K), 0.0328768 secs] 1726304K->1597889K(2097088K), 0.0337467 secs]
    1467.913: [GC 1467.913: [ParNew: 130943K->0K(131008K), 0.0321930 secs] 1728833K->1601670K(2097088K), 0.0332656 secs]
    1468.115: [GC 1468.115: [ParNew: 130944K->0K(131008K), 0.0378134 secs] 1732614K->1608985K(2097088K), 0.0386423 secs]
    1468.259: [GC 1468.260: [ParNew: 130944K->0K(131008K), 0.0389691 secs] 1739929K->1619450K(2097088K), 0.0398194 secs]
    1468.378: [GC 1468.379: [ParNew: 130939K->0K(131008K), 0.0346053 secs] 1750390K->1622375K(2097088K), 0.0354541 secs]
    1468.509: [GC 1468.509: [ParNew: 130931K->0K(131008K), 0.0329106 secs] 1753307K->1625703K(2097088K), 0.0339618 secs]
    1468.627: [GC 1468.636: [ParNew: 130874K->0K(131008K), 0.0315224 secs] 1756578K->1630866K(2097088K), 0.0323764 secs]
    1468.751: [GC 1468.752: [ParNew: 130944K->0K(131008K), 0.0316954 secs] 1761810K->1633494K(2097088K), 0.0325242 secs]
    1468.926: [GC 1468.933: [ParNew: 130944K->0K(131008K), 0.0361699 secs] 1764438K->1641317K(2097088K), 0.0371095 secs]
    1469.145: [GC 1469.146: [ParNew: 130944K->0K(131008K), 0.0367411 secs] 1772261K->1647218K(2097088K), 0.0378685 secs]
    1469.380: [GC 1469.380: [ParNew: 130944K->0K(131008K), 0.0446357 secs] 1778162K->1653998K(2097088K), 0.0454686 secs]
    1469.688: [GC 1469.689: [ParNew: 130944K->0K(131008K), 0.0390912 secs] 1784942K->1661565K(2097088K), 0.0402177 secs]
    1469.951: [GC 1469.951: [ParNew: 130944K->0K(131008K), 0.0486676 secs] 1792509K->1671504K(2097088K), 0.0495607 secs]
    1470.248: [GC 1470.249: [ParNew: 130944K->0K(131008K), 0.0414052 secs] 1802448K->1676829K(2097088K), 0.0425514 secs]
    1470.447: [GC 1470.447: [ParNew: 130944K->0K(131008K), 0.0342887 secs] 1807773K->1683630K(2097088K), 0.0351706 secs]
    1470.610: [GC 1470.611: [ParNew: 130940K->0K(131008K), 0.0319829 secs] 1814571K->1688471K(2097088K), 0.0332171 secs]
    1470.722: [GC 1470.722: [ParNew: 130944K->0K(131008K), 0.0311625 secs] 1819415K->1692388K(2097088K), 0.0319923 secs]
    1470.852: [GC 1470.852: [ParNew: 130944K->0K(131008K), 0.0314821 secs] 1823332K->1696163K(2097088K), 0.0389738 secs]
    1470.967: [GC 1470.979: [ParNew: 130944K->0K(131008K), 0.0318559 secs] 1827107K->1700627K(2097088K), 0.0326812 secs]
    1471.085: [GC 1471.086: [ParNew: 130937K->0K(131008K), 0.0328143 secs] 1831564K->1704737K(2097088K), 0.0336577 secs]
    1471.229: [GC 1471.229: [ParNew: 130944K->0K(131008K), 0.0335674 secs] 1835681K->1709344K(2097088K), 0.0343768 secs]
    1471.385: [GC 1471.386: [ParNew: 130944K->0K(131008K), 0.0406892 secs] 1840288K->1714787K(2097088K), 0.0419120 secs]
    1471.516: [GC 1471.516: [ParNew: 130944K->0K(131008K), 0.0295618 secs] 1845731K->1717584K(2097088K), 0.0303950 secs]
    1471.625: [GC 1471.625: [ParNew: 130944K->0K(131008K), 0.0376045 secs] 1848528K->1720845K(2097088K), 0.0384246 secs]
    1471.750: [GC 1471.750: [ParNew: 130870K->0K(131008K), 0.0260799 secs] 1851715K->1723365K(2097088K), 0.0269050 secs]
    1471.839: [GC 1471.840: [ParNew: 130944K->0K(131008K), 0.0255170 secs] 1854309K->1724726K(2097088K), 0.0263177 secs]
    1471.988: [GC 1471.988: [ParNew: 130944K->0K(131008K), 0.0285268 secs] 1855670K->1728103K(2097088K), 0.0295660 secs]
    1472.143: [GC 1472.143: [ParNew: 130944K->0K(131008K), 0.0392845 secs] 1859047K->1738951K(2097088K), 0.0401130 secs]
    1472.362: [GC 1472.362: [ParNew: 130944K->0K(131008K), 0.0379261 secs] 1869895K->1746709K(2097088K), 0.0387573 secs]
    1472.554: [GC 1472.554: [ParNew: 130944K->0K(131008K), 0.0431822 secs] 1877653K->1752618K(2097088K), 0.0442519 secs]
    1472.672: [GC 1472.673: [ParNew: 130944K->0K(131008K), 0.0434003 secs] 1883562K->1754549K(2097088K), 0.0442311 secs]
    1472.795: [GC 1472.796: [ParNew: 130944K->0K(131008K), 0.0295359 secs] 1885493K->1757812K(2097088K), 0.0304006 secs]
    1472.904: [GC 1472.904: [ParNew: 130944K->0K(131008K), 0.0386588 secs] 1888756K->1760377K(2097088K), 0.0395649 secs]
    1473.035: [GC 1473.036: [ParNew: 130861K->0K(131008K), 0.0297446 secs] 1891238K->1763237K(2097088K), 0.0308653 secs]
    1473.188: [GC 1473.201: [ParNew: 130944K->0K(131008K), 0.0319938 secs] 1894181K->1767389K(2097088K), 0.0333206 secs]
    1473.401: [GC 1473.401: [ParNew: 130944K->0K(131008K), 0.0387580 secs] 1898333K->1773804K(2097088K), 0.0396182 secs]
    1473.522: [GC 1473.538: [ParNew: 130944K->0K(131008K), 0.0312055 secs] 1904748K->1778637K(2097088K), 0.0320711 secs]
    1473.650: [GC 1473.650: [ParNew: 130944K->0K(131008K), 0.0346679 secs] 1909581K->1784622K(2097088K), 0.0355099 secs]
    1473.769: [GC 1473.780: [ParNew: 130944K->0K(131008K), 0.0325029 secs] 1915566K->1787148K(2097088K), 0.0336199 secs]
    1473.889: [GC 1473.890: [ParNew: 130944K->0K(131008K), 0.0306312 secs] 1918092K->1791684K(2097088K), 0.0314725 secs]
    1474.023: [GC 1474.025: [ParNew: 130944K->0K(131008K), 0.0346581 secs] 1922628K->1800742K(2097088K), 0.0354589 secs]
    1474.219: [GC 1474.219: [ParNew: 130944K->0K(131008K), 0.0431710 secs] 1931686K->1811004K(2097088K), 0.0440093 secs]
    1474.392: [GC 1474.393: [ParNew: 130944K->0K(131008K), 0.0384690 secs] 1941948K->1820039K(2097088K), 0.0396068 secs]
    1474.572: [GC 1474.584: [ParNew: 130944K->0K(131008K), 0.0413289 secs] 1950983K->1827458K(2097088K), 0.0426382 secs]
    1474.801: [GC 1474.813: [ParNew: 130944K->0K(131008K), 0.0385987 secs] 1958402K->1837161K(2097088K), 0.0627487 secs]
    1475.046: [GC 1475.047: [ParNew: 130944K->0K(131008K), 0.0488532 secs] 1968105K->1845650K(2097088K), 0.0499528 secs]
    1475.262: [GC 1475.262: [ParNew: 130944K->0K(131008K), 0.0512485 secs] 1976594K->1853035K(2097088K), 0.0521045 secs]
    1475.458: [GC 1475.458: [ParNew: 130944K->0K(131008K), 0.0461453 secs] 1983979K->1862019K(2097088K), 0.0469998 secs]
    1475.652: [GC 1475.653: [ParNew: 130944K->0K(131008K), 0.0447249 secs] 1992963K->1868886K(2097088K), 0.0458052 secs]
    1475.763: [GC 1475.780: [ParNew: 130928K->0K(131008K), 0.0324786 secs] 1999815K->1873299K(2097088K), 0.0333689 secs]
    1475.882: [GC 1475.882: [ParNew: 130944K->0K(131008K), 0.0340021 secs] 2004243K->1876399K(2097088K), 0.0348220 secs]
    1475.993: [GC 1475.994: [ParNew: 130944K->0K(131008K), 0.0299435 secs] 2007343K->1883168K(2097088K), 0.0308629 secs]
    1476.097: [GC 1476.097: [ParNew: 130944K->0K(131008K), 0.0343554 secs] 2014112K->1887994K(2097088K), 0.0351879 secs]
    1476.218: [GC 1476.218: [ParNew: 130904K->0K(131008K), 0.0331282 secs] 2018898K->1891141K(2097088K), 0.0341797 secs]
    1476.375: [GC 1476.376: [ParNew: 130944K->0K(131008K), 0.0460486 secs] 2022085K->1896937K(2097088K), 0.0472445 secs]
    1476.522: [GC 1476.522: [ParNew: 130937K->0K(131008K), 0.0305741 secs] 2027875K->1900685K(2097088K), 0.0313665 secs]
    1476.630: [GC 1476.630: [ParNew: 130944K->0K(131008K), 0.0354049 secs] 2031629K->1903695K(2097088K), 0.0362252 secs]
    1476.749: [GC 1476.750: [ParNew: 130944K->0K(131008K), 0.0332512 secs] 2034639K->1906008K(2097088K), 0.0343826 secs]
    1476.861: [GC 1476.882: [ParNew: 130944K->0K(131008K), 0.0302215 secs] 2036952K->1908090K(2097088K), 0.0310552 secs]
    1477.030: [GC 1477.030: [ParNew: 130944K->0K(131008K), 0.0308026 secs] 2039034K->1910954K(2097088K), 0.0316133 secs]
    1477.250: [GC 1477.251: [ParNew: 130944K->0K(131008K), 0.0385982 secs] 2041898K->1916426K(2097088K), 0.0397542 secs]
    1477.451: [GC 1477.452: [ParNew: 130944K->0K(131008K), 0.0355202 secs] 2047370K->1922843K(2097088K), 0.0363764 secs]
    1477.641: [GC 1477.641: [ParNew: 130944K->0K(131008K), 0.0374713 secs] 2053787K->1928347K(2097088K), 0.0383333 secs]
    1477.864: [GC 1477.865: [ParNew: 130944K->0K(131008K), 0.0420856 secs] 2059291K->1935857K(2097088K), 0.0432463 secs]
    1478.088: [GC 1478.089: [ParNew (promotion failed): 130944K->130944K(131008K), 0.2115119 secs]1478.300: [CMS1480.678: [CMS-concurrent-mark: 8.657/20.247 secs]
     (concurrent mode failure): 1943323K->1067360K(1966080K), 12.8686984 secs] 2066801K->1067360K(2097088K), 13.0813570 secs]
    1491.257: [GC [1 CMS-initial-mark: 1067360K(1966080K)] 1106126K(2097088K), 0.0660923 secs]
    1491.325: [CMS-concurrent-mark-start]
    1491.415: [GC 1491.416: [ParNew: 130944K->0K(131008K), 0.0628546 secs] 1198304K->1074342K(2097088K), 0.0642408 secs]
    1491.569: [GC 1491.569: [ParNew: 130944K->0K(131008K), 0.0283941 secs] 1205286K->1077410K(2097088K), 0.0292610 secs]  

    The following screen shot shows VisualGC, prstat and xcpustate.  There is a sustained period of single threaded behavior. Of the 664 threads in the Java process, only one is able to run. 

     100_bad_visual_gc_single_threaded_gc.jpg

    The following screen shot shows VisualGC, prstat and xcpustate.  After the single threaded phase, there is a phase where many of the threads that had been blocked by garbage collection become runnable.  Every virtual CPU's is busy processing runnable threads. 

     100_bad_visual_gc_back_to_work.jpg

    The vmstat data shows the stages of single-threaded garage collection (1/16 CPU's = 6.25%) alternating with stages where the the CPU's are extremely busy processing application code.

     100_bad_vmstat.jpg

     The pattern in the vmstat data is more obvious when only 10 minutes of data is displayed.

     100_bad_vmstat_10_min.jpg

    Stage 3: Fix it

    In the third stage, the Java parameters have been modified and the system able to run well under a heavier load.  

    Initial Java Options
    Modified Java Options
    -Xms1024m -Xms4096m
    -Xmx2048m -Xmx6144m
    -XX:NewSize=128m -XX:NewSize=1536m
    -XX:MaxNewSize=256m -XX:MaxNewSize=1536m
    N/A -XX:SurvivorRatio=4
    N/A

    -verbose:gc
    -XX:+PrintGCTimeStamps
    -XX:+PrintGCDetails
    -XX:+PrintTenuringDistribution
    -Xloggc:/web_dir/log/gc.log

    What I like about the following VisualGC

    1) GC is not constantly running
    2) Survivor space is being used (5 generations) but not full
    3) CMS is able to hold Old Gen down
    4) Eden GCs are about one per second.

     100_big_visualgc.jpg  100_big_prstat.jpg

    The following jstat data was gather during a run where the the Java parameters have been modified and the system able to run well under a heavier load.  There is one sample every 10 seconds (10000 ms) and twenty samples to cover a total of 200 seconds.  Observing the YGC column, notice that there are 126 (943 - 817) Young Generation Collections, for an average rate of 0.63 collections/second (one every 1.6 seconds). 

    Also notice that the Survivor space is now active. In the VisualGC Survivor Age Histogram, above. observe that data survives 5 generations before being promoted to the Old Generation.  This implied that Java objects that are still referenced after  8 seconds (1.6 seconds X 5 collections).  Most Java objects are not referenced for 8 seconds and therefore are never promoted into the Old Generation,

    #jstat -gc $PID 10000 20
      S0C      S1C     S0U     S1U         EC       EU        OC         OU        PC      PU          YGC    YGCT   FGC    FGCT     GCT
    262144.0 262144.0  0.0   135491.4   1048576.0  740774.1 2621440.0  1518371.0  737280.0 454756.7    817  204.672  119   105.404  310.076
    262144.0 262144.0 143538.3  0.0     1048576.0  429976.2 2621440.0  1667952.6  737280.0 454763.5    826  206.506  119   105.404  311.911
    262144.0 262144.0 127788.8  0.0     1048576.0  648808.0 2621440.0  1772166.9  737280.0 454766.5    832  207.969  120   105.404  313.373
    262144.0 262144.0 149593.1  0.0     1048576.0  598492.1 2621440.0  1515637.6  737280.0 454766.5    840  209.517  121   107.803  317.320
    262144.0 262144.0 162023.5  0.0     1048576.0 1008524.0 2621440.0  1656240.4  737280.0 454768.1    846  210.922  121   107.803  318.725
    262144.0 262144.0  0.0   125334.4   1048576.0  953097.1 2621440.0  1836384.7  737280.0 454769.0    853  212.890  121   107.803  320.694
    262144.0 262144.0 151573.6 152908.7 1048576.0 1048576.0 2621440.0  1687193.8  737280.0 454769.8    859  214.153  122   109.747  323.900
    262144.0 262144.0  0.0   127784.9   1048576.0  170487.4 2621440.0  1597313.5  737280.0 454764.8    865  216.265  123   110.197  326.462
    262144.0 262144.0 125183.1  0.0     1048576.0  679868.2 2621440.0  1674807.1  737280.0 454765.9    870  217.484  123   110.197  327.682
    262144.0 262144.0  0.0   138781.8   1048576.0  941507.7 2621440.0  1500453.3  737280.0 454766.5    875  218.688  124   112.160  330.848
    262144.0 262144.0 154681.9 98652.8  1048576.0 1048576.0 2621440.0  1419845.8  737280.0 454770.0    881  220.185  125   112.576  332.761
    262144.0 262144.0 140635.1  0.0     1048576.0  184017.0 2621440.0  1541698.1  737280.0 454774.3    888  222.260  125   112.576  334.836
    262144.0 262144.0 126225.1  0.0     1048576.0  800352.6 2621440.0  1664977.6  737280.0 454777.0    894  224.013  125   112.576  336.589
    262144.0 262144.0 128930.2  0.0     1048576.0  709898.1 2621440.0  1423721.6  737280.0 454779.2    900  225.276  126   114.513  339.788
    262144.0 262144.0  0.0   135929.5   1048576.0  901493.1 2621440.0  1500049.7  737280.0 454783.5    907  227.254  127   114.962  342.216
    262144.0 262144.0 78006.9 156657.5  1048576.0 1048576.0 2621440.0  1682226.6  737280.0 454785.8    916  229.098  127   114.962  344.060
    262144.0 262144.0 112938.7  0.0     1048576.0  707242.9 2621440.0  1833853.7  737280.0 454790.2    922  230.749  128   114.962  345.712
    262144.0 262144.0  0.0   119538.9   1048576.0  466447.4 2621440.0  1587738.4  737280.0 454790.8    929  232.260  129   117.110  349.370
    262144.0 262144.0 141025.9 131586.9 1048576.0 1048576.0 2621440.0  1774825.3  737280.0 454791.9    937  233.762  129   117.110  350.872
    262144.0 262144.0  0.0   116953.7   1048576.0  562706.0 2621440.0  1886004.9  737280.0 454792.8    943  235.572  129   117.110  352.683
    

    The Verbose GC log file confirms the jstat data.

    1063.533: [GC[YG occupancy: 740130 K (1310720 K)]1063.534: [Rescan (parallel) , 0.2932129 secs]1063.827: [weak refs processing, 0.8300947 secs]1064.658: [class unloading, 0.2556826 secs]1064.914: [scrub symbol & string tables, 0.1225031 secs] [1 CMS-remark: 1362667K(2621440K)] 2102798K(3932160K), 1.6268772 secs]
    1065.161: [CMS-concurrent-sweep-start]
    1065.744: [GC 1065.748: [ParNew: 1199794K->136002K(1310720K), 0.3157680 secs] 2562442K->1538117K(3932160K), 0.3173270 secs]
    1067.540: [GC 1067.541: [ParNew: 1183955K->134897K(1310720K), 0.3643224 secs] 2516046K->1487739K(3932160K), 0.3657230 secs]
    1069.449: [GC 1069.452: [ParNew: 1183473K->132663K(1310720K), 0.4440330 secs] 2398115K->1372567K(3932160K), 0.4454642 secs]
    1071.165: [GC 1071.165: [ParNew: 1181239K->117685K(1310720K), 0.2458847 secs] 2366563K->1322400K(3932160K), 0.2472765 secs]
    1073.099: [GC 1073.100: [ParNew: 1166261K->143983K(1310720K), 0.2949013 secs] 2230754K->1208476K(3932160K), 0.2964760 secs]
    1074.020: [CMS-concurrent-sweep: 3.145/8.858 secs]
    1074.020: [CMS-concurrent-reset-start]
    1074.056: [CMS-concurrent-reset: 0.036/0.036 secs]
    1074.700: [GC 1074.701: [ParNew: 1192514K->130782K(1310720K), 0.2122237 secs] 2257007K->1216167K(3932160K), 0.2136974 secs]
    1075.304: [GC [1 CMS-initial-mark: 1085384K(2621440K)] 1482522K(3932160K), 0.3090166 secs]
    1075.614: [CMS-concurrent-mark-start]
    1076.299: [GC 1076.300: [ParNew: 1179358K->142731K(1310720K), 0.2360216 secs] 2264743K->1228116K(3932160K), 0.2373256 secs]
    1078.147: [GC 1078.148: [ParNew: 1191307K->146389K(1310720K), 0.3161631 secs] 2276692K->1249628K(3932160K), 0.3176056 secs]
    1080.368: [GC 1080.369: [ParNew: 1194965K->140667K(1310720K), 0.2221885 secs] 2298204K->1266323K(3932160K), 0.2235858 secs]
    1082.719: [GC 1082.720: [ParNew: 1189243K->156618K(1310720K), 0.4543953 secs] 2314899K->1292754K(3932160K), 0.4557551 secs]
    1083.887: [GC 1083.888: [ParNew: 1205108K->101152K(1310720K), 0.1484881 secs] 2341243K->1271768K(3932160K), 0.1498135 secs]
    1084.462: [CMS-concurrent-mark: 3.962/8.848 secs]
    1084.462: [CMS-concurrent-preclean-start]
    1085.434: [CMS-concurrent-preclean: 0.859/0.971 secs]
    1085.434: [CMS-concurrent-abortable-preclean-start]
    1086.081: [GC 1086.082: [ParNew: 1149728K->135612K(1310720K), 0.3098950 secs] 2320344K->1306228K(3932160K), 0.3113219 secs]
    1087.294: [CMS-concurrent-abortable-preclean: 0.459/1.860 secs]
    1087.342: [GC[YG occupancy: 667129 K (1310720 K)]1087.342: [Rescan (parallel) , 0.3074071 secs]1087.650: [weak refs processing, 0.9052933 secs]1088.555: [class unloading, 0.2571331 secs]1088.813: [scrub symbol & string tables, 0.1228506 secs] [1 CMS-remark: 1170615K(2621440K)] 1837745K(3932160K), 1.7140749 secs]
    1089.057: [CMS-concurrent-sweep-start]
    1089.396: [GC 1089.397: [ParNew: 1184134K->139422K(1310720K), 0.1621469 secs] 2354726K->1320441K(3932160K), 0.1636681 secs]
    1091.280: [GC 1091.281: [ParNew: 1187998K->154229K(1310720K), 0.2434008 secs] 2295339K->1273265K(3932160K), 0.2449105 secs]
    1093.547: [GC 1093.548: [ParNew: 1202805K->150707K(1310720K), 0.3279749 secs] 2156871K->1125761K(3932160K), 0.3295486 secs]
    

     The application is able to run most of the time, but even with the improved Java options, there are short periods of single-threaded computation, which are likely attributable to the "initial mark" or the "remark" phases of CMS.  See this description of CMS for more details.

     100_big_cms_vmstat.jpg

    When zooming in on the vmstat data, the short and acceptable phases of single threaded activity are more obvious.  Notice how much better this is than the long phases of single-threaded behavior that were visible with the initial Java options.

     100_big_cms_vmstat._10_min.jpg

    Conclusion:

    "One-size fits all" Java options are not necessarily a good match for your application.  Monitoring garbage collection with a combination of several tools and using the feedback to adjust the Java options for your application can result is a substantial performance improvement.  


    Thursday Mar 25, 2010

    Installing the Solaris 10 OS Companion CD

    Note to self: stop forgetting the following choreography for installing the Solaris 10 OS Companion CD:

    Download zipped ISO from http://www.sun.com/software/solaris/freeware/

    # unzip sol-10-u8-companion-ga-iso.zip
    # lofiadm -a `pwd`/sol-10-u8-companion-ga.iso
    # mount -oro -Fhsfs /dev/lofi/1 /mnt
    # /bin/yes | pkgadd -d /mnt/Solaris_Software_Companion/Solaris_sparc/Packages all
    # pkgrm SFWvnc
    # umount /mnt
    # lofiadm -d `pwd`/sol-10-u8-companion-ga.iso
    # rm sol-10-u8-companion-ga.iso
    # rm sol-10-u8-companion-ga-iso.zip



    About

    user12620111

    Search

    Archives
    « April 2014
    SunMonTueWedThuFriSat
      
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
       
           
    Today