Wednesday Aug 25, 2010

Transparent Failover with Solaris MPxIO and Oracle ASM

Recently I had a nice experience configuring failover with Solaris MPxIO (Multipathing), Oracle ASM (Automatic Storage Management) and Oracle's Sun Storage 6180 array. In this configuration failover was completely transparent to the Oracle database.

MPxIO is part of the Solaris 10 distribution. Multipathing provides redundant paths and eliminates a single point of failure by failing over to an alternate path automatically when one of the paths fails. Oracle ASM was used for volume manager to manage Oracle database files and it works very well with Multipathing. The Sun Storage 6180 array configuration had two controllers and each controller was connected to a different adapter on the host for redundant path. Multipathing generates a pseudo device and by providing Oracle ASM a pseudo device name at ASM diskgroup creation it makes failover transparent to the Oracle database.

HW/SW Configuration

Sun SPARC Enterprise M9000, 2x 8Gb FC HBA
1xSun Storage 6180 array, 16x 300G 15K RPM disks, two controllers, Firmware
Solaris 10, Oracle 11g with Oracle ASM
MPxIO enabled

When one of the paths fails (e.g. Cable, adapter, controller) system messages in /var/adm/messages file show the following sequence of events.

    Link down      <- the path is down
    offlining lun= ...       <- associated devices are offline
    multipath status: degraded      <- multipath status is degraded
    Initiating failover for device       <- Initiate failover
    Failover operation completed successfully      <- failover was successful
However during this period Oracle database transactions continue without any interruption and the Oracle database doesn't even know this is happening. The failover is transparent to the Oracle database and there was no error or message in Oracle log files during this period.

The following shows the steps and examples to set up MPxIO and Oracle ASM. At the bottom of the page it shows system messages from failover.

Sun Storage 6180 array

First you need to configure two paths physically. In this case, one controller was connected to one of the HBAs on the host and the other controller was connected to the other HBA.

Use Sun StorageTEK Common Array Manager GUI tool or command tool 'sscs' (StorEdge Systems Element Manager Command Line Interface) to manage 6180.
6180 'os-type' needs to be set properly to support MPxIO.
To view the current 'os-type' setting run 'sscs list array array_name'.
'Default Host Type' should be 'SOLARIS_MPXIO - Solaris (with Traffic Manager)'.
Run 'sscs list -a array_name os-type' to view Operating Systems supported and 'sscs modify -o os-type_name array array_name' to modify 'os-type'.


Solaris 'stmsboot -e' command enables MPxIO. It will prompt you to reboot the host.

After the reboot 'format' shows a long pseudo name such as c29t60080E500018142E000003574C5 EBF38d0
But 'stmsboot -L' provides both before and after MPxIO. This mapping is very useful for any debugging you might need.

#stmsboot -L

 non-STMS device name                    STMS device name
/dev/rdsk/c6t0d0        /dev/rdsk/c29t60080E5000184350000003614C5EBE23d0
/dev/rdsk/c6t0d2        /dev/rdsk/c29t60080E5000184350000003654C5EBE44d0
/dev/rdsk/c6t0d1        /dev/rdsk/c29t60080E50001845C6000003524C5EBE2Ed0
/dev/rdsk/c6t0d3        /dev/rdsk/c29t60080E50001845C6000003564C5EBE4Cd0
/dev/rdsk/c9t4d0        /dev/rdsk/c29t60080E5000184350000003614C5EBE23d0
/dev/rdsk/c9t4d2        /dev/rdsk/c29t60080E5000184350000003654C5EBE44d0
/dev/rdsk/c9t4d1        /dev/rdsk/c29t60080E50001845C6000003524C5EBE2Ed0
/dev/rdsk/c9t4d3        /dev/rdsk/c29t60080E50001845C6000003564C5EBE4Cd0
Notice c6t0 and c9t4 point to the same STMS device name (e.g. c6t0d0 and c9t4d0 point to c29t60080E5000184350000003614C5EBE23d0).

Oracle ASM

Create a ASM diskgroup by providing MPxIO pseudo device names.

Initialize disks by doing 'dd if=/dev/zero of=/dev/rdsk/c29t60080E50001843500 00003614C5EBE23d0s0 bs=1204k count=10.

Set owner to Oracle 'chown oracle:dba /dev/rdsk/c29t60080E500018435000000 3614C5EBE23d0s0'
create diskgroup diskgroup_name external redundancy disk
   '/dev/rdsk/c29t60080E5000184350000003614C5EBE23d0s0' size 10g,
   '/dev/rdsk/c29t60080E5000184350000003654C5EBE44d0s0' size 10g,
   '/dev/rdsk/c29t60080E50001845C6000003524C5EBE2Ed0s0' size 10g,
   '/dev/rdsk/c29t60080E50001845C6000003564C5EBE4Cd0s0' size 10g;

System messages from failover

When one of the paths fails at the array side 'sscs list -a array_name fcport ' shows one of the ports is down.

And /var/adm/messages file reports:

Aug 10 11:17:33 host1 emlxs: [ID 349649] [ 5.0314]emlxs0: NOTICE: 710: Link down.
Aug 10 11:19:03 host1 fctl: [ID 517869 kern.warning] WARNING: fp(2)::OFFLINE timeout
Aug 10 11:19:22 host1 scsi: [ID 243001] /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fcp2):
Aug 10 11:19:22 host1    offlining lun=3 (trace=0), target=ef (trace=2800004)
Aug 10 11:19:22 host1 scsi: [ID 243001] /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fcp2):
Aug 10 11:19:22 host1    offlining lun=2 (trace=0), target=ef (trace=2800004)
Aug 10 11:19:22 host1 scsi: [ID 243001] /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fcp2):
Aug 10 11:19:22 host1    offlining lun=1 (trace=0), target=ef (trace=2800004)
Aug 10 11:19:22 host1 scsi: [ID 243001] /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fcp2):
Aug 10 11:19:22 host1    offlining lun=0 (trace=0), target=ef (trace=2800004)
Aug 10 11:19:22 host1 genunix: [ID 408114] /pci@6,600000/SUNW,emlxs@0/fp@0,0/ssd@w2014
0080e51845c6,1f (ssd0) offline
Aug 10 11:19:22 host1 genunix: [ID 834635] /scsi_vhci/ssd@g60080e50001845c6000003564c5
ebe4c (ssd165) multipath status: degraded, path /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fp2) to targe
t address: w20140080e51845c6,3 is offline Load balancing: round-robin
Aug 10 11:19:22 host1 genunix: [ID 834635] /scsi_vhci/ssd@g60080e5000184350000003654c5
ebe44 (ssd166) multipath status: degraded, path /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fp2) to targe
t address: w20140080e51845c6,2 is offline Load balancing: round-robin
Aug 10 11:19:22 host1 genunix: [ID 834635] /scsi_vhci/ssd@g60080e50001845c6000003524c5
ebe2e (ssd167) multipath status: degraded, path /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fp2) to targe
t address: w20140080e51845c6,1 is offline Load balancing: round-robin
Aug 10 11:19:22 host1 genunix: [ID 834635] /scsi_vhci/ssd@g60080e5000184350000003614c5
ebe23 (ssd168) multipath status: degraded, path /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fp2) to targe
t address: w20140080e51845c6,0 is offline Load balancing: round-robin
Aug 10 11:19:22 host1 scsi: [ID 243001] /scsi_vhci (scsi_vhci0):
Aug 10 11:19:22 host1    Initiating failover for device ssd (GUID 60080e5000184350000003614c5ebe
Aug 10 11:19:23 host1 scsi: [ID 243001] /scsi_vhci (scsi_vhci0):
Aug 10 11:19:23 host1    Failover operation completed successfully for device ssd (GUID 60080e50
00184350000003614c5ebe23): failed over from primary to secondary
Aug 10 11:23:10 host1 emlxs: [ID 349649] [ 5.0536]emlxs0: NOTICE: 720: Link up. (8Gb,
loop, initiator)
Aug 10 11:23:10 host1 genunix: [ID 936769] ssd0 is /pci@6,600000/SUNW,emlxs@0/fp@0,0/s
Aug 10 11:23:11 host1 genunix: [ID 408114] /pci@6,600000/SUNW,emlxs@0/fp@0,0/ssd@w2014
0080e51845c6,1f (ssd0) online
Aug 10 11:23:11 host1 genunix: [ID 834635] /scsi_vhci/ssd@g60080e50001845c6000003564c5
ebe4c (ssd165) multipath status: optimal, path /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fp2) to target
 address: w20140080e51845c6,3 is standby Load balancing: round-robin
Aug 10 11:23:11 host1 genunix: [ID 834635] /scsi_vhci/ssd@g60080e5000184350000003654c5
ebe44 (ssd166) multipath status: optimal, path /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fp2) to target
 address: w20140080e51845c6,2 is online Load balancing: round-robin
Aug 10 11:23:11 host1 genunix: [ID 834635] /scsi_vhci/ssd@g60080e50001845c6000003524c5
ebe2e (ssd167) multipath status: optimal, path /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fp2) to target
 address: w20140080e51845c6,1 is standby Load balancing: round-robin
Aug 10 11:23:11 host1 genunix: [ID 834635] /scsi_vhci/ssd@g60080e5000184350000003614c5
ebe23 (ssd168) multipath status: optimal, path /pci@6,600000/SUNW,emlxs@0/fp@0,0 (fp2) to target
 address: w20140080e51845c6,0 is standby Load balancing: round-robin
Aug 10 11:23:11 host1 scsi: [ID 243001] /scsi_vhci (scsi_vhci0):
Aug 10 11:23:11 host1    Initiating failover for device ssd (GUID 60080e5000184350000003614c5ebe
Aug 10 11:23:13 host1    Failover operation completed successfully for device ssd (GUID 60080e50
00184350000003614c5ebe23): failed over from secondary to primary
Aug 10 11:23:13 host1 scsi: [ID 243001] /scsi_vhci (scsi_vhci0):
Aug 10 11:23:13 host1    Auto failback operation succeeded for device ssd (GUID 60080e5000184350
That's it. Try this for yourself.

Friday Oct 23, 2009

Wiki on performance best practices

A fantastic source of technical Best Practices is at

This wiki hosts the combined wisdom of many performance engineers from across Sun. It has information about Hardware, Software, ZFS, Oracle and other various performance topics.  This wiki attempts to categorize and present information so it is easy to find and use. It is getting started, but please let us know if there are any topics which would be useful.

Thursday Oct 15, 2009

Oracle Flash Cache - SGA Caching on Sun Storage F5100

Overview and Significance of Results

Oracle and Sun's Flash Cache technology combines New features in Oracle with the Sun Storage F5100 to improve database performance. In Oracle databases, the System Global Area (SGA) is a group of shared memory areas that are dedicated to an Oracle “instance” (Oracle processes in execution sharing a database) . All Oracle processes use the SGA to hold information. The SGA is used to store incoming data (data and index buffers) and internal control information that is needed by the database. The size of the SGA is limited by the size of the available physical memory.

This benchmark tested and measured the performance of a new Oracle Database 11g (Release2) feature, which allows to extend the SGA size and caching beyond physical memory, to a large flash memory storage device as the Sun Storage F5100 flash array.

One particular benchmark test demonstrated a dramatic performance improvement (almost 5x) using the Oracle Extended SGA feature on flash storage by reaching SGA sizes in the hundreds of GB range, at a more reasonable cost than equivalently sized RAM and with much faster access times than disk I/O.

The workload consisted in a high volume of SQL select transactions accessing a very large table in a typical business oriented OLTP database. To obtain a baseline, throughput and response times were measured applying the workload against a traditional storage configuration and constrained by disk I/O demand (DB working set of about 3x the size of the data cache in the SGA). The workload was then executed with an added Sun Storage F5100 Flash Array configured to contain an Extended SGA of incremental size.

The tests have shown scaling throughput along with increasing Flash Cache size.

Table of Results

F5100 Extended SGA Size (GB) Query Txns / Min Avg Response Time (Secs) Speedup Ratio
No 76338 0.118 N/A
25 169396 0.053 2.2
50 224318 0.037 2.9
75 300568 0.031 3.9
100 357086 0.025 4.6

Configuration Summary

Server Configuration:

    Sun SPARC Enterprise M5000 Server
    8 x SPARC64 VII 2.4GHz Quad Core
    96 GB memory

Storage Configuration:

    8 x Sun Storage J4200 Arrays, 12x 146 GB 15K RPM disks each (96 disks total)
    1 x Sun Storage F5100 Flash Array

Software Configuration:

    Oracle 11gR2
    Solaris 10

Benchmark Description

The workload consisted in a high volume of SQL select transactions accessing a very large table in a typical business oriented OLTP database.

The database consisted of various tables: Products, Customers, Orders, Warehouse Inventory (Stock) data, etc. and the Stock table alone was 3x the size of the db cache size.

To obtain a baseline, throughput and response times were measured applying the workload against a traditional storage configuration and constrained by disk I/O demand. The workload was then executed with an added Sun Storage F5100 Flash Array configured to contain an Extended SGA of incremental size.

During all tests, the in memory SGA data cache was limited to 25 GB .

The Extended SGA was allocated on a “raw' Solaris Volume created with the Solaris Volume Manager (SVM) on a set of devices (Flash Modules) residing on the Sun Storage F5100 flash array.

Key Points and Best Practices

In order to verify the performance improvement brought by extended SGA, the feature had to be tested with a large enough database size and with a workload requiring significant disk I/O activity to access the data. For that purpose, the size of the database needed to be a multiple of the physical memory size, avoiding the case in which the accessed data could be entirely or almost entirely cached in physical memory.

The above represents a typical “use case” in which the Flash Cache Extension is able to show remarkable performance advantages.

If the DB dataset is already entirely cached, or the DB I/O demand is not significant or the application is already saturating the CPU for non database related processing, or large data caching is not productive (DSS type Queries), the Extended SGA may not improve performance.

It is also relevant to know that additional memory structures needed to manage the Extended SGA are allocated in the “in memory” SGA, therefore reducing its data caching capacity.

Increasing the Extended Cache beyond a specific threshold, dependent on various factors, may reduce the benefit of widening the Flash SGA and actually reduce the overall throughput.

This new cache is somewhat similar architecturally to the L2ARC on ZFS. Once written, flash cache buffers are read-only, and updates are only done into main memory SGA buffers. This feature is expected to primarily benefit read-only and read-mostly workloads.

A typical sizing of database flash cache is 2x to 10x the size of SGA memory buffers. Note that header information is stored in the SGA for each flash cache buffer (100 bytes per buffer in exclusive mode, 200 bytes per buffer in RAC mode), so the number of available SGA buffers is reduced as the flash cache size increases, and the SGA size should be increased accordingly.

Two new init.ora parameters have been introduced, illustrated below:

    db_flash_cache_file = /lfdata/lffile_raw
    db_flash_cache_size = 100G
The db_flash_cache_file parameter takes a single file name, which can be a file system file, a raw device, or an ASM volume. The db_flash_cache_size parameter specifies the size of the flash cache. Note that for raw devices, the partition being used should start at cylinder 1 rather than cylinder 0 (to avoid the disk's volume label).

See Also

Disclosure Statement

Results as of October 10, 2009 from Sun Microsystems.

Tuesday Jul 14, 2009

Vdbench: Sun StorageTek Vdbench, a storage I/O workload generator.

Vdbench is written in Java (and a little C) and runs on Solaris Sparc and X86, Windows, AIX, Linux, zLinux, HP/UX, and OS/X.

I wrote the SPC1 and SPC2 workload generator using the Vdbench base code for the Storage Performance Council:

Vdbench is a disk and tape I/O workload generator, allowing detailed control over numerous workload parameters like:


· For raw disk (and tape) and large disk files:

o Read vs. write

o Random vs. sequential or skip-sequential

o I/O rate

o Data transfer size

o Cache hit rates

o I/O queue depth control

o Unlimited amount of concurrent devices and workloads

o Compression (tape)

· For file systems:

o Number of directory and files

o File sizes

o Read vs. write

o Data transfer size

o Directory create/delete, file create/delete,

o Unlimited amount of concurrent file systems and workloads

Single host or Multi-host:

All work is centrally controlled, running either on a single host or on multiple hosts concurrently.


Centralized reporting, reporting and reporting using the simple idea that you can't understand performance of a workload unless you can see the detail. If you just look at run totals you'll miss the fact that for some reason the storage configuration was idle for several seconds or even minutes!

  • Second by second detail of by Vdbench accumulated performance statistics for total workload and for each individual logical device used by Vdbench.
  • For Solaris Sparc and X86: second by second detail of Kstat statistics for total workload and for each physical lun or NFS mounted device used.
  • All Vdbench reports are HTML files. Just point your browser to the summary.html file in your Vdbench output directory and all the reports link together.
  • Swat (an other of my tools) allows you to display performance charts of the data created by Vdbench: Just start SPM, then 'File' 'Import Vdbench data'.
  • Vdbench will (optionally) automatically call Swat to create JPG files of your performance charts.
  • Vdbench has a GUI that will allow you to compare the results of two different Vdbench workload executions. It shows the differences between the two runs in different grades of green, yellow and red. Green is good, red is bad.

Data Validation:

Data Validation is a highly sophisticated methodology to assure data integrity by always writing unique data contents to each block and then doing a compare after the next read or before the next write. The history tables containing information about what is written where is maintained in memory and optionally in journal files. Journaling allows data to be written to disk in one execution of Vdbench with Data Validation and then continued in a future Vdbench execution to make sure that after a system shutdown all data is still there. Great for testing mirrors: write some data using journaling, break the mirror, and have Vdbench validate the contents of the mirror.

I/O Replay

A disk I/O workload traced using Swat (an other of my tools) can be replayed using Vdbench on any test system to any type of storage. This allows you to trace a production I/O workload, bring the trace data to your lab, and then replay your I/O workload on whatever storage you want. Want to see how the storage performs when the I/O rate doubles? Vdbench Replay will show you. With this you can test your production workload without the hassle of having to get your data base software and licenses, your application software, or even your production data on your test system.

For more detailed information about Vdbench go to where you can download the documentation or the latest GA version of Vdbench.

You can find continuing updates about Swat and Vdbench on my blog:

Henk Vandenbergh

PS: If you're wondering where the name Vdbench came from :  Henk Vandenbergh benchmarking.

Storage performance and workload analysis using Swat.

Swat (Sun StorageTek Workload Analysis Tool) is a host-based, storage-centric Java application that thoroughly captures, summarizes, and analyzes storage workloads for both Solaris and Windows environments.

This tool was written to help Sun’s engineering, sales and service organizations and Sun’s customers understand storage I/O workloads.

 Sample screenshot:

Swat can be used for among many other reasons:

  • Problem analysis
  • Configuration sizing (just buying x GB of storage just won't do anymore)
  • Trend analysis: is my workload growing, and can I identify/resolve problems before they happen?

Swat is storage agnostic, so it does not matter what type or brand of storage you are trying to report on. Swat reports the host's view of the storage performance and workload, using the same Kstat (Solaris) data that iostat uses.

Swat consists of several different major functions:

· Swat Performance Monitor (SPM)

· Swat Trace Facility (STF)

· Swat Trace Monitor (STM)

· Swat Real Time Monitor

· Swat Local Real Time Monitor

· Swat Reporter

Swat Performance Monitor (SPM):

Works on Solaris and Windows. An attempt has been made in the current Swat 3.02 to also collect data on AIX and Linux. Swat 3.02 also reports Network Adapter statistics on Solaris, Windows, and Linux. A Swat Data Collector (agent) runs on some or all of your servers/hosts, collecting I/O performance statistics every 5, 10, or 15 minutes and writes the data to a disk file, one new file every day, automatically switched at midnight.

The data then can be analyzed using the Swat Reporter.

Swat Trace Facility (STF):

For Solaris and Windows. STF collects detailed I/O trace information. This data then goes through a data Extraction and Analysis phase that generates hundreds or thousands of second-by-second statistics counters. That data then can be analyzed using the Swat Reporter. You create this trace for between 30 and 60 minutes for instance at a time when you know you will have a performance problem.

A disk I/O workload traced using Swat can be replayed on any test system to any type of storage using Vdbench (an other of my tools, available at This allows you to trace a production I/O workload, bring the trace data to your lab, and then replay that I/O workload on whatever storage you want. Want to see how the storage performs when the I/O rate doubles or triples? Vdbench Replay will show you. With this you can test your production workload without the hassle of having to get your data base software and licenses, your application software and licenses, or even your production data.

Note: STF is currently limited to the collection of about 20,000 IOPS. Some development effort is required to handle the current increase in IOPS made possible by Solid State Devices (SSDs).

Note: STF, while collecting the trace data is the only Swat function that requires root access. This functionality is all handled by one single KSH script which can be run independently. (Script uses TNF and ADB).

Swat Trace Monitor (STM):

With STF you need to know when the performance problem will occur so that you can schedule the trace data to be collected. Not every performance problem however is predictable. STM will run an in-memory trace and then monitors the overall storage performance. Once a certain threshold is reach, for instance response time greater than 100 milliseconds, the in-memory trace buffer is dumped to disk and the trace then continues collecting trace data for an amount of seconds before terminating.

Swat Real Time Monitor:

When a Data Collector is active on your current or any network-connected host, Swat Real Time Monitor will open a Java socket connection with that host, allowing you to actively monitor the current storage performance either from your local or any of your remote hosts.

Swat Local Real Time Monitor:

Local Real Time Monitor is the quickest way to start using Swat. Just enter './swat -l' and Swat will start a private Data Collector for your local system and then will show you exactly what is happening to your current storage workload. No more fiddling trying to get some useful data out of a pile of iostat output.

Swat Reporter:

The Swat Reporter ties everything together. All data collected by the above Swat functions can be displayed using this powerful GUI reporting and charting function. You can generate hundreds of different performance charts or tabulated reports giving you intimate understanding of your storage workload and performance. Swat will even create JPG files for you that then can be included in documents and/or presentations. There is even a batch utility (Swat Batch Reporter) that will automate the JPG generation for you. If you want, Swat will even create a script for this batch utility for you.

Some of the many available charts:

  • Response time per controller or device
  • I/O rate per controller or device
  • Read percentage
  • Data transfer size
  • Queue depth
  • Random vs. sequential (STF only)
  • CPU usage
  • Device skew
  • Etc. etc.

Swat has been written in Java. This means, that once your data has been collected on its originating system, the data can be displayed and analyzed using the Swat Reporter on ANY Java enabled system, including any type of laptop.

For more detailed information go to (long URL)where you can download the latest release, Swat 3.02.

You can find continuing updates about Swat and Vdbench on my blog:

Henk Vandenbergh

Wednesday Jun 24, 2009

I/O analysis using DTrace


In almost any commercial application, the role of storage is critical, in terms of capacity, performance and data availability. The future roadmap of the storage industry shows disk storage getting cheaper, capacity getting larger, and SAS technology acquiring wider adoption.

Disk performance is becoming more important as CPU power has increased, and is more rarely the bottleneck. Newly available analysis tools have provided a simpler way to understand the internal workings of the I/O subsystem as part of the Operating System, increasing effectiveness of solving I/O performance issues.

Solaris's powerful and open sourced DTrace set of tools allows any user interested in solving I/O performance issues to observe how I/O are carried out by the Operating System and related drivers. The ever increasing set of DTrace features and adoption by various vendors is a strong proof of its usefulness.

Many performance problems can be captured and solved using basic performance tools, such as vmstat, mpstat, prstat and iostat. However, some problems require deeper analysis to be resolved, or require live analysis to solve performance problems and analyze events out of the ordinary.


DTrace allows analysis of real or simulated workloads, of arbitrary complexity. There are several applications that can be used to generate the workload. My choice was to use Oracle database software, in order to generate a more interesting and realistic I/O workloads.

An Oracle table was created similar to a fact table in a data warehousing environment, then loaded with data from a flat-file directly into the fact table. This is a write-intensive workload that generates almost all writes to the underlying disk storage.

Just a few important points about the whole setup:

  • Solaris10 or OpenSolaris is required to run DTrace commands.
  • No filesystem was used for this experiment. Since we have few filesystem options, and the architecture of these filesystems differ, I might address filesystem performance analysis in a future blog.
  • Only one disk was used, to keep it very simple. However, I don't see any problem using the same techniques to diagnose I/O issues on a huge disk farm environment.
  • Although I used a workload similar to real, the main goal is to monitor the I/O subsystem using DTrace. So, let's not dive deep into the tricks used to generate I/O load. They are simple enough to generate the target workload.
  • The scripts to setup the environment are in the zip-file.
  • An account with admin privilege can run DTrace commands. In fact, an account with just bare minimum DTrace privilege can run DTrace commands as well. For more details, check out the document here.

Live Analysis:

Note that while the workload was being generated, iostat was started, and then the DTrace commands to get more details. Since iostat without any arguments wouldn't produce the desired details, extended statistics were requested. The options used are:

    iostat -cnxz 2

The quick summary of those arguments:

   -c : display CPU statistics
   -n : display device names in descriptive format
   -x : display extended device statistics
   -z : do not display lines with all zeros
    2 : display io statisitcs for every 2 seconds

For more details, check out the man page.

About 6 SQL\*Loader sessions were started to load the data, which in-turn generated huge writes to the disk where the table was located. Oracle SQL\*Loader is a utility that is used to load flat-file data into a table. The disk was saturated with about 4-6 SQL\*Loader sessions. The following iostat output was captured during the run:

 us sy wt id
 58  8  0 34
                    extended device statistics              
    r/s    w/s   kr/s    kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    3.0    0.0    23.9  0.0  0.0    0.0    4.2   0   1 c0t1d0
    0.0  520.4    0.0 32535.2  0.0  4.9    0.0    9.4   2  98 c2t8d0
    0.0    1.5    0.0   885.1  0.0  0.0    0.0   13.5   0   1 c2t13d0
 us sy wt id
 57  7  0 36
                    extended device statistics              
    r/s    w/s   kr/s    kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  546.7    0.0 31004.3  0.0  9.3    0.0   16.9   1  98 c2t8d0
    0.0    1.0    0.0   832.1  0.0  0.0    0.0   11.7   0   1 c2t13d0

The tablespace, where the target table was created, is on the disk device identified by c2t8d0. We can see that we are writing about 30MB/sec. Let's now use DTrace to explore further...

Let's start with very basic info. The following DTrace command prints the number of reads, writes and asynchronous I/Os initiated on the system. [In this example, and in all other examples in this article, white space has been added for readability. In real life, you would either put the whole dtrace command on a single line, or would put it in a file.]

# dtrace -qn 
   'BEGIN {
   io:::start {
   tick-1s {
      printf("%Y: reads :%8d, writes :%8d, async-ios :%8d\\n", walltimestamp, rio, wio, aio); 
2009 Jun 15 13:11:39: reads :       0, writes :     578, async-ios :     465
2009 Jun 15 13:11:40: reads :       0, writes :     526, async-ios :     520
2009 Jun 15 13:11:41: reads :       0, writes :     529, async-ios :     528
2009 Jun 15 13:11:42: reads :       0, writes :     533, async-ios :     519
2009 Jun 15 13:11:43: reads :       0, writes :     779, async-ios :     622
2009 Jun 15 13:11:44: reads :       0, writes :     533, async-ios :     472
2009 Jun 15 13:11:45: reads :       0, writes :     511, async-ios :     513
2009 Jun 15 13:11:46: reads :       0, writes :     538, async-ios :     531
2009 Jun 15 13:11:47: reads :       0, writes :     620, async-ios :     503
2009 Jun 15 13:11:48: reads :       0, writes :     512, async-ios :     504
2009 Jun 15 13:11:49: reads :       0, writes :     533, async-ios :     526
2009 Jun 15 13:11:50: reads :       0, writes :     565, async-ios :     523
2009 Jun 15 13:11:51: reads :       0, writes :     662, async-ios :     539
2009 Jun 15 13:11:52: reads :       0, writes :     527, async-ios :     524
2009 Jun 15 13:11:53: reads :       0, writes :     501, async-ios :     503
2009 Jun 15 13:11:54: reads :       0, writes :     587, async-ios :     531
2009 Jun 15 13:11:55: reads :       0, writes :     585, async-ios :     532

The DTrace command used above has a few sections. BEGIN gets executed only once when the script starts to execute. The io:::start gets executed everytime when an I/O is initiated. tick-1s gets executed for every second.

The above output clearly shows that the system has been issuing a lot of asynchronous I/O to write data. It is easy to verify this behavior by running a truss command against one of those background Oracle processes. Since it's confirmed that only 'writes' are issued, not 'reads', the following commands will focus only on 'writes'.

We will now look at the histogram of write I/O sizes using DTrace aggregation feature. The aggregation function, quantize(), is used to generate histogram data. The quick summary of DTrace aggregation functions can be found here.

# dtrace -qn 
   'io:::start /execname=="oracle" && args[0]->b_flags & B_WRITE/ { 
       @[args[1]->dev_pathname] = quantize(args[0]->b_bcount); 
    tick-10s {

           value  ------------- Distribution ------------- count    
            4096 |                                         0        
            8192 |@@@@@@@@@@                               1324     
           16384 |                                         0        
           32768 |@@@@@@@@                                 1032     
           65536 |@@@@@@@@@@@@@@@@@                        2182     
          131072 |@@@@                                     518      
          262144 |                                         6        
          524288 |                                         0        

Before interpreting the output, let's examine the command used. The 2 new filters within forward slashes are used to a) enable DTrace only for executable name "oracle" and b) enable DTrace only if it's a write request. The quantize aggregation function creates multiple buckets for a range of I/O size and tracks it.

The DTrace output has a physical device name. It is easy to map it to the device that is used at the application layer.

# ls -l /dev/rdsk/c2t8d0s0
lrwxrwxrwx   1 root     root          54 Feb  8  2008 
   /dev/rdsk/c2t8d0s0 -> ../../devices/pci@3,700000/pci@0/scsi@8,1/sd@8,0:a,raw

That's it. The device that is used in database is in fact a symbolic link to the device that showed up above in the DTrace output. The above output suggests that the size of more than half-of I/O requests are between 65536 (64KB) and 131071 (~128KB). Since the database software and the OS are capable of writing about upto 1MB per second, there is an opportunity to look closely the database configuration and OS kernel parameters to optimize it. This is one of the useful DTrace commands that can be used to observe how much throughupt a device can deliver. More importantly, it helps discover the I/O (write) pattern of an application.

We will now look at the latency histogram.

# dtrace -qn 
   'io:::start /execname=="oracle" && args[0]->b_flags & B_WRITE/ { 
       io_start[args[0]->b_edev, args[0]->b_blkno] = timestamp; 
    io:::done / (io_start[args[0]->b_edev, args[0]->b_blkno]) && (args[0]->b_flags & B_WRITE) / { 
       @[args[1]->dev_pathname, args[2]->fi_name] = 
           quantize((timestamp - io_start[args[0]->b_edev, args[0]->b_blkno]) / 1000000); 
       io_start[args[0]->b_edev, args[0]->b_blkno] = 0; 
    tick-10s {

  /devices/pci@0,600000/pci@0/pci@8/pci@0/scsi@1/sd@1,0:h  orasystem                                         
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
               8 |                                         0        

  /devices/pci@0,600000/pci@0/pci@8/pci@0/scsi@1/sd@1,0:h  control_003                                       
           value  ------------- Distribution ------------- count    0 |                                         0        
               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      7        
               2 |@@@@@                                    1        
               4 |                                         0        

  /devices/pci@0,600000/pci@0/pci@8/pci@0/scsi@1/sd@1,0:h  control_001                                       
           value  ------------- Distribution ------------- count    
               1 |                                         0        
               2 |@@@@@@@@@@                               2        
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           6        
               8 |                                         0        

  /devices/pci@0,600000/pci@0/pci@8/pci@0/scsi@1/sd@1,0:h  control_002                                       
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8        
               8 |                                         0        

  /devices/pci@0,600000/pci@0/pci@8/pci@0/scsi@1/sd@1,0:h  oraundo                                           
           value  ------------- Distribution ------------- count    
               1 |                                         0        
               2 |                                         1        
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  271      
               8 |@                                        5        
              16 |                                         2        
              32 |                                         0        

The preceding DTrace command now has more options, io:::start followed by io:::done. The io:::start gets executed every time when an I/O is initiated and io:::done gets executed when an I/O is completed. Since these 2 routines get executed every time when an I/O event occurs, we want to make sure an I/O that is being tracked when it's initiated is the same I/O being tracked when it's completed. This is why the above command tracks each I/O by its device name and its block number. This, for most cases, gives us the result we want.

However, the elapsed time histogram output shows data for all files except the device that we are interested in. Why? I believe it's due to the effect of asynchronous I/O. (Expect a future blog about monitoring asynchronous I/O). The above output would have been much different if synchronous I/Os are issued to transfer data. The output is still interesting to see that there has been some 'writes' to other database files. For example, it shows the response-time of most 'writes' to various database files are well under 8 milliseconds.

Finally, let's look at disk blocks (lba) that are being accessed more frequently.

# dtrace -qn 
   'io:::start /execname=="oracle" && args[0]->b_flags & B_WRITE/{ 
       @["block access count", args[1]->dev_pathname] = quantize(args[0]->b_blkno); 
    tick-10s {

  block access count                                  /devices/pci@3,700000/pci@0/scsi@8,1/sd@8,0:a     
           value  ------------- Distribution ------------- count    
         4194304 |                                         0        
         8388608 |@@@                                      324      
        16777216 |                                         8        
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    4237     
        67108864 |                                         0        


We see that the range of blocks between 33554432 and 67108863 are being accessed much more frequently than any other block ranges listed above. The use of this command is rather occassional. It is more useful if an application provides any mapping info between the application block and the physical block, to figure out a hot block at the disk level and where it's located.


We can observe how easy it is take a close look at the OS/physical layer when an application is busy with its disk-based workload. The DTrace collection can be widned or narrowed by using various filter options. For example, the read-intensive workloads can be observed by flipping a filter flag, from B_WRITE to B_READ.

Addtional references:

If you are interested in any specific topics similar to above experiment, feel free to drop a note in the comments section.

Friday Jun 19, 2009

Pointers to Java Performance Tuning resources

On Madhu Konda's Weblog he writes: For more info, please see Madhu Konda's Weblog.

SSDs in HPC: Reducing the I/O Bottleneck BluePrint Best Practices

High-Performance Computing (HPC) applications can be dramatically increased by simply using SSDs instead of traditional hard drives. To read about these findings see the Sun BluePrint by Larry McIntosh and Michael Burke, called "Solid State Drives in HPC: Reducing the I/O Bottleneck".

There was a BestPerf blog posting on the NASTRAN/SSD results at:

Our BestPerf authors will blog about more of their recent benchmarks in the coming weeks.

Wednesday Jun 10, 2009

Using Solaris Resource Management Utilities to Improve Application Performance

The SPECjAppServer2004 benchmark is a very complex benchmark produced by the Standard Performance Evaluation Corporation (SPEC). Servers measured with this benchmark exercise all major Java 2 Enterprise Edition (J2EE) technologies, including transaction management, database connectivity, web containers, and Enterprise JavaBeans. The benchmark heavily exercises the hardware, software, and network, as hundreds or thousands of JOPS (jAppServerOperationsPerSecond) are loaded onto Systems Under Test (SUTs).

This article introduces some of the Solaris resource management utilities that are used for this benchmark. These utilities may be useful to system managers who are responsible for complex servers. The author has applied these features to improve performance when using multiple instances of J2EE application server software with the SPECjAppServer2004 benchmark.

In SPECjAppServer2004 benchmark results submitted by Sun Microsystems, you can find references to Solaris Resource Management features such as Containers, Zones, Processor sets, and Scheduling classes. The recently published results for the  Sun Fire T5440 and the Sun Fire T5140 servers use many of these features.

Solaris Resource Management utilities are used to provide isolation of applications and better management of system resources. There are a number of publications which describe many of the features and benefits. The  Sun Solaris Container Administration Guide and  Sun Zones Blueprint are two of many sources of good information.

Solaris Containers

Looking at the first benchmark publication listed above, the Sun Fire T5440 server was configured with 8 Solaris Containers where each container or zone was setup to host a single application server instance. By hosting an application server instance in a container, the memory and network resources used by that instance are virtually isolated from the memory and network resources used by other instances running in separate containers.

While running the application software in a zone does not directly increase performance, using Containers with this benchmark workload makes it easier to manage multiple J2EE instances. When combined with the techniques below, using Solaris Containers can be an effective environment to help improve application performance.

Note that many Solaris performance utilities can be used to monitor and report process information for the configured zones, such as prstat with the -Z option.

Processor Sets

The System Administration Guide for Solaris Containers discusses use of Resource Pools to partition machine resources. A resource pool is a configuration mechanism used to implement a processor set and possibly combine with a scheduling class to configure with a zone. When configuring a resource pool, the administrator will specify the min and max cpu resources for the pool and the system will create the processor set with this information. The Resource Pool can then be configured with a specific zone using the zonecfg(1M) utility. However, in some scenarios, it is possible that the processor IDs selected for the resource pool may span multiple cpu chips, and thus may not make most efficient use of caches or access to local memory.

For the configurations in the published results, each Solaris Container was bound to a unique processor set, where each processor set was composed of 4 UltraSPARC T2 Plus cores. Since each UltraSPARC T2 Plus core consists of 8 hardware strands, each cpu chip was partitioned into two processor sets of 32 processor IDs. The processor sets were created by specifying the 32 processor ids as an argument to the psrset (1M) command as shown in the following example:

% psrset -c 32-63

The command above instructs Solaris to create a processor set using virtual processor numbers 32 thru 63 from 4 cores of an UltraSPARC T2 Plus cpu chip. With a total of four UltraSPARC T2 Plus cpu chips, the Sun Fire T5440 system was configured to use 7 processor sets of 4 cores each. The remaining 4 cores (virtual processor numbers 0-31) remained in the default processor set, as there must be at least 1 virtual processor ID in the default set.

Looking at the Sun Fire T5440 System Architecture , each  UltraSPARC T2 Plus cpu chip has 4 MB of L2 cache shared by all 8 cores in the chip. Each UltraSPARC T2 Plus cpu also has direct links to 16 DIMM slots of local memory with access to the remaining or remote memory DIMMs using an External Coherency Hub. Data references to local memory generally have slightly faster access as any data access through an External Coherency Hub will incur a small added latency as Denis indicates. This combination of CPU hardware and physically local memory is treated by Solaris as a Locality Group. Solaris attempts to allocate physical memory pages from the same locality group associated with the CPU executing the application process/thread. To help reduce latency for data accesses by an application, processor sets are a simple and effective means to co-locate data accesses within an L2 cache and a Locality Group boundary.

To use a Container with a specific processor set requires binding the processes running in the Container to the specified processor set. This can be done using the pgrep and psrset commands. Use pgrep -z ZONENAME to obtain the list of process IDs currently running in the specified zone. Then use psrset -b PSET PID to bind a process ID obtained earlier using pgrep to the specified processor set as shown in the following example:

% for PID in `pgrep -z ZONENAME`;  do psrset -b PSET_ID $PID;  done

Scheduling Class

Solaris offers a number of different process scheduling classes to execute user processes which are administered using the utilities dispadmin(1M) and priocntl(1M). The default is the Time Sharing or TS scheduling class. However many benchmark results have made use of the Fixed Priority or FX scheduling class. The dispadmin command can be used to list the classes supported on the system with associated priority and time quantim parameters. Processes normally running in the TS class can be run in the FX class using the priocntl command with either of the following methods:

% priocntl -e -c FX <COMMAND> <ARGS>


% priocntl -s -c FX -i pid <PID>

The first case executes a command starting in the FX class and the second case changes the scheduling class of a running process using the process ID.

The following article FX for Databases discusses this subject for the Database application space in some detail.  Similar considerations apply to J2EE application software. Running the application server instances in the FX scheduling class has shown to reduce the number of context switches and help improve overall throughput. 

Additional Sources:

Solaris™ Internals: Solaris 10 and OpenSolaris Kernel Architecture Second Edition by Richard McDougall and Jim Mauro

Solaris Best Practices


SPEC, SPECjAppServer reg tm of Standard Performance Evaluation Corporation. Results from  as of 6/10/09.


BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages

« April 2014