Thursday Apr 22, 2010

Vdbench file system testing WITHOUT using 'format=yes'

Just found a bug in Vdbench 5.02 around file system testing.

When not having Vdbench create all the files using the 'format=yes' parameter, but instead coding fileselect=sequential,fileio=sequential,operation=write in the File system Workload Definition (FWD) to simulate the format, Vdbench only creates the first 'threads=n' files and then overwrites them again and again. If you are in desperate need for a fix, let me know.


Vdbench and SSD alignment

These last months I have heard a lot about issues related to solid state devices not properly being aligned to the expected data transfer sizes. Each OS has its own way of creating volumes and partitions so trying to figure out if everything is neatly aligned is not an easy job. Add to that the possibility of the OS thinking everything is in order but alignment somewhere down the line not being accurate in one of the many possible layers of software when we have virtual volumes.

Without really being interested in the 'how to figure it all out and how to fix alignment issues' I created a small Vdbench parameter file that will allow you to at least figure out whether things are properly aligned or not. It revolves around the use of the Vdbench 'offset=' parameter that allows you to artificially change the alignment from Vdbench's point of view.

If your SSDs are on a storage subsystem that has a large cache, make sure that your volume is much larger than that cache. You rally need to make sure you are getting your data from the SSD, not from cache.



These are the 'avg' lines:

offset=0000   avg_2-3   19223.00    75.09    4096 100.00    1.580    2.803    0.231     1.1   0.9
offset=0512   avg_2-3    3655.50    14.28    4096 100.00    8.772    9.473    0.067     0.3   0.2
offset=1024   avg_2-3    3634.00    14.20    4096 100.00    8.784    9.390    0.064     0.3   0.2
offset=1536   avg_2-3    3633.00    14.19    4096 100.00    8.799    9.472    0.062     0.3   0.2
offset=2048   avg_2-3    3614.50    14.12    4096 100.00    8.831    9.440    0.066     0.3   0.2
offset=2560   avg_2-3    3604.00    14.08    4096 100.00    8.852    9.477    0.067     0.2   0.2
offset=3072   avg_2-3    3602.50    14.07    4096 100.00    8.853    9.430    0.059     0.3   0.2
offset=3584   avg_2-3    3597.50    14.05    4096 100.00    8.888    9.468    0.069     0.2   0.2
offset=4096   avg_2-3   20050.50    78.32    4096 100.00    1.584    2.811    0.231     1.0   0.9

As you can see, the runs with offset=0 and offset=4096 offer more than 5 times the throughput than the others. This tells me that this volume is properly aligned.
If for instance the run results would show that offset=512 has the best results the volume is on a 512 byte offset.
To then run properly 4k aligned tests with Vdbench, add to all your runs:
and Vdbench, after generating each lba, will always add 512.

Monday Apr 19, 2010

Running high IOPS against a single lun/SSD

On Solaris, and I expect the same with other operating systems, whenever and I/O is requested some process-level lock is set. This means that if you try to run very high IOPS, this lock can become 100% busy, causing all threads that need this lock to start spinning. End result is two-fold: high CPU utilization and/or lower than expected IOPS.

This is not a new problem. The problem was discovered several years ago when storage subsystems became fast enough to handle 5000 IOPS and more. Since that time cpus have become much faster and Solaris code has been enhanced several times to lower the need and duration for these locks. I have seen Vdbench runs where we were able to do 100k IOPS without problems.

Vdbench is written in Java, and Java runs as a single process. Vdbench therefore introduced what is called multi-JVM mode, the ability of Vdbench to split the requested workload over multiple JVMs (Java Virtual Machines).

By default Vdbench starts one JVM for each 5000 IOPS requested, with a maximum of 8, and no more than one per Storage Definition (SD). The 5000-number probably should be changed some day; it is a leftover of the initial discovery of this problem.

So, when you ask for iorate=max with only a single SD and you’re lucky enough to be running against a Solid State Device (SSD) guess what: you may run into this locking problem.

To work around this you have to override the default JVM count:

  • Specify hd=localhost,jvms=nn I suggest you request one JVM for each 50k IOPS that you expect
  • Add ‘-m nn’ as an execution parameter, for instance ‘-m4’.

There is one exception though, and that is for 100% sequential workloads using the seekpct=sequential or seekpct=eof Workload Definition (WD) parameter. A sequential workload will only run using one single JVM. This is done to prevent for instance with two JVMs that the workload would look like this: read block 1,1,2,2,3,3,4,4,5,5, etc. The performance numbers of course will look great because the second read of a block will be guaranteed a cache hit, but this is not really a valid sequential workload.


Tuesday Apr 13, 2010

Swat Trace Facility (STF) memory problems during the Analyze phase.


STF during the Analyze phase keeps thirty seconds worth of I/O detail in memory. That is done so that when a trace probe of an I/O completion is found STF can still find the I/O start probe if that occurred less than thirty seconds earlier. Believe me, in some very difficult customer problem scenarios I have seen I/O that have taken that long.

If you run 5000 iops, keeping thirty seconds worth of detailed information in memory is not a real problem. Even 10k or 20k will work fine.

And now we have solid state devices, running 100k, 200k iops and up. Just do the math, and you know you will run into memory problems.

There is an undocumented option in STF to lower this 30-second value. In STF, go to the Settings tab, click on ‘batch_prm’, enter ‘-a5’, click ‘Save’, and run the Analyze phase again. Of course, -a2 works fine too. Note that any I/O that takes longer than the new ‘age’ value you specified will not be recognized by STF, but usually two seconds should just be fine.


Thursday Apr 01, 2010

Vdbench and multipath I/O

A question just asked: does Vdbench support multipath I/O?

Vdbench tries to be "just any other user application". To make sure that Vdbench is portable across multiple platforms I decided to depend on the OS for doing i/o. All I need from the OS is a working open/close/read/write() function and I am happy.
Just imagine writing a simple customer application, my favorite (and I am sure yours too): the payroll application. Would you as a programmer have to worry about HOW things get read or written? Those days are over when the application had to know and handle this level of detail.
That's now the role of the OS.

So, Vdbench is not multi pathing aware whatsoever. It's the OS's responsibility.


AIX shared libraries for Vdbench 5.02

Thank you IBM!

Please place  and in your /vdbench502/aix/ directory.


Heartbeat timeouts during Vdbench startup

One of the most complex things in Vdbench (and Swat also) on Solaris is trying to translate  file names, volume names and device numbers to the proper Kstat instance names. To accomplish that I use output of iostat and the ls /dev/rdsk command, and match and merge those outputs. Never realizing how bad that could be I did not bother with efficiency, so I do numerous sequential lookups in both iostat and ls output. And then of course I am finding iostat outputs with several thousand lines, and then ls output of course with at least that many. At times it can take over two minutes to do all this matching and merging, causing heartbeat timeouts during Vdbench startup. Add -d27 as an extra execution parameter to change the default two minutes heartbeat timeout to 15 minutes.

The next release of Vdbench will have a fix for this problem by creating hash tables for all the ls and iostat output.


Tuesday Mar 16, 2010

MAC OS X shared library for Vdbench5.02

Here is a copy of the MAC shared library for Vdbench 5.02. Download this file and place it in your /vdbench502/mac/ directory.


Monday Mar 08, 2010

Vdbench Replay run not terminating properly

The i/o collected by an i/o trace created by Sun StorageTek Workload Analysis Tool (Swat), can be replayed using Vdbench. I just noticed a problem that when you give Vdbench more Storage Definitions (SDs) than needed, Vdbench does not terminate after the last i/o has been replayed. Each SD in Vdbench keeps track of when it executes its last i/o and then checks with all other SDs to see if they also are all done. If one or more of the SDs never is used this 'end-of-run' checking is not done, causing Vdbench to just wait until the elapsed= time is reached.

To correct this, remove the last 'n' SDs. A quick way to see if an SD is used is by looking at the SD reports. If all interval counters are zero then you know it has not been used.


Thursday Feb 11, 2010

Vdbench, problems with the patterns= parameter

I have written a blog entry about problems with the patterns= parameter before, even mentioning that I may no longer support it. I have concluded since, that I need to continue supporting it though in a different format than currently, where you (in older versions) could specify 127 different data patterns.

In Vdbench 5.01 and and 5.02 (brandnew), patterns= works as follows: patterns=/pattern/dir where file name '/pattern/dir/default' gets picked up, and its contents stored in the data buffer used for writing data. That works.

However, (and these things always happen when it is too late) a few hours after I did the last build of Vdbench 5.02 I realized that yes, I put the pattern in the buffer, but I use the same buffer for reading which means that if you have a mixed read/write workload your data pattern can be overlaid by whatever data is on your disk. Since the pattern is copied only once into your buffer all new writes will NOT contain this pattern. So, until I fix this, if you want a fixed pattern to be written, do not include reads in your test.

In normal operations I use only a single data buffer, both for reads and writes. This is done to save on the amount of memory needed during the run. Loads of luns \* Loads of threads = Loads of memory. This now needs to change when using specific data patterns.



Vdbench 5.02 now available for download

Vdbench 5.02 contains numerous large and small enhancements.

- Data Validation for file system testing. Vdbench for years has had a huge success running Data Validation against raw disks or files. Release 5.02 now introduces the same powerful Data Validation functionality for its file system testing.
- Numerous enhancements to deal with multi-client file system performance testing
- A Data Validation post-processing GUI, giving you a quick understand about what data is corrupted and (possibly) why.

For more detail, see

Questions or problems? Contact me at



Friday Jan 29, 2010

Vdbench running prtdiag, cfgadm, etc, slowing down vdbench startup

The objective of Vdbench is to measure storage performance. When you save the performance information generated by Vdbench for future use it can happen that 6 months down the road you ask your self: "what were the details of the system status at the time of this run"?  Vdbench on Solaris each time runs the '' script distributed in the /solaris/ or /solx86/ sub directory. This script includes commands like prtdiag and cfgadm; the output is stored in file 'config.html'

With systems getting larger and more complex every day, these commands can take quite a while to complete and can some times take 30-60 seconds, delaying the actual starting of Vdbench.

If you do not care about recording this data create a file named 'noconfig' in your Vdbench install directory, and Vdbench from that point on will bypass running ''.


Tuesday Jan 26, 2010

Vdbench: dangerous use of stopafter=100, possibly inflating throughput results.

In short: doing random I/O against very small files can inflate your throughput numbers.

When doing random I/O against a file using File system Workload Definition (FWD) parameters Vdbench needs to know when to stop using the currently selected file.
The ‘stopafter=100’ parameter (default 100) tells Vdbench to stop after 100 blocks. For Vdbench 5.02 you can also specify ‘stopafter=nn%’, or ‘nn %’ of the size of the file.

This all works great, but here’s the catch: if your file size is very small, for instance just 8k, the default stopafter=100 value will cause the same block to be read 100 times.

The stopafter= parameter was really only meant for large files, and this side effect was not anticipated.

For Vdbench 5.01, change ‘stopafter=’ to a value that matches the file size. ‘stopafter=’ allows for only one fixed value so if you have multiple different file sizes this won’t work for you.
For Vdbench502 (beta), use stopafter=100%. This makes sure that you never read or write more blocks than that the file contains.
I will modify 502 as soon as possible to change the default value to be no more than the current file size.

Note: 5.02 is currently only available (in beta) internally at Sun/Oracle.


Monday Dec 21, 2009

Shared library available for AIX 32 and 64bit

For 32 bit java, download and place it in /vdbench501fix1/aix/

For 64 bit java, download  and place it in /vdbench501fix1/aix/


Monday Dec 14, 2009

Vdbench and concurrent sequential workloads.

A sequential workload for me is the sequential reading of blocks 1,2,3,4,5 etc.

Running concurrent sequential workloads against the same lun or file then will result in reading blocks 1,1,2,2,3,3,4,4,5,5 etc, something that I have considered incorrect since day one of Vdbench.

When spreading out random I/O workloads across multiple Vdbench slaves, I allow each slave to do some of the random work. For sequential workloads however, the above issue forces me to make sure that only one slave receives a sequential workload. This is all transparent to the user.

That all has worked fine, until last Friday I received an email about the following Vdbench abort message: “rd=rd1,wd=wd2 not used. Could it be that more hosts have been requested than there are threads?”

It took me a while to figure this one out, until it became clear that this was caused by making sure that a sequential workload does not run more than once across slaves. In this case there were two different sequential workloads however that were specifically requested to run against the same device, one to read and one to write. The result was that Vdbench ignored the second workload without notifying the user. This was not a case of not spreading out the same workload across slaves, but instead there were two different sequential workloads.

Somewhere in the bowels of the Vdbench code is a check to make sure that I did not lose track of one or more workloads (believe me, it can get complex allowing multiple concurrent different workloads to run across different slaves and/or hosts). This code noticed that the second workload was not included at all. Therefore the “wd2 not used” abort.

So how to get around this if you still really want to do this? The code above only looks at a 100% sequential workload (seekpct=0, seekpct=seq, or seekpct=eof). By specifying for instance seekpct=1 you can tell Vdbench to generate a new random lba on average each 1% (one in a hundred) of the I/O generated. Then, on average again, 100 blocks will be sequentially read or written. Specify seekpct=0.01 and a new random lba will be generated only every 10,000 I/O’s. This should suffice without changing the Vdbench logic.



Blog for Henk Vandenbergh, author of Vdbench, and Sun StorageTek Workload Analysis Tool (Swat). This blog is used to keep you up to date about anything revolving around Swat and Vdbench.


« May 2016