Friday Jun 18, 2010

Vdbench and Swat: how to identify what is what when using 'ps -ef'

For obvious reasons I frequently have multiple Swat or Vdbench processes running, and sometimes get confused as to what is what. 'ps' output is not very helpful. I can maybe compare the heap size values, but I don't always remember them:

hvxxxx 21027 21008   0 09:02:01 pts/12      0:06 java -Xmx512m -Xms128m -cp ./:./classes:./swat.jar:./javachart.jar:./swing-layo
hvxxxx 21060 21041   0 09:02:04 pts/6       0:03 java -Xmx1024m -Xms512m -cp ./:./classes:./swat.jar:./javachart.jar:./swing-lay

A primitive little trick now is making my life easier by using the -D java parameter. It shows me that I am running one background data collector and one local real time monitor (swat -c and swat -l)

hvxxxx 21102 21083   0 09:04:49 pts/12      0:07 java -Dreq=-c -Xmx512m -Xms128m -cp ./:./classes:./swat.jar:./javachart.jar:./s
hvxxxx 21240 21221   0 09:06:01 pts/6       0:34 java -Dreq=-l -Xmx1024m -Xms512m -cp ./:./classes:./swat.jar:./javachart.jar:./

Update to the swat script, you can make a similar change to the Vdbench script

if ("$1" == "-t" || "$1" == "-p" || "$1" == "-l") then
  $java -Dreq=$1 -Xmx1024m -Xms512m -cp $cp Swt.swat $\*
else
  $java -Dreq=$1 -Xmx512m  -Xms128m -cp $cp Swt.swat $\*
endif



 Henk.

Wednesday Jun 09, 2010

Swat trace and prtvtoc

For Swat to be able to create a proper Replay parameter file it needs to know how large the luns are. For that the prtvtoc command is run during trace creation for each lun found in /dev/rdsk/. That can get awfully annoyingly slow when there are a lot of luns. To make this much faster replace the 'prtvtoc' lines in tnfe.sh with:

# Generate prtvtoc data (Is needed to create Replay parameter file)
# (devfsadm -C will clean up old garbage in /dev/rdsk)
printf "Running prtvtoc command"
ls /dev/rdsk/\* > /tmp/tnfe1
nawk '{disk = substr($1,1,length($1)-2); if (disk != last) print $1; last = disk}' \\
      /tmp/tnfe1 > /tmp/tnfe2
rm /tmp/tnf_prtvtoc.txt 2> /dev/null
while read disk;do
   echo Running prtvtoc $disk
   echo $disk       >> /tmp/tnf_prtvtoc.txt
   prtvtoc -h $disk >> /tmp/tnf_prtvtoc.txt 2>/dev/null
done  < /tmp/tnfe2
cp /tmp/tnf_prtvtoc.txt tnf_prtvtoc.txt


 Henk

Thursday May 13, 2010

Updates to Vdbench503 beta code

In this blog I'll try to keep you informed about the problems that have been found. Updates can be found on http://vdbench.org under 'View all files'

  • rc2: All platforms: an abort in JNI (C code) when trying to display an error or warning message BEFORE the first workload is started. Message in file localhost-0.html: "A fatal error has been detected by the Java Runtime Environment". Fixed in rc3.
  • rc2+rc3: All platforms: For raw I/O functionality (using SDs) all writes done using the new default random data pattern are reported as reads, though a write was correctly done. Fixed in rc4
  • rc2-rc4: Similar problems, not only with the random data pattern, but also with dedup and compression. Fixed in rc5

Wednesday May 12, 2010

First beta version of Vdbench 5.03, including Dedup and Compression

I placed vdbench503rcxxx on vdbench.org to give my users the opportunity to verify that there are no hidden problems in the code.Once I feel confident enough to make this code ready for GA I’ll also distribute the source code.

Would you like to measure the performance differences when running:
- without dedup and compression
- with only dedup and no compression
- with only compression and no dedup
- with dedup and compression
- for dedupratio=1,2,3,4,5,...
- for compratio=1,2,3,4,5,...?

Then pick up a copy of vdbench503rcxxx from vdbench.org.

For release notes see vdbench503_notes.html

For a tar or zip file, go to http://vdbench.org, select 'View all files', and look for vdbench503beta.

Any problems? Contact me at vdbench@sun.com

Henk.

Thursday Apr 22, 2010

Vdbench and SSD alignment, continued.

Of course, it took only a few minutes before someone asked 'how can I run this against files, not volumes'. Here is the response:

Just change the lun to a file name (use the same file name each  time) and add a size.
Vdbench then will first create the file for you.
The problem will be that you need to make sure Vdbench will not read from file system or file server cache, so the file size must be at least 5 times the system's cache size,
Unless of course you mount stuff directio, but then you still have the file server cache to deal with.
Just take your time and create a large file (I am using 100g). Vdbench will automatically create it for you.

BTW: the elapsed time the elapsed time needs to be long enough to make sure you get away from cache. I set it here to 60 seconds, which should be a good start..

Henk.

hd=default,jvms=1
sd=default,th=32
sd=default,size=100g
sd=sd_0000,lun=/dir/filename,offset=0000
sd=sd_0512,lun=/dir/filename,offset=0512
sd=sd_1024,lun=/dir/filename,offset=1024
sd=sd_1536,lun=/dir/filename,offset=1536
sd=sd_2048,lun=/dir/filename,offset=2048
sd=sd_2560,lun=/dir/filename,offset=2560
sd=sd_3072,lun=/dir/filename,offset=3072
sd=sd_3584,lun=/dir/filename,offset=3584
sd=sd_4096,lun=/dir/filename,offset=4096
wd=wd1,sd=sd_1,xf=4k,rdpct=100
rd=default,iorate=max,elapsed=60,interval=1,dist=d,wd=wd1
rd=rd_0000,sd=sd_0000
rd=rd_0512,sd=sd_0512
rd=rd_1024,sd=sd_1024
rd=rd_1536,sd=sd_1536
rd=rd_2048,sd=sd_2048
rd=rd_2560,sd=sd_2560
rd=rd_3072,sd=sd_3072
rd=rd_3584,sd=sd_3584
rd=rd_4096,sd=sd_4096


Vdbench file system testing WITHOUT using 'format=yes'

Just found a bug in Vdbench 5.02 around file system testing.

When not having Vdbench create all the files using the 'format=yes' parameter, but instead coding fileselect=sequential,fileio=sequential,operation=write in the File system Workload Definition (FWD) to simulate the format, Vdbench only creates the first 'threads=n' files and then overwrites them again and again. If you are in desperate need for a fix, let me know.

 Henk.

Vdbench and SSD alignment

These last months I have heard a lot about issues related to solid state devices not properly being aligned to the expected data transfer sizes. Each OS has its own way of creating volumes and partitions so trying to figure out if everything is neatly aligned is not an easy job. Add to that the possibility of the OS thinking everything is in order but alignment somewhere down the line not being accurate in one of the many possible layers of software when we have virtual volumes.

Without really being interested in the 'how to figure it all out and how to fix alignment issues' I created a small Vdbench parameter file that will allow you to at least figure out whether things are properly aligned or not. It revolves around the use of the Vdbench 'offset=' parameter that allows you to artificially change the alignment from Vdbench's point of view.

If your SSDs are on a storage subsystem that has a large cache, make sure that your volume is much larger than that cache. You rally need to make sure you are getting your data from the SSD, not from cache.

Henk:

hd=default,jvms=1
sd=default,th=32
sd=sd_0000,lun=/dev/rdsk/c7t0d0s4,offset=0000
sd=sd_0512,lun=/dev/rdsk/c7t0d0s4,offset=0512
sd=sd_1024,lun=/dev/rdsk/c7t0d0s4,offset=1024
sd=sd_1536,lun=/dev/rdsk/c7t0d0s4,offset=1536
sd=sd_2048,lun=/dev/rdsk/c7t0d0s4,offset=2048
sd=sd_2560,lun=/dev/rdsk/c7t0d0s4,offset=2560
sd=sd_3072,lun=/dev/rdsk/c7t0d0s4,offset=3072
sd=sd_3584,lun=/dev/rdsk/c7t0d0s4,offset=3584
sd=sd_4096,lun=/dev/rdsk/c7t0d0s4,offset=4096
wd=wd1,sd=sd_1,xf=4k,rdpct=100
rd=default,iorate=max,elapsed=60,interval=1,dist=d,wd=wd1
rd=rd_0000,sd=sd_0000
rd=rd_0512,sd=sd_0512
rd=rd_1024,sd=sd_1024
rd=rd_1536,sd=sd_1536
rd=rd_2048,sd=sd_2048
rd=rd_2560,sd=sd_2560
rd=rd_3072,sd=sd_3072
rd=rd_3584,sd=sd_3584
rd=rd_4096,sd=sd_4096


These are the 'avg' lines:

offset=0000   avg_2-3   19223.00    75.09    4096 100.00    1.580    2.803    0.231     1.1   0.9
offset=0512   avg_2-3    3655.50    14.28    4096 100.00    8.772    9.473    0.067     0.3   0.2
offset=1024   avg_2-3    3634.00    14.20    4096 100.00    8.784    9.390    0.064     0.3   0.2
offset=1536   avg_2-3    3633.00    14.19    4096 100.00    8.799    9.472    0.062     0.3   0.2
offset=2048   avg_2-3    3614.50    14.12    4096 100.00    8.831    9.440    0.066     0.3   0.2
offset=2560   avg_2-3    3604.00    14.08    4096 100.00    8.852    9.477    0.067     0.2   0.2
offset=3072   avg_2-3    3602.50    14.07    4096 100.00    8.853    9.430    0.059     0.3   0.2
offset=3584   avg_2-3    3597.50    14.05    4096 100.00    8.888    9.468    0.069     0.2   0.2
offset=4096   avg_2-3   20050.50    78.32    4096 100.00    1.584    2.811    0.231     1.0   0.9

As you can see, the runs with offset=0 and offset=4096 offer more than 5 times the throughput than the others. This tells me that this volume is properly aligned.
If for instance the run results would show that offset=512 has the best results the volume is on a 512 byte offset.
To then run properly 4k aligned tests with Vdbench, add to all your runs:
sd=default,offset=512
and Vdbench, after generating each lba, will always add 512.




Monday Apr 19, 2010

Running high IOPS against a single lun/SSD

On Solaris, and I expect the same with other operating systems, whenever and I/O is requested some process-level lock is set. This means that if you try to run very high IOPS, this lock can become 100% busy, causing all threads that need this lock to start spinning. End result is two-fold: high CPU utilization and/or lower than expected IOPS.

This is not a new problem. The problem was discovered several years ago when storage subsystems became fast enough to handle 5000 IOPS and more. Since that time cpus have become much faster and Solaris code has been enhanced several times to lower the need and duration for these locks. I have seen Vdbench runs where we were able to do 100k IOPS without problems.

Vdbench is written in Java, and Java runs as a single process. Vdbench therefore introduced what is called multi-JVM mode, the ability of Vdbench to split the requested workload over multiple JVMs (Java Virtual Machines).

By default Vdbench starts one JVM for each 5000 IOPS requested, with a maximum of 8, and no more than one per Storage Definition (SD). The 5000-number probably should be changed some day; it is a leftover of the initial discovery of this problem.

So, when you ask for iorate=max with only a single SD and you’re lucky enough to be running against a Solid State Device (SSD) guess what: you may run into this locking problem.

To work around this you have to override the default JVM count:

  • Specify hd=localhost,jvms=nn I suggest you request one JVM for each 50k IOPS that you expect
  • Add ‘-m nn’ as an execution parameter, for instance ‘-m4’.

There is one exception though, and that is for 100% sequential workloads using the seekpct=sequential or seekpct=eof Workload Definition (WD) parameter. A sequential workload will only run using one single JVM. This is done to prevent for instance with two JVMs that the workload would look like this: read block 1,1,2,2,3,3,4,4,5,5, etc. The performance numbers of course will look great because the second read of a block will be guaranteed a cache hit, but this is not really a valid sequential workload.

Henk.

Tuesday Apr 13, 2010

Swat Trace Facility (STF) memory problems during the Analyze phase.

java.lang.OutOfMemoryError 

STF during the Analyze phase keeps thirty seconds worth of I/O detail in memory. That is done so that when a trace probe of an I/O completion is found STF can still find the I/O start probe if that occurred less than thirty seconds earlier. Believe me, in some very difficult customer problem scenarios I have seen I/O that have taken that long.

If you run 5000 iops, keeping thirty seconds worth of detailed information in memory is not a real problem. Even 10k or 20k will work fine.

And now we have solid state devices, running 100k, 200k iops and up. Just do the math, and you know you will run into memory problems.

There is an undocumented option in STF to lower this 30-second value. In STF, go to the Settings tab, click on ‘batch_prm’, enter ‘-a5’, click ‘Save’, and run the Analyze phase again. Of course, -a2 works fine too. Note that any I/O that takes longer than the new ‘age’ value you specified will not be recognized by STF, but usually two seconds should just be fine.

Henk.


Thursday Apr 01, 2010

Vdbench and multipath I/O

A question just asked: does Vdbench support multipath I/O?

Vdbench tries to be "just any other user application". To make sure that Vdbench is portable across multiple platforms I decided to depend on the OS for doing i/o. All I need from the OS is a working open/close/read/write() function and I am happy.
Just imagine writing a simple customer application, my favorite (and I am sure yours too): the payroll application. Would you as a programmer have to worry about HOW things get read or written? Those days are over when the application had to know and handle this level of detail.
That's now the role of the OS.

So, Vdbench is not multi pathing aware whatsoever. It's the OS's responsibility.

Henk.

AIX shared libraries for Vdbench 5.02

Thank you IBM!

Please place https://blogs.oracle.com/henk/resource/502/aix-32.so  and http://blogs.oracle.com/henk/resource/502/aix-64.so in your /vdbench502/aix/ directory.


Henk

Heartbeat timeouts during Vdbench startup

One of the most complex things in Vdbench (and Swat also) on Solaris is trying to translate  file names, volume names and device numbers to the proper Kstat instance names. To accomplish that I use output of iostat and the ls /dev/rdsk command, and match and merge those outputs. Never realizing how bad that could be I did not bother with efficiency, so I do numerous sequential lookups in both iostat and ls output. And then of course I am finding iostat outputs with several thousand lines, and then ls output of course with at least that many. At times it can take over two minutes to do all this matching and merging, causing heartbeat timeouts during Vdbench startup. Add -d27 as an extra execution parameter to change the default two minutes heartbeat timeout to 15 minutes.

The next release of Vdbench will have a fix for this problem by creating hash tables for all the ls and iostat output.

Henk.

Tuesday Mar 16, 2010

MAC OS X shared library for Vdbench5.02

Here is a copy of the MAC shared library for Vdbench 5.02. Download this file and place it in your /vdbench502/mac/ directory.

Henk.

Monday Mar 08, 2010

Vdbench Replay run not terminating properly

The i/o collected by an i/o trace created by Sun StorageTek Workload Analysis Tool (Swat), can be replayed using Vdbench. I just noticed a problem that when you give Vdbench more Storage Definitions (SDs) than needed, Vdbench does not terminate after the last i/o has been replayed. Each SD in Vdbench keeps track of when it executes its last i/o and then checks with all other SDs to see if they also are all done. If one or more of the SDs never is used this 'end-of-run' checking is not done, causing Vdbench to just wait until the elapsed= time is reached.

To correct this, remove the last 'n' SDs. A quick way to see if an SD is used is by looking at the SD reports. If all interval counters are zero then you know it has not been used.

Henk.

Thursday Feb 11, 2010

Vdbench, problems with the patterns= parameter

I have written a blog entry about problems with the patterns= parameter before, even mentioning that I may no longer support it. I have concluded since, that I need to continue supporting it though in a different format than currently, where you (in older versions) could specify 127 different data patterns.

In Vdbench 5.01 and and 5.02 (brandnew), patterns= works as follows: patterns=/pattern/dir where file name '/pattern/dir/default' gets picked up, and its contents stored in the data buffer used for writing data. That works.

However, (and these things always happen when it is too late) a few hours after I did the last build of Vdbench 5.02 I realized that yes, I put the pattern in the buffer, but I use the same buffer for reading which means that if you have a mixed read/write workload your data pattern can be overlaid by whatever data is on your disk. Since the pattern is copied only once into your buffer all new writes will NOT contain this pattern. So, until I fix this, if you want a fixed pattern to be written, do not include reads in your test.

In normal operations I use only a single data buffer, both for reads and writes. This is done to save on the amount of memory needed during the run. Loads of luns \* Loads of threads = Loads of memory. This now needs to change when using specific data patterns.

 

Henk.

About

Blog for Henk Vandenbergh, author of Vdbench, and Sun StorageTek Workload Analysis Tool (Swat). This blog is used to keep you up to date about anything revolving around Swat and Vdbench.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today