Vdbench: workload skew
By Henk Vandenbergh-Oracle on Oct 14, 2008
These past few weeks some questions have come up about what workload we expect Vdbench to run, and what is actually done looking at the output reports.
I spent several days trying to figure out how to explain everything, first I ended up planning five different blog entries, but then realized that they tie in all together in such a way that writing just one entry appears to be the right thing to do.
The main problem that I am trying to explain is described in ‘Vdbench and sequential read/write’ but you’ll need to read the rest first.
So here is one entry, but with five chapters:
JVM: Java Virtual Machine.
Java runs as a single process, with the result that if you request too many IOPS within that single process you can run into some serious process lock contention.
When Vdbench was initially written Solaris would get bogged down when going above 5000 IOPS. Since then systems have become faster and huge performance improvements have been made to Solaris, so by now that IOPS limit per process is much higher. I have seen successful tests at 100,000 IOPS per process.
Vdbench Multi JVM processing allows Vdbench to start multiple copies of it selves, therefore spreading out the many IOPS over more than one process. Vdbench by default will start up to 8 JVMs, limited by the number of processors and the amount of SDs.
You can override the amount of JVMs to be used by using either the (vdbench500) ‘hd=localhost,jvms=nn’ or the ‘-m’ execution parameter.
In multi JVM mode, Vdbench has one master and one or more slaves. The master does all the parameter parsing and reporting and also does the scheduling of work. Vdbench407 runs in multi JVM mode only if the amount of work requested is relatively small. Vdbench 5.00 always runs in multi JVM mode. (There used to be three different modes of operation in Vdbench: single JVM, Multi JVM, and multi-host. This is now all encapsulated into one: multi-host, where it does not matter that there is only one host, the current).
Vdbench 4.07: SDs are given to the next JVM in a round-robin fashion; an SD is used by only one JVM.
Vdbench 5.00: SDs are given to each JVM. A 100% sequential workload that has a skew specified however all go to the first JVM. This is done to prevent each JVM from reading the same blocks e.g. blocks 1/1, 2/2, 3/3, etc. That would NOT be a proper sequential workload.
The big change here is that in Vdbench 5.00 random workloads will be using multiple JVMs instead of only one. This has been done to accommodate the high IOPS solid-state devices.
When creating a Vdbench workload in a parameter file you can specify more than one workload to run at the same time. Optionally you can specify a workload skew if you want the different workloads to run with different IOPS using the ‘skew=nn’ parameter.
Vdbench controls individual IOPS and workload skew by sending new I/O requests to each SD’s internal work queue. This work queue has a maximum queue depth of 2000 per SD. The I/O threads for each SD then pick up a new request from this queue to start the I/O.
When an SD cannot keep up with it’s requested workload and the SD’s internal work queue fills up, Vdbench will not generate new I/O requests for this and all other SDs until space in this queue becomes available again.
This means that if you send 1000 IOPS to an SD that can handle only 100 IOPS, and 50 IOPS to a similar device, the queue for the first device will fill up, and I/O request generation for the second device will be held up.
This has been done to enable Vdbench to preserve the requested workload skew while still allowing for a temporary 'backlog' of 2000 requested I/Os.
Vdbench only controls the skew within one JVM. The amount of CPU cycles needed to control the skew between different JVMs would be far too much.
When asking Vdbench to run a workload requesting as many IOPS as the storage/server combination can handle there are two options:
- Specify ‘iorate=max’ (Uncontrolled workload).
- Specify a high numerical value for instance ‘iorate=9999999’ (any high value) (Controlled workload).
There is an important difference between these two when using multiple SDs or using multiple concurrent workloads.
- Two or more SDs that are so different that one SD’s response time is much better than the other. For instance a cached vs. an uncached device.
- Two or more workloads that are so different that one workload’s response time is much better than the other. For instance 512-byte reads vs. 1mb reads.
Uncontrolled workload: when using iorate=max, without any skew= parameters used, Vdbench allows each SD or workload to run as fast as it can.
Controlled workload: when using numeric ‘iorate=’ values, Vdbench will make sure that the total IOPS generated for each SD and/or workload honors the requested workload skew. If no skew has been specified, the IOPS will be evenly spread out over all workloads and SDs.
This past week there were two occasions where my users used the following parameters in a Vdbench run.
What the users were expecting is that Vdbench would sequentially read and write to the requested storage. And indeed, Vdbench does just that. Great!
But what does Vdbench really do? It generates a sequential workload with on average 50% reads and 50% writes. This means that (depending on the results of the randomizers used) this is how things really look like:
Read block1, write block2, read block3, write block4, read block5, write block6, etc.etc.
Code works as designed. But this is not what the user wanted. He wanted a sequential read workload, and a sequential write workload, not a sequential read/write mix.
What to do to get a sequential read workload and a sequential write workload? Ask Vdbench for two workloads, one for reads and one for writes:
This will create TWO workloads with the following results:
Workload wd1: read blocks 1,2,3,4,5,6,etc.
Workload wd2: write blocks 1,2,3,4,5,6,etc.
Note that both workloads are reading and writing the same blocks. If this is not what you want, then there are several options:
- Use two separate devices, one to read from and one to write to.
- Use the ‘range=’ parameter to send the reads and the writes to different portions of a device, e.g.
- For Vdbench 5.00 the range parameter is also available in the SD:
- Change ‘seekpct=seq’ (or ‘seekpct=0’) to ‘seekpct=1’ which will cause a new random lba to be selected on average every 100 blocks, or even ‘seekpct=0.1’ to select a random lba once every 1000 blocks.
Note: since each sequential workload always starts at block1 it may take a few hundred ios before a new random lba is selected.
But now there is a problem!
When you request two (or more) controlled workloads and these workloads end up running on different JVMs, Vdbench can no longer control the IOPS since the JVMs do not communicate with each other.
Reads are usually so much faster than writes that the reads will monopolize the storage and you can end up with 90/10% read/write and not the expected 50/50%.
Note: when you specify iorate=max then of course we have an Uncontrolled workload where we accept the fact that the workloads will have different IOPS.
So how to avoid this problem? The only way to do this is to make sure that these workloads all run in the same JVM by overriding the JVM count:
hd=localhost,jvms=1 (Vdbench 5.00)
add the ‘-m1’ execution parameter.