Saturday Jun 05, 2010

Visualizing System Latency

I've just had an article published in ACMQ: Visualizing System Latency, which demonstrates latency analysis using heat maps in Analytics from Oracle's Sun Open Storage appliances. These have revealed details about system performance that were previously not visible, and show how effective a simple visualization can be. As many of these details are new, they are still being studied and are not yet understood.

One detail now can be understood, thanks to Pete Harllee who offered an explanation for the faint line at the top of Figure 4: these I/O are where the lba range spans two tracks in the drive (which I should have realized can happen sooner since these are 8 Kbyte writes on a drive with 512 byte sectors); the additional latency encountered is the expected track to track seek time during the I/O, as the lbas are written to one track and then complete writing on the next.

The resolution of the screenshots was reduced to fit the online article, which preserved the patterns that the article was describing but not ancillary text; the original resolution screenshots are linked in the article, and are also listed here:

  • Figure 1: NFS Latency When Enabling SSD-based Cache Devices
  • Figure 2: Synchronous Writes to a Striped Pool of Disks
  • Figure 3: Single Disk Latency from a Striped Pool of Disks
  • Figure 4: Synchronous Write Latency to a Single-disk Pool
  • Figure 5: Synchronous Write Latency to a Two-disk Pool
  • Figure 6: Synchronous Writes to a Mirrored Pool of Disks
  • Figure 7: Sequential Disk Reads, Stepping Disk Count
  • Figure 8: Repeated Disk Reads, Stepping Disk Count
  • Figure 9: High Latency I/O

The article has been picked up by sites including Slashdot and Vizworld.

Friday Jun 26, 2009

SLOG Screenshots

I previously posted screenshots of the L2ARC: the ZFS second level cache which uses read optimized SSDs ("Readzillas") to cache random read workloads. That's the read side of the Hybrid Storage Pool. On the write side, the ZFS separate intent log (SLOG) can use write optimized SSDs (which we call "Logzillas") to accelerate performance of synchronous write workloads.

In the screenshots that follow I'll show how Logzillas have delivered 12x more IOPS and over 20x reduced latency for a synchronous write workload over NFS. These screenshots are from Analytics on the Sun Storage 7000 series. In particular, the following heat map shows the dramatic reduction in NFS latency when turning on the Logzillas:

Before the 17:54 line, latency was at the mercy of spinning 7200 RPM disks, and reached as high as 9 ms. After 17:54, these NFS operations were served from Logzillas, which consistently delivered latency much lower than 1 ms. Click for a larger screenshot.

What is a Synchronous Write?

While the use of Logzilla devices can dramatically improve the performance of synchronous write workloads, what is a synchronous write?

When an application writes data, if it waits for that data to complete writing to stable storage (such as a hard disk), it is a synchronous write. If the application requests the write but continues on without waiting (the data may is buffered in the filesystem's DRAM cache, but not yet written to disk), it is an asynchronous write. Asynchronous writes are faster for the application, and is the default behavior.

There is a down side to asynchronous writes - the application doesn't know if the write completed successfully. If there is a power outage before the write could be flushed to disk, the write will be lost. For some applications such as database log writers, this risk is unacceptable - and so they perform synchronous writes instead.

There are two forms of synchronous writes: individual I/O which is written synchronously, and groups of previous writes which are synchronously committed.

Individual synchronous writes

Write I/O will become synchronous when:

  • A file is opened using a flag such as O_SYNC or O_DSYNC
  • NFS clients use the mount option (if available):
    • "sync": to force all write I/O to be synchronous
    • "forcedirectio": which avoids caching, and the client may decide to make write I/O synchronous as part of avoiding its cache

Synchronously committing previous writes

Rather than synchronously writing each individual I/O, an application may synchronously commit previous writes at logical checkpoints in the code. This can improve performance by grouping the synchronous writes. On ZFS these may also be handled by the SLOG and Logzilla devices. These are performed by:

  • An application calling fsync()
  • An NFS client calling commit. It can do this because:
    • It closed a file handle that it wrote to (Solaris clients can avoid this using the "nocto" mount option)
    • Directory operations (file creation and deletion)
    • Too many uncommitted buffers on a file (eg, FreeBSD/MacOS X)

Examples of Synchronous Writes

Applications that are expected to perform synchronous writes include:

  • Database log writers: eg, Oracle's LGWR.
  • Expanding archives over NFS: eg, tar creating thousands of small files.
  • iSCSI writes on the Sun Storage 7000 series (unless lun write cache is enabled.)

The second example is easy to demonstrate. Here I've taken firefox-, a tar file containing 43,000 small files, and unpacked it to an NFS share (tar xf.) This will create thousands of small files, which becomes a synchronous write workload as these files are created. Writing of the file contents will be asynchronous, it is just the act of creating the file entries in their parent directories which is synchronous. I've unpacked this tar file twice, the first time without Logzilla devices and the second time with. The difference is clear:

Without logzillas, it took around 20 minutes to unpack the tar file. With logzillas, it took around 2 minutes - 10x faster. This improvement is also visible as the higher number of NFS ops/sec, and the lower NFS latency in the heat map.

For this example I used a Sun Storage 7410 with 46 disks and 2 Logzillas, stored in 2 JBODs. The disks and Logzillas can serve many more IOPS than I've demonstrated here, as I'm only running a single threaded application from a single modest client as a workload.

Identifying Synchronous Writes

How do you know if your particular write workload is synchronous or asynchronous? There are many ways to check, here I'll describe a scenario where a client side application is performing writes to an NFS file server.

Client side

To determine if your application is performing synchronous writes, one way is to debug the open system calls. The following example uses truss on Solaris (try strace on Linux) on two different programs:

    client# truss -ftopen ./write
    4830:   open64("outfile", O_WRONLY|O_CREAT, 0644)            = 3
    client# truss -ftopen ./odsync-write
    4832:   open64("outfile", O_WRONLY|O_DSYNC|O_CREAT, 0644)    = 3   

The second program (odsync-write) opened its file with the O_DSYNC flag, so subsequent writes will be synchronous. In these examples, the program was executed by the debugger, truss. If the application is already running, on Solaris you can use pfiles to examine the file flags for processes (on Linux, try lsof.)

Apart from tracing the open() syscall, also look for frequent fsync() or fdsync() calls, which synchronously commit the previously written data.

Server side

If you'd like to determine if you have a synchronous write workload from the NFS server itself, you can't run debuggers like truss on the target process (since it's running on another host), so instead we'll need to examine the NFS protocol. You can do this either with packet sniffers (snoop, tcpdump), or with DTrace if it and its NFS provider are available:

    topknot# dtrace -n 'nfsv3:::op-write-start { @[args[2]->stable] = count(); }'
    dtrace: description 'nfsv3:::op-write-start ' matched 1 probe
            2             1398

Here I've frequency counted the "stable" member of the NFSv3 write protocol, which was "2" some 1,398 times. 2 is a synchronous write workload (FILE_SYNC), and 0 is for asynchronous:

    enum stable_how {
            UNSTABLE = 0,
            DATA_SYNC = 1,
            FILE_SYNC = 2

If you don't have DTrace available, you'll need to dig this out of the NFS protocol headers. For example, using snoop:

    # snoop -d nxge4 -v | grep NFS
    NFS:  ----- Sun NFS -----
    NFS:  Proc = 7 (Write to file)
    NFS:  File handle = [A2FC]
    NFS:   7A6344B308A689BA0A00060000000000073000000A000300000000001F000000
    NFS:  Offset = 96010240
    NFS:  Size   = 8192
    NFS:  Stable = FSYNC

That's an example of a synchronous write.

The NFS client may be performing a synchronous-write-like workload by frequently calling the NFS commit operation. To identify this, use whichever tool is available to show NFS operations broken down by type (Analytics on the Sun Storage 7000; nfsstat -s on Solaris)

SLOG and Synchronous Writes

The following diagram shows the difference when adding a seperate intent log (SLOG) to ZFS:

Major components:

  • ZPL: ZFS POSIX Layer. Primary interface to ZFS as a filesystem.
  • ZIL: ZFS Intent Log. Synchronous write data for replay in the event of a crash.
  • DMU: Data Management Unit. Transactional object management.
  • ARC: Adaptive Replacement Cache. Main memory filesystem cache.
  • ZIO: ZFS I/O Pipeline. Processing of disk I/O.

For the most detailed information of these and the other components of ZFS, you can browse their source code at the ZFS Source Tour.

When data is asynchronously written to ZFS, it is buffered in memory and gathered periodically by the DMU as transaction groups, and then written to disk. Transaction groups either succeed or fail as a whole, and are part of the ZFS design to always deliver on disk data consistency.

This periodical writing of transaction groups can improve asynchronous writes by aggregating data and streaming it to disk. However the interval of these writes is in the order of seconds, which makes it unsuitable to serve synchronous writes directly - as the application would stall until the next transaction group was synced. Enter the ZIL.

ZFS Intent Log

The ZIL handles synchronous writes by immediately writing their data and information to stable storage, an "intent log", so that ZFS can claim that the write completed. The written data hasn't reached its final destination on the ZFS filesystem yet, that will happen sometime later when the transaction group is written. In the meantime, if there is a system failure, the data will not be lost as ZFS can replay that intent log and write the data to its final destination. This is a similar principle as Oracle redo logs, for example.

Separate Intent Log

In the above diagram on the left, the ZIL stores the log data on the same disks as the ZFS pool. While this works, the performance of synchronous writes can suffer as they compete for disk access along with the regular pool requests (reads, and transaction group writes.) Having two different workloads compete for the same disk can negatively effect both workloads, as the heads seek between the hot data for one and the other. The solution is to give the ZIL its own dedicated "log devices" to write to, as pictured on the right. These dedicated log devices form the separate intent log: the SLOG.


By writing to dedicated log devices, we can improve performance further by choosing a device which is best suited for fast writes. Enter Logzilla. Logzilla was the name we gave to write-optimized flash-memory-based solid state disks (SSDs.) Flash memory is known for slow write performance, so to improve this the Logzilla device buffers the write in DRAM and uses a super-capacitor to power the device long enough to write the DRAM buffer to flash, should it lose power. These devices can write an I/O as fast as 0.1 ms (depends on I/O size), and do so consistently. By using them as our SLOG, we can serve synchronous write workloads consistently fast.

Speed + Data Integrity

It's worth mentioning that when ZFS synchronously writes to disk, it uses new ioctl()s to ensure that the data is flushed properly to the disk platter, and isn't just buffered on the disk's write cache (which is a small amount of DRAM inside the disk itself.) Which is exactly what we want to happen - that's why these writes are synchronous. Other filesystems didn't bother to do this (eg, UFS), and believed that data had reached stable storage when it may have just been cached. If there was a power outage, and the disk's write cache is not battery backed, then those writes would be lost - which means data corruption for the filesystem. Since ZFS waits for the disk to properly write out the data, synchronous writes on ZFS are slower than other filesystems - but they are also correct. There is a way to turn off this behaviour in ZFS (zfs_nocacheflush), and ZFS will perform as fast or faster than other filesystems for synchronous writes - but you've also sacrificed data integrity, so this is highly unrecommended. By using fast SLOG devices on ZFS, we get both speed and data integrity.

I won't go into too much more detail of the inner workings of the SLOG, which is best described by the author Neil Perrin.


To create a screenshot to best show the effect of Logzillas, I applied a synchronous write workload over NFS from a single threaded process on a single client. The target was the same 7410 used for the tar test, which has 46 disks and 2 Logzillas. At first I let the workload run with the Logzillas disabled, then enbled them:

This is the effect of adding 2 Logzillas to a pool of 46 disks, on both latency and IOPS. My last post discussed the odd latency pattern that these 7200 RPM disks causes for synchronous writes, which looks a bit like an icy lake.

The latency heat map shows the improvement very well, but it's also shown something unintended: These heat maps use a false color palette which draws the faintest details much darker than they (linearly) should be, so that they are visible. This has made visible a minor and unrelated performance issue: those faint vertical trails on the right side of the plot. These were every 30 seconds, and was when ZFS flushed a transaction group to disk - which stalled a fraction of NFS I/O while this happened. The fraction is so small it's almost lost in the rounding, but appears to be about 0.17% when examining the left panel numbers. While minor, we are working to fix it. This perf issue is known internally as the "picket fence". Logzilla is still delivering a fantastic improvement over 99% of the time.

To quantify the Logzilla improvement, I'll now zoom to the before and after periods to see the range averages from Analytics:


With just the 46 disks:

Reading the values from the left panels: with just the 46 disks, we've averaged 143 NFS synchronous writes/sec. Latency has reached as high as 8.93ms - most of the 8 ms would be worst case rotational latency for a 7200 RPM disk.


46 disks + 2 Logzillas:

Nice - a tight latency cloud from 229 to 305 microseconds, showing very fast and consistent responses from the Logzillas.

We've now averaged 1,708 NFS synchronous writes/sec - 12x faster than without the Logzillas. Beyond 381 us, latency has rounded down to zero ops/sec, and was mostly in the 267 to 304 microsecond range. Previously latency stretched from a few to over 8 ms, making the delivered NFS latency improvement reach over 20x.

This 7410 and these Logzillas can handle much more than 1,708 synchronous writes/sec, if I apply more clients and threads (I'll show a higher load demo in a moment.)

Disk I/O

Another way to see the effect of Logzillas is through disk I/O. The following shows before and after in seperate graphs:

I've used the Hierarchial breakdown feature to highlight the JBODs in green. In the top plot (before), the I/O to each JBOD was equal - about 160 IOPS. After enabling the Logzillas, the bottom plot shows one of the JBODs is now serving over 1800 IOPS - which is the JBOD with the Logzillas in.

There are spikes of IOPS every 30 seconds in both these graphs. That is when ZFS flushes a transaction group to disk - commiting the writes to their final location. Zooming into the before graph:

Instead of highlighting JBODs, I've now highlighted individual disks (yes, the pie chart is pretty now.) The non-highlighted slice is for the system disks, which are recording the numerous Analytics statistics I've enabled.

We can see that all the disks have data flushed to them every 30 seconds, but they are also being written to constantly. This constant writing is the synchronous writes being written to the ZFS intent log, which is being served from the pool of 46 disks (since it isn't a separate intent log, yet.)

After enabling the Logzillas:

Now those constant writes are being served from 2 disks: HDD 20 and HDD 16, which are both in the /2029...012 JBOD. Those are the Logzillas, and are our separate intent log (SLOG):

I've highlighted one in this screenshot, HDD 20.

Mirrored Logzillas

The 2 Logzillas I've been demonstrating have been acting as individual log devices. ZFS lists them like this:

topknot# zpool status
  pool: pool-0
 state: ONLINE
        NAME                                     STATE     READ WRITE CKSUM
        pool-0                                   ONLINE       0     0     0
[...many lines snipped...]
          mirror                                 ONLINE       0     0     0
            c2t5000CCA216CCB913d0                ONLINE       0     0     0
            c2t5000CCA216CE289Cd0                ONLINE       0     0     0
          c2tATASTECZEUSIOPS018GBYTES00000AD6d0  ONLINE       0     0     0
          c2tATASTECZEUSIOPS018GBYTES00000CD8d0  ONLINE       0     0     0
          c2t5000CCA216CCB958d0                  AVAIL
          c2t5000CCA216CCB969d0                  AVAIL

Log devices can be mirrored instead (see zpool(1M)), and zpool will list them as part of a "mirror" under "logs". On the Sun Storage 7000 series, it's configurable from the BUI (and CLI) when creating the pool:

With a single Logzilla, if there was a system failure when a client was performing synchronous writes, and if that data was written to the Logzilla but hadn't been flushed to disk, and if that Logzilla also fails, then data will be lost. So it is system failure + drive failure + unlucky timing. ZFS should still be intact.

It sounds a little paranoid to use mirroring, however from experience system failures and drive failures sometimes go hand in hand, so it is understandable that this is often chosen for high availability environments. The default for the Sun Storage 7000 config when you have multiple Logzillas is to mirror them.


While the Logzillas can greatly improve performance, it's important to understand which workloads this is for and how well they work, to help set realistic expectations. Here's a summary:

  • Logzillas do not help every write workload; they are for synchronous write workloads, as described earlier.
  • Our current Logzilla devices for the Sun Storage 7000 family deliver as high as 100 Mbytes/sec each (less with small I/O sizes), and as high as 10,000 IOPS (less with big I/O sizes).
  • A heavy multi threaded workload on a single Logzilla device can queue I/O, increasing latency. Use multiple Logzilla devices to serve I/O concurrently.

Stepping up the workload

So far I've only used a single threaded process to apply the workload. To push the 7410 and Logzillas to their limit, I'll multiply this workload by 200. To do this I'll use 10 clients, and on each client I'll add another workload thread every minute until it reaches 20 per client. Each workload thread performs 512 byte synchronous write ops continually. I'll also upgrade the 7410: I'll now use 6 JBODs containing 8 Logzillas and 132 data disks, to give the 7410 a fighting chance against this bombardment.

8 Logzillas

We've reached 114,000 synchronous write ops/sec over NFS, and the delivered NFS latency is still mostly less than 1 ms - awesome stuff! We can study the numbers in the left panel of the latency heat map to quantify this: there were 132,453 ops/sec total, and (adding up the numbers) 112,121 of them completed faster than 1.21 ms - which is 85%.

The bottom plot suggests we have driven this 7410 close to its CPU limit (for what is measured as CPU utilization, which includes bus wait cycles.) The Logzillas are only around 30% utilized. The CPUs are 4 sockets of 2.3 GHz quad core Opteron - and for them to be the limit means a lot of other system components are performing extremely well. There are now faster CPUs available, which we need to make available in the 7410 to push that ops/sec limit much higher than 114k.

No Logzillas

Ouch! We are barely reaching 7,000 sync write ops/sec, and our latency is pretty ordinary - often 15 ms and higher. These 7,200 RPM disks are not great to serve synchronous writes from alone, which is why Adam created the Hybrid Storage Pool model.

When comparing the above two screenshots, note the change in the vertical scale. With Logzillas, the NFS latency heat map is plotted to 10 ms; without, it's plotted to 100 ms to span the higher latency from the disks.


The full description of the server and clients used in the above demo is:

  • Sun Storage 7410
  • Latest software version: 2009.,1-1.15
  • 4 x quad core AMD Opteron 2.3 GHz CPUs
  • 128 Gbytes DRAM
  • 2 x SAS HBAs
  • 1 x 2x10 GbE network card (1 port connected)
  • 6 JBODs (136 disks total, each 1 Tbyte 7,200 RPM)
  • Storage configured with mirroring
  • 8 Logzillas (installed in the JBODs)
  • 10 x blade servers, each:
  • 2 x quad core Intel Xeon 1.6 GHz CPUs
  • 3 Gbytes DRAM
  • 2 x 1 GbE network ports (1 connected)
  • mounted using NFSv3, wsize=512

The clients are a little old. The server is new, and is a medium sized 7410.

What about 15K RPM disks?

I've been comparing the performance of write optimized SSDs vs cheap 7,200 RPM SATA disks as the primary storage. What if I used high quality 15K RPM disks instead?

Rotational latency will be half. What about seek latency? Perhaps if the disks are of higher quality the heads move faster, but they also have to move further to span the same data (15K RPM disks are usually lower density than 7,200 RPM disks), so that could cancel out. They could have a larger write cache, which would help. Lets be generous and say that 15K RPM disks are 3x faster than 7,200 RPM disks.

So instead of reaching 7,000 sync write ops/sec, perhaps we could reach 21,000 if we used 15K RPM disks. That's a long way short of 114,000 using Logzilla SSDs. Comparing the latency improvement also shows this is no contest. In fact, to comfortably match the speed of current SSDs, disks would need to spin at 100K RPM and faster. Which may never happen. SSDs, on the other hand, are getting faster and faster.

What about NVRAM cards?

For the SLOG devices, I've been using Logzillas - write optimized SSDs. It is possible to use NVRAM cards as SLOG devices, which should deliver much faster writes than Logzilla can. While these may be an interesting option, there are some down sides to consider:

  • Slot budget: A Logzilla device consumes one disk bay, whereas an NVRAM card would consume one PCI-E slot. The Sun Storage 7410 has 6 PCI-E slots, and a current maximum of 288 disk bays. Consuming disk bays for Logzillas may be a far easier trade-off than spending one or more PCI-E slots.
  • Scaling: With limited slots compared to disk bays, the scaling of PCI-E cards will be very coarse. With Logzillas in disk bays it is easier to fine tune their number to match the workload, and to keep scaling to a dozen devices and more.
  • Complexity: One advantage of the Logzillas is that they exist in the pool of disks, and in clustered environments can be accessed by the other head node during failover. NVRAM cards are in the head node, and would require a high speed cluster interconnect to keep them both in sync should one head node fail.
  • Capacity: NVRAM cards have limited capacity when compared to Logzillas to begin with, and in clustered environments the available capacity from NVRAM cards may be halved (each head node's NVRAM cards must support both pools.) Logzillas are currently 18 Gbytes, whereas NVRAM cards are currently around 2 Gbytes.
  • Thermal: Battery backed NVRAM cards may have large batteries attached to the card. Their temperature envelope needs to be considered when placed in head nodes, which with CPUs and 10 GbE cards can get awfully hot (batteries don't like getting awfully hot.)
  • Cost: These are expected to be more expensive per Gbyte than Logzillas.

Switching to NVRAM cards may also make little relative difference for NFS clients over the network, when other latencies are factored. It may be an interesting option, but it's not necessarily better than Logzillas.


For synchronous write workloads, the ZFS intent log can use dedicated SSD based log devices (Logzillas) to greatly improve performance. The screenshots above from Analytics on the Sun Storage 7000 series showed 12x more IOPS and reaching over 20x reduced latency.

Thanks to Neil Perrin for creating the ZFS SLOG, Adam Leventhal for designing it into the Hybrid Storage Pool, and Sun's Michael Cornwell and STEC for making the Logzillas a reality.

Logzillas as SLOG devices work great, making the demos for this blog entry easy to create. It's not every day you find a performance technology that can deliver 10x and higher.

Friday Jun 12, 2009

Latency Art: X marks the spot

Here is another example of latency art (interesting I/O latency heat maps.) This is a screenshot from Analytics on the Sun Storage 7000 series.

The configuration was 2 JBODs (48 disks) configured with mirroring, and a single NFS client performing a synchronous write workload (opened with O_DSYNC), from one thread writing 8 Kbyte I/O sequentially. This is a crippled config - if you really cared about synchronous writes (some applications perform them) you can use SSD based separate intent logs ("Logzillas") which are part of the Hybrid Storage Pool strategy. Since I didn't, these synchronous writes will be served at disk speed, which for these 7200 RPM disks will be between 0 and 10 ms. Here's the latency (click for a larger version):

Woah! That's Insane.

This is not what I was expecting. Remember that this workload is very simple - just one thread performing sequential synchronous writes. The magnitude of the latency is as expected for these disks (and no Logzilla), but the strange diagonal lines? Latency is creeping up or down second after second, sometimes at the same time (the 'X's.) It looks a bit like an icy lake, and this particular example has a distinct 'X' in it.

I'd guess this is related to how ZFS is placing the data sequentially on disks as they rotate - perhaps each diagonal represents one track. With time this can be answered using DTrace; the first step would be to see if individual disks contribute single lines to the heat map...

The point of this post is to show another interesting latency heat map, one that we don't yet fully understand (leave a comment if you have another theory about why this happens.)

Before Analytics, I analyzed disk latency using averages or coarse distributions, which don't have the resolution to reveal patterns like this. Indeed, for other filesystems (UFS, ...) there may be similar or even more bizarre patterns.

In my next post I'll demo Logzillas - the solution to this latency.

Tuesday May 26, 2009

My Sun Storage 7310 perf limits

As part of my role in Fishworks, I push systems to their limits to investigate and solve bottlenecks. Limits can be useful to consider as a possible upper bound of performance - as it shows what the target can do. I previously posted my results for the Sun Storage 7410, which is our current top performing product. Today the Sun Storage 7310 has been launched, which is an entry level offering that can be clustered for high availability environments like the 7410.

The following summarises the performance limits I found for a single 7310 head node, along with the current 7410 results for comparison:

As shown above, the 7310 has very reasonable performance in comparison to the high end 7410. (As of this blog post, I've yet to update my 7410 results for the 2009.Q2 software update, which gave about a 5% performance boost.)

Next I'll show the source of these results - screenshots taken from the 7310 using Analytics, to measure how the target NFS server actually performed. I'll finish by describing the 7310 and clients used to perform these tests.

NFSv3 streaming read from DRAM

To find the fastest read throughput possible, I used 10 Gbytes of files (working set) and had 20 clients run two threads per client to read 1 Mbyte I/O through them repeatedly:

Over this 10 minute interval, we've averaged 1.08 Gbytes/sec. This includes both inbound NFS requests and network protocol headers, so the actual data transferred will be a little less. Still, breaking 1 Gbyte/sec of delivered NFS for an entry level server is a great result.

NFSv3 streaming read from disk

This time 400 Gbytes of files were used as the working set, to minimize caching in the 7310's 16 Gbytes of DRAM. As before, 20 clients and 2 threads per client read through the working set repeatedly:

This screenshot from Analytics includes the disk bytes/sec, to confirm that this workload really did read from disk. It averaged 780 Mbytes/sec - a solid result. The 792 Mbytes/sec average on the network interfaces includes the NFS requests and network protocol headers.

NFSv3 streaming write to disk

To test streaming writes, 20 clients ran 5 threads per client performing writes with a 32 Kbyte I/O size:

While there were 477 Mbytes/sec on the network interfaces (which includes network protocol headers and ACKs), at the disk level the 7310 has averaged 1.09 Gbytes/sec. This is due to software mirroring, which has doubled the data sent to the storage devices (plus ZFS metadata). The actual data bytes written will be a little less than 477 Mbytes/sec.

NFSv3 max read IOPS

I tested this with the 7410, more as an experiment than of practical value. This is the most 1 byte reads that NFS can deliver to the clients, with the 10 Gbyte working set entirerly cached in DRAM. While 1 byte I/O isn't expected, that doesn't render this test useless - it does give the absolute upper bound for IOPS:

The 7310 reached 182,282 reads/sec, and averaged around 180K.

NFSv3 4 Kbyte read IOPS from DRAM

For a more realistic test of read IOPS, the following shows an I/O size of 4 Kbytes, and a working set of 10 Gbytes cached in the 7310's DRAM. Each client runs an additional thread every minute (stepping up the workload), from one to ten threads:

This screenshot shows %CPU utilization, to check that this ramping up of workload is pushing some limit on the 7310 (in this case, what is measured as CPU utilization.) With 10 threads running per client, the 7310 has served a whopping 110,724 x 4 Kbyte read ops/sec, and still has a little CPU headroom for more. Network bytes/sec was also included in this screenshot to double check the result, which should be at least be 432 Mbytes/sec (110,724 x 4 Kbytes), which it is.


As a filer, I used a single Sun Storage 7310 with the following config:

  • 16 Gbytes DRAM
  • 2 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
  • 1 socket of quad-core AMD Opteron 2300 MHz CPU
  • 1 x 2x10 GbE card, jumbo frames
  • 1 x HBA card
  • noatime on shares, and database size left at 128 Kbytes

It's not a max config system - the 7310 can currently scale to 4 JBODs and have 2 sockets of quad-core Opteron.

The clients were 20 blades, each:

  • 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
  • 3 Gbytes of DRAM
  • 1 x 1 GbE network ports
  • Running Solaris, and NFS mounted forcedirectio for read testing to avoid client caching

These are great, apart from the CPU clock speed - which at 1600 MHz is a little low.

The network consists of multiple 10 GbE switches to connect the client 1 GbE ports to the filer 10 GbE ports.


We've spent years working on maximizing performance of our highest end offering, the Sun Storage 7410. New family members of the Sun Storage 7000 series inherit most of this work, and really hit the ground running. When Adam Leventhal designed the 7310 product, we weren't thinking of a system capable of 1 Gbyte/sec - but as the pieces came together we realized the performance could in fact be very good, and was. And it's worth emphasizing - the results above are all from a single socket 7310; the 7310 can be configured with two sockets of quad core Opteron!

Performance Testing the 7000 series, part 3 of 3

For performance testing storage products, in particular the Sun Storage 7000 series, I previously posted the top 10 suggestions for performance testing, and the little stuff to check. Here I'll post load generation scripts I've been using to measure the limits of the Sun Storage 7410 (people have been asking for these.)

Load Generation Scripts

Many storage benchmark tools exist, including FileBench which can apply sophisticated workloads and report various statistics. However the performance limit testing I've blogged about has not involved sophisticated workloads at all - I've tested sequential and random I/O to find the limits of throughput and IOPS. This is like finding out a car's top speed and its 0-60 mph time - simple and useful metrics, but aren't sophisticated such as finding out its best lap time at Laguna Seca Raceway. While a lap time would be a more comprehensive test of a car's performance, the simple metrics can be a better measure of specific abilities, with fewer external factors to interfere with the result. A lap time will be more affected by the driver's ability and weather conditions, for example, unlike top speed alone.

The scripts that follow are very simple, as all they need to do is generate load. They don't generate a specific level of load or measure the performance of that load, they just apply load as fast as they can. Measurements are taken on the target - which can be a more reliable way of measuring what actually happened (past the client caches.) For anything more complex, reach for a real benchmarking tool.

Random Reads

This is my script, which takes a filename as an argument:

    #!/usr/bin/perl -w 
    # - randomly read over specified file.
    use strict;
    my $IOSIZE = 8192;                      # size of I/O, bytes
    my $QUANTA = $IOSIZE;                   # seek granularity, bytes
    die "USAGE: filename\\n" if @ARGV != 1 or not -e $ARGV[0];
    my $file = $ARGV[0];
    my $span = -s $file;                    # span to randomly read, bytes
    my $junk;
    open FILE, "$file" or die "ERROR: reading $file: $!\\n";
    while (1) {
            seek(FILE, int(rand($span / $QUANTA)) \* $QUANTA, 0);
            sysread(FILE, $junk, $IOSIZE);
    close FILE;

Dead simple. Tune $IOSIZE to the I/O size desired. This is designed to run over NFS or CIFS, so the program will spend most of its time waiting for network responses - not chewing over its own code, and so Perl works just fine. Rewriting this in C isn't going to make it much faster, but it may be fun to try and see for yourself (be careful with the resolution of rand(), which may not have the granularity to span files bigger than 2\^32 bytes.)

To run, create a file for it to work on, eg:

    # dd if=/dev/zero of=10g-file1 bs=1024k count=10k

which is also how I create simple sequential write workloads. Then run it:

    # ./ 10g-file1 &

Sequential Reads

This is my script, which is similar to

    #!/usr/bin/perl -w
    # - sequentially read through a file, and repeat.
    use strict;
    my $IOSIZE = 1024 \* 1024;               # size of I/O, bytes
    die "USAGE: filename\\n" if @ARGV != 1 or not -e $ARGV[0];
    my $file = $ARGV[0];
    my $junk;
    open FILE, "$file" or die "ERROR: reading $file: $!\\n";
    while (1) {
            my $bytes = sysread(FILE, $junk, $IOSIZE);
            if (!(defined $bytes) or $bytes != $IOSIZE) {
                    seek(FILE, 0, 0);
    close FILE;

Once it reaches the end of a file, it loops back to the start.

Client Management Script

To test the limits of your storage target, you'll want to run these scripts on a bunch of clients - ten or more. This is possible with some simple shell scripting. Start by setting up ssh (or rsh) so that a master server (your desktop) can login to all the clients as root without prompting for a password (ssh-keygen, /.ssh/authorized_keys ...). My clients are named dace-0 through to dace-9, and after setting up the ssh keys the following succeeds without a password prompt:

    # ssh root@dace-0 uname -a
    SunOS dace-0 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc

Since I have 10 clients, I'll want an easy way to execute commands on them all at the same time, rather than one by one. There are lots of simple ways to do this, here I've created a text file called clientlist with the names of the clients:

    # cat clientlist

which is easy to maintain. Now a script to run commands on all the clients in the list:

    # clientrun - execute a command on every host in clientlist.
    if (( $# == 0 )); then
            print "USAGE: clientrun cmd [args]"
            exit 1
    for client in $(cat clientlist); do
            ssh root@$client "$@" &

Testing that this script works by running uname -a on every client:

    # ./clientrun uname -a
    SunOS dace-0 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-1 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-2 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-3 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-4 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-5 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-7 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-8 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-6 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-9 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc


Running the Workload

We've got some simple load generation scripts and a way to run them on our client farm. Now to execute a workload on our target server, turbot. To prepare for this:

  • Scripts are available to the clients on an NFS share, /net/fw/tools/perf. This makes it easy to adjust the scripts and have the clients run the latest version, rather than installing the scripts one by one on the clients.
  • One share is created on the target NFS server for every client (usually more interesting than just using one), and is named the same as the client's hostname (/export/dace-0 etc.)

Creating a directory on the clients to use as a mount point:

    # ./clientrun mkdir /test

Mounting the shares with default NFSv3 mount options:

    # ./clientrun 'mount -o vers=3 tarpon:/export/`uname -n` /test'

The advantage of using the client's hostname as the share name is that it becomes easy for our clientrun script to have each client mount their own share, by getting the client to call uname -n to construct the share name. (The single forward quotes in that command are necessary.)

Creating files for our workload. The clients have 3 Gbytes of DRAM each, so 10 Gbytes per file will avoid caching all (or most) of the file on the client, since we want our clients to apply load to the NFS server and not hit from their own client cache:

    # ./clientrun dd if=/dev/zero of=/test/10g-file1 bs=1024k count=10k

This applys a streaming write workload from 10 clients, one thread (process) per client. While that is happening, it may be interesting to login to the NFS server and see how fast the write is performing (eg, network bytes/sec.)

With the test files created, I can now apply a streaming read workload like so:

    # ./clientrun '/net/fw/tools/perf/ /test/10g-file1 &'

That will apply a streaming read workload from 10 clients, one thread per client.

Run it multiple times to add more threads, however the client cache is more likely to interfere when trying this (one thread reads what the other just cached); on Solaris you can try adding the mount option forcedirectio to avoid the client cache altogether.

To stop the workload:

    # ./clientrun pkill seqread

And to cleanup after testing is completed:

    # ./clientrun umount /test

Stepping the Workload

Running one thread per client on 10 clients may not be enough to stress powerful storage servers like the 7000 series. How many do we need to find the limits? An easy way to find out is to step up the workload over time, until the target stops serving any more.

The following runs on the clients, starting with one and running another every 60 seconds until ten are running on each client:

    # ./clientrun 'for i in 1 2 3 4 5 6 7 8 9 10; do
         /net/fw/tools/perf/ /test/10g-file1 & sleep 60; done &' &

This was executed with an I/O size of 4 Kbytes. Our NFS server turbot is a Sun Storage 7410. The results from Analytics on turbot:

Worked great, and we can see that after 10 threads per client we've pushed the target to 91% CPU utilization, so we are getting close to a limit (in this case, available CPU.) Which is the aim of these type of tests - to drive load until we reach some limit.

I included network bytes/sec in the screenshot as a sanity check; we've reached 138180 x 4 Kbyte NFS reads/sec, which would require at least 540 Mbytes/sec of network throughput; we pushed 602 Mbytes/sec (includes headers.) 138K IOPS is quite a lot - this server has 128 Gbytes of DRAM, so 10 clients with a 10 Gbyte file per client means a 100 Gbyte working set (active data) in total, which has entirerly cached in DRAM on the 7410. If I wanted to test disk performance (cache miss), I can increase the per client filesize to create a working set much larger than the target's DRAM.

This type of testing can be useful to determine how fast a storage server can go - it's top speed. But that's all. For a better test of application performance, reach for a real benchmarking tool, or setup a test environment and run the application with a simulated workload.

Thursday Apr 02, 2009

Performance Testing the 7000 series, part 2 of 3

With the release of the Sun Storage 7000 series there has been much interest in the products performance, which I've been demonstrating. In my previous post I listed 10 suggestions for performance testing - the big stuff you need to get right. The 10 suggestions are applicable to all perf testing on the 7000 series. Here I'll dive into the smaller details of max performance tuning, which may only be of interest if you'd like to see the system reach its limits.

The little stuff

The following is a tuning checklist for achieving maximum performance on the Sun Storage 7410, particularly for finding performance limits. Please excuse the brevity of some descriptions - this was originally written as an internal Sun document and has been released here in the interests of transparency. This kind of tuning is used during product development, to drive systems as fast as possible to identify and solve bottlenecks.

These can all be useful points to consider, but they do not apply to all workloads. I've seen a number of systems that were misconfigured with random tuning suggestions found on the Internet. Please understand what you are doing before making changes.

Sun Storage 7410 Hardware

  • Use the max CPU and max DRAM option (currently 4 sockets of quad core Opteron, and 128 Gbytes of DRAM.)

  • Use 10 GbE. When using 2 ports, ideally use 2 x 10 GbE cards and one port on each - which balances across two PCI-E slots, two I/O controllers, and both CPU to I/O HyperTransports. Two ports on a single 10 GbE card will be limited by the PCI-E slot throughput.

    Using LACP to trunk over 1 GbE interfaces is a different way to increase network bandwidth, but you'll have port hashing to balance in small test environments (eg, fewer than 10 clients), which can be a headache based on the client attributes (my 20 client test farm all have even-numbered IP address and mac-addresses!)

  • Balance load across I/O controllers. There are two I/O controllers in the 7410, the MCP55 and the IO55. The picture on the right shows which PCI-E slots are served by which controller, labeled A and B. When using two 10 GbE cards, they should (already) be installed in the top left and bottom right slots, so that their load is balanced across each controller. When three HBAs are used, they are put in the bottom left and both middle slots; if I'm running a test that only uses two of them, I'll use the middle ones.

  • Get as many JBODs as possible with the largest disks possible. This will improve random I/O for cache busting workloads - although considering the amount of cache possible (DRAM plus Readzilla), to be cache busting it may become too artificial to be interesting.

  • Use 3 x HBAs, configure dual paths to chains of 4 x JBODs (if you have the full 12.)

  • Consider Readzilla for random I/O tests - but plan for a long (hours) warmup. Make sure the share database size is small (8 Kbytes) before file creation. Also use multiple readzillas for concurrent I/O (you get about 3100 x 8 Kbyte IOPS from each of the current STECs.) Note that Readzilla (read cache) can be enabled on a per filesystem basis.

  • Use Logzilla for synchronous (O_DSYNC) write workloads, which may include database operations and small file workloads.

Sun Storage 7410 Settings

  • Configure mirroring on the pool. If there is interest in a different profile, test it in addition to mirroring. Mirroring will especially help random IOPS read and write workloads that are cache busting.

  • Create multiple shares for the clients to use, if this resembles the target environment. Depending on the workload, it may improve performance by a tiny amout for clients not to be hammering the same share.

  • Disable access time updates on the shares.

  • 128K recsize for streaming. For streaming I/O tests: 128 Kbyte "database size" on the shares before file creation.

  • 8K recsize for random. Random I/O tests: set the share "database size" to match the application record size. I generally wouldn't go smaller than 4 Kbytes - if you make this extremely small then the ARC metadata to reference tiny amounts of data can begin to consume Gbytes of DRAM. I usually use 8 Kbytes.

  • Consider testing with LZJB compression on the shares on to see if it improves performance (it may relieve back-end I/O throughput.) For compression, make sure your working set files actually contain data and aren't all zeros - ZFS has some tricks for compressing zero data that will artificially boost performance.

  • Consider suspending DTrace based Analytics during tests for an extra percent or so of performance (especially the by-file, by-client, by-latency, by-size and by-offset ones.) The default Analytics (listed in the HELP wiki) are negligible, and I leave them on during my limit testing (without them I couldn't take screenshots.)

  • Don't benchmark a cold server (after boot.) Run a throwaway benchmark first until it fills DRAM, then cancel it and run the real benchmark. This avoids a one-off kmem cache buffer creation penalty, that may cost 5% performance or so only in the minutes immediately after boot.


  • Use jumbo frames on clients and server ports (not needed for management). The 7410 has plenty of horsepower to drive 1 GbE with or without jumbo frames, so their use on 1 GbE is more to help the clients keep up. On 10 GbE the 7410s peak throughput will improve by about 20% with jumbo frames. This ratio would be higher, but we are using LSO (large send offload) with this driver (nxge) which keeps the non-jumbo frame throughput pretty high to start with.

  • Check your 10 Gbit switches. Don't assume a switch with multiple 10 GbE ports can drive them at the same time; some 10 Gbit switches we've tested cap at 11 Gbits/sec. The 7410 can have 4 x 10 GbE ports - so make sure the switches can handle the load, such as by using multiple switches. You don't want to test the switches by mistake.


  • Lots of clients. At least 10. 20+ is better. Everyone who tries to test the 7410 (including me) gets client bound at some point.

  • A single client can't drive 10 GbE, without a lot of custom tuning. And if the clients have 1 GbE interfaces, it should (obviously) take at least 10 clients to drive 10 GbE.

  • Connect the clients to dedicated switch(s) connected to the 7410.

  • You can reduce client caching by booting the clients with less DRAM: eg, eeprom physmem=786432 for 3 Gbytes on Solaris x86. Otherwise the test can hit from client cache and test the client instead of the target!

Client Workload

  • NFSv3 is generally the fastest protocol to test.

  • Consider umount/mount cycles on the clients between runs to flush the client cache.

  • NFS client mount options: Tune the "rsize" and "wsize" option to match the workload. eg, rsize=8192 for random 8 Kbyte IOPS tests - this reduces unnecessary read-ahead; and rsize=131072 for streaming (and also try setting nfs3_bsize to 131072 in /etc/system on the clients for the streaming tests.)

    For read tests, try "forcedirectio" if your NFS client supports this option (Solaris does) - this especially helps clients apply a heavier workload by not using their own cache. Don't leave this option enabled for write tests.

  • Whatever benchmark tool you use, you want to try multiple threads/processes per client. For max streaming tests I generally use 2 processes per client with 20 clients - so thats 40 client processes in total; for max random IOPS tests I use a lot more (100+ client processes in total.)

  • Don't trust benchmark tools - most are misleading in some way. Roch explains Bonnie++ here, and we have more posts planned for other tools. Always double check what the benchmark tool is doing using Analytics on the 7410.

  • Benchmark working set: very important! To properly test the 7410, you need to plan different total file sizes or working sets for the benchmarks. Ideally you'd run one benchmark that the server could entirely cache in DRAM, another that cached in Readzilla, and another that hit disk.

    If you are using a benchmark tool without tuning the file size, you are probably getting this wrong. For example, the defaults for iozone are very small (512 Mbytes and less) and can entirely cache on the client. If I want to test both DRAM and disk performance on a 7410 with 128 Gbytes of DRAM, I'll use a total file size of 100 Gbytes for the DRAM test, and at least 1 Terabyte (preferably 10 Terabytes) for the disk test.

  • Keep a close eye on the clients for any issues - eg, DTrace kernel activity.

  • Getting the best numbers involves tuning the clients as much as possible. For example, use a large value for autoup (300); tune tcp_recv_hiwat (for read tests, 400K should be good in general, 1 MB or more for long latency links.) The aim is to eliminate any effects from the available clients, and have results which are bounded by the targets performance.

  • Aim for my limits: That will help sanity check your numbers - to see if they are way off.

The Sun Storage 7410 doesn't need special tuning at all (no messing with /etc/system settings.) If it did, we'd consider that a bug we should fix. Indeed, this is part of what Fishworks is about - the expert tuning has already been done in our products. What there is left for the customer is simple and industry common: pick mirroring or double parity RAID, jumbo frames, no access time updates and tuning the filesystem record size. The clients require much, much more tuning and fussing when doing these performance tests.

In my next post, I'll show the simple tools I use to apply test workloads.

Monday Mar 23, 2009

Performance Testing the 7000 series, part 1 of 3

With the introduction of the Sun Storage 7000 series there has been much interest in its performance, which I've been demonstrating in this blog. Along with Bryan and Roch, I've also been helping other teams properly size and evaluate their configurations through performance testing. The advent of technologies like the hybrid storage pool makes performance testing more complicated, but no less important.

Over the course of performance testing and analysis, we assembled best practices internally to help Sun staff avoid common testing mistakes, tune their systems for maximum performance, and properly test the hybrid storage pool. In the interest of transparency and helping others understand the issues surrounding performance testing, I'll be posting this information over two posts, and my load generation scripts in a third.

Performance Testing - Top 10 Suggestions:

  1. Sanity Test
  2. Before accepting a test result, find ways to sanity test the numbers.

    When testing throughput over a gigabit network interface, the theoretical maximum is about 120 Mbytes/sec (converting 1 GbE to bytes.) I've been handed results of 300 Mbytes/sec and faster over a gigabit link, which is clearly wrong. Think of ways to sanity test results, such as checking against limits.

    IOPS can be checked in a similar way: 20,000 x 8 Kbyte read ops/sec would require about 156 Mbytes/sec of network throughput, plus protocol headers - too much for a 1 GbE link.

  3. Double Check
  4. When collecting performance data, use an alternate method to confirm your results.

    If a test measures network throughput, validate results from different points in the data path: switches, routers, and of course the origin and destination. A result can appear sane but still be wrong. I've discovered misconfigurations and software bugs this way, by checking if the numbers add up end-to-end.

  5. Beware of Client Caching
  6. File access protocols may cache data on the client, which is performance tested instead of the fileserver.

    This mistake should be caught by the above two steps, but it is so common it deserves a separate mention. If you test a fileserver with a file small enough to fit within the client's RAM, you may be testing client memory bandwidth, not fileserver performance. This is currently the most frequent mistake we see people make when testing NFS performance.

  7. Distribute Client Load
  8. Use multiple clients, at least 10.

    The Sun Storage 7410 has no problem saturating 10 GbE interfaces, but it's difficult for a client to do the same. A fileserver's optimized kernel can respond to requests much quicker than client-side software can generate them. In general, it takes twice the CPU horsepower to drive load than it does to accept it.

    Network bandwidth can also be a bottleneck: it takes at least ten 1 Gbit clients to max out a 10 Gbit interface. The 7410 has been shown to serve NFS at 1.9 Gbytes/sec, so at least sixteen 1 Gbit clients would be required to test max performance.

  9. Drive CPUs to Saturation
  10. If the CPUs are idle, the system is not operating at peak performance.

    The ultimate limiter in the 7000 series is measured as CPU utilization, and with the 7410's four quad-core Opterons, it takes a tremendous workload to reach its limits. To see the system at peak performance, add more clients, a faster network, or more drives to serve I/O. If the CPUs are not maxed out, they can handle more load.

    This is a simplification, but a useful one. Some workloads are CPU heavy due to the cycles to process instructions, others with CPU wait cycles for various I/O bus transfers. Either way, it's measured as percent CPU utilization - and when that reaches 100%, the system can generally go no faster (although it may go a little faster if polling threads and mutex contention backs off.)

  1. Disks Matter
  2. Don't ignore the impact of rotational storage.

    A full Sun Storage 7410 can have access to a ton of read cache: up to 128 Gbytes of DRAM and six 100 Gbyte SSDs. While these caches can greatly improve performance, disk performance can't be ignored as data must eventually be written to (and read from) disk. A bare minimum of two fully-populated JBODs are required to properly gauge 7410 performance.

  3. Check Your Storage Profile
  4. Evaluate the desired redundancy profile against mirroring.

    The default RAID-Z2 storage profile on the 7000 series provides double-parity, but can also deliver lower performance than mirroring, particularly with random reads. Test your workload with mirroring as well as RAID-Z2, then compare price/performance and price/Gbyte to best understand the tradeoff made.

  5. Use Readzillas for Random Reads
  6. Use multiple SSDs, tune your record size, and allow for warmup.

    Readzillas (read-biased SSDs), can greatly improve random read performance, if configured properly and given time to warm up. Each Readzilla currently delivers around 3,100 x 8 Kbyte read ops/sec, and has 100 Gbytes of capacity. For best performance, use as many Readzillas as possible for concurrent I/O. Also consider that, due to the low-throughput nature of random-read workloads, it can take several hours to warm up 600 Gbytes of read cache.

    On the 7000 series, when using Readzillas on random read workloads, adjust the database record size from its 128 Kbyte default down to 8 Kbytes before creating files, or size it to match your application record size. Data is retrieved from Readzillas by their record size, and smaller record sizes improve the available IOPS from the read cache. This must be set before file creation, as ZFS doesn't currently rewrite files after this change.

  7. Use Logzillas for Synchronous Writes
  8. Accelerate synchronous write workloads with SSD based intent logs.

    Some file system operations, like file and directory creation, and writes to database log files are considered "synchronous writes," requiring data be on disk before the client can continue. Flash-based intent log devices, or Logzillas, can only dramatically speed up workloads comprised of synchronous writes; otherwise, data is written directly to disk.

    Logzillas can provide roughly 10,000 write ops/sec, depending on write size, or about 100 Mbytes/sec of write throughput, and scale linearly to meet demand.

  9. Document Your Test System
  10. Either test a max config, or make it clear that you didn't.

    While it's not always possible or practicable to obtain a maximum configuration for testing purposes, the temptation to share and use results without a strong caveat to this effect should be resisted. Every performance result should be accompanied by details on the target system, and a comparison to a maximum configuration, to give an accurate representation of a product's true capabilities.

The main source of problems in performance testing that we've seen is the misuse of benchmark software and the misinterpretation of their results. The above suggestions should help the tester avoid the most common problems in this field. No matter how popular or widely-used benchmark software is, the tester is obliged to verify the results. And by paying sufficient attention to the test environment - i.e. system configuration, client load balance, network bandwidth - you can avoid common pitfalls (such as measuring 300 Mbytes/sec over a 1 Gbit/sec interface, which was courtesy of a popular benchmarking tool.)

In part two, I'll step through a more detailed checklist for max performance testing.

Tuesday Mar 17, 2009

Heat Map Analytics

I've been recently posting screenshots of heat maps from Analytics - the observability tool shipped with the Sun Storage 7000 series.

These heat maps are especially interesting, which I'll describe here in more detail.


To start with, when you first visit Analytics you have an empty worksheet and need to add statistics to plot. Clicking on the plus icon next to "Add statistic" brings up a menu of statistics, as shown on the right.

I've clicked on "NFSv3 operations" and a sublist of possible breakdowns are shown. The last three (not including "as a raw statistic") are represented as heat maps. Clicking on "by latency" would show "NFSv3 operations by latency" as a heat map. Great.

But it's actually much more powerful than it looks. It is possible to drill down on each breakdown to focus on behavior of interest. For example, latency may be more interesting for read or write operations, depending on the workload. If our workload was performing synchronous writes, we may like to see the NFS latency heat map for 'write' operations separately - which we can do with Analytics.

To see an example of this, I've selected "NFS operations by type of operation", then selected 'write', then right-clicked on the "write" text to see the next breakdowns that are possible:

This menu is also visible by clicking the drill icon (3rd from the right) to drill down further.

By clicking on latency, it will now graph "NFSv3 operations of type write broken down by latency". So these statistics can be shown in whatever context is most interesting - perhaps I want to see NFS operations from a particular client, or for a particular file. Here are NFSv3 writes from the client 'deimos', showing the filenames that are being written:

Awsome. Behind the scenes, DTrace is building up dynamic scripts to fetch this data. We just click the mouse.

This was important to mention - the heat maps I'm about to demonstrate can be customized very specifically, by type of operation, client, filename, etc.

Sequential reads

I'll demonstrate heat maps at the NFS level by running the /usr/bin/sum command on a large file a few times, and letting it run longer each time before hitting Ctrl-C to cancel it. The sum command calculates a file's checksum, and does so by sequentially reading through the file contents. Here is what the three heat maps from Analytics shows:

The top heat map of offset clearly shows the client's behavior - the stripes show sequential I/O. The blocks show the offsets of the read operations as the client creeps through the file. I mounted the client using forcedirectio, so that NFS would not cache on the file contents on the client - and would be forced to keep reading the file each time.

The middle heat map shows the size of the client I/O requests. This shows the NFS client is always requesting 8 Kbyte reads. The bottom heat map shows NFS I/O latency. Most of the I/O was between 0 and 35 microseconds - but there are some odd clouds of latency on the 2nd and 3rd runs.

These latency clouds would be almost invisible if a linear color scheme was used - these heat maps use false color to emphasize detail. The left panel showed that on average there were 1771 ops/sec faster than 104 us (adding up the numbers), and that the entire heat map averaged at 1777 ops/sec; this means that the latency clouds (at about 0.7 ms) represent 0.3% of the I/O. The false color scheme makes them clearly visible, which is important for latency heat maps - as these slow outliers can hurt performance - even though they are relatively infrequent.

For those interested in more detail, I've included a couple of extra screenshots to explain this further:

  • Screenshot 1: NFS operations and disk throughput. From the top graph, it's clear how long I left the sum command running each time. The bottom graph of disk I/O bytes shows that the file I was checksumming had to be pulled in from disk for the entire first run, but only later in the second and third runs. Correspond the times to the offset heat map above - the 2nd and 3rd runs are reading data that is now cached, and doesn't need to be read from disk.

  • Screenshot 2: ARC hits/misses. This shows what the ZFS DRAM cache is doing (which is the 'ARC' - the Adaptive Replacement Cache). I've shown the same statistic twice, so that I can highlight breakdowns separately. The top graph shows a data miss at 22:30:24, which triggers ZFS to prefetch the data (since ZFS detects that this is a sequential read.) The bottom graph shows data hits are kept high, thanks to ZFS prefetch, and ZFS prefetch in operation: the "prefetched data misses" shows requests triggered by prefetch that were not already in the ARC, and read from disk; and the "prefetched data hits" shows prefetch requests that were already satisfied by the ARC. The latency clouds correspond to the later prefetch data misses, where some client requests are arriving while prefetch is still reading from disk - and are waiting for that to complete.

Random reads

While the rising stripes of a sequential workload are clearly visible in the offset heat map, random workloads are also easily identifiable:

The NFS operations by offset shows a random and fairly uniform pattern, which matches the random I/O I now have my client requesting. These are all hitting the ZFS DRAM cache, and so the latency heat map shows most responses in the 0 to 32 microsecond range.

Checking how these known workloads look in Analytics is valuable, as when we are faced with the unknown we know what to look for.

Disk I/O

The heat maps above demonstrated Analytics at the NFS layer; Analytics can also trace these details at the back-end: what the disks are doing, as requested by ZFS. As an example, here is a sequential disk workload:

The heat maps aren't as clear as they are at the NFS layer, as now we are looking at what ZFS decides to do based on our NFS requests.

The sequential read is mostly reading from the first 25 Gbytes of the disks, as shown in the offset heat map. The size heat map shows ZFS is doing mostly 128 Kbyte I/Os, and the latency heat map shows the disk I/O time is often about 1.20 ms, and longer.

Latency at the disk I/O layer doesn't directly correspond to client latency - it depends on the type of I/O. Asynchronous writes and prefetch I/O won't necessarily slow the client, for example.

Vertical Zoom

There is a way to zoom these heat maps vertically. Zooming horizontally is obvious (the first 10 buttons above each heat map do that - by changing the time range), but the vertical zoom isn't so obvious. It is documented in the online help - I just wanted to say here that it does exist. In a nutshell: click the outliers icon (last on the right) to switch outlier elimination modes (5%, 1%, 0.1%, 0.01%, none), which often will do what you want (by zooming to exclude a percentage of outliers); otherwise left click a low value in the left panel, shift click a high value, then click the outliers icon.


As mentioned earlier, these heat maps use optimal resolutions at different ranges to conserve disk space, while maintaining visual resolution. They are also saved on the system disks, which have compression enabled. Still, when recording this data every second, 24 hours a day, the disk space can add up. Check their disk usage by going to Analytics->Datasets and clicking the "ON DISK" title to sort by size:

The size is listed before compression, so the actual consumed bytes is lower. These datasets can be suspended by clicking the power button, which is handy if you'd like to keep interesting data but not continue to collect new data.

Playing around...

While using these heat maps we noticed some unusual and detailed plots. Bryan and I starting wondering if it were possible to generate artificial workloads that plotted arbitrary patterns, such as spelling out words in 8 point text. This would be especially easy for the offset heat map at the NFS level - since the client requests the offsets, we just need to write a program to request reads or writes to the offsets we want. Moments after this idea, Bryan and I were furiously coding to see who could finish it first (and post comical messages to each other.) Bryan won, after about 10 minutes. Here is an example:

Awsome, dude! ... (although that wasn't the first message we printed ... when I realized Bryan was winning, I logged into his desktop, found the binary he was compiling, and posted the first message to his screen before he had finished writing the software. However my message appeared as: "BWC SnX" (Bryan's username is "bmc".) Bryan was looking at the message, puzzled, while I'm saying "it's upside down - your program prints upside down!")

I later modified the program to work for the size heat maps as well, which was easy as the client requests it. But what about the latency heat maps? Latency isn't requested - it depends on many factors: for reads, it depends on whether the data is cached, and if not, whether it is on a flash memory based read cache (if one is used), and if not, then it depends on how much disk head seek and rotation time it takes to pull it in - which varies depending on the previous disk I/O. Maybe this can't be done...

Actually, it can be done. Here is all three:

Ok, the latency heat map looks a bit fuzzy, but this does work. I could probably improve it if I spent more than 30 mins on the code - but I have plenty of actual work to do.

I got the latency program to work by requesting data which was cached in DRAM, of large increasing sizes. The latency from DRAM is consistent and relative to the size requested, so by calling reads with certain large I/O sizes I can manufacture a workload with the latency I want (close to.) The client was mounted forcedirectio, so that every read caused an NFS I/O (no client caching.)

If you are interested in the client programs that injected these workloads, they are provided here (completely unsupported) for your entertainment: offsetwriter.c, sizewriter.c and latencywriter.c. If you don't have a Sun Storage 7000 series product to try them on, you can try the fully functional VMware simulator (although they may need adjustments to compensate for the simulator's slower response times.)


Heat maps are an excellent visual tool for analyzing data, and identifying patterns that would go unnoticed via text based commands or plain graphs. Some may remember Richard McDougall's Taztool, which used heat maps for disk I/O by offset analysis, and was very useful at the time (I reinvented it later for Solaris 10 with DTraceTazTool.)

Analytics takes heat maps much further:

  • visibility of different layers of the software stack: disk I/O, NFS, CIFS, ...
  • drilldown capabilities: for a particular client or file only, ...
  • I/O by offset, I/O by size and I/O by latency
  • can archive data 24x7 in production environments
  • optimal disk storage

With this new visibility, heat maps are illuminating numerous performance behaviors that we previously didn't know about and some we still don't yet understand - like the Rainbow Pterodactyl. DTrace has made this data available for years, Analytics is making it easy to see.


Brendan Gregg, Fishworks engineer


« July 2016