Friday Feb 06, 2009

DRAM latency

In my previous post, I showed NFS random read latency at different points in the operating system stack. I made a few references to hits from DRAM - which were visible as a dark solid line at the bottom of the latency heat maps. This is worth exploring in a little more detail, as this is both interesting and another demonstration of Analytics.

Here is both delivered NFS latency and disk latency, for disk + DRAM alone:

First clue is that the dark line at the bottom is in the NFS latency map only. This suggests the operation has returned to the client before it has reached the disk layer.

Zooming in the vertical time scale:

We can now see that these operations are mostly in the 0 us to 21 us range - which is very, very fast. DRAM fast. As an aside, there are other NFS operations apart from read can return from DRAM - these include open, close and stat. We know these are all reads from viewing the NFS operation type:

The average is 2633 NFS reads/sec, which includes hits from DRAM and reads from disk.

Now we'll use the "ZFS ARC" accesses statistic to see our DRAM cache hit rate (the ARC is our DRAM based filesystem cache):

The averages for this visible range shows 467 data hits/sec - consistent with the number of fast NFS operations/sec that the latency map suggested were DRAM based.

This is amazing stuff - examining the latency throughout the software stack, and clearly seeing the difference between DRAM and disk hits. You can probably see why we picked heat maps to show latency instead of line graphs showing average latency. Traditional performance tools provide average latency for these layers - but so much information has been lost when averaging this data. Since using these heat maps for latency, we've noticed many issues that may otherwise go unnoticed when using single values.

Friday Jan 30, 2009

L2ARC Screenshots

Back before the Fishworks project went public, I posted an entry to explain how the ZFS L2ARC worked (Level 2 Cache) - which is a flash memory based cache currently intended for random read workloads. I was itching to show screenshots from Analytics, which I'm now able to do. From these screenshots, I'll be able to describe in detail how the L2ARC performs.

Summary

There are a couple of screenshots that really tell the story. This is on a Sun Storage 7410 with the following specs:

  • 128 Gbytes of DRAM
  • 6 x 100 Gbyte "Readzillas" (read optimized SSDs) as the L2ARC
  • 6 x JBODs (disk trays), for a total of 140 disks configured with mirroring

As a workload, I'm using 10 clients (described previously), 2 random read threads per client with an 8 Kbyte I/O size, and a 500 Gbyte total working set mounted over NFS. This 500 Gbyte working set represents your frequently accessed data ("hot" data) that you'd like to be cached; this doesn't represent the total file or database size - which may be dozens of Tbytes. From Analytics on the 7410:

The top graph shows the L2ARC population level, and the bottom shows NFS operations/sec. As the L2ARC warms up, delivered performance in terms of read ops/sec increases, as data is returned from the SSD based L2ARC rather than slower disks. The L2ARC has increased the IOPS by over 5x.

5x IOPS! That's the difference 6 of our current SSDs makes when added to: 140 disks configured with mirroring plus 128 Gbytes of warm DRAM cache - meaning this system was already tuned and configured to serve this workload as fast as possible, yet the L2ARC has no problem magnifying performance further. If I had used fewer disks, or configured them with RAID-Z (RAID-5), or used less DRAM, this improvement ratio would be much higher (demonstrated later.) But I'm not showing this in the summary because this isn't about IOPS - this is about latency:

Here I've toggled a switch to enable and disable the L2ARC. The left half of these graphs shows the L2ARC disabled - which is the performance from disks plus the DRAM cache. The right half shows the L2ARC enabled - so that its effect can be compared. Heat maps have been used to graph latency - which is the time to service that I/O. Lower is faster, and the darker colors represent more I/Os occured at that time (x-axis) at that latency (y-axis). Lower dark colors is better - it means I/Os are completing quickly.

These maps show I/O latency plummet when the L2ARC is enabled, delivering I/O faster than disk was able to. Latency at both the NFS level and disk level can be seen, which is often helpful for locating where latency originates; here it simply shows that the faster SSD performance is being delivered to NFS. There are still some I/Os occurring slowly when the L2ARC is enabled (lighter colors in the top right), as the L2ARC is only 96% warm at this point - so 4% of the requested I/Os are still being serviced from disk. If I let the L2ARC warmup further, the top right will continue to fade.

There is one subtle difference between the heat maps - can you spot it? There is a dark stripe of frequent and fast I/O at the bottom of the NFS latency map, which doesn't appear in the disk map. These are read requests that hit the DRAM cache, and return from there.

The bottom graph shows IOPS, which increased (over 5x) when the L2ARC was enabled as due to the faster I/O latency.

This is just one demonstration of the L2ARC - I've shown a good result, but this isn't the best latency or IOPS improvement possible.

Before: DRAM + disk

Lets look closer at the NFS latency before the L2ARC was enabled:

This shows the performance delivered by DRAM plus the 140 mirrored disks. The latency is mostly between 0 and 10 ms, which is to be expected for a random read workload on 7,200 RPM disks.

Zooming in:

The vertical scale has now been zoomed to 10 ms. The dark line at the bottom is for hits from the DRAM cache - which is averaging about 460 hits/sec. Then there is a void until about 2 ms - where these disks start to return random IOPS.

After: DRAM + L2ARC + disk

Now a closer look at the NFS latency with the L2ARC enabled, and warmed up:

Here I've already zoomed to the 10 ms range, which covers most of the I/O. In fact, the left panel shows that most I/O took less than 1 ms.

Zooming in further:

The L2ARC now begins returning data over NFS at 300 us, and as the previous graph showed - most I/O are returned by 1 ms, rather than 10 ms for disk.

The bottom line in the graph is DRAM cache hits, which is now about 2400 hits/sec - over 5x than without the L2ARC. This may sound strange at first (how can the L2ARC affect DRAM cache performance?), but it makes sense - the client applications aren't stalled waiting for slower disks, and can send more IOPS. More IOPS means more chance of hitting from the DRAM cache, and a higher hits/sec value. The hits/misses rate is actually the same - we are just making better use of the DRAM cache as the clients can request from it more frequently.

Hit Rate

We can see how the DRAM cache hits increases as the L2ARC warms up with the following screenshot. This shows hit statistics for the ARC (DRAM cache) and L2ARC (SSD cache):

As the L2ARC warms up, its hit rate improves. The ARC also serves more hits as the clients are able to send more IOPS.

We may have assumed that hits improved in this way, however it is still a good idea to check such assumptions whenever possible. Analytics makes it easy to check different areas of the software stack, from NFS ops down to disk ops.

Disk ops

For a different look at L2ARC warmup, we can examine disk ops/sec by disk:

Rather than highlighting individual disks, I've used the Hierarchical breakdown to highlight the system itself ("/turbot") in pale blue. The system is the head node of the 7410, and has 6 L2ARC SSDs - visible as the 6 wedges in the pie chart. The JBODs are not highlighted here, and their ops/sec is shown in the default dark blue. The graph shows the disk ops to the JBODs decreases over time, and those to the L2ARC SSDs increases - as expected.

Warmup Time

A characteristic can be seen in these screenshots that I haven't mentioned yet: the L2ARC is usually slow to warmup. Since it is caching a random read workload, it only warms up as fast as that data can be randomly read from disk - and these workloads have very low throughput.

Zooming in to the start of the L2ARC warmup:

The point I've selected (02:08:20) is when the ARC (DRAM cache) has warmed up, shown in the 3rd graph, which took over 92 minutes! This isn't the L2ARC - this is just to warmup main memory. The reason is shown in the 2nd graph - the read throughput from the disks, which is populating DRAM, is less than 20 Mbytes/sec. This is due to the workload - we are doing around 2,700 x 8 Kbyte random reads/sec - some which are returning from the DRAM cache, which leaves a total throughput of less than 20 Mbytes/sec. The system has 128 Gbytes of DRAM, of which 112 Gbytes was used for the ARC. Warming up 112 Gbytes of DRAM at 20 Mbytes/sec should take 95 minutes - consistent with the real time it took. (The actual disk throughput is faster to begin with as it pulls in filesystem metadata, then slows down afterwards.)

If 112 Gbytes of DRAM takes 92 minutes to warmup, our 500 Gbytes of flash SSD based L2ARC should take at least 7 hours to warmup. In reality it takes longer - the top screenshot shows this took over a day to get warm. As the L2ARC warms up and serves requests, there are fewer requests to be served by disk - so that 20 Mbytes/sec of input decays.

The warmup isn't so much a problem because:

  • While it may take a while to warmup (depending on workload and L2ARC capacity), unless you are rebooting your production servers every couple of days - you'll find you spend more time warm than cold. We are also working on a persistent L2ARC, so if a server does reboot it can begin warm, which will be available in a future update.
  • The L2ARC does warmup half its capacity rather quickly, to give you an early performance boost - it's getting to 100% that takes a while. This is visible in the top screenshot - the two steps at the start raise the L2ARC size quickly.

If we were to warmup the L2ARC more aggresively, it can hurt overall system performance. The L2ARC has been designed to either help performance or do nothing - so you shouldn't have to worry if it may be causing a performance issue.

More IOPS

I mentioned earlier that the IOPS improvement would be higher with fewer disks or RAID-Z. To see what that looks like, I used the same system, clients and workload, but with 2 JBODs (48 disks) configured with RAID-Z2 (double parity) and wide stripes (46 disks wide.) The Sun Storage 7410 provides RAID-Z2 wide stripes as a configuration option to maximize capacity (and price/Gbyte) - but it does warn you not to pick this for performance:

If you had a random I/O workload in mind, you wouldn't want to pick RAID-Z2 wide stripes as each I/O must read from every disk in the stripe - and random IOPS will suffer badly. Ideally you'd pick mirroring (and my first screenshot in this post demonstrated that.) You could try RAID-Z narrow stripes if their performance was sufficient.

Here is the result - 2 JBODs with RAID-Z2 wide stripes, warming up 6 L2ARC cache SSDs:

IOPS increased by 40x! ... While impressive, this is also unrealistic - no one would pick RAID-Z2 wide stripes for a random I/O workload in the first place.

But wait...

Didn't I just fix the problem? The random read ops/sec reached the same rate as with the 6 x JBOD mirrored system, and yet I was now using 2 x JBODs of RAID-Z2 wide stripes. The L2ARC, once warm, has compensated for the reduced disk performance - so we get great performance, and great price/Gbyte.

So while this setup appeared completely unrealistic, it turns out it could make some sense in certain situations - particularly if price/Gbyte was the most important factor to consider.

There are some things to note:

  • The filesystem reads began so low in this example (because of RAID-Z2 wide stripes), that disk input began at 2 Mbytes/sec then decayed - and so 500 Gbytes of L2ARC took 6 days to warmup.
  • Since disk IOPS were so painfully slow, any significant percentage of them stalled the clients to a crawl. The real boost only happened when the L2ARC was more than 90% warm, so that these slow disk IOPS were marginalized - the dramatic profile at the end of the NFS ops/sec graph. This means you really want your working set to fit into available L2ARC; if it was only 10% bigger, then the improvement may drop from 40x to 10x; and for 20% bigger - 5x. The penalty when using mirroring isn't so steep.
  • While the working set may fit entirely in the L2ARC, any outlier requests that go to disk will be very slow. For time sensitive applications, you'd still pick mirroring.

This tactic isn't really different for DRAM - if your working set fit into the DRAM cache (and this 7410 has 128 Gbytes of DRAM), then you could also use slower disk configurations - as long as warmup time and misses were acceptable. And the IOPS from DRAM gets much higher.

The before/after latency maps for this test were:

By zooming in to the before and after sections (as before), I could see that most of the I/O were taking between 20 and 90 ms without the L2ARC, and then mostly less than 1 ms with the L2ARC enabled.

Adding more disks

You don't need the L2ARC to get more IOPS, you can just add more disks. Lets say you could choose between an system with L2ARC SSDs delivering 10,000 IOPS for your workload, or a system with many more disks - also delivering 10,000 IOPS. Which is better?

The L2ARC based system can reduce cost, power and space (part of Adam's HSP strategy with flash memory) - but just on IOPS alone the L2ARC solution should still be favorable - as this is 10,000 fast IOPS (flash SSD based) vs 10,000 slow IOPS (rotating disk based). Latency is more important than IOPS.

Flash disks as primary storage

You could use flash based SSD disks for primary storage (and I'm sure SSD vendors would love you to) - it's a matter of balancing price/performance and price/Gbyte. The L2ARC means you get the benefits of faster flash memory based I/O, plus inexpensive high density storage from disks - I'm currently using 1 Tbyte 7,200 RPM disks. The disks themselves provide the redundancy: you don't need to mirror the L2ARC SSDs (and hence buy more), as any failed L2ARC request is passed down to the primary storage.

Other uses for the L2ARC

The L2ARC is great at extending the reach of caching in terms of size, but it may have other uses too (in terms of time.) Consider the following example: you have a desktop or laptop with 2 Gbytes of DRAM, and an application goes haywire consuming all memory until it crashes. Now everything else you had running is slow - as their cached pages were kicked out of DRAM by the misbehaving app, and now must be read back in from disk. Sound familiar?

Now consider you had 2 Gbytes (or more) of L2ARC. Since the L2ARC copies what is in DRAM, it will copy the DRAM filesystem cache. When the misbehaving app kicks this out, the L2ARC still has a copy on fast media - and when you use your other apps again, they return quickly. Interesting! The L2ARC is serving as a backup of your DRAM cache.

This also applys to enterprise environments: what happens if you backup an entire filesystem on a production server? Not only can the additional I/O interfere with client performance, but the backup process can dump the hot DRAM cache as it streams through files - degrading performance much further. With the L2ARC, current and recent DRAM cache pages may be available on flash memory, reducing the performance loss during such perturbations. Here the limited L2ARC warmup rate is beneficial - hot data can be kicked out of DRAM quickly, but not the L2ARC.

Expectations

While the L2ARC can greatly improve performance, it's important to understand which workloads this is for, to help set realistic expectations. Here's a summary:

  • The L2ARC benefits will be more visible to workloads with a high random read component. The L2ARC can help mixed random read/write workloads, however the higher the overall write ratio (specifically, write throughput) the more difficult it will be for the L2ARC to cache the working set - as it becomes a moving target.
  • The L2ARC is currently suited for 8 Kbyte I/Os. By default, ZFS picks a record size (also called "database size") of 128 Kbytes - so if you are using the L2ARC, you want to set that down to 8 Kbytes before creating your files. You may already be doing this to improve your random read performance from disk - 128 Kbytes is best for streaming workloads instead (or small files, where it shouldn't matter.) You could try 4 or 16 Kbytes, if it matched the application I/O size, but I wouldn't go further without testing. Higher will reduce the IOPS, smaller will eat more DRAM for metadata.
  • The L2ARC can be slow to warmup (as is massive amounts of DRAM and a random read workload) as discussed earlier.
  • Use multiple L2ARC SSD devices ("Readzillas") to improve performance - not just for the capacity, but for the concurrent I/O. This is just like adding disk spindles to improve IOPS - but without the spindles. Each Readzilla the 7410 currently uses delivers around 3100 x 8 Kbyte read ops/sec. If you use 6 of them, that's over 18,000 x 8 Kbyte read ops/sec, plus what you get from the DRAM cache.
  • It costs some DRAM to reference the L2ARC, at a rate proportional to record size. For example, it currently takes about 15 Gbytes of DRAM to reference 600 Gbytes of L2ARC - at an 8 Kbyte ZFS record size. If you use a 16 Kbyte record size, that cost would be halve - 7.5 Gbytes. This means you shouldn't, for example, configure a system with only 8 Gbytes of DRAM, 600 Gbytes of L2ARC, and an 8 Kbyte record size - if you did, the L2ARC would never fully populate.

The L2ARC warmup up in the first example reached 477 Gbytes of cached content. The following screenshot shows how much ARC (DRAM) metadata was needed to reference both the ARC and L2ARC data contents (ARC headers + L2ARC headers), at an 8 Kbyte record size:

It reached 11.28 Gbytes of metadata. Metadata has always been needed for the DRAM cache - this is the in memory information to reference the data, plus locks and counters (for ZFS coders: mostly arc_buf_hdr_t); the L2ARC uses similar in-memory information to refer to its in-SSD content, only this time we are referencing up to 600 Gbytes of content rather than 128 Gbytes for DRAM alone (current maximums for the 7410.)

Conclusion

The L2ARC can cache random read workloads on flash based SSD, reducing the I/O latency to sub millisecond times. This fast response time from SSD is also consistent, unlike a mechanical disk with moving parts. By reducing I/O latency, IOPS may also improve - as the client applications can send more frequent requests. The examples here showed most I/O returned in sub millisecond times with the L2ARC enabled, and 5x and 40x IOPS over just disk + DRAM.

The L2ARC does take a while to warmup, due to the nature of the workload it is intended to cache - random read I/O. It is preferrable to set the filesystem record size to 8 Kbytes or so before using the L2ARC, and to also use multiple SSDs for concurrency - these examples all used 6 x 100 Gbyte SSDs, to entirely cache the working set.

While these screenshots are impressive, flash memory SSDs continue to get faster and have greater capacities. A year from now, I'd expect to see screenshots of even lower latency and even higher IOPS, for larger working sets. It's an exciting time to be working with flash memory.

Friday Jan 09, 2009

My Sun Storage 7410 perf limits

As part of my role in Fishworks, I push systems to their limits to investigate and solve bottlenecks. Limits can be useful to consider as a possible upper bound of performance - as it shows what the target can do. I thought these results would make for some interesting blog posts if I could explain the setup, describe what the tests were, and include screenshots of these results in action. (update: I've included results from colleagues who have tested in the same manner.)

To summarize the performance limits that I found for a single Sun Storage 7410 head node:

All tests are performed on Ethernet (usually 10 GbE) unless otherwise specified ("IB" == InfiniBand).

Like many products, the 7410 will undergo software and hardware updates over time. This page currently has results for:

  • 7410 Barcelona: The initial release, with four quad-core AMD Opteron CPUs (Barcelona), and the 2008.Q4 software.
  • 7410 Istanbul: The latest release, with four six-core AMD Opteron CPUs (Istanbul), and the 2009.Q3 software.

I should make clear that these are provided as possible upper bounds - these aren't what to expect for any given workload, unless your workload was similar to what I used for these tests. Click on the results to see details of the workloads used.

These are also the limits that were found with a given farm of clients and JBODs - it's possible the 7410 could go faster with more clients and more JBODs.

Updated 3-Mar-2009: added CIFS results.

Updated 22-Sep-2009: added column for 7410 Istanbul. Results will be added as they are collected.

Updated 12-Nov-2009: added Cindi's InfiniBand results.

1 Gbyte/sec NFS, streaming from disk

I've previously explored maximum throughput and IOPS that I could reach on a Sun Storage 7410 by caching the working set entirely in DRAM, which may be likely as the 7410 currently scales to 128 Gbytes of DRAM per head node. The more your working set exceeds 128 Gbytes or is shifting, the more I/O will be served from disk instead. Here I'll explore the streaming disk performance of the 7410, and show some of the highest throughputs I've seen.

Roch posted performance invariants for capacity planning, which included values of 1 Gbyte/sec for streaming read from disk, and 500 Mbytes/sec for streaming write to disk (50% of the read value due to the nature of software mirroring). Amitabha posted Sun Storage 7000 Filebench results on a 7410 with 2 x JBODs, which had reached 924 Mbytes/sec streaming read and 461 Mbytes/sec streaming write - consistent with Roch's values. What I'm about to post here will further reinforce these numbers, and include the screenshots taken from the Sun Storage 7410 browser interface to show this in action - specifically from Analytics.

Streaming read

The 7410 can currently scale to 12 JBODs (each with 24 disks), but when I performed this test I only had 6 available to use, which I configured with mirroring. While this isn't a max config, I don't think throughput will increase much further with more spindles - after about 3 JBODs there is plenty of raw disk throughput capacity. This is the same 7410, clients and network as I've described before, with more configuration notes below.

The following screenshot shows 10 clients reading 1 Tbyte in total over NFS, with 128 Kbyte I/Os and 2 threads per client. Each client is connected using 2 x 1 GbE interfaces, and the server is using 2 x 10 GbE interfaces:

Showing disk I/O bytes along with network bytes is important - it helps us confirm that this streaming read test did read from disk. If the intent is to go to disk, make sure it does - otherwise it could be hitting from the server or client cache.

I've highlighted the network peak in this screenshot, which was 1.16 Gbytes/sec network outbound - but this is just a peak. The average throughput can be seen by zooming in on Analytics (not shown here), which found the network outbound throughput to average 1.10 Gbytes/sec, and disk bytes at 1.07 Gbyte/sec. The difference is that the network result includes data payload plus network protocol headers, whereas the disk result is data payload plus ZFS metadata - which is adding less than the protocol headers. So 1.07 Gbytes/sec is closer to the average data payload read over NFS from disk.

It's always worth sanity checking results however possible, and we can do this here using the times. This run took 16:23 to complete, and 1 Tbyte was moved - that's an average of 1066 Mbytes/sec of data payload - 1.04 Gbytes/sec. This time includes the little step at the end of the run, and when I zoomed in to see the average it didn't include that. The step is from a slow client completing (I found out who using the Analytics statistic "NFS operations broken down by client" - and I need to now check why I have one client that is a little slower than the others!) Even including the slow client, 1.04 Gbytes/sec is a great result for delivered NFS reads from disk, on a single head node.

Update: 2-Mar-09

In a comment posted on this blog, Rex noticed a slight throughput decay over time in the above screenshots, and wondered if this continued if left longer. To test this out, I had the clients loop over their input files (files which were much too big to fit in the 7410's DRAM cache), and mounted the clients with forcedirectio (to avoid them caching); this kept the reads being served from disk, which I could leave running for hours. The result:

The throughput is rock steady over this 24 hour interval.

I hoped to post some screenshots showing the decay and drilling down with Analytics to explain the cause. Problem is, it doesn't happen anymore - throughput is always steady. I have upgraded our 10 GbE network since the original tests - our older switches were getting congested and slowing throughput a little, which may have been happening earlier.

Streaming write, mirroring

These write tests were with 5 JBODs (from a max of 12), but I don't think the full 12 would improve write throughput much further. This is the same 7410, clients and network as I've described before, with more configuration notes below.

The following screenshot shows 20 clients writing 1 Tbyte in total over NFS, using 2 threads per client and 32 Kbyte I/Os. The disk pool has been configured with mirroring:

The statistics shown tell the story - the network throughput has averaged about 577 Mbytes/sec inbound, shown in the bottom zoomed graph. This network throughput includes protocol overheads, so the actual data throughput is a little less. The time for the 1 Tbyte transfer to complete is about 31 minutes (top graphs), which indicates the data throughput was about 563 Mbytes/sec. The Disk I/O bytes graph confirms that this is being delivered to disk, at a rate of 1.38 Gbytes/sec (measured by zooming in and reading the range average, as with the bottom graph). The rate of 1.38 Gbytes/sec is due to software mirroring ( (563 + metadata) x 2 ).

Streaming write, RAID-Z2

For comparison, the following shows the same test as above, but with double parity raid instead of mirroring:

The network throughput has dropped to 535 Mbytes/sec inbound, as there is more work for the 7410 to calculate parity during writes. As this took 34 minutes to write 1 Tbyte, our data rate is about 514 Mbytes/sec. The disk I/O bytes is much lower (notice the vertical range), averaging about 720 Mbytes/sec - a big difference to the previous test, due to RAID-Z2 vs mirroring.

Configuration Notes

To get the maximum streaming performance, jumbo frames were used and the ZFS record size ("database size") was left at 128 Kbytes. 3 x SAS HBA cards were used, with dual pathing configured to improve performance.

This isn't a maximum config for the 7410, since I'm testing a single head node (not a cluster), and I don't have the full 12 JBODs.

In these tests, ZFS compression wasn't enabled (the 7000 series provides 5 choices: off, LZJB, GZIP-2, GZIP, GZIP-9.) Enabling compression may improve performance as there is less disk I/O, or it may not due to the extra CPU cycles to compress/uncompress the data. This would need to be tested with the workload in mind.

Note that the read and write optimized SSD devices known as Readzilla and Logzilla wern't used in this test. They currently help with random read workloads and synchronous write workloads, neither of which I was testing here.

Conclusion

As shown in the screenshots above, the streaming performance to disk for the Sun Storage 7410 matches what Roch posted, which rounding down is about 1 Gbyte/sec for streaming disk read, and 500 Mbytes/sec for streaming disk write. This is the delivered throughput over NFS, the throughput to the disk I/O subsystem was measured to confirm that this really was performing disk I/O. The disk write throughput was also shown to be higher due to software mirroring or RAID-Z2 - averaging up to 1.38 Gbytes/sec. The real limits of the 7410 may be higher than these - this is just the highest I've been able to find with my farm of test clients.

Wednesday Dec 31, 2008

Unusual disk latency

The following screenshot shows two spikes of unusually high disk I/O latency during a streaming write test:

This screenshot is from Analytics on the 7410. The issue is not with the 7410, it's with disk drives in general. The disk latency here is also not suffered by the client applications, as this is ZFS asynchronously flushing write data to disk. Still, it was great to see how easily Analytics could identify this latency, and interesting to see what the cause was.

See this video for the bizarre explanation:

Don't try this yourself...

Tuesday Dec 30, 2008

JBOD Analytics example

The Analytics feature of the Sun Storage 7000 series provides a fantastic new insight into system performance. Some screenshots have appeared in various blog posts, but there is a lot more of Analytics to show. Here I'll show some JBOD analytics.

Bryan coded a neat way to break down data hierarchically, and made it available to by-filename and by-disk statistics, along with pie charts. While it looks pretty, it is incredibly useful in unexpected ways. I just used it to identify some issues with my JBOD configuration, and realized after the fact that this would probably make an interesting blog post. Since Analytics archived the data I was viewing, I was able to go back and take screenshots of the steps I took - which I've included here.

The problem

I was setting up a NFSv3 streaming write test after configuring a storage pool with 6 JBODs, and noticed that the disk I/O was a little low:

This is measuring disk I/O bytes - I/O from the Sun Storage 7410 head node to the JBODs. I was applying a heavy write workload which was peaking at around 1.6 Gbytes/sec - but I know the 7410 peaks higher than this.

Viewing the other statistics available, I noticed that the disk I/O latency was getting high:

Most of the ops/sec were around 10 ms, but there are plenty of outliers beyond 500 ms - shown in the screenshot above. These disk I/O operations are from ZFS flushing data asynchronously to disk in large chunks, where the I/Os can queue up for 100s of ms. The clients don't wait for this to complete. So while some of this is normal, the outliers are still suspiciously high and worth checking further. Fortunately Analytics lets us drill down on these outliers in other dimensions.

As I had just onlined 144 new disks (6 JBODs worth), it was possible that one may have a performance issue. It wouldn't be an outright disk error - the Sun Storage software would pick that up and generate an alert, which hadn't happened. I'm thinking of others issues, where the disk can take longer than usual to successfully read sectors (perhaps from a manufacturing defect, vibration issue, etc.) This can be identified in Analytics by looking at the I/O latency outliers by disk.

From the left panel in the disk I/O by latency statistic, I selected 220 ms, right clicked and selected to break this down by disk:

If there is a disk or two with high latency, they'll usually appear at the top of the list on the left. Instead I have a list of disks with roughly the same average number of 220+ ms disk I/Os. I think. There are 144 disks in that list on the left, so it can take a while to scroll down them all to check (and you need to click the "..." ellipsis to expand the list). This is where the "Show hierarchy" button on the left is helpful - it will display the items in the box as wedges in a pie chart - which makes it easy to check if they all look the same:

Instead of displaying every disk, the pie chart begins by displaying the JBODs (the top of the hierarchy.) You can click the little "+" sign in the hierarchy view to show individual disks in each JBOD, like so:

But I didn't need to drill down to the disks to notice something wrong. Why do the two JBODs I selected have bigger slices of the pie? These are identical JBODs, are equal members of a ZFS mirrored pool - and so my write workload should be sent evenly across all JBODs (which I used Analytics to confirm by viewing disk I/O operations by disk.) So given they are the same hardware and are sent the same workload - they should all perform the same. Instead, two JBODs (whose names start with /2029 and end with 003 and 012) have returned slower disk I/O.

I confirmed the issue using a different statistic - average I/O operations by disk. This statistic is a little odd but very useful - it's not showing the rate of I/Os to the disks, but instead how many on average are active plus waiting in a queue (length of the wait queue.) Higher usually means I/Os are queueing for longer:

Again, these JBODs should be performing equally, and have the same average number of active and queued I/Os, but instead the 003 and 012 JBODs are taking longer to dispatch their I/Os. Something may be wrong with my configuration.

The Answer

Checking the Maintenance area identified the issue straight away:

The JBODs ending in 003 and 012 have only one SAS path connected. I intended to cable 2 paths to each JBOD, for both redundancy and to improve performance. I cabled it wrong!

Fixing the JBOD cabling so that all JBODs have 2 paths has improved the write throughput for this test:

And our JBODs look more balanced:

But they aren't perfectly balanced. Why do the 002 and 00B JBODs perform faster than the rest? Their pie slices are slightly smaller (~197 compared to 260-270 average I/Os.) They should have equal disk I/O requests, and take the same time to dispatch them.

The answer again is cabling - the 002 and 00B JBODs are in a chain of 2, and the rest are in a chain of 4. There is less bus and HBA contention for the JBODs in the chain of 2, so they perform better. To balance this properly I should have cabled 2 chains of 3, or 3 chains of 2 (this 7410 has 3 x SAS HBAs, so I can cable 3 chains).

Thoughts

Imagine solving this with traditional tools such as iostat: you may dig though the numerous lines of output and notice some disks with higher service times, but not notice that they belong to the same JBODs. This would be more difficult with iostat if the problem was intermittent, whereas Analytics can constantly record all this information to a one second granularity, to allow after the fact analysis.

The aim of Analytics is not just to plot data, but to use GUI features to add value to the data. The hierarchy tree view and pie chart are examples of this - and were the key to identifying my configuration problems.

While I already knew to dual-path JBODs and to balance them across available chains - it's wonderful to so easily see the performance issues that these sub-optimal configurations can cause, and know that we've made identifying problems like this easier for everyone who uses Analytics.

Monday Dec 15, 2008

Up to 2 Gbytes/sec NFS

In a previous post, I showed how many NFS read ops/sec I could drive from a Sun Storage 7410, as a way of investigating its IOPS limits. In this post I'll use a similar approach to investigate streaming throughput, and discuss how to understand throughput numbers. Like before, I'll show a peak value along with a more realistic value, to illustrate why understanding context is important.

As DRAM scalability is a key feature of the Sun Storage 7410 - currently reaching 128 Gbytes per head node, I'll demonstrate streaming throughput when the working set is entirerly cached in DRAM. This will provide some idea of the upper bound - the most throughput that can be driven in ideal conditions.

The screenshots below are from Analytics on a single node Sun Storage 7410, which is serving a streaming read workload over NFSv3 to 20 clients:

Here I've highlighted a peak of 2.07 Gbytes/sec - see the text above and below the box on the left.

While it's great to see 2 Gbytes/sec reached on a single NAS node, the above screenshot should also show that this was a peak only - it is more useful to see a longer term average:

The average for this interval (over 20 mins) is 1.90 Gbytes/sec. This graph shows network bytes by device, which shows the traffic was balanced across two nxge ports (each 10 GbE, and each about 80% utilised).

1.90 Gbytes/sec a good result, but Analytics suggests the 7410 can go higher. Resources which can bound (limit) this workload include the CPU cycles to process instructions, and the CPU cycles to wait for the memory instructions needed to push 1.90 Gbytes/sec -- both of these are reported as "CPU utilization":

For the same time interval, the CPU utilization on the 7410 was only around 67%. Hmm. Be careful here - the temptation is to do some calculations to predict where the limit could be based on that 67% utilization - but there could be other resource limits that prevent us from going much faster. What we do know is that the 7410 has sustained 1.90 Gbytes/sec and peaked at 2.07 Gbytes/sec in my test. The 67% CPU utilization encourages me to do more testing, especially with faster clients (these have 1600 MHz CPUs).

Answers to Questions

To understand these results, I'll describe the test environment by following the questions I posted previously:

  • This is not testing a cluster - this is a single head node.
  • It is for sale.
  • Same target as before - a Sun Storage 7410 with 128 Gbytes of DRAM, 4 sockets of quad-core AMD Opteron 2300 MHz CPU, and 2 x 2x10 GigE cards. It's not a max config since it isn't in a cluster.
  • Same clients as before - 20 blades, each with 2 sockets of Intel Xeon quad-core 1600 MHz CPUs, 6 Gbytes of DRAM, and 2 x 1 Gig network ports.
  • Client and server ports are connected together using 2 switches, and jumbo frames are enabled. Only 2 ports of 10 GbE are connected from the 7410 - one from each of its 2 x 10 GbE cards, to load balance across the cards as well as the ports. Both ports on each client are used.
  • The workload is streaming reads over files, with a 1 Mbyte I/O size. Once the client reaches the end of the file, it loops to the start. 10 processes were run on each of the clients. The files are mounted over NFSv3/TCP.
  • The total working set size is 100 Gbytes, which cache in the 7410's 128 Gbytes of DRAM.
  • The results are screenshots from Analytics on the 7410.
  • The target Sun Storage 7410 may not have been fully utilized - see the CPU utilization graph above and comments.
  • The clients aren't obviously saturated, although they are processing the workload as fast as they can with 1600 MHz CPUs.
  • I've been testing throughput as part of my role as a Fishworks performance engineer.

Traps to watch out for regarding throughput

For throughput results, there are some specific additional questions to consider:

  • How many clients were used?
    • While hardware manufactures make 10 GbE cards, it doesn't mean that clients can drive them. A mistake to avoid is to try testing 10 GbE (and faster) with only one underpowered client - and end up benchmarking the client by mistake. Apart from CPU horsepower, if your clients only have 1 GbE NICs then you'll need at least 10 of them to test 10 GbE, connected to a dedicated switch with 10 GbE ports.

  • What was the payload throughput?
    • In the above screenshots I showed network throughput by device, but this isn't showing us how much data was sent to the clients - rather that's how busy the network interfaces were. The value of 1.90 Gbytes/sec includes the inbound NFS requests, not just the outbound NFS replys (which includes the data payload); it also includes the overheads of the Ethernet, IP, TCP and NFS protocol headers. The actual payload bytes moved is going to be a little less than the total throughput - how much exactly wasn't measured above.
  • Did client caching occur?
    • I mentioned this in my previous post, but it's worth emphasising. Clients will usually cache NFS data in the client's DRAM. This can produce a number of problems for benchmarking. In particular, if you measure throughput from the client, you may see throughput rates much higher than the client has network bandwidth, as the client is reading from its own DRAM rather than testing the target over the network (eg, measuring 3 Gbit/sec on a client with a 1 Gbit/sec NIC). In my test results above, I've avoided this issue by measuring throughput from the target 7410.
  • Was the throughput peak or an average?
    • Again, this was mentioned in my previous post and worth repeating here. The first two screenshots in this post show the difference - average throughput over a long interval is more interesting, as this is more likely to be repeatable.

Being more realistic: 1 x 10 GbE, 10 clients

The above test showed the limits I could find, although to do so required running many processes on each of the 20 clients, using both of the 1 GbE ports on each client (40 x 1 GbE in total), balancing the load across 2 x 10 GbE cards on the target - not just 2 x 10 GbE ports, and using a 1 Mbyte I/O size. The workload the clients are applying is as extreme as they can handle.

I'll now test a lighter (and perhaps more realistic) workload - 1 x 10 GbE port on the 7410, 10 clients using one of their 1 GbE ports each, and running a single process on each client to perform the streaming reads. The client process is /usr/bin/sum (file checksum tool, which sequentially reads through files), which is run on a 5 x 1 Gbyte files for each client, so a 50 Gbyte working set in total:

This time the network traffic is on the nxge1 interface only, peaking at 1.10 Gbytes/sec for both inbound and outbound. The average outbound throughput can be shown in Analytics by zooming in a little and breaking down by direction:

That's 1.07 Gbytes/sec outbound. This includes the network headers, so the NFS payload throughput will be a little less. As a sanity check, we can see from the first screenshot x-axis that the test ran from 03:47:40 to about 03:48:30. We know that 50 Gbytes of total payload was moved over NFS (the shares were mounted before the run, so no client caching), so if this took 50 seconds - our average payload throughput would be about 1 Gbyte/sec. This fits.

10 GbE should peak at about 1.164 Gbyte/sec (converting gigabits to gibibytes) per direction, so this test reaching 1.07 Gbytes/sec outbound is a 92% utilization for the 7410's 10 GbE interface. Each of the 10 client's 1 GbE interface would be equally busy. This is a great result for such a simple test - everything is doing what it is supposed to. (While this might seem obvious, it did take much engineering work during the year to make this work so smoothly; see posts here and here for some details.)

In summary, I've shown that the Sun Storage 7410 can drive NFS requests from DRAM at 1 Gbyte/sec, and even up to about 2 Gbytes/sec with extreme load - which pushed 2 x 10 GbE interfaces to high levels of utilization. With a current max of 128 Gbytes of DRAM in the Sun Storage 7410, entirerly cached working set workloads are a real possibility. While these results are great, it is always important to understand the context of such numbers to avoid the common traps - which I've discussed in this post.

Tuesday Dec 02, 2008

A quarter million NFS IOPS

Following the launch of the Sun Storage 7000 series, various performance results have been published. It's important when reading these numbers to understand their context, and how that may apply to your workload. Here I'll introduce some numbers regarding NFS read ops/sec, and explain what they mean.

A key feature of the Sun Storage 7410 is DRAM scalability, which can currently reach 128 Gbytes per head node. This can span a significant working set size, and so serve most (or even all) requests from the DRAM filesystem cache. If you aren't familiar with the term working set size - this refers to the amount of data which is frequently accessed; for example, your website could be multiple Tbytes in size - but only tens of Gbytes are frequently accessed hour after hour, which is your working set.

Considering that serving most or all of your working set from DRAM may be a real possibility, it's worth exploring this space. I'll start by finding the upper bound - what's the most NFS read ops/sec I can drive. Here are screenshots from Analytics that shows sustained NFS read ops/sec from DRAM. Starting with NFSv3:

And now NFSv4:

Both beating 250,000 NFS random read ops/sec from a single head node - great to see!

Questions when considering performance numbers

To understand these numbers, you must understand the context. These are the sort of questions you can ask yourself, along with the answers for those results above:

  • Is this clustered?
    • When you see incredible performance results, check whether this is the aggregated result of multiple servers acting as a cluster. For this result I'm using a single Sun Storage 7410 - no cluster.
  • Is the product for sale? What is its price/performance?
    • Yes (which should also have the price.)
  • What is the target, and is it a max config?
    • It's a Sun Storage 7410 with 128 Gbytes of DRAM, 4 sockets of quad-core AMD Opteron 2300 MHz CPU, and 2 x 2x10 GigE cards. It's not a max config since it isn't in a cluster, and it only has 5 JBODs (although that didn't make a difference with the DRAM test above.)
  • What are the clients?
    • 20 blades - each with 2 sockets of Intel Xeon quad-core 1600 MHz CPUs, 6 Gbytes of DRAM, and 2 x 1 Gig network ports. With more and faster clients it may be possible to beat these results.
  • What is the network?
    • The clients are connected to a switch using 1 GigE, and the 7410 connects to the same switch using using 10 GigE. All ports are connected, so the 7410 is balanced across its 4 x 10 GigE ports.
  • What is the workload?
    • Each client is running over 20 processes which randomly read from files over NFS, with a 1 byte I/O size. Many performance tests these days will involve multiple threads and/or processes to check scaling; any test that only uses 1 thread on 1 client isn't showing the full potential of the target.
  • What is the working set size?
    • 100 Gbytes. This is important to check - it shouldn't be so tiny as to be served from CPU caches, if the goal is to test DRAM.
  • How is the result calculated?
    • It's measured on the 7410, and is the average seen in the Analytics window for the visible time period (5+ mins). Be very careful with results measured from the clients - as they can include client caching.
  • Was the target fully utilized?
    • In this case, yes. If you are reading numbers to determine maximum performance, check whether the benchmark is intended to measure that - some aren't!
  • Were the clients or network saturated?
    • Just a common benchmark problem to look out for, especially telltale when results cap at about 120 Mbytes/sec (hmm, you mean 1 GigE?). If the client or network becomes saturated, you've benchmarked them as well as the target server - probably not the intent. In the above test I maxed out neither.
  • Who gathered the data and why?
    • I gathered these results as part of Fishworks performance analysis to check what the IOPS limits may be. They aren't Sun official results. I thought of blogging about them a couple of weeks after running the tests (note the dates in the screenshot), and used Analytics to go back in time and take some screenshots.

The above list covers many subtle issues to help you avoid them (don't learn them the hard way).

Traps to watch out for regarding IOPS

For IOPS results, there are some specific additional questions to consider:

  • Is this from cache?
    • Yes, which is the point of this test, as this 7410 has 128 Gbytes of DRAM.
  • What is the I/O size?
    • 1 byte! This was about checking what the limit may be, as a possible upper bound. An average I/O size of 1 Kbyte or 8 Kbytes is going to drop this result - as there is more work by the clients and server to do. If you are matching this to your workload, find out what your average I/O size is and look for results at that size.
  • Is the value an average, and for how long?
    • These are both 5+ minute averages. Be wary of tiny intervals that may show unsustainable results.
  • What is the latency?
    • Just as a 1 byte I/O size may make this value unrealistic, so may the latency for heavy IOPS results. Disks can be pushed to some high IOPS values by piling on more and more client threads, but the average I/O latency becomes so bad it is impractical. The latency isn't shown in the above screenshot!

Being more realistic: 8 Kbyte I/O with latency

The aim of the above was to discuss context, and to show how to understand a great result - such as 250,000+ NFS IOPS - by knowing what questions to ask. The two key criticisms for this result would be that it was for 1 byte I/Os, and that latency wasn't shown at all. Here I'll redo this with 8 Kbyte I/Os, and show how Analytics can display the NFS I/O latency. I'll also wind back to 10 clients, only use 1 of the 10 GigE ports on the 7410, and I'll gradually add threads to the clients until each is running 20:

The steps in the NFSv3 ops/sec staircase are where I'm adding more client threads.

I've reached over 145,000 NFSv3 read ops/sec - and this is not the maximum the 7410 can do (I'll need to use a second 10 GigE port to take this further). The latency does increase as more threads queue up, here it is plotted as a heat map with latency on the y-axis (the darker the pixel - the more I/Os were at that latency for that second.) At our peak (which has been selected by the vertical line), most of the I/Os were faster than 55 us (0.055 milliseconds) - which can be seen in the numbers in the list on the left.

Note that this is the NFSv3 read ops/sec delivered to the 7410 after the client NFS driver has processed the 8 Kbyte I/Os, which decided to split some of the 8 Kbyte reads into 2 x 4 Kbyte NFS reads (pagesize). This means the workload became a mixed 4k and 8k read workload - for which 145,000 IOPS is still a good value. (I'm tempted to redo this for just 4 Kbyte I/Os to keep things simpler, but perhaps this is another useful lesson in the perils of benchmarking - the system doesn't always do what it is asked.)

Reaching 145,000 4+ Kbyte NFS cached read ops/sec without blowing out latency is a great result - and it's the latency that really matters (and from latency comes IOPS)... And on the topic of latency and IOPS - I do need to post a follow up for the next level after DRAM - no, not disks, it's the L2ARC using SSDs in the Hybrid Storage Pool.

About

Brendan Gregg, Fishworks engineer

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today