Tuesday Mar 17, 2009

Heat Map Analytics

I've been recently posting screenshots of heat maps from Analytics - the observability tool shipped with the Sun Storage 7000 series.

These heat maps are especially interesting, which I'll describe here in more detail.

Introduction

To start with, when you first visit Analytics you have an empty worksheet and need to add statistics to plot. Clicking on the plus icon next to "Add statistic" brings up a menu of statistics, as shown on the right.

I've clicked on "NFSv3 operations" and a sublist of possible breakdowns are shown. The last three (not including "as a raw statistic") are represented as heat maps. Clicking on "by latency" would show "NFSv3 operations by latency" as a heat map. Great.

But it's actually much more powerful than it looks. It is possible to drill down on each breakdown to focus on behavior of interest. For example, latency may be more interesting for read or write operations, depending on the workload. If our workload was performing synchronous writes, we may like to see the NFS latency heat map for 'write' operations separately - which we can do with Analytics.

To see an example of this, I've selected "NFS operations by type of operation", then selected 'write', then right-clicked on the "write" text to see the next breakdowns that are possible:

This menu is also visible by clicking the drill icon (3rd from the right) to drill down further.

By clicking on latency, it will now graph "NFSv3 operations of type write broken down by latency". So these statistics can be shown in whatever context is most interesting - perhaps I want to see NFS operations from a particular client, or for a particular file. Here are NFSv3 writes from the client 'deimos', showing the filenames that are being written:

Awsome. Behind the scenes, DTrace is building up dynamic scripts to fetch this data. We just click the mouse.

This was important to mention - the heat maps I'm about to demonstrate can be customized very specifically, by type of operation, client, filename, etc.

Sequential reads

I'll demonstrate heat maps at the NFS level by running the /usr/bin/sum command on a large file a few times, and letting it run longer each time before hitting Ctrl-C to cancel it. The sum command calculates a file's checksum, and does so by sequentially reading through the file contents. Here is what the three heat maps from Analytics shows:

The top heat map of offset clearly shows the client's behavior - the stripes show sequential I/O. The blocks show the offsets of the read operations as the client creeps through the file. I mounted the client using forcedirectio, so that NFS would not cache on the file contents on the client - and would be forced to keep reading the file each time.

The middle heat map shows the size of the client I/O requests. This shows the NFS client is always requesting 8 Kbyte reads. The bottom heat map shows NFS I/O latency. Most of the I/O was between 0 and 35 microseconds - but there are some odd clouds of latency on the 2nd and 3rd runs.

These latency clouds would be almost invisible if a linear color scheme was used - these heat maps use false color to emphasize detail. The left panel showed that on average there were 1771 ops/sec faster than 104 us (adding up the numbers), and that the entire heat map averaged at 1777 ops/sec; this means that the latency clouds (at about 0.7 ms) represent 0.3% of the I/O. The false color scheme makes them clearly visible, which is important for latency heat maps - as these slow outliers can hurt performance - even though they are relatively infrequent.

For those interested in more detail, I've included a couple of extra screenshots to explain this further:

  • Screenshot 1: NFS operations and disk throughput. From the top graph, it's clear how long I left the sum command running each time. The bottom graph of disk I/O bytes shows that the file I was checksumming had to be pulled in from disk for the entire first run, but only later in the second and third runs. Correspond the times to the offset heat map above - the 2nd and 3rd runs are reading data that is now cached, and doesn't need to be read from disk.

  • Screenshot 2: ARC hits/misses. This shows what the ZFS DRAM cache is doing (which is the 'ARC' - the Adaptive Replacement Cache). I've shown the same statistic twice, so that I can highlight breakdowns separately. The top graph shows a data miss at 22:30:24, which triggers ZFS to prefetch the data (since ZFS detects that this is a sequential read.) The bottom graph shows data hits are kept high, thanks to ZFS prefetch, and ZFS prefetch in operation: the "prefetched data misses" shows requests triggered by prefetch that were not already in the ARC, and read from disk; and the "prefetched data hits" shows prefetch requests that were already satisfied by the ARC. The latency clouds correspond to the later prefetch data misses, where some client requests are arriving while prefetch is still reading from disk - and are waiting for that to complete.

Random reads

While the rising stripes of a sequential workload are clearly visible in the offset heat map, random workloads are also easily identifiable:

The NFS operations by offset shows a random and fairly uniform pattern, which matches the random I/O I now have my client requesting. These are all hitting the ZFS DRAM cache, and so the latency heat map shows most responses in the 0 to 32 microsecond range.

Checking how these known workloads look in Analytics is valuable, as when we are faced with the unknown we know what to look for.

Disk I/O

The heat maps above demonstrated Analytics at the NFS layer; Analytics can also trace these details at the back-end: what the disks are doing, as requested by ZFS. As an example, here is a sequential disk workload:

The heat maps aren't as clear as they are at the NFS layer, as now we are looking at what ZFS decides to do based on our NFS requests.

The sequential read is mostly reading from the first 25 Gbytes of the disks, as shown in the offset heat map. The size heat map shows ZFS is doing mostly 128 Kbyte I/Os, and the latency heat map shows the disk I/O time is often about 1.20 ms, and longer.

Latency at the disk I/O layer doesn't directly correspond to client latency - it depends on the type of I/O. Asynchronous writes and prefetch I/O won't necessarily slow the client, for example.

Vertical Zoom

There is a way to zoom these heat maps vertically. Zooming horizontally is obvious (the first 10 buttons above each heat map do that - by changing the time range), but the vertical zoom isn't so obvious. It is documented in the online help - I just wanted to say here that it does exist. In a nutshell: click the outliers icon (last on the right) to switch outlier elimination modes (5%, 1%, 0.1%, 0.01%, none), which often will do what you want (by zooming to exclude a percentage of outliers); otherwise left click a low value in the left panel, shift click a high value, then click the outliers icon.

Overheads

As mentioned earlier, these heat maps use optimal resolutions at different ranges to conserve disk space, while maintaining visual resolution. They are also saved on the system disks, which have compression enabled. Still, when recording this data every second, 24 hours a day, the disk space can add up. Check their disk usage by going to Analytics->Datasets and clicking the "ON DISK" title to sort by size:

The size is listed before compression, so the actual consumed bytes is lower. These datasets can be suspended by clicking the power button, which is handy if you'd like to keep interesting data but not continue to collect new data.

Playing around...

While using these heat maps we noticed some unusual and detailed plots. Bryan and I starting wondering if it were possible to generate artificial workloads that plotted arbitrary patterns, such as spelling out words in 8 point text. This would be especially easy for the offset heat map at the NFS level - since the client requests the offsets, we just need to write a program to request reads or writes to the offsets we want. Moments after this idea, Bryan and I were furiously coding to see who could finish it first (and post comical messages to each other.) Bryan won, after about 10 minutes. Here is an example:

Awsome, dude! ... (although that wasn't the first message we printed ... when I realized Bryan was winning, I logged into his desktop, found the binary he was compiling, and posted the first message to his screen before he had finished writing the software. However my message appeared as: "BWC SnX" (Bryan's username is "bmc".) Bryan was looking at the message, puzzled, while I'm saying "it's upside down - your program prints upside down!")

I later modified the program to work for the size heat maps as well, which was easy as the client requests it. But what about the latency heat maps? Latency isn't requested - it depends on many factors: for reads, it depends on whether the data is cached, and if not, whether it is on a flash memory based read cache (if one is used), and if not, then it depends on how much disk head seek and rotation time it takes to pull it in - which varies depending on the previous disk I/O. Maybe this can't be done...

Actually, it can be done. Here is all three:

Ok, the latency heat map looks a bit fuzzy, but this does work. I could probably improve it if I spent more than 30 mins on the code - but I have plenty of actual work to do.

I got the latency program to work by requesting data which was cached in DRAM, of large increasing sizes. The latency from DRAM is consistent and relative to the size requested, so by calling reads with certain large I/O sizes I can manufacture a workload with the latency I want (close to.) The client was mounted forcedirectio, so that every read caused an NFS I/O (no client caching.)

If you are interested in the client programs that injected these workloads, they are provided here (completely unsupported) for your entertainment: offsetwriter.c, sizewriter.c and latencywriter.c. If you don't have a Sun Storage 7000 series product to try them on, you can try the fully functional VMware simulator (although they may need adjustments to compensate for the simulator's slower response times.)

Summary

Heat maps are an excellent visual tool for analyzing data, and identifying patterns that would go unnoticed via text based commands or plain graphs. Some may remember Richard McDougall's Taztool, which used heat maps for disk I/O by offset analysis, and was very useful at the time (I reinvented it later for Solaris 10 with DTraceTazTool.)

Analytics takes heat maps much further:

  • visibility of different layers of the software stack: disk I/O, NFS, CIFS, ...
  • drilldown capabilities: for a particular client or file only, ...
  • I/O by offset, I/O by size and I/O by latency
  • can archive data 24x7 in production environments
  • optimal disk storage

With this new visibility, heat maps are illuminating numerous performance behaviors that we previously didn't know about and some we still don't yet understand - like the Rainbow Pterodactyl. DTrace has made this data available for years, Analytics is making it easy to see.

Monday Mar 16, 2009

Dave tests compression

Dave from Fishworks has tested ZFS compression on the Sun Storage 7000 series. This is just a simple test, but interesting to see that performance improved slightly when using the LZJB compression algorithm. Compression relieves back-end throughput to the disks, and the LZJB algorithm doesn't consume much CPU to do so. I had suggested trying compression this in my streaming disk blog post, but didn't have any results to show. It's good to see this tested and shown with Analytics.

Thursday Mar 12, 2009

Latency Art: Rainbow Pterodactyl

In previous posts I've demonstrated Analytics from the Sun Storage 7000 series, which is a DTrace based observability tool created by Bryan. One of the great features of Analytics is its use of heat maps, especially for I/O latency.

I/O latency is the time to service and respond to an I/O request, and since clients are often waiting for this to complete - it is often the most interesting metric for performance analysis (more so than IOPS or throughput.) We thought of graphing average I/O latency, however important details would be averaged out; occasional slow requests can destroy performance, but when averaged with many fast requests their existence may be difficult to see. So instead of averaging I/O latency, we provide I/O latency as a heat map. For example:

That is showing the effect of turning on our flash based SSD read cache. Latency drops!

The x-axis is time, the y-axis is the I/O latency, and color is used to represent a 3rd dimension - how many I/O requests occurred at that time, at that latency (darker means more.) Gathering this much detail from the operating system was not only made possible by DTrace, but also made optimal. DTrace already has the ability to group data into buckets, which is used to provide a sufficient resolution of data for plotting. Higher resolutions are kept for lower latencies: having 10 microsecond resolution for I/O times less than 1,000 microseconds is useful, but overkill for I/O times over 1,000,000 microseconds (1 second) - and would waste disk space when storing it. Analytics automatically picks a resolution suitable for the range displayed.

In this post (perhaps the start of a series), I'll show unusual latency plots that I and others have discovered. Here is what I call the "Rainbow Pterodactyl":

Yikes! Maybe it's just me, but that looks like the Pterodactyl from Joust.

I discovered this when testing back-end throughput on our storage systems, by 'lighting up' disks one by one with a 128 Kbyte streaming read workload to their raw device (visible as the rainbow.) I was interested in any knee points in I/O throughput, which is visible there at 17:55 where we reached 1.13 Gbytes/sec. To understand what was happening, I graphed the I/O latency as well - and found the alarming image above. The knee point corresponds to where the neck ends, and the wing begins.

Why is there a beak? (the two levels of latency) ... Why is there a head? Why is there a wing? This raised many more questions than answers - things for us to figure out. Which is great - had this been average latency, these behaviors may have gone unnoticed.

I can at least explain the beak to head threshold: the beak ends at 8 disks, and I had two "x4" ports of SAS connected - so when 9 disks are busy, there is contention for those ports. (I'll defer describing the rest of the features to our storage driver expert Greg if he gets the time. :-)

The above was tested with 1 SAS path to each of 2 JBODs. Testing 2 paths to each JBOD produces higher throughput, and latency which looks a bit more like a bird:

Here we reached 2.23 Gbytes/sec for the first knee point, which is actually the neck point...

Latency heat maps have shown us many behaviors that we don't yet understand, which we need to spend more time with Analytics to figure out. It can be exciting to discover strange phenomenons for the first time, and it's very useful and practical as well - latency matters.

Tuesday Mar 03, 2009

CIFS at 1 Gbyte/sec

I've recently been testing the limits of NFS performance on the Sun Storage 7410. Here I'll test the CIFS (SMB) protocol - the file sharing protocol commonly used by Microsoft Windows, which can be served by the Sun Storage 7000 series products. I'll push the 7410 to the limits I can find, and show screenshots of the results. I'm using 20 clients to test a 7410 which has 6 JBODs and 4 x 10 GbE ports, described in more detail later on.

CIFS streaming read from DRAM

Since the 7410 has 128 Gbytes of DRAM, most of which is available as the filesystem cache, it is possible that some workloads can be served entirely or almost entirely from DRAM cache, which I've tested before for NFS. Understanding how fast CIFS can serve this data from DRAM is interesting, so to search for a limit I've run the following workload: 100 Gbytes of files (working set), 4 threads per client, each doing streaming reads with a 1 Mbyte I/O size, and looping through their files.

I don't have to worry about client caching affecting the observed result - as is the case with other benchmarks - since I'm not measuring the throughput on the client. I'm measuring the actual throughput on the 7410, using Analytics:

I've zoomed out to show the average over 60 minutes - which was 1.04 Gbytes/sec outbound!

This result was measured as outbound network throughput, so it includes the overhead of Ethernet, IP, TCP and CIFS headers. Since jumbo frames were used, this overhead is going to be about 1%. So the actual data payload moved over CIFS will be closer to 1.03 Gbytes/sec.

As a screenshot it's clear that I'm not showing a peak value - rather a sustained average over a long interval, to show how well the 7410 can serve CIFS at this throughput.

CIFS streaming read from disk

While 128 Gbytes of DRAM can cache large working sets, it's still interesting to see what happens when we fall out of that, as I've previously shown for NFS. The 7410 I'm testing has 6 JBODs (from a max of 12), which I've configured with mirroring for max performance. To test out disk throughput, my workload is: 2 Tbytes of files (working set), 2 threads per client, each performing streaming reads with a 1 Mbyte I/O size, and looping through their files.

As before, I'm taking a screenshot from Analytics - to show what the 7410 is really doing:

Here I've shown read throughput from disk at 849 Mbytes/sec, and network outbound at 860 Mbytes/sec (includes the headers.) 849 Mbytes/sec from disk (which will be close to our data payload) is very solid performance.

CIFS streaming write to disk

Writes are a different code path to reads, and need to be tested seperately. The workload I've used to test write throughput is: Writing 1+ Tbytes of files, 4 threads per client, each performing streaming writes with a 1 Mbyte I/O size. The result:

The network inbound throughput was 638 Mbytes/sec, which includes protocol overheads. Our data payload rate will be a little less than that due to the CIFS protocol overheads, but should still be at least 620 Mbytes/sec - which is very good indeed (even beat my NFSv3 write throughput result!)

Note that the disk I/O bytes were at 1.25 Gbytes/sec write: the 7410 is using 6 JBODs configured with software mirroring, so the back end write throughput is doubled. If I picked other storage profiles like RAID-Z2, the back end thoughput would be less (as I showed in the NFS post.)

CIFS read IOPS from DRAM

Apart from throughput, it's also interesting to test the limits of IOPS, in particular read ops/sec. To do this, I'll use a workload which is: 100 Gbytes of files (working set) which caches in DRAM, 20 threads per client, each performing reads with a 1 byte I/O size. The results:

203,000+ is awesome; while a realistic workload is unlikely to call 1 byte I/Os, still, it's interesting to see what the 7410 can do (I'll test 8 Kbyte I/O next.)

Note the network in and out bytes is about the same - the 1 byte of data payload doesn't make much difference beyond the network headers.

Modest read IOPS

While the limits I can reach on the 7410 are great for heavy workloads, these don't demonstrate how well the 7410 responds under modest conditions. Here I'll test a lighter read ops/sec workload by cutting it to: 10 clients, 1 x 1 GbE port per client, 1 x 10 GbE port on the 7410, 100 Gbytes of files (working set), and 8 Kbyte random I/O. I'll step up the threads per client every minute (by 1), starting at 1 thread/client (so 10 in total to begin with):

We reached 71,428 read ops/sec - a good result for 8 Kbyte random I/O from cache and only 10 clients.

It's more difficult to generate client load (involves context switching to userland) than to serve it (kernel only), so you generally need more CPU grunt on the clients than on the target. At one thread per client on 10 clients, the clients are using 10 x 1600 MHz cores to test a 7410 with 16 x 2300 MHz cores - so the clients themselves will limit the throughput achieved. Even at 5 threads per client there is still headroom (%CPU) on this 7410.

The bottom graph is a heat map of CIFS read latency, as measured on the 7410 (from when the I/O request was received, to when the response was sent.). As load increases, so does I/O latency - but they are still mostly less than 100 us (fast!). This may be the most interesting from all the results - as this modest load is increased, the latency remains low while the 7410 scales to meet the workload.

Configuration

As the filer I was using a single Sun Storage 7410, with the following config:

  • 128 Gbytes DRAM
  • 6 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
  • 4 sockets of quad-core AMD Opteron 2300 MHz CPUs
  • 2 x 2x10 GbE cards (4 x 10 GbE ports total), jumbo frames
  • 2 x HBA cards
  • noatime on shares, and database size left at 128 Kbytes

It's not a max config system - the 7410 can currently scale to 12 JBODs, 3 x HBA cards, and have flash based SSD as read cache and intent log - which I'm not using for these tests. The CPU and DRAM size is the current max: 4 sockets of quad-core driving 128 Gbytes of DRAM is a heavyweight for workloads that cache well, as shown earlier.

The clients were 20 blades, each:

  • 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
  • 6 Gbytes of DRAM
  • 2 x 1 GbE network ports
  • Running Solaris, and mounting CIFS using the smbfs driver

These are great, apart from the CPU clock speed - which at 1600 MHz is a little low.

The network consists of multiple 10 GbE switches to connect the client 1 GbE ports to the filer 10 GbE ports.

Conclusion

A single head node 7410 has very solid CIFS performance, as shown in the screenshots. I should note that I've shown what the 7410 can do given the clients I have, but it may perform even faster given faster clients for testing. What I have been able to show is 1 Gbyte/sec of CIFS read throughput from cache, and up to 200,000 read ops/sec - tremendous performance.

Before the Sun Storage 7000 products were released, there was intensive performance work on this by the CIFS team and PAE (as Roch describes), and they have delivered great performance for CIFS, and continue to improve it further. I've updated the summary page with these CIFS results.

Monday Feb 23, 2009

Networking Analytics Example

I just setup a new Sun Storage 7410, and found a performance issue using Analytics. People have been asking for examples of Analytics in use, so I need to blog these as I find them. This one is regarding network throughput, and while simple - it demonstrates the value of high level data and some features of Analytics.

The Symptom

To test this new 7410, I ran a cached DRAM read workload from 10 clients and checked network throughput:

It's reached 736 Mbytes/sec, and I know the 7410 can go faster over 1 x 10 GbE port (I previously posted a test showing a single nxge port reaching 1.10 Gbytes/sec.) This is a different workload and client setup, so, is 736 Mbytes/sec the most I should expect from these clients and this workload? Or can I go closer to 1.10 Gbytes/sec?

With Analytics, I can examine intricate details of system performance throughout the software stack. Start simple - start with the higher level questions and drill down deeper on anything suspicious. Also, don't assume anything that can easily be measured.

I expect my 10 clients to be performing NFSv3 reads, but are they? This is easily checked:

They are.

Since these clients are 10 identical blades in a blade server, and the workload I ran successfully connected to and started running on all 10 clients, I could assume that they are all still running. Are they? This is also easily checked:

I see all 10 clients in the left panal, so they are still running. But there is something suspicious - the client dace-2-data-3 is performing far fewer NFSv3 ops than the other clients. For the selected time of 15:49:35, dace-2-data-3 performed 90 NFSv3 ops/sec while the rest of the clients performed over 600. In the graph, that client is plotted as a trickle on the top of the stacks - and at only a pixel or two high, it's difficult to see.

This is where it is handy to change the graph type from stacked to line, using the 5th icon from the right:

Rather than stacking the data to highlight its components, the line graphs plot those components seperately - so that they can be compared with one another. This makes it pretty clear that dace-2-data-3 is an outlier - shown as a single line much lower than all others.

For some reason, dace-2-data-3 is performing fewer NFSv3 ops than its identical neighbours. Lets check if its network throughput is also lower:

It is, running at only 11.3 Mbytes/sec. Showing this as a line graph:

As before, the line graph highlights this client as an outlier.

Line Graphs

While I'm here, these line graphs are especially handy for comparing any two items (called breakdowns.) Click the first from the left panel, then shift-click a second (or more.) For example:

By just selecting one breakdown, the graph will rescale to fit it to the vertical height (note the units change on the top right):

Variations in this client's throughput are now more clearly visible. Handy stuff...

Datasets

Another side topic worth mentioning is the datasets - archived statistics used by Analytics. So far I've used:

  • Network bytes/sec by device
  • NFSv3 ops/sec by operation
  • NFSv3 ops/sec by-client
  • IP bytes/sec by-hostname

There is a screen in the Sun Storage 7000 series interface to manage datasets:

Here I've sorted by creation time, showing the newly created datasets at the top. The icons on the right include a power icon, which can suspend dataset collection. Suspended datasets show a gray light on the left, rather than green for enabled.

The by-client and by-hostname statistics require more CPU overhead to collect than the others, as these gather their data by tracing every network packet, aggregating that data, then resolving hostnames. These are some of the datasets that are DTrace based.

The overhead of DTrace based datasets is relative to the number of traced events, and how loaded the system is. The overhead itself is microscopic, however if you multiply that by 100,000 on a busy system, it can become measurable. This system was pushing over 700 Mbytes/sec, which is approaching 100,000 packets/sec. The overhead (performance cost) for those by-client and by-hostname datasets was about 1.4% each. Tiny as this is, I usually suspend these when performing benchmarks (if they have been enabled - they aren't out of the box.) With lighter workloads (lighter than 700+ Mbytes/sec), this overhead becomes lower as there is more CPU capacity available for collecting such statistics. So, you generally don't need to worry about the CPU overhead - unless you want to perform benchmarks.

The Problem

Back to the issue; The 11.3 Mbytes/sec value rings a bell. Converting bytes to bits, that's about 90 Mbit/sec - within 100 Mbit/sec. Hmm.. These are supposed to be 1 Gbit/sec interfaces - is something wrong? On the client:

    dace-2# dladm show-linkprop -p speed
    LINK         PROPERTY        VALUE          DEFAULT        POSSIBLE
    e1000g0      speed           100            --             10,100,1000
    e1000g1      speed           1000           --             10,100,1000
    
Yes - the 1 Gbit/sec interface has negotiated to 100 Mbit. Taking a look at the physical port:

Confirmed! The port (in the center) is showing a yellow light on the right rather than a green light. There is a problem with the port, or the cable, or the port on the switch.

The Fix

Swapping the cable with one that is known-to-be-good fixed the issue - the port renegotiated to 1 Gbyte/sec:

    dace-2# dladm show-linkprop -p speed
    LINK         PROPERTY        VALUE          DEFAULT        POSSIBLE
    e1000g0      speed           1000           --             10,100,1000
    e1000g1      speed           1000           --             10,100,1000
    

And the client dace-2-data-3 is now running much faster:

That's about a 9% performance win - caused by a bad Ethernet cable, found by Analytics.

Friday Feb 06, 2009

DRAM latency

In my previous post, I showed NFS random read latency at different points in the operating system stack. I made a few references to hits from DRAM - which were visible as a dark solid line at the bottom of the latency heat maps. This is worth exploring in a little more detail, as this is both interesting and another demonstration of Analytics.

Here is both delivered NFS latency and disk latency, for disk + DRAM alone:

First clue is that the dark line at the bottom is in the NFS latency map only. This suggests the operation has returned to the client before it has reached the disk layer.

Zooming in the vertical time scale:

We can now see that these operations are mostly in the 0 us to 21 us range - which is very, very fast. DRAM fast. As an aside, there are other NFS operations apart from read can return from DRAM - these include open, close and stat. We know these are all reads from viewing the NFS operation type:

The average is 2633 NFS reads/sec, which includes hits from DRAM and reads from disk.

Now we'll use the "ZFS ARC" accesses statistic to see our DRAM cache hit rate (the ARC is our DRAM based filesystem cache):

The averages for this visible range shows 467 data hits/sec - consistent with the number of fast NFS operations/sec that the latency map suggested were DRAM based.

This is amazing stuff - examining the latency throughout the software stack, and clearly seeing the difference between DRAM and disk hits. You can probably see why we picked heat maps to show latency instead of line graphs showing average latency. Traditional performance tools provide average latency for these layers - but so much information has been lost when averaging this data. Since using these heat maps for latency, we've noticed many issues that may otherwise go unnoticed when using single values.

Friday Jan 30, 2009

L2ARC Screenshots

Back before the Fishworks project went public, I posted an entry to explain how the ZFS L2ARC worked (Level 2 Cache) - which is a flash memory based cache currently intended for random read workloads. I was itching to show screenshots from Analytics, which I'm now able to do. From these screenshots, I'll be able to describe in detail how the L2ARC performs.

Summary

There are a couple of screenshots that really tell the story. This is on a Sun Storage 7410 with the following specs:

  • 128 Gbytes of DRAM
  • 6 x 100 Gbyte "Readzillas" (read optimized SSDs) as the L2ARC
  • 6 x JBODs (disk trays), for a total of 140 disks configured with mirroring

As a workload, I'm using 10 clients (described previously), 2 random read threads per client with an 8 Kbyte I/O size, and a 500 Gbyte total working set mounted over NFS. This 500 Gbyte working set represents your frequently accessed data ("hot" data) that you'd like to be cached; this doesn't represent the total file or database size - which may be dozens of Tbytes. From Analytics on the 7410:

The top graph shows the L2ARC population level, and the bottom shows NFS operations/sec. As the L2ARC warms up, delivered performance in terms of read ops/sec increases, as data is returned from the SSD based L2ARC rather than slower disks. The L2ARC has increased the IOPS by over 5x.

5x IOPS! That's the difference 6 of our current SSDs makes when added to: 140 disks configured with mirroring plus 128 Gbytes of warm DRAM cache - meaning this system was already tuned and configured to serve this workload as fast as possible, yet the L2ARC has no problem magnifying performance further. If I had used fewer disks, or configured them with RAID-Z (RAID-5), or used less DRAM, this improvement ratio would be much higher (demonstrated later.) But I'm not showing this in the summary because this isn't about IOPS - this is about latency:

Here I've toggled a switch to enable and disable the L2ARC. The left half of these graphs shows the L2ARC disabled - which is the performance from disks plus the DRAM cache. The right half shows the L2ARC enabled - so that its effect can be compared. Heat maps have been used to graph latency - which is the time to service that I/O. Lower is faster, and the darker colors represent more I/Os occured at that time (x-axis) at that latency (y-axis). Lower dark colors is better - it means I/Os are completing quickly.

These maps show I/O latency plummet when the L2ARC is enabled, delivering I/O faster than disk was able to. Latency at both the NFS level and disk level can be seen, which is often helpful for locating where latency originates; here it simply shows that the faster SSD performance is being delivered to NFS. There are still some I/Os occurring slowly when the L2ARC is enabled (lighter colors in the top right), as the L2ARC is only 96% warm at this point - so 4% of the requested I/Os are still being serviced from disk. If I let the L2ARC warmup further, the top right will continue to fade.

There is one subtle difference between the heat maps - can you spot it? There is a dark stripe of frequent and fast I/O at the bottom of the NFS latency map, which doesn't appear in the disk map. These are read requests that hit the DRAM cache, and return from there.

The bottom graph shows IOPS, which increased (over 5x) when the L2ARC was enabled as due to the faster I/O latency.

This is just one demonstration of the L2ARC - I've shown a good result, but this isn't the best latency or IOPS improvement possible.

Before: DRAM + disk

Lets look closer at the NFS latency before the L2ARC was enabled:

This shows the performance delivered by DRAM plus the 140 mirrored disks. The latency is mostly between 0 and 10 ms, which is to be expected for a random read workload on 7,200 RPM disks.

Zooming in:

The vertical scale has now been zoomed to 10 ms. The dark line at the bottom is for hits from the DRAM cache - which is averaging about 460 hits/sec. Then there is a void until about 2 ms - where these disks start to return random IOPS.

After: DRAM + L2ARC + disk

Now a closer look at the NFS latency with the L2ARC enabled, and warmed up:

Here I've already zoomed to the 10 ms range, which covers most of the I/O. In fact, the left panel shows that most I/O took less than 1 ms.

Zooming in further:

The L2ARC now begins returning data over NFS at 300 us, and as the previous graph showed - most I/O are returned by 1 ms, rather than 10 ms for disk.

The bottom line in the graph is DRAM cache hits, which is now about 2400 hits/sec - over 5x than without the L2ARC. This may sound strange at first (how can the L2ARC affect DRAM cache performance?), but it makes sense - the client applications aren't stalled waiting for slower disks, and can send more IOPS. More IOPS means more chance of hitting from the DRAM cache, and a higher hits/sec value. The hits/misses rate is actually the same - we are just making better use of the DRAM cache as the clients can request from it more frequently.

Hit Rate

We can see how the DRAM cache hits increases as the L2ARC warms up with the following screenshot. This shows hit statistics for the ARC (DRAM cache) and L2ARC (SSD cache):

As the L2ARC warms up, its hit rate improves. The ARC also serves more hits as the clients are able to send more IOPS.

We may have assumed that hits improved in this way, however it is still a good idea to check such assumptions whenever possible. Analytics makes it easy to check different areas of the software stack, from NFS ops down to disk ops.

Disk ops

For a different look at L2ARC warmup, we can examine disk ops/sec by disk:

Rather than highlighting individual disks, I've used the Hierarchical breakdown to highlight the system itself ("/turbot") in pale blue. The system is the head node of the 7410, and has 6 L2ARC SSDs - visible as the 6 wedges in the pie chart. The JBODs are not highlighted here, and their ops/sec is shown in the default dark blue. The graph shows the disk ops to the JBODs decreases over time, and those to the L2ARC SSDs increases - as expected.

Warmup Time

A characteristic can be seen in these screenshots that I haven't mentioned yet: the L2ARC is usually slow to warmup. Since it is caching a random read workload, it only warms up as fast as that data can be randomly read from disk - and these workloads have very low throughput.

Zooming in to the start of the L2ARC warmup:

The point I've selected (02:08:20) is when the ARC (DRAM cache) has warmed up, shown in the 3rd graph, which took over 92 minutes! This isn't the L2ARC - this is just to warmup main memory. The reason is shown in the 2nd graph - the read throughput from the disks, which is populating DRAM, is less than 20 Mbytes/sec. This is due to the workload - we are doing around 2,700 x 8 Kbyte random reads/sec - some which are returning from the DRAM cache, which leaves a total throughput of less than 20 Mbytes/sec. The system has 128 Gbytes of DRAM, of which 112 Gbytes was used for the ARC. Warming up 112 Gbytes of DRAM at 20 Mbytes/sec should take 95 minutes - consistent with the real time it took. (The actual disk throughput is faster to begin with as it pulls in filesystem metadata, then slows down afterwards.)

If 112 Gbytes of DRAM takes 92 minutes to warmup, our 500 Gbytes of flash SSD based L2ARC should take at least 7 hours to warmup. In reality it takes longer - the top screenshot shows this took over a day to get warm. As the L2ARC warms up and serves requests, there are fewer requests to be served by disk - so that 20 Mbytes/sec of input decays.

The warmup isn't so much a problem because:

  • While it may take a while to warmup (depending on workload and L2ARC capacity), unless you are rebooting your production servers every couple of days - you'll find you spend more time warm than cold. We are also working on a persistent L2ARC, so if a server does reboot it can begin warm, which will be available in a future update.
  • The L2ARC does warmup half its capacity rather quickly, to give you an early performance boost - it's getting to 100% that takes a while. This is visible in the top screenshot - the two steps at the start raise the L2ARC size quickly.

If we were to warmup the L2ARC more aggresively, it can hurt overall system performance. The L2ARC has been designed to either help performance or do nothing - so you shouldn't have to worry if it may be causing a performance issue.

More IOPS

I mentioned earlier that the IOPS improvement would be higher with fewer disks or RAID-Z. To see what that looks like, I used the same system, clients and workload, but with 2 JBODs (48 disks) configured with RAID-Z2 (double parity) and wide stripes (46 disks wide.) The Sun Storage 7410 provides RAID-Z2 wide stripes as a configuration option to maximize capacity (and price/Gbyte) - but it does warn you not to pick this for performance:

If you had a random I/O workload in mind, you wouldn't want to pick RAID-Z2 wide stripes as each I/O must read from every disk in the stripe - and random IOPS will suffer badly. Ideally you'd pick mirroring (and my first screenshot in this post demonstrated that.) You could try RAID-Z narrow stripes if their performance was sufficient.

Here is the result - 2 JBODs with RAID-Z2 wide stripes, warming up 6 L2ARC cache SSDs:

IOPS increased by 40x! ... While impressive, this is also unrealistic - no one would pick RAID-Z2 wide stripes for a random I/O workload in the first place.

But wait...

Didn't I just fix the problem? The random read ops/sec reached the same rate as with the 6 x JBOD mirrored system, and yet I was now using 2 x JBODs of RAID-Z2 wide stripes. The L2ARC, once warm, has compensated for the reduced disk performance - so we get great performance, and great price/Gbyte.

So while this setup appeared completely unrealistic, it turns out it could make some sense in certain situations - particularly if price/Gbyte was the most important factor to consider.

There are some things to note:

  • The filesystem reads began so low in this example (because of RAID-Z2 wide stripes), that disk input began at 2 Mbytes/sec then decayed - and so 500 Gbytes of L2ARC took 6 days to warmup.
  • Since disk IOPS were so painfully slow, any significant percentage of them stalled the clients to a crawl. The real boost only happened when the L2ARC was more than 90% warm, so that these slow disk IOPS were marginalized - the dramatic profile at the end of the NFS ops/sec graph. This means you really want your working set to fit into available L2ARC; if it was only 10% bigger, then the improvement may drop from 40x to 10x; and for 20% bigger - 5x. The penalty when using mirroring isn't so steep.
  • While the working set may fit entirely in the L2ARC, any outlier requests that go to disk will be very slow. For time sensitive applications, you'd still pick mirroring.

This tactic isn't really different for DRAM - if your working set fit into the DRAM cache (and this 7410 has 128 Gbytes of DRAM), then you could also use slower disk configurations - as long as warmup time and misses were acceptable. And the IOPS from DRAM gets much higher.

The before/after latency maps for this test were:

By zooming in to the before and after sections (as before), I could see that most of the I/O were taking between 20 and 90 ms without the L2ARC, and then mostly less than 1 ms with the L2ARC enabled.

Adding more disks

You don't need the L2ARC to get more IOPS, you can just add more disks. Lets say you could choose between an system with L2ARC SSDs delivering 10,000 IOPS for your workload, or a system with many more disks - also delivering 10,000 IOPS. Which is better?

The L2ARC based system can reduce cost, power and space (part of Adam's HSP strategy with flash memory) - but just on IOPS alone the L2ARC solution should still be favorable - as this is 10,000 fast IOPS (flash SSD based) vs 10,000 slow IOPS (rotating disk based). Latency is more important than IOPS.

Flash disks as primary storage

You could use flash based SSD disks for primary storage (and I'm sure SSD vendors would love you to) - it's a matter of balancing price/performance and price/Gbyte. The L2ARC means you get the benefits of faster flash memory based I/O, plus inexpensive high density storage from disks - I'm currently using 1 Tbyte 7,200 RPM disks. The disks themselves provide the redundancy: you don't need to mirror the L2ARC SSDs (and hence buy more), as any failed L2ARC request is passed down to the primary storage.

Other uses for the L2ARC

The L2ARC is great at extending the reach of caching in terms of size, but it may have other uses too (in terms of time.) Consider the following example: you have a desktop or laptop with 2 Gbytes of DRAM, and an application goes haywire consuming all memory until it crashes. Now everything else you had running is slow - as their cached pages were kicked out of DRAM by the misbehaving app, and now must be read back in from disk. Sound familiar?

Now consider you had 2 Gbytes (or more) of L2ARC. Since the L2ARC copies what is in DRAM, it will copy the DRAM filesystem cache. When the misbehaving app kicks this out, the L2ARC still has a copy on fast media - and when you use your other apps again, they return quickly. Interesting! The L2ARC is serving as a backup of your DRAM cache.

This also applys to enterprise environments: what happens if you backup an entire filesystem on a production server? Not only can the additional I/O interfere with client performance, but the backup process can dump the hot DRAM cache as it streams through files - degrading performance much further. With the L2ARC, current and recent DRAM cache pages may be available on flash memory, reducing the performance loss during such perturbations. Here the limited L2ARC warmup rate is beneficial - hot data can be kicked out of DRAM quickly, but not the L2ARC.

Expectations

While the L2ARC can greatly improve performance, it's important to understand which workloads this is for, to help set realistic expectations. Here's a summary:

  • The L2ARC benefits will be more visible to workloads with a high random read component. The L2ARC can help mixed random read/write workloads, however the higher the overall write ratio (specifically, write throughput) the more difficult it will be for the L2ARC to cache the working set - as it becomes a moving target.
  • The L2ARC is currently suited for 8 Kbyte I/Os. By default, ZFS picks a record size (also called "database size") of 128 Kbytes - so if you are using the L2ARC, you want to set that down to 8 Kbytes before creating your files. You may already be doing this to improve your random read performance from disk - 128 Kbytes is best for streaming workloads instead (or small files, where it shouldn't matter.) You could try 4 or 16 Kbytes, if it matched the application I/O size, but I wouldn't go further without testing. Higher will reduce the IOPS, smaller will eat more DRAM for metadata.
  • The L2ARC can be slow to warmup (as is massive amounts of DRAM and a random read workload) as discussed earlier.
  • Use multiple L2ARC SSD devices ("Readzillas") to improve performance - not just for the capacity, but for the concurrent I/O. This is just like adding disk spindles to improve IOPS - but without the spindles. Each Readzilla the 7410 currently uses delivers around 3100 x 8 Kbyte read ops/sec. If you use 6 of them, that's over 18,000 x 8 Kbyte read ops/sec, plus what you get from the DRAM cache.
  • It costs some DRAM to reference the L2ARC, at a rate proportional to record size. For example, it currently takes about 15 Gbytes of DRAM to reference 600 Gbytes of L2ARC - at an 8 Kbyte ZFS record size. If you use a 16 Kbyte record size, that cost would be halve - 7.5 Gbytes. This means you shouldn't, for example, configure a system with only 8 Gbytes of DRAM, 600 Gbytes of L2ARC, and an 8 Kbyte record size - if you did, the L2ARC would never fully populate.

The L2ARC warmup up in the first example reached 477 Gbytes of cached content. The following screenshot shows how much ARC (DRAM) metadata was needed to reference both the ARC and L2ARC data contents (ARC headers + L2ARC headers), at an 8 Kbyte record size:

It reached 11.28 Gbytes of metadata. Metadata has always been needed for the DRAM cache - this is the in memory information to reference the data, plus locks and counters (for ZFS coders: mostly arc_buf_hdr_t); the L2ARC uses similar in-memory information to refer to its in-SSD content, only this time we are referencing up to 600 Gbytes of content rather than 128 Gbytes for DRAM alone (current maximums for the 7410.)

Conclusion

The L2ARC can cache random read workloads on flash based SSD, reducing the I/O latency to sub millisecond times. This fast response time from SSD is also consistent, unlike a mechanical disk with moving parts. By reducing I/O latency, IOPS may also improve - as the client applications can send more frequent requests. The examples here showed most I/O returned in sub millisecond times with the L2ARC enabled, and 5x and 40x IOPS over just disk + DRAM.

The L2ARC does take a while to warmup, due to the nature of the workload it is intended to cache - random read I/O. It is preferrable to set the filesystem record size to 8 Kbytes or so before using the L2ARC, and to also use multiple SSDs for concurrency - these examples all used 6 x 100 Gbyte SSDs, to entirely cache the working set.

While these screenshots are impressive, flash memory SSDs continue to get faster and have greater capacities. A year from now, I'd expect to see screenshots of even lower latency and even higher IOPS, for larger working sets. It's an exciting time to be working with flash memory.

Friday Jan 09, 2009

My Sun Storage 7410 perf limits

As part of my role in Fishworks, I push systems to their limits to investigate and solve bottlenecks. Limits can be useful to consider as a possible upper bound of performance - as it shows what the target can do. I thought these results would make for some interesting blog posts if I could explain the setup, describe what the tests were, and include screenshots of these results in action. (update: I've included results from colleagues who have tested in the same manner.)

To summarize the performance limits that I found for a single Sun Storage 7410 head node:

All tests are performed on Ethernet (usually 10 GbE) unless otherwise specified ("IB" == InfiniBand).

Like many products, the 7410 will undergo software and hardware updates over time. This page currently has results for:

  • 7410 Barcelona: The initial release, with four quad-core AMD Opteron CPUs (Barcelona), and the 2008.Q4 software.
  • 7410 Istanbul: The latest release, with four six-core AMD Opteron CPUs (Istanbul), and the 2009.Q3 software.

I should make clear that these are provided as possible upper bounds - these aren't what to expect for any given workload, unless your workload was similar to what I used for these tests. Click on the results to see details of the workloads used.

These are also the limits that were found with a given farm of clients and JBODs - it's possible the 7410 could go faster with more clients and more JBODs.

Updated 3-Mar-2009: added CIFS results.

Updated 22-Sep-2009: added column for 7410 Istanbul. Results will be added as they are collected.

Updated 12-Nov-2009: added Cindi's InfiniBand results.

About

Brendan Gregg, Fishworks engineer

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today