Monday Mar 16, 2009

Dave tests compression

Dave from Fishworks has tested ZFS compression on the Sun Storage 7000 series. This is just a simple test, but interesting to see that performance improved slightly when using the LZJB compression algorithm. Compression relieves back-end throughput to the disks, and the LZJB algorithm doesn't consume much CPU to do so. I had suggested trying compression this in my streaming disk blog post, but didn't have any results to show. It's good to see this tested and shown with Analytics.

Thursday Mar 12, 2009

Latency Art: Rainbow Pterodactyl

In previous posts I've demonstrated Analytics from the Sun Storage 7000 series, which is a DTrace based observability tool created by Bryan. One of the great features of Analytics is its use of heat maps, especially for I/O latency.

I/O latency is the time to service and respond to an I/O request, and since clients are often waiting for this to complete - it is often the most interesting metric for performance analysis (more so than IOPS or throughput.) We thought of graphing average I/O latency, however important details would be averaged out; occasional slow requests can destroy performance, but when averaged with many fast requests their existence may be difficult to see. So instead of averaging I/O latency, we provide I/O latency as a heat map. For example:

That is showing the effect of turning on our flash based SSD read cache. Latency drops!

The x-axis is time, the y-axis is the I/O latency, and color is used to represent a 3rd dimension - how many I/O requests occurred at that time, at that latency (darker means more.) Gathering this much detail from the operating system was not only made possible by DTrace, but also made optimal. DTrace already has the ability to group data into buckets, which is used to provide a sufficient resolution of data for plotting. Higher resolutions are kept for lower latencies: having 10 microsecond resolution for I/O times less than 1,000 microseconds is useful, but overkill for I/O times over 1,000,000 microseconds (1 second) - and would waste disk space when storing it. Analytics automatically picks a resolution suitable for the range displayed.

In this post (perhaps the start of a series), I'll show unusual latency plots that I and others have discovered. Here is what I call the "Rainbow Pterodactyl":

Yikes! Maybe it's just me, but that looks like the Pterodactyl from Joust.

I discovered this when testing back-end throughput on our storage systems, by 'lighting up' disks one by one with a 128 Kbyte streaming read workload to their raw device (visible as the rainbow.) I was interested in any knee points in I/O throughput, which is visible there at 17:55 where we reached 1.13 Gbytes/sec. To understand what was happening, I graphed the I/O latency as well - and found the alarming image above. The knee point corresponds to where the neck ends, and the wing begins.

Why is there a beak? (the two levels of latency) ... Why is there a head? Why is there a wing? This raised many more questions than answers - things for us to figure out. Which is great - had this been average latency, these behaviors may have gone unnoticed.

I can at least explain the beak to head threshold: the beak ends at 8 disks, and I had two "x4" ports of SAS connected - so when 9 disks are busy, there is contention for those ports. (I'll defer describing the rest of the features to our storage driver expert Greg if he gets the time. :-)

The above was tested with 1 SAS path to each of 2 JBODs. Testing 2 paths to each JBOD produces higher throughput, and latency which looks a bit more like a bird:

Here we reached 2.23 Gbytes/sec for the first knee point, which is actually the neck point...

Latency heat maps have shown us many behaviors that we don't yet understand, which we need to spend more time with Analytics to figure out. It can be exciting to discover strange phenomenons for the first time, and it's very useful and practical as well - latency matters.

Tuesday Mar 03, 2009

CIFS at 1 Gbyte/sec

I've recently been testing the limits of NFS performance on the Sun Storage 7410. Here I'll test the CIFS (SMB) protocol - the file sharing protocol commonly used by Microsoft Windows, which can be served by the Sun Storage 7000 series products. I'll push the 7410 to the limits I can find, and show screenshots of the results. I'm using 20 clients to test a 7410 which has 6 JBODs and 4 x 10 GbE ports, described in more detail later on.

CIFS streaming read from DRAM

Since the 7410 has 128 Gbytes of DRAM, most of which is available as the filesystem cache, it is possible that some workloads can be served entirely or almost entirely from DRAM cache, which I've tested before for NFS. Understanding how fast CIFS can serve this data from DRAM is interesting, so to search for a limit I've run the following workload: 100 Gbytes of files (working set), 4 threads per client, each doing streaming reads with a 1 Mbyte I/O size, and looping through their files.

I don't have to worry about client caching affecting the observed result - as is the case with other benchmarks - since I'm not measuring the throughput on the client. I'm measuring the actual throughput on the 7410, using Analytics:

I've zoomed out to show the average over 60 minutes - which was 1.04 Gbytes/sec outbound!

This result was measured as outbound network throughput, so it includes the overhead of Ethernet, IP, TCP and CIFS headers. Since jumbo frames were used, this overhead is going to be about 1%. So the actual data payload moved over CIFS will be closer to 1.03 Gbytes/sec.

As a screenshot it's clear that I'm not showing a peak value - rather a sustained average over a long interval, to show how well the 7410 can serve CIFS at this throughput.

CIFS streaming read from disk

While 128 Gbytes of DRAM can cache large working sets, it's still interesting to see what happens when we fall out of that, as I've previously shown for NFS. The 7410 I'm testing has 6 JBODs (from a max of 12), which I've configured with mirroring for max performance. To test out disk throughput, my workload is: 2 Tbytes of files (working set), 2 threads per client, each performing streaming reads with a 1 Mbyte I/O size, and looping through their files.

As before, I'm taking a screenshot from Analytics - to show what the 7410 is really doing:

Here I've shown read throughput from disk at 849 Mbytes/sec, and network outbound at 860 Mbytes/sec (includes the headers.) 849 Mbytes/sec from disk (which will be close to our data payload) is very solid performance.

CIFS streaming write to disk

Writes are a different code path to reads, and need to be tested seperately. The workload I've used to test write throughput is: Writing 1+ Tbytes of files, 4 threads per client, each performing streaming writes with a 1 Mbyte I/O size. The result:

The network inbound throughput was 638 Mbytes/sec, which includes protocol overheads. Our data payload rate will be a little less than that due to the CIFS protocol overheads, but should still be at least 620 Mbytes/sec - which is very good indeed (even beat my NFSv3 write throughput result!)

Note that the disk I/O bytes were at 1.25 Gbytes/sec write: the 7410 is using 6 JBODs configured with software mirroring, so the back end write throughput is doubled. If I picked other storage profiles like RAID-Z2, the back end thoughput would be less (as I showed in the NFS post.)

CIFS read IOPS from DRAM

Apart from throughput, it's also interesting to test the limits of IOPS, in particular read ops/sec. To do this, I'll use a workload which is: 100 Gbytes of files (working set) which caches in DRAM, 20 threads per client, each performing reads with a 1 byte I/O size. The results:

203,000+ is awesome; while a realistic workload is unlikely to call 1 byte I/Os, still, it's interesting to see what the 7410 can do (I'll test 8 Kbyte I/O next.)

Note the network in and out bytes is about the same - the 1 byte of data payload doesn't make much difference beyond the network headers.

Modest read IOPS

While the limits I can reach on the 7410 are great for heavy workloads, these don't demonstrate how well the 7410 responds under modest conditions. Here I'll test a lighter read ops/sec workload by cutting it to: 10 clients, 1 x 1 GbE port per client, 1 x 10 GbE port on the 7410, 100 Gbytes of files (working set), and 8 Kbyte random I/O. I'll step up the threads per client every minute (by 1), starting at 1 thread/client (so 10 in total to begin with):

We reached 71,428 read ops/sec - a good result for 8 Kbyte random I/O from cache and only 10 clients.

It's more difficult to generate client load (involves context switching to userland) than to serve it (kernel only), so you generally need more CPU grunt on the clients than on the target. At one thread per client on 10 clients, the clients are using 10 x 1600 MHz cores to test a 7410 with 16 x 2300 MHz cores - so the clients themselves will limit the throughput achieved. Even at 5 threads per client there is still headroom (%CPU) on this 7410.

The bottom graph is a heat map of CIFS read latency, as measured on the 7410 (from when the I/O request was received, to when the response was sent.). As load increases, so does I/O latency - but they are still mostly less than 100 us (fast!). This may be the most interesting from all the results - as this modest load is increased, the latency remains low while the 7410 scales to meet the workload.

Configuration

As the filer I was using a single Sun Storage 7410, with the following config:

  • 128 Gbytes DRAM
  • 6 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
  • 4 sockets of quad-core AMD Opteron 2300 MHz CPUs
  • 2 x 2x10 GbE cards (4 x 10 GbE ports total), jumbo frames
  • 2 x HBA cards
  • noatime on shares, and database size left at 128 Kbytes

It's not a max config system - the 7410 can currently scale to 12 JBODs, 3 x HBA cards, and have flash based SSD as read cache and intent log - which I'm not using for these tests. The CPU and DRAM size is the current max: 4 sockets of quad-core driving 128 Gbytes of DRAM is a heavyweight for workloads that cache well, as shown earlier.

The clients were 20 blades, each:

  • 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
  • 6 Gbytes of DRAM
  • 2 x 1 GbE network ports
  • Running Solaris, and mounting CIFS using the smbfs driver

These are great, apart from the CPU clock speed - which at 1600 MHz is a little low.

The network consists of multiple 10 GbE switches to connect the client 1 GbE ports to the filer 10 GbE ports.

Conclusion

A single head node 7410 has very solid CIFS performance, as shown in the screenshots. I should note that I've shown what the 7410 can do given the clients I have, but it may perform even faster given faster clients for testing. What I have been able to show is 1 Gbyte/sec of CIFS read throughput from cache, and up to 200,000 read ops/sec - tremendous performance.

Before the Sun Storage 7000 products were released, there was intensive performance work on this by the CIFS team and PAE (as Roch describes), and they have delivered great performance for CIFS, and continue to improve it further. I've updated the summary page with these CIFS results.

Monday Feb 23, 2009

Networking Analytics Example

I just setup a new Sun Storage 7410, and found a performance issue using Analytics. People have been asking for examples of Analytics in use, so I need to blog these as I find them. This one is regarding network throughput, and while simple - it demonstrates the value of high level data and some features of Analytics.

The Symptom

To test this new 7410, I ran a cached DRAM read workload from 10 clients and checked network throughput:

It's reached 736 Mbytes/sec, and I know the 7410 can go faster over 1 x 10 GbE port (I previously posted a test showing a single nxge port reaching 1.10 Gbytes/sec.) This is a different workload and client setup, so, is 736 Mbytes/sec the most I should expect from these clients and this workload? Or can I go closer to 1.10 Gbytes/sec?

With Analytics, I can examine intricate details of system performance throughout the software stack. Start simple - start with the higher level questions and drill down deeper on anything suspicious. Also, don't assume anything that can easily be measured.

I expect my 10 clients to be performing NFSv3 reads, but are they? This is easily checked:

They are.

Since these clients are 10 identical blades in a blade server, and the workload I ran successfully connected to and started running on all 10 clients, I could assume that they are all still running. Are they? This is also easily checked:

I see all 10 clients in the left panal, so they are still running. But there is something suspicious - the client dace-2-data-3 is performing far fewer NFSv3 ops than the other clients. For the selected time of 15:49:35, dace-2-data-3 performed 90 NFSv3 ops/sec while the rest of the clients performed over 600. In the graph, that client is plotted as a trickle on the top of the stacks - and at only a pixel or two high, it's difficult to see.

This is where it is handy to change the graph type from stacked to line, using the 5th icon from the right:

Rather than stacking the data to highlight its components, the line graphs plot those components seperately - so that they can be compared with one another. This makes it pretty clear that dace-2-data-3 is an outlier - shown as a single line much lower than all others.

For some reason, dace-2-data-3 is performing fewer NFSv3 ops than its identical neighbours. Lets check if its network throughput is also lower:

It is, running at only 11.3 Mbytes/sec. Showing this as a line graph:

As before, the line graph highlights this client as an outlier.

Line Graphs

While I'm here, these line graphs are especially handy for comparing any two items (called breakdowns.) Click the first from the left panel, then shift-click a second (or more.) For example:

By just selecting one breakdown, the graph will rescale to fit it to the vertical height (note the units change on the top right):

Variations in this client's throughput are now more clearly visible. Handy stuff...

Datasets

Another side topic worth mentioning is the datasets - archived statistics used by Analytics. So far I've used:

  • Network bytes/sec by device
  • NFSv3 ops/sec by operation
  • NFSv3 ops/sec by-client
  • IP bytes/sec by-hostname

There is a screen in the Sun Storage 7000 series interface to manage datasets:

Here I've sorted by creation time, showing the newly created datasets at the top. The icons on the right include a power icon, which can suspend dataset collection. Suspended datasets show a gray light on the left, rather than green for enabled.

The by-client and by-hostname statistics require more CPU overhead to collect than the others, as these gather their data by tracing every network packet, aggregating that data, then resolving hostnames. These are some of the datasets that are DTrace based.

The overhead of DTrace based datasets is relative to the number of traced events, and how loaded the system is. The overhead itself is microscopic, however if you multiply that by 100,000 on a busy system, it can become measurable. This system was pushing over 700 Mbytes/sec, which is approaching 100,000 packets/sec. The overhead (performance cost) for those by-client and by-hostname datasets was about 1.4% each. Tiny as this is, I usually suspend these when performing benchmarks (if they have been enabled - they aren't out of the box.) With lighter workloads (lighter than 700+ Mbytes/sec), this overhead becomes lower as there is more CPU capacity available for collecting such statistics. So, you generally don't need to worry about the CPU overhead - unless you want to perform benchmarks.

The Problem

Back to the issue; The 11.3 Mbytes/sec value rings a bell. Converting bytes to bits, that's about 90 Mbit/sec - within 100 Mbit/sec. Hmm.. These are supposed to be 1 Gbit/sec interfaces - is something wrong? On the client:

    dace-2# dladm show-linkprop -p speed
    LINK         PROPERTY        VALUE          DEFAULT        POSSIBLE
    e1000g0      speed           100            --             10,100,1000
    e1000g1      speed           1000           --             10,100,1000
    
Yes - the 1 Gbit/sec interface has negotiated to 100 Mbit. Taking a look at the physical port:

Confirmed! The port (in the center) is showing a yellow light on the right rather than a green light. There is a problem with the port, or the cable, or the port on the switch.

The Fix

Swapping the cable with one that is known-to-be-good fixed the issue - the port renegotiated to 1 Gbyte/sec:

    dace-2# dladm show-linkprop -p speed
    LINK         PROPERTY        VALUE          DEFAULT        POSSIBLE
    e1000g0      speed           1000           --             10,100,1000
    e1000g1      speed           1000           --             10,100,1000
    

And the client dace-2-data-3 is now running much faster:

That's about a 9% performance win - caused by a bad Ethernet cable, found by Analytics.

Friday Feb 06, 2009

DRAM latency

In my previous post, I showed NFS random read latency at different points in the operating system stack. I made a few references to hits from DRAM - which were visible as a dark solid line at the bottom of the latency heat maps. This is worth exploring in a little more detail, as this is both interesting and another demonstration of Analytics.

Here is both delivered NFS latency and disk latency, for disk + DRAM alone:

First clue is that the dark line at the bottom is in the NFS latency map only. This suggests the operation has returned to the client before it has reached the disk layer.

Zooming in the vertical time scale:

We can now see that these operations are mostly in the 0 us to 21 us range - which is very, very fast. DRAM fast. As an aside, there are other NFS operations apart from read can return from DRAM - these include open, close and stat. We know these are all reads from viewing the NFS operation type:

The average is 2633 NFS reads/sec, which includes hits from DRAM and reads from disk.

Now we'll use the "ZFS ARC" accesses statistic to see our DRAM cache hit rate (the ARC is our DRAM based filesystem cache):

The averages for this visible range shows 467 data hits/sec - consistent with the number of fast NFS operations/sec that the latency map suggested were DRAM based.

This is amazing stuff - examining the latency throughout the software stack, and clearly seeing the difference between DRAM and disk hits. You can probably see why we picked heat maps to show latency instead of line graphs showing average latency. Traditional performance tools provide average latency for these layers - but so much information has been lost when averaging this data. Since using these heat maps for latency, we've noticed many issues that may otherwise go unnoticed when using single values.

Friday Jan 30, 2009

L2ARC Screenshots

Back before the Fishworks project went public, I posted an entry to explain how the ZFS L2ARC worked (Level 2 Cache) - which is a flash memory based cache currently intended for random read workloads. I was itching to show screenshots from Analytics, which I'm now able to do. From these screenshots, I'll be able to describe in detail how the L2ARC performs.

Summary

There are a couple of screenshots that really tell the story. This is on a Sun Storage 7410 with the following specs:

  • 128 Gbytes of DRAM
  • 6 x 100 Gbyte "Readzillas" (read optimized SSDs) as the L2ARC
  • 6 x JBODs (disk trays), for a total of 140 disks configured with mirroring

As a workload, I'm using 10 clients (described previously), 2 random read threads per client with an 8 Kbyte I/O size, and a 500 Gbyte total working set mounted over NFS. This 500 Gbyte working set represents your frequently accessed data ("hot" data) that you'd like to be cached; this doesn't represent the total file or database size - which may be dozens of Tbytes. From Analytics on the 7410:

The top graph shows the L2ARC population level, and the bottom shows NFS operations/sec. As the L2ARC warms up, delivered performance in terms of read ops/sec increases, as data is returned from the SSD based L2ARC rather than slower disks. The L2ARC has increased the IOPS by over 5x.

5x IOPS! That's the difference 6 of our current SSDs makes when added to: 140 disks configured with mirroring plus 128 Gbytes of warm DRAM cache - meaning this system was already tuned and configured to serve this workload as fast as possible, yet the L2ARC has no problem magnifying performance further. If I had used fewer disks, or configured them with RAID-Z (RAID-5), or used less DRAM, this improvement ratio would be much higher (demonstrated later.) But I'm not showing this in the summary because this isn't about IOPS - this is about latency:

Here I've toggled a switch to enable and disable the L2ARC. The left half of these graphs shows the L2ARC disabled - which is the performance from disks plus the DRAM cache. The right half shows the L2ARC enabled - so that its effect can be compared. Heat maps have been used to graph latency - which is the time to service that I/O. Lower is faster, and the darker colors represent more I/Os occured at that time (x-axis) at that latency (y-axis). Lower dark colors is better - it means I/Os are completing quickly.

These maps show I/O latency plummet when the L2ARC is enabled, delivering I/O faster than disk was able to. Latency at both the NFS level and disk level can be seen, which is often helpful for locating where latency originates; here it simply shows that the faster SSD performance is being delivered to NFS. There are still some I/Os occurring slowly when the L2ARC is enabled (lighter colors in the top right), as the L2ARC is only 96% warm at this point - so 4% of the requested I/Os are still being serviced from disk. If I let the L2ARC warmup further, the top right will continue to fade.

There is one subtle difference between the heat maps - can you spot it? There is a dark stripe of frequent and fast I/O at the bottom of the NFS latency map, which doesn't appear in the disk map. These are read requests that hit the DRAM cache, and return from there.

The bottom graph shows IOPS, which increased (over 5x) when the L2ARC was enabled as due to the faster I/O latency.

This is just one demonstration of the L2ARC - I've shown a good result, but this isn't the best latency or IOPS improvement possible.

Before: DRAM + disk

Lets look closer at the NFS latency before the L2ARC was enabled:

This shows the performance delivered by DRAM plus the 140 mirrored disks. The latency is mostly between 0 and 10 ms, which is to be expected for a random read workload on 7,200 RPM disks.

Zooming in:

The vertical scale has now been zoomed to 10 ms. The dark line at the bottom is for hits from the DRAM cache - which is averaging about 460 hits/sec. Then there is a void until about 2 ms - where these disks start to return random IOPS.

After: DRAM + L2ARC + disk

Now a closer look at the NFS latency with the L2ARC enabled, and warmed up:

Here I've already zoomed to the 10 ms range, which covers most of the I/O. In fact, the left panel shows that most I/O took less than 1 ms.

Zooming in further:

The L2ARC now begins returning data over NFS at 300 us, and as the previous graph showed - most I/O are returned by 1 ms, rather than 10 ms for disk.

The bottom line in the graph is DRAM cache hits, which is now about 2400 hits/sec - over 5x than without the L2ARC. This may sound strange at first (how can the L2ARC affect DRAM cache performance?), but it makes sense - the client applications aren't stalled waiting for slower disks, and can send more IOPS. More IOPS means more chance of hitting from the DRAM cache, and a higher hits/sec value. The hits/misses rate is actually the same - we are just making better use of the DRAM cache as the clients can request from it more frequently.

Hit Rate

We can see how the DRAM cache hits increases as the L2ARC warms up with the following screenshot. This shows hit statistics for the ARC (DRAM cache) and L2ARC (SSD cache):

As the L2ARC warms up, its hit rate improves. The ARC also serves more hits as the clients are able to send more IOPS.

We may have assumed that hits improved in this way, however it is still a good idea to check such assumptions whenever possible. Analytics makes it easy to check different areas of the software stack, from NFS ops down to disk ops.

Disk ops

For a different look at L2ARC warmup, we can examine disk ops/sec by disk:

Rather than highlighting individual disks, I've used the Hierarchical breakdown to highlight the system itself ("/turbot") in pale blue. The system is the head node of the 7410, and has 6 L2ARC SSDs - visible as the 6 wedges in the pie chart. The JBODs are not highlighted here, and their ops/sec is shown in the default dark blue. The graph shows the disk ops to the JBODs decreases over time, and those to the L2ARC SSDs increases - as expected.

Warmup Time

A characteristic can be seen in these screenshots that I haven't mentioned yet: the L2ARC is usually slow to warmup. Since it is caching a random read workload, it only warms up as fast as that data can be randomly read from disk - and these workloads have very low throughput.

Zooming in to the start of the L2ARC warmup:

The point I've selected (02:08:20) is when the ARC (DRAM cache) has warmed up, shown in the 3rd graph, which took over 92 minutes! This isn't the L2ARC - this is just to warmup main memory. The reason is shown in the 2nd graph - the read throughput from the disks, which is populating DRAM, is less than 20 Mbytes/sec. This is due to the workload - we are doing around 2,700 x 8 Kbyte random reads/sec - some which are returning from the DRAM cache, which leaves a total throughput of less than 20 Mbytes/sec. The system has 128 Gbytes of DRAM, of which 112 Gbytes was used for the ARC. Warming up 112 Gbytes of DRAM at 20 Mbytes/sec should take 95 minutes - consistent with the real time it took. (The actual disk throughput is faster to begin with as it pulls in filesystem metadata, then slows down afterwards.)

If 112 Gbytes of DRAM takes 92 minutes to warmup, our 500 Gbytes of flash SSD based L2ARC should take at least 7 hours to warmup. In reality it takes longer - the top screenshot shows this took over a day to get warm. As the L2ARC warms up and serves requests, there are fewer requests to be served by disk - so that 20 Mbytes/sec of input decays.

The warmup isn't so much a problem because:

  • While it may take a while to warmup (depending on workload and L2ARC capacity), unless you are rebooting your production servers every couple of days - you'll find you spend more time warm than cold. We are also working on a persistent L2ARC, so if a server does reboot it can begin warm, which will be available in a future update.
  • The L2ARC does warmup half its capacity rather quickly, to give you an early performance boost - it's getting to 100% that takes a while. This is visible in the top screenshot - the two steps at the start raise the L2ARC size quickly.

If we were to warmup the L2ARC more aggresively, it can hurt overall system performance. The L2ARC has been designed to either help performance or do nothing - so you shouldn't have to worry if it may be causing a performance issue.

More IOPS

I mentioned earlier that the IOPS improvement would be higher with fewer disks or RAID-Z. To see what that looks like, I used the same system, clients and workload, but with 2 JBODs (48 disks) configured with RAID-Z2 (double parity) and wide stripes (46 disks wide.) The Sun Storage 7410 provides RAID-Z2 wide stripes as a configuration option to maximize capacity (and price/Gbyte) - but it does warn you not to pick this for performance:

If you had a random I/O workload in mind, you wouldn't want to pick RAID-Z2 wide stripes as each I/O must read from every disk in the stripe - and random IOPS will suffer badly. Ideally you'd pick mirroring (and my first screenshot in this post demonstrated that.) You could try RAID-Z narrow stripes if their performance was sufficient.

Here is the result - 2 JBODs with RAID-Z2 wide stripes, warming up 6 L2ARC cache SSDs:

IOPS increased by 40x! ... While impressive, this is also unrealistic - no one would pick RAID-Z2 wide stripes for a random I/O workload in the first place.

But wait...

Didn't I just fix the problem? The random read ops/sec reached the same rate as with the 6 x JBOD mirrored system, and yet I was now using 2 x JBODs of RAID-Z2 wide stripes. The L2ARC, once warm, has compensated for the reduced disk performance - so we get great performance, and great price/Gbyte.

So while this setup appeared completely unrealistic, it turns out it could make some sense in certain situations - particularly if price/Gbyte was the most important factor to consider.

There are some things to note:

  • The filesystem reads began so low in this example (because of RAID-Z2 wide stripes), that disk input began at 2 Mbytes/sec then decayed - and so 500 Gbytes of L2ARC took 6 days to warmup.
  • Since disk IOPS were so painfully slow, any significant percentage of them stalled the clients to a crawl. The real boost only happened when the L2ARC was more than 90% warm, so that these slow disk IOPS were marginalized - the dramatic profile at the end of the NFS ops/sec graph. This means you really want your working set to fit into available L2ARC; if it was only 10% bigger, then the improvement may drop from 40x to 10x; and for 20% bigger - 5x. The penalty when using mirroring isn't so steep.
  • While the working set may fit entirely in the L2ARC, any outlier requests that go to disk will be very slow. For time sensitive applications, you'd still pick mirroring.

This tactic isn't really different for DRAM - if your working set fit into the DRAM cache (and this 7410 has 128 Gbytes of DRAM), then you could also use slower disk configurations - as long as warmup time and misses were acceptable. And the IOPS from DRAM gets much higher.

The before/after latency maps for this test were:

By zooming in to the before and after sections (as before), I could see that most of the I/O were taking between 20 and 90 ms without the L2ARC, and then mostly less than 1 ms with the L2ARC enabled.

Adding more disks

You don't need the L2ARC to get more IOPS, you can just add more disks. Lets say you could choose between an system with L2ARC SSDs delivering 10,000 IOPS for your workload, or a system with many more disks - also delivering 10,000 IOPS. Which is better?

The L2ARC based system can reduce cost, power and space (part of Adam's HSP strategy with flash memory) - but just on IOPS alone the L2ARC solution should still be favorable - as this is 10,000 fast IOPS (flash SSD based) vs 10,000 slow IOPS (rotating disk based). Latency is more important than IOPS.

Flash disks as primary storage

You could use flash based SSD disks for primary storage (and I'm sure SSD vendors would love you to) - it's a matter of balancing price/performance and price/Gbyte. The L2ARC means you get the benefits of faster flash memory based I/O, plus inexpensive high density storage from disks - I'm currently using 1 Tbyte 7,200 RPM disks. The disks themselves provide the redundancy: you don't need to mirror the L2ARC SSDs (and hence buy more), as any failed L2ARC request is passed down to the primary storage.

Other uses for the L2ARC

The L2ARC is great at extending the reach of caching in terms of size, but it may have other uses too (in terms of time.) Consider the following example: you have a desktop or laptop with 2 Gbytes of DRAM, and an application goes haywire consuming all memory until it crashes. Now everything else you had running is slow - as their cached pages were kicked out of DRAM by the misbehaving app, and now must be read back in from disk. Sound familiar?

Now consider you had 2 Gbytes (or more) of L2ARC. Since the L2ARC copies what is in DRAM, it will copy the DRAM filesystem cache. When the misbehaving app kicks this out, the L2ARC still has a copy on fast media - and when you use your other apps again, they return quickly. Interesting! The L2ARC is serving as a backup of your DRAM cache.

This also applys to enterprise environments: what happens if you backup an entire filesystem on a production server? Not only can the additional I/O interfere with client performance, but the backup process can dump the hot DRAM cache as it streams through files - degrading performance much further. With the L2ARC, current and recent DRAM cache pages may be available on flash memory, reducing the performance loss during such perturbations. Here the limited L2ARC warmup rate is beneficial - hot data can be kicked out of DRAM quickly, but not the L2ARC.

Expectations

While the L2ARC can greatly improve performance, it's important to understand which workloads this is for, to help set realistic expectations. Here's a summary:

  • The L2ARC benefits will be more visible to workloads with a high random read component. The L2ARC can help mixed random read/write workloads, however the higher the overall write ratio (specifically, write throughput) the more difficult it will be for the L2ARC to cache the working set - as it becomes a moving target.
  • The L2ARC is currently suited for 8 Kbyte I/Os. By default, ZFS picks a record size (also called "database size") of 128 Kbytes - so if you are using the L2ARC, you want to set that down to 8 Kbytes before creating your files. You may already be doing this to improve your random read performance from disk - 128 Kbytes is best for streaming workloads instead (or small files, where it shouldn't matter.) You could try 4 or 16 Kbytes, if it matched the application I/O size, but I wouldn't go further without testing. Higher will reduce the IOPS, smaller will eat more DRAM for metadata.
  • The L2ARC can be slow to warmup (as is massive amounts of DRAM and a random read workload) as discussed earlier.
  • Use multiple L2ARC SSD devices ("Readzillas") to improve performance - not just for the capacity, but for the concurrent I/O. This is just like adding disk spindles to improve IOPS - but without the spindles. Each Readzilla the 7410 currently uses delivers around 3100 x 8 Kbyte read ops/sec. If you use 6 of them, that's over 18,000 x 8 Kbyte read ops/sec, plus what you get from the DRAM cache.
  • It costs some DRAM to reference the L2ARC, at a rate proportional to record size. For example, it currently takes about 15 Gbytes of DRAM to reference 600 Gbytes of L2ARC - at an 8 Kbyte ZFS record size. If you use a 16 Kbyte record size, that cost would be halve - 7.5 Gbytes. This means you shouldn't, for example, configure a system with only 8 Gbytes of DRAM, 600 Gbytes of L2ARC, and an 8 Kbyte record size - if you did, the L2ARC would never fully populate.

The L2ARC warmup up in the first example reached 477 Gbytes of cached content. The following screenshot shows how much ARC (DRAM) metadata was needed to reference both the ARC and L2ARC data contents (ARC headers + L2ARC headers), at an 8 Kbyte record size:

It reached 11.28 Gbytes of metadata. Metadata has always been needed for the DRAM cache - this is the in memory information to reference the data, plus locks and counters (for ZFS coders: mostly arc_buf_hdr_t); the L2ARC uses similar in-memory information to refer to its in-SSD content, only this time we are referencing up to 600 Gbytes of content rather than 128 Gbytes for DRAM alone (current maximums for the 7410.)

Conclusion

The L2ARC can cache random read workloads on flash based SSD, reducing the I/O latency to sub millisecond times. This fast response time from SSD is also consistent, unlike a mechanical disk with moving parts. By reducing I/O latency, IOPS may also improve - as the client applications can send more frequent requests. The examples here showed most I/O returned in sub millisecond times with the L2ARC enabled, and 5x and 40x IOPS over just disk + DRAM.

The L2ARC does take a while to warmup, due to the nature of the workload it is intended to cache - random read I/O. It is preferrable to set the filesystem record size to 8 Kbytes or so before using the L2ARC, and to also use multiple SSDs for concurrency - these examples all used 6 x 100 Gbyte SSDs, to entirely cache the working set.

While these screenshots are impressive, flash memory SSDs continue to get faster and have greater capacities. A year from now, I'd expect to see screenshots of even lower latency and even higher IOPS, for larger working sets. It's an exciting time to be working with flash memory.

About

Brendan Gregg, Fishworks engineer

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today