Friday Jan 09, 2009

1 Gbyte/sec NFS, streaming from disk

I've previously explored maximum throughput and IOPS that I could reach on a Sun Storage 7410 by caching the working set entirely in DRAM, which may be likely as the 7410 currently scales to 128 Gbytes of DRAM per head node. The more your working set exceeds 128 Gbytes or is shifting, the more I/O will be served from disk instead. Here I'll explore the streaming disk performance of the 7410, and show some of the highest throughputs I've seen.

Roch posted performance invariants for capacity planning, which included values of 1 Gbyte/sec for streaming read from disk, and 500 Mbytes/sec for streaming write to disk (50% of the read value due to the nature of software mirroring). Amitabha posted Sun Storage 7000 Filebench results on a 7410 with 2 x JBODs, which had reached 924 Mbytes/sec streaming read and 461 Mbytes/sec streaming write - consistent with Roch's values. What I'm about to post here will further reinforce these numbers, and include the screenshots taken from the Sun Storage 7410 browser interface to show this in action - specifically from Analytics.

Streaming read

The 7410 can currently scale to 12 JBODs (each with 24 disks), but when I performed this test I only had 6 available to use, which I configured with mirroring. While this isn't a max config, I don't think throughput will increase much further with more spindles - after about 3 JBODs there is plenty of raw disk throughput capacity. This is the same 7410, clients and network as I've described before, with more configuration notes below.

The following screenshot shows 10 clients reading 1 Tbyte in total over NFS, with 128 Kbyte I/Os and 2 threads per client. Each client is connected using 2 x 1 GbE interfaces, and the server is using 2 x 10 GbE interfaces:

Showing disk I/O bytes along with network bytes is important - it helps us confirm that this streaming read test did read from disk. If the intent is to go to disk, make sure it does - otherwise it could be hitting from the server or client cache.

I've highlighted the network peak in this screenshot, which was 1.16 Gbytes/sec network outbound - but this is just a peak. The average throughput can be seen by zooming in on Analytics (not shown here), which found the network outbound throughput to average 1.10 Gbytes/sec, and disk bytes at 1.07 Gbyte/sec. The difference is that the network result includes data payload plus network protocol headers, whereas the disk result is data payload plus ZFS metadata - which is adding less than the protocol headers. So 1.07 Gbytes/sec is closer to the average data payload read over NFS from disk.

It's always worth sanity checking results however possible, and we can do this here using the times. This run took 16:23 to complete, and 1 Tbyte was moved - that's an average of 1066 Mbytes/sec of data payload - 1.04 Gbytes/sec. This time includes the little step at the end of the run, and when I zoomed in to see the average it didn't include that. The step is from a slow client completing (I found out who using the Analytics statistic "NFS operations broken down by client" - and I need to now check why I have one client that is a little slower than the others!) Even including the slow client, 1.04 Gbytes/sec is a great result for delivered NFS reads from disk, on a single head node.

Update: 2-Mar-09

In a comment posted on this blog, Rex noticed a slight throughput decay over time in the above screenshots, and wondered if this continued if left longer. To test this out, I had the clients loop over their input files (files which were much too big to fit in the 7410's DRAM cache), and mounted the clients with forcedirectio (to avoid them caching); this kept the reads being served from disk, which I could leave running for hours. The result:

The throughput is rock steady over this 24 hour interval.

I hoped to post some screenshots showing the decay and drilling down with Analytics to explain the cause. Problem is, it doesn't happen anymore - throughput is always steady. I have upgraded our 10 GbE network since the original tests - our older switches were getting congested and slowing throughput a little, which may have been happening earlier.

Streaming write, mirroring

These write tests were with 5 JBODs (from a max of 12), but I don't think the full 12 would improve write throughput much further. This is the same 7410, clients and network as I've described before, with more configuration notes below.

The following screenshot shows 20 clients writing 1 Tbyte in total over NFS, using 2 threads per client and 32 Kbyte I/Os. The disk pool has been configured with mirroring:

The statistics shown tell the story - the network throughput has averaged about 577 Mbytes/sec inbound, shown in the bottom zoomed graph. This network throughput includes protocol overheads, so the actual data throughput is a little less. The time for the 1 Tbyte transfer to complete is about 31 minutes (top graphs), which indicates the data throughput was about 563 Mbytes/sec. The Disk I/O bytes graph confirms that this is being delivered to disk, at a rate of 1.38 Gbytes/sec (measured by zooming in and reading the range average, as with the bottom graph). The rate of 1.38 Gbytes/sec is due to software mirroring ( (563 + metadata) x 2 ).

Streaming write, RAID-Z2

For comparison, the following shows the same test as above, but with double parity raid instead of mirroring:

The network throughput has dropped to 535 Mbytes/sec inbound, as there is more work for the 7410 to calculate parity during writes. As this took 34 minutes to write 1 Tbyte, our data rate is about 514 Mbytes/sec. The disk I/O bytes is much lower (notice the vertical range), averaging about 720 Mbytes/sec - a big difference to the previous test, due to RAID-Z2 vs mirroring.

Configuration Notes

To get the maximum streaming performance, jumbo frames were used and the ZFS record size ("database size") was left at 128 Kbytes. 3 x SAS HBA cards were used, with dual pathing configured to improve performance.

This isn't a maximum config for the 7410, since I'm testing a single head node (not a cluster), and I don't have the full 12 JBODs.

In these tests, ZFS compression wasn't enabled (the 7000 series provides 5 choices: off, LZJB, GZIP-2, GZIP, GZIP-9.) Enabling compression may improve performance as there is less disk I/O, or it may not due to the extra CPU cycles to compress/uncompress the data. This would need to be tested with the workload in mind.

Note that the read and write optimized SSD devices known as Readzilla and Logzilla wern't used in this test. They currently help with random read workloads and synchronous write workloads, neither of which I was testing here.

Conclusion

As shown in the screenshots above, the streaming performance to disk for the Sun Storage 7410 matches what Roch posted, which rounding down is about 1 Gbyte/sec for streaming disk read, and 500 Mbytes/sec for streaming disk write. This is the delivered throughput over NFS, the throughput to the disk I/O subsystem was measured to confirm that this really was performing disk I/O. The disk write throughput was also shown to be higher due to software mirroring or RAID-Z2 - averaging up to 1.38 Gbytes/sec. The real limits of the 7410 may be higher than these - this is just the highest I've been able to find with my farm of test clients.

Wednesday Dec 31, 2008

Unusual disk latency

The following screenshot shows two spikes of unusually high disk I/O latency during a streaming write test:

This screenshot is from Analytics on the 7410. The issue is not with the 7410, it's with disk drives in general. The disk latency here is also not suffered by the client applications, as this is ZFS asynchronously flushing write data to disk. Still, it was great to see how easily Analytics could identify this latency, and interesting to see what the cause was.

See this video for the bizarre explanation:

Don't try this yourself...

Tuesday Dec 30, 2008

JBOD Analytics example

The Analytics feature of the Sun Storage 7000 series provides a fantastic new insight into system performance. Some screenshots have appeared in various blog posts, but there is a lot more of Analytics to show. Here I'll show some JBOD analytics.

Bryan coded a neat way to break down data hierarchically, and made it available to by-filename and by-disk statistics, along with pie charts. While it looks pretty, it is incredibly useful in unexpected ways. I just used it to identify some issues with my JBOD configuration, and realized after the fact that this would probably make an interesting blog post. Since Analytics archived the data I was viewing, I was able to go back and take screenshots of the steps I took - which I've included here.

The problem

I was setting up a NFSv3 streaming write test after configuring a storage pool with 6 JBODs, and noticed that the disk I/O was a little low:

This is measuring disk I/O bytes - I/O from the Sun Storage 7410 head node to the JBODs. I was applying a heavy write workload which was peaking at around 1.6 Gbytes/sec - but I know the 7410 peaks higher than this.

Viewing the other statistics available, I noticed that the disk I/O latency was getting high:

Most of the ops/sec were around 10 ms, but there are plenty of outliers beyond 500 ms - shown in the screenshot above. These disk I/O operations are from ZFS flushing data asynchronously to disk in large chunks, where the I/Os can queue up for 100s of ms. The clients don't wait for this to complete. So while some of this is normal, the outliers are still suspiciously high and worth checking further. Fortunately Analytics lets us drill down on these outliers in other dimensions.

As I had just onlined 144 new disks (6 JBODs worth), it was possible that one may have a performance issue. It wouldn't be an outright disk error - the Sun Storage software would pick that up and generate an alert, which hadn't happened. I'm thinking of others issues, where the disk can take longer than usual to successfully read sectors (perhaps from a manufacturing defect, vibration issue, etc.) This can be identified in Analytics by looking at the I/O latency outliers by disk.

From the left panel in the disk I/O by latency statistic, I selected 220 ms, right clicked and selected to break this down by disk:

If there is a disk or two with high latency, they'll usually appear at the top of the list on the left. Instead I have a list of disks with roughly the same average number of 220+ ms disk I/Os. I think. There are 144 disks in that list on the left, so it can take a while to scroll down them all to check (and you need to click the "..." ellipsis to expand the list). This is where the "Show hierarchy" button on the left is helpful - it will display the items in the box as wedges in a pie chart - which makes it easy to check if they all look the same:

Instead of displaying every disk, the pie chart begins by displaying the JBODs (the top of the hierarchy.) You can click the little "+" sign in the hierarchy view to show individual disks in each JBOD, like so:

But I didn't need to drill down to the disks to notice something wrong. Why do the two JBODs I selected have bigger slices of the pie? These are identical JBODs, are equal members of a ZFS mirrored pool - and so my write workload should be sent evenly across all JBODs (which I used Analytics to confirm by viewing disk I/O operations by disk.) So given they are the same hardware and are sent the same workload - they should all perform the same. Instead, two JBODs (whose names start with /2029 and end with 003 and 012) have returned slower disk I/O.

I confirmed the issue using a different statistic - average I/O operations by disk. This statistic is a little odd but very useful - it's not showing the rate of I/Os to the disks, but instead how many on average are active plus waiting in a queue (length of the wait queue.) Higher usually means I/Os are queueing for longer:

Again, these JBODs should be performing equally, and have the same average number of active and queued I/Os, but instead the 003 and 012 JBODs are taking longer to dispatch their I/Os. Something may be wrong with my configuration.

The Answer

Checking the Maintenance area identified the issue straight away:

The JBODs ending in 003 and 012 have only one SAS path connected. I intended to cable 2 paths to each JBOD, for both redundancy and to improve performance. I cabled it wrong!

Fixing the JBOD cabling so that all JBODs have 2 paths has improved the write throughput for this test:

And our JBODs look more balanced:

But they aren't perfectly balanced. Why do the 002 and 00B JBODs perform faster than the rest? Their pie slices are slightly smaller (~197 compared to 260-270 average I/Os.) They should have equal disk I/O requests, and take the same time to dispatch them.

The answer again is cabling - the 002 and 00B JBODs are in a chain of 2, and the rest are in a chain of 4. There is less bus and HBA contention for the JBODs in the chain of 2, so they perform better. To balance this properly I should have cabled 2 chains of 3, or 3 chains of 2 (this 7410 has 3 x SAS HBAs, so I can cable 3 chains).

Thoughts

Imagine solving this with traditional tools such as iostat: you may dig though the numerous lines of output and notice some disks with higher service times, but not notice that they belong to the same JBODs. This would be more difficult with iostat if the problem was intermittent, whereas Analytics can constantly record all this information to a one second granularity, to allow after the fact analysis.

The aim of Analytics is not just to plot data, but to use GUI features to add value to the data. The hierarchy tree view and pie chart are examples of this - and were the key to identifying my configuration problems.

While I already knew to dual-path JBODs and to balance them across available chains - it's wonderful to so easily see the performance issues that these sub-optimal configurations can cause, and know that we've made identifying problems like this easier for everyone who uses Analytics.

Monday Dec 15, 2008

Up to 2 Gbytes/sec NFS

In a previous post, I showed how many NFS read ops/sec I could drive from a Sun Storage 7410, as a way of investigating its IOPS limits. In this post I'll use a similar approach to investigate streaming throughput, and discuss how to understand throughput numbers. Like before, I'll show a peak value along with a more realistic value, to illustrate why understanding context is important.

As DRAM scalability is a key feature of the Sun Storage 7410 - currently reaching 128 Gbytes per head node, I'll demonstrate streaming throughput when the working set is entirerly cached in DRAM. This will provide some idea of the upper bound - the most throughput that can be driven in ideal conditions.

The screenshots below are from Analytics on a single node Sun Storage 7410, which is serving a streaming read workload over NFSv3 to 20 clients:

Here I've highlighted a peak of 2.07 Gbytes/sec - see the text above and below the box on the left.

While it's great to see 2 Gbytes/sec reached on a single NAS node, the above screenshot should also show that this was a peak only - it is more useful to see a longer term average:

The average for this interval (over 20 mins) is 1.90 Gbytes/sec. This graph shows network bytes by device, which shows the traffic was balanced across two nxge ports (each 10 GbE, and each about 80% utilised).

1.90 Gbytes/sec a good result, but Analytics suggests the 7410 can go higher. Resources which can bound (limit) this workload include the CPU cycles to process instructions, and the CPU cycles to wait for the memory instructions needed to push 1.90 Gbytes/sec -- both of these are reported as "CPU utilization":

For the same time interval, the CPU utilization on the 7410 was only around 67%. Hmm. Be careful here - the temptation is to do some calculations to predict where the limit could be based on that 67% utilization - but there could be other resource limits that prevent us from going much faster. What we do know is that the 7410 has sustained 1.90 Gbytes/sec and peaked at 2.07 Gbytes/sec in my test. The 67% CPU utilization encourages me to do more testing, especially with faster clients (these have 1600 MHz CPUs).

Answers to Questions

To understand these results, I'll describe the test environment by following the questions I posted previously:

  • This is not testing a cluster - this is a single head node.
  • It is for sale.
  • Same target as before - a Sun Storage 7410 with 128 Gbytes of DRAM, 4 sockets of quad-core AMD Opteron 2300 MHz CPU, and 2 x 2x10 GigE cards. It's not a max config since it isn't in a cluster.
  • Same clients as before - 20 blades, each with 2 sockets of Intel Xeon quad-core 1600 MHz CPUs, 6 Gbytes of DRAM, and 2 x 1 Gig network ports.
  • Client and server ports are connected together using 2 switches, and jumbo frames are enabled. Only 2 ports of 10 GbE are connected from the 7410 - one from each of its 2 x 10 GbE cards, to load balance across the cards as well as the ports. Both ports on each client are used.
  • The workload is streaming reads over files, with a 1 Mbyte I/O size. Once the client reaches the end of the file, it loops to the start. 10 processes were run on each of the clients. The files are mounted over NFSv3/TCP.
  • The total working set size is 100 Gbytes, which cache in the 7410's 128 Gbytes of DRAM.
  • The results are screenshots from Analytics on the 7410.
  • The target Sun Storage 7410 may not have been fully utilized - see the CPU utilization graph above and comments.
  • The clients aren't obviously saturated, although they are processing the workload as fast as they can with 1600 MHz CPUs.
  • I've been testing throughput as part of my role as a Fishworks performance engineer.

Traps to watch out for regarding throughput

For throughput results, there are some specific additional questions to consider:

  • How many clients were used?
    • While hardware manufactures make 10 GbE cards, it doesn't mean that clients can drive them. A mistake to avoid is to try testing 10 GbE (and faster) with only one underpowered client - and end up benchmarking the client by mistake. Apart from CPU horsepower, if your clients only have 1 GbE NICs then you'll need at least 10 of them to test 10 GbE, connected to a dedicated switch with 10 GbE ports.

  • What was the payload throughput?
    • In the above screenshots I showed network throughput by device, but this isn't showing us how much data was sent to the clients - rather that's how busy the network interfaces were. The value of 1.90 Gbytes/sec includes the inbound NFS requests, not just the outbound NFS replys (which includes the data payload); it also includes the overheads of the Ethernet, IP, TCP and NFS protocol headers. The actual payload bytes moved is going to be a little less than the total throughput - how much exactly wasn't measured above.
  • Did client caching occur?
    • I mentioned this in my previous post, but it's worth emphasising. Clients will usually cache NFS data in the client's DRAM. This can produce a number of problems for benchmarking. In particular, if you measure throughput from the client, you may see throughput rates much higher than the client has network bandwidth, as the client is reading from its own DRAM rather than testing the target over the network (eg, measuring 3 Gbit/sec on a client with a 1 Gbit/sec NIC). In my test results above, I've avoided this issue by measuring throughput from the target 7410.
  • Was the throughput peak or an average?
    • Again, this was mentioned in my previous post and worth repeating here. The first two screenshots in this post show the difference - average throughput over a long interval is more interesting, as this is more likely to be repeatable.

Being more realistic: 1 x 10 GbE, 10 clients

The above test showed the limits I could find, although to do so required running many processes on each of the 20 clients, using both of the 1 GbE ports on each client (40 x 1 GbE in total), balancing the load across 2 x 10 GbE cards on the target - not just 2 x 10 GbE ports, and using a 1 Mbyte I/O size. The workload the clients are applying is as extreme as they can handle.

I'll now test a lighter (and perhaps more realistic) workload - 1 x 10 GbE port on the 7410, 10 clients using one of their 1 GbE ports each, and running a single process on each client to perform the streaming reads. The client process is /usr/bin/sum (file checksum tool, which sequentially reads through files), which is run on a 5 x 1 Gbyte files for each client, so a 50 Gbyte working set in total:

This time the network traffic is on the nxge1 interface only, peaking at 1.10 Gbytes/sec for both inbound and outbound. The average outbound throughput can be shown in Analytics by zooming in a little and breaking down by direction:

That's 1.07 Gbytes/sec outbound. This includes the network headers, so the NFS payload throughput will be a little less. As a sanity check, we can see from the first screenshot x-axis that the test ran from 03:47:40 to about 03:48:30. We know that 50 Gbytes of total payload was moved over NFS (the shares were mounted before the run, so no client caching), so if this took 50 seconds - our average payload throughput would be about 1 Gbyte/sec. This fits.

10 GbE should peak at about 1.164 Gbyte/sec (converting gigabits to gibibytes) per direction, so this test reaching 1.07 Gbytes/sec outbound is a 92% utilization for the 7410's 10 GbE interface. Each of the 10 client's 1 GbE interface would be equally busy. This is a great result for such a simple test - everything is doing what it is supposed to. (While this might seem obvious, it did take much engineering work during the year to make this work so smoothly; see posts here and here for some details.)

In summary, I've shown that the Sun Storage 7410 can drive NFS requests from DRAM at 1 Gbyte/sec, and even up to about 2 Gbytes/sec with extreme load - which pushed 2 x 10 GbE interfaces to high levels of utilization. With a current max of 128 Gbytes of DRAM in the Sun Storage 7410, entirerly cached working set workloads are a real possibility. While these results are great, it is always important to understand the context of such numbers to avoid the common traps - which I've discussed in this post.

Tuesday Dec 02, 2008

A quarter million NFS IOPS

Following the launch of the Sun Storage 7000 series, various performance results have been published. It's important when reading these numbers to understand their context, and how that may apply to your workload. Here I'll introduce some numbers regarding NFS read ops/sec, and explain what they mean.

A key feature of the Sun Storage 7410 is DRAM scalability, which can currently reach 128 Gbytes per head node. This can span a significant working set size, and so serve most (or even all) requests from the DRAM filesystem cache. If you aren't familiar with the term working set size - this refers to the amount of data which is frequently accessed; for example, your website could be multiple Tbytes in size - but only tens of Gbytes are frequently accessed hour after hour, which is your working set.

Considering that serving most or all of your working set from DRAM may be a real possibility, it's worth exploring this space. I'll start by finding the upper bound - what's the most NFS read ops/sec I can drive. Here are screenshots from Analytics that shows sustained NFS read ops/sec from DRAM. Starting with NFSv3:

And now NFSv4:

Both beating 250,000 NFS random read ops/sec from a single head node - great to see!

Questions when considering performance numbers

To understand these numbers, you must understand the context. These are the sort of questions you can ask yourself, along with the answers for those results above:

  • Is this clustered?
    • When you see incredible performance results, check whether this is the aggregated result of multiple servers acting as a cluster. For this result I'm using a single Sun Storage 7410 - no cluster.
  • Is the product for sale? What is its price/performance?
    • Yes (which should also have the price.)
  • What is the target, and is it a max config?
    • It's a Sun Storage 7410 with 128 Gbytes of DRAM, 4 sockets of quad-core AMD Opteron 2300 MHz CPU, and 2 x 2x10 GigE cards. It's not a max config since it isn't in a cluster, and it only has 5 JBODs (although that didn't make a difference with the DRAM test above.)
  • What are the clients?
    • 20 blades - each with 2 sockets of Intel Xeon quad-core 1600 MHz CPUs, 6 Gbytes of DRAM, and 2 x 1 Gig network ports. With more and faster clients it may be possible to beat these results.
  • What is the network?
    • The clients are connected to a switch using 1 GigE, and the 7410 connects to the same switch using using 10 GigE. All ports are connected, so the 7410 is balanced across its 4 x 10 GigE ports.
  • What is the workload?
    • Each client is running over 20 processes which randomly read from files over NFS, with a 1 byte I/O size. Many performance tests these days will involve multiple threads and/or processes to check scaling; any test that only uses 1 thread on 1 client isn't showing the full potential of the target.
  • What is the working set size?
    • 100 Gbytes. This is important to check - it shouldn't be so tiny as to be served from CPU caches, if the goal is to test DRAM.
  • How is the result calculated?
    • It's measured on the 7410, and is the average seen in the Analytics window for the visible time period (5+ mins). Be very careful with results measured from the clients - as they can include client caching.
  • Was the target fully utilized?
    • In this case, yes. If you are reading numbers to determine maximum performance, check whether the benchmark is intended to measure that - some aren't!
  • Were the clients or network saturated?
    • Just a common benchmark problem to look out for, especially telltale when results cap at about 120 Mbytes/sec (hmm, you mean 1 GigE?). If the client or network becomes saturated, you've benchmarked them as well as the target server - probably not the intent. In the above test I maxed out neither.
  • Who gathered the data and why?
    • I gathered these results as part of Fishworks performance analysis to check what the IOPS limits may be. They aren't Sun official results. I thought of blogging about them a couple of weeks after running the tests (note the dates in the screenshot), and used Analytics to go back in time and take some screenshots.

The above list covers many subtle issues to help you avoid them (don't learn them the hard way).

Traps to watch out for regarding IOPS

For IOPS results, there are some specific additional questions to consider:

  • Is this from cache?
    • Yes, which is the point of this test, as this 7410 has 128 Gbytes of DRAM.
  • What is the I/O size?
    • 1 byte! This was about checking what the limit may be, as a possible upper bound. An average I/O size of 1 Kbyte or 8 Kbytes is going to drop this result - as there is more work by the clients and server to do. If you are matching this to your workload, find out what your average I/O size is and look for results at that size.
  • Is the value an average, and for how long?
    • These are both 5+ minute averages. Be wary of tiny intervals that may show unsustainable results.
  • What is the latency?
    • Just as a 1 byte I/O size may make this value unrealistic, so may the latency for heavy IOPS results. Disks can be pushed to some high IOPS values by piling on more and more client threads, but the average I/O latency becomes so bad it is impractical. The latency isn't shown in the above screenshot!

Being more realistic: 8 Kbyte I/O with latency

The aim of the above was to discuss context, and to show how to understand a great result - such as 250,000+ NFS IOPS - by knowing what questions to ask. The two key criticisms for this result would be that it was for 1 byte I/Os, and that latency wasn't shown at all. Here I'll redo this with 8 Kbyte I/Os, and show how Analytics can display the NFS I/O latency. I'll also wind back to 10 clients, only use 1 of the 10 GigE ports on the 7410, and I'll gradually add threads to the clients until each is running 20:

The steps in the NFSv3 ops/sec staircase are where I'm adding more client threads.

I've reached over 145,000 NFSv3 read ops/sec - and this is not the maximum the 7410 can do (I'll need to use a second 10 GigE port to take this further). The latency does increase as more threads queue up, here it is plotted as a heat map with latency on the y-axis (the darker the pixel - the more I/Os were at that latency for that second.) At our peak (which has been selected by the vertical line), most of the I/Os were faster than 55 us (0.055 milliseconds) - which can be seen in the numbers in the list on the left.

Note that this is the NFSv3 read ops/sec delivered to the 7410 after the client NFS driver has processed the 8 Kbyte I/Os, which decided to split some of the 8 Kbyte reads into 2 x 4 Kbyte NFS reads (pagesize). This means the workload became a mixed 4k and 8k read workload - for which 145,000 IOPS is still a good value. (I'm tempted to redo this for just 4 Kbyte I/Os to keep things simpler, but perhaps this is another useful lesson in the perils of benchmarking - the system doesn't always do what it is asked.)

Reaching 145,000 4+ Kbyte NFS cached read ops/sec without blowing out latency is a great result - and it's the latency that really matters (and from latency comes IOPS)... And on the topic of latency and IOPS - I do need to post a follow up for the next level after DRAM - no, not disks, it's the L2ARC using SSDs in the Hybrid Storage Pool.

Sunday Nov 09, 2008

Status Dashboard

In this entry I'll write about the Status Dashboard, a summary of appliance status written by myself and the Fishworks engineering team for the Sun Storage 7xxx series of NAS appliances (released today). This is of interest beyond the scope of NAS, as a carefully designed solution to a common problem - summarizing the status of a complex system.

Status Dashboard
Click the screenshot for a larger version

Key Features:

  • Five areas of status:
    • Usage - disk storage and main memory usage, broken down into components
    • Services - status of system services
    • Hardware - status of hardware
    • Activity - (top right) shows current and historic activity of eight different metrics
    • Recent Alerts - most recent system alerts
  • Everything is updated live
  • Everything is clickable, switching to screens with more details

Motivations

The dashboard is the union of both technical and graphical expertise. I'll talk about the technical aspects in this blog entry, and Todd (Fishworks graphics designer) can talk about the design and graphics used - which have been choosen to be consistant with the rest of the appliance.

I have a particular interest in observability tools (I discuss many in the book Solaris Performance and Tools), and my most recent project before joining Sun was the DTraceToolkit[1]. But the most valuable experience for developing the dashboard was from the time that I (and the Fishworks team) have spent with customers to see what does and what doesn't work. What I've learnt most from customers is that having a performance tool, even a useful performance tool, is only part of the problem; the tool must also be understood by the end-users, and help them understand system performance without being confusing.

Another motivation was to use a GUI to add value to the data, and not just repeat what can be done in text at the CLI. This includes making everything clickable to take the user to screens for more information, using icons to summarize states, graphs to plot data, and locating these elements in the GUI to improve readability.

I'll discuss each area of the dashboard, as a tour of the interface and to discuss some technical points along the way.

Usage

The Usage area shows usage of the storage pool and main memory. The values are listed and represented as pie charts.

Pie charts work well here, not only because we have known maximums for both storage and memory to divide into slices, but also for the visual effect. I've seen dashboards covered in the same style indicator for all statistics (eg, a page of strip charts and nothing else) - it's uninteresting and makes it difficult to remember where certain information is displayed. Here it is easy to remember where the usage status is shown - look for the pie charts.

For the storage section, both used and available space are displayed as well as the current achieved compression ratio. Clicking the pie chart will take you to the storage section of the interface where consumed space can be examined by share and project.

The memory section shows how main memory (RAM) is consumed. Clicking this will take you to the Analytics section where this data is graphed over time.

Services

The Services area shows the status of the system services with traffic lights to show the status of each. The traffic light will either be:

    online
    disabled - not yet enabled by the administrator
    offline - may be waiting for a dependant service to start
    maintenance - service has a current fault and is offline

Traffic lights work well here as these services have distinct states as defined by SMF (Service Management Facility from Solaris 10). Clicking either the name or the icon will take you to the service screen where its properties and logs can be examined.

This is worth repeating - if you see a service in a maintenance state, one click takes you to its properties and logs for troubleshooting, as well as controls to restart the service. This makes it possible to administer and maintain these services without any specific SMF or Solaris knowledge - a common aim of the appliance interface.

Hardware

Like Services, the Hardware section uses traffic lights to summarise the status of key hardware components. If clustering is configured, and extra light is displayed for cluster status. The status is mostly retrieved from FMA (Fault Management Architecture from Solaris 10).

Clicking on these icons will take you to the Hardware View - a section of the appliance with interactive digital views of the hardware components.

The system uptime is displayed in the top right.

Activity

This is the largest section on the dashboard, and graphs activity across several performance statistics (disk I/O shown here as an example).

There are no traffic lights used - instead I have used weather icons at the top left. Traffic lights indicate what is good or bad (green or red), which can be determined for hardware or services where any fault is bad, but can be inaccurate when quantifying performance activity statistics. This is due in part to:

  • Different customer environments have different acceptable levels for performance (latency), and so there is no one-size-fits-all threshold that can be used.
  • The displayed statistics on the dashboard are based on operations/sec and bytes/sec, which do not scale properly with performance issues.

The weather icon displayed is based on configurable thresholds, and the average value from the last 60 seconds; click the icon to go to the configuration screen, shown below.

The reason for this activity icon is to grab attention when something is unusually busy or idle, and needs further investigation. Weather icons do this fine. If an activity becomes stormy or beyond (that normally isn't), I'll take a look.

I've seen performance dashboards that use traffic lights based on arbitrary values, that for some environments and workloads can display green (good) when performance is bad, and red (bad) when the performance is good. This is something we wanted to avoid!

In this model there is no good/bad threshold, rather a gradient of levels for each activity statistic. This avoids the problem of choosing an arbitrary threshold for good/bad, and allows a more realistic gradient to be chosen. Weather itself is suitable for an additional reason: weather as a metaphor implys a non-exact science.

The statistics these are based on are an approximate view of performance, since they are based on ops/sec and bytes/sec and not measures of latency. The dashboard deliberately uses these statistics since:

1. They can work fine in many situations
2. The dashboard links to appliance Analytics to provide a better understanding of each resource (including latency)
3. They are commonly understood (disk ops/sec, network bytes/sec)

The default thresholds have been based on our own performance benchmarking, which under the heaviest load will drive each activity into the CAT-\* hurricanes. (If you get to a CAT-5 hurricane under a normal production workload, let me know.)

Back to the activity panel: In the middle are four graphs of recent activity over different time intervals, all drawn to the same vertical scale. From left to right are: 7 days, 24 hours, 60 minutes, and the instantaneous 1 second average (average is blue, maximum gray). Having the historic data available helps indicate if the current activity is higher or lower than normal - normal being the graph for the previous 24 hours or 7 days.

The labels beneath the graphs ("7d", "24h", "60m") are clickable - this will change the vertical scale on all graphs to the maximum for the clicked range, also helping the user quickly compare the current activity to the past; it will also change the average value in the middle top (in this example, "475 ops/sec") to be the average for the range selected.

Clicking each graph will take you to Analytics for that statistic and time range, where each statistic can be understood in more detail. For example, if you had an unusual number of NFSv3 ops/sec, you could use Analytics to see which files were hot, and which clients.

Note that the 7d and 24h graphs use bars instead of a line, where each bar represents 1 day and 1 hour respectively.

The activity statitstics displayed are configurable from the Status->Settings section, which allows each of the 8 activity sections to be configured. It is possible to set sections to empty; your server may never use iSCSI, NDMP or NFSv4, so setting these sections to empty will improve load time for the dashboard (since the dashboard is dozens of png images - all of which must be loaded and refreshed.)

Recent Alerts

This section shows the most recent alerts from the appliance alerts logging system. Clicking the alert box will take you to the alerts log viewer, where each can be examined in more detail.

About the CLI?

The Fishworks appliance software has a CLI interface, which mirrors the BUI (Browser User Interface) as much as possible. The Dashboard is a little tricky as it has been designed to be a visual GUI, and contains elements such as graphs that don't translate directly to text (although we might try aalib at some point). Despite this, much of the data can be printed, and a CLI version of the dashboard has been provided. This is how it currently looks:

walu:> status dashboard
Storage:
   pool_0:
      Used     10.0G bytes
      Avail    6.52T bytes
      State          online
      Compression    1x

Memory:
   Cache        550M bytes
   Unused       121G bytes
   Mgmt         272M bytes
   Other       4.10G bytes
   Kernel      1.90G bytes

Services:
   ad                disabled               cifs              disabled
   dns               online                 ftp               disabled
   http              online                 identity          online
   idmap             online                 ipmp              online
   iscsi             online                 ldap              disabled
   ndmp              online                 nfs               online
   nis               online                 ntp               online
   routing           online                 scrk              maintenance
   snmp              online                 ssh               online
   tags              online                 vscan             online

Hardware:
   CPU               online                 Cards             online
   Disks             faulted                Fans              online
   Memory            online                 PSU               online

Activity:
   CPU             1 %util                  Sunny
   Disk           32 ops/sec                Sunny
   iSCSI           0 ops/sec                Sunny
   NDMP            0 bytes/sec              Sunny
   NFSv3           0 ops/sec                Sunny
   NFSv4           0 ops/sec                Sunny
   Network       13K bytes/sec              Sunny
   CIFS            0 ops/sec                Sunny

Recent Alerts:
   2008-10-13 07:46: A cluster interconnect link has been restored.

Conclusion

We set out to summarize the entire system on one screen, and provide links to the rest of the interface. We are happy with how it turned out - and not just having fit the information in, but also having used features that a GUI allows to make this a powerful and intuitive interface.

For more information about appliance activity analysis, see the presentation on Analytics.

[1] I should have more spare time to update the DTraceToolkit and my blog in the coming months; I've been rather quiet during the last 2 years while working as a developer on the Fishworks engineering team, on a product that Sun has been keeping a lid on until today.

About

Brendan Gregg, Fishworks engineer

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today