Tuesday May 26, 2009

Performance Testing the 7000 series, part 3 of 3

For performance testing storage products, in particular the Sun Storage 7000 series, I previously posted the top 10 suggestions for performance testing, and the little stuff to check. Here I'll post load generation scripts I've been using to measure the limits of the Sun Storage 7410 (people have been asking for these.)

Load Generation Scripts

Many storage benchmark tools exist, including FileBench which can apply sophisticated workloads and report various statistics. However the performance limit testing I've blogged about has not involved sophisticated workloads at all - I've tested sequential and random I/O to find the limits of throughput and IOPS. This is like finding out a car's top speed and its 0-60 mph time - simple and useful metrics, but aren't sophisticated such as finding out its best lap time at Laguna Seca Raceway. While a lap time would be a more comprehensive test of a car's performance, the simple metrics can be a better measure of specific abilities, with fewer external factors to interfere with the result. A lap time will be more affected by the driver's ability and weather conditions, for example, unlike top speed alone.

The scripts that follow are very simple, as all they need to do is generate load. They don't generate a specific level of load or measure the performance of that load, they just apply load as fast as they can. Measurements are taken on the target - which can be a more reliable way of measuring what actually happened (past the client caches.) For anything more complex, reach for a real benchmarking tool.

Random Reads

This is my randread.pl script, which takes a filename as an argument:

    #!/usr/bin/perl -w 
    #
    # randread.pl - randomly read over specified file.
    
    use strict;
    
    my $IOSIZE = 8192;                      # size of I/O, bytes
    my $QUANTA = $IOSIZE;                   # seek granularity, bytes
    
    die "USAGE: randread.pl filename\\n" if @ARGV != 1 or not -e $ARGV[0];
    
    my $file = $ARGV[0];
    my $span = -s $file;                    # span to randomly read, bytes
    my $junk;
    
    open FILE, "$file" or die "ERROR: reading $file: $!\\n";
    
    while (1) {
            seek(FILE, int(rand($span / $QUANTA)) \* $QUANTA, 0);
            sysread(FILE, $junk, $IOSIZE);
    }
    
    close FILE;
    

Dead simple. Tune $IOSIZE to the I/O size desired. This is designed to run over NFS or CIFS, so the program will spend most of its time waiting for network responses - not chewing over its own code, and so Perl works just fine. Rewriting this in C isn't going to make it much faster, but it may be fun to try and see for yourself (be careful with the resolution of rand(), which may not have the granularity to span files bigger than 2\^32 bytes.)

To run randread.pl, create a file for it to work on, eg:

    # dd if=/dev/zero of=10g-file1 bs=1024k count=10k
    

which is also how I create simple sequential write workloads. Then run it:

    # ./randread.pl 10g-file1 &
    

Sequential Reads

This is my seqread.pl script, which is similar to randread.pl:

    #!/usr/bin/perl -w
    #
    # seqread.pl - sequentially read through a file, and repeat.
    
    use strict;
    
    my $IOSIZE = 1024 \* 1024;               # size of I/O, bytes
    
    die "USAGE: seqread.pl filename\\n" if @ARGV != 1 or not -e $ARGV[0];
    
    my $file = $ARGV[0];
    my $junk;
    
    open FILE, "$file" or die "ERROR: reading $file: $!\\n";
    
    while (1) {
            my $bytes = sysread(FILE, $junk, $IOSIZE);
            if (!(defined $bytes) or $bytes != $IOSIZE) {
                    seek(FILE, 0, 0);
            }
    }
    
    close FILE;
    

Once it reaches the end of a file, it loops back to the start.

Client Management Script

To test the limits of your storage target, you'll want to run these scripts on a bunch of clients - ten or more. This is possible with some simple shell scripting. Start by setting up ssh (or rsh) so that a master server (your desktop) can login to all the clients as root without prompting for a password (ssh-keygen, /.ssh/authorized_keys ...). My clients are named dace-0 through to dace-9, and after setting up the ssh keys the following succeeds without a password prompt:

    # ssh root@dace-0 uname -a
    SunOS dace-0 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    

Since I have 10 clients, I'll want an easy way to execute commands on them all at the same time, rather than one by one. There are lots of simple ways to do this, here I've created a text file called clientlist with the names of the clients:

    # cat clientlist
    dace-0
    dace-1
    dace-2
    dace-3
    dace-4
    dace-5
    dace-6
    dace-7
    dace-8
    dace-9
    

which is easy to maintain. Now a script to run commands on all the clients in the list:

    #!/usr/bin/ksh
    # 
    # clientrun - execute a command on every host in clientlist.
    
    if (( $# == 0 )); then
            print "USAGE: clientrun cmd [args]"
            exit 1
    fi
    
    for client in $(cat clientlist); do
            ssh root@$client "$@" &
    done
    

Testing that this script works by running uname -a on every client:

    # ./clientrun uname -a
    SunOS dace-0 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-1 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-2 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-3 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-4 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-5 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-7 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-8 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-6 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    SunOS dace-9 5.11 fishhooks-gate:05/01/08 i86pc i386 i86pc
    

Great.

Running the Workload

We've got some simple load generation scripts and a way to run them on our client farm. Now to execute a workload on our target server, turbot. To prepare for this:

  • Scripts are available to the clients on an NFS share, /net/fw/tools/perf. This makes it easy to adjust the scripts and have the clients run the latest version, rather than installing the scripts one by one on the clients.
  • One share is created on the target NFS server for every client (usually more interesting than just using one), and is named the same as the client's hostname (/export/dace-0 etc.)

Creating a directory on the clients to use as a mount point:

    # ./clientrun mkdir /test
    

Mounting the shares with default NFSv3 mount options:

    # ./clientrun 'mount -o vers=3 tarpon:/export/`uname -n` /test'
    

The advantage of using the client's hostname as the share name is that it becomes easy for our clientrun script to have each client mount their own share, by getting the client to call uname -n to construct the share name. (The single forward quotes in that command are necessary.)

Creating files for our workload. The clients have 3 Gbytes of DRAM each, so 10 Gbytes per file will avoid caching all (or most) of the file on the client, since we want our clients to apply load to the NFS server and not hit from their own client cache:

    # ./clientrun dd if=/dev/zero of=/test/10g-file1 bs=1024k count=10k
    

This applys a streaming write workload from 10 clients, one thread (process) per client. While that is happening, it may be interesting to login to the NFS server and see how fast the write is performing (eg, network bytes/sec.)

With the test files created, I can now apply a streaming read workload like so:

    # ./clientrun '/net/fw/tools/perf/seqread.pl /test/10g-file1 &'
    

That will apply a streaming read workload from 10 clients, one thread per client.

Run it multiple times to add more threads, however the client cache is more likely to interfere when trying this (one thread reads what the other just cached); on Solaris you can try adding the mount option forcedirectio to avoid the client cache altogether.

To stop the workload:

    # ./clientrun pkill seqread
    

And to cleanup after testing is completed:

    # ./clientrun umount /test
    

Stepping the Workload

Running one thread per client on 10 clients may not be enough to stress powerful storage servers like the 7000 series. How many do we need to find the limits? An easy way to find out is to step up the workload over time, until the target stops serving any more.

The following runs randread.pl on the clients, starting with one and running another every 60 seconds until ten are running on each client:

    # ./clientrun 'for i in 1 2 3 4 5 6 7 8 9 10; do
         /net/fw/tools/perf/randread.pl /test/10g-file1 & sleep 60; done &' &
    

This was executed with an I/O size of 4 Kbytes. Our NFS server turbot is a Sun Storage 7410. The results from Analytics on turbot:

Worked great, and we can see that after 10 threads per client we've pushed the target to 91% CPU utilization, so we are getting close to a limit (in this case, available CPU.) Which is the aim of these type of tests - to drive load until we reach some limit.

I included network bytes/sec in the screenshot as a sanity check; we've reached 138180 x 4 Kbyte NFS reads/sec, which would require at least 540 Mbytes/sec of network throughput; we pushed 602 Mbytes/sec (includes headers.) 138K IOPS is quite a lot - this server has 128 Gbytes of DRAM, so 10 clients with a 10 Gbyte file per client means a 100 Gbyte working set (active data) in total, which has entirerly cached in DRAM on the 7410. If I wanted to test disk performance (cache miss), I can increase the per client filesize to create a working set much larger than the target's DRAM.

This type of testing can be useful to determine how fast a storage server can go - it's top speed. But that's all. For a better test of application performance, reach for a real benchmarking tool, or setup a test environment and run the application with a simulated workload.

Thursday Apr 02, 2009

Performance Testing the 7000 series, part 2 of 3

With the release of the Sun Storage 7000 series there has been much interest in the products performance, which I've been demonstrating. In my previous post I listed 10 suggestions for performance testing - the big stuff you need to get right. The 10 suggestions are applicable to all perf testing on the 7000 series. Here I'll dive into the smaller details of max performance tuning, which may only be of interest if you'd like to see the system reach its limits.

The little stuff

The following is a tuning checklist for achieving maximum performance on the Sun Storage 7410, particularly for finding performance limits. Please excuse the brevity of some descriptions - this was originally written as an internal Sun document and has been released here in the interests of transparency. This kind of tuning is used during product development, to drive systems as fast as possible to identify and solve bottlenecks.

These can all be useful points to consider, but they do not apply to all workloads. I've seen a number of systems that were misconfigured with random tuning suggestions found on the Internet. Please understand what you are doing before making changes.

Sun Storage 7410 Hardware

  • Use the max CPU and max DRAM option (currently 4 sockets of quad core Opteron, and 128 Gbytes of DRAM.)

  • Use 10 GbE. When using 2 ports, ideally use 2 x 10 GbE cards and one port on each - which balances across two PCI-E slots, two I/O controllers, and both CPU to I/O HyperTransports. Two ports on a single 10 GbE card will be limited by the PCI-E slot throughput.

    Using LACP to trunk over 1 GbE interfaces is a different way to increase network bandwidth, but you'll have port hashing to balance in small test environments (eg, fewer than 10 clients), which can be a headache based on the client attributes (my 20 client test farm all have even-numbered IP address and mac-addresses!)

  • Balance load across I/O controllers. There are two I/O controllers in the 7410, the MCP55 and the IO55. The picture on the right shows which PCI-E slots are served by which controller, labeled A and B. When using two 10 GbE cards, they should (already) be installed in the top left and bottom right slots, so that their load is balanced across each controller. When three HBAs are used, they are put in the bottom left and both middle slots; if I'm running a test that only uses two of them, I'll use the middle ones.

  • Get as many JBODs as possible with the largest disks possible. This will improve random I/O for cache busting workloads - although considering the amount of cache possible (DRAM plus Readzilla), to be cache busting it may become too artificial to be interesting.

  • Use 3 x HBAs, configure dual paths to chains of 4 x JBODs (if you have the full 12.)

  • Consider Readzilla for random I/O tests - but plan for a long (hours) warmup. Make sure the share database size is small (8 Kbytes) before file creation. Also use multiple readzillas for concurrent I/O (you get about 3100 x 8 Kbyte IOPS from each of the current STECs.) Note that Readzilla (read cache) can be enabled on a per filesystem basis.

  • Use Logzilla for synchronous (O_DSYNC) write workloads, which may include database operations and small file workloads.

Sun Storage 7410 Settings

  • Configure mirroring on the pool. If there is interest in a different profile, test it in addition to mirroring. Mirroring will especially help random IOPS read and write workloads that are cache busting.

  • Create multiple shares for the clients to use, if this resembles the target environment. Depending on the workload, it may improve performance by a tiny amout for clients not to be hammering the same share.

  • Disable access time updates on the shares.

  • 128K recsize for streaming. For streaming I/O tests: 128 Kbyte "database size" on the shares before file creation.

  • 8K recsize for random. Random I/O tests: set the share "database size" to match the application record size. I generally wouldn't go smaller than 4 Kbytes - if you make this extremely small then the ARC metadata to reference tiny amounts of data can begin to consume Gbytes of DRAM. I usually use 8 Kbytes.

  • Consider testing with LZJB compression on the shares on to see if it improves performance (it may relieve back-end I/O throughput.) For compression, make sure your working set files actually contain data and aren't all zeros - ZFS has some tricks for compressing zero data that will artificially boost performance.

  • Consider suspending DTrace based Analytics during tests for an extra percent or so of performance (especially the by-file, by-client, by-latency, by-size and by-offset ones.) The default Analytics (listed in the HELP wiki) are negligible, and I leave them on during my limit testing (without them I couldn't take screenshots.)

  • Don't benchmark a cold server (after boot.) Run a throwaway benchmark first until it fills DRAM, then cancel it and run the real benchmark. This avoids a one-off kmem cache buffer creation penalty, that may cost 5% performance or so only in the minutes immediately after boot.

Network

  • Use jumbo frames on clients and server ports (not needed for management). The 7410 has plenty of horsepower to drive 1 GbE with or without jumbo frames, so their use on 1 GbE is more to help the clients keep up. On 10 GbE the 7410s peak throughput will improve by about 20% with jumbo frames. This ratio would be higher, but we are using LSO (large send offload) with this driver (nxge) which keeps the non-jumbo frame throughput pretty high to start with.

  • Check your 10 Gbit switches. Don't assume a switch with multiple 10 GbE ports can drive them at the same time; some 10 Gbit switches we've tested cap at 11 Gbits/sec. The 7410 can have 4 x 10 GbE ports - so make sure the switches can handle the load, such as by using multiple switches. You don't want to test the switches by mistake.

Clients

  • Lots of clients. At least 10. 20+ is better. Everyone who tries to test the 7410 (including me) gets client bound at some point.

  • A single client can't drive 10 GbE, without a lot of custom tuning. And if the clients have 1 GbE interfaces, it should (obviously) take at least 10 clients to drive 10 GbE.

  • Connect the clients to dedicated switch(s) connected to the 7410.

  • You can reduce client caching by booting the clients with less DRAM: eg, eeprom physmem=786432 for 3 Gbytes on Solaris x86. Otherwise the test can hit from client cache and test the client instead of the target!

Client Workload

  • NFSv3 is generally the fastest protocol to test.

  • Consider umount/mount cycles on the clients between runs to flush the client cache.

  • NFS client mount options: Tune the "rsize" and "wsize" option to match the workload. eg, rsize=8192 for random 8 Kbyte IOPS tests - this reduces unnecessary read-ahead; and rsize=131072 for streaming (and also try setting nfs3_bsize to 131072 in /etc/system on the clients for the streaming tests.)

    For read tests, try "forcedirectio" if your NFS client supports this option (Solaris does) - this especially helps clients apply a heavier workload by not using their own cache. Don't leave this option enabled for write tests.

  • Whatever benchmark tool you use, you want to try multiple threads/processes per client. For max streaming tests I generally use 2 processes per client with 20 clients - so thats 40 client processes in total; for max random IOPS tests I use a lot more (100+ client processes in total.)

  • Don't trust benchmark tools - most are misleading in some way. Roch explains Bonnie++ here, and we have more posts planned for other tools. Always double check what the benchmark tool is doing using Analytics on the 7410.

  • Benchmark working set: very important! To properly test the 7410, you need to plan different total file sizes or working sets for the benchmarks. Ideally you'd run one benchmark that the server could entirely cache in DRAM, another that cached in Readzilla, and another that hit disk.

    If you are using a benchmark tool without tuning the file size, you are probably getting this wrong. For example, the defaults for iozone are very small (512 Mbytes and less) and can entirely cache on the client. If I want to test both DRAM and disk performance on a 7410 with 128 Gbytes of DRAM, I'll use a total file size of 100 Gbytes for the DRAM test, and at least 1 Terabyte (preferably 10 Terabytes) for the disk test.

  • Keep a close eye on the clients for any issues - eg, DTrace kernel activity.

  • Getting the best numbers involves tuning the clients as much as possible. For example, use a large value for autoup (300); tune tcp_recv_hiwat (for read tests, 400K should be good in general, 1 MB or more for long latency links.) The aim is to eliminate any effects from the available clients, and have results which are bounded by the targets performance.

  • Aim for my limits: That will help sanity check your numbers - to see if they are way off.

The Sun Storage 7410 doesn't need special tuning at all (no messing with /etc/system settings.) If it did, we'd consider that a bug we should fix. Indeed, this is part of what Fishworks is about - the expert tuning has already been done in our products. What there is left for the customer is simple and industry common: pick mirroring or double parity RAID, jumbo frames, no access time updates and tuning the filesystem record size. The clients require much, much more tuning and fussing when doing these performance tests.

In my next post, I'll show the simple tools I use to apply test workloads.

Monday Mar 23, 2009

Performance Testing the 7000 series, part 1 of 3

With the introduction of the Sun Storage 7000 series there has been much interest in its performance, which I've been demonstrating in this blog. Along with Bryan and Roch, I've also been helping other teams properly size and evaluate their configurations through performance testing. The advent of technologies like the hybrid storage pool makes performance testing more complicated, but no less important.

Over the course of performance testing and analysis, we assembled best practices internally to help Sun staff avoid common testing mistakes, tune their systems for maximum performance, and properly test the hybrid storage pool. In the interest of transparency and helping others understand the issues surrounding performance testing, I'll be posting this information over two posts, and my load generation scripts in a third.

Performance Testing - Top 10 Suggestions:

  1. Sanity Test
  2. Before accepting a test result, find ways to sanity test the numbers.

    When testing throughput over a gigabit network interface, the theoretical maximum is about 120 Mbytes/sec (converting 1 GbE to bytes.) I've been handed results of 300 Mbytes/sec and faster over a gigabit link, which is clearly wrong. Think of ways to sanity test results, such as checking against limits.

    IOPS can be checked in a similar way: 20,000 x 8 Kbyte read ops/sec would require about 156 Mbytes/sec of network throughput, plus protocol headers - too much for a 1 GbE link.

  3. Double Check
  4. When collecting performance data, use an alternate method to confirm your results.

    If a test measures network throughput, validate results from different points in the data path: switches, routers, and of course the origin and destination. A result can appear sane but still be wrong. I've discovered misconfigurations and software bugs this way, by checking if the numbers add up end-to-end.

  5. Beware of Client Caching
  6. File access protocols may cache data on the client, which is performance tested instead of the fileserver.

    This mistake should be caught by the above two steps, but it is so common it deserves a separate mention. If you test a fileserver with a file small enough to fit within the client's RAM, you may be testing client memory bandwidth, not fileserver performance. This is currently the most frequent mistake we see people make when testing NFS performance.

  7. Distribute Client Load
  8. Use multiple clients, at least 10.

    The Sun Storage 7410 has no problem saturating 10 GbE interfaces, but it's difficult for a client to do the same. A fileserver's optimized kernel can respond to requests much quicker than client-side software can generate them. In general, it takes twice the CPU horsepower to drive load than it does to accept it.

    Network bandwidth can also be a bottleneck: it takes at least ten 1 Gbit clients to max out a 10 Gbit interface. The 7410 has been shown to serve NFS at 1.9 Gbytes/sec, so at least sixteen 1 Gbit clients would be required to test max performance.

  9. Drive CPUs to Saturation
  10. If the CPUs are idle, the system is not operating at peak performance.

    The ultimate limiter in the 7000 series is measured as CPU utilization, and with the 7410's four quad-core Opterons, it takes a tremendous workload to reach its limits. To see the system at peak performance, add more clients, a faster network, or more drives to serve I/O. If the CPUs are not maxed out, they can handle more load.

    This is a simplification, but a useful one. Some workloads are CPU heavy due to the cycles to process instructions, others with CPU wait cycles for various I/O bus transfers. Either way, it's measured as percent CPU utilization - and when that reaches 100%, the system can generally go no faster (although it may go a little faster if polling threads and mutex contention backs off.)

  1. Disks Matter
  2. Don't ignore the impact of rotational storage.

    A full Sun Storage 7410 can have access to a ton of read cache: up to 128 Gbytes of DRAM and six 100 Gbyte SSDs. While these caches can greatly improve performance, disk performance can't be ignored as data must eventually be written to (and read from) disk. A bare minimum of two fully-populated JBODs are required to properly gauge 7410 performance.

  3. Check Your Storage Profile
  4. Evaluate the desired redundancy profile against mirroring.

    The default RAID-Z2 storage profile on the 7000 series provides double-parity, but can also deliver lower performance than mirroring, particularly with random reads. Test your workload with mirroring as well as RAID-Z2, then compare price/performance and price/Gbyte to best understand the tradeoff made.

  5. Use Readzillas for Random Reads
  6. Use multiple SSDs, tune your record size, and allow for warmup.

    Readzillas (read-biased SSDs), can greatly improve random read performance, if configured properly and given time to warm up. Each Readzilla currently delivers around 3,100 x 8 Kbyte read ops/sec, and has 100 Gbytes of capacity. For best performance, use as many Readzillas as possible for concurrent I/O. Also consider that, due to the low-throughput nature of random-read workloads, it can take several hours to warm up 600 Gbytes of read cache.

    On the 7000 series, when using Readzillas on random read workloads, adjust the database record size from its 128 Kbyte default down to 8 Kbytes before creating files, or size it to match your application record size. Data is retrieved from Readzillas by their record size, and smaller record sizes improve the available IOPS from the read cache. This must be set before file creation, as ZFS doesn't currently rewrite files after this change.

  7. Use Logzillas for Synchronous Writes
  8. Accelerate synchronous write workloads with SSD based intent logs.

    Some file system operations, like file and directory creation, and writes to database log files are considered "synchronous writes," requiring data be on disk before the client can continue. Flash-based intent log devices, or Logzillas, can only dramatically speed up workloads comprised of synchronous writes; otherwise, data is written directly to disk.

    Logzillas can provide roughly 10,000 write ops/sec, depending on write size, or about 100 Mbytes/sec of write throughput, and scale linearly to meet demand.

  9. Document Your Test System
  10. Either test a max config, or make it clear that you didn't.

    While it's not always possible or practicable to obtain a maximum configuration for testing purposes, the temptation to share and use results without a strong caveat to this effect should be resisted. Every performance result should be accompanied by details on the target system, and a comparison to a maximum configuration, to give an accurate representation of a product's true capabilities.

The main source of problems in performance testing that we've seen is the misuse of benchmark software and the misinterpretation of their results. The above suggestions should help the tester avoid the most common problems in this field. No matter how popular or widely-used benchmark software is, the tester is obliged to verify the results. And by paying sufficient attention to the test environment - i.e. system configuration, client load balance, network bandwidth - you can avoid common pitfalls (such as measuring 300 Mbytes/sec over a 1 Gbit/sec interface, which was courtesy of a popular benchmarking tool.)

In part two, I'll step through a more detailed checklist for max performance testing.

Tuesday Mar 17, 2009

Heat Map Analytics

I've been recently posting screenshots of heat maps from Analytics - the observability tool shipped with the Sun Storage 7000 series.

These heat maps are especially interesting, which I'll describe here in more detail.

Introduction

To start with, when you first visit Analytics you have an empty worksheet and need to add statistics to plot. Clicking on the plus icon next to "Add statistic" brings up a menu of statistics, as shown on the right.

I've clicked on "NFSv3 operations" and a sublist of possible breakdowns are shown. The last three (not including "as a raw statistic") are represented as heat maps. Clicking on "by latency" would show "NFSv3 operations by latency" as a heat map. Great.

But it's actually much more powerful than it looks. It is possible to drill down on each breakdown to focus on behavior of interest. For example, latency may be more interesting for read or write operations, depending on the workload. If our workload was performing synchronous writes, we may like to see the NFS latency heat map for 'write' operations separately - which we can do with Analytics.

To see an example of this, I've selected "NFS operations by type of operation", then selected 'write', then right-clicked on the "write" text to see the next breakdowns that are possible:

This menu is also visible by clicking the drill icon (3rd from the right) to drill down further.

By clicking on latency, it will now graph "NFSv3 operations of type write broken down by latency". So these statistics can be shown in whatever context is most interesting - perhaps I want to see NFS operations from a particular client, or for a particular file. Here are NFSv3 writes from the client 'deimos', showing the filenames that are being written:

Awsome. Behind the scenes, DTrace is building up dynamic scripts to fetch this data. We just click the mouse.

This was important to mention - the heat maps I'm about to demonstrate can be customized very specifically, by type of operation, client, filename, etc.

Sequential reads

I'll demonstrate heat maps at the NFS level by running the /usr/bin/sum command on a large file a few times, and letting it run longer each time before hitting Ctrl-C to cancel it. The sum command calculates a file's checksum, and does so by sequentially reading through the file contents. Here is what the three heat maps from Analytics shows:

The top heat map of offset clearly shows the client's behavior - the stripes show sequential I/O. The blocks show the offsets of the read operations as the client creeps through the file. I mounted the client using forcedirectio, so that NFS would not cache on the file contents on the client - and would be forced to keep reading the file each time.

The middle heat map shows the size of the client I/O requests. This shows the NFS client is always requesting 8 Kbyte reads. The bottom heat map shows NFS I/O latency. Most of the I/O was between 0 and 35 microseconds - but there are some odd clouds of latency on the 2nd and 3rd runs.

These latency clouds would be almost invisible if a linear color scheme was used - these heat maps use false color to emphasize detail. The left panel showed that on average there were 1771 ops/sec faster than 104 us (adding up the numbers), and that the entire heat map averaged at 1777 ops/sec; this means that the latency clouds (at about 0.7 ms) represent 0.3% of the I/O. The false color scheme makes them clearly visible, which is important for latency heat maps - as these slow outliers can hurt performance - even though they are relatively infrequent.

For those interested in more detail, I've included a couple of extra screenshots to explain this further:

  • Screenshot 1: NFS operations and disk throughput. From the top graph, it's clear how long I left the sum command running each time. The bottom graph of disk I/O bytes shows that the file I was checksumming had to be pulled in from disk for the entire first run, but only later in the second and third runs. Correspond the times to the offset heat map above - the 2nd and 3rd runs are reading data that is now cached, and doesn't need to be read from disk.

  • Screenshot 2: ARC hits/misses. This shows what the ZFS DRAM cache is doing (which is the 'ARC' - the Adaptive Replacement Cache). I've shown the same statistic twice, so that I can highlight breakdowns separately. The top graph shows a data miss at 22:30:24, which triggers ZFS to prefetch the data (since ZFS detects that this is a sequential read.) The bottom graph shows data hits are kept high, thanks to ZFS prefetch, and ZFS prefetch in operation: the "prefetched data misses" shows requests triggered by prefetch that were not already in the ARC, and read from disk; and the "prefetched data hits" shows prefetch requests that were already satisfied by the ARC. The latency clouds correspond to the later prefetch data misses, where some client requests are arriving while prefetch is still reading from disk - and are waiting for that to complete.

Random reads

While the rising stripes of a sequential workload are clearly visible in the offset heat map, random workloads are also easily identifiable:

The NFS operations by offset shows a random and fairly uniform pattern, which matches the random I/O I now have my client requesting. These are all hitting the ZFS DRAM cache, and so the latency heat map shows most responses in the 0 to 32 microsecond range.

Checking how these known workloads look in Analytics is valuable, as when we are faced with the unknown we know what to look for.

Disk I/O

The heat maps above demonstrated Analytics at the NFS layer; Analytics can also trace these details at the back-end: what the disks are doing, as requested by ZFS. As an example, here is a sequential disk workload:

The heat maps aren't as clear as they are at the NFS layer, as now we are looking at what ZFS decides to do based on our NFS requests.

The sequential read is mostly reading from the first 25 Gbytes of the disks, as shown in the offset heat map. The size heat map shows ZFS is doing mostly 128 Kbyte I/Os, and the latency heat map shows the disk I/O time is often about 1.20 ms, and longer.

Latency at the disk I/O layer doesn't directly correspond to client latency - it depends on the type of I/O. Asynchronous writes and prefetch I/O won't necessarily slow the client, for example.

Vertical Zoom

There is a way to zoom these heat maps vertically. Zooming horizontally is obvious (the first 10 buttons above each heat map do that - by changing the time range), but the vertical zoom isn't so obvious. It is documented in the online help - I just wanted to say here that it does exist. In a nutshell: click the outliers icon (last on the right) to switch outlier elimination modes (5%, 1%, 0.1%, 0.01%, none), which often will do what you want (by zooming to exclude a percentage of outliers); otherwise left click a low value in the left panel, shift click a high value, then click the outliers icon.

Overheads

As mentioned earlier, these heat maps use optimal resolutions at different ranges to conserve disk space, while maintaining visual resolution. They are also saved on the system disks, which have compression enabled. Still, when recording this data every second, 24 hours a day, the disk space can add up. Check their disk usage by going to Analytics->Datasets and clicking the "ON DISK" title to sort by size:

The size is listed before compression, so the actual consumed bytes is lower. These datasets can be suspended by clicking the power button, which is handy if you'd like to keep interesting data but not continue to collect new data.

Playing around...

While using these heat maps we noticed some unusual and detailed plots. Bryan and I starting wondering if it were possible to generate artificial workloads that plotted arbitrary patterns, such as spelling out words in 8 point text. This would be especially easy for the offset heat map at the NFS level - since the client requests the offsets, we just need to write a program to request reads or writes to the offsets we want. Moments after this idea, Bryan and I were furiously coding to see who could finish it first (and post comical messages to each other.) Bryan won, after about 10 minutes. Here is an example:

Awsome, dude! ... (although that wasn't the first message we printed ... when I realized Bryan was winning, I logged into his desktop, found the binary he was compiling, and posted the first message to his screen before he had finished writing the software. However my message appeared as: "BWC SnX" (Bryan's username is "bmc".) Bryan was looking at the message, puzzled, while I'm saying "it's upside down - your program prints upside down!")

I later modified the program to work for the size heat maps as well, which was easy as the client requests it. But what about the latency heat maps? Latency isn't requested - it depends on many factors: for reads, it depends on whether the data is cached, and if not, whether it is on a flash memory based read cache (if one is used), and if not, then it depends on how much disk head seek and rotation time it takes to pull it in - which varies depending on the previous disk I/O. Maybe this can't be done...

Actually, it can be done. Here is all three:

Ok, the latency heat map looks a bit fuzzy, but this does work. I could probably improve it if I spent more than 30 mins on the code - but I have plenty of actual work to do.

I got the latency program to work by requesting data which was cached in DRAM, of large increasing sizes. The latency from DRAM is consistent and relative to the size requested, so by calling reads with certain large I/O sizes I can manufacture a workload with the latency I want (close to.) The client was mounted forcedirectio, so that every read caused an NFS I/O (no client caching.)

If you are interested in the client programs that injected these workloads, they are provided here (completely unsupported) for your entertainment: offsetwriter.c, sizewriter.c and latencywriter.c. If you don't have a Sun Storage 7000 series product to try them on, you can try the fully functional VMware simulator (although they may need adjustments to compensate for the simulator's slower response times.)

Summary

Heat maps are an excellent visual tool for analyzing data, and identifying patterns that would go unnoticed via text based commands or plain graphs. Some may remember Richard McDougall's Taztool, which used heat maps for disk I/O by offset analysis, and was very useful at the time (I reinvented it later for Solaris 10 with DTraceTazTool.)

Analytics takes heat maps much further:

  • visibility of different layers of the software stack: disk I/O, NFS, CIFS, ...
  • drilldown capabilities: for a particular client or file only, ...
  • I/O by offset, I/O by size and I/O by latency
  • can archive data 24x7 in production environments
  • optimal disk storage

With this new visibility, heat maps are illuminating numerous performance behaviors that we previously didn't know about and some we still don't yet understand - like the Rainbow Pterodactyl. DTrace has made this data available for years, Analytics is making it easy to see.

Monday Mar 16, 2009

Dave tests compression

Dave from Fishworks has tested ZFS compression on the Sun Storage 7000 series. This is just a simple test, but interesting to see that performance improved slightly when using the LZJB compression algorithm. Compression relieves back-end throughput to the disks, and the LZJB algorithm doesn't consume much CPU to do so. I had suggested trying compression this in my streaming disk blog post, but didn't have any results to show. It's good to see this tested and shown with Analytics.

Thursday Mar 12, 2009

Latency Art: Rainbow Pterodactyl

In previous posts I've demonstrated Analytics from the Sun Storage 7000 series, which is a DTrace based observability tool created by Bryan. One of the great features of Analytics is its use of heat maps, especially for I/O latency.

I/O latency is the time to service and respond to an I/O request, and since clients are often waiting for this to complete - it is often the most interesting metric for performance analysis (more so than IOPS or throughput.) We thought of graphing average I/O latency, however important details would be averaged out; occasional slow requests can destroy performance, but when averaged with many fast requests their existence may be difficult to see. So instead of averaging I/O latency, we provide I/O latency as a heat map. For example:

That is showing the effect of turning on our flash based SSD read cache. Latency drops!

The x-axis is time, the y-axis is the I/O latency, and color is used to represent a 3rd dimension - how many I/O requests occurred at that time, at that latency (darker means more.) Gathering this much detail from the operating system was not only made possible by DTrace, but also made optimal. DTrace already has the ability to group data into buckets, which is used to provide a sufficient resolution of data for plotting. Higher resolutions are kept for lower latencies: having 10 microsecond resolution for I/O times less than 1,000 microseconds is useful, but overkill for I/O times over 1,000,000 microseconds (1 second) - and would waste disk space when storing it. Analytics automatically picks a resolution suitable for the range displayed.

In this post (perhaps the start of a series), I'll show unusual latency plots that I and others have discovered. Here is what I call the "Rainbow Pterodactyl":

Yikes! Maybe it's just me, but that looks like the Pterodactyl from Joust.

I discovered this when testing back-end throughput on our storage systems, by 'lighting up' disks one by one with a 128 Kbyte streaming read workload to their raw device (visible as the rainbow.) I was interested in any knee points in I/O throughput, which is visible there at 17:55 where we reached 1.13 Gbytes/sec. To understand what was happening, I graphed the I/O latency as well - and found the alarming image above. The knee point corresponds to where the neck ends, and the wing begins.

Why is there a beak? (the two levels of latency) ... Why is there a head? Why is there a wing? This raised many more questions than answers - things for us to figure out. Which is great - had this been average latency, these behaviors may have gone unnoticed.

I can at least explain the beak to head threshold: the beak ends at 8 disks, and I had two "x4" ports of SAS connected - so when 9 disks are busy, there is contention for those ports. (I'll defer describing the rest of the features to our storage driver expert Greg if he gets the time. :-)

The above was tested with 1 SAS path to each of 2 JBODs. Testing 2 paths to each JBOD produces higher throughput, and latency which looks a bit more like a bird:

Here we reached 2.23 Gbytes/sec for the first knee point, which is actually the neck point...

Latency heat maps have shown us many behaviors that we don't yet understand, which we need to spend more time with Analytics to figure out. It can be exciting to discover strange phenomenons for the first time, and it's very useful and practical as well - latency matters.

Tuesday Mar 03, 2009

CIFS at 1 Gbyte/sec

I've recently been testing the limits of NFS performance on the Sun Storage 7410. Here I'll test the CIFS (SMB) protocol - the file sharing protocol commonly used by Microsoft Windows, which can be served by the Sun Storage 7000 series products. I'll push the 7410 to the limits I can find, and show screenshots of the results. I'm using 20 clients to test a 7410 which has 6 JBODs and 4 x 10 GbE ports, described in more detail later on.

CIFS streaming read from DRAM

Since the 7410 has 128 Gbytes of DRAM, most of which is available as the filesystem cache, it is possible that some workloads can be served entirely or almost entirely from DRAM cache, which I've tested before for NFS. Understanding how fast CIFS can serve this data from DRAM is interesting, so to search for a limit I've run the following workload: 100 Gbytes of files (working set), 4 threads per client, each doing streaming reads with a 1 Mbyte I/O size, and looping through their files.

I don't have to worry about client caching affecting the observed result - as is the case with other benchmarks - since I'm not measuring the throughput on the client. I'm measuring the actual throughput on the 7410, using Analytics:

I've zoomed out to show the average over 60 minutes - which was 1.04 Gbytes/sec outbound!

This result was measured as outbound network throughput, so it includes the overhead of Ethernet, IP, TCP and CIFS headers. Since jumbo frames were used, this overhead is going to be about 1%. So the actual data payload moved over CIFS will be closer to 1.03 Gbytes/sec.

As a screenshot it's clear that I'm not showing a peak value - rather a sustained average over a long interval, to show how well the 7410 can serve CIFS at this throughput.

CIFS streaming read from disk

While 128 Gbytes of DRAM can cache large working sets, it's still interesting to see what happens when we fall out of that, as I've previously shown for NFS. The 7410 I'm testing has 6 JBODs (from a max of 12), which I've configured with mirroring for max performance. To test out disk throughput, my workload is: 2 Tbytes of files (working set), 2 threads per client, each performing streaming reads with a 1 Mbyte I/O size, and looping through their files.

As before, I'm taking a screenshot from Analytics - to show what the 7410 is really doing:

Here I've shown read throughput from disk at 849 Mbytes/sec, and network outbound at 860 Mbytes/sec (includes the headers.) 849 Mbytes/sec from disk (which will be close to our data payload) is very solid performance.

CIFS streaming write to disk

Writes are a different code path to reads, and need to be tested seperately. The workload I've used to test write throughput is: Writing 1+ Tbytes of files, 4 threads per client, each performing streaming writes with a 1 Mbyte I/O size. The result:

The network inbound throughput was 638 Mbytes/sec, which includes protocol overheads. Our data payload rate will be a little less than that due to the CIFS protocol overheads, but should still be at least 620 Mbytes/sec - which is very good indeed (even beat my NFSv3 write throughput result!)

Note that the disk I/O bytes were at 1.25 Gbytes/sec write: the 7410 is using 6 JBODs configured with software mirroring, so the back end write throughput is doubled. If I picked other storage profiles like RAID-Z2, the back end thoughput would be less (as I showed in the NFS post.)

CIFS read IOPS from DRAM

Apart from throughput, it's also interesting to test the limits of IOPS, in particular read ops/sec. To do this, I'll use a workload which is: 100 Gbytes of files (working set) which caches in DRAM, 20 threads per client, each performing reads with a 1 byte I/O size. The results:

203,000+ is awesome; while a realistic workload is unlikely to call 1 byte I/Os, still, it's interesting to see what the 7410 can do (I'll test 8 Kbyte I/O next.)

Note the network in and out bytes is about the same - the 1 byte of data payload doesn't make much difference beyond the network headers.

Modest read IOPS

While the limits I can reach on the 7410 are great for heavy workloads, these don't demonstrate how well the 7410 responds under modest conditions. Here I'll test a lighter read ops/sec workload by cutting it to: 10 clients, 1 x 1 GbE port per client, 1 x 10 GbE port on the 7410, 100 Gbytes of files (working set), and 8 Kbyte random I/O. I'll step up the threads per client every minute (by 1), starting at 1 thread/client (so 10 in total to begin with):

We reached 71,428 read ops/sec - a good result for 8 Kbyte random I/O from cache and only 10 clients.

It's more difficult to generate client load (involves context switching to userland) than to serve it (kernel only), so you generally need more CPU grunt on the clients than on the target. At one thread per client on 10 clients, the clients are using 10 x 1600 MHz cores to test a 7410 with 16 x 2300 MHz cores - so the clients themselves will limit the throughput achieved. Even at 5 threads per client there is still headroom (%CPU) on this 7410.

The bottom graph is a heat map of CIFS read latency, as measured on the 7410 (from when the I/O request was received, to when the response was sent.). As load increases, so does I/O latency - but they are still mostly less than 100 us (fast!). This may be the most interesting from all the results - as this modest load is increased, the latency remains low while the 7410 scales to meet the workload.

Configuration

As the filer I was using a single Sun Storage 7410, with the following config:

  • 128 Gbytes DRAM
  • 6 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
  • 4 sockets of quad-core AMD Opteron 2300 MHz CPUs
  • 2 x 2x10 GbE cards (4 x 10 GbE ports total), jumbo frames
  • 2 x HBA cards
  • noatime on shares, and database size left at 128 Kbytes

It's not a max config system - the 7410 can currently scale to 12 JBODs, 3 x HBA cards, and have flash based SSD as read cache and intent log - which I'm not using for these tests. The CPU and DRAM size is the current max: 4 sockets of quad-core driving 128 Gbytes of DRAM is a heavyweight for workloads that cache well, as shown earlier.

The clients were 20 blades, each:

  • 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
  • 6 Gbytes of DRAM
  • 2 x 1 GbE network ports
  • Running Solaris, and mounting CIFS using the smbfs driver

These are great, apart from the CPU clock speed - which at 1600 MHz is a little low.

The network consists of multiple 10 GbE switches to connect the client 1 GbE ports to the filer 10 GbE ports.

Conclusion

A single head node 7410 has very solid CIFS performance, as shown in the screenshots. I should note that I've shown what the 7410 can do given the clients I have, but it may perform even faster given faster clients for testing. What I have been able to show is 1 Gbyte/sec of CIFS read throughput from cache, and up to 200,000 read ops/sec - tremendous performance.

Before the Sun Storage 7000 products were released, there was intensive performance work on this by the CIFS team and PAE (as Roch describes), and they have delivered great performance for CIFS, and continue to improve it further. I've updated the summary page with these CIFS results.

Monday Feb 23, 2009

Networking Analytics Example

I just setup a new Sun Storage 7410, and found a performance issue using Analytics. People have been asking for examples of Analytics in use, so I need to blog these as I find them. This one is regarding network throughput, and while simple - it demonstrates the value of high level data and some features of Analytics.

The Symptom

To test this new 7410, I ran a cached DRAM read workload from 10 clients and checked network throughput:

It's reached 736 Mbytes/sec, and I know the 7410 can go faster over 1 x 10 GbE port (I previously posted a test showing a single nxge port reaching 1.10 Gbytes/sec.) This is a different workload and client setup, so, is 736 Mbytes/sec the most I should expect from these clients and this workload? Or can I go closer to 1.10 Gbytes/sec?

With Analytics, I can examine intricate details of system performance throughout the software stack. Start simple - start with the higher level questions and drill down deeper on anything suspicious. Also, don't assume anything that can easily be measured.

I expect my 10 clients to be performing NFSv3 reads, but are they? This is easily checked:

They are.

Since these clients are 10 identical blades in a blade server, and the workload I ran successfully connected to and started running on all 10 clients, I could assume that they are all still running. Are they? This is also easily checked:

I see all 10 clients in the left panal, so they are still running. But there is something suspicious - the client dace-2-data-3 is performing far fewer NFSv3 ops than the other clients. For the selected time of 15:49:35, dace-2-data-3 performed 90 NFSv3 ops/sec while the rest of the clients performed over 600. In the graph, that client is plotted as a trickle on the top of the stacks - and at only a pixel or two high, it's difficult to see.

This is where it is handy to change the graph type from stacked to line, using the 5th icon from the right:

Rather than stacking the data to highlight its components, the line graphs plot those components seperately - so that they can be compared with one another. This makes it pretty clear that dace-2-data-3 is an outlier - shown as a single line much lower than all others.

For some reason, dace-2-data-3 is performing fewer NFSv3 ops than its identical neighbours. Lets check if its network throughput is also lower:

It is, running at only 11.3 Mbytes/sec. Showing this as a line graph:

As before, the line graph highlights this client as an outlier.

Line Graphs

While I'm here, these line graphs are especially handy for comparing any two items (called breakdowns.) Click the first from the left panel, then shift-click a second (or more.) For example:

By just selecting one breakdown, the graph will rescale to fit it to the vertical height (note the units change on the top right):

Variations in this client's throughput are now more clearly visible. Handy stuff...

Datasets

Another side topic worth mentioning is the datasets - archived statistics used by Analytics. So far I've used:

  • Network bytes/sec by device
  • NFSv3 ops/sec by operation
  • NFSv3 ops/sec by-client
  • IP bytes/sec by-hostname

There is a screen in the Sun Storage 7000 series interface to manage datasets:

Here I've sorted by creation time, showing the newly created datasets at the top. The icons on the right include a power icon, which can suspend dataset collection. Suspended datasets show a gray light on the left, rather than green for enabled.

The by-client and by-hostname statistics require more CPU overhead to collect than the others, as these gather their data by tracing every network packet, aggregating that data, then resolving hostnames. These are some of the datasets that are DTrace based.

The overhead of DTrace based datasets is relative to the number of traced events, and how loaded the system is. The overhead itself is microscopic, however if you multiply that by 100,000 on a busy system, it can become measurable. This system was pushing over 700 Mbytes/sec, which is approaching 100,000 packets/sec. The overhead (performance cost) for those by-client and by-hostname datasets was about 1.4% each. Tiny as this is, I usually suspend these when performing benchmarks (if they have been enabled - they aren't out of the box.) With lighter workloads (lighter than 700+ Mbytes/sec), this overhead becomes lower as there is more CPU capacity available for collecting such statistics. So, you generally don't need to worry about the CPU overhead - unless you want to perform benchmarks.

The Problem

Back to the issue; The 11.3 Mbytes/sec value rings a bell. Converting bytes to bits, that's about 90 Mbit/sec - within 100 Mbit/sec. Hmm.. These are supposed to be 1 Gbit/sec interfaces - is something wrong? On the client:

    dace-2# dladm show-linkprop -p speed
    LINK         PROPERTY        VALUE          DEFAULT        POSSIBLE
    e1000g0      speed           100            --             10,100,1000
    e1000g1      speed           1000           --             10,100,1000
    
Yes - the 1 Gbit/sec interface has negotiated to 100 Mbit. Taking a look at the physical port:

Confirmed! The port (in the center) is showing a yellow light on the right rather than a green light. There is a problem with the port, or the cable, or the port on the switch.

The Fix

Swapping the cable with one that is known-to-be-good fixed the issue - the port renegotiated to 1 Gbyte/sec:

    dace-2# dladm show-linkprop -p speed
    LINK         PROPERTY        VALUE          DEFAULT        POSSIBLE
    e1000g0      speed           1000           --             10,100,1000
    e1000g1      speed           1000           --             10,100,1000
    

And the client dace-2-data-3 is now running much faster:

That's about a 9% performance win - caused by a bad Ethernet cable, found by Analytics.

About

Brendan Gregg, Fishworks engineer

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today