Sunday Oct 24, 2010

G'Day and Goodbye

In 2006 I received a phone call from Bryan Cantrill of Sun Microsystems, inviting me to join a new team, Fishworks. In four years I've learned and accomplished a lot, and worked with many talented people. We achieved our goal, shipped an amazing product, and made an ambitious dream a reality. It's been a privilege to be part of it, and to be able to contribute and innovate in many different areas along the way.

And now, a different and exciting job awaits. My last day here is Oct 25, ending a brief and busy four years at Sun and then Oracle. Good luck, and thanks. My work blog will move to


New Videos

Some new videos have been posted of myself and others speaking about DTrace and performance, including: Enjoy!

Friday Oct 01, 2010

DTrace materials from Oracle OpenWorld 2010

At Oracle OpenWorld I gave a talk titled How to Build Better Applications With Oracle Solaris DTrace, written by myself and Jim Mauro. Click the link to see the slides, and video is here. Another DTrace talk is online in video form: a chat with Rick Ramsey about DTrace for system administrators. Enjoy!

Wednesday Sep 22, 2010

DTrace book coming soon

Jim Mauro and I recently handed the manuscript for the DTrace book to the publisher, which (unformatted) came in at over 900 pages. It covers using DTrace, complementary to the DTrace Guide, and should serve as a cookbook of DTrace scripts and one-liners for topics spanning the entire software stack. It also contains strategies and checklists for these topics, to help the reader develop scripts further.

The chapters are:

  1. Introduction
  2. The D Language
  3. System View
  4. Disk I/O
  5. Filesystems
  6. Network Lower Level Protocols
  7. Application Level Protocols
  8. Languages
  9. Applications
  10. Databases
  11. Security
  12. Kernel
  13. Tools
  14. Tips and Tricks

The book contains over 230 DTrace scripts and over 270 one-liners, many of which are new, and go much further than I have done previously in the DTraceToolkit.

Our intent was ambitious: this wasn't to be just a cookbook of scripts, but a cookbook of useful scripts and strategy that spanned the entire software stack - covering both easy and difficult topics alike. We wrote up the chapters and subtopics that such a book would ideally cover, and then created the content - much of which didn't exist before.

It's taken most of the past year to develop and write this material - consuming all of my spare time at the expense of many things: blog entries, DTraceToolkit updates, and other personal sacrifices. It's been exhausting at times, but it's also very satisfying to have finally completed. The publisher is now turning this into the finished product, a process which typically takes a few months.

Saturday Jun 05, 2010

Visualizing System Latency

I've just had an article published in ACMQ: Visualizing System Latency, which demonstrates latency analysis using heat maps in Analytics from Oracle's Sun Open Storage appliances. These have revealed details about system performance that were previously not visible, and show how effective a simple visualization can be. As many of these details are new, they are still being studied and are not yet understood.

One detail now can be understood, thanks to Pete Harllee who offered an explanation for the faint line at the top of Figure 4: these I/O are where the lba range spans two tracks in the drive (which I should have realized can happen sooner since these are 8 Kbyte writes on a drive with 512 byte sectors); the additional latency encountered is the expected track to track seek time during the I/O, as the lbas are written to one track and then complete writing on the next.

The resolution of the screenshots was reduced to fit the online article, which preserved the patterns that the article was describing but not ancillary text; the original resolution screenshots are linked in the article, and are also listed here:

  • Figure 1: NFS Latency When Enabling SSD-based Cache Devices
  • Figure 2: Synchronous Writes to a Striped Pool of Disks
  • Figure 3: Single Disk Latency from a Striped Pool of Disks
  • Figure 4: Synchronous Write Latency to a Single-disk Pool
  • Figure 5: Synchronous Write Latency to a Two-disk Pool
  • Figure 6: Synchronous Writes to a Mirrored Pool of Disks
  • Figure 7: Sequential Disk Reads, Stepping Disk Count
  • Figure 8: Repeated Disk Reads, Stepping Disk Count
  • Figure 9: High Latency I/O

The article has been picked up by sites including Slashdot and Vizworld.

Saturday May 15, 2010

Performance Instrumentation Counters: short talk

Performance Instrumentation Counters (PICs) allow CPU internals to be observed, and are especially useful for identifying why exactly CPUs are busy - not just that they are. I've blogged about them before, as part of analyzing HyperTransport utilization and CPI (Cycles-per-Instruction). There are a number of performance analysis needs for which can only be answered via PICs, either using the command line cpustat/cputrack tools, developer suites such as Oracle Sun Studio, or accessing them via DTrace. They include observing:

  • CPI: cycles per instruction
  • Memory bus utilization
  • I/O bus utilization (between the CPUs and I/O controllers)
  • CPU interconnect bus utilization
  • Level 1 cache (I$/D$) hit/miss rate
  • Level 2 cache (E$) hit/miss rate
  • Level 3 cache (if present) hit/miss rate
  • MMU events: TLB/TSB hit/miss rate
  • CPU stall cycles for other reasons (thermal?)
  • ... and more

This information is useful, not just for developers writing code (who are typically more familiar with their existence from using Oracle Sun Studio), but also for system administrators doing performance analysis and capacity planning.

I've recently been doing more performance analysis with PICs and taking advantage of PAPI (Performance Application Programming Interface), which provides generic counters that are both easy to identify and work across different platforms. Over the years I've maintained a collection of cpustat based scripts to answer questions from the above list. These scripts were written for specific architectures and became out of date when new processor types were introduced. PAPI solves this - I'm now writing a suite of cpustat based scripts based on PAPI (out of necessity - performance analysis is my job), that will work across different and future processor types. If I can, I'll post them here.

And for the reason of this post: Roch Bourbonnais, Jim Mauro and myself were recently in the same place at the same time, and used the opportunity to have a few informal talks about performance topics recorded on video. These talks wern't prepared beforehand, we just chatted about what we knew at the time, including advice and tips. This talk is on PICs:

download part 1 for iPod

download part 2 for iPod

I'm a fan of informal video talks, and I hope to do more - they are an efficient way to disseminate information. And for busy people like myself, it can be the difference between never documenting a topic or providing something - albeit informal - to help others out. Just based on my experience, the time it's taken to generate different formats of content has been:

  • Informal talk: 0.5 - 1 hour
  • Blog post: 1 - 10 hours
  • Formal presentation: 3 - 10 hours
  • Published article: 3 - 30+ hours
  • Whitepaper: 5 - 50+ hours
  • Book: (months)

In fact, it's taken twice the time to write this blog post about the videos than it took to plan and film them.

Documentation is another passion of mine, and we are doing some very cool things in the Fishworks product to create documentation deliverables in a smart and efficient way; which can be the topic of another blog post...

Wednesday Jan 27, 2010

iSCSI before and after

I've blogged many 7410 performance results for the NFSv3, NFSv4, and CIFS protocols. Now it's iSCSI's turn.

When the Sun Storage 7000 series first launched, iSCSI was provided by a user-land process: iscsitgtd. While it worked, performance wasn't as good as the kernel-based services such as NFS. iSCSI performance was to be improved by making it a kernel service as part of the COMSTAR project. This was delivered in the 7000 series in the 2009 Q3 software release, and iSCSI performance was indeed greatly improved.

Here I'll show iSCSI performance test results which highlight the before and after effects of both COMSTAR and the 7410 hardware update. It would of course be better to split the results showing the COMSTAR and hardware update effects separately - but I only have the time to blog this right now (and I wouldn't be blogging anything without Joel's help to collect results.) I would guess that the hardware update is responsible for up to a 2x improvement, as it was for NFS. The rest is COMSTAR.

OLD: Sun Storage 7410 (Barcelona, 2008 Q4 software)

To gather this data I powered up an original 7410 with the original software release from the product launch.

Testing max throughput

To do this I created 20 luns of 100 Gbytes each, which should entirely cache on the 7410. The point of this test is not to measure disk performance, but instead the performance of the iSCSI codepath. To optimize for max throughput, each lun used a 128 Kbyte block size. Since iSCSI is implemented as a user-land daemon in this release, performance is expected to suffer due to the extra work to copy-out and copy-in data to the iscsitgtd process from the kernel.

I stepped up clients, luns, and thread count to find a maximum where adding more stopped improving throughput. I reached this with only 8 clients, 2 sequential read threads per client (128 Kbyte reads), and 1 lun per client:

Network throughput was 311 Mbytes/sec. Actual iSCSI payload throughput will be a little lower due to network and iSCSI headers (jumbo frames were used, so the headers shouldn't add too much.)

This 311 Mbytes/sec result is the "bad" result - before COMSTAR. However, is it really that bad? How many people are still on 1 Gbit Ethernet? 311 Mbytes/sec is plenty to saturate a 1 GbE link, which may be all you have connected.

The CPU utilization for this test was 69%, suggesting that more headroom may be available. I wasn't able to consume it with my client test farm or workload.

Testing max IOPS

To test max IOPS, I repeated a similar test but used a 512 byte read size for the client threads intsead of 128 Kbyte. This time 10 threads per client were run, on 8 clients with 1 lun per client:

iSCSI IOPS were 37,056. For IOPS I only count served reads (and writes): the I/O part of IOPS.

NEW: Sun Storage 7410 (Istanbul, 2009 Q3+ software)

This isn't quite the 2009 Q3 release, it's our current development version. While it may be a little faster or slower than the actual 2009 Q3 release, it still reflects the magnitude of the improvement that COMSTAR and the Istanbul CPUs have made.

Testing max throughput:

This shows a 60 second average of 2.75 Gbytes/sec network throughput - impressive, and close to the 3.06 Gbytes/sec I measured for NFSv3 on the same software and hardware. For this software version, Analytics included iSCSI payload bytes, which showed it actually moved 2.70 Gbytes/sec (the extra 0.05 Gbytes/sec was iSCSI and TCP/IP headers.)

That's over 9x faster throughput. I would guess that up to 2x of this is due to the Istanbul CPUs, which still leaves over 5x due to COMSTAR.

Since this version of iSCSI could handle much more load, 47 clients were used with 5 luns and 5 threads per client. Four 10 GbE ports on the 7410 were configured to serve the data.

Testing max IOPS:

The average here was over 318,000 read IOPS. Over 8x the original iSCSI performance.


Both 7410s were max head node configurations with max DRAM (128 and 256 Gbytes), and max CPUs (4 sockets of Barcelona, and 4 sockets of Istanbul.) Tests were performed over 10 GbE ports from two cards. This is the performance from single head nodes - they were not part of a cluster.


The above tests showed iSCSI read performance. Writes are processed differently: they are made synchronous unless the write cache enabled property is set on the lun (the setting is in Shares->lun->Protocols.) The description for this setting is:

    This setting controls whether the LUN caches writes. With this setting off, all writes are synchronous and if no log device is available, write performance suffers significantly. Turning this setting on can therefore dramatically improve write performance, but can also result in data corruption on unexpected shutdown unless the client application understands the semantics of a volatile write cache and properly flushes the cache when necessary. Consult your client application documentation before turning this on.

For this reason, it's recommended to use Logzillas with iSCSI to improve write performance instead.


As a kernel service, iSCSI performance is similar to other kernel based protocols on the current 7410. For cached reads it is reaching 2.70 Gbytes/sec throughput, and over 300,000 IOPS.

Tuesday Nov 17, 2009

NFS Analytics example

I just tracked down a minor performance issue on my desktop in a couple of minutes, and thought it may make a good example to follow as it demonstrates using Analytics along with dtrace, truss, snoop and kstats - really the entire suite of perf tools I regularly use on Solaris.

I was downloading a file and happened to run the kstat based nicstat tool to watch network traffic. After the download finished, traffic continued:

    brendan@deimos:~> nicstat 1
        Time     Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs   %Util    Sat
    20:53:30    nge0   10.81   13.13   35.77   37.69   309.4   356.6    0.02   0.00
    20:53:30     lo0    0.00    0.00    0.00    0.00    0.00    0.00    0.00   0.00
        Time     Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs   %Util    Sat
    20:53:31    nge0   28.17   21.71   89.51   91.50   322.2   243.0    0.04   0.00
    20:53:31     lo0    0.00    0.00    0.00    0.00    0.00    0.00    0.00   0.00
        Time     Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs   %Util    Sat
    20:53:32    nge0   26.18   19.71   84.15   85.14   318.6   237.0    0.04   0.00
    20:53:32     lo0    0.00    0.00    0.00    0.00    0.00    0.00    0.00   0.00
        Time     Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs   %Util    Sat
    20:53:33    nge0   26.89   19.93   80.18   85.13   343.4   239.7    0.04   0.00
    20:53:33     lo0    0.00    0.00    0.00    0.00    0.00    0.00    0.00   0.00
        Time     Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs   %Util    Sat
    20:53:34    nge0   26.89   21.50   88.14   91.11   312.3   241.6    0.04   0.00
    20:53:34     lo0    0.00    0.00    0.00    0.00    0.00    0.00    0.00   0.00

It's not much, but my desktop should be idle, and yet there are regular network reads and writes. I'd like to know what that is.

Running snoop should give me a quick idea of the traffic over the wire:

    root@deimos:/> snoop -r | cut -c1-80
    Using device nge0 (promiscuous mode) -> NFS C 4 (lookup valid) PUTFH FH=700C NVERIFY GETA -> NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK -> NFS C 4 (lookup valid) PUTFH FH=700C NVERIFY GETA -> NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK -> NFS C 4 (lookup valid) PUTFH FH=700C NVERIFY GETA -> NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK -> NFS C 4 (lookup valid) PUTFH FH=700C NVERIFY GETA -> NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK -> NFS C 4 (lookup valid) PUTFH FH=700C NVERIFY GETA -> NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK

That's not what I was expecting. I'm doing NFS traffic? The only NFS share I have mounted is my home directory, and I'm not doing any sustained I/O right now.

Fortunately my home directory server is a Sun Storage system, so I can examine the NFS traffic with Analytics from the server side:

I'll begin by drilling down by client, as I know my desktop is "deimos":

This shows I'm doing more NFS IOPS to the home directory server than anyone else.

Now to drill down further, by type of operation and by file:

".elinks"? elinks is a text based browser that I use - a modern version of lynx.

Here's a screenshot of elinks viewing osnews:

So why is elinks causing so much NFS traffic?

Right now I could aim DTrace at elinks to get a detailed view of what it is up to, but as I'm not worried about slowing the target in this case, I'll take a quick look with truss to start with (runs non-root by default, fewer keystrokes and inbuilt syscall translation):

    brendan@deimos:~> truss -fp `pgrep elinks`
    295062: pollsys(0x08047AC0, 3, 0x08047B68, 0x00000000)  = 0 
    295062: alarm(0)                                        = 0
    295062: open64("/home/brendan/.elinks/formhist", O_RDONLY) Err#2 ENOENT
    295062: time()                                          = 1258176283
    295062: time()                                          = 1258176283
    295062: time()                                          = 1258176283
    295062: open64("/home/brendan/.elinks/formhist", O_RDONLY) Err#2 ENOENT
    295062: time()                                          = 1258176283
    295062: time()                                          = 1258176283
    295062: time()                                          = 1258176283
    295062: open64("/home/brendan/.elinks/formhist", O_RDONLY) Err#2 ENOENT
    295062: time()                                          = 1258176283
    295062: time()                                          = 1258176283
    295062: time()                                          = 1258176283
    295062: open64("/home/brendan/.elinks/formhist", O_RDONLY) Err#2 ENOENT
    295062: time()                                          = 1258176283
    295062: time()                                          = 1258176283
    295062: time()                                          = 1258176283
    295062: open64("/home/brendan/.elinks/formhist", O_RDONLY) Err#2 ENOENT

Ahh - elinks is repeatedly trying to read a non-existant file from my home directory, which is causing the NFS traffic.

To get an accurate read on the rate of these opens, I'll use dtrace to frequency count the filenames that were called with open() by elinks, printing a summary every second:

    root@deimos:/> dtrace -n 'syscall::open\*:entry /execname == "elinks"/ {
            @[copyinstr(arg0)] = count(); } tick-1sec { printa(@); clear(@); }'
    dtrace: description 'syscall::open\*:entry ' matched 3 probes
    CPU     ID                    FUNCTION:NAME
      1  87538                       :tick-1sec
      /home/brendan/.elinks/formhist                                   63
      1  87538                       :tick-1sec
      /home/brendan/.elinks/formhist                                   63
      1  87538                       :tick-1sec
      /home/brendan/.elinks/formhist                                   63

It's 63 opens per second, all of the ".elinks/formhist" file.

DTrace can dig deeper, shedding light as to why elinks is reading that file. I can just add "ustack()" as a second key to the DTrace aggregation used to frequency count filenames, to include the entire user stack trace:

    root@deimos:/> dtrace -n 'syscall::open\*:entry /execname == "elinks"/ {
            @[copyinstr(arg0), ustack()] = count(); } tick-1sec { printa(@); clear(@); }'
    dtrace: description 'syscall::open\*:entry ' matched 3 probes
    CPU     ID                    FUNCTION:NAME
      1  87538                       :tick-1sec

I'm not familiar with the elinks source, so I'll read from the bottom of the stack trace up, looking for high level functions. It looks generic until I get to redraw_leds(). leds? LEDs? There is a pseudo-LED text display in the bottom right of the screen. It's calling redraw_terminal(), which is calling refresh_view(), and then into load_formhist_from_file() - causing the file open.

Doing file opens with the frequent update of the status LEDs sounds like a bug, and perhaps a known one. A minute with google found that this was known and fixed in the next version of elinks.

While I was watching, the home directory server for fishworks was serving most of its NFS IOPS for a bug in elinks on my desktop. Which I can eliminate by upgrading my elinks version. This is an example of the best type of performance win: eliminating unnecessary work.


Brendan Gregg, Fishworks engineer


« April 2014