Saturday May 15, 2010

Performance Instrumentation Counters: short talk

Performance Instrumentation Counters (PICs) allow CPU internals to be observed, and are especially useful for identifying why exactly CPUs are busy - not just that they are. I've blogged about them before, as part of analyzing HyperTransport utilization and CPI (Cycles-per-Instruction). There are a number of performance analysis needs for which can only be answered via PICs, either using the command line cpustat/cputrack tools, developer suites such as Oracle Sun Studio, or accessing them via DTrace. They include observing:

  • CPI: cycles per instruction
  • Memory bus utilization
  • I/O bus utilization (between the CPUs and I/O controllers)
  • CPU interconnect bus utilization
  • Level 1 cache (I$/D$) hit/miss rate
  • Level 2 cache (E$) hit/miss rate
  • Level 3 cache (if present) hit/miss rate
  • MMU events: TLB/TSB hit/miss rate
  • CPU stall cycles for other reasons (thermal?)
  • ... and more

This information is useful, not just for developers writing code (who are typically more familiar with their existence from using Oracle Sun Studio), but also for system administrators doing performance analysis and capacity planning.

I've recently been doing more performance analysis with PICs and taking advantage of PAPI (Performance Application Programming Interface), which provides generic counters that are both easy to identify and work across different platforms. Over the years I've maintained a collection of cpustat based scripts to answer questions from the above list. These scripts were written for specific architectures and became out of date when new processor types were introduced. PAPI solves this - I'm now writing a suite of cpustat based scripts based on PAPI (out of necessity - performance analysis is my job), that will work across different and future processor types. If I can, I'll post them here.

And for the reason of this post: Roch Bourbonnais, Jim Mauro and myself were recently in the same place at the same time, and used the opportunity to have a few informal talks about performance topics recorded on video. These talks wern't prepared beforehand, we just chatted about what we knew at the time, including advice and tips. This talk is on PICs:

download part 1 for iPod

download part 2 for iPod

I'm a fan of informal video talks, and I hope to do more - they are an efficient way to disseminate information. And for busy people like myself, it can be the difference between never documenting a topic or providing something - albeit informal - to help others out. Just based on my experience, the time it's taken to generate different formats of content has been:

  • Informal talk: 0.5 - 1 hour
  • Blog post: 1 - 10 hours
  • Formal presentation: 3 - 10 hours
  • Published article: 3 - 30+ hours
  • Whitepaper: 5 - 50+ hours
  • Book: (months)

In fact, it's taken twice the time to write this blog post about the videos than it took to plan and film them.

Documentation is another passion of mine, and we are doing some very cool things in the Fishworks product to create documentation deliverables in a smart and efficient way; which can be the topic of another blog post...

Wednesday Nov 11, 2009

FROSUG perf horrors

A couple of days before Halloween I gave a talk titled "Little Shop of Performance Horrors" at the Front Range OpenSolaris User Group (FROSUG), in Denver, Colorado. The title was suggested to me, which inspired me to talk about things going wrong in the field of system performance. We had a great turnout, despite the talk happening during one of the worst snow storms of the year.

For the talk, I listed different performance problems and gave an example or two of each, including many lessons that were learnt the hard way.

Horrific Topics:
  • The worst perf issues I've ever seen!
  • Common misconfigurations
  • The encyclopedia of poor assumptions
  • Unbelievably bad perf analysis
  • Death by complexity
  • Bad benchmarking
  • Misleading analysis tools
  • Insane performance tuning
  • The curse of the unexpected

The slides are here. I'll revisit the slides when I have a chance and add more content; as this was the first time I give this talk, several more topics sprung to mind during the actual talk which there aren't slides for.

Despite speaking for about 2.5 hours, the entire talk was videoed and has already been posted on Sun Video, which I've included below.

Part 1/3

Part 2/3

Part 3/3

Tuesday Jul 22, 2008


An exciting new ZFS feature has now become publicly known - the second level ARC, or L2ARC. I've been busy with its development for over a year, however this is my first chance to post about it. This post will show a quick example and answer some basic questions.

Background in a nutshell

The "ARC" is the ZFS main memory cache (in DRAM), which can be accessed with sub microsecond latency. An ARC read miss would normally read from disk, at millisecond latency (especially random reads). The L2ARC sits in-between, extending the main memory cache using fast storage devices - such as flash memory based SSDs (solid state disks).

old model

new model

with ZFS

Some example sizes to put this into perspective, from a lab machine named "walu":

    LayerMediumTotal Capacity
    ARCDRAM128 Gbytes
    L2ARC6 x SSDs550 Gbytes
    Storage Pool44 Disks17.44 Tbytes (mirrored)

For this server, the L2ARC allows around 650 Gbytes to be stored in the total ZFS cache (ARC + L2ARC), rather than just DRAM with about 120 Gbytes.

A previous ZFS feature (the ZIL) allowed you to add SSD disks as log devices to improve write performance. This means ZFS provides two dimensions for adding flash memory to the file system stack: the L2ARC for random reads, and the ZIL for writes.

Adam has been the mastermind behind our flash memory efforts, and has written an excellent article in Communications of the ACM about flash memory based storage in ZFS; for more background, check it out.

L2ARC Example

To illustrate the L2ARC with an example, I'll use walu - a medium sized server in our test lab, which was briefly described above. Its ZFS pool of 44 x 7200 RPM disks is configured as a 2-way mirror, to provide both good reliability and performance. It also has 6 SSDs, which I'll add to the ZFS pool as L2ARC devices (or "cache devices").

I should note - this is an example of L2ARC operation, not a demonstration of the maximum performance that we can achieve (the SSDs I'm using here aren't the fastest I've ever used, nor the largest.)

20 clients access walu over NFSv3, and execute a random read workload with an 8 Kbyte record size across 500 Gbytes of files (which is also its working set).

1) disks only

Since the 500 Gbytes of working set is larger than walu's 128 Gbytes of DRAM, the disks must service many requests. One way to grasp how this workload is performing is to examine the IOPS that the ZFS pool delivers:

    walu# zpool iostat pool_0 30
                   capacity     operations    bandwidth
    pool         used  avail   read  write   read  write  
    ----------  -----  -----  -----  -----  -----  -----
    pool_0      8.38T  9.06T     95      4   762K  29.1K
    pool_0      8.38T  9.06T  1.87K     15  15.0M  30.3K
    pool_0      8.38T  9.06T  1.88K      3  15.1M  20.4K
    pool_0      8.38T  9.06T  1.89K     16  15.1M  39.3K
    pool_0      8.38T  9.06T  1.89K      4  15.1M  23.8K

The pool is pulling about 1.89K ops/sec, which would require about 42 ops per disk of this pool. To examine how this is delivered by the disks, we can either use zpool iostat or the original iostat:

    walu# iostat -xnz 10
    [...trimmed first output...]
                        extended device statistics              
        r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
       43.9    0.0  351.5    0.0  0.0  0.4    0.0   10.0   0  34 c0t5000CCA215C46459d0  
       47.6    0.0  381.1    0.0  0.0  0.5    0.0    9.8   0  36 c0t5000CCA215C4521Dd0
       42.7    0.0  349.9    0.0  0.0  0.4    0.0   10.1   0  35 c0t5000CCA215C45F89d0
       41.4    0.0  331.5    0.0  0.0  0.4    0.0    9.6   0  32 c0t5000CCA215C42A4Cd0
       45.6    0.0  365.1    0.0  0.0  0.4    0.0    9.2   0  34 c0t5000CCA215C45541d0
       45.0    0.0  360.3    0.0  0.0  0.4    0.0    9.4   0  34 c0t5000CCA215C458F1d0
       42.9    0.0  343.5    0.0  0.0  0.4    0.0    9.9   0  33 c0t5000CCA215C450E3d0
       44.9    0.0  359.5    0.0  0.0  0.4    0.0    9.3   0  35 c0t5000CCA215C45323d0
       45.9    0.0  367.5    0.0  0.0  0.5    0.0   10.1   0  37 c0t5000CCA215C4505Dd0

iostat is interesting as it lists the service times: wsvc_t + asvc_t. These I/Os are taking on average between 9 and 10 milliseconds to complete, which the client application will usually suffer as latency. This time will be due to the random read nature of this workload - each I/O must wait as the disk heads seek and the disk platter rotates.

Another way to understand this performance is to examine the total NFSv3 ops delivered by this system (these days I use a GUI to monitor NFSv3 ops, but for this blog post I'll hammer nfsstat into printing something concise):

    walu# nfsstat -v 3 1 | sed '/\^Server NFSv3/,/\^[0-9]/!d' 
    Server NFSv3:
    calls     badcalls 
    2260      0
    Server NFSv3:
    calls     badcalls
    2306      0
    Server NFSv3:
    calls     badcalls
    2239      0

That's about 2.27K ops/sec for NFSv3; I'd expect 1.89K of that to be what our pool was delivering, and the rest are cache hits out of DRAM, which is warm at this point.

2) L2ARC devices

Now the 6 SSDs are added as L2ARC cache devices:
    walu# zpool add pool_0 cache c7t0d0 c7t1d0 c8t0d0 c8t1d0 c9t0d0 c9t1d0 

And we wait until the L2ARC is warm.

Time passes ...

Several hours later the cache devices have warmed up enough to satisfy most I/Os which miss main memory. The combined 'capacity/used' column for the cache devices shows that our 500 Gbytes of working set now exists on those 6 SSDs:

    walu# zpool iostat -v pool_0 30
                                  capacity     operations    bandwidth
    pool                        used  avail   read  write   read  write 
    -------------------------  -----  -----  -----  -----  -----  -----
    pool_0                     8.38T  9.06T     30     14   245K  31.9K
      mirror                    421G   507G      1      0  9.44K      0
        c0t5000CCA216CCB905d0      -      -      0      0  4.08K      0
        c0t5000CCA216CCB74Cd0      -      -      0      0  5.36K      0
      mirror                    416G   512G      0      0  7.66K      0
        c0t5000CCA216CCB919d0      -      -      0      0  4.34K      0
        c0t5000CCA216CCB763d0      -      -      0      0  3.32K      0
    [... 40 disks truncated ...]
    cache                          -      -      -      -      -      -
      c7t0d0                   84.5G  8.63G  2.63K      0  21.1M  11.4K
      c7t1d0                   84.7G  8.43G  2.62K      0  21.0M      0
      c8t0d0                   84.5G  8.68G  2.61K      0  20.9M      0
      c8t1d0                   84.8G  8.34G  2.64K      0  21.1M      0
      c9t0d0                   84.3G  8.81G  2.63K      0  21.0M      0
      c9t1d0                   84.2G  8.91G  2.63K      0  21.0M  1.53K
    -------------------------  -----  -----  -----  -----  -----  -----

The pool_0 disks are still serving some requests (in this output 30 ops/sec) but the bulk of the reads are being serviced by the L2ARC cache devices - each providing around 2.6K ops/sec. The total delivered by this ZFS pool is 15.8K ops/sec (pool disks + L2ARC devices), about 8.4x faster than with disks alone.

This is confirmed by the delivered NFSv3 ops:

    walu# nfsstat -v 3 1 | sed '/\^Server NFSv3/,/\^[0-9]/!d' 
    Server NFSv3:
    calls      badcalls   
    18729      0          
    Server NFSv3:
    calls      badcalls   
    18762      0          
    Server NFSv3:
    calls      badcalls   
    19000      0          

walu is now delivering 18.7K ops/sec, which is 8.3x faster than without the L2ARC.

However the real win for the client applications is that of read latency; the disk-only iostat output showed our average was between 9 and 10 milliseconds, the L2ARC cache devices are delivering the following:

    walu# iostat -xnz 10
                        extended device statistics              
        r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
     2665.0    0.4 21317.2    0.0  0.7  0.7    0.2    0.2  39  67 c9t0d0 
     2668.1    0.5 21342.0    3.2  0.6  0.7    0.2    0.2  38  66 c9t1d0
     2665.4    0.4 21320.4    0.0  0.7  0.7    0.3    0.3  42  69 c8t0d0
     2683.6    0.4 21465.9    0.0  0.7  0.7    0.3    0.3  41  68 c8t1d0
     2660.7    0.6 21295.6    3.2  0.6  0.6    0.2    0.2  36  65 c7t1d0
     2650.7    0.4 21202.8    0.0  0.6  0.6    0.2    0.2  36  64 c7t0d0

Our average service time is between 0.4 and 0.6 ms (wsvt_t + asvc_t columns), which is about 20x faster than what the disks were delivering.

What this means ...

An 8.3x improvement for 8 Kbyte random IOPS across a 500 Gbyte working set is impressive, as is improving storage I/O latency by 20x.

But this isn't really about the numbers, which will become dated (these SSDs were manufactured in July 2008, by a supplier who is providing us with bigger and faster SSDs every month).

What's important is that ZFS can make intelligent use of fast storage technology, in different roles to maximize their benefit. When you hear of new SSDs with incredible ops/sec performance, picture them as your L2ARC; or if it were great write throughput, picture them as your ZIL.

The example above was to show that the L2ARC can deliver, over NFS, whatever these SSDs could do. And these SSDs are being used as a second level cache, in-between main memory and disk, to achieve the best price/performance.


I recently spoke to a customer about the L2ARC and they asked a few questions which may be useful to repeat here:

What is L2ARC?

    The L2ARC is best pictured as a cache layer in-between main memory and disk, using flash memory based SSDs or other fast devices as storage. It holds non-dirty ZFS data, and is currently intended to improve the performance of random read workloads.

Isn't flash memory unreliable? What have you done about that?

    It's getting much better, but we have designed the L2ARC to handle errors safely. The data stored on the L2ARC is checksummed, and if the checksum is wrong or the SSD reports an error, we defer that read to the original pool of disks. Enough errors and the L2ARC device will offline itself. I've even yanked out busy L2ARC devices on live systems as part of testing, and everything continues to run.

Aren't SSDs really expensive?

    They used to be, but their price/performance has now reached the point where it makes sense to start using them in the coming months. See Adam's ACM article for more details about price/performance.

What about writes - isn't flash memory slow to write to?

    The L2ARC is coded to write to the cache devices asynchronously, so write latency doesn't affect system performance. This allows us to use "read-bias" SSDs for the L2ARC, which have the best read latency (and slow write latency).

What's bad about the L2ARC?

    It was designed to either improve performance or do nothing, so there isn't anything that should be bad. To explain what I mean by do nothing - if you use the L2ARC for a streaming or sequential workload, then the L2ARC will mostly ignore it and not cache it. This is because the default L2ARC settings assume you are using current SSD devices, where caching random read workloads is most favourable; with future SSDs (or other storage technology), we can use the L2ARC for streaming workloads as well.


If anyone is interested, I wrote a summary of L2ARC internals as a block comment in usr/src/uts/common/fs/zfs/arc.c, which is also surrounded by the actual implementation code. The block comment is below (see the source for the latest version), and is an excellent reference for how it really works:

     \* Level 2 ARC
     \* The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
     \* It uses dedicated storage devices to hold cached data, which are populated
     \* using large infrequent writes.  The main role of this cache is to boost
     \* the performance of random read workloads.  The intended L2ARC devices
     \* include short-stroked disks, solid state disks, and other media with
     \* substantially faster read latency than disk.
     \*                 +-----------------------+
     \*                 |         ARC           |
     \*                 +-----------------------+
     \*                    |         \^     \^
     \*                    |         |     |
     \*      l2arc_feed_thread()    arc_read()
     \*                    |         |     |
     \*                    |  l2arc read   |
     \*                    V         |     |
     \*               +---------------+    |
     \*               |     L2ARC     |    | 
     \*               +---------------+    |
     \*                   |    \^           |
     \*          l2arc_write() |           |
     \*                   |    |           |
     \*                   V    |           |
     \*                 +-------+      +-------+
     \*                 | vdev  |      | vdev  |
     \*                 | cache |      | cache |
     \*                 +-------+      +-------+
     \*                 +=========+     .-----.
     \*                 :  L2ARC  :    |-_____-|
     \*                 : devices :    | Disks |
     \*                 +=========+    `-_____-'
     \* Read requests are satisfied from the following sources, in order:
     \*      1) ARC
     \*      2) vdev cache of L2ARC devices
     \*      3) L2ARC devices
     \*      4) vdev cache of disks
     \*      5) disks
     \* Some L2ARC device types exhibit extremely slow write performance.
     \* To accommodate for this there are some significant differences between
     \* the L2ARC and traditional cache design:
     \* 1. There is no eviction path from the ARC to the L2ARC.  Evictions from
     \* the ARC behave as usual, freeing buffers and placing headers on ghost
     \* lists.  The ARC does not send buffers to the L2ARC during eviction as
     \* this would add inflated write latencies for all ARC memory pressure.
     \* 2. The L2ARC attempts to cache data from the ARC before it is evicted.
     \* It does this by periodically scanning buffers from the eviction-end of
     \* the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
     \* not already there.  It scans until a headroom of buffers is satisfied,
     \* which itself is a buffer for ARC eviction.  The thread that does this is
     \* l2arc_feed_thread(), illustrated below; example sizes are included to
     \* provide a better sense of ratio than this diagram:
     \*             head -->                        tail
     \*              +---------------------+----------+
     \*      ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC
     \*              +---------------------+----------+   |   o L2ARC eligible
     \*      ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer
     \*              +---------------------+----------+   |
     \*                   15.9 Gbytes      \^ 32 Mbytes    |
     \*                                 headroom          |
     \*                                            l2arc_feed_thread()
     \*                                                   |
     \*                       l2arc write hand <--[oooo]--'
     \*                               |           8 Mbyte
     \*                               |          write max
     \*                               V
     \*                +==============================+
     \*      L2ARC dev |####|#|###|###|    |####| ... |
     \*                +==============================+
     \*                           32 Gbytes
     \* 3. If an ARC buffer is copied to the L2ARC but then hit instead of
     \* evicted, then the L2ARC has cached a buffer much sooner than it probably
     \* needed to, potentially wasting L2ARC device bandwidth and storage.  It is
     \* safe to say that this is an uncommon case, since buffers at the end of
     \* the ARC lists have moved there due to inactivity.
     \* 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
     \* then the L2ARC simply misses copying some buffers.  This serves as a
     \* pressure valve to prevent heavy read workloads from both stalling the ARC
     \* with waits and clogging the L2ARC with writes.  This also helps prevent
     \* the potential for the L2ARC to churn if it attempts to cache content too
     \* quickly, such as during backups of the entire pool.
     \* 5. After system boot and before the ARC has filled main memory, there are
     \* no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
     \* lists can remain mostly static.  Instead of searching from tail of these
     \* lists as pictured, the l2arc_feed_thread() will search from the list heads
     \* for eligible buffers, greatly increasing its chance of finding them.
     \* The L2ARC device write speed is also boosted during this time so that
     \* the L2ARC warms up faster.  Since there have been no ARC evictions yet,
     \* there are no L2ARC reads, and no fear of degrading read performance
     \* through increased writes.
     \* 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
     \* the vdev queue can aggregate them into larger and fewer writes.  Each
     \* device is written to in a rotor fashion, sweeping writes through
     \* available space then repeating.
     \* 7. The L2ARC does not store dirty content.  It never needs to flush
     \* write buffers back to disk based storage.
     \* 8. If an ARC buffer is written (and dirtied) which also exists in the
     \* L2ARC, the now stale L2ARC buffer is immediately dropped.
     \* The performance of the L2ARC can be tweaked by a number of tunables, which
     \* may be necessary for different workloads:
     \*      l2arc_write_max         max write bytes per interval
     \*      l2arc_write_boost       extra write bytes during device warmup
     \*      l2arc_noprefetch        skip caching prefetched buffers
     \*      l2arc_headroom          number of max device writes to precache
     \*      l2arc_feed_secs         seconds between L2ARC writing
     \* Tunables may be removed or added as future performance improvements are
     \* integrated, and also may become zpool properties.

Jonathan recently linked to this block comment in a blog entry about flash memory, to show that ZFS can incorporate flash into the storage hierarchy, and here is the actual implementation.

Tuesday Feb 27, 2007


I recently wanted to gather some numbers on CPU and memory system performance, for AMD64 CPUs. I reached a point where I searched the Internet for other Solaris AMD64 PIC (Performance Instrumentation Counters) analysis, and found little. I hope to improve this with some blog entries. In this part I'll introduce PIC observability, and demonstrate measuring CPI (cycles per instruction) for different workloads.

To see why PICs are important, the following are the sort of questions that PIC analysis can answer:
  1. What is the Level 2 cache hit rate?
  2. What is the Level 2 cache miss volume?
  3. What is the hit rate and miss volume for the TLB?
  4. What is my memory bus utilization?
Questions 1 and 2 relate to the CPU hardware cache, where Level 2 is the E$ (meaning either "external" cache or "embedded" cache, depending on the CPU architecture). For optimal performance we want to see a high hit rate, and more importantly, a low miss volume.

Question 3 concerns a component of the memory management unit - the translation lookaside buffer (TLB). This processes and caches virtual to physical memory page translations. It can consume a lot of CPU (the worst I've seen is 60%), and it can be tuned. A good document for understanding this further is Taming Your Emu by Richard McDougall.

Question 4 seems obvious - the memory bus can be a bottleneck for system performance, so, how utilized is it? Answering this isn't easy, but it is usually possible by examining CPU PICs.

There are many AMD64 CPU PICs available, which can be viewed using tools such as cpustat and cputrack. Running cpustat -h dumps the list:
# cpustat -h
        cpustat [-c events] [-p period] [-nstD] [interval [count]]

        -c events specify processor events to be monitored
        -n        suppress titles
        -p period cycle through event list periodically
        -s        run user soaker thread for system-only events
        -t        include tsc register
        -D        enable debug mode
        -h        print extended usage information

        Use cputrack(1) to monitor per-process statistics.

        CPU performance counter interface: AMD Opteron & Athlon64

        event specification syntax:

        event[0-3]: FP_dispatched_fpu_ops FP_cycles_no_fpu_ops_retired 
                 FP_dispatched_fpu_ops_ff LS_seg_reg_load 
                 LS_uarch_resync_self_modify LS_uarch_resync_snoop 
                 LS_buffer_2_full LS_locked_operation LS_retired_cflush 
                 LS_retired_cpuid DC_access DC_miss DC_refill_from_L2 
                 DC_refill_from_system DC_copyback DC_dtlb_L1_miss_L2_hit 
                 DC_dtlb_L1_miss_L2_miss DC_misaligned_data_ref 
                 DC_uarch_late_cancel_access DC_uarch_early_cancel_access 
                 DC_1bit_ecc_error_found DC_dispatched_prefetch_instr 
                 DC_dcache_accesses_by_locks BU_memory_requests 
                 BU_data_prefetch BU_system_read_responses 
                 BU_quadwords_written_to_system BU_cpu_clk_unhalted 
                 BU_internal_L2_req BU_fill_req_missed_L2 BU_fill_into_L2 
                 IC_fetch IC_miss IC_refill_from_L2 IC_refill_from_system 
                 IC_itlb_L1_miss_L2_hit IC_itlb_L1_miss_L2_miss 
                 IC_uarch_resync_snoop IC_instr_fetch_stall 
                 IC_return_stack_hit IC_return_stack_overflow 
                 FR_retired_x86_instr_w_excp_intr FR_retired_uops 
                 FR_retired_branches_mispred FR_retired_taken_branches 
                 FR_retired_far_ctl_transfer FR_retired_resyncs 
                 FR_retired_near_rets FR_retired_near_rets_mispred 
                 FR_retired_fpu_instr FR_retired_fastpath_double_op_instr 
                 FR_intr_masked_cycles FR_intr_masked_while_pending_cycles 
                 FR_taken_hardware_intrs FR_nothing_to_dispatch 
                 FR_dispatch_stall_fpu_full FR_dispatch_stall_ls_full 
                 FR_fpu_exception FR_num_brkpts_dr0 FR_num_brkpts_dr1 
                 FR_num_brkpts_dr2 FR_num_brkpts_dr3 
                 NB_mem_ctrlr_page_access NB_mem_ctrlr_page_table_overflow 
                 NB_mem_ctrlr_bypass_counter_saturation NB_ECC_errors 
                 NB_sized_commands NB_probe_result NB_gart_events 
                 NB_ht_bus0_bandwidth NB_ht_bus1_bandwidth 
                 NB_ht_bus2_bandwidth NB_sized_blocks NB_cpu_io_to_mem_io 

        attributes: edge pc inv cmask umask nouser sys 

        See Chapter 10 of the "BIOS and Kernel Developer's Guide for the 
        AMD Athlon 64 and AMD Opteron Processors," AMD publication #26094 
There are around fifty names above such as "FP_dispatched_fpu_ops" which describe the PICs available. On my AMD Opteron CPUs you can measure four of these at a time, which can be provided in the arguments to cpustat, eg,
# cpustat -c IC_fetch,DC_access,DC_dtlb_L1_miss_L2_hit,DC_dtlb_L1_miss_L2_miss 0.25
   time cpu event      pic0      pic1      pic2      pic3 
  0.257   0  tick   6406429   8333198     45826      5515 
  0.257   1  tick   3333442   3942694     24682      4409 
  0.507   1  tick   6450964   8229104     44046      5713 
  0.507   0  tick   2359697   2828683     14365      4415 
  0.757   0  tick   2490406   3060416     16458      4901 
  0.757   1  tick   7292986   9530806     68956      6490 
  1.007   0  tick   2514008   3063049     15037      3863 
  1.007   1  tick   6057048   7747580     42415      6083 
In the above example I printed four PICs every 0.25 seconds, for each CPU (I'm on a 2 x virtual CPU server). The CPU column shows that the output is slightly shuffled - a harmless side effect from the way cpustat was coded (it pbinds a libcpc consumer onto each CPU in the available processor set, and all threads write to STDOUT in any order). These PICs are provided by programmable hardware registers - so there is no ideal way around the four-at-a-time limit. You can shuffle measurements between different sets of PICs, which cpustat supports.

Reference Documentation
Since different CPUs provide different PICs, the guide mentioned at the bottom of the cpustat -h output will list what PICs your CPU type provides. It is important to read these guides carefully - for example, PICs that track cache misses may have some exceptions to what is considered a "miss".

I spent a while with AMD's #26094 guide, but I found that the PIC descriptions raise more questions than answers. (try to find basics such as "instruction count")... If you find yourself in a similar situation, it can help to create known workloads and then examine which metrics move by a similar amount. I used this approach to confirm what PICs provided cycle counts and instruction counts.

I did eventually find two good resources on AMD PICs,

You may notice some really interesting PICs mentioned, such as memory locality observability in the newer revs of AMD CPUs.

If you are interested in PIC analysis for any CPU type, see chapter 8 "Performance Counters" in Solaris Performance and Tools, by Richard McDougall, Jim Mauro and myself. One of the metrics we made sure to include in the book was CPI (cycles per instruction), as it proves to be a useful starting point for understanding CPU behavior.

Example - CPI
The cycles per instruction metric (sometimes measured as IPC - instructions per cycle) is a useful ratio and (depending on CPU type) fairly easy to measure. If the measured CPI ratio is low, more instructions can be dispatched in a given time, which usually means higher performance. High CPI means instructions are stalling, usually on main memory bus activity.

The output of cpustat can be formatted with a little scripting; the following script "amd64cpiu" uses a little shell and Perl to aggregate and print the output:

# amd64cpiu - measure CPI and Utilization on AMD64 processors.
# USAGE: amd64cpiu [interval]
#   eg,
#        amd64cpiu 0.1          # for 0.1 second intervals
# CPI is cycles per instruction, a metric that increases due to activity
# such as main memory bus lookups.
# ident "@(#)       1.1     07/02/17 SMI"

interval=${1:-1}        # default interval, 1 second

set -- `kstat -p unix:0:system_misc:ncpus`      # assuming no psets,
cpus=$2                                         # number of CPUs

pics='BU_cpu_clk_unhalted'                      # cycles
pics=$pics,'FR_retired_x86_instr_w_excp_intr'   # instructions

/usr/sbin/cpustat -tc $pics $interval | perl -e '
        printf "%16s %8s %8s\\n", "Instructions", "CPI", "%CPU";
        while (<>) {
                next if ++$lines == 1;
                $total += $_[3];
                $cycles += $_[4];
                $instructions += $_[5];

                if ((($lines - 1) % '$cpus') == 0) {
                        printf "%16u %8.2f %8.2f\\n", $instructions,
                            $cycles / $instructions, $total ?
                            100 \* $cycles / $total : 0;
                        $total = 0;
                        $cycles = 0;
                        $instructions = 0;
This script prints a column for CPI and for percent CPU utilization. I've used the PICs that were suggested in the AMD article - and from testing they do appear to be the best ones for measuring CPI.

Here amd64cpiu is used to examine a CPU bound workload of fast register based instructions,

# ./ 
    Instructions      CPI     %CPU
     16509657954     0.34    97.56
     16550162001     0.33    98.54
     16523746049     0.34    98.41
     16510783100     0.34    98.32
     16497135723     0.34    98.29
The CPI is around 0.34. This is the maximum to be expected from the AMD64 architecture, which attempts to run three instructions per clock cycle.

Now for a memory bound workload of sequential 1 byte memory reads,

# ./ 
    Instructions      CPI     %CPU
      4883935299     1.12    97.60
      4852961204     1.12    97.03
      4884120645     1.13    97.69
      4898818096     1.12    97.92
      4895064839     1.12    97.80
Things are starting to become slower due to the extra overhead of memory requests. Many reads will satisfy from the level 1 cache, some from the slower level 2 cache, and occasionally a cache line will be read from main memory. This additional overhead slows us to 1.13 CPI, and we are running fewer instructions for the same %CPU.

Watch what happens when our memory workload performs 1 byte scattered reads (100 Kbytes apart),

# ./ 
    Instructions      CPI     %CPU
       653300388     8.53    98.36
       648496314     8.53    98.37
       644163952     8.54    97.75
       648941939     8.53    98.35
       648507176     8.53    98.37
Many of the reads will not be in the CPU caches, and so now most are requiring a memory bus lookup. Our CPI is around 8.53, some 25 times slower than register based CPU instructions. Our %CPU is still around the same, but this buys us fewer instructions in total.

As you can see, CPI is shedding light on memory bus activity - it's very cool, and from such a simple metric.

Now for a real application: Here I watch as Sun's C compiler chews through a source tree,

# ./ 
    Instructions      CPI     %CPU
      2624028943     1.26    58.52
      2992167837     1.19    63.17
      2327129316     1.26    52.08
      2046997158     1.27    46.14
      2414376864     1.23    52.80
      3305351199     1.23    70.72
That's not so bad - any memory access instructions must be hitting caches fairly often (something that we can confirm by measuring other PICs).

Beware of output such as the following:

# ./ 
    Instructions      CPI     %CPU
        22695257     1.82     0.73
        22197894     1.75     0.69
        49626271     2.16     1.90
       102731779     2.21     4.04
       104795796     1.49     2.78
The CPUs are fairly idle (less than 5% utilization), and so CPI is less likely to be useful to indicate performance issues.

Suggested Actions - CPI
While many PICs produce interesting measurements, it's much more useful if there is some action we can take based on the results. The following is a list of suggestions based on CPI.

Firstly, to be even considering this list you need to have a known and measured performance issue. If one or more CPUs are 100% busy, then that may be a performance issue and it can be useful to check CPI; if your CPUs are idle, then it probably won't be useful to check. As for measured performance issue - it can be especially helpful to be able to quantify an issue, eg average latency is 150 ms; tools such as DTrace can take these measurements.

  • Measure other PICs. CPI is a high level metric, and there are many other PICs that will explain why CPI is high or low, such as cache hits and misses, TLB hits and misses, and memory locality.
  • If CPI is low,
    • Examine the application for unnecessary CPU work (eg, using DTrace)
    • Get faster CPUs
    • Get more CPUs
  • If CPI is high,
    • Examine the application for unnecessary memory work
    • Recompile your application with optimization, and with Sun's C compiler
    • Consider using processor sets to improve memory locality
    • Get CPUs with larger caches
    • Test different CPU architectures (multi-core/multi-thread may improve performance as bus distance is shorter)
I hope this has been helpful. And there are many more cool metrics to observe on AMD64 CPUs - CPI is just the beginning.

Brendan Gregg, Fishworks engineer


« July 2016