Monday Oct 26, 2015

One SPARC Beats Two POWER8 on 3 out 4 SPECrate metrics

Congratulations to the SPARC M7 team for delivery of a monster computer chip incorporating 32 cores, each of which simultaneously executes 8 independent threads, for a total of 256 CPUs on a single chip.

The chip provides throughput that leaves competitors far behind, as posted today at, including:

  • 67% more integer throughput than the nearest competitor

  • 76% more floating point throughput than the nearest competitor

The above comparisons are based on best single-chip SPECrate results, as posted at through today, 26-Oct-2015.

Comparison to IBM POWER8

What about POWER8?.... Well, there aren't any 1-chip results at now what?

Yes, it is correct that IBM has not submitted results to the SPEC website for systems using only a single POWER8 chip. Interestingly, though, the newly announced SPARC T7-1 system, using one SPARC M7 chip, outperforms two IBM POWER8 chips on three out of four top-level metrics:

Compute Throughput
2-chip POWER8 SPECrate results
1-chip SPARC M7
MHz 2926 4133
Number of Chips 2 1
Number of Cores 20 32
Number of CPUs(*) 160 256
SPECint_rate_base2006 853 1120
SPECint_rate2006 1100 1200
SPECfp_rate_base2006 745 801
SPECfp_rate2006 888 832
Best IBM POWER8 2-chip results as published at as of 26-Oct-2015
vs. newly announced results for SPARC M7.
For more detail, see the full disclosures at and

(The results posted to have been submitted to SPEC, but have not yet been reviewed.)

(*) Depending on context, different terms may be seen: "hardware threads", "threads", "virtual processors", "logical processors", "processors", "virtual CPUs", and so forth. From the point of view of the operating system, there are 256 independently scheduled CPUs available on a single SPARC M7 chip.


Congratulations are due to all who work to deliver stunning throughput, including but not limited to the chip group, Solaris, Solaris Studio, and in particular to the performance lab rat who puts it all together in a well-documented, reportable performance result.


More Info

The SPEC CPU benchmarks are derived from the compute intensive portions of real applications. More info is available at SPECint and SPECfp are SPEC trademarks.

Tuesday May 26, 2009

Losing My Fear of ZFS


The original ZFS vision was enormously ambitious: End the Suffering, Free Your Mind, providing simplicity, power, safety, and speed. As is common with most new technologies, this ambitious vision was not completely fulfilled in the intial versions. Initial usage showed that although it did have useful and convenient features, for some workloads, such as the memory-intensive SPEC CPU benchmarks, there were reasons for concern. Now that ZFS has had time to grow, more of the vision is fulfilled. This article, told from the personal perspective of one performance engineer, describes some of the improvements, and provides examples of use.

Rumors: Does This Sound Familiar?

Have you heard some of these about ZFS?

"ZFS? You can't use that - it will eat all your memory!"

"ZFS? That's a software disk striping/RAID solution. You don't want that. You want hardware RAID."

"ZFS? Be afraid."

Can I Please Just Forget About IO? (NO)

As a performance engineer, my primary concern is for the SPEC CPU benchmarks - which intentionally do relatively little IO. Usually.

To a first approximation, IO can be ignored in this context. Usually.

To a first approximation, it's fine if my ZFS "knowledge" is limited to rumors / innuendo as quoted above. Until....

Until there comes the second approximation, the re-education, and the beginner loses some fear of ZFS.

Why a SPEC CPU Benchmarker Might Care About IO

Although the SPEC CPU benchmarks intentionally try to avoid doing IO, some amount inevitably remains. An analysis of the IO in the benchmarks shows that one benchmark, 450.soplex, reads 300 MB of data. Most of that comes from a single 1/4 GB file, ref.mps, which is read during the second invocation of the benchmark.

Given the speed of today's disk drives, is that a lot? Using an internal drive (Seagate ST973402SSUN72G), a T5220 with a Niagara 2 processor reads the 1/4 GB file at about 50 MB/sec. It takes about 5.5 seconds to read one copy of the file, which is a tiny amount of time compared to how long it takes to run one copy of the actual benchmark - about 3000 seconds.

But 1/4 GB becomes a concern when one takes into account that we do not, in fact, read one copy of the file when testing SPEC CPU2006, because we are interested in the SPECrate metrics, which run multiple copies of the benchmarks. On a single-chip T5220 system, which supports 64 theads, 63 copies of the benchmark are run. An 8-chip M5000, which supports 8 threads per chip, also runs 63 copies.

On such systems, it is not uncommon to see 10 to 30 minutes of time when the CPU is sitting idle - which is not the desired behavior for a CPU benchmark.

For example, on the M5000, as shown in the graph below, it takes about 18 minutes before the CPU reaches the desired 99% User time. During that 18 minutes, a single disk with ufs on it is, according to iostat, 100% busy. It reads about 16 MB/sec, doing about 725 reads/sec.

Note that in this graph, and all other graphs in this article, the program being tested is only one of the benchmarks drawn from a larger suite, and only one of its inputs. Therefore, no statements in this article should be taken as indicative of "compliant" (aka "reportable") runs of an entire suite. SPEC and the benchmark name SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. For more information about SPEC and the CPU benchmarks, see

graph1: ufs takes 18 min on M5000

ZFS Makes its Dramatic Entrance

Although this tester has heard concerns raised by people who have passed along rumors of ZFS limitations, there have been other teachers who have sung its praises, including one who has pointed out that 450.soplex's 1/4 GB input file is highly compressible, going from 267 MB to 20 MB with gzip.

The best IO is the IO that you never have to do at all. By using the ZFS compression feature, we can make 90% of the IO go away:

      $ zpool create -f tank c0t1d0
      $ zfs create tank/spec-zfs-gzip
      $ zfs set compression=gzip tank/spec-zfs-gzip

graph2: zfs pegs the CPU almost immediately

The improvement from ZFS gzip compression is indeed dramatic.

The careful reader may note that there are actually two lines on the far left: one measured with Solaris 10 Update 7, the other with Solaris Express. The version of Solaris did not appear to be a signficant variable for the tests reported in this paper, as can be seen by the fact that the two lines are right on top of each other.

What About Memory Consumption?

Although ZFS has done a great job above, what about its memory consumption? Concerns have been raised that it is memory-hungry, and indeed the "Be st Practices" Guide plainly says that it will use all the memory on the system if it thinks it can get away with it:

The ZFS adaptive replacement cache (ARC) tries to use most of a system's available memory to cache file system data. The default is to use all of physical memory except 1 Gbyte. As memory pressure increases, the ARC relinquishes memory.

ZFS memory usage is an important concern when running the SPEC CPU benchmarks, which are designed to stress the CPU, the memory system, and the compiler. Some of the benchmarks in the suite use just under 1 GB of physical memory, and it is desirable to run (n — 1) copies on a system with (n) threads and (n) GB of memory. Fortunately, there is a tuning knob available to control the size of the ARC: set zfs:zfs_arc_max = 0x(size) can be added to/etc/system.

The tests reported on this page all use a limited ARC cache.

It should also be noted that all tests are done after a fresh reboot, so presumably the ARC cache is not contributing to the reported performance. More details about methods may be found at the end of the article.

ZFS on T5440: Good, But Not As Dramatic

Although the above simple commands are enough to remove the IO idle time on the M5000, for the 4-chip T5440 there is a bigger problem: this system supports 256 threads, and 255 copies of the benchmark are run. Therefore, it needs to quickly inhale on the order of 64 GB.

A somewhat older RAID system was made available for this test: an SE3510 with 12x 15K 72GB drives. Using this device with ufs, it takes 30 minutes before the system hits the maximum user time, as shown by the line on the right in the graph below:

Graph 3: ufs takes about 30 min, zfs about 15 min

In the ufs test above, the SE3510 is configured as 12x drives in a RAID-5 logical drive, with a simple ufs filesystem (newfs -f 8192 /dev/dsk/c2t40d0s6). Despite the large number of drives, the SE3510 sustains a steady read rate of only about 45 MB/sec, processing about 3000 IO/sec according to iostat. (Aside: the IO expert may question why the hardware RAID provides only 45 MB/sec, but please bear in mind we are following the path of the IO beginner here. This topic is re-visted below.)

The zfs file system reads about 16 MB/sec, doing about 4500 IO/sec, but takes less than 1/2 as long to peg the CPU, since it is reading compressed data.

The zfs file system also used an SE3510 with SUN72G 15k RPM drives. On that unit, 12 individual "NRAID" drives were created, and made visible to the host as 12 separate units. Then, 10 of them were strung together as zfs RAID-Z using:

   # zpool create zf-raidz10 raidz \\
     c3t40d0  c3t40d1  c3t40d2 c3t40d3  c3t40d4  \\
     c3t40d5  c3t40d6  c3t40d7 c3t40d8  c3t40d9

   # zpool status       
      pool: zf-raidz10
     state: ONLINE
     scrub: none requested

           NAME         STATE     READ WRITE CKSUM
           zf-raidz10   ONLINE       0     0     0
             raidz1     ONLINE       0     0     0
               c3t40d0  ONLINE       0     0     0
               c3t40d1  ONLINE       0     0     0
               c3t40d2  ONLINE       0     0     0
               c3t40d3  ONLINE       0     0     0
               c3t40d4  ONLINE       0     0     0
               c3t40d5  ONLINE       0     0     0
               c3t40d6  ONLINE       0     0     0
               c3t40d7  ONLINE       0     0     0
               c3t40d8  ONLINE       0     0     0
               c3t40d9  ONLINE       0     0     0

Compression was added at a later time, but before the experiment shown above:

      $ zfs list -o compression zf-raidz10

Why Is the T5440 Improvement Not As Dramatic As the M5000?

The improvement from zfs is helpful to the T5440, but unlike the M5000, nearly 15 minutes of clock time is spent on IO. Let's look at some statistics from iostat:

$ iostat -xncz 30
 us sy wt id
  8  4  0 88
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    2.1    0.0  0.0  0.0    0.1    6.7   0   0 c0t0d0
  469.3    0.0 1851.2    0.0  0.9 18.5    1.9   39.4  25  88 c3t40d9
  401.8    0.0 1893.2    0.0 19.7  9.8   48.9   24.5  88  96 c3t40d8
  471.1    0.0 1836.2    0.0  1.1 18.4    2.4   39.1  27  87 c3t40d7
  416.1    0.0 1858.5    0.0  2.0 16.1    4.9   38.7  33  88 c3t40d6
  452.1    0.0 1792.8    0.0 13.9 13.1   30.9   29.0  78  92 c3t40d5
  417.8    0.0 1868.9    0.0  0.9 16.3    2.1   39.1  18  87 c3t40d4
  461.0    0.0 1766.9    0.0  3.7 17.2    8.0   37.3  42  87 c3t40d3
  418.9    0.0 1854.9    0.0  2.9 16.2    7.0   38.6  40  88 c3t40d2
  433.6    0.0 1761.0    0.0 21.9  8.9   50.6   20.6  92  99 c3t40d1
  420.0    0.0 1852.6    0.0  1.5 16.1    3.5   38.4  29  86 c3t40d0

A kind ZFS expert notes that "with RAID-Z the disks are saturated delivering >400 iops. The problem of RAID-Z is that those iops carry small amount of data and throughput is low." For more information, see this popular reference:

A secondary reason might be that as the reads are done, ZFS is decompressing the gzip'd data on a system where single thread performance is much slower than the one in Graph #2. On the M5000, 'gunzip ref.mps' requires about 2 seconds of CPU time; on the T5440, about 7 seconds. It should be emphasized that this is only a secondary concern for the read statistics described in this article, although it can become more important for write workloads, since compression is harder than decompression. Doing 'gzip ref.mps' takes ~12 seconds on the M5000, and ~51 seconds on the T5440. Furthermore, although the T5440 has 256 threads available, as of Solaris 10 s10s_u7, and Solaris Express snv_112, it is only willing to spend 8 threads doing gzip/gunzip operations. (This limitation may change in a future version of Solaris.)

Solution: Mirrors, No Gzip

The kind ZFS expert suggested trying mirrored drives without gzip. When this is done, the %b (busy) time, which is about 90% in the iostat report just above, changes to 98-100%. The %w time (queue non-empty) time, which shows wide variability just above, also pushes 90-100%. Because we are reading much more data, elapsed time is actually slower - the red line in the graph below:

graph4: uncompressed mirrors slower than gzip/raidz, until we have 24 drives

Adding 12 more drives, configured as 8x three way mirrors, does the trick: the leftmost line shows the desired slope. We spend about 3-4 minutes reading the file, an acceptable amount given that the benchmark as a whole runs for more than 120 minutes.

The file system for the leftmost line was created using:

      # zpool create dev8-m3-ngz \\
      > mirror c2t40d0  c2t40d1     c3t40d0 \\
      > mirror c2t40d2  c2t40d3     c3t40d1 \\
      > mirror c2t40d4  c2t40d5     c3t40d2 \\
      > mirror c2t40d6  c2t40d7     c3t40d3 \\
      > mirror c2t40d8              c3t40d4  c3t40d5 \\
      > mirror c2t40d9              c3t40d6  c3t40d7 \\
      > mirror c2t40d10             c3t40d8  c3t40d9 \\
      > mirror c2t40d11             c3t40d10 c3t40d11

The command creates 3-way mirrors, splitting each mirror across the two available controllers (c2 and c3). There are 8 of these 3-way mirrors, and zfs will dynamically stripe data across the 8 mirrors.

Were These Tests Fair?

The hardware IO expert may be bothered by the data presented here for the RAID-5 ufs configuration. Why would the hardware RAID system, with 12x drives, deliver only 45 MB/sec? In addition, it may seem odd that the tests use a RAID device which is now 5 years old, and compare it versus contemporary ZFS.

This is a fair point. In fact, a more modern RAID device has been observed delivering 97 MB/sec to 450.soplex, although with a very different system under test.

On the other hand, it should be emphasized that all the T5440 tests reported in this article used SE3510/SUN72G/15k. For the ufs tests, the SE3510 on-board software did the RAID-5 work. For the zfs tests, the SE3510 simply presented its disks to the Solaris system as 12 separate ("NRAID") logical units, and zfs did the RAID-Z and mirroring work.

Could there be something wrong with the particular SE3510 used for ufs? That seems unlikely. Although Graph 3 compares two different SE3510s (both connected to the same HBA, both configured with SUN72G 15k drives), a later test repeated the RAID-5 run on the exact same SE3510 unit as had been used for zfs. The time did not improve.

Is it possible that the SE3510 was mis-configured? Maybe. The author does not claim to be an IO expert, and, in particular, relied on the SE3510 menu system, not its command line interface (sccli). The menus provide limited access to disk block size setting, and the tester did not at first realize that the disk block size depends on this other parameter .... located over here in the menus ...


For this particular controller, default block sizes are controlled indirectly by whether this setting is yes or no. Changing it to "No" makes the default block size larger (32 KB vs. 128 KB). Once this was discovered, various tests were repeated. The hw RAID-5 test was repeated with explicit selection of a larger size; however, it did not improve. On the other hand, the NRAID devices, controlled by zfs, did improve.

Finally, in order to isolate any overhead from RAID-5, the SE3510 was configured as 12 x drives in a RAID-0 stripe (256 KB stripe size). The time required to start 450.soplex was still over 30 minutes.


As usual, your mileage may vary depending on your workload. This is especially true for IO workloads.

Summary / Basic Lessons

Some basic lessons about ZFS emerge:

1) ZFS can be easily taught not to hog memory.

2) Selecting gzip compression can be a big win, especially on systems with relatively faster CPUs.

3) Setting up mirrored drives with dynamic striping is straightforward.

4) ZFS is not so scary, after all.

Notes on Methods

During an actual test of a "reportable" run of SPEC CPU2006, file caches are normally not effective for 450.soplex, because its data files are set up many hours prior to their use, with many intervening programs competing for memory usage. Therefore, for all tests reported here, it was important to avoid unwanted file caching effects that would not be present in a reportable run, which was accomplished as summarized below:

    runspec -a setup --rate 450.soplex
    cd 450.soplex/run/run\*000
    specinvoke -nnr >
    convert 'sh dobmk' in to 'sh dobmk &'

The tests noted as Solaris 10 used:

      # head -1 /etc/release
                          Solaris 10 5/09 s10s_u7wos_08 SPARC

The tests noted as SNV used:

      # head -1 /etc/release
                       Solaris Express Community Edition snv_112 SPARC

The tests on the M5000 used 72GB 10K RPM disks. The ufs disk was a FUJITSU MBB2073RCSUN72G (SAS); the zfs disk was a SEAGATE ST973401LSUN72G (Ultra320 SCSI). The tests on the T5440 used 72GB 15K RPM disks: FUJITSU MAU3073FCSUN72 (Fibre Channel).


My IO teachers include Senthil Ramanujam, Cloyce Spradling, and Roch Bourbonnais, none of whom should be blamed for this beginner's ignorance. Karsten Guthridge was the first to point out the usefulness of ZFS gzip compression for 450.soplex.

Monday Feb 02, 2009

SPEC Benchmark Workshop 2009

Attached are the slides from my talk at the "SPEC Benchmark Workshop 2009", described at

The program drew about 100 people.  There were 9 papers accepted and published by Springer, out of a field of about twice as many entries.  The papers were written by 10 authors from industry and 16 from academia.

My criterion for success was: "How many people are sleeping during my early-Sunday-morning talk?"   During the talk, I managed to wake up all but 1 of the sleepers, so I guess it was successful. 

Attached are my slides: specrate-slides.pdf

The article was published by Springer in a book - preview here: Springerlink

Here is the Author's pre-submission copy; note that Springer has the copyright, but authors are allowed to self-archive a copy on their personal site: specrate-paper.pdf


« December 2015