Thursday Jan 04, 2007

ZFS v VxFS - IOzone

IOzone is a I/O benchmarking utility that I have blogged about before. I also covered off the results of running Filebench on the two filesystems. Here, for the sake of completeness, are the results of some IOzone runs I did at the same time. The command line for IOzone used the following arguments and options:

iozone -R -a -z -b file.wks -g 4G -f testile


This test measures the performance of writing a new file. It is normal for the initial write performance to be lower than the performance of rewriting a file (see next test, below) due to metadata overhead.

Iozone Write Performance


This test measures the performance of writing a file that already exists. When a file is written that already exists the work required is less as the metadata already exists. It is normal for the rewrite performance to be higher than the performance of writing a new file.

Iozone Re-Write Performance


This test measures the performance of reading an existing file.

Iozone Read Performance


This test measures the performance of reading a file that was recently read. It is possible for the performance to be higher as the file system can maintain a cache of the data for files that were recently read. This cache can be used to satisfy reads and improves the throughput.

Iozone Re-read Performance

Random Read

This test measures the performance of reading a file with accesses being made to random locations within the file. The performance of a system under this type of activity can be impacted by several factors such as the size of operating system’s cache, number of disks, seek latencies, and others.

Iozone Random Read Performance

Random Write

This test measures the performance of writing a file with accesses being made to random locations within the file. Again the performance of a system under this type of activity can be impacted by the factors listed above for Random Read. Efficient random write is important to the operation of transaction processing systems.

Iozone Random Write Performance

Backward Read

This test measures the performance of reading a file backwards. This may seem like a strange way to read a file but in fact there are applications that do this. MSC Nastran is an example of an HPC application that reads its files backwards. Video editing is another example. Although many file systems have special features that enable them to read a file forward more rapidly, there are very few that detect and enhance the performance of reading a file backwards.

Iozone Backward Read Performance

Record Rewrite

This test measures the performance of writing and re-writing a particular spot within a file.

Iozone Record Rewrite Performance

Strided Read

This test measures the performance of reading a file with a strided access behavior. An example would be: “Read at offset zero for a length of 4 KB, then seek 200 KB, and then read for a length of 4 KB, then seek 200 KB and so on.” Here the pattern is to read 4 KB and then seek 200 KB and repeat the pattern. This again is a typical application behavior for applications that have data structures contained within a file and is accessing a particular region of the data structure. Most file systems do not detect this behavior or implement any techniques to enhance the performance under this type of access behavior.

Iozone Strided Read Performance


This test measures the performance of writing a file using the library function fwrite(). This is a library routine that performs buffered write operations. The buffer is within the user’s address space. If an application were to write in very small size transfers then the buffered & blocked I/O functionality of fwrite() can enhance the performance of the application by reducing the number of actual operating system calls and increasing the size of the transfers when operating system calls are made. This test is writing a new file so again the overhead of the metadata is included in the measurement.

Iozone fwrite() Performance


This test performs repetitive re-writes to portions of an existing file using the fwrite() interface.

Iozone Re-fwrite() Performance


This test measures the performance of reading a file using the library function fread() - a library routine that performs buffered & blocked read operations. The buffer is within the user’s address space, as for fwrite() operations. If an application were to read in very small size transfers then the buffered & blocked I/O functionality of fread() can enhance the performance of the application by reducing the number of actual operating system calls and increasing the size of the transfers when operating system calls are made.

Iozone fread Performance


This test is the same as fread() above except that in this test the file that is being read was read in the recent past. This can result in higher performance as the file system is likely to have the file data in cache.

Iozone Re-fread() Performance

End note

In the last couple of blogs, I've given the results of testing a number of typical file system workloads in an open and reproduceable manner using the publicly available Filebench and IOzone tools and shown that Solaris 10 ZFS can significantly outperform a combination of Veritas Volume Manager and Filesystem in many cases. However, the following points (the "usual caveats") should also be taken into consideration:

  • These tests were performed on a Sun Fire server with powerful processors, a large memory configuration, and a very wide interface to an array of high-speed disks to ensure that the fewest possible factors would inhibit file system performance. It is possible that the differences between file systems would be less pronounced on a less powerful system simply because all file systems would run into hardware bottlenecks in moving data to the disks.
  • A file system performs only as well as the hardware and operating system infrastructure surrounding it, such as the virtual memory subsystem, kernel threading implementation, and device drivers. As such, Sun’s overall enhancements in the Solaris OS, combined with high-powered Sun Fire servers, will provide customers with high levels of performance for applications. But proof-of-concept (POC) implementations are invaluable in supporting purchasing decisions for specific configurations and applications.
  • Benchmarks provide general guidance to performance. The conclusion that can be drawn from these tests is that in application areas such as databases, e-mail, web server and software development, Solaris 10 ZFS performs best in an “apples-to-apples” comparisons with the Veritas product suite. Again, POCs and real-world customer testing help evaluate performance for specific applications and services.

ZFS v VxFS - Ease

I've had people asking me to blog more of my stuff on ZFS, especially in relation to the Veritas suite (Microsoft NTFS and Linux afficionados, make yourselves known to my highly efficient Customer Services team using the comments form below).

I did a lot of poking into ZFS performance over the summer. The Veritas Filebench results are already posted here but apart from the numbers, what leapt out straight away was the simplicity of use of ZFS compared to the competition.

I'm not talking about GUIs because I grew up in environments (banks, IT vendors) where they simply weren't used either because the required precision in the configuration demanded command line and scripting or the work was remote (for which read "from home in middle of night") and the comms just didn't move the bits fast enough to support GUIs.

For a start the conceptual framework is a lot simpler. The following table lists the building blocks of both Veritas VxVM/VxFS and ZFS.



The pool: all the disks in the system.A physical disk is the basic storage device (media) where the data is ultimately stored.
File systems: as many as are required.When you place a physical disk under VxVM control, a VM disk is assigned to the physical disk. A VM disk is under VxVM control and is usually in a disk group.

A VM disk can be divided into one or more subdisks. Each subdisk (actually a set of contiguous disk blocks) represents a specific portion of a VM disk.

VxVM uses subdisks to build virtual objects called plexes. A plex consists of one or more subdisks located on one or more physical disks.

A disk group is a collection of disks that share a common configuration, and which are managed by VxVM.

A volume is a virtual disk device that appears like a physical disk device and consists of one or more plexes contained in a disk group, each holding a copy of the selected data in the volume.

A VxFS file system is constructed on a volume so that files can be stored.

So how does this translate into practice? The following table lists the activities and times taken to create useable storage using either Veritas or ZFS:



# zpool create -f tank [list of disks]
# zfs create tank/fs
# /usr/lib/vxvm/bin/vxdisksetup -i c2t16d0
# vxdg init dom-dg c2t16d0
# for i in [list of disks]
/usr/lib/vxvm/bin/vxdisksetup -i $i

# for i in [list of disks]
vxdg -g dom-dg adddisk $i

# vxassist -g dom-dg -p maxsize layout=stripe
6594385920 [ get size of volume, then feed back in ]
Time Taken: 30 minutes # vxassist -g dom-dg make dom-vol 6594385920 layout=stripe
# mkfs -F vxfs /dev/vx/rdsk/dom-dg/dom-vol
version 6 layout
6594385920 sectors, 412149120 blocks of size 8192, log size 32768 blocks
largefiles supported
# mount -F vxfs /dev/vx/dsk/dom-dg/dom-vol /mnt
Time Taken: 17.5 secondsTime Taken: 30 minutes

It's far simpler, is it not? The timings are for 48 72 Gb disks in 3500 jbods, by the way. If you want a bit more guidance on using ZFS, you should:

  • Read the manual on
  • Look at the wiki which covers hints, tips and preferred practice
  • Subscribe to zfs-discuss. This patches you through to (amongst others) people who wrote ZFS, so its a good source of authoratitive, if sometimes a little terse, guidance.

Friday Dec 01, 2006

Filebench: A ZFS v VxFS Shootout


Here is an example of Filebench in action to give you an idea of its capabilities "out of the box" - a run through a couple of the test suites provided with the tool on the popular filesystems ZFS and VxFS/VxVM; I've given sufficient detail so that you can easily reproduce the tests on your own hardware. I apologise for the graphs, which have struggled to survive the Openoffice .odt -> .html conversion. I hadn't the energy to recreate all 24 of them from the original data

They summarize the differences between ZFS and VxFS/VM in a number of tests which are covered in greater detail further on . It can be seen that in most cases ZFS performed better at its initial release (in Solaris 10 06/06) than Veritas 4.1; in some cases it does not perform as well; but in all cases it performs differently. The aim of such tests is to give a feel for the differences between products/technologies so intelligent decisions can be made as to which file system is more appropriate for a given purpose.

It remains true however that access to the data will be more effective in helping decision makers reach their performance goals if these can be stated clearly in numerical terms. Quite often this is not the case.

Figure 1: Filebench: Testing Summary for Veritas Foundation Suite (ZFS = 0)

Hardware and Operating System

Solaris 10 Update 2 06/06 (inc. ZFS) running on a Sun Fire E6900 with 24 x 1.5 Ghz Ultrasparc IV+, 98 Gb RAM with storage comprising 4 x StorEdge 3500 JBOD (48 x 72 Gb disks) , Fiber attach (4 x 2 Gb PCI-X 133 MHz)

The software used was VERITAS Volume Manager 4.1, VERITAS File System 4.1, FileBench 1.64.5. For each Filebench test a brief description is given followed by a table which shows how it was configured for that test run. This enables you to reproduce the test on your hardware. Of course if you want greater detail on the tests, you have to download Filebench (see blogs passim).

Create & Delete

Creation and deletion of files is a metadata intensive activity which is key to many applications, especially in web-based commerce and software development.

createfilescreatefilesnfiles 100000, dirwidth 20, filesize 16k, nthreads 16
deletefilesdeletefilesnfiles 100000, meandirwidth 20, filesize 16k, nthreads 16

Figure 2: Create/Delete - Operations per Second

Figure 3: Create/Delete - CPU uSec per Operation

Figure 4: Create/Delete - Latency (ms)


This test creates two large directory tree structures and then measures the rate at which files can be copied from one tree to the other.

copyfilescopyfilesnfiles 100000, dirwidth 20, filesize 16k, nthreads 16

Figure 5: Copy Files - Operations per Second

Figure 6: Copy Files - CPU uSec per Operation

Figure 7: Copy Files - Latency (ms)

File Creation

This test creates a directory tree and fills it with a population of files of specified sizes. File sizes are chosen according to a gamma distribution of 1.5, with a mean size of 16k. The different workloads are designed to test different types of I/O - see generally the Solaris manual pages for open(2), sync(2) and fsync(3C).

filemicro_createcreateandallocnfiles 100000, nthreads 1, iosize 1m, count 64

createallocsyncnthreads 1, iosize 1m, count 1k, sync 1
filemicro_writefsynccreateallocfsyncnthreads 1
filemicro_createrandcreateallocappendnthreads 1

Figure 8: File Creation - Operations per Second

Figure 9: File Creation - CPU uSec per Operation

Figure 10: File Creation - Latency (ms)

Random Reads

This test performs single-threaded random reads of 2 KB size from a file of 5 Gb.

filemicro_rreadrandread2kcached 0, iosize 2k

randread2kcachedcached 1, iosize 2k

Figure 11: Random Read - Operations per Second

Figure 12: Random Read - CPU uSec per Operation

Figure 13: Random Read - Latency (ms)

Random Writes

This test consists of multi-threaded writes to a single 5 Gb file.

filemicro_rwriterandwrite2ksynccached 1, iosize 2k

randwrite2ksync4threadiosize 2k, nthreads 4, sync 1

Figure 14: Random Writes -
Operations per Second

Figure 15: Random Writes -
CPU uSec per Operation

Figure 16: Random Writes - Latency (ms)

Sequential Read

These tests perform a single threaded read from a 5 Gb file.

filemicro_seqreadseqread32kiosize 32k, nthreads 1, cached 0, filesize 5g

seqread32kcachediosize 32k, nthreads 1, cached 1, filesize 5g

Figure 17: Sequential Read -
Operations per Second

Figure 18: Sequential Read -
CPU uSec per Operation

Figure 19: Sequential Read - Latency (ms)

Sequential Write

These tests perform single threaded writes to a 5 Gb file.

filemicro_seqwriteseqwrite32kiosize 32k, count 32k, nthreads 1, cached 0, sync 0

seqwrite32kdsynciosize 32k, count 32k, nthreads 1, cached 0, sync 1
filemicro_seqwriterandseqwriterand8kiosize 8k, count 128k, nthreads 1, cached 0, sync 0

Figure 20: Sequential Write -
Operations per Second

Figure 21: Sequential Write -
CPU uSec per Operation

Figure 22: Sequential Write - Latency (ms)

Application Simulations: Fileserver, Varmail, Web Proxy & Server

There are a number of scripts supplied with Filebench to emulate applications:


A file system workload, similar to SPECsfs. This workload performs a sequence of creates, deletes, appends, reads, writes and attribute operations on the file system. A configurable hierarchical directory structure is used for the file set.


A /var/mail NFS mail server emulation, following the workload of Postmark, but multi-threaded. The workload consists of a multi-threaded set of open/read/close, open/append/close and deletes in a single directory.

Web Proxy:

A mix of create/write/close, open/read/close, delete of multiple files in a directory tree, plus a file append (to simulate the proxy log). 100 threads are used. 16k is appended to the log for every 10 read/writes.

Web Server:

A mix of open/read/close of multiple files in a directory tree, plus a file append (to simulate the web log). 100 threads are used. 16k is appended to the weblog for every 10 reads.

fileserverfileservernfiles 100000, meandirwidth 20, filesize 2k, nthreads 100, meaniosize 16k
varmailvarmailnfiles 100000, meandirwidth 1000000, filesize 1k, nthreads 16, meaniosize 16k
webproxywebproxynfiles 100000, meandirwidth 1000000, filesize 1k, nthreads 100, meaniosize 16k
webserverwebservernfiles 100000, meandirwidth 20, filesize 1k, nthreads 100

Figure 23: Application Simulations - Operations per Second

Figure 24: Application Simulations - CPU uSec per Operation

Figure 25: Application Simulations - Latency (ms)

OLTP Database Simulation

This database emulation performs transactions on a file system using the I/O model from Oracle 9i. This workload tests for the performance of small random reads & writes, and is sensitive to the latency of moderate (128Kb+) synchronous writes as occur in the database log file. It launches 200 reader processes, 10 processes for asynchronous writing, and a log writer. The emulation uses intimate shared memory (ISM) in the same way as Oracle which is critical to I/O efficiency (as_lock optimizations).

oltplarge_db_oltp_2k_cachedcached 1, directio 0, iosize 2k,
nshadows 200, ndbwriters 10, usermode 20000,
filesize 5g, memperthread 1m, workingset 0

large_db_oltp_2k_uncachedAs above except cached 0, directio 1
large_db_oltp_8k_cachedAs for 2k except iosoze 8k

large_db_oltp_8k_uncachedAs for 2k except iosoze 8k

Figure 26: OLTP 2/8 Kb Blocksize - Operations per Second

Figure 27: OLTP 2/8 Kb Blocksize - CPU uSec per Operation

Figure 28: OLTP 2/8 Kb Blocksize - Latency (ms)

The summary figures above are the tip of a vast numerical iceberg of statistics provided by Filebench and wrappers around it which probe every system resource counter you can think of. It is a truism though that in using data like this, there is an enthusiasm to reduce it to single figures and simple graphs, leaving the engineers working on the performance bugs to the excruciating detail.

Remember also that these are the pre-packackaged scripts. The possibilities for custom benchmark workloads are as infinite as your imagination. Its also worth saying that technologies move on. The snapshot above will start to fade as improvements are made.

Thursday Nov 30, 2006

twm(1) - The Directors Cut

As a postscript to my blogs on squeezing Solaris 10 onto an antique PC (here and here), Dave Levy commented that I should provide a screenshot of Tom's Window Manager, as a means to point out the irony that doing a screen grab of TWM is a recursive problem - the tools ain't there to do it. Oh well. I shan't restate the case for it because the Wikipedia entry sums it up - "Although it is now generally regarded as the window manager of last resort, a small but dedicated minority of users favor twm for its simplicity, customizability, and light weight". You can leap from the Wiki entry to an interesting interview with Tom himself.

As you can see, twm(1) has the rich functionality that you need but without all that troublesome clutter.

Solaris Performance and Tools and Solaris Internals (2nd Ed) make a firm foundation for any tuning exercise such as this. Rich Teer's Solaris Systems Programming also provides useful supporting documentation.

The quid pro quo I negotiated with Dave was that he in turn would blog an analysis of the case for commercial enterprises such as Sun open-sourcing their software set in the context of classical economic theory. I know I don't understand this properly because listening to Dave explain it is like holding up the TV aerial - it seems clear while he holds it up and explains but the moment he finishes and puts the aerial down, my screen goes fuzzy again.

Monday Nov 20, 2006

Fat Software Part 2

I found a little more fat to trim from the midget system I described in my last post. Well actually I found quite a bit but it took me down a cul-de-sac of unbootability so here, after more experimentation are the modifications that are "safe" by which I mean, of course, totally unsupported and unsupportable by Sun but useful if you have to run Solaris on an old, memory-constrained, probably beige, system.

Such a system will never run such new-fangled protocols as Infiniband, FibreChannel and iSCSI, I don't need NFS or any of the enterprise management software. I really don't actually need SCSI or IPv6 - but you can't unbolt those.

So, for the record and so you can cherrypick your own modifications, here are the scores - memory is in 4 Kb pages: I have 63374 of these to my name:





Page Cache


Baseline (dtconfig -e) 10398 22143 3619 4587 22627
Rename webm and webconsole in rc2.d/ 9573 9829 2572 2582 38818
Rename snmpdx,dmi,initsma in rc3.d 9505 8888 2253 2626 40102
Remove services (previous blog) 9705 7587 1904 2709 41469
Exclude nfs, lofs from /etc/system,
rename volmgt in /etc/rc3.d
7975 7045 1738 1915 44647
Exclude iSCSI 7900 7005 1776 1896 44797
Remove Infiniband 7819 6862 1720 1880 45093
Remove Fibrechannel 7319 6969 1738 1871 45477

I'm calling it a day at this point. A more diligent man would put all this in a Jumpstart script but time is money and that would be a sign of obsession. The out of the box installation had 7% of its memory free and I've managed to take that to 71%. The kernel (which of course is not pageable) is down to 28 Mb. This old HP Brio has a new lease of life. But brio is the quality of being active or spirited or alive and vigorous - so I'm off to get a life.

Wednesday Nov 15, 2006

French Customs and Fat Software

Never ever ever tell French Customs that you have just driven from Amsterdam - you will lose an hour of your life. This happened to me at Calais on my way home to UK. I always get stopped by Customs when I'm on my own. Its because I drive a big shiny German car but wear scruffy clothes and don't shave often enough for their liking. If I cleaned up and put a suit on for these journeys, I'd get left alone but I always forget. If I was a customs officer, I'd stop me.

Anyway, due to my arrival from the intoxicant capital of Europe I got the full treatment this time. They even X-rayed my spare tyre (no, the one in the car, not 'round my waist) and it is in fine health. They were very suspicious when I told them I don't smoke - not just dope, but anything. They all smoked - continuously. Then when they found that the 6 bottles of wine I had (only 6, Monsieur?) were from Portugal and not their beloved France, that was it. Out came the back seat of the car, pockets were emptied. carpets were lifted. I just kept smiling - They have guns after all, and latex surgical gloves - far more scary. Anyway, if I'm ever stopped again; I've just come from Lourdes.

I was coming home because remote working isn't working any more: my laptop is broken. Since my daughter peeled all the keys from the keyboard, it has never been quite the same and now one of them is permanently detached - queue renditions of "U picked a fine time to leave me, Lucille", "I'll never find another U", etc. Try working without this vowel - its impossible. So off to the laptop garage it goes.

This leaves me with a 8 year old PC in my office which was given to me as an alternative to putting it in landfill. Bye bye $3500 Ferarri laptop, hello ageing HP Brio. It hasn't much memory (256 Mb), or a DVD drive, or sound and the screen has got pen marks all over it because it came out of a goods-despatch office in a warehouse. But hey, it was free.

I load up Solaris 10 6/06 (Developer install profile), log in to Gnome and That is to say the visual element of the great Solaris experience stops - the meat grinder noise continues under the desk. Since then, its taken me a fair amount of performance tuning to get useability out of it - a sort of "back to basics" exercise.

My first suspicion was that the delicious new GUI that Solaris sports was the culprit for all the thrashing below. I used to say that using Unix was fine so long as one remembered that the first ten years would be the worst part. Now? You don't have to know anything about anything to be a performance analyst. You just read Solaris Performance & Tools and you're away. Observability? Look at this:

# echo "::memstat" | mdb -k 
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                      19522                76   31%
Anon                        20766                81   33%
Exec and libs                7685                30   12%
Page cache                   2095                 8    3%
Free (cachelist)            10547                41   17%
Free (freelist)              2759                10    4%

Total                       63374               247

...then turn to page 646 of the book, distill the content onto a Powerpoint slide and present your invoice Voila! (as the Customs men were dying to shout, but couldn't).

So is the windowing software fat? Well actually no. It is fatter than it was as the graph below shows (the Y axis indicates where my precious 247 megabytes have gone - for all the data points we have one terminal window open on the desktop) but not really podgy like some operating systems window mangers. The real problem though was that unknown to me (then), the web management software starts a honking great JVM - thats where all the memory goes.

So my choice was to go out and buy some hardware or execute a slash and burn strategy on memory consumption. Well the PC was free, the OS was free and the authors gave me the book for free (all as in "free beer"). Enough said. I did the numbers at each stage as you can see in the graph below:

Stage 1. This is my baseline - With Gnome and 1 terminal window open.

Stage 2. With CDE, 1 terminal window open.

Stage 3. Disable the windowing system altogether (dtconfig -d).

Stage 4. In /etc/rc2.d get rid of (i.e rename) S90wbem and S90webconsole: As I mentioned they start a very large (for the RAM I have to play with) JVM. This was the biggest single win.

Stage 5. In /etc/rc3.d, rid yourself of S76snmpdx, S77dmi and S82initsma

Stage 6. Disable "uneeded" (opinions differ here) service deamons using svcadm disable: sendmail, nfs/cbd, nfs/mapid, nfs/server, nfs/status, nfs/nlockmgr, nfs/rquota, nfs/client, network/ssh, system/power, system/picl, filesystem/autofs

Stage 7. Fire up twm(1) using xinit(1). For those who cannot remember Toms Window Manager, you can enjoy its rich functionality by choosing a Failsafe session from the options on your Gnome/CDE login screen and then logging in. In the (only) shell window you are presented type/usr/openwin/bin/twm and there you are. Now run away. Fast.

There is probably more fat I could trim (please email me your favourite culprits: dominic dot kay at gmail dot com) but the effort/mempages ratio has peaked and the day-job beckons.

And so here I am. I have an antique computer running a windowing system no-one can remember how to use but my ps(1) listing doesn't scroll off the screen anymore and it boots up in bounded time and goes like the wind. It runs Openoffice, Firefox, Sun Studio 11 (with special "compile during coffee break" option enabled). My X Windows Users and Administrators Guides (from O'Reilly circa 1992, from my attic yesterday) have new value and I'm high on life - thanks to losing "U" in Amsterdam.

I should make it clear by way of postscript that I was not in Amsterdam because it is the preferred haunt of those who like their tobacco to have a fuller flavour but because my brother has a fabulously comfortable house there with a spare room, a DSL connection and a fridge full of food. As far as intoxication went - I took everything from my cellar dated 1995 to enjoy with him. Most of the claret was starting to fade. I take this as a sign that I should giddy-up and attack the 1996's at Christmas. Mais ne pas les drogues et stupifients. No, I haven't got any. Really.

Friday Jan 27, 2006

Filesystem Benchmarks: iozone

In my last post I discussed the vxbench I/O load generator which may (or may not) be available from Symantec for the use of all. Recent work with Windows 2003 Server has given me the excuse to use Iozone which has many things in common with vxbench. In fact I feel a taxonomic table coming on:




Open sourceNo. Copyrighted but freely available.Yes: ANSI C
Async I/OYes: aread, awrite, arand_mixed workloadsYes -H, -k options
Memory mapped I/OYes. mmap_read, mmap_write workloadsYes. -B option
Multi-process workloadsYes. -P/-p optionsYes. Default
Multi-threaded workloadsYes -t optionYes. -T option
Single stream measurementYes.Yes.
Spreadsheet outputNo.Yes.
Large file compatableNo.Yes.
Random reads/writesYes. rand_[read|write|mixed]Yes. -i 2 option
Strided I/OYes stride=n suboptYes. -j option
Simulate compute delayYes. sleep workload, sleeptime=n secondsYes. -J milliseconds
Caching optionsO_SYNC, O_DSYNC, direct I/O, unbuffered I/OO_SYNC,
OS'sSolaris, AIX, HP, Linux. Not MS WinAs vxbench + MS Win. POSIX

There are challenges in using these tools; the first is that these are not benchmarks; they are load generators with no load (benchmark) defined. And there are two approaches to defining a load (a.) how many operations of a specfic type can be achieved in a set time. (b.) How long does it take to complete a specific number of operations. The difference, for a lot of people, is a matter of taste. The consequence is that each new analyst who approaches these tools starts to write a new cookbook.

Another challenge is that the principle dimensions of performance in a benchmark are

  1. Latency - how long until the first byte is returned to the user or committed to disk and the operation returned from.
  2. Throughput - how much data under different access patterns can be sent to or retrieved from permenant storage.
  3. Efficiency - How much of the system's resources were consumed in moving data to and from storage rather than doing computation upon it. (Resources can be memory, CPU cycles, hardware and software synchronisation mechanisms. In this Millenium we also bring in the consumption of electricity and the generation of heat.

Load generators including the ones we are discussing are pretty good on the first two counts but no good at the third. That distinction marks the difference between a load generator and benchmarking framework such as Filebench, SLAMD or the tools such as Loadrunner from Mercury. It is no minor matter to coordinate the gathering of system metrics with the execution of the workload. Its even more difficult to achieve this accross distributed systems sharing access to a filesystem such as NFS or Shared QFS. In this case a common and precise idea of the current time needs to be maintained accross the systems.

Tools such as Iozone and Vxbench need to be embedded in scripted frameworks to do performance metrics collection - In several Unixes it simply means running any and every tool whose name ends in "stat" in the background. In Microsofts world there are the CIM probes accessible through VBscript or Perl and in Solaris 10, dtrace provides access to arbitary counters.

Putting Iozone to Work

Using Iozone we can generate output similar to the graphs below.

I created a 32 Gb volume accross 12 (Seagate ST13640 disks on 2 JBODS connected via 2 Adaptec Ultra320 SCSI controllers to a Dual 2 Ghz AMD Opteron with 2 Gb RAM). For the care and feeding of this sandbox, I am grateful to Paul Humphreys and his band of lab engineers.

I then ran iozone and collected the results. As it dumps straight into spreadsheet format you can quickly do some interesting graphical analysis such as this example at the tools' website. However I was after something more mundane.

The first graph below is from OpenOffice. I don't like it much because it follows the data so you end with a powers-of-two x-axis which hides important detail. Also all Openoffice graphs tend to look the same without extensive fiddling.

The graph below it is done with R and although it is plainer, I think it gives a clearer picture.

Here is the data and R code - not a lot to it really. I continue to urge you to use this tool as I have in the past. James Holtman has made a compelling case for the use of R in performance analysis in his paper Visualisation of Performance Data (CMG2005) and The Use of "R" for System Performance Analysis (CMG2004). Sadly CMG do not make their papers available to the wider community.

size  FW      NW      FR      NR
4     426.78  327.59  627.4   560.92
8     544.29  467.6   733.93  672.68
16    594.61  362.57  878.66  725.4
32    628.45  587.71  883.71  754.75
64    662.49  606.35  886.44  748.12
128   664.56  619.54  846.31  815.49
256   700.33  666.51  933.96  769.15
512   12.55   13.6    664.16  660.76
1024  8.8     10.77   600.13  592.36
g_data <- read.table("C:\\\\home\\\\dominika\\\\FATvNTFS.csv", header=T)

plot(size, FR, type="l",
   main="NTFS and FAT32 I/O Performance",
   sub="Sequential Reads/Writes to 1 Gb File in 32 Gb Filesystem",
   xlab="I/O size (Kb)",
   ylab="I/O rate (Mb/s)",
   lty=5,col=5, lwd=2  )

lines(size,FW,lty=2,col=2, lwd=2)
lines(size,NW,lty=3,col=3, lwd=2)
lines(size,NR,lty=4,col=4, lwd=2)

text(150,700,"FAT Wr"); text(130,600,"NTFS Wr")
text(350,750,"NTFS Rd"); text(400,800,"FAT Rd")

The graph appears to show us a good deal but its what it doesn't show that has to be remembered - the qualitative side to all this.

The expectation of several people I showed it to had been that NTFS being the more modern filesystem should have better performance. Not so but for good reasons. Yes in the simple case FAT32 is faster than NTFS. Out and out performance is not the point of NTFS. It has many value-add features not found in FAT such as file and directory permissions, encryption, compression, quotas, content-addressability (indexing) and so forth. These come at a cost as do other features in NTFS that the OS relies on to provide such facilities as shadow copy and replication.

Longer code path - longer to wait for those I/Os to return!

Thursday Jul 21, 2005

Visualising Performance

Visualising Performance

There are several things that interest me. Filesystem and datapath software design is one. Computer performance is another; particularly datapath performance of course but also the whole stack. Open Source software for helping in improving performance; load generators, probes and monitors, mathematical and graphical software for doing such things as statistical manipulation, implementing queuing theory and simulation; that sort of thing. I'm not alone here. Richard Cockroft, author of perhaps the primary source on Solaris performance has blogged on this topic.

What do I mean by visualising performance? Well, look at the following table, extracted from the Lustre Wiki - data gleaned from a netperf benchmark of 10 gigabit ethernet interfaces, increasing the payload size and the size of the socket buffer:


Socket Buffer Size

Send Size

















































































This is a common enough scenario. There is one dependent variable; the throughput of the connection. There are two independent variables - the size of the socket buffer and the size of the request. I had to look at that table for quite a while before I could see the result - the relationship. This is very common in benchmarking. Often, only two causal factors would be considered to be on the light side; the mount parameters for a filesystem can run to a dozen or more.

OK, so this example is not one that is going to set the world alight but its in the public domain, which helps. I have to get drunk with people who, in terms of scientific visualisation, have bigger fish to fry. But these days we (Sun) have bigger fish on the chopping board - especially petabyte storage and grids; both of the compute and storage varieties. You canot build these things in the lab on a whim; you have to model and modelling means visualisation.

I found this graph more intuitive:


g_data <- read.table
   (fileName <- choose.files("\*.csv"), header=T)

print(wireframe(g_data$mbs ~ g_data$soc_buf 
   \* g_data$send_kb,
	zlab="Mb/s" ,
	ylab="Send size (Kb)" ,
	xlab="Socket buffer size (Kb)" ,
	drape = TRUE, 
	colorkey = TRUE
	) )

The code to the above is for the The R Package, a free software environment for statistical computing and graphics, more of which below. I think the key message is "This is not a lot of code" (to 'fess up, I did have to deprocess the pretty printed table back to CSV). So this more or less tells us that one of the variables has little effect. But we can do better than this:

g_data <- read.table(fileName <- choose.files("\*.csv"), header=T)
print(splom( ~ g_data)

This gives us a scatterplot matrix. In two lines of code we can compare the relationship between every variable in the test and the relationships leap from the page. In our case there are only three dimensions but trellis graphics (in S-Plus, the commercial version) or lattice graphics (in R) allow us several graphical methods to visually explore our data.

What does it tell us? That after a certain point, increasing the size of the buffer provides no further boost in throughput. This is important as kernel memory is a finite resource.

Then its just a matter of drilling down for the "management summary" (But 'fessing up again, I am daintily sidestepping the thorny topic of non-linear regression analysis. Another day.):

xyplot(mbs ~ soc_buf_kb , 
	aspect = "xy", 
	ylab = "Mb/s" , 
	xlab = "Socket buffer size (Kb)")

So then. The my elevator pitch for R.

  • Its free (as in speech, not beer, yadda yadda). There is good community around it.
  • It has vector maths and matrices built in so no more loops, nested loops, nested nested...[repeat 'till fade].
  • All the regression, correlation, smoothing, modelling mathmatical grind and all the presentation graphics have already been attended to.
  • It interfaces to Java (and C, and [your language shared library of choice here]).
  • It is object orientated which is handy for someone who wants to represent e.g a storage array or compute node both as a piece of graphics (icon, connectors, etc) and as a chunk of maths.
  • Its home from home for those that like a command line environment. Intractable (write-only) code to rival Perl can be written if one leans to the beard-stroking, sandal wearing edge of the technology community.
  • It incorporates the TCL/Tk libraries so you can write fully formed standalone GUI applications in it.

When all is said and done, its really good for performance & capacity planning "exploration"; later on I'll measure an elephant for you in pretty quick time in R.

So endeth my first blog; respect and gratitude to David Levy for requisite motivational arse kicking and Simon Dachtler for finding time to produce my banner graphic while still keeping the Far-East manufacturing economy ticking over.




« July 2016