Thursday Oct 11, 2012

Oracle Solaris 11 ZFS Lab for Openworld 2012

Preface

This is the content from the Oracle Openworld 2012 ZFS lab. It was well attended - the feedback was that it was a little short - thats probably because in writing it I bacame very time-concious after the ASM/ACFS on Solaris extravaganza I ran last year which was almost too long for mortal man to finish in the 1 hour session. Enjoy.

Table of Contents

Introduction

This set of exercises is designed to briefly demonstrate new features in Solaris 11 ZFS file system: Deduplication, Encryption and Shadow Migration. Also included is the creation of zpools and zfs file systems - the basic building blocks of the technology, and also Compression which is the compliment of Deduplication. The exercises are just introductions - you are referred to the ZFS Adminstration Manual for further information. From Solaris 11 onward the online manual pages consist of zpool(1M) and zfs(1M) with further feature-specific information in zfs_allow(1M), zfs_encrypt(1M) and zfs_share(1M). The lab is easily carried out in a VirtualBox running Solaris 11 with 6 virtual 3 Gb disks to play with.

Exercise Z.1: ZFS Pools

Task: You have several disks to use for your new file system. Create a new zpool and a file system within it.

Lab: You will check the status of existing zpools, create your own pool and expand it.

Your Solaris 11 installation already has a root ZFS pool. It contains the root file system. Check this:

root@solaris:~# zpool list
NAME    SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  15.9G  6.62G  9.25G  41%  1.00x  ONLINE  -

root@solaris:~# zpool status 
pool: rpool
state: ONLINE
scan: none requested
config:

NAME        STATE     READ WRITE CKSUM
rpool       ONLINE       0     0     0
  c3t0d0s0  ONLINE       0     0     0

errors: No known data errors

Note the disk device the root pool is on - c3t0d0s0

Now you will create your own ZFS pool. First you will check what disks are available:

root@solaris:~# echo | format 
Searching for disks...done

AVAILABLE DISK SELECTIONS:
0. c3t0d0 <ATA-VBOX HARDDISK-1.0 cyl 2085 alt 2 hd 255 sec 63>
/pci@0,0/pci8086,2829@d/disk@0,0
1. c3t2d0 <ATA-VBOX HARDDISK-1.0 cyl 1534 alt 2 hd 128 sec 32>
/pci@0,0/pci8086,2829@d/disk@2,0
2. c3t3d0 <ATA-VBOX HARDDISK-1.0 cyl 1534 alt 2 hd 128 sec 32>
/pci@0,0/pci8086,2829@d/disk@3,0
3. c3t4d0 <ATA-VBOX HARDDISK-1.0 cyl 1534 alt 2 hd 128 sec 32>
/pci@0,0/pci8086,2829@d/disk@4,0
4. c3t5d0 <ATA-VBOX HARDDISK-1.0 cyl 1534 alt 2 hd 128 sec 32>
/pci@0,0/pci8086,2829@d/disk@5,0
5. c3t6d0 <ATA-VBOX HARDDISK-1.0 cyl 1534 alt 2 hd 128 sec 32>
/pci@0,0/pci8086,2829@d/disk@6,0
6. c3t7d0 <ATA-VBOX HARDDISK-1.0 cyl 1534 alt 2 hd 128 sec 32>
/pci@0,0/pci8086,2829@d/disk@7,0
Specify disk (enter its number): Specify disk (enter its number): 

The root disk is numbered 0. The others are free for use. Try creating a simple pool and observe the error message:

root@solaris:~# zpool create mypool c3t2d0 c3t3d0 
'mypool' successfully created, but with no redundancy; failure of one
device will cause loss of the pool

So destroy that pool and create a mirrored pool instead:

root@solaris:~# zpool destroy mypool  
root@solaris:~# zpool create mypool mirror c3t2d0 c3t3d0 
root@solaris:~# zpool status mypool 
pool: mypool
state: ONLINE
scan: none requested
config:

NAME        STATE     READ WRITE CKSUM
mypool      ONLINE       0     0     0
  mirror-0  ONLINE       0     0     0
    c3t2d0  ONLINE       0     0     0
    c3t3d0  ONLINE       0     0     0

errors: No known data errors

Back to top

Exercise Z.2: ZFS File Systems

Task: You have to create file systems for later exercises.

You can see that when a pool is created, a file system of the same name is created:

root@solaris:~# zfs list 
NAME                     USED  AVAIL  REFER  MOUNTPOINT
mypool                  86.5K  2.94G    31K  /mypool

Create your filesystems and mountpoints as follows:

root@solaris:~# zfs create -o mountpoint=/data1 mypool/mydata1 

The -o option sets the mount point and automatically creates the necessary directory.

root@solaris:~# zfs list mypool/mydata1 
NAME            USED  AVAIL  REFER  MOUNTPOINT
mypool/mydata1   31K  2.94G    31K  /data1

Back to top

Exercise Z.3: ZFS Compression

Task:Try out different forms of compression available in ZFS

Lab:Create 2nd filesystem with compression, fill both file systems with the same data, observe results

You can see from the zfs(1) manual page that there are several types of compression available to you, set with the property=value syntax:

compression=on | off | lzjb | gzip | gzip-N | zle

 Controls  the  compression  algorithm  used   for   this
 dataset. The lzjb compression algorithm is optimized for
 performance while  providing  decent  data  compression.
 Setting  compression  to  on  uses  the lzjb compression
 algorithm. The gzip compression algorithm uses the  same
 compression  as the gzip(1) command. You can specify the
 gzip level by using the  value  gzip-N  where  N  is  an
 integer  from 1 (fastest) to 9 (best compression ratio).
 Currently, gzip is equivalent to gzip-6 (which  is  also
 the default for gzip(1)).

Create a second filesystem with compression turned on. Note how you set and get your values separately:

root@solaris:~# zfs create -o mountpoint=/data2 mypool/mydata2 
root@solaris:~# zfs set compression=gzip-9 mypool/mydata2 
root@solaris:~# zfs get compression mypool/mydata1 
NAME            PROPERTY     VALUE     SOURCE
mypool/mydata1  compression  off       default

root@solaris:~# zfs get compression mypool/mydata2 
NAME            PROPERTY     VALUE     SOURCE
mypool/mydata2  compression  gzip-9    local

Now you can copy the contents of /usr/lib into both your normal and compressing filesystem and observe the results. Don't forget the dot or period (".") in the find(1) command below:

root@solaris:~# cd /usr/lib
root@solaris:/usr/lib# find . -print | cpio -pdv /data1 
root@solaris:/usr/lib# find . -print | cpio -pdv /data2 

The copy into the compressing file system takes longer - as it has to perform the compression but the results show the effect:

root@solaris:/usr/lib# zfs list 
NAME                     USED  AVAIL  REFER  MOUNTPOINT
mypool                  1.35G  1.59G    31K  /mypool
mypool/mydata1          1.01G  1.59G  1.01G  /data1
mypool/mydata2           341M  1.59G   341M  /data2

Note that the available space in the pool is shared amongst the file systems. This behavior can be modified using quotas and reservations which are not covered in this lab but are covered extensively in the ZFS Administrators Guide.

Back to top

Exercise Z.4: ZFS Deduplication

The deduplication property is used to remove redundant data from a ZFS file system. With the property enabled duplicate data blocks are removed synchronously. The result is that only unique data is stored and common componenents are shared.

Task:See how to implement deduplication and its effects

Lab: You will create a ZFS file system with deduplication turned on and see if it reduces the amount of physical storage needed when we again fill it with a copy of /usr/lib.

root@solaris:/usr/lib# zfs destroy mypool/mydata2 
root@solaris:/usr/lib# zfs set dedup=on mypool/mydata1
root@solaris:/usr/lib# rm -rf /data1/*  
root@solaris:/usr/lib# mkdir /data1/2nd-copy
root@solaris:/usr/lib# zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
mypool                  1.02M  2.94G    31K  /mypool
mypool/mydata1            43K  2.94G    43K  /data1
root@solaris:/usr/lib# find . -print | cpio -pd /data1
2142768 blocks
root@solaris:/usr/lib# zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
mypool                  1.02G  1.99G    31K  /mypool
mypool/mydata1          1.01G  1.99G  1.01G  /data1
root@solaris:/usr/lib# find . -print | cpio -pd /data1/2nd-copy 
2142768 blocks
root@solaris:/usr/lib#zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
mypool                  1.99G  1.96G    31K  /mypool
mypool/mydata1          1.98G  1.96G  1.98G  /data1

You could go on creating copies for quite a while...but you get the idea. Note that deduplication and compression can be combined: the compression acts on metadata.

Deduplication works across file systems in a pool and there is a zpool-wide property dedupratio:

root@solaris:/usr/lib# zpool get dedupratio mypool
NAME    PROPERTY    VALUE  SOURCE
mypool  dedupratio  4.30x  -

Deduplication can also be checked using "zpool list":

root@solaris:/usr/lib# zpool list
NAME     SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
mypool  2.98G  1001M  2.01G  32%  4.30x  ONLINE  -
rpool   15.9G  6.66G  9.21G  41%  1.00x  ONLINE  -

Before moving on to the next topic, destroy that dataset and free up some space:

root@solaris:~# zfs destroy  mypool/mydata1 

Back to top

Exercise Z.5: ZFS Encryption

Task: Encrypt sensitive data.

Lab: Explore basic ZFS encryption.

This lab only covers the basics of ZFS Encryption. In particular it does not cover various aspects of key management. Please see the ZFS Adminastrion Manual and the zfs_encrypt(1M) manual page for more detail on this functionality.

Back to top

root@solaris:~# zfs create -o encryption=on mypool/data2 
Enter passphrase for 'mypool/data2': ********
Enter again: ********
root@solaris:~# 

Creation of a descendent dataset shows that encryption is inherited from the parent:

root@solaris:~# zfs create mypool/data2/data3 
root@solaris:~# zfs get -r  encryption,keysource,keystatus,checksum mypool/data2 
NAME                PROPERTY    VALUE              SOURCE
mypool/data2        encryption  on                 local
mypool/data2        keysource   passphrase,prompt  local
mypool/data2        keystatus   available          -
mypool/data2        checksum    sha256-mac         local
mypool/data2/data3  encryption  on                 inherited from mypool/data2
mypool/data2/data3  keysource   passphrase,prompt  inherited from mypool/data2
mypool/data2/data3  keystatus   available          -
mypool/data2/data3  checksum    sha256-mac         inherited from mypool/data2
You will find the online manual page zfs_encrypt(1M) contains examples. In particular, if time permits during this lab session you may wish to explore the changing of a key using "zfs key -c mypool/data2".

Exercise Z.6: Shadow Migration

Shadow Migration allows you to migrate data from an old file system to a new file system while simultaneously allowing access and modification to the new file system during the process. You can use Shadow Migration to migrate a local or remote UFS or ZFS file system to a local file system.

Task: You wish to migrate data from one file system (UFS, ZFS, VxFS) to ZFS while mainaining access to it.

Lab: Create the infrastructure for shadow migration and transfer one file system into another.

First create the file system you want to migrate

root@solaris:~# zpool create oldstuff c3t4d0 
root@solaris:~# zfs create oldstuff/forgotten 

Then populate it with some files:

root@solaris:~# cd /var/adm 
root@solaris:/var/adm# find . -print | cpio -pdv /oldstuff/forgotten

You need the shadow-migration package installed:

root@solaris:~# pkg install shadow-migration
           Packages to install:  1
       Create boot environment: No
Create backup boot environment: No
            Services to change:  1

DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  1/1       14/14      0.2/0.2

PHASE                                        ACTIONS
Install Phase                                  39/39

PHASE                                          ITEMS
Package State Update Phase                       1/1 
Image State Update Phase                         2/2 

You then enable the shadowd service:

root@solaris:~# svcadm enable shadowd
root@solaris:~# svcs shadowd 
STATE          STIME    FMRI
online          7:16:09 svc:/system/filesystem/shadowd:default

Set the filesystem to be migrated to read-only

root@solaris:~# zfs set readonly=on oldstuff/forgotten

Create a new zfs file system with the shadow property set to the file system to be migrated:

root@solaris:~# zfs create -o shadow=file:///oldstuff/forgotten   mypool/remembered 

Use the shadowstat(1M) command to see the progress of the migration:

root@solaris:~# shadowstat 
					EST		
				BYTES	BYTES		ELAPSED
DATASET				XFRD	LEFT	ERRORS	TIME
mypool/remembered               92.5M	-	-	00:00:59
mypool/remembered               99.1M	302M	-	00:01:09
mypool/remembered               109M	260M	-	00:01:19
mypool/remembered               133M	304M	-	00:01:29
mypool/remembered               149M	339M	-	00:01:39
mypool/remembered               156M	86.4M	-	00:01:49
mypool/remembered               156M	8E	29		(completed)

Note that if you had created /mypool/remembered as encrypted, this would be the preferred method of encrypting existing data. Similarly for compressing or deduplicating existing data.

The procedure for migrating a file system over NFS is similar - see the ZFS Administration manual.

That concludes this lab session.

Wednesday Nov 18, 2009

ZFS HSP Demo

Here is a demo of the use of ZFS Hybrid Storage Pools that I put together. Its quite neat in that it is reproduceable fairly easily. My thanks to Mo Beik for doing the heavy lifting.

Open Storage Demos

Here is series of Brief Open Storage Demo's that I put together on various aspects of Solaris storage software such as ZFS, NFS, CIFS, iSCSI - just the basics to get you started.

I originally created them for SuperComputer 2008 and have only just rediscovered them. Enjoy.

Saturday Apr 12, 2008

Solaris Application Programming

"Solaris Application Programming" by Darryl Gove is an important new book which has just hit the shelves. It is aimed (In my opinion which cost me $60 to formulate) at the programmer coming from another operating system such as Linux or Windows where (s)he had a level of familiarity with the compiler and observability tools that enabled the writing of extremely fast code, fast. If you are at home with all the GCC options but need a high speed introduction to Sun Studio or just want to program closer to the hardware, but don't relish the prospect of attacking the CPU vendor's manuals, then this is for you.

What's in it

Before you can understand the compiler and how to get the best out of it, you have to understand the CPU and memory and this area is dealt with both for SPARC and x64 straight away. This is followed by the obligatory chapter on the observability tools provided with Solaris - essential reading if you are new around here but not so if you have the POD Book to hand.

Then comes the meat of the book - using the Sun Studio compiler to make computationally intensive code run fast. The target language is C, with forays into C++ and Fortran where necessarily. There is good coverage of floating point, fun with cache lines and the SIMD (on x64) and VIS (on SPARC) instruction sets. The CPU performance counter metrics for both those chips and how to utilise them are thoroughly examined - they provide the data but you have to turn that into information and the material here jumpstarts that process of understanding. There's a section on manually optimising code, caveated that much of what the books in this area teach you has also been taught to the compiler too and in many cases it can do that stuff better than you.

The final section of the book peels off from traditional sequential programming to the parallel model covering multiprocessor machines and multicore processors (CMT). In discussing this, the libraries to take advantage of the hardware capabilities are taken one by one - System V IPC, MPI, Pthreads and OpenMP.

What's not in it

As the title says, its about Solaris application programmming. Its about making code run fast. So it is not a lesson on the APIs available to the programmer - for that you have to go and read Rich Teer or Rich Stevens. Its not about the operating system (that's already covered off in McDougall and Mauro) or system performance, observability and debugging, which they (with Brendan Gregg) have also amply illuminated.

When you read the code samples in the book, you'll notice they perform almost no I/O - indeed use almost no system calls other than to get timings and output them to you the reader. By concentrating on a core subject in this way and keeping the code entirely focused Darryl enables the reader to become amazingly proficient in an area traditionally labelled "guru" (how I hate that word) in 446 pages - that is a huge achievment. If I have nits they are that there is no bibliography (There is a single footnote referenceing a single manpage) and the code is not provided for the all the examples that are run but this is minor chaff. I'd also note that you have to be able to read, or be prepared to learn to read, assembly language for the x64 and SPARC. But you know that anyway.

Where to find out more

The Wiki page for the book is here and I would expect to see downloadable source there some day. Darryl's blog is here.

Conclusion

I can't commend this book highly enough. Every topic that Darryl addresses, he does so with enough detail to get you up, running and proficient, but not so much detail that the going gets bogged down. Its a fine line and the call has been made right every time. If you know what you are doing in a programming environment, just not this one - then this book really is for you. Similarly if you have been playing the Solaris game for a while and are anxious to proceed to the next level, go buy!

Thursday Jan 04, 2007

ZFS v VxFS - IOzone

IOzone is a I/O benchmarking utility that I have blogged about before. I also covered off the results of running Filebench on the two filesystems. Here, for the sake of completeness, are the results of some IOzone runs I did at the same time. The command line for IOzone used the following arguments and options:

iozone -R -a -z -b file.wks -g 4G -f testile

Write

This test measures the performance of writing a new file. It is normal for the initial write performance to be lower than the performance of rewriting a file (see next test, below) due to metadata overhead.


Iozone Write Performance

Re-Write

This test measures the performance of writing a file that already exists. When a file is written that already exists the work required is less as the metadata already exists. It is normal for the rewrite performance to be higher than the performance of writing a new file.


Iozone Re-Write Performance

Read

This test measures the performance of reading an existing file.


Iozone Read Performance

Re-Read

This test measures the performance of reading a file that was recently read. It is possible for the performance to be higher as the file system can maintain a cache of the data for files that were recently read. This cache can be used to satisfy reads and improves the throughput.


Iozone Re-read Performance

Random Read

This test measures the performance of reading a file with accesses being made to random locations within the file. The performance of a system under this type of activity can be impacted by several factors such as the size of operating system’s cache, number of disks, seek latencies, and others.


Iozone Random Read Performance

Random Write

This test measures the performance of writing a file with accesses being made to random locations within the file. Again the performance of a system under this type of activity can be impacted by the factors listed above for Random Read. Efficient random write is important to the operation of transaction processing systems.


Iozone Random Write Performance

Backward Read

This test measures the performance of reading a file backwards. This may seem like a strange way to read a file but in fact there are applications that do this. MSC Nastran is an example of an HPC application that reads its files backwards. Video editing is another example. Although many file systems have special features that enable them to read a file forward more rapidly, there are very few that detect and enhance the performance of reading a file backwards.


Iozone Backward Read Performance

Record Rewrite

This test measures the performance of writing and re-writing a particular spot within a file.


Iozone Record Rewrite Performance

Strided Read

This test measures the performance of reading a file with a strided access behavior. An example would be: “Read at offset zero for a length of 4 KB, then seek 200 KB, and then read for a length of 4 KB, then seek 200 KB and so on.” Here the pattern is to read 4 KB and then seek 200 KB and repeat the pattern. This again is a typical application behavior for applications that have data structures contained within a file and is accessing a particular region of the data structure. Most file systems do not detect this behavior or implement any techniques to enhance the performance under this type of access behavior.


Iozone Strided Read Performance

fwrite()

This test measures the performance of writing a file using the library function fwrite(). This is a library routine that performs buffered write operations. The buffer is within the user’s address space. If an application were to write in very small size transfers then the buffered & blocked I/O functionality of fwrite() can enhance the performance of the application by reducing the number of actual operating system calls and increasing the size of the transfers when operating system calls are made. This test is writing a new file so again the overhead of the metadata is included in the measurement.


Iozone fwrite() Performance

Re-fwrite()

This test performs repetitive re-writes to portions of an existing file using the fwrite() interface.


Iozone Re-fwrite() Performance

fread()

This test measures the performance of reading a file using the library function fread() - a library routine that performs buffered & blocked read operations. The buffer is within the user’s address space, as for fwrite() operations. If an application were to read in very small size transfers then the buffered & blocked I/O functionality of fread() can enhance the performance of the application by reducing the number of actual operating system calls and increasing the size of the transfers when operating system calls are made.


Iozone fread Performance

Re-fread()

This test is the same as fread() above except that in this test the file that is being read was read in the recent past. This can result in higher performance as the file system is likely to have the file data in cache.


Iozone Re-fread() Performance

End note

In the last couple of blogs, I've given the results of testing a number of typical file system workloads in an open and reproduceable manner using the publicly available Filebench and IOzone tools and shown that Solaris 10 ZFS can significantly outperform a combination of Veritas Volume Manager and Filesystem in many cases. However, the following points (the "usual caveats") should also be taken into consideration:

  • These tests were performed on a Sun Fire server with powerful processors, a large memory configuration, and a very wide interface to an array of high-speed disks to ensure that the fewest possible factors would inhibit file system performance. It is possible that the differences between file systems would be less pronounced on a less powerful system simply because all file systems would run into hardware bottlenecks in moving data to the disks.
  • A file system performs only as well as the hardware and operating system infrastructure surrounding it, such as the virtual memory subsystem, kernel threading implementation, and device drivers. As such, Sun’s overall enhancements in the Solaris OS, combined with high-powered Sun Fire servers, will provide customers with high levels of performance for applications. But proof-of-concept (POC) implementations are invaluable in supporting purchasing decisions for specific configurations and applications.
  • Benchmarks provide general guidance to performance. The conclusion that can be drawn from these tests is that in application areas such as databases, e-mail, web server and software development, Solaris 10 ZFS performs best in an “apples-to-apples” comparisons with the Veritas product suite. Again, POCs and real-world customer testing help evaluate performance for specific applications and services.

ZFS v VxFS - Ease

I've had people asking me to blog more of my stuff on ZFS, especially in relation to the Veritas suite (Microsoft NTFS and Linux afficionados, make yourselves known to my highly efficient Customer Services team using the comments form below).

I did a lot of poking into ZFS performance over the summer. The Veritas Filebench results are already posted here but apart from the numbers, what leapt out straight away was the simplicity of use of ZFS compared to the competition.

I'm not talking about GUIs because I grew up in environments (banks, IT vendors) where they simply weren't used either because the required precision in the configuration demanded command line and scripting or the work was remote (for which read "from home in middle of night") and the comms just didn't move the bits fast enough to support GUIs.

For a start the conceptual framework is a lot simpler. The following table lists the building blocks of both Veritas VxVM/VxFS and ZFS.

ZFS

Veritas

The pool: all the disks in the system.A physical disk is the basic storage device (media) where the data is ultimately stored.
File systems: as many as are required.When you place a physical disk under VxVM control, a VM disk is assigned to the physical disk. A VM disk is under VxVM control and is usually in a disk group.

A VM disk can be divided into one or more subdisks. Each subdisk (actually a set of contiguous disk blocks) represents a specific portion of a VM disk.

VxVM uses subdisks to build virtual objects called plexes. A plex consists of one or more subdisks located on one or more physical disks.

A disk group is a collection of disks that share a common configuration, and which are managed by VxVM.

A volume is a virtual disk device that appears like a physical disk device and consists of one or more plexes contained in a disk group, each holding a copy of the selected data in the volume.

A VxFS file system is constructed on a volume so that files can be stored.

So how does this translate into practice? The following table lists the activities and times taken to create useable storage using either Veritas or ZFS:

ZFS

Veritas

# zpool create -f tank [list of disks]
# zfs create tank/fs
# /usr/lib/vxvm/bin/vxdisksetup -i c2t16d0
# vxdg init dom-dg c2t16d0
# for i in [list of disks]
do
/usr/lib/vxvm/bin/vxdisksetup -i $i
done

# for i in [list of disks]
do
vxdg -g dom-dg adddisk $i
done

# vxassist -g dom-dg -p maxsize layout=stripe
6594385920 [ get size of volume, then feed back in ]
Time Taken: 30 minutes # vxassist -g dom-dg make dom-vol 6594385920 layout=stripe
# mkfs -F vxfs /dev/vx/rdsk/dom-dg/dom-vol
version 6 layout
6594385920 sectors, 412149120 blocks of size 8192, log size 32768 blocks
largefiles supported
# mount -F vxfs /dev/vx/dsk/dom-dg/dom-vol /mnt
Time Taken: 17.5 secondsTime Taken: 30 minutes

It's far simpler, is it not? The timings are for 48 72 Gb disks in 3500 jbods, by the way. If you want a bit more guidance on using ZFS, you should:

  • Read the manual on docs.sun.com
  • Look at the wiki which covers hints, tips and preferred practice
  • Subscribe to zfs-discuss. This patches you through to (amongst others) people who wrote ZFS, so its a good source of authoratitive, if sometimes a little terse, guidance.

Friday Dec 08, 2006

Welcome France 24

France 24, a 24 hour news channel in French, English and Arabic, broadcast over terrestrial digital TV, satelite and the net, has been launched. It has arrived at the behest of Jacques Chirac to provide a French view on the world - He was rather annoyed at the predominance of the Anglo-Saxon news coverage of the Gulf War, for instance. For me, the arrival of this station is really good news (excuse the pun) as I'm trying to learn French and have been limited to output such as Radio France International which, although very good, is a lot less fun than 24 hour rolling news and comment for francophones and francophiles. Also, to some extent I share M. Chirac's reservations about global news media - though I do have some minor observations about France 24:

The logo. I think it very amusing that this is so close to being the negative of the logo of that great American firm, AT&T (See below, all trademarks are copyright of...you know the rest):



France 24 Logo

AT & T Logo

The Presenters. One can detect a certain ageism/sexism in the choice of staff. There seem to be two sorts:



Britney Spears

Prototype France 24 Presenter

I hope that I am mistaken and the female staff will be allowed to age gracefully facing the camera with their surviving male colleagues.

The news loop. Its understandable - its only their second day and they rely largely on feeds from other agencies and networks but even with my short attention span I realised I was watching the same content twice. A slightly longer loop would be nice.

The big issue: The Technology. If you have Microsoft Windows Internet Explorer 7 and Windows Media Player 9 then you are in for a real treat. If you use any other browser (I tried Mozilla and a newer Firefox) you are in for an incoherent mess that looks like this.

And as for my Solaris and Linux platforms? One life is too short to blend this site into the viewing experience of those operating systems. I hope that there are other people out there that have given or will give a fragment of theirs to this cause and could offer me guidance. For those who only want the video feed via Realplayer,

mms://live.france24.com/france24_fr.wsx

is the key, which is not immediately obvious, buried in the HTML source.

Optimisation for only one platform at this stage in the life of the channel seems an obvious decision: they are only 2 days old. On the other hand, there is no love loss between the French government and the evil empire of American IT corporations - France is developing a "Google rival" and has also taken a swipe at Apple. Microsoft has been mired in European Union anti-trust litigation for years.

Perhaps this emnity will bear fruit for those viewers amongst us who do not drink entirely from the fountain at Redmond. In the meantime I am profoundly grateful to the French taxpayers and the staff of France 24 for this Christmas gift.

Friday Dec 01, 2006

Filebench: A ZFS v VxFS Shootout

Overview

Here is an example of Filebench in action to give you an idea of its capabilities "out of the box" - a run through a couple of the test suites provided with the tool on the popular filesystems ZFS and VxFS/VxVM; I've given sufficient detail so that you can easily reproduce the tests on your own hardware. I apologise for the graphs, which have struggled to survive the Openoffice .odt -> .html conversion. I hadn't the energy to recreate all 24 of them from the original data

They summarize the differences between ZFS and VxFS/VM in a number of tests which are covered in greater detail further on . It can be seen that in most cases ZFS performed better at its initial release (in Solaris 10 06/06) than Veritas 4.1; in some cases it does not perform as well; but in all cases it performs differently. The aim of such tests is to give a feel for the differences between products/technologies so intelligent decisions can be made as to which file system is more appropriate for a given purpose.

It remains true however that access to the data will be more effective in helping decision makers reach their performance goals if these can be stated clearly in numerical terms. Quite often this is not the case.



Figure 1: Filebench: Testing Summary for Veritas Foundation Suite (ZFS = 0)

Hardware and Operating System

Solaris 10 Update 2 06/06 (inc. ZFS) running on a Sun Fire E6900 with 24 x 1.5 Ghz Ultrasparc IV+, 98 Gb RAM with storage comprising 4 x StorEdge 3500 JBOD (48 x 72 Gb disks) , Fiber attach (4 x 2 Gb PCI-X 133 MHz)

The software used was VERITAS Volume Manager 4.1, VERITAS File System 4.1, FileBench 1.64.5. For each Filebench test a brief description is given followed by a table which shows how it was configured for that test run. This enables you to reproduce the test on your hardware. Of course if you want greater detail on the tests, you have to download Filebench (see blogs passim).

Create & Delete

Creation and deletion of files is a metadata intensive activity which is key to many applications, especially in web-based commerce and software development.

PersonalityWorkloadVariables
createfilescreatefilesnfiles 100000, dirwidth 20, filesize 16k, nthreads 16
deletefilesdeletefilesnfiles 100000, meandirwidth 20, filesize 16k, nthreads 16



Figure 2: Create/Delete - Operations per Second

Figure 3: Create/Delete - CPU uSec per Operation



Figure 4: Create/Delete - Latency (ms)


Copyfiles

This test creates two large directory tree structures and then measures the rate at which files can be copied from one tree to the other.

PersonalityWorkloadVariables
copyfilescopyfilesnfiles 100000, dirwidth 20, filesize 16k, nthreads 16



Figure 5: Copy Files - Operations per Second

Figure 6: Copy Files - CPU uSec per Operation



Figure 7: Copy Files - Latency (ms)


File Creation

This test creates a directory tree and fills it with a population of files of specified sizes. File sizes are chosen according to a gamma distribution of 1.5, with a mean size of 16k. The different workloads are designed to test different types of I/O - see generally the Solaris manual pages for open(2), sync(2) and fsync(3C).

PersonalityWorkloadVariables
filemicro_createcreateandallocnfiles 100000, nthreads 1, iosize 1m, count 64

createallocsyncnthreads 1, iosize 1m, count 1k, sync 1
filemicro_writefsynccreateallocfsyncnthreads 1
filemicro_createrandcreateallocappendnthreads 1




Figure 8: File Creation - Operations per Second

Figure 9: File Creation - CPU uSec per Operation



Figure 10: File Creation - Latency (ms)


Random Reads

This test performs single-threaded random reads of 2 KB size from a file of 5 Gb.

PersonalityWorkloadVariables
filemicro_rreadrandread2kcached 0, iosize 2k

randread2kcachedcached 1, iosize 2k



Figure 11: Random Read - Operations per Second

Figure 12: Random Read - CPU uSec per Operation



Figure 13: Random Read - Latency (ms)


Random Writes

This test consists of multi-threaded writes to a single 5 Gb file.

PersonalityWorkloadVariables
filemicro_rwriterandwrite2ksynccached 1, iosize 2k

randwrite2ksync4threadiosize 2k, nthreads 4, sync 1



Figure 14: Random Writes -
Operations per Second

Figure 15: Random Writes -
CPU uSec per Operation



Figure 16: Random Writes - Latency (ms)


Sequential Read

These tests perform a single threaded read from a 5 Gb file.

PersonalityWorkloadVariables
filemicro_seqreadseqread32kiosize 32k, nthreads 1, cached 0, filesize 5g

seqread32kcachediosize 32k, nthreads 1, cached 1, filesize 5g



Figure 17: Sequential Read -
Operations per Second

Figure 18: Sequential Read -
CPU uSec per Operation



Figure 19: Sequential Read - Latency (ms)


Sequential Write

These tests perform single threaded writes to a 5 Gb file.

PersonalityWorkloadVariables
filemicro_seqwriteseqwrite32kiosize 32k, count 32k, nthreads 1, cached 0, sync 0

seqwrite32kdsynciosize 32k, count 32k, nthreads 1, cached 0, sync 1
filemicro_seqwriterandseqwriterand8kiosize 8k, count 128k, nthreads 1, cached 0, sync 0



Figure 20: Sequential Write -
Operations per Second

Figure 21: Sequential Write -
CPU uSec per Operation



Figure 22: Sequential Write - Latency (ms)


Application Simulations: Fileserver, Varmail, Web Proxy & Server

There are a number of scripts supplied with Filebench to emulate applications:

Fileserver:

A file system workload, similar to SPECsfs. This workload performs a sequence of creates, deletes, appends, reads, writes and attribute operations on the file system. A configurable hierarchical directory structure is used for the file set.

Varmail:

A /var/mail NFS mail server emulation, following the workload of Postmark, but multi-threaded. The workload consists of a multi-threaded set of open/read/close, open/append/close and deletes in a single directory.

Web Proxy:

A mix of create/write/close, open/read/close, delete of multiple files in a directory tree, plus a file append (to simulate the proxy log). 100 threads are used. 16k is appended to the log for every 10 read/writes.

Web Server:

A mix of open/read/close of multiple files in a directory tree, plus a file append (to simulate the web log). 100 threads are used. 16k is appended to the weblog for every 10 reads.

PersonalityWorkloadVariables
fileserverfileservernfiles 100000, meandirwidth 20, filesize 2k, nthreads 100, meaniosize 16k
varmailvarmailnfiles 100000, meandirwidth 1000000, filesize 1k, nthreads 16, meaniosize 16k
webproxywebproxynfiles 100000, meandirwidth 1000000, filesize 1k, nthreads 100, meaniosize 16k
webserverwebservernfiles 100000, meandirwidth 20, filesize 1k, nthreads 100



Figure 23: Application Simulations - Operations per Second

Figure 24: Application Simulations - CPU uSec per Operation



Figure 25: Application Simulations - Latency (ms)


OLTP Database Simulation

This database emulation performs transactions on a file system using the I/O model from Oracle 9i. This workload tests for the performance of small random reads & writes, and is sensitive to the latency of moderate (128Kb+) synchronous writes as occur in the database log file. It launches 200 reader processes, 10 processes for asynchronous writing, and a log writer. The emulation uses intimate shared memory (ISM) in the same way as Oracle which is critical to I/O efficiency (as_lock optimizations).


PersonalityWorkloadVariables
oltplarge_db_oltp_2k_cachedcached 1, directio 0, iosize 2k,
nshadows 200, ndbwriters 10, usermode 20000,
filesize 5g, memperthread 1m, workingset 0

large_db_oltp_2k_uncachedAs above except cached 0, directio 1
large_db_oltp_8k_cachedAs for 2k except iosoze 8k

large_db_oltp_8k_uncachedAs for 2k except iosoze 8k


Figure 26: OLTP 2/8 Kb Blocksize - Operations per Second


Figure 27: OLTP 2/8 Kb Blocksize - CPU uSec per Operation


Figure 28: OLTP 2/8 Kb Blocksize - Latency (ms)

The summary figures above are the tip of a vast numerical iceberg of statistics provided by Filebench and wrappers around it which probe every system resource counter you can think of. It is a truism though that in using data like this, there is an enthusiasm to reduce it to single figures and simple graphs, leaving the engineers working on the performance bugs to the excruciating detail.

Remember also that these are the pre-packackaged scripts. The possibilities for custom benchmark workloads are as infinite as your imagination. Its also worth saying that technologies move on. The snapshot above will start to fade as improvements are made.

Thursday Nov 30, 2006

twm(1) - The Directors Cut

As a postscript to my blogs on squeezing Solaris 10 onto an antique PC (here and here), Dave Levy commented that I should provide a screenshot of Tom's Window Manager, as a means to point out the irony that doing a screen grab of TWM is a recursive problem - the tools ain't there to do it. Oh well. I shan't restate the case for it because the Wikipedia entry sums it up - "Although it is now generally regarded as the window manager of last resort, a small but dedicated minority of users favor twm for its simplicity, customizability, and light weight". You can leap from the Wiki entry to an interesting interview with Tom himself.

As you can see, twm(1) has the rich functionality that you need but without all that troublesome clutter.

Solaris Performance and Tools and Solaris Internals (2nd Ed) make a firm foundation for any tuning exercise such as this. Rich Teer's Solaris Systems Programming also provides useful supporting documentation.

The quid pro quo I negotiated with Dave was that he in turn would blog an analysis of the case for commercial enterprises such as Sun open-sourcing their software set in the context of classical economic theory. I know I don't understand this properly because listening to Dave explain it is like holding up the TV aerial - it seems clear while he holds it up and explains but the moment he finishes and puts the aerial down, my screen goes fuzzy again.

Monday Nov 20, 2006

Filebench: Request For Enhancements<

The time has come to get Filebench cleaned up (I can't use the word "Productised", I just can't) and this task has fallen to me. Filebench, for those of you not aware is an I/O load-generation framework for simulating the effect of applications on file systems. - its an open source project hosted on Sourceforge and is used a lot within Sun for research, benchmarking and product development. I've been asked to tidy it up, document it and generally give it some love.

There are two real reasons why Filebench is a good thing. Firstly its cheap. Setting up and running some of the industry standard benchmarks (and all application stacks) consumes an enormous amount of engineering time that could be better spent thinking great thoughts, or snowboarding. Filebench gives you back this time. Secondly it provides a language to express the conversation between an application and its persistence layer (what we used to call "storage" before those Java fiends got hold of it). The semantics of this language are very rich and have taken a lot of thought - they might not be right yet - but we can express workloads that mimic real applications without the expense of real applications. How we transfer that converation from the "Record" button (i.e a trace) to "Play" (i.e a workload) is a topic which seems to have kept a lot of academics and others from their snowboards but I'll come back to their endeavours in a future blog.

Whats on The ToDo List?

There are quite a few things on my "To Do" list for Filebench which have been put on it people using it in Sun, our customers and the wider community. Here's what I have:

  • A User guide. There is currently no documentation for Filebench. This was cited as the greatest barrier to wider adoption by people I talked to when I presented at the UKCMG conference earlier in the year.
  • Multi-client functionality. It needs to work in environments such as clusters, grids, and with proxy filesystems such as CIFS, NFSv3 (and 4) and others such as QFS and SANergy
  • Work with block devices. We need this to play with iSCSI, apart from anything else.
  • Creation of community site to support further development: It would be good to have more fully formed website and wiki to support Filebench. Sourceforge offers support for this but its all a bit bare-bones at present.
  • Tidying up of stable ports to Linux and Microsoft. There are currently ports to these operating systems but they are not entirely stable or properly tested and documented. It is not yet clear whether a Microsoft port should utilise the native C environment or a Microsoft supplied or third party Posix emulation layer.
  • General fixing of minor bugs, integrating a backlog of existing bug-fixes, attending to RFEs and so forth.

Anything Else?

If you want to add to this list (or indeed, take things from it; the joy of Open development) your first stop is either mailing me (dominic dot kay at gmail dot com) or the Open Solaris Performance Alias.

Fat Software Part 2

I found a little more fat to trim from the midget system I described in my last post. Well actually I found quite a bit but it took me down a cul-de-sac of unbootability so here, after more experimentation are the modifications that are "safe" by which I mean, of course, totally unsupported and unsupportable by Sun but useful if you have to run Solaris on an old, memory-constrained, probably beige, system.

Such a system will never run such new-fangled protocols as Infiniband, FibreChannel and iSCSI, I don't need NFS or any of the enterprise management software. I really don't actually need SCSI or IPv6 - but you can't unbolt those.

So, for the record and so you can cherrypick your own modifications, here are the scores - memory is in 4 Kb pages: I have 63374 of these to my name:

Action

Kernel

Anon

Exec/Libs

Page Cache

Free

Baseline (dtconfig -e) 10398 22143 3619 4587 22627
Rename webm and webconsole in rc2.d/ 9573 9829 2572 2582 38818
Rename snmpdx,dmi,initsma in rc3.d 9505 8888 2253 2626 40102
Remove services (previous blog) 9705 7587 1904 2709 41469
Exclude nfs, lofs from /etc/system,
rename volmgt in /etc/rc3.d
7975 7045 1738 1915 44647
Exclude iSCSI 7900 7005 1776 1896 44797
Remove Infiniband 7819 6862 1720 1880 45093
Remove Fibrechannel 7319 6969 1738 1871 45477

I'm calling it a day at this point. A more diligent man would put all this in a Jumpstart script but time is money and that would be a sign of obsession. The out of the box installation had 7% of its memory free and I've managed to take that to 71%. The kernel (which of course is not pageable) is down to 28 Mb. This old HP Brio has a new lease of life. But brio is the quality of being active or spirited or alive and vigorous - so I'm off to get a life.

Wednesday Nov 15, 2006

French Customs and Fat Software

Never ever ever tell French Customs that you have just driven from Amsterdam - you will lose an hour of your life. This happened to me at Calais on my way home to UK. I always get stopped by Customs when I'm on my own. Its because I drive a big shiny German car but wear scruffy clothes and don't shave often enough for their liking. If I cleaned up and put a suit on for these journeys, I'd get left alone but I always forget. If I was a customs officer, I'd stop me.

Anyway, due to my arrival from the intoxicant capital of Europe I got the full treatment this time. They even X-rayed my spare tyre (no, the one in the car, not 'round my waist) and it is in fine health. They were very suspicious when I told them I don't smoke - not just dope, but anything. They all smoked - continuously. Then when they found that the 6 bottles of wine I had (only 6, Monsieur?) were from Portugal and not their beloved France, that was it. Out came the back seat of the car, pockets were emptied. carpets were lifted. I just kept smiling - They have guns after all, and latex surgical gloves - far more scary. Anyway, if I'm ever stopped again; I've just come from Lourdes.

I was coming home because remote working isn't working any more: my laptop is broken. Since my daughter peeled all the keys from the keyboard, it has never been quite the same and now one of them is permanently detached - queue renditions of "U picked a fine time to leave me, Lucille", "I'll never find another U", etc. Try working without this vowel - its impossible. So off to the laptop garage it goes.

This leaves me with a 8 year old PC in my office which was given to me as an alternative to putting it in landfill. Bye bye $3500 Ferarri laptop, hello ageing HP Brio. It hasn't much memory (256 Mb), or a DVD drive, or sound and the screen has got pen marks all over it because it came out of a goods-despatch office in a warehouse. But hey, it was free.

I load up Solaris 10 6/06 (Developer install profile), log in to Gnome and it...eh...stops. That is to say the visual element of the great Solaris experience stops - the meat grinder noise continues under the desk. Since then, its taken me a fair amount of performance tuning to get useability out of it - a sort of "back to basics" exercise.

My first suspicion was that the delicious new GUI that Solaris sports was the culprit for all the thrashing below. I used to say that using Unix was fine so long as one remembered that the first ten years would be the worst part. Now? You don't have to know anything about anything to be a performance analyst. You just read Solaris Performance & Tools and you're away. Observability? Look at this:

# echo "::memstat" | mdb -k 
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                      19522                76   31%
Anon                        20766                81   33%
Exec and libs                7685                30   12%
Page cache                   2095                 8    3%
Free (cachelist)            10547                41   17%
Free (freelist)              2759                10    4%

Total                       63374               247

...then turn to page 646 of the book, distill the content onto a Powerpoint slide and present your invoice Voila! (as the Customs men were dying to shout, but couldn't).

So is the windowing software fat? Well actually no. It is fatter than it was as the graph below shows (the Y axis indicates where my precious 247 megabytes have gone - for all the data points we have one terminal window open on the desktop) but not really podgy like some operating systems window mangers. The real problem though was that unknown to me (then), the web management software starts a honking great JVM - thats where all the memory goes.

So my choice was to go out and buy some hardware or execute a slash and burn strategy on memory consumption. Well the PC was free, the OS was free and the authors gave me the book for free (all as in "free beer"). Enough said. I did the numbers at each stage as you can see in the graph below:

Stage 1. This is my baseline - With Gnome and 1 terminal window open.

Stage 2. With CDE, 1 terminal window open.

Stage 3. Disable the windowing system altogether (dtconfig -d).

Stage 4. In /etc/rc2.d get rid of (i.e rename) S90wbem and S90webconsole: As I mentioned they start a very large (for the RAM I have to play with) JVM. This was the biggest single win.

Stage 5. In /etc/rc3.d, rid yourself of S76snmpdx, S77dmi and S82initsma

Stage 6. Disable "uneeded" (opinions differ here) service deamons using svcadm disable: sendmail, nfs/cbd, nfs/mapid, nfs/server, nfs/status, nfs/nlockmgr, nfs/rquota, nfs/client, network/ssh, system/power, system/picl, filesystem/autofs

Stage 7. Fire up twm(1) using xinit(1). For those who cannot remember Toms Window Manager, you can enjoy its rich functionality by choosing a Failsafe session from the options on your Gnome/CDE login screen and then logging in. In the (only) shell window you are presented type/usr/openwin/bin/twm and there you are. Now run away. Fast.

There is probably more fat I could trim (please email me your favourite culprits: dominic dot kay at gmail dot com) but the effort/mempages ratio has peaked and the day-job beckons.

And so here I am. I have an antique computer running a windowing system no-one can remember how to use but my ps(1) listing doesn't scroll off the screen anymore and it boots up in bounded time and goes like the wind. It runs Openoffice, Firefox, Sun Studio 11 (with special "compile during coffee break" option enabled). My X Windows Users and Administrators Guides (from O'Reilly circa 1992, from my attic yesterday) have new value and I'm high on life - thanks to losing "U" in Amsterdam.

I should make it clear by way of postscript that I was not in Amsterdam because it is the preferred haunt of those who like their tobacco to have a fuller flavour but because my brother has a fabulously comfortable house there with a spare room, a DSL connection and a fridge full of food. As far as intoxication went - I took everything from my cellar dated 1995 to enjoy with him. Most of the claret was starting to fade. I take this as a sign that I should giddy-up and attack the 1996's at Christmas. Mais ne pas les drogues et stupifients. No, I haven't got any. Really.

Friday Jan 27, 2006

Filesystem Benchmarks: iozone

In my last post I discussed the vxbench I/O load generator which may (or may not) be available from Symantec for the use of all. Recent work with Windows 2003 Server has given me the excuse to use Iozone which has many things in common with vxbench. In fact I feel a taxonomic table coming on:

Feature

VxBench

Iozone

Open sourceNo. Copyrighted but freely available.Yes: ANSI C
Async I/OYes: aread, awrite, arand_mixed workloadsYes -H, -k options
Memory mapped I/OYes. mmap_read, mmap_write workloadsYes. -B option
Multi-process workloadsYes. -P/-p optionsYes. Default
Multi-threaded workloadsYes -t optionYes. -T option
Single stream measurementYes.Yes.
Spreadsheet outputNo.Yes.
Large file compatableNo.Yes.
Random reads/writesYes. rand_[read|write|mixed]Yes. -i 2 option
Strided I/OYes stride=n suboptYes. -j option
Simulate compute delayYes. sleep workload, sleeptime=n secondsYes. -J milliseconds
Caching optionsO_SYNC, O_DSYNC, direct I/O, unbuffered I/OO_SYNC,
OS'sSolaris, AIX, HP, Linux. Not MS WinAs vxbench + MS Win. POSIX

There are challenges in using these tools; the first is that these are not benchmarks; they are load generators with no load (benchmark) defined. And there are two approaches to defining a load (a.) how many operations of a specfic type can be achieved in a set time. (b.) How long does it take to complete a specific number of operations. The difference, for a lot of people, is a matter of taste. The consequence is that each new analyst who approaches these tools starts to write a new cookbook.

Another challenge is that the principle dimensions of performance in a benchmark are

  1. Latency - how long until the first byte is returned to the user or committed to disk and the operation returned from.
  2. Throughput - how much data under different access patterns can be sent to or retrieved from permenant storage.
  3. Efficiency - How much of the system's resources were consumed in moving data to and from storage rather than doing computation upon it. (Resources can be memory, CPU cycles, hardware and software synchronisation mechanisms. In this Millenium we also bring in the consumption of electricity and the generation of heat.

Load generators including the ones we are discussing are pretty good on the first two counts but no good at the third. That distinction marks the difference between a load generator and benchmarking framework such as Filebench, SLAMD or the tools such as Loadrunner from Mercury. It is no minor matter to coordinate the gathering of system metrics with the execution of the workload. Its even more difficult to achieve this accross distributed systems sharing access to a filesystem such as NFS or Shared QFS. In this case a common and precise idea of the current time needs to be maintained accross the systems.

Tools such as Iozone and Vxbench need to be embedded in scripted frameworks to do performance metrics collection - In several Unixes it simply means running any and every tool whose name ends in "stat" in the background. In Microsofts world there are the CIM probes accessible through VBscript or Perl and in Solaris 10, dtrace provides access to arbitary counters.


Putting Iozone to Work

Using Iozone we can generate output similar to the graphs below.

I created a 32 Gb volume accross 12 (Seagate ST13640 disks on 2 JBODS connected via 2 Adaptec Ultra320 SCSI controllers to a Dual 2 Ghz AMD Opteron with 2 Gb RAM). For the care and feeding of this sandbox, I am grateful to Paul Humphreys and his band of lab engineers.

I then ran iozone and collected the results. As it dumps straight into spreadsheet format you can quickly do some interesting graphical analysis such as this example at the tools' website. However I was after something more mundane.

The first graph below is from OpenOffice. I don't like it much because it follows the data so you end with a powers-of-two x-axis which hides important detail. Also all Openoffice graphs tend to look the same without extensive fiddling.

The graph below it is done with R and although it is plainer, I think it gives a clearer picture.

Here is the data and R code - not a lot to it really. I continue to urge you to use this tool as I have in the past. James Holtman has made a compelling case for the use of R in performance analysis in his paper Visualisation of Performance Data (CMG2005) and The Use of "R" for System Performance Analysis (CMG2004). Sadly CMG do not make their papers available to the wider community.

size  FW      NW      FR      NR
4     426.78  327.59  627.4   560.92
8     544.29  467.6   733.93  672.68
16    594.61  362.57  878.66  725.4
32    628.45  587.71  883.71  754.75
64    662.49  606.35  886.44  748.12
128   664.56  619.54  846.31  815.49
256   700.33  666.51  933.96  769.15
512   12.55   13.6    664.16  660.76
1024  8.8     10.77   600.13  592.36
require(lattice)
g_data <- read.table("C:\\\\home\\\\dominika\\\\FATvNTFS.csv", header=T)
attach(g_data)

plot(size, FR, type="l",
   main="NTFS and FAT32 I/O Performance",
   sub="Sequential Reads/Writes to 1 Gb File in 32 Gb Filesystem",
   xlim=c(0,1024),
   ylim=c(0,1000),
   xlab="I/O size (Kb)",
   ylab="I/O rate (Mb/s)",
   lty=5,col=5, lwd=2  )

lines(size,FW,lty=2,col=2, lwd=2)
lines(size,NW,lty=3,col=3, lwd=2)
lines(size,NR,lty=4,col=4, lwd=2)

text(150,700,"FAT Wr"); text(130,600,"NTFS Wr")
text(350,750,"NTFS Rd"); text(400,800,"FAT Rd")

The graph appears to show us a good deal but its what it doesn't show that has to be remembered - the qualitative side to all this.

The expectation of several people I showed it to had been that NTFS being the more modern filesystem should have better performance. Not so but for good reasons. Yes in the simple case FAT32 is faster than NTFS. Out and out performance is not the point of NTFS. It has many value-add features not found in FAT such as file and directory permissions, encryption, compression, quotas, content-addressability (indexing) and so forth. These come at a cost as do other features in NTFS that the OS relies on to provide such facilities as shadow copy and replication.

Longer code path - longer to wait for those I/Os to return!

Monday Jan 23, 2006

Filesystem Benchmarks: vxbench

For a long time I've used a simple I/O load generator from Veritas called vxbench for doing just that - generating I/O loads against systems that have been configured up either in the lab or on customer sites. vxbench is a tool available on AIX, HP-UX, Linux and Solaris for benchmarking I/O loads on raw disk or file systems. It can produce various I/O workloads such as sequential and random reads/writes asynchronous I/Os, and memory mapped (mmap) operations. It has many options specific to the VERITAS File System (VxFS).

It also has characteristics that I need in a simple load generator - specifically it can generate multithreaded workloads which are essential and it has a simple command-line interface which makes it easy to incorporate in a scripting harness. It can also do strided reads/writes and sleep - important for database-like operations.

vxbench arrives on the CD in the package VRTSspt - the Veritas Software Support Tools and most sites have it to hand. However I've always shied away from publishing any work done with it because I 've never quite pinned down its status as a piece of software in terms of copyright or license. Recently however I've been driven to take a closer look. Two papers appeared recently which I'm afraid I can cite but not give you a URL for:

  • Study of Linux I/O Performance Characteristics for Volume Managers on an Interl Xeon Server (2004) Xianneng Shen, Jim Nagler, Randy Taylor, Clark McDonald. Proc CMG 2004
  • I/O Performance Characteristics for Volume Managers on LInux 2.6 Servers (2005) Dan Yee, Xianneng Shen. Proc CMG 2005

As corporate history has moved on, the first paper is copyrighted by the VERITAS Software Corporation and the second by the Symantec Corporation. I never realised that CMG does not own the content of its own proceedings but there you are. The second paper is a continuation of the first and uses the same methodology and tools. Yes; vxbench.

At first sight this is a little annoying - as another recent paper (which I won't point you at just at the moment because I want to talk about it in more detail in a later post) pointed out, if you can't reproduce a benchmark from its report, its not really very scientific and I'm sure thats not what the authors of these papers intended. This need for the rigor imposed by writing reproduceability into benchmarking papers is one reason why people working in the field often resort to the "usual suspects" when looking for load generators - iozone, postmark, bonnie++. They all have their weaknesses but are at least available on the net.

So I set about tracking down vxbench. The header in the source code was not encouraging; "This software contains confidential information and trade secrets of VERITAS Software. Use, disclosure or reproduction is prohibited without prior express written permission of VERITAS Software". Well, I won't be sharing any more of the contents of vxbench.c with you, thats for sure. Onward!

to....a Veritas support document pointed out to me by the README that comes with the package. Apparently you can download the package from the VERITAS ftp site (without the need to purchase media and/or a license). The support document was no more encouraging than the source header; "These tools are designed to be used under the direction of a VERITAS Technical Support Engineer only." Does this mean you shouldn't use them in other circumstances? (For "other circumstances" read "benchmarking against competing vendors of storage software") Well, it seems you can. Document 261451 leaves out the sentence that follows the one the I've quoted, but in the README.VRTSspt it continues on; "Any other use of these tools is at your own risk." So you can amuse yourself with vxbench and publish the results but if you fry your disks and panic your system you have only yourself to blame.

Vxbench is a useful tool. Its availability is important - the implementors of Linux LVM (and VxVM!) will no doubt want to study these papers and work to improve their products. I'm glad Symantec continue to make it available to the storage software community.

Thursday Jan 05, 2006

Goodbye Windows XP Professional x64 Edition

It's time to say "Farewell" to Windows XP Professional x64 Edition. When I bought my Acer Ferrari 4000 Laptop, it seemed the obvious choice - have 64 bit AMD processor; buy 64 bit operating system. I also have Solaris and Suse Linux on the laptop so why do I need XP? Well, I'm doing benchmarking on Windows so it helps to have a sandpit to develop stuff but I have to protect myself in the face of those anxious to promote Solaris on the desktop (i.e the other 400 people in this building) by repeating this mantra.

Many things work OK on XP 64-bit as you would expect; Macromedia Flash, Mozilla, NetBeans, OpenOffice, Quicktime, even more arcane things: PostgreSQL, R, Ptolemy, Vim. So in general, binary compatability was fine, at the application level.

However there was a list of things that just didn't; some didn't attempt to install; some got to the end of the install and then showed their disdain for their new home; some installed but would not run.

Most spooky of all was Microsoft's Windows Utilities for Unix. This is developed for the server market so you would think...but no. Also, there is no Realplayer for x64 or any sign of there being one in the near future; you have to make do with a similar but less functional alternative.

There are no drivers for my scanner - a very popular HP model. This was when doubt began to set in. If these are not available from HP, is there a real problem? Yes of course there is. Yes of course there isn't. ("You work for who?") Solaris got around this when it moved to 64-bit in Solaris 7; it was 64-bit but retained the 32-bit framework and loaded appropriately; also if you didn't like one flavour of addressing, you could simply boot into the other.

Microsoft could have gone down this road but haven't and the reasons are not hard to guess. Many if not most of the drivers in that world are written by 3rd parties and also there are good business reasons for a firm distinction between the 32 and 64 bit products. Its also only fair to say that take-up of Solaris 7 within the installed base was not immediate by any means (Y2K forced the pace eventually) so looked at in those terms, Microsoft are simply where Sun were, about 8 years ago.

Continuing on down the list, I thought the new XP was responsible for the failure of Apple's iTunes to make contact with the iTunes store site. In fact hundreds of new iPod owners thought the same and mused about firewalls, virus protection software and all manner of other possible barriers to their enjoyment on Apple's self-help forum (note singular absence of input from Apple!). Amusingly it never dawned on any of these people (and I'm sure Apple would never admit to the notion) that as it was December 25th and the whole world had just unwrapped their new Christmas present and installed the software, eager to make their first iTunes purchase. The iTunes site might just have gone completely and utterly tits-up and be refusing any further custom due insufficient advertance to IT capacity planning on the part of its owners (who are, er, an IT company). Its one possible explanation but I'm sure Apple can provide a more rational one.

I had already predicted the final nail in the coffin when the HP scanner drivers failed to exist. If Cisco's VPN drivers were not available, I would not be able to work from home; there's not much point in having a laptop if you have to leave it at the office - and VPN drivers are pretty low level stuff and might take a while to write and test. The good news is I will be able to work from home. The bad news is it won't be until the courier van turns up with a shiny new copy of the 32 bit version of Windows XP.

It would be tempting to reiterate the advice given to those contemplating marriage ("Don't.") but the actual moral of the story is that drivers in this world don't ship with dual 32/64 frameworks as Solaris did and you can't just reboot your way from one world to the other - if you wish to go down the XP x64 Edition route, take advantage of the evaluation program and make use of the repository of drivers at PlanetAMD64. What I really hope is that the vendors will co-package the 32-bit and 64 bit versions and detect appropriately at install time (or even runtime). Chances?

And now a word from our sponsors.

About

dom

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today