Performance insights and tips from a CPU-oriented perspective

Recent Posts


A tweak for arcstat

IntroductionNeelakanth Nadgir posted a useful utility that prints out various statistics about the ZFS Adaptive Replacement Cache (ARC). Here is the download link to his original arcstat.pl <<===Link goes to the year 2007 version from realneel, at the corrected oracle.com location instead of the 404'd sun.com location.Minor problems with the output of the original arcstat.plIf you have run arcstat.pl, you may have noticed a couple of odd things about the output1. Sometimes, it prints very tiny numbers while giving them entirely too many columns: Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 21:43:08 355 17 4 15 4 1 13 15 7 530M 17G 21:43:38 746 0 0 0 0 0 -1.4210854715202e-14 0 0 2G 17G 21:44:08 394 0 0 0 0 0 0 0 0 2G 17G 21:44:38 588 0 0 0 0 0 0 0 0 2G 17G 21:45:08 15 0 0 0 0 0 0 0 0 2G 17G 21:45:38 2 0 0 0 0 0 0 0 0 2G 17G 21:46:08 15 0 1 0 -1.4210854715202e-14 0 7 0 4 2G 17G 21:46:38 5 0 -1.4210854715202e-14 0 0 0 0 0 0 2G 17G 21:47:08 2 0 0 0 0 0 0 0 0 2G 17G 21:47:38 3 0 0 0 0 0 0 0 0 2G 17G 2. The script was posted in 2007. Since then, some field names have evolved. In particular, if evict_miss and rmiss are not available, there are many warnings.$ ./arcstat.pl Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c Use of uninitialized value in division (/) at ./arcstat.pl line 233.Use of uninitialized value in division (/) at ./arcstat.pl line 234.06:35:19 1G 28M 2 3M 0 24M 11 1M 0 17G 17G Use of uninitialized value in division (/) at ./arcstat.pl line 233.Use of uninitialized value in division (/) at ./arcstat.pl line 234.06:35:20 72K 36K 49 214 0 35K 92 135 0 17G 17G Use of uninitialized value in division (/) at ./arcstat.pl line 233.Use of uninitialized value in division (/) at ./arcstat.pl line 234.06:35:21 16K 10K 68 162 3 10K 99 57 1 17G 17G Use of uninitialized value in division (/) at arcstat.pl line 233.Use of uninitialized value in division (/) at arcstat.pl line 234. Two minor bugfixesWhen a tiny number comes along, do not use a whole bunch of columns to say "I saw a tiny number". Just print a zero.Add this near the top:use Scalar::Util qw(looks_like_number);And change this line:- return sprintf("%s", $num) if not $num =~ /^[0-9\.]+$/;+ return sprintf("%s", $num) if not looks_like_number($num); Discussion: The original test $num =~ /^[0-9\.]+$/ does not recognize that 1.4e-14 is, indeed, a number. It just has an 'e' in it. See Tom Christiansen's sermon from 1996 about recognizing numbers, which teaches that perl is better at recognizing numbers than your custom code is likely to be. Since perl v5.8, perl's recognition routine can be called directly, in Scalar::UtilSilence warnings if fields evict_skip and rmiss are not availableA simple fix:- $v{"eskip"} = $d{"evict_skip"}/$int;- $v{"rmiss"} = $d{"recycle_miss"}/$int;+ $v{"eskip"} = $d{"evict_skip"}/$int if defined $d{"evict_skip"};+ $v{"rmiss"} = $d{"recycle_miss"}/$int if defined $d{"recycle_miss"};Discussion: Simple enough: don't operate on non-existent things.Lazy bloggerThere are other versions of arcstat out there. I am too lazy to go hunt them down. Please feel free to comment with a pointer your favorite version.

Introduction Neelakanth Nadgir posted a useful utility that prints out various statistics about the ZFS Adaptive Replacement Cache (ARC). Here is the download link to his original arcstat.pl <<===Link...


Shrink a ZFS Root Pool, Solaris 11.1, SPARC

Revision infoUpdate III - 5-Aug-2013 1015am EDT - Further clarification on VARSHARE has been added.Update II - 4-Aug-2013 6pm EDT - - A clarification has been added under the goals section.- A kind reviewer points out that I forgot about the new file system VARSHARE. An update was added to discuss it.Update I - 4-Aug-2013 10am EDT - A kind reader points out an opportunity for jealousy, which has been addressed.SummaryA root pool cannot be shrunk in a single operation, but it can be copied to a smaller partition. The Solaris 11 beadm command makes the task easier than itused to be. In this example, the pool is copied, and then mirrored.Use format to create a smaller partition on a new device, say c0tXs0# zpool create -f newpool c0tXs0# beadm create -a -d "smaller s11.1" -p newpool solaris-SRUnnUse {ok} probe-scsi-all and {ok} devalias to identify the new disk{ok} setenv boot-device diskNNBoot new system, and clean up or copy (zfs send/receive) other file systems from the old device (e.g. /export, /export/home, perhaps also swap, dump, and VARSHARE)Use zpool export - or use zpool destroy - to hide or destroy the originalUse format to create the mirror partition, say c0tYs0zpool attach -f newpool c0tXs0 c0tYs0Allow the resilver to completeAt OBP, hunt down c0tY and boot the mirrorA detailed example follows.Contents1. Goal: shrink a root pool on a SPARC system.a. Sidebar: Why? b. A long-discussed feature, and a unicornc. Web Resources(i) Other bloggers(ii) Solaris 11.1 Documentation Library2. Initial state: one very large rpool, mostly empty3. Create newpool and new swap/dumpa. Delete old swap, use the newb. Delete old dump, use the new4. The actual copya. Let beadm do the work!b. Thank you, beadm, for automatically taking care of:(i) activation, (ii) bootfs, (iii) bootloader, and (iv) menu.lstUpdate: Beadm missed one item....VARSHARE5. Boot the new system (after a little OBP hunting)6. Cleanupa. Copy additional file systemsb. Hide - or delete - the original 7. Mirror the newpool8. Final verificationThank you1. Goal: shrink a root pool on a SPARC system.A large root pool occupies most of a disk. I would like to make it much smaller.a. Sidebar: Why? Why do I want to shrink? Because the desired configuration is: Mirrored root pool Large swap partition, not mirroredThat is, the current configuration is:diskX c0tXs0 rpool rpool/ROOT/solaris rpool/swap I do not want to just add a mirror, like this:diskX diskY c0tXs0 rpool c0tYs0 rpool rpool/ROOT/solaris solaris (copy) rpool/swap swap (copy)Instead, the goal is to have twice as much swap, like so:diskX diskY c0tXs0 rpool c0tYs0 rpool rpool/ROOT/solaris solaris (copy) c0tXs1 c0tYs1 swap more swapClarification: Bytes of disk vs. bytes of memory. At least one reader seemed to want a clarification of the point of the above. To be explicit:A 2-way mirrored rpool with a ZFS swap volume of size N spends 2 x N bytes of disk space to provide backing store to N bytes of memory.Two swap partitions, each of size N, spend 2 x N bytes of disk space and provides backing store to 2 x N bytes of memory.As it happens, due to the planned workload, this particular system is going to need a lot of swap space. Therefore, I prefer to avoid mirrored swap.b. A long-discussed feature, and a unicornThe word "shrink" does not appear in the ZFS admin guide.Discussions at an archived ZFS discussion groupassert that the feature was under activedevelopment in 2007, but by 2010 the short summary was "it's hiding behind the unicorn".Apparently, the feature is difficult, and demand simply has not been high enough.c. Web ResourcesWell, if there is no shrink feature, surely, it can be done by other methods, right? Well....(i) Other bloggersIf one uses Google to search for "shrink rpool", the top two blog entries that are returned appear to be relevant: http://resilvered.blogspot.com/2011/07/how-to-shrink-zfs-root-pool.html https://blogs.oracle.com/mock/entry/how_to_shrink_a_mirroredBoth of the above are old, written well prior to the release of Solaris 11. Both also use x86 volumes and conventions, notSPARC.Nevertheless, they contain some useful clues.(ii) Solaris 11.1 Documentation LibrarySince the above blog entries are dated, I also tried to use contemporary documentation. These were consulted, along with thecorresponding man pages: Solaris 11.1 library ZFS File Systems, especially Chapter 4 Booting and Shutting Down Oracle Solaris 11.1 Systems, especially Chapter 4 Creating and Administering Oracle Solaris 11.1 Boot Environments2. Initial state: one very large rpool, mostly emptyHere are the intial pools, file systems, and boot environments. Note that there is a large 556 GBrpool, and it is not mirrored.# zpool status pool: rpool state: ONLINE scan: none requestedconfig: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c0t5000CCA0224D6354d0s0 ONLINE 0 0 0## zpool listNAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOTrpool 556G 77.0G 479G 13% 1.00x ONLINE -## zfs listNAME USED AVAIL REFER MOUNTPOINTrpool 79.2G 468G 73.5K /rpoolrpool/ROOT 5.60G 468G 31K legacyrpool/ROOT/solaris 42.6M 468G 3.78G /rpool/ROOT/solaris-1 5.56G 468G 3.79G /rpool/ROOT/solaris-1/var 657M 468G 521M /varrpool/ROOT/solaris/var 38.9M 468G 221M /varrpool/VARSHARE 83.5K 468G 58K /var/sharerpool/dump 66.0G 470G 64.0G -rpool/export 3.43G 468G 32K /exportrpool/export/home 3.43G 468G 3.43G /export/homerpool/swap 4.13G 468G 4.00G -# beadm listBE Active Mountpoint Space Policy Created -- ------ ---------- ----- ------ ------- solaris - - 81.58M static 2013-07-10 17:19 solaris-1 NR / 6.88G static 2013-07-31 12:27 #The partitions on the original boot disk are:# format...partition> pVolume: solarisCurrent partition table (original):Total disk cylinders available: 64986 + 2 (reserved cylinders)Part Tag Flag Cylinders Size Blocks 0 root wm 0 - 64985 558.89GB (64986/0/0) 1172087496 1 unassigned wm 0 0 (0/0/0) 0 2 backup wu 0 - 64985 558.89GB (64986/0/0) 1172087496 3 unassigned wm 0 0 (0/0/0) 0 4 unassigned wm 0 0 (0/0/0) 0 5 unassigned wm 0 0 (0/0/0) 0 6 unassigned wm 0 0 (0/0/0) 0 7 unassigned wm 0 0 (0/0/0) 03. Create newpool and new swap/dumpThe format utility was used to create a new, smaller partition for the root pool. The firstswap partition was also created.# format...partition> pVolume: smallsysCurrent partition table (unnamed):Total disk cylinders available: 64986 + 2 (reserved cylinders)Part Tag Flag Cylinders Size Blocks 0 root wm 0 - 11627 100.00GB (11628/0/0) 209722608 1 swap wu 11628 - 34883 200.01GB (23256/0/0) 419445216 2 backup wu 0 - 64985 558.89GB (64986/0/0) 1172087496 3 unassigned wm 0 0 (0/0/0) 0 (*) 4 unassigned wm 0 0 (0/0/0) 0 (*) 5 unassigned wm 0 0 (0/0/0) 0 (*) 6 unassigned wm 0 0 (0/0/0) 0 (*) 7 unassigned wm 0 0 (0/0/0) 0 (*)partition> labelReady to label disk, continue? yAnd the new pool was created with zpool create:# zpool create -f newpool c0t5000CCA0224D62A0d0s0# (*) Note: one might ask, what about the other 250 GB available on the disk? Yes, this user has aplan in mind to use that space. It is not terribly relevant to the concerns covered in this particular blog, and so is left aside for now. a. Delete old swap, use the newAs noted in the introduction, there will eventually be multiple swap partitions, and they will not be in the root pool. The first new swappartition was just created, above. Therefore, for my purposes, I might as well delete the originals now, if only because it would be useless to copy them. (Your needsmay differ!)In a separate window, a new vfstab was created, which removes the zvol swap and adds the newswap partition:# cd /etc# diff vfstab.orig vfstab.withnewswap 12c12< /dev/zvol/dsk/rpool/swap - - swap - no ----> /dev/dsk/c0t5000CCA0224D62A0d0s1 - - swap - no - # The commands below display the current swap partition, add the new one, and display the result.# swap -lhswapfile dev swaplo blocks free/dev/zvol/dsk/rpool/swap 285,2 8K 4.0G 4.0G## /sbin/swapadd# swap -lhswapfile dev swaplo blocks free/dev/zvol/dsk/rpool/swap 285,2 8K 4.0G 4.0G/dev/dsk/c0t5000CCA0224D62A0d0s1 203,49 8K 200G 200G#Next, use swap -d to stop swapping on the old, and then destroy it.# swap -d /dev/zvol/dsk/rpool/swap# swap -lhswapfile dev swaplo blocks free/dev/dsk/c0t5000CCA0224D62A0d0s1 203,49 8K 200G 200G# # zfs destroy rpool/swap#b. Delete old dump, use the newThe largest part of the original pool is the dump device. Since we now have a large swapfile, we can use that instead:# dumpadm Dump content: kernel pages Dump device: /dev/zvol/dsk/rpool/dump (dedicated)Savecore directory: /var/crash Savecore enabled: yes Save compressed: on## dumpadm -d swap Dump content: kernel pages Dump device: /dev/dsk/c0t5000CCA0224D62A0d0s1 (swap)Savecore directory: /var/crash Savecore enabled: yes Save compressed: on# And the volume can now be destroyed. Emphasis: your needs may differ. You may prefer to keep swap and dump volumes in the new pool.# zfs destroy rpool/dump4. The actual copyLet's use pkg info to figure out a good name for the new BE. From the material below, it appears that this is Service Repository Update 9.5:# pkg info entire (**) Name: entire Summary: entire incorporation including Support Repository Update (Oracle Solaris Publisher: solarisPackaging Date: Thu Jul 04 03:10:15 2013# (**) Output has been abbreviated for readabilityAt this point, I did something that I thought would be needed, based on previous blog entries, but as you will see in a moment, it was not needed yet.# zfs snapshot -r rpool@orig_before_shrinka. Let beadm do the work!The previous blog entries at this point made use of zfs send andzfs receive. In a first attempt at this copy, so did I; but a more careful reading of the manpage indicated that beadm create would probablybe a better idea. For the sake of brevity, the send/receive side track is omitted.Here is the first attempt with beadm create:# beadm create -a -d "smaller s11.1" -e rpool@orig_before_shrink \> -p newpool s11.1-sru9.5be_copy: failed to find zpool for BE (rpool)Unable to create s11.1-sru9.5. (oops)Hmmm, it claims that it cannot find the snapshot that was just created a minute ago. ... Reading the manpage, I realize that a"beadm snapshot" is not exactly the same concept as a "zfs snapshot". OK. The manpage also says that if -e is not provided, then it will clone the current environment. Sounds good to me.# beadm create -a -d "smaller s11.1" -p newpool s11.1-sru9.5#The above command took only a few minutes to do, probably because rpool did not have a lot of content. Here is the result:# zfs listNAME USED AVAIL REFER MOUNTPOINTnewpool 4.30G 93.6G 73.5K /newpoolnewpool/ROOT 4.30G 93.6G 31K legacynewpool/ROOT/s11.1-sru9.5 4.30G 93.6G 3.79G /newpool/ROOT/s11.1-sru9.5/var 519M 93.6G 519M /varrpool 9.04G 538G 73.5K /rpoolrpool/ROOT 5.60G 538G 31K legacyrpool/ROOT/solaris 42.6M 538G 3.78G /rpool/ROOT/solaris-1 5.56G 538G 3.79G /rpool/ROOT/solaris-1/var 659M 538G 519M /varrpool/ROOT/solaris/var 38.9M 538G 221M /varrpool/VARSHARE 83.5K 538G 58K /var/sharerpool/export 3.43G 538G 32K /exportrpool/export/home 3.43G 538G 3.43G /export/home#Note that /export and /export/home were not copied. We will come back to these later.b. Thank you, beadm, for automatically taking care of...The older blogs mentioned several additional steps that had to be performed when copying root pools. As I checked into each of these topics, it turned out - repeatedly - that beadm create had alredy taken care of it.(i) activation The new Boot Environment will be active on reboot, as shown by code "R", below, because the above beadm create command included the -a switch.# beadm listBE Active Mountpoint Space Policy Created -- ------ ---------- ----- ------ ------- s11.1-sru9.5 R - 4.80G static 2013-08-02 10:22 solaris - - 81.58M static 2013-07-10 17:19 solaris-1 NR / 6.88G static 2013-07-31 12:27 (ii) bootfs The older blogs (which used zfs send/recv) mentioned that thebootfs property needs to be set on the new pool. This is no longer needed: bootadm create already set it automatically. Thank you, beadm.# zpool list -o name,bootfsNAME BOOTFSnewpool newpool/ROOT/s11.1-sru9.5rpool rpool/ROOT/solaris-1# (iii) bootloader The disk will need a bootloader. Here, some history may be of interest. A few years ago:Frequently, system administrators needed to add bootloaders, for example, anytime a mirror was created.The method differed by platform: installboot on SPARC, or something grub-ish on x86Today, The bootloader is added automatically when root pools are mirroredAnd if for some reason, you do need to add one by hand, the command is now bootadm install-bootloader, which in turn calls installboot on your behalf, or messes with grub on your behalf.The question for the moment: has a bootloader been placed on the new disk?Here is the original bootloader, on rpool - notice the -R / for path# bootadm list-archive -R /platform/SUNW,Netra-CP3060/kernelplatform/SUNW,Netra-CP3260/kernelplatform/SUNW,Netra-T2000/kernelplatform/SUNW,Netra-T5220/kernelplatform/SUNW,Netra-T5440/kernelplatform/SUNW,SPARC-Enterprise-T1000/kernelplatform/SUNW,SPARC-Enterprise-T2000/kernelplatform/SUNW,SPARC-Enterprise-T5120/kernelplatform/SUNW,SPARC-Enterprise-T5220/kernelplatform/SUNW,SPARC-Enterprise/kernelplatform/SUNW,Sun-Blade-T6300/kernelplatform/SUNW,Sun-Blade-T6320/kernelplatform/SUNW,Sun-Blade-T6340/kernelplatform/SUNW,Sun-Fire-T1000/kernelplatform/SUNW,Sun-Fire-T200/kernelplatform/SUNW,T5140/kernelplatform/SUNW,T5240/kernelplatform/SUNW,T5440/kernelplatform/SUNW,USBRDT-5240/kernelplatform/sun4v/kerneletc/cluster/nodeidetc/dacf.confetc/driveretc/machkernel# After mounting the newly created environment, it can be seen that it also has a bootloader. There isno need to use installboot nor bootadm install-bootloader, because the beadm create command already took care of it. Thank you,beadm.# beadm mount s11.1-sru9.5 /mnt## bootadm list-archive -R /mntplatform/SUNW,Netra-CP3060/kernelplatform/SUNW,Netra-CP3260/kernelplatform/SUNW,Netra-T2000/kernelplatform/SUNW,Netra-T5220/kernelplatform/SUNW,Netra-T5440/kernelplatform/SUNW,SPARC-Enterprise-T1000/kernelplatform/SUNW,SPARC-Enterprise-T2000/kernelplatform/SUNW,SPARC-Enterprise-T5120/kernelplatform/SUNW,SPARC-Enterprise-T5220/kernelplatform/SUNW,SPARC-Enterprise/kernelplatform/SUNW,Sun-Blade-T6300/kernelplatform/SUNW,Sun-Blade-T6320/kernelplatform/SUNW,Sun-Blade-T6340/kernelplatform/SUNW,Sun-Fire-T1000/kernelplatform/SUNW,Sun-Fire-T200/kernelplatform/SUNW,T5140/kernelplatform/SUNW,T5240/kernelplatform/SUNW,T5440/kernelplatform/SUNW,USBRDT-5240/kernelplatform/sun4u/kernelplatform/sun4v/kerneletc/cluster/nodeidetc/dacf.confetc/driveretc/machkernel# (iv) menu.lstThe first google reference above includes this sentence:Change all the references to [the new pool] in the menu.1st file.That sounds GRUBish, for x86, and not very much like SPARC. As it turns out, though, yes, there is a menu.lst file for SPARC:# cat /rpool/boot/menu.lsttitle Oracle Solaris 11.1 SPARCbootfs rpool/ROOT/solaristitle solaris-1bootfs rpool/ROOT/solaris-1# And, oh look at this, beadm create also made a new menu.lst on the new pool. Thank you, beadm# cat /newpool/boot/menu.lst title smaller s11.1bootfs newpool/ROOT/s11.1-sru9.5# Beadm missed one item....VARSHAREUpdate (III): WHAT'S MISSING? An earlier update to this blog entry pointed out that I forgot about VARSHARE. It has been further clarified that the right time to worry about it is actually BEFORE the reboot. OK. So, if you are following this blog while working on a system of your own, do that zfs list command now, before rebooting. If VARSHARE is present, migrate it now.5. Boot the new system (after a little OBP hunting)Attempt to boot the new pool. First, remind myself of the disk ids, and then head off towards OBP# zpool status (**) NAME STATE READ WRITE CKSUM newpool ONLINE 0 0 0--> c0t5000CCA0224D62A0d0s0 ONLINE 0 0 0 rpool ONLINE 0 0 0 c0t5000CCA0224D6354d0s0 ONLINE 0 0 0# shutdown -y -g0 -i0{0} ok probe-scsi-all (**)/pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0 <--Target a Unit 0 Disk HITACHI H109060SESUN600G A31A 1172123568 Blocks, 600 GB SASDeviceName 5000cca0224d62a0 SASAddress 5000cca0224d62a1 PhyNum 1 ^^^^^^^^^^^^^^^^It looks like the newly created pool is on /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0at PhyNum 1 - note its SASDeviceName    5000cca0224d62a0 which matches newpool at solaris devicec0t5000CCA0224D62A0d0s0. Is there a device alias that also matches?{0} ok devaliasscreen /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@7/display@0disk7 /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p3disk6 /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p2disk5 /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p1 <--disk4 /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p0scsi1 /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0net3 /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0,1net2 /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0disk3 /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/disk@p3disk2 /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/disk@p2disk1 /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/disk@p1disk /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/disk@p0disk0 /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/disk@p0...The OBP alias disk5 matches the desired disk.Try the new disk, checking: does the boot -L switch include the desired new BE?{0} ok boot disk5 -LBoot device: /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p1 File and args: -L1 smaller s11.1Select environment to boot: [ 1 - 1 ]: 1To boot the selected entry, invoke:boot [] -Z newpool/ROOT/s11.1-sru9.5 <--Yes, disk5 offers a choice that matches the new boot environment. Point the OBP boot-device at it, and off we go.{0} ok setenv boot-device disk5{0} ok boot6. CleanupThe system booted successfully. After the first boot, note that - as mentioned earlier - /export and /export/home are in the original pool:# zfs list -r rpoolNAME USED AVAIL REFER MOUNTPOINTrpool 9.04G 538G 73.5K /rpoolrpool/ROOT 5.61G 538G 31K legacyrpool/ROOT/solaris 42.6M 538G 3.78G /rpool/ROOT/solaris-1 5.57G 538G 3.79G /rpool/ROOT/solaris-1/var 660M 538G 519M /varrpool/ROOT/solaris/var 38.9M 538G 221M /varrpool/VARSHARE 108K 538G 58.5K /var/sharerpool/export 3.43G 538G 32K /exportrpool/export/home 3.43G 538G 3.43G /export/home# # zfs list -r newpoolNAME USED AVAIL REFER MOUNTPOINTnewpool 4.33G 93.6G 73.5K /newpoolnewpool/ROOT 4.33G 93.6G 31K legacynewpool/ROOT/s11.1-sru9.5 4.33G 93.6G 3.79G /newpool/ROOT/s11.1-sru9.5/var 524M 93.6G 519M /varnewpool/VARSHARE 43K 93.6G 43K /var/share# Update: VARSHARE Mike Gerdts points out what I missed in previous reading of the above: notice that rpool/VARSHARE contained some data that has not been migrated to newpool/VARSHARE. The VARSHARE file system provides a convenient place to store crash dumps, audit records, and similar data that can be shared across boot environments, as described under What's New with ZFS? in the updated ZFS Admin Guide. Unfortunately, I missed my chance to migrate that data, it's gone. Fortunately, I didn't lose very much (about 108 KB, according to the above). If you are following this blog as you work on your own system, one hopes you noticed the note above about VARSHARE. If not, then would be a good moment to review the status of VARSHARE on your system, potentially merging the previous content with whatever has accumulated since the reboot.Swap/dump - as already discussed, swap and dump were intentionally not migrated, because they are handled elsewhere. Your needs may differ. If you are following this blog as you work on your own system, now would be a good moment to ensure that you have figured out what you want to do for your swap / dump volumes.a. Copy additional file systemsEarlier, a snapshot was created, using:# zfs snapshot -r rpool@orig_before_shrinkThat snapshot is used in a zfs send/receive command, which goes quickly, but it ends with an error:# zfs send -vR rpool/export@orig_before_shrink | zfs receive -vFd newpoolsending from @ to rpool/export@before_shrinkreceiving full stream of rpool/export@before_shrink into newpool/export@before_shrinksending from @before_shrink to rpool/export@orig_before_shrinksending from @ to rpool/export/home@before_shrinkreceived 47.9KB stream in 1 seconds (47.9KB/sec)receiving incremental stream of rpool/export@orig_before_shrink into newpool/export@orig_before_shrinkreceived 8.11KB stream in 3 seconds (2.70KB/sec)receiving full stream of rpool/export/home@before_shrink into newpool/export/home@before_shrinksending from @before_shrink to rpool/export/home@orig_before_shrinkreceived 3.46GB stream in 78 seconds (45.4MB/sec)receiving incremental stream of rpool/export/home@orig_before_shrink into newpool/export/home@orig_before_shrinkreceived 198KB stream in 2 seconds (99.1KB/sec)cannot mount 'newpool/export' on '/export': directory is not emptycannot mount 'newpool/export' on '/export': directory is not emptycannot mount 'newpool/export/home' on '/export/home': failure mounting parent datasetThe problem above is that more than one file system is eligible to be mounted. A better solution - pointed out by a kindreviewer - would have been to use the zfs receive -u switch: # zfs send -vR rpool/export@orig_before_shrink | zfs receive -vFdu newpoolThe -u switch would have avoided the attempt to mount the newly created file systems.b. Hide - or delete - the original Warning: YMMV Because I am nearly certain that I will soon be destroying the original rpool, my solution was to disqualify the old ones from mounting. Yourmileage may vary. For example, you might prefer to leave the old one unchanged, in case it is needed later. In that case, you could skip directly to the export, described below. Anyway, the following was satisfactory for my needs. I changed the canmount property: # zfs list -o mounted,canmount,mountpoint,name -r rpoolMOUNTED CANMOUNT MOUNTPOINT NAME yes on /rpool rpool no off legacy rpool/ROOT no noauto / rpool/ROOT/solaris no noauto / rpool/ROOT/solaris-1 no noauto /var rpool/ROOT/solaris-1/var no noauto /var rpool/ROOT/solaris/var no noauto /var/share rpool/VARSHARE yes on /export rpool/export yes on /export/home rpool/export/home# # zfs list -o mounted,canmount,mountpoint,name -r newpoolMOUNTED CANMOUNT MOUNTPOINT NAME yes on /newpool newpool no off legacy newpool/ROOT yes noauto / newpool/ROOT/s11.1-sru9.5 yes noauto /var newpool/ROOT/s11.1-sru9.5/var yes noauto /var/share newpool/VARSHARE no on /export newpool/export no on /export/home newpool/export/home# # zfs set canmount=noauto rpool/export# zfs set canmount=noauto rpool/export/home# rebootAfter the reboot, only one data set from the original pool is mounted:# zfs list -r -o name,mounted,canmount,mountpoint NAME MOUNTED CANMOUNT MOUNTPOINTnewpool yes on /newpoolnewpool/ROOT no off legacynewpool/ROOT/s11.1-sru9.5 yes noauto /newpool/ROOT/s11.1-sru9.5/var yes noauto /varnewpool/VARSHARE yes noauto /var/sharenewpool/export yes on /exportnewpool/export/home yes on /export/homerpool yes on /rpoolrpool/ROOT no off legacyrpool/ROOT/solaris no noauto /rpool/ROOT/solaris-1 no noauto /rpool/ROOT/solaris-1/var no noauto /varrpool/ROOT/solaris/var no noauto /varrpool/VARSHARE no noauto /var/sharerpool/export no noauto /exportrpool/export/home no noauto /export/home# The canmount property could be set for that one too, but a better solution - as suggested by the kind reviewer - is to zpool exportthe dataset. The export command ensures that none of it will be seen until/unless a later zpool import command is done (which will not be done in thiscase, because I want to re-use the space for other purposes).# zpool export rpool# reboot...$ zpool status pool: newpool state: ONLINE scan: none requestedconfig: NAME STATE READ WRITE CKSUM newpool ONLINE 0 0 0 c0t5000CCA0224D62A0d0s0 ONLINE 0 0 0errors: No known data errors$$ zfs listNAME USED AVAIL REFER MOUNTPOINTnewpool 7.80G 90.1G 73.5K /newpoolnewpool/ROOT 4.37G 90.1G 31K legacynewpool/ROOT/s11.1-sru9.5 4.37G 90.1G 3.79G /newpool/ROOT/s11.1-sru9.5/var 530M 90.1G 521M /varnewpool/VARSHARE 45.5K 90.1G 45.5K /var/sharenewpool/export 3.43G 90.1G 32K /exportnewpool/export/home 3.43G 90.1G 3.43G /export/home$ 7. Mirror the newpoolThe new root pool has been created, it boots, it has the correct size and now has all the right data sets. Mirror it.Set up the partitions for the mirror volume to match the volume that has newroot:partition> pVolume: mirrorCurrent partition table (unnamed):Total disk cylinders available: 64986 + 2 (reserved cylinders)Part Tag Flag Cylinders Size Blocks 0 root wm 0 - 11627 100.00GB (11628/0/0) 209722608 1 swap wu 11628 - 34883 200.01GB (23256/0/0) 419445216 2 backup wu 0 - 64985 558.89GB (64986/0/0) 1172087496 3 unassigned wm 0 0 (0/0/0) 0 4 unassigned wm 0 0 (0/0/0) 0 5 unassigned wm 0 0 (0/0/0) 0 6 unassigned wm 0 0 (0/0/0) 0 7 unassigned wm 0 0 (0/0/0) 0partition> labelReady to label disk, continue? yStart the mirror operation:# zpool status pool: newpool state: ONLINE scan: none requestedconfig: NAME STATE READ WRITE CKSUM newpool ONLINE 0 0 0 c0t5000CCA0224D62A0d0s0 ONLINE 0 0 0errors: No known data errors## zpool attach -f newpool c0t5000CCA0224D62A0d0s0 c0t5000CCA0224D6A30d0s0Make sure to wait until resilver is done before rebooting.## zpool status pool: newpool state: DEGRADEDstatus: One or more devices is currently being resilvered. The pool will continue to function in a degraded state.action: Wait for the resilver to complete. Run 'zpool status -v' to see device specific details. scan: resilver in progress since Sat Aug 3 07:49:25 2013 413M scanned out of 7.83G at 20.6M/s, 0h6m to go 409M resilvered, 5.15% doneconfig: NAME STATE READ WRITE CKSUM newpool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 c0t5000CCA0224D62A0d0s0 ONLINE 0 0 0 c0t5000CCA0224D6A30d0s0 DEGRADED 0 0 0 (resilvering)errors: No known data errors# It says it will complete in 6 minutes. OK, wait that long and check again:# sleep 360; zpool status pool: newpool state: ONLINE scan: resilvered 7.83G in 0h3m with 0 errors on Sat Aug 3 07:52:30 2013config: NAME STATE READ WRITE CKSUM newpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t5000CCA0224D62A0d0s0 ONLINE 0 0 0 c0t5000CCA0224D6A30d0s0 ONLINE 0 0 0errors: No known data errors# Recall that a part of the goal was to have 2x swap partitions, not mirrored. Add the second one now.# swap -lhswapfile dev swaplo blocks free/dev/dsk/c0t5000CCA0224D62A0d0s1 203,49 8K 200G 200G## echo "/dev/dsk/c0t5000CCA0224D6A30d0s1 - - swap - no - " >> /etc/vfstab# /sbin/swapadd# swap -lhswapfile dev swaplo blocks free/dev/dsk/c0t5000CCA0224D62A0d0s1 203,49 8K 200G 200G/dev/dsk/c0t5000CCA0224D6A30d0s1 203,41 8K 200G 200G# 8. Final verificationWhen the zpool attach command above was issued, the root pool was mirrored. As mentioned previously,in the days of our ancestors, one hadto follow this up by adding the bootloader. Now, thanks to updated zpool attach, it happens automatically. Verify the feature by booting the other side of the mirror:# shutdown -y -g0 -i0...{0} ok printenv boot-deviceboot-device = disk5{0} ok boot disk4Boot device: /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p0 File and args: SunOS Release 5.11 Version 11.1 64-bitCopyright (c) 1983, 2012, Oracle and/or its affiliates. All rights reserved.Thank youThank you to bigal, and to Joe Mocker for their starting points. Thank you to Cloyce Spradling for review of drafts of this post. Also, Michael Ramchand noted that the first post of this blog forgot to thank zpool attach in step 8, which has been fixed; that was important because, as Michael noted, "It might get jealous after all the thanks that beadm got."(this space intentionally left blank)

Revision info Update III - 5-Aug-2013 1015am EDT - Further clarification on VARSHARE has been added. Update II - 4-Aug-2013 6pm EDT - - A clarification has been added under the goals section.- A kind...


IBM "per core" comparisons for SPECjEnterprise2010

I recently stumbled upon a blog entry from Roman Kharkovski (an IBM employee) comparing some SPECjEnterprise2010 results for IBM vs. Oracle. Mr. Kharkovski's blog claims that SPARC delivers half the transactions per core vs. POWER7.Prior to any argument, I should say that my predisposition is to like Mr. Kharkovski, because he says that his blog is intended to be factual; that the intent is to try to avoid marketing hype and FUD tactic; and mostly because he features a picture of himself wearing a bike helmet (me too).Therefore, in a spirit of technical argument, rather than FUD fight, there are a few areas in his comparison that should be discussed.Scaling is not free For any benchmark, if a small system scores 13k using quantity R1 of some resource, and a big system scores 57k using quantity R2 of that resource, then, sure, it's tempting to divide: is  13k/R1 > 57k/R2 ? It is tempting, but not necessarily educational. The problem is that scaling is not free. Building big systems is harder than building small systems. Scoring  13k/R1  on a little system provides no guarantee whatsoever that one can sustain that ratio when attempting to handle more than 4 times as many users.Choosing the denominator radically changes the picture When ratios are used, one can vastly manipulate appearances by the choice of denominator. In this case, lots of choices are available for the resource to be compared (R1 and R2 above).IBM chooses to put cores in the denominator. Mr. Kharkovski provides some reasons for that choice in his blog entry. And yet, it should be noted that the very concept of a core is: arbitrary: not necessarily comparable across vendors; fluid: modern chips shift chip resources in response to load; and invisible: unless you have a microscope, you can't see it. By contrast, one can actually see processor chips with the naked eye, and they are a bit easier to count. If we put chips in the denominator instead of cores, we get: 13161.07 EjOPS / 4 chips = 3290 EjOPS per chip for IBM vs 57422.17 EjOPS / 16 chips = 3588 EjOPS per chip for OracleThe choice of denominator makes all the difference in the appearance. Speaking for myself, dividing by chips just seems to make more sense, because:I can see chips and count them; and I can accurately compare the number of chips in my system to the count in some other vendor's system; andTthe probability of being able to continue to accurately count chips over the next 10 years of microprocessor development seems higher than the probability of being able to accurately and comparably count "cores".SPEC Fair use requirements Speaking as an individual, not speaking for SPEC and not speaking for my employer, I wonder whether Mr. Kharkovski's blog article, taken as a whole, meets the requirements of the SPEC Fair Use rule www.spec.org/fairuse.html section I.D.2.For example, Mr. Kharkovski's footnote (1) begins Results from http://www.spec.org as of 04/04/2013 Oracle SUN SPARC T5-8 449 EjOPS/core SPECjEnterprise2010 (Oracle's WLS best SPECjEnterprise2010 EjOPS/core result on SPARC). IBM Power730 823 EjOPS/core (World Record SPECjEnterprise2010 EJOPS/core result)The questionable tactic, from a Fair Use point of view, is that there is no such metric at the designated location. At www.spec.org,You can find the SPEC metric 57422.17SPECjEnterprise2010 EjOPS for Oracle and You can also find the SPEC metric 13161.07 SPECjEnterprise2010 EjOPS for IBM. Despite the implication of the footnote, you will not find any mention of 449 nor anything that says 823. SPEC says that you can, under its fair use rule, derive your own values; but it emphasizes: "The context must not give the appearance that SPEC has created or endorsed the derived value." Substantiation and transparencyAlthough SPEC disclaims responsibility for non-SPEC information (section I.E), it says that non-SPEC data and methods should be accurate, should be explained, should be substantiated. Unfortunately, it is difficult or impossible for the reader to independently verify the pricing:Were like units compared to like (e.g. list price to list price)? Were all components (hw, sw, support) included?Were all fees included? Note that when tpc.org shows IBM pricing, there are often items such as "PROCESSOR ACTIVATION" and "MEMORY ACTIVATION". Without the transparency of a detailed breakdown, the pricing claims are questionable.T5 claim for "Fastest Processor"Mr. Kharkovski several times questions Oracle's claim for fastest processor, writing You see, when you publish industry benchmarks, people may actually compare your results to other vendor's results.Well, as we performance people always say, "it depends". If you believe in performance-per-core as the primary way of looking at the world, then yes, the POWER7+ is impressive, spending its chip resources to support up to 32 threads (8 cores x 4 threads).Or, it just might be useful to consider performance-per-chip. Each SPARC T5 chip allows 128 hardware threads to be simultaneously executing (16 cores x 8 threads). The Industry Standard Benchmark that focuses specifically on processor chip performance is SPEC CPU2006. For this very well known and popular benchmark, SPARC T5: provides better performance than both POWER7 and POWER7+, for 1 chip vs. 1 chip,for 8 chip vs. 8 chip, for integer (SPECint_rate2006) and floating point (SPECfp_rate2006), for Peak tuning and for Base tuning. For example, at the 8-chip level, integer throughput (SPECint_rate2006) is:3750 for SPARC2170 for POWER7+. You can find the details at the March 2013 BestPerf CPU2006 pageSPEC is a trademark of the Standard Performance Evaluation Corporation, www.spec.org. The two specific results quoted for SPECjEnterprise2010 are posted at the URLs linked from the discussion. Results for SPEC CPU2006 were verified at spec.org 1 July 2013, and can be rechecked here.

I recently stumbled upon a blog entry from Roman Kharkovski (an IBM employee) comparing some SPECjEnterprise2010 results for IBM vs. Oracle. Mr. Kharkovski's blog claims that SPARC delivers half the...


Losing My Fear of ZFS

Abstract The original ZFS vision was enormously ambitious: End the Suffering, Free Your Mind, providing simplicity, power, safety, and speed. As is common with most new technologies, this ambitious vision was not completely fulfilled in the intial versions. Initial usage showed that although it did have useful and convenient features, for some workloads, such as the memory-intensive SPEC CPU benchmarks, there were reasons for concern. Now that ZFS has had time to grow, more of the vision is fulfilled. This article, told from the personal perspective of one performance engineer, describes some of the improvements, and provides examples of use. Rumors: Does This Sound Familiar? Have you heard some of these about ZFS? "ZFS? You can't use that - it will eat all your memory!" "ZFS? That's a software disk striping/RAID solution. You don't want that. You want hardware RAID." "ZFS? Be afraid." Can I Please Just Forget About IO? (NO) As a performance engineer, my primary concern is for the SPEC CPU benchmarks - which intentionally do relatively little IO. Usually. To a first approximation, IO can be ignored in this context. Usually. To a first approximation, it's fine if my ZFS "knowledge" is limited to rumors / innuendo as quoted above. Until.... Until there comes the second approximation, the re-education, and the beginner loses some fear of ZFS. Why a SPEC CPU Benchmarker Might Care About IO Although the SPEC CPU benchmarks intentionally try to avoid doing IO, some amount inevitably remains. An analysis of the IO in the benchmarks shows that one benchmark, 450.soplex, reads 300 MB of data. Most of that comes from a single 1/4 GB file, ref.mps, which is read during the second invocation of the benchmark. Given the speed of today's disk drives, is that a lot? Using an internal drive (Seagate ST973402SSUN72G), a T5220 with a Niagara 2 processor reads the 1/4 GB file at about 50 MB/sec. It takes about 5.5 seconds to read one copy of the file, which is a tiny amount of time compared to how long it takes to run one copy of the actual benchmark - about 3000 seconds. But 1/4 GB becomes a concern when one takes into account that we do not, in fact, read one copy of the file when testing SPEC CPU2006, because we are interested in the SPECrate metrics, which run multiple copies of the benchmarks. On a single-chip T5220 system, which supports 64 theads, 63 copies of the benchmark are run. An 8-chip M5000, which supports 8 threads per chip, also runs 63 copies. On such systems, it is not uncommon to see 10 to 30 minutes of time when the CPU is sitting idle - which is not the desired behavior for a CPU benchmark. For example, on the M5000, as shown in the graph below, it takes about 18 minutes before the CPU reaches the desired 99% User time. During that 18 minutes, a single disk with ufs on it is, according to iostat, 100% busy. It reads about 16 MB/sec, doing about 725 reads/sec. Note that in this graph, and all other graphs in this article, the program being tested is only one of the benchmarks drawn from a larger suite, and only one of its inputs. Therefore, no statements in this article should be taken as indicative of "compliant" (aka "reportable") runs of an entire suite. SPEC and the benchmark name SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. For more information about SPEC and the CPU benchmarks, see www.spec.org/cpu2006. ZFS Makes its Dramatic Entrance Although this tester has heard concerns raised by people who have passed along rumors of ZFS limitations, there have been other teachers who have sung its praises, including one who has pointed out that 450.soplex's 1/4 GB input file is highly compressible, going from 267 MB to 20 MB with gzip. The best IO is the IO that you never have to do at all. By using the ZFS compression feature, we can make 90% of the IO go away: $ zpool create -f tank c0t1d0 $ zfs create tank/spec-zfs-gzip $ zfs set compression=gzip tank/spec-zfs-gzip The improvement from ZFS gzip compression is indeed dramatic. The careful reader may note that there are actually two lines on the far left: one measured with Solaris 10 Update 7, the other with Solaris Express. The version of Solaris did not appear to be a significant variable for the tests reported in this paper, as can be seen by the fact that the two lines are right on top of each other. What About Memory Consumption? Although ZFS has done a great job above, what about its memory consumption? Concerns have been raised that it is memory-hungry, and indeed the Best Practices Guide (archived copy from 2009) plainly says that it will use all the memory on the system if it thinks it can get away with it: The ZFS adaptive replacement cache (ARC) tries to use most of a system's available memory to cache file system data. The default is to use all of physical memory except 1 Gbyte. As memory pressure increases, the ARC relinquishes memory. ZFS memory usage is an important concern when running the SPEC CPU benchmarks, which are designed to stress the CPU, the memory system, and the compiler. Some of the benchmarks in the suite use just under 1 GB of physical memory, and it is desirable to run (n — 1) copies on a system with (n) threads and(n) GB of memory. Fortunately, there is a tuning knob available to control the size of the ARC: set zfs:zfs_arc_max = 0x(size) can be added to/etc/system. The tests reported on this page all use a limited ARC cache. It should also be noted that all tests are done after a fresh reboot, so presumably the ARC cache is not contributing to the reported performance. More details about methods may be found at the end of the article. ZFS on T5440: Good, But Not As Dramatic Although the above simple commands are enough to remove the IO idle time on the M5000, for the 4-chip T5440 there is a bigger problem: this system supports 256 threads, and 255 copies of the benchmark are run. Therefore, it needs to quickly inhale on the order of 64 GB. A somewhat older RAID system was made available for this test: an SE3510 with 12x 15K 72GB drives. Using this device with ufs, it takes 30 minutes before the system hits the maximum user time, as shown by the line on the right in the graph below: In the ufs test above, the SE3510 is configured as 12x drives in a RAID-5 logical drive, with a simple ufs filesystem (newfs -f 8192 /dev/dsk/c2t40d0s6). Despite the large number of drives, the SE3510 sustains a steady read rate of only about 45 MB/sec, processing about 3000 IO/sec according to iostat. (Aside: the IO expert may question why the hardware RAID provides only 45 MB/sec, but please bear in mind we are following the path of the IO beginner here. This topic is re-visted below.) The zfs file system reads about 16 MB/sec, doing about 4500 IO/sec, but takes less than 1/2 as long to peg the CPU, since it is reading compressed data. The zfs file system also used an SE3510 with SUN72G 15k RPM drives. On that unit, 12 individual "NRAID" drives were created, and made visible to the host as 12 separate units. Then, 10 of them were strung together as zfs RAID-Z using: # zpool create zf-raidz10 raidz \\ c3t40d0 c3t40d1 c3t40d2 c3t40d3 c3t40d4 \\ c3t40d5 c3t40d6 c3t40d7 c3t40d8 c3t40d9 # zpool status pool: zf-raidz10 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM zf-raidz10 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c3t40d0 ONLINE 0 0 0 c3t40d1 ONLINE 0 0 0 c3t40d2 ONLINE 0 0 0 c3t40d3 ONLINE 0 0 0 c3t40d4 ONLINE 0 0 0 c3t40d5 ONLINE 0 0 0 c3t40d6 ONLINE 0 0 0 c3t40d7 ONLINE 0 0 0 c3t40d8 ONLINE 0 0 0 c3t40d9 ONLINE 0 0 0 Compression was added at a later time, but before the experiment shown above: $ zfs list -o compression zf-raidz10 COMPRESS gzip Why Is the T5440 Improvement Not As Dramatic As the M5000? The improvement from zfs is helpful to the T5440, but unlike the M5000, nearly 15 minutes of clock time is spent on IO. Let's look at some statistics from iostat: $ iostat -xncz 30 . . . cpu us sy wt id 8 4 0 88 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 2.1 0.0 0.0 0.0 0.1 6.7 0 0 c0t0d0 469.3 0.0 1851.2 0.0 0.9 18.5 1.9 39.4 25 88 c3t40d9 401.8 0.0 1893.2 0.0 19.7 9.8 48.9 24.5 88 96 c3t40d8 471.1 0.0 1836.2 0.0 1.1 18.4 2.4 39.1 27 87 c3t40d7 416.1 0.0 1858.5 0.0 2.0 16.1 4.9 38.7 33 88 c3t40d6 452.1 0.0 1792.8 0.0 13.9 13.1 30.9 29.0 78 92 c3t40d5 417.8 0.0 1868.9 0.0 0.9 16.3 2.1 39.1 18 87 c3t40d4 461.0 0.0 1766.9 0.0 3.7 17.2 8.0 37.3 42 87 c3t40d3 418.9 0.0 1854.9 0.0 2.9 16.2 7.0 38.6 40 88 c3t40d2 433.6 0.0 1761.0 0.0 21.9 8.9 50.6 20.6 92 99 c3t40d1 420.0 0.0 1852.6 0.0 1.5 16.1 3.5 38.4 29 86 c3t40d0 A kind ZFS expert notes that "with RAID-Z the disks are saturated delivering >400 iops. The problem of RAID-Z is that those iops carry small amount of data and throughput is low." For more information, see this popular reference: https://blogs.oracle.com/roch/entry/when_to_and_not_to. A secondary reason might be that as the reads are done, ZFS is decompressing the gzip'd data on a system where single thread performance is much slower than the one in Graph #2. On the M5000,'gunzip ref.mps' requires about 2 seconds of CPU time; on the T5440, about 7 seconds. It should be emphasized that this is only a secondary concern for the read statistics described in this article, although it can become more important for write workloads, since compression is harder than decompression. Doing 'gzip ref.mps' takes ~12 seconds on the M5000, and ~51 seconds on the T5440. Furthermore, although the T5440 has 256 threads available, as of Solaris 10 s10s_u7, and Solaris Express snv_112, it is only willing to spend 8 threads doing gzip/gunzip operations. (This limitation may change in a future version of Solaris.) Solution: Mirrors, No Gzip The kind ZFS expert suggested trying mirrored drives without gzip. When this is done, the %b (busy) time, which is about 90% in the iostat report just above, changes to 98-100%. The %w time (queue non-empty) time, which shows wide variability just above, also pushes 90-100%. Because we are reading much more data, elapsed time is actually slower - the red line in the graph below: Adding 12 more drives, configured as 8x three way mirrors, does the trick: the leftmost line shows the desired slope. We spend about 3-4 minutes reading the file, an acceptable amount given that the benchmark as a whole runs for more than 120 minutes. The file system for the leftmost line was created using: # zpool create dev8-m3-ngz \\ > mirror c2t40d0 c2t40d1 c3t40d0 \\ > mirror c2t40d2 c2t40d3 c3t40d1 \\ > mirror c2t40d4 c2t40d5 c3t40d2 \\ > mirror c2t40d6 c2t40d7 c3t40d3 \\ > mirror c2t40d8 c3t40d4 c3t40d5 \\ > mirror c2t40d9 c3t40d6 c3t40d7 \\ > mirror c2t40d10 c3t40d8 c3t40d9 \\ > mirror c2t40d11 c3t40d10 c3t40d11 # The command creates 3-way mirrors, splitting each mirror across the two available controllers (c2 and c3). There are 8 of these 3-way mirrors, and zfs will dynamically stripe data across the 8 mirrors.   Were These Tests Fair? The hardware IO expert may be bothered by the data presented here for the RAID-5 ufs configuration. Why would the hardware RAID system, with 12x drives, deliver only 45 MB/sec? In addition, it may seem odd that the tests use a RAID device which is now 5 years old, and compare it versus contemporary ZFS. This is a fair point. In fact, a more modern RAID device has been observed delivering 97 MB/sec to 450.soplex, although with a very different system under test. On the other hand, it should be emphasized that all the T5440 tests reported in this article used SE3510/SUN72G/15k. For the ufs tests, the SE3510 on-board software did the RAID-5 work. For the zfs tests, the SE3510 simply presented its disks to the Solaris system as 12 separate ("NRAID") logical units, and zfs did the RAID-Z and mirroring work. Could there be something wrong with the particular SE3510 used for ufs? That seems unlikely. Although Graph 3 compares two different SE3510s (both connected to the same HBA, both configured with SUN72G 15k drives), a later test repeated the RAID-5 run on the exact same SE3510 unit as had been used for zfs. The time did not improve. Is it possible that the SE3510 was mis-configured? Maybe. The author does not claim to be an IO expert, and, in particular, relied on the SE3510 menu system, not its command line interface (sccli). The menus provide limited access to disk block size setting, and the tester did not at first realize that the disk block size depends on this other parameter .... located over here in the menus ... For this particular controller, default block sizes are controlled indirectly by whether this setting is yes or no. Changing it to "No" makes the default block size larger (32 KB vs. 128 KB). Once this was discovered, various tests were repeated. The hw RAID-5 test was repeated with explicit selection of a larger size; however, it did not improve. On the other hand, the NRAID devices, controlled by zfs, did improve. Finally, in order to isolate any overhead from RAID-5, the SE3510 was configured as 12 x drives in a RAID-0 stripe (256 KB stripe size). The time required to start 450.soplex was still over 30 minutes. YMMV As usual, your mileage may vary depending on your workload. This is especially true for IO workloads. Summary / Basic Lessons Some basic lessons about ZFS emerge: 1) ZFS can be easily taught not to hog memory. 2) Selecting gzip compression can be a big win, especially on systems with relatively faster CPUs. 3) Setting up mirrored drives with dynamic striping is straightforward. 4) ZFS is not so scary, after all. Notes on Methods During an actual test of a "reportable" run of SPEC CPU2006, file caches are normally not effective for 450.soplex, because its data files are set up many hours prior to their use, with many intervening programs competing for memory usage. Therefore, for all tests reported here, it was important to avoid unwanted file caching effects that would not be present in a reportable run, which was accomplished as summarized below: runspec -a setup --rate 450.soplex reboot cd 450.soplex/run/run\*000 specinvoke -nnr > doit.sh convert 'sh dobmk' in doit.sh to 'sh dobmk &' doit.sh The tests noted as Solaris 10 used: # head -1 /etc/release Solaris 10 5/09 s10s_u7wos_08 SPARC The tests noted as SNV used: # head -1 /etc/release Solaris Express Community Edition snv_112 SPARC The tests on the M5000 used 72GB 10K RPM disks. The ufs disk was a FUJITSU MBB2073RCSUN72G (SAS); the zfs disk was a SEAGATE ST973401LSUN72G (Ultra320 SCSI). The tests on the T5440 used 72GB 15K RPM disks: FUJITSU MAU3073FCSUN72 (Fibre Channel). Acknowledgments. My IO teachers include Senthil Ramanujam, Cloyce Spradling, and Roch Bourbonnais, none of whom should be blamed for this beginner's ignorance. Karsten Guthridge was the first to point out the usefulness of ZFS gzip compression for 450.soplex.

Abstract The original ZFS vision was enormously ambitious: End the Suffering, Free Your Mind, providing simplicity, power, safety, and speed. As is common with most new technologies, this ambitious...


Sun Studio Trounces Intel Compiler on Intel Chip

Today Sun announces a new world record for SPECfp2006: 50.4 on a 2-chip Nehalem (Intel Xeon X5570) Sun Blade X6270.Congratulations to my colleagues in the Sun Studio Compiler group - the fun thing about this result is that it beats Intel's own compiler on this Intel chip by 20%, due to the optimization technologies found in the Sun Studio 12 Update 1 compiler.SPECfp2006SystemProcessorsPerformance ResultsCommentsTypeGHzChipsCoresPeakBaseSun Blade X6270Xeon 55702.932850.445.0NewHitachi BladeSymphony BS2000 Xeon 55702.932842.039.3Top result at www.spec.org as of 14 Apr 2009IBM Power 595 POWER6 5.001124.920.1Best POWER6 as of 14 Apr 2009Note that even with the less aggressive "Base" tuning [SPECfp_base2006] the Sun Blade X6270 beats the best-posted "Peak" tuning from competitors [SPECfp2006].Of course, the Intel compiler engineers are bright folks too, and they will no doubtless quickly provide additional performance on Nehalem. Still, it's fun to see the multi-target Sun Studio optimization technology deliver top results on a variety of platforms, now including Nehalem.As to integer performance - the Sun Blade also takes top honors there [for peak]:SPECint2006SystemProcessorsPerformance ResultsCommentsTypeGHzChipsCoresPeakBaseSun Blade X6270Xeon 55702.932836.932.0NewFujitsu Celsius R570Xeon 55702.932836.332.2Top SPECint2006 result as of 14 Apr 2009The Sun Blade results have been submitted to SPEC for review, and should appear at SPEC's website in about 2 weeks.On a personal note, this was my first time using OpenSolaris. The level of compatibility with other operating systems is substantially improved; and utilities that this tester likes having handy are built in (e.g. the NUMA Observability Tools); and ZFS zips right along, needing less attention than ufs and delivering better performance.SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation.Competitive results from www.spec.org as of 4/14/2009.

Today Sun announces a new world record for SPECfp2006: 50.4 on a 2-chip Nehalem (Intel Xeon X5570) Sun Blade X6270. Congratulations to my colleagues in the Sun Studio Compiler group - the fun thing...