Sunday Aug 04, 2013

Shrink a ZFS Root Pool, Solaris 11.1, SPARC

Revision info

  • Update III - 5-Aug-2013 1015am EDT - Further clarification on VARSHARE has been added.
  • Update II - 4-Aug-2013 6pm EDT -
    - A clarification has been added under the goals section.
    - A kind reviewer points out that I forgot about the new file system VARSHARE. An update was added to discuss it.
  • Update I - 4-Aug-2013 10am EDT - A kind reader points out an opportunity for jealousy, which has been addressed.

Summary

A root pool cannot be shrunk in a single operation, but it can be copied to a smaller partition. The Solaris 11 beadm command makes the task easier than it used to be. In this example, the pool is copied, and then mirrored.

  1. Use format to create a smaller partition on a new device, say c0tXs0
  2. # zpool create -f newpool c0tXs0
  3. # beadm create -a -d "smaller s11.1" -p newpool solaris-SRUnn
  4. Use {ok} probe-scsi-all and
    {ok} devalias to identify the new disk
  5. {ok} setenv boot-device diskNN
  6. Boot new system, and clean up or copy (zfs send/receive) other file systems from the old device (e.g. /export, /export/home, perhaps also swap, dump, and VARSHARE)
  7. Use zpool export - or use zpool destroy - to hide or destroy the original
  8. Use format to create the mirror partition, say c0tYs0
  9. zpool attach -f newpool c0tXs0 c0tYs0
  10. Allow the resilver to complete
  11. At OBP, hunt down c0tY and boot the mirror

A detailed example follows.

Contents

1. Goal: shrink a root pool on a SPARC system.

a. Sidebar: Why?

b. A long-discussed feature, and a unicorn

c. Web Resources

(i) Other bloggers

(ii) Solaris 11.1 Documentation Library

2. Initial state: one very large rpool, mostly empty

3. Create newpool and new swap/dump

a. Delete old swap, use the new

b. Delete old dump, use the new

4. The actual copy

a. Let beadm do the work!

b. Thank you, beadm, for automatically taking care of:

(i) activation, (ii) bootfs, (iii) bootloader, and (iv) menu.lst

Update: Beadm missed one item....VARSHARE

5. Boot the new system (after a little OBP hunting)

6. Cleanup

a. Copy additional file systems

b. Hide - or delete - the original

7. Mirror the newpool

8. Final verification

Thank you

1. Goal: shrink a root pool on a SPARC system.

A large root pool occupies most of a disk. I would like to make it much smaller.

a. Sidebar: Why?

Why do I want to shrink? Because the desired configuration is:

  • Mirrored root pool
  • Large swap partition, not mirrored

That is, the current configuration is:

diskX                 
  c0tXs0 rpool        
    rpool/ROOT/solaris
    rpool/swap        

I do not want to just add a mirror, like this:

diskX                           diskY
  c0tXs0 rpool                    c0tYs0 rpool
    rpool/ROOT/solaris              solaris (copy)
    rpool/swap                      swap (copy)

Instead, the goal is to have twice as much swap, like so:

diskX                           diskY
  c0tXs0 rpool                    c0tYs0 rpool
    rpool/ROOT/solaris              solaris (copy)
  c0tXs1                          c0tYs1 
    swap                            more swap

Clarification: Bytes of disk vs. bytes of memory. At least one reader seemed to want a clarification of the point of the above. To be explicit:

  • A 2-way mirrored rpool with a ZFS swap volume of size N spends 2 x N bytes of disk space to provide backing store to N bytes of memory.
  • Two swap partitions, each of size N, spend 2 x N bytes of disk space and provides backing store to 2 x N bytes of memory.

As it happens, due to the planned workload, this particular system is going to need a lot of swap space. Therefore, I prefer to avoid mirrored swap.

b. A long-discussed feature, and a unicorn

The word "shrink" does not appear in the ZFS admin guide.

Discussions at an archived ZFS discussion group assert that the feature was under active development in 2007, but by 2010 the short summary was "it's hiding behind the unicorn".

Apparently, the feature is difficult, and demand simply has not been high enough.

c. Web Resources

Well, if there is no shrink feature, surely, it can be done by other methods, right? Well....

(i) Other bloggers

If one uses Google to search for "shrink rpool", the top two blog entries that are returned appear to be relevant:

Both of the above are old, written well prior to the release of Solaris 11. Both also use x86 volumes and conventions, not SPARC.

Nevertheless, they contain some useful clues.

(ii) Solaris 11.1 Documentation Library

Since the above blog entries are dated, I also tried to use contemporary documentation. These were consulted, along with the corresponding man pages:

2. Initial state: one very large rpool, mostly empty

Here are the intial pools, file systems, and boot environments. Note that there is a large 556 GB rpool, and it is not mirrored.

# zpool status
  pool: rpool
 state: ONLINE
  scan: none requested
config:

     NAME                       STATE     READ WRITE CKSUM
     rpool                      ONLINE       0     0     0
       c0t5000CCA0224D6354d0s0  ONLINE       0     0     0
#

# zpool list
NAME   SIZE  ALLOC  FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  556G  77.0G  479G  13%  1.00x  ONLINE  -
#

# zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool                     79.2G   468G  73.5K  /rpool
rpool/ROOT                5.60G   468G    31K  legacy
rpool/ROOT/solaris        42.6M   468G  3.78G  /
rpool/ROOT/solaris-1      5.56G   468G  3.79G  /
rpool/ROOT/solaris-1/var   657M   468G   521M  /var
rpool/ROOT/solaris/var    38.9M   468G   221M  /var
rpool/VARSHARE            83.5K   468G    58K  /var/share
rpool/dump                66.0G   470G  64.0G  -
rpool/export              3.43G   468G    32K  /export
rpool/export/home         3.43G   468G  3.43G  /export/home
rpool/swap                4.13G   468G  4.00G  -

# beadm list
BE        Active Mountpoint Space  Policy Created          
--        ------ ---------- -----  ------ -------          
solaris   -      -          81.58M static 2013-07-10 17:19 
solaris-1 NR     /          6.88G  static 2013-07-31 12:27 
#

The partitions on the original boot disk are:

# format
...
partition> p
Volume:  solaris
Current partition table (original):
Total disk cylinders available: 64986 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       0 - 64985      558.89GB    (64986/0/0) 1172087496
  1 unassigned    wm       0                0         (0/0/0)              0
  2     backup    wu       0 - 64985      558.89GB    (64986/0/0) 1172087496
  3 unassigned    wm       0                0         (0/0/0)              0
  4 unassigned    wm       0                0         (0/0/0)              0
  5 unassigned    wm       0                0         (0/0/0)              0
  6 unassigned    wm       0                0         (0/0/0)              0
  7 unassigned    wm       0                0         (0/0/0)              0

3. Create newpool and new swap/dump

The format utility was used to create a new, smaller partition for the root pool. The first swap partition was also created.

# format
...
partition> p
Volume:  smallsys
Current partition table (unnamed):
Total disk cylinders available: 64986 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       0 - 11627      100.00GB    (11628/0/0)  209722608
  1       swap    wu   11628 - 34883      200.01GB    (23256/0/0)  419445216
  2     backup    wu       0 - 64985      558.89GB    (64986/0/0) 1172087496
  3 unassigned    wm       0                0         (0/0/0)              0 (*)
  4 unassigned    wm       0                0         (0/0/0)              0 (*)
  5 unassigned    wm       0                0         (0/0/0)              0 (*)
  6 unassigned    wm       0                0         (0/0/0)              0 (*)
  7 unassigned    wm       0                0         (0/0/0)              0 (*)

partition> label
Ready to label disk, continue? y

And the new pool was created with zpool create:

# zpool create -f newpool c0t5000CCA0224D62A0d0s0
# 

(*) Note: one might ask, what about the other 250 GB available on the disk? Yes, this user has a plan in mind to use that space. It is not terribly relevant to the concerns covered in this particular blog, and so is left aside for now.

a. Delete old swap, use the new

As noted in the introduction, there will eventually be multiple swap partitions, and they will not be in the root pool. The first new swap partition was just created, above. Therefore, for my purposes, I might as well delete the originals now, if only because it would be useless to copy them. (Your needs may differ!)

In a separate window, a new vfstab was created, which removes the zvol swap and adds the new swap partition:

# cd /etc
# diff vfstab.orig vfstab.withnewswap 
12c12
< /dev/zvol/dsk/rpool/swap   -   -      swap    -       no      -
---
> /dev/dsk/c0t5000CCA0224D62A0d0s1 - - swap - no - 
# 

The commands below display the current swap partition, add the new one, and display the result.

# swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 285,2        8K     4.0G     4.0G
#

# /sbin/swapadd
# swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 285,2        8K     4.0G     4.0G
/dev/dsk/c0t5000CCA0224D62A0d0s1 203,49       8K     200G     200G
#

Next, use swap -d to stop swapping on the old, and then destroy it.

# swap -d /dev/zvol/dsk/rpool/swap
# swap -lh
swapfile             dev    swaplo   blocks     free
/dev/dsk/c0t5000CCA0224D62A0d0s1 203,49       8K     200G     200G
# 

# zfs destroy rpool/swap
#

b. Delete old dump, use the new

The largest part of the original pool is the dump device. Since we now have a large swapfile, we can use that instead:

# dumpadm
      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/crash
  Savecore enabled: yes
   Save compressed: on
#

# dumpadm -d swap
      Dump content: kernel pages
       Dump device: /dev/dsk/c0t5000CCA0224D62A0d0s1 (swap)
Savecore directory: /var/crash
  Savecore enabled: yes
   Save compressed: on
# 

And the volume can now be destroyed. Emphasis: your needs may differ. You may prefer to keep swap and dump volumes in the new pool.

# zfs destroy rpool/dump

4. The actual copy

Let's use pkg info to figure out a good name for the new BE. From the material below, it appears that this is Service Repository Update 9.5:

# pkg info entire (**)
          Name: entire
       Summary: entire incorporation including Support Repository 
                Update (Oracle Solaris 11.1.9.5.1).
     Publisher: solaris
Packaging Date: Thu Jul 04 03:10:15 2013
# 

(**) Output has been abbreviated for readability

At this point, I did something that I thought would be needed, based on previous blog entries, but as you will see in a moment, it was not needed yet.

# zfs snapshot -r rpool@orig_before_shrink

a. Let beadm do the work!

The previous blog entries at this point made use of zfs send and zfs receive. In a first attempt at this copy, so did I; but a more careful reading of the manpage indicated that beadm create would probably be a better idea. For the sake of brevity, the send/receive side track is omitted.

Here is the first attempt with beadm create:

# beadm create -a -d "smaller s11.1" -e rpool@orig_before_shrink \
> -p newpool s11.1-sru9.5
be_copy: failed to find zpool for BE (rpool)
Unable to create s11.1-sru9.5.   (oops)

Hmmm, it claims that it cannot find the snapshot that was just created a minute ago. ... Reading the manpage, I realize that a "beadm snapshot" is not exactly the same concept as a "zfs snapshot". OK.

The manpage also says that if -e is not provided, then it will clone the current environment. Sounds good to me.

# beadm create -a -d "smaller s11.1" -p newpool s11.1-sru9.5
#

The above command took only a few minutes to do, probably because rpool did not have a lot of content. Here is the result:

# zfs list
NAME                            USED  AVAIL  REFER  MOUNTPOINT
newpool                        4.30G  93.6G  73.5K  /newpool
newpool/ROOT                   4.30G  93.6G    31K  legacy
newpool/ROOT/s11.1-sru9.5      4.30G  93.6G  3.79G  /
newpool/ROOT/s11.1-sru9.5/var   519M  93.6G   519M  /var

rpool                          9.04G   538G  73.5K  /rpool
rpool/ROOT                     5.60G   538G    31K  legacy
rpool/ROOT/solaris             42.6M   538G  3.78G  /
rpool/ROOT/solaris-1           5.56G   538G  3.79G  /
rpool/ROOT/solaris-1/var        659M   538G   519M  /var
rpool/ROOT/solaris/var         38.9M   538G   221M  /var
rpool/VARSHARE                 83.5K   538G    58K  /var/share
rpool/export                   3.43G   538G    32K  /export
rpool/export/home              3.43G   538G  3.43G  /export/home
#

Note that /export and /export/home were not copied. We will come back to these later.

b. Thank you, beadm, for automatically taking care of...

The older blogs mentioned several additional steps that had to be performed when copying root pools. As I checked into each of these topics, it turned out - repeatedly - that beadm create had alredy taken care of it.

(i) activation

The new Boot Environment will be active on reboot, as shown by code "R", below, because the above beadm create command included the -a switch.

# beadm list
BE           Active Mountpoint Space  Policy Created          
--           ------ ---------- -----  ------ -------          
s11.1-sru9.5 R      -          4.80G  static 2013-08-02 10:22 
solaris      -      -          81.58M static 2013-07-10 17:19 
solaris-1    NR     /          6.88G  static 2013-07-31 12:27 

(ii) bootfs

The older blogs (which used zfs send/recv) mentioned that the bootfs property needs to be set on the new pool. This is no longer needed: bootadm create already set it automatically. Thank you, beadm.

# zpool list -o name,bootfs
NAME     BOOTFS
newpool  newpool/ROOT/s11.1-sru9.5
rpool    rpool/ROOT/solaris-1
# 

(iii) bootloader

The disk will need a bootloader. Here, some history may be of interest. A few years ago:

  • Frequently, system administrators needed to add bootloaders, for example, anytime a mirror was created.
  • The method differed by platform: installboot on SPARC, or something grub-ish on x86

Today,

  • The bootloader is added automatically when root pools are mirrored
  • And if for some reason, you do need to add one by hand, the command is now bootadm install-bootloader, which in turn calls installboot on your behalf, or messes with grub on your behalf.

The question for the moment: has a bootloader been placed on the new disk?

Here is the original bootloader, on rpool - notice the -R / for path

# bootadm list-archive -R /
platform/SUNW,Netra-CP3060/kernel
platform/SUNW,Netra-CP3260/kernel
platform/SUNW,Netra-T2000/kernel
platform/SUNW,Netra-T5220/kernel
platform/SUNW,Netra-T5440/kernel
platform/SUNW,SPARC-Enterprise-T1000/kernel
platform/SUNW,SPARC-Enterprise-T2000/kernel
platform/SUNW,SPARC-Enterprise-T5120/kernel
platform/SUNW,SPARC-Enterprise-T5220/kernel
platform/SUNW,SPARC-Enterprise/kernel
platform/SUNW,Sun-Blade-T6300/kernel
platform/SUNW,Sun-Blade-T6320/kernel
platform/SUNW,Sun-Blade-T6340/kernel
platform/SUNW,Sun-Fire-T1000/kernel
platform/SUNW,Sun-Fire-T200/kernel
platform/SUNW,T5140/kernel
platform/SUNW,T5240/kernel
platform/SUNW,T5440/kernel
platform/SUNW,USBRDT-5240/kernel
platform/sun4v/kernel
etc/cluster/nodeid
etc/dacf.conf
etc/driver
etc/mach
kernel
# 

After mounting the newly created environment, it can be seen that it also has a bootloader. There is no need to use installboot nor bootadm install-bootloader, because the beadm create command already took care of it. Thank you, beadm.

# beadm mount s11.1-sru9.5 /mnt
#

# bootadm list-archive -R /mnt
platform/SUNW,Netra-CP3060/kernel
platform/SUNW,Netra-CP3260/kernel
platform/SUNW,Netra-T2000/kernel
platform/SUNW,Netra-T5220/kernel
platform/SUNW,Netra-T5440/kernel
platform/SUNW,SPARC-Enterprise-T1000/kernel
platform/SUNW,SPARC-Enterprise-T2000/kernel
platform/SUNW,SPARC-Enterprise-T5120/kernel
platform/SUNW,SPARC-Enterprise-T5220/kernel
platform/SUNW,SPARC-Enterprise/kernel
platform/SUNW,Sun-Blade-T6300/kernel
platform/SUNW,Sun-Blade-T6320/kernel
platform/SUNW,Sun-Blade-T6340/kernel
platform/SUNW,Sun-Fire-T1000/kernel
platform/SUNW,Sun-Fire-T200/kernel
platform/SUNW,T5140/kernel
platform/SUNW,T5240/kernel
platform/SUNW,T5440/kernel
platform/SUNW,USBRDT-5240/kernel
platform/sun4u/kernel
platform/sun4v/kernel
etc/cluster/nodeid
etc/dacf.conf
etc/driver
etc/mach
kernel
# 

(iv) menu.lst

The first google reference above includes this sentence:

Change all the references to [the new pool] in the menu.1st file.

That sounds GRUBish, for x86, and not very much like SPARC. As it turns out, though, yes, there is a menu.lst file for SPARC:

# cat /rpool/boot/menu.lst
title Oracle Solaris 11.1 SPARC
bootfs rpool/ROOT/solaris
title solaris-1
bootfs rpool/ROOT/solaris-1
# 

And, oh look at this, beadm create also made a new menu.lst on the new pool. Thank you, beadm

# cat /newpool/boot/menu.lst 
title smaller s11.1
bootfs newpool/ROOT/s11.1-sru9.5
# 

Beadm missed one item....VARSHARE

Update (III): WHAT'S MISSING? An earlier update to this blog entry pointed out that I forgot about VARSHARE. It has been further clarified that the right time to worry about it is actually BEFORE the reboot. OK. So, if you are following this blog while working on a system of your own, do that zfs list command now, before rebooting. If VARSHARE is present, migrate it now.

5. Boot the new system (after a little OBP hunting)

Attempt to boot the new pool. First, remind myself of the disk ids, and then head off towards OBP

# zpool status (**)

           NAME                       STATE     READ WRITE CKSUM
           newpool                    ONLINE       0     0     0
        -->  c0t5000CCA0224D62A0d0s0  ONLINE       0     0     0

           rpool                      ONLINE       0     0     0
             c0t5000CCA0224D6354d0s0  ONLINE       0     0     0

# shutdown -y -g0 -i0
{0} ok probe-scsi-all (**)
/pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0   <--

Target a 
  Unit 0   Disk   HITACHI  H109060SESUN600G A31A    1172123568 Blocks, 600 GB
  SASDeviceName 5000cca0224d62a0  SASAddress 5000cca0224d62a1  PhyNum 1 
                ^^^^^^^^^^^^^^^^

It looks like the newly created pool is on

/pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0

at PhyNum 1 - note its SASDeviceName
   5000cca0224d62a0 which matches newpool at solaris device
c0t5000CCA0224D62A0d0s0.

Is there a device alias that also matches?

{0} ok devalias
screen   /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@7/display@0
disk7    /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p3
disk6    /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p2
disk5    /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p1   <--
disk4    /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p0
scsi1    /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0
net3     /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0,1
net2     /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@4/network@0
disk3    /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/disk@p3
disk2    /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/disk@p2
disk1    /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/disk@p1
disk     /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/disk@p0
disk0    /pci@300/pci@1/pci@0/pci@4/pci@0/pci@c/scsi@0/disk@p0
...

The OBP alias disk5 matches the desired disk.

Try the new disk, checking: does the boot -L switch include the desired new BE?

{0} ok boot disk5 -L
Boot device: /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p1  
File and args: -L

1 smaller s11.1
Select environment to boot: [ 1 - 1 ]: 1

To boot the selected entry, invoke:
boot [] -Z newpool/ROOT/s11.1-sru9.5   <--

Yes, disk5 offers a choice that matches the new boot environment.

Point the OBP boot-device at it, and off we go.

{0} ok setenv boot-device disk5
{0} ok boot

6. Cleanup

The system booted successfully. After the first boot, note that - as mentioned earlier - /export and /export/home are in the original pool:

# zfs list -r rpool
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool                     9.04G   538G  73.5K  /rpool
rpool/ROOT                5.61G   538G    31K  legacy
rpool/ROOT/solaris        42.6M   538G  3.78G  /
rpool/ROOT/solaris-1      5.57G   538G  3.79G  /
rpool/ROOT/solaris-1/var   660M   538G   519M  /var
rpool/ROOT/solaris/var    38.9M   538G   221M  /var
rpool/VARSHARE             108K   538G  58.5K  /var/share
rpool/export              3.43G   538G    32K  /export
rpool/export/home         3.43G   538G  3.43G  /export/home
# 

# zfs list -r newpool
NAME                            USED  AVAIL  REFER  MOUNTPOINT
newpool                        4.33G  93.6G  73.5K  /newpool
newpool/ROOT                   4.33G  93.6G    31K  legacy
newpool/ROOT/s11.1-sru9.5      4.33G  93.6G  3.79G  /
newpool/ROOT/s11.1-sru9.5/var   524M  93.6G   519M  /var
newpool/VARSHARE                 43K  93.6G    43K  /var/share
# 

Update: VARSHARE Mike Gerdts points out what I missed in previous reading of the above: notice that rpool/VARSHARE contained some data that has not been migrated to newpool/VARSHARE. The VARSHARE file system provides a convenient place to store crash dumps, audit records, and similar data that can be shared across boot environments, as described under What's New with ZFS? in the updated ZFS Admin Guide.

Unfortunately, I missed my chance to migrate that data, it's gone. Fortunately, I didn't lose very much (about 108 KB, according to the above). If you are following this blog as you work on your own system, one hopes you noticed the note above about VARSHARE. If not, then would be a good moment to review the status of VARSHARE on your system, potentially merging the previous content with whatever has accumulated since the reboot.

Swap/dump - as already discussed, swap and dump were intentionally not migrated, because they are handled elsewhere. Your needs may differ. If you are following this blog as you work on your own system, now would be a good moment to ensure that you have figured out what you want to do for your swap / dump volumes.

a. Copy additional file systems

Earlier, a snapshot was created, using:

# zfs snapshot -r rpool@orig_before_shrink

That snapshot is used in a zfs send/receive command, which goes quickly, but it ends with an error:

# zfs send -vR rpool/export@orig_before_shrink | zfs receive -vFd newpool
sending from @ to rpool/export@before_shrink
receiving full stream of rpool/export@before_shrink 
   into newpool/export@before_shrink
sending from @before_shrink to rpool/export@orig_before_shrink
sending from @ to rpool/export/home@before_shrink
received 47.9KB stream in 1 seconds (47.9KB/sec)
receiving incremental stream of rpool/export@orig_before_shrink 
   into newpool/export@orig_before_shrink
received 8.11KB stream in 3 seconds (2.70KB/sec)
receiving full stream of rpool/export/home@before_shrink 
   into newpool/export/home@before_shrink
sending from @before_shrink to rpool/export/home@orig_before_shrink
received 3.46GB stream in 78 seconds (45.4MB/sec)
receiving incremental stream of rpool/export/home@orig_before_shrink 
   into newpool/export/home@orig_before_shrink
received 198KB stream in 2 seconds (99.1KB/sec)
cannot mount 'newpool/export' on '/export': directory is not empty
cannot mount 'newpool/export' on '/export': directory is not empty
cannot mount 'newpool/export/home' on '/export/home': 
    failure mounting parent dataset

The problem above is that more than one file system is eligible to be mounted. A better solution - pointed out by a kind reviewer - would have been to use the zfs receive -u switch:

# zfs send -vR rpool/export@orig_before_shrink | zfs receive -vFdu newpool

The -u switch would have avoided the attempt to mount the newly created file systems.

b. Hide - or delete - the original

Warning: YMMV Because I am nearly certain that I will soon be destroying the original rpool, my solution was to disqualify the old ones from mounting. Your mileage may vary. For example, you might prefer to leave the old one unchanged, in case it is needed later. In that case, you could skip directly to the export, described below.

Anyway, the following was satisfactory for my needs. I changed the canmount property:

# zfs list -o mounted,canmount,mountpoint,name -r rpool
MOUNTED  CANMOUNT  MOUNTPOINT    NAME
    yes        on  /rpool        rpool
     no       off  legacy        rpool/ROOT
     no    noauto  /             rpool/ROOT/solaris
     no    noauto  /             rpool/ROOT/solaris-1
     no    noauto  /var          rpool/ROOT/solaris-1/var
     no    noauto  /var          rpool/ROOT/solaris/var
     no    noauto  /var/share    rpool/VARSHARE
    yes        on  /export       rpool/export
    yes        on  /export/home  rpool/export/home
# 
# zfs list -o mounted,canmount,mountpoint,name -r newpool
MOUNTED  CANMOUNT  MOUNTPOINT    NAME
    yes        on  /newpool      newpool
     no       off  legacy        newpool/ROOT
    yes    noauto  /             newpool/ROOT/s11.1-sru9.5
    yes    noauto  /var          newpool/ROOT/s11.1-sru9.5/var
    yes    noauto  /var/share    newpool/VARSHARE
     no        on  /export       newpool/export
     no        on  /export/home  newpool/export/home
# 
# zfs set canmount=noauto rpool/export
# zfs set canmount=noauto rpool/export/home
# reboot

After the reboot, only one data set from the original pool is mounted:

# zfs list -r -o name,mounted,canmount,mountpoint 
NAME                           MOUNTED  CANMOUNT  MOUNTPOINT
newpool                            yes        on  /newpool
newpool/ROOT                        no       off  legacy
newpool/ROOT/s11.1-sru9.5          yes    noauto  /
newpool/ROOT/s11.1-sru9.5/var      yes    noauto  /var
newpool/VARSHARE                   yes    noauto  /var/share
newpool/export                     yes        on  /export
newpool/export/home                yes        on  /export/home

rpool                              yes        on  /rpool
rpool/ROOT                          no       off  legacy
rpool/ROOT/solaris                  no    noauto  /
rpool/ROOT/solaris-1                no    noauto  /
rpool/ROOT/solaris-1/var            no    noauto  /var
rpool/ROOT/solaris/var              no    noauto  /var
rpool/VARSHARE                      no    noauto  /var/share
rpool/export                        no    noauto  /export
rpool/export/home                   no    noauto  /export/home
# 

The canmount property could be set for that one too, but a better solution - as suggested by the kind reviewer - is to zpool export the dataset. The export command ensures that none of it will be seen until/unless a later zpool import command is done (which will not be done in this case, because I want to re-use the space for other purposes).

# zpool export rpool
# reboot
.
.
.
$ zpool status
  pool: newpool
 state: ONLINE
  scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        newpool                    ONLINE       0     0     0
          c0t5000CCA0224D62A0d0s0  ONLINE       0     0     0

errors: No known data errors
$

$ zfs list
NAME                            USED  AVAIL  REFER  MOUNTPOINT
newpool                        7.80G  90.1G  73.5K  /newpool
newpool/ROOT                   4.37G  90.1G    31K  legacy
newpool/ROOT/s11.1-sru9.5      4.37G  90.1G  3.79G  /
newpool/ROOT/s11.1-sru9.5/var   530M  90.1G   521M  /var
newpool/VARSHARE               45.5K  90.1G  45.5K  /var/share
newpool/export                 3.43G  90.1G    32K  /export
newpool/export/home            3.43G  90.1G  3.43G  /export/home
$ 

7. Mirror the newpool

The new root pool has been created, it boots, it has the correct size and now has all the right data sets. Mirror it.

Set up the partitions for the mirror volume to match the volume that has newroot:

partition> p
Volume:  mirror
Current partition table (unnamed):
Total disk cylinders available: 64986 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm       0 - 11627      100.00GB    (11628/0/0)  209722608
  1       swap    wu   11628 - 34883      200.01GB    (23256/0/0)  419445216
  2     backup    wu       0 - 64985      558.89GB    (64986/0/0) 1172087496
  3 unassigned    wm       0                0         (0/0/0)              0
  4 unassigned    wm       0                0         (0/0/0)              0
  5 unassigned    wm       0                0         (0/0/0)              0
  6 unassigned    wm       0                0         (0/0/0)              0
  7 unassigned    wm       0                0         (0/0/0)              0

partition> label
Ready to label disk, continue? y

Start the mirror operation:

# zpool status
  pool: newpool
 state: ONLINE
  scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        newpool                    ONLINE       0     0     0
          c0t5000CCA0224D62A0d0s0  ONLINE       0     0     0

errors: No known data errors
#

# zpool attach -f newpool c0t5000CCA0224D62A0d0s0 c0t5000CCA0224D6A30d0s0
Make sure to wait until resilver is done before rebooting.
#
# zpool status
  pool: newpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Sat Aug  3 07:49:25 2013
    413M scanned out of 7.83G at 20.6M/s, 0h6m to go
    409M resilvered, 5.15% done
config:

        NAME                         STATE     READ WRITE CKSUM
        newpool                      DEGRADED     0     0     0
          mirror-0                   DEGRADED     0     0     0
            c0t5000CCA0224D62A0d0s0  ONLINE       0     0     0
            c0t5000CCA0224D6A30d0s0  DEGRADED     0     0     0  (resilvering)

errors: No known data errors
# 

It says it will complete in 6 minutes. OK, wait that long and check again:

# sleep 360; zpool status
  pool: newpool
 state: ONLINE
  scan: resilvered 7.83G in 0h3m with 0 errors on Sat Aug  3 07:52:30 2013
config:

        NAME                         STATE     READ WRITE CKSUM
        newpool                      ONLINE       0     0     0
          mirror-0                   ONLINE       0     0     0
            c0t5000CCA0224D62A0d0s0  ONLINE       0     0     0
            c0t5000CCA0224D6A30d0s0  ONLINE       0     0     0

errors: No known data errors
# 

Recall that a part of the goal was to have 2x swap partitions, not mirrored. Add the second one now.

# swap -lh
swapfile             dev    swaplo   blocks     free
/dev/dsk/c0t5000CCA0224D62A0d0s1 203,49       8K     200G     200G
#
# echo "/dev/dsk/c0t5000CCA0224D6A30d0s1 - - swap - no - " >> /etc/vfstab
# /sbin/swapadd
# swap -lh
swapfile             dev    swaplo   blocks     free
/dev/dsk/c0t5000CCA0224D62A0d0s1 203,49       8K     200G     200G
/dev/dsk/c0t5000CCA0224D6A30d0s1 203,41       8K     200G     200G
# 

8. Final verification

When the zpool attach command above was issued, the root pool was mirrored. As mentioned previously, in the days of our ancestors, one had to follow this up by adding the bootloader. Now, thanks to updated zpool attach, it happens automatically.

Verify the feature by booting the other side of the mirror:

# shutdown -y -g0 -i0
...
{0} ok printenv boot-device
boot-device =           disk5
{0} ok boot disk4
Boot device: /pci@4c0/pci@1/pci@0/pci@c/pci@0/pci@c/scsi@0/disk@p0  File and args: 
SunOS Release 5.11 Version 11.1 64-bit
Copyright (c) 1983, 2012, Oracle and/or its affiliates. All rights reserved.

Thank you

Thank you to bigal, and to Joe Mocker for their starting points. Thank you to Cloyce Spradling for review of drafts of this post.

Also, Michael Ramchand noted that the first post of this blog forgot to thank zpool attach in step 8, which has been fixed; that was important because, as Michael noted, "It might get jealous after all the thanks that beadm got."

(this space intentionally left blank)

 

Monday Jul 01, 2013

IBM "per core" comparisons for SPECjEnterprise2010

Responding to an IBM claim of double performance per core, questions and concerns are raised about:

  • Scaling is not free
  • Choosing the denominator radically changes the picture
  • SPEC Fair Use requirements
  • Substantiation and transparency
  • T5 claim for "fastest processor"
[Read More]

About this blog

Occasional discussion of computer performance, especially performance of CPU, Memory, and Compiler (the components measured by SPEC CPU)

I contributed to the development of SPEC CPU2006 and SPEC CPU2000, and serve as run rules editor for the CPU subcommittee.

In the picture, I am the one on the right. The file name for the picture says it all - I need to get out on that bike more often.

Old Fat Guy With One Foot In The Gravy

Tuesday Mar 22, 2011

Still here

There seems to be rumor that the blogs here will disappear. Don't believe everything you read on the interwebs.

Thursday Jun 11, 2009

New BestPerf blog

I'll be putting most of my attention into the Group Blog, for the group that I am a member of: blogs.sun.com/BestPerf

Tuesday May 26, 2009

Losing My Fear of ZFS

Abstract

The original ZFS vision was enormously ambitious: End the Suffering, Free Your Mind, providing simplicity, power, safety, and speed. As is common with most new technologies, this ambitious vision was not completely fulfilled in the intial versions. Initial usage showed that although it did have useful and convenient features, for some workloads, such as the memory-intensive SPEC CPU benchmarks, there were reasons for concern. Now that ZFS has had time to grow, more of the vision is fulfilled. This article, told from the personal perspective of one performance engineer, describes some of the improvements, and provides examples of use.

Rumors: Does This Sound Familiar?

Have you heard some of these about ZFS?

"ZFS? You can't use that - it will eat all your memory!"

"ZFS? That's a software disk striping/RAID solution. You don't want that. You want hardware RAID."

"ZFS? Be afraid."

Can I Please Just Forget About IO? (NO)

As a performance engineer, my primary concern is for the SPEC CPU benchmarks - which intentionally do relatively little IO. Usually.

To a first approximation, IO can be ignored in this context. Usually.

To a first approximation, it's fine if my ZFS "knowledge" is limited to rumors / innuendo as quoted above. Until....

Until there comes the second approximation, the re-education, and the beginner loses some fear of ZFS.

Why a SPEC CPU Benchmarker Might Care About IO

Although the SPEC CPU benchmarks intentionally try to avoid doing IO, some amount inevitably remains. An analysis of the IO in the benchmarks shows that one benchmark, 450.soplex, reads 300 MB of data. Most of that comes from a single 1/4 GB file, ref.mps, which is read during the second invocation of the benchmark.

Given the speed of today's disk drives, is that a lot? Using an internal drive (Seagate ST973402SSUN72G), a T5220 with a Niagara 2 processor reads the 1/4 GB file at about 50 MB/sec. It takes about 5.5 seconds to read one copy of the file, which is a tiny amount of time compared to how long it takes to run one copy of the actual benchmark - about 3000 seconds.

But 1/4 GB becomes a concern when one takes into account that we do not, in fact, read one copy of the file when testing SPEC CPU2006, because we are interested in the SPECrate metrics, which run multiple copies of the benchmarks. On a single-chip T5220 system, which supports 64 theads, 63 copies of the benchmark are run. An 8-chip M5000, which supports 8 threads per chip, also runs 63 copies.

On such systems, it is not uncommon to see 10 to 30 minutes of time when the CPU is sitting idle - which is not the desired behavior for a CPU benchmark.

For example, on the M5000, as shown in the graph below, it takes about 18 minutes before the CPU reaches the desired 99% User time. During that 18 minutes, a single disk with ufs on it is, according to iostat, 100% busy. It reads about 16 MB/sec, doing about 725 reads/sec.

Note that in this graph, and all other graphs in this article, the program being tested is only one of the benchmarks drawn from a larger suite, and only one of its inputs. Therefore, no statements in this article should be taken as indicative of "compliant" (aka "reportable") runs of an entire suite. SPEC and the benchmark name SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. For more information about SPEC and the CPU benchmarks, see www.spec.org/cpu2006.

graph1: ufs takes 18 min on M5000

ZFS Makes its Dramatic Entrance

Although this tester has heard concerns raised by people who have passed along rumors of ZFS limitations, there have been other teachers who have sung its praises, including one who has pointed out that 450.soplex's 1/4 GB input file is highly compressible, going from 267 MB to 20 MB with gzip.

The best IO is the IO that you never have to do at all. By using the ZFS compression feature, we can make 90% of the IO go away:

      $ zpool create -f tank c0t1d0
      $ zfs create tank/spec-zfs-gzip
      $ zfs set compression=gzip tank/spec-zfs-gzip

graph2: zfs pegs the CPU almost immediately

The improvement from ZFS gzip compression is indeed dramatic.

The careful reader may note that there are actually two lines on the far left: one measured with Solaris 10 Update 7, the other with Solaris Express. The version of Solaris did not appear to be a signficant variable for the tests reported in this paper, as can be seen by the fact that the two lines are right on top of each other.

What About Memory Consumption?

Although ZFS has done a great job above, what about its memory consumption? Concerns have been raised that it is memory-hungry, and indeed the "Be st Practices" Guide plainly says that it will use all the memory on the system if it thinks it can get away with it:

The ZFS adaptive replacement cache (ARC) tries to use most of a system's available memory to cache file system data. The default is to use all of physical memory except 1 Gbyte. As memory pressure increases, the ARC relinquishes memory.

ZFS memory usage is an important concern when running the SPEC CPU benchmarks, which are designed to stress the CPU, the memory system, and the compiler. Some of the benchmarks in the suite use just under 1 GB of physical memory, and it is desirable to run (n — 1) copies on a system with (n) threads and (n) GB of memory. Fortunately, there is a tuning knob available to control the size of the ARC: set zfs:zfs_arc_max = 0x(size) can be added to/etc/system.

The tests reported on this page all use a limited ARC cache.

It should also be noted that all tests are done after a fresh reboot, so presumably the ARC cache is not contributing to the reported performance. More details about methods may be found at the end of the article.

ZFS on T5440: Good, But Not As Dramatic

Although the above simple commands are enough to remove the IO idle time on the M5000, for the 4-chip T5440 there is a bigger problem: this system supports 256 threads, and 255 copies of the benchmark are run. Therefore, it needs to quickly inhale on the order of 64 GB.

A somewhat older RAID system was made available for this test: an SE3510 with 12x 15K 72GB drives. Using this device with ufs, it takes 30 minutes before the system hits the maximum user time, as shown by the line on the right in the graph below:

Graph 3: ufs takes about 30 min, zfs about 15 min

In the ufs test above, the SE3510 is configured as 12x drives in a RAID-5 logical drive, with a simple ufs filesystem (newfs -f 8192 /dev/dsk/c2t40d0s6). Despite the large number of drives, the SE3510 sustains a steady read rate of only about 45 MB/sec, processing about 3000 IO/sec according to iostat. (Aside: the IO expert may question why the hardware RAID provides only 45 MB/sec, but please bear in mind we are following the path of the IO beginner here. This topic is re-visted below.)

The zfs file system reads about 16 MB/sec, doing about 4500 IO/sec, but takes less than 1/2 as long to peg the CPU, since it is reading compressed data.

The zfs file system also used an SE3510 with SUN72G 15k RPM drives. On that unit, 12 individual "NRAID" drives were created, and made visible to the host as 12 separate units. Then, 10 of them were strung together as zfs RAID-Z using:

   # zpool create zf-raidz10 raidz \\
     c3t40d0  c3t40d1  c3t40d2 c3t40d3  c3t40d4  \\
     c3t40d5  c3t40d6  c3t40d7 c3t40d8  c3t40d9

   # zpool status       
      pool: zf-raidz10
     state: ONLINE
     scrub: none requested
    config:

           NAME         STATE     READ WRITE CKSUM
           zf-raidz10   ONLINE       0     0     0
             raidz1     ONLINE       0     0     0
               c3t40d0  ONLINE       0     0     0
               c3t40d1  ONLINE       0     0     0
               c3t40d2  ONLINE       0     0     0
               c3t40d3  ONLINE       0     0     0
               c3t40d4  ONLINE       0     0     0
               c3t40d5  ONLINE       0     0     0
               c3t40d6  ONLINE       0     0     0
               c3t40d7  ONLINE       0     0     0
               c3t40d8  ONLINE       0     0     0
               c3t40d9  ONLINE       0     0     0

Compression was added at a later time, but before the experiment shown above:

      $ zfs list -o compression zf-raidz10
      COMPRESS
          gzip

Why Is the T5440 Improvement Not As Dramatic As the M5000?

The improvement from zfs is helpful to the T5440, but unlike the M5000, nearly 15 minutes of clock time is spent on IO. Let's look at some statistics from iostat:

$ iostat -xncz 30
.
.
.
     cpu
 us sy wt id
  8  4  0 88
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    2.1    0.0  0.0  0.0    0.1    6.7   0   0 c0t0d0
  469.3    0.0 1851.2    0.0  0.9 18.5    1.9   39.4  25  88 c3t40d9
  401.8    0.0 1893.2    0.0 19.7  9.8   48.9   24.5  88  96 c3t40d8
  471.1    0.0 1836.2    0.0  1.1 18.4    2.4   39.1  27  87 c3t40d7
  416.1    0.0 1858.5    0.0  2.0 16.1    4.9   38.7  33  88 c3t40d6
  452.1    0.0 1792.8    0.0 13.9 13.1   30.9   29.0  78  92 c3t40d5
  417.8    0.0 1868.9    0.0  0.9 16.3    2.1   39.1  18  87 c3t40d4
  461.0    0.0 1766.9    0.0  3.7 17.2    8.0   37.3  42  87 c3t40d3
  418.9    0.0 1854.9    0.0  2.9 16.2    7.0   38.6  40  88 c3t40d2
  433.6    0.0 1761.0    0.0 21.9  8.9   50.6   20.6  92  99 c3t40d1
  420.0    0.0 1852.6    0.0  1.5 16.1    3.5   38.4  29  86 c3t40d0

A kind ZFS expert notes that "with RAID-Z the disks are saturated delivering >400 iops. The problem of RAID-Z is that those iops carry small amount of data and throughput is low." For more information, see this popular reference: https://blogs.oracle.com/roch/entry/when_to_and_not_to.

A secondary reason might be that as the reads are done, ZFS is decompressing the gzip'd data on a system where single thread performance is much slower than the one in Graph #2. On the M5000, 'gunzip ref.mps' requires about 2 seconds of CPU time; on the T5440, about 7 seconds. It should be emphasized that this is only a secondary concern for the read statistics described in this article, although it can become more important for write workloads, since compression is harder than decompression. Doing 'gzip ref.mps' takes ~12 seconds on the M5000, and ~51 seconds on the T5440. Furthermore, although the T5440 has 256 threads available, as of Solaris 10 s10s_u7, and Solaris Express snv_112, it is only willing to spend 8 threads doing gzip/gunzip operations. (This limitation may change in a future version of Solaris.)

Solution: Mirrors, No Gzip

The kind ZFS expert suggested trying mirrored drives without gzip. When this is done, the %b (busy) time, which is about 90% in the iostat report just above, changes to 98-100%. The %w time (queue non-empty) time, which shows wide variability just above, also pushes 90-100%. Because we are reading much more data, elapsed time is actually slower - the red line in the graph below:

graph4: uncompressed mirrors slower than gzip/raidz, until we have 24 drives

Adding 12 more drives, configured as 8x three way mirrors, does the trick: the leftmost line shows the desired slope. We spend about 3-4 minutes reading the file, an acceptable amount given that the benchmark as a whole runs for more than 120 minutes.

The file system for the leftmost line was created using:

      # zpool create dev8-m3-ngz \\
      > mirror c2t40d0  c2t40d1     c3t40d0 \\
      > mirror c2t40d2  c2t40d3     c3t40d1 \\
      > mirror c2t40d4  c2t40d5     c3t40d2 \\
      > mirror c2t40d6  c2t40d7     c3t40d3 \\
      > mirror c2t40d8              c3t40d4  c3t40d5 \\
      > mirror c2t40d9              c3t40d6  c3t40d7 \\
      > mirror c2t40d10             c3t40d8  c3t40d9 \\
      > mirror c2t40d11             c3t40d10 c3t40d11
      # 

The command creates 3-way mirrors, splitting each mirror across the two available controllers (c2 and c3). There are 8 of these 3-way mirrors, and zfs will dynamically stripe data across the 8 mirrors.

Were These Tests Fair?

The hardware IO expert may be bothered by the data presented here for the RAID-5 ufs configuration. Why would the hardware RAID system, with 12x drives, deliver only 45 MB/sec? In addition, it may seem odd that the tests use a RAID device which is now 5 years old, and compare it versus contemporary ZFS.

This is a fair point. In fact, a more modern RAID device has been observed delivering 97 MB/sec to 450.soplex, although with a very different system under test.

On the other hand, it should be emphasized that all the T5440 tests reported in this article used SE3510/SUN72G/15k. For the ufs tests, the SE3510 on-board software did the RAID-5 work. For the zfs tests, the SE3510 simply presented its disks to the Solaris system as 12 separate ("NRAID") logical units, and zfs did the RAID-Z and mirroring work.

Could there be something wrong with the particular SE3510 used for ufs? That seems unlikely. Although Graph 3 compares two different SE3510s (both connected to the same HBA, both configured with SUN72G 15k drives), a later test repeated the RAID-5 run on the exact same SE3510 unit as had been used for zfs. The time did not improve.

Is it possible that the SE3510 was mis-configured? Maybe. The author does not claim to be an IO expert, and, in particular, relied on the SE3510 menu system, not its command line interface (sccli). The menus provide limited access to disk block size setting, and the tester did not at first realize that the disk block size depends on this other parameter .... located over here in the menus ...

menu

For this particular controller, default block sizes are controlled indirectly by whether this setting is yes or no. Changing it to "No" makes the default block size larger (32 KB vs. 128 KB). Once this was discovered, various tests were repeated. The hw RAID-5 test was repeated with explicit selection of a larger size; however, it did not improve. On the other hand, the NRAID devices, controlled by zfs, did improve.

Finally, in order to isolate any overhead from RAID-5, the SE3510 was configured as 12 x drives in a RAID-0 stripe (256 KB stripe size). The time required to start 450.soplex was still over 30 minutes.

YMMV

As usual, your mileage may vary depending on your workload. This is especially true for IO workloads.

Summary / Basic Lessons

Some basic lessons about ZFS emerge:

1) ZFS can be easily taught not to hog memory.

2) Selecting gzip compression can be a big win, especially on systems with relatively faster CPUs.

3) Setting up mirrored drives with dynamic striping is straightforward.

4) ZFS is not so scary, after all.

Notes on Methods

During an actual test of a "reportable" run of SPEC CPU2006, file caches are normally not effective for 450.soplex, because its data files are set up many hours prior to their use, with many intervening programs competing for memory usage. Therefore, for all tests reported here, it was important to avoid unwanted file caching effects that would not be present in a reportable run, which was accomplished as summarized below:

    runspec -a setup --rate 450.soplex
    reboot
    cd 450.soplex/run/run\*000
    specinvoke -nnr > doit.sh
    convert 'sh dobmk' in doit.sh to 'sh dobmk &'
    doit.sh

The tests noted as Solaris 10 used:

      # head -1 /etc/release
                          Solaris 10 5/09 s10s_u7wos_08 SPARC

The tests noted as SNV used:

      # head -1 /etc/release
                       Solaris Express Community Edition snv_112 SPARC

The tests on the M5000 used 72GB 10K RPM disks. The ufs disk was a FUJITSU MBB2073RCSUN72G (SAS); the zfs disk was a SEAGATE ST973401LSUN72G (Ultra320 SCSI). The tests on the T5440 used 72GB 15K RPM disks: FUJITSU MAU3073FCSUN72 (Fibre Channel).

Acknowledgments.

My IO teachers include Senthil Ramanujam, Cloyce Spradling, and Roch Bourbonnais, none of whom should be blamed for this beginner's ignorance. Karsten Guthridge was the first to point out the usefulness of ZFS gzip compression for 450.soplex.

Monday Apr 13, 2009

Sun Studio Trounces Intel Compiler on Intel Chip

Today Sun announces a new world record for SPECfp2006: 50.4 on a 2-chip Nehalem (Intel Xeon X5570) Sun Blade X6270.

Congratulations to my colleagues in the Sun Studio Compiler group - the fun thing about this result is that it beats Intel's own compiler on this Intel chip by 20%, due to the optimization technologies found in the Sun Studio 12 Update 1 compiler.

SPECfp2006

System Processors Performance Results Comments
Type GHz Chips Cores Peak Base
Sun Blade X6270 Xeon 5570 2.93 2 8 50.4 45.0 New
Hitachi BladeSymphony BS2000 Xeon 5570 2.93 2 8 42.0 39.3 Top result at www.spec.org as of 14 Apr 2009
IBM Power 595 POWER6 5.00 1 1 24.9 20.1 Best POWER6 as of 14 Apr 2009

Note that even with the less aggressive "Base" tuning [SPECfp_base2006] the Sun Blade X6270 beats the best-posted "Peak" tuning from competitors [SPECfp2006].

Of course, the Intel compiler engineers are bright folks too, and they will no doubtless quickly provide additional performance on Nehalem. Still, it's fun to see the multi-target Sun Studio optimization technology deliver top results on a variety of platforms, now including Nehalem.

As to integer performance - the Sun Blade also takes top honors there [for peak]:

SPECint2006

System Processors Performance Results Comments
Type GHz Chips Cores Peak Base
Sun Blade X6270 Xeon 5570 2.93 2 8 36.9 32.0 New
Fujitsu Celsius R570 Xeon 5570 2.93 2 8 36.3 32.2 Top SPECint2006 result as of 14 Apr 2009

The Sun Blade results have been submitted to SPEC for review, and should appear at SPEC's website in about 2 weeks.

On a personal note, this was my first time using OpenSolaris. The level of compatibility with other operating systems is substantially improved; and utilities that this tester likes having handy are built in (e.g. the NUMA Observability Tools); and ZFS zips right along, needing less attention than ufs and delivering better performance.

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Competitive results from www.spec.org as of 4/14/2009.

Monday Feb 02, 2009

SPEC Benchmark Workshop 2009

Attached are the slides from my talk at the "SPEC Benchmark Workshop 2009", described at http://www.spec.org/workshops/2009/austin/program.html.

The program drew about 100 people.  There were 9 papers accepted and published by Springer, out of a field of about twice as many entries.  The papers were written by 10 authors from industry and 16 from academia.

My criterion for success was: "How many people are sleeping during my early-Sunday-morning talk?"   During the talk, I managed to wake up all but 1 of the sleepers, so I guess it was successful. 

Attached are my slides: specrate-slides.pdf

The article was published by Springer in a book - preview here: Springerlink

Here is the Author's pre-submission copy; note that Springer has the copyright, but authors are allowed to self-archive a copy on their personal site: specrate-paper.pdf

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today