Friday Nov 27, 2009

Sun Ray und ZFS - oder wie eine Desktop-Infrastruktur für Tausende von Benutzern sein sollte

(This blog entry is my German translation of my original blog entry on Sun Ray and ZFS at JavaOne 2009)

Was passieren kann, wenn man zwei beeindruckende Technologien auf intelligente Weise kombiniert, konnte man in diesem Jahr auf der JavaOne-Konferenz beobachten: Mehrere tausend Desktop-Umgebungen wurden mit einer Kombination aus Sun VDI Software 3, ZFS, Sun-Servern and Sun-Speichersystemen dargestellt. Hier einige bemerkenswerte Fakten: Es gab nur drei (statt mehrere tausend) verschiedene Betriebssystem-Installationen (Windows 7, Ubuntu 8.10 und OpenSolaris 2009.06)! Alle weitere Installationen waren nur Referenzen auf die originalen drei Installationen, deren Speicherplatz nur dann wuchs, wenn Benutzer ihren Desktop benutzten und Daten veränderten. Erreicht wurde das durch die spezifischen Copy-on-Write- und Snapshot-Eigenschaften von ZFS, wodurch das Klonen der Betriebssysteme nur einen Bruchteil des Speicherplatzes und auch nur einen Bruchteil der Zeit benötigte wie beim üblicherweise nötigen mehrfachen Kopieren der Betriebssysteme. Der gesamte Bedarf an Festplatten-Speicherplatz für die 6000 im Verlauf der JavaOne-Konferenz benutzten Betriebsysteme (jedes ca. 10 GB groß) sank dadurch von gewaltigen 60 TB auf gerade einmal 2 TB!

Nicht mehr als 200 Desktops auf etwas mehr als 150 Sun Ray-Terminals (400 bis 500 wären mit der dort eingesetzten Server-Infrastruktur möglich gewesen) wurden zu einer bestimmten Zeit gleichzeitig benutzt. Alle weiteren, inaktiven Desktops wurden in der Zwischenzeit zentral auf dem Festplattensystem gespeichert. Wie immer bei Sun Rays konnten die Benutzer ihre Java Card (die gleichzeitig auch ihre Konferenzkarte war) aus einem Terminal herausnehmen und zu einer beliebigen Zeit wieder in ein anderes Terminal hineinstecken, um kurz darauf ihre letzte Benutzerumgebung mit unverändertem Zustand am neuen Ort verwenden zu können. Eine Administration von PCs war natürlich durch die Verwendung von Sun Rays anstelle von PCs auch nicht erforderlich.

Wenn man an Schulen und Universitäten denkt, bei denen ebenfalls in vielen Fällen eine relativ geringe Zahl von Benutzern aus einer großen Anzahl von registrierten Benutzern gleichzeitig zu einer bestimmten Zeit ihre Desktops benutzt (z.B. in einem Informatik- oder PC-Lehrsaal), erscheint die auf der JavaOne 2009 eingesetzte SunRay-Lösung hervorragend für solche Umgebungen zu passen.

Auf dieser Internet-Seite (in englischer Sprache) kann man im zweiten Blog-Eintrag (bitte eine Seite nach unten blättern) die wesentlichen Eigenschaften dieser Lösung in einer Liste nachlesen.

Wenn Sie erfahren möchten, was Microsoft über die im eigenen Hause eingesetzte Sun Ray-Umgebung mitzuteilen hat, lesen Sie bitte in meinem letzten Blog-Eintrag nach.

Sun Ray thin clients and ZFS - or how a multi-thousand user, multi-desktop infrastructure should look like

This is what can happen if two great technologies are brought together: During this year's JavaOne back in June 2009, several thousand user desktops were handled by a combination of Sun VDI Software 3, ZFS, Sun Servers and Storage. Let me tell you about some remarkable aspects of that setup: There were just three (not thousands of) different operating system images (Windows 7, Ubuntu 8.10, OpenSolaris 2009.06). All other operating system desktops were just references to the original three images, growing only when a user changes data. See how ZFS comes into play? With its built-in copy-on-write and snapshot capabilities, cloning desktops requires just a fraction of the space and time required by traditional solutions. The total amount of storage for the 6000 desktops (each about 10 GB in size) used during the conference was just 2 TB instead of the otherwise needed 60 TB!

There were at most 200 desktops (400 to 500 were possible with that specific setup) in use at any given time, from about 150 Sun Ray terminals. All other (inactive) desktops were stored on disk in the meantime. As always with Sun Ray, users could take their Java Card (which was also their conference card) out of the current Sun Ray terminal and put it back to any other Sun Ray terminal later and would see the exact desktop they used last time. And, of course, no front end administration is necessary with Sun Rays.

Thinking of schools or universities, where often a limited number of users from a much higher number of total users access their desktops at any given time, the Sun Ray setup used at JavaOne 2009 looks like a perfect fit for those environments as well. In the second blog entry on this page (scroll down one page to get to the entry named "VDI 3 @ JavaOne - Summary"), you can see the the main features in one list.

Read what Microsoft says about their Sun Ray solution in my previous blog entry.

Tuesday Jan 27, 2009

Using ZFS as (an iSCSI) target for Mac OS X Time Machine

Inspired by this and then this blog entry, I thought it was now time for me to get my own experience with iSCSI.

Here's the result:

  1. On my eco-friendly server running OpenSolaris 2008.11, I created a new ZFS volume (not a ZFS file system!) with iSCSI sharing switched on:
    $ zfs create -o shareiscsi=on -V 180G pool2/mac-tm
    cannot share 'pool2/mac-tm': iscsitgtd failed request to share
    filesystem successfully created, but not shared
  2. Well, that did not work well. Better search and install the iSCSI packages first:
    $ pkg search -rl iscsi | nawk '{print $NF}' | \\
    nawk 'BEGIN{FS="@"}{print $1}' | sort -u
    PACKAGE
    pkg:/SUNWiscsi
    pkg:/SUNWiscsitgt
    $ pkg install SUNWiscsi SUNWiscsitgt
    DOWNLOAD                                    PKGS       FILES     XFER (MB)
    Completed                                    2/2       18/18     0.86/0.86
    
    PHASE                                        ACTIONS
    Install Phase                                  74/74
    PHASE                                          ITEMS
    Reading Existing Index                           9/9
    Indexing Packages                                2/2
    
  3. Then, I wanted to delete (destroy, in ZFS speak) and create the zvol again:
    $ zfs destroy pool2/mac-tm
    cannot destroy 'pool2/mac-tm': volume has children
    use '-r' to destroy the following datasets:
    pool2/mac-tm@zfs-auto-snap:frequent-2009-01-23-12:15#
  4. OK, I understand that an automated snapshot had already been created in the meantime. Destroy the zvol with its snapshots, and create the zvol again:
    $ zfs destroy -r pool2/mac-tm
    $ zfs create -o shareiscsi=on -V 180G pool2/mac-tm
    
  5. Check if the shareiscsi property is on for our volume:
    $ zfs get shareiscsi pool2/mac-tm
    NAME          PROPERTY    VALUE         SOURCE
    pool2/mac-tm  shareiscsi  on            local
  6. List all defined iSCSI targets:
    $ iscsitadm list target
    Target: pool2/mac-tm
        iSCSI Name: iqn.1986-03.com.sun:02:3f4f551a-41ab-4a3a-adf9-ea3ce5c2789c
        Connections: 0
  7. Looks great! On the MacBook Pro running Mac OS X 10.5.6, I installed the globalSAN iSCSI initiator software (version 3.3.0.43) from Studio Network Solutions, after downloading from this link.

  8. Then I rebooted the Mac (as required by the globalSAN iSCSI software).

  9. Next step was to mount the iSCSI drive:
    Mac OS X System Preferences
    a) Click on the globalSAN iSCSI icon to display its control panel:
    GlobalSAN iSCSI control panel #1
    b) Click on the + symbol in the lower left corner to get the following popup:
    GlobalSAN iSCSI control panel #2
    c) Enter the IP address or host name of the OpenSolaris server, leave the port number as it is, and enter the target name (the last column in the line starting with iSCSI Name: in the output of the iscsitadm list target command on your OpenSolaris server - in our case, it's iqn.1986-03.com.sun:02:3f4f551a-41ab-4a3a-adf9-ea3ce5c2789c ), and press the OK button. The iSCSI control panel will then look like:
    GlobalSAN iSCSI control panel #3
    d) Click the Connected switch at the end of the iSCSI target line (the line which starts with iqn) to get the following popup:
    GlobalSAN iSCSI control panel #4
    e) Press the Connect button to connect to that iSCSI target. As we did not specify CHAP or Kerberos authentication, the connect will work without user and password. For a walkthrough and more on CHAP authentication, click this link.
    After pressing the Connect button, the control panel will look like:
    GlobalSAN iSCSI control panel #5
    At this time, the newly created volume will show up in Disk Utility. Note that I clicked on the Persistent button to build the connection again after a reboot - I didn't try rebooting to check, but believe it will work.

  10. Then, I created a Mac OS X volume in Disk Utility.
    Disk Utility #1
    a) Click on the disk drive and then on the Erase tab, enter a new name for the volume (or leave it as it is), and press the Erase... button. The following screen will displayed to show the progesss:
    Disk Utility #2
    After the erase is completed, the new volume will show up in the left part of the Disk Utility (For this screen shot, I created the volume again after providing the name ZFS-180GB for the volume. Not sure if it's possible to rename a volume without formatting it):
    Disk Utility #3
  11. Now the volume is usable in Time Machine.
    a) Click on the Time Machine icon in System Preferences to start its control panel:
    TM control panel #1
    b) Click on Change Disk to change the destination volume for Time Machine (the lock in the lower left corner has to be unlocked first to allow for the change):
    TM control panel #2
    c) Select the new volume and press Use for Backup. Then, just start the backup (or wait 120 seconds until it starts automatically):
    TM control panel #3
    Mac OS X Time Machine has started its first backup on a ZFS volume!

However, as always in my blog entries, this is no guarantee that it will always work as described, or that the backup and restore will also work after your next Mac OS X upgrade, or that there will be no errors or problems with such a setup. What I can tell you is that a simple restore attempt worked for me just as if I had done it from a USB disk!

Up to now, I have always disconnected the USB disk drive before closing the Mac's lid so that a Time Machine backup would not be interrupted in the middle. Not sure what would happen if a Time Machine backup is running while you close the lid, so better read the docs and test it, or just always unmount Time Machine's active volume before letting your Mac sleep.

And I discovered that if an iSCSI volume is mounted before closing the lid, the Mac Book Pro cannot transition into deep sleep mode with a power consumption similar to the switched off state. It somehow sleeps, but with rotating fan and a steady front LED. And in order to wake it up, I had to open and close the lid several times. So the steps to do before closing the Mac's lid are:

  1. Eject (unmount) the volume (use the eject menu item after right-clicking on the volume's icon on the desktop).
  2. Disconnect the iSCSI target (and all others) in the globalSAN iSCSI control panel in the Mac OS X System Preferences, by unmarking the tick in column Connected for all targets. A confirmation popup will be shown when unmarking the Connected tick.

After waking up your Mac next time, just tick the Connected mark in the globalSAN iSCSI control panel again and confirm the popup that will be shown. If you did not choose another destination disk for Time Machine in the meantime, Time Machine will recognize the iSCSI drive as a valid destination volume automatically and use it for its next scheduled backup.

BTW For an interesting article on how to use ZFS iSCSI sharing with a Linux client, please click here.

Thursday Jan 15, 2009

Solaris booting from ZFS, explained by Lori Alt

Lori Alt, project lead for the ZFS boot project, explains in this video how booting Solaris from ZFS works.

The first 5 minutes covers the main feaures of ZFS. At about 05:45, Lori starts describing the Solaris boot process (SPARC and x86) and the special considerationg when using ZFS for booting, including some remarks on grub. A great primer if you want to know what happens when Solaris is booting. The video is also available in m4v format for viewing on an iPod, for example.

Friday Jan 09, 2009

ZFS and the uberblock, part 2

In my last blog entry, I described how to use DTrace to learn more about how and where the ZFS uberblock is written. I showed that an uberblock update causes 4 writes to 4 different areas, and only after 128 write cycles will an old uberblock be overwritten by a new one. Now, using zdb, I will show how often various zpools update their uberblocks so we can examine after which time a USB flash without wear leveling would probably be worn out.

Using zdb -uuuv, as in the following script:

#!/bin/ksh
TMPDIR=/var/tmp
LOGFILE=${TMPDIR}/ub-stats.out

echo "-----" >> ${LOGFILE}
date +"date+time: %Y-%m-%d %H:%M:%S" >> ${LOGFILE}
for ZPOOL in $(zpool list -o name -H); do
   echo ${ZPOOL} >> ${LOGFILE}
   zdb -uuuv ${ZPOOL} >> ${LOGFILE}                                        
done

, we can show important information about the uberblocks of each zpool and write it to a file (for example once per minute, per crontab). A timestamp is useful for further analysis.

After some some hours or days, we can extract the data of certain zpools, import it into a spreadsheet and create graphs to show how often new uberblocks are written to disk. Here's how we can print the data for rpool:

nawk '/date/{date=$2; time=$3}
/pool/{pool=$NF}
/txg/{printf ("%s %s\\t%s\\t%s\\n", date, time, pool, $NF)}' ub-stats.out
A sample output might look like:
2009-01-06 22:14:00     dpool   66273
2009-01-06 22:14:00     dpool-2 58707
2009-01-06 22:14:00     rpool   33165
2009-01-06 22:15:00     dpool   66273
2009-01-06 22:15:00     dpool-2 58707
2009-01-06 22:15:00     rpool   33167
2009-01-06 22:16:00     dpool   66273
2009-01-06 22:16:00     dpool-2 58707
2009-01-06 22:16:00     rpool   33169
2009-01-06 22:17:00     dpool   66273
2009-01-06 22:17:00     dpool-2 58707
2009-01-06 22:17:00     rpool   33171
We can easily extract just one pool, using nawk '$3=="dpool"{print}', for example, read the resulting file into OpenOffice, and create a graph of type x/y to show the txg numbers over time. For rpool, I got the following picture:

txg over time (rpool)

The data (you can even see it in the above table) shows that for rpool, txg is incremented at about two times per minute. Each additional manual "sync" command increments txg by one and also writes the uberblock to disk. Assuming that a USB flash drive without wear leveling wears out after data is written to the same location 10.000 times (which is the minimum guaranteed number of cycles for a standard USB flash drive mentioned in this document), knowing that the same block is overwritten 2\*60\*24/128 = less than 23 times after one day, we can calculate 10.000/22.5 to get the number of days until such a USB stick would be worn out. The result is about 444 days (=1 year 79 days). Assuming the USB stick is worn out only after 100.000 writes to the same location, the result would be 4444 days which is more than 12 years! At the end of this blog entry, I will explain why the USB stick might even live much longer than those 444 or 4444 days.

Interestingly, for another pool, name dpool, I got a completely different graph:

txg over time (dpool)

What I know is that on this pool, at about 06:00 am, a small number of files (about 10 to 20) is been created, again at 07:00 am, and during other times, there is no or not much write activity. Using another dtrace script, we can watch in real time and see that the uberblock's txg number is indeed increased every 30 seconds but only written to disk when there is a change to the pool to which it belongs or, for example, when a sync command is issued on the command line. Here's the script I used (a colleague recommended to also include spa_sync as this is the call for synching a transaction group, so I added two more probes for watching these events):

io:genunix:bdev_strategy:start
{
   printf ("%Y %s %d %d", walltimestamp, execname, args[0]->b_blkno, args[0]->b_bcount);
}

fbt:zfs:spa_sync:entry
{
   printf ("%Y %s %d", walltimestamp, execname, args[1]);
}

fbt:zfs:spa_sync:return,
fbt:zfs:vdev_uberblock_sync:entry
{
   printf ("%Y %s", walltimestamp, execname);
}

fbt:zfs:uberblock_update:entry
{
   printf ("%Y %d %d %d", walltimestamp,
     args[0]->ub_rootbp.blk_birth, args[0]->ub_rootbp.blk_fill,
     args[2]);
}

fbt:zfs:uberblock_update:return
{
   printf ("%Y %d %d", walltimestamp, args[0], args[1]);
}

And here's a sample output. I removed certain columns, including the hour (19 p.m.) from the original output. The times are minutes:seconds. For better readability, I have marked all entries and comments regarding writing the uberblock update for rpool to disk in green color, and all actions I performed in another terminal window while the dtrace script was running in red color.

           spa_sync:entry 11:09 sched 41533           <- spa_sync() rpool
      bdev_strategy:start 11:09 sched 25671451 512
      bdev_strategy:start 11:09 sched 4310260 512
      bdev_strategy:start 11:09 sched 25571476 3072
      bdev_strategy:start 11:09 sched 25671452 14848
	   spa_sync:entry 11:09 sched 69698           <- spa_sync() dpool-2
   uberblock_update:entry 11:09 67082 178 69698       <- updating dpool-2. txg != blk_birth!
  uberblock_update:return 11:09 85 0                  <- args[1]=0: no sync!
	  spa_sync:return 11:09 sched
	   spa_sync:entry 11:09 sched 76452           <- spa_sync() dpool
   uberblock_update:entry 11:09 75007 122 76452       <- updating dpool. Note that txg is 76452
                                                         whereas blk_birth is 75007: means that
						         this file system has not been changed for
						         (76452-75007)/2+1 = 723m 30s = 12h 3m 30s
						         So the last change to dpool was
						         at about 7:08 am. This is exactly the time
                                                         stamp as shown in zdb -uuuv dpool. And the
                                                         txg number shown in that output is the one
                                                         that is read from disk, so it is 75007:

(zdb output follows; dtrace output will be continued below)

---
$ zdb -uuuv dpool
Uberblock

        magic = 0000000000bab10c
        version = 13
        txg = 75007
        guid_sum = 10502025224250020770
        timestamp = 1231481321 UTC = Fri Jan  9 07:08:41 2009
        rootbp = [L0 DMU objset] 400L/200P DVA[0]=<0:fd8c54a00:200> DVA[1]=<0:1f8b7f8800:200> DVA[2]=<0:c02e0600:200> fletcher4 lzjb LE contiguous birth=75007 fill=122 cksum=a0e45aac0:42a5af596ce:e0b8e86ece60:201ef17f0b98da
---

(dtrace output continued)

  uberblock_update:return 11:09 85 0                  <- args[1]=0: no sync!
	  spa_sync:return 11:09 sched
      bdev_strategy:start 11:09 sched 4310261 27136
      bdev_strategy:start 11:09 sched 25571482 15360
.
... some more IO of various blocks and sizes
.
      bdev_strategy:start 11:09 sched 25671511 512
      bdev_strategy:start 11:09 sched 25671517 5632
   uberblock_update:entry 11:09 41533 159 41533       <- updating rpool. txg == blk_birth!
  uberblock_update:return 11:09 85 1                  <- args[1]=1: sync!
vdev_uberblock_sync:entry 11:09 sched
      bdev_strategy:start 11:09 sched 378 1024        <- write to block #378!
bdev_strategy:start 11:09 sched 890 1024 bdev_strategy:start 11:09 sched 39005050 1024 bdev_strategy:start 11:09 sched 39005562 1024 spa_sync:return 11:09 sched spa_sync:entry 11:39 sched 41534 <- spa_sync() rpool uberblock_update:entry 11:39 41533 159 41534 <- updating rpool. txg != blk_birth! uberblock_update:return 11:39 85 0 <- args[1]=0: no sync! skip writing to block #380!
spa_sync:return 11:39 sched spa_sync:entry 11:39 sched 76453 <- spa_sync() dpool uberblock_update:entry 11:39 75007 122 76453 uberblock_update:return 11:39 85 0 spa_sync:return 11:39 sched spa_sync:entry 11:39 sched 69699 <- spa_sync() dpool-2 uberblock_update:entry 11:39 67082 178 69699 uberblock_update:return 11:39 85 0 spa_sync:return 11:39 sched spa_sync:entry 12:09 sched 69700 <- spa_sync() dpool-2 uberblock_update:entry 12:09 67082 178 69700 uberblock_update:return 12:09 85 0 spa_sync:return 12:09 sched spa_sync:entry 12:09 sched 76454 <- spa_sync() dpool uberblock_update:entry 12:09 75007 122 76454 uberblock_update:return 12:09 85 0 spa_sync:return 12:09 sched spa_sync:entry 12:09 sched 41535 <- spa_sync() rpool bdev_strategy:start 12:09 sched 25571530 1024 bdev_strategy:start 12:09 sched 4310376 1024 . ... some more IO of various blocks and sizes . bdev_strategy:start 12:09 sched 25671620 512 bdev_strategy:start 12:09 sched 25671626 5120 uberblock_update:entry 12:09 41535 159 41535 <- updating rpool. txg == blk_birth! uberblock_update:return 12:09 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 12:09 sched bdev_strategy:start 12:09 sched 382 1024 <- write to block #382!
bdev_strategy:start 12:09 sched 894 1024 bdev_strategy:start 12:09 sched 39005054 1024 bdev_strategy:start 12:09 sched 39005566 1024 spa_sync:return 12:09 sched
(sync command issued in another terminal window)

(end sync command)
spa_sync:entry 12:23 sched 76455 <- spa_sync() dpool uberblock_update:entry 12:23 75007 122 76455 uberblock_update:return 12:23 85 0 spa_sync:return 12:23 sched spa_sync:entry 12:23 sched 69701 <- spa_sync() dpool-2 uberblock_update:entry 12:23 67082 178 69701 uberblock_update:return 12:23 85 0 spa_sync:return 12:23 sched spa_sync:entry 12:23 sched 41536 <- spa_sync() rpool bdev_strategy:start 12:23 sched 25671636 1536 bdev_strategy:start 12:23 sched 25671639 2560 . ... some more IO of various blocks and sizes . bdev_strategy:start 12:23 sched 25671718 3584 bdev_strategy:start 12:23 sched 25682884 2048 uberblock_update:entry 12:23 41536 159 41536 <- updating rpool. txg == blk_birth! uberblock_update:return 12:23 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 12:23 sched bdev_strategy:start 12:23 sched 384 1024 <- write to next block, #384!
bdev_strategy:start 12:23 sched 896 1024 bdev_strategy:start 12:23 sched 39005056 1024 bdev_strategy:start 12:23 sched 39005568 1024 spa_sync:return 12:23 sched bdev_strategy:start 12:53 sched 234913576 12288 bdev_strategy:start 12:53 sched 234910656 16384 spa_sync:entry 12:53 sched 41537 <- spa_sync() rpool bdev_strategy:start 12:53 sched 25671725 14848 bdev_strategy:start 12:53 sched 25571590 1024 bdev_strategy:start 12:53 sched 4310548 1024 bdev_strategy:start 12:53 sched 25571588 1024 spa_sync:entry 12:53 sched 69702 <- spa_sync() dpool-2 bdev_strategy:start 12:53 sched 234915243 512 bdev_strategy:start 12:53 sched 306224021 512 bdev_strategy:start 12:53 sched 234913640 12288 bdev_strategy:start 12:53 sched 234915250 1024 spa_sync:entry 12:53 sched 76456 <- spa_sync() dpool uberblock_update:entry 12:53 75007 122 76456 uberblock_update:return 12:53 85 0 spa_sync:return 12:53 sched bdev_strategy:start 12:53 sched 234915252 1536 bdev_strategy:start 12:53 sched 234915256 7168 . ... some more IO of various blocks and sizes . bdev_strategy:start 12:53 sched 379625154 1536 bdev_strategy:start 12:53 sched 379625160 4608 uberblock_update:entry 12:53 41537 159 41537 <- updating rpool. txg == blk_birth! uberblock_update:return 12:53 85 1 <- args[1]=1: sync! uberblock_update:entry 12:53 69702 178 69702 uberblock_update:return 12:53 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 12:53 sched bdev_strategy:start 12:53 sched 396 1024 bdev_strategy:start 12:53 sched 908 1024 bdev_strategy:start 12:53 sched 390704524 1024 bdev_strategy:start 12:53 sched 390705036 1024 spa_sync:return 12:53 sched vdev_uberblock_sync:entry 12:54 sched bdev_strategy:start 12:54 sched 386 1024 <- write to next block, #386
bdev_strategy:start 12:54 sched 898 1024 bdev_strategy:start 12:54 sched 39005058 1024 bdev_strategy:start 12:54 sched 39005570 1024 spa_sync:return 12:54 sched (wrote file in dpool-2, with :w in vi) 0 6668 bdev_strategy:start 2009 Jan 9 19:13:08 sched 234913600 20480 (done writing) spa_sync:entry 13:23 sched 41538 <- spa_sync() rpool uberblock_update:entry 13:23 41537 159 41538 <- updating rpool. txg != blk_birth! uberblock_update:return 13:23 85 0 <- args[1]=0: no sync! skip writing to block #388!
spa_sync:return 13:23 sched spa_sync:entry 13:23 sched 69703 <- spa_sync() dpool-2 bdev_strategy:start 13:23 sched 234915299 512 bdev_strategy:start 13:23 sched 306224077 512 bdev_strategy:start 13:23 sched 234915300 512 bdev_strategy:start 13:23 sched 234915302 1024 spa_sync:entry 13:23 sched 76457 <- spa_sync() dpool uberblock_update:entry 13:23 75007 122 76457 uberblock_update:return 13:23 85 0 spa_sync:return 13:23 sched bdev_strategy:start 13:23 sched 234915304 1536 bdev_strategy:start 13:23 sched 234915308 7168 . ... some more IO of various blocks and sizes . bdev_strategy:start 13:23 sched 379625170 6144 bdev_strategy:start 13:23 sched 379625184 6144 uberblock_update:entry 13:23 69703 178 69703 <- updating dpool-2. txg == blk_birth! uberblock_update:return 13:23 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 13:23 sched bdev_strategy:start 13:23 sched 398 1024 bdev_strategy:start 13:23 sched 910 1024 bdev_strategy:start 13:23 sched 390704526 1024 bdev_strategy:start 13:23 sched 390705038 1024 spa_sync:return 13:23 sched (rm file in dpool-2)

(done rm) spa_sync:entry 13:53 sched 69704 <- spa_sync() dpool-2 bdev_strategy:start 13:53 sched 234915348 512 bdev_strategy:start 13:53 sched 306224124 512 bdev_strategy:start 13:53 sched 234915364 1024 bdev_strategy:start 13:53 sched 306224140 1024 spa_sync:entry 13:53 sched 76458 <- spa_sync() dpool uberblock_update:entry 13:53 75007 122 76458 uberblock_update:return 13:53 85 0 spa_sync:return 13:53 sched spa_sync:entry 13:53 sched 41539 <- spa_sync() rpool bdev_strategy:start 13:53 sched 25571606 15360 bdev_strategy:start 13:53 sched 25571642 1024 . ... some more IO of various blocks and sizes . bdev_strategy:start 13:53 sched 379625196 6144 bdev_strategy:start 13:53 sched 379625210 6144 uberblock_update:entry 13:53 69704 178 69704 <- updating dpool-2. txg == blk_birth! uberblock_update:return 13:53 85 1 <- args[1]=1: sync! bdev_strategy:start 13:53 sched 25671818 2560 bdev_strategy:start 13:53 sched 4310632 2560 . ... some more IO of various blocks and sizes . bdev_strategy:start 13:53 sched 25671823 3584 bdev_strategy:start 13:53 sched 25682892 2048 uberblock_update:entry 13:53 41539 159 41539 <- updating rpool. txg == blk_birth! uberblock_update:return 13:53 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 13:53 sched bdev_strategy:start 13:53 sched 400 1024 bdev_strategy:start 13:53 sched 912 1024 bdev_strategy:start 13:53 sched 390704528 1024 bdev_strategy:start 13:53 sched 390705040 1024 spa_sync:return 13:53 sched vdev_uberblock_sync:entry 13:54 sched bdev_strategy:start 13:54 sched 390 1024 <- write to block #390
bdev_strategy:start 13:54 sched 902 1024 bdev_strategy:start 13:54 sched 39005062 1024 bdev_strategy:start 13:54 sched 39005574 1024 spa_sync:return 13:54 sched spa_sync:entry 14:23 sched 41540 <- spa_sync() rpool uberblock_update:entry 14:23 41539 159 41540 <- updating rpool. txg != blk_birth! uberblock_update:return 14:23 85 0 <- args[1]=0: no sync! skip writing to block #392!
spa_sync:return 14:23 sched spa_sync:entry 14:23 sched 69705 <- spa_sync() dpool-2 uberblock_update:entry 14:23 69704 178 69705 <- updating dpool-2. txg != blk_birth! uberblock_update:return 14:23 85 0 <- args[1]=0: no sync! spa_sync:return 14:23 sched spa_sync:entry 14:23 sched 76459 <- spa_sync() dpool uberblock_update:entry 14:23 75007 122 76459 <- updating dpool. txg != blk_birth! uberblock_update:return 14:23 85 0 <- args[1]=0: no sync! spa_sync:return 14:23 sched spa_sync:entry 14:53 sched 76460 <- spa_sync() dpool uberblock_update:entry 14:53 75007 122 76460 <- updating dpool. txg != blk_birth! uberblock_update:return 14:53 85 0 <- args[1]=0: no sync! spa_sync:return 14:53 sched spa_sync:entry 14:53 sched 69706 <- spa_sync() dpool-2 uberblock_update:entry 14:53 69704 178 69706 <- updating dpool-2. txg != blk_birth! uberblock_update:return 14:53 85 0 <- args[1]=0: no sync! spa_sync:return 14:53 sched spa_sync:entry 14:53 sched 41541 <- spa_sync() rpool uberblock_update:entry 14:53 41539 159 41541 <- updating rpool. txg != blk_birth! uberblock_update:return 14:53 85 0 <- args[1]=0: no sync! skip writing to block #394!
spa_sync:return 14:53 sched spa_sync:entry 15:23 sched 41542 <- spa_sync() rpool uberblock_update:entry 15:23 41539 159 41542 <- updating rpool. txg != blk_birth! uberblock_update:return 15:23 85 0 <- args[1]=0: no sync! skip writing to block #396!
spa_sync:return 15:23 sched spa_sync:entry 15:23 sched 69707 <- spa_sync() dpool-2 uberblock_update:entry 15:23 69704 178 69707 <- updating dpool-2. txg != blk_birth! uberblock_update:return 15:23 85 0 <- args[1]=0: no sync! spa_sync:return 15:23 sched spa_sync:entry 15:23 sched 76461 <- spa_sync() dpool uberblock_update:entry 15:23 75007 122 76461 <- updating dpool. txg != blk_birth! uberblock_update:return 15:23 85 0 <- args[1]=0: no sync! spa_sync:return 15:23 sched (created new file in rpool, using the touch command)

(done creating file)
spa_sync:entry 15:53 sched 76462 <- spa_sync() dpool uberblock_update:entry 15:53 75007 122 76462 <- updating dpool. txg != blk_birth! uberblock_update:return 15:53 85 0 <- args[1]=0: no sync! spa_sync:return 15:53 sched spa_sync:entry 15:53 sched 69708 <- spa_sync() dpool-2 uberblock_update:entry 15:53 69704 178 69708 <- updating dpool-2. txg != blk_birth! uberblock_update:return 15:53 85 0 <- args[1]=0: no sync! spa_sync:return 15:53 sched spa_sync:entry 15:53 sched 41543 <- spa_sync() rpool bdev_strategy:start 15:53 sched 25571658 15360 bdev_strategy:start 15:53 sched 25571688 1024 . ... some more IO of various blocks and sizes . bdev_strategy:start 15:53 sched 25671855 4096 bdev_strategy:start 15:53 sched 25682908 2048 uberblock_update:entry 15:53 41543 159 41543 <- updating rpool. txg == blk_birth! uberblock_update:return 15:53 85 1 <- args[1]=1: sync! vdev_uberblock_sync:entry 15:53 sched bdev_strategy:start 15:53 sched 398 1024 <- write to block #398! bdev_strategy:start 15:53 sched 910 1024 bdev_strategy:start 15:53 sched 39005070 1024 bdev_strategy:start 15:53 sched 39005582 1024 spa_sync:return 15:54 sched
I think this is a very nice example of what you can find out with DTrace. You may ask, "Why did the rpool graph at the top of this blog entry show an update in rpool 2 times per minute whereas the DTrace output shows an update not that often (which, by the way, would further extend the life of the USB flash drive...)?" This is because the output of the zdb script was written to file in /var/tmp every minute, causing a change in rpool more often than during the run time of the DTrace script!


Tuesday Jan 06, 2009

ZFS and the uberblock

Inspired by Constantin's comment on USB sticks wearing out Matthias's blog entry about an eco-friendly home server, I tried to find out more about how and how often the ZFS uberblock is written.

Using DTrace, it's not that difficult:

We start by finding out which DTrace probes exist for the uberblock:

$ dtrace -l | grep -i uberblock
31726        fbt               zfs            vdev_uberblock_compare entry
31727        fbt               zfs            vdev_uberblock_compare return
31728        fbt               zfs          vdev_uberblock_load_done entry
31729        fbt               zfs          vdev_uberblock_load_done return
31730        fbt               zfs          vdev_uberblock_sync_done entry
31731        fbt               zfs          vdev_uberblock_sync_done return
31732        fbt               zfs               vdev_uberblock_sync entry
31733        fbt               zfs               vdev_uberblock_sync return
34304        fbt               zfs          vdev_uberblock_sync_list entry
34305        fbt               zfs          vdev_uberblock_sync_list return
34404        fbt               zfs                  uberblock_update entry
34405        fbt               zfs                  uberblock_update return
34408        fbt               zfs                  uberblock_verify entry
34409        fbt               zfs                  uberblock_verify return
34416        fbt               zfs               vdev_uberblock_load entry
34417        fbt               zfs               vdev_uberblock_load return

So there are two probes on uberblock_update: fbt:zfs:uberblock_update:entry and fbt:zfs:uberblock_update:return!

Now we can find out more about it by searching the OpenSolaris sources: When searching for definition of uberblock_update in project onnv, we find one hit for line 49 in file uberblock.c, and when clicking on it, we see:

source extract: line 49 of file uberblock.c

Now, when searching again for the definitions of the first two arguments (args[0| and args[1|) of uberblock_update (which is uberblock and vdev), we get:

For uberblock, the following hits are shown:

When clicking on the link on the definition of struct uberblock (around line 53 in file uberblock_impl.h), we get:

For the members of struct vdev, it's not that easy. First, we get a long hit list when searching for the definition of vdev in the source browser. But if we search for "struct vdev" in that list, using the browser's search function, we get:


When clicking on the definition of struct vdev (around line 108 in file vdev_impl.h), we can see all the members of this structure.

Here are all the links, plus one more for struct blkprt (a member of struct uberblock), again in one place:

Now we are prepared to access the data via DTrace, by printing the arguments and members as in the following example:

printf ("%d %d %d", args[0]->ub_timestamp, args[1]->vdev_id, args[2]);

So a sample final DTrace script to print out as much information in the event of an uberblock_update as we can, and also print out any relevant I/O (hoping that from showing both at the same time, we can see where and how often the uberblocks are written):

io:genunix:default_physio:start,
io:genunix:bdev_strategy:start,
io:genunix:biodone:done
{
   printf ("%d %s %d %d", timestamp, execname,
     args[0]->b_blkno, args[0]->b_bcount);
}

fbt:zfs:uberblock_update:entry
{
   printf ("%d %s, %d, %d, %d, %d", timestamp, execname,
     pid, args[0]->ub_rootbp.blk_prop, args[1]->vdev_asize, args[2]);
}

The lines for showing the I/O are derived from DTrace scripts for I/O analysis in the DTrace Toolkit

Although I was unable to print out members of struct vdev (the second argument to uberblock_update() ) with the fbt:zfs:uberblock_update:entry probe (I also tried fbt:zfs:uberblock_update:return but had other problems with that one), the results when running this script, using
$ dtrace -s zfs-uberblock-report-02.d
, are quite interesting. Here's an extract (long lines shortened):

  0  33280  uberblock_update:entry 102523281435514 sched, 0, 922..345, 0, 21005
  0   5510     bdev_strategy:start 102523490757174 sched 282 1024
  0   5510     bdev_strategy:start 102523490840779 sched 794 1024
  0   5510     bdev_strategy:start 102523490873844 sched 18493722 1024
  0   5510     bdev_strategy:start 102523490903928 sched 18494234 1024
  0   5498            biodone:done 102523491215729 sched 282 1024
  0   5498            biodone:done 102523491576878 sched 794 1024
  0   5498            biodone:done 102523491873015 sched 18493722 1024
  0   5498            biodone:done 102523492232464 sched 18494234 1024
...
  0  33280  uberblock_update:entry 102553280316974 sched, 0, 922..345, 0, 21006
  0   5510     bdev_strategy:start 102553910907205 sched 284 1024
  0   5510     bdev_strategy:start 102553910989248 sched 796 1024
  0   5510     bdev_strategy:start 102553911022603 sched 18493724 1024
  0   5510     bdev_strategy:start 102553911052733 sched 18494236 1024
  0   5498            biodone:done 102553911344640 sched 284 1024
  0   5498            biodone:done 102553911623733 sched 796 1024
  0   5498            biodone:done 102553911981236 sched 18493724 1024
  0   5498            biodone:done 102553912250614 sched 18494236 1024
...
  0  33280  uberblock_update:entry 102583279275573 sched, 0, 922..345, 0, 21007
  0   5510     bdev_strategy:start 102583540376459 sched 286 1024
  0   5510     bdev_strategy:start 102583540459265 sched 798 1024
  0   5510     bdev_strategy:start 102583540492968 sched 18493726 1024
  0   5510     bdev_strategy:start 102583540522840 sched 18494238 1024
  0   5498            biodone:done 102583540814677 sched 286 1024
  0   5498            biodone:done 102583541091636 sched 798 1024
  0   5498            biodone:done 102583541406962 sched 18493726 1024
  0   5498            biodone:done 102583541743494 sched 18494238 1024

Using the following (n)awk one-liners:

$ nawk '/uberblock/{print}}' zfs-ub-report-02.d.out
$ nawk '/uberblock/{a=0}{a++;if ((a==2)){print}}' zfs-ub-report-02.d.out
$ nawk '/uberblock/{a=0}{a++;if ((a>=1)&&(a<=5)){print}}' zfs-ub-report-02.d.out
, we can print:
  • only the uberblock_update lines, or
  • just the next line after the line that matches the uberblock_update entry, or
  • all 4 lines after that entry, including the entry itself.

When running the script for a while and capturing its output, we can later analyze at which block number the first block after uberblock_update() is written, and we can see that the numbers are always even, the lowest number is 256 and the highest number is 510, with a block size of 1024. Those block numbers always go from 256, 258, 260, and so forth, until they reach 510. Then, they start with 256 again. So every (510-256)/2+1 = 128th iteration (yes, it's one more, as we have to include the first element after subtracting the first from the last element), the first block is overwritten again. The same is true for blocks 768...1022, 18493696...18493950 and 18494208...18494462 (the third and fourth block ranges should be different for different zpool sizes).

Now that we understand how and in which order the uberblocks are written, we are prepared to examine after how many days the uberblock area of a USB stick without wear leveling would probably be worn out. More on that and how we can use zdb for that, in my next blog entry.

Some more links on this topic:

Friday Jan 02, 2009

ZFS on external disks: Why free disk space might not increase when deleting files...

Interesting: After copying about 30,000 files from an internal disk (OpenSolaris 2008.11) to a new directory of an external USB disk and later removing a similar number of files, the free disk space as reported by zpool status or in the column "Available" of df -k did not increase! And the df -k output showed that the number of total blocks of the only file system on that disk had decreased! What had happened?

Here's the explanation:

For backup purposes, I wanted to use an external 2.5" USB disk with a capacity of about 186GB. On that USB disk, I had created a zpool, using:
$ zpool create -f dpool-2 c5t0d0

For a first backup, I copied the mentioned 30,000 files with a total size of about 75GB from one system to a USB disk connected to another server (running OpenSolaris 2008.11 as well) via the network. I used a simple scp -pr which preserves modes and times but not uids and gids. One possible procedude for preserving file ownership as well, as mentioned here, for example, would be:
cd source_dir; tar -cf - .|ssh user@targethost "cd target_dir; tar -xvf -".

Anyway, I thought it would be a good idea to export that USB disk and import it on another server (also on OpenSolaris 2008.11) to check if there were any problems. I also wanted to run a
$ zpool scrub dpool-2
to verify the data integrity. Everything went smoothly.

OK. So I exported the zpool again, connected the disk to the server on which the original data is located, imported the zpool and started copying the same data which I had copied before (via the network) locally, using
$ tar -cf - | ( cd /dpool-2; tar -xpf - )

As this finished without errors as well, and a comparison using /usr/bin/diff or the default /usr/gnu/bin/diff only showed a problem with one of the files that had an interesting character in its file name (the problem went away after changing LANG from en_US.UTF-8 to C), I decided to remove the old 30,000 files (the ones which I had copied via the network). I had removed files on ZFS before, using rm -rf, and it always was amazingly fast, compared to PCFS or UFS. But this time, removing those files took quite a while, with some disk activity.

After all the "old" files were removed, I discovered the following:

  • Free space as in df -k or zpool status dropped from 112GB to 36GB!
  • Used space as in zpool status increased from 75GB to 150GB!
  • df -k reported a total size of the only file system on that zpool of just 108GB, compared to 183GB as it was reported before!

The solution was to execute the following command:
$ zfs list -t snapshot
The output showed several snapshots, taken during the time the disk was connected to the second server (on which I had set up time slider for creating automated snapshots). When I connected the disk to the final server, the snapshots were still there, and when removing files, the snapshots were still valid so it would have been possible to create the deleted files again. So the easy solution for recreating the free space again was just to remove all those snapshots, using zfs destroy, as in the following example:
$ zfs destroy dpool-2@zfs-auto-snap:frequent-2009-01-02-18:45
After the final snapshot for dpool-2 was removed, things showed up in df -k and zpool status as expected. "Problem" solved!

Wednesday Apr 11, 2007

Solaris 10 How-To Guides!

See this for some very good how-to guides on various Solaris 10 administration topics.

Friday Apr 06, 2007

ZFS is part of FreeBSD

My colleagues just "told" [1], [2] me that ZFS, the upcoming standard filesystem (my personal opinion, but search for zfs Windows in the web and you'll see what I mean), has made its way into FreeBSD!
About

blogfinger

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today