Friday Feb 02, 2007

corrupted files and 'zpool status -v'

If ZFS detects either a checksum error or read I/O failure and is not able to correct it (say by successfully reading from the other side of a mirror), then it will store a log of objects that are damaged permanently (perhaps due to silent corruption).

Previously (that is, before snv_57), the output we gave was only somewhat useful:

# zpool status -v
  pool: monkey
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        monkey      ONLINE      26     0     0
          c1t1d0s7  ONLINE      12     0     0
          c1t1d0s6  ONLINE      14     0     0

errors: The following persistent errors have been detected:

          DATASET  OBJECT  RANGE
          0x0      0x13    lvl=0 blkid=0
          0x5      0x4     lvl=0 blkid=0
          0x17     0x4     lvl=0 blkid=0
          0x1d     0x4     lvl=0 blkid=0
          0x24     0x5     lvl=0 blkid=0
          0x2a     0x4     lvl=0 blkid=0
          0x2a     0x6     lvl=0 blkid=0
          0x30     0x4     lvl=0 blkid=0
          0x36     0x0     lvl=0 blkid=2

If you were lucky, the DATASET object number would actually get converted into a dataset name. If it didn't then you would have to use zdb(1M) to figure out what the dataset name/mountpoint was. After that, you would have to use the '-inum' option to find(1) to figure out what the actual file was (see the opensolaris thread on it). While it is really powerful to even have this ability, it would be really nice to have ZFS do all the dirty work for you - we are after all shooting for easy administration.

With the putback of: 6410433 'zpool status -v' would be more useful with filenames, observability has been greatly increased!:

# zpool status -v
  pool: monkey
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        monkey      ONLINE      24     0     0
          c1t1d0s6  ONLINE      10     0     0
          c1t1d0s7  ONLINE      14     0     0

errors: Permanent errors have been detected in the following files:

        /monkey/a.txt
        /monkey/bananas/b.txt
        /eric/c.txt
        /monkey/sub/dir/d.txt
        monkey/ghost:/e.txt
        monkey/ghost:/boo/f.txt
        monkey/dnode:<0x0>
        <metadata>:<0x13>

For the listings above, we attempt to print out the full path to the file. If we successfully find the full path and the dataset is mounted then we print out the full path with a preceding "/" (such as in the "/monkey/a.txt" example above). If we successfully find it, but the dataset is not mounted, then we print out the dataset name (no preceding "/"), followed by the path within the dataset to the file (see the "monkey/ghost:/e.txt" example above).

If we can't successfully translate the object number to a file path (either due to error or the object doesn't have a real file path associated with it as is the case for say a dnode_t), then we print out the dataset name followed by the object's number (as in the "monkey/dnode:<0x0>" case above). If an object in the MOS gets corrupted then we print out the special tag of <metadata>, followed by the object number.

Couple this with background scrubbing and you have very impressive fault management and observability. What other filesystem/storage system can give you this ability?

Note: these changes are in snv_57, will hopefully make s10u4, and perhaps even Leopard :)

If you're stuck on old bits (without the above mentioned changes) and are trying to figure out how to translate object numbers to filenames, then check out this thread

Tuesday Nov 21, 2006

zil_disable

People are finding that setting 'zil_disable' seems to increase their performance - especially NFS/ZFS performance. But what does setting 'zil_disable' to 1 really do? It completely disables the ZIL. Ok fine, what does that mean?

Disabling the ZIL causes ZFS to not immediatley write synchronous operations to disk/storage. With the ZIL disabled, the synchronous operations (such as fsync(), O_DSYNC, OP_COMMIT for NFS, etc.) will be written to disk, just at the same guarantees as asynchronous operations. That means you can return success to applications/NFS clients before the data has been commited to stable storage. In the event of a server crash, if the data hasn't been written out to the storage, it is lost forever.

With the ZIL disabled, no ZIL log records are written.

Note: disabling the ZIL does NOT compromise filesystem integrity. Disabling the ZIL does NOT cause corruption in ZFS.

Disabling the ZIL is definitely frowned upon and can cause your applications much confusion. Disabling the ZIL can cause corruption for NFS clients in the case where a reply to the client is done before the server crashes, and the server crashes before the data is commited to stable storage. If you can't live with this, then don't turn off the ZIL.

The 'zil_disable' tuneable will go away once 6280630 zil synchronicity is putback.

Hmm, so all of this sounds shady - so why did we add 'zil_disable' to the code base? Not for people to use, but as an easy way to do performance measurements (to isolate areas outside the ZIL).

If you'd like more information on how the ZIL works, check out Neil's blog and Neelakanth's blog.

Monday Oct 16, 2006

zpool history

ZFS - now with history!

When disaster strikes and you'd really like to know what people have been doing to your pool, what do you do? Until now, there was nothing elegant. Enter 'zpool history'.

# zpool create hewitt c1d0
# zfs create hewitt/jen
# zfs create hewitt/jen/love
# zpool history
History for 'hewitt':
2006-10-16.17:54:02 zpool create hewitt c1d0
2006-10-16.17:54:11 zfs create hewitt/jen
2006-10-16.17:54:15 zfs create hewitt/jen/love

#

All subcommands of zfs(1M) and zpool(1M) that modify the state of the pool get logged persistently to disk. That means no matter where you take your pool or what machine is currently accessing it (such as in the SunCluster failover case), your history follows. Sorta like your permanent record.

Now you have a convenient way of finding out if someone did something bad to your pool...

bad_admin# zfs set checksum=off hewitt
bad_admin# zfs destroy hewitt/jen/love

good_admin# zpool history              
History for 'hewitt':
2006-10-16.17:54:02 zpool create hewitt c1d0
2006-10-16.17:54:11 zfs create hewitt/jen
2006-10-16.17:54:15 zfs create hewitt/jen/love
2006-10-16.17:54:35 zfs set checksum=off hewitt
2006-10-16.17:57:29 zfs destroy hewitt/jen/love

# 

The history log is implemented using a ring buffer of <packed record length, record nvlist> tuples. More details can be found in spa_history.c, which contains the main kernel code changes for 'zpool history'. The history log's size is 1% of your pool, with a maximum of 32MB and a minimum of 128KB. Note: the original creation of the pool via 'zpool create' is never overwritten.

If you add a new subcommand to zfs(1m) or zpool(1M), all you need to do is call zpool_log_history(). If you build a new consumer of 'zpool history' (such as a GUI), then you need to call zpool_get_history(), and parse the nvlist. A good example of that is in get_history_one().

In the future, we will add the ability to also log uid, hostname, and zonename. We're also looking at adding "internal events" to the log since some subcommands actually take more than one txg, and we'd like to log history every txg (this would be more for developers and debuggers than admins).

These changes are in snv_51, and i would expect s10_u4 (though that schedule hasn't been decided yet).

Enjoy making history.

Monday Aug 07, 2006

vq_max_pending

As part of the I/O scheduling, ZFS has a field called 'zfs_vdev_max_pending'. This limits the maximum number of I/Os we can send down per leaf vdev. This is NOT the maximum per filesystem or per pool. Currently the default is 35. This is a good number for today's disk drives; however, it is not a good number for storage arrays that are really comprised of many disks but exported to ZFS as a single device.

This limit is a really good thing when you have a heavy I/O load as described in Bill's "ZFS vs. The Benchmark" blog.

But if you've created say a 2 device mirrored pool - where each device is really a 10 disk storage array, and you think that ZFS just isn't doing enough I/O for you, here's a script to see if that's true:

#!/usr/sbin/dtrace -s

vdev_queue_io_to_issue:return
/arg1 != NULL/
{
        @c["issued I/O"] = count();
}

vdev_queue_io_to_issue:return
/arg1 == NULL/
{
        @c["didn't issue I/O"] = count();
}

vdev_queue_io_to_issue:entry
{
        @avgers["avg pending I/Os"] = avg(args[0]->vq_pending_tree.avl_numnodes);
        @lquant["quant pending I/Os"] = quantize(args[0]->vq_pending_tree.avl_numnodes);
        @c["total times tried to issue I/O"] = count();
}

vdev_queue_io_to_issue:entry
/args[0]->vq_pending_tree.avl_numnodes > 349/
{
        @avgers["avg pending I/Os > 349"] = avg(args[0]->vq_pending_tree.avl_numnodes);
        @quant["quant pending I/Os > 349"] = lquantize(args[0]->vq_pending_tree.avl_numnodes, 33, 1000, 1);
        @c["total times tried to issue I/O where > 349"] = count();
}

/\* bail after 5 minutes \*/
tick-300sec
{
        exit(0);
} 

If you see the "avg pending I/Os" hitting your vq_max_pending limit, then raising the limit would be a good thing. The way to do that used to be per vdev, but we now have a single global way to change all vdevs.

heavy# mdb -kw
Loading modules: [ unix genunix specfs dtrace cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs ip hook neti sctp arp usba fctl nca lofs zfs random nfs cpc fcip logindmux ptm sppp ipc ]
> zfs_vdev_max_pending/E
zfs_vdev_max_pending:
zfs_vdev_max_pending:           35              
> zfs_vdev_max_pending/W 0t70
zfs_vdev_max_pending:           0x23            =       0x46
> zfs_vdev_max_pending/E
zfs_vdev_max_pending:
zfs_vdev_max_pending:           70              
>

The above will change the max # of pending requests to 70, instead of 35.

So having people tune variables is never desireable, and we'd like 'vq_max_pending' (among others) to be dynamically set, see: 6457709 vdev_knob values should be determined dynamically .

Friday Aug 04, 2006

iSCSI setup

Now that both the iSCSI target and initiator are in Solaris, let's see how to set one up. On the target machine, we do the following...

First, use 'iscsitadm' to provide a directory where the iSCSI repository and configuration information is stored. This is also the directory where the logical units will be stored if you create luns that are disk emulated files (the default).

hodur# iscsitadm modify admin -d /var/tmp/iscsi

Now actually create the luns. I chose to create pass-through raw devices instead of file based disk emulating luns.

hodur# iscsitadm create target --type raw --backing-store /dev/rdsk/c4t2d0 c4t2d0
hodur# iscsitadm create target --type raw --backing-store /dev/rdsk/c4t4d0 c4t4d0
hodur# iscsitadm create target --type raw --backing-store /dev/rdsk/c4t8d0 c4t8d0
hodur# iscsitadm create target --type raw --backing-store /dev/rdsk/c4t12d0 c4t12d0

Let's get a listing of the targets on this machine:

hodur# iscsitadm list target 
Target: c4t2d0
    iSCSI Name: iqn.1986-03.com.sun:02:aaf6d680-6681-e36e-8497-855decbf8038.c4t2d0
    Connections: 0
Target: c4t4d0
    iSCSI Name: iqn.1986-03.com.sun:02:f935aa89-d195-6ed9-9a1a-febe8d0550d2.c4t4d0
    Connections: 0
Target: c4t8d0
    iSCSI Name: iqn.1986-03.com.sun:02:5af80810-aa7c-c8c2-e566-af70bd579219.c4t8d0
    Connections: 0
Target: c4t12d0
    iSCSI Name: iqn.1986-03.com.sun:02:b3120e6c-896c-e3e3-c908-8a43789997d9.c4t12d0
    Connections: 0
hodur#

Now onto the initiator:

fsh-mullet# iscsiadm add discovery-address 
fsh-mullet# iscsiadm modify discovery -t enable

We wait a few seconds, and then verify the status on the target machine (notice the 4 luns are now online instead of offline):

hodur# iscsitadm list target -v
Target: c4t2d0
    iSCSI Name: iqn.1986-03.com.sun:02:aaf6d680-6681-e36e-8497-855decbf8038.c4t2d0
    Connections: 1
        Initiator:
            iSCSI Name: iqn.1986-03.com.sun:01:0003ba73b35b.44d23863
            Alias: fsh-mullet
    ACL list:
    TPGT list:
    LUN information:
        LUN: 0
            GUID: 010000e0812a8f5300002a0044d3ef91
            VID: SUN
            PID: SOLARIS
            Type: raw
            Size:   34G
            Backing store: /dev/rdsk/c4t2d0
            Status: online
Target: c4t4d0
    iSCSI Name: iqn.1986-03.com.sun:02:f935aa89-d195-6ed9-9a1a-febe8d0550d2.c4t4d0
    Connections: 1
        Initiator:
            iSCSI Name: iqn.1986-03.com.sun:01:0003ba73b35b.44d23863
            Alias: fsh-mullet
    ACL list:
    TPGT list:
    LUN information:
        LUN: 0
            GUID: 010000e0812a8f5300002a0044d3ef8e
            VID: SUN
            PID: SOLARIS
            Type: raw
            Size:   34G
            Backing store: /dev/rdsk/c4t4d0
            Status: online
Target: c4t8d0
    iSCSI Name: iqn.1986-03.com.sun:02:5af80810-aa7c-c8c2-e566-af70bd579219.c4t8d0
    Connections: 1
        Initiator:
            iSCSI Name: iqn.1986-03.com.sun:01:0003ba73b35b.44d23863
            Alias: fsh-mullet
    ACL list:
    TPGT list:
    LUN information:
        LUN: 0
            GUID: 010000e0812a8f5300002a0044d3ef8a
            VID: SUN
            PID: SOLARIS
            Type: raw
            Size:   34G
            Backing store: /dev/rdsk/c4t8d0
            Status: online
Target: c4t12d0
    iSCSI Name: iqn.1986-03.com.sun:02:b3120e6c-896c-e3e3-c908-8a43789997d9.c4t12d0
    Connections: 1
        Initiator:
            iSCSI Name: iqn.1986-03.com.sun:01:0003ba73b35b.44d23863
            Alias: fsh-mullet
    ACL list:
    TPGT list:
    LUN information:
        LUN: 0
            GUID: 010000e0812a8f5300002a0044d3ef87
            VID: SUN
            PID: SOLARIS
            Type: raw
            Size:   34G
            Backing store: /dev/rdsk/c4t12d0
            Status: online
hodur# 

Now back on the initiator, we check to see if we can see those devices:

fsh-mullet# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t0d0 
          /pci@1c,600000/scsi@2/sd@0,0
       1. c0t1d0 
          /pci@1c,600000/scsi@2/sd@1,0
       2. c2t1d0 
          /iscsi/disk@0000iqn.1986-03.com.sun%3A02%3Ab3120e6c-896c-e3e3-c908-8a43789997d9.c4t12d00001,0
       3. c2t2d0 
          /iscsi/disk@0000iqn.1986-03.com.sun%3A02%3A5af80810-aa7c-c8c2-e566-af70bd579219.c4t8d00001,0
       4. c2t4d0 
          /iscsi/disk@0000iqn.1986-03.com.sun%3A02%3Af935aa89-d195-6ed9-9a1a-febe8d0550d2.c4t4d00001,0
       5. c2t5d0 
          /iscsi/disk@0000iqn.1986-03.com.sun%3A02%3Aaaf6d680-6681-e36e-8497-855decbf8038.c4t2d00001,0
Specify disk (enter its number): \^C
fsh-mullet# 

Hmm, perhaps i could even create a RAID-Z out of those luns...

fsh-mullet# zpool create -f iscs-me raidz c2t1d0 c2t2d0 c2t4d0 c2t5d0
fsh-mullet# zpool status
  pool: iscs-me
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        iscs-me     ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0
            c2t4d0  ONLINE       0     0     0
            c2t5d0  ONLINE       0     0     0

errors: No known data errors
fsh-mullet# 

Do i hear Mr. Koolaid? ... "OH YEAH!"

Friday Jun 30, 2006

NFSv4 client for MacOSX 10.4

Just saw this pass by on nfsv4.org mailing list from Rick... excellent news!

If anyone is interested in trying an NFS Version 4 client (including Kerberos
support) on xnu-792.6.70 (Mac OSX 10.4.6 on PPC or Darwin8.0.1 on x86), you can
download the patch from (currently Beta test, I'd say):
ftp.cis.uoguelph.ca:/pub/nfsv4/darwin-port/xnu-client.tar.gz

There is also a "Readme" file and a primitive web page at:
http://www.cis.uoguelph.ca/~nfsv4

If you are interested in seeing further announcements related to this,
please join openbsd-nfsv4@sfobug.org. I'll try and refrain from spamming these
lists with announcements most of you aren't interested in.

Just in case you are interested in it, rick
ps: If you have already downloaded the patch, you should do so again,
    since I just updated it a few minutes ago.

Perhaps soon my laptop will have NFSv4 and ZFS....

Monday Apr 17, 2006

Linux support for mirror mounts


Subject:
RFC [PATCH 0/6] Client support for crossing NFS server mountpoints
From:
Trond Myklebust 
Date:
Tue, 11 Apr 2006 13:45:43 -0400
To:
linux-fsdevel@vger.kernel.org
CC:
nfsv4@linux-nfs.org, nfs@lists.sourceforge.net

The following series of patches implement NFS client support for crossing
server submounts (assuming that the server is exporting them using the
'nohide' option).  We wish to ensure that inode numbers remain unique
on either side of the mountpoint, so that programs like 'tar' and
'rsync' do not get confused when confronted with files that have the same
inode number, but are actually on different filesystems on the server.

This is achieved by having the client automatically create a submount
that mirrors the one on the server.

In order to avoid confusing users, we would like for this mountpoint to be
transparent to 'umount': IOW: when the user mounts the filesystem '/foo',
then an automatic submount by the NFS client for /foo/bar should not cause
'umount /foo' (particularly since the kernel cannot create entries for
/foo/bar in /etc/mtab). To get around this we mark automatically
created submounts using the new flag MNT_SHRINKABLE, and then allow
the NFS client to attempt to unmount them whenever the user calls umount on
the parent.

Note: This code also serves as the base for NFSv4 'referral' support, in
which one server may direct the client to a different server as it crosses
into a filesystem that has been migrated.

Cheers,
  Trond
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4

F'n sweet.

AIX support for referrals and replicas

Spencer pointed me to: this documentation. Looks like AIX versions 5.3.0.30 or later have support for referrals and replicas! Note that its only for NFSv4. The v4 train is rolling, good news!

Friday Mar 31, 2006

dscripts to time ZFS VOPs

So you're running some app, and you're curious where zfs is spending its time... here's some dscripts to figure out how much time each VOP is taking.

This one ( zfs_get_vop_times.d) grabs the number of times each VOP was called, the VOP's average time, and the total time spent in each VOP. This is for all ZFS file systems on the system. It generates output like:

# ./zvop_times.d 
dtrace: script './zvop_times.d' matched 66 probes
\^C
CPU     ID                    FUNCTION:NAME
 17      2                             :END 
ZFS COUNT


  zfs_fsync                                                        61
  zfs_write                                                       494
  zfs_read                                                        520

ZFS AVG TIME

  zfs_read                                                    2737251
  zfs_write                                                   6992704
  zfs_fsync                                                  73401109


ZFS SUM TIME

  zfs_read                                                 1423370640
  zfs_write                                                3454396080
  zfs_fsync                                                4477467680

This one ( zvop_times_fsid.d) does the same as above but just for one file system - namely the one you specify via passed in FSID ints.

Lastly, this one ( zvop_times_fsid_large.d) does the same as above (tracking per FSID), but also spits out the stack and quantize information when a zfs VOP call goes over X time - where X is passed into the script. This makes it easy to see if there's some really slow calls. It generates output like (skipping the output thats the same from the above examples):

# ./zvop_times_fsid_large.d 0x7929404d 0xb3d52b08 50000000
dtrace: script './zvop_times_fsid_large.d' matched 123 probes
CPU     ID                    FUNCTION:NAME
 14  35984                  zfs_read:return 
              genunix`fop_read+0x20
              genunix`read+0x29c
              unix`syscall_trap32+0x1e8

 16  35994               zfs_putpage:return 
              genunix`fop_putpage+0x1c
              nfssrv`rfs3_commit+0x110
              nfssrv`common_dispatch+0x588
              rpcmod`svc_getreq+0x1ec
              rpcmod`svc_run+0x1e8
              nfs`nfssys+0x1b8
              unix`syscall_trap32+0x1e8

 18  35994               zfs_putpage:return 
              genunix`fop_putpage+0x1c
              nfssrv`rfs3_commit+0x110
              nfssrv`common_dispatch+0x588
              rpcmod`svc_getreq+0x1ec
              rpcmod`svc_run+0x1e8
              nfs`nfssys+0x1b8
              unix`syscall_trap32+0x1e8

 12  35972                 zfs_fsync:return 
              genunix`fop_fsync+0x14
              nfssrv`rfs4_createfile+0x500
              nfssrv`rfs4_do_opennull+0x44
              nfssrv`rfs4_op_open+0x380
              nfssrv`rfs4_compound+0x208
              nfssrv`rfs4_dispatch+0x11c
              nfssrv`common_dispatch+0x154
              rpcmod`svc_getreq+0x1ec
              rpcmod`svc_run+0x1e8
              nfs`nfssys+0x1b8
              unix`syscall_trap32+0x1e8

\^C


  zfs_fsync                                         
           value  ------------- Distribution ------------- count    
        33554432 |                                         0        
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
       134217728 |                                         0        

  zfs_putpage                                       
           value  ------------- Distribution ------------- count    
        33554432 |                                         0        
        67108864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2        
       134217728 |                                         0        

  zfs_read                                          
           value  ------------- Distribution ------------- count    
        67108864 |                                         0        
       134217728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
       268435456 |                                         0     

Feel free to play around with these scripts as well as add/subtract.

Tuesday Mar 14, 2006

NFSv3 support for .zfs

Rob just did the putback last night to now enable NFSv3 to access .zfs/snapshot! So now both v3 and v4 have access.

fsh-weakfish# mount -o vers=3 fsh-mullet:/pool /mnt
fsh-weakfish# ls /mnt/.zfs/snapshot
krispies
fsh-weakfish# ls /mnt/.zfs/snapshot/krispies
neat.txt
fsh-weakfish# cat /mnt/.zfs/snapshot/krispies/neat.txt
hi
fsh-weakfish# 

In my previous blog on how the v4 support was done, i introduced a new data structure "fhandle_ext_t". That strucuture is now renamed to fhandle4_t to ease code changes in the future in case we ever need to increase v4's potential filehandle size (its currently set to v3's protocol limitation). NFSv3, similarly, uses the data structure fhandle3_t.

These changes will be in build 36 of nevada.

Thursday Feb 02, 2006

creating a stripe in ZFS vs. SVM/UFS

To create a 4 disk stripe in ZFS:

hodur# zpool create zfs_bonnie c4t2d0 c4t4d0 c4t8d0 c4t12d0
hodur# df -kh zfs_bonnie
Filesystem             size   used  avail capacity  Mounted on
zfs_bonnie             134G    26K   134G     1%    /zfs_bonnie
hodur# 

To create a 4 disk stripe in SVM:

hodur# metadb -a -f -c2 c4t2d0s0 c4t4d0s0

In the above command, the metadb is where SVM stores things like stripe width for stripes or the dirty region for mirrors. You can technically get away with only adding one metadb, but having two adds redundancy... could also go with four (one on each disk) but that just becomes overkill (as periodically the master metadb needs to sync with the slave(s) ). And yes, it would be really nice if this was all just automated (hmm like the above zfs command). Next...

hodur# metainit d1 1 4 c4t2d0s0 c4t4d0s0 c4t8d0s0 c4t12d0s0 -i 256k
d1: Concat/Stripe is setup
hodur#

In the above command, the "1" tells us to create one stripe, the "4" tells us how many slices to make that stripe out of, and the "-i 256k" tells us to make the stripe width 256kB (instead of the default 16kB). Continuing...

hodur# newfs /dev/md/rdsk/d1
newfs: construct a new file system /dev/md/rdsk/d1: (y/n)? y
Warning: 4096 sector(s) in last cylinder unallocated
/dev/md/rdsk/d1:        284327936 sectors in 46278 cylinders of 48 tracks, 128 sectors
        138832.0MB in 2893 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
.........................................................
super-block backups for last 10 cylinder groups at:
 283410848, 283509280, 283607712, 283706144, 283804576, 283903008, 284001440,
 284099872, 284198304, 284296736
hodur# mkdir /ufs_bonnie
hodur# mount -F ufs /dev/md/dsk/d1 /ufs_bonnie
hodur# df -kh ufs_bonnie        
Filesystem             size   used  avail capacity  Mounted on
/dev/md/dsk/d1         134G   6.9G   125G     6%    /ufs_bonnie
hodur# 

One method is straight forward, the other method caused me to write a blog entry so i'd remember how to do it.

Wednesday Feb 01, 2006

enabling the write cache

To enable the write cache you can use the format(1M) command:

i_like_corruption# format -e
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t0d0 
          /pci@1c,600000/scsi@2/sd@0,0
       1. c0t1d0 
          /pci@1c,600000/scsi@2/sd@1,0
Specify disk (enter its number): 1
selecting c0t1d0
[disk formatted]
/dev/dsk/c0t1d0s0 is in use by zpool zfs_tar. Please see zpool(1M).


FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        inquiry    - show vendor, product and revision
        scsi       - independent SCSI mode selects
        cache      - enable, disable or query SCSI disk cache
        volname    - set 8-character volume name
        !     - execute , then return
        quit
format> cache


CACHE MENU:
        write_cache - display or modify write cache settings
        read_cache  - display or modify read cache settings
        !      - execute , then return
        quit
cache> write      


WRITE_CACHE MENU:
        display     - display current setting of write cache
        enable      - enable write cache
        disable     - disable write cache
        !      - execute , then return
        quit
write_cache> display
Write Cache is disabled
write_cache> enable
write_cache> display
Write Cache is enabled
write_cache> q

Now, of course, be careful when you do this as data corruption can happen!

It was nice for me to do this as i was taking a simple v20z with a single SCSI disk to run some prelim specSFS numbers, and the write cache will mitigate (somewhat) the synchronous writes/commits. Of course you need much more than a single SCSI disk to get real specSFS numbers.

Or if you want to be snazier, Neil wrote a script to do this:

i_like_corruption# write_cache display c0t1d0
Write Cache is disabled
i_like_corruption# write_cache enable c0t1d0 
Write Cache is enabled
i_like_corruption# write_cache display c0t1d0
Write Cache is enabled
i_like_corruption# 

Here's the script:

#!/bin/ksh
#
# issue write cache commands to format(1M).
# commands can be to individual disks or all disks.
#
# usage: write_cache [-v] display|enable|disable all|
# E.g.:
#    write_cache disable all
#    write_cache -v enable c2t1d0

id=`id | sed -e 's/uid=//' -e 's/(.\*//'`
if [ $id != "0" ] ; then
        printf "No permissions"
        exit 1
fi

# tmp files
cmds=/tmp/write_cache_commands.txt.$$
disks=/tmp/all_disk_list.txt.$$

silent=-s
if [ "$1" = "-v" ] ; then
        # in verbose mode turn off silent format option
        silent=
        shift
fi

cat > $cmds << EOF
cache
write_cache
$1
EOF

if [ "$2" = "all" ]; then
        echo disk | format 2>/dev/null | fgrep ". c" \\
            | nawk '{ print $2 }' > $disks
        for i in `cat $disks`
        do
                format -e $silent -f $cmds $i 2>/dev/null
                if [ "$silent" = "-s" ]; then
                        # print write cache state using recursion
                        printf "%s : " $i
                        write_cache -v display $i | fgrep "Write Cache is"
                fi
        done
else
        format -e $silent -f $cmds $2
        if [ "$silent" = "-s" ]; then
                # print write cache state using recursion
                write_cache -v display $2 | fgrep "Write Cache is"
        fi
fi

rm -f $cmds $disks

Oh yeah, make sure its in your PATH. It can be run like this:

hodur# write_cache enable c4t2d0 
Write Cache is enabled
hodur# write_cache disable c4t2d0
Write Cache is disabled
hodur# 

Wednesday Nov 16, 2005

FS perf 201 : Postmark

Now let's run a simple but popular benchmark - Netapp's postmark. Let's see how long it takes to do 1,000,000 transactions.

First lets try ZFS:

mcp# ./postmark                  
PostMark v1.5 : 3/27/01
pm>set location=/scsi_zfs
pm>set transactions=1000000
pm>run
Creating files...Done
Performing transactions..........Done
Deleting files...Done
Time:
        220 seconds total
        214 seconds of transactions (4672 per second)

Files:
        519830 created (2362 per second)
                Creation alone: 20000 files (4000 per second)
                Mixed with transactions: 499830 files (2335 per second)
        500124 read (2337 per second)
        494776 appended (2312 per second)
        519830 deleted (2362 per second)
                Deletion alone: 19660 files (19660 per second)
                Mixed with transactions: 500170 files (2337 per second)

Data:
        3240.97 megabytes read (14.73 megabytes per second)
        3365.07 megabytes written (15.30 megabytes per second)
pm>

During the run, i used our good buddy zpool(1M) to see how much IO we were doing:

mcp# zpool iostat 1
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
scsi_zfs    32.5K  68.0G      0    207      0  6.25M
scsi_zfs    32.5K  68.0G      0    821      0  24.1M
scsi_zfs    32.5K  68.0G      0    978      0  28.6M
scsi_zfs    32.5K  68.0G      0  1.04K      0  30.3M
scsi_zfs    32.5K  68.0G      0  1.01K      0  27.6M
scsi_zfs     129M  67.9G      0    797      0  16.2M
scsi_zfs     129M  67.9G      0    832      0  27.4M

Ok, onto UFS:

mcp# ./postmark 
PostMark v1.5 : 3/27/01
pm>set location=/export/scsi_ufs
pm>set transactions=1000000
pm>run
Creating files...Done
Performing transactions..........Done
Deleting files...Done
Time:
        3450 seconds total
        3419 seconds of transactions (292 per second)

Files:
        519830 created (150 per second)
                Creation alone: 20000 files (909 per second)
                Mixed with transactions: 499830 files (146 per second)
        500124 read (146 per second)
        494776 appended (144 per second)
        519830 deleted (150 per second)
                Deletion alone: 19660 files (2184 per second)
                Mixed with transactions: 500170 files (146 per second)

Data:
        3240.97 megabytes read (961.96 kilobytes per second)
        3365.07 megabytes written (998.79 kilobytes per second)
pm>

Also, during the run grabbed a little iostat to see how UFS's IO was doing:

mcp# iostat -Mxnz 1
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  820.9    0.0    3.4 142.5 256.0  173.5  311.9 100 100 c4t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  797.0    0.0    3.1 129.2 256.0  162.1  321.2 100 100 c4t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  777.0    0.0    3.1 128.0 256.0  164.7  329.5 100 100 c4t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  827.1    0.0    4.0 128.8 256.0  155.7  309.5 100 100 c4t1d0

Yikes! so looking at throughput (number of transactions per second) ZFS is ~16x better than UFS on this benchmark. Ok so ZFS is not this good on every benchmark when compared to UFS, but we rather like this one.

This was run on a 2 way opteron box, using the same SCSI disk for both ZFS and UFS.

bigger filehandles for NFSv4 - die NFSv2 die

So what does a filehandle created by the Solaris NFS server look like? If we take a gander at the fhandle_t struct, we see its layout:

struct svcfh {
    fsid_t	fh_fsid;			/\* filesystem id \*/
    ushort_t    fh_len;			        /\* file number length \*/
    char	fh_data[NFS_FHMAXDATA];		/\* and data \*/
    ushort_t    fh_xlen;			/\* export file number length \*/
    char	fh_xdata[NFS_FHMAXDATA];	/\* and data \*/
};
typedef struct svcfh fhandle_t;

Where fh_len represents the length of valid bytes in fh_data, and likewise, fh_xlen is the length fh_xdata. Note, NFS_FHMAXDATA used to be:

#define	NFS_FHMAXDATA	((NFS_FHSIZE - sizeof (struct fhsize) + 8) / 2)

To be less confusing, I removed fhsize and shortened that to:

#define NFS_FHMAXDATA    10

Ok, but where does fh_data come from? Its the FID (via VOP_FID) of the local file system. fh_data represents the actual file of the filehandle, and fh_xdata represents the exported file/directory. So for NFSv2 and NFSv3, the filehandle is basically:
fsid + file FID + exported FID

NFSv4 is pretty much the same thing, except at the end we add two fields, and you can see the layout in nfs_fh4_fmt_t:

struct nfs_fh4_fmt {
 	fhandle_ext_t fh4_i;
 	uint32_t      fh4_flag;
 	uint32_t      fh4_volatile_id;
};

The fh4_flag is used to distinguish named attributes from "normal" files, and fh4_volatile_id is currently only currently used for testing purposes - for testing volatile filehandles of course, and since Solaris doesn't have a local file system that doesn't have persistent filehandles we don't need to use fh4_volatile_id quite yet.

So back to the magical "10" for NFS_FHMAXDATA... what's going on there? Well, adding those fields up, you get: 8(fsid) + 2(len) + 10(data) + 2(xlen) + 10(xdata) = 32 bytes. Which is the protocol limitation of NFSv2 - just look for "FHSIZE". So the Solaris server is currently limiting its filehandles to 10 byte FIDs just to make NFSv2 happy. Note, this limitation has purposely crept into the local file systems to make this all work, check out UFS's ufid:

/\*
 \* This overlays the fid structure (see vfs.h)
 \*
 \* LP64 note: we use int32_t instead of ino_t since UFS does not use
 \* inode numbers larger than 32-bits and ufid's are passed to NFS
 \* which expects them to not grow in size beyond 10 bytes (12 including
 \* the length).
 \*/
struct ufid {
 	ushort_t ufid_len;
 	ushort_t ufid_flags;
	int32_t	ufid_ino;
 	int32_t	ufid_gen;
};

Note that NFSv3's protocol limitation is 64 bytes and NFSv4's limitation is 128 bytes. So these two file systems could theoreticallly give out bigger filehandles, but there's two reasons why they don't for currently existing data: 1) there's really no need and more importantly 2) the filehandles MUST be the same on the wire before any change is done. If 2) isn't satisfied, then all clients with active mounts will get STALE errors when the longer filehandles are introduced. Imagine a server giving out 32 byte filehandles over NFSv3 for a file, then the server is upgraded and now gives out 64 byte filehandles - even if all the extra 32 bytes are zeroed out, that's a different filehandle and the client will think it has a STALE reference. Now a force umount or client reboot will fix the problem, but it seems pretty harsh to force all active clients to perform some manual admin action for a simple (and should be harmless) server upgrade.

So yeah my blog title is how i changed filehandles to be bigger - which almost contradicts the above paragraph. The key point to note is that files that have never been served up via NFS have never had a filehandle generated for them (duh), so they can be whatever length the protocol allows and we don't have to worry about STALE filehandles.

If you're not familiar with ZFS's .zfs/snapshot, there will be a future blog on it soon. But basically it places a dot file (.zfs) under the "main" file system at its root, and all snapshots created are then placed namespace-wise under .zfs/snapshot. Here's an example:

fsh-mullet# zfs snapshot bw_hog@monday
fsh-mullet# zfs snapshot bw_hog@tuesday
fsh-mullet# ls -a /bw_hog
.         ..        .zfs      aces.txt  is.txt    zfs.txt
fsh-mullet# ls -a /bw_hog/.zfs/snapshot
.        ..       monday   tuesday
fsh-mullet# ls -a /bw_hog/.zfs/snapshot/monday
.         ..        aces.txt  is.txt    zfs.txt
fsh-mullet#

With the introduction of .zfs/snapshot, we were faced with an interesting dilemma for NFS - either only have NFS clients that could do "mirror mounts" have access to the .zfs directory OR increase ZFS's fid for files under .zfs. "Mirror mounts" would allow us to do the technically correct solution of having a unique FSID for the "main" file system and each of its snapshots. This requires NFS clients to cross server mount points. The latter option has one FSID for the "main" file system and all of its snapshots. This means the same file under the "main" file system and any of its snapshots will appear to be the same - so things like "cp" over NFS won't like it.

"Mirror mounts" is our lingo for letting clients cross server file system boundaries - as dictated by the FSID (file system identifier). This is totally legit in NFSv4 (see section "7.7. Mount Point Crossing" and section "5.11.7. mounted_on_fileid" in rfc 3530). NFSv3 doesn't really allow this functionality (see "3.3.3 Procedure 3: LOOKUP - Lookup filename" here). Though, with some little trickery, i'm sure it could be achieved - perhaps via the automounter?

The problem with mirror mounts is that no one has actually implemented them. So if we went with the more technically correct solution of having a unique FSID for the "main" local file system and a unique FSID for all its snapshots, only Solaris Update 2(?) NFSv4 clients would be able to access .zfs upon initial delivery of ZFS. That seems silly.

If we instead bend a little on the unique FSID, then all NFS clients in existence today can access .zfs. That seems much more attractive. Oh wait... small problem. We would rather like at least the filehandles to be different for files in the "main" files ystem from the snapshots - this ensures NFS doesn't get completely confused. Slight problem is that the filehandles we give out today are maxed out at the 32 byte NFSv2 protocol limitation (as mentioned above). If we add any other bit of uniqueness to the filehandles (such as a snapshot identifier) then v2 just can't handle it.... hmmm...

Well you know what? Tough s\*&t v2. Seriously, you are antiquated and really need to go away. So since the snapshot identifier doesn't need to be added to the "main" file system. FIDs for non-.zfs snapshot files will remain the same size and fit within NFSv2's limitations. So we can access ZFS over NFSv2, just will be denied .zfs's goodness:

fsh-weakfish# mount -o vers=2 fsh-mullet:/bw_hog /mnt
fsh-weakfish# ls /mnt/.zfs/snapshot/      
monday   tuesday
fsh-weakfish# ls /mnt/.zfs/snapshot/monday
/mnt/.zfs/snapshot/monday: Object is remote
fsh-weakfish# 

So what about v3 and v4? Well since v4 is the default for Solaris and its code is simpler, i just changed v4 to handle bigger filehandles for now. NFSv3 is coming soooon. So we basically have the same structure as fhandle_t, except we extend it a bit for NFSv4 via fhandle4_t:

/\*
 \* This is the in-memory structure for an NFSv4 extended filehandle.
 \*/
typedef struct {
        fsid_t  fhx_fsid;                       /\* filesystem id \*/
        ushort_t fhx_len;                       /\* file number length \*/
        char    fhx_data[NFS_FH4MAXDATA];    /\* and data \*/
        ushort_t fhx_xlen;                      /\* export file number length \*/
        char    fhx_xdata[NFS_FH4MAXDATA];   /\* and data \*/
} fhandle4_t;

So the only difference is that FIDs can be up to 26 bytes instead of 10 bytes. Why 26? Thats NFSv3's protocol limitation - 64 bytes. And if we ever need larger than 64 byte filehandles for NFSv4, its easy to change - just create a new struct with the capacity for larger FIDs and use that for NFSv4. Why will it be easier in the future than it was for this change? Well part of what i needed to do to make NFSv4 filehandles backwards compatible is that when filehandles are actuallly XDR'd, we need to parse them so that filehandles that used to be given out with 10 byte FIDs (based on the fhandle_t struct) continue to give out filehandles base on 10 byte FIDs, but at the same time VOP_FID()s that return larger than 10 byte FIDs (such as .zfs) are allowed to do so. So NFSv4 will return different length filehandles based on the need of the local file system.

So checking out xdr_nfs_resop4, the old code (knowing that the filehandle was safe to be a contigious set of bytes), simply did this:

case OP_GETFH:
	if (!xdr_int(xdrs,
		     (int32_t \*)&objp->nfs_resop4_u.opgetfh.status))
		return (FALSE);
	if (objp->nfs_resop4_u.opgetfh.status != NFS4_OK)
		return (TRUE);
	return (xdr_bytes(xdrs,
	    (char \*\*)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_val,
	    (uint_t \*)&objp->nfs_resop4_u.opgetfh.object.nfs_fh4_len,
	    NFS4_FHSIZE));

Now, instead of simply doing a xdr_bytes, we use the template of fhandle_ext_t and internally always have the space for 26 byte FIDS but for OTW we skip bytes depending on what fhx_len and fhx_xlen, see xdr_encode_nfs_fh4.

whew, that's enough about filehandles for 2005.

FS perf 102 : Filesystem Bandwith

Now that you can grab the disks's BW, the next question is "How do i see what BW my local file system can push?". First lets check writes for ZFS:

fsh-mullet# /bin/time sh -c 'lockfs -f .; mkfile 1g 1g.txt; lockfs -f .'
real       17.1
user        0.0
sys         1.1

So that's 1GB/17.1s = ~62MB/s for a 1 gig file. During the mkfile(1M), you can use iostat(1M) to see how much disk BW is going on:

fsh-mullet# iostat -Mxnz 1
              
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  541.0    0.0   67.6  0.0 35.0    0.0   64.7   1 100 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  567.0    0.0   70.3  0.0 33.9    0.0   59.9   1 100 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  254.9    0.0   29.0  0.0 15.7    0.0   61.6   0  64 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  528.1    0.0   66.0  0.0 35.0    0.0   66.2   1 100 c0t1d0

We can also use zpool(1M) to show just the IO for zfs:

fsh-mullet# zpool iostat 1
bw_hog      32.5K  33.7G      0    538      0  67.4M
bw_hog       189M  33.6G      0     30      0   459K
bw_hog       189M  33.6G      0      0      0      0
bw_hog       189M  33.6G      0    509      0  63.7M
bw_hog       189M  33.6G      0    544      0  68.1M
bw_hog       189M  33.6G      0    544      0  68.1M
bw_hog       189M  33.6G      0    535      0  67.0M

Now let's look at UFS writes:

fsh-mullet# /bin/time sh -c 'lockfs -f .; mkfile 1g 1g.txt; lockfs -f .'
real       18.7
user        0.1
sys         6.3

So UFS is doing 1GB/18.7s = ~57MB/s. Let's see some of that iostat:

fsh-mullet# iostat -Mxnz 1
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    4.0   70.0    0.0   58.9  0.0 10.8    0.0  145.6   0  99 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    3.0   70.0    0.0   57.8  0.0 10.6    0.0  144.5   0  99 c0t1d0
                    extended device statistics              
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    4.0   70.0    0.0   59.4  0.0 11.2    0.0  151.3   0  99 c0t1d0

This was done on a 2-way v210 sparc box, using a SCSI disk.

And why the 'lockfs' call you ask? This ensures that all data is flushed to disk - and measuring how long it takes to do something that doesn't necessarily get flushed is just not legit in this case. Persistent data is good.

About

erickustarz

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today