Sunday Dec 21, 2014

help! Non-aligned writes are killing my zpool!

I put this post up on my personal blog:

Tuesday Jul 08, 2008

Better late than never - a ZFS bringover-like util

So... it's a little late, but better late than never. Now that I've got my u40m2 re-configured and redone my local source code repositories (not hg repos... yet), I figured it was time to make the other part of what I've mentioned to customers a reality.

The first part is this: bringing over the source from $GATE, wx init, cd $SRC, /usr/bin/make setup on both UltraSPARC and x64 buildboxen, and then zfs snapshot followed by two zfs clone ops so that I get to build on UltraSPARC and x64 buildboxen in the same workspace at the same time.

Yes, this is a really ugly workaround for Should be able to build for sparc and x86 in a single workspace, and while I'm the RE for that bug, it's probably not going to be fixed for a while.

So here's the afore-mentioned "other part": a kinda-sorta replacement for bringover, using ZFS snapshots and clones. Both Bill and DarrenM have mentioned something of this in the past, and you know what - the script I just hacked together is about 3 lines of content, 1 line of #! magic and 16 lines of arg checking.

Herewith is the script. No warranties, guarantees or anything. Use at your own risk. It works for me, but your mileage may vary. Suggestions and improvements cheerfully accepted.
# The contents of this file are subject to the terms of the
# Common Development and Distribution License (the "License").
# You may not use this file except in compliance with the License.
# You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
# or
# See the License for the specific language governing permissions
# and limitations under the License.
# When distributing Covered Code, include this CDDL HEADER in each
# file and include the License file at usr/src/OPENSOLARIS.LICENSE.
# If applicable, add the following below this CDDL HEADER, with the
# fields enclosed by brackets "[]" replaced with your own identifying
# information: Portions Copyright [yyyy] [name of copyright owner]
# Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
# Use is subject to license terms.
# Version number? if this needs a rev.... there's something 
# really, really wrong

# we use the following process to snapshot, clone
# and mount an up to date image of $GATE ::
# zfs snapshot sink/$GATE@$wsname  {$GATE is onnv-gate|on10-feature-patch|on10-patch}
# zfs clone sink/$GATE@wsname sink/src/$wsname  ## ignore "failed to mount"
# zfs set mountpoint=/scratch/src/build/{rfes|bugs}/$wsname sink/src/$wsname

# arg1 is "b" or "r" - bug or rfe
# arg2 is $GATE - onnv-gate, on10-feature-patch, on10-patch
# arg3 is wsname

# first, some sanity checking of the args

if [ "$1" != "b" -a "$1" != "r" ]; then
   echo "Is this a bug (b) or an rfe (r) ?"
   exit 1;

if [ "$2" != "onnv-gate" -a "$2" != "on10-feature-patch" -a "$2" != "on10-patch" ]; then
    echo "unknown / invalid gate specified ($2). Please choose one of "
    echo "onnv-gate, on10-feature-patch or on10-patch."
    exit 2;


if [ "$1" = "b" ]; then


# ASSUMPTION1: our $GATE is a dataset under pool "sink"
# ASSUMPTION2: we have another dataset called "sink/src"
# ASSUMPTION3: our user has delegated admin privileges, and can mount
#              a cloned snapshot under /scratch/src/.....

zfs snapshot sink/$GATE@$WSNAME
zfs clone sink/$GATE@$WSNAME sink/src/$WSNAME >> /dev/null 2>&1
zfs set mountpoint=/scratch/src/build/$BR/$WSNAME sink/src/$WSNAME
exit 0

Note the ASSUMPTIONx lines - they're specific to my workstation, you will almost definitely want to change them to suit your system.

Friday Jul 04, 2008

Oh, if only I'd had

Back when I got my first real break as a sysadmin, one of my first tasks was to upgrade the Uni's finance office server, a SparcServer 1000. Running Solaris 2.5 with a gaggle of external unipacks and multipacks for Oracle 7.$mumble, I organised an outage with the DBAs and the Finance stakeholders, practiced installing Solaris 2.6 on a new system (we'd just got an E450), and at the appointed time on the Saturday morning I rocked up and got to work on my precisely specified upgrade plan.

That all went swimmingly (though looooooooowly) until the time came to reboot after the final SDS 4.1 mirror had been created. The primary system board decided that it really didn't like me, and promptly died along with the boot prom.


At that point I didn't know all that much about the innards of the SS1000 otherwise I probably would have just engaged in some swaptronics with the other three boards. However, I was green, nervous, and - by that point - very tired of sitting in a cold, loud machine room for 12 hours. Turned the box off, rang the local Sun support office and left a message (we didn't have weekend coverage on any of our systems then), rang my boss and the primary stakeholder in the Finance unit and went home.

Come Monday morning, all hell broke loose - the Accounts groups were unable to do any work, and the DBAs had to do a very quick enable of the DR system so I could get time to work on the problem with Sun. The "quick enable" took around 4 hours, if I'm remembering it correctly. Fortunately for me, not only were the DBAs quite sympathetic and very quick to help, but Miriam on the support phone number (who later hired me) was able to diagnose the problem and organise a service call to replace the faulty board. She also calmed me down, which I really, really appreciated. (Thankyou Miriam!)

So ... why am I dredging this up? Because I've just done a LiveUpgrade (LU) from Solaris Nevada build 91 to build 93, with ZFS root, and it took me a shade under 90 minutes. Total. Including the post-installation reboot. Not only would I have gone all gooey at the idea of being able to do something like LU back in that job, but if I could have done it with ZFS and not had to reconfigure all the uni- and multi-pack devices I probably could have had the whole upgrade done in around 4 or 5 hours rather than 12. (Remember, of course, that while the SS1000 could take quite a few cpus, they were still very very very very sloooooooooow).

Here's a trancript of this evening's upgrade:

# uname -a
SunOS gedanken 5.11 snv_91 i86pc i386 i86xpv

(remove the snv_91 LU packages)
pkgrm SUNWlu... packages from snv_91
(add the snv_93 LU packages)
pkgadd SUNWlu... packages from snv_93

(Create my LU config)
# lucreate -n snv_93 -p rpool
Checking GRUB menu...
Analyzing system configuration.
No name for current boot environment.
INFORMATION: The current boot environment is not named - assigning name .
Current boot environment is named .
Creating initial configuration for primary boot environment .
The device  is not a root device for any boot environment; cannot get BE ID.
PBE configuration successful: PBE name  PBE Boot Device .
Comparing source boot environment  file systems with the file 
system(s) you specified for the new boot environment. Determining which 
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment .
Source boot environment is .
Creating boot environment .
Cloning file systems from boot environment  to create boot environment .
Creating snapshot for  on .
Creating clone for  on .
Setting canmount=noauto for  in zone  on .
Saving existing file  in top level dataset for BE  as //boot/grub/menu.lst.prev.
File  propagation successful
Copied GRUB menu from PBE to ABE
No entry for BE  in GRUB menu
Population of boot environment  successful.
Creation of boot environment  successful.
-bash-3.2# zfs list
NAME                             USED  AVAIL  REFER  MOUNTPOINT
rpool                           50.0G   151G    35K  /rpool
rpool/ROOT                      7.06G   151G    18K  legacy
rpool/ROOT/snv_91               7.06G   151G  7.06G  /
rpool/ROOT/snv_91@snv_93        71.5K      -  7.06G  -
rpool/ROOT/snv_93                128K   151G  7.06G  /tmp/.alt.luupdall.2695
rpool/WinXP-Host0-Vol0          3.57G   151G  3.57G  -
rpool/WinXP-Host0-Vol0@install  4.74M      -  3.57G  -
rpool/dump                      4.00G   151G  4.00G  -
rpool/export                    7.47G   151G    19K  /export
rpool/export/home               7.47G   151G  7.47G  /export/home
rpool/gate                      5.86G   151G  5.86G  /opt/gate
rpool/hometools                 2.10G   151G  2.10G  /opt/hometools
rpool/optcsw                     225M   151G   225M  /opt/csw
rpool/optlocal                  1.20G   151G  1.20G  /opt/local
rpool/scratch                   14.4G   151G  14.4G  /scratch
rpool/swap                         4G   155G  64.6M  -

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
snv_91                     yes      yes    yes       no     -         
snv_93                     yes      no     no        yes    -         

Golly, that was so easy! Here I was rtfming for the LU with UFS syntax.... not needed at all.

# time luupgrade -u -s /media/SOL_11_X86 -n snv_93

No entry for BE  in GRUB menu
Copying failsafe kernel from media.
Uncompressing miniroot
Uncompressing miniroot archive (Part2)
13367 blocks
Creating miniroot device
miniroot filesystem is 
Mounting miniroot at 
Mounting miniroot Part 2 at 
Validating the contents of the media .
The media is a standard Solaris media.
The media contains an operating system upgrade image.
The media contains  version <11>.
Constructing upgrade profile to use.
Locating the operating system upgrade program.
Checking for existence of previously scheduled Live Upgrade requests.
Creating upgrade profile for BE .
Checking for GRUB menu on ABE .
Saving GRUB menu on ABE .
Checking for x86 boot partition on ABE.
Determining packages to install or upgrade for BE .
Performing the operating system upgrade of the BE .
CAUTION: Interrupting this process may leave the boot environment unstable 
or unbootable.
Upgrading Solaris: 100% completed
Installation of the packages from this media is complete.
Restoring GRUB menu on ABE .
Adding operating system patches to the BE .
The operating system patch installation is complete.
ABE boot partition backing deleted.
PBE GRUB has no capability information.
PBE GRUB has no versioning information.
ABE GRUB is newer than PBE GRUB. Updating GRUB.
GRUB update was successful.
Configuring failsafe for system.
Failsafe configuration is complete.
INFORMATION: The file  on boot 
environment  contains a log of the upgrade operation.
INFORMATION: The file  on boot 
environment  contains a log of cleanup operations required.
WARNING: <3> packages failed to install properly on boot environment .
INFORMATION: The file  on 
boot environment  contains a list of packages that failed to 
upgrade or install properly.
INFORMATION: Review the files listed above. Remember that all of the files 
are located on boot environment . Before you activate boot 
environment , determine if any additional system maintenance is 
required or if additional media of the software distribution must be 
The Solaris upgrade of the boot environment  is partially complete.
Installing failsafe
Failsafe install is complete.

real    83m24.299s
user    13m33.199s
sys     24m8.313s

# zfs list
NAME                             USED  AVAIL  REFER  MOUNTPOINT
rpool                           52.5G   148G  36.5K  /rpool
rpool/ROOT                      9.56G   148G    18K  legacy
rpool/ROOT/snv_91               7.07G   148G  7.06G  /
rpool/ROOT/snv_91@snv_93        18.9M      -  7.06G  -
rpool/ROOT/snv_93               2.49G   148G  5.53G  /tmp/.luupgrade.inf.2862
rpool/WinXP-Host0-Vol0          3.57G   148G  3.57G  -
rpool/WinXP-Host0-Vol0@install  4.74M      -  3.57G  -
rpool/dump                      4.00G   148G  4.00G  -
rpool/export                    7.47G   148G    19K  /export
rpool/export/home               7.47G   148G  7.47G  /export/home
rpool/gate                      5.86G   148G  5.86G  /opt/gate
rpool/hometools                 2.10G   148G  2.10G  /opt/hometools
rpool/optcsw                     225M   148G   225M  /opt/csw
rpool/optlocal                  1.20G   148G  1.20G  /opt/local
rpool/scratch                   14.4G   148G  14.4G  /scratch
rpool/swap                         4G   152G  64.9M  -
-bash-3.2# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
snv_91                     yes      yes    yes       no     -         
snv_93                     yes      no     no        yes    -         

# luactivate snv_93
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE 
Saving existing file  in top level dataset for BE  as //etc/bootsign.prev.
WARNING: <3> packages failed to install properly on boot environment .
INFORMATION:  on boot 
environment  contains a list of packages that failed to upgrade or 
install properly. Review the file before you reboot the system to 
determine if any additional system maintenance is required.

Generating boot-sign for ABE 
Saving existing file  in top level dataset for BE  as //etc/bootsign.prev.
Generating partition and slice information for ABE 
Copied boot menu from top level dataset.
Generating direct boot menu entries for PBE.
Generating xVM menu entries for PBE.
Generating direct boot menu entries for ABE.
Generating xVM menu entries for ABE.
Disabling splashimage
Re-enabling splashimage
No more bootadm entries. Deletion of bootadm entries is complete.
Changing GRUB menu default setting to <0>
Done eliding bootadm entries.


The target boot environment has been activated. It will be used when you 
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You 
MUST USE either the init or the shutdown command when you reboot. If you 
do not use either init or shutdown, the system will not boot using the 
target BE.


In case of a failure while booting to the target BE, the following process 
needs to be followed to fallback to the currently working boot environment:

1. Boot from Solaris failsafe or boot in single user mode from the Solaris 
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like 
/mnt). You can use the following command to mount:

     mount -Fzfs /dev/dsk/c1t0d0s0 /mnt

3. Run  utility with out any arguments from the Parent boot 
environment root slice, as shown below:


4. luactivate, activates the previous working boot environment and 
indicates the result.

5. Exit Single User mode and reboot the machine.


Modifying boot archive service
Propagating findroot GRUB for menu conversion.
File  propagation successful
File  propagation successful
File  propagation successful
File  propagation successful
Deleting stale GRUB loader from all BEs.
File  deletion successful
File  deletion successful
File  deletion successful
Activation of boot environment  successful.

# date
Friday,  4 July 2008  9:45:41 PM EST

# init 6
propagating updated GRUB menu
Saving existing file  in top level dataset for BE  as //boot/grub/menu.lst.prev.
File  propagation successful
File  propagation successful
File  propagation successful
File  propagation successful

Here I reboot and then login.

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
snv_91                     yes      no     no        yes    -         
snv_93                     yes      yes    yes       no     -         

# lufslist -n snv_91
               boot environment name: snv_91

Filesystem              fstype    device size Mounted on          Mount Options
----------------------- -------- ------------ ------------------- --------------
/dev/zvol/dsk/rpool/swap swap       4294967296 -                   -
rpool/ROOT/snv_91       zfs          20630528 /                   -

# lufslist -n snv_93
               boot environment name: snv_93
               This boot environment is currently active.
               This boot environment will be active on next system boot.

Filesystem              fstype    device size Mounted on          Mount Options
----------------------- -------- ------------ ------------------- --------------
/dev/zvol/dsk/rpool/swap swap       4294967296 -                   -
rpool/ROOT/snv_93       zfs       10342821376 /                   -

Cor! That was so easy I think I need to fall off my chair.

Thinking about this for a moment, I needed just 6 commands and around 90 minutes to upgrade my laptop. If only I'd had this technology available to me back then.

Finally, let me send a massive, massive thankyou to the install team and the ZFS team for all their hard work to get these technologies integrated and working pretty darned smoothly together.

Friday Jul 07, 2006

A few tidbits re ZFS Root

I was lurking in #opensolaris @ when drdoug007 mentioned he'd been blogging about an intense Solaris minimisation project. I had a look... he's managed to get the entire image to around 42Mb uncompressed.

I then noticed he had a few entries re ZFS Root, and I remembered that there were three things I've discovered over the past week courtesy of my adventures with the ZFS delete queue that I really should mention.

(1) When you boot single-user off media, you need to ensure that your boot archive contains a version of the zfs module which has the same on-disk filesystem format. In the process of repeated bfus and update_nonONs I'd also upgraded my ondisk format from ZFS Version 1 to ZFS Version 3. The older module doesn't like the newer format. I yelled a bit when I realised this (it was a Friday evening.....).

Lesson: always make sure that your media is up to date with your on-disk format!

(2) The next thing I realised was that if you zfs export your root pool before you reboot (so that you can boot from the allegedly-fixed ondisk boot archive) then you'll see a panic on the next boot because your pool isn't imported and therefore the boot archive won't know where to look for the rest of the OS. That's a bit of an annoyance, to say the least!


(3) I had to boot off media a few times because I stuffed up my boot archive. I found that in that scenario there was a gap in the bootadm logic which meant that effectively zero-length archives were created since lofiadm and zfs wouldn't play nice with each other.

Lesson: when booting to single-user, mount -o remount,rw / ;  cp {root_pool/root_filesystem}/usr/bin/mkisofs /usr/bin

Technorati Tags: , ,

Wednesday Jul 05, 2006

I've got my space back

When I got my Ultra 20 I decided to continue living on the bleeding edge and make a contribution to quality in Solaris Express.
That, of course, means ZFS Root.

So I followed the instructions on Tabriz' and Tim Foster's blogs, and quickly wound up with a zfs-rooted system.

All well and good until I noticed that a lot of my disk space had disappeared.

Itturns out that I was suffering from two issues:

6420204 root filesystem's delete queue is not running and
6436526 delete_queue thread reporting drained when it may not be true

So the first workaround was the re-set the "readonly" flag to "off" and see how well it worked. Turns out that that wasn't quite enough, I only got about 2gb back. Then a little while later Mark Shellenbaum putback his fix for 6436526, so I bfu'd to that.
One of the things that --- at this point --- you have to remember to do with ZFS Root is to copy your new boot archive from the ZFS Root environment to your ufs boot partition. It took me two reboots to remember that :(

So once I'd got the correct module installed and booted, I had about 7 hours of downtime as a lot of my delete queue got flushed. I had to reboot several times because my disk decided it couldn't cope with the load, and also because the delete queue managed to trip over some assertions in the vm code. (Some pathetic reason like ASSERT(proc_pageout != NULL);.....)

And that was all cool.

BUT there was more to come. While chatting with Mark in the ZFS irc room he mentioned that Tabriz had a fix for 6420204 which involved adding a line to /lib/svc/method/fs-usr:

[ "$fstype" = zfs ] && mntopts="${mntopts},rw"

So I did that, and rebooted..... lo and behold, there was more stuff in the delete queue which needed to be taken care of. After another 10 hours and 8 panics my ZFS Root partition is finally back to using the expected 7% rather than 70+%.

Life is good and all seems good with the world.

Technorati Tags: , ,

Wednesday Nov 16, 2005

Back in the old days

In 1998 and 1999 I was a Solaris system administrator at one of Sydney's 5 universities. I was quite green --- this was my second sysadmin job --- and I'd been given the task of administering the University's Finance arm's server and the Uni's DR server. It was quite a challenge for me: each box was a SparcServer-1000 with about 30 attached disks. The DR host had those disks all nicely physically organised into an SSA-100 array whereas the Finance host had unipacks and multipacks crowded around it in a semi-neat fashion.

I had to learn VxVM (for the DR box) and SDS (for the Finance host) very quickly, and I realised that doing so from the command line perspective was very clearly the way to go.... in a disaster I wouldn't have a graphical head to let me look at the semi-pretty gui that VxVM 2.5 and SDS 4.1 required.

Working on the Finance box required close interaction with the Oracle DBAs that we had --- they'd frequently want to move Oracle datafiles around in order to maximise access speed .... whether that meant having the partition on the "fast" end of the disk or on a fast spindle or a faster scsi controller or on a lower scsi id. That was a pain. The DR box was another challenge because I had to somehow make the 30x4gb disks appear to be a single storage pool for whichever host had to be hosted there in a DR situation. Since we had three hosts which might get that experience it was a bit difficult. While all of them ran Oracle, they each ran different versions of Oracle, had different application filesystem requirements.... you get the idea I'm sure.

After a year working in that part of the Uni I moved to work for a smaller (30 people) group in the research division. It was great being top dog in the sysadmin group.... since I was their only sysadmin ;-) I migrated that group off a Novell NetWare server which quite seriously crashed every day. I got them onto an E250 running Solaris 2.6, Samba and PC-Netlink (for the Macs!) Once again I had to carefully carve up the internal disks and the luns from the attached A1000 (never, ever remove lun0 from an A1000 if you want it to work). I had to worry about quotas and how much space to allocate for the application I wrote for them and how much to allocate for sendmail spool files too. I recall that I ended up creating my filesystems (a) to not exceed the size I could ufsdump to a single DLT7000 tape (35gb), and (b) by grabbing the few megabytes at the end of what I figured were correctly sized filesystems and using an SDS concat to make somethng out of.

Hideous! Time-wasting! Ugly!

Oh, if only we'd had ZFS back then..... One of ZFS' main aims is to end the suffering of the humble (and not so humble!) system admin. With those hosts it would not have been difficult to add more storage or a new filesystem:

zfs add financepool mirror c9t0d0 c10t0d0

With the research group I wouldn't have had to worry about setting quotas for each filesystem by editing the quotatab and remembering to mount the filesystem with quotas turned on:

zfs set quota=1g research/home/louise

not only could I have done with some compression on those Oracle datafiles

zfs set compression=on finance

zfs set compression=on oracledata

but I could also have used zfs to send incremental backups of the relevant bits from the finance host to the DR box:

zfs backup -i finance/application@12:00 finance/application@12:01 | ssh DRbox zfs restore -d /finance/application

Do you get it? Do you understand why we've been desperately keen to get ZFS into your hands? Do you want to start making use of all this? It's quite fine with me (us?) if you want to keep using SVM and VxVM. Really, it is. When you're ready, please go and have a look at ZFS. Drop it onto a test machine and play with it. Look at the source code and the documentation and reach out to the possibilities of spending your time productively rather than in slicing up disks.

Tuesday Nov 01, 2005

ZFS is integrated into Solaris.Next

Just got back from a tiring afternoon at Uni, and found this update on ZFS from Jeff Bonwick. When I woke up this morning I was eagerly awaiting the email notification of the putback... had to wait about another hour (panting with anticipation) but then it happened. Kind of an anticlimax in the end, really, but there it was. Code Manager notification from Matt Ahrens that ZFS was putback. W00tW00tW00t! I'm not sure which Solaris Express build it'll appear in, but I know that as soon as I've got through this week (uni committments are yanking my chain right now) I'll be grabbing the latest build and nightlies for the fuller enjoyment of what really is the Last Word In FileSystems(tm). What I'm anticipating most is the reaction of everybody out there who has heard anything about ZFS and wanted to get their teeth into it. It'll take a little bit of effort to wrap your head around the paradigm shift, but believe me, it's truly worth it. Got Checksums? OpenSolaris: Innovation MattersOpenSolaris User Group

Sunday Oct 16, 2005

Another cool ad

I was finalising a few slides for my presentation tonight on ZFS at SOSUG (as announced by Alan Hargreaves at the Google Group), and noticed that our company President and COO posted another ad which was rejected. There's a perception in some areas that "near enough is good enough" ... but that's not the case with Sun. The teams and groups within Sun that I've come into contact with have the attitude that near enough is nowhere close to good enough. There are also a few other guiding lights that we work with. One is Fix the problem at the source. Another is Make it secure by default (security is not an afterthought), along with if another OS is faster, then it's a bug in Solaris. And the biggy for me is based around the user experience: The Technology Must Just Work --- think about it for a moment. As an engineer, I reckon just about the greatest praise I can receive is "this is great, it just works" --- which means that whatever it is, it works as designed, the design is verifiably correct, and there are no surprises in the user experience. Bringing this back to the ad, if near enough was good enough, we wouldn't be caring about the power requirements of our hardware. And we'd be contributing to greenhouse gas emissions wherever our kit is used through power and cooling demands. But we don't have that attitude. And quite some time ago our hardware designers got that message, so now you can reap the benefit of our "near enough is nowhere close to good enough" policy.

I work at Oracle in the Solaris group. The opinions expressed here are entirely my own, and neither Oracle nor any other party necessarily agrees with them.


« July 2016