Poor Man's Cluster - end the corruption

The putback of 6282725 hostname/hostid should be stored in the label introduces hostid checking when importing a pool.
If the pool was last accessed by another system, then the import is denied (of course can be overridden with the '-f' flag).

This is especially important to people rolling their own cluster's - the so-called poor man's cluster. What people were finding is:

1) clientA creates the pool (using shared storage)
2) clientA reboots/panics
3) clientB forcibily imports the pool
4) clientA comes back up
5) clientA automatically imports the pool via /etc/zfs/zpool.cache

At this point, both clientA and clientB have the same pool imported and both can write to it - however, ZFS is not designed
to have multiple writers (yet), so both clients will quickly corrupt the pool as both have a different view of the pool's state.

Now that we store the hostid in the label and verify the system importing the pool was the last one that accessed the pool, the
poor man's cluster corruption scenario mentioned above can no longer happen. Below is an example using shared storage over iSCSI.
In the example, clientA is 'fsh-weakfish', clientB is 'fsh-mullet'.

First, let's create the pool on clientA (assume both clients are already setup for iSCSI):

fsh-weakfish# zpool create i c2t01000003BAAAE84F00002A0045F86E49d0
fsh-weakfish# zpool status
  pool: i
 state: ONLINE
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE CKSUM
        i                                        ONLINE       0     0     0
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE       0     0     0

errors: No known data errors
fsh-weakfish# zfs create i/wombat
fsh-weakfish# zfs create i/hulio 
fsh-weakfish# zfs list
NAME       USED  AVAIL  REFER  MOUNTPOINT
i          154K  9.78G    19K  /i
i/hulio     18K  9.78G    18K  /i/hulio
i/wombat    18K  9.78G    18K  /i/wombat
fsh-weakfish#

Note the enhanced information 'zpool import' reports on clientB:

fsh-mullet# zpool import
  pool: i
    id: 8574825092618243264
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        i                                        ONLINE
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE
fsh-mullet# zpool import i
cannot import 'i': pool may be in use from other system, it was last accessed by
fsh-weakfish (hostid: 0x4ab08c2) on Tue Apr 10 09:33:07 2007
use '-f' to import anyway
fsh-mullet#

Ok, we don't want to forcibly import the pool until clientA is down. So after clientA (fsh-weakfish) has rebooted,
forcibly import the pool on clientB (fsh-mullet):

fsh-weakfish# reboot
....

fsh-mullet# zpool import -f i
fsh-mullet# zpool status
  pool: i
 state: ONLINE
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE CKSUM
        i                                        ONLINE       0     0     0
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE       0     0     0

errors: No known data errors
fsh-mullet#

After clientA comes back up, we'll see this message via syslog:

WARNING: pool 'i' could not be loaded as it was last accessed by another system
(host: fsh-mullet hostid: 0x8373b35b).  See: http://www.sun.com/msg/ZFS-8000-EY

And just to double check to make sure that pool 'i' is in fact not loaded:

fsh-weakfish# zpool list
no pools available
fsh-weakfish# 

And to verify the pool has not been corrupted from clientB's view of the world, we see:

fsh-mullet# zpool scrub i
fsh-mullet# zpool status
  pool: i
 state: ONLINE
 scrub: scrub completed with 0 errors on Tue Apr 10 10:28:03 2007
config:

        NAME                                     STATE     READ WRITE CKSUM
        i                                        ONLINE       0     0     0
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE       0     0     0

errors: No known data errors
fsh-mullet# zfs list
NAME       USED  AVAIL  REFER  MOUNTPOINT
i          156K  9.78G    21K  /i
i/hulio     18K  9.78G    18K  /i/hulio
i/wombat    18K  9.78G    18K  /i/wombat
fsh-mullet# 

See you never again poor man's cluster corruption.

One detail i'd like to point out is that you have to be careful on \*when\* you forcibly import a pool. For instance,
if you forcibly import the pool on clientB \*before\* you reboot clientA then corruption can still happen. This is because
the command reboot(1M) cleanly takes down the machine, which means it unmounts all filesystems, and unmounting a
filesystem will write a bit of data to the pool.

To see the new information on the label, you can use zdb(1M):

fsh-mullet# zdb -l /dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
    version=6
    name='i'
    state=0
    txg=665
    pool_guid=8574825092618243264
    hostid=2205397851
    hostname='fsh-mullet'
    top_guid=5676430250453749577
    guid=5676430250453749577
    vdev_tree
        type='disk'
        id=0
        guid=5676430250453749577
        path='/dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0'
        devid='id1,ssd@x01000003baaae84f00002a0045f86e49/a'
        whole_disk=1
        metaslab_array=14
        metaslab_shift=26
        ashift=9
        asize=10724048896
        DTL=30
--------------------------------------------
LABEL 1
--------------------------------------------
...
Comments:

Anyway to enahance the following message to include the hostname that last imported the pool?:

fsh-mullet# zpool import
pool: i
id: 8574825092618243264
state: ONLINE
status: The pool was last accessed by another system.
< .....>

Thanks for adding this cool feature!
- Ryan

Posted by Matty on April 10, 2007 at 06:05 AM PDT #

That's pretty f-ing cool. Nice work. Sometimes I have to forcibly do a pool import when I do a live upgrade... will that still be the case here, or does the hostid/hostname checking notice that case and eliminate the need for it?

Posted by guest on April 10, 2007 at 06:56 AM PDT #

eric, you blog is cool... but the blue on white theme is looking a little tired... how about sprucing things up a bit? also the font comes out really itty bitty teeny tiny for me. maybe try one of the newer themes built into roller?

Love, the blog fashionista.

Posted by What not to wear on your blog on April 10, 2007 at 07:00 AM PDT #

It should be possible to add the hostname to 'zpool import' (without specifiying a specific pool). I decided against it as it seemed to clutter up the output and was inconsistent with regards to how specific the other 'zpool import' errors existed.

You can get that information by trying to import the specific pool. If this isn't sufficient, let me know and i can see about adding it.

Posted by eric kustarz on April 10, 2007 at 10:25 AM PDT #

With regards to having to or not having to use the '-f' flag, yep, we've made it easier on you. If you were the last one to access the pool, then '-f' is no longer needed. Try destroying a pool and importing it via 'zpool import -D <pool>' - no more '-f'!

Posted by eric kustarz on April 10, 2007 at 10:27 AM PDT #

Fashion police - you're too funny!

when i have spare time, i'll check out the new themes...

Posted by eric kustarz on April 10, 2007 at 10:30 AM PDT #

When does this hostid stuff make it into the Solaris GA release? David

Posted by David Smith on April 17, 2007 at 03:00 AM PDT #

Hey David,

It won't make s10u4 and the schedules for future updates haven't been settled yet. So i don't know yet.

Posted by eric kustarz on April 17, 2007 at 06:53 AM PDT #

ZFS is not designed to have multiple writers (yet)...

I'm intrigued by the "yet". Are there any concrete plans to support that?

Posted by David Hopwood on May 26, 2007 at 11:33 PM PDT #

Nothing concrete. pNFS is going to be one solution and that is being actively worked on (prototype works and the NFSv4 wg is going to settle on the spec this summer).

Posted by eric kustarz on May 29, 2007 at 02:18 AM PDT #

This is an excellent change, and hopefully the "zfs mount -a" in lib/svcs/fs-local will no longer fail the service when an unimportable pool is discovered. As an aside, this is an issue as much for a "rich man's SAN" as a "poor man's cluster"...it's expected that systems can coexist peacefully without extensive lun masking, and this will definitely help that.

Posted by Jeff on June 05, 2007 at 08:36 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

erickustarz

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today