Monday Jun 08, 2009

"S.M.A.R.T. Capable but command failed" aka ZFS saves the day!

 I get to join the ranks of smug people that have had ZFS save their bacon/data:

So my system threw a bit of a wobbly when I unplugged it to move it (dosn't get turned off much these days). It seems like one of the Western Digital WS5000YS RE (thats RAID Edition) drives failed to come back to life after being spun down. The first clue was the S.M.A.R.T. Capable but command failed error in the BIOS POST messages, the 2nd was the zio_read_data_fail messages and getting dropped to the grub> prompt instead of boot selection screen.

 Ahh but surely  ZFS will save the day? Well yes but the key to all situations like this is preperation - and of course I had not prepared properly. Hey this is my personal home workstation and apart from being unable to stream music and movies to the PS3 my users are pretty laid back about outages.

The problem is although I have a mirrored ZFS root pool (rpool) I didn't install the grub bootloader onto the 2nd disk despite the action of attaching a 2nd mirror to the rpool specifically telling you to do this.

I had some weirdness and had to reset the BIOS to defaults and eventually remove the bad drive to get the system to boot from the 2009.06 Live CD I suspect this was due to the amount of errors I got on the console from the disk as it booted.

So once I had booted of the 2009.06 Live CD I could mount my pool and try and work around the damage.

 Firstly import the pool using the -f flag to overide the fact that it technically still in use.

jack@opensolaris:~$ pfexec zpool import -f rpool
cannot share 'rpool/export/home': smb add share failed
cannot share 'rpool/export/home/james': smb add share failed
cannot share 'rpool/export/home/media': smb add share failed


I can safely ignore the messages about being unable to share out my filesystems - don't think the livecd has cifs, so smb shares fail.

Make a place to mount the root file system, I'm mounting 111a as it is my lastest BE on this system (it has been broken since before 2009.06 came out - I told you my users where laid back)

jack@opensolaris:~$ mkdir /tmp/mnt

jack@opensolaris:~$ pfexec mount -F zfs rpool/ROOT/111a /tmp/mnt


Now the bit that I should have done before I had the disk fail:

jack@opensolaris:/rpool/boot/grub$ pfexec installgrub -m /tmp/mnt/boot/grub/stage1 /tmp/mnt/boot/grub/stage2 /dev/rdsk/c9d0s0
Updating master boot sector destroys existing boot managers (if any).
continue (y/n)?y
stage1 written to partition 0 sector 0 (abs 16065)
stage2 written to partition 0, 271 sectors starting at 50 (abs 16115)
stage1 written to master boot sector
jack@opensolaris:/rpool/boot/grub$

Umount and reboot:

jack@opensolaris:~$ pfexec umount /tmp/mnt
jack@opensolaris:~$ pfexec reboot

 I am back in a working boot enviroment, though only running on a single disk, it looks like Western Digital will accept this disk as an RMA so it only goes to see how long I have to wait. I think I'm getting itchy already only having a single platter between me and data loss - I wonder how much a temp replacment would cost.

 james@frank ~ $ zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: scrub in progress for 0h23m, 17.09% done, 1h54m to go
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     0
          mirror    DEGRADED     0     0     0
            c6d0s0  UNAVAIL      0     0     0  cannot open
            c9d0s0  ONLINE       0     0     0

errors: No known data errors
james@frank



About

James Legg

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today