Tuesday Apr 26, 2005

SVM and Solaris Express 4/05

The latest release of Solaris Express came out yesterday. As usual a summary is on Dan's blog.

One new Solaris Volume Manager (SVM) capability in this release is better integration with the Solaris Reconfiguration Coordination Manager (RCM) framework. This is the fix for bug:

4927518 SVM could be more intelligent when components are being unconfigured

SVM has had some RCM integration since Solaris 9 update 2. Starting in that release, if you attempted to Dynamically Reconfigure (DR) out a disk that was in use by SVM, you would get a nice message explaining how SVM was using the disk. However, this was really only the bare minimum of what we could do to integrate with the RCM framework. The problem that is identified by bug 4927518 is that if the disk has died you should just be able to DR it out so you can replace it. Up to now what would happen is you would get the message explaining that the disk was part of an SVM configuration. You had to manually unconfigure the SVM metadevice before you could DR out the disk. For example, if the disk was on one side of a mirror, you would have had to detach the submirror, delete it, DR out the disk, DR in the new one, recreate the submirror and reattach it to the mirror. Then you would have incurred a complete resync of the newly attached submirror. Obviously this is not very clean.

With the latest Solaris Express release SVM will clean up the systems internal state for the dead disk so that you can just DR it out without detaching and deleting the submirror. Once you DR in a new disk you can just enable it into the mirror and only that disk will be resynced, not the whole submirror. This is a big improvement for the manageability of SVM. There are more integration improvements like this coming for SVM. In another blog entry I'll also try to highlight some of the areas of integration we already have with SVM and other Solaris features.

Tuesday Apr 19, 2005

Solaris Volume Manager odds and ends

I have had a few questions about some of my previous blog entries so I thought I would try to write up some general responses to the various questions.

First, here are the bugids for some of the bugs I mentioned in my previous posts:

6215065 Booting off single disk from mirrored root pair causes panic reset
This is the UFS logging bug that you can hit when you don't have metadb quorum.

6236382 boot should enter single-user when there is no mddb quorum
This one is pretty self explanatory. The system will boot all the way to multi-user, but root remains read-only. To recover from this, delete the metadb on the dead disk and reboot so that you have metadb quorum.

6250010 Cannot boot root mirror
This is the bug where the V20z and V40z don't use the altbootpath to failover when the primary boot disk is dead.

There was a question about what to do if you get into the infinite panic-reboot cycle. This would happen if you were hitting bug 6215065. To recover from this, you need to boot off of some other media so that you can cleanup. You could boot off of a Solaris netinstall image or the the Solaris install CD-ROM for example. Once you do that you can mount the disk that is still ok and change the /etc/vfstab entry so that the root filesystem is mounted without logging. Since logging was originally in use, you want to make sure the log is rolled and that UFS stops using the log. Here are some example commands to do this:
# mount -o nologging /dev/dsk/c1t0d0s0 /a
# [edit /a/etc/vfstab; change the last field for / to "nologging"]
# umount /a
# reboot
By mounting this way UFS should roll the existing log and then mark the filesystem so that logging won't be used for the current mount on /a. This kind of recovery procedure where you boot off of the install image should only be used when one side of the root mirror is already dead. If you were to use this approach for other kinds of recovery you would leave the mirror in an inconsistent state since only one side was actually modified.

Here is a general procedure for accessing the root mirror when you boot off of the install image. An example where you might use this procedure is if you forgot the root password and wanted to mount the mirror so you could clear the field in /etc/shadow.

First, you need to boot off the install image and get a copy of the md.conf file from the root filesystem. Mount one of the underlying root mirror disks read-only to get a copy.
# mount -o ro /dev/dsk/c0t0d0s0 /a
# cp /a/kernel/drv/md.conf /kernel/drv/md.conf
# umount /a
Now update the SVM driver to load the configuration.
# update_drv -f md
[ignore any warning messages printed by update_drv]
# metainit -r
If you have mirrors you should run metasync to get them synced.

Your SVM metadevices should now be accessible and you should be able to mount them and perform whatever recovery you need.
# mount /dev/md/dsk/d10 /a

We are also getting this simple procedure into our docs so that it is easier to find.

Finally, there was another comment about using a USB memory disk to hold a copy of the metadb so that quorum would be maintained even if one of the disks in the root mirror died.

This is something that has come up internally in the past but nobody on our team had actually tried this so I went out this weekend and bought a USB memory disk to see if I could make this work. It turns out this worked fine for me, but there are a lot of variables so your mileage may vary.

Here is what worked for me. I got a 128MB memory disk since this was the cheapest one they sold and you don't need much space for a copy of the metadb (only about 5MB is required for one copy). First I used the "rmformat -l" command to figure out the name of the disk. The disk came formatted with a pcfs filesystem already on it and a single fdisk partition of type DOS-BIG. I used the fdisk(1M) command to delete that partition and create a single Solaris fdisk partition for the whole memory disk. After that I just put a metadb on the disk.
# metadb -a /dev/dsk/c6t0d0s2
Once I did all of this, I could reboot with one of the disks removed from my root mirror and I still had metadb quorum since I had a 3rd copy of the metadb available.

There are a several caveats here. First, I was running this on a current nightly build of Solaris. I haven't tried it yet on the shipping Solaris 10 bits but I think this will probably work. Going back to the earlier S9 bits I would be less certain since a lot of work went into the USB code for S10. The main thing here is that the system has to see the USB disk at the time that SVM driver is reading the metadbs. This happens fairly early in the boot sequence. If we don't see the USB disk, then that metadb replica will be marked in error and it won't help maintain quorum.

The second thing to watch out for is that SVM keeps track of mirror resync regions in some of the copies of the metadbs. This is used for the optimized resync feature that SVM supports. Currently there is no good way to see which metadbs SVM is using for this and there is no way to control which metadbs will be used. You wouldn't want these writes going to the USB memory disk since it will probably wear out faster and might be slower too. We need to improve this in order for the USB memory disk to really be a production quality solution.

Another issue to watch out for is if your hardware supports the memory disk. I tried this on an older x86 box and it couldn't see the USB disk. I am pretty sure the older system did not support USB 2.0 which is what the disk I bought supports. This worked fine when I tried it on a V65 and on a Dell 650, both of which are newer systems.

I need to do more work in this area before we could really recommend this approach, but it might be a useful solution today for people who need to get around the metadb quorum problem on a two disk configuration. You would really have to play around to make sure the configuration worked ok in your environment. We are also working internally on some other solutions to this same problem. Hopefully I can write more about that soon.

Monday Apr 18, 2005

SVM V20z root mirror panic

One of the problems I have been working on recently has to do with running Solaris Volume Manager on V20z and V40z servers. In general, SVM does not care what kinds of disks it is layered on top of. It justs passes the I/O requests through to the drivers for the underlying storage. However, we were seeing problems on the V20z when it was configured for root mirroring.

With root mirroring, a common test is to pull the primary boot disk and reboot to verify that the system comes up properly on the other side of the mirror. What was happening was that Solaris would start to boot, then panic.

It turns out that we were hitting a limitation in the boot support for disks that are connected to the system with an mpt(7D) based HBA. The problematic code exists in bootconf.exe within the Device Configuration Assistant (DCA). The DCA is responsible for loading and starting the Solaris kernel. The problem is that the bootconf code was not failing over to the altbootpath device path, so Solaris would start to boot but then panic because the DCA was passing it the old bootpath device path. With the primary boot disk removed from the system, this was no longer the correct boot device path. You can see if this limitation might impact you by using the "prtconf -D" command and looking for the mpt driver.

We have some solutions for this limitation in the pipeline, but in the meantime, there is an easy workaround for this. You need to edit the /boot/solaris/bootenv.rc file and remove the entries for bootpath and altbootpath. At this point, the DCA should automatically detect the correct boot device path and pass it into the kernel.

There are a couple of limitations to this workaround. First, it only works for S10 and later. In S9, it won't automatically boot. Instead it will enter the DCA and you will have to manually choose the boot device. Also, it only works for systems where both disks in the root mirror are attached via the mpt HBA. This is a typical configuration for the V20z and V40z. We are working on better solutions to this limitation, but hopefully this workaround is useful in the meantime.

Tuesday Apr 05, 2005

Solaris Volume Manager disksets

Solaris Volume Manager has had support for a feature called "disksets" for a long time, going back to when it was an unbundled product named SDS. Disksets are a way to group a collection of disks together for use by SVM. Originally this was designed for sharing disks, and the metadevices on top of them, between two or more hosts. For example, disksets are used by SunCluster. However, having a way to manage a set of disks for exclusive use by SVM simplifies administration and with S10 we have made several enhancements to the diskset feature which make it more useful, even on a single host.

If you have never used SVM disksets before, I'll briefly summarize a couple of the key differences from the more common use of SVM metadevices outside of a diskset. First, with a diskset, the whole disk is given over to SVM control. When you do this SVM will repartition the disk and create a slice 7 for metadb placement. SVM automatically manages metadbs in disksets so you don't have to worry about creating those. As disks are added to the set, SVM will create new metadbs and automatically rebalance them across the disks as needed. Another difference with disksets is that hosts can also be assigned to the set. The local host must be in the set when you create it. Disksets implement the concept of a set owner so you can release ownership of the set and a different host can take ownership. Only the host that owns the set can access the metadevices within the set. An exception to this is the new multi-owner diskset which is a large topic on its own so I'll save that for another post.

With S10 SVM has added several new features based around the diskset. One of these is the metaimport(1M) command which can be used to load a complete diskset configuration onto a separate system. You might use this if your host died and you physically moved all of your storage to a different machine. SVM uses the disk device IDs to figure out how to put the configuration together on the new machine. This is required since the disk names themselves (e.g. c2t1d0,c3t5d0,...) will probably be different on the new system. In S10 disk images in a diskset are self-identifying. What this means is that if you use remote block replication software like HDS TrueCopy or the SNDR feature of Sun's StorEdge Availability Suite to do remote replication you can still use metaimport(1M) to import the remotely replicated disks, even though the device IDs of the remote disks will different.

A second new feature is the metassist(1M) command. This command automates the creation of metadevices so that you don't have to run all of the separate meta\* commands individually. Metassist has quite a lot of features which I won't delve into in this post, but you can read more here. The one idea I wanted to discuss is that metassist uses disksets to implement the concept of a storage pool. Metassist relies on the automatic management of metadbs that disksets provide. Also, since the whole disk is now under control of SVM, metassist can repartition the disk as necessary to automatically create the appropriate metadevices within the pool. Metassist will use the disks in the pool to create new metadevices and only try to add additional disks to the pool if it can't configure the new metadevice using the available space on the existing disks in the pool.

When disksets were first implemented back in SDS they were intended for use by multiple hosts. Since the metadevices in the set were only accessible by the host the owned the set the assumption was that some other piece of software, outside of SDS, would manage the diskset ownership. Since SVM is now making greater use of disksets we added a new capability called "auto-take" which allows the local host to automatically take ownership of the diskset during boot and thus have access to the metadevices in the set. This means that you can use vfstab entries to mount filesystems built on metadevices within the set and those will "just work" during the system boot. The metassist command relies on this feature and the storage pools (i.e. disksets) it uses will all have auto-take enabled. Auto-take can only be enabled for disksets which have the single, local host in the set. If you have multiple hosts in the set than you are really using the set in the traditional manner and you'll need something outside of SVM to manage which host owns the set (again, this ignores the new multi-owner sets). You use the new "-A" option on the metaset command to enable or disable auto-take.

Finally, since use of disksets is expected to be more common with these new features, we added the "-a" option to the metastat command so that you can see the status of metadevices in all disksets without have to run a separate command for each set. We also added a "-c" option to metastat which gives a compact, one-line output for each metadevice. This is particularly useful as the number of metadevices in the configuration increases. For example, on one of our local servers we have 93 metadevices. With the original, verbose metastat output this resulted in 842 lines of status, which makes it hard to see exactly what is going on. The "-c" option reduces this down to 93 lines.

This is just a brief overview of some of the new features we have implemented around SVM disksets. There is lots more detail in the docs here. The metaimport(1M) command was available starting in S9 9/04 and the metassist command was available starting in S9 6/04. However, the remote replication feature for metaimport requires S10.
About

jerrysblog

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today