Thursday Dec 08, 2005

Moving and cloning zones

Its been quite a long time since my last blog entry. I have moved over from the SVM team onto the Zones team and I have been busy coming up to speed on Zones.

So far I have just fixed a few Zones bugs but now I am starting to work on some new features. One of the big things people want from Zones is the ability to move them and clone them.

I have a proposal for moving and cloning zones over on the OpenSolaris zones discussion. This has been approved by our internal architectural review committee and the code is basically done so it should be available soon. Moving a zone is currently limited to a single system but the next step is migrating the zone from one machine to another. Thats the next thing we're going to work on.

For cloning we currently copy the bits from one zone instance to another and we're seeing significant performance wins compared to installing the zone from scratch (13x faster on one test machine). With ZFS now available it seems obvious that we could use ZFS clones to quickly clone Zone instances. This is something that we are actively looking at but for now we don't recommend that you place your zonepath on ZFS. This is bug 6356600 and is due to the current limitation that you won't be able to upgrade your system if your zonepath is on ZFS. Once the upgrade issues have been resolved, we'll be extending Zone cloning to be better integrated with ZFS clones. In the meantime, while you can use ZFS to hold your zones, you need to be aware that the system won't be upgradeable.

Friday Aug 12, 2005

FROSUG

I haven't written a blog for quite a while now. I'm actually not working on SVM right now. Instead, I am busy on some zones related work. It has been a busy summer. My wife and I were in Beijing for about 10 days talking to some customers about Solaris. Sarah has posted some pictures and written a funny story about our trip.

Last night we had the inaugural meeting of FROSUG (the Front Range Open Solaris Users Group). I gave an overview presentation introducing OpenSolaris. The meeting seemed to go well and it got blogged by Stephen O'Grady, which is pretty cool. Hopefully it will take off and we can get an active community of OpenSolaris people in the Denver area.

Thursday Jun 23, 2005

SVM resync cancel/resume

The latest release of Solaris Express came out the other day. Dan has his usual excellent summary. He mentions one cool new SVM feature but it might be easy to overlook it since there are so many other new things in this release. The new feature is the ability to cancel a mirror resync that is underway. The resync is checkpointed and you can restart it later. It will simply pick up where it left off. This is handy if the resync is effecting performance and you'd like to wait until later to let it run. Another use for this is if you need to reboot. With the old code, if a full resync was underway and you rebooted, the resync would start over from the beginning. Now, if you cancel it before rebooting, the checkpoint will allow the resync to pick up where it left off.

This code is already in OpenSolaris. You can see the CLI changes in metasync.c and the library changes in meta_mirror_resync_kill. The changes are fairly small because most of the underlying support was already implemented for multi-node disksets. All we had to do was add the CLI option and hook in to the existing ioctl. You can see some of this resync functionality in the resync_unit function. There is a nice big comment there which explains some of this code.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Tuesday Jun 14, 2005

SVM and the B_FAILFAST flag

Now that OpenSolaris is here it is a lot easier to talk about some of the interesting implementation details in the code. In this post I wanted to discuss the first project I did after I started to work on the Solaris Volume Manager (SVM). This is on my mind right now because it also happens to be related to one of my most recent changes to the code. This change is not even in Solaris Express yet, it is only available in OpenSolaris. Early access to these kind of changes is just one small reasons why OpenSolaris is so cool.

My first SVM project was to add support for the B_FAILFAST flag. This flag is defined in /usr/include/sys/buf.h and it was implemented in some of the disk drivers so that I/O requests that were queued in the driver could be cleared out quickly when the driver saw that a disk was malfunctioning. For SVM the big requester for this feature was our clustering software. The problem they were seeing was that in a production environment there would be many concurrent I/O requests queued up down in the sd driver. When the disk was failing the sd driver would need to process each of these requests, wait for the timeouts and retrys and slowly drain its queue. The cluster software could not failover to another node until all of these pending requests had been cleared out of the system. The B_FAILFAST flag is the exact solution to this problem. It tells the driver to do two things. First, it reduces the number of retries that the driver does to a failing disk before it gives up and returns an error. Second, when the first I/O buf that is queued up in the driver gets an error, the driver will immediately error out all of the other, pending bufs in its queue. Furthermore, any new bufs sent down with the B_FAILFAST flag will immediately return with an error.

This seemed fairly straightforward to implement in SVM. The code had to be modified to detect if the underlying devices supported the B_FAILFAST flag and if so, the flag should be set in the buf that was being passed down from the md driver to the underlying drivers that made up the metadevice. For simplicity we decided we would only add this support to the mirror metadevice in SVM. However, the more I looked at this, the more complicated it seemed to be. We were worried about creating new failure modes with B_FAILFAST and the big concern was the possibility of a "spurious" error. That is, getting back an error on the buf that we would not have seen if we had let the underlying driver perform its full set of timeouts and retries. This concern eventually drove the whole design of the initial B_FAILFAST implementation within the mirror code. To handle this spurious error case I implemented an algorithm within the driver so that when we got back an errored B_FAILFAST buf we would resubmit that buf without the B_FAILFAST flag set. During this retry, all of the other failed I/O bufs would also immediately come back up into the md driver. I queued those up so that I could either fail all of the them after the retried buf finally failed or I could resubmit them back down to the underlying driver if the retried I/O succeeded. Implementing this correctly took a lot longer than I originally expected when I took this first project and it was one of those things that worked but I was never very happy with. The code was complex and I never felt completely confident that there wasn't some obscure error condition lurking here that would come back to bite us later. In addition, because of the retry, the failover of a component within a mirror actually took \*longer\* now if there was only a single I/O being processed.

This original algorithm was delivered in the S10 code and was also released as a patch for S9 and SDS 4.2.1. It has been in use for a couple of years which gave me some direct experience with how well the B_FAILFAST option worked in real life. We actually have seen one or two of these so called spurious errors but in all cases there were real, underlying problems with the disks. The storage was marginal and SVM would have been better off just erroring out those components within the mirror and immediately failing over to the good side of the mirror. By this time I was comfortable with this idea so I rewrote the B_FAILFAST code within the mirror driver. This new algorithm is what you'll see today in the OpenSolaris code base. I basically decided to just trust the error we get back when B_FAILFAST is set. The code will follow the normal error path so that it puts the submirror component into the maintenance state and just uses the other, good side of the mirror from that point onward. I was able to remove the queue and simplify the logic almost back to what it was before we added support for B_FAILFAST.

However, there is still one special case we have to worry about when using B_FAILFAST. As I mentioned above, when B_FAILFAST is set, all of the pending I/O bufs that are queued down in the underlying driver will fail once the first buf gets an error. When we are down to the last side of a mirror the SVM code will continue to try to do I/O to the those last submirror components, even though they are taking errors. This is called the LAST_ERRED state within SVM and is an attempt to try to provide access to as much of your data as possible. When using B_FAILFAST it is probable that not all of the failed I/O bufs will have been seen by the disk and given a chance to succeed. With the new algorithm the code detects this state and reissues all of the I/O bufs without B_FAILFAST set. There is no longer any queueing, we just resubmit the I/O bufs without the flag and all future I/O to the submirror is done without the flag. Once the LAST_ERRED state is cleared the code will return to using the B_FAILFAST flag.

All of this is really an implementation detail of mirroring in SVM. There is no user-visible component of this except for a change in the behavior of how quickly the mirror will fail the errored drives in the submirror. All of the code is contained within the mirror portion of the SVM driver and you can see it in mirror.c. The function mirror_check_failfast is used to determine if all of the components in a submirror support using the B_FAILFAST flag. The mirror_done function is called when the I/O to the underlying submirror is complete. In this function we check if the I/O failed and if B_FAILFAST was set. If so we call the submirror_is_lasterred function to check for that condition and the last_err_retry function is called only when we need to resubmit the I/O. This function is actually executed in a helper thread since the I/O completes in a thread separately from the thread that initiated the I/O down into the md driver.

To wrap up, the SVM md driver code lives in the source tree at usr/src/uts/common/io/lvm. The main md driver is in the md subdirectory and each specific kind of metadevice also has its own subdirectory ( mirror, stripe, etc.). The SVM command line utilities live in usr/src/cmd/lvm and the shared library code that SVM uses lives in usr/src/lib/lvm. Libmeta is the primary library. In another post I'll talk in more detail about some of these other components of SVM.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Friday May 20, 2005

SVM and SMF

In a previous blog I talked about integration of Solaris Volume Manager (SVM) and RCM. Proper integration of SVM with the other subsystems in Solaris is one of the things I am particularly interested in.

Today I'd like to talk about some of the work I did to integrate SVM with the new Service Management Facility ( SMF) that was introduced in S10. Previously SVM had a couple of RC scripts that would run when the system booted, even if you had not configured SVM and were not using any metadevices. There were also several SVM specific RPC daemons that were enabled. One of the ideas behind SMF is that only the services that are actually needed should be enabled. This speeds up boot and makes for a cleaner system. Another thing is that not all of the RPC daemons need to be enabled when using SVM. Different daemons will be used based upon the way SVM is configured. SMF allows us to clean this up and manage these services within the code so that the proper services are enabled and disabled as you reconfigure SVM.

The following is a list of the services used by SVM:

svc:/network/rpc/mdcomm
svc:/network/rpc/metamed
svc:/network/rpc/metamh
svc:/network/rpc/meta
svc:/system/metainit
svc:/system/mdmonitor

The system/mdmonitor, system/metainit and network/rpc/meta services are the core services. These will be enabled when you create the first metadb. Once you create your first diskset the network/rpc/metamed and network/rpc/metamh services will be enabled. When you create your first multi-node diskset the network/rpc/mdcomm service will also be enabled. As you delete these portions of your configuration the corresponding services will be disabled.

Integrating this coordination of SVM and SMF is easy since SMF offers a full API which allows programs to monitor and reconfigure the services they use. The primary functions used are smf_get_state, smf_enable_instance and smf_disable_instance, all of which are documented on the smf_enable_instance(3SCF)) man page. This could have all be done previously using various hacks to rename scripts and edit configuration files in various ways but it is trivially simple with SMF. Furthermore, the code can always tell when there is something wrong with the services it depends on. Recently I integrated some new code that will notify you whenever you check the status of SVM with one of the CLI commands (metastat, metaset or metadb) and there is a problem with the SVM services. We have barely scratched the service here but SMF lays a good foundation for enabling us to deliver a true self-healing system.

Tuesday May 17, 2005

Another SVMer starts blogging.

Sanjay Nadkarni, another member of the Solaris Volume Manager engineering team has just started a blog. Sanjay was the technical lead for the project that added new clustering capabilities to SVM so that it can now support concurrent readers and writers. If you read my blog because you are interested in SVM you will want to take a look at his too. Welcome Sanjay.

Friday May 06, 2005

SVM metadbs, USB disks and S2.7

In an earlier blog I talked about using a USB memory disk to store a Solaris Volume Manager (SVM) metadb on a two-disk configuration. This would reduce the likelihood of hitting the mddb quorum problem I have talked about. The biggest problem with this approach is that there was no way to control where SVM would place its optimized resync regions. I just putback a fix for this limitation this week. It should be showing up in an upcoming Solaris Express release in the near future. With this fix the code will no longer place the optimized resync regions on a USB disk or any other removable disk for that matter. The only I/O to these devices should be the SVM configuration change writes or the initialization reads of the metadbs, which is a lot less frequent than the optimized resync writes.

I had another interesting experience this week. I was working on a bug fix for the latest release of Solaris and I had to run a test of an x86 upgrade for S2.7 with a Solstice DiskSuite (SDS) 4.2 root mirror to the latest release of Solaris. This was interesting to me for a number of reasons. First this code is over 6 years old but because of the long support lifetimes for Solaris releases we still have to be sure things like this will work. Second, it was truly painful to see this ancient release of Solaris and SDS running on x86. It was quite a reminder of how far Solaris 10 has come on the x86 platform. It will be exciting to see where the Solaris community takes Open Solaris on the x86 platform, as well as other platforms, over the next few years.

Tuesday Apr 26, 2005

SVM and Solaris Express 4/05

The latest release of Solaris Express came out yesterday. As usual a summary is on Dan's blog.

One new Solaris Volume Manager (SVM) capability in this release is better integration with the Solaris Reconfiguration Coordination Manager (RCM) framework. This is the fix for bug:

4927518 SVM could be more intelligent when components are being unconfigured

SVM has had some RCM integration since Solaris 9 update 2. Starting in that release, if you attempted to Dynamically Reconfigure (DR) out a disk that was in use by SVM, you would get a nice message explaining how SVM was using the disk. However, this was really only the bare minimum of what we could do to integrate with the RCM framework. The problem that is identified by bug 4927518 is that if the disk has died you should just be able to DR it out so you can replace it. Up to now what would happen is you would get the message explaining that the disk was part of an SVM configuration. You had to manually unconfigure the SVM metadevice before you could DR out the disk. For example, if the disk was on one side of a mirror, you would have had to detach the submirror, delete it, DR out the disk, DR in the new one, recreate the submirror and reattach it to the mirror. Then you would have incurred a complete resync of the newly attached submirror. Obviously this is not very clean.

With the latest Solaris Express release SVM will clean up the systems internal state for the dead disk so that you can just DR it out without detaching and deleting the submirror. Once you DR in a new disk you can just enable it into the mirror and only that disk will be resynced, not the whole submirror. This is a big improvement for the manageability of SVM. There are more integration improvements like this coming for SVM. In another blog entry I'll also try to highlight some of the areas of integration we already have with SVM and other Solaris features.

Tuesday Apr 19, 2005

Solaris Volume Manager odds and ends

I have had a few questions about some of my previous blog entries so I thought I would try to write up some general responses to the various questions.

First, here are the bugids for some of the bugs I mentioned in my previous posts:

6215065 Booting off single disk from mirrored root pair causes panic reset
This is the UFS logging bug that you can hit when you don't have metadb quorum.

6236382 boot should enter single-user when there is no mddb quorum
This one is pretty self explanatory. The system will boot all the way to multi-user, but root remains read-only. To recover from this, delete the metadb on the dead disk and reboot so that you have metadb quorum.

6250010 Cannot boot root mirror
This is the bug where the V20z and V40z don't use the altbootpath to failover when the primary boot disk is dead.

There was a question about what to do if you get into the infinite panic-reboot cycle. This would happen if you were hitting bug 6215065. To recover from this, you need to boot off of some other media so that you can cleanup. You could boot off of a Solaris netinstall image or the the Solaris install CD-ROM for example. Once you do that you can mount the disk that is still ok and change the /etc/vfstab entry so that the root filesystem is mounted without logging. Since logging was originally in use, you want to make sure the log is rolled and that UFS stops using the log. Here are some example commands to do this:
# mount -o nologging /dev/dsk/c1t0d0s0 /a
# [edit /a/etc/vfstab; change the last field for / to "nologging"]
# umount /a
# reboot
By mounting this way UFS should roll the existing log and then mark the filesystem so that logging won't be used for the current mount on /a. This kind of recovery procedure where you boot off of the install image should only be used when one side of the root mirror is already dead. If you were to use this approach for other kinds of recovery you would leave the mirror in an inconsistent state since only one side was actually modified.

Here is a general procedure for accessing the root mirror when you boot off of the install image. An example where you might use this procedure is if you forgot the root password and wanted to mount the mirror so you could clear the field in /etc/shadow.

First, you need to boot off the install image and get a copy of the md.conf file from the root filesystem. Mount one of the underlying root mirror disks read-only to get a copy.
# mount -o ro /dev/dsk/c0t0d0s0 /a
# cp /a/kernel/drv/md.conf /kernel/drv/md.conf
# umount /a
Now update the SVM driver to load the configuration.
# update_drv -f md
[ignore any warning messages printed by update_drv]
# metainit -r
If you have mirrors you should run metasync to get them synced.

Your SVM metadevices should now be accessible and you should be able to mount them and perform whatever recovery you need.
# mount /dev/md/dsk/d10 /a

We are also getting this simple procedure into our docs so that it is easier to find.

Finally, there was another comment about using a USB memory disk to hold a copy of the metadb so that quorum would be maintained even if one of the disks in the root mirror died.

This is something that has come up internally in the past but nobody on our team had actually tried this so I went out this weekend and bought a USB memory disk to see if I could make this work. It turns out this worked fine for me, but there are a lot of variables so your mileage may vary.

Here is what worked for me. I got a 128MB memory disk since this was the cheapest one they sold and you don't need much space for a copy of the metadb (only about 5MB is required for one copy). First I used the "rmformat -l" command to figure out the name of the disk. The disk came formatted with a pcfs filesystem already on it and a single fdisk partition of type DOS-BIG. I used the fdisk(1M) command to delete that partition and create a single Solaris fdisk partition for the whole memory disk. After that I just put a metadb on the disk.
# metadb -a /dev/dsk/c6t0d0s2
Once I did all of this, I could reboot with one of the disks removed from my root mirror and I still had metadb quorum since I had a 3rd copy of the metadb available.

There are a several caveats here. First, I was running this on a current nightly build of Solaris. I haven't tried it yet on the shipping Solaris 10 bits but I think this will probably work. Going back to the earlier S9 bits I would be less certain since a lot of work went into the USB code for S10. The main thing here is that the system has to see the USB disk at the time that SVM driver is reading the metadbs. This happens fairly early in the boot sequence. If we don't see the USB disk, then that metadb replica will be marked in error and it won't help maintain quorum.

The second thing to watch out for is that SVM keeps track of mirror resync regions in some of the copies of the metadbs. This is used for the optimized resync feature that SVM supports. Currently there is no good way to see which metadbs SVM is using for this and there is no way to control which metadbs will be used. You wouldn't want these writes going to the USB memory disk since it will probably wear out faster and might be slower too. We need to improve this in order for the USB memory disk to really be a production quality solution.

Another issue to watch out for is if your hardware supports the memory disk. I tried this on an older x86 box and it couldn't see the USB disk. I am pretty sure the older system did not support USB 2.0 which is what the disk I bought supports. This worked fine when I tried it on a V65 and on a Dell 650, both of which are newer systems.

I need to do more work in this area before we could really recommend this approach, but it might be a useful solution today for people who need to get around the metadb quorum problem on a two disk configuration. You would really have to play around to make sure the configuration worked ok in your environment. We are also working internally on some other solutions to this same problem. Hopefully I can write more about that soon.

Monday Apr 18, 2005

SVM V20z root mirror panic

One of the problems I have been working on recently has to do with running Solaris Volume Manager on V20z and V40z servers. In general, SVM does not care what kinds of disks it is layered on top of. It justs passes the I/O requests through to the drivers for the underlying storage. However, we were seeing problems on the V20z when it was configured for root mirroring.

With root mirroring, a common test is to pull the primary boot disk and reboot to verify that the system comes up properly on the other side of the mirror. What was happening was that Solaris would start to boot, then panic.

It turns out that we were hitting a limitation in the boot support for disks that are connected to the system with an mpt(7D) based HBA. The problematic code exists in bootconf.exe within the Device Configuration Assistant (DCA). The DCA is responsible for loading and starting the Solaris kernel. The problem is that the bootconf code was not failing over to the altbootpath device path, so Solaris would start to boot but then panic because the DCA was passing it the old bootpath device path. With the primary boot disk removed from the system, this was no longer the correct boot device path. You can see if this limitation might impact you by using the "prtconf -D" command and looking for the mpt driver.

We have some solutions for this limitation in the pipeline, but in the meantime, there is an easy workaround for this. You need to edit the /boot/solaris/bootenv.rc file and remove the entries for bootpath and altbootpath. At this point, the DCA should automatically detect the correct boot device path and pass it into the kernel.

There are a couple of limitations to this workaround. First, it only works for S10 and later. In S9, it won't automatically boot. Instead it will enter the DCA and you will have to manually choose the boot device. Also, it only works for systems where both disks in the root mirror are attached via the mpt HBA. This is a typical configuration for the V20z and V40z. We are working on better solutions to this limitation, but hopefully this workaround is useful in the meantime.

Tuesday Apr 05, 2005

Solaris Volume Manager disksets

Solaris Volume Manager has had support for a feature called "disksets" for a long time, going back to when it was an unbundled product named SDS. Disksets are a way to group a collection of disks together for use by SVM. Originally this was designed for sharing disks, and the metadevices on top of them, between two or more hosts. For example, disksets are used by SunCluster. However, having a way to manage a set of disks for exclusive use by SVM simplifies administration and with S10 we have made several enhancements to the diskset feature which make it more useful, even on a single host.

If you have never used SVM disksets before, I'll briefly summarize a couple of the key differences from the more common use of SVM metadevices outside of a diskset. First, with a diskset, the whole disk is given over to SVM control. When you do this SVM will repartition the disk and create a slice 7 for metadb placement. SVM automatically manages metadbs in disksets so you don't have to worry about creating those. As disks are added to the set, SVM will create new metadbs and automatically rebalance them across the disks as needed. Another difference with disksets is that hosts can also be assigned to the set. The local host must be in the set when you create it. Disksets implement the concept of a set owner so you can release ownership of the set and a different host can take ownership. Only the host that owns the set can access the metadevices within the set. An exception to this is the new multi-owner diskset which is a large topic on its own so I'll save that for another post.

With S10 SVM has added several new features based around the diskset. One of these is the metaimport(1M) command which can be used to load a complete diskset configuration onto a separate system. You might use this if your host died and you physically moved all of your storage to a different machine. SVM uses the disk device IDs to figure out how to put the configuration together on the new machine. This is required since the disk names themselves (e.g. c2t1d0,c3t5d0,...) will probably be different on the new system. In S10 disk images in a diskset are self-identifying. What this means is that if you use remote block replication software like HDS TrueCopy or the SNDR feature of Sun's StorEdge Availability Suite to do remote replication you can still use metaimport(1M) to import the remotely replicated disks, even though the device IDs of the remote disks will different.

A second new feature is the metassist(1M) command. This command automates the creation of metadevices so that you don't have to run all of the separate meta\* commands individually. Metassist has quite a lot of features which I won't delve into in this post, but you can read more here. The one idea I wanted to discuss is that metassist uses disksets to implement the concept of a storage pool. Metassist relies on the automatic management of metadbs that disksets provide. Also, since the whole disk is now under control of SVM, metassist can repartition the disk as necessary to automatically create the appropriate metadevices within the pool. Metassist will use the disks in the pool to create new metadevices and only try to add additional disks to the pool if it can't configure the new metadevice using the available space on the existing disks in the pool.

When disksets were first implemented back in SDS they were intended for use by multiple hosts. Since the metadevices in the set were only accessible by the host the owned the set the assumption was that some other piece of software, outside of SDS, would manage the diskset ownership. Since SVM is now making greater use of disksets we added a new capability called "auto-take" which allows the local host to automatically take ownership of the diskset during boot and thus have access to the metadevices in the set. This means that you can use vfstab entries to mount filesystems built on metadevices within the set and those will "just work" during the system boot. The metassist command relies on this feature and the storage pools (i.e. disksets) it uses will all have auto-take enabled. Auto-take can only be enabled for disksets which have the single, local host in the set. If you have multiple hosts in the set than you are really using the set in the traditional manner and you'll need something outside of SVM to manage which host owns the set (again, this ignores the new multi-owner sets). You use the new "-A" option on the metaset command to enable or disable auto-take.

Finally, since use of disksets is expected to be more common with these new features, we added the "-a" option to the metastat command so that you can see the status of metadevices in all disksets without have to run a separate command for each set. We also added a "-c" option to metastat which gives a compact, one-line output for each metadevice. This is particularly useful as the number of metadevices in the configuration increases. For example, on one of our local servers we have 93 metadevices. With the original, verbose metastat output this resulted in 842 lines of status, which makes it hard to see exactly what is going on. The "-c" option reduces this down to 93 lines.

This is just a brief overview of some of the new features we have implemented around SVM disksets. There is lots more detail in the docs here. The metaimport(1M) command was available starting in S9 9/04 and the metassist command was available starting in S9 6/04. However, the remote replication feature for metaimport requires S10.

Friday Apr 01, 2005

Solaris Volume Manager root mirror problems on S10

There are a couple of bugs that we found in S10 that make it look like Solaris Volume Manager root mirroring does not work at all. Unfortunately we found these bugs after the release went out. These bugs will be patched but I wanted to describe the problems a bit and describe some workarounds

On a lot of systems that use SVM to do root mirroring there are only two disks. When you set up the configuration you put one or more metadbs on each disk to hold the SVM configuration information.

SVM implements a metadb quorum rule which means that during the system boot, if more than 50% of the metadbs are not available, the system should boot into single-user mode so that you can fix things up. You can read more about this here.

On a two disk system there is no way to set things up so that more than 50% of the metadbs will be available if one of the disks dies.

When SVM does not have metadb quorum during the boot it is supposed to leave all of the metadevices read-only and boot into single-user. This gives you a chance to confirm that you are using the right SVM configuration and that you don't corrupt any of your data before having a chance to cleanup the dead metadbs.

What a lot of people do when when they set up a root mirror is pull one of the disks to check if the system will still boot and run ok. If you do this experiment on a two disk configuration running S10 the system will panic really early in the boot process and it will go into a infinite panic/reboot cycle.

What is happening here is that we found a bug related to UFS logging, which is on by default in S10. Since the root mirror stays read-only because there is no metadb quorum we hit a bug in the UFS log rolling code. This in turn leaves UFS in a bad state which causes the system to panic.

We're testing the fix for this bug right now but in the meantime, it is easy to workaround this bug by just disabling logging on the root filesystem. You can do that be specifying the "nologging" option in the last field of the vfstab entry for root. You should reboot once before doing any SVM experiments (like pulling a disk) to ensure that UFS has rolled the log and is no longer using logging on root.

Once a patch for this bug is out you will definitely want to remove this workaround from the the vfstab entry since UFS logging offers so many performance and availability benefits.

By they way, UFS logging is also on by default in the S9 9/04 release but that code does not suffer from this bug.

The second problem we found is not as serious as the UFS bug. This has to do with an interaction with the Service Management Facility (SMF) which is new in S10 and again, this is related to not have metadb quorum during the boot. What should happen is that the system should enter single-user so you can clean up the dead metadbs. Instead it boots all the way to multi-user but since the root device is still read-only things don't work very well. This turned out to be a missing dependency which we didn't catch when we integrated SVM and SMF. We'll have a patch for this too but this problem is much less serious. You can still login as root and clean up the dead metadbs so that you can then reboot with a good metadb quorum.

Both of the problems result because there is no metadb quorum so the root metadevice remains read-only after a boot with a dead disk. If you have a third disk which you can use to add a metadb onto, then you can reduce the likelihood of hitting this problem since losing one disk won't cause you to lose quorum during boot.

Given these kinds of problems you might wonder why does SVM bother to implement the metadb quorum? Why not just trust the metadbs that are alive? SVM is conservative and always chooses the path to ensure that you won't lose data or use stale data. There are various corner cases to worry about when SVM cannot be sure it is using the most current data. For example, in a two disk mirror configuration, you might run for a while on the first disk with the second disk powered down. Later you might reboot off the second disk (because the disk was now powered up) and the first disk might now be powered down. At this point you would be using the stale data on the mirror, possibly without even realizing it. The metadb quorum rule gives you a chance to intervene and fix up the configuration when SVM cannot do it automatically.

Thursday Mar 31, 2005

Solaris Volume Manager x86 root mirroring

I am one of the engineers working on Solaris Volume manager. There are some questions I get asked a lot and I thought it would be useful to get this information out to a broader audience so I am starting this blog. We have also done a lot of work on the volume manager in the recent Solaris releases and I'd like to talk about some of that too.

One of the questions I get asked most frequently is how to do root mirroring on x86 systems. Unfortunately our docs don't explain this very well yet. This is a short description I wrote about root mirroring on x86 which hopefully explains how to set this up.

Root mirroring on x86 is more complex than is root mirroring on SPARC. Specifically, there are issues with being able to boot from the secondary side of the mirror when the primary side fails. Compared to SPARC systems, with x86 machines you have be sure to properly configure the system BIOS and set up your fdisk partitioning.

The x86 BIOS is analogous to the PROM interpreter on SPARC. The BIOS is responsible for finding the right device to boot from, then loading and executing the master boot record from that device.

All modern x86 BIOSes are configurable to some degree but the discussion of how to configure them is beyond the scope of this post. In general you can usually select the order of devices that you want the BIOS to probe (e.g. floppy, IDE disk, SCSI disk, network) but you may be limited in configuring at a more granular level. For example, it may not be possible to configure the BIOS to boot from the first and second IDE disks. These limitations may be a factor with some hardware configurations (e.g. a system with two IDE disks that are root mirrored). You will need to understand the capabilities of the BIOS that is on your hardware. If your primary boot disk fails and your BIOS is set up properly, then it will automatically boot from the second disk in the mirror. Otherwise, you may need to break into the BIOS while the machine is booting and reconfigure to boot from the second disk or you may even need to boot from a floppy with the Solaris Device Configuration Assistant (DCA) on it so that you can select the alternate disk to boot from.

On x86 machines fdisk partitions are used and it is common to have multiple operating systems installed. Also, there are different flavors of master boot programs (e.g. LILO or Grub), in addition to the standard Solaris master boot program. The boot(1M) man page is a good resource for a detailed discussion of the multiple components that are used during booting on Solaris x86.

Since SVM can only mirror Solaris slices within the Solaris fdisk partition this post focuses on a configuration that only has Solaris installed. If you have multiple fdisk partitions then you will need to use some other approach to protect the data outside of the Solaris fdisk partition. SVM can't mirror that data.

For an x86 system with Solaris installed there are two common fdisk partitioning schemes. One approach uses two fdisk partitions. There is a Solaris fdisk partition and another, small fdisk partition of about 10MB called the x86 boot partition. This partition has an Id value of 190. The Solaris system installation software will create a configuration with these two fdisk partitions as the default. The x86 boot partition is needed in some cases, such as when you want to use live-upgrade on a single disk configuration, but it is problematic when using root mirroring. The Solaris system installation software only allows one x86 boot partition for the entire system and it places important data on that fdisk partition. That partition is mounted in the vfstab with this entry:

/dev/dsk/c2t1d0p0:boot - /boot pcfs - no -

Because this fdisk partition is outside of the Solaris fdisk partition it cannot be mirrored by SVM. Furthermore, because there is only a single copy of this fdisk partition it represents a single point of failure.

Since the x86 boot partition is not required in most cases it is recommended that you do not use this as the default when setting up a system that will have root mirroring. Instead, just use a single Solaris fdisk partition and omit the x86 boot partition for your installation. It is easiest to do this at the time you install Solaris. If you already have Solaris installed and you created the x86 boot partition as part of that process then the easiest thing would be to delete that with the fdisk(1M) command and reinstall, taking care not to create the x86 boot partition during the installation process.

Once your system is installed you create your metadbs and root mirror using the normal procedures.

When you use fdisk to partition your second disk you must make the disk bootable with a master boot program. The -b option of fdisk(1M) does this.

e.g. fdisk -b /usr/lib/fs/ufs/mboot /dev/rdsk/c2t1d0p0

On x86 machines the Solaris VTOC (the slices within the Solaris fdisk partition) is slightly different from what is seen on SPARC. On SPARC there are 8 VTOC slices (0-7) but on x86 there are more. In particular slice 8 is used as a "boot" slice. You will see that this slice is 1 cylinder in size and starts at the beginning of the disk (offset 0). The other slices will come after that, starting at cylinder 1.

Slice 8 is necessary for booting Solaris from this fdisk partition. It holds the partition boot record (pboot), the Solaris VTOC for the disk and the bootblk. This information is disk specific so it is not mirrored with SVM. However, you must ensure that both disks are bootable so that you can boot from the secondary disk if the primary fails. You can use the installboot program to setup the second disk as a Solaris bootable disk (see installboot(1M)). An example command is:

installboot /usr/platform/i86pc/lib/fs/ufs/pboot \\ /usr/platform/i86pc/lib/fs/ufs/bootblk /dev/rdsk/c2t1d0s2

There is one further consideration for booting an x86 disk. You must ensure that the root slice has a slice tag of "root" and the root slice must be slice 0. See format(1M) for checking and setting the slice tag field.

Solaris x86 emulates some of the behavior of the SPARC eeprom. See eeprom(1M). The boot device is stored in the "bootpath" property that you can see with the eeprom command. The value is the device tree path of the primary boot disk. For example:

bootpath=/pci@0,0/pci8086,2545@3/pci8086,1460@1d/pci8086,341a@7,1/sd@0,0:a

You should set up the alternate bootpath via the eeprom command so that the system will try to boot from the second side of the root mirror. First you must get the device tree path for the other disk in the mirror. You can use the ls command. For example:

# ls -l /dev/dsk/c2t1d0s0 lrwxrwxrwx 1 root root 78 Sep 28 23:41 /dev/dsk/c0t1d0s0 -> ../../devices/pci@0,0/pci8086,2545@3/pci8086,1460@1d/pci8086,341a@7,1/sd@1,0:a

The device tree path is the portion of the output following "../devices".

Use the eeprom command to set up the alternate boot path. For example:

eeprom altbootpath='/pci@0,0/pci8086,2545@3/pci8086,1460@1d/pci8086,341a@7,1/sd@1,0:a'

If your primary disk fails you will boot from the secondary disk. This may either be automatic if your BIOS is configured properly, you may need to manually enter the BIOS and boot from the secondary disk or you may even need to boot from a floppy with the DCA on it. Once the system starts to boot it will try to boot from the "bootpath" device. Assuming the primary boot disk is the dead disk in the root mirror the system will attempt to boot from the "altbootpath" device.

If the system fails to boot from the altbootpath device for some reason then you need to finish booting manually. The boot should drop in to the Device Configuration Assistant (DCA). You must choose the secondary disk as the boot disk within the DCA. After the system has booted you should update the "bootpath" value with the device path that you used for the for the secondary disk (the "altbootpath") so that the machine will boot automatically.

For example, run the following to set the boot device to the second scsi disk (target 1):

eeprom bootpath='/pci@0,0/pci8086,2545@3/pci8086,1460@1d/pci8086,341a@7,1/sd@1,0:a'

Note that all of the usual considerations of mddb quorum apply to x86, just as they do for SPARC.
About

jerrysblog

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today