Friday Oct 23, 2009

solaris10 branded zones on OpenSolaris

For the past 9 or 10 months I've been pretty much heads down working with Jordan on the Solaris 10 branded zone project. Yesterday we integrated the first phase of this project into OpenSolaris. This brand allows you to run the Solaris 10 10/09 release, or later, inside of a zone running on OpenSolaris. We see this brand as one of the tools which will help people as they transition from running Solaris 10 to OpenSolaris.

We've divided this project into two phases. For this initial integration we have the following features:

  • basic brand emulation
    The brand emulation works for running the latest version of Solaris 10 (Solaris 10 10/09) on OpenSolaris. A zone running this brand is intended to be functionally equivalent to a native zone on Solaris 10 10/09 with the same configuration.
  • p2v
    A physical-to-virtual capability to install an archive of a system running Solaris 10 10/09 into the branded zone
  • v2v
    A virtual-to-virtual capability to install an archive of a native zone from a system running Solaris 10 10/09 into the branded zone
  • multiple architecture support
    This brand runs on all sun4u, sun4v and x86 architecture machines that OpenSolaris has defined as supported platforms

There are a few limitations with this initial version of the code which we'll work on in the second phase of the project. We'll be adding support for:

  • Exclusive IP stack zones
  • Delegated ZFS datasets
  • The ability to run these branded zones on a system running xVM
  • The ability to upgrade the version of Solaris 10 running inside the zone to a later release of Solaris 10

We've done extensive testing of the brand using our internal Solaris 10 test suites and a variety of 3rd party applications. Now that the code has integrated, we're looking forward to getting feedback from more people about their real-world experiences running their own Solaris 10 application stacks inside the zone. If you give this branded zone a try, let us hear about your experiences on the OpenSolaris zones-discuss alias.

Monday May 18, 2009

Free Community One Deep Dive

The Deploying OpenSolaris Deep Dive on Tuesday at Community One is free if you register using the promotional code OSDDT. The session doesn't start until 11:00 am so that people can still attend the JaveOne key note.

Chris Armes will start with an overview of deploying OpenSolaris in the data center. After lunch Ben Rockwood will be delivering a two hour presentation on ZFS. This promises to be the highlight of the session. Nick, one of my co-authors on the OpenSolaris Bible, will then talk about high availability and I'll wrap up with a talk on how to use zones for consolidation.

Friday May 08, 2009

Running Solaris 10 on OpenSolaris

Jordan just posted a nice blog about the work we've been doing for Solaris 10 branded zones running on OpenSolaris. His post also has a link to a Flash demo we put together showing the process of migrating a standalone Solaris 10 system into a zone on OpenSolaris. Both of us will be at Community One West and we'll be running the branded zone in the virtualization pod. If you're there and interested, stop by to check it out. I'll also be talking about this project as part of my presentations.

Thursday May 07, 2009

I'll be presenting at Community One West

I'll be delivering two presentations at Community One West at the beginning of June. The first presentation is on Monday June 1st and I'll be covering "Built-in Virtualization for the OpenSolaris Operating System". It will be an overview of some basic virtualization concepts and the various solutions available in OpenSolaris. I'll also be discussing the trade-offs of one vs. the other. The second presentation is on Tuesday as part of the deep dives. I'll be discussing application consolidation using zones. I'll also be hanging around the virtualization demo pod when I'm not presenting.

In addition, I think there is going to be a book signing for the OpenSolaris Bible. My co-authors Nick and Dave are also going to be attending. This will be the first (and only?) time the three of us have actually been together at the same time.

At least some of the other zones engineers ( Dan, Steve and Jordan) should be there too, so if you're attending, stop by the virtualization pod and say hi.

Saturday May 02, 2009

OpenSolaris books on google book search

I happened to be looking at google book search today and I thought I'd see if the book I co-authored, the OpenSolaris Bible was there. It is and you can see it here. Although the table of contents and some sample chapters are available elsewhere, this provides a nice way to browse more material in the book. I think google will let you see up to 20% of the book.

I also noticed the other new OpenSolaris book, Pro OpenSolaris is there, as is the venerable Solaris Internals.

Wednesday Dec 10, 2008

zones on OpenSolaris 2008.11

The OpenSolaris 2008.11 release just came out and we've made some significant changes in the way that zones are installed on this release. The motivation for these changes are so that we can eventually have software management operations using IPS work in a non-global zone much the same way as they work in the global zone. Global zone software management uses the SNAP Upgrade project along with IPS and the idea is to create a new Boot Environment (BE) when you update the software in the global zone. A BE is based on a ZFS snapshot and clone, so that you can easily roll back if there are any problems with the newly installed software. Because the software in the non-global zones should be in sync with the global zone, when a new BE is created each of the non-global zones must also have a new ZFS snapshot and clone that matches up to the new BE.

We'd also eventually like to have the same software management capabilities within a non-global zone. That is, we'd like the non-global zone system administrator to be able to use IPS to install software in the zone, and as part of this process, a new BE inside the zone would be created based on a ZFS snapshot and clone. This way the non-global zone can take advantage of the same safety features for rolling back that are available in the global zone.

In order to provide these capabilities, we needed to make some important changes in how zones are laid out in the file system. To support all of this we need the actual zone root file system to be its own delegated ZFS dataset. In this way the non-global zone sysadmin can make their own ZFS snapshots and clones of the zone root and the IPS software can automatically create a new BE within the zone when a software management operation takes place in the zone.

The gory details of this are discussed in the spec.

All of the capabilities described above don't work yet, but we have laid a foundation to enable this for the future. In particular, when you create a new global zone BE, all of the non-global zones are also cloned as well. However, running image-update in the global zone still doesn't update each individual zone. You still need to do that manually, as Dan described in his blog about zones on the 2008.05 release. In a future post I'll talk about some other ways to update each zone. Another feature that isn't done yet is the full SNAP Upgrade support from within the zone itself. That is, zone roots are now delegated ZFS datasets, but when you run IPS inside the zone itself, a new clone is not automatically created. Adding this feature should be fairly straightforward though, now that the basic support is in the release.

With all of these changes to how zone roots use ZFS in 2008.11, here is a summary of the important differences and limitations with using zones on 2008.11.

1) Existing zones can't be used. If you have zones installed on an earlier release of OpenSolaris and image-update to 2008.11 or later, those zones won't be usable.

2) Your global zone BE needs a UUID. If you are running 2008.11 or later then your global zone BE will have a UUID.

3) Zones are only supported in ZFS. This means that the zonepath must be a dataset. For example, if the zonepath for your zone is /export/zones/foo, then /export/zones must be a dataset. The zones code will then create the foo dataset and all the underlying datasets when you install the zone.

4) As I mentioned above, image-updating the global BE doesn't update the zones yet. After you image-update the global zone, don't forget to update the new BE for each zone so that it is in sync with the global zone.

Thursday Sep 06, 2007

A busy week for zones

This is turning out to be a busy week for zones related news. First, the newest version of Solaris 10, the 8/07 release, is now available. This release includes the improved resource management integration with zones that has been available for a while now in the OpenSolaris nevada code base and which I blogged about here. It also includes other zones enhancements such as brandz and IP instances. Jeff Victor has a nice description of all of these new zone features on his blog.

If that wasn't enough, we have started to talk about our latest project, code named Etude. This is a new brand for zones, building on the brandz framework, and allows you to run a Solaris 8 environment within a zone. We have been working on this project for a good part of the year and it is exciting to finally be able to talk more about it. With Etude you can quickly consolidate those dusty old Solaris 8 SPARC systems, running on obsolete hardware, onto current generation, energy efficient, systems. Marc Hamilton, VP of Solaris Marketing, describes this project at a high level on his blog but for more details, Dan Price, our project lead, wrote up a really nice overview on his blog. If you have old systems still running Solaris 8 and would like an easy path to Solaris 10 and to newer hardware, then this project might be what you need.

Thursday Feb 01, 2007

Containers in SX build 56

The many Resource Management (RM) features in Solaris have been developed and evolved over the course of years and several releases. We have resource controls, resource pools, resource capping and the Fair Share Scheduler (FSS). We have rctls, projects, tasks, cpu-shares, processor sets and the rcapd(1M). All of these features have different commands and syntax to configure the feature. In some cases, particularly with resource pools, the syntax is quite complex and long sequences of commands are needed to configure a pool. When you first look at RM it is not immediately clear when to use one feature vs. another or if some combination of these features is needed to achieve the RM objectives.

In Solaris 10 we introduced Zones, a lightweight system virtualization capability. Marketing coined the term 'containers' to refer to a combination of Zones and RM within Solaris. However, the integration between the two was fairly weak. Within Zones we had the 'rctl' configuration option, which you could use to set a couple of zone specific resource controls, and we had the 'pool' property which could be used to bind the zone to an existing resource pool, but that was it. Just setting the 'zone.cpu-shares' rctl wouldn't actually give you the right cpu shares unless you also configured the system to use FSS. But, that was a separate step and easily overlooked. Without the correct configuration of these various, disparate components even a simple test, such as a fork bomb within a zone, could disrupt the entire system.

As users started experimenting with Zones we found that many of them were not leveraging the RM capabilities provided by the system. We would get dinged in evaluations because Zones, without a correct RM configuration, didn't provide all of the containment users needed. We always expected Zones and RM to be used together, but due the the complexity of the RM features and the loose integration between the two, we were seeing that few Zones users actually had a proper RM configuration. In addition, our RM for memory control was limited to rcapd running within a zone and capping RSS on projects. This wasn't really adequate.

About 9 months ago the Zones engineering team started a project to try to improve this situation. We didn't want to just paper over the complexity with things like a GUI or wizards, so it took us quite a bit of design before we felt like we hit upon some key abstractions that we could use to truly simplify the interaction between the two components. Eventually we settled upon the idea of organizing the RM features into 'dedicated' and 'capped' configurations for the zone. We enhanced resource pools to add the idea of a 'temporary pool' which we could dynamically instantiate when a zone boots. We enhanced rcapd(1M) so that we could do physical memory capping from the global zone. Steve Lawrence did a lot of work to improve resident set size (RSS) accounting as well as adding new rctls for maximum swap and locked memory. These new features significantly improve RM of memory for Zones. We then enhanced the Zones infrastructure to automatically do the work to set up the various RM features that were configured for the zone. Although the project made many smaller improvements, the key ideas are the two new configuration options in zonecfg(1M). When configuring a zone you can now configure 'dedicated-cpu' and 'capped-memory'. Going forward, as additional RM features are added, we anticipate this idea will evolve gracefully to add 'dedicated-memory' and 'capped-cpu' configuration. We also think this concept can be easily extended to support RM features for other key parts of the system such as the network or storage subsystem.

Here is our simple diagram of how we eventually unified the RM view within Zones.
       | dedicated  |  capped
cpu    | temporary  | cpu-cap
       | processor  | rctl\*
       | set        |
memory | temporary  | rcapd, swap
       | memory     | and locked
       | set\*       | rctl

\* memory sets and cpu caps are under development but are not yet part of Solaris.

With these enhancements, it is now almost trivial to configure RM for a zone. For example, to configure a resource pool with a set of up to four cpu's, all you do in zonecfg is:
zonecfg:my-zone> add dedicated-cpu
zonecfg:my-zone:dedicated-cpu> set ncpus=1-4
zonecfg:my-zone:dedicated-cpu> set importance=10
zonecfg:my-zone:dedicated-cpu> end
To configure memory caps, you would do:
zonecfg:my-zone> add capped-memory
zonecfg:my-zone:capped-memory> set physical=50m
zonecfg:my-zone:capped-memory> set swap=128m
zonecfg:my-zone:capped-memory> set locked=10m
zonecfg:my-zone:capped-memory> end
All of the complexity of configuring the associated RM capabilities is then handled behind the scenes when the zone boots. Likewise, when you migrate a zone to a new host, these RM settings migrate too.

Over the course of the project we discussed these ideas within the opensolaris Zones community where we benefited from much good input which we used in the final design and implementation. The full details of the project are available here and here.

This work is available in Solaris Express build 56 which was just posted. Hopefully folks using Zones will get a chance to try out the new features and let us know what they think. All of the core engineering team actively participates in the zones discuss list and we're happy to try to answer any questions or just hear your thoughts.

Monday Feb 20, 2006

SVM root mirroring and GRUB

Although I haven't been working on SVM for over 6 months (I am working on Zones now), I still get questions about SVM and x86 root mirroring from time to time. Some of these procedures are different when using the new x86 boot loader (GRUB) that is now part of Nevada and S10u1. I have some old notes that I wrote up about 9 months ago that describe the updated procedures and I think these are still valid.

Root mirroring on x86 is more complex than is root mirroring on SPARC. Specifically, there are issues with being able to boot from the secondary side of the mirror when the primary side fails. On x86 machines the system BIOS and fdisk partitioning are the complicating factors.

The x86 BIOS is analogous to the PROM interpreter on SPARC. The BIOS is responsible for finding the right device to boot from, then loading and executing GRUB from that device.

All modern x86 BIOSes are configurable to some degree but the discussion of how to configure them is beyond the scope of this document. In general you can usually select the order of devices that you want the BIOS to probe (e.g. floppy, IDE disk, SCSI disk, network) but you may be limited in configuring at a more granular level. For example, it may not be possible to configure the BIOS to probe the first and second IDE disks. These limitations may be a factor with some hardware configurations (e.g. a system with two IDE disks that are root mirrored). You will need to understand the capabilities of the BIOS that is on your hardware. If your primary boot disk fails you may need to break into the BIOS while the machine is booting and reconfigure to boot from the second disk.

On x86 machines fdisk partitions are used and it is common to have multiple operating systems installed. Also, there are different flavors of master boot programs (e.g. LILO) in addition to GRUB which is the standard Solaris master boot program. The boot(1M) man page is a good resource for a detailed discussion of the multiple components that are used during booting on Solaris x86.

Since SVM can only mirror Solaris slices within the Solaris fdisk partition this discussion will focus on a configuration that only has Solaris installed. If you have multiple fdisk partitions then you will need to use some other approach to protect the data outside of the Solaris fdisk partition.

Once your system is installed you create your metadbs and root mirror using the normal procedures.

You must ensure that both disks are bootable so that you can boot from the secondary disk if the primary fails. You use the installgrub program to setup the second disk as a Solaris bootable disk (see installgrub(1M)). An example command is:

/sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c0t1d0s0

Solaris x86 emulates some of the behavior of the SPARC eeprom. See eeprom(1M). The boot device is stored in the "bootpath" property that you can see with the eeprom command. The value should be assigned to the the device tree path of the root mirror. For example:


Next you need to modify the GRUB boot menu so that you can manually boot from the second side of the mirror, should this ever be necessary. Here is a quick overview of the GRUB disk naming convention.

(hd0),(hd1) -- first & second BIOS disk (entire disk)
(hd0,0),(hd0,1) -- first & second fdisk partition of first BIOS disk
(hd0,0,a),(hd0,0,b) -- Solaris/BSD slice 0 and 1 on first fdisk partition on the first BIOS disk

Hard disk names starts with hd and a number, where 0 maps to BIOS disk 0x80 (first disk enumerated by the BIOS), 1 maps to 0x81, and so on. One annoying aspect of BIOS disk numbering is that the order may change depending on the BIOS configuration. Hence, the GRUB menu may become invalid if you change the BIOS boot disk order or modify the disk configuration. Knowing the disk naming convention is essential to handling boot issues related to disk renumbering in the BIOS. This will be a factor if the primary disk in the mirror is not seen by the BIOS so that it renumbers and boots from the secondary disk in the mirror. Normally this renumbering will mean that the system can still automatically boot from the second disk, since you configured it to boot in the previous steps, but it becomes a factor when the first disk becomes available again, as described below.

You should edit the GRUB boot menu in /boot/grub/menu.lst and add an entry for the second disk in the mirror. It is important that you be able to manually boot from the second side of the mirror due to the BIOS renumbering described above. If the primary disk is unavailable, the boot archive on that disk may become stale. Later, if you boot and that disk is available again, the BIOS renumbering would cause GRUB to load that stale boot archive which could cause problems or may even leave the system unbootable.

If the primary disk is once again made available and then you reboot without first resyncing the mirror back onto the primary drive, then you should use the GRUB menu entry for the second disk to manually boot from the correct boot archive (the one on the secondary side of the mirror). Once the system is booted, perform normal metadevice maintenance to resync the primary disk. This will restore the current boot archive to the primary so that subsequent boots from that disk will work correctly.

The previous procedure is not normally necessary since you would replace the failed primary disk using cfgadm(1M) and resync but it will be required if the primary is simply not powered on, causing the BIOS to miss the disk and renumber. Subsequently powering up this disk and rebooting would cause the BIOS to renumber again and by default you would boot from the stale disk.

Note that all of the usual considerations of mddb quorum apply to x86 root mirroring, just as they do for SPARC.

Thursday Dec 08, 2005

Moving and cloning zones

Its been quite a long time since my last blog entry. I have moved over from the SVM team onto the Zones team and I have been busy coming up to speed on Zones.

So far I have just fixed a few Zones bugs but now I am starting to work on some new features. One of the big things people want from Zones is the ability to move them and clone them.

I have a proposal for moving and cloning zones over on the OpenSolaris zones discussion. This has been approved by our internal architectural review committee and the code is basically done so it should be available soon. Moving a zone is currently limited to a single system but the next step is migrating the zone from one machine to another. Thats the next thing we're going to work on.

For cloning we currently copy the bits from one zone instance to another and we're seeing significant performance wins compared to installing the zone from scratch (13x faster on one test machine). With ZFS now available it seems obvious that we could use ZFS clones to quickly clone Zone instances. This is something that we are actively looking at but for now we don't recommend that you place your zonepath on ZFS. This is bug 6356600 and is due to the current limitation that you won't be able to upgrade your system if your zonepath is on ZFS. Once the upgrade issues have been resolved, we'll be extending Zone cloning to be better integrated with ZFS clones. In the meantime, while you can use ZFS to hold your zones, you need to be aware that the system won't be upgradeable.

Thursday Jun 23, 2005

SVM resync cancel/resume

The latest release of Solaris Express came out the other day. Dan has his usual excellent summary. He mentions one cool new SVM feature but it might be easy to overlook it since there are so many other new things in this release. The new feature is the ability to cancel a mirror resync that is underway. The resync is checkpointed and you can restart it later. It will simply pick up where it left off. This is handy if the resync is effecting performance and you'd like to wait until later to let it run. Another use for this is if you need to reboot. With the old code, if a full resync was underway and you rebooted, the resync would start over from the beginning. Now, if you cancel it before rebooting, the checkpoint will allow the resync to pick up where it left off.

This code is already in OpenSolaris. You can see the CLI changes in metasync.c and the library changes in meta_mirror_resync_kill. The changes are fairly small because most of the underlying support was already implemented for multi-node disksets. All we had to do was add the CLI option and hook in to the existing ioctl. You can see some of this resync functionality in the resync_unit function. There is a nice big comment there which explains some of this code.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Tuesday Jun 14, 2005

SVM and the B_FAILFAST flag

Now that OpenSolaris is here it is a lot easier to talk about some of the interesting implementation details in the code. In this post I wanted to discuss the first project I did after I started to work on the Solaris Volume Manager (SVM). This is on my mind right now because it also happens to be related to one of my most recent changes to the code. This change is not even in Solaris Express yet, it is only available in OpenSolaris. Early access to these kind of changes is just one small reasons why OpenSolaris is so cool.

My first SVM project was to add support for the B_FAILFAST flag. This flag is defined in /usr/include/sys/buf.h and it was implemented in some of the disk drivers so that I/O requests that were queued in the driver could be cleared out quickly when the driver saw that a disk was malfunctioning. For SVM the big requester for this feature was our clustering software. The problem they were seeing was that in a production environment there would be many concurrent I/O requests queued up down in the sd driver. When the disk was failing the sd driver would need to process each of these requests, wait for the timeouts and retrys and slowly drain its queue. The cluster software could not failover to another node until all of these pending requests had been cleared out of the system. The B_FAILFAST flag is the exact solution to this problem. It tells the driver to do two things. First, it reduces the number of retries that the driver does to a failing disk before it gives up and returns an error. Second, when the first I/O buf that is queued up in the driver gets an error, the driver will immediately error out all of the other, pending bufs in its queue. Furthermore, any new bufs sent down with the B_FAILFAST flag will immediately return with an error.

This seemed fairly straightforward to implement in SVM. The code had to be modified to detect if the underlying devices supported the B_FAILFAST flag and if so, the flag should be set in the buf that was being passed down from the md driver to the underlying drivers that made up the metadevice. For simplicity we decided we would only add this support to the mirror metadevice in SVM. However, the more I looked at this, the more complicated it seemed to be. We were worried about creating new failure modes with B_FAILFAST and the big concern was the possibility of a "spurious" error. That is, getting back an error on the buf that we would not have seen if we had let the underlying driver perform its full set of timeouts and retries. This concern eventually drove the whole design of the initial B_FAILFAST implementation within the mirror code. To handle this spurious error case I implemented an algorithm within the driver so that when we got back an errored B_FAILFAST buf we would resubmit that buf without the B_FAILFAST flag set. During this retry, all of the other failed I/O bufs would also immediately come back up into the md driver. I queued those up so that I could either fail all of the them after the retried buf finally failed or I could resubmit them back down to the underlying driver if the retried I/O succeeded. Implementing this correctly took a lot longer than I originally expected when I took this first project and it was one of those things that worked but I was never very happy with. The code was complex and I never felt completely confident that there wasn't some obscure error condition lurking here that would come back to bite us later. In addition, because of the retry, the failover of a component within a mirror actually took \*longer\* now if there was only a single I/O being processed.

This original algorithm was delivered in the S10 code and was also released as a patch for S9 and SDS 4.2.1. It has been in use for a couple of years which gave me some direct experience with how well the B_FAILFAST option worked in real life. We actually have seen one or two of these so called spurious errors but in all cases there were real, underlying problems with the disks. The storage was marginal and SVM would have been better off just erroring out those components within the mirror and immediately failing over to the good side of the mirror. By this time I was comfortable with this idea so I rewrote the B_FAILFAST code within the mirror driver. This new algorithm is what you'll see today in the OpenSolaris code base. I basically decided to just trust the error we get back when B_FAILFAST is set. The code will follow the normal error path so that it puts the submirror component into the maintenance state and just uses the other, good side of the mirror from that point onward. I was able to remove the queue and simplify the logic almost back to what it was before we added support for B_FAILFAST.

However, there is still one special case we have to worry about when using B_FAILFAST. As I mentioned above, when B_FAILFAST is set, all of the pending I/O bufs that are queued down in the underlying driver will fail once the first buf gets an error. When we are down to the last side of a mirror the SVM code will continue to try to do I/O to the those last submirror components, even though they are taking errors. This is called the LAST_ERRED state within SVM and is an attempt to try to provide access to as much of your data as possible. When using B_FAILFAST it is probable that not all of the failed I/O bufs will have been seen by the disk and given a chance to succeed. With the new algorithm the code detects this state and reissues all of the I/O bufs without B_FAILFAST set. There is no longer any queueing, we just resubmit the I/O bufs without the flag and all future I/O to the submirror is done without the flag. Once the LAST_ERRED state is cleared the code will return to using the B_FAILFAST flag.

All of this is really an implementation detail of mirroring in SVM. There is no user-visible component of this except for a change in the behavior of how quickly the mirror will fail the errored drives in the submirror. All of the code is contained within the mirror portion of the SVM driver and you can see it in mirror.c. The function mirror_check_failfast is used to determine if all of the components in a submirror support using the B_FAILFAST flag. The mirror_done function is called when the I/O to the underlying submirror is complete. In this function we check if the I/O failed and if B_FAILFAST was set. If so we call the submirror_is_lasterred function to check for that condition and the last_err_retry function is called only when we need to resubmit the I/O. This function is actually executed in a helper thread since the I/O completes in a thread separately from the thread that initiated the I/O down into the md driver.

To wrap up, the SVM md driver code lives in the source tree at usr/src/uts/common/io/lvm. The main md driver is in the md subdirectory and each specific kind of metadevice also has its own subdirectory ( mirror, stripe, etc.). The SVM command line utilities live in usr/src/cmd/lvm and the shared library code that SVM uses lives in usr/src/lib/lvm. Libmeta is the primary library. In another post I'll talk in more detail about some of these other components of SVM.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Friday May 20, 2005


In a previous blog I talked about integration of Solaris Volume Manager (SVM) and RCM. Proper integration of SVM with the other subsystems in Solaris is one of the things I am particularly interested in.

Today I'd like to talk about some of the work I did to integrate SVM with the new Service Management Facility ( SMF) that was introduced in S10. Previously SVM had a couple of RC scripts that would run when the system booted, even if you had not configured SVM and were not using any metadevices. There were also several SVM specific RPC daemons that were enabled. One of the ideas behind SMF is that only the services that are actually needed should be enabled. This speeds up boot and makes for a cleaner system. Another thing is that not all of the RPC daemons need to be enabled when using SVM. Different daemons will be used based upon the way SVM is configured. SMF allows us to clean this up and manage these services within the code so that the proper services are enabled and disabled as you reconfigure SVM.

The following is a list of the services used by SVM:


The system/mdmonitor, system/metainit and network/rpc/meta services are the core services. These will be enabled when you create the first metadb. Once you create your first diskset the network/rpc/metamed and network/rpc/metamh services will be enabled. When you create your first multi-node diskset the network/rpc/mdcomm service will also be enabled. As you delete these portions of your configuration the corresponding services will be disabled.

Integrating this coordination of SVM and SMF is easy since SMF offers a full API which allows programs to monitor and reconfigure the services they use. The primary functions used are smf_get_state, smf_enable_instance and smf_disable_instance, all of which are documented on the smf_enable_instance(3SCF)) man page. This could have all be done previously using various hacks to rename scripts and edit configuration files in various ways but it is trivially simple with SMF. Furthermore, the code can always tell when there is something wrong with the services it depends on. Recently I integrated some new code that will notify you whenever you check the status of SVM with one of the CLI commands (metastat, metaset or metadb) and there is a problem with the SVM services. We have barely scratched the service here but SMF lays a good foundation for enabling us to deliver a true self-healing system.

Tuesday May 17, 2005

Another SVMer starts blogging.

Sanjay Nadkarni, another member of the Solaris Volume Manager engineering team has just started a blog. Sanjay was the technical lead for the project that added new clustering capabilities to SVM so that it can now support concurrent readers and writers. If you read my blog because you are interested in SVM you will want to take a look at his too. Welcome Sanjay.

Friday May 06, 2005

SVM metadbs, USB disks and S2.7

In an earlier blog I talked about using a USB memory disk to store a Solaris Volume Manager (SVM) metadb on a two-disk configuration. This would reduce the likelihood of hitting the mddb quorum problem I have talked about. The biggest problem with this approach is that there was no way to control where SVM would place its optimized resync regions. I just putback a fix for this limitation this week. It should be showing up in an upcoming Solaris Express release in the near future. With this fix the code will no longer place the optimized resync regions on a USB disk or any other removable disk for that matter. The only I/O to these devices should be the SVM configuration change writes or the initialization reads of the metadbs, which is a lot less frequent than the optimized resync writes.

I had another interesting experience this week. I was working on a bug fix for the latest release of Solaris and I had to run a test of an x86 upgrade for S2.7 with a Solstice DiskSuite (SDS) 4.2 root mirror to the latest release of Solaris. This was interesting to me for a number of reasons. First this code is over 6 years old but because of the long support lifetimes for Solaris releases we still have to be sure things like this will work. Second, it was truly painful to see this ancient release of Solaris and SDS running on x86. It was quite a reminder of how far Solaris 10 has come on the x86 platform. It will be exciting to see where the Solaris community takes Open Solaris on the x86 platform, as well as other platforms, over the next few years.



Top Tags
« April 2014