These Boots are Made for Walkin'

One of the most gratifying and exciting aspects of the OpenSolaris project is a return (for me, at least) to working on operating system design and research with the larger, open community. In another era while I was an undergraduate at Berkeley, I was fortunate enough to see the 2.x and 4.x BSD development effort up close and to see the larger community formed between the University and external organizations that had UNIX source licenses. It was not an open source community, of course, but it was a community none the less, and one that shared fixes, ideas and other software projects built on top of the operating system. Our hopes for OpenSolaris are that in addition to releasing operating system source code that can be used for many different purposes, Sun and the community will innovate together while maintaining the core values that Solaris provides today.

One of the many pieces of OpenSolaris which is of personal interest is the Zones virtualization technology introduced in Solaris 10. Zones provide a lightweight but very efficient and flexible way of consolidating and securing potentially complex workloads. There is a wealth of technical information about Zones in Solaris available at the OpenSolaris Zones Community and the BigAdmin System Administration Portal.

One of the things about Zones that people notice right away is how quickly they boot. Of course, booting a zone does not cause a system to run its power-on self-test (POST) or require the same amount of initialization that takes place when the hardware itself is booting. However, I thought it might be useful to do a tour of the dance that takes place when a non-global zone is booted. I call it a dance since there is a certain amount of interplay between the primary players - zoneadm, zoneadmd and the kernel itself - that warrants an explanation.

Although the virtualization that Zones provides is spread throughout the source code, the primary implementation in the kernel can be found in zone.c. As with many OpenSolaris frameworks, there is a big block comment at the start of the file which is very useful for understanding the lay of the land with respect to the code. Besides describing the data structures and locking strategy used for Zones, there is a description of the states a zone can be in from the kernel's perspective and at what points a zone may transition from one state to another. For brevity, only the states covered during a zone boot are listed here

 \*   Zone States:
 \*   The states in which a zone may be in and the transitions are as
 \*   follows:
 \*   ZONE_IS_UNINITIALIZED: primordial state for a zone. The partially
 \*   initialized zone is added to the list of active zones on the system but
 \*   isn't accessible.
 \*   ZONE_IS_READY: zsched (the kernel dummy process for a zone) is
 \*   ready.  The zone is made visible after the ZSD constructor callbacks are
 \*   executed.  A zone remains in this state until it transitions into
 \*   the ZONE_IS_BOOTING state as a result of a call to zone_boot().
 \*   ZONE_IS_BOOTING: in this shortlived-state, zsched attempts to start
 \*   init.  Should that fail, the zone proceeds to the ZONE_IS_SHUTTING_DOWN
 \*   state.
 \*   ZONE_IS_RUNNING: The zone is open for business: zsched has
 \*   successfully started init.   A zone remains in this state until
 \*   zone_shutdown() is called.

It is important to note here that there are a number of zone states not represented here - those are for zones which do not (yet) have a kernel context. An example of such a state is for a zone that is in the process of being installed. These states are defined in libzonecfg.h.

One of the players in the zone boot dance is the zoneadmd process which runs in the global zone and performs a number of critical tasks. Although much of the virtualization for a zone is implemented in the kernel, zoneadmd manages a great deal of a zone's infrastructure as outlined in zoneadmd.c

 \* zoneadmd manages zones; one zoneadmd process is launched for each
 \* non-global zone on the system.  This daemon juggles four jobs:
 \* - Implement setup and teardown of the zone "virtual platform": mount and
 \*   unmount filesystems; create and destroy network interfaces; communicate
 \*   with devfsadmd to lay out devices for the zone; instantiate the zone
 \*   console device; configure process runtime attributes such as resource
 \*   controls, pool bindings, fine-grained privileges.
 \* - Launch the zone's init(1M) process.
 \* - Implement a door server; clients (like zoneadm) connect to the door
 \*   server and request zone state changes.  The kernel is also a client of
 \*   this door server.  A request to halt or reboot the zone which originates
 \*   \*inside\* the zone results in a door upcall from the kernel into zoneadmd.
 \*   One minor problem is that messages emitted by zoneadmd need to be passed
 \*   back to the zoneadm process making the request.  These messages need to
 \*   be rendered in the client's locale; so, this is passed in as part of the
 \*   request.  The exception is the kernel upcall to zoneadmd, in which case
 \*   messages are syslog'd.
 \*   To make all of this work, the Makefile adds -a to xgettext to extract \*all\*
 \*   strings, and an exclusion file (zoneadmd.xcl) is used to exclude those
 \*   strings which do not need to be translated.
 \* - Act as a console server for zlogin -C processes; see comments in zcons.c
 \*   for more information about the zone console architecture.
 \* Restart:
 \*   A chief design constraint of zoneadmd is that it should be restartable in
 \*   the case that the administrator kills it off, or it suffers a fatal error,
 \*   without the running zone being impacted; this is akin to being able to
 \*   reboot the service processor of a server without affecting the OS instance.

When a user wishes to boot a zone, zoneadm will attempt to contact zoneadmd via a door that is used by all three components for a number of things including coordinating zone state changes. If for some reason zoneadmd is not running, an attempt will be made to start it. Once that has completed, zoneadm tells zoneadmd to boot the zone by supplying the appropriate zone_cmd_arg_t request via a door call. It is worth noting that the same door is used by zoneadmd to return messages back to the user executing zoneadm and also as a way for zoneadm to indicate to zoneadmd the locale of the user executing the boot command so that messages are localized appropriately.

Looking at the door server that zoneadmd implements, there is some straightforward sanity checking that takes place on the argument passed via the door call as well as the use of some of the technology that came in with the introduction of discrete privileges in Solaris 10.

	if (door_ucred(&uc) != 0) {
		zerror(&logsys, B_TRUE, "door_ucred");
		goto out;
	eset = ucred_getprivset(uc, PRIV_EFFECTIVE);
	if (ucred_getzoneid(uc) != GLOBAL_ZONEID ||
	    (eset != NULL ? !priv_ismember(eset, PRIV_SYS_CONFIG) :
	    ucred_geteuid(uc) != 0)) {
		zerror(&logsys, B_FALSE, "insufficient privileges");
		goto out;

	kernelcall = ucred_getpid(uc) == 0;

	 \* This is safe because we only use a zlog_t throughout the
	 \* duration of a door call; i.e., by the time the pointer
	 \* might become invalid, the door call would be over.
	zlog.locale = kernelcall ? DEFAULT_LOCALE : zargp->locale;

Using door_ucred, the user credential can be checked to determine whether the request originated in the global zone,1 whether the user making the request had sufficient privilege to do so2 and whether the request was a result of an upcall from the kernel. That last piece of information is used, among other things, to determine whether or not messages should be localized by localize_msg.

It is within the door server implemented by zoneadmd that transitions from one state to another take place. There are two states from which a zone boot is permissible, installed and ready. From the installed state, zone_ready is used to create and bring up the zone's virtual platform that consists of the zone's kernel context (created using zone_create) as well as the zone's specific file systems (including the root file system) and logical networking interfaces. If a zone is supposed to be bound to a non-default resource pool, then that also takes place as part of this state transition.

When a zone's kernel context is created using zone_create, a zone_t structure is allocated and initialized. At this time, the the status of the zone is set to ZONE_IS_UNINITIALIZED. Some of the initialization that takes place is in order to set up the security boundary which isolates processes running inside a zone. For example, the vnode_t of the zone's root file system, the zone's kernel credentials and the privilege sets of the zone's future processes are all initialized here.

Before returning back to the zoneadmd command, zone_create adds the primordial zone to a doubly-linked list and two hash tables, 3 one hashed by zone name and the other by zone ID. These data structures are protected by the zonehash_lock mutex which is then dropped after the zone has been added. Finally a new kernel process is then created, zsched, which is where kernel threads for this zone are parented. After calling newproc to create this kernel process, zone_create will wait using zone_status_wait until the zsched kernel process has completed initializing the zone and has set its status to ZONE_IS_READY.

Since the user structure of the process initialization has not been completed, the first thing the new zsched process does is finish that initialization along with reparenting itself to PID 1 (the global zone's init, process). And since the future processes to be run within the new zone may be subject to resource controls, that initialization takes place here in the context of zsched.

After grabbing the zone_status_lock mutex in order to set the status to ZONE_IS_READY, zsched will then suspend itself, waiting for the zone's status to been changed to ZONE_IS_BOOTING.

Once the zone is in the ready state, zone_create returns control back to zoneadmd and the door server continues the boot process by calling zone_bootup This initializes the zone's console device, mounts some of the standard OpenSolaris file systems like /proc and /etc/mnttab and then uses the zone_boot system call to attempt to boot the zone.

As the comment that introduces zone_boot points out, most of the heavy lifting has already been done either by zoneadmd or by the work the kernel has done through zone_create. As this point, zone_boot saves the requested boot arguments after grabbing the zonehash_lock mutex and then further grabs the zone_status_lock mutex in order to set the zone status to ZONE_IS_BOOTING. After dropping both locks, it is zone_boot that suspends itself waiting for the zone status is be set to ZONE_IS_RUNNING.

Since the zone's status has now been set to ZONE_IS_BOOTING, zsched now continues where it left off after it has suspended itself with its call to zone_status_wait_cpr After checking that the current zone status is indeed ZONE_IS_BOOTING, a new kernel process is created in order to run init in the zone. This process calls zone_icode which is analogous to the traditional icode function that is used to start init in the global zone and in traditional UNIX environments. After doing some zone-specific initialization, each of the icode functions end up calling exec_init to actually exec the init process after copying out the path to the executable, /sbin/init, and the boot arguments. If the exec is successful, zone_icode will set the zone's status to ZONE_IS_RUNNING and in the process, zone_boot will pick up where it had been suspended. At this point, the value of zone_boot_err indicates whether the zone boot was successful or not and is used to set the global errno value for zoneadmd.

There are two additional things to note with the zone's transition to the running state. First of all, audit_put_record is called to generate an event for the Solaris auditing system so that it's known which user executed which command to boot a zone. In addition, there is an internal zoneadmd event generated to indicate on the zone's console device that the zone is booting. This internal stream of events is sent by the door server to the zone console subsystem for all state transitions, so that the console user can see which state the zone is transitioning to.

1 This is a bit of defensive programming since unless the global zone administrator were to make the door in question available through the non-global zone's own file system, there would be no way for a privileged user in a non-global zone to actually access door used by zoneadmd.

2 zoneadm itself checks that the user attempting to boot a zone has the necessary privilege but it's possible some other privileged process in the global zone might have access to the door but lack the necessary PRIV_SYS_CONFIG privilege.

3 The doubly-linked list implementation was integrated by Dave while Dan was responsible for the hash table implementation. Both of these are worth examining in the OpenSolaris source base.

Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:


David, Does Solaris Zones ideas/implementations have any potent filed or approved?

Posted by johnh on February 01, 2007 at 07:08 AM PST #

Nice description, David. I'm looking forward to the blog explaining halt/reboot :-).

Posted by Andy Tucker on February 01, 2007 at 07:08 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed



« October 2016