Thursday Feb 01, 2007

Well, the Good Wine's A-Flowin'

Jerry has posted an excellent summary of the work he and Steve Lawrence architected as part of Project Duckhorn. The project encompassed a number of new features which are detailed here (see the quoted proposal in Jeff Victor's message), here and here. As Jerry discusses, zones and OpenSolaris resource management previously could be combined but it required quite a bit of knowledge to integrate these technologies together. The team was able to define some sensible yet powerful abstractions that really bring to fruition the notion of a container.

Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

Tuesday Jan 09, 2007

Privilege (Set Me Free)

One of the perhaps lessor known features of Solaris Containers or Zones is that applications running inside these virtualized environments execute with less privileges than applications executing outside the container. This is enforced through the Solaris Privileges framework which was also introduced in the Solaris 10 release.

When comparing virtualization solutions, typically OS level virtualization mechanisms like Zones or FreeBSD Jails are thought to provide less security than mechanisms where a machine architecture is virtualized, such as with the family of products from VMware or with paravirtualization mechanisms such as Xen, in which the guest OS is ported to the virtualized machine architecture. One reason for that is there is usually weaker separation between virtualized OS environments since at several levels in the kernel there is some sharing of data structures and code paths.

However in some cases, OS level virtualization provides an advantage for certain aspects of security. For example, with Solaris Containers the privilege mechanism in the kernel enforces limitations on the types of operations an application can perform. Consider the case of the ability to create or "plumb" a software networking interface using ifconfig(1M) or set an IP address on that interface. In some situations, one wants to allow such operations inside a virtualized environment because a particular application requires the ability to change an existing IP address or to toggle an interface up or down. The ramification of this, however, is that a malicious or naive user inside the environment might change their IP address to something not expected with the results ranging from disruption in the network topology to the potential of spoofing another machine on the network. In addition, most applications do not actually require the ability to change their environment's IP address or create new network interfaces or even know the name of the interface in their environment. Rather, they typically want one or more IPv4 or IPv6 addresses which they can bind(3SOCKET) to.

In the case of Solaris Containers, the privilege to set the IP address of an interface is not given to any applications running inside a container and there is no way for an application to escalate or grow the set of privileges from those they started out with. The end result, in this example, is that the root password or super-user privileges can be given to a user inside a container but they will be unable to manipulate or affect the topology of the network or impersonate another machine and potentially gain access to its network traffic.1

Until recently, the set of privileges a container's applications were limited to was fixed. However starting with both Solaris Express 5/06 and Solaris 10 11/06, the global zone administrator can change this set of privileges. What this means from a practical point of view is that containers can become more capable by adding some of the privileges that are not usually present. An example here might be the ability to run DTrace from within the container2. Dan provided an excellent writeup on the details for doing so here.

As another example, by adding some additional privileges to the container's default privilege set, a Network Time Protocol (NTP) server can be deployed in the container which is preferable from a security point of view, especially for a server that might be facing a hostile Internet. In order to configure the container appropriately, the list of privileges that it requires needs to be known. Solaris 10 currently ships with the 3-5.93e version of xntpd(1M), which is the daemon that implements the NTP server capability. This particularly daemon actually can take advantages of three privileges that are not normally present within a container. The first, perhaps obviously, is the privilege to change the system clock - sys_time. With the addition of this privilege, xntpd will be able to successfully set the system clock when it needs to.

However it also turns out that the daemon tries to both lock down its memory pages and also run in the real-time scheduling class. It does this so that the daemon can maintain accurate time particularly in the face of other system activity. These two operations are also covered by unique privileges - proc_lock_memory and proc_priocntl.

Tying these privileges3 together, we can take an existing container and configure it to be a NTP server. In this example, Sun's internal network routes IP multicast and so I will leverage that to connect to the network's NTP servers listening on the standard multicast address of for NTP: For example, consider this update to the configuration of the zone myzone:
        global# zonecfg -z myzone
        zonecfg:myzone> set limitpriv=default,proc_lock_memory,proc_priocntl,sys_time
        zonecfg:myzone> commit
        zonecfg:myzone> exit
        global# zoneadm -z myzone boot

Then from within the newly booted container, I will set up the configuration of the server itself and start the service:

        myzone# cp -p /etc/inet/ntp.client /etc/inet/ntp.conf
        myzone# svcadm enable network/ntp

The property that was set in the container's configuration, limitpriv, consists of a list of privileges similar to the form expected by user_attr(4) and priv_str_to_set(3C). In this particular example, the container's privilege set is limited to the standard default set of privileges plus the three additional privileges required by the NTP server.

It is worthwhile to note that privileges can also be taken away by preceding them with an exclamation mark (!) or a minus sign (-). This can allow a container to be booted in which applications have even fewer privileges than usual. For example, to take away the ability to generate ICMP datagrams from the zone named twilight, the global zone administrator would configure the container as follows:

        global# zonecfg -z twilight set limitpriv=default,!net_icmpaccess

There are a few restrictions on what privilges can be added to a container as well as some concerning which ones can be removed. For more details, please see the original proposal and the ensuing discussion on the zones-discuss mailing list. This proposal and many others concerning containers and other parts of OpenSolaris have benefited greatly from the participation of the OpenSolaris Zones Community. Information about each of these proposals can be found here.

1 The actual privilege check in the kernel for this particular case occurs here.

2 The ability to use DTrace inside a non-global zone is at the present time restricted to Solaris Express as some additional changes to DTrace were required. However, these changes should be appearing in an upcoming Solaris 10 release.

3 Starting with Solaris Express 11/06, the privilege to lock memory has actually been added to the container's default set. This is because additional resource controls have been added that can limit the amount of memory applications within a container can lock so it is no longer necessary to make this privilege an optional one.

Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

Wednesday Jun 14, 2006

In My Reflecting Pool

A year ago today, the realization of something that many of us at Sun had pushed and wished for finally came true - the open sourcing of the Solaris source code and the creation of the OpenSolaris Project. On that date, I wrote about one aspect of OpenSolaris that I had been working on for a number of years, but what really was even more exciting than the technology being released was the possibilities for the future.

Reflecting after a year, I see a tremendous amount of progress including accomplishments in areas, such as the selection of a source code management (SCM) system, which I dared not hope to be complete after one year's time. Many of the changes that have taken place the past year represent fundamental changes in the way Sun does Solaris development and though the OpenSolaris community has a long way to grow, everyone should feel good about how much has already taken place. And the fact that there are already four distributions based on OpenSolaris including Schillix, BeleniX, NexentaOS, marTux as well Sun's own Solaris Express is a reason to celebrate.

One of the areas of OpenSolaris that I was fortunate to have worked on the past year was with a team working on a proposal on what the OpenSolaris development process should look like. The team was led by Teresa and I was asked by Andy if I wanted to contribute to this effort. The team consisted of a number of people both within Sun and outside including John Beck, Rich Teer, Al Hopper, Stephen Hahn, Ed Hunter, Joe Kowalski, Keith Wesolowski, Casper Dik, and Bill Sommerfeld. Although the development process draft that we eventually published does look in many ways like the Software Development Framework used within Sun for its product development, the process by which the proposal itself was developed was entirely organic. The team initially discussed what the scope of the proposal should be and examined the high-level requirements of an operating system such as OpenSolaris. These design principles and other fundamentals were something that was always kept at the forefront when we then examined other open-source projects including Apache, FreeBSD, Linux, NetBSD and OpenOffice.

After reviewing other open-source projects and their development processes, we brainstormed over the steps necessary to take an idea from conception to realization, again taking into account the guiding requirements discussed earlier. One very important notion that weighed heavily on our discussions was that of "shrink to fit", where steps in the process can be reduced or even eliminated when it makes sense. The result is a fairly streamlined process that is meant to handle both the introduction of large pieces of framework into OpenSolaris as well as the simple bug fix.

The draft was released last November and we received many insightful comments from the community. I would definitely encourage others who have not read the draft to do so and provide comments to the above thread or on the OpenSolaris cab-discuss forum.

As exciting the first year of OpenSolaris has been, it seems obvious that the coming year is going be even more so. And as impressive as it is having a hundred community integrations into OpenSolaris this first year, I suspect that we will be seeing a far higher number in the coming year along with the introduction of some large scale projects where the community will be playing an larger part in the design, implementation and integration phases.

Technorati Tag:
Technorati Tag:

Monday Sep 12, 2005

Three Conductors and Twenty-five Sacks of Mail

I would like to thank everyone that came out to the Silicon Valley OpenSolaris Users Group (SVOSUG) meeting on August 31st and for bearing with me as we struggled with getting the laptop and projector to play nice with one another. The slides for the presentation are now available.

The questions and feedback on Zones were excellent and it was great to see the level of interest in OpenSolaris. Special thanks to Dan and Allan of the Zones project team for taking notes and lending support and of course, to the other Alan for all his work in organizing the meetings.

One frequently asked question which came up at the meeting is how can a zone support a writable directory under /usr, such as /usr/local, when the former is usually mounted read-only from the global zone.

The easiest way to support such a directory is to add a lofs(7FS) file system for the zone using zonecfg(1M). One simply needs to specify a directory in the global zone to serve as backing store for the zone's /usr/local directory and then edit the zone's configuration as follows:

        global# mkdir -p /usr/local
        global# mkdir -p /path/to/some/storage/local/twilight
        global# zonecfg -z twilight
        zonecfg:twilight> add fs
        zonecfg:twilight:fs> set dir=/usr/local
        zonecfg:twilight:fs> set special=/path/to/some/storage/local/twilight
        zonecfg:twilight:fs> set type=lofs
        zonecfg:twilight:fs> end
        zonecfg:twilight> commit
        zonecfg:twilight> exit

The next time the zone boots, it will have its own writable /usr/local directory.

Speaking of frequently asked questions, Jeff has compiled a Zones and Containers FAQ which provides a list of the common questions that have been asked since Zones were introduced along with their answers. The FAQ along with a great deal of other information can also be found on the redesigned OpenSolaris Zones Community page, which was recently given a well needed makeover by Dan.

Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

Monday Aug 29, 2005

Lastly Through a Hogshead of Real Fire!

Tomorrow evening, August 30th at 7:30, at the Silicon Valley OpenSolaris Users Group (SVOSUG) meeting, I will have the pleasure of talking about Zones, the virtualization software available in OpenSolaris. This month's meeting will be held upstairs from the auditorium at Sun's Santa Clara campus - directions are available .

In addition to the talk (for which I will post slides soon), there will be a panel discussion to discuss anything related to OpenSolaris and hopefully there will be a status update of where things stand, which build has been released and other related news.

Thanks again to Alan for organizing these meetings and to the community for attending and bringing their questions, concerns and enthusiasm. We look forward to seeing you at the user group meeting.

This is a stereo recording.
A splendid time is guaranteed for all.

Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

Tuesday Jun 14, 2005

These Boots are Made for Walkin'

One of the most gratifying and exciting aspects of the OpenSolaris project is a return (for me, at least) to working on operating system design and research with the larger, open community. In another era while I was an undergraduate at Berkeley, I was fortunate enough to see the 2.x and 4.x BSD development effort up close and to see the larger community formed between the University and external organizations that had UNIX source licenses. It was not an open source community, of course, but it was a community none the less, and one that shared fixes, ideas and other software projects built on top of the operating system. Our hopes for OpenSolaris are that in addition to releasing operating system source code that can be used for many different purposes, Sun and the community will innovate together while maintaining the core values that Solaris provides today.

One of the many pieces of OpenSolaris which is of personal interest is the Zones virtualization technology introduced in Solaris 10. Zones provide a lightweight but very efficient and flexible way of consolidating and securing potentially complex workloads. There is a wealth of technical information about Zones in Solaris available at the OpenSolaris Zones Community and the BigAdmin System Administration Portal.

One of the things about Zones that people notice right away is how quickly they boot. Of course, booting a zone does not cause a system to run its power-on self-test (POST) or require the same amount of initialization that takes place when the hardware itself is booting. However, I thought it might be useful to do a tour of the dance that takes place when a non-global zone is booted. I call it a dance since there is a certain amount of interplay between the primary players - zoneadm, zoneadmd and the kernel itself - that warrants an explanation.

Although the virtualization that Zones provides is spread throughout the source code, the primary implementation in the kernel can be found in zone.c. As with many OpenSolaris frameworks, there is a big block comment at the start of the file which is very useful for understanding the lay of the land with respect to the code. Besides describing the data structures and locking strategy used for Zones, there is a description of the states a zone can be in from the kernel's perspective and at what points a zone may transition from one state to another. For brevity, only the states covered during a zone boot are listed here

 \*   Zone States:
 \*   The states in which a zone may be in and the transitions are as
 \*   follows:
 \*   ZONE_IS_UNINITIALIZED: primordial state for a zone. The partially
 \*   initialized zone is added to the list of active zones on the system but
 \*   isn't accessible.
 \*   ZONE_IS_READY: zsched (the kernel dummy process for a zone) is
 \*   ready.  The zone is made visible after the ZSD constructor callbacks are
 \*   executed.  A zone remains in this state until it transitions into
 \*   the ZONE_IS_BOOTING state as a result of a call to zone_boot().
 \*   ZONE_IS_BOOTING: in this shortlived-state, zsched attempts to start
 \*   init.  Should that fail, the zone proceeds to the ZONE_IS_SHUTTING_DOWN
 \*   state.
 \*   ZONE_IS_RUNNING: The zone is open for business: zsched has
 \*   successfully started init.   A zone remains in this state until
 \*   zone_shutdown() is called.

It is important to note here that there are a number of zone states not represented here - those are for zones which do not (yet) have a kernel context. An example of such a state is for a zone that is in the process of being installed. These states are defined in libzonecfg.h.

One of the players in the zone boot dance is the zoneadmd process which runs in the global zone and performs a number of critical tasks. Although much of the virtualization for a zone is implemented in the kernel, zoneadmd manages a great deal of a zone's infrastructure as outlined in zoneadmd.c

 \* zoneadmd manages zones; one zoneadmd process is launched for each
 \* non-global zone on the system.  This daemon juggles four jobs:
 \* - Implement setup and teardown of the zone "virtual platform": mount and
 \*   unmount filesystems; create and destroy network interfaces; communicate
 \*   with devfsadmd to lay out devices for the zone; instantiate the zone
 \*   console device; configure process runtime attributes such as resource
 \*   controls, pool bindings, fine-grained privileges.
 \* - Launch the zone's init(1M) process.
 \* - Implement a door server; clients (like zoneadm) connect to the door
 \*   server and request zone state changes.  The kernel is also a client of
 \*   this door server.  A request to halt or reboot the zone which originates
 \*   \*inside\* the zone results in a door upcall from the kernel into zoneadmd.
 \*   One minor problem is that messages emitted by zoneadmd need to be passed
 \*   back to the zoneadm process making the request.  These messages need to
 \*   be rendered in the client's locale; so, this is passed in as part of the
 \*   request.  The exception is the kernel upcall to zoneadmd, in which case
 \*   messages are syslog'd.
 \*   To make all of this work, the Makefile adds -a to xgettext to extract \*all\*
 \*   strings, and an exclusion file (zoneadmd.xcl) is used to exclude those
 \*   strings which do not need to be translated.
 \* - Act as a console server for zlogin -C processes; see comments in zcons.c
 \*   for more information about the zone console architecture.
 \* Restart:
 \*   A chief design constraint of zoneadmd is that it should be restartable in
 \*   the case that the administrator kills it off, or it suffers a fatal error,
 \*   without the running zone being impacted; this is akin to being able to
 \*   reboot the service processor of a server without affecting the OS instance.

When a user wishes to boot a zone, zoneadm will attempt to contact zoneadmd via a door that is used by all three components for a number of things including coordinating zone state changes. If for some reason zoneadmd is not running, an attempt will be made to start it. Once that has completed, zoneadm tells zoneadmd to boot the zone by supplying the appropriate zone_cmd_arg_t request via a door call. It is worth noting that the same door is used by zoneadmd to return messages back to the user executing zoneadm and also as a way for zoneadm to indicate to zoneadmd the locale of the user executing the boot command so that messages are localized appropriately.

Looking at the door server that zoneadmd implements, there is some straightforward sanity checking that takes place on the argument passed via the door call as well as the use of some of the technology that came in with the introduction of discrete privileges in Solaris 10.

	if (door_ucred(&uc) != 0) {
		zerror(&logsys, B_TRUE, "door_ucred");
		goto out;
	eset = ucred_getprivset(uc, PRIV_EFFECTIVE);
	if (ucred_getzoneid(uc) != GLOBAL_ZONEID ||
	    (eset != NULL ? !priv_ismember(eset, PRIV_SYS_CONFIG) :
	    ucred_geteuid(uc) != 0)) {
		zerror(&logsys, B_FALSE, "insufficient privileges");
		goto out;

	kernelcall = ucred_getpid(uc) == 0;

	 \* This is safe because we only use a zlog_t throughout the
	 \* duration of a door call; i.e., by the time the pointer
	 \* might become invalid, the door call would be over.
	zlog.locale = kernelcall ? DEFAULT_LOCALE : zargp->locale;

Using door_ucred, the user credential can be checked to determine whether the request originated in the global zone,1 whether the user making the request had sufficient privilege to do so2 and whether the request was a result of an upcall from the kernel. That last piece of information is used, among other things, to determine whether or not messages should be localized by localize_msg.

It is within the door server implemented by zoneadmd that transitions from one state to another take place. There are two states from which a zone boot is permissible, installed and ready. From the installed state, zone_ready is used to create and bring up the zone's virtual platform that consists of the zone's kernel context (created using zone_create) as well as the zone's specific file systems (including the root file system) and logical networking interfaces. If a zone is supposed to be bound to a non-default resource pool, then that also takes place as part of this state transition.

When a zone's kernel context is created using zone_create, a zone_t structure is allocated and initialized. At this time, the the status of the zone is set to ZONE_IS_UNINITIALIZED. Some of the initialization that takes place is in order to set up the security boundary which isolates processes running inside a zone. For example, the vnode_t of the zone's root file system, the zone's kernel credentials and the privilege sets of the zone's future processes are all initialized here.

Before returning back to the zoneadmd command, zone_create adds the primordial zone to a doubly-linked list and two hash tables, 3 one hashed by zone name and the other by zone ID. These data structures are protected by the zonehash_lock mutex which is then dropped after the zone has been added. Finally a new kernel process is then created, zsched, which is where kernel threads for this zone are parented. After calling newproc to create this kernel process, zone_create will wait using zone_status_wait until the zsched kernel process has completed initializing the zone and has set its status to ZONE_IS_READY.

Since the user structure of the process initialization has not been completed, the first thing the new zsched process does is finish that initialization along with reparenting itself to PID 1 (the global zone's init, process). And since the future processes to be run within the new zone may be subject to resource controls, that initialization takes place here in the context of zsched.

After grabbing the zone_status_lock mutex in order to set the status to ZONE_IS_READY, zsched will then suspend itself, waiting for the zone's status to been changed to ZONE_IS_BOOTING.

Once the zone is in the ready state, zone_create returns control back to zoneadmd and the door server continues the boot process by calling zone_bootup This initializes the zone's console device, mounts some of the standard OpenSolaris file systems like /proc and /etc/mnttab and then uses the zone_boot system call to attempt to boot the zone.

As the comment that introduces zone_boot points out, most of the heavy lifting has already been done either by zoneadmd or by the work the kernel has done through zone_create. As this point, zone_boot saves the requested boot arguments after grabbing the zonehash_lock mutex and then further grabs the zone_status_lock mutex in order to set the zone status to ZONE_IS_BOOTING. After dropping both locks, it is zone_boot that suspends itself waiting for the zone status is be set to ZONE_IS_RUNNING.

Since the zone's status has now been set to ZONE_IS_BOOTING, zsched now continues where it left off after it has suspended itself with its call to zone_status_wait_cpr After checking that the current zone status is indeed ZONE_IS_BOOTING, a new kernel process is created in order to run init in the zone. This process calls zone_icode which is analogous to the traditional icode function that is used to start init in the global zone and in traditional UNIX environments. After doing some zone-specific initialization, each of the icode functions end up calling exec_init to actually exec the init process after copying out the path to the executable, /sbin/init, and the boot arguments. If the exec is successful, zone_icode will set the zone's status to ZONE_IS_RUNNING and in the process, zone_boot will pick up where it had been suspended. At this point, the value of zone_boot_err indicates whether the zone boot was successful or not and is used to set the global errno value for zoneadmd.

There are two additional things to note with the zone's transition to the running state. First of all, audit_put_record is called to generate an event for the Solaris auditing system so that it's known which user executed which command to boot a zone. In addition, there is an internal zoneadmd event generated to indicate on the zone's console device that the zone is booting. This internal stream of events is sent by the door server to the zone console subsystem for all state transitions, so that the console user can see which state the zone is transitioning to.

1 This is a bit of defensive programming since unless the global zone administrator were to make the door in question available through the non-global zone's own file system, there would be no way for a privileged user in a non-global zone to actually access door used by zoneadmd.

2 zoneadm itself checks that the user attempting to boot a zone has the necessary privilege but it's possible some other privileged process in the global zone might have access to the door but lack the necessary PRIV_SYS_CONFIG privilege.

3 The doubly-linked list implementation was integrated by Dave while Dan was responsible for the hash table implementation. Both of these are worth examining in the OpenSolaris source base.

Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:




« July 2016