Friday Oct 23, 2009

Solaris 10 Containers Released on OpenSolaris

After roughly nine months of nonstop development, Jerry Jelinek integrated the first phase of solaris10-branded zones (a.k.a. Solaris 10 Containers) into OpenSolaris build 127 yesterday. Such zones enable users to host environments from Solaris 10 10/09 and later inside OpenSolaris zones. As mentioned in one of my earlier posts, we're developing solaris10-branded zones so that users can consolidate their Solaris 10 production environments onto machines running OpenSolaris and take advantage of many innovative OpenSolaris technologies (such as Crossbow) within such environments.

As Jerry mentioned in his blog, this first phase delivers emulation for Solaris 10 10/09, physical-to-virtual (p2v) and virtual-to-virtual (v2v) capabilities to help users deploy Solaris 10 environments in solaris10-branded zones, and support for all three OpenSolaris-supported platforms (sun4u, sun4v, and x86). He also explained that there are some limitations that will be addressed in the second phase of the project. However, he didn't mention that users are unable to use dtrace(1M) and mdb(1) on processes running in solaris10-branded zones if dtrace(1M) and mdb(1) are executed in the global zone. This resulted from incompatible changes made to some of the debugging libraries between Solaris 10 and OpenSolaris and it will be addressed during the second development phase. In the meantime, users can use dtrace(1M) and mdb(1) inside solaris10-branded zones to examine processes running inside of the zones.

If you are an OpenSolaris or Solaris 10 kernel developer, then I admonish you to read the Solaris10-Branded Zone Developer Guide, which explains the purpose and implementation of solaris10-branded zones as well as what you'll need to do to avoid breaking such zones. It's every kernel developer's responsibility to ensure that solaris10-branded zones will work with his/her changes to the Solaris 10 and OpenSolaris user-kernel interfaces (syscalls, ioctls, kstats, etc.).

This project was full of surprises and challenges. One of my favorite bugs involved Solaris 10's libc's use of the x86 %fs segment register. Solaris 10's libc expected the x86 %fs register to contain a nonzero selector value in 64-bit processes (Solaris 10's __curthread() returns NULL if %fs is zero.), which was problematic because OpenSolaris' kernel cleared %fs. Libc has always used %fs to locate the current thread's ulwp_t structure on 64-bit x86 machines. Therefore, 64-bit x86 processes running inside solaris10-branded zones were unable to use thr_main(3C) and other critical libc functions as well as several common libraries, such as libdoor.

The fix was somewhat complicated because it had to guarantee that all threads in all 64-bit x86 processes running in solaris10-branded zones would start with nonzero %fs registers. Fortunately, only two system calls modify %fs in Solaris 10 and OpenSolaris: SYS_lwp_private and SYS_lwp_create. SYS_lwp_private is a libc-private system call that's invoked once when libc initializes after a process execs (see OpenSolaris' implementation of libc_init()) in order to configure the FS segment so that its base lies at the start of the single thread's ulwp_t structure. SYS_lwp_create takes a ucontext_t structure and the address of a ulwp_t structure and creates a new thread for the calling process with the given thread context and an FS segment beginning at the start of the specified ulwp_t structure.

My initial fix did the following:

  1. The solaris10 brand's emulation library interposed on SYS_lwp_private in s10_lwp_private(). It handed the system call to the OpenSolaris kernel untouched and afterwards invoked thr_main(3C) to determine whether the Solaris 10 environment's libc worked after the kernel configured %fs. If thr_main(3C) returned -1, then the library invoked a special SYS_brand system call to set %fs to the old nonzero Solaris 10 selector value.
  2. The brand's emulation library also interposed on SYS_lwp_create in s10_lwp_create() and tweaked the supplied ucontext_t structure so that the new thread started in s10_lwp_create_entry_point() rather than _thrp_setup(). Of course, new threads had to execute _thrp_setup() eventually, so s10_lwp_create() stored _thrp_setup()'s address in a predetermined location in the new thread's stack. s10_lwp_create_entry_point() invoked thr_main(3C) to determine whether the Solaris 10 environment's libc worked when %fs was zero. If thr_main(3C) returned -1, then the new thread invoked the same SYS_brand system call invoked by s10_lwp_private() in order to correct %fs. Afterwards, the new thread read its true entry point's address (i.e., _thrp_setup()'s address) from the predetermined location in its stack and jumped to the true entry point.
  3. The solaris10 brand's kernel module ensured that forked threads in solaris10-branded zones inherited their parents' %fs selector values. This ensured that forked threads whose parents needed %fs register adjustments started with correct %fs selector values.

I committed the fix and was content until a test engineer working on solaris10-branded zones, Mengwei Jiao, reported a segfault of a 64-bit x86 test in a solaris10-branded zone. I immediately suspected my fix because the test was multithreaded, yet I was surprised because I thoroughly tested my fix and never encountered segfaults. Mengwei's test created and immediately canceled a thread using pthread_create(3C) and pthread_cancel(3C). After spending hours debugging core dumps, I discovered that I forgot to consider signals while testing my fix.

The test segfaulted because its new thread read a junk address from its stack in s10_lwp_create_entry_point() and jumped to it. Something clobbered the thread's stack and overwrote its true entry point's address. I noticed that the thread didn't start until its parent finished executing pthread_cancel(3C), so I suspected that the delivery of the SIGCANCEL signal clobbered the child's stack. It turned out that the child started in s10_lwp_create_entry_point() as expected but immediately jumped to sigacthandler() in libc to process the SIGCANCEL signal. Such behavior might have been acceptable because the thread's true entry point's address was stored deep within the thread's stack (2KB from the top of the stack) and neither sigacthandler() nor any of the functions it invoked consumed much stack space, but sigacthandler() invoked memcpy(3C) to copy a siginfo_t structure and the dynamic linker hadn't yet loaded memcpy(3C) into the library's link map. Consequently, the thread executed ld.so.1 routines in order to load memcpy(3C) and fill its associated PLT entry. Eventually the thread's stack grew large enough for ld.so.1 to clobber the thread's true entry point's address, which produced the junk address that later led to the segfault.

My final solution eliminated the use of new threads' stacks and instead stored entry points in new threads' %r14 registers. Libc doesn't store any special initial values in new threads' %r14 registers, so I was free to use %r14. Additionally, any System V ABI-conforming functions invoked by s10_lwp_create_entry_point() and sigacthandler() had to preserve %r14 for s10_lwp_create_entry_point() (%r14 is a callee-saved register), so it was impossible for such functions to clobber %r14 as seen by s10_lwp_create_entry_point().

I also renamed s10_lwp_create() to s10_lwp_create_correct_fs() and used a trick that I call sysent table patching to ensure that the brand library only causes SYS_lwp_create to force new threads to start at s10_lwp_create_entry_point() after s10_lwp_private() determines that the Solaris 10 environment's libc can't function properly when %fs is zero. The brand's emulation library accesses a global array called s10_sysent_table to fetch system call handlers. An emulation function can change a system call's entry in the array in order to change the system call's handler. The emulation library invokes s10_lwp_create() to emulate SYS_lwp_create by default, which simply hands the system call to the OpenSolaris kernel untouched. If s10_lwp_private() determines that new threads require nonzero %fs selector values, then it modifies s10_sysent_table so that s10_lwp_create_correct_fs() handles SYS_lwp_create system calls. SYS_lwp_private is only invoked while a process is single-threaded, so races between s10_lwp_private() and SYS_lwp_create are impossible.

I encourage you to download and install the latest version of OpenSolaris, update it to build 127 or later (once the builds become available), and try solaris10-branded zones. Jerry and I would appreciate any feedback you might have, which you can send to us via the zones-discuss discussion forum on opensolaris.org. Remember that solaris10-branded zones are capable of hosting production environments even though they are still being developed.

Enjoy!

Thursday May 07, 2009

Solaris 10 Containers for OpenSolaris

Branded Zones/Containers is a technology that allows Solaris system administrators to virtualize non-native operating system environments within Solaris zones, a lightweight OS-level (i.e., no hypervisor) virtualization technology that creates isolated application environments. (Look here for more details.) Brands exist for Linux on OpenSolaris and Solaris 8 and 9 on Solaris 10, but not Solaris 10 on OpenSolaris...until now.

On April 23, Jerry Jelinek announced the development of Solaris 10 containers on OpenSolaris.org and requested that the project be open-sourced as a part of ON (i.e., the OpenSolaris kernel). Solaris 10 Containers will allow administrators to adopt technologies found in the OpenSolaris kernel (e.g., Crossbow networking and ZFS enhancements) by maintaining Solaris 10 operating system environments on top of the OpenSolaris kernel. In other words, you will be able to run your Solaris 10 environments on top of the OpenSolaris kernel (provided that your Solaris 10 environments meet the standard Solaris zone requirements).

Both Jerry and I have been working on Solaris 10 containers for at least a month. We are currently able to archive and install Solaris 10 environments into Solaris 10 containers (i.e., p2v Solaris 10 systems) and boot the containers as shared-stack zones. Automounting NFS filesystems, examining processes with the proc tools, tracing process and thread behavior with truss, and listing installed Solaris 10 patches are a few of the many features that appear to run without problems within Solaris 10 containers as they currently are. I even managed to forward X connections over SSH and establish VNC sessions with my Solaris 10 containers on all three Solaris-supported architectures (x86, x64, and SPARC).

Jerry and I prepared screencast demos of archiving, installing, booting, and working within a Solaris 10 container for the upcoming Community One West developer conference. We couldn't decide whose narration was best suited for the demo, so we submitted two versions, one featuring my voice and the other featuring Jerry's voice. Take a look at Jerry's demo if you want to see the results (though you might have to download the flash video file because it might not fit within the preview window). We are considering producing more videos or blog posts (or both) as the technology evolves.

For more information on Solaris 10 containers and zones/containers in general and how you can contribute to both, visit the OpenSolaris.org zones community page and the Solaris 10 Brand/Containers project page at OpenSolaris.org.

About

I am a kernel developer at Sun Microsystems, Inc., working on zones and resource pools. This blog logs some of my thoughts regarding my work and the [mis]adventures that I have while working on Solaris.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today