Friday Nov 13, 2015

Important zones fixes in Solaris 11.3 SRU 2.4

Solaris 11.3 SRU 2.4 has some fixes that are likely of interest to you if you are reading this blog.  These fixes include:

  • Free memory is no longer copied during live migration and suspend/resume.  If you have a lot of free memory - especially free memory that wasn't free at some point - this can significantly reduce migration time.
  • During zoneadm migrate, progress reports keep you up to date on how the migration is progressing.
  • vmstat and iostat are now much faster to start in zones.
  • Operations involving access to device nodes in /dev/zvol are much faster.  That is, ls, find, and open() may be faster.  There are no zone-specific changes that would impact the performance of reads and writes to ZFS volumes.
Of course there are a bunch of other fixes in zones and elsewhere that may be important to you too.  Check out My Oracle Support Doc ID 2077717.1 for details.

Thursday Oct 22, 2015

Meet the experts at #oow15

Coming to Oracle Open World 2015?  Great!  Stop by and see me and a bunch of other Solaris engineers at the Meet the Experts networking events.

  • Monday, Oct 26 @ 3:30 PM - 4:00 PM | 5th floor, Intercontinental Hotel
  • Tuesday, Oct 27 @ 3:15 PM - 4:00 PM | 5th floor, Intercontinental Hotel
  • Wednesday, Oct 28 @ 2:30 PM - 3:00 PM | 5th floor, Intercontinental Hotel

There's a bunch of Solaris sessions just before and/or after these events, right in the same area.  Be sure to check them out too!

Follow @zoneszone for reminders of these and perhaps some other interesting happenings at #oow15.

Thursday Jul 30, 2015

Live storage migration for kernel zones

From time to time every sysadmin realizes that something that is consuming a bunch of storage is sitting in the wrong place.  This could be because of a surprise conversion of proof of concept into proof of production or something more positive like ripping out old crusty storage for a nice new Oracle ZFS Storage Appliance. When you use kernel zones with Solaris 11.3, storage migration gets a lot easier.

As our fine manual says:

The Oracle Solaris 11.3 release introduces the Live Zone Reconfiguration feature for Oracle Solaris Kernel Zones. With this feature, you can reconfigure the network and the attached devices of a running kernel zone. Because the configuration changes are applied immediately without requiring a reboot, there is zero downtime service availability within the zone. You can use the standard zone utilities such as zonecfg and zoneadm to administer the Live Zone Reconfiguration. 

Well, we can combine this with other excellent features of Solaris to have no-outage storage migrations, even of the root zpool.

In this example, I have a kernel zone that was created with something like:

root@global:~# zonecfg -z kz1 create -t SYSsolaris-kz
root@global:~# zoneadm -z kz1 install -c <scprofile.xml>

That happened several weeks ago and now I really wish that I had installed it using an iSCSI LUN from my ZFS Storage Appliance. We can fix that with no outage.

First, I'll update the zone's configuration to add a bootable iscsi disk.

root@global:~# zonecfg -z kz1
zonecfg:kz1> add device
zonecfg:kz1:device> set storage=iscsi://zfssa/luname.naa.600144F0DBF8AF19000053879E9C0009 
zonecfg:kz1:device> set bootpri=0
zonecfg:kz1:device> end
zonecfg:kz1> exit

Next, I tell the system to add that disk to the running kernel zone.

root@global:~# zoneadm -z kz1 apply
zone 'kz1': Checking: Adding device storage=iscsi://zfssa/luname.naa.600144F0DBF8AF19000053879E9C0009
zone 'kz1': Applying the changes

Let's be sure we can see it and look at the current rpool layout.  Notice that this kernel zone is running Solaris 11.2 - I only need to have Solaris 11.3 in the global zone.

root@global:~# zlogin kz1
[Connected to zone 'kz1' pts/2]
Oracle Corporation      SunOS 5.11      11.2    May 2015
You have new mail.

root@kz1:~# format
Searching for disks...done

       0. c1d0 <kz-vDisk-ZVOL-16.00GB>
       1. c1d1 <SUN-ZFS Storage 7120-1.0-120.00GB>
Specify disk (enter its number): ^D

root@kz1:~# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: none requested
        rpool   ONLINE       0     0     0
          c1d0  ONLINE       0     0     0
errors: No known data errors

Now, zpool replace can be used to migrate the root pool over to the new storage.

root@kz1:~# zpool replace rpool c1d0 c1d1
Make sure to wait until resilver is done before rebooting.

root@kz1:~# zpool status rpool
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Thu Jul 30 05:47:50 2015
    4.39G scanned
    143M resilvered at 24.7M/s, 3.22% done, 0h2m to go
        NAME           STATE     READ WRITE CKSUM
        rpool          DEGRADED     0     0     0
          replacing-0  DEGRADED     0     0     0
            c1d0       ONLINE       0     0     0
            c1d1       DEGRADED     0     0     0  (resilvering)
errors: No known data errors

After a couple minutes, that completes.

root@kz1:~# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: resilvered 4.39G in 0h2m with 0 errors on Thu Jul 30 05:49:57 2015
        rpool   ONLINE       0     0     0
          c1d1  ONLINE       0     0     0
errors: No known data errors

root@kz1:~# zpool list
rpool  15.9G  4.39G  11.5G  27%  1.00x  ONLINE  -

You may have noticed in the format output that I'm replacing a 16 GB zvol with a 120 GB disk.  However, the size of the zpool reported above doesn't reflect that it's on a bigger disk.  Let's fix that by setting the autoexpand property. 

root@kz1:~# zpool get autoexpand rpool
rpool  autoexpand  off    default

root@kz1:~# zpool set autoexpand=on rpool

root@kz1:~# zpool list
rpool  120G  4.39G  115G   3%  1.00x  ONLINE  -

To finish this off, all we need to do is remove the old disk from the kernel zone's configuration.  This happens back in the global zone.

root@global:~# zonecfg -z kz1
zonecfg:kz1> info device
device 0:
	match not specified
	storage: iscsi://zfssa/luname.naa.600144F0DBF8AF19000053879E9C0009
	id: 1
	bootpri: 0
device 1:
	match not specified
	storage.template: dev:/dev/zvol/dsk/%{global-rootzpool}/VARSHARE/zones/%{zonename}/disk%{id}
	storage: dev:/dev/zvol/dsk/rpool/VARSHARE/zones/kz1/disk0
	id: 0
	bootpri: 0
zonecfg:kz1> remove device id=0
zonecfg:kz1> exit

Now, let's apply that configuration. To show what it does, I run format in kz1 before and after applying the configuration.

root@global:~# zlogin kz1 format </dev/null
Searching for disks...done

       0. c1d0 <kz-vDisk-ZVOL-16.00GB>
       1. c1d1 <SUN-ZFS Storage 7120-1.0-120.00GB>
Specify disk (enter its number): 

root@global:~# zoneadm -z kz1 apply 
zone 'kz1': Checking: Removing device storage=dev:/dev/zvol/dsk/rpool/VARSHARE/zones/kz1/disk0
zone 'kz1': Applying the changes

root@global:~# zlogin kz1 format </dev/null
Searching for disks...done

       0. c1d1 <SUN-ZFS Storage 7120-1.0-120.00GB>
Specify disk (enter its number): 


At this point the live (no outage) storage migration is complete and it is safe to destroy the old disk (rpool/VARSHARE/zones/kz1/disk0).

root@global:~# zfs destroy rpool/VARSHARE/zones/kz1/disk0

Tuesday Jul 28, 2015

A trip down memory lane

In Scott Lynn's announcement of Oracle's membership in the Open Container Initiative, he gives a great summary of how Solaris virtualization got to the point it's at.  Quite an interesting read!

Thursday Jul 16, 2015

Solaris 11.3 zones blog entries

When I was interviewing with the zones team a number of years ago, I was told that Zones were the peanut butter that was spread throughout the operating system.  I'm not so sure peanut butter is exactly the analogy that I'd go for... perhaps something a bit more viscous and hip like Sriracha sauce.  Whatever the analogy, there's a lot of innovation related to zones throughout Solaris by people that don't work on the zones team.  Here's a sampling of zones-related hotness in Solaris 11.3 blogged about elsewhere.


Thursday Jul 09, 2015

Multi-CPU bindings for Solaris Project

Traditionally, assigning specific processes to a certain set of CPUs has been done by using processor sets (and resource pools). This is quite useful, but it requires the hard partitioning of processors in the system. That means, we can't restrict process A to run on CPUs 1,2,3 and process B to run on CPUs 3,4,5, because these partitions overlap.

There is another way to assign CPUs to processes, which is called processor affinity, or Multi-CPU binding (MCB for short). Oracle Solaris 11.2 introduced MCB binding, as described in pbind(1M) and processor_affinity(2). With the release of Oracle Solaris 11.3, we have a new interface to assign/modify/remove MCBs, via Solaris project.

Briefly, a Solaris project is a collection of processes with predefined attributes. These attributes include various resource controls one can apply to processes that belong to the project. For more details, see projects(4) and resource_controls(5). What's new is that MCB becomes simply another resource control we can manage through Solaris projects.

We start by making a new project with MCB property. We assume that we have enough privilege to create a project and there's no project called test-project in the system, and all CPUs described by project.mcb.cpus entry exist in the system and online. We also assume that the listed cpus are in the resource pool to which current zone is bound. For manipulating project, we use standard command line tools projadd(1M)/projdel(1M)/projmod(1M).

root@sqkx4450-1:~# projects -l test-project
projects: project "test-project" does not exist
root@sqkx4450-1:~# projadd -K project.mcb.cpus=0,3-5,9-11 -K project.mcb.flags=weak -K project.pool=pool_default test-project
root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=0,3-5,9-11

This means that processes in test-project will be weakly bound to CPUs 0,3,4,5, 9,10,11. (Note: For the concept of strong/weak binding, see processor_affinity(2). In short, strong binding guarantees that processes will run ONLY on designated CPUs, while weak binding does not have such a guarantee.)

Next thing is to assign some processes to test-project. If we know PIDs of target processes, it can be done by newtask(1).

root@sqkx4450-1:~# newtask -c 4156 -p test-project
root@sqkx4450-1:~# newtask -c 4170 -p test-project
root@sqkx4450-1:~# newtask -c 4184 -p test-project

Let's check the result by using the following command.

root@sqkx4450-1:~# pbind -q -i projid 100
pbind(1M): pid 4156 weakly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4170 weakly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4184 weakly bound to processor(s) 0 3 4 5 9 10 11.

Good. Now suppose we want to change the binding type to strong binding. In that case, all we need to do is change the value of project.mcb.flags to "strong", or even delete the project.mcb.flag key, because the default value is set to "strong".

root@sqkx4450-1:~# projmod -s -K project.mcb.flags=strong test-project
root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=0,3-5,9-11

Things look good, but...

root@sqkx4450-1:~# pbind -q -i projid 100
pbind(1M): pid 4156 weakly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4170 weakly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4184 weakly bound to processor(s) 0 3 4 5 9 10 11.

Nothing changed actually! WARNING: By default, projmod(1M) only modifies project configuration file, but do not attempt to apply it to its processes. To do that, use the "-A" option.

root@sqkx4450-1:~# projmod -A test-project
root@sqkx4450-1:~# pbind -q -i projid 100
pbind(1M): pid 4156 strongly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4170 strongly bound to processor(s) 0 3 4 5 9 10 11.
pbind(1M): pid 4184 strongly bound to processor(s) 0 3 4 5 9 10 11.

Now, suppose we want to change the list of CPUs, but oops, we made some typos.

root@sqkx4450-1:~# projmod -s -K project.mcb.cpus=0,3-5,13-17 -A test-project
projmod: Updating project test-project succeeded with following warning message.
WARNING: Following ids of cpus are not found in the system:16-17
root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=0,3-5,13-17

Our system has CPUs 0 to 15, not up to 17. In that case, we get some warnings. But the command succeeded anyway. The command simply ignores missing CPUs.

root@sqkx4450-1:~# pbind -q -i projid 100
pbind(1M): pid 4156 strongly bound to processor(s) 0 3 4 5 13 14 15.
pbind(1M): pid 4170 strongly bound to processor(s) 0 3 4 5 13 14 15.
pbind(1M): pid 4184 strongly bound to processor(s) 0 3 4 5 13 14 15.

And one more thing: If you want to check the validity of project file only, use projmod(1M) without any options.

root@sqkx4450-1:~# projmod
projmod: Validation warning on line 6, WARNING: Following ids of cpus are not found in the system:16-17

But projmod is not tolerant if it can't find any CPUs at all.

root@sqkx4450-1:~# projmod -s -K project.mcb.cpus=17-20 -A test-project
projmod: WARNING: Following ids of cpus are not found in the system:17-20
projmod: ERROR: All of given multi-CPU binding (MCB) ids are not found in the system: project.mcb.cpus=17-20
root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=0,3-5,13-17

Now we see ERROR. It's something that actually fails the command. Please read the error message carefully when you see it. Note that project configuration is not updated also.

Before moving to next topic, one small but important tip. How do we clear MCB from a project? Set the value of project.mcb.cpus to "none" and remove project.mcb.flags if there is.

root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=none
root@sqkx4450-1:~# projmod -A test-project
root@sqkx4450-1:~# pbind -q -i projid 100

Let's move on to a little bit of advanced usage. In Oracle Solaris systems, as well as other systems, CPUs are grouped in certain units. Currently there are 'cores', 'sockets', 'processor-groups' and 'lgroups'. Utilizing these units can improve performance aided by hardware design. (I have less familiarity with those topics, so look at the following post about lgroups: Locality Group Observability on Solaris.) MCB for projects supports all of these CPU structures. The usage is simple. Just change "project.mcb.cpus" to "project.mcb.cores", "project.mcb.sockets", "project.mcb.pgs", or "project.mcb.lgroups".

Note: To get information about CPU structures on a given system, use following commands. "psrinfo -t" gives information about cpu/core/socket structure, "pginfo" gives information about processor groups, and "lgrpinfo -c" gives information about lgroups.

root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.sockets=1
root@sqkx4450-1:~# projmod -A test-project
root@sqkx4450-1:~# pbind -q -i projid 100
pbind(1M): pid 4156 strongly bound to processor(s) 1 5 9 13.
pbind(1M): pid 4170 strongly bound to processor(s) 1 5 9 13.
pbind(1M): pid 4184 strongly bound to processor(s) 1 5 9 13.

These examples explain the basics of MCB for projects. For more details, you can refer to the appropriate man pages. But, let me briefly summarize some features we didn't explain here. And, final warning: Many features we used in this post are not supported on Oracle Solaris 11.2, even those not related to MCB directly.

1. newtask(1) also utilizes projects. When we set MCB for a project in the project configuration file, an unprivileged user that is a member of project can use newtask(1) to put new or his/her existing processes in it.

2. For Solaris projects APIs, look at libproject(3LIB). Warning: some features work only for 64-bit version of the library for now.

3. There are many other existing attributes of project. Combining them with MCB usually causes no problems, but there is one exception: project.pool. Ignoring all the detail, there's only one important guideline when using both project.pool and project.mcb.(cpus|cores|sockets|pgs|lgroups): all the CPUs in project.mcb.(cpus|cores|sockets|pgs|lgroups) should reside in the project.pool.

When we don't specifiy project.pool and use project.mcb.(cpus|cores|sockets|pgs|lgroups), the system ASSUMES that project.pool is the default pool of the current zone. In this case, when we try to apply the project's attributes to processes, we'll see following warning message.

root@sqkx4450-1:~# projects -l test-project
        projid : 100
        comment: ""
        users  : (none)
        groups : (none)
        attribs: project.mcb.cpus=0,3-5,9-11
root@sqkx4450-1:~# projmod -A test-project
projmod: Updating project test-project succeeded with following warning message.
WARNING: We bind the target project to the default pool of the zone because an multi-CPU binding (MCB) entry exists.

Man page references.
    General information:
        Project file configuration: project(4)
        How to manage resource control by project: resource_controls(5)
    Project utilization:
        Get information of projects: projects(1)
        Manage projects: projadd(1M) / projdel(1M) / projmod(1M)
        Assign a process to project: newtask(1)
        project control APIs: libproject(3LIB)
    Existing interfaces dealing MCB:
        command line interface: pbind(1M)
        system call interface: processor_affinity(2)
    Processor information:
        psrinfo(1M) / pginfo(1M) / lgrpinfo(1M)

Managing Orphan Zone BEs

Zone Boot environments that do not have any global zone BE associated with them - called orphan ZBE - are generally a byproduct of zone migrating from one host to another. Managing them is a tough 'nut to crack' as it requires mucky manual steps to get rid of them/retain them during migration or otherwise. Solaris 11.3 introduces changes to zoneadm(1M) and beadm(1M) to manage them better.

To find out more about these enhancements, click here

rcapd enhancements in Solaris 11.3

The resource capping daemon, or rcapd has been a key VM resource manager for solaris(5) zones and projects to limit their rss usage to an admin set cap. There was a need to reduce the complexity of its configuration in addition to provide a handle to the admin to manage out of control zones/projects that were slowing down the system due to cap enforcement. In Solaris 11.3, we introduce these changes amongst other optimizations to rcapd to improve cap enforcement effectiveness and application performance for user benefit.

To know more about these enhancements and how to use them to your advantage, click here.

Secure multi-threaded live migration for kernel zones

As mentioned in the What's New document, Solaris 11.3 now supports live migration for kernel zones!  Let's try it out.

As mentioned in the fine manual, live migration requires the use of zones on shared storage (ZOSS) and a few other things. In Solaris 11.2, we could use logical units (i.e. fibre channel) or iscsi.  Always living on the edge, I decide to try out the new ZOSS NFS feature.  Since the previous post did such a great job of explaining how to set it up, I won't go into the details.  Here's what my zone configuration looks like:

zonecfg:mig1> info
zonename: mig1
brand: solaris-kz
anet 0:
device 0:
	match not specified
	storage.template: nfs://zoss:zoss@kzx-05/zones/zoss/%{zonename}.disk%{id}
	storage: nfs://zoss:zoss@kzx-05/zones/zoss/mig1.disk0
	id: 0
	bootpri: 0
	ncpus: 4
	physical: 4G
	raw redacted

And the zone is running.

root@vzl-216:~# zoneadm -z mig1 list -s
NAME             STATUS           AUXILIARY STATE                               
mig1             running                                        

In order for live migration to work, the kz-migr and rad:remote services need to be online.  They are disabled by default.

# svcadm enable -s svc:/system/rad:remote svc:/network/kz-migr:stream
# svcs svc:/system/rad:remote svc:/network/kz-migr:stream
STATE          STIME    FMRI
online          6:40:12 svc:/network/kz-migr:stream
online          6:40:12 svc:/system/rad:remote

While these services are only needed on the remote end, I enable them on both sides because there's a pretty good chance that I will migrate kernel zones in both directions.  Now we are ready to perform the migration.  I'm migrating mig1 from vzl-216 to vzl-212.  Both vzl-216 and vzl-212 are logical domains on T5's.

root@vzl-216:~# zoneadm -z mig1 migrate vzl-212
zoneadm: zone 'mig1': Importing zone configuration.
zoneadm: zone 'mig1': Attaching zone.
zoneadm: zone 'mig1': Booting zone in 'migrating-in' mode.
zoneadm: zone 'mig1': Checking migration compatibility.
zoneadm: zone 'mig1': Starting migration.
zoneadm: zone 'mig1': Suspending zone on source host.
zoneadm: zone 'mig1': Waiting for migration to complete.
zoneadm: zone 'mig1': Migration successful.
zoneadm: zone 'mig1': Halting and detaching zone on source host.

Afterwards, we see that the zone is now configured on vzl-216 and running on vzl-212.

root@vzl-216:~# zoneadm -z mig1 list -s
NAME             STATUS           AUXILIARY STATE                               
mig1             configured                                    
root@vzl-212:~# zoneadm -z mig1 list -s
NAME             STATUS           AUXILIARY STATE                               
mig1             running                

Ok, cool.  But what really happened?  During the migration, I was also running tcptop, one of our demo dtrace scripts.  Unfortunately, it doesn't print the pretty colors: I added those so we can see what's going on.

root@vzl-216:~# dtrace -s /usr/demo/dtrace/tcptop.d
Sampling... Please wait.

2015 Jul  9 06:50:30,  load: 0.10,  TCPin:      0 Kb,  TCPout:      0 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0    613      22   48168       112
     0   2640   60773   12302       137
     0    613      22   60194       336

2015 Jul  9 06:50:35,  load: 0.10,  TCPin:      0 Kb,  TCPout: 832420 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0    613      22   48168       208
     0   2640   60773   12302       246
     0    613      22   60194       480
     0   2640   45661    8102      8253
     0   2640   41441    8102 418467721
     0   2640   59051    8102 459765481


2015 Jul  9 06:50:50,  load: 0.41,  TCPin:      1 Kb,  TCPout: 758608 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0   2640   60773   12302       388
     0    613      22   60194       544
     0    613      22   48168       592
     0   2640   45661    8102    119032
     0   2640   59051    8102 151883984
     0   2640   41441    8102 620449680

2015 Jul  9 06:50:55,  load: 0.48,  TCPin:      0 Kb,  TCPout:      0 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0    613      22   60194       736

In the first sample, we see that vzl-216 ( has established a RAD connection to vzl-212.  We know it is RAD because it is over port 12302.  RAD is used to connect the relevant zone migration processes on the two machines.  One connection between the zone migration processes is used for orchestrating various aspects of the migration.  There are two others that are used for synchronizing the memory between the machines.  In each of the samples, there is also some traffic from each of a couple ssh sessions I have between vzl-216 and another machine.

As the amount of kernel zone memory increases, the number of connections will also increase.  Currently that scaling factor is one connection per 2 GB of kernel zone memory, with an upper limit based on the number of CPUs in the machine.  The scaling is limited by the number of CPUs because each connection corresponds to a sending and a receiving thread. Those threads are responsible for encrypting and decrypting the traffic.  The multiple connections can work nicely with IPMP's outbound load sharing and/or link aggregations to spread the load across multiple physical network links. The algorithm for selecting the number of connections may change from time to time, so don't be surprised if your observations don't match what is shown above.

All of the communication between the two machines is encrypted.  The RAD connection (in this case) is encrypted with TLS, as described in rad(1M).  This RAD connection supports a series of calls that are used to negotiate various things, including encryption parameters for connections to kz-migr (port 8102).  You have control over the encryption algorithm used with the -c <cipher> option to zoneadm migrate.  You can see the list of available ciphers with:

root@vzl-216:~# zoneadm -z mig1 migrate -c list vzl-216
source ciphers: aes-128-ccm aes-128-gcm none
destination ciphers: aes-128-ccm aes-128-gcm none

If for some reason you don't want to use encryption, you can use migrate -c none.  There's not much reason to do that, though.  The default encryption, aes-128-ccm, makes use of hardware crypto instructions found in all of the SPARC and x86 processors that are supported with kernel zones.  In tests, I regularly saturated a 10 gigabit link while migrating a single kernel zone.

One final note.... If you don't like typing the root password every time you migrate, you can also set up key-based authentication between the two machines.  In that case, you will use a command like:

# zoneadm -z <zone> migrate ssh://<remotehost>

Happy secure live migrating! 

Wednesday Jul 08, 2015

Kernel zone suspend now goes zoom!

Solaris 11.2 had the rather nice feature that you can have kernel zones automatically suspend and resume across global zone reboots.  We've made some improvements in this area in Solaris 11.3 to help in the cases where more CPU cycles could make suspend and resume go faster.

As a recap, automatic suspend/resume of kernel zones across global zone reboots can be accomplished by having a suspend resource, setting autoboot=true and autoshutdown=suspend.

# zonecfg -z kz1
zonecfg:kz1> set autoboot=true
zonecfg:kz1> set autoshutdown=suspend
zonecfg:kz1> exit
zonecfg:kz1:suspend> info
	path.template: /export/%{zonename}.suspend
	path: /export/kz1.suspend
	storage not specified

When a graceful reboot is performed (that is, shutdown -r or init 6) svc:/system/zones:default will suspend the zone as it shuts down and resume it as the system boots.  Obviously, reading from memory and writing to disk would have the inclination to saturate the disk bandwidth.  To create a more balanced system, the suspend image is compressed.  While this greatly slows down the write rate, several kernel zones that were concurrently suspending would still saturate available bandwidth in typical configurations.  More balanced and faster - good, right?

Well, this more balanced system came at a cost.  When suspending one zone the performance was not so great.  For example, a basic kernel zone with 2 GB of RAM on a T5 ldom shows:

# tail /var/log/zones/kz1.messages
2015-07-08 12:33:15 notice: NOTICE: Zone suspending
2015-07-08 12:33:39 notice: NOTICE: Zone halted
root@vzl-212:~# ls -lh /export/kz1.suspend
-rw-------   1 root     root        289M Jul  8 12:33 /export/kz1.suspend
# bc -l
289 / 24

Yikes - 12 MB/s to disk.  During this time, I used prstat -mLc -n 5 1 and iostat -xzn and could see that the compression thread in zoneadmd was using 100% of a CPU and the disk had idle times then spurts of being busy as zfs flushed out each transaction group.  Note that this rate of 12 MB/s is artificially low because some other things are going on before and after writing the suspend file that may take up to a couple of seconds.

I then updated my system to the Solaris 11.3 beta release and tried again.  This time things look better.

# zoneadm -z kz1 suspend
# tail /var/log/zones/kz1.messages
2015-07-08 12:59:49 info: Processing command suspend flags 0x0 from ruid/euid/suid 0/0/0 pid 3141
2015-07-08 12:59:49 notice: NOTICE: Zone suspending
2015-07-08 12:59:58 info: Processing command halt flags 0x0 from ruid/euid/suid 0/0/0 pid 0
2015-07-08 12:59:58 notice: NOTICE: Zone halted
# ls -lh /export/kz1.suspend 
-rw-------   1 root     root        290M Jul  8 12:59 /export/kz1.suspend
# echo 290 / 9 | bc -l

That's better, but not great.  Remember what I said about the rate being artificially low above?  While writing the multi-threaded suspend/resume support, I also created some super secret debug code that gives more visibility into the rate.  That shows:

Suspend raw: 1043 MB in 5.9 sec 177.5 MB/s
Suspend compressed: 289 MB in 5.9 sec 49.1 MB/s
Suspend raw-fast-fsync: 1043 MB in 3.5 sec 299.1 MB/s
Suspend compressed-fast-fsync: 289 MB in 3.5 sec 82.8 MB/s

What this is telling me is that my kernel zone with 2 GB of RAM actually had 1043 MB that actually needed to be suspended - the remaining was blocks of zeroes.  The total suspend time was 5.9 seconds, giving a read from memory rate of 177.5 MB/s and write to disk rate of 49.1 MB/s.  The -fsync lines are saying that if suspend didn't fsync(3C) the suspend file before returning, it would have completed in 3.5 seconds, giving a suspend rate of 82.8 MB/s.  That's looking better.

In another experiment, we aim to make the storage not be the limiting factor. This time, let's do 16 GB of RAM and write the suspend image to /tmp.

# zonecfg -z kz1 info
zonename: kz1
brand: solaris-kz
autoboot: true
autoshutdown: suspend
	ncpus: 12
	physical: 16G
	path: /tmp/kz1.suspend
	storage not specified

To ensure that most of the RAM wasn't just blocks of zeroes (and as such wouldn't be in the suspend file), I created a tar file of /usr in kz1's /tmp and made copies of it until the kernel zone's RAM was rather full.

This time around, we are seeing that we are able to write the 15 GB of active memory in 52.5 seconds.  Notice that this is roughly 15x the amount of memory in only double the time from our Solaris 11.2 baseline.

Suspend raw: 15007 MB in 52.5 sec 286.1 MB/s
Suspend compressed: 5416 MB in 52.5 sec 103.3 MB/s

While the focus of this entry has been multi-threaded compression during suspend, it's also worth noting that:

  • The suspend image is also encrypted. If someone gets a copy of the suspend image, it doesn't mean that they can read the guest memory.  Oh, and the encryption is multi-threaded as well.
  • Decryption is also multi-threaded.
  • And so is uncompression.  The parallel compression and uncompression code is freely available, even. :)

The performance numbers here should be taken with a grain of salt.  Many factors influence the actual rate you will see.  In particular:

  • Different CPUs have very different performance characteristics.
  • If the zone has a dedicated-cpu resource, only the CPU's that are dedicated to the zone will be used for compression and encryption.
  • More CPUs tend to go faster, but only to a certain point.
  • Various types of storage will perform vastly differently.
  • When many zones are suspending or resuming at the same time, they will compete for resources.
And one last thing... for those of you that are too impatient to wait until Solaris 11.3 to try this out, it is actually in Solaris 11.2 SRU 8 and later.

Shared Storage on NFS for Kernel Zones

In Solaris 11.2 Zones could be installed on shared storage (ZOSS) using iSCSI devices.  With Solaris 11.3 Beta the shared storage for Kernel Zones can also be placed on NFS files.

To setup an NFS SURI (storage URI), you'll need to identify the NFS host, share and path where the file will be placed and the user and group allowed to access the file.  The file does not need to exist, but the parent directory of the file must exist.  The user and group are specified so a user can control access of their zone storage via NFS.

Then in the zone configuration, you can setup a device (including a boot device) using the NFS SURI that looks like:
    - nfs://user:group@host/NFS_share/path_to_file

If the file does not yet exist, you'll need to specify a size.

Here's my setup of a 16g file for the zone root on an NFS share "/test" on system "sys1" owned by user "user1". My NFS server has this mode/owner for the directory /test/z1kz:

# ls -ld /test/z1kz
drwx------   2 user1  staff          4 Jun 12 12:36 /test/z1kz 

In zonecfg for the kernel zone "z1kz", select device 0 (the boot device) and set storage and create-size:

zonecfg:z1kz> select device 0
zonecfg:z1kz:device> set storage=nfs://user1:staff@sys1/test/z1kz/z1kz_root
zonecfg:z1kz:device> set create-size=16g
zonecfg:z1kz:device> end
zonecfg:z1kz> info device
device 0:
    match not specified
    storage: nfs://user1:staff@sys1/test/z1kz_root 
    create-size: 16g
    id: 0 
    bootpri: 0
zonecfg:z1kz> commit 

To add another device to this kernel zone, do:

zonecfg:z1kz> add device
zonecfg:z1kz:device> set storage=nfs://user1:staff@sys1/test/z1kz/z1kz_disk1 
zonecfg:z1kz:device> set create-size=8g
zonecfg:z1kz:device> end 
zonecfg:z1kz> commit
When installing the kernel zone, use the "-x storage-create-missing" option to create the NFS files owned by user1:staff.
# zoneadm -z z1kz install -x storage-create-missing
<output deleted> 
On my NFS server:
# ls -l /test/z1kz
total 407628 
-rw-------   1 user1  staff    8589934592 Jun 12 12:36 z1kz_disk1
-rw-------   1 user1  staff    17179869184 Jun 12 12:43 z1kz_root 

When the zone is uninstalled, the option "-x force-storage-destroy-all" will be needed to destroy the NFS files z1kz_root and z1kz_disk1.  If the "-x force-storage-destroy-all" option isn't used, then the NFS files will still exist on the NFS server after the zone uninstall.


Different time in different zones

Ever since Zones was introduced way back in Solaris 10, there has been a demand for the ability for zones to set its own time. In Solaris 11.3, that is finally possible and we deliver it as default for solaris(5) and solaris10(5) branded zones.

For more information, on how to enable and use this new feature click here

Saturday Feb 28, 2015

One image for native zones, kernel zones, ldoms, metal, ...

In my previous post, I described how to convert a global zone to a non-global zone using a unified archive.  Since then, I've fielded a few questions about whether this same approach can be used to create a master image that is used to install Solaris regardless of virtualization type (including no virtualization).  The answer is: of course!  That was one of the key goals of the project that invented unified archives.

In my earlier example, I was focused on preserving the identity and other aspects of the global zone and knew I had only one place that I planned to deploy it.  Hence, I chose to skip media creation (--exclude-media) and used a recovery archive (-r).  To generate a unified archive of a global zone that is ideal for use as an image for installing to another global zone or native zone, just use a simpler command line.

root@global# archiveadm create /path/to/golden-image.uar

Notice that by using fewer options we get something that is more usable.

What's different about this image compared to the one created in the previous post?

  • This archive as an embedded AI iso that will be used if you install a kernel zone from it.  That is, zoneadm -z kzname install -a /path/to/golden-image.uar will boot from that embedded AI image and perform an automated install from that archive.
  • This archive only contains the active boot environment and other ZFS snapshots are not archived.
  • This archive has been stripped of its identity.  When installing, you either need to provide a sysconfig profile or interactively configure the system or zone on the console or zone console on the first post-installation boot.

Friday Feb 20, 2015

global to non-global conversion with multiple zpools

Suppose you have a global zone with multiple zpools that you would like to convert into a native zone.  You can do that, thanks to unified archives (introduced in Solaris 11.2) and dataset aliasing (introduced in Solaris 11.0).  The source system looks like this:
root@buzz:~# zoneadm list -cv
  ID NAME             STATUS      PATH                         BRAND      IP
   0 global           running     /                            solaris    shared
root@buzz:~# zpool list
rpool  15.9G  4.38G  11.5G  27%  1.00x  ONLINE  -
tank   1008M    93K  1008M   0%  1.00x  ONLINE  -
root@buzz:~# df -h /tank
Filesystem             Size   Used  Available Capacity  Mounted on
tank                   976M    31K       976M     1%    /tank
root@buzz:~# cat /tank/README
this is tank
Since we are converting a system rather than cloning it, we want to use a recovery archive and use the -r option.  Also, since the target is a native zone, there's no need for the unified archive to include media.
root@buzz:~# archiveadm create --exclude-media -r /net/kzx-02/export/uar/p2v.uar
Initializing Unified Archive creation resources...
Unified Archive initialized: /net/kzx-02/export/uar/p2v.uar
Logging to: /system/volatile/archive_log.1014
Executing dataset discovery...
Dataset discovery complete
Preparing archive system image...
Beginning archive stream creation...
Archive stream creation complete
Beginning final archive assembly...
Archive creation complete
Now we will go to the global zone that will have the zone installed.  First, we must configure the zone.  The archive contains a zone configuration that is almost correct, but needs a little help because archiveadm(1M) doesn't know the particulars of where you will deploy it.

Most examples that show configuration of a zone from an archive show the non-interactive mode.  Here we use the interactive mode.
root@vzl-212:~# zonecfg -z p2v
Use 'create' to begin configuring a new zone.
zonecfg:p2v> create -a /net/kzx-02/export/uar/p2v.uar
After the create command completes (in a fraction of a second) we can see the configuration that was embedded in the archive.  I've trimmed out a bunch of uninteresting stuff from the anet interface.
zonecfg:p2v> info
zonename: p2v
zonepath.template: /system/zones/%{zonename}
zonepath: /system/zones/p2v
brand: solaris
autoboot: false
autoshutdown: shutdown
ip-type: exclusive
[max-lwps: 40000]
[max-processes: 20000]
        linkname: net0
        lower-link: auto
        name: zonep2vchk-num-cpus
        type: string
        value: "original system had 4 cpus: consider capped-cpu (ncpus=4.0) or dedicated-cpu (ncpus=4)"
        name: zonep2vchk-memory
        type: string
        value: "original system had 2048 MB RAM and 2047 MB swap: consider capped-memory (physical=2048M swap=4095M)"
        name: zonep2vchk-net-net0
        type: string
        value: "interface net0 has lower-link set to 'auto'.  Consider changing to match the name of a global zone link."
        name: __change_me__/tank
        alias: tank
        name: zone.max-processes
        value: (priv=privileged,limit=20000,action=deny)
        name: zone.max-lwps
        value: (priv=privileged,limit=40000,action=deny)
In this case, I want to be sure that the zone's network uses a particular global zone interface, so I need to muck with that a bit.
zonecfg:p2v> select anet linkname=net0
zonecfg:p2v:anet> set lower-link=stub0
zonecfg:p2v:anet> end
The zpool list output in the beginning of this post showed that the system had two ZFS pools: rpool and tank.  We need to tweak the configuration to point the tank virtual ZFS pool to the right ZFS file system.  The name in the dataset resource refers to the location in the global zone.  This particular system has a zpool named export - a more basic Solaris installation would probably need to use rpool/export/....  The alias in the dataset resource needs to match the name of the secondary ZFS pool in the archive.
zonecfg:p2v> select dataset alias=tank
zonecfg:p2v:dataset> set name=export/tank/%{zonename}
zonecfg:p2v:dataset> info
        name.template: export/tank/%{zonename}
        name: export/tank/p2v
        alias: tank
zonecfg:p2v:dataset> end
zonecfg:p2v> exit
I did something tricky above - I used a template property to make it easier to clone this zone configuration and have the dataset name point at a different dataset.

Let's try an installation.  NOTE: Before you get around to booting the new zone, be sure the old system is offline else you will have IP address conflicts.
root@vzl-212:~# zoneadm -z p2v install -a /net/kzx-02/export/uar/p2v.uar
could not verify zfs dataset export/tank/p2v: filesystem does not exist
zoneadm: zone p2v failed to verify
Oops.  I forgot to create the dataset.  Let's do that.  I use -o zoned=on to prevent the dataset from being mounted in the global zone.  If you forget that, it's no biggy - the system will fix it for you soon enough.
root@vzl-212:~# zfs create -p -o zoned=on export/tank/p2v
root@vzl-212:~# zoneadm -z p2v install -a /net/kzx-02/export/uar/p2v.uar
The following ZFS file system(s) have been created:
Progress being logged to /var/log/zones/zoneadm.20150220T060031Z.p2v.install
    Installing: This may take several minutes...
 Install Log: /system/volatile/install.5892/install_log
 AI Manifest: /tmp/manifest.p2v.YmaOEl.xml
    Zonename: p2v
Installation: Starting ...
        Commencing transfer of stream: 0f048163-2943-cde5-cb27-d46914ec6ed3-0.zfs to rpool/VARSHARE/zones/p2v/rpool
        Commencing transfer of stream: 0f048163-2943-cde5-cb27-d46914ec6ed3-1.zfs to export/tank/p2v
        Completed transfer of stream: '0f048163-2943-cde5-cb27-d46914ec6ed3-1.zfs' from file:///net/kzx-02/export/uar/p2v.uar
        Completed transfer of stream: '0f048163-2943-cde5-cb27-d46914ec6ed3-0.zfs' from file:///net/kzx-02/export/uar/p2v.uar
        Archive transfer completed
        Changing target pkg variant. This operation may take a while
Installation: Succeeded
      Zone BE root dataset: rpool/VARSHARE/zones/p2v/rpool/ROOT/solaris-recovery
                     Cache: Using /var/pkg/publisher.
Updating image format
Image format already current.
  Updating non-global zone: Linking to image /.
Processing linked: 1/1 done
  Updating non-global zone: Syncing packages.
No updates necessary for this image. (zone:p2v)
  Updating non-global zone: Zone updated.
                    Result: Attach Succeeded.
        Done: Installation completed in 165.355 seconds.
  Next Steps: Boot the zone, then log into the zone console (zlogin -C)
              to complete the configuration process.
Log saved in non-global zone as /system/zones/p2v/root/var/log/zones/zoneadm.20150220T060031Z.p2v.install
root@vzl-212:~# zoneadm -z p2v boot
After booting we see that everything in the zone is in order.
root@vzl-212:~# zlogin p2v
[Connected to zone 'p2v' pts/3]
Oracle Corporation      SunOS 5.11      11.2    September 2014
root@buzz:~# svcs -x
root@buzz:~# zpool list
rpool  99.8G  66.3G  33.5G  66%  1.00x  ONLINE  -
tank    199G  49.6G   149G  24%  1.00x  ONLINE  -
root@buzz:~# df -h /tank
Filesystem             Size   Used  Available Capacity  Mounted on
tank                   103G    31K       103G     1%    /tank
root@buzz:~# cat /tank/README
this is tank
root@buzz:~# zonename
Happy p2v-ing!  Or rather, g2ng-ing.

Thursday Jan 15, 2015

fronting isolated zones

This is a continuation of a series of posts.  While this one may be interesting all on its own, you may want to start from the top to get the context.

In this post, we'll create teeter - the load balancer.  This zone will be a native (solaris brand) zone.  The intent of this arrangement is to make it so that paying customers get served by the zone named premium and the freeloaders have to scrape by with free.  Since that logic is clearly highly dependent on each webapp, I'll take the shortcut of having a more simplistic load balancer.

Once again, we'll configure the zone's networking from the global zone.  This time around both networks get a static configuration - one attached to the red ( network and the other attached to the global zone's first network interface.

root@global:~# zonecfg -z teeter
Use 'create' to begin configuring a new zone.
zonecfg:teeter> create
zonecfg:teeter> set zonepath=/zones/%{zonename}
zonecfg:teeter> select anet linkname=net0
zonecfg:teeter:anet> set lower-link=balstub0
zonecfg:teeter:anet> set allowed-address=
zonecfg:teeter:anet> set configure-allowed-address=true
zonecfg:teeter:anet> end
zonecfg:teeter> add anet
zonecfg:teeter:anet> set lower-link=net0
zonecfg:teeter:anet> set allowed-address=
zonecfg:teeter:anet> set defrouter=
zonecfg:teeter:anet> set configure-allowed-address=true
zonecfg:teeter:anet> end
zonecfg:teeter> exit

root@global:~# zoneadm -z teeter install
The following ZFS file system(s) have been created:
Progress being logged to /var/log/zones/zoneadm.20150114T222949Z.teeter.install
       Image: Preparing at /zones/teeter/root.
Log saved in non-global zone as /zones/teeter/root/var/log/zones/zoneadm.20150114T222949Z.teeter.install
root@global:~# zoneadm -z teeter boot
root@global:~# zlogin -C teeter
   sysconfig, again.  Gee, I really should have created a sysconfig.xml...

 In a solaris zone, there are no dependencies that bring in the apache web server, so that needs to be installed.

root@vzl-212:~# zlogin teeter
[Connected to zone 'teeter' pts/3]
Oracle Corporation    SunOS 5.11    11.2    December 2014
root@teeter:~# pkg install apache-22

Once the web server is installed, we'll configure a simple load balancer using mod_proxy_balancer.

root@teeter:~# cd /etc/apache2/2.2/conf.d/
root@teeter:/etc/apache2/2.2/conf.d# cat > mod_proxy_balancer.conf <<EOF
<Proxy balancer://mycluster>
ProxyPass /test balancer://mycluster 
root@teeter:/etc/apache2/2.2/conf.d# svcadm enable apache22

To see if this is working, we will use a symbolic link on the NFS server to point to a unique file on each of the web servers.  Unless you are trying to paste your output into Oracle's blogging software, you won't need to define $download as I did.

root@global:~# ln -s /tmp/hostname /export/web/hostname
root@global:~# zlogin free 'hostname > /tmp/hostname'
root@global:~# zlogin premium 'hostname > /tmp/hostname'
root@global:~# download=cu; download=${download}rl
root@global:~# $download -s
root@global:~# $download
root@global:~# for i in {1..100}; do \
    $download -s; done| sort | uniq -c
  50 free
  50 premium

This concludes this series.  Surely there are things that I've glossed over and many more interesting things I could have done.  Please leave comments with any questions and I'll try to fill in the details.



  • Mike Gerdts - Principal Software Engineer
  • Lawrence Chung - Software Engineer
  • More coming soon!


« July 2016