Thursday Jul 30, 2015

Live storage migration for kernel zones

From time to time every sysadmin realizes that something that is consuming a bunch of storage is sitting in the wrong place.  This could be because of a surprise conversion of proof of concept into proof of production or something more positive like ripping out old crusty storage for a nice new Oracle ZFS Storage Appliance. When you use kernel zones with Solaris 11.3, storage migration gets a lot easier.

As our fine manual says:

The Oracle Solaris 11.3 release introduces the Live Zone Reconfiguration feature for Oracle Solaris Kernel Zones. With this feature, you can reconfigure the network and the attached devices of a running kernel zone. Because the configuration changes are applied immediately without requiring a reboot, there is zero downtime service availability within the zone. You can use the standard zone utilities such as zonecfg and zoneadm to administer the Live Zone Reconfiguration. 

Well, we can combine this with other excellent features of Solaris to have no-outage storage migrations, even of the root zpool.

In this example, I have a kernel zone that was created with something like:

root@global:~# zonecfg -z kz1 create -t SYSsolaris-kz
root@global:~# zoneadm -z kz1 install -c <scprofile.xml>

That happened several weeks ago and now I really wish that I had installed it using an iSCSI LUN from my ZFS Storage Appliance. We can fix that with no outage.

First, I'll update the zone's configuration to add a bootable iscsi disk.

root@global:~# zonecfg -z kz1
zonecfg:kz1> add device
zonecfg:kz1:device> set storage=iscsi://zfssa/luname.naa.600144F0DBF8AF19000053879E9C0009 
zonecfg:kz1:device> set bootpri=0
zonecfg:kz1:device> end
zonecfg:kz1> exit

Next, I tell the system to add that disk to the running kernel zone.

root@global:~# zoneadm -z kz1 apply
zone 'kz1': Checking: Adding device storage=iscsi://zfssa/luname.naa.600144F0DBF8AF19000053879E9C0009
zone 'kz1': Applying the changes

Let's be sure we can see it and look at the current rpool layout.  Notice that this kernel zone is running Solaris 11.2 - I only need to have Solaris 11.3 in the global zone.

root@global:~# zlogin kz1
[Connected to zone 'kz1' pts/2]
Oracle Corporation      SunOS 5.11      11.2    May 2015
You have new mail.

root@kz1:~# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c1d0 <kz-vDisk-ZVOL-16.00GB>
          /kz-devices@ff/disk@0
       1. c1d1 <SUN-ZFS Storage 7120-1.0-120.00GB>
          /kz-devices@ff/zvblk@1
Specify disk (enter its number): ^D

root@kz1:~# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: none requested
config:
        NAME    STATE     READ WRITE CKSUM
        rpool   ONLINE       0     0     0
          c1d0  ONLINE       0     0     0
errors: No known data errors

Now, zpool replace can be used to migrate the root pool over to the new storage.

root@kz1:~# zpool replace rpool c1d0 c1d1
Make sure to wait until resilver is done before rebooting.

root@kz1:~# zpool status rpool
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Thu Jul 30 05:47:50 2015
    4.39G scanned
    143M resilvered at 24.7M/s, 3.22% done, 0h2m to go
config:
        NAME           STATE     READ WRITE CKSUM
        rpool          DEGRADED     0     0     0
          replacing-0  DEGRADED     0     0     0
            c1d0       ONLINE       0     0     0
            c1d1       DEGRADED     0     0     0  (resilvering)
errors: No known data errors

After a couple minutes, that completes.

root@kz1:~# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: resilvered 4.39G in 0h2m with 0 errors on Thu Jul 30 05:49:57 2015
config:
        NAME    STATE     READ WRITE CKSUM
        rpool   ONLINE       0     0     0
          c1d1  ONLINE       0     0     0
errors: No known data errors

root@kz1:~# zpool list
NAME    SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  15.9G  4.39G  11.5G  27%  1.00x  ONLINE  -

You may have noticed in the format output that I'm replacing a 16 GB zvol with a 120 GB disk.  However, the size of the zpool reported above doesn't reflect that it's on a bigger disk.  Let's fix that by setting the autoexpand property. 

root@kz1:~# zpool get autoexpand rpool
NAME   PROPERTY    VALUE  SOURCE
rpool  autoexpand  off    default

root@kz1:~# zpool set autoexpand=on rpool

root@kz1:~# zpool list
NAME   SIZE  ALLOC  FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  120G  4.39G  115G   3%  1.00x  ONLINE  -

To finish this off, all we need to do is remove the old disk from the kernel zone's configuration.  This happens back in the global zone.

root@global:~# zonecfg -z kz1
zonecfg:kz1> info device
device 0:
	match not specified
	storage: iscsi://zfssa/luname.naa.600144F0DBF8AF19000053879E9C0009
	id: 1
	bootpri: 0
device 1:
	match not specified
	storage.template: dev:/dev/zvol/dsk/%{global-rootzpool}/VARSHARE/zones/%{zonename}/disk%{id}
	storage: dev:/dev/zvol/dsk/rpool/VARSHARE/zones/kz1/disk0
	id: 0
	bootpri: 0
zonecfg:kz1> remove device id=0
zonecfg:kz1> exit

Now, let's apply that configuration. To show what it does, I run format in kz1 before and after applying the configuration.

root@global:~# zlogin kz1 format </dev/null
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c1d0 <kz-vDisk-ZVOL-16.00GB>
          /kz-devices@ff/disk@0
       1. c1d1 <SUN-ZFS Storage 7120-1.0-120.00GB>
          /kz-devices@ff/zvblk@1
Specify disk (enter its number): 

root@global:~# zoneadm -z kz1 apply 
zone 'kz1': Checking: Removing device storage=dev:/dev/zvol/dsk/rpool/VARSHARE/zones/kz1/disk0
zone 'kz1': Applying the changes

root@global:~# zlogin kz1 format </dev/null
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c1d1 <SUN-ZFS Storage 7120-1.0-120.00GB>
          /kz-devices@ff/zvblk@1
Specify disk (enter its number): 

root@global:~# 

At this point the live (no outage) storage migration is complete and it is safe to destroy the old disk (rpool/VARSHARE/zones/kz1/disk0).

root@global:~# zfs destroy rpool/VARSHARE/zones/kz1/disk0

Tuesday Jul 28, 2015

A trip down memory lane

In Scott Lynn's announcement of Oracle's membership in the Open Container Initiative, he gives a great summary of how Solaris virtualization got to the point it's at.  Quite an interesting read!

Thursday Jul 09, 2015

Secure multi-threaded live migration for kernel zones

As mentioned in the What's New document, Solaris 11.3 now supports live migration for kernel zones!  Let's try it out.

As mentioned in the fine manual, live migration requires the use of zones on shared storage (ZOSS) and a few other things. In Solaris 11.2, we could use logical units (i.e. fibre channel) or iscsi.  Always living on the edge, I decide to try out the new ZOSS NFS feature.  Since the previous post did such a great job of explaining how to set it up, I won't go into the details.  Here's what my zone configuration looks like:

zonecfg:mig1> info
zonename: mig1
brand: solaris-kz
...
anet 0:
        ...
device 0:
	match not specified
	storage.template: nfs://zoss:zoss@kzx-05/zones/zoss/%{zonename}.disk%{id}
	storage: nfs://zoss:zoss@kzx-05/zones/zoss/mig1.disk0
	id: 0
	bootpri: 0
virtual-cpu:
	ncpus: 4
capped-memory:
	physical: 4G
keysource:
	raw redacted

And the zone is running.

root@vzl-216:~# zoneadm -z mig1 list -s
NAME             STATUS           AUXILIARY STATE                               
mig1             running                                        

In order for live migration to work, the kz-migr and rad:remote services need to be online.  They are disabled by default.

# svcadm enable -s svc:/system/rad:remote svc:/network/kz-migr:stream
# svcs svc:/system/rad:remote svc:/network/kz-migr:stream
STATE          STIME    FMRI
online          6:40:12 svc:/network/kz-migr:stream
online          6:40:12 svc:/system/rad:remote

While these services are only needed on the remote end, I enable them on both sides because there's a pretty good chance that I will migrate kernel zones in both directions.  Now we are ready to perform the migration.  I'm migrating mig1 from vzl-216 to vzl-212.  Both vzl-216 and vzl-212 are logical domains on T5's.

root@vzl-216:~# zoneadm -z mig1 migrate vzl-212
Password: 
zoneadm: zone 'mig1': Importing zone configuration.
zoneadm: zone 'mig1': Attaching zone.
zoneadm: zone 'mig1': Booting zone in 'migrating-in' mode.
zoneadm: zone 'mig1': Checking migration compatibility.
zoneadm: zone 'mig1': Starting migration.
zoneadm: zone 'mig1': Suspending zone on source host.
zoneadm: zone 'mig1': Waiting for migration to complete.
zoneadm: zone 'mig1': Migration successful.
zoneadm: zone 'mig1': Halting and detaching zone on source host.

Afterwards, we see that the zone is now configured on vzl-216 and running on vzl-212.

root@vzl-216:~# zoneadm -z mig1 list -s
NAME             STATUS           AUXILIARY STATE                               
mig1             configured                                    
root@vzl-212:~# zoneadm -z mig1 list -s
NAME             STATUS           AUXILIARY STATE                               
mig1             running                

Ok, cool.  But what really happened?  During the migration, I was also running tcptop, one of our demo dtrace scripts.  Unfortunately, it doesn't print the pretty colors: I added those so we can see what's going on.

root@vzl-216:~# dtrace -s /usr/demo/dtrace/tcptop.d
Sampling... Please wait.
...

2015 Jul  9 06:50:30,  load: 0.10,  TCPin:      0 Kb,  TCPout:      0 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0    613 10.134.18.216      22 10.134.18.202   48168       112
     0   2640 10.134.18.216   60773 10.134.18.212   12302       137
     0    613 10.134.18.216      22 10.134.18.202   60194       336

2015 Jul  9 06:50:35,  load: 0.10,  TCPin:      0 Kb,  TCPout: 832420 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0    613 10.134.18.216      22 10.134.18.202   48168       208
     0   2640 10.134.18.216   60773 10.134.18.212   12302       246
     0    613 10.134.18.216      22 10.134.18.202   60194       480
     0   2640 10.134.18.216   45661 10.134.18.212    8102      8253
     0   2640 10.134.18.216   41441 10.134.18.212    8102 418467721
     0   2640 10.134.18.216   59051 10.134.18.212    8102 459765481

...

2015 Jul  9 06:50:50,  load: 0.41,  TCPin:      1 Kb,  TCPout: 758608 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0   2640 10.134.18.216   60773 10.134.18.212   12302       388
     0    613 10.134.18.216      22 10.134.18.202   60194       544
     0    613 10.134.18.216      22 10.134.18.202   48168       592
     0   2640 10.134.18.216   45661 10.134.18.212    8102    119032
     0   2640 10.134.18.216   59051 10.134.18.212    8102 151883984
     0   2640 10.134.18.216   41441 10.134.18.212    8102 620449680

2015 Jul  9 06:50:55,  load: 0.48,  TCPin:      0 Kb,  TCPout:      0 Kb
  ZONE    PID LADDR           LPORT RADDR           RPORT      SIZE
     0    613 10.134.18.216      22 10.134.18.202   60194       736
^C

In the first sample, we see that vzl-216 (10.134.18.216) has established a RAD connection to vzl-212.  We know it is RAD because it is over port 12302.  RAD is used to connect the relevant zone migration processes on the two machines.  One connection between the zone migration processes is used for orchestrating various aspects of the migration.  There are two others that are used for synchronizing the memory between the machines.  In each of the samples, there is also some traffic from each of a couple ssh sessions I have between vzl-216 and another machine.

As the amount of kernel zone memory increases, the number of connections will also increase.  Currently that scaling factor is one connection per 2 GB of kernel zone memory, with an upper limit based on the number of CPUs in the machine.  The scaling is limited by the number of CPUs because each connection corresponds to a sending and a receiving thread. Those threads are responsible for encrypting and decrypting the traffic.  The multiple connections can work nicely with IPMP's outbound load sharing and/or link aggregations to spread the load across multiple physical network links. The algorithm for selecting the number of connections may change from time to time, so don't be surprised if your observations don't match what is shown above.

All of the communication between the two machines is encrypted.  The RAD connection (in this case) is encrypted with TLS, as described in rad(1M).  This RAD connection supports a series of calls that are used to negotiate various things, including encryption parameters for connections to kz-migr (port 8102).  You have control over the encryption algorithm used with the -c <cipher> option to zoneadm migrate.  You can see the list of available ciphers with:

root@vzl-216:~# zoneadm -z mig1 migrate -c list vzl-216
Password: 
source ciphers: aes-128-ccm aes-128-gcm none
destination ciphers: aes-128-ccm aes-128-gcm none

If for some reason you don't want to use encryption, you can use migrate -c none.  There's not much reason to do that, though.  The default encryption, aes-128-ccm, makes use of hardware crypto instructions found in all of the SPARC and x86 processors that are supported with kernel zones.  In tests, I regularly saturated a 10 gigabit link while migrating a single kernel zone.

One final note.... If you don't like typing the root password every time you migrate, you can also set up key-based authentication between the two machines.  In that case, you will use a command like:

# zoneadm -z <zone> migrate ssh://<remotehost>

Happy secure live migrating! 

Wednesday Jul 08, 2015

Kernel zone suspend now goes zoom!

Solaris 11.2 had the rather nice feature that you can have kernel zones automatically suspend and resume across global zone reboots.  We've made some improvements in this area in Solaris 11.3 to help in the cases where more CPU cycles could make suspend and resume go faster.

As a recap, automatic suspend/resume of kernel zones across global zone reboots can be accomplished by having a suspend resource, setting autoboot=true and autoshutdown=suspend.

# zonecfg -z kz1
zonecfg:kz1> set autoboot=true
zonecfg:kz1> set autoshutdown=suspend
zonecfg:kz1> exit
zonecfg:kz1:suspend> info
suspend:
	path.template: /export/%{zonename}.suspend
	path: /export/kz1.suspend
	storage not specified

When a graceful reboot is performed (that is, shutdown -r or init 6) svc:/system/zones:default will suspend the zone as it shuts down and resume it as the system boots.  Obviously, reading from memory and writing to disk would have the inclination to saturate the disk bandwidth.  To create a more balanced system, the suspend image is compressed.  While this greatly slows down the write rate, several kernel zones that were concurrently suspending would still saturate available bandwidth in typical configurations.  More balanced and faster - good, right?

Well, this more balanced system came at a cost.  When suspending one zone the performance was not so great.  For example, a basic kernel zone with 2 GB of RAM on a T5 ldom shows:

# tail /var/log/zones/kz1.messages
...
2015-07-08 12:33:15 notice: NOTICE: Zone suspending
2015-07-08 12:33:39 notice: NOTICE: Zone halted
root@vzl-212:~# ls -lh /export/kz1.suspend
-rw-------   1 root     root        289M Jul  8 12:33 /export/kz1.suspend
# bc -l
289 / 24
12.04166666666666666666

Yikes - 12 MB/s to disk.  During this time, I used prstat -mLc -n 5 1 and iostat -xzn and could see that the compression thread in zoneadmd was using 100% of a CPU and the disk had idle times then spurts of being busy as zfs flushed out each transaction group.  Note that this rate of 12 MB/s is artificially low because some other things are going on before and after writing the suspend file that may take up to a couple of seconds.

I then updated my system to the Solaris 11.3 beta release and tried again.  This time things look better.

# zoneadm -z kz1 suspend
# tail /var/log/zones/kz1.messages
...
2015-07-08 12:59:49 info: Processing command suspend flags 0x0 from ruid/euid/suid 0/0/0 pid 3141
2015-07-08 12:59:49 notice: NOTICE: Zone suspending
2015-07-08 12:59:58 info: Processing command halt flags 0x0 from ruid/euid/suid 0/0/0 pid 0
2015-07-08 12:59:58 notice: NOTICE: Zone halted
# ls -lh /export/kz1.suspend 
-rw-------   1 root     root        290M Jul  8 12:59 /export/kz1.suspend
# echo 290 / 9 | bc -l
32.22222222222222222222

That's better, but not great.  Remember what I said about the rate being artificially low above?  While writing the multi-threaded suspend/resume support, I also created some super secret debug code that gives more visibility into the rate.  That shows:

Suspend raw: 1043 MB in 5.9 sec 177.5 MB/s
Suspend compressed: 289 MB in 5.9 sec 49.1 MB/s
Suspend raw-fast-fsync: 1043 MB in 3.5 sec 299.1 MB/s
Suspend compressed-fast-fsync: 289 MB in 3.5 sec 82.8 MB/s

What this is telling me is that my kernel zone with 2 GB of RAM actually had 1043 MB that actually needed to be suspended - the remaining was blocks of zeroes.  The total suspend time was 5.9 seconds, giving a read from memory rate of 177.5 MB/s and write to disk rate of 49.1 MB/s.  The -fsync lines are saying that if suspend didn't fsync(3C) the suspend file before returning, it would have completed in 3.5 seconds, giving a suspend rate of 82.8 MB/s.  That's looking better.

In another experiment, we aim to make the storage not be the limiting factor. This time, let's do 16 GB of RAM and write the suspend image to /tmp.

# zonecfg -z kz1 info
zonename: kz1
brand: solaris-kz
autoboot: true
autoshutdown: suspend
...
virtual-cpu:
	ncpus: 12
capped-memory:
	physical: 16G
suspend:
	path: /tmp/kz1.suspend
	storage not specified

To ensure that most of the RAM wasn't just blocks of zeroes (and as such wouldn't be in the suspend file), I created a tar file of /usr in kz1's /tmp and made copies of it until the kernel zone's RAM was rather full.

This time around, we are seeing that we are able to write the 15 GB of active memory in 52.5 seconds.  Notice that this is roughly 15x the amount of memory in only double the time from our Solaris 11.2 baseline.

Suspend raw: 15007 MB in 52.5 sec 286.1 MB/s
Suspend compressed: 5416 MB in 52.5 sec 103.3 MB/s

While the focus of this entry has been multi-threaded compression during suspend, it's also worth noting that:

  • The suspend image is also encrypted. If someone gets a copy of the suspend image, it doesn't mean that they can read the guest memory.  Oh, and the encryption is multi-threaded as well.
  • Decryption is also multi-threaded.
  • And so is uncompression.  The parallel compression and uncompression code is freely available, even. :)

The performance numbers here should be taken with a grain of salt.  Many factors influence the actual rate you will see.  In particular:

  • Different CPUs have very different performance characteristics.
  • If the zone has a dedicated-cpu resource, only the CPU's that are dedicated to the zone will be used for compression and encryption.
  • More CPUs tend to go faster, but only to a certain point.
  • Various types of storage will perform vastly differently.
  • When many zones are suspending or resuming at the same time, they will compete for resources.
And one last thing... for those of you that are too impatient to wait until Solaris 11.3 to try this out, it is actually in Solaris 11.2 SRU 8 and later.

Saturday Feb 28, 2015

One image for native zones, kernel zones, ldoms, metal, ...

In my previous post, I described how to convert a global zone to a non-global zone using a unified archive.  Since then, I've fielded a few questions about whether this same approach can be used to create a master image that is used to install Solaris regardless of virtualization type (including no virtualization).  The answer is: of course!  That was one of the key goals of the project that invented unified archives.

In my earlier example, I was focused on preserving the identity and other aspects of the global zone and knew I had only one place that I planned to deploy it.  Hence, I chose to skip media creation (--exclude-media) and used a recovery archive (-r).  To generate a unified archive of a global zone that is ideal for use as an image for installing to another global zone or native zone, just use a simpler command line.

root@global# archiveadm create /path/to/golden-image.uar

Notice that by using fewer options we get something that is more usable.

What's different about this image compared to the one created in the previous post?

  • This archive as an embedded AI iso that will be used if you install a kernel zone from it.  That is, zoneadm -z kzname install -a /path/to/golden-image.uar will boot from that embedded AI image and perform an automated install from that archive.
  • This archive only contains the active boot environment and other ZFS snapshots are not archived.
  • This archive has been stripped of its identity.  When installing, you either need to provide a sysconfig profile or interactively configure the system or zone on the console or zone console on the first post-installation boot.

Friday Feb 20, 2015

global to non-global conversion with multiple zpools

Suppose you have a global zone with multiple zpools that you would like to convert into a native zone.  You can do that, thanks to unified archives (introduced in Solaris 11.2) and dataset aliasing (introduced in Solaris 11.0).  The source system looks like this:
root@buzz:~# zoneadm list -cv
  ID NAME             STATUS      PATH                         BRAND      IP
   0 global           running     /                            solaris    shared
root@buzz:~# zpool list
NAME    SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  15.9G  4.38G  11.5G  27%  1.00x  ONLINE  -
tank   1008M    93K  1008M   0%  1.00x  ONLINE  -
root@buzz:~# df -h /tank
Filesystem             Size   Used  Available Capacity  Mounted on
tank                   976M    31K       976M     1%    /tank
root@buzz:~# cat /tank/README
this is tank
Since we are converting a system rather than cloning it, we want to use a recovery archive and use the -r option.  Also, since the target is a native zone, there's no need for the unified archive to include media.
root@buzz:~# archiveadm create --exclude-media -r /net/kzx-02/export/uar/p2v.uar
Initializing Unified Archive creation resources...
Unified Archive initialized: /net/kzx-02/export/uar/p2v.uar
Logging to: /system/volatile/archive_log.1014
Executing dataset discovery...
Dataset discovery complete
Preparing archive system image...
Beginning archive stream creation...
Archive stream creation complete
Beginning final archive assembly...
Archive creation complete
Now we will go to the global zone that will have the zone installed.  First, we must configure the zone.  The archive contains a zone configuration that is almost correct, but needs a little help because archiveadm(1M) doesn't know the particulars of where you will deploy it.

Most examples that show configuration of a zone from an archive show the non-interactive mode.  Here we use the interactive mode.
root@vzl-212:~# zonecfg -z p2v
Use 'create' to begin configuring a new zone.
zonecfg:p2v> create -a /net/kzx-02/export/uar/p2v.uar
After the create command completes (in a fraction of a second) we can see the configuration that was embedded in the archive.  I've trimmed out a bunch of uninteresting stuff from the anet interface.
zonecfg:p2v> info
zonename: p2v
zonepath.template: /system/zones/%{zonename}
zonepath: /system/zones/p2v
brand: solaris
autoboot: false
autoshutdown: shutdown
bootargs:
file-mac-profile:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
hostid:
tenant:
fs-allowed:
[max-lwps: 40000]
[max-processes: 20000]
anet:
        linkname: net0
        lower-link: auto
    [snip]
attr:
        name: zonep2vchk-num-cpus
        type: string
        value: "original system had 4 cpus: consider capped-cpu (ncpus=4.0) or dedicated-cpu (ncpus=4)"
attr:
        name: zonep2vchk-memory
        type: string
        value: "original system had 2048 MB RAM and 2047 MB swap: consider capped-memory (physical=2048M swap=4095M)"
attr:
        name: zonep2vchk-net-net0
        type: string
        value: "interface net0 has lower-link set to 'auto'.  Consider changing to match the name of a global zone link."
dataset:
        name: __change_me__/tank
        alias: tank
rctl:
        name: zone.max-processes
        value: (priv=privileged,limit=20000,action=deny)
rctl:
        name: zone.max-lwps
        value: (priv=privileged,limit=40000,action=deny)
In this case, I want to be sure that the zone's network uses a particular global zone interface, so I need to muck with that a bit.
zonecfg:p2v> select anet linkname=net0
zonecfg:p2v:anet> set lower-link=stub0
zonecfg:p2v:anet> end
The zpool list output in the beginning of this post showed that the system had two ZFS pools: rpool and tank.  We need to tweak the configuration to point the tank virtual ZFS pool to the right ZFS file system.  The name in the dataset resource refers to the location in the global zone.  This particular system has a zpool named export - a more basic Solaris installation would probably need to use rpool/export/....  The alias in the dataset resource needs to match the name of the secondary ZFS pool in the archive.
zonecfg:p2v> select dataset alias=tank
zonecfg:p2v:dataset> set name=export/tank/%{zonename}
zonecfg:p2v:dataset> info
dataset:
        name.template: export/tank/%{zonename}
        name: export/tank/p2v
        alias: tank
zonecfg:p2v:dataset> end
zonecfg:p2v> exit
I did something tricky above - I used a template property to make it easier to clone this zone configuration and have the dataset name point at a different dataset.

Let's try an installation.  NOTE: Before you get around to booting the new zone, be sure the old system is offline else you will have IP address conflicts.
root@vzl-212:~# zoneadm -z p2v install -a /net/kzx-02/export/uar/p2v.uar
could not verify zfs dataset export/tank/p2v: filesystem does not exist
zoneadm: zone p2v failed to verify
Oops.  I forgot to create the dataset.  Let's do that.  I use -o zoned=on to prevent the dataset from being mounted in the global zone.  If you forget that, it's no biggy - the system will fix it for you soon enough.
root@vzl-212:~# zfs create -p -o zoned=on export/tank/p2v
root@vzl-212:~# zoneadm -z p2v install -a /net/kzx-02/export/uar/p2v.uar
The following ZFS file system(s) have been created:
    rpool/VARSHARE/zones/p2v
Progress being logged to /var/log/zones/zoneadm.20150220T060031Z.p2v.install
    Installing: This may take several minutes...
 Install Log: /system/volatile/install.5892/install_log
 AI Manifest: /tmp/manifest.p2v.YmaOEl.xml
    Zonename: p2v
Installation: Starting ...
        Commencing transfer of stream: 0f048163-2943-cde5-cb27-d46914ec6ed3-0.zfs to rpool/VARSHARE/zones/p2v/rpool
        Commencing transfer of stream: 0f048163-2943-cde5-cb27-d46914ec6ed3-1.zfs to export/tank/p2v
        Completed transfer of stream: '0f048163-2943-cde5-cb27-d46914ec6ed3-1.zfs' from file:///net/kzx-02/export/uar/p2v.uar
        Completed transfer of stream: '0f048163-2943-cde5-cb27-d46914ec6ed3-0.zfs' from file:///net/kzx-02/export/uar/p2v.uar
        Archive transfer completed
        Changing target pkg variant. This operation may take a while
Installation: Succeeded
      Zone BE root dataset: rpool/VARSHARE/zones/p2v/rpool/ROOT/solaris-recovery
                     Cache: Using /var/pkg/publisher.
Updating image format
Image format already current.
  Updating non-global zone: Linking to image /.
Processing linked: 1/1 done
  Updating non-global zone: Syncing packages.
No updates necessary for this image. (zone:p2v)
  Updating non-global zone: Zone updated.
                    Result: Attach Succeeded.
        Done: Installation completed in 165.355 seconds.
  Next Steps: Boot the zone, then log into the zone console (zlogin -C)
              to complete the configuration process.
Log saved in non-global zone as /system/zones/p2v/root/var/log/zones/zoneadm.20150220T060031Z.p2v.install
root@vzl-212:~# zoneadm -z p2v boot
After booting we see that everything in the zone is in order.
root@vzl-212:~# zlogin p2v
[Connected to zone 'p2v' pts/3]
Oracle Corporation      SunOS 5.11      11.2    September 2014
root@buzz:~# svcs -x
root@buzz:~# zpool list
NAME    SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool  99.8G  66.3G  33.5G  66%  1.00x  ONLINE  -
tank    199G  49.6G   149G  24%  1.00x  ONLINE  -
root@buzz:~# df -h /tank
Filesystem             Size   Used  Available Capacity  Mounted on
tank                   103G    31K       103G     1%    /tank
root@buzz:~# cat /tank/README
this is tank
root@buzz:~# zonename
p2v
root@buzz:~#
Happy p2v-ing!  Or rather, g2ng-ing.

Wednesday Jan 14, 2015

stamping out web servers

This is a continuation of a series of posts.  While this one may be interesting all on its own, you may want to start from the top to get the context.

The diagram above shows one global zone with a few zones in it.  That's not very exciting in a world where we need to rapidly provision new instances that are preconfigured and as hack-proof as we can make them.  This post will show how to create a unified archive that includes the kernel zone configuration and content that makes for a hard-to-compromise web server.  I'd like to say impossible, but history has shown us that software has bugs that affects everyone across the industry.

We'll start of by configuring and installing a kernel zone called web.  It will have two automatic networks, each attached to the appropriate etherstubs.  Notice the use of template properties - using %{zonename} and %{id} make it so that we don't have to futz with so much of the configuration when we configure the next zone based on this one.

root@global:~# zonecfg -z web
Use 'create' to begin configuring a new zone.
zonecfg:web> create -t SYSsolaris-kz
zonecfg:web> select device id=0
zonecfg:web:device> set storage=dev:zvol/dsk/zones/%{zonename}/disk%{id}
zonecfg:web:device> end
zonecfg:web> select anet id=0
zonecfg:web:anet> set lower-link=balstub0
zonecfg:web:anet> set allowed-address=192.168.1.2/24
zonecfg:web:anet> set configure-allowed-address=true
zonecfg:web:anet> end
zonecfg:web> add anet
zonecfg:web:anet> set lower-link=internalstub0
zonecfg:web:anet> set allowed-dhcp-cids=%{zonename}
zonecfg:web:anet> end
zonecfg:web> info
zonename: web
brand: solaris-kz
autoboot: false
autoshutdown: shutdown
bootargs:
pool:
scheduling-class:
hostid: 0xdf87388
tenant:
anet:
        lower-link: balstub0
        allowed-address: 192.168.1.2/24
        configure-allowed-address: true
        defrouter not specified
        allowed-dhcp-cids not specified
        link-protection: "mac-nospoof, ip-nospoof"
        ...
        id: 0
anet:
        lower-link: internalstub0
        allowed-address not specified
        configure-allowed-address: true
        defrouter not specified
        allowed-dhcp-cids.template: %{zonename}
        allowed-dhcp-cids: web
        link-protection: mac-nospoof
        ...
        id: 1
device:
        match not specified
        storage.template: dev:zvol/dsk/zones/%{zonename}/disk%{id}
        storage: dev:zvol/dsk/zones/web/disk0
        id: 0
        bootpri: 0
capped-memory:
        physical: 2G
zonecfg:web> exit
root@global:~# zoneadm -z web install
Progress being logged to /var/log/zones/zoneadm.20150114T193808Z.web.install
pkg cache: Using /var/pkg/publisher.
 Install Log: /system/volatile/install.4391/install_log
 AI Manifest: /tmp/zoneadm3808.vTayai/devel-ai-manifest.xml
  SC Profile: /usr/share/auto_install/sc_profiles/enable_sci.xml
Installation: Starting ...

        Creating IPS image
        Installing packages from:
            solaris
                origin:  file:///export/repo/11.2/repo/
        The following licenses have been accepted and not displayed.
        Please review the licenses for the following packages post-install:
          consolidation/osnet/osnet-incorporation
        Package licenses may be viewed using the command:
          pkg info --license <pkg_fmri>

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                            451/451   63686/63686  579.9/579.9    0B/s

PHASE                                          ITEMS
Installing new actions                   86968/86968
Updating package state database                 Done
Updating package cache                           0/0
Updating image state                            Done
Creating fast lookup database                   Done
Installation: Succeeded
        Done: Installation completed in 431.445 seconds.

root@global:~# zoneadm -z web boot
root@global:~# zlogin -C web
        Perform sysconfig.  Allow networking to be configured automatically.
        ~~.   (one ~ for ssh, one for zlogin -C)
root@global:~# zlogin web

At this point, networking inside the zone should look like this:

root@web:~# ipadm
NAME              CLASS/TYPE STATE        UNDER      ADDR
lo0               loopback   ok           --         --
   lo0/v4         static     ok           --         127.0.0.1/8
   lo0/v6         static     ok           --         ::1/128
net0              ip         ok           --         --
   net0/v4        inherited  ok           --         192.168.1.2/24
net1              ip         ok           --         --
   net1/v4        dhcp       ok           --         192.168.0.2/24
   net1/v6        addrconf   ok           --         <IPv6addr>

Configure the NFS mounts for web content (/web) and IPS repos (/repo).

root@web:~# cat >> /etc/vfstab
192.168.0.1:/export/repo/11.2/repo - /repo      nfs     -       yes     -
192.168.0.1:/export/web -       /web    nfs     -       yes     -
^D
root@web:~# svcadm enable -r nfs/client

Now, update the pkg image configuration so that it uses the repository from the correct path. 

root@web:~# pkg set-publisher -O file:///repo/ solaris

Update the apache configuration so that it looks to /web for the document root. 

root@web:~# vi /etc/apache2/2.2/httpd.conf

This involves (at a minimum) changing DocumentRoot to "/web" and changing the <Directory "/var/apache/2.2/htdocs"> line to <Directory "/web">.  Your needs will be different and probably more complicated.  This is not an Apache tutorial and I'm not qualified to give it.  After modifying the configuration file, start the web server.

root@web:~# svcadm enable -r svc:/network/http:apache22

This is a good time to do any other configuration (users, other software, etc.) that you need.  If you did the changes above really quickly, you may also want to wait for first boot tasks like man-index to complete.  Allowing it to complete now will mean that it doesn't need to be redone for every instance of this zone that you create.

Since this is a type of a zone that shouldn't need to have its configuration changed a whole lot, let's use the immutable global zone (IMGZ) feature to lock down the web zone.  Note that we use IMGZ inside a kernel zone because a kernel zone is another global zone.

root@web:~# zonecfg -z global set file-mac-profile=fixed-configuration
updating /platform/sun4v/boot_archive
root@web:~# init 6

Back in the global zone, we are ready to create a clone archive once the zone reboots.

root@global:~# archiveadm create -z web /export/web.uar
Initializing Unified Archive creation resources...
Unified Archive initialized: /export/web.uar
Logging to: /system/volatile/archive_log.6835
Executing dataset discovery...
Dataset discovery complete
Creating install media for zone(s)...
Media creation complete
Preparing archive system image...
Beginning archive stream creation...
Archive stream creation complete
Beginning final archive assembly...
Archive creation complete

Now that the web clone unified archive has been created, it can be used on this machine or any other with a similar global zone configuration (etherstubs of same names, dhcp server, same nfs exports, etc.) to quickly create new web servers that fit the model described in the diagram at the top of this post.  To create the free kernel zone:

root@global:~# zonecfg -z free
Use 'create' to begin configuring a new zone.
zonecfg:free> create -a /export/web.uar
zonecfg:free> select anet id=0
zonecfg:free:anet> set allowed-address=192.168.1.3/24
zonecfg:free:anet> end
zonecfg:free> select capped-memory
zonecfg:free:capped-memory> set physical=4g
zonecfg:free:capped-memory> end
zonecfg:free> add virtual-cpu
zonecfg:free:virtual-cpu> set ncpus=2
zonecfg:free:virtual-cpu> end
zonecfg:free> exit

If I were doing this for a purpose other than this blog post, I would have also created a sysconfig profile and passed it to zoneadm install.  This would have made the first boot completely hands-off.

root@global:~# zoneadm -z free install -a /export/web.uar
...
root@global:~# zoneadm -z free boot
root@global:~# zlogin -C free
[Connected to zone 'free' console]
   run sysconfig because I didn't do zoneadm install -c sc_profile.xml
SC profile successfully generated as:
/etc/svc/profile/sysconfig/sysconfig-20150114-210014/sc_profile.xml
...

Once we log into free, we see that there's no more setup to do.

root@free:~# df -h -F nfs
Filesystem             Size   Used  Available Capacity  Mounted on
192.168.0.1:/export/repo/11.2/repo
                       194G    36G       158G    19%    /repo
192.168.0.1:/export/web
                       158G    31K       158G     1%    /web
root@free:~# ipadm
NAME              CLASS/TYPE STATE        UNDER      ADDR
lo0               loopback   ok           --         --
   lo0/v4         static     ok           --         127.0.0.1/8
   lo0/v6         static     ok           --         ::1/128
net0              ip         ok           --         --
   net0/v4        inherited  ok           --         192.168.1.3/24
net1              ip         ok           --         --
   net1/v4        dhcp       ok           --         192.168.0.3/24
   net1/v6        addrconf   ok           --         fe80::8:20ff:fed0:5eb/10

Nearly identical steps can be taken with the deployment of premium.  The key difference there is that we are dedicating two cores (add dedicated-cpu; set cores=...) rather than allocating to virtual-cpus (add virtual-cpu; set ncpus=...).  That is, no one else can use any of the cpus on premium's  cores but free has to compete with the rest of the system for cpu time.

root@global:~# psrinfo -t
socket: 0
  core: 201457665
    cpus: 0-7
  core: 201654273
    cpus: 8-15
  core: 201850881
    cpus: 16-23
  core: 202047489
    cpus: 24-31
root@global:~# zonecfg -z premium
zonecfg:premium> create -a /export/web.uar
zonecfg:premium> select anet id=0
zonecfg:premium:anet> set allowed-address=192.168.1.4/24
zonecfg:premium:anet> end
zonecfg:premium> select capped-memory
zonecfg:premium:capped-memory> set physical=8g
zonecfg:premium:capped-memory> end
zonecfg:premium> add dedicated-cpu
zonecfg:premium:dedicated-cpu> set cores=201850881,202047489
zonecfg:premium:dedicated-cpu> end
zonecfg:premium> exit

The install and boot of premium will then be the same as that of free.  After both zones are up, we can see that psrinfo reports the number of cores for premium but not for free.

root@global:~# zlogin free psrinfo -pv
The physical processor has 2 virtual processors (0-1)
  SPARC-T5 (chipid 0, clock 3600 MHz)
root@global:~# zlogin free prtconf | grep Memory
Memory size: 4096 Megabytes
root@global:~# zlogin premium psrinfo -pv
The physical processor has 2 cores and 16 virtual processors (0-15)
  The core has 8 virtual processors (0-7)
  The core has 8 virtual processors (8-15)
    SPARC-T5 (chipid 0, clock 3600 MHz)
root@global:~# zlogin premium prtconf | grep Memory
Memory size: 8192 Megabytes

That's enough for this post.  Next time, we'll get teeter going.

in-the-box NFS and pkg repository

This is the third in a series of short blog entries.  If you are new to the series, I suggest you start from the top.

As shown in our system diagram, the kernel zones have no direct connection to the outside world.  This will make it quite hard for them to apply updates.  To get past that we will set up a pkg repository in the global zone and export it via NFS to the zones.  I won't belabor the topic of a Local IPS Repositories, because our fine doc writers have already covered that.

As a quick summary, I first created a zfs file system for the repo.  On this system, export is a separate pool with its topmost dataset mounted at /export.  By default /export is the rpool/export dataset - you may need to adjust commands to match your system.

root@global:~# zfs create -p export/repo/11.2

I then followed the procedure in MOS Doc ID 1928542.1 for creating a Solaris 11.2 repository, including all of the SRUs.  That resulted in having a repo with the solaris publisher at /export/repo/11.2/repo.

Since I have a local repo for the kernel zones to use, I figured the global zone may as well use it too.

root@global:~# pkg set-publisher -O  file:///export/repo/11.2/repo/ solaris

To make this publisher accessible (read-only) to the zones on the 192.168.0.0/24 network, it needs to be NFS exported.

root@global:~# share -F nfs -o ro=@192.168.0.0/24 /export/repo/11.2/repo

Now I'll get ahead of myself a bit - I've not actually covered the installation of the free or premium zones yet.  Let's pretend we have a kernel zone called web and we want the repository to be accessible at /repo in web.

root@global:~# zlogin web
root@web:~# vi /etc/vfstab
   (add an entry)
root@web:~# grep /repo /etc/vfstab
192.168.0.1:/export/repo/11.2/repo - /repo      nfs     -       yes     -
root@web:~# svcadm enable -r nfs/client

If svc:/network/nfs/client was already enabled, use mount /repo instead of svcadm enable.  Once /repo is mounted, update the solaris publisher.

root@web:~# pkg set-publisher -O  file:///repo/ solaris

In this example, we also want to have some content shared from the global zone into each of the web zones.  To make that possible:

root@global:~# zfs create export/web
root@global:~# share -F nfs -o ro=@192.168.0.0/24 /export/web

Again, this is exported read-only to the zones.  Adjust for your own needs.

That's it for this post.  Next time we'll create a unified archive that can be used for quickly stamping out lots of web zones.

in-the-box networking

In my previous post, I described a scenario where a couple networks are needed to shuffle NFS and web traffic between a few zones.  In this post, I'll describe the configuration of the networking.  As a reminder, here's the configuration we are after.


The green (192.168.0.0/24) network is used for the two web server zones that need to connect services in the global zone.  The red (102.168.1.0/24) network is used for communication between the load balancer and the web servers.  The basis for this simplistic in-the-box network is an etherstub.

The red network is a bit simpler than the green network, so we'll start with that.

root@global:~# dladm create-etherstub balstub0

That's it!  The (empty) network has been created by simply creating an etherstub.  As zones are configured to use balstub0 as the lower-link in their anet resources, they will attach to this network.

The green network is just a little bit more involved because there will be services (DHCP and NFS) in the global zone that will use this network.

root@global:~# dladm create-etherstub internalstub0
root@global:~# dladm create-vnic -l internalstub0 internal0
root@global:~# ipadm create-ip internal0
root@global:~# ipadm create-addr -T static -a 192.168.0.1/24 internal0
internal0/v4

That wasn't a lot harder.  What we did here was create an etherstub named internalstub0.  On top of it, we created a vnic called internal0, attached an IP interface onto it, then set a static IP address.

As was mentioned in the introductory post, we'll have DHCP manage the IP address allocation for zones that use internalstub0.  Setup of that is pretty straight-forward too.

root@global:~# cat > /etc/inet/dhcpd4.conf
default-lease-time 86400;
log-facility local7;
subnet 192.168.0.0 netmask 255.255.255.0 {
    range 192.168.0.2 192.168.0.254;
    option broadcast-address 192.168.0.255;
}
^D
root@global:~# svcadm enable svc:/network/dhcp/server:ipv4

The real Solaris blogs junkie will recognize this as a simplified version of something from the Maine office.

kernel zones vs. fs resources

Traditionally, zones have had a way to perform loopback mounts of global zone file systems using zonecfg fs resources with fstype=lofs.  Since kernel zones run a separate kernel, lofs is not really an option.  Other file systems can be safely presented to kernel zones by delegating the devices, so the omission of fs resources is not as bad as it may initially sound.

For those that really need something that works like a lofs fs resource, there's a way to simulate the functionality with NFS.  Consider the following system.


It has a native zone, teeter, that has load balancing software in it and two kernel zones, free and premium. The idea behind this contrived example is that freeloaders get directed to the small kernel zone and the paying customers use the resources of the bigger zone.  Neither free nor premium are directly connected to the outside world.  They use NFS served from the global zone for web content and package repositories.  When it comes time to update the content seen by all of the web zones, the admin only needs to update it in one place.  It is quite acceptable for this one place to be on the NFS server - that is, in the global zone.

To simplify the configuration of each zone, all network configuration is performed in the global zone with zonecfg.  For example, this is the configuration of free.

root@global:~# zonecfg -z free export
create -b
set brand=solaris-kz
set autoboot=false
set autoshutdown=shutdown
set hostid=0x19df23f0
add anet
set lower-link=internalstub0
set configure-allowed-address=true
set link-protection=mac-nospoof
set mac-address=auto
set id=1
end
add anet
set lower-link=balstub0
set allowed-address=192.168.1.3/24
set configure-allowed-address=true
set link-protection=mac-nospoof
set mac-address=auto
set id=0
end
...

This example shows two ways to handle the network configuration from the global zone. The green network is configured using internalstub0 as the lower link. A DHCP server is configured in the global zone to dynamically assign addresses to all the zones that have a vnic on that network. The red network uses balstub0 as the lower link.  Because the load balancer will need to be configured to use specific IP addresses for free and premium zones, allocating these addresses from a dynamic address range seems prone to troubles.

In the next few blog entries, I'll cover the following topics:

Monday Jun 02, 2014

Oops, I left my kernel zone configuration behind!

Most people use boot environments to move in one direction.  A system starts with an initial installation and from time to time new boot environments are created - typically as a result of pkg update - and then the new BE is booted.  This post is of little interest to those people as no hackery is needed.  This post is about some mild hackery.

During development, I commonly test different scenarios across multiple boot environments.  Many times, those tests aren't related to the act of configuring or installing zone and I so it's kinda handy to avoid the effort involved of zone configuration and installation.  A somewhat common order of operations is like the following:

# beadm create -e golden -a test1
# reboot

Once the system is running in the test1 BE, I install a kernel zone.

# zonecfg -z a178 create -t SYSsolaris-kz
# zoneadm -z a178 install

Time passes, and I do all kinds of stuff to the test1 boot environment and want to test other scenarios in a clean boot environment.  So then I create a new one from my golden BE and reboot into it.

# beadm create -e golden -a test2
# reboot

Since the test2 BE was created from the golden BE, it doesn't have the configuration for the kernel zone that I configured and installed.  Getting that zone over to the test2 BE is pretty easy.  My test1 BE is really known as s11fixes-2.

root@vzl-212:~# beadm mount s11fixes-2 /mnt
root@vzl-212:~# zonecfg -R /mnt -z a178 export | zonecfg -z a178 -f -
root@vzl-212:~# beadm unmount s11fixes-2
root@vzl-212:~# zoneadm -z a178 attach
root@vzl-212:~# zoneadm -z a178 boot

On the face of it, it would seem as though it would have been easier to just use zonecfg -z a178 create -t SYSolaris-kz within the test2 BE to get the new configuration over.  That would almost work, but it would have left behind the encryption key required for access to host data and any suspend image.  See solaris-kz(5) for more info on host data.  I very commonly have more complex configurations that contain many storage URIs and non-default resource controls.  Retyping them would be rather tedious.

Thursday May 08, 2014

No time like the future

Zones have forever allowed different time zones.  Kernel zones kicks that up to 11 (or is that 11.2?) with the ability to have an entirely different time in the zone.  To be clear, this only works with kernel zones.  You can see from the output below that the zone in use has brand solaris-kz.

root@kzx-05:~# zoneadm -z junk list -v
  ID NAME             STATUS      PATH                         BRAND      IP    
  12 junk             running     -                            solaris-kz excl  

By default, the clocks between the global zone and a kernel zone are in sync.  We'll use console logging to show that...

root@kzx-05:~# zlogin -C junk
[Connected to zone 'junk' console]

vzl-178 console login: root
Password: 
Last login: Fri Apr 18 13:58:01 on console
Oracle Corporation      SunOS 5.11      11.2    April 2014
root@vzl-178:~# date "+%Y-%m-%d %T"
2014-04-18 13:58:57
root@vzl-178:~# exit
logout

vzl-178 console login: ~.
[Connection to zone 'junk' console closed]
root@kzx-05:~# tail /var/log/zones/junk.console 
2014-04-18 13:58:45 vzl-178 console login: root
2014-04-18 13:58:46 Password: 
2014-04-18 13:58:48 Last login: Fri Apr 18 13:58:01 on console
2014-04-18 13:58:48 Oracle Corporation      SunOS 5.11      11.2    April 2014
2014-04-18 13:58:48 root@vzl-178:~# date "+%Y-%m-%d %T"
2014-04-18 13:58:57 2014-04-18 13:58:57
2014-04-18 13:58:57 root@vzl-178:~# exit
2014-04-18 13:59:04 logout
2014-04-18 13:59:04 
2014-04-18 13:59:04 vzl-178 console login: root@kzx-05:~# 

Notice that the time stamp on the log matches what we see in the output of date.  Now let's pretend that it is next year.

root@kzx-05:~# zlogin junk
root@vzl-178:~# date  0101002015
Thursday, January  1, 2015 12:20:00 AM PST

And let's be sure that it still thinks it's 2015 in the zone:

root@kzx-05:~# date; zlogin junk date; date
Friday, April 18, 2014 02:05:53 PM PDT
Thursday, January  1, 2015 12:20:18 AM PST
Friday, April 18, 2014 02:05:54 PM PDT

And the time offset survives a reboot.

root@kzx-05:~# zoneadm -z junk reboot
root@kzx-05:~# date; zlogin junk date; date
Friday, April 18, 2014 02:09:18 PM PDT
Thursday, January  1, 2015 12:23:43 AM PST
Friday, April 18, 2014 02:09:18 PM PDT

So, what's happening under the covers?  When the date is set in the kernel zone, the offset between the kernel zone's clock and the global zone's clock is stored in the kernel zone's host data.  See solaris-kz(5) for a description of host data.  Whenever a kernel zone boots, the kernel zone's clock is initialized based on this offset.

Wednesday May 07, 2014

Unified Archives

I had previously mentioned that I've spent a bit of time working on Unified Archives - Solaris 11's answer to Flash Archives.  I was going to write up a blog entry or two on it, but it turns out the staff at the Maine office has already done a bang-up job of that.

And, of course, you should check out the related video.

Zones Console Logs

You know how there's that thing that you've been meaning to do for a long time but never quite get to it?  And then one day something just pushes you over the edge?  This is a short story of being pushed over the edge.

As I was working on kernel zones, I rarely used zlogin -C to get to the console.  And then I'd get a panic.  If the panic happened early enough in boot that the dump device wasn't enabled, I'd lose all traces of what went wrong.  That pushed me over the edge - time to implement console logging!

With Solaris 11.2, there's now a zone console log for all zone brands: you can find it at /var/log/zones/zonename.console.

root@kzx-05:~# zoneadm -z junk boot
root@kzx-05:~# tail /var/log/zones/junk.console 
2014-04-18 12:59:08 syncing file systems... done
2014-04-18 12:59:11 
2014-04-18 12:59:11 [NOTICE: Zone halted]
2014-04-18 13:00:56 
2014-04-18 13:00:56 [NOTICE: Zone booting up]
2014-04-18 13:00:59 Boot device: disk0  File and args: 
2014-04-18 13:00:59 reading module /platform/i86pc/amd64/boot_archive...done.
2014-04-18 13:00:59 reading kernel file /platform/i86pc/kernel/amd64/unix...done.
2014-04-18 13:01:00 SunOS Release 5.11 Version 11.2 64-bit
2014-04-18 13:01:00 Copyright (c) 1983, 2014, Oracle and/or its affiliates. All rights reserved.

The output above will likely generate a few questions.  Let me try to answer those.

Will this log passwords typed on the console?

Generally, passwords are entered in a way that they aren't echoed to the terminal.  The console log only contains the characters written from the zone to the terminal - that is it logs the echoes.  Characters you type that are never printed are never logged.

Can just anyone read the console log file?

No.  You need to be root in the global zone to read the console log file.

How is log rotation handled?

Rules have been added to logadm.conf(4) to handle weekly log rotation.

I see a time stamp, but what time zone is that?

All time stamps are in the same time zone as svc:/system/zones:default.  That should be the same as is reported by:

root@kzx-05:~# svcprop -p timezone/localtime timezone
US/Pacific

Really, what does the time stamp mean?

It was the time that the first character of the line was written to the terminal.  This means that if a line contains a shell prompt, the time is the time that the prompt was printed, not the time that the person finished entering a command.

I see other files in /var/log/zones.  What are they?

zonename.messages contains various diagnostic information from zoneadmd.  If all goes well, you will never need to look at that.

zoneadm.* contains log files from attach, install, clone, and uninstall operations.  These files have existed since Solaris 11 first launched.

Tuesday May 06, 2014

A tour of a kernel zone

In my earlier post, I showed how to configure and install a kernel zone.  In this post, we'll take a look at this kernel zone.

The kernel zone was installed within an LDom on a T5-4.

root@vzl-212:~# prtdiag -v | head -2
System Configuration:  Oracle Corporation  sun4v SPARC T5-4
Memory size: 65536 Megabytes
root@vzl-212:~# psrinfo | wc -l
      32

The kernel zone was configured with:

 root@vzl-212:~# zonecfg -z myfirstkz create -t SYSsolaris-kz

Let's take a look at the resulting configuration.

root@vzl-212:~# zonecfg -z myfirstkz info | cat -n
     1    zonename: myfirstkz
     2    brand: solaris-kz
     3    autoboot: false
     4    autoshutdown: shutdown
     5    bootargs: 
     6    pool: 
     7    scheduling-class: 
     8    hostid: 0x2b2044c5
     9    tenant: 
    10    anet:
    11        lower-link: auto
    12        allowed-address not specified
    13        configure-allowed-address: true
    14        defrouter not specified
    15        allowed-dhcp-cids not specified
    16        link-protection: mac-nospoof
    17        mac-address: auto
    18        mac-prefix not specified
    19        mac-slot not specified
    20        vlan-id not specified
    21        priority not specified
    22        rxrings not specified
    23        txrings not specified
    24        mtu not specified
    25        maxbw not specified
    26        rxfanout not specified
    27        vsi-typeid not specified
    28        vsi-vers not specified
    29        vsi-mgrid not specified
    30        etsbw-lcl not specified
    31        cos not specified
    32        evs not specified
    33        vport not specified
    34        id: 0
    35    device:
    36        match not specified
    37        storage: dev:/dev/zvol/dsk/rpool/VARSHARE/zones/myfirstkz/disk0
    38        id: 0
    39        bootpri: 0
    40    capped-memory:
    41        physical: 2G
    42    suspend:
    43        path: /system/zones/myfirstkz/suspend
    44        storage not specified
    45    keysource:
    46        raw redacted

There are a number of things to notice in this configuration.

  • No zonepath.  Kernel zones install into a real or virtual disks - quite like the way that logical domains install into real or virtual disks.  The virtual disk(s) that contain the root zfs pool are specified by one or more device resources that contain a bootpri property (line 39).  By default, a kernel zone's root disk is a 16 GB zfs volume in the global zone's root zfs pool.  There's more about this in the solaris-kz(5) man page.  It's never been a good idea to directly copy things into a zone's zonepath.  With kernel zones that just doesn't work.
  • The device resource accepts storage URI's (line 37).  See suri(5).  Storage URI's were introduced in Solaris 11.1 in support of Zones on Shared Storage (rootzpool and zpool resources).  This comes in really handy when a kernel zone is installed on external storage and may be migrated between hosts from time to time.
  • The device resource has an id property (line 38).  This means that this disk will be instance 0 of zvblk - which will translate into it being c1d0.  We'll see more of that in a bit.
  • The anet resource has an id property (line 34).  This means that this anet will be instance 0 of zvnet - which will normally be seen as net0.  Again, more of that in a bit.
  • A memory resource control, capped-memory, is set by default (lines 40 - 41).  In the solaris or solaris10 brand, this would mean that rcapd is used to soft limit the amount of physical memory a zone can use.  Kernel zones are different.  Not only is this a hard limit on the amount of physical memory that the kernel zone can use - the memory is immediately allocated and reserved as the zone boots.
  • A suspend resource is present, which defines a location for to write out a suspend file when zoneadm -z zonename suspend is invoked.
  • The keysource resource is used for an encryption key that is used to encrypt suspend images and host data.  solaris-kz(5) has more info on this.

There are several things not shown here that may also be of interest:

  • Previously, autoshutdown (line 4) allowed halt and shutdown as values.  It now also supports suspend for kernel zones only.  As you may recall, autoshutdown is used by svc:/system/zones:default when it is transitioning from online to offline.  If set to halt, the zone (kernel or otherwise) is brought down abruptly.  If set to shutdown, a graceful shutdown is performed.  Now, if a kernel zone has it set to suspend, the kernel zone will be suspended as svc:/system/zones:default goes offline.  When zoneadm boot is issued for a suspended zone, the zone is resumed.
  • If there are multiple device resources that have bootpri set (i.e. bootable devices), zoneadm install will add all of the boot devices to a mirrored root zpool.

From the earlier blog entry, this kernel zone was booted and sysconfig was performed.  Let's look inside.

To get into the zone, you can use zlogin just like you do with any other zone.

root@vzl-212:~# zlogin myfirstkz
[Connected to zone 'myfirstkz' pts/3]
Oracle Corporation      SunOS 5.11      11.2    April 2014
root@myfirstkz:~# 

As I alluded to above, a kernel zone gets a fixed amount of memory.  The value shown above matches the value shown in the capped-memory resource in the zone configuration.

root@myfirstkz:~# prtconf | grep ^Memory
Memory size: 2048 Megabytes

By default, a kernel zone gets one virtual cpu.  You can adjust this with the virtual-cpu or dedicated-cpu zonecfg resources.  See solaris-kz(5).

root@myfirstkz:~# psrinfo
0       on-line   since 04/18/2014 22:39:22

Because a kernel zone runs its own kernel, it does not require that packages are in sync between the global zone and the kernel zone.  Notice that the pkg publisher output does not say (syspub) - the kernel zone and the global zone can even use different publishers for the solaris repository.  As SRU's and updates start to roll out you will see that you can independently update the global zone and the kernel zones on it.

root@myfirstkz:~# pkg publisher
PUBLISHER                   TYPE     STATUS P LOCATION
solaris                     origin   online F http://internal-ips-repo.example.com/

Because a kernel zone runs its own kernel, it considers itself to be a global zone.

root@myfirstkz:~# zonename
global

The root disk that I mentioned above shows up at c1d0.

root@myfirstkz:~# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c1d0 <kz-vDisk-ZVOL-16.00GB>
          /kz-devices@ff/disk@0
Specify disk (enter its number): ^D

And the anet shows up as net0 using physical device zvnet0.

root@myfirstkz:~# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         1000   full      zvnet0

Let's jump on the console and see what happens when bad things happen...

root@myfirstkz:~# logout

[Connection to zone 'myfirstkz' pts/3 closed]

root@vzl-212:~# zlogin -C myfirstkz
[Connected to zone 'myfirstkz' console]

myfirstkz console login: root
Password: 
Apr 18 23:47:06 myfirstkz login: ROOT LOGIN /dev/console
Last login: Fri Apr 18 23:32:28 on kz/term
Oracle Corporation      SunOS 5.11      11.2    April 2014
root@myfirstkz:~# dtrace -wn 'BEGIN { panic() }'
dtrace: description 'BEGIN ' matched 1 probe

panic[cpu0]/thread=c4001afbd720: dtrace: panic action at probe dtrace:::BEGIN (ecb c400123381e0)

000002a10282acd0 dtrace:dtrace_probe+c54 (252acb8f029b3, 0, 0, 33fe, c4001b75e000, 103215b2)
  %l0-3: 0000c400123381e0 0000c40019b82340 00000000000013fc 0000c40016889740
  %l4-7: 0000c4001bc00000 0000c40019b82370 0000000000000003 000000000000ff00
000002a10282af10 dtrace:dtrace_state_go+4ac (c40019b82340, 100, 0, c40019b82370, 16, 702a7040)
  %l0-3: 0000000000030000 0000000010351580 0000c4001b75e000 00000000702a7000
  %l4-7: 0000000000000000 0000000df8475800 0000000000030d40 00000000702a6c00
000002a10282aff0 dtrace:dtrace_ioctl+ad8 (2c, 612164be40, 2a10282bacc, 202003, c400162fcdc0, 64747201)
  %l0-3: 000000006474720c 0000c40019b82340 000002a10282b1a4 00000000ffffffff
  %l4-7: 00000000702a6ee8 00000000702a7100 0000000000000b18 0000000000000180
000002a10282b8a0 genunix:fop_ioctl+d0 (c40019647a40, 0, 612164be40, 202003, c400162fcdc0, 2a10282bacc)
  %l0-3: 000000006474720c 0000000000202003 0000000001374f2c 0000c40010d84180
  %l4-7: 0000000000000000 0000000000000000 00000000000000c0 0000000000000000
000002a10282b970 genunix:ioctl+16c (3, 6474720c, 612164be40, 3, 1fa5ac, 0)
  %l0-3: 0000c4001a5ea958 0000000010010000 0000000000002003 0000000000000000
  %l4-7: 0000000000000003 0000000000000004 0000000000000000 0000000000000000

syncing file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel sections: zfs
 0:04  90% done (kernel)
 0:05 100% done (zfs)
100% done: 127783 (kernel) + 12950 (zfs) pages dumped, dump succeeded
rebooting...
Resetting...

[NOTICE: Zone rebooting]
NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD from HV.
NOTICE: Starting additional cpus.
NOTICE: Initializing LDC services.
NOTICE: Probing PCI devices.
NOTICE: Finished PCI probing.


SPARC T5-4, No Keyboard
Copyright (c) 1998, 2014, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.36.0, 2.0000 GB memory available, Serial #723535045.
Ethernet address 0:0:0:0:0:0, Host ID: 2b2044c5.



Boot device: disk0  File and args: 
SunOS Release 5.11 Version 11.2 64-bit
Copyright (c) 1983, 2014, Oracle and/or its affiliates. All rights reserved.
Hostname: myfirstkz
Apr 18 23:48:44 myfirstkz savecore: System dump time: Fri Apr 18 23:47:42 2014
Apr 18 23:48:44 myfirstkz savecore: Saving compressed system crash dump files in directory /var/crash

myfirstkz console login: Apr 18 23:49:02 myfirstkz savecore: Decompress all crash dump files with '(cd /var/crash && savecore -v 0)' or individual files with 'savecore -vf /var/crash/vmdump{,-<secname>}.0'

SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
EVENT-TIME: Fri Apr 18 23:49:07 CDT 2014
PLATFORM: SPARC-T5-4, CSN: unknown, HOSTNAME: myfirstkz
SOURCE: software-diagnosis, REV: 0.1
EVENT-ID: f4c0d684-da80-425f-e45c-97bd0239b154
DESC: The system has rebooted after a kernel panic.

After disconnecting from the console (~.) I was back at the global zone root prompt.  The global zone didn't panic - the kernel zone did.

root@vzl-212:~# uptime; zlogin myfirstkz uptime
  9:53pm  up  8:03,  2 users,  load average: 0.03, 0.12, 0.08
 11:52pm  up 5 min(s),  0 users,  load average: 0.04, 0.26, 0.15

That's the end of this tour.  Thanks for coming, and please come again!

About

Contributors:

  • Mike Gerdts - Principal Software Engineer
  • Lawrence Chung - Software Engineer
  • More coming soon!

Search

Categories
Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today