Tuesday Jul 07, 2015

Virtual Address Reservation in Solaris 11.3

For applications that have a need to place memory at fixed locations in its address space (like the Oracle SGA), there is a new feature in Solaris 11.3 called Virtual Address Reservation that provides support for such fixed address mappings. A fixed address mapping today can fail if the system has already assigned a mapping to the desired location. As the system is free to choose any unused region in a process' address space for mapping things such as libraries, such conflicts could arise.  Worse yet, if MAP_FIXED mmap(2) were used by the application, it would be successful but any existing mapping could be destroyed.

Virtual Address Reservation in Solaris 11.3 provides the means to 'reserve' a portion of a process' address space which will prevent the system from using the reserved space for mapping operations that don't specify a fixed address. The VA Reservations guarantee that fixed address mappings would be successful.

To create a VA Reservation requires that the application be recompiled with a Mapfile (version 2) containing the RESERVE_SEGMENT directive that specifies the virtual address range to reserve. Multiple RESERVE_SEGMENT directives can be specified in the Mapfile to create multiple VA Reservations. The Mapfile below would reserve the VA range from 0x300000000 to 0x300400000. 

# cat Mapfile

$mapfile_version 2

RESERVE_SEGMENT myReservedVaName {
        VADDR = 0x300000000;
        SIZE = 0x400000;
};

# cc file.c -Mmapfile -m64

On execution of the resultant a.out binary, the specified virtual address range will be reserved early on during process startup and before libraries are mapped. pmap(1) can be run on the running process to see its VA reservation(s); it can be seen in the pmap output as "[ reserved ]". 


0000000100000000        32K r-x----  /a.out
0000000100106000         8K rwx----  /a.out
0000000300000000      4096K -------  [ reserved ]
FFFFFFFF7F200000      2112K r-x----  /lib/sparcv9/libc.so.1
...


To use the reserved space, the application simply needs to specify a fixed address that corresponds to the Reserved VA range on calls to either mmap(2) or shmat(2).

Please note that VA Reservation only addresses possible conflicts related to fixed address mappings.  Applications that use fixed address mappings should be well aware of other potential problems. For instance, my example above (on SPARC) reserves the VA Address space starting at 0x300000000.  This could cause malloc failures if the process is memory intensive and the Heap needs to be grow larger than 8G (heap starts at around 0x100106000 and cannot grow past 0x300000000). 

APIs for handling per-thread signals in Solaris

Introduction
-------------
Solaris 11.3 introduces the following APIs to allow one process to interact directly
with a specific thread in a different process.

int proc_thr_kill(pid_t pid, pthread_t thread, int sig);
int proc_thr_sigqueue(pid_t pid, pthread_t thread, int sig, const union sigval value);
int proc_thr_sigqueue_wait(pid_t pid, pthread_t thread, int sig, const union sigval value,
    const struct timespec *timeout);

These APIs are patterned after the process direct signal APIs kill(2),
sigqueue(3C) and sigqueue_wait(3C). The introduction of these APIs will not change
anything about the basics of  signal generation and reception, i.e., there will be 
no guarantee that the signals have been received by the target. It depends on whether
the signal has been blocked or not or ignored or not at the target.

Use Case
---------
These APIs can be used in any multi-process multi-threaded application between
threads of cooperating processes, where threads that are handling specific tasks
need to receive signals. An example would be an application that deals with network I/O
where in each thread in a process is handling one connection. In such a scenario,
using the thread directed signal API, a specific thread could be forced to cleanup
and abort due to errors or asked to dump status/debug info. A signal handler which
can perform the desired action(abort, dump) in response to a specific signal has
to be implemented by the process/threads that can receive the signals.

What does this mean for you?
Solaris threads of two independent and cooperting processes can now send and receive
signals on a per thread basis.

Document Reference
-------------------
See man pages for
 - proc_thr_kill(3C)
 - proc_thr_sigqueue(3C)
 - proc_thr_sigqueue_wait(3C)

PV IPoIB in Kernel Zones in Solaris 11.3

The Paravirtualization of IP over Infiniband (IPoIB) in kernel zones is a 
new feature in S11.3 enhancing the network virtualization offering in Solaris.
This allows for existing IP applications in the guest to run over Infiniband 
fabrics. Features such as Kernel zone Live Migration and IPMP are supported 
with the Paravirtualized IPoIB datalinks making it an appealing option.

Moreover, the device management of these guest datalinks are similar to their 
Ethernet counterparts making it straightforward to configure and manage. Zonecfg 
is used in the host to configure the kernel zone's automatic network interface 
(anet) to select the link of the IB HCA port to paravirtualize and assign as the 
lower-link, the Partition Key (P_Key) wthin the IB fabric and the possible 
link mode to choose from which could either be IPoIB-CM or IPoIB-UD.

The PV IPoIB datalink is a front end guest driver emulating a IPoIB VNIC 
in the host created over a physical IB partition datalink per P_Key and port.

To create a PV IPoIB datalink in a kernel zone the configuration is fairly 
simple. Here is an example showing how to create a PV IPoIB datalink in a 
kernel zone.

1. Find the IB datalink in the host to paravirtualize. 

I am selecting net7 for this example.

# ibadm
HCA             TYPE      STATE     IOV    ZONE
hermon0         physical  online    off    global

# dladm show-ib
LINK      HCAGUID        PORTGUID       PORT STATE   GWNAME       GWPORT   PKEYS
net5      21280001A0D220 21280001A0D222 2    up      --           --       8001,FFFF
net7      21280001A0D220 21280001A0D221 1    up      --           --       8001,FFFF
                                                  
# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         1000   full      igb0
net2              Ethernet             unknown    0      unknown   igb2
net3              Ethernet             unknown    0      unknown   igb3
net1              Ethernet             unknown    0      unknown   igb1
net4              Ethernet             up         10     full      usbecm0
net5              Infiniband           up         32000  full      ibp1
net7              Infiniband           up         32000  full      ibp0

2. Create an IPoIB PV datalinks to a kernel zone.
To add an IPoIB PV interface to a kernel zone say tzone1 add an anet 
and specify a lower-link and pkey which are mandatory properties using 
zonecfg. If not specified IPoIB-CM is the default link mode.

# zonecfg -z tzone1
    zonecfg:kzone0> add anet
    zonecfg:kzone0:anet> set lower-link=net7
    zonecfg:kzone0:anet> set pkey=0xffff
    zonecfg:kzone0:anet> info
    anet 1:
        lower-link: net7
        ...
        pkey: 0xffff
        linkmode not specified
        evs not specified
        vport not specified
        iov: off
        lro: auto
        id: 1
    ...
    zonecfg:tzone1>exit
#

3. Additional IPoIB PV datalinks to the kernel zone.
Additional IPoIB PV interfaces to a kernel zone with a lower-link and pkey 
can be added as indicated above. These datalinks can be used exclusively 
to host native zones within the kernel zones.

4. The PV IPoIB datalinks appear within the kernel zone on boot.

root@tzone1:~# dladm 
LINK                CLASS     MTU    STATE    OVER
net1                phys      65520  up       --
net0                phys      65520  up       --

root@tzone1:~# ipadm
NAME              CLASS/TYPE STATE        UNDER      ADDR
lo0               loopback   ok           --         --
   lo0/v4         static     ok           --         127.0.0.1/8
   lo0/v6         static     ok           --         ::1/128
net0              ip         ok           --         --
   net0/v4        static     ok           --         1.1.1.190/24
net1              ip         ok           --         --
   net1/v4        static     ok           --         2.2.2.190/24

Virtual NICs (VNICs) tzone1/net0 and tzone1/net1 are created in the
host kernel which are the backend of the PV interface.

# dladm show-vnic
LINK            OVER           SPEED  MACADDRESS        MACADDRTYPE IDS
tzone1/net1     net7           32000  80:0:0:4d:fe:..   fixed       PKEY:0xffff
tzone1/net0     net7           32000  80:0:0:4e:fe:..   fixed       PKEY:0xffff

Named threads in Oracle Solaris 11.3

We've added a new feature in Solaris 11.3, the ability to name threads.  With this feature, you can now give a semantically meaningful name to a thread.  This can make life easier when trying to figure out which threads in your application are doing what.

These are the new functions that have been added for this feature:

int pthread_setname_np(pthread_t t, const char *name );

int pthread_getname_np(pthread_t t, char *buf, size_t len);

int pthread_attr_setname_np(pthread_attr_t *attr, const char *name);

int pthread_attr_getname_np(pthread_attr_t *attr, char *buf, size_t len);


pthread_setname_np(3C) allows an existing thread to be named.  pthread_attr_setname_np(3C) allows a thread to be named before it is  created.  Both pthread_getname_np(3C) and pthread_attr_getname_np(3C)  let you retrieve the name of a thread.

Thread names are exposed by prstat(1M) and ps(1).  For example, 'ps -Loutput has been modified to include the thread name (LNAME):

$ ps -L

  PID   LWP LNAME     TTY         LTIME CMD

 2644     1 -         pts/32       0:00 bash

14320     1 moe       pts/32       0:00 a.out

14320     2 curly     pts/32       0:00 a.out

14320     3 larry     pts/32       0:00 a.out

14320     4 shemp     pts/32       0:00 a.out

14321     1 -         pts/32       0:00 ps

$

Similarly, a format specifier has been added for the '-o' option to ps(1):

$ ps -L -o pid,lwp,lname,fname

  PID   LWP LNAME     COMMAND

 2644     1 -         bash

13421     1 moe       a.out

13421     2 curly     a.out

13421     3 larry     a.out

13421     4 shemp     a.out

13422     1 -         ps

prstat(1M) now displays the thread name instead of the thread ID, if it has been set.  (If the thread hasn't been named, the thread ID is displayed.) For example:

# prstat -Lmp `pgrep nscd` 5

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG LWPID PROCESS/LWPNAME

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   4   0  37   0  1762 nscd/server_tsd_bind

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0  13   0  5661 nscd/server_tsd_bind

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0  14   0   200 nscd/server_tsd_bind

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   7   0  21   0     2 nscd/set_smf_state

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0  5662 nscd/server_tsd_bind

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0  5660 nscd/reaper

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0  5659 nscd/revalidate

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0  5657 nscd/reaper

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0  5656 nscd/revalidate

[ ... ]

We've also added a new variable to DTrace, uthreadname.  (This is for the userspace thread name. We've also added kthreadname, as we also allow for naming kernel threads.)  The following DTrace script would tell you which threads in your application are most active:

profile-397

/ pid == $target /

{

        @[uthreadname] = count();

}

 The output from this script might appear as follows:

$ ./uthr.d -p `pgrep a.out`

dtrace: script './uthr.d' matched 1 probe

^C

shemp             29

larry             35

curly             36

moe             4423

Better performing pthread reader-writer locks for NUMA architectures

Introduction
-------------
Solaris 11.3 introduces an intelligent version of process private reader-writer locks
(rwlocks) that is aware of the underlying NUMA architecture. This awareness can be
used to extract better performance from the reader-writer locks. This is accomplished
by using a technique called lock cohorting. This blog post describes how to use the
new rwlocks.

Interface Changes
------------------
To ensure compatibility and ease of adaptation, the only change to the current
pthread_rwlock interfaces (see pthread_rwlock_*(3C) and pthread_rwlockattr_*(3C) interfaces)
made is the introduction of a new attribute, PTHREAD_RWSCALE_PRIVATE_NP. Once the property
of the attribute object is set to PTHREAD_RWSCALE_PRIVATE_NP, rest of the application
remains unchanged.
	
	pthread_rwlockattr_t	lock_attr;
	pthread_rwlockattr_setpshared(&lattr, PTHREAD_RWSCALE_PRIVATE_NP);

The above code will set the lock_attr object to be of NUMA aware type. When an rwlock is
created using lock_attr, it will provide better performance when compared to the
traditional process private rwlocks.

Use Case
---------
These locks are best used in the following scenario.
Consider a multi-threaded process where the threads share a process-private rwlock.
If such a process is running in a NUMA machine and the threads of the process are
*not* confined to a single NUMA node, then considerable performance improvement
can be seen when using the new rwlocks. If the threads of such a process are confined
to a single NUMA node, then it performs as a process running on a UMA machine would.
No performance degradation will be seen.

Example
--------
Here is a simple example to demonstrate the creation of an rwlock of
PTHREAD_RWSCALE_PRIVATE_NP type.

#include <pthread.h>

int
main()
{
	/* Declare a variable of type pthread_rwlockattr_t */
	pthread_rwlockattr_t rwlattr;
	pthread_rwlock_t rwlock;
	int rc;

	/*
	 * Initialize the attribute object and call pthread_rwlockattr_setpshared
	 * to set it to the appropriate value. Default value of pthread_rwlock-
	 * attr_init is PTHREAD_PROCESS_PRIVATE
	 */
	rc = pthread_rwlockattr_setpshared(&rwlattr, PTHREAD_RWSCALE_SHARED_NP);

	/*
	 * Call pthread_rwlock_init with the initialized rwlockattr object to
	 * initialize the rwlock to the desired type
	 */
	rc = pthread_rwlock_init(&rwlock, &rwlattr);

	/*
	 * Use the lock and destroy it by using pthread_rwlock_destroy.
	 * It is important to destroy the lock to ensure that the memory
	 * allocated by the lock internally is freed.
	 */
	rc = pthread_rwlock_destroy(&rwlock);
}

New Security Extensions in Oracle Solaris 11.3

In Solaris 11.3, we've expanded the security extensions framework to give you more tools to defend your installations. In addition to Address Space Layout Randomization (ASLR), we now offer tools to set a non-executable stack (NXSTACK) and a non-executable heap (NXHEAP). We've also improved the sxadm(1M) utility to make it easier to manage security extension configurations.

NXSTACK

When NXSTACK is enabled, the process stack memory segment is marked non-executable. This extension defends against attacks that rely on injecting malicious code and executing it on the stack. You can also configure NXSTACK to log each time a program tries to execute code on the stack. Log entries are output to /var/adm/messages.

Very few  non-malicious programs need to execute code on the stack, so NXSTACK is enabled by default in Solaris 11.3. If you have a program that needs to execute on the stack and you are able to recompile it, you can pass the "-z nxstack=disable" flag to Solaris Studio. Otherwise, you can use sxadm either to disable NXSTACK or set it to work only on tagged binaries. Most core Solaris utilities are tagged for NXSTACK.

Note that NXSTACK takes the place of the "noexec_user_stack" and "noexec_user_stack_log" entries in /etc/system. You can still use those entries to configure non-executable stack, and they will take precedence over any configuration of NXSTACK. However, they are considered deprecated and you are encouraged to switch to using NXSTACK through sxadm.

NXHEAP

When NXHEAP is enabled, the brk(2)-based heap memory segment is marked non-executable. This extension defends against attacks that rely on injecting code and executing it from the heap. You can also configure NXHEAP to log each time a program tries to execute code on the heap. NXHEAP log entries are also written to /var/adm/messages.

Some programs (such as interpreters) do have legitimate reasons to execute code from the heap, so NXHEAP is enabled by default only for tagged binaries. Most core Solaris utilities are already tagged for NXHEAP, and you can tag your own binaries by passing the linker flag "-z nxheap=enable" when compiling with Solaris Studio. Of course, NXHEAP can also be enabled or disabled globally with sxadm.

sxadm

We've made all sorts of improvements to sxadm in Solaris 11.3, so I'm only going to focus on three new subcommands that will help you configure the new security extensions.

sxadm get

"sxadm get" allows you to observe the properties of security extensions. For example, NXSTACK and NXHEAP have log properties that show whether or not logging is enabled for those extensions. You can query the log property with:

$ sxadm get log nxstack nxheap
EXTENSION           PROPERTY                      VALUE
nxstack             log                           enable
nxheap              log                           enable  

And you can get an easily parsable format by passing the "-p" flag:

$ sxadm get -p log nxstack nxheap
nxstack:log:enable
nxheap:log:enable

You can also query all properties (equivalent to "sxadm status") with:

$ sxadm get all
EXTENSION           PROPERTY                      VALUE
aslr                model                         tagged-files
nxstack             model                         all
--                  log                           enable
nxheap              model                         tagged-files
--                  log                           enable  

sxadm set

"sxadm set" allows you to set individual properties of extensions without needing to use "sxadm enable". For example, you can disable NXSTACK logging with:

$ sxadm get log nxstack
EXTENSION           PROPERTY                      VALUE
nxstack             log                           enable
$ sxadm set log=disable nxstack
$ sxadm get log nxstack
EXTENSION           PROPERTY                      VALUE
nxstack             log                           disable

sxadm delcust

"sxadm delcust" allows you to restore the default configuration for one or more security extensions. For example:

$ sxadm get all nxstack
EXTENSION           PROPERTY                      VALUE
nxstack             model                         tagged-files
--                  log                           disable
$ sxadm delcust nxstack
$ sxadm get all nxstack
EXTENSION           PROPERTY                      VALUE
nxstack             model                         all
--                  log                           enable

Of course, all of these new subcommands also work with ASLR, even though it only has one "model" property. For example:

$ sxadm get all aslr
EXTENSION           PROPERTY                      VALUE
aslr                model                         tagged-files
$ sxadm set model=all aslr
$ sxadm get all aslr
EXTENSION           PROPERTY                      VALUE
aslr                model                         all
$ sxadm delcust aslr
$ sxadm get all aslr
EXTENSION           PROPERTY                      VALUE
aslr                model                         tagged-files 

Conclusion

I hope you've enjoyed this quick introduction to all the work we've put into the Security Extensions Framework for Solaris 11.3, and I hope you're able to use some or all of it to meet your organization's security needs. For a more detailed explanation of sxadm and the individual security extensions, please see the sxadm(1M) man page.

OpenSSL on Oracle Solaris 11.3

As with Solaris 11.2, Solaris 11.3 delivers two versions of OpenSSL: the non-FIPS 140 version (default) and the FIPS 140 version.  They are both based on OpenSSL 1.0.1o (as of July 7th, 2015).

There are no major features added to Solaris 11.3 OpenSSL; however, there are a couple of things that I would like to note.

EOL SSLv2 Support


SSLv2 protocol has been known to have issues for a while. Therefore, we have decided it's about time to remove SSLv2 support from Solaris OpenSSL. This should not be an issue for most applications out there, as nobody should be using SSLv2 protocols these days.  If your application still does, please consider moving on to more secure TLS protocols.

With Solaris 11.3, SSLv2 entry points are replaced with stub functions, and they are declared 'deprecated'.  Thus, if you are building an application which has references to the SSLv2 entry points, be prepared to see some compiler warnings like:

        warning:  "SSLv2_client_method" is deprecated, declared in : "/usr/include/openssl/ssl.h", line 2035

Now, some of you may wonder: why are we not removing SSLv3 from Solaris OpenSSL as well?
Unfortunately, there are some 3rd party applications which still only support the SSLv3 protocol, thus, we feel that it's not time to remove SSLv3 support from the OpenSSL library just yet. That's not to say SSLv3 protocol is an acceptable protocol.  RFC 7568 Deprecating Secure Sockets Layer Version 3.0 was just published stating that "SSLv3 MUST NOT be used. Negotiation of SSLv3 from any version of TLS MUST NOT be permitted."  Fortunately, Oracle has already been implementing compliance with this RFC for a while now, and most applications supported by Oracle Solaris 11.3 disable SSLv2 and SSLv3 by default.  If you own an application which only supports SSLv3, it is time to move onto the newer and more secure protocols such as TLS 1.2.  We won't be supporting SSLv3 protocols for too much longer.


OpenSSL Thread and Fork Safety (Part 2)


With S11.2, we attempted to make OpenSSL thread and fork safe by default.  (See "OpenSSL Thread and Fork Safety" under "OpenSSL on Solaris 11.2")
However, the fix apparently wasn't complete, and we needed to extend the fix.

With Solaris 11.3 OpenSSL, the following functions are now replaced with stub functions.  Instead of allowing other applications/libraries to specify their own locking and thread identification callback functions, Solaris now has an internal implementation of locking and thread identification within Solaris OpenSSL that's not visible by the API caller.  Applications may still call those functions, but supplied callback functions will not be used by Solaris OpenSSL.

      CRYPTO_set_locking_callback
      CRYPTO_set_dynlock_create_callback
      CRYPTO_set_dynlock_lock_callback
      CRYPTO_set_dynlock_destroy_callback
      CRYPTO_set_add_lock_callback
      CRYPTO_THREADID_set_callback
      CRYPTO_set_id_callback

What does that mean for you?
OpenSSL is now thread and fork safe by default, finally.  You don't need to make any modification to
your application nor to your library.  You can relax and have a beer or two


That's all I have for now.

Changes to ZFS ARC Memory Allocation in 11.3

New in Solaris 11.3 is a kernel memory allocation subsystem called the
kernel object manager, or KOM. The first consumer of this subsystem is the
ZFS ARC.

Prior to Solaris 11.3, the ZFS ARC allocated its memory from the kernel heap
space using kmem caches. This has several drawbacks: first, internal
fragmentation can result in memory used by the ARC not being reclaimed by the
system. This problem is particularly acute if large pages are being used, since
the buffer size is considerably smaller than the large page size -- even one
buffer still allocated will prevent the system from freeing the large page.
Another drawback of ZFS ARC using the kernel heap is that all of the kernel
heap is non-relocatable in memory, and thus must reside in the kernel cage.
This can lead to issues allocating large pages or performing DR memory remove
operations once the ARC has grown large, even if it shrinks successfully. As a
workaround for the cage growth issue, many sysadmins have limited the size of
the ZFS ARC cache in /etc/system. Finally, scalability of ARC shrinking prior
to Solaris 11.3 is limited by heap page unmapping speed on large SPARC systems.

In Solaris 11.3, the ZFS ARC allocates its memory through KOM. The metadata
which is frequently accessed by ZFS (such as directory files) remains in
the kernel cage, but the vast majority of the cache which is not frequently
accessed by ZFS now resides outside of the kernel cage, where it can be
relocated by DR and page coalescing. KOM uses a slab size of 2M on x86 or 4M on
SPARC, so internal fragmenation is much less of an issue than it was with 256M
heap pages on SPARC. Scalability is vastly improved, as KOM takes advantage of
64-bit systems by using the seg_kpm framework for its address translations.

With this change, many systems which required limiting the ARC size will no
longer require a hard limit, since the system is able to manage its memory much
better. Metadata heavy workloads, and systems hosting kernel zones, will still
need to limit the ARC size through /etc/system tuning in Solaris 11.3, however.

Saturday Jan 31, 2015

Multi-CPU Binding (MCB)

I want to tell everyone about the cool, new Multi-CPU Binding API introduced in Solaris 11.2.  Bo Li and I wrote up something that explains what it does, its benefits, and how it is used in Solaris along with examples of how to use it:

INTRODUCTION

Multi-CPU Binding (MCB) is new functionality that was added to Solaris 11.2 and is available through a new API called "processor_affinity(2)" and through the pbind(1M) command line tool.  MCB provides similar functionality to processor_bind(2), but can do much more than processor_bind(2):

  1. Bind specified threads to one or more CPUs, leaf locality groups (lgroups)*, or Processor Groups (PGs)**.

  2. Specify strong or weak affinity to CPUs where:

    • Strong affinity means that the threads must only run on the specified CPUs

    •  Weak affinity means that the threads should always prefer to run on the specified CPUs but will run on the closest available CPU where they have sufficient priority to run soonest when the desired CPUs are running higher priority threads

  3. Specify positive or negative affinity for CPUs (ie. want to run or avoid running on specified CPUs)

  4. Enable or disable inheritance across fork(2), exec(2), and/or thr_create(3C).

  5. Query affinities of specified threads to CPUs, PGs, or lgroups.

* lgroups are the Solaris abstraction for telling which CPUs, memory, and I/O devices are within some latency of each other in a Non Uniform Memory Access (NUMA) machine

** PGs are the Solaris abstraction for performance relevant processor sharing relationships in CMT processors (eg. shared execution pipeline, FPU, cache, etc.)

BENEFITS

Overall, MCB is more powerful and flexible than what was available in Solaris for affining threads to CPUs before MCB.

Before MCB, you could only do one or more of the following to affine a thread to one or more CPUs:

  • Bind one or more threads to one CPU and have this binding always be inherited across fork(2) and exec(2)
  • Set one or more thread's affinity for a locality group (lgroup) which is the Solaris abstraction for the CPUs, memory, and I/O devices within some latency of each other in a Non Uniform Memory Acess (NUMA) machine
  • Create an exclusive set of CPUs that can only run threads assigned to it, bind one or more threads to this processor set, and always have this processor set binding inherited across fork(2) and exec(2).

In contrast to the old functionality above, MCB has the following new functionality and benefits:

  1. Can bind to more than one CPU
    • The biggest benefit of MCB is that you can affine one or more threads to any set of CPUs that you want.  With this ability, you can bind threads to a NUMA node, processor chip, core, the CPUs sharing some performance relevant hardware component (eg. execution pipeline, FPU, cache, etc.), or an arbitrary set of CPUs.
    • Using a processor set is a way to affine a thread to a set of CPUs like MCB.  However, processor sets are exclusive so only threads assigned to the processor set can run on the CPUs in the processor set.  In contrast, MCB does not set aside CPUs for exclusive use by threads affined to those CPUs by MCB.  Hence, a thread having an MCB affinity for some CPUs does not prevent any other threads from running on those CPUs.
  2. More affinities
    • Having a positive and negative affinity to specify whether to run on or avoid the specified CPUs is a new feature that wasn't offered in the previous APIs for binding threads to CPUs
    • Being able to specify a strong or weak affinity is new for binding threads to CPUs, but isn't a completely new idea in Solaris.  The lgroup affinities already have the notion of strong and weak affinity.  The semantics are pretty different though.  The lgroup affinities mostly affect the order of preference for a thread's home lgroup.  In contrast, MCB strong and weak affinity affect where a thread must run or should prefer to run.  MCB affinities can cause the home lgroup of the thread to change to an lgroup that at least contains some of the specified CPUs, but it does not change the order of preference of home lgroups for the thread.
  3. More flexibility with inheritance
    • MCB has more flexibility with setting the inheritance of the MCB CPU affinities across fork(2),exec(2), or thr_create(3C).  It allows you to enable or disable inheritance of its CPU affinities separately across fork(2), exec(2), or thr_create(3C).

In contrast, the pre-existing APIs for binding threads to a CPU or a processor set make the bindings always be inherited across fork(2), exec(2), and thr_create(3C) so you can never disable any of the inheritance.  With lgroup affinities, you can enable or disable inheritance for fork(2), exec(2), and thr_create(3C), but you must enable or disable inheritance across all or none of these operations.

How is MCB used in Solaris?

Solaris optimizes performance for I/O on Non Uniform Memory Access (NUMA) machines where some I/O devices are closer to some CPUs and memory than others.  Part of what Solaris does for its NUMA I/O optimizations is place kernel I/O helper threads that help usher I/O from the application to the I/O device and vice versa near the I/O device.

Before Solaris 11.2, Solaris would bind each I/O helper thread to one CPU near its corresponding I/O device.  Unfortunately, this can cause some performance issues when the CPU where the I/O helper thread is bound becomes very busy running higher priority threads or handling interrupts.  Since the I/O helper thread is bound to just one CPU, it can only run on that one CPU, isn't allowed to run on any other CPU, and can have to wait a long time to run.  This can cause I/O performance to go down because the I/O will take longer to process.

In S11.2, MCB is used to overcome this problem by affining each I/O helper thread to one or more processor cores.  This gives the I/O helper threads more places to run and reduces the chance that they get stuck on a very busy CPU.  Also, MCB weak affinity can be used to specify that the I/O helper threads prefer to run on the specified CPUs but it is ok to run them on the closest available CPUs if the specified CPUs are too busy.

Tool

pbind(1M)

pbind(1M) is an existing tool to control and query the bindings of processes or LWPs to a CPU and has been modified to support affining threads to more than one CPU.

When specifying target CPUs, the user could directly use their processor IDs or indirectly use their Processor Group (PG) or Locality Group (lgroup) ID.

Bind processes/LWPs

Below are the equivalent ways of binding process 101048 to CPU 1. By default, the binding target type is CPU and, idtype is pid and binding affinity is strong:

    # pbind -b 1 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

    # pbind -b -c 1 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

    # pbind -b -c 1 -i pid 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

    # pbind -b -c 1 -s -i pid 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

Bind processes/LWPs to CPUs specified by Processor Group or Locality Group

    Binding process 101048 to the CPUs in Processor Group 1:

    # pbind -b -g 1 101048

    pbind(1M): pid 101048 strongly bound to Processor Group(s) 1

    Binding process 101048 to the CPUs in Locality Group 2:

    # pbind -b -l 2 101048

    pbind(1M): pid 101048 strongly bound to Locality Group(s) 0 2.

Weak binding

    # pbind -b 2 -w 101048

    pbind(1M): pid 101048 weakly bound to processor(s) 2.

Negative binding targets

    Weakly binding process 101048 to all CPUs but the ones in Processor Group 1:

    # pbind -b -g 1 -n -w 101048

    pbind(1M): pid 101048 weakly bound to Processor Group(s) 2.

Binding LWPs

When the user binds a process the specified CPUs, all the LWPs belonging to that process will be automatically bound to those CPUs. The user may also bind LWPs in the same process individually. LWPs range could be specified after ‘/’ and separated by comma.

    Strongly binding LWP 2, 3, 4 of process 101048 to CPU 2:

    # pbind -b -c 2 -i pid 116936/2-3,4

    pbind(1M): LWP 116936/2 strongly bound to processor(s) 2.

    pbind(1M): LWP 116936/3 strongly bound to processor(s) 2.

    pbind(1M): LWP 116936/4 strongly bound to processor(s) 2.

Query processes/LWPs binding

When querying for bindings of specific LWPs, the user may request that the resulting set of CPUs be identified through their IDs, the Processor Groups or the Locality Groups that contain them:

    # pbind -q 101048

    pbind(1M): pid 101048 weakly bound to processor(s) 2 3.

    # pbind -q -g 101048

    pbind(1M): pid 101048 weakly bound to Processor Group(s) 2.

    # pbind -q -l 101048

    pbind(1M): pid 101048 weakly bound to Locality Group(s) 0 2.

The user may also query all bindings for a specified CPU

    # pbind -Q 2

    pbind(1M): LWP 101048/1 weakly bound to processor(s) 2 3.

    pbind(1M): LWP 102122/1 weakly bound to processor(s) 2 3.

Binding Inheritance

By default, bindings are inherited across exec(2), fork(2) and thr_create(3C), but inheritance across any of these can be disabled.  For example, the user could bind a shell process to a set of CPUs and specify the binding is not inherited in fork(2).  In this way, all processes created by this shell will not be bound to any CPUs.

    Bind processes/LWPs but request binding not inherited across fork(2):

    # pbind -b -c 2 -f 101048                      

    pbind(1M): pid 101048 strongly bound to processor(s) 2.

Explanation of return value is commented in the manpage. For more details, please refer to manpage of pbind(1M).

APIs

processor_affinity(2)

MCB introduces a new processor_affinity(2) system call to control and query the affinity to CPUs for processes or LWPs.

    int processor_affinity(procset_t *ps, uint_t *nids, id_t *ids, uint32_t *flags);

Each option and flag used in pbind(1M) could be directly mapped to processor_affinity(2).  Similarly, the user may request the binding to be either strong or weak by specifying flag PA_AFF_STRONG or PA_AFF_WEAK.  The target CPUs could be specified by their processor IDs, Processor Group (PG) or Locality Group (lgroup) ID when used with corresponding flag PA_TYPE_CPU, PA_TYPE_PG, or PA_TYPE_LGRP.

The ps argument identifies to which LWP(s) that the call should be applied through a procset structure (see procset.h(3HEAD) for details).  The flags argument must contain valid combinations of the options given in the manpage.

When setting affinities, the nids argument points to a memory position holding the number of CPU, PG or LGRP identifiers to which affinity is being set, and ids points to an array with the identifiers.  Only one type of affinity must be specified along with one affinity strength.  Negative affinity is a type modifier that indicates that the given IDs should be avoided and affinity of the specified type should be set to all of the other processors in the system.

When specifying multiple LWPs, the threads should all be bound to the same processor set since they can be affined to CPUs in their processor set.  Additionally, setting affinities will succeed if processor_affinity(2) is able to set a LWP's affinity for any of the specified CPUs even if a subset of the specified CPUs are are invalid, offline, or faulted.

Setting strong affinity for CPUs [0-3] to the current LWP:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    procset_t ps;

    uint_t nids = 4;

    id_t ids[4] = { 0, 1, 2, 3 };

    uint32_t flags = PA_TYPE_CPU | PA_AFF_STRONG;

    setprocset(&ps, POP_AND, P_PID, P_MYID, P_LWPID, thr_self());

    if (processor_affinity(&ps, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error setting affinity.\n");

        perror(NULL);

    }

Setting weak affinity for CPUs in Processor Group 3 and 7 to process 300's LWP 2:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    procset_t ps;

    uint_t nids = 4;

    id_t ids[4] = { 3, 7 };

    uint32_t flags = PA_TYPE_PG | PA_AFF_WEAK;

    setprocset(&ps, POP_AND, P_PID, 300, P_LWPID, 2);

    if (processor_affinity(&ps, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error setting affinity.\n");

        perror(NULL);

    }

Upon a successful query, nids will contain the number of CPUs, PGs or LGRPs for which the specified LWP(s) has affinity.  If ids is not NULL, processor_affinity(2) will store the IDs of the indicated type up to the initial nids value.  Additionally, flags will return the affinity strength and whether any type of inheritance is excluded.

When querying affinities, PA_TYPE_CPU, PA_TYPE_PG or PA_TYPE_LGRP may be specified to indicate that the returned identifiers must be either be the CPUs, Processor Groups, or Locality Groups that contain the processors for which the specified LWPs have affinity.  If no type is specified, the interface defaults to CPUs.

Querying and printing affinities for the current LWP:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    procset_t ps;

    uint_t nids;

    id_t *ids;

    uint32_t flags = PA_QUERY;

    int i;

    setprocset(&ps, POP_AND, P_PID, P_MYID, P_LWPID, thr_self());

    if (processor_affinity(&ps, &nids, NULL, &flags) != 0) {

        fprintf(stderr, "Error querying number of ids.\n");

        perror(NULL);

    } else {

        fprintf(stderr, "LWP %d has affinity for %d CPUs.\n",

            thr_self(), nids);

    }

    flags = PA_QUERY;

    ids = calloc(nids, sizeof (id_t));

    if (processor_affinity(&ps, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error querying ids.\n");

        perror(NULL);

    }

    if (nids == 0)

        printf("Current LWP has no affinity set.\n");

    else

        printf("Current LWP has affinity for the following CPU(s):\n");

    for (i = 0; i < nids; i++)

        printf(" %u", ids[i]);

    printf("\n");

When clearing affinities, the caller can either specify a set of LWPs that should have their affinities revoked (through the ps argument) or none or specify a list of CPU, PG or LGRP identifiers for which all affinities must be cleared.  See EXAMPLES below for details.

Clearing all affinities for CPUs 5 and 7:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    uint_t nids = 2;

    id_t ids[4] = { 5, 7 };

    uint32_t flags = PA_CLEAR | PA_TYPE_CPU;

    if (processor_affinity(NULL, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error clearing affinity.\n");

        perror(NULL);

    }

Explanation of return value is commented in the manpage. For more details, please refer to manpage of processor_affinity(2).

processor_bind(2)

The processor_bind(2) binds processes/LWPs to a single CPU.  The interface remains the same as early Solaris version, but its implementation changes significantly to use MCB.  The processor_bind(2) and processor_affinity(2) are implemented the same way only differing in the limitations imposed by the number and types of arguments each accepts.  The calls to processor_bind(2) are essentially calls to processor_affinity(2) which only allow setting and querying binding to a single CPU at a time.

    int processor_bind(idtype_t idtype, id_t id, processorid_t new_binding, processorid_t *old_binding);

This function binds the LWP (lightweight process) or set of LWPs specified by idtype and id to the processor specified by new_binding. If old_binding is not NULL, it will contain the previous binding of one of the specified LWPs, or PBIND_NONE if none were previously bound.

For more details, please refer to the manpage of processor_bind(2).


Wednesday Dec 10, 2014

Which Oracle Solaris Virtualization?

From time to time as the product manager for Oracle Solaris Virtualization I get asked by customers which virtualization technology they should choose. This is probably because of two main reasons.

  1. Choice: Oracle Solaris provides a choice of virtualization technologies so you can tailor your virtual infrastructure to best fit your application, not to have force (and hence compromise) your application to fit a single option 
  2. No way back: There is the perception, once you make your choice if you get it wrong there is no way back (or a very difficult way back), so it is really important to make the right choice

Understandably there is occasionally a lot of angst around this decision but, as always, with Oracle Solaris there is good news. First the choice isn't as complex as it first seems and below is a diagram that can help you get a feel for that choice. We now have many many customers that are discovering that the combination of Oracle Solaris Zones inside OVM Server for SPARC instances (Logical Domains) gives them the best of both worlds.

Second with Unified Archives in Oracle Solaris 11.2 you always have a way back. With a Unified Archive you can move from a Native Zone to a Kernel Zone to a Logical Domain to Bare Metal and any and all combinations in-between. You can test which is the best type of virtualization for your applications and infrastructure and if you don't like it change to another type in a few minutes. 

BTW if you want a more in-depth discussion of virtualization and how to best utilize it for consolidation, check out the Consolidation Using Oracle's SPARC Virtualization Technologies white paper.  

Thursday Aug 14, 2014

VXLAN in Solaris 11.2

What is a VXLAN?

VXLAN, or Virtual eXtensible LAN, is essentially a tunneling mechanism used to provide isolated virtual Layer 2 (L2) segments that can span multiple physical L2 segments. Since it is a tunneling mechanism it uses IP (IPv4 or IPv6) as its underlying network which means we can have isolated virtual L2 segments over networks connected by IP. This allows Virtual Machines (VM) to be in the same L2 segment even if they  are located on systems that are in different physical networks. Some of the benefits of VXLAN include:

  • Better use of resources, i.e. VMs can be provisioned on systems, that span different geographies, based on system load.
  • VMs can be moved across systems without having to reconfigure the underlying physical network.
  • Fewer MAC address collision issues, i.e. MAC address may collide as long as they are in different VXLAN segments.
Isolated L2 segments can be supported by existing mechanisms such as VLANs, but VLANs don't scale; the number of VLANs are limited to 4094 (0 and 1 are reserved), but VXLAN can provide upto 16 million isolated L2 networks.

Additional details, including protocol working, can be found in the VXLAN draft IETF RFC. Note that Solaris uses the IANA specified UDP port number of 4789 for VXLAN. 

The following is a quick primer on administering VXLAN in Solaris 11.2 using the Solaris administrative utility dladm(1m). Solaris Elastic Virtual Switch (EVS) can be used to manage VXLAN deployment automatically in a cloud environment - this will be the subject of a  future discussion.

The following illustrates how VXLANs are created on Solaris:

where IPx is an IP address (IPv4 or IPv6) and VNIs y and z are different VXLAN segments. VM1, VM2 and VM3 are guests with interfaces configured on VXLAN segments y and z. vxlan1 and vxlan2 are VXLAN links, represented by a new class called VXLAN.

Creating VXLANs

To begin with we need to create  VXLAN links in the segments that we want to use  for guests - let's assume we want to create segments 100 and 101. Additionally, we also want to create the VXLAN links on IP (remember VXLANs are overlay over IP networks), so we need the IP address over which we want to create the VXLAN links - let's assume our endpoint on this system is 10.10.10.1 (in the following example this IP address resides on net4).

# ipadm show-addr net4                                      
ADDROBJ           TYPE     STATE        ADDR
net4/v4                 static        ok           10.10.10.1/24

Create VXLAN segments 100 and 101 on this IP address.

# dladm create-vxlan -p addr=10.10.10.1,vni=100 vxlan1 
# dladm create-vxlan -p addr=10.10.10.1,vni=101 vxlan2    

Notes:

  • In the above example we explicitly provide the IP address, however, you could also:
    • provide a prefix and prefixlen to use an IP address that matches it, e.g:
# dladm create-vxlan -p addr=10.10.10.0/24,vni=100 vxlan1
    • provide an interface (say net4 in our case) to pick an active address on that interface, e.g:
# dladm create-vxlan -p interface=net4,vni=100 vxlan1
(you can't provide interface and addr together)

  • VXLAN links can be created on an IP address over any interface, including IPoIB link, except IPMP, loopback or VNI (Virtual Network Interface).
  • The IP address may belong to a VLAN segment.

Displaying VXLANs

Check if we have our VXLAN links:

# dladm show-vxlan                                          
LINK                ADDR                     VNI   MGROUP
vxlan1              10.10.10.1               100   224.0.0.1
vxlan2              10.10.10.1               101   224.0.0.1

One thing  we haven't talked about so far is the MGROUP. Recall from the RFC that VXLAN links use IP multicast for broadcast. So, we can assign a multicast address to each  VXLAN segment that we create. If we don't specify a multicast address, we assign the all-host multicast address (or all nodes for IPv6) to the VXLAN segments. In the above case since we didn't specify the multicast address both vxlan1 and vxlan2 will use the all-host multicast address.

The VXLAN links created, vxlan1 and vxlan2, are just like other datalinks (physical, VNIC, VLAN, etc.) and can be displayed using 

# dladm show-link
LINK                CLASS     MTU    STATE    OVER
...
vxlan1              vxlan     1440         up            --
vxlan2              vxlan     1440         up            --

The STATE reflects that state of the VXLAN links which is based on the status of the IP address (10.10.10.1 in this case). Note that the MTU is reduced because of the VXLAN encapsulation for each packet, on this VXLAN link.

Now that we have our VXLAN links, we can create Virtual Links (VNICs) over these  VXLAN links. Note, the VXLAN links themselves not active links, i.e. you can't plumb IP address or create Flows on them, but they can be snooped.

# dladm create-vnic  -l vxlan1 vnic1                    
# dladm create-vnic  -l vxlan1 vnic2    
# dladm create-vnic  -l vxlan2 vnic3            
# dladm create-vnic  -l vxlan2 vnic4  

# dladm show-vnic                                           
LINK                OVER              SPEED  MACADDRESS        MACADDRTYPE VIDS
vnic1               vxlan1            10000     2:8:20:d9:df:5f            random                   0
vnic2               vxlan1            10000     2:8:20:72:9a:70          random                   0
vnic3               vxlan2            10000     2:8:20:19:c7:14          random                   0
vnic4               vxlan2            10000     2:8:20:88:98:6d         random                    0

You can see from the above that the process of creating a VNIC on a VXLAN link  is no different from creating one any other link  such as physical, aggregation, etherstub etc.  This means that the VNICs created may belong to a VLAN and properties (such as maxbw and priority) can be set on them.

Once created, these VNICs can be assiged explicitly to Solaris zones. Alternatively, the VXLAN links can be set as the lower-link for configuring anet (automatic VNIC) links in Solaris Zones.

For Logical Domains on SPARC, the virtual switch (add-vsw) can be created on the VXLAN device which means the vnets created on the virtual switch will be part of the VXLAN segment.

Deleting VXLANs

A VXLAN can be deleted once all the VNICs over the VXLAN links have been deleted. Thus in our case:


# dladm delete-vnic vnic1   
# dladm delete-vnic vnic2 
# dladm delete-vnic vnic3     
# dladm delete-vnic vnic4  

# dladm delete-vxlan vxlan1
# dladm delete-vxlan vxlan2  

Additional Notes:
  • VXLAN for Solaris Kernel zone and LDom guests are not supported with direct I/O.
  • Hardware capabilities such as checksum and LSO are not available for the encapsulated (inner) packet.
  • Some earlier implementations (e.g. Linux) might use a pre-IANA assigned port number. If so, such implementations might have to be configured to use the IANA port number to interoperate with Solaris VXLAN. 
  • IP multicast must be available in the underlying network and if communicating  across different IP subnets, multicast routing should be available as well.
  • Modifying properties (IP address, multicast address or VNI) on a VXLAN link is currently not supported; you'd have to delete the VXLAN and re-create it.

Tuesday Jul 29, 2014

DTrace improvements in Oracle Solaris 11.2

There have been a few improvements to DTrace in Solaris 11.2.

llquantize()

DTrace has quantize() and lquantize() aggregating actions to give you, respectively, a power-of-two distribution and a linear distribution of data points that you're interested in.  While these are both useful, there may be instances in which you want to examine events whose latencies span multiple orders of magnitude, but for which you want relatively fine-grained information about the data within each order of magnitude.  With quantize(), there's likely to be insufficient detail. You could use lquantize(), but you'd need multiple aggregations to cover the multiple orders of magnitude.

In 11.2, we have added a log-linear quantize aggregation, llquantize().  This aggregating action allows you to specify a base and a range of exponents for the data, but it also allows you to specify a number of steps (or buckets, if you will) per order of magnitude.  For example, the following line will create an aggregation covering the values from 103 through 106 - 1 with 10 steps per order of magnitude.  (The buckets for 105 will include the values from 105 through 106 - 1.):

@ = llquantize(foo, 10, 3, 5, 20);

We can use this in a script to examine system call latencies:

        syscall:::entry
        {
                self->ts = timestamp;
        }

        syscall:::return
        / self->ts /
        {
                @ = llquantize(timestamp - self->ts, 10, 3, 5, 10);
                self->ts = 0;
        }

Because the timestamp is measured in nanoseconds, this script reports system call latencies in the microseconds range.  Here's sample output from this script:

           value  ------------- Distribution ------------- count
          < 1000 |@@@@@@                                   12899
            1000 |@@@@@@@@@@@@                             26357
            2000 |@                                        3202
            3000 |@                                        1869
            4000 |@                                        2110
            5000 |@@                                       4716
            6000 |@@                                       3998
            7000 |@                                        1617
            8000 |@@                                       4924
            9000 |@                                        2515
           10000 |@@@@@@@                                  15307
           20000 |@                                        2240
           30000 |@                                        1327
           40000 |@                                        1369
           50000 |                                         990
           60000 |                                         1057
           70000 |                                         631
           80000 |                                         453
           90000 |                                         434
          100000 |@                                        1570
          200000 |                                         228
          300000 |                                         45
          400000 |                                         59
          500000 |                                         60
          600000 |                                         52
          700000 |                                         30
          800000 |                                         22
          900000 |                                         17
      >= 1000000 |                                         513



Scalability

When DTrace was first conceived, 100 CPUs was a large machine.  Now, the largest machines contain over 1,000 CPUs, and the original DTrace architecture is starting to show its age.  dtrace(1M) (or specifically, libdtrace(3LIB)) was originally written to process data from the CPUs on a server in a single thread.  Unfortunately, dtrace(1M) was unable to to keep up on newer, larger servers with just a single thread.

In Solaris 11.2, we've modified libdtrace to perform this task with multiple threads.  On x86 servers, dtrace(1M) will use one thread per 8 CPUs.  On SPARC servers, it will use one thread per 16 CPUs.  We've also included an option, nworkers, to allow you to request a specific number of threads.

What does this mean for you?  The main benefit of doing this is that dtrace(1M) will actually be able to keep up on larger servers.  During testing on a 256-CPU system, generating 6,000 records per second per CPU, we were seeing hundreds of aggregation drops per second per CPU without this multi-threading framework.  Running the same test with a multi-threaded dtrace(1M), we saw no aggregation drops.

errexit option

Another minor enhancement in Solaris 11.2 is the errexit option.  This option causes dtrace(1M) to exit when it first hits an error.

As a trivial example, consider the following script:

tick-1s
{
        this->i = 0;
        this->j = 5 / this->i;
}

If run normally, this script would run until termination reporting an error once per second:

# dtrace -q -s divide-by-zero.d
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2
^C

#

When run with the errexit option, the script terminates after the first error:

# dtrace -x errexit -q -s divide-by-zero.d
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2

#

tracemem() enhancements

We've added an optional third argument to tracemem().  Where the second argument specifies how many bytes of memory to trace, the optional third argument specifies how many bytes of memory to display.  This can be useful in cases where the size of the data you care about is variable.  (The DTrace architecture requires that you trace a constant amount of data, but in some cases, the amount of data you're interested in is variable.  What you would currently do is trace enough to capture what you're interested in and ignore the rest.  This optional argument lets dtrace(1M) ignore the garbage for you.)

As an example, consider tracing the beginning of an SSH connection, as seen from the server side.  (This is a simplified example, so we'll just look at what the server writes.):

syscall::write:entry
/execname == "sshd"/
{
        tracemem(copyin(arg1, arg2), 1024, arg2);
}

While running this script and opening a connection from a remote server, we see this:

 CPU     ID                    FUNCTION:NAME
   0   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 53 53 48 2d 32 2e 30 2d 53 75 6e 5f 53 53 48 5f  SSH-2.0-Sun_SSH_
        10: 32 2e 32 0a                                      2.2.

   0   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 25 54 00 00                                      %T..

   1   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 90 39 c4 53 d7 82 01 00 25 54 00 00              .9.S....%T..

   1   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 90 39 c4 53 49 84 01 00 25 54 00 00              .9.SI...%T..

   1   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 90 39 c4 53 d2 84 01 00 25 54 00 00              .9.S....%T..

[ ... ]

This is certainly much better than seeing this:

 CPU     ID                    FUNCTION:NAME
   0   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 53 53 48 2d 32 2e 30 2d 53 75 6e 5f 53 53 48 5f  SSH-2.0-Sun_SSH_
        10: 32 2e 32 0a 00 00 00 00 18 00 00 00 00 00 00 00  2.2.............
        20: 98 f6 d3 08 25 1f d3 08 01 21 d3 08 00 00 00 00  ....%....!......
        30: 00 00 00 00 01 00 00 00 18 00 00 00 00 00 00 00  ................
        40: c0 23 d2 08 0d 21 d3 08 10 00 00 00 01 00 00 00  .#...!..........
        50: 9d 11 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
        60: c0 24 d2 08 25 1f d3 08 0d 21 d3 08 00 00 00 00  .$..%....!......
        70: 00 00 00 00 01 00 00 00 18 00 00 00 00 00 00 00  ................
        80: 80 25 d2 08 01 00 00 00 02 00 00 00 03 00 00 00  .%..............
        90: ff ff ff ff 04 00 00 00 18 00 00 00 00 00 00 00  ................
        a0: a0 22 d2 08 2e 21 d3 08 0e 00 00 00 01 00 00 00  ."...!..........
        b0: ab 11 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
        c0: 00 23 d2 08 2e 21 d3 08 00 00 00 00 25 1f d3 08  .#...!......%...
        d0: 00 00 00 00 01 00 00 00 18 00 00 00 00 00 00 00  ................
        e0: b8 f6 d3 08 00 00 00 00 00 00 00 00 00 00 00 00  ................
        f0: 00 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
       100: d8 f6 d3 08 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ ... ]
       3d0: 00 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
       3e0: b8 f9 d3 08 00 00 00 00 00 00 00 00 00 00 00 00  ................
       3f0: 00 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................

tracemem() output consistency

Another enhancement we've made to tracemem() is to modify the behavior of tracemem() for certain traced memory sizes.  Previously, when tracing 1, 2, 4 or 8 bytes, the traced memory would be treated as a signed decimal integer.  When tracing any other amount below 32 bytes, the traced memory was treated as a string if the buffer contained only printable ASCII characters.  For example:

# dtrace -qn 'BEGIN {tracemem(&`initname, 1); exit(0)}'
47
# dtrace -qn 'BEGIN {tracemem(&`initname, 4); exit(0)}'
1920169263
# dtrace -qn 'BEGIN {tracemem(&`initname, 32); exit(0)}'
/usr/sbin/init
# dtrace -qn 'BEGIN {tracemem(&`initname, 64); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72 2f 73 62 69 6e 2f 69 6e 69 74 00 00  /usr/sbin/init..
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        20: 80 00 00 00 01 00 00 00 84 2b ab fb ff ff ff ff  .........+......
        30: ac 2e ab fb ff ff ff ff ac 31 ab fb ff ff ff ff  .........1......

#

We've modified this behavior to be consistent across traced memory sizes.  For example:

# dtrace -qn 'BEGIN {tracemem(&`initname, 1); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f                                               /

# dtrace -qn 'BEGIN {tracemem(&`initname, 4); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72                                      /usr

# dtrace -qn 'BEGIN {tracemem(&`initname, 32); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72 2f 73 62 69 6e 2f 69 6e 69 74 00 00  /usr/sbin/init..
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

# dtrace -qn 'BEGIN {tracemem(&`initname, 64); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72 2f 73 62 69 6e 2f 69 6e 69 74 00 00  /usr/sbin/init..
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        20: 80 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
        30: 88 32 b6 fb ff ff ff ff 00 36 b6 fb ff ff ff ff  .2.......6......

#

Structure alignment

DTrace has had a longstanding issue with how it calculates alignment for structures, unions, and bit-fields.  At best, this provided an annoyance requiring users to modify their scripts to manually add padding to reflect correct behavior, or at worst, would provide wrong information when using things like the sizeof() action or produce alignment errors where there shouldn't be any issue.  We have made several modifications such that structures, unions, and bit-fields are now more ABI compliant.

In the case of bit fields, depending on where the break is between the bits, some would get lost or placed where they shouldn't.  For example, having a DTrace script written as:

union u1tag
{
        unsigned char c;
        struct
        {
                unsigned int b1:1;
                unsigned int b2:7;
        } s1;
} u1;

BEGIN
{
        u1.c = 255;
        printf ("%d %d %d\n", u1.c, u1.s1.b1, u1.s1.b2);
        u1.c = 0;
        u1.s1.b1 = 1;
        u1.s1.b2 = 127;
        printf ("%d %d %d\n", u1.c, u1.s1.b1, u1.s1.b2);
        exit(0);
}

Would produce a result of:

255 1 0
128 1 127

If you were to write the same thing in C, you would see:

255 1 127
255 1 127

For structs and unions, the problem is more pronounced.  Suppose we have a DTrace script:

typedef struct _my_data {
        uint32_t a;
        uint64_t c;
} my_data;

typedef struct _more_data {
        uint32_t x;
        my_data y;
} more_data;

more_data a;

BEGIN {
        a.y.c = 30;
        printf("%lu\n", a.y.c);
}

When we run it, DTrace will report that this otherwise valid structure will result in an invalid alignment.

 # dtrace -64 -s /tmp/test.d
dtrace: script '/tmp/test.d' matched 1 probe
dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid alignment
(0xffffff08bf603a0c) in action #1 at DIF offset 24

What would be done in the past to work around this issue is to add a padding variable so that the struct _more_data would look like:

typedef struct _more_data {
        uint32_t x;
        uint32_t padding;
        my_data y;
} more_data;

This means that without the padding, not only did we run into alignment issues that would produce runtime errors, but we ran the risk of reporting wrong data within the members of the struct.  This would also impact the size being reported by actions such as sizeof().

For instance suppose we have two structs:

struct s1 {
        int x;          /* 4 bytes */
        short a;        /* 2 bytes */
};

struct s2 {
        struct s1 b;    /* 6 bytes */
        short c;        /* 2 bytes */
        int d;          /* 4 bytes */
};

In this case, sizeof(s1) would report a size of 6, and sizeof(s2) would report a size of 12 instead of 8 and 16 respectively.

One last alignment issue is being addressed with this release, and that deals with the memory alignment restrictions between SPARC and x86 architectures.  In the past, DTrace would force a 4-byte alignment regardless of the architecture the script happened to be running on. With 11.2, alignment is based on the platform since SPARC still requires 4-byte alignment for memory accesses, but the x86 architecture is more lenient and doesn't require such restrictions.

Summing up, by insuring ABI compliance each structure will match their C counterpart, and when you copyin a struct from your program, and rely on the structs being defined inside a header file or your own D scripts, you can rest assured the data is reliable.

Wednesday May 07, 2014

Solaris-specific Providers for Puppet

As I mentioned in my previous post about Puppet, there are some new Solaris-specific Resource Types for Puppet 3.4.1 in Oracle Solaris 11.2.  All of these new Resource Types and Providers have been available on java.net since integration into the FOSS projects gate.  I am actively working with Puppet Labs to get this code pushed back upstream so that it's available for anybody to work with.

Here's a small description of a few (of 23) of the new Resource Types:

  • boot_environment
    • name - The boot_environment name (#namevar)
    • description - Description for the new boot environment
    • clone_be - Create a new boot environment from an existing inactive boot environment
    • options - Create the datasets for a new boot environment with specific ZFS properties.  Specified as a hash
    • zpool - Create the new boot environment in the specified zpool
    • activate - Activate the specified boot environment
  • pkg_publisher
    • name - The publisher name (#namevar)
    • origin - Which origin URI(s) to set.  For multiple origins, specify them as a list
    • enable - Enable the publisher
    • sticky - Set the publisher 'sticky'
    • searchfirst - Set the publisher first in the search order
    • searchafter - Set the publisher after the specified publisher in the search order
    • searchbefore - Set the publisher before the specified publisher in the search order
    • proxy - Use the specified web proxy URI to retrieve content for the specified origin or mirror
    • sslkey - The client SSL key
    • sslcert - The client SSL certificate
  • vnic
    • name - The name of the VNIC (#namevar)
    • temporary - Optional parameter that specifies that the VNIC is temporary
    • lower_link - The name of the physical datalink over which the VNIC is operating
    • mac_address - Sets the VNIC's MAC address based on the specified value
  • dns
    • name - A symbolic name for the DNS client settings to use.  This name is used for human reference only
    • nameserver - The IP address(es) the resolver is to query.  A maximum of 3 IP addresses may be specified.  Specify multiple addresses as a list
    • domain - The local domain name
    • search - The search list for host name lookup.  A maximum of 6 search entries may be specified.  Specify multiple search entries as a list
    • sortlist - Addresses returned by gethostbyname() to be sorted.  Entries must be specified in IP 'slash notation'.  A maximum of 10 sortlist entries may be specified.  Specify multiple entries as an array.
    • options - Set internal resolver variables.  Valid values are debug, ndots:n, timeout:n, retrans:n, attempts:n, retry:n, rotate, no-check-names, inet6.  For values with 'n', specify 'n' as an integer.  Specify multiple options as an array.

Other Resource Types are:

  • Datalink Management:   etherstub, ip_tunnel, link_aggregation, solaris_vlan
  • IP Network Interfaces:  address_object, address_property, interface_properties, ip_interface, ipmp_interface,                                             link_properties, protocol_properties, vni_interface
  • pkg(5) Management:  pkg_facet, pkg_mediator, pkg_variant
  • Naming Services:  nis, nsswitch, ldap

The zones Resource Type has been updated to provide Kernel Zone and archive support as well.

Tuesday May 06, 2014

OpenSSL on Oracle Solaris 11.2



I'm sure you all wonder which version of OpenSSL is delivered with Oracle Solaris 11.2?
The answer is the latest and greatest OpenSSL 1.0.1h!

Now that I answered 80% of the questions you may have with regard to OpenSSL, I would like to announce three major features added to the Oracle Solaris 11.2 which I'm sure you'll all be excited to hear :-)

Inlined T4/T4+ instructions support and Engines


Background: S11.1 and earlier

Years and years ago, I worked on the SPARC T2/T3 crypto drivers.  On the SPARC T2/T3 processors, the crypto instructions are privileged; and therefore, the drivers are needed to access those instructions.  Thus, to make use of T2/T3 crypto hardware, OpenSSL had to use pkcs11 engine which adds lots of cycles going through the thick PKCS#11 session/object management layer, Solaris kernel layer, hypervisor layer to the hardware, and all the way back.  However, on SPARC T4/T4+ processors, crypto instructions are no longer privileged; and therefore, you can access them directly without drivers.  Valerie Fenwick has a nice article explaining the lower level specifics of the T4 hardware.

What does that means to you?  Much improved performance!  No more PKCS#11 layer, no more copy-in/copy-out of the data from the userland to the kernel space, no more scheduling, no more hypervisor, NADA!   As much as I enjoyed working on the crypto drivers, I'm happy to see this driver-less transition! ;-)

Dan Anderson has a great blog entry describing the difference between the T3 and T4 based hardware.  As he described, on Solaris 11 and 11.1, we made the T4 instructions available to OpenSSL via OpenSSL engine mechanism.  It was great for the time being, but to make T4 instruction support available directly from the OpenSSL website and to even bypass the engine layer all together, I was assigned to assassinate the t4 engine (Sorry, Dan) and make T4 instructions embedded to the OpenSSL's internal crypto module (a.k.a adding inlined T4 instruction support).

S11.2 and beyond

As I was learning how OpenSSL development worked, I learned OpenSSL upstream engineers had already committed the inlined T4 instruction support to the OpenSSL 1.0.2 branch.  (Thanks for making my life easier, OpenSSL team!)  I was job-less for a second, but since OpenSSL 1.0.2 won't be available in time for Solaris 11.2 delivery, we decided to patch the inlined T4 instruction support to our OpenSSL 1.0.1g delivery bundled with Solaris 11.2.

With this change, you'll get the T4/T4+ instruction support without engines; and therefore, you get as great performance as the t4 engine and even better performance for some algorithms (i.e. SHA-1, MD5) by default.

Other Engines

Oracle Solaris 11.2  killed not only the t4 engine, but also the aesni engine and the devcrypto engine.   The story for the aesni engine is pretty much similar to the one for the t4 engine.   It was introduced in Solaris 11 as Dan Anderson described in his article, and killed in Solaris 11.2.  AES-NI instruction support is now embedded in the OpenSSL upstream implementation (OpenSSL 1.0.1); and therefore, the separate engine is no longer needed.  The devcrypto engine was removed simply due to the lack of use.

With all this change, Oracle Solaris 11.2 OpenSSL is left with the one and only pkcs11 engine. pkcs11 engine is still necessary on the T2/T3 platforms and on any platform with the hardware keystore (i.e. SCA 6000). However, be sure to leave the pkcs11 engine disabled on T4/T4+ if you want max performance.  Again, I would like to emphasize that the OpenSSL performance on T4/T4+ platforms are looking MUCH better compared to the one on T2/T3 platforms!  It's time to move onto T4/T4+ platform, Y'all!!


OpenSSL FIPS-140 version support


It is important for many federal and financial service customers to have their cryptographic products being FIPS-140 validated. Oracle Solaris Cryptographic Framework recently achieved a FIPS 140-2 validation(yay!!), and it was very important to deliver the FIPS-140 validated OpenSSL with Solaris 11.2.

At the time Solaris 11 was released, OpenSSL 1.0.0 was the latest OpenSSL version available, and since OpenSSL 1.0.0 was not FIPS-140 validated, we only delivered non-FIPS-140 version of OpenSSL with Solaris 11.

Thanks to the OpenSSL upstream team (again), the best and greatest OpenSSL 1.0.1 can be compiled with a FIPS-140 validated module, and we are now delivering the FIPS-140 version of OpenSSL in addition to the non-FIPS-140 version of OpenSSL with Solaris 11.2.

When do you want to use FIPS-140 version of OpenSSL?


It's probably important to mention that the FIPS-140 version of OpenSSL is not for everybody.  The FIPS-140 validated version of cryptographic products come with a price tag.  Enabling FIPS-140 mode adds a lot of cycles to satisfy the FIPS-140 verification requirement (i.e. POST, pair-wise consistency test, contiguous RNG test, etc) at run time.  In addition, inlined T4/T4+ instruction support is not available in the FIPS-140 version of OpenSSL, and you won't get the best performance when the FIPS-140 mode is enabled.

That said, I would recommend you to enable FIPS-140 mode *only if* you need to.  The good news is that you will get the FIPS-140 compatible implementation even when the FIPS-140 mode is disabled.  It's just that it runs much faster!
That's one of the reasons why non-FIPS-140 version of OpenSSL is activated by default.

How to enable FIPS-140 version of OpenSSL


If you decided to enable FIPS-140 mode, here is how you can switch to the FIPS-140 version of OpenSSL.

Make sure you have the FIPS-140 version of the OpenSSL installed on the system.

# pkg mediator -a openssl
MEDIATOR VER. SRC. VERSION IMPL. SRC. IMPLEMENTATION
openssl  vendor            vendor     default
openssl  system            system     fips-140


To activate the fips-140 implementation
# pkg set-mediator -I fips-140 openssl

To check the currently activated OpenSSL implementation
# pkg mediator openssl

To change back to the default (non-FIPS-140) implementation
# pkg set-mediator -I default openssl


OpenSSL Thread and Fork Safety


OpenSSL provides an interface CRYPTO_set_locking_callback() for you (any application or library) to set your own locking callback function with the mutexes of your choice.
That sounds reasonable if the OpenSSL library is used only by applications.  However, when the OpenSSL library is used by another library, such design is asking for trouble.

We've seen a case where an OpenSSL application used a library which set a locking callback function, and the library got unloaded while the application continued using the OpenSSL library.  The application got a segfault because OpenSSL tried to reference the invalid locking callback function set by the unloaded library.  Whose fault is this?

You can argue that the library should have set the locking callback to NULL when it was unloaded.
Well, not quite.  Once the locking callback is set to NULL, the application is no longer thread-safe.

OpenSSL needed some changes to make applications and libraries thread and fork safe.

To fix this issue, the OpenSSL library (libcrypto.so) delivered with Solaris 11.2 sets up mutexes and a locking callback internally, and it ignores an attempt to set/change the locking callback.

What does that mean to you?
OpenSSL is now thread and fork safe by default.  You don't need to make any modification to your application nor library.  You can relax and have a margarita or two.

That's all I have for now.

Note:  The version number delivered with Solaris 11.2 was updated from 1.0.1g to 1.0.1h on Jun 05, 2014. OpenSSL version 1.0.1g was delivered with Solaris 11.2 Beta.

Puppet Configuration in Solaris

What is Puppet?

Puppet is IT automation software that helps system administrators manage IT infrastructure. It automates tasks such as provisioning, configuration, patch management and compliance. Repetitive tasks are easily automated, deployment of critical applications occurs rapidly, and required system changes are proactively managed. Puppet scales to meet the needs of the environment, whether it is a simple deployment or a complex infrastructure, and works on-premise or in the cloud.

Puppet is now available as part of Oracle Solaris 11.2!

Use ntpdate or ntpd -q to set the date

Puppet can error out with some very strange messages if the clocks on both the master and agent aren't synchronized.  You can use ntpdate or ntpd -q to set the date just once if you'd like to manage the NTP service with Puppet, or you can configure NTP.

Install the required packages on both systems 

# pkg install puppet

This will install the puppet, facter and ruby-19 packages.

Configure the Puppet SMF instances

master # svccfg -s puppet:master setprop config/server = master.fqdn.company.com
master # svccfg -s puppet:master refresh
master # svcadm enable puppet:master

agent # svccfg -s puppet:agent setprop config/server = master.fqdn.company.com
agent # svccfg -s puppet:agent refresh

Test the connection to the master and configure authentication

Before enabling the puppet:agent service, you'll want to test the connection first in order to set up authentication

agent # puppet agent --test --server master.fqdn.company.com

Info: Creating a new SSL key for agent.fqdn.company.com
Info: Caching certificate for ca
Info: Creating a new SSL certificate request for agent.fqdn.company.com
Info: Certificate Request fingerprint (SHA256):
C9:63:22:6A:9F:88:D6:18:7F:F3:F4:FA:89:E4:86:A1:C7:BE:94:CF:F1:D5:59:B9:DD:21:8D:C1:C9:B0:F4:18
**Exiting; no certificate found and waitforcert is disabled**

Now that the agent has created a new SSL key, authorization needs approval on the master.

Sign the SSL certificate on the master

master # puppet cert list
  "agent.fqdn.company.com" (SHA256)
  C9:63:22:6A:9F:88:D6:18:7F:F3:F4:FA:89:E4:86:A1:C7:BE:94:CF:F1:D5:59:B9:DD:21:8D:C1:C9:B0:F4:18

master # puppet cert sign agent.fqdn.company.com
Notice: Signed certificate request for agent.fqdn.company.com
Notice: Removing file Puppet::SSL::CertificateRequest agent.fqdn.company.com at
'/etc/puppet/ssl/ca/requests/agent.fqdn.company.com.pem'

Retest the agent to ensure it can connect

agent # puppet agent --test --server master.fqdn.company.com
Info: Caching certificate for agent.fqdn.company.com
Info: Caching certificate_revocation_list for ca
Info: Retrieving plugin
Info: Caching catalog for agent.fqdn.company.com
Info: Applying configuration version '1371232699'
Notice: Finished catalog run in 0.65 seconds

Enable the agent service

agent # svcadm enable puppet:agent

Additional configuration of /etc/puppet/puppet.conf on both master and agent (optional) 

Further customizations can be made in /etc/puppet/puppet.conf.  See Puppet's Configurables page for more details.

NOTE:  Puppet's configuration is completely done via  SMF stencils.  /etc/puppet/puppet.conf should not be directly edited as any edits will be lost when the Puppet SMF service (re)starts.  Setting a new value should be done via svccfg(1M):

# svccfg -s puppet:agent setprop config/<option> = <value>

# svccfg -s puppet:agent refresh

(substitute :master as needed)
About

The Observatory is a blog for users of Oracle Solaris. Tune in here for tips, tricks and more as we explore the Solaris operating system from Oracle.

Search

Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today