Tuesday Nov 10, 2015

Last login tracking in pam_unix_session

screenshot of last login dialog in gdmWhen you first login to a desktop session on Solaris 11.3, you may notice a new notification dialog box informing you of your last login time and location, which may help you notice if an unauthorized login has occurred. This is a good security practice and commonly required by various security policies.

Previously different applications handled this display in different ways (or didn't show the info at all) - in Solaris 11.3, it's been centralized into a common implementation in PAM so that all login methods should display it uniformly.

The time & location of a user’s login has long been recorded into the/var/log/lastlog file by pam_unix_session on Solaris. Other parts of the PAM stack reference this for inactive account tracking. In prior Solaris releases, /bin/login and ssh would read the file and then print a message such as:

      Last login: Wed Sep 17 15:24:05 2014 from gojira

Instead of copying that code into every application processing logins, the PAM team decided to remove the existing calls and instead have pam_unix_session print that message instead, via the PAM conversation routines that all conforming PAM applications should be using.  Applications that don't want to show this can pass the PAM_SILENT flag to pam_open_session(3PAM).

If sysadmins want to silence the notices, they can do so via the PAM configuration files for the application in question. For instance, if you don’t want that popup when you login to the desktop via the GDM display manager, you simply need to create /etc/pam.d/gdm (or update it if you already have one) to include the line:

      session required        pam_unix_session.so.1    nowarn

The pam_unix_session(5) man page has been updated to describe this change as well.

Thursday Oct 29, 2015

Oracle Solaris 11.3 progress on LP64 conversion

Last year, I posted Moving Oracle Solaris to LP64 bit by bit. With this week’s release of Oracle Solaris 11.3 at Oracle OpenWorld 2015, we can provide a bit of a progress update on that effort.

While most of the conversion work is going into our main development train, some of this work is already visible in the Solaris 11 update releases, where you can see the number of LP64 programs in /usr/bin and /usr/sbin in the full Oracle Solaris package repositories has climbed each release, and climbed again in 11.3:

Release 32-bit 64-bit total
Solaris 11.0 1707 (92%) 144 (8%) 1851
Solaris 11.1 1723 (92%) 150 (8%) 1873
Solaris 11.2 1652 (86%) 271 (14%) 1923
Solaris 11.3 1603 (80%) 379 (19%) 1982

(These numbers only count ELF binaries, not python programs, shell scripts, etc.). The numbers go even higher when you include the Selected FOSS evaluation packages which have been backported from our development branch to run on Solaris 11.

Solaris 11.3 also brings a freshly updated version of the Oracle® Solaris 64-bit Developer's Guide to the doc set. The developers driving the LP64 conversion work and colleagues from the Oracle Solaris Studio Developer Tools teams went through the existing document to bring it up to date and add lessons we've learned so far, and worked closely with the Solaris documentation writers to try to make this guide more useful for other developers making their software 64-bit clean and ready — hopefully you’re all doing that now after reading about why 64-bit software is increasingly important, or seeing that it’s required for using the new M7 ADI features to check your memory access (since ADI uses bits available in 64-bit pointers where there’s no room to spare in 32-bit pointers).

There’s also a few new additions on the Oracle Solaris 11 End of Feature Notices lists where we’ve found more programs that weren’t worth the effort to convert or update (as well as software added for unrelated reasons as well).

Of course, this is not the end state, just another milestone on the journey, and we’ll be back with another progress update for our next release.

Saturday Oct 10, 2015

Valgrind: Easy and powerful detection of memory and threading problems

Valgrind in Solaris

Valgrind is a framework and a set of tools for dynamic binary analysis of userspace programs. There are Valgrind tools that can automatically detect many memory management and threading bugs, avoiding hours of frustrating bug-hunting, making your programs more stable. You can also perform detailed profiling to help speed up your programs. Valgrind also supports (incremental) memory leak checking and many other advanced features. See its manual. In short: Valgrind is a debugging and profiling system for large, complex programs.

Recently a support for Oracle Solaris OS has been included in the latest Valgrind release 3.11.0. Let's see it in action for a simple memory corruption problem.

Valgrind in Action

No system is bug free. There are always corner cases and edge conditions which developers have not thought of and testing did not cover. Let me demonstrate such a corner-case problem which exists in Oracle Solaris 11.2.

Create a local user, for example:

# useradd -d /export/home/tester -m -c "test user" -s /bin/bash tester

Setup the environment for problem reproduction:

# mkdir /foo
# cd /foo
# touch test_{1..32000}.txt

And simply watch sudo program crash:

# sudo chown tester test_*
Segmentation Fault

Now what's going on? Before diving in with time-consuming coredump analysis, let's try Valgrind if it readily gives us an answer:

# valgrind -q sudo chown tester test_*
valgrind: sudo: Permission denied

Hmm, sudo is a special setuid binary. Let's copy it somewhere else and remove the setuid bit (it won't be needed):

# cp /usr/bin/sudo /var/tmp/sudo
# chmod a-s /var/tmp/sudo
# valgrind -q sudo chown tester test_*

and voila'! Last few reported errors point to the right direction:

==6925== Invalid write of size 1
==6925==    at 0x4F09FC9A: adr_short (in /lib/libbsm.so.1)
==6925==    by 0x4F0A6ED0: au_to_cmd (in /lib/libbsm.so.1)
==6925==    by 0x4F0A351D: adt_to_cmd (in /lib/libbsm.so.1)
==6925==    by 0x4F0A336C: adt_generate_token (in /lib/libbsm.so.1)
==6925==    by 0x4F0A2D5D: adt_generate_event (in /lib/libbsm.so.1)
==6925==    by 0x4F0A2E6B: adt_put_event (in /lib/libbsm.so.1)
==6925==    by 0x4F21FFA4: solaris_audit_success (in /usr/lib/sudo/sudoers.so)
==6925==    by 0x4F229AA2: audit_success (in /usr/lib/sudo/sudoers.so)
==6925==    by 0x4F21D80E: sudoers_policy_main (in /usr/lib/sudo/sudoers.so)
==6925==    by 0x4F21A30F: sudoers_policy_check (in /usr/lib/sudo/sudoers.so)
==6925==    by 0x8060A2C: main (in /var/tmp/sudo)
==6925==  Address 0x4fbf1de4 is 16 bytes after a block of size 8,620 alloc'd
==6925==    at 0x4FF8208A: malloc (vg_replace_malloc.c:310)
==6925==    by 0x4F0A5F12: get_token (in /lib/libbsm.so.1)
==6925==    by 0x4F0A6E4B: au_to_cmd (in /lib/libbsm.so.1)
==6925==    by 0x4F0A351D: adt_to_cmd (in /lib/libbsm.so.1)
==6925==    by 0x4F0A336C: adt_generate_token (in /lib/libbsm.so.1)
==6925==    by 0x4F0A2D5D: adt_generate_event (in /lib/libbsm.so.1)
==6925==    by 0x4F0A2E6B: adt_put_event (in /lib/libbsm.so.1)
==6925==    by 0x4F21FFA4: solaris_audit_success (in /usr/lib/sudo/sudoers.so)
==6925==    by 0x4F229AA2: audit_success (in /usr/lib/sudo/sudoers.so)
==6925==    by 0x4F21D80E: sudoers_policy_main (in /usr/lib/sudo/sudoers.so)
==6925==    by 0x4F21A30F: sudoers_policy_check (in /usr/lib/sudo/sudoers.so)
==6925==    by 0x8060A2C: main (in /var/tmp/sudo)

Valgrind is telling us there is a buffer overrun and the program writes past the allocated buffer. Our focus now turns to function au_to_cmd() in libbsm.so.1 and spots immediately the problem there.

Limitations

Support for Oracle Solaris OS in Valgrind is currently limited to x86 (32-bit) and amd64 (64-bit) architectures only. We are working on sparcv9 architecture support but there is still much work ahead.

Valgrind is a dynamic binary analysis tool. It does not need access to the program source code. And it cannot spot problems in a code path which does not get run.

About the Author

Ivo Raisr works as a software engineer at Oracle, Systems. Porting of Valgrind to Solaris OS belongs to one of his most favorite hobbies. He believes that if it is useful for the Oracle engineering then must be also for partners, ISVs and customers.

Thursday Sep 10, 2015

Minimizing the Size of Your Oracle Solaris IPS Package Repository

If you have configured your Oracle Solaris 11 systems to use a local IPS package repository to install and update software, you might have wondered whether you can limit the size of the repository and still well serve the needs of all your systems and their users.

A new article, How to Minimize the Size of Your Oracle Solaris IPS Package Repository, reviews the tradeoffs of using a repository that contains all packages since the first Solaris 11 release or using a repository that contains all packages associated with one release.

If you decide that a repository that contains only packages for one release will fulfill your requirements, follow the instructions in the article for using the repository files that are provided for each release. These instructions show how to replace the repository with the least disruption to users when repository files for the next release are available.

Remember that you cannot create a functional repository by including only packages with a particular version string, such as @0.5.11-0.175.2. Such a strategy omits required packages from a previous release that are still current in this release and required packages that use a different version numbering scheme.

The following is an example of a minimum repository life cycle:

  1. Download repository files for Solaris 11.n and follow the instructions included with those files to create a repository.
  2. Install systems using this Solaris 11.n package repository.
  3. Download repository files for the most current Solaris 11.n SRU. SRU content is cumulative, so if the most current SRU is SRU 3, you can add just SRU 3 to get all fixes provided by SRU 1, SRU 2, and SRU 3. You only need to download and add SRU 1, SRU 2, and SRU 3 if some of your systems need to be able to boot to Solaris 11.n SRU 1 or Solaris 11.n SRU 2.
  4. Add this SRU content to your existing Solaris 11.n repository according to the instructions included with those files.
  5. Use this updated repository to update your systems.
  6. As additional SRUs are released, download those repository files, add that content to your Solaris 11.n repository, and update your systems.
  7. When Solaris 11.n+1 is available, download repository files for that release and follow the instructions in the article cited above to replace your local Solaris 11.n package repository with a new Solaris 11.n+1 repository.
  8. Use this new repository to update your systems to Solaris 11.n+1.

Wednesday Aug 19, 2015

AI Manifest Editor CLI in Solaris 11.3

Introducing the AI Manifest Editor CLI in Solaris 11.3

Does anyone like editing XML files?

In Solaris 11.3, we have added an interactive editing capability to installadm(1M) that allows you to create and edit a customized XML manifest for an AI install service, without having to view or understand an XML document.

The interactive interface presents the AI manifest content as a set of objects and properties that can be manipulated using subcommands entered at the interactive interface prompt. The interface can be accessed from the create-manifest or update-manifest subcommands.

Creating a Manifest

To give you an idea of how it works, let's walk through a sample session. Assume you want to create a new manifest for install service sol_11_3 and change the publisher to point to a local repository.

# installadm create-manifest -n sol_11_3 -m mymanifest (no need to provide -f option)

Type help to see list of subcommands.

installadm:mymanifest> help

The following subcommands are supported:
Operations: set     add     delete  move
Navigation: select  cancel  end
Additional: help    info    walk
            commit  exit    validate
            shell

For more information, type: help <subcommand>

installadm:mymanifest> help info

Usage:
info [-v|--verbose]
Display information about object and property settings starting at the
current level. For objects more than one level down, a summary line is
displayed, followed by '...'. When multiples of a given object exist,
the order is designated by <object>[<position#>], e.g., disk[3]. Use
'-v' option for verbose output.

installadm:mymanifest> info (a reasonable set of defaults is pre-set for you)

http-proxy: <not specified>
auto-reboot: false
  create-swap: true
  create-dump: true
  software:
     type: IPS
     name: <not specified>
     facet[1]: facet.locale.*=false ...
     ...
     publisher: name=solaris ...
     pkg-list: action=install ...
  disk: Section  not specified
  pool:
     action: create
     name: rpool
     is-root: true
     is-boot: false
     mount point: <not specified>
     pool-option: Section not specified
     dataset-option: Section not specified
     be: name=solaris ...
        be-option: Section not specified
     vdev: Section not specified
     filesystem[1]:  name=export ...
        option: Section not specified
     filesystem[2]: name=export/home ...
        option: Section not specified
     volume: Section not specified
  boot-mods: Section not specified
  configuration: Section not specified

installadm:mymanifest> select software (the publisher object is located under software)

installadm:mymanifest:software> select publisher (the prompt changes to reflect the selected level)

installadm:mymanifest:software:publisher> info

name: solaris
  key: <not specified>
  cert: <not specified>
  ca-cert: <not specified>
  origin: http://pkg.oracle.com/solaris/release  (you want to set this - get help for 'set') 
  mirror: <not specified> 
  cmd-options: <not specified>

installadm:mymanifest:software:publisher> help set

Usage:
set <property>=<value>
Valid properties to set are:
origin, cmd-options, name, ca-cert, cert, key, mirror

installadm:mymanifest:software:publisher> set origin=http://myrepo.example.com/solaris

installadm:mymanifest:software:publisher> info

name: solaris
  key: <not specified>
  cert: <not specified>
  ca-cert: <not specified>
  origin: http://myrepo.example.com/solaris  (reflects newly set origin)
  mirror: <not specified>
  cmd-options: <not specified>

installadm:mymanifest:software:publisher> end (goes up one level)

installadm:mymanifest:software> end

installadm:mymanifest> exit

1. Save manifest and exit
2. Exit without saving uncommitted changes
3. Continue editing
Please select choice: 1	(choose to save the  manifest)

Created Manifest: 'mymanifest'   (and the manifest is created - it's that simple!)

For those doubters among us, if you now export the manifest, you will see the XML snippet:

     <publisher name="solaris">
         <origin name="http://myrepo.example.com/solaris"/>
     </publisher>

.....which reflects the change that was just made.


Updating a manifest

Similarly, you can use the interactive interface to edit a manifest using update-manifest. Let's add a new publisher to the manifest we just created above, but place it first in the search order:

# installadm update-manifest -n sol_11_3 -m mymanifest

Type help to see list of subcommands.

installadm:mymanifest> select software

installadm:mymanifest:software> add publisher (add new publisher object)

  name: <not specified>
  key: <not specified>
  cert: <not specified>
  ca-cert: <not specified>
  origin: <not specified>
  mirror: <not specified>
  cmd-options: <not specified>

installadm:mymanifest:software:publisher> set name=publisher2

installadm:mymanifest:software:publisher> set origin=http://pub2origin.com/repo

installadm:mymanifest:software:publisher> end

installadm:mymanifest:software> info

type: IPS
  name: <not specified>
  facet[1]:
     name: facet.locale.*
     value: false
  ...
  publisher[1]:
     name: solaris
     key: <not specified>
     cert: <not specified>
     ca-cert: <not specified>
     origin: http://myrepo.example.com/solaris
     mirror: <not specified>
     cmd-options: <not specified>
  publisher[2]:
     name: publisher2
     key: <not specified>
     cert: <not specified>
     ca-cert: <not specified>
     origin: http://pub2origin.com/repo
     mirror: <not specified>
     cmd-options: <not specified>
  pkg-list:
     action: install
     name: pkg:/entire@0.5.11-0.175.3
     name: pkg:/group/system/solaris-large-server
     reject: <not specified>

installadm:mymanifest:software> move publisher 2 1 (moves new publisher to position 1)

installadm:mymanifest:software> info

  type: IPS
  name: <not
specified>
  facet[1]:
     name: facet.locale.*
     value: false
  ...
  publisher[1]:
     name: publisher2
     key: <not specified>
     cert: <not specified>
     ca-cert: <not specified>
     origin: http://pub2origin.com/repo
     mirror: <not specified>
     cmd-options: <not specified>
  publisher[2]:
     name: solaris
     key: <not specified>
     cert: <not specified>
     ca-cert: <not specified>
     origin: http://myrepo.example.com/solaris
     mirror: <not specified>
     cmd-options: <not specified>
  pkg-list:
     action: install
     name: pkg:/entire@0.5.11-0.175.3
     name: pkg:/group/system/solaris-large-server
     reject: <not specified>

installadm:mymanifest:software> exit

1. Save manifest and exit
2. Exit without saving uncommitted changes
3. Continue editing
Please select choice: 1
Changed Manifest: 'mymanifest'

Updating Derived Manifest Scripts

While it isn't possible to edit derived manifest scripts with the interactive interface, we've made it easier for you to modify them by invoking an editor from update-manifest.

Let's say you want to update the manifest, "orig_default", of the sol_11_3 service.

# installadm list -m -n sol_11_3

Service Name Manifest Name Type    Status   Criteria
------------ ------------- ----    ------   --------
sol_11_3     orig_default  derived default  none

# installadm update-manifest -n sol_11_3 -m orig_default

  < Places you into an editor specified by the environment variable, VISUAL.  If VISUAL is not defined, EDITOR is used instead. If neither are defined, then the default editor vi(1) is used. Make changes, save, and exit. >

Changed Manifest: 'orig_default'

This one step process replaces what was previously 4 steps (running "installadm export" to copy the orig_default manifest to a file, editing the file, running "installadm update-manifest -f" to update orig_default, and then removing the edited file).

Creating and/or Updating a Manifest from a File and Editing the Contents

If you want to create or update a manifest from a file, but want to edit the manifest contents before the manifest is saved to the install service, you can use the -e option.

Here we create a manifest based on the default_archive.xml file located under:

 <imagepath>/auto_install/manifest/default_archive.xml

# installadm create-manifest -n sol_11_3 -f /export/auto_install/sol_11_3/auto_install/manifest/default_archive.xml -m myarchive -e

Type help to see list of subcommands.

installadm:myarchive> info

  http-proxy: <not specified>
  auto-reboot: false
  create-swap: true
  create-dump: true
  software:
     type: ARCHIVE
     name: <not specified>
     uri: file:///.cdrom/archive.uar
     key: <not specified>
     cert: <not specified>
     ca-cert: <not specified>
     http-auth-token: <not specified>
     archive-name: *
  disk: Section not specified
  pool:
     ...
  boot-mods: Section not specified
  configuration: Section not specified

installadm:myarchive> select software

installadm:myarchive:software> set uri=http://someserver/dir/myarchive.uar

installadm:myarchive:software> info

type: ARCHIVE
  name: <not specified>
  uri: http://someserver/dir/myarchive.uar
  key: <not specified>
  cert: <not specified>
  ca-cert: <not specified>
  http-auth-token: <not specified>
  archive-name: *

installadm:myarchive:software> end

installadm:myarchive> commit

installadm:myarchive> validate

installadm:myarchive> exit

1. Save manifest and exit
2. Exit without saving uncommitted changes
3. Continue editing
Please select choice: 1
Created Manifest: 'myarchive'

You can do the same thing if -f points to a derived manifest script, but you will be placed into an editor such as vi(1) as described earlier. Let's say you want to create a manifest based on the script, myscript.ksh, but you want to make a couple of changes (without modifying myscript.ksh).

# installadm create-manifest -n sol_11_3 -m myderived -f ./myscript.ksh -e

	<Make changes, save, and exit.>
	Created Manifest: 'myderived' 

You can similarly use the -f and -e options for update-manifest.

Creating a Manifest from an Existing Manifest

Another handy addition to create-manifest added in Solaris 11.3 is the ability to create a manifest based on the contents of an existing manifest (-M).

You can optionally append the -e option if you want to edit the manifest with the interactive interface (for an XML manifest) or editor (for a derived manifest script) before it is saved to the install service.

For instance, the following command copies the content of the myarchive manifest and allows you to modify it before creating the newarchive manifest (the myarchive manifest is unchanged).

# installadm create-manifest -n sol_11_3 -M myarchive -m newarchive -e

Type help to see list of subcommands.

installadm:newarchive> info

  http-proxy: <not specified>
  auto-reboot: false
  create-swap: true
  create-dump: true
  software:
     type: ARCHIVE
     name: <not specified>
     Uri: http://someserver/dir/myarchive.uar
     key: <not specified>
     cert: <not specified>
     ca-cert: <not specified>
     http-auth-token: <not specified>
     archive-name: *
  disk: Section not specified
  ...

installadm:newarchive> select software

installadm:newarchive:software> set uri=http://newserver/newdir/newarchive.uar

installadm:newarchive:software> info

  type: ARCHIVE
  name: <not specified>
  uri: http://newserver/newdir/newarchive.uar
  key: <not specified>
  cert: <not specified>
  ca-cert: <not specified>
  http-auth-token: <not specified>
  archive-name: *

installadm:newarchive:software> exit

...

Please select choice: 1
Created Manifest: 'newarchive'

Tuesday Jul 07, 2015

Virtual Address Reservation in Solaris 11.3

For applications that have a need to place memory at fixed locations in its address space (like the Oracle SGA), there is a new feature in Solaris 11.3 called Virtual Address Reservation that provides support for such fixed address mappings. A fixed address mapping today can fail if the system has already assigned a mapping to the desired location. As the system is free to choose any unused region in a process' address space for mapping things such as libraries, such conflicts could arise.  Worse yet, if MAP_FIXED mmap(2) were used by the application, it would be successful but any existing mapping could be destroyed.

Virtual Address Reservation in Solaris 11.3 provides the means to 'reserve' a portion of a process' address space which will prevent the system from using the reserved space for mapping operations that don't specify a fixed address. The VA Reservations guarantee that fixed address mappings would be successful.

To create a VA Reservation requires that the application be recompiled with a Mapfile (version 2) containing the RESERVE_SEGMENT directive that specifies the virtual address range to reserve. Multiple RESERVE_SEGMENT directives can be specified in the Mapfile to create multiple VA Reservations. The Mapfile below would reserve the VA range from 0x300000000 to 0x300400000. 

# cat Mapfile

$mapfile_version 2

RESERVE_SEGMENT myReservedVaName {
        VADDR = 0x300000000;
        SIZE = 0x400000;
};

# cc file.c -Mmapfile -m64

On execution of the resultant a.out binary, the specified virtual address range will be reserved early on during process startup and before libraries are mapped. pmap(1) can be run on the running process to see its VA reservation(s); it can be seen in the pmap output as "[ reserved ]". 


0000000100000000        32K r-x----  /a.out
0000000100106000         8K rwx----  /a.out
0000000300000000      4096K -------  [ reserved ]
FFFFFFFF7F200000      2112K r-x----  /lib/sparcv9/libc.so.1
...


To use the reserved space, the application simply needs to specify a fixed address that corresponds to the Reserved VA range on calls to either mmap(2) or shmat(2).

Please note that VA Reservation only addresses possible conflicts related to fixed address mappings.  Applications that use fixed address mappings should be well aware of other potential problems. For instance, my example above (on SPARC) reserves the VA Address space starting at 0x300000000.  This could cause malloc failures if the process is memory intensive and the Heap needs to be grow larger than 8G (heap starts at around 0x100106000 and cannot grow past 0x300000000). 

APIs for handling per-thread signals in Solaris

Introduction
-------------
Solaris 11.3 introduces the following APIs to allow one process to interact directly
with a specific thread in a different process.

int proc_thr_kill(pid_t pid, pthread_t thread, int sig);
int proc_thr_sigqueue(pid_t pid, pthread_t thread, int sig, const union sigval value);
int proc_thr_sigqueue_wait(pid_t pid, pthread_t thread, int sig, const union sigval value,
    const struct timespec *timeout);

These APIs are patterned after the process direct signal APIs kill(2),
sigqueue(3C) and sigqueue_wait(3C). The introduction of these APIs will not change
anything about the basics of  signal generation and reception, i.e., there will be 
no guarantee that the signals have been received by the target. It depends on whether
the signal has been blocked or not or ignored or not at the target.

Use Case
---------
These APIs can be used in any multi-process multi-threaded application between
threads of cooperating processes, where threads that are handling specific tasks
need to receive signals. An example would be an application that deals with network I/O
where in each thread in a process is handling one connection. In such a scenario,
using the thread directed signal API, a specific thread could be forced to cleanup
and abort due to errors or asked to dump status/debug info. A signal handler which
can perform the desired action(abort, dump) in response to a specific signal has
to be implemented by the process/threads that can receive the signals.

What does this mean for you?
Solaris threads of two independent and cooperting processes can now send and receive
signals on a per thread basis.

Document Reference
-------------------
See man pages for
 - proc_thr_kill(3C)
 - proc_thr_sigqueue(3C)
 - proc_thr_sigqueue_wait(3C)

PV IPoIB in Kernel Zones in Solaris 11.3

The Paravirtualization of IP over Infiniband (IPoIB) in kernel zones is a 
new feature in S11.3 enhancing the network virtualization offering in Solaris.
This allows for existing IP applications in the guest to run over Infiniband 
fabrics. Features such as Kernel zone Live Migration and IPMP are supported 
with the Paravirtualized IPoIB datalinks making it an appealing option.

Moreover, the device management of these guest datalinks are similar to their 
Ethernet counterparts making it straightforward to configure and manage. Zonecfg 
is used in the host to configure the kernel zone's automatic network interface 
(anet) to select the link of the IB HCA port to paravirtualize and assign as the 
lower-link, the Partition Key (P_Key) wthin the IB fabric and the possible 
link mode to choose from which could either be IPoIB-CM or IPoIB-UD.

The PV IPoIB datalink is a front end guest driver emulating a IPoIB VNIC 
in the host created over a physical IB partition datalink per P_Key and port.

To create a PV IPoIB datalink in a kernel zone the configuration is fairly 
simple. Here is an example showing how to create a PV IPoIB datalink in a 
kernel zone.

1. Find the IB datalink in the host to paravirtualize. 

I am selecting net7 for this example.

# ibadm
HCA             TYPE      STATE     IOV    ZONE
hermon0         physical  online    off    global

# dladm show-ib
LINK      HCAGUID        PORTGUID       PORT STATE   GWNAME       GWPORT   PKEYS
net5      21280001A0D220 21280001A0D222 2    up      --           --       8001,FFFF
net7      21280001A0D220 21280001A0D221 1    up      --           --       8001,FFFF
                                                  
# dladm show-phys
LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE
net0              Ethernet             up         1000   full      igb0
net2              Ethernet             unknown    0      unknown   igb2
net3              Ethernet             unknown    0      unknown   igb3
net1              Ethernet             unknown    0      unknown   igb1
net4              Ethernet             up         10     full      usbecm0
net5              Infiniband           up         32000  full      ibp1
net7              Infiniband           up         32000  full      ibp0

2. Create an IPoIB PV datalinks to a kernel zone.
To add an IPoIB PV interface to a kernel zone say tzone1 add an anet 
and specify a lower-link and pkey which are mandatory properties using 
zonecfg. If not specified IPoIB-CM is the default link mode.

# zonecfg -z tzone1
    zonecfg:kzone0> add anet
    zonecfg:kzone0:anet> set lower-link=net7
    zonecfg:kzone0:anet> set pkey=0xffff
    zonecfg:kzone0:anet> info
    anet 1:
        lower-link: net7
        ...
        pkey: 0xffff
        linkmode not specified
        evs not specified
        vport not specified
        iov: off
        lro: auto
        id: 1
    ...
    zonecfg:tzone1>exit
#

3. Additional IPoIB PV datalinks to the kernel zone.
Additional IPoIB PV interfaces to a kernel zone with a lower-link and pkey 
can be added as indicated above. These datalinks can be used exclusively 
to host native zones within the kernel zones.

4. The PV IPoIB datalinks appear within the kernel zone on boot.

root@tzone1:~# dladm 
LINK                CLASS     MTU    STATE    OVER
net1                phys      65520  up       --
net0                phys      65520  up       --

root@tzone1:~# ipadm
NAME              CLASS/TYPE STATE        UNDER      ADDR
lo0               loopback   ok           --         --
   lo0/v4         static     ok           --         127.0.0.1/8
   lo0/v6         static     ok           --         ::1/128
net0              ip         ok           --         --
   net0/v4        static     ok           --         1.1.1.190/24
net1              ip         ok           --         --
   net1/v4        static     ok           --         2.2.2.190/24

Virtual NICs (VNICs) tzone1/net0 and tzone1/net1 are created in the
host kernel which are the backend of the PV interface.

# dladm show-vnic
LINK            OVER           SPEED  MACADDRESS        MACADDRTYPE IDS
tzone1/net1     net7           32000  80:0:0:4d:fe:..   fixed       PKEY:0xffff
tzone1/net0     net7           32000  80:0:0:4e:fe:..   fixed       PKEY:0xffff

Named threads in Oracle Solaris 11.3

We've added a new feature in Solaris 11.3, the ability to name threads.  With this feature, you can now give a semantically meaningful name to a thread.  This can make life easier when trying to figure out which threads in your application are doing what.

These are the new functions that have been added for this feature:

int pthread_setname_np(pthread_t t, const char *name );

int pthread_getname_np(pthread_t t, char *buf, size_t len);

int pthread_attr_setname_np(pthread_attr_t *attr, const char *name);

int pthread_attr_getname_np(pthread_attr_t *attr, char *buf, size_t len);


pthread_setname_np(3C) allows an existing thread to be named.  pthread_attr_setname_np(3C) allows a thread to be named before it is  created.  Both pthread_getname_np(3C) and pthread_attr_getname_np(3C)  let you retrieve the name of a thread.

Thread names are exposed by prstat(1M) and ps(1).  For example, 'ps -Loutput has been modified to include the thread name (LNAME):

$ ps -L

  PID   LWP LNAME     TTY         LTIME CMD

 2644     1 -         pts/32       0:00 bash

14320     1 moe       pts/32       0:00 a.out

14320     2 curly     pts/32       0:00 a.out

14320     3 larry     pts/32       0:00 a.out

14320     4 shemp     pts/32       0:00 a.out

14321     1 -         pts/32       0:00 ps

$

Similarly, a format specifier has been added for the '-o' option to ps(1):

$ ps -L -o pid,lwp,lname,fname

  PID   LWP LNAME     COMMAND

 2644     1 -         bash

13421     1 moe       a.out

13421     2 curly     a.out

13421     3 larry     a.out

13421     4 shemp     a.out

13422     1 -         ps

prstat(1M) now displays the thread name instead of the thread ID, if it has been set.  (If the thread hasn't been named, the thread ID is displayed.) For example:

# prstat -Lmp `pgrep nscd` 5

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG LWPID PROCESS/LWPNAME

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   4   0  37   0  1762 nscd/server_tsd_bind

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0  13   0  5661 nscd/server_tsd_bind

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0  14   0   200 nscd/server_tsd_bind

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   7   0  21   0     2 nscd/set_smf_state

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0  5662 nscd/server_tsd_bind

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0  5660 nscd/reaper

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0  5659 nscd/revalidate

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0  5657 nscd/reaper

100357 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0  5656 nscd/revalidate

[ ... ]

We've also added a new variable to DTrace, uthreadname.  (This is for the userspace thread name. We've also added kthreadname, as we also allow for naming kernel threads.)  The following DTrace script would tell you which threads in your application are most active:

profile-397

/ pid == $target /

{

        @[uthreadname] = count();

}

 The output from this script might appear as follows:

$ ./uthr.d -p `pgrep a.out`

dtrace: script './uthr.d' matched 1 probe

^C

shemp             29

larry             35

curly             36

moe             4423

Better performing pthread reader-writer locks for NUMA architectures

Introduction
-------------
Solaris 11.3 introduces an intelligent version of process private reader-writer locks
(rwlocks) that is aware of the underlying NUMA architecture. This awareness can be
used to extract better performance from the reader-writer locks. This is accomplished
by using a technique called lock cohorting. This blog post describes how to use the
new rwlocks.

Interface Changes
------------------
To ensure compatibility and ease of adaptation, the only change to the current
pthread_rwlock interfaces (see pthread_rwlock_*(3C) and pthread_rwlockattr_*(3C) interfaces)
made is the introduction of a new attribute, PTHREAD_RWSCALE_PRIVATE_NP. Once the property
of the attribute object is set to PTHREAD_RWSCALE_PRIVATE_NP, rest of the application
remains unchanged.
	
	pthread_rwlockattr_t	lock_attr;
	pthread_rwlockattr_setpshared(&lattr, PTHREAD_RWSCALE_PRIVATE_NP);

The above code will set the lock_attr object to be of NUMA aware type. When an rwlock is
created using lock_attr, it will provide better performance when compared to the
traditional process private rwlocks.

Use Case
---------
These locks are best used in the following scenario.
Consider a multi-threaded process where the threads share a process-private rwlock.
If such a process is running in a NUMA machine and the threads of the process are
*not* confined to a single NUMA node, then considerable performance improvement
can be seen when using the new rwlocks. If the threads of such a process are confined
to a single NUMA node, then it performs as a process running on a UMA machine would.
No performance degradation will be seen.

Example
--------
Here is a simple example to demonstrate the creation of an rwlock of
PTHREAD_RWSCALE_PRIVATE_NP type.

#include <pthread.h>

int
main()
{
	/* Declare a variable of type pthread_rwlockattr_t */
	pthread_rwlockattr_t rwlattr;
	pthread_rwlock_t rwlock;
	int rc;

	/*
	 * Initialize the attribute object and call pthread_rwlockattr_setpshared
	 * to set it to the appropriate value. Default value of pthread_rwlock-
	 * attr_init is PTHREAD_PROCESS_PRIVATE
	 */
	rc = pthread_rwlockattr_setpshared(&rwlattr, PTHREAD_RWSCALE_SHARED_NP);

	/*
	 * Call pthread_rwlock_init with the initialized rwlockattr object to
	 * initialize the rwlock to the desired type
	 */
	rc = pthread_rwlock_init(&rwlock, &rwlattr);

	/*
	 * Use the lock and destroy it by using pthread_rwlock_destroy.
	 * It is important to destroy the lock to ensure that the memory
	 * allocated by the lock internally is freed.
	 */
	rc = pthread_rwlock_destroy(&rwlock);
}

New Security Extensions in Oracle Solaris 11.3

In Solaris 11.3, we've expanded the security extensions framework to give you more tools to defend your installations. In addition to Address Space Layout Randomization (ASLR), we now offer tools to set a non-executable stack (NXSTACK) and a non-executable heap (NXHEAP). We've also improved the sxadm(1M) utility to make it easier to manage security extension configurations.

NXSTACK

When NXSTACK is enabled, the process stack memory segment is marked non-executable. This extension defends against attacks that rely on injecting malicious code and executing it on the stack. You can also configure NXSTACK to log each time a program tries to execute code on the stack. Log entries are output to /var/adm/messages.

Very few  non-malicious programs need to execute code on the stack, so NXSTACK is enabled by default in Solaris 11.3. If you have a program that needs to execute on the stack and you are able to recompile it, you can pass the "-z nxstack=disable" flag to Solaris Studio. Otherwise, you can use sxadm either to disable NXSTACK or set it to work only on tagged binaries. Most core Solaris utilities are tagged for NXSTACK.

Note that NXSTACK takes the place of the "noexec_user_stack" and "noexec_user_stack_log" entries in /etc/system. You can still use those entries to configure non-executable stack, and they will take precedence over any configuration of NXSTACK. However, they are considered deprecated and you are encouraged to switch to using NXSTACK through sxadm.

NXHEAP

When NXHEAP is enabled, the brk(2)-based heap memory segment is marked non-executable. This extension defends against attacks that rely on injecting code and executing it from the heap. You can also configure NXHEAP to log each time a program tries to execute code on the heap. NXHEAP log entries are also written to /var/adm/messages.

Some programs (such as interpreters) do have legitimate reasons to execute code from the heap, so NXHEAP is enabled by default only for tagged binaries. Most core Solaris utilities are already tagged for NXHEAP, and you can tag your own binaries by passing the linker flag "-z nxheap=enable" when compiling with Solaris Studio. Of course, NXHEAP can also be enabled or disabled globally with sxadm.

sxadm

We've made all sorts of improvements to sxadm in Solaris 11.3, so I'm only going to focus on three new subcommands that will help you configure the new security extensions.

sxadm get

"sxadm get" allows you to observe the properties of security extensions. For example, NXSTACK and NXHEAP have log properties that show whether or not logging is enabled for those extensions. You can query the log property with:

$ sxadm get log nxstack nxheap
EXTENSION           PROPERTY                      VALUE
nxstack             log                           enable
nxheap              log                           enable  

And you can get an easily parsable format by passing the "-p" flag:

$ sxadm get -p log nxstack nxheap
nxstack:log:enable
nxheap:log:enable

You can also query all properties (equivalent to "sxadm status") with:

$ sxadm get all
EXTENSION           PROPERTY                      VALUE
aslr                model                         tagged-files
nxstack             model                         all
--                  log                           enable
nxheap              model                         tagged-files
--                  log                           enable  

sxadm set

"sxadm set" allows you to set individual properties of extensions without needing to use "sxadm enable". For example, you can disable NXSTACK logging with:

$ sxadm get log nxstack
EXTENSION           PROPERTY                      VALUE
nxstack             log                           enable
$ sxadm set log=disable nxstack
$ sxadm get log nxstack
EXTENSION           PROPERTY                      VALUE
nxstack             log                           disable

sxadm delcust

"sxadm delcust" allows you to restore the default configuration for one or more security extensions. For example:

$ sxadm get all nxstack
EXTENSION           PROPERTY                      VALUE
nxstack             model                         tagged-files
--                  log                           disable
$ sxadm delcust nxstack
$ sxadm get all nxstack
EXTENSION           PROPERTY                      VALUE
nxstack             model                         all
--                  log                           enable

Of course, all of these new subcommands also work with ASLR, even though it only has one "model" property. For example:

$ sxadm get all aslr
EXTENSION           PROPERTY                      VALUE
aslr                model                         tagged-files
$ sxadm set model=all aslr
$ sxadm get all aslr
EXTENSION           PROPERTY                      VALUE
aslr                model                         all
$ sxadm delcust aslr
$ sxadm get all aslr
EXTENSION           PROPERTY                      VALUE
aslr                model                         tagged-files 

Conclusion

I hope you've enjoyed this quick introduction to all the work we've put into the Security Extensions Framework for Solaris 11.3, and I hope you're able to use some or all of it to meet your organization's security needs. For a more detailed explanation of sxadm and the individual security extensions, please see the sxadm(1M) man page.

OpenSSL on Oracle Solaris 11.3

As with Solaris 11.2, Solaris 11.3 delivers two versions of OpenSSL: the non-FIPS 140 version (default) and the FIPS 140 version.  They are both based on OpenSSL 1.0.1o (as of July 7th, 2015).

There are no major features added to Solaris 11.3 OpenSSL; however, there are a couple of things that I would like to note.

EOL SSLv2 Support


SSLv2 protocol has been known to have issues for a while. Therefore, we have decided it's about time to remove SSLv2 support from Solaris OpenSSL. This should not be an issue for most applications out there, as nobody should be using SSLv2 protocols these days.  If your application still does, please consider moving on to more secure TLS protocols.

With Solaris 11.3, SSLv2 entry points are replaced with stub functions, and they are declared 'deprecated'.  Thus, if you are building an application which has references to the SSLv2 entry points, be prepared to see some compiler warnings like:

        warning:  "SSLv2_client_method" is deprecated, declared in : "/usr/include/openssl/ssl.h", line 2035

Now, some of you may wonder: why are we not removing SSLv3 from Solaris OpenSSL as well?
Unfortunately, there are some 3rd party applications which still only support the SSLv3 protocol, thus, we feel that it's not time to remove SSLv3 support from the OpenSSL library just yet. That's not to say SSLv3 protocol is an acceptable protocol.  RFC 7568 Deprecating Secure Sockets Layer Version 3.0 was just published stating that "SSLv3 MUST NOT be used. Negotiation of SSLv3 from any version of TLS MUST NOT be permitted."  Fortunately, Oracle has already been implementing compliance with this RFC for a while now, and most applications supported by Oracle Solaris 11.3 disable SSLv2 and SSLv3 by default.  If you own an application which only supports SSLv3, it is time to move onto the newer and more secure protocols such as TLS 1.2.  We won't be supporting SSLv3 protocols for too much longer.


OpenSSL Thread and Fork Safety (Part 2)


With S11.2, we attempted to make OpenSSL thread and fork safe by default.  (See "OpenSSL Thread and Fork Safety" under "OpenSSL on Solaris 11.2")
However, the fix apparently wasn't complete, and we needed to extend the fix.

With Solaris 11.3 OpenSSL, the following functions are now replaced with stub functions.  Instead of allowing other applications/libraries to specify their own locking and thread identification callback functions, Solaris now has an internal implementation of locking and thread identification within Solaris OpenSSL that's not visible by the API caller.  Applications may still call those functions, but supplied callback functions will not be used by Solaris OpenSSL.

      CRYPTO_set_locking_callback
      CRYPTO_set_dynlock_create_callback
      CRYPTO_set_dynlock_lock_callback
      CRYPTO_set_dynlock_destroy_callback
      CRYPTO_set_add_lock_callback
      CRYPTO_THREADID_set_callback
      CRYPTO_set_id_callback

What does that mean for you?
OpenSSL is now thread and fork safe by default, finally.  You don't need to make any modification to
your application nor to your library.  You can relax and have a beer or two


That's all I have for now.

Changes to ZFS ARC Memory Allocation in 11.3

New in Solaris 11.3 is a kernel memory allocation subsystem called the
kernel object manager, or KOM. The first consumer of this subsystem is the
ZFS ARC.

Prior to Solaris 11.3, the ZFS ARC allocated its memory from the kernel heap
space using kmem caches. This has several drawbacks: first, internal
fragmentation can result in memory used by the ARC not being reclaimed by the
system. This problem is particularly acute if large pages are being used, since
the buffer size is considerably smaller than the large page size -- even one
buffer still allocated will prevent the system from freeing the large page.
Another drawback of ZFS ARC using the kernel heap is that all of the kernel
heap is non-relocatable in memory, and thus must reside in the kernel cage.
This can lead to issues allocating large pages or performing DR memory remove
operations once the ARC has grown large, even if it shrinks successfully. As a
workaround for the cage growth issue, many sysadmins have limited the size of
the ZFS ARC cache in /etc/system. Finally, scalability of ARC shrinking prior
to Solaris 11.3 is limited by heap page unmapping speed on large SPARC systems.

In Solaris 11.3, the ZFS ARC allocates its memory through KOM. The metadata
which is frequently accessed by ZFS (such as directory files) remains in
the kernel cage, but the vast majority of the cache which is not frequently
accessed by ZFS now resides outside of the kernel cage, where it can be
relocated by DR and page coalescing. KOM uses a slab size of 2M on x86 or 4M on
SPARC, so internal fragmenation is much less of an issue than it was with 256M
heap pages on SPARC. Scalability is vastly improved, as KOM takes advantage of
64-bit systems by using the seg_kpm framework for its address translations.

With this change, many systems which required limiting the ARC size will no
longer require a hard limit, since the system is able to manage its memory much
better. Metadata heavy workloads, and systems hosting kernel zones, will still
need to limit the ARC size through /etc/system tuning in Solaris 11.3, however.

Saturday Jan 31, 2015

Multi-CPU Binding (MCB)

I want to tell everyone about the cool, new Multi-CPU Binding API introduced in Solaris 11.2.  Bo Li and I wrote up something that explains what it does, its benefits, and how it is used in Solaris along with examples of how to use it:

INTRODUCTION

Multi-CPU Binding (MCB) is new functionality that was added to Solaris 11.2 and is available through a new API called "processor_affinity(2)" and through the pbind(1M) command line tool.  MCB provides similar functionality to processor_bind(2), but can do much more than processor_bind(2):

  1. Bind specified threads to one or more CPUs, leaf locality groups (lgroups)*, or Processor Groups (PGs)**.

  2. Specify strong or weak affinity to CPUs where:

    • Strong affinity means that the threads must only run on the specified CPUs

    •  Weak affinity means that the threads should always prefer to run on the specified CPUs but will run on the closest available CPU where they have sufficient priority to run soonest when the desired CPUs are running higher priority threads

  3. Specify positive or negative affinity for CPUs (ie. want to run or avoid running on specified CPUs)

  4. Enable or disable inheritance across fork(2), exec(2), and/or thr_create(3C).

  5. Query affinities of specified threads to CPUs, PGs, or lgroups.

* lgroups are the Solaris abstraction for telling which CPUs, memory, and I/O devices are within some latency of each other in a Non Uniform Memory Access (NUMA) machine

** PGs are the Solaris abstraction for performance relevant processor sharing relationships in CMT processors (eg. shared execution pipeline, FPU, cache, etc.)

BENEFITS

Overall, MCB is more powerful and flexible than what was available in Solaris for affining threads to CPUs before MCB.

Before MCB, you could only do one or more of the following to affine a thread to one or more CPUs:

  • Bind one or more threads to one CPU and have this binding always be inherited across fork(2) and exec(2)
  • Set one or more thread's affinity for a locality group (lgroup) which is the Solaris abstraction for the CPUs, memory, and I/O devices within some latency of each other in a Non Uniform Memory Acess (NUMA) machine
  • Create an exclusive set of CPUs that can only run threads assigned to it, bind one or more threads to this processor set, and always have this processor set binding inherited across fork(2) and exec(2).

In contrast to the old functionality above, MCB has the following new functionality and benefits:

  1. Can bind to more than one CPU
    • The biggest benefit of MCB is that you can affine one or more threads to any set of CPUs that you want.  With this ability, you can bind threads to a NUMA node, processor chip, core, the CPUs sharing some performance relevant hardware component (eg. execution pipeline, FPU, cache, etc.), or an arbitrary set of CPUs.
    • Using a processor set is a way to affine a thread to a set of CPUs like MCB.  However, processor sets are exclusive so only threads assigned to the processor set can run on the CPUs in the processor set.  In contrast, MCB does not set aside CPUs for exclusive use by threads affined to those CPUs by MCB.  Hence, a thread having an MCB affinity for some CPUs does not prevent any other threads from running on those CPUs.
  2. More affinities
    • Having a positive and negative affinity to specify whether to run on or avoid the specified CPUs is a new feature that wasn't offered in the previous APIs for binding threads to CPUs
    • Being able to specify a strong or weak affinity is new for binding threads to CPUs, but isn't a completely new idea in Solaris.  The lgroup affinities already have the notion of strong and weak affinity.  The semantics are pretty different though.  The lgroup affinities mostly affect the order of preference for a thread's home lgroup.  In contrast, MCB strong and weak affinity affect where a thread must run or should prefer to run.  MCB affinities can cause the home lgroup of the thread to change to an lgroup that at least contains some of the specified CPUs, but it does not change the order of preference of home lgroups for the thread.
  3. More flexibility with inheritance
    • MCB has more flexibility with setting the inheritance of the MCB CPU affinities across fork(2),exec(2), or thr_create(3C).  It allows you to enable or disable inheritance of its CPU affinities separately across fork(2), exec(2), or thr_create(3C).

In contrast, the pre-existing APIs for binding threads to a CPU or a processor set make the bindings always be inherited across fork(2), exec(2), and thr_create(3C) so you can never disable any of the inheritance.  With lgroup affinities, you can enable or disable inheritance for fork(2), exec(2), and thr_create(3C), but you must enable or disable inheritance across all or none of these operations.

How is MCB used in Solaris?

Solaris optimizes performance for I/O on Non Uniform Memory Access (NUMA) machines where some I/O devices are closer to some CPUs and memory than others.  Part of what Solaris does for its NUMA I/O optimizations is place kernel I/O helper threads that help usher I/O from the application to the I/O device and vice versa near the I/O device.

Before Solaris 11.2, Solaris would bind each I/O helper thread to one CPU near its corresponding I/O device.  Unfortunately, this can cause some performance issues when the CPU where the I/O helper thread is bound becomes very busy running higher priority threads or handling interrupts.  Since the I/O helper thread is bound to just one CPU, it can only run on that one CPU, isn't allowed to run on any other CPU, and can have to wait a long time to run.  This can cause I/O performance to go down because the I/O will take longer to process.

In S11.2, MCB is used to overcome this problem by affining each I/O helper thread to one or more processor cores.  This gives the I/O helper threads more places to run and reduces the chance that they get stuck on a very busy CPU.  Also, MCB weak affinity can be used to specify that the I/O helper threads prefer to run on the specified CPUs but it is ok to run them on the closest available CPUs if the specified CPUs are too busy.

Tool

pbind(1M)

pbind(1M) is an existing tool to control and query the bindings of processes or LWPs to a CPU and has been modified to support affining threads to more than one CPU.

When specifying target CPUs, the user could directly use their processor IDs or indirectly use their Processor Group (PG) or Locality Group (lgroup) ID.

Bind processes/LWPs

Below are the equivalent ways of binding process 101048 to CPU 1. By default, the binding target type is CPU and, idtype is pid and binding affinity is strong:

    # pbind -b 1 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

    # pbind -b -c 1 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

    # pbind -b -c 1 -i pid 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

    # pbind -b -c 1 -s -i pid 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

Bind processes/LWPs to CPUs specified by Processor Group or Locality Group

    Binding process 101048 to the CPUs in Processor Group 1:

    # pbind -b -g 1 101048

    pbind(1M): pid 101048 strongly bound to Processor Group(s) 1

    Binding process 101048 to the CPUs in Locality Group 2:

    # pbind -b -l 2 101048

    pbind(1M): pid 101048 strongly bound to Locality Group(s) 0 2.

Weak binding

    # pbind -b 2 -w 101048

    pbind(1M): pid 101048 weakly bound to processor(s) 2.

Negative binding targets

    Weakly binding process 101048 to all CPUs but the ones in Processor Group 1:

    # pbind -b -g 1 -n -w 101048

    pbind(1M): pid 101048 weakly bound to Processor Group(s) 2.

Binding LWPs

When the user binds a process the specified CPUs, all the LWPs belonging to that process will be automatically bound to those CPUs. The user may also bind LWPs in the same process individually. LWPs range could be specified after ‘/’ and separated by comma.

    Strongly binding LWP 2, 3, 4 of process 101048 to CPU 2:

    # pbind -b -c 2 -i pid 116936/2-3,4

    pbind(1M): LWP 116936/2 strongly bound to processor(s) 2.

    pbind(1M): LWP 116936/3 strongly bound to processor(s) 2.

    pbind(1M): LWP 116936/4 strongly bound to processor(s) 2.

Query processes/LWPs binding

When querying for bindings of specific LWPs, the user may request that the resulting set of CPUs be identified through their IDs, the Processor Groups or the Locality Groups that contain them:

    # pbind -q 101048

    pbind(1M): pid 101048 weakly bound to processor(s) 2 3.

    # pbind -q -g 101048

    pbind(1M): pid 101048 weakly bound to Processor Group(s) 2.

    # pbind -q -l 101048

    pbind(1M): pid 101048 weakly bound to Locality Group(s) 0 2.

The user may also query all bindings for a specified CPU

    # pbind -Q 2

    pbind(1M): LWP 101048/1 weakly bound to processor(s) 2 3.

    pbind(1M): LWP 102122/1 weakly bound to processor(s) 2 3.

Binding Inheritance

By default, bindings are inherited across exec(2), fork(2) and thr_create(3C), but inheritance across any of these can be disabled.  For example, the user could bind a shell process to a set of CPUs and specify the binding is not inherited in fork(2).  In this way, all processes created by this shell will not be bound to any CPUs.

    Bind processes/LWPs but request binding not inherited across fork(2):

    # pbind -b -c 2 -f 101048                      

    pbind(1M): pid 101048 strongly bound to processor(s) 2.

Explanation of return value is commented in the manpage. For more details, please refer to manpage of pbind(1M).

APIs

processor_affinity(2)

MCB introduces a new processor_affinity(2) system call to control and query the affinity to CPUs for processes or LWPs.

    int processor_affinity(procset_t *ps, uint_t *nids, id_t *ids, uint32_t *flags);

Each option and flag used in pbind(1M) could be directly mapped to processor_affinity(2).  Similarly, the user may request the binding to be either strong or weak by specifying flag PA_AFF_STRONG or PA_AFF_WEAK.  The target CPUs could be specified by their processor IDs, Processor Group (PG) or Locality Group (lgroup) ID when used with corresponding flag PA_TYPE_CPU, PA_TYPE_PG, or PA_TYPE_LGRP.

The ps argument identifies to which LWP(s) that the call should be applied through a procset structure (see procset.h(3HEAD) for details).  The flags argument must contain valid combinations of the options given in the manpage.

When setting affinities, the nids argument points to a memory position holding the number of CPU, PG or LGRP identifiers to which affinity is being set, and ids points to an array with the identifiers.  Only one type of affinity must be specified along with one affinity strength.  Negative affinity is a type modifier that indicates that the given IDs should be avoided and affinity of the specified type should be set to all of the other processors in the system.

When specifying multiple LWPs, the threads should all be bound to the same processor set since they can be affined to CPUs in their processor set.  Additionally, setting affinities will succeed if processor_affinity(2) is able to set a LWP's affinity for any of the specified CPUs even if a subset of the specified CPUs are are invalid, offline, or faulted.

Setting strong affinity for CPUs [0-3] to the current LWP:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    procset_t ps;

    uint_t nids = 4;

    id_t ids[4] = { 0, 1, 2, 3 };

    uint32_t flags = PA_TYPE_CPU | PA_AFF_STRONG;

    setprocset(&ps, POP_AND, P_PID, P_MYID, P_LWPID, thr_self());

    if (processor_affinity(&ps, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error setting affinity.\n");

        perror(NULL);

    }

Setting weak affinity for CPUs in Processor Group 3 and 7 to process 300's LWP 2:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    procset_t ps;

    uint_t nids = 4;

    id_t ids[4] = { 3, 7 };

    uint32_t flags = PA_TYPE_PG | PA_AFF_WEAK;

    setprocset(&ps, POP_AND, P_PID, 300, P_LWPID, 2);

    if (processor_affinity(&ps, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error setting affinity.\n");

        perror(NULL);

    }

Upon a successful query, nids will contain the number of CPUs, PGs or LGRPs for which the specified LWP(s) has affinity.  If ids is not NULL, processor_affinity(2) will store the IDs of the indicated type up to the initial nids value.  Additionally, flags will return the affinity strength and whether any type of inheritance is excluded.

When querying affinities, PA_TYPE_CPU, PA_TYPE_PG or PA_TYPE_LGRP may be specified to indicate that the returned identifiers must be either be the CPUs, Processor Groups, or Locality Groups that contain the processors for which the specified LWPs have affinity.  If no type is specified, the interface defaults to CPUs.

Querying and printing affinities for the current LWP:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    procset_t ps;

    uint_t nids;

    id_t *ids;

    uint32_t flags = PA_QUERY;

    int i;

    setprocset(&ps, POP_AND, P_PID, P_MYID, P_LWPID, thr_self());

    if (processor_affinity(&ps, &nids, NULL, &flags) != 0) {

        fprintf(stderr, "Error querying number of ids.\n");

        perror(NULL);

    } else {

        fprintf(stderr, "LWP %d has affinity for %d CPUs.\n",

            thr_self(), nids);

    }

    flags = PA_QUERY;

    ids = calloc(nids, sizeof (id_t));

    if (processor_affinity(&ps, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error querying ids.\n");

        perror(NULL);

    }

    if (nids == 0)

        printf("Current LWP has no affinity set.\n");

    else

        printf("Current LWP has affinity for the following CPU(s):\n");

    for (i = 0; i < nids; i++)

        printf(" %u", ids[i]);

    printf("\n");

When clearing affinities, the caller can either specify a set of LWPs that should have their affinities revoked (through the ps argument) or none or specify a list of CPU, PG or LGRP identifiers for which all affinities must be cleared.  See EXAMPLES below for details.

Clearing all affinities for CPUs 5 and 7:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    uint_t nids = 2;

    id_t ids[4] = { 5, 7 };

    uint32_t flags = PA_CLEAR | PA_TYPE_CPU;

    if (processor_affinity(NULL, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error clearing affinity.\n");

        perror(NULL);

    }

Explanation of return value is commented in the manpage. For more details, please refer to manpage of processor_affinity(2).

processor_bind(2)

The processor_bind(2) binds processes/LWPs to a single CPU.  The interface remains the same as early Solaris version, but its implementation changes significantly to use MCB.  The processor_bind(2) and processor_affinity(2) are implemented the same way only differing in the limitations imposed by the number and types of arguments each accepts.  The calls to processor_bind(2) are essentially calls to processor_affinity(2) which only allow setting and querying binding to a single CPU at a time.

    int processor_bind(idtype_t idtype, id_t id, processorid_t new_binding, processorid_t *old_binding);

This function binds the LWP (lightweight process) or set of LWPs specified by idtype and id to the processor specified by new_binding. If old_binding is not NULL, it will contain the previous binding of one of the specified LWPs, or PBIND_NONE if none were previously bound.

For more details, please refer to the manpage of processor_bind(2).


Wednesday Dec 10, 2014

Which Oracle Solaris Virtualization?

From time to time as the product manager for Oracle Solaris Virtualization I get asked by customers which virtualization technology they should choose. This is probably because of two main reasons.

  1. Choice: Oracle Solaris provides a choice of virtualization technologies so you can tailor your virtual infrastructure to best fit your application, not to have force (and hence compromise) your application to fit a single option 
  2. No way back: There is the perception, once you make your choice if you get it wrong there is no way back (or a very difficult way back), so it is really important to make the right choice

Understandably there is occasionally a lot of angst around this decision but, as always, with Oracle Solaris there is good news. First the choice isn't as complex as it first seems and below is a diagram that can help you get a feel for that choice. We now have many many customers that are discovering that the combination of Oracle Solaris Zones inside OVM Server for SPARC instances (Logical Domains) gives them the best of both worlds.

Second with Unified Archives in Oracle Solaris 11.2 you always have a way back. With a Unified Archive you can move from a Native Zone to a Kernel Zone to a Logical Domain to Bare Metal and any and all combinations in-between. You can test which is the best type of virtualization for your applications and infrastructure and if you don't like it change to another type in a few minutes. 

BTW if you want a more in-depth discussion of virtualization and how to best utilize it for consolidation, check out the Consolidation Using Oracle's SPARC Virtualization Technologies white paper.  

About

The Observatory is a blog for users of Oracle Solaris. Tune in here for tips, tricks and more as we explore the Solaris operating system from Oracle.

Search

Archives
« May 2016
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today