Saturday Jan 31, 2015

Multi-CPU Binding (MCB)

I want to tell everyone about the cool, new Multi-CPU Binding API introduced in Solaris 11.2.  Bo Li and I wrote up something that explains what it does, its benefits, and how it is used in Solaris along with examples of how to use it:

INTRODUCTION

Multi-CPU Binding (MCB) is new functionality that was added to Solaris 11.2 and is available through a new API called "processor_affinity(2)" and through the pbind(1M) command line tool.  MCB provides similar functionality to processor_bind(2), but can do much more than processor_bind(2):

  1. Bind specified threads to one or more CPUs, leaf locality groups (lgroups)*, or Processor Groups (PGs)**.

  2. Specify strong or weak affinity to CPUs where:

    • Strong affinity means that the threads must only run on the specified CPUs

    •  Weak affinity means that the threads should always prefer to run on the specified CPUs but will run on the closest available CPU where they have sufficient priority to run soonest when the desired CPUs are running higher priority threads

  3. Specify positive or negative affinity for CPUs (ie. want to run or avoid running on specified CPUs)

  4. Enable or disable inheritance across fork(2), exec(2), and/or thr_create(3C).

  5. Query affinities of specified threads to CPUs, PGs, or lgroups.

* lgroups are the Solaris abstraction for telling which CPUs, memory, and I/O devices are within some latency of each other in a Non Uniform Memory Access (NUMA) machine

** PGs are the Solaris abstraction for performance relevant processor sharing relationships in CMT processors (eg. shared execution pipeline, FPU, cache, etc.)

BENEFITS

Overall, MCB is more powerful and flexible than what was available in Solaris for affining threads to CPUs before MCB.

Before MCB, you could only do one or more of the following to affine a thread to one or more CPUs:

  • Bind one or more threads to one CPU and have this binding always be inherited across fork(2) and exec(2)
  • Set one or more thread's affinity for a locality group (lgroup) which is the Solaris abstraction for the CPUs, memory, and I/O devices within some latency of each other in a Non Uniform Memory Acess (NUMA) machine
  • Create an exclusive set of CPUs that can only run threads assigned to it, bind one or more threads to this processor set, and always have this processor set binding inherited across fork(2) and exec(2).

In contrast to the old functionality above, MCB has the following new functionality and benefits:

  1. Can bind to more than one CPU
    • The biggest benefit of MCB is that you can affine one or more threads to any set of CPUs that you want.  With this ability, you can bind threads to a NUMA node, processor chip, core, the CPUs sharing some performance relevant hardware component (eg. execution pipeline, FPU, cache, etc.), or an arbitrary set of CPUs.
    • Using a processor set is a way to affine a thread to a set of CPUs like MCB.  However, processor sets are exclusive so only threads assigned to the processor set can run on the CPUs in the processor set.  In contrast, MCB does not set aside CPUs for exclusive use by threads affined to those CPUs by MCB.  Hence, a thread having an MCB affinity for some CPUs does not prevent any other threads from running on those CPUs.
  2. More affinities
    • Having a positive and negative affinity to specify whether to run on or avoid the specified CPUs is a new feature that wasn't offered in the previous APIs for binding threads to CPUs
    • Being able to specify a strong or weak affinity is new for binding threads to CPUs, but isn't a completely new idea in Solaris.  The lgroup affinities already have the notion of strong and weak affinity.  The semantics are pretty different though.  The lgroup affinities mostly affect the order of preference for a thread's home lgroup.  In contrast, MCB strong and weak affinity affect where a thread must run or should prefer to run.  MCB affinities can cause the home lgroup of the thread to change to an lgroup that at least contains some of the specified CPUs, but it does not change the order of preference of home lgroups for the thread.
  3. More flexibility with inheritance
    • MCB has more flexibility with setting the inheritance of the MCB CPU affinities across fork(2),exec(2), or thr_create(3C).  It allows you to enable or disable inheritance of its CPU affinities separately across fork(2), exec(2), or thr_create(3C).

In contrast, the pre-existing APIs for binding threads to a CPU or a processor set make the bindings always be inherited across fork(2), exec(2), and thr_create(3C) so you can never disable any of the inheritance.  With lgroup affinities, you can enable or disable inheritance for fork(2), exec(2), and thr_create(3C), but you must enable or disable inheritance across all or none of these operations.

How is MCB used in Solaris?

Solaris optimizes performance for I/O on Non Uniform Memory Access (NUMA) machines where some I/O devices are closer to some CPUs and memory than others.  Part of what Solaris does for its NUMA I/O optimizations is place kernel I/O helper threads that help usher I/O from the application to the I/O device and vice versa near the I/O device.

Before Solaris 11.2, Solaris would bind each I/O helper thread to one CPU near its corresponding I/O device.  Unfortunately, this can cause some performance issues when the CPU where the I/O helper thread is bound becomes very busy running higher priority threads or handling interrupts.  Since the I/O helper thread is bound to just one CPU, it can only run on that one CPU, isn't allowed to run on any other CPU, and can have to wait a long time to run.  This can cause I/O performance to go down because the I/O will take longer to process.

In S11.2, MCB is used to overcome this problem by affining each I/O helper thread to one or more processor cores.  This gives the I/O helper threads more places to run and reduces the chance that they get stuck on a very busy CPU.  Also, MCB weak affinity can be used to specify that the I/O helper threads prefer to run on the specified CPUs but it is ok to run them on the closest available CPUs if the specified CPUs are too busy.

Tool

pbind(1M)

pbind(1M) is an existing tool to control and query the bindings of processes or LWPs to a CPU and has been modified to support affining threads to more than one CPU.

When specifying target CPUs, the user could directly use their processor IDs or indirectly use their Processor Group (PG) or Locality Group (lgroup) ID.

Bind processes/LWPs

Below are the equivalent ways of binding process 101048 to CPU 1. By default, the binding target type is CPU and, idtype is pid and binding affinity is strong:

    # pbind -b 1 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

    # pbind -b -c 1 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

    # pbind -b -c 1 -i pid 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

    # pbind -b -c 1 -s -i pid 101048

    pbind(1M): pid 101048 strongly bound to processor(s) 1.

Bind processes/LWPs to CPUs specified by Processor Group or Locality Group

    Binding process 101048 to the CPUs in Processor Group 1:

    # pbind -b -g 1 101048

    pbind(1M): pid 101048 strongly bound to Processor Group(s) 1

    Binding process 101048 to the CPUs in Locality Group 2:

    # pbind -b -l 2 101048

    pbind(1M): pid 101048 strongly bound to Locality Group(s) 0 2.

Weak binding

    # pbind -b 2 -w 101048

    pbind(1M): pid 101048 weakly bound to processor(s) 2.

Negative binding targets

    Weakly binding process 101048 to all CPUs but the ones in Processor Group 1:

    # pbind -b -g 1 -n -w 101048

    pbind(1M): pid 101048 weakly bound to Processor Group(s) 2.

Binding LWPs

When the user binds a process the specified CPUs, all the LWPs belonging to that process will be automatically bound to those CPUs. The user may also bind LWPs in the same process individually. LWPs range could be specified after ‘/’ and separated by comma.

    Strongly binding LWP 2, 3, 4 of process 101048 to CPU 2:

    # pbind -b -c 2 -i pid 116936/2-3,4

    pbind(1M): LWP 116936/2 strongly bound to processor(s) 2.

    pbind(1M): LWP 116936/3 strongly bound to processor(s) 2.

    pbind(1M): LWP 116936/4 strongly bound to processor(s) 2.

Query processes/LWPs binding

When querying for bindings of specific LWPs, the user may request that the resulting set of CPUs be identified through their IDs, the Processor Groups or the Locality Groups that contain them:

    # pbind -q 101048

    pbind(1M): pid 101048 weakly bound to processor(s) 2 3.

    # pbind -q -g 101048

    pbind(1M): pid 101048 weakly bound to Processor Group(s) 2.

    # pbind -q -l 101048

    pbind(1M): pid 101048 weakly bound to Locality Group(s) 0 2.

The user may also query all bindings for a specified CPU

    # pbind -Q 2

    pbind(1M): LWP 101048/1 weakly bound to processor(s) 2 3.

    pbind(1M): LWP 102122/1 weakly bound to processor(s) 2 3.

Binding Inheritance

By default, bindings are inherited across exec(2), fork(2) and thr_create(3C), but inheritance across any of these can be disabled.  For example, the user could bind a shell process to a set of CPUs and specify the binding is not inherited in fork(2).  In this way, all processes created by this shell will not be bound to any CPUs.

    Bind processes/LWPs but request binding not inherited across fork(2):

    # pbind -b -c 2 -f 101048                      

    pbind(1M): pid 101048 strongly bound to processor(s) 2.

Explanation of return value is commented in the manpage. For more details, please refer to manpage of pbind(1M).

APIs

processor_affinity(2)

MCB introduces a new processor_affinity(2) system call to control and query the affinity to CPUs for processes or LWPs.

    int processor_affinity(procset_t *ps, uint_t *nids, id_t *ids, uint32_t *flags);

Each option and flag used in pbind(1M) could be directly mapped to processor_affinity(2).  Similarly, the user may request the binding to be either strong or weak by specifying flag PA_AFF_STRONG or PA_AFF_WEAK.  The target CPUs could be specified by their processor IDs, Processor Group (PG) or Locality Group (lgroup) ID when used with corresponding flag PA_TYPE_CPU, PA_TYPE_PG, or PA_TYPE_LGRP.

The ps argument identifies to which LWP(s) that the call should be applied through a procset structure (see procset.h(3HEAD) for details).  The flags argument must contain valid combinations of the options given in the manpage.

When setting affinities, the nids argument points to a memory position holding the number of CPU, PG or LGRP identifiers to which affinity is being set, and ids points to an array with the identifiers.  Only one type of affinity must be specified along with one affinity strength.  Negative affinity is a type modifier that indicates that the given IDs should be avoided and affinity of the specified type should be set to all of the other processors in the system.

When specifying multiple LWPs, the threads should all be bound to the same processor set since they can be affined to CPUs in their processor set.  Additionally, setting affinities will succeed if processor_affinity(2) is able to set a LWP's affinity for any of the specified CPUs even if a subset of the specified CPUs are are invalid, offline, or faulted.

Setting strong affinity for CPUs [0-3] to the current LWP:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    procset_t ps;

    uint_t nids = 4;

    id_t ids[4] = { 0, 1, 2, 3 };

    uint32_t flags = PA_TYPE_CPU | PA_AFF_STRONG;

    setprocset(&ps, POP_AND, P_PID, P_MYID, P_LWPID, thr_self());

    if (processor_affinity(&ps, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error setting affinity.\n");

        perror(NULL);

    }

Setting weak affinity for CPUs in Processor Group 3 and 7 to process 300's LWP 2:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    procset_t ps;

    uint_t nids = 4;

    id_t ids[4] = { 3, 7 };

    uint32_t flags = PA_TYPE_PG | PA_AFF_WEAK;

    setprocset(&ps, POP_AND, P_PID, 300, P_LWPID, 2);

    if (processor_affinity(&ps, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error setting affinity.\n");

        perror(NULL);

    }

Upon a successful query, nids will contain the number of CPUs, PGs or LGRPs for which the specified LWP(s) has affinity.  If ids is not NULL, processor_affinity(2) will store the IDs of the indicated type up to the initial nids value.  Additionally, flags will return the affinity strength and whether any type of inheritance is excluded.

When querying affinities, PA_TYPE_CPU, PA_TYPE_PG or PA_TYPE_LGRP may be specified to indicate that the returned identifiers must be either be the CPUs, Processor Groups, or Locality Groups that contain the processors for which the specified LWPs have affinity.  If no type is specified, the interface defaults to CPUs.

Querying and printing affinities for the current LWP:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    procset_t ps;

    uint_t nids;

    id_t *ids;

    uint32_t flags = PA_QUERY;

    int i;

    setprocset(&ps, POP_AND, P_PID, P_MYID, P_LWPID, thr_self());

    if (processor_affinity(&ps, &nids, NULL, &flags) != 0) {

        fprintf(stderr, "Error querying number of ids.\n");

        perror(NULL);

    } else {

        fprintf(stderr, "LWP %d has affinity for %d CPUs.\n",

            thr_self(), nids);

    }

    flags = PA_QUERY;

    ids = calloc(nids, sizeof (id_t));

    if (processor_affinity(&ps, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error querying ids.\n");

        perror(NULL);

    }

    if (nids == 0)

        printf("Current LWP has no affinity set.\n");

    else

        printf("Current LWP has affinity for the following CPU(s):\n");

    for (i = 0; i < nids; i++)

        printf(" %u", ids[i]);

    printf("\n");

When clearing affinities, the caller can either specify a set of LWPs that should have their affinities revoked (through the ps argument) or none or specify a list of CPU, PG or LGRP identifiers for which all affinities must be cleared.  See EXAMPLES below for details.

Clearing all affinities for CPUs 5 and 7:

    #include <sys/processor.h>

    #include <sys/procset.h>

    #include <thread.h>

    uint_t nids = 2;

    id_t ids[4] = { 5, 7 };

    uint32_t flags = PA_CLEAR | PA_TYPE_CPU;

    if (processor_affinity(NULL, &nids, ids, &flags) != 0) {

        fprintf(stderr, "Error clearing affinity.\n");

        perror(NULL);

    }

Explanation of return value is commented in the manpage. For more details, please refer to manpage of processor_affinity(2).

processor_bind(2)

The processor_bind(2) binds processes/LWPs to a single CPU.  The interface remains the same as early Solaris version, but its implementation changes significantly to use MCB.  The processor_bind(2) and processor_affinity(2) are implemented the same way only differing in the limitations imposed by the number and types of arguments each accepts.  The calls to processor_bind(2) are essentially calls to processor_affinity(2) which only allow setting and querying binding to a single CPU at a time.

    int processor_bind(idtype_t idtype, id_t id, processorid_t new_binding, processorid_t *old_binding);

This function binds the LWP (lightweight process) or set of LWPs specified by idtype and id to the processor specified by new_binding. If old_binding is not NULL, it will contain the previous binding of one of the specified LWPs, or PBIND_NONE if none were previously bound.

For more details, please refer to the manpage of processor_bind(2).


Wednesday Dec 10, 2014

Which Oracle Solaris Virtualization?

From time to time as the product manager for Oracle Solaris Virtualization I get asked by customers which virtualization technology they should choose. This is probably because of two main reasons.

  1. Choice: Oracle Solaris provides a choice of virtualization technologies so you can tailor your virtual infrastructure to best fit your application, not to have force (and hence compromise) your application to fit a single option 
  2. No way back: There is the perception, once you make your choice if you get it wrong there is no way back (or a very difficult way back), so it is really important to make the right choice

Understandably there is occasionally a lot of angst around this decision but, as always, with Oracle Solaris there is good news. First the choice isn't as complex as it first seems and below is a diagram that can help you get a feel for that choice. We now have many many customers that are discovering that the combination of Oracle Solaris Zones inside OVM Server for SPARC instances (Logical Domains) gives them the best of both worlds.

Second with Unified Archives in Oracle Solaris 11.2 you always have a way back. With a Unified Archive you can move from a Native Zone to a Kernel Zone to a Logical Domain to Bare Metal and any and all combinations in-between. You can test which is the best type of virtualization for your applications and infrastructure and if you don't like it change to another type in a few minutes. 

BTW if you want a more in-depth discussion of virtualization and how to best utilize it for consolidation, check out the Consolidation Using Oracle's SPARC Virtualization Technologies white paper.  

Tuesday Jul 29, 2014

DTrace improvements in Oracle Solaris 11.2

There have been a few improvements to DTrace in Solaris 11.2.

llquantize()

DTrace has quantize() and lquantize() aggregating actions to give you, respectively, a power-of-two distribution and a linear distribution of data points that you're interested in.  While these are both useful, there may be instances in which you want to examine events whose latencies span multiple orders of magnitude, but for which you want relatively fine-grained information about the data within each order of magnitude.  With quantize(), there's likely to be insufficient detail. You could use lquantize(), but you'd need multiple aggregations to cover the multiple orders of magnitude.

In 11.2, we have added a log-linear quantize aggregation, llquantize().  This aggregating action allows you to specify a base and a range of exponents for the data, but it also allows you to specify a number of steps (or buckets, if you will) per order of magnitude.  For example, the following line will create an aggregation covering the values from 103 through 106 - 1 with 10 steps per order of magnitude.  (The buckets for 105 will include the values from 105 through 106 - 1.):

@ = llquantize(foo, 10, 3, 5, 20);

We can use this in a script to examine system call latencies:

        syscall:::entry
        {
                self->ts = timestamp;
        }

        syscall:::return
        / self->ts /
        {
                @ = llquantize(timestamp - self->ts, 10, 3, 5, 10);
                self->ts = 0;
        }

Because the timestamp is measured in nanoseconds, this script reports system call latencies in the microseconds range.  Here's sample output from this script:

           value  ------------- Distribution ------------- count
          < 1000 |@@@@@@                                   12899
            1000 |@@@@@@@@@@@@                             26357
            2000 |@                                        3202
            3000 |@                                        1869
            4000 |@                                        2110
            5000 |@@                                       4716
            6000 |@@                                       3998
            7000 |@                                        1617
            8000 |@@                                       4924
            9000 |@                                        2515
           10000 |@@@@@@@                                  15307
           20000 |@                                        2240
           30000 |@                                        1327
           40000 |@                                        1369
           50000 |                                         990
           60000 |                                         1057
           70000 |                                         631
           80000 |                                         453
           90000 |                                         434
          100000 |@                                        1570
          200000 |                                         228
          300000 |                                         45
          400000 |                                         59
          500000 |                                         60
          600000 |                                         52
          700000 |                                         30
          800000 |                                         22
          900000 |                                         17
      >= 1000000 |                                         513



Scalability

When DTrace was first conceived, 100 CPUs was a large machine.  Now, the largest machines contain over 1,000 CPUs, and the original DTrace architecture is starting to show its age.  dtrace(1M) (or specifically, libdtrace(3LIB)) was originally written to process data from the CPUs on a server in a single thread.  Unfortunately, dtrace(1M) was unable to to keep up on newer, larger servers with just a single thread.

In Solaris 11.2, we've modified libdtrace to perform this task with multiple threads.  On x86 servers, dtrace(1M) will use one thread per 8 CPUs.  On SPARC servers, it will use one thread per 16 CPUs.  We've also included an option, nworkers, to allow you to request a specific number of threads.

What does this mean for you?  The main benefit of doing this is that dtrace(1M) will actually be able to keep up on larger servers.  During testing on a 256-CPU system, generating 6,000 records per second per CPU, we were seeing hundreds of aggregation drops per second per CPU without this multi-threading framework.  Running the same test with a multi-threaded dtrace(1M), we saw no aggregation drops.

errexit option

Another minor enhancement in Solaris 11.2 is the errexit option.  This option causes dtrace(1M) to exit when it first hits an error.

As a trivial example, consider the following script:

tick-1s
{
        this->i = 0;
        this->j = 5 / this->i;
}

If run normally, this script would run until termination reporting an error once per second:

# dtrace -q -s divide-by-zero.d
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2
^C

#

When run with the errexit option, the script terminates after the first error:

# dtrace -x errexit -q -s divide-by-zero.d
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2

#

tracemem() enhancements

We've added an optional third argument to tracemem().  Where the second argument specifies how many bytes of memory to trace, the optional third argument specifies how many bytes of memory to display.  This can be useful in cases where the size of the data you care about is variable.  (The DTrace architecture requires that you trace a constant amount of data, but in some cases, the amount of data you're interested in is variable.  What you would currently do is trace enough to capture what you're interested in and ignore the rest.  This optional argument lets dtrace(1M) ignore the garbage for you.)

As an example, consider tracing the beginning of an SSH connection, as seen from the server side.  (This is a simplified example, so we'll just look at what the server writes.):

syscall::write:entry
/execname == "sshd"/
{
        tracemem(copyin(arg1, arg2), 1024, arg2);
}

While running this script and opening a connection from a remote server, we see this:

 CPU     ID                    FUNCTION:NAME
   0   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 53 53 48 2d 32 2e 30 2d 53 75 6e 5f 53 53 48 5f  SSH-2.0-Sun_SSH_
        10: 32 2e 32 0a                                      2.2.

   0   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 25 54 00 00                                      %T..

   1   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 90 39 c4 53 d7 82 01 00 25 54 00 00              .9.S....%T..

   1   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 90 39 c4 53 49 84 01 00 25 54 00 00              .9.SI...%T..

   1   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 90 39 c4 53 d2 84 01 00 25 54 00 00              .9.S....%T..

[ ... ]

This is certainly much better than seeing this:

 CPU     ID                    FUNCTION:NAME
   0   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 53 53 48 2d 32 2e 30 2d 53 75 6e 5f 53 53 48 5f  SSH-2.0-Sun_SSH_
        10: 32 2e 32 0a 00 00 00 00 18 00 00 00 00 00 00 00  2.2.............
        20: 98 f6 d3 08 25 1f d3 08 01 21 d3 08 00 00 00 00  ....%....!......
        30: 00 00 00 00 01 00 00 00 18 00 00 00 00 00 00 00  ................
        40: c0 23 d2 08 0d 21 d3 08 10 00 00 00 01 00 00 00  .#...!..........
        50: 9d 11 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
        60: c0 24 d2 08 25 1f d3 08 0d 21 d3 08 00 00 00 00  .$..%....!......
        70: 00 00 00 00 01 00 00 00 18 00 00 00 00 00 00 00  ................
        80: 80 25 d2 08 01 00 00 00 02 00 00 00 03 00 00 00  .%..............
        90: ff ff ff ff 04 00 00 00 18 00 00 00 00 00 00 00  ................
        a0: a0 22 d2 08 2e 21 d3 08 0e 00 00 00 01 00 00 00  ."...!..........
        b0: ab 11 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
        c0: 00 23 d2 08 2e 21 d3 08 00 00 00 00 25 1f d3 08  .#...!......%...
        d0: 00 00 00 00 01 00 00 00 18 00 00 00 00 00 00 00  ................
        e0: b8 f6 d3 08 00 00 00 00 00 00 00 00 00 00 00 00  ................
        f0: 00 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
       100: d8 f6 d3 08 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ ... ]
       3d0: 00 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
       3e0: b8 f9 d3 08 00 00 00 00 00 00 00 00 00 00 00 00  ................
       3f0: 00 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................

tracemem() output consistency

Another enhancement we've made to tracemem() is to modify the behavior of tracemem() for certain traced memory sizes.  Previously, when tracing 1, 2, 4 or 8 bytes, the traced memory would be treated as a signed decimal integer.  When tracing any other amount below 32 bytes, the traced memory was treated as a string if the buffer contained only printable ASCII characters.  For example:

# dtrace -qn 'BEGIN {tracemem(&`initname, 1); exit(0)}'
47
# dtrace -qn 'BEGIN {tracemem(&`initname, 4); exit(0)}'
1920169263
# dtrace -qn 'BEGIN {tracemem(&`initname, 32); exit(0)}'
/usr/sbin/init
# dtrace -qn 'BEGIN {tracemem(&`initname, 64); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72 2f 73 62 69 6e 2f 69 6e 69 74 00 00  /usr/sbin/init..
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        20: 80 00 00 00 01 00 00 00 84 2b ab fb ff ff ff ff  .........+......
        30: ac 2e ab fb ff ff ff ff ac 31 ab fb ff ff ff ff  .........1......

#

We've modified this behavior to be consistent across traced memory sizes.  For example:

# dtrace -qn 'BEGIN {tracemem(&`initname, 1); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f                                               /

# dtrace -qn 'BEGIN {tracemem(&`initname, 4); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72                                      /usr

# dtrace -qn 'BEGIN {tracemem(&`initname, 32); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72 2f 73 62 69 6e 2f 69 6e 69 74 00 00  /usr/sbin/init..
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

# dtrace -qn 'BEGIN {tracemem(&`initname, 64); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72 2f 73 62 69 6e 2f 69 6e 69 74 00 00  /usr/sbin/init..
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        20: 80 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
        30: 88 32 b6 fb ff ff ff ff 00 36 b6 fb ff ff ff ff  .2.......6......

#

Structure alignment

DTrace has had a longstanding issue with how it calculates alignment for structures, unions, and bit-fields.  At best, this provided an annoyance requiring users to modify their scripts to manually add padding to reflect correct behavior, or at worst, would provide wrong information when using things like the sizeof() action or produce alignment errors where there shouldn't be any issue.  We have made several modifications such that structures, unions, and bit-fields are now more ABI compliant.

In the case of bit fields, depending on where the break is between the bits, some would get lost or placed where they shouldn't.  For example, having a DTrace script written as:

union u1tag
{
        unsigned char c;
        struct
        {
                unsigned int b1:1;
                unsigned int b2:7;
        } s1;
} u1;

BEGIN
{
        u1.c = 255;
        printf ("%d %d %d\n", u1.c, u1.s1.b1, u1.s1.b2);
        u1.c = 0;
        u1.s1.b1 = 1;
        u1.s1.b2 = 127;
        printf ("%d %d %d\n", u1.c, u1.s1.b1, u1.s1.b2);
        exit(0);
}

Would produce a result of:

255 1 0
128 1 127

If you were to write the same thing in C, you would see:

255 1 127
255 1 127

For structs and unions, the problem is more pronounced.  Suppose we have a DTrace script:

typedef struct _my_data {
        uint32_t a;
        uint64_t c;
} my_data;

typedef struct _more_data {
        uint32_t x;
        my_data y;
} more_data;

more_data a;

BEGIN {
        a.y.c = 30;
        printf("%lu\n", a.y.c);
}

When we run it, DTrace will report that this otherwise valid structure will result in an invalid alignment.

 # dtrace -64 -s /tmp/test.d
dtrace: script '/tmp/test.d' matched 1 probe
dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid alignment
(0xffffff08bf603a0c) in action #1 at DIF offset 24

What would be done in the past to work around this issue is to add a padding variable so that the struct _more_data would look like:

typedef struct _more_data {
        uint32_t x;
        uint32_t padding;
        my_data y;
} more_data;

This means that without the padding, not only did we run into alignment issues that would produce runtime errors, but we ran the risk of reporting wrong data within the members of the struct.  This would also impact the size being reported by actions such as sizeof().

For instance suppose we have two structs:

struct s1 {
        int x;          /* 4 bytes */
        short a;        /* 2 bytes */
};

struct s2 {
        struct s1 b;    /* 6 bytes */
        short c;        /* 2 bytes */
        int d;          /* 4 bytes */
};

In this case, sizeof(s1) would report a size of 6, and sizeof(s2) would report a size of 12 instead of 8 and 16 respectively.

One last alignment issue is being addressed with this release, and that deals with the memory alignment restrictions between SPARC and x86 architectures.  In the past, DTrace would force a 4-byte alignment regardless of the architecture the script happened to be running on. With 11.2, alignment is based on the platform since SPARC still requires 4-byte alignment for memory accesses, but the x86 architecture is more lenient and doesn't require such restrictions.

Summing up, by insuring ABI compliance each structure will match their C counterpart, and when you copyin a struct from your program, and rely on the structs being defined inside a header file or your own D scripts, you can rest assured the data is reliable.

Wednesday May 07, 2014

Solaris-specific Providers for Puppet

As I mentioned in my previous post about Puppet, there are some new Solaris-specific Resource Types for Puppet 3.4.1 in Oracle Solaris 11.2.  All of these new Resource Types and Providers have been available on java.net since integration into the FOSS projects gate.  I am actively working with Puppet Labs to get this code pushed back upstream so that it's available for anybody to work with.

Here's a small description of a few (of 23) of the new Resource Types:

  • boot_environment
    • name - The boot_environment name (#namevar)
    • description - Description for the new boot environment
    • clone_be - Create a new boot environment from an existing inactive boot environment
    • options - Create the datasets for a new boot environment with specific ZFS properties.  Specified as a hash
    • zpool - Create the new boot environment in the specified zpool
    • activate - Activate the specified boot environment
  • pkg_publisher
    • name - The publisher name (#namevar)
    • origin - Which origin URI(s) to set.  For multiple origins, specify them as a list
    • enable - Enable the publisher
    • sticky - Set the publisher 'sticky'
    • searchfirst - Set the publisher first in the search order
    • searchafter - Set the publisher after the specified publisher in the search order
    • searchbefore - Set the publisher before the specified publisher in the search order
    • proxy - Use the specified web proxy URI to retrieve content for the specified origin or mirror
    • sslkey - The client SSL key
    • sslcert - The client SSL certificate
  • vnic
    • name - The name of the VNIC (#namevar)
    • temporary - Optional parameter that specifies that the VNIC is temporary
    • lower_link - The name of the physical datalink over which the VNIC is operating
    • mac_address - Sets the VNIC's MAC address based on the specified value
  • dns
    • name - A symbolic name for the DNS client settings to use.  This name is used for human reference only
    • nameserver - The IP address(es) the resolver is to query.  A maximum of 3 IP addresses may be specified.  Specify multiple addresses as a list
    • domain - The local domain name
    • search - The search list for host name lookup.  A maximum of 6 search entries may be specified.  Specify multiple search entries as a list
    • sortlist - Addresses returned by gethostbyname() to be sorted.  Entries must be specified in IP 'slash notation'.  A maximum of 10 sortlist entries may be specified.  Specify multiple entries as an array.
    • options - Set internal resolver variables.  Valid values are debug, ndots:n, timeout:n, retrans:n, attempts:n, retry:n, rotate, no-check-names, inet6.  For values with 'n', specify 'n' as an integer.  Specify multiple options as an array.

Other Resource Types are:

  • Datalink Management:   etherstub, ip_tunnel, link_aggregation, solaris_vlan
  • IP Network Interfaces:  address_object, address_property, interface_properties, ip_interface, ipmp_interface,                                             link_properties, protocol_properties, vni_interface
  • pkg(5) Management:  pkg_facet, pkg_mediator, pkg_variant
  • Naming Services:  nis, nsswitch, ldap

The zones Resource Type has been updated to provide Kernel Zone and archive support as well.

Tuesday May 06, 2014

OpenSSL on Oracle Solaris 11.2



I'm sure you all wonder which version of OpenSSL is delivered with Oracle Solaris 11.2?
The answer is the latest and greatest OpenSSL 1.0.1h!

Now that I answered 80% of the questions you may have with regard to OpenSSL, I would like to announce three major features added to the Oracle Solaris 11.2 which I'm sure you'll all be excited to hear :-)

Inlined T4/T4+ instructions support and Engines


Background: S11.1 and earlier

Years and years ago, I worked on the SPARC T2/T3 crypto drivers.  On the SPARC T2/T3 processors, the crypto instructions are privileged; and therefore, the drivers are needed to access those instructions.  Thus, to make use of T2/T3 crypto hardware, OpenSSL had to use pkcs11 engine which adds lots of cycles going through the thick PKCS#11 session/object management layer, Solaris kernel layer, hypervisor layer to the hardware, and all the way back.  However, on SPARC T4/T4+ processors, crypto instructions are no longer privileged; and therefore, you can access them directly without drivers.  Valerie Fenwick has a nice article explaining the lower level specifics of the T4 hardware.

What does that means to you?  Much improved performance!  No more PKCS#11 layer, no more copy-in/copy-out of the data from the userland to the kernel space, no more scheduling, no more hypervisor, NADA!   As much as I enjoyed working on the crypto drivers, I'm happy to see this driver-less transition! ;-)

Dan Anderson has a great blog entry describing the difference between the T3 and T4 based hardware.  As he described, on Solaris 11 and 11.1, we made the T4 instructions available to OpenSSL via OpenSSL engine mechanism.  It was great for the time being, but to make T4 instruction support available directly from the OpenSSL website and to even bypass the engine layer all together, I was assigned to assassinate the t4 engine (Sorry, Dan) and make T4 instructions embedded to the OpenSSL's internal crypto module (a.k.a adding inlined T4 instruction support).

S11.2 and beyond

As I was learning how OpenSSL development worked, I learned OpenSSL upstream engineers had already committed the inlined T4 instruction support to the OpenSSL 1.0.2 branch.  (Thanks for making my life easier, OpenSSL team!)  I was job-less for a second, but since OpenSSL 1.0.2 won't be available in time for Solaris 11.2 delivery, we decided to patch the inlined T4 instruction support to our OpenSSL 1.0.1g delivery bundled with Solaris 11.2.

With this change, you'll get the T4/T4+ instruction support without engines; and therefore, you get as great performance as the t4 engine and even better performance for some algorithms (i.e. SHA-1, MD5) by default.

Other Engines

Oracle Solaris 11.2  killed not only the t4 engine, but also the aesni engine and the devcrypto engine.   The story for the aesni engine is pretty much similar to the one for the t4 engine.   It was introduced in Solaris 11 as Dan Anderson described in his article, and killed in Solaris 11.2.  AES-NI instruction support is now embedded in the OpenSSL upstream implementation (OpenSSL 1.0.1); and therefore, the separate engine is no longer needed.  The devcrypto engine was removed simply due to the lack of use.

With all this change, Oracle Solaris 11.2 OpenSSL is left with the one and only pkcs11 engine. pkcs11 engine is still necessary on the T2/T3 platforms and on any platform with the hardware keystore (i.e. SCA 6000). However, be sure to leave the pkcs11 engine disabled on T4/T4+ if you want max performance.  Again, I would like to emphasize that the OpenSSL performance on T4/T4+ platforms are looking MUCH better compared to the one on T2/T3 platforms!  It's time to move onto T4/T4+ platform, Y'all!!


OpenSSL FIPS-140 version support


It is important for many federal and financial service customers to have their cryptographic products being FIPS-140 validated. Oracle Solaris Cryptographic Framework recently achieved a FIPS 140-2 validation(yay!!), and it was very important to deliver the FIPS-140 validated OpenSSL with Solaris 11.2.

At the time Solaris 11 was released, OpenSSL 1.0.0 was the latest OpenSSL version available, and since OpenSSL 1.0.0 was not FIPS-140 validated, we only delivered non-FIPS-140 version of OpenSSL with Solaris 11.

Thanks to the OpenSSL upstream team (again), the best and greatest OpenSSL 1.0.1 can be compiled with a FIPS-140 validated module, and we are now delivering the FIPS-140 version of OpenSSL in addition to the non-FIPS-140 version of OpenSSL with Solaris 11.2.

When do you want to use FIPS-140 version of OpenSSL?


It's probably important to mention that the FIPS-140 version of OpenSSL is not for everybody.  The FIPS-140 validated version of cryptographic products come with a price tag.  Enabling FIPS-140 mode adds a lot of cycles to satisfy the FIPS-140 verification requirement (i.e. POST, pair-wise consistency test, contiguous RNG test, etc) at run time.  In addition, inlined T4/T4+ instruction support is not available in the FIPS-140 version of OpenSSL, and you won't get the best performance when the FIPS-140 mode is enabled.

That said, I would recommend you to enable FIPS-140 mode *only if* you need to.  The good news is that you will get the FIPS-140 compatible implementation even when the FIPS-140 mode is disabled.  It's just that it runs much faster!
That's one of the reasons why non-FIPS-140 version of OpenSSL is activated by default.

How to enable FIPS-140 version of OpenSSL


If you decided to enable FIPS-140 mode, here is how you can switch to the FIPS-140 version of OpenSSL.

Make sure you have the FIPS-140 version of the OpenSSL installed on the system.

# pkg mediator -a openssl
MEDIATOR VER. SRC. VERSION IMPL. SRC. IMPLEMENTATION
openssl  vendor            vendor     default
openssl  system            system     fips-140


To activate the fips-140 implementation
# pkg set-mediator -I fips-140 openssl

To check the currently activated OpenSSL implementation
# pkg mediator openssl

To change back to the default (non-FIPS-140) implementation
# pkg set-mediator -I default openssl


OpenSSL Thread and Fork Safety


OpenSSL provides an interface CRYPTO_set_locking_callback() for you (any application or library) to set your own locking callback function with the mutexes of your choice.
That sounds reasonable if the OpenSSL library is used only by applications.  However, when the OpenSSL library is used by another library, such design is asking for trouble.

We've seen a case where an OpenSSL application used a library which set a locking callback function, and the library got unloaded while the application continued using the OpenSSL library.  The application got a segfault because OpenSSL tried to reference the invalid locking callback function set by the unloaded library.  Whose fault is this?

You can argue that the library should have set the locking callback to NULL when it was unloaded.
Well, not quite.  Once the locking callback is set to NULL, the application is no longer thread-safe.

OpenSSL needed some changes to make applications and libraries thread and fork safe.

To fix this issue, the OpenSSL library (libcrypto.so) delivered with Solaris 11.2 sets up mutexes and a locking callback internally, and it ignores an attempt to set/change the locking callback.

What does that mean to you?
OpenSSL is now thread and fork safe by default.  You don't need to make any modification to your application nor library.  You can relax and have a margarita or two.

That's all I have for now.

Note:  The version number delivered with Solaris 11.2 was updated from 1.0.1g to 1.0.1h on Jun 05, 2014. OpenSSL version 1.0.1g was delivered with Solaris 11.2 Beta.

Puppet Configuration in Solaris

What is Puppet?

Puppet is IT automation software that helps system administrators manage IT infrastructure. It automates tasks such as provisioning, configuration, patch management and compliance. Repetitive tasks are easily automated, deployment of critical applications occurs rapidly, and required system changes are proactively managed. Puppet scales to meet the needs of the environment, whether it is a simple deployment or a complex infrastructure, and works on-premise or in the cloud.

Puppet is now available as part of Oracle Solaris 11.2!

Use ntpdate or ntpd -q to set the date

Puppet can error out with some very strange messages if the clocks on both the master and agent aren't synchronized.  You can use ntpdate or ntpd -q to set the date just once if you'd like to manage the NTP service with Puppet, or you can configure NTP.

Install the required packages on both systems 

# pkg install puppet

This will install the puppet, facter and ruby-19 packages.

Configure the Puppet SMF instances

master # svccfg -s puppet:master setprop config/server = master.fqdn.company.com
master # svccfg -s puppet:master refresh
master # svcadm enable puppet:master

agent # svccfg -s puppet:agent setprop config/server = master.fqdn.company.com
agent # svccfg -s puppet:agent refresh

Test the connection to the master and configure authentication

Before enabling the puppet:agent service, you'll want to test the connection first in order to set up authentication

agent # puppet agent --test --server master.fqdn.company.com

Info: Creating a new SSL key for agent.fqdn.company.com
Info: Caching certificate for ca
Info: Creating a new SSL certificate request for agent.fqdn.company.com
Info: Certificate Request fingerprint (SHA256):
C9:63:22:6A:9F:88:D6:18:7F:F3:F4:FA:89:E4:86:A1:C7:BE:94:CF:F1:D5:59:B9:DD:21:8D:C1:C9:B0:F4:18
**Exiting; no certificate found and waitforcert is disabled**

Now that the agent has created a new SSL key, authorization needs approval on the master.

Sign the SSL certificate on the master

master # puppet cert list
  "agent.fqdn.company.com" (SHA256)
  C9:63:22:6A:9F:88:D6:18:7F:F3:F4:FA:89:E4:86:A1:C7:BE:94:CF:F1:D5:59:B9:DD:21:8D:C1:C9:B0:F4:18

master # puppet cert sign agent.fqdn.company.com
Notice: Signed certificate request for agent.fqdn.company.com
Notice: Removing file Puppet::SSL::CertificateRequest agent.fqdn.company.com at
'/etc/puppet/ssl/ca/requests/agent.fqdn.company.com.pem'

Retest the agent to ensure it can connect

agent # puppet agent --test --server master.fqdn.company.com
Info: Caching certificate for agent.fqdn.company.com
Info: Caching certificate_revocation_list for ca
Info: Retrieving plugin
Info: Caching catalog for agent.fqdn.company.com
Info: Applying configuration version '1371232699'
Notice: Finished catalog run in 0.65 seconds

Enable the agent service

agent # svcadm enable puppet:agent

Additional configuration of /etc/puppet/puppet.conf on both master and agent (optional) 

Further customizations can be made in /etc/puppet/puppet.conf.  See Puppet's Configurables page for more details.

NOTE:  Puppet's configuration is completely done via  SMF stencils.  /etc/puppet/puppet.conf should not be directly edited as any edits will be lost when the Puppet SMF service (re)starts.  Setting a new value should be done via svccfg(1M):

# svccfg -s puppet:agent setprop config/<option> = <value>

# svccfg -s puppet:agent refresh

(substitute :master as needed)

Tuesday Apr 29, 2014

New in IPS Documentation for Oracle Solaris 11.2

Documentation of the Image Packaging System on docs.oracle.com is in three books. All three books contain new information for the Oracle Solaris 11.2 release.

See also Tim Foster's Web Log

Highlights:

  • New pkg/mirror service
  • New pkg/depot service
  • New chapter about web server configuration, including a new section about configuring https access
  • New pkgrecv --clone option
  • New pkg install and pkg update troubleshooting section
  • New chapter about updating an image
  • New options for pkg subcommands:
    • -r: perform operation recursively on specified non-global zones
    • --sync-actuators: do not return until all actuators have finished
    • --ignore-missing: when updating or uninstalling, ignore packages that are not installed
  • New pkg exact-install command
  • New file attribute for setting system attributes
Details:

Copying and Creating Package Repositories in Oracle Solaris 11.2

Chapter 1, "Image Packaging System Package Repositories"
- New section about best practices

Chapter 2, "Copying IPS Package Repositories"
- Copying from a zip file (see also Release Engineering's blog) or iso file
- Using the pkgrecv command
- Using the new pkg/mirror service to automatically periodically update a repository

Chapter 3, "Providing Access To Your Repository"
- Using a ZFS share
- Using the pkg/server service

Chapter 4, "Maintaining Your Local IPS Package Repository"
- New repository update procedure
- Using the pkgrecv --clone option to clone a repository
- Using the new pkg/depot service to serve multiple repositories from a single location

Chapter 5, "Running the Depot Server Behind a Web Server"
- Caching, load balancing
- New section about configuring HTTPS repository access

Adding and Updating Software in Oracle Solaris 11.2

Chapter 1, "Introduction to the Image Packaging System"
- Incorporations and group packages, FMRIs, images

Chapter 2, "Getting Information About Software Packages"
- Packages that can be installed
- Package descriptions, licenses, dependencies, dependents
- Searching for packages

Chapter 3, "Installing and Updating Software Packages"
- New options for pkg subcommands regarding non-global zones, SMF actuators, and ignoring missing packages in a pkg update or uninstall
- New pkg exact-install command (see also Bart's blog)
- Updated information about non-global zones

Chapter 4, "Updating or Upgrading an Oracle Solaris Image"
- Ways to control the version to which to upgrade, including creating a custom incorporation package

Chapter 5, "Configuring Installed Images"
- Configuring publishers
- Variants and facets
- Freezing
- Incorporation constraints
- Mediations
- Groups

Appendix A, "Troubleshooting Package Installation and Update"
- All new - Begins with steps you should always do and then is organized by error message

Appendix B, "IPS Graphical User Interfaces"
- Package Manager and Package Update

Packaging and Delivering Software With the Image Packaging System in Oracle Solaris 11.2

Chapter 1, "IPS Design Goals, Concepts, and Terminology"
- General information about software self-assembly and package lifecycle
- Definitions, package components
- New file attribute, sysattr, for setting system attributes

Chapter 2, "Packaging Software With IPS"
- Updated procedures for publishing and delivering your package

Chapter 3, "Installing, Removing, and Updating Software Packages"
- How this works in the Image Packaging System

Chapter 4, "Specifying Package Dependencies"
- New firmware value for the fmri attribute of the origin dependency for specifying driver firmware compatibility

Chapter 5, "Allowing Variations"
- Variants and facets

Chapter 6, "Modifying Package Manifests Programmatically"
- Using pkgmogrify

Chapter 7, "Automating System Change as Part of Package Installation"
- Specifying actuators on package actions
- Delivering SMF services in IPS packages
- New or updated examples of a run-once service and a self-assembly service

Chapter 8, "Advanced Topics For Package Updating"
- Renaming, merging, splitting, obsoleting packages
- New or updated examples of preserving editable packaged content, preserving unpackaged content, sharing content across boot environments, overlaying files, and delivering a mediation

Chapter 9, "Signing IPS Packages"

Chapter 10, "Handling Non-Global Zones"

Chapter 11, "Modifying Published Packages"

Appendix A, "Classifying Packages"

Appendix B, "How IPS Is Used To Package the Oracle Solaris OS"

 

Wednesday Dec 12, 2012

Oracle Solaris 11 pkg fix

Bob Netherton explains why Solaris 11 pkg fix is his new friend.

"So far so good. Then comes an oops... This is where you generally say a few things to yourself, and then promise to quit deleting configuration files and directories when you don't know what you are doing. Then you recall that the new Solaris 11 packaging system has some ability to correct common mistakes (like the one I just made)."

[Read More]

Thursday Nov 17, 2011

Critical Threads Optimization

Background

One of the more common issues we've been seeing in the field is the growing difficulty in optimizing performance of multi-threaded applications. A good portion of this difficulty is due to the increasing complexity of modern processors that present various degrees of sharing relationships between hardware components. Take any current CMT processor and you'll find any number of CPUs sharing execution pipelines, floating point units, caches, etc. Consequently, applying the traditional recipe of one software thread for each CPU will have varying degrees of success, according to the layout of the underlying hardware.

On top of this increasing complexity we've also seen processors with features that aim at dynamically resourcing software threads according to their utilization. Intel's Turbo Boost allows processors to increase their operating frequency if there is enough thermal headroom available and the processor isn't fully utilized. More recently, the SPARC T4 processor introduced dynamic threading, allowing each core to dynamically allocate more resources to its active CPUs. Both cases are in essence recognizing that current processors will be running a wide mix of workloads, some will be designed for throughput, others for low latency. The hardware is providing mechanisms to dynamically resource threads according to their runtime behavior.

We're very aware of these challenges in Solaris, and have been working to provide the best out of box performance while providing mechanisms to further optimize applications when necessary. The Critical Threads Optimzation was introduced in Solaris 10 8/11 and Solaris 11 as one such mechanism that allows customers to both address issues caused by contention over shared hardware resources and explicitly take advantage of features such as T4's dynamic threading.

What it is

The basic idea is to allow performance critical threads to execute with more exclusive access to hardware resources. For example, when deploying an application that implements a producer/consumer model, it'll likely be advantageous to give the producer more exclusive access to the hardware instead of having it competing for resources with all the consumers. In the case of a T4 based system, we may want to have a producer running by itself on a single core and create one consumer for each of the remaining CPUs.

With the Critical Threads Optimization we're extending the semantics of scheduling priorities (which thread should run first) to include priority over shared resources (which thread should have more "space"). Now the scheduler will not only run higher priority threads first: it will also provide them with more exclusive access to hardware resources if they are available.

How does it work ?

Using the previous example in Solaris 11, all you'd have to do would be to place the producer in the Fixed Priority (FX) scheduling class at priority 60, or in the Real Time (RT) class at any priority and Solaris will try to give it more "hardware space". On both Solaris 10 8/11 and Solaris 11 this can be achieved through the existing priocntl(1,2) and priocntlset(2) interfaces. If your application already assigns these priorities to performance critical threads, there's no additional step you need to take.

One important aspect of this optimization is that it requires some level of idleness in the system, either as a result of sizing the application before hand or through periods of transient idleness during runtime. If the system is fully committed, the scheduler will put all the available CPUs to work.

Best practices

If you're an application developer, we encourage you to look into assigning the right priorities for the different threads in your application. Solaris provides different scheduling classes (Time Share, Interactive, Fair Share, Fixed Priority and Real Time) that offer different policies and behaviors. It is not always simple to figure out which set of threads are critical to the performance of a workload, and it may not always be feasible to take advantage of this optimization, but we believe that this can be correctly (and safely) done during development.

Overall, the out of box performance in Solaris should meet your workload's requirements. If you are looking into that extra bit of performance, then the Critical Threads Optimization may be what you're looking for.

Monday Aug 29, 2011

Running GNOME Terminal From a Zone

As I've mentioned before, I VPN into the Oracle Intranet from within a zone. Once I establish the VPN connection, I'm no longer able to SSH into the zone, which is a slight drag if I'd like to open a new terminal window. The solution is to launch a new GNOME terminal window from within the zone. However, this wasn't without some minor hurdles to clear, so I'm documenting the process for future reference.

I'm assuming your zone already has a user account and the X authority file utility installed so you can launch X applications. If not, follow Steps 2 and 3 from the entry Running Firefox From a Zone.

Of course, GNOME Terminal needs to be installed:

bleonard@myzone:~$ sudo pkg install gnome-terminal
               Packages to install:     1
           Create boot environment:    No
               Services to restart:     2
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  1/1       80/80      2.1/2.1

PHASE                                        ACTIONS
Install Phase                                160/160 

PHASE                                          ITEMS
Package State Update Phase                       1/1 
Image State Update Phase                         2/2  

At this point, you'd like to think you could just launch gnome-terminal, but alas:

bleonard@myzone:~$ gnome-terminal
**
ERROR:terminal-app.c:1450:terminal_app_init: assertion failed: (app->default_profile_id != NULL)
Abort (core dumped)

It turns out you also need to install the SMF services responsible for updating the GNOME desktop caches (I've already filed an issue for this):

bleonard@myzone:~$ sudo pkg install desktop-cache
               Packages to install:     8
           Create boot environment:    No
               Services to restart:     5
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  8/8   3125/3125    13.5/13.5

PHASE                                        ACTIONS
Install Phase                              3566/3566 

PHASE                                          ITEMS
Package State Update Phase                       8/8 
Image State Update Phase                         2/2 

After installing the package, wait a few seconds while the cache is built. You can verify it's complete when the GNOME Gconf Cache Builder service state changes to online:

bleonard@myzone:~$ svcs -l gconf-cache
fmri         svc:/application/desktop-cache/gconf-cache:default
name         GNOME Gconf Cache Builder
enabled      true
state        online
next_state   none
state_time   August 29, 2011 04:50:45 PM EDT
logfile      /var/svc/log/application-desktop-cache-gconf-cache:default.log
restarter    svc:/system/svc/restarter:default
dependency   require_all/none svc:/system/filesystem/local (online)

After which, gnome-terminal should start successfully:

bleonard@myzone:~$ gnome-terminal &


If for some reason you still run into a problem, try refreshing the GNOME Gconf Cache Service:

bleonard@myzone:~$ sudo svcadm refresh gconf-cache

Friday Aug 19, 2011

Replacing the system HDD by a larger one on Solaris 11 X86

A feedback on replacing the internal disk drive on a Solaris 11 Express labtop by a larger one, using ZFS mirrorring and ZFS split.
[Read More]

Thursday Jul 28, 2011

Installing WebLogic in a Zone

Sadly, the download page for WebLogic 10.3.5 doesn't yet include an installer for Solaris on x86. So, here for you, I outline the steps I took to successfully install WebLogic on Solaris - in a zone, of course.

Step 1: Create the Zone

The Web Logic installer requires 1.2 GB of swap space. Follow these steps to ensure you have enough.

Create a VNIC for the zone:
bleonard@solaris:~$ sudo dladm create-vnic -l e1000g0 wls_zone0

Create the zone:

bleonard@solaris:~$ sudo zonecfg -z wls_zone
wls_zone: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:wls_zone> create
zonecfg:wls_zone> set zonepath=/zones/wls_zone
zonecfg:wls_zone> set ip-type=exclusive
zonecfg:wls_zone> add net
zonecfg:wls_zone:net> set physical=wls_zone0
zonecfg:wls_zone:net> end
zonecfg:wls_zone> verify
zonecfg:wls_zone> exit 

Install the zone:

bleonard@solaris:~$ sudo zoneadm -z wls_zone install
...

Here's the sysidcfg file I used for the zone:

bleonard@solaris:~$ sudo cat /zones/wls_zone/root/etc/sysidcfg
system_locale=C
terminal=xterms
network_interface=PRIMARY {
	hostname=wls_zone
	ip_address=10.0.1.70
        default_route=10.0.1.1
	netmask=255.255.255.0
 	protocol_ipv6=no}
security_policy=none
name_service=NONE
nfs4_domain=dynamic
timezone=US/Eastern
root_password=fto/dU8MKwQRI

Boot and configure the zone:

bleonard@solaris:~$ sudo zoneadm -z wls_zone boot
bleonard@solaris:~$ sudo zlogin -C wls_zone
[Connected to zone 'wls_zone' console]
100/100
Hostname: wls_zone
Loading smf(5) service descriptions: 3/3
Creating new rsa public/private host key pair
Creating new dsa public/private host key pair
Configuring network interface addresses:.

wls_zone console login: root
Password: abc123 

Configure DNS name resolution:

bleonard@solaris:~$ sudo cp /etc/nsswitch.conf /zones/wls_zone/root/etc/.
bleonard@solaris:~$ sudo cp /etc/resolv.conf /zones/wls_zone/root/etc/.

Step 2: Install the Supporting Software

The JDK:

root@wls_zone:~# pkg install jdk
               Packages to install:     3
           Create boot environment:    No
               Services to restart:     1
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  3/3   1252/1252    71.7/71.7

PHASE                                        ACTIONS
Install Phase                              1633/1633 

PHASE                                          ITEMS
Package State Update Phase                       3/3 
Image State Update Phase                         2/2 

Include 64-bit support:

root@wls_zone:~# pkg install jdk64
               Packages to install:     1
           Create boot environment:    No
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  1/1       32/32      0.7/0.7

PHASE                                        ACTIONS
Install Phase                                  59/59

PHASE                                          ITEMS
Package State Update Phase                       1/1 
Image State Update Phase                         2/2 

The X authority file utility. This will allow us to forward the display to the zone so we can run the graphical installer:

root@wls_zone:~# pkg install xauth
               Packages to install:     1
           Create boot environment:    No
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  1/1         6/6      0.0/0.0

PHASE                                        ACTIONS
Install Phase                                  37/37 

PHASE                                          ITEMS
Package State Update Phase                       1/1 
Image State Update Phase                         2/2 

Install the X Test and Record extensions client library. This library is also required to start the graphical installer:

root@wls_zone:~# pkg install libxtst
               Packages to install:     2
           Create boot environment:    No
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  2/2       96/96      0.2/0.2

PHASE                                        ACTIONS
Install Phase                                176/176 

PHASE                                          ITEMS
Package State Update Phase                       2/2 
Image State Update Phase                         2/2 

Step 3:  Create a User Account

For this exercise I'm going to create a user 'weblogic'.

root@wls_zone:~# useradd -m -d /weblogic -s /usr/bin/bash weblogic
root@wls_zone:~# passwd weblogic 
New Password: weblogic
Re-enter new Password: weblogic 
passwd: password successfully changed for weblogic

Step 4: Download WebLogic and Copy to the Zone

Downloading the Oracle WebLogic Server 11gR1 (10.3.5) + Coherence Package Installer File 1 for Additional Platforms. You may also want to download the Supplemental ZIP distribution File 1 that contains sample applications.

Copy those files into the zone:

bleonard@solaris:~$ sudo cp Download/wls1035_generic.jar /zones/wls_zone/root/weblogic/.
bleonard@solaris:~$ sudo cp Download/wls1035_dev_supplemental.zip /zones/wls_zone/root/weblogic/.

Step 5: Start the Installer

SSH into the zone. Be sure to forward the X11 display:

bleonard@solaris:~$ ssh -X weblogic@10.0.1.70
The authenticity of host '10.0.1.70 (10.0.1.70)' can't be established.
RSA key fingerprint is c4:73:8b:ea:db:c5:1e:fd:76:35:61:26:92:8e:4e:4b.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.0.1.70' (RSA) to the list of known hosts.
Password: 
Last login: Thu Jul 28 07:55:41 2011 from 10.0.1.3
Oracle Corporation      SunOS 5.11      snv_151a        June 2011

Here's the official documentation to Starting the Installation Program on UNIX Using .jar Installers. Note, I'm also adding the option -Dspace.detection=false, because I can't figure out how to get past this error:

I certainly have more then 1MB of space in my home directory. So basically:

weblogic@wls_zone:~$ java -d64 -Dspace.detection=false -jar wls1035_generic.jar 
Extracting 0%....................................................................................................100%

Working through the installer, I select most of the defaults except where noted:

Enter your My Oracle Support credentials if you have them. You're allowed to continue if you them them blank, as I did:

Select the Custom Install Type. I like to include the Server Examples, but this is optional.

Select Server Examples:


I recommend running Quickstart to create an initial domain. To run Quickstart later, you'll find it at Oracle/Middleware/wlserver_10.3/common/quickstart/quickstart.sh.


Step 6:  Start the Server

weblogic@wls_zone:~$ ./Oracle/Middleware/user_projects/domains/base_domain/startWebLogic.sh &

..

<Jul 28, 2011 8:54:10 AM PDT> <Notice> <WebLogicServer> <BEA-000360> <Server started in RUNNING mode>

Step 7:  Browse to the Console

http://10.0.1.70:7001/console:


Monday Jul 25, 2011

Zone Swap Space

A non-global zone inherits its swap space setting from the global zone. For example, in my global zone:

bleonard@solaris:~$ swap -sh
total: 604M allocated + 122M reserved = 724M used, 836M available

And in my local zone:

bleonard@myzone:~$ swap -sh
total: 604M allocated + 122M reserved = 724M used, 836M available 

If I need to increase swap space in a particular zone, I need to add swap to the entire system. As covered in Adjusting the Sizes of Your ZFS Swap and Dump Devices, first add another swap volume:

bleonard@solaris:~$ sudo zfs create -V 1G rpool/swap2
Password: 

Then add the new volume to the swap:

bleonard@solaris:~$ sudo swap -a /dev/zvol/dsk/rpool/swap2

bleonard@solaris:~$ swap -sh
total: 612M allocated + 133M reserved = 748M used, 1.8G available

The new swap is also immediately recognized by the zone:

bleonard@myzone:~$ swap -sh
total: 612M allocated + 133M reserved = 748M used, 1.8G available

To permanently add the swap to the system, you need to add the device to the /etc/vfstab file:

bleonard@solaris:~$ cat /etc/vfstab 
#device		device		mount		FS	fsck	mount	mount
#to mount	to fsck		point		type	pass	at boot	options
#
/devices	-		/devices	devfs	-	no	-
/proc		-		/proc		proc	-	no	-
ctfs		-		/system/contract ctfs	-	no	-
objfs		-		/system/object	objfs	-	no	-
sharefs		-		/etc/dfs/sharetab	sharefs	-	no	-
fd		-		/dev/fd		fd	-	no	-
swap		-		/tmp		tmpfs	-	yes	-

/dev/zvol/dsk/rpool/swap	-		-		swap	-	no	-
/dev/zvol/dsk/rpool/swap2	-		-		swap	-	no	-

You can also control the amount of swap space used by zones with resource caps, for example:

bleonard@solaris:~$ sudo zonecfg -z myzoney  
zonecfg:myzone> add capped-memory
zonecfg:myzone:capped-memory> set swap=1G
zonecfg:myzone:capped-memory> end
zonecfg:myzone> verify
zonecfg:myzone> exit

This change will require a zone reboot:

bleonard@solaris:~$ sudo zoneadm -z myzone reboot

After which the swap cap will be in place:

bleonard@myzone:~$ swap -sh
total: 33M allocated + 0K reserved = 33M used, 988M available


Tuesday Jul 19, 2011

Integrated Load Balancer

I'm not sure how well known it is that Solaris 11 contains a load balancer. The official documentation, starting with the Integrated Load Balancer Overview, does a great job of explaining this feature. In this blog entry my goal is to provide an implementation example.

For starters, I will be using the HALF-NAT operation mode. Basically, HALF-NAT means that the client's IP address is not mapped so that the servers know the real client address. This is usually preferred for server logging (see ILB Operation Modes for more). 

I will load balance traffic across 2 zones, each running the Apache Tomcat server. The load balancer itself will be configured as a multi-homed zone. The configuration will look as follows:

Step 1: Create the VNICs

The first step is to create VNICs for all of these interfaces:

bleonard@solaris:~$ sudo dladm create-vnic -l e1000g0 ilb0
bleonard@solaris:~$ sudo dladm create-vnic -l e1000g0 ilb1
bleonard@solaris:~$ sudo dladm create-vnic -l e1000g0 server1
bleonard@solaris:~$ sudo dladm create-vnic -l e1000g0 server2

Step 2: Create the Zones:

If you don't already have a file system for your zones:

bleonard@solaris:~$ sudo zfs create -o mountpoint=/zones rpool/zones

Then create the ILB zones:

bleonard@solaris:~$ sudo zonecfg -z ilb-zone
ilb-zone: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:ilb-zone> create
zonecfg:ilb-zone> set zonepath=/zones/ilb-zone
zonecfg:ilb-zone> set ip-type=exclusive
zonecfg:ilb-zone> add net
zonecfg:ilb-zone:net> set physical=ilb0
zonecfg:ilb-zone:net> end
zonecfg:ilb-zone> add net
zonecfg:ilb-zone:net> set physical=ilb1
zonecfg:ilb-zone:net> end
zonecfg:ilb-zone> verify
zonecfg:ilb-zone> exit 

And the server zones (repeat this step for server 2 - changing values where appropriate):

bleonard@solaris:~$ sudo zonecfg -z server1-zone
server1-zone: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:server1-zone> create
zonecfg:server1-zone> set zonepath=/zones/server1-zone
zonecfg:server1-zone> set ip-type=exclusive
zonecfg:server1-zone> add net
zonecfg:server1-zone:net> set physical=server1
zonecfg:server1-zone:net> end
zonecfg:server1-zone> verify
zonecfg:server1-zone> exit

Step 3: Install the ILB Zone

Then install the ilb-zone (wait to install the server zones as we will just clone this zone):

bleonard@solaris:~$ sudo zoneadm -z ilb-zone install
A ZFS file system has been created for this zone.
   Publisher: Using solaris (http://pkg.oracle.com/solaris/release/ ).
       Image: Preparing at /zones/ilb-zone/root.
       Cache: Using /var/pkg/download.
Sanity Check: Looking for 'entire' incorporation.
  Installing: Core System (output follows)
------------------------------------------------------------
Package: pkg://solaris/consolidation/osnet/osnet-incorporation@0.5.11,5.11-0.151.0.1:20101104T230646Z
License: usr/src/pkg/license_files/lic_OTN

Oracle Technology Network Developer License Agreement

...

               Packages to install:     1
           Create boot environment:    No
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  1/1         1/1      0.0/0.0

PHASE                                        ACTIONS
Install Phase                                  11/11

PHASE                                          ITEMS
Package State Update Phase                       1/1 
Image State Update Phase                         2/2 
               Packages to install:    45
           Create boot environment:    No
               Services to restart:     3
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                45/45 12511/12511    89.1/89.1

PHASE                                        ACTIONS
Install Phase                            17953/17953 

PHASE                                          ITEMS
Package State Update Phase                     45/45 
Image State Update Phase                         2/2 
  Installing: Additional Packages (output follows)
               Packages to install:    46
           Create boot environment:    No
               Services to restart:     2
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                46/46   4498/4498    26.5/26.5

PHASE                                        ACTIONS
Install Phase                              6139/6139 

PHASE                                          ITEMS
Package State Update Phase                     46/46 
Image State Update Phase                         2/2 

        Note: Man pages can be obtained by installing SUNWman
 Postinstall: Copying SMF seed repository ... done.
 Postinstall: Applying workarounds.
        Done: Installation completed in 499.617 seconds.

  Next Steps: Boot the zone, then log into the zone console (zlogin -C)
              to complete the configuration process.

I will be using the following sysidcfg file to automate the zone's system configuration. Adjust your values accordingly. The root password is "abc123":

bleonard@solaris:~$ sudo cat /zones/ilb-zone/root/etc/sysidcfg 
system_locale=C
terminal=xterms
network_interface=ilb0 {
    primary
    hostname=ilb-ext
    ip_address=10.0.2.21
    netmask=255.255.255.0
    default_route=10.0.2.2
    protocol_ipv6=no}
network_interface=ilb1 {
    hostname=ilb-int
    ip_address=192.168.1.21
        default_route=NONE
    netmask=255.255.255.0
     protocol_ipv6=no}
security_policy=none
name_service=NONE
nfs4_domain=dynamic
timezone=US/Eastern
root_password=fto/dU8MKwQRI

Boot and log into the zone:

bleonard@solaris:~$ sudo zoneadm -z ilb-zone boot
bleonard@solaris:~$ sudo zlogin -C ilb-zone
[Connected to zone 'ilb-zone' console]
100/100
Hostname: ilb-zone
Loading smf(5) service descriptions: 3/3
 network_interface=ilb0 {
ilb0 is not a valid network interface  line 3 position 19
Creating new rsa public/private host key pair
Creating new dsa public/private host key pair
Configuring network interface addresses: ilb0 ilb1.

ilb-ext console login: root
Password: abc123
Jul  1 10:54:37 ilb-ext login: ROOT LOGIN /dev/console
Oracle Corporation      SunOS 5.11      snv_151a        November 2010
root@ilb-ext:~# 

Since our ilb-zone has 2 network interfaces, we also want to make sure a packet arriving on one network interface and addressed to a host on a different network is forwarded to the appropriate interface.

 root@ilb-ext:~# svcadm enable ipv4-forwarding

Step 4: Install the Serve 1 Zone

We'll create the first server zone as a clone of the ilb-zone. We'll then configure the server 1 zone and clone it to server 2.

Shut down ilb-zone so that it can be cloned:

bleonard@solaris:~$ sudo zoneadm -z ilb-zone halt

Then clone ilb-zone:

bleonard@solaris:~$ sudo zoneadm -z server1-zone clone ilb-zone

Here's a sysidcfg file to use with server1-zone:

bleonard@solaris:~$ sudo cat /zones/server1-zone/root/etc/sysidcfg
system_locale=C
terminal=xterms
network_interface=PRIMARY {
	hostname=server1-zone
	ip_address=192.168.1.50
	netmask=255.255.255.0
	default_route=none
	protocol_ipv6=no}
security_policy=none
name_service=NONE
nfs4_domain=dynamic
timezone=US/Eastern
root_password=fto/dU8MKwQRI

Then boot and log in to server1-zone:

bleonard@solaris:~$ sudo zoneadm -z server1-zone boot
Password: 
bleonard@solaris:~$ sudo zlogin -C server1-zone
[Connected to zone 'server1-zone' console]
Hostname: server1-zone
Creating new rsa public/private host key pair
Creating new dsa public/private host key pair
Configuring network interface addresses: server1.

server1-zone console login: root
Password: abc123
Jul  1 14:53:20 server1-zone login: ROOT LOGIN /dev/console
Last login: Fri Jul  1 13:54:37 on console
Oracle Corporation      SunOS 5.11      snv_151a        November 2010

Also boot back up the ilb zone:

bleonard@solaris:~$ sudo zoneadm -z ilb-zone boot
Password: 

Step 5: Configure Internet Access

Test if you can ping the outside world from within the ilb zone:

root@ilb-ext:~# ping www.oracle.com
ping: unknown host www.oracle.com

Open another terminal window. The new terminal window should have you in the global zone. Copy the /etc/resolve.conf and /etc/[nsswitch.conf files from the global zone to ilb-zone and server1-zone:

bleonard@solaris:~$ sudo cp /etc/resolv.conf /zones/ilb-zone/root/etc/.
Password: 
bleonard@solaris:~$ sudo cp /etc/nsswitch.conf /zones/ilb-zone/root/etc/.
bleonard@solaris:~$ sudo cp /etc/resolv.conf /zones/server1-zone/root/etc/.
bleonard@solaris:~$ sudo cp /etc/nsswitch.conf /zones/server1-zone/root/etc/. 

Return to the ilb-zone. You should not be able to reach the outside world:

root@ilb-ext:~# ping www.oracle.com
www.oracle.com is alive 

However, server1-zone needs some routing set up before it can reach out as it will route its traffic through the ilb-zone:

root@server1-zone:~# route -p add  default 192.168.1.21
add net default: gateway 192.168.1.21
add persistent net default: gateway 192.168.1.21

root@server1-zone:~# ping www.oracle.com
www.oracle.com is alive 

Step 6: Install Tomcat

Apache Tomcat will be the service we load balance to:

root@server1-zone:~# pkg install tomcat tomcat-examples runtime/java
               Packages to install:     3
           Create boot environment:    No
               Services to restart:     2
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  3/3   1166/1166    38.9/38.9

PHASE                                        ACTIONS
Install Phase                              1504/1504 

PHASE                                          ITEMS
Package State Update Phase                       3/3 
Image State Update Phase                         2/2 
Loading smf(5) service descriptions: 1/1

root@server1-zone:~# svcadm enable http:tomcat6

Step 7: Configure Routing to the Server Zone

From the global zone we need to be able to reach the server. Add the following route (the -p option makes the changes persistent across network restarts):

bleonard@solaris:~$ sudo route -p add 192.168.1.0 10.0.2.21
Password: 
add net 192.168.1.0: gateway 10.0.2.21
add persistent net 192.168.1.0: gateway 10.0.2.21
And now you should be able to reach Tomcat from the global zone (or any client on that subnet):


Step 8: Cloning the Tomcat Server Zone

Now that we have the Tomcat server running in a zone, we can quickly create another instance. First, we need to shut down the server1-zone:

bleonard@solaris:~$ sudo zoneadm -z server1-zone halt

Then clone it:

bleonard@solaris:~$ sudo zoneadm -z server2-zone clone server1-zone

Copy the sysidcfg file you created for server 1 to server 2:

bleonard@solaris:~$ sudo cp /zones/server1-zone/root/etc/sysidcfg /zones/server2-zone/root/etc/sysidcfg

Then change the hostname and ip_address. This time around we'll also set the default router. Once editing is complete, the file should look as follows:

bleonard@solaris:~$ sudo cat /zones/server2-zone/root/etc/sysidcfg
system_locale=C
terminal=xterms
network_interface=PRIMARY {
	hostname=server2-zone
	ip_address=192.168.1.60
	netmask=255.255.255.0
	default_route=192.168.1.21
	protocol_ipv6=no}
security_policy=none
name_service=NONE
nfs4_domain=dynamic
timezone=US/Eastern
root_password=fto/dU8MKwQRI

Then boot and log into the server 2 zone:

bleonard@solaris:~$ sudo zoneadm -z server2-zone boot
bleonard@solaris:~$ sudo zlogin -C server2-zone
[Connected to zone 'server2-zone' console]
Hostname: server2-zone
Creating new rsa public/private host key pair
Creating new dsa public/private host key pair
Configuring network interface addresses: server2.

server2-zone console login: root
Password: abc123
Jul  6 10:39:54 server2-zone login: ROOT LOGIN /dev/console
Last login: Fri Jul  1 16:19:45 on console
Oracle Corporation      SunOS 5.11      snv_151a        November 2010
root@server2-zone:~# 

Not only does this 2nd zone install much, much quicker, but Tomcat is already up and running:

It really gives you a feel for how easy it can be to scale using Solaris in a cloud type environment.

Don't forget to boot server1-zone:

bleonard@solaris:~$ sudo zoneadm -z server1-zone boot

Step 9: Configure Load Balancing

OK, that was a lot of setup just to get to the point of this blog. But now that we have 2 servers running at 2 IP address, let's set up a load balancer to scale traffic between them.

In the ILB zone, install the ILB:

root@ilb-ext:~# pkg install ilb
               Packages to install:     1
           Create boot environment:    No
               Services to restart:     1
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  1/1       11/11      0.2/0.2

PHASE                                        ACTIONS
Install Phase                                  38/38 

PHASE                                          ITEMS
Package State Update Phase                       1/1 
Image State Update Phase                         2/2 
Loading smf(5) service descriptions: 1/1

Then enable the ILB service:

root@ilb-ext:~# svcadm enable ilb

Then define a server group:

root@ilb-ext:~# ilbadm create-servergroup -s servers=192.168.1.50:8080,192.168.1.60:8080 tomcatgroup
root@ilb-ext:~# ilbadm show-servergroup 
SGNAME         SERVERID            MINPORT MAXPORT IP_ADDRESS
tomcatgroup    _tomcatgroup.0      8080    8080    192.168.1.50
tomcatgroup    _tomcatgroup.1      8080    8080    192.168.1.60

Then define a load balancing rule. This is the most complicated part of the process. For starters, I'll try to keep the rule as simple as possible. The rule is enabled (-e), will persist (-p), incoming packets (-i) are matched against destination virtual IP address (vip) and port 10.0.2.20:80. The packet is handled (-m) using round robin (rr). The destination for the packets (-o) is server group tomcatgroup. The rule is called tomcatrule_rr.

root@ilb-ext:~# ilbadm create-rule -e -p -i vip=10.0.2.20,port=80 -m lbalg=rr,type=HALF-NAT -o servergroup=tomcatgroup tomcatrule_rr

You can view the rule as follows:

root@ilb-ext:~# ilbadm show-rule
RULENAME            STATUS LBALG       TYPE    PROTOCOL VIP         PORT
tomcatrule_rr       E      roundrobin  HALF-NAT TCP 10.0.2.20       80
root@ilb-ext:~# ilbadm show-rule -f
       RULENAME: tomcatrule_rr
         STATUS: E
           PORT: 80
       PROTOCOL: TCP
          LBALG: roundrobin
           TYPE: HALF-NAT
      PROXY-SRC: --
          PMASK: /32
        HC-NAME: --
        HC-PORT: --
     CONN-DRAIN: 0
    NAT-TIMEOUT: 120
PERSIST-TIMEOUT: 60
    SERVERGROUP: tomcatgroup
            VIP: 10.0.2.20
        SERVERS: _tomcatgroup.0,_tomcatgroup.1

Finally, we need to tell the outside world that packets destined for our VIP, 10.0.2.20, should be sent to ilb0. First, find the MAC address of ilb0:

root@ilb-ext:~# dladm show-vnic ilb0
LINK         OVER         SPEED  MACADDRESS        MACADDRTYPE         VID
ilb0         ?            1000   2:8:20:bf:2a:d9   random              0

Then resolve the address to interface ilb0 using arp:

root@ilb-ext:~# arp -s 10.0.2.20 2:8:20:bf:2a:d9 pub permanent

You should then be able to ping the VIP:

root@ilb-ext:~# ping 10.0.2.20
10.0.2.20 is alive

Step 10: Load Balance!

You can now point your browser to the virtual IP address and get a result back from one of the Tomcat servers:

Very cool! But from which server was I served? I modified the example snoop.jsp to return the server's hostname and IP Address. Save the snoop.jsp to the /var/tomcat6/webapps/examples/jsp/snp directory in each of your zones.

bleonard@solaris:~$ sudo cp Downloads/snoop.jsp /zones/server1-zone/root/var/tomcat6/webapps/examples/jsp/snp/.
bleonard@solaris:~$ sudo cp Downloads/snoop.jsp /zones/server2-zone/root/var/tomcat6/webapps/examples/jsp/snp/.

I've appended the Server Side IP Address section to the bottom of the page, http://10.0.2.20/examples/jsp/snp/snoop.jsp:


Step 11: Health Checks

To keep things simple on the first go-around, I avoided health checks. However, it's pointless to have a load balancer that continues to feed requests to a dead server.

Health check options include ping probes, TCP probes, UDP probes and a user-defined script. Since I'm concerned about the health of Tomcat, I've created a simple script:

root@ilb-ext:~# cat /var/hc-tomcat 
#!/bin/bash
result=`curl -s http://$2:8080`
if [ ${result:0:5} = "<meta" ]; then
        echo 0
else
        echo -1 
fi

The load balancer provides the following variables to use with your script, of which I'm only using $2:

$1 - VIP (literal IPv4 or IPv6 address)
$2 - Server IP (literal IPv4 or IPv6 address)
$3 - Protocol (UDP, TCP as a string)
$4 - Numeric port range (the user-specified value for hc-port)
$5 - maximum time (in seconds) that the test should wait before returning a failure. If the test runs beyond the specified time, it might be stopped, and the test would be considered failed. This value is user-defined and specified in hc-timeout.

Ensure the script has execute permissions (the ilbd deamon, which runs the health check, is not run as root):

root@ilb-ext:~# chmod +x /var/hc-tomcat

Giving the script a quick test:

root@ilb-ext:~# /var/hc-tomcat n/a 192.168.1.50
0

You then create a health check rule as follows:

root@ilb-ext:~# ilbadm create-healthcheck -h hc-test=/var/hc-tomcat,hc-timeout=2,hc-count=1,hc-interval=10 hc-tomcat

The hc-timeout is how many seconds the health check will wait for a response before giving up. The hc-count is how many times the script will attempt to succeed before claiming the server to be dead. The hc-interval is how often the health-check is performed.

Once created you can view the configured health-checks as follows:

root@ilb-ext:~# ilbadm show-hc
HCNAME        TIMEOUT COUNT   INTERVAL DEF_PING TEST
hc-tomcat     2       1       10       Y        /var/hc-script

Now that we have a health check, we need to add it to our load balancing rule. Unfortunately, ilbadm doesn't have a command to modify an existing load balancing rule, so we have to delete it and create it again:

root@ilb-ext:~# ilbadm delete-rule tomcatrule_rr

We'll create the same rule as before, this time including the health check:

root@ilb-ext:~# ilbadm create-rule -e -p -i vip=10.0.2.20,port=80 -m lbalg=rr,type=HALF-NAT -h hc-name=hc-tomcat -o servergroup=tomcatgroup tomcatrule_rr

Once the new rule is created, the health check goes into effect. You can see the status as follows:

root@ilb-ext:~# ilbadm show-hc-result
RULENAME      HCNAME        SERVERID      STATUS   FAIL LAST     NEXT     RTT
tomcatrule_rr hc-tomcat     _tomcatgroup.0 alive   0    10:39:51 10:40:02 2509
tomcatrule_rr hc-tomcat     _tomcatgroup.1 alive   0    10:39:57 10:40:09 1869

So now, if probe.jsp is showing that you're hitting server 2 and we then disable Tomcat on Server 1:

root@server1-zone:~# svcadm disable tomcat6

When you refresh your browser you will be directed to Server 2. Of course, any state they you may have been maintaining on Server 1 will be lost. You can also see the status as dead using ilbadm show-hc-result:

root@ilb-ext:~# ilbadm show-hc-result
RULENAME      HCNAME        SERVERID      STATUS   FAIL LAST     NEXT     RTT
tomcatrule_rr hc-tomcat     _tomcatgroup.0 dead    4    10:43:36 10:43:45 1102
tomcatrule_rr hc-tomcat     _tomcatgroup.1 alive   0    10:43:42 10:43:53 5919

See Administering Health Checks in ILB for the official documentation.

That was a fair amount of work to configure this environment. Would it be worth providing a VM pre-configured with load balancing for download?

Wednesday Jul 06, 2011

Two Part Series on Live Upgrade

What prompted me to write my previous entry, recommended reading, was Bob Netherton's recently published 2 part series on Live Upgrade. Part 1 covers common problems while part 2 introduces survival tips:

Bob plans to update these entries as the topic evolves. However, he may also convert them over to a wiki if that makes more sense. Let him know.

    About

    The Observatory is a blog for users of Oracle Solaris. Tune in here for tips, tricks and more as we explore the Solaris operating system from Oracle.

    Search

    Categories
    Archives
    « April 2015
    SunMonTueWedThuFriSat
       
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
      
           
    Today