DTrace improvements in Oracle Solaris 11.2

There have been a few improvements to DTrace in Solaris 11.2.

llquantize()

DTrace has quantize() and lquantize() aggregating actions to give you, respectively, a power-of-two distribution and a linear distribution of data points that you're interested in.  While these are both useful, there may be instances in which you want to examine events whose latencies span multiple orders of magnitude, but for which you want relatively fine-grained information about the data within each order of magnitude.  With quantize(), there's likely to be insufficient detail. You could use lquantize(), but you'd need multiple aggregations to cover the multiple orders of magnitude.

In 11.2, we have added a log-linear quantize aggregation, llquantize().  This aggregating action allows you to specify a base and a range of exponents for the data, but it also allows you to specify a number of steps (or buckets, if you will) per order of magnitude.  For example, the following line will create an aggregation covering the values from 103 through 106 - 1 with 10 steps per order of magnitude.  (The buckets for 105 will include the values from 105 through 106 - 1.):

@ = llquantize(foo, 10, 3, 5, 20);

We can use this in a script to examine system call latencies:

        syscall:::entry
        {
                self->ts = timestamp;
        }

        syscall:::return
        / self->ts /
        {
                @ = llquantize(timestamp - self->ts, 10, 3, 5, 10);
                self->ts = 0;
        }

Because the timestamp is measured in nanoseconds, this script reports system call latencies in the microseconds range.  Here's sample output from this script:

           value  ------------- Distribution ------------- count
          < 1000 |@@@@@@                                   12899
            1000 |@@@@@@@@@@@@                             26357
            2000 |@                                        3202
            3000 |@                                        1869
            4000 |@                                        2110
            5000 |@@                                       4716
            6000 |@@                                       3998
            7000 |@                                        1617
            8000 |@@                                       4924
            9000 |@                                        2515
           10000 |@@@@@@@                                  15307
           20000 |@                                        2240
           30000 |@                                        1327
           40000 |@                                        1369
           50000 |                                         990
           60000 |                                         1057
           70000 |                                         631
           80000 |                                         453
           90000 |                                         434
          100000 |@                                        1570
          200000 |                                         228
          300000 |                                         45
          400000 |                                         59
          500000 |                                         60
          600000 |                                         52
          700000 |                                         30
          800000 |                                         22
          900000 |                                         17
      >= 1000000 |                                         513



Scalability

When DTrace was first conceived, 100 CPUs was a large machine.  Now, the largest machines contain over 1,000 CPUs, and the original DTrace architecture is starting to show its age.  dtrace(1M) (or specifically, libdtrace(3LIB)) was originally written to process data from the CPUs on a server in a single thread.  Unfortunately, dtrace(1M) was unable to to keep up on newer, larger servers with just a single thread.

In Solaris 11.2, we've modified libdtrace to perform this task with multiple threads.  On x86 servers, dtrace(1M) will use one thread per 8 CPUs.  On SPARC servers, it will use one thread per 16 CPUs.  We've also included an option, nworkers, to allow you to request a specific number of threads.

What does this mean for you?  The main benefit of doing this is that dtrace(1M) will actually be able to keep up on larger servers.  During testing on a 256-CPU system, generating 6,000 records per second per CPU, we were seeing hundreds of aggregation drops per second per CPU without this multi-threading framework.  Running the same test with a multi-threaded dtrace(1M), we saw no aggregation drops.

errexit option

Another minor enhancement in Solaris 11.2 is the errexit option.  This option causes dtrace(1M) to exit when it first hits an error.

As a trivial example, consider the following script:

tick-1s
{
        this->i = 0;
        this->j = 5 / this->i;
}

If run normally, this script would run until termination reporting an error once per second:

# dtrace -q -s divide-by-zero.d
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2
^C

#

When run with the errexit option, the script terminates after the first error:

# dtrace -x errexit -q -s divide-by-zero.d
dtrace: error on enabled probe ID 1 (ID 5148: profile:::tick-1s): divide-by-zero in action #2 at DIF offset 2

#

tracemem() enhancements

We've added an optional third argument to tracemem().  Where the second argument specifies how many bytes of memory to trace, the optional third argument specifies how many bytes of memory to display.  This can be useful in cases where the size of the data you care about is variable.  (The DTrace architecture requires that you trace a constant amount of data, but in some cases, the amount of data you're interested in is variable.  What you would currently do is trace enough to capture what you're interested in and ignore the rest.  This optional argument lets dtrace(1M) ignore the garbage for you.)

As an example, consider tracing the beginning of an SSH connection, as seen from the server side.  (This is a simplified example, so we'll just look at what the server writes.):

syscall::write:entry
/execname == "sshd"/
{
        tracemem(copyin(arg1, arg2), 1024, arg2);
}

While running this script and opening a connection from a remote server, we see this:

 CPU     ID                    FUNCTION:NAME
   0   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 53 53 48 2d 32 2e 30 2d 53 75 6e 5f 53 53 48 5f  SSH-2.0-Sun_SSH_
        10: 32 2e 32 0a                                      2.2.

   0   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 25 54 00 00                                      %T..

   1   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 90 39 c4 53 d7 82 01 00 25 54 00 00              .9.S....%T..

   1   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 90 39 c4 53 49 84 01 00 25 54 00 00              .9.SI...%T..

   1   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 90 39 c4 53 d2 84 01 00 25 54 00 00              .9.S....%T..

[ ... ]

This is certainly much better than seeing this:

 CPU     ID                    FUNCTION:NAME
   0   5900                      write:entry
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 53 53 48 2d 32 2e 30 2d 53 75 6e 5f 53 53 48 5f  SSH-2.0-Sun_SSH_
        10: 32 2e 32 0a 00 00 00 00 18 00 00 00 00 00 00 00  2.2.............
        20: 98 f6 d3 08 25 1f d3 08 01 21 d3 08 00 00 00 00  ....%....!......
        30: 00 00 00 00 01 00 00 00 18 00 00 00 00 00 00 00  ................
        40: c0 23 d2 08 0d 21 d3 08 10 00 00 00 01 00 00 00  .#...!..........
        50: 9d 11 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
        60: c0 24 d2 08 25 1f d3 08 0d 21 d3 08 00 00 00 00  .$..%....!......
        70: 00 00 00 00 01 00 00 00 18 00 00 00 00 00 00 00  ................
        80: 80 25 d2 08 01 00 00 00 02 00 00 00 03 00 00 00  .%..............
        90: ff ff ff ff 04 00 00 00 18 00 00 00 00 00 00 00  ................
        a0: a0 22 d2 08 2e 21 d3 08 0e 00 00 00 01 00 00 00  ."...!..........
        b0: ab 11 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
        c0: 00 23 d2 08 2e 21 d3 08 00 00 00 00 25 1f d3 08  .#...!......%...
        d0: 00 00 00 00 01 00 00 00 18 00 00 00 00 00 00 00  ................
        e0: b8 f6 d3 08 00 00 00 00 00 00 00 00 00 00 00 00  ................
        f0: 00 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
       100: d8 f6 d3 08 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ ... ]
       3d0: 00 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................
       3e0: b8 f9 d3 08 00 00 00 00 00 00 00 00 00 00 00 00  ................
       3f0: 00 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00  ................

tracemem() output consistency

Another enhancement we've made to tracemem() is to modify the behavior of tracemem() for certain traced memory sizes.  Previously, when tracing 1, 2, 4 or 8 bytes, the traced memory would be treated as a signed decimal integer.  When tracing any other amount below 32 bytes, the traced memory was treated as a string if the buffer contained only printable ASCII characters.  For example:

# dtrace -qn 'BEGIN {tracemem(&`initname, 1); exit(0)}'
47
# dtrace -qn 'BEGIN {tracemem(&`initname, 4); exit(0)}'
1920169263
# dtrace -qn 'BEGIN {tracemem(&`initname, 32); exit(0)}'
/usr/sbin/init
# dtrace -qn 'BEGIN {tracemem(&`initname, 64); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72 2f 73 62 69 6e 2f 69 6e 69 74 00 00  /usr/sbin/init..
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        20: 80 00 00 00 01 00 00 00 84 2b ab fb ff ff ff ff  .........+......
        30: ac 2e ab fb ff ff ff ff ac 31 ab fb ff ff ff ff  .........1......

#

We've modified this behavior to be consistent across traced memory sizes.  For example:

# dtrace -qn 'BEGIN {tracemem(&`initname, 1); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f                                               /

# dtrace -qn 'BEGIN {tracemem(&`initname, 4); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72                                      /usr

# dtrace -qn 'BEGIN {tracemem(&`initname, 32); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72 2f 73 62 69 6e 2f 69 6e 69 74 00 00  /usr/sbin/init..
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

# dtrace -qn 'BEGIN {tracemem(&`initname, 64); exit(0)}'

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 2f 75 73 72 2f 73 62 69 6e 2f 69 6e 69 74 00 00  /usr/sbin/init..
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        20: 80 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
        30: 88 32 b6 fb ff ff ff ff 00 36 b6 fb ff ff ff ff  .2.......6......

#

Structure alignment

DTrace has had a longstanding issue with how it calculates alignment for structures, unions, and bit-fields.  At best, this provided an annoyance requiring users to modify their scripts to manually add padding to reflect correct behavior, or at worst, would provide wrong information when using things like the sizeof() action or produce alignment errors where there shouldn't be any issue.  We have made several modifications such that structures, unions, and bit-fields are now more ABI compliant.

In the case of bit fields, depending on where the break is between the bits, some would get lost or placed where they shouldn't.  For example, having a DTrace script written as:

union u1tag
{
        unsigned char c;
        struct
        {
                unsigned int b1:1;
                unsigned int b2:7;
        } s1;
} u1;

BEGIN
{
        u1.c = 255;
        printf ("%d %d %d\n", u1.c, u1.s1.b1, u1.s1.b2);
        u1.c = 0;
        u1.s1.b1 = 1;
        u1.s1.b2 = 127;
        printf ("%d %d %d\n", u1.c, u1.s1.b1, u1.s1.b2);
        exit(0);
}

Would produce a result of:

255 1 0
128 1 127

If you were to write the same thing in C, you would see:

255 1 127
255 1 127

For structs and unions, the problem is more pronounced.  Suppose we have a DTrace script:

typedef struct _my_data {
        uint32_t a;
        uint64_t c;
} my_data;

typedef struct _more_data {
        uint32_t x;
        my_data y;
} more_data;

more_data a;

BEGIN {
        a.y.c = 30;
        printf("%lu\n", a.y.c);
}

When we run it, DTrace will report that this otherwise valid structure will result in an invalid alignment.

 # dtrace -64 -s /tmp/test.d
dtrace: script '/tmp/test.d' matched 1 probe
dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid alignment
(0xffffff08bf603a0c) in action #1 at DIF offset 24

What would be done in the past to work around this issue is to add a padding variable so that the struct _more_data would look like:

typedef struct _more_data {
        uint32_t x;
        uint32_t padding;
        my_data y;
} more_data;

This means that without the padding, not only did we run into alignment issues that would produce runtime errors, but we ran the risk of reporting wrong data within the members of the struct.  This would also impact the size being reported by actions such as sizeof().

For instance suppose we have two structs:

struct s1 {
        int x;          /* 4 bytes */
        short a;        /* 2 bytes */
};

struct s2 {
        struct s1 b;    /* 6 bytes */
        short c;        /* 2 bytes */
        int d;          /* 4 bytes */
};

In this case, sizeof(s1) would report a size of 6, and sizeof(s2) would report a size of 12 instead of 8 and 16 respectively.

One last alignment issue is being addressed with this release, and that deals with the memory alignment restrictions between SPARC and x86 architectures.  In the past, DTrace would force a 4-byte alignment regardless of the architecture the script happened to be running on. With 11.2, alignment is based on the platform since SPARC still requires 4-byte alignment for memory accesses, but the x86 architecture is more lenient and doesn't require such restrictions.

Summing up, by insuring ABI compliance each structure will match their C counterpart, and when you copyin a struct from your program, and rely on the structs being defined inside a header file or your own D scripts, you can rest assured the data is reliable.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

The Observatory is a blog for users of Oracle Solaris. Tune in here for tips, tricks and more as we explore the Solaris operating system from Oracle.

Search

Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today