Wednesday Nov 16, 2005

Some thoughts on ZFS's impact on Solaris

So ZFS is now available and we've put together lots of blogs and demos to show everyone the neat kinds of things ZFS supports - snapshots, writable snapshots (clones), simple disk management, protection against hardware and firmware errors, etc. Rather than discuss some other neat feature of ZFS or do some extreme performance demos, I thought it might be interesting to mull over some of the possible implications of this new technology on the rest of Solaris and other applications. We don't yet support booting ZFS quite yet, so some of the ideas below will have to wait a bit for implementation - but it's certainly time to start thinking about them.

First of all, in years past we've been moving more and more to the "one giant filesystem" model for installing systems; it's just been to much of a PITA to anticipate how much space we might need for root, var, opt, usr, etc.... but now with ZFS divorcing space allocation from filesystem boundaries, it's easy to use separate filesystems to make administration easier. It's ok to use lots of filesystems - they're essentially free, and since snapshots are at the filesystem level, filesystems also the "undo" boundary. Clearly, we don't want to delivery of mail or the growth of logs from inhibiting one from rolling back a ill-considered change to other parts of the system, so separating directories that contain files that are modified "automatically" from those that containing binaries or configuration files seems like a good idea.

Once this separation occurs, using zfs rollback to undo the effects of changes such as a ill-considered patch or administrative actions becomes simple. It also appears that Live Upgrade could be a lot easier with ZFS - just snapshot and clone the filesystems being upgraded. Perhaps we should take snapshots always before adding a patch... hmmm, given that we have ZFS, how would we redesign patching if we could always use ZFS for root filesystems?

Another area of possible change is initial installation. Right now we use bzip2 compression on on each package on the install media to compress our packages into the smallest possible footprint; this has been needed in order to fit localization information onto the first CD. Since we access the CD/DVD as a filesystem and then uncompress the package as it's being installed, we often have trouble reading data fast enough from the install media to keep the device spun up at anywhere near full speed, esp. for small packages. With zfs compression, we could store a ZFS filesystem image on the CD/DVD and stream that onto a ZFS filesystem on the hard disk in one (fast)shot, and then install packages from the hard disk as needed. Afterwards the extra packages could be deleted, or left there to facilitate later installation of additional components; disk space is amazingly cheap these days.

Another area of interest is providing backups. ZFS makes it very easy to determine the differences between snapshots; this makes doing incremental updates of even very large slightly modified filesystems inexpensive in terms of disk and CPU utilization. It sure would be nice if my Ferrari laptop could just upload it's changes since the last backup automatically when I plugged it into the building nets; yes, I could do that today, but without the smarts of ZFS, traversing a 50GB data set to see what has changed just isn't very practical on the slow laptop drives.

There are more ideas to consider, but it's clear that today is just the beginning of the ZFS revolution; ZFS will change the way Solaris works.

Tuesday Jun 14, 2005

Doing the Jitter Bug


One of the nice things about finally having OpenSolaris properly launched is that I can share Solaris source code with everyone when talking about Solaris. Since Solaris is the code, not being able to show people the code has been like having a discussion about various aspects of the flavors of wine without actually tasting any - possible perhaps, but a rather dry (heh) and uninteresting narrative....


I was happily (as I don't get to write nearly enough code) whacking away on a revised version of malloc to fix some longstanding performance and scalabilty issues when one of our marketing folks darted in asking about telcos, real time, and latency bubbles [I'll write about our malloc findings later on]. After getting the relevant email trail and a test program from the customer contact, it turned out that indeed the customer was seeing a real problem - 20 millisecond timers was apparently firing up to 60 microseconds late occasionally on his Opteron box, but no problem was seen on SPARC machines. What's weirder, this problem appeared sometimes, but not always - any given instance of the test program might display the problem, or it might not.  Hmmm.   The customer's test program would carefully store information about intervals between timer firings that exceeded 40 usecs +  20 msecs and report them every 20 seconds or so... and when the problem appeared, it was happening once a second...

Like most performance problems I run into, this looked like a job for DTrace.  As Bryan has pointed out many times, DTrace is best at quickly evaluating various hypotheses...so I needed some testable ideas as to what might be going on. Well, what could make the program's timers fire late?

  • Other threads or interrupts - perhaps these are interfering with our program somehow?
  • Other timers firing on this cpu.  Maybe some other system timers are causing problems on CPU 1?

Well, hypothesis testing time.  First of all, the test program clearly created a processor set containing CPU 1 and placed itself as the only thread in that set.  It even disabled interrupts on that CPU.  A quick check with the following script:

on-cpu
/curcpu->cpu_id == 1/
{
trace(curthread);
}
verified that there was only a single thread (the expected one) running on CPU 1.  Hmmm.  interstat verified that  CPU 1 wasn't getting any device interrupts, either.  So what's happening here? I had some experience with  time keeping on Solaris, but not enough to know offhand where to look for this problem.  Confused?  DTrace will help!  A quick change to the above script lets us use the enqueue probe in the sched provider to discover exactly how our process is made runnable when the timer expires:

sched:::enqueue
/curcpu->cpu_id == 1/
{
stack(10);
}
The output is pretty straightforward - lots of entries that look like this:
 1   2561                setbackdq:enqueue
              TS`ts_setrun+0xf1
              genunix`cv_unsleep+0x6e
              genunix`setrun_locked+0x6d
             genunix`eat_signal+0x89
              genunix`sigtoproc+0x420
              genunix`sigaddqa+0x4a
              genunix`timer_fire+0xb9
              genunix`clock_highres_fire+0x3f
              genunix`cyclic_softint+0xb2
              unix`cbe_low_level+0x14
This gave me a place to get started.... timer_fire is clearly important, so let's instrument that and see when it fires:

#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}

fbt::timer_fire:entry
{
printf("%d: interval is %d usecs late\\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
last = timestamp;
}

This script just prints a line with a timestamp in microseconds since we started, with the number of microseconds late (more than the 20 msec) interval that was desired.

Interestingly enough, the output looks like this:
13822657: interval is 0 usecs late
13842656: interval is 0 usecs late
13862657: interval is 0 usecs late
13882657: interval is 0 usecs late
13902657: interval is 0 usecs late
13922599: interval is -57 usecs late
13942657: interval is 57 usecs late
13962657: interval is 0 usecs late
13982657: interval is 0 usecs late
14002657: interval is 0 usecs late

Whoops!  Turns out our test program was leading us down the garden path; the problem is that one timer fire appears to be early, and the next one makes up the difference; the net error is 0 over time. Reading the comments in cyclic.c, this isn't that surprising but does give us a place to start looking. At this point, I noticed that I can suspend the test program and still get the same intermittant late behavior. How often are we late?
#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}


fbt::timer_fire:entry
/(int) (((timestamp - last) - 20000000))/1000 != 0/
{
printf("%d: interval is %d usecs late\\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
}
fbt::timer_fire:entry
{
last = timestamp;
}
5974796: interval is -57 usecs late
5994854: interval is 56 usecs late
6974796: interval is -57 usecs late
6994854: interval is 56 usecs late

Interesting - once a second.

Reading the cyclic.c source code we see that cyclic_fire handles all the interesting timer stuff.  Let's trace cyclic_fire in addition to timer_fire:
2248600: interval is 0 usecs late
2263098: cyclic_fire
2268539: cyclic_fire
2268543: interval is -57 usecs late
2288597: cyclic_fire
2288600: interval is 57 usecs late

So did we misprogram the clock, or is something else going on? On line 929 ofcyclic.c, we see that we reprogram the interrupt source through a function pointer. Those are a pain. Dropping into mdb:

# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp usba
s1394 nca fctl lofs nfs audiosup random sppp ipc ptm logindmux ]
> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc1ede0 1b 3 0 59 no no t-0 ffffffff8584e220 mdb
1 ffffffff825cb800 b 0 0 -1 no no t-516001 fffffe8000275c80 (idle)
> ffffffff825cb800::print cpu_t cpu_cyclic
cpu_cyclic = 0xffffffff80f45600
> 0xffffffff80f45600::print cyc_cpu_t cyp_backend
cyp_backend = 0xffffffff803d65a8
> 0xffffffff803d65a8::print cyc_backend_t
{
cyb_configure = cbe_configure
cyb_unconfigure = 0
cyb_enable = cbe_enable
cyb_disable = cbe_disable
cyb_reprogram = cbe_reprogram
cyb_softint = cbe_softint
cyb_set_level = cbe_set_level
cyb_restore_level = cbe_restore_level
cyb_xcall = cbe_xcall
cyb_suspend = 0
cyb_resume = 0
cyb_arg = 0xffffffff825cb800
}
>

So we need to trace calls to cbe_reprogram:
#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}


cbe_reprogram:entry
/curcpu->cpu_id == 1/
{
firetime = (arg1 - start)/1000;
printf("%d: cbe_reprogram: set timer to fire at %d\\n",
(timestamp - start)/1000, firetime);
}

cyclic_fire:entry
/curcpu->cpu_id == 1/
{
now = (timestamp - start)/1000;
printf("%d: cyclic_fire: %d usecs late\\n",
now, now - firetime );
}

fbt::timer_fire:entry
/curcpu->cpu_id == 1/
{
printf("%d: interval is %d usecs late\\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
last = timestamp;
}


3839612: cyclic_fire: 80 usecs late
3839614: cbe_reprogram: set timer to fire at 3859532
3839615: interval is 0 usecs late
3859613: cyclic_fire: 81 usecs late
3859614: cbe_reprogram: set timer to fire at 3879532
3859616: interval is 0 usecs late
3879612: cyclic_fire: 80 usecs late
3879614: cbe_reprogram: set timer to fire at 3894055
3879615: interval is 0 usecs late
3894114: cyclic_fire: 59 usecs late
3894115: cbe_reprogram: set timer to fire at 3899532
3899555: cyclic_fire: 23 usecs late
3899556: cbe_reprogram: set timer to fire at 3919532
3899558: interval is -57 usecs late
3919613: cyclic_fire: 81 usecs late
3919614: cbe_reprogram: set timer to fire at 3939532
3919616: interval is 57 usecs late
3939612: cyclic_fire: 80 usecs late
3939614: cbe_reprogram: set timer to fire at 3959532
3939616: interval is 0 usecs late

Ahaha!  Turns out  we're late all of the time; normally 80 usecs when there's a full 20 msec between cyclic firings, but less when the deadman cyclic has fired recently.  Computing error as a percentage of sleep time, we see that our error is a relatively constant .4% of the interval slept; it's the varying time bewteen firings (due to multiple cyclics with different periods) that's responsible for the jitter.  Ok, so what is  cbe_reprogram doing? Rats, calling through another function pointer.  Ok, this time we'll use DTrace to figure out where we're going rather than mdb:


cbe_reprogram:entry
/curcpu->cpu_id == 1/
{
self->trace = 1;
}


fbt:::entry,
fbt:::return
/self->trace/
{
}
cbe_reprogram:return
/curcpu->cpu_id == 1/
{
self->trace = 0;
}

1 -> cbe_reprogram
1 -> apic_timer_reprogram
1 -> gethrtime
1 -> tsc_gethrtime
1 <- tsc_gethrtime
1 <- gethrtime
1 <- apic_timer_reprogram
1 <- cbe_reprogram

Ok, now let's look at the source for apic_timer_reprogram

ticks = (uint_t)((time - now) / apic_nsec_per_tick);

This is introducing a .4 % error... I wonder what the value of apic_nsecs_per_tick really is?

# echo 'apic_nsec_per_tick/D' | mdb -k 
apic_nsec_per_tick:
apic_nsec_per_tick: 10

That's it. We're suffering from an imprecise conversion between nsecs and apic ticks; the appropriate value cannot be represented as an integer with this conversion accurately enough to keep excessive jitter from appearing. Since the deadman cyclic runs every second, if we want to keep worst case jitter due to quantization errors down to 1 usec we'll need 1 ppm resolution on the conversion factor.

This also explained why sometimes the bug isn't visible - if the a 20 msec cyclic always occurs just before the deadman cyclic fires, there is almost no change in the amount of time between cyclic firings.

At this point I filed a bug, 6266961.  The fix involves rewriting the conversion to use a different factor and redoing the calibration of the apic timer to use the kernel's idea of hi-res time rather than the pit timer.
/\*
\* deal with resolution issues by converting between ticks and
\* nanoseconds using 64 bit math.
\* use multiplication cause it's faster; divide by power of 2
\* for speed.
\*/

static int64_t apic_ticks_per_128Mnsecs; /\* ticks/128\*1024\*1024 nsecs \*/
#define APIC_TICKS_TO_NSECS(ticks) (((long long)(ticks) \* 134217728LL)/\\
apic_ticks_per_128Mnsecs)
#define APIC_NSECS_TO_TICKS(nsecs) ((long long)(nsecs) \* \\
apic_ticks_per_128Mnsecs / 134217728LL)

After I get this finished up, it's back to malloc... more about that later on.


Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

Wednesday Apr 27, 2005

Putting developer-defined DTrace probe points in an application

Well, it's been a while since blogging - time to post another example. As part of the Solaris 10 developer BOF at Usenix a couple of weeks ago, Liane asked me to put together a quick DTrace demo, so I cons'ed up a quick example of how to put static probes into an application. This technique is simple and can avoid the need for debug flags, conditional logging, etc in the application.

In order to have something to work with, I wrote a quick version of wc that I instrumented with two dtrace probe points.  These two probe points are defined in a file called simple_probes.d:

provider simple {
probe saw__word(int);
probe saw__line(int);
};

Note that dtrace treats double underscores specially in such definitions; they're converted to dash "-" characters in the finished probe names. Here's the source for the wc program, which I called simple.c:


#include <stdio.h>
#include <sys/sdt.h>

/\*
 \* simple example of defining sdt probes in a trivial program
 \* Sdt probes can often completely replace debug levels, optional
 \* log files, etc, in daemons... you can leverage the power of dtrace
 \* to make your server/application more readily debuggable.
 \*/

int
main(int argc, char \*argv[])
{
        int i;
        int characters, lines, words;
        characters = lines = words = 0;


        while (1) {

                if ((i = getchar()) == EOF) {
                        /\*
                         \* here we specify the name of the module,
                         \* the name of the probe (modulo mapping
                         \* '__' to '-') and pass in the parameter to be
                         \* traced which in this case is the number of
                         \* lines seen so far.
                         \*/
                        DTRACE_PROBE1(simple, saw__line, lines);
                        break;
                }

                characters++;

                if (i == '\\n') {
                        lines++;
                        DTRACE_PROBE1(simple, saw__line, lines);
                        continue;
                }

                if (isblank(i)) /\* eating white space \*/
                        continue;

                words++; /\* in a word now \*/

                while (1) {

                        if ((i = getchar()) == EOF) {
                                DTRACE_PROBE1(simple, saw__word, words);
                                break;
                        }

                        characters++;

                        if (i == '\\n') { /\* EOL? ends word implicitly \*/
                                DTRACE_PROBE1(simple, saw__word, words);
                                lines++;
                                DTRACE_PROBE1(simple, saw__line, lines);
                                break;
                        }

                        if (isblank(i)) { /\* white space ends words too \*/
                                DTRACE_PROBE1(simple, saw__word, words);
                                break;
                        }
                }
        }

        printf("%8d %8d %8d\\n", lines, words, characters);

        exit(0);
}

Now we need a makefile to process the code and build the executable:

CC=gcc
CFLAGS=-m64
DTRACE=dtrace

simple: simple_probes.o simple.o
$(CC) -o simple -m64 simple_probes.o simple.o

simple_probes.o: simple.o simple_probes.d
$(DTRACE) -G -64 -s simple_probes.d simple.o

Note that we use dtrace to build a .o file that defines our new probes,
and we link this into our (64 bit amd) application:

barts@cyber:/home/barts/demos 144% /usr/ccs/bin/make
gcc -m64  -c  simple.c
dtrace -G -64 -s simple_probes.d simple.o
gcc -o simple -m64 simple_probes.o simple.o
barts@cyber:/home/barts/demos 145%

We can now take a look at the probes we've added into our program with
dtrace:

# dtrace -P simple'$target' -c ./simple -l
   ID   PROVIDER            MODULE                          FUNCTION NAME
48630 simple19946            simple                              main saw-line
48631 simple19946            simple                              main saw-word
#

And we can now demonstrate the probes firing as follows.  First, running normally:

# echo "Hello World" | ./simple
       1        2       12
#

and now running with probes enabled from dtrace:

# echo "Hello World" | dtrace -P simple'$target' -c ./simple

dtrace: description 'simple$target' matched 2 probes
       1        2       12
CPU     ID                    FUNCTION:NAME
  0  48631                    main:saw-word
  0  48631                    main:saw-word
  0  48630                    main:saw-line
  0  48630                    main:saw-line
dtrace: pid 19917 has exited

Simple enough really, and this works with Forte, gcc, and Studio compilers for all Solaris platforms...

Hmmm... perhaps I should change the nscd to use this rather than generating log
files....



About

An engineer's viewpoint on Solaris...

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today