Doing the Jitter Bug


One of the nice things about finally having OpenSolaris properly launched is that I can share Solaris source code with everyone when talking about Solaris. Since Solaris is the code, not being able to show people the code has been like having a discussion about various aspects of the flavors of wine without actually tasting any - possible perhaps, but a rather dry (heh) and uninteresting narrative....


I was happily (as I don't get to write nearly enough code) whacking away on a revised version of malloc to fix some longstanding performance and scalabilty issues when one of our marketing folks darted in asking about telcos, real time, and latency bubbles [I'll write about our malloc findings later on]. After getting the relevant email trail and a test program from the customer contact, it turned out that indeed the customer was seeing a real problem - 20 millisecond timers was apparently firing up to 60 microseconds late occasionally on his Opteron box, but no problem was seen on SPARC machines. What's weirder, this problem appeared sometimes, but not always - any given instance of the test program might display the problem, or it might not.  Hmmm.   The customer's test program would carefully store information about intervals between timer firings that exceeded 40 usecs +  20 msecs and report them every 20 seconds or so... and when the problem appeared, it was happening once a second...

Like most performance problems I run into, this looked like a job for DTrace.  As Bryan has pointed out many times, DTrace is best at quickly evaluating various hypotheses...so I needed some testable ideas as to what might be going on. Well, what could make the program's timers fire late?

  • Other threads or interrupts - perhaps these are interfering with our program somehow?
  • Other timers firing on this cpu.  Maybe some other system timers are causing problems on CPU 1?

Well, hypothesis testing time.  First of all, the test program clearly created a processor set containing CPU 1 and placed itself as the only thread in that set.  It even disabled interrupts on that CPU.  A quick check with the following script:

on-cpu
/curcpu->cpu_id == 1/
{
trace(curthread);
}
verified that there was only a single thread (the expected one) running on CPU 1.  Hmmm.  interstat verified that  CPU 1 wasn't getting any device interrupts, either.  So what's happening here? I had some experience with  time keeping on Solaris, but not enough to know offhand where to look for this problem.  Confused?  DTrace will help!  A quick change to the above script lets us use the enqueue probe in the sched provider to discover exactly how our process is made runnable when the timer expires:

sched:::enqueue
/curcpu->cpu_id == 1/
{
stack(10);
}
The output is pretty straightforward - lots of entries that look like this:
 1   2561                setbackdq:enqueue
              TS`ts_setrun+0xf1
              genunix`cv_unsleep+0x6e
              genunix`setrun_locked+0x6d
             genunix`eat_signal+0x89
              genunix`sigtoproc+0x420
              genunix`sigaddqa+0x4a
              genunix`timer_fire+0xb9
              genunix`clock_highres_fire+0x3f
              genunix`cyclic_softint+0xb2
              unix`cbe_low_level+0x14
This gave me a place to get started.... timer_fire is clearly important, so let's instrument that and see when it fires:

#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}

fbt::timer_fire:entry
{
printf("%d: interval is %d usecs late\\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
last = timestamp;
}

This script just prints a line with a timestamp in microseconds since we started, with the number of microseconds late (more than the 20 msec) interval that was desired.

Interestingly enough, the output looks like this:
13822657: interval is 0 usecs late
13842656: interval is 0 usecs late
13862657: interval is 0 usecs late
13882657: interval is 0 usecs late
13902657: interval is 0 usecs late
13922599: interval is -57 usecs late
13942657: interval is 57 usecs late
13962657: interval is 0 usecs late
13982657: interval is 0 usecs late
14002657: interval is 0 usecs late

Whoops!  Turns out our test program was leading us down the garden path; the problem is that one timer fire appears to be early, and the next one makes up the difference; the net error is 0 over time. Reading the comments in cyclic.c, this isn't that surprising but does give us a place to start looking. At this point, I noticed that I can suspend the test program and still get the same intermittant late behavior. How often are we late?
#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}


fbt::timer_fire:entry
/(int) (((timestamp - last) - 20000000))/1000 != 0/
{
printf("%d: interval is %d usecs late\\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
}
fbt::timer_fire:entry
{
last = timestamp;
}
5974796: interval is -57 usecs late
5994854: interval is 56 usecs late
6974796: interval is -57 usecs late
6994854: interval is 56 usecs late

Interesting - once a second.

Reading the cyclic.c source code we see that cyclic_fire handles all the interesting timer stuff.  Let's trace cyclic_fire in addition to timer_fire:
2248600: interval is 0 usecs late
2263098: cyclic_fire
2268539: cyclic_fire
2268543: interval is -57 usecs late
2288597: cyclic_fire
2288600: interval is 57 usecs late

So did we misprogram the clock, or is something else going on? On line 929 ofcyclic.c, we see that we reprogram the interrupt source through a function pointer. Those are a pain. Dropping into mdb:

# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp usba
s1394 nca fctl lofs nfs audiosup random sppp ipc ptm logindmux ]
> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc1ede0 1b 3 0 59 no no t-0 ffffffff8584e220 mdb
1 ffffffff825cb800 b 0 0 -1 no no t-516001 fffffe8000275c80 (idle)
> ffffffff825cb800::print cpu_t cpu_cyclic
cpu_cyclic = 0xffffffff80f45600
> 0xffffffff80f45600::print cyc_cpu_t cyp_backend
cyp_backend = 0xffffffff803d65a8
> 0xffffffff803d65a8::print cyc_backend_t
{
cyb_configure = cbe_configure
cyb_unconfigure = 0
cyb_enable = cbe_enable
cyb_disable = cbe_disable
cyb_reprogram = cbe_reprogram
cyb_softint = cbe_softint
cyb_set_level = cbe_set_level
cyb_restore_level = cbe_restore_level
cyb_xcall = cbe_xcall
cyb_suspend = 0
cyb_resume = 0
cyb_arg = 0xffffffff825cb800
}
>

So we need to trace calls to cbe_reprogram:
#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}


cbe_reprogram:entry
/curcpu->cpu_id == 1/
{
firetime = (arg1 - start)/1000;
printf("%d: cbe_reprogram: set timer to fire at %d\\n",
(timestamp - start)/1000, firetime);
}

cyclic_fire:entry
/curcpu->cpu_id == 1/
{
now = (timestamp - start)/1000;
printf("%d: cyclic_fire: %d usecs late\\n",
now, now - firetime );
}

fbt::timer_fire:entry
/curcpu->cpu_id == 1/
{
printf("%d: interval is %d usecs late\\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
last = timestamp;
}


3839612: cyclic_fire: 80 usecs late
3839614: cbe_reprogram: set timer to fire at 3859532
3839615: interval is 0 usecs late
3859613: cyclic_fire: 81 usecs late
3859614: cbe_reprogram: set timer to fire at 3879532
3859616: interval is 0 usecs late
3879612: cyclic_fire: 80 usecs late
3879614: cbe_reprogram: set timer to fire at 3894055
3879615: interval is 0 usecs late
3894114: cyclic_fire: 59 usecs late
3894115: cbe_reprogram: set timer to fire at 3899532
3899555: cyclic_fire: 23 usecs late
3899556: cbe_reprogram: set timer to fire at 3919532
3899558: interval is -57 usecs late
3919613: cyclic_fire: 81 usecs late
3919614: cbe_reprogram: set timer to fire at 3939532
3919616: interval is 57 usecs late
3939612: cyclic_fire: 80 usecs late
3939614: cbe_reprogram: set timer to fire at 3959532
3939616: interval is 0 usecs late

Ahaha!  Turns out  we're late all of the time; normally 80 usecs when there's a full 20 msec between cyclic firings, but less when the deadman cyclic has fired recently.  Computing error as a percentage of sleep time, we see that our error is a relatively constant .4% of the interval slept; it's the varying time bewteen firings (due to multiple cyclics with different periods) that's responsible for the jitter.  Ok, so what is  cbe_reprogram doing? Rats, calling through another function pointer.  Ok, this time we'll use DTrace to figure out where we're going rather than mdb:


cbe_reprogram:entry
/curcpu->cpu_id == 1/
{
self->trace = 1;
}


fbt:::entry,
fbt:::return
/self->trace/
{
}
cbe_reprogram:return
/curcpu->cpu_id == 1/
{
self->trace = 0;
}

1 -> cbe_reprogram
1 -> apic_timer_reprogram
1 -> gethrtime
1 -> tsc_gethrtime
1 <- tsc_gethrtime
1 <- gethrtime
1 <- apic_timer_reprogram
1 <- cbe_reprogram

Ok, now let's look at the source for apic_timer_reprogram

ticks = (uint_t)((time - now) / apic_nsec_per_tick);

This is introducing a .4 % error... I wonder what the value of apic_nsecs_per_tick really is?

# echo 'apic_nsec_per_tick/D' | mdb -k 
apic_nsec_per_tick:
apic_nsec_per_tick: 10

That's it. We're suffering from an imprecise conversion between nsecs and apic ticks; the appropriate value cannot be represented as an integer with this conversion accurately enough to keep excessive jitter from appearing. Since the deadman cyclic runs every second, if we want to keep worst case jitter due to quantization errors down to 1 usec we'll need 1 ppm resolution on the conversion factor.

This also explained why sometimes the bug isn't visible - if the a 20 msec cyclic always occurs just before the deadman cyclic fires, there is almost no change in the amount of time between cyclic firings.

At this point I filed a bug, 6266961.  The fix involves rewriting the conversion to use a different factor and redoing the calibration of the apic timer to use the kernel's idea of hi-res time rather than the pit timer.
/\*
\* deal with resolution issues by converting between ticks and
\* nanoseconds using 64 bit math.
\* use multiplication cause it's faster; divide by power of 2
\* for speed.
\*/

static int64_t apic_ticks_per_128Mnsecs; /\* ticks/128\*1024\*1024 nsecs \*/
#define APIC_TICKS_TO_NSECS(ticks) (((long long)(ticks) \* 134217728LL)/\\
apic_ticks_per_128Mnsecs)
#define APIC_NSECS_TO_TICKS(nsecs) ((long long)(nsecs) \* \\
apic_ticks_per_128Mnsecs / 134217728LL)

After I get this finished up, it's back to malloc... more about that later on.


Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
Comments:

Somehow I doubt that you meant to paste part of the article twice, and paste this in the middle of it: barts@cyber:/net/zion/export/opensolaris/ws-saved/usr/src/barts 20% cp jitter.html ~ barts@cyber:/net/zion/export/opensolaris/ws-saved/usr/src/barts 21% cd barts@cyber:/home/barts 22% barts@cyber:/home/barts 22% cat jitter.html Ooops :) -Shawn

Posted by Shawn Walker on June 14, 2005 at 02:50 PM PDT #

Thank you! Fixed 6/15. - Bart

Posted by Bart Smaalders on June 15, 2005 at 01:53 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

An engineer's viewpoint on Solaris...

Search

Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today