### How many kernel engineers does it take to solve one differential equation?

Now, before you can answer this fundamental questions, lets take care of some basics first. Do you really know what does "load average" (as reported by getloadavg() or by various utilities such as uptime and w) really tell you? Or even what does it actually mean? Out of 10 random folks informally surveyed in the hallways of Sun Quentin:
1. 2 said "Duh?"
2. 7 said "Load average measures an average number of processes over 1 minute; 5 minutes and 15 minutes respectively"
3. 1 said "Load average numbers represent a simple moving average of the number of process in a scheduler's run queue with a window of 1 minute, 5 minutes and 15 minutes respectively
Now, unless your name is Neil Gunther something is telling me you'd be among one of these folks yourself. I certainly used to be. That is, before I came across a series of articles by Dr. Gunther. The rest chronicles my personal journey of realizing just how deep the rabbit hole goes.

First things first: the last dude from the survey got pretty close to giving a correct answer. Indeed, Load Average numbers are moving averages, but they are not simple. In fact, they are exponential moving averages (or EMA for all the financially minded folks out there). So in a nutshell, what kernel does to gather load average (LA) statistics is rather simple: once every X seconds it samples the size of the runqueue (which gives it the number of runnable processes at that point in time) A bit of digression: one of the primary jobs of the Operating System is to make sure that all of the processes can finish up as quickly as they can. The CPU is a precious toy that needs to be given to each process fairly, as though they were children in a large family and a kernel was their mother. The good news is: most of the time most of the children are sleeping and don't need any toys to begin with. The reasons of why they are asleep are plentiful: they could be waiting for the IO to complete or waiting for somebody to wake them up when a particular condition gets met, etc. Processes, not children, that is. The important thing is: they do not require CPU and do not fight for it. The few that are ready to go are called "runnable" and are placed on a runqueue. If your system has fewer CPUs than the amount of processes in that queue than a kernel has to make sure that CPU time is fairly allocated between all needy kids\^H\^H\^H\^Hprocesses (you have more kids than toys). and updates the current values of LA numbers using this simple EMA formula:
current_average = α \* previous_average + (1-α) \* runqueue_length
In reality you have 3 variables like current_average to take care of. Each corresponding to a 1, 5 and 15 minutes window. But they all do the same thing: they tell you how many processes on average (an exponential moving one, but still) there were waiting in the runqueue during the last 1, 5 and 15 minutes. To put it in simple terms: if your LA numbers are constantly smaller than the total number of cores you have -- you got ripped off when you bought your quad-core AMD cpu (there's nothing for some of the cores to do at all, you've got more toys than kids). If the numbers are constantly bigger -- go ahead and get an extra quad core CPU (get some shiny toys for those kids of yours). Simple huh? Almost, there's one little question still to be answered: what's the ideal value for the α constant? Now, before you run to your favorite statistics handbook, here's a shocker: have you realized that everytime you compute a formula like this:
Y[i+1] = α \* Y[i] + (1-α) \* X[i]
what you're really doing is numerically solving an ordinary differential equation using the backward Euler method? Yep, that's right. There's no escape. You can call it an EMA, you can call it a low-pass filter, you can call it whatever you like, but the truth of the matter is: you're reconstructing function Y, such that:
 dY 1-α -- = -T\*Y(t) + T\*X(t), where T = ----- dt α\*h
and each consecutive sample Yn is h units away from a previous one. A fundamental digression: if you think about what a typical ordinary differential equation is, you can quickly realize that, in simple terms, it is exactly the recipe for constructing the resulting function one step at a time. Recall a high school definition of the derivative: f'(t) ~= (f(t+h) - f(t))/h, where the weird looking ~= designates a limit. On the other hand ~= could as well designate an approximation. And if you do read it that way -- congratulations! You are on a brink of discovery of a Euler method of numerically solving ordinary differential equations all by yourself. But what's more important is that you can now see that a derivate is a very good tool for describing how neighboring values of the function are supposed to relate to each other: f[next] = f[prev]+h\*f'[prev] If you want extra precision you can write it like this (and struggle with a bit more complex computation, which is what we have to do for computing LAs): f[next] = f[prev]+h\*f'[next] But the basic fundamental principle remains the same: derivatives describe how neighboring values of the function relate to each other, hence every time you see something like: x[i] = x[i-1] + EXPRESSION; don't be fooled -- there's a differential equation being solved here. Now that we know how does the equation look like it would be a good time to dust off that old calculus book you have sitting in your basement. Once you open it, you'll get the bad news: there's no compact solution for the dY/dt = -T\*Y(t) + T\*X(t). You have to integrate X(t) and that's a downer. On the positive side, though: as long as our X(t)==0 we have a much simpler and well know equation to solve: dY/dt = -T\*Y(t). Which, of course makes Y(t) == Y(0)\*exp(-T\*t) and also hints at what the good values for T might be. Indeed, since in our case the values of X(t) are the measurements of the runqueue's length, everytime X(t)==0 it means that the runqueue is empty. The next logical question to ask is: how do we want our load average numbers to behave when the size of the runque drops to 0. We don't have much freedom, though, since for every segment where X(t)==0 we actually \*know\* how our numbers will behave: Y(t) == Y(0)\*exp(-T\*t). The only knob we could turn is the value of 'T'. Control engineering, that is also governed by the same differential equation, has long standing conventions for how the controlling device should react on changing in the process. These conventions are based on a notion of a time constant. And the way it is defined makes the following two statements true:
1. After a period of one time constant the function reaches ~37% of its initial value
2. After five time constants the function reaches a value less than 1% of its initial value
Which seems especially fitting in our case since the second window (5 minutes) is exactly 5 time bigger than the first one. So without further ado, let me say that I would like our time constant to be 1 minute. Given that our sampling is done in seconds that would be 60. And it just follows that since the value of T is 1/60 (T==1/time_constant), which means that (1-α)/(α\*h)==1/60 and the α==60/(60+h). Pretty neat huh? In fact, that's how I would like \*my\* load average numbers to be calculated!

But obviously, as pointed out by Dr. Gunther, that's not what the Linux community wants. Nor is it what the OpenSolaris community wants. In fact unless your operating system was done by Fernando Corbato and co. you're out of luck. For some unknown reason all of the opensource implementations of load average stick with defining α as an exponent of some sorts:
 Linux Solaris What makes sense Difference h 5sec 1sec 1sec 1min α exp(-5/60) exp(-1/60) 60/61 0.0001 5min α exp(-5/300) exp(-1/300) 300/301 0.000005 15min α exp(-5/900) exp(-1/900) 900/901 0.0000006
This makes very little sense (although doesn't do much harm either, since numerically these exponent values are pretty close to the real deal) and leads to all sorts of complications in calculating the actual numbers. Of course, if instead of a simple α==60/65==12/13 (Linux kernel has to sample the actual length of the runqueue only once per 5 seconds, hence the h==5) you stick with exp(-5sec/1min) no wonder you get in trouble! And for what? Well, at least I don't know. Now, don't get me wrong, it is not like different αs are a crime. It is definitely not more of a problem than using logarithms of base 7 or something. On the other hand, when a Linux α effectively gives me a time constant of 59.204812 seconds and is numerically slightly different from what Solaris uses one has to wonder what kind of a esoteric knowledge needs to be possessed in order to appreciate something like that. Again, I don't really know.

But after reading the excerpt from the Multics manual reproduced by Dr. Gunther, I do know this: it takes at least one kernel engineer from 60s to actually be able to solve a differential equation, the new guys seem to have flunked the calculus.

Fascinating. I love this old-skool Unix cruft. :-)

Note that in the 3BSD distribution (which is the oldest source code I could find here) in usr/src/sys/sys/vmsched.c it says:

/\*
\* Constants for averages over 1, 5, and 15 minutes
\* when sampling at 5 second intervals.
\*/
double cexp[3] = {
0.9200444146293232, /\* exp(-1/12) \*/
0.9834714538216174, /\* exp(-1/60) \*/
0.9944598480048967, /\* exp(-1/180) \*/
};

The file vmsched.c is dated Feb 14, 1980 in the 3BSD distribution. So it is very likely that Linux just copied from BSD...

Posted by Volker A. Brandt on May 11, 2008 at 08:39 PM PDT #

I would guess the exp() is moved in the constant on order to avoid recalculating it. So it the formular looks like this:

in linux kernel.

Gruss
Bernd

Posted by Bernd Eckenfels on May 12, 2008 at 05:53 AM PDT #

To: Volker A. Brandt

Thanks for quoting 3BSD (do you have a URL for that,
by the way?). It seems that a singularity between how
LAs used to be calculated and how they are calculated
right now lies somewhere in AT&T. Now, if only I worked

Thanks,
Roman.

Posted by Roman V Shaposhnik on May 12, 2008 at 07:01 AM PDT #

To: Bernd Eckenfels

argument is that exp() has no place being part of the
numerical solution of the above differential equation to
begin with. Of course, the fact that Load Average numbers
have not been standardized (yep, LAs are NOT part of
POSIX) contributes to clouding the matter. Without prior
knowledge of what they are supposed to measure an alpha constant of 60/61 is as good as exp(-1/60). And if for some
reason we really have to stick with exp(-1/60) than your
argument makes sense. But! I maintain that the most logical way of approaching LAs is to make the first one be
aligned at a time constant (IOW, our system start to enter a stable state), the second one be aligned at 5\*T (IOW, our system is now completely transitioned to the steady state) and the third one just being N\*5\*T just to be able to check that the second number is indeed representative of a complete transition into a steady state. Now, why for the scheduler the steady states is supposed to start after 1 minute is debatable, but 1 minute is at least a nice round number ;-)

Thanks,
Roman.

Posted by Roman V Shaposhnik on May 12, 2008 at 07:12 AM PDT #

Hi Roman:

The Unix Heritage Society in Australia have some nice pages under

The 3BSD source is in their archive; since their first mirror in the US does not work and the second one has "tux" in its name :-), here's a URL into the Canadian mirror:

ftp://ftp.darwinsys.com/pub/UnixArchive/4BSD/Distributions/3bsd.tar.gz

There should also be some AT&T stuff floating around the net but I can't find the links, offhand. I also don't remember exactly what then-Caldera released...

Regards -- Volker

Posted by Volker A. Brandt on May 12, 2008 at 11:16 PM PDT #

i liked the information it helped me withmy work.

Posted by guest on August 06, 2008 at 02:30 PM PDT #

I have received several similar emails like this one.

Posted by links london on December 01, 2009 at 09:57 AM PST #

Posted by linksoflondon on December 15, 2009 at 02:34 PM PST #

• HTML Syntax: NOT allowed