The limits of parallelism
By Darryl Gove on Oct 31, 2008
We all know Amdahl's law. The way I tend to think of it is if you reduce the time spent in the hot region of code, the most benefit you can get is the total time that you initially spent there. However, the original setting for the 'law' was parallelisation - the runtime improvement depends on the proportion of the code that can be made to run in parallel.
Aside: When I'm looking at profiles of applications, and I see a hot region of code, I typically consider what the improvement in runtime would be if I entirely eliminated the time spent there, or if I halved it. Then use this as a guide as to whether it's worth the effort of changing the code.
The issue with Amdahl is that it's completely unrealistic to consider parallelisation without considering issues of the synchronisation overhead introduced when you use multiple threads. So let's do that and see what happens. Assume that:
P = parallel runtime S = Serial runtime N = Number of threads Z = Synchronisation cost
Amdahl would give you:
P = S / N
The flaw is that I can keep adding processors, and the parallel runtime keeps getting smaller and smaller - why would I ever stop? A more accurate equation would be something like:
P = S / N + Z\*log(N)
This is probably a fair approximation to the cost of synchronisation, some kind of binary tree object that synchronises all the threads. So we can differentiate that:
dP/DN = -S / N\^2 + Z / N
And then solve:
0 = -S / N\^2 + Z / N S / N = Z N = S / Z
Ok, it's a bit disappointing, you start off with some nasty looking equation, and end up with a ratio. But let's take a look at what the ratio actually means. Let's suppose I reduce the synchronisation cost (Z). If I keep the work constant, then I can scale to a greater number of threads on the system with the lower synchronisation cost. Or, if I keep the number of threads constant, I can make a smaller chunk of work run in parallel.
Let's take a practical example. If I do synchronisation between threads on traditional SMP system, then communication between cores occurs at memory latency. Let's say that's ~200ns. Now compare that with a CMT system, where the synchronisation between threads can occur through the second level cache, with a latency of ~20ns. That's a 10x reduction in latency, which means that I can either use 10x the threads on the same chunk of work, or I can run in parallel a chunk of work that is 10x smaller.
The logical conclusion is that CMT is a perfect enabler of Microparallelism. You have both a system with huge numbers of threads, and the synchronisation costs between threads are potentially very low.
Now, that's exciting!