Dude, where'd those CPU cycles go?

Recently I worked on the (trivial) fix for 6276048 which is one of those bugs that is at the same time slightly amusing and slightly depressing. ;)

At my urging, the SSD performance team out east was looking into the costs of doing TLB shootdowns on large systems. In modern systems, the two biggests costs of virtual memory are TLB misses, and keeping the global view of each process' address space in sync across all of the processors (though it isn't particuarly relevant, the machine they were using is a fairly good sized domain of a SunFire 15K, which is a big machine, so there are a lot of processors to keep in sync). The latter operation, better known to VM geeks as TLB shootdowns, was the subject of the investigation, and was a rathole being visited as part of a bigger ongoing investigation into the overhead of CPU cross calls in big SPARC boxes.

While pouring over the kernel profile data from one of the runs, I noticed that the probe rate for the TLB shootdown code in the HAT layer (over in hat_sfmmu.c) was \*significantly\* higher than the probe rate for the cross call routines (over in x_call.c). This immediately rang some alarm bells -- the cost of the TLB shootdown should be constrained by the overhead in doing the cross calls themselves, especially considering the size of the system involved (around 32 CPUs, as I recall)!

So, being the diligent bees that they are, Dave Valin and his team dug in deeper with DTrace, coupled with a few home-grown analysis tools on the back end, and came up with an answer... after just a few hours, my phone rang, and I was staring at this code in disbelief:

#define SFMMU_XCALL_STATS(cpuset, ctxnum)                               \\ 
{                                                                       \\ 
                int cpui;                                               \\ 
                                                                        \\ 
                for (cpui = 0; cpui < NCPU; cpui++) {                   \\ 
                        if (CPU_IN_SET(cpuset, cpui)) {                 \\ 
                                if (ctxnum == KCONTEXT) {               \\ 
                                        SFMMU_STAT(sf_kernel_xcalls);   \\ 
                                } else {                                \\ 
                                        SFMMU_STAT(sf_user_xcalls);     \\ 
                                }                                       \\ 
                        }                                               \\ 
                }                                                       \\ 
} 

That's right, we were spending half our time updating some statistics! For reasons I won't go into here, NCPU on the SunFire high-end machines is quite large -- more than 512 -- so this loop was taking quite a long time to execute, and was actually dominating the time spent doing TLB shootdowns. The fix was unbelievably simple -- delete the for() loop construct in the macro. :)

As Bryan reminds us, there's a lesson to be learned in every bug, and usually more involved in the bug than is immediately apparent, so one should keep digging even after finding the root cause until finding the real underlying systemic problem that led to the bug. So, wondering how this one got by us for so long, I started asking around, and discovered that, not surprisingly, this macro was introduced a long, LONG time ago, when NCPU was really really small and insignificant. Add to this, the fact that we used to have little observability into the system at all, until DTrace came along, which is still a relatively recent innovation, and you have a hidden performance problem. In any big, complex system like the Solaris kernel there are bound to be dusty corners which no one has visited for years, and whose design constraints are long overdue for an overhaul. The lesson I took away from this one is that, when examining the source for dusty corners, one must examine the macros and not just the source -- for what is hidden in the macros may just be the most important part of the function!


Technorati Tag:
Technorati Tag:
Comments:

Post a Comment:
Comments are closed for this entry.
About

elowe

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today