Monday Feb 16, 2009

DTrace meets the CPU Performance Counters

It's been a while coming but it's here at last in build 109 of Solaris Nevada; the DTrace CPU Performance Counter provider (or 'cpc' provider for short). The cpc provider gives you the ability to profile your system by many different types of processor related events; the list of events is processor specific and usually quite large but typically includes events such as cycles executed, instructions executed, cache misses, TLB misses and many more. From a high level view, the cpc provider is essentially the same as the profile provider except that, where the profile provider allows you to profile a system using a fixed, time based source, the cpc provider lets you profile by events related to processor activity.

There's a bunch of things around the implementation that I'd like to talk about but that can probably wait for another day as we'll ease ourselves gently into the world of the cpc provider with some simple examples. Additional sources of information that you may find useful are the DTrace documentation and the cpc provider ARC case one pager. Note that most of the examples I give here are focused on two of the more common event types, instructions or cycles executed. However, we could have just as easily profiled behaviour around the cache hierarchy (l1/l2/l3 hits/misses) for example - we'd merely need to specify different event types to look at.

Anyway, let's start with a simple example to see it in action:

Example 1: User Mode Instruction Profile by Executable

#cat userexecbyinst.d
	@[execname] = count();

	trunc(@, 10);
	printa("%-30s %@d\\n", @);

# dtrace -qs ./userexecbyinst.d
gnome-panel                    3760
soffice.bin                    5952
nautilus                       15255
gnome-terminal                 23363
metacity                       29627
wnck-applet                    36635
isapython2.4                   88264
thunderbird-bin                119269
firefox-bin                    319431
Xorg                           399781

Here we have a profile of our system which looks at executed user mode instructions. The specification of what we wanted to profile is contained in the probename of the cpc probe. The probename is made up of the following, hyphen delimited, fields:

So, let's get a bit fruitier and take a different view of things. Here's a system wide view of user mode TLB misses by library:

Example 2: User Mode TLB miss Profile by Library

#cat usertlbmiss.d
#!/usr/sbin/dtrace -qs

	@[umod(arg1)] = count();

	trunc(@, 10);
	printa("%-50A %10@d\\n", @);

# dtrace -qs ./usertlbmiss.d 
\^C                                             274                                           454                                       739                                  791                                              964
firefox-bin                                              1026                                           1443                                                1740                                                5651
Xorg                                                     6766

Note that probes from the cpc provider export the same arguments as the profile provider. arg0 will be non zero if the event occured while operating in kernel mode. arg1 will be non zero id the event occured while we were operating in user mode.

Example 3: Selective User Mode Cycle Count Profile by Function

As this is DTrace, we can use predicates to prune the data we capture. Let's profile a couple of applications from a particular user but tyhis time profile them by the number of cycles used:

# cat ./userbyappbycycles.d
/(execname == "firefox-bin" || execname == "thunderbird-bin") && uid == 101/
	@[ufunc(arg1)] = count();

	trunc(@, 10);
	printa("%-60A %@10d\\n", @);

# dtrace -qs ./userbyappbycycles.d
\^C`mutex_unlock                                             1544`FcFontMatch                                     1671`clear_lockbyte                                           1754`__1cKnsImageGTKODrawComposited6MrnTnsIRenderingContext_pnRnsIDrawingSurface_iiiiiiiiii_v_       1987`mutex_unlock_queue                                       2324`mutex_lock_impl                                          2684`__1cPSelectorMatches6FrnRRuleProcessorData_pnNnsCSSSelector_ipnHnsIAtom_W_i_       3755`__1cKnsImageGTKRDrawCompositeTile6MrnTnsIRenderingContext_pnRnsIDrawingSurface_iiiiiiii_v_       4597`mutex_lock_impl                                          5280`clear_lockbyte                                           5313
Just bare in mind that all the other user mode application on the system are contributing to the event count here. If you wanted to only profile events from a particular set of applications then you would need to put them in there own processor set to make sure nothing else contributes to the event count.

Example 4: Kernel mode instruction profile by module

Now we'll look at some examples of kernel profiling. I'm using filebench to generate some workload on top of a ZFS filesystem so these examples tend to focus on zfs. To start with, let's look at a kernel profile of which modules are executing the most instructions:

# cat kernbymod.d
/arg0 != 0/
	@[mod(arg0)] = count();

	trunc(@, 10);
	printa("%-50a %@10d\\n", @);

# dtrace -qs ./kernbymod.d
dtrace                                                    348
ata                                                       625
specfs                                                    753
rootnex                                                   880
kcf                                                      1551
sha1                                                     5704
procfs                                                   6097
unix                                                   158137
genunix                                                161461
zfs                                                    575552

Not surprisingly, as we're running an intensive workload on top of a ZFS filesystem, zfs is at the top of the list. It may be interesting to break down the cycle usage within the zfs module so we'll do that next.

Example 5: Kernel mode cycle profile: functions from specified module

# cat kernmodbyfunc.d
#!/usr/sbin/dtrace -s

 \* Simple script to profile all functions within a given module ('zfs' is
 \* the lucky chap here). The module range is just extracted from modinfo(1M)
 \* output.

inline int ZFS_MOD_START = 0xfffffffff77fb000;
inline int ZFS_MOD_END = 0xfffffffff78c7378;

/arg0 >= ZFS_MOD_START && arg0 < ZFS_MOD_END/
	@[func(arg0)] = count();


	trunc(@, 10);
	printa("%-50a %@d\\n", @);

# dtrace -qs ./kernmodbyfunc.d
zfs`arc_write_ready                                1459
zfs`arc_write_done                                 1472
zfs`zio_done                                       1512
zfs`refcount_is_zero                               1836
zfs`zio_execute                                    1918
zfs`arc_cksum_verify                               2013
zfs`dbuf_verify                                    2167
zfs`lzjb_compress                                  3356
zfs`zil_commit_writer                              3706
zfs`fletcher_2_native                              497284

Example 6: Kernel mode cycle profile of a given function

Taking the example above, we may now want to drill down on the fletcher_2_native() function to see the distribution of samples within the function to try and gain any insights into hotspots. The following script may look a bit hacky but it does the job of ensuring that we only store the PC for any events which were in this function.

# cat kernfunc.d
/arg0 >= (uint64_t)&(`fletcher_2_native)
    && arg0 < ((uint64_t)&(`fletcher_2_native) + sizeof(`fletcher_2_native))/
	@[arg0] = count();

	printa("%a %@d\\n", @);

# dtrace -qs ./kernfunc.d
zfs`fletcher_2_native+0xc 1
zfs`fletcher_2_native+0x10 1
zfs`fletcher_2_native+0x17 1
zfs`fletcher_2_native+0x1b 1
zfs`fletcher_2_native+0x42 1
zfs`fletcher_2_native+0x46 1
zfs`fletcher_2_native+0x49 1
zfs`fletcher_2_native+0x5d 1
zfs`fletcher_2_native+0xa6 1
zfs`fletcher_2_native+0xce 1
zfs`fletcher_2_native+0xd2 1
zfs`fletcher_2_native+0x53 2
zfs`fletcher_2_native+0x5b 2
zfs`fletcher_2_native+0x83 2
zfs`fletcher_2_native+0xa3 2
zfs`fletcher_2_native+0xa8 2
zfs`fletcher_2_native+0x37 3
zfs`fletcher_2_native+0x3b 3
zfs`fletcher_2_native+0xca 4
zfs`fletcher_2_native+0x9c 9
zfs`fletcher_2_native+0xc6 27
zfs`fletcher_2_native+0x2b 35
zfs`fletcher_2_native+0xab 40
zfs`fletcher_2_native+0x1 56
zfs`fletcher_2_native+0x72 79
zfs`fletcher_2_native+0x91 709
zfs`fletcher_2_native+0x80 1331
zfs`fletcher_2_native+0x75 1346
zfs`fletcher_2_native+0x61 1382
zfs`fletcher_2_native+0x79 1409
zfs`fletcher_2_native+0x87 1437
zfs`fletcher_2_native+0x9f 1438
zfs`fletcher_2_native+0x68 1522
zfs`fletcher_2_native+0x99 1754
zfs`fletcher_2_native+0x7d 1818
zfs`fletcher_2_native+0x6f 1839
zfs`fletcher_2_native+0x95 2341
zfs`fletcher_2_native+0x8b 2517
zfs`fletcher_2_native+0x6b 38448

fletcher_2_native() is a simple piece of code that basically contains a single for() loop. If we map out the PCs in the above example then they map onto the instructions in the for loop as you'd expect.

Example 7: Selective kernel mode cycle profile of a given function

The fletcher_2_native code can be called from 3 separate sites. It may be the case that we are only interested in profiles of the code when it has been called from one of these sites. Here's an example of how we can just profile calls into fletcher_2_native when the caller is arc_cksum_verify:

# cat kerncaller-callee.d
 \* This is an example of selectively profiling a kernel function depending upon
 \* who its caller is. The example we have here is only profiling the zfs
 \* function, fletcher_2_native(), when it is called from arc_cksum_verify().
 \* We can't use thread local variables to predicate on in the cpc probe as it
 \* is executing in a high level interrupt context`. To work around this we use a
 \* global associative array and hash on the lwp's pr_addr as this is guaranteed
 \* to be consistent.

	vrfy[curlwpsinfo->pr_addr] = 1;

 \* This is a bit icky to say the least. We're basically checking that the PC
 \* that generated the event (presented in arg0 for kernel events) is in the
 \* fletcher_2_native functions address range and also that the underlying
 \* thread has called fletcher_2_native() from within arc_cksum_verify().
/arg0 >= (uint64_t)&(`fletcher_2_native)
    && arg0 < ((uint64_t)&(`fletcher_2_native) + sizeof(`fletcher_2_native))
    && vrfy[curlwpsinfo->pr_addr] == 1/
	@[arg0] = count();

/vrfy[curlwpsinfo->pr_addr] == 1/
	vrfy[curlwpsinfo->pr_addr] = 0; 

	printa("%a %@d\\n", @);

I've elided the output of the above script as it looks remarkably similar to the previous one! However, it's the technique that was of interest here and not so much the result itself.

So, there's a very quick introduction to the cpc provider. There's a bunch of stuff that I could have mentioned that I didn't so I'll try and do some more entries fleshing out a few other areas and points. I'm just the guy with the spanners bolting this stuff together and it's you guys out there who can do the clever stuff of applying this to real code to gain new insights. Let me know if there is anything that doesn't work as you thought or if you have any ideas around other pieces that could be bolted on here. Have at it!




  • General
« February 2009