Monday Feb 16, 2009

DTrace meets the CPU Performance Counters

It's been a while coming but it's here at last in build 109 of Solaris Nevada; the DTrace CPU Performance Counter provider (or 'cpc' provider for short). The cpc provider gives you the ability to profile your system by many different types of processor related events; the list of events is processor specific and usually quite large but typically includes events such as cycles executed, instructions executed, cache misses, TLB misses and many more. From a high level view, the cpc provider is essentially the same as the profile provider except that, where the profile provider allows you to profile a system using a fixed, time based source, the cpc provider lets you profile by events related to processor activity.

There's a bunch of things around the implementation that I'd like to talk about but that can probably wait for another day as we'll ease ourselves gently into the world of the cpc provider with some simple examples. Additional sources of information that you may find useful are the DTrace documentation and the cpc provider ARC case one pager. Note that most of the examples I give here are focused on two of the more common event types, instructions or cycles executed. However, we could have just as easily profiled behaviour around the cache hierarchy (l1/l2/l3 hits/misses) for example - we'd merely need to specify different event types to look at.

Anyway, let's start with a simple example to see it in action:

Example 1: User Mode Instruction Profile by Executable

#cat userexecbyinst.d
cpc:::PAPI_tot_ins-user-5000
{
	@[execname] = count();
}

END
{
	trunc(@, 10);
	printa("%-30s %@d\\n", @);
}

# dtrace -qs ./userexecbyinst.d
\^C
gnome-panel                    3760
soffice.bin                    5952
nautilus                       15255
gnome-terminal                 23363
metacity                       29627
wnck-applet                    36635
isapython2.4                   88264
thunderbird-bin                119269
firefox-bin                    319431
Xorg                           399781

Here we have a profile of our system which looks at executed user mode instructions. The specification of what we wanted to profile is contained in the probename of the cpc probe. The probename is made up of the following, hyphen delimited, fields:

So, let's get a bit fruitier and take a different view of things. Here's a system wide view of user mode TLB misses by library:

Example 2: User Mode TLB miss Profile by Library

#cat usertlbmiss.d
#!/usr/sbin/dtrace -qs

cpc:::PAPI_tlb_dm-user-5000
{
	@[umod(arg1)] = count();
}

END
{
	trunc(@, 10);
	printa("%-50A %10@d\\n", @);
}

# dtrace -qs ./usertlbmiss.d 
\^C
libproc.so.1                                             274
libjemalloc.so                                           454
libcairo.so.2.20.0                                       739
libglib-2.0.so.0.1800.0                                  791
libmozjs.so                                              964
firefox-bin                                              1026
libjemalloc.so                                           1443
libc.so.1                                                1740
libxul.so                                                5651
Xorg                                                     6766

Note that probes from the cpc provider export the same arguments as the profile provider. arg0 will be non zero if the event occured while operating in kernel mode. arg1 will be non zero id the event occured while we were operating in user mode.

Example 3: Selective User Mode Cycle Count Profile by Function

As this is DTrace, we can use predicates to prune the data we capture. Let's profile a couple of applications from a particular user but tyhis time profile them by the number of cycles used:

# cat ./userbyappbycycles.d
cpc:::PAPI_tot_cyc-user-5000
/(execname == "firefox-bin" || execname == "thunderbird-bin") && uid == 101/
{
	@[ufunc(arg1)] = count();
}

END
{
	trunc(@, 10);
	printa("%-60A %@10d\\n", @);
}

# dtrace -qs ./userbyappbycycles.d
\^C
libc.so.1`mutex_unlock                                             1544
libfontconfig.so.1`FcFontMatch                                     1671
libc.so.1`clear_lockbyte                                           1754
libgfx_gtk.so`__1cKnsImageGTKODrawComposited6MrnTnsIRenderingContext_pnRnsIDrawingSurface_iiiiiiiiii_v_       1987
libc.so.1`mutex_unlock_queue                                       2324
libc.so.1`mutex_lock_impl                                          2684
libgklayout.so`__1cPSelectorMatches6FrnRRuleProcessorData_pnNnsCSSSelector_ipnHnsIAtom_W_i_       3755
libgfx_gtk.so`__1cKnsImageGTKRDrawCompositeTile6MrnTnsIRenderingContext_pnRnsIDrawingSurface_iiiiiiii_v_       4597
libc.so.1`mutex_lock_impl                                          5280
libc.so.1`clear_lockbyte                                           5313
Just bare in mind that all the other user mode application on the system are contributing to the event count here. If you wanted to only profile events from a particular set of applications then you would need to put them in there own processor set to make sure nothing else contributes to the event count.

Example 4: Kernel mode instruction profile by module

Now we'll look at some examples of kernel profiling. I'm using filebench to generate some workload on top of a ZFS filesystem so these examples tend to focus on zfs. To start with, let's look at a kernel profile of which modules are executing the most instructions:

# cat kernbymod.d
cpc:::PAPI_tot_ins-kernel-5000
/arg0 != 0/
{
	@[mod(arg0)] = count();
}

END
{
	trunc(@, 10);
	printa("%-50a %@10d\\n", @);
}

# dtrace -qs ./kernbymod.d
\^C
dtrace                                                    348
ata                                                       625
specfs                                                    753
rootnex                                                   880
kcf                                                      1551
sha1                                                     5704
procfs                                                   6097
unix                                                   158137
genunix                                                161461
zfs                                                    575552

Not surprisingly, as we're running an intensive workload on top of a ZFS filesystem, zfs is at the top of the list. It may be interesting to break down the cycle usage within the zfs module so we'll do that next.

Example 5: Kernel mode cycle profile: functions from specified module

# cat kernmodbyfunc.d
#!/usr/sbin/dtrace -s


/\*
 \* Simple script to profile all functions within a given module ('zfs' is
 \* the lucky chap here). The module range is just extracted from modinfo(1M)
 \* output.
 \*/

inline int ZFS_MOD_START = 0xfffffffff77fb000;
inline int ZFS_MOD_END = 0xfffffffff78c7378;

cpc:::PAPI_tot_cyc-kernel-5000
/arg0 >= ZFS_MOD_START && arg0 < ZFS_MOD_END/
{
	@[func(arg0)] = count();
}

END

{
	trunc(@, 10);
	printa("%-50a %@d\\n", @);
}

# dtrace -qs ./kernmodbyfunc.d
\^C
zfs`arc_write_ready                                1459
zfs`arc_write_done                                 1472
zfs`zio_done                                       1512
zfs`refcount_is_zero                               1836
zfs`zio_execute                                    1918
zfs`arc_cksum_verify                               2013
zfs`dbuf_verify                                    2167
zfs`lzjb_compress                                  3356
zfs`zil_commit_writer                              3706
zfs`fletcher_2_native                              497284

Example 6: Kernel mode cycle profile of a given function

Taking the example above, we may now want to drill down on the fletcher_2_native() function to see the distribution of samples within the function to try and gain any insights into hotspots. The following script may look a bit hacky but it does the job of ensuring that we only store the PC for any events which were in this function.

# cat kernfunc.d
cpc:::PAPI_tot_cyc-kernel-5000
/arg0 >= (uint64_t)&(`fletcher_2_native)
    && arg0 < ((uint64_t)&(`fletcher_2_native) + sizeof(`fletcher_2_native))/
{
	@[arg0] = count();
}

END
{
	printa("%a %@d\\n", @);
}

# dtrace -qs ./kernfunc.d
\^C
zfs`fletcher_2_native+0xc 1
zfs`fletcher_2_native+0x10 1
zfs`fletcher_2_native+0x17 1
zfs`fletcher_2_native+0x1b 1
zfs`fletcher_2_native+0x42 1
zfs`fletcher_2_native+0x46 1
zfs`fletcher_2_native+0x49 1
zfs`fletcher_2_native+0x5d 1
zfs`fletcher_2_native+0xa6 1
zfs`fletcher_2_native+0xce 1
zfs`fletcher_2_native+0xd2 1
zfs`fletcher_2_native+0x53 2
zfs`fletcher_2_native+0x5b 2
zfs`fletcher_2_native+0x83 2
zfs`fletcher_2_native+0xa3 2
zfs`fletcher_2_native+0xa8 2
zfs`fletcher_2_native+0x37 3
zfs`fletcher_2_native+0x3b 3
zfs`fletcher_2_native+0xca 4
zfs`fletcher_2_native+0x9c 9
zfs`fletcher_2_native+0xc6 27
zfs`fletcher_2_native+0x2b 35
zfs`fletcher_2_native+0xab 40
zfs`fletcher_2_native+0x1 56
zfs`fletcher_2_native+0x72 79
zfs`fletcher_2_native+0x91 709
zfs`fletcher_2_native+0x80 1331
zfs`fletcher_2_native+0x75 1346
zfs`fletcher_2_native+0x61 1382
zfs`fletcher_2_native+0x79 1409
zfs`fletcher_2_native+0x87 1437
zfs`fletcher_2_native+0x9f 1438
zfs`fletcher_2_native+0x68 1522
zfs`fletcher_2_native+0x99 1754
zfs`fletcher_2_native+0x7d 1818
zfs`fletcher_2_native+0x6f 1839
zfs`fletcher_2_native+0x95 2341
zfs`fletcher_2_native+0x8b 2517
zfs`fletcher_2_native+0x6b 38448

fletcher_2_native() is a simple piece of code that basically contains a single for() loop. If we map out the PCs in the above example then they map onto the instructions in the for loop as you'd expect.

Example 7: Selective kernel mode cycle profile of a given function

The fletcher_2_native code can be called from 3 separate sites. It may be the case that we are only interested in profiles of the code when it has been called from one of these sites. Here's an example of how we can just profile calls into fletcher_2_native when the caller is arc_cksum_verify:

# cat kerncaller-callee.d
/\*
 \* This is an example of selectively profiling a kernel function depending upon
 \* who its caller is. The example we have here is only profiling the zfs
 \* function, fletcher_2_native(), when it is called from arc_cksum_verify().
 \* We can't use thread local variables to predicate on in the cpc probe as it
 \* is executing in a high level interrupt context`. To work around this we use a
 \* global associative array and hash on the lwp's pr_addr as this is guaranteed
 \* to be consistent.
 \*/

arc_cksum_verify:entry
{
	vrfy[curlwpsinfo->pr_addr] = 1;
}

/\*
 \* This is a bit icky to say the least. We're basically checking that the PC
 \* that generated the event (presented in arg0 for kernel events) is in the
 \* fletcher_2_native functions address range and also that the underlying
 \* thread has called fletcher_2_native() from within arc_cksum_verify().
 \*/
cpc:::PAPI_tot_cyc-kernel-5000
/arg0 >= (uint64_t)&(`fletcher_2_native)
    && arg0 < ((uint64_t)&(`fletcher_2_native) + sizeof(`fletcher_2_native))
    && vrfy[curlwpsinfo->pr_addr] == 1/
{
	@[arg0] = count();
}

arc_cksum_verify:return
/vrfy[curlwpsinfo->pr_addr] == 1/
{
	vrfy[curlwpsinfo->pr_addr] = 0; 
}

END
{
	printa("%a %@d\\n", @);
}

I've elided the output of the above script as it looks remarkably similar to the previous one! However, it's the technique that was of interest here and not so much the result itself.

So, there's a very quick introduction to the cpc provider. There's a bunch of stuff that I could have mentioned that I didn't so I'll try and do some more entries fleshing out a few other areas and points. I'm just the guy with the spanners bolting this stuff together and it's you guys out there who can do the clever stuff of applying this to real code to gain new insights. Let me know if there is anything that doesn't work as you thought or if you have any ideas around other pieces that could be bolted on here. Have at it!

Friday Jan 16, 2009

Performance Counter Generic Events

In Opensolaris 2008.11 (Nevada build 100 if you want to be more precise) I introduced a sprinkling of sugar into the CPU performance counter subsystem known as Generic Events. Generic events should hopefully make this subsystem a bit easier to use for those of us that have bad memories or just don't like carting reference manuals for processors around when using the processor performance counters. First, we'll take a quick look at the problem and then look at what has been added.

Solaris has an extremely capable CPU performance counter subsystem that, not surprisingly, provides the ability to measure a whole host of interesting and useful events that occur when the processor is executing. The amount of events available varies by processor but we're talking about things such as:

  • Instructions executed.
  • Cycles the processor was busy.
  • Data Cache misses at different levels in the cache hierarchy.
  • Instruction cache misses at different levels in the cache hierarchy.
  • TLB misses
  • Floating point operations of various types

So, each processor allows us to keep a count of events we're interested in; just say which event(s) you want to measure and go. Sounds simple? You would think so but it's not quite as easy to use as you would think. Every platform (Niagara2, AMD, Intel core etc.) implements its own set of events that can be measured and these events are named usually according to names that are used in the hardware vendors programmers reference manuals for the processor in question. We use these event names with tools such as cpustat(1m) and cputrack(1m). As an example, let's measure a common event such as "instructions executed" on a T5140 which has 2 Niagara2 processors:

# cpustat -c Instr_cnt 1
   time cpu event      pic0 
  1.003   0  tick     25458 
  1.003   1  tick     13967 
  1.013   2  tick       956 
  1.023   3  tick      9240 
<chop>
  1.123 127  tick       627

On SPARC this event is pretty easy to identify as the name, Instr_cnt, is intuitive. However, let's say we want to measure the same thing on other Solaris systems we have around the place with different processor types. Our event names for the commonly used "instruction count" on a sample of supported processors looks like:


Processor Event Name Optional Mask
Niagara2 Instr_cnt
UltraSPARC I/II/III/IV Instr_cnt
AMD Family 0xf/0x10 FR_retired_x86_instr_w_excp_intr
Intel P4 instr_retired 0x4
Intel P1/2/3 inst_retired
SPARC64 VI/VII instruction_counts

The third column in the above table refers to an optional mask field that some events on some processors use. The mask is usually used to subdivide an event type into more specific events related to that type. For example, an an amd system we can measure floating point instructions using the FP_dispatched_fpu_ops event which can be subdivided using a unit mask of 0x1 to measure floating point add instructions, or a unit mask of 0x2 to measure floating point multiply instructions.

Another commonly used event type is that of "Total Cycles executed". Again, for the processors cited above:

Processor Event Name Optional Mask
Niagara2 N/A
UltraSPARC I/II/III/IV+ Cycle_cnt
AMD Family 0xf/0x10 BU_cpu_clk_unhalted
Intel P4 global_power_events 0x1
Intel P1/2/3 cpu_clk_unhalted
SPARC64 VI/VII cycle_counts

So, on a Pentium 4 based system, to measure user mode instructions executed aslong with user mode cycles you'd have to do:
#cpustat -c instr_retired,emask=0x4 -c global_power_events,emask=0x1 1

So, how are you supposed to know what the magic values are? Well, you really need to reference the programmers reference manual for the processor you're working on. Obviously, this can become extremely tedious after a while especially if you're not working with this things every day and if you have to work with several processor types.

Generic events provide the ability to refer to a particular event type with a single name across all platforms. So, for example, cycle count can now be referred to with the generic event name of PAPI_tot_cyc and instruction count with the PAPI_tot_ins event. By providing abstracted names for the commonly used event types you now no longer have to have a crib sheet with all the different names and masks for a good number of events. The above Pentium 4 example now looks like:

#cpustat -c PAPI_tot_ins -c PAPI_tot_cyc 1

In fact, that is now the invocation across all our platforms to measure instructions executed and cycle count (assuming the platform in question defines both events and can also accommodate both at once!).

So, you may ask, why the PAPI prefix? In order for us not to implement yet another naming scheme for processor events I decided to use the naming scheme defined by the PAPI project from the University of Tennessee. This project has put a good amount of thought and effort into producing a well defined naming scheme for event abstraction. If you know the PAPI work then you'll see that the Solaris generic event names equate to the PAPI single counter preset events.

Solaris now defines over 100 generic events and each platform defines whichever generic events it can map into a native platform event. Depending upon the hardware capabilities of the platform, some platforms can define a more extensive list of generic events than others. To discover the generic events that are implemented by a platform, look at the output of cpustat -h. Here's a sample from a Niagara2 based system:

#cpustat -h
Usage:
	cpustat [-c events] [-p period] [-nstD] [interval [count]]

	-c events specify processor events to be monitored
	-n	  suppress titles
	-p period cycle through event list periodically
	-s	  run user soaker thread for system-only events
	-t	  include %tick register
	-D	  enable debug mode
	-h	  print extended usage information

	Use cputrack(1) to monitor per-process statistics.

	CPU performance counter interface: UltraSPARC T2

	event specification syntax:
	[picn=][,attr[n][=]][,[picn=][,attr[n][=]],...]

	Generic Events:

	event[0-1]: PAPI_tot_ins PAPI_l1_dcm PAPI_l1_icm PAPI_l2_icm 
                 PAPI_l2_ldm PAPI_tlb_dm PAPI_tlb_im PAPI_tlb_tm 
                 PAPI_br_tkn PAPI_br_ins PAPI_ld_ins PAPI_sr_ins 

	See generic_events(3CPC) for descriptions of these events
<chop>

You'll note from this output that a new man page, generic_events(3CPC), has been added which defines the total set of generic events and also the events that each platform defines and the underlying platform event and any mask that it maps onto.

Hopefully, generic events will have made the Solaris performance counter subsystem a bit more approachable, especially for the more common use cases. The performance counters open up a whole world of opportunities in understanding application and system behaviour but sometimes they can be slightly daunting to use and map into real world behaviour. My next project to integrate, the DTrace CPU Performance Counter provider (which really should integrate soon, honest...) will hopefully open up a whole new world of exploration in the murky depths of the CPU performance counters. Watch this space.

Thursday Oct 02, 2008

Fast Reboot: tightening the loop on anonymous tracing

Back in the day, Bryan, Mike and Adam used to give a DTrace presentation that had a final slide entitled "The DTrace Revolution". I really liked this slide because it distilled into a single bullet point what, essentially, made DTrace so exciting for me. The bullet point was:

  • DTrace tightens the diagnosis loop:

    hypothesis -> instrumentation -> data gathering -> analysis -> hypothesis

This simple statement succinctly puts down the steps we go through when analysing system behaviour and the realisation that DTrace could make this loop a whole lot shorter made me excessively happy (illegally so at times I'd say).

I mention the above because Sherry Moore and her team recently putback (I should say pushed now I guess) a project into Solaris which makes the diagnosis loop a whole lot tighter for anonymous tracing; that project is the Fast Reboot project. This project, not surprisingly, allows us to reboot x86 based systems a whole lot quicker than we used to. For example, my little v20z test system, now reboots to the console prompt in around 40 seconds instead of 3 minutes!! Hopefully you won't need to reboot too often but, if you do, that's the kind of juice you need! It really is amazing stuff.

For those that may have forgotten, anonymous tracing gives us the ability to trace from really early on in the Solaris boot process. Some of you may never need to do this but for those of us that do the data gathering phase in the diagnosis loop can become a real pain, especially on highly configured systems. Reducing the time to reboot systems by such a large degree is a positive boon for the anonymous tracer!

Here's a quick demo (yes, I know you really need to see Fast Reboot in action but text will have to do) of the up and coming CPU Performance Counter provider in use with anonymous tracing to look at kernel L2 Data Cache misses generated during the boot process. I think this is pretty much a first so you saw it here first:

# cat l2miss-bycpu.d 
#!/usr/sbin/dtrace -s

#pragma D option quiet

cpc:::PAPI_l2_dcm-kernel-100
{
        @[func(arg0)] = count();
}

END
{
        trunc(@, 30);
        printf("%-60s %16s\\n\\n", "FUNCTION", "FREQ");
        printa("%-60a %16@d\\n", @);
}
# dtrace -A -s ./l2miss-bycpu.d 
dtrace: saved anonymous enabling in /kernel/drv/dtrace.conf
dtrace: added forceload directives to /etc/system
dtrace: run update_drv(1M) or reboot to enable changes
# reboot -f
Oct  2 14:12:51 va64-v20zn-gmp03 reboot: initiated by LOGIN on /dev/console
Oct  2 14:13:37 va64-v20zn-gmp03 syslogd: going down on signal 15
updating /platform/i86pc/amd64/boot_archive
updating /platform/i86pc/boot_archive
Fast reboot.
syncing file systems... done
SunOS Release 5.11 Version dtrace-cpc-hg 64-bit
Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
DEBUG enabled
NOTICE: enabling probe 0 (cpc:::PAPI_l2_dcm-kernel-100)
NOTICE: enabling probe 1 (:::END)
NOTICE: enabling probe 2 (dtrace:::ERROR)

Hostname: va64-v20zn-gmp03
NIS+ domain name is eu.cte.sun.com
Reading ZFS config: done.

va64-v20zn-gmp03 console login: root
Password: 
Oct  2 14:14:27 va64-v20zn-gmp03 login: ROOT LOGIN /dev/console
Last login: Thu Oct  2 14:12:17 on console
Sun Microsystems Inc.   SunOS 5.11      dtrace-cpc-hg   Sep. 24, 2008
SunOS Internal Development:  jhaslam 2008-09-24 [dtrace-cpc-hg]
bfu'ed from /cpc on 2008-09-24
Sun Microsystems Inc.   SunOS 5.11      snv_96  November 2008
# dtrace -ae
FUNCTION                                                                 FREQ

unix`page_try_reclaim_lock                                               2862
unix`x86_hm_exit                                                         2883
genunix`copy_pattern                                                     2920
genunix`checkpage                                                        2975
unix`page_create_va                                                      2984
unix`disp_getwork                                                        3002
unix`splr                                                                3038
unix`atomic_and_64                                                       3076
genunix`kmem_log_enter                                                   3202
unix`bcopy                                                               3237
unix`sys_syscall32                                                       3615
unix`page_get_mnode_freelist                                             3858
genunix`kmem_cache_free_debug                                            4423
unix`mutex_exit                                                          4498
unix`tsc_read                                                            4659
genunix`kmem_hash_rescale                                                4756
unix`atomic_add_32                                                       4786
unix`bzero                                                               4809
unix`atomic_add_64                                                       4924
unix`do_splx                                                             4951
unix`strcmp                                                              5156
unix`kcopy                                                               6318
genunix`verify_and_copy_pattern                                          6833
unix`page_lookup_create                                                  7916
genunix`kmem_cache_alloc_debug                                           8007
genunix`getpcstack                                                       8370
dtrace`dtrace_unregister                                                11502
unix`hwblkpagecopy                                                      28359
unix`mutex_enter                                                        55687

And there you go! No waiting ages for the various BIOS checks and device probing - just straight back to us so we can get on with some useful work. Now, why exactly does dtrace_unregister() take so many L2 cache misses...

Thursday Jul 17, 2008

This one goes to eleven...

I may play a Strat and a Marshall but I doubt I'll ever be a Rock Star (though I do play a mean Whole Lotta Rosie :-) ). I'll have to take fame where I can get it then and that would as a JavaOne 2008 Rock Star . Shameless self promotion I know but you'll have to excuse me just this once.

For those that attended our session, thanks for the votes - we had a great time doing it! I and my fellow Rock Star salute you!

Monday May 12, 2008

DTrace @ JavaOne

Well, I can truly say that I had a great time at this years JavaOne. There were several highlights for me but all of them weren't technical (Smashmouth were absolutely excellent on Thursday night). I guess what I enjoyed the most was seeing what's going off in technology areas that are so different to the one that I do for my day to day activities and I picked up some ideas for pet projects that may (or may not) see the light of day this year...

Down to DTrace and Java - it was exciting to see how much interest there is in the marriage of these two technologies. The demo booths in the pavilion seemed to be permanently rammed with people wanting to see what's going off in this area and the hands on labs were very well attended (hats off to Angelo et. al). Simon Ritter and I got the chance to take up the baton from Adam and Jarod this year to present two sessions on DTrace and Java (well, we did the same session twice anyway!). We had a fantastic time doing this and may thanks to everyone who came to see the show and also to those that asked so many interesting and well thought out questions afterwards.

DTrace tends top be a hands-on technology and demo's and talks are best when code is cranked live so that is what we (mostly) did. One of the issues though when demo'ing DTrace is having a workload to observe that isn't too contrived and simplistic but one that also isn't overly complicated. For our JavaOne talk I wanted some Java code that everyone could feel at home with and, standing on the shoulders of giants , I opted for a game from my youth, Space Invaders. I used Kevin Glass's excellent Space Invader game as it was designed as a tutorial and is therefore simple and perfect for a demo like this. As a bunch of people asked for the scripts for this session I thought I'd put down here what I did. What follows are three examples of what we showed on the day in our session. Apologies if what I write is over prescriptive at times for the DTrace long timers but hopefully a bit of detail may help any newcomers to DTrace.

The main aim of this session was to show how you can use the dvm and hotspot providers to obtain new views on your Java code. I then wanted to show how the great work that is being done by Keith McGuigan and team on Java Statically Defined Tracing can be used to enhance that view. All the Java focussed demo's of the session (there were some plain old DTrace ones as well) were done using the hotspot provider which comes with Java SE6 and the SE7 JVM for the JSDT demo's. I wanted to show the approach that I'd taken when first analysing the behaviour of this game so that the attendees could get a view of the kind of approaches you can take when looking at such an application (which may not be the most optimal way BTW!). Anyway, onto the code and analysis.

Example 1 - Looking at the methods

I had the code in one workspace in a netbeans project so that I could easily modify it with and rebuild it quickly. First off I just wanted to look at the methods that were being called to see if anything of interested turned up:

$ cat meth.d
#!/usr/sbin/dtrace -s

hotspot$target:::method-entry
{
	this->str = strjoin(".", copyinstr(arg3, arg4));
	@[strjoin(copyinstr(arg1, arg2), this->str)] = count();
}

Note that the strjoin()'s that I use are just there to make the output pretty. We grab the pid of the relevant jvm:

$ jps
806 Main
813 Game
814 Jps

...run the script and play the game for a few seconds:

$ dtrace -q -s ./meth.d -p 813
\^C
<chop>
  java/util/concurrent/locks/AbstractQueuedSynchronizer.release             5153
  java/util/concurrent/locks/ReentrantLock$NonfairSync.lock             5153
  java/util/concurrent/locks/ReentrantLock$Sync.tryRelease             5153
  sun/java2d/SunGraphics2D.drawImage                             7466
  sun/java2d/pipe/DrawImage.copyImage                            7466
  java/awt/Rectangle.intersects                                108993
  org/newdawn/spaceinvaders/Entity.collidesWith                108994
  java/awt/image/BufferedImage.getHeight                       221725
  java/awt/image/BufferedImage.getWidth                        221725
  java/util/ArrayList.RangeCheck                               227932

So, from the look of the method names this is collision detection code which may or may not be interesting. Let's just take a look at where these methods are being executed by getting stack traces of the RangeCheck method:

$ cat range.d
#!/usr/sbin/dtrace -s

hotspot$target:::method-entry
/copyinstr(arg3, arg4) == "RangeCheck"/
{
	@[jstack(30, 16384)] = count();
}
$ dtrace -q -s ./range.d -p 813
\^C
<chop>
              libjvm.so`__1cNSharedRuntimeTdtrace_method_entry6FpnKJavaThread_pnNmethodOopDesc__i_+0x7b
              java/util/ArrayList.RangeCheck(I)V\*
              org/newdawn/spaceinvaders/Game.gameLoop()V\*
              org/newdawn/spaceinvaders/Game.main([Ljava/lang/String;)V
              StubRoutines (1)
              libjvm.so`__1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x1a3
              libjvm.so`__1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v2468_v_+0x27
              libjvm.so`__1cJJavaCallsEcall6FpnJJavaValue_nMmethodHandle_pnRJavaCallArguments_pnGThread__v_+0x2f
              libjvm.so`__1cRjni_invoke_static6FpnHJNIEnv__pnJJavaValue_pnI_jobject_nLJNICallType_pnK_jmethodID_pnSJNI_ArgumentPusher_pnGThread__v_+0x1df
              libjvm.so`jni_CallStaticVoidMethod+0x154
              java`JavaMain+0xd30
              libc.so.1`_thr_setup+0x52
              libc.so.1`_lwp_start
             5153

We see that most of them are coming out of the Game.gameLoop() method which probably makes sense if it's collision detection. Looking at the code for this we see:

        for (int p=0;p<entities.size();p++) {
                for (int s=p+1;s<entities.size();s++) {
			Entity me = (Entity) entities.get(p);
			Entity him = (Entity) entities.get(s);
					
			if (me.collidesWith(him)) {
					me.collidedWith(him);
					him.collidedWith(me);
			}
		}
        }

The entities object is an ArrayList so the RangeCheck calls are coming from there (looks like the get() is being optimised out of the stacktrace). Now that we have the area of code in view we can easily see that there is an inefficiency here. This loop goes along the list of entities (aliens, missiles) and sees if the current entity (me) intersects with any other entities. However, note that the current entity, me, is being retreived on every iteration of the inner loop when it only needs to be fetched on the outer loop. We can hoist that out as shown below, rebuild and see what the effect is:

        for (int p=0;p<entities.size();p++) {
		Entity me = (Entity) entities.get(p);
                for (int s=p+1;s<entities.size();s++) {
			Entity him = (Entity) entities.get(s);
					
			if (me.collidesWith(him)) {
					me.collidedWith(him);
					him.collidedWith(me);
			}
		}
        }

$ dtrace -q -s ./meth.d -p 833
<chop>
  java/util/concurrent/locks/AbstractQueuedSynchronizer.release             4747
  java/util/concurrent/locks/ReentrantLock$NonfairSync.lock             4747
  java/util/concurrent/locks/ReentrantLock$Sync.tryRelease             4747
  sun/java2d/SunGraphics2D.drawImage                             7144
  sun/java2d/pipe/DrawImage.copyImage                            7144
  java/awt/Rectangle.intersects                                104874
  org/newdawn/spaceinvaders/Entity.collidesWith                104874
  java/util/ArrayList.RangeCheck                               117608
  java/awt/image/BufferedImage.getHeight                       213326
  java/awt/image/BufferedImage.getWidth                        213326

And we see that the RangeCheck method has dropped down the list accordingly.

Example 2 - Shots gone Wild!

I wanted to provide an example where we could write some simple D to drill down on a particular aspects of the games behaviour that may be of interest. In a game like this there are really only a few things that are of great interest and any interaction a user has is one of these areas. The example I took next was to look at detecting when a missile has gone off the screen (as opposed to hitting an alien) to demonstrate the kind of things we do to differentiate events in D scripts. The code for moving a missile is the ShotEntity.move method:

	public void move(long delta) {
		// proceed with normal move
		super.move(delta);
		
		// if we shot off the screen, remove ourselfs
                
		if (y < -100) {
			game.removeEntity(this);
		}
	}

So, we just need to look for the case where we calling game.removeEntity when we are in the ShotEntity.move method. The script that I created to show this is:

$ cat shotmiss.d
#!/usr/sbin/dtrace -s

hotspot$target:::method-entry
/basename(copyinstr(arg1,arg2)) == "ShotEntity" &&
	copyinstr(arg3, arg4) == "move"/
{
	self->in = 1;
}

hotspot$target:::method-entry
/self->in && copyinstr(arg3, arg4) == "removeEntity"/
{
	printf("Missile Missed!\\n");
}

hotspot$target:::method-return
/basename(copyinstr(arg1,arg2)) == "ShotEntity" &&
	copyinstr(arg3, arg4) == "move" && self->in/
{
	self->in = 0;
}

So, we probe method-entry and build up the predicate to show our interest in ShotEntity.move. Note that I use the D basename() subroutine to pull the class name out of the fully qualified package name. By setting the thread local variable, in, when we enter the move method we can then predicate upon that in the following clause to ensure that we only show interest in removeEntity calls that originate from ShotEntity.move. The third clause makes sure that we unset the in thread local variable when we return from the ShotEntity.move method as we're done.

Java Statically Defined Tracing

Java Statically Defined Tracing is the Java equivalent of USDT and should be arriving in Java SE7. Augmenting the SpaceInvader game to use JSDT was simple. I first defined my provider by extending the Provider interface:

@ProviderName("Invader")
interface MyProvider extends Provider {
    @ProbeName("boom") void death(String type);
    void miss(int nothing);
}

Note that although I have named the provider, MyProvider, I have used an annotation to override this name so that it will appear as Invader. You'll see I also did the same with the probe named death. This is for no other reason than to demonstrate the annotations. As you see above, the provider has two probes defined, death which takes a String as its argument and, miss which takes an integer as its argument.

In the Game class's constructor I created the instance of the ProviderFactory from which I then create my provider (read Keith's blog referenced above for why it's done this way):

        provider = ProviderFactory.getDefaultFactory().createProvider(MyProvider.class);

Note that provider is defined as public MyProvider provider in the Game classes public data. Adding probes is then trivial as we just insert calls in our code where suitable. I had two probe placements, one which is taken from the last hotspot example, to show missiles going off screen. The ShotEntity.move code now becomes:

 public void move(long delta) {
    // proceed with normal move
    super.move(delta);
    
    // if we shot off the screen, remove ourselfs
    if (y < -100) {
      game.provider.miss(1);
      game.removeEntity(this);
    }
  }

I also added the death probe into where Entities (aliens and missiles) are destroyed in Game.removeEntity. I also added in here some code to demonstrate the JSDT mechanism for checking whether a probe is enabled or not. We can use this technique of the code taken to construct probe arguments is costly and therefore we only want to do it when a probe is enabled. The code for this looks like:

	public void removeEntity(Entity entity) {
                Method m = null;
                String ent = null;
                
                try {
                        m = MyProvider.class.getMethod("death", String.class);
                }
                catch (NoSuchMethodException e) {
                        System.out.println("Can't find probe");
                }
                Probe p = provider.getProbe(m);
                
                if (p.isEnabled() ) {
                    // Do argument prep in here
                    if (entity instanceof AlienEntity)
                        ent = "Alien";
                    else if (entity instanceof ShotEntity)
                        ent = "Missile";
                    else
                        ent = "Unknown Object";
                }
 
                provider.death(ent);
		removeList.add(entity);
	}

We obtain a reference to the provider from which we can then verify whether or not a probe is currently enabled. Smart! If it is I then check what type of object is being removed and prepare a suitable string to pass as an argument to the death probe. Building the code and running it we then see the following probes available:

$ dtrace -l -P Invader\*
   ID   PROVIDER            MODULE                          FUNCTION NAME
70483 Invader899      java_tracing                       unspecified miss
70484 Invader899      java_tracing                       unspecified boom

I then cranked the following script which shows us how many aliens were destroyed, how many missiles were fired and how many of those went off the screen (i.e. missed their targets)

$ cat alien.d
#!/usr/sbin/dtrace -s

Invader\*:::miss
{
	@[probename] = count();
}

Invader\*:::boom
{
	@[copyinstr(arg0)] = count();
}

Running this against the game and playing a level we saw:

$ dtrace -q -s ./alien.d    
\^C

  miss                                                             18
  Alien                                                            60
  Missile                                                          78
$ 

18 misses out of 78 shots! Poor indeed! One thing that was immediately obvious to those watching this was the much, much reduced probe effect of the JSDT version. With the hotspot techniques we have visible probe effect owing to the mechanisms being used by that provider but the JSDT version is slick and fast.

The last thing I demo'd for fun was the integration of these newly created JSDT based probes with a visualisation technique. I took the Chime Visualisation tool and created a display so we could graph the killing in real time! I did cheat and create the display before the session but it only took a few minutes to do. Here's a screenshot of some realtime destruction:

As you can see from the screenshot, I was a real killing machine until near the end (missile fired in blue , aliens killed in green) when I couldn't hit anything at all for and my miss count (in red) rises rapidly! The point of this demo was not to prove my lack or prowess at SpaceInvaders but rather to show that we have a set of Java API's that you can hook into easily so it's not much effort to plug DTrace into a Java management framework or visualisation tools (just as two examples).

Anyway, that's about it for now. I did do some more hotspot demo's which were interesting but they'll have to save for another time. If you came along to our sessions at JavaOne then I hope you enjoyed it as much as we did.

Monday Mar 17, 2008

dtrace.conf(08)

I had the fortune of attending the first dtrace.conf last Friday in San Francisco and I have to say that it was a fantastic event. There are many aspects that I enjoyed but what really made it for me was the chance to have great technical and social exchanges with such a diverse bunch of engineers. It didn't matter who we all worked for; DTrace was the great leveller. I really had some of the most thought provoking technical conversations I've had in a long time. By the end of the day my head was spinning with the amount and quality of ideas that I'd heard (well, it was either the ideas or Richard McDougall's driving...).

Many thanks to the DTrace Patriarchs for organising the event, Forsythe for funding it and for the many attendees who willingly shared and listened. Absolutely top day.

Wednesday Sep 26, 2007

DTrace @ LOSUG

The London OpenSolaris User Group meets once a month at the Sun Customer Briefing Center on King William Street (the North end of London Bridge). If you ever get the chance to go along to one of the meetings then I can heartily recommend it as a there is always a nice amount of pretty good food and drink supplied by a generous, anonymous benefactor. In addition to this there is a regular group of between 30-50 people who make for good chat during the evening.

Ages ago I had agreed to present something on DTrace at the September event. However, everyone who attends has no doubt been exposed to the Haslam DTrace tour so I couldn't force that upon those poor souls. What I came up with was a quick walk through of some of the new features that have been introduced into DTrace since Solaris 10 first shipped and the slides are given below. The slides are in no way meant to be a complete set of features but just a selection of ones that I thought may be of interest. On the night I actually demo'd various pieces but I have actually included some example code in the slides so you can (hopefully) see what is going off. Features that don't have examples meant that I ran out of time by then!

Thursday Jul 12, 2007

Grab your Irn-Bru: Accessing a programs local variables using DTrace

A question that has raised its head several times on the dtrace-discuss@opensolaris.org forum has been, "how do I access the local variables in my program"? Unfortunately, the answer to this isn't particularly pretty so I thought I'd provide a simple (and contrived...) example of how to this. What we're doing here relies on getting just a teeny bit low down and dirty so you may want to grab your Irn-Bru laddy! The example used here was done on an x86 based system but the principle is exactly the same for any other architecture. Take the extremely simple piece of code below:

void
foo(void)
{
        int i;
        const char \*string = "Foo bar test string";

        for (i = 0; i < strlen(string); i++) {
                printf("%c", string[i]);
        }

        printf("%\\n");
}
We're going to use DTrace to tell us the value of the loop counter i when we encounter a space in the string that is being iterated over.

Here's where it starts to become slightly unpleasant. The storage for a programs local variables will either be in registers or somewhere in memory, depending upon where in the programs execution you want to view them. In the above code we want to see the value of the i variable when a space character has been found (i.e. when string[i] contains a space). To do this we need to first disassemble the code and we can do that using mdb(1):
> foo::dis
foo:                            pushl  %ebp
foo+1:                          movl   %esp,%ebp
foo+3:                          subl   $0x8,%esp
foo+6:                          leal   0x8050ac8,%eax
foo+0xc:                        movl   %eax,-0x8(%ebp)
foo+0xf:                        movl   $0x0,-0x4(%ebp)
foo+0x16:                       movl   -0x8(%ebp),%eax
foo+0x19:                       pushl  %eax
foo+0x1a:                       call   -0x163   < PLT:strlen >
foo+0x1f:                       addl   $0x4,%esp
foo+0x22:                       movl   -0x4(%ebp),%edx
foo+0x25:                       cmpl   %eax,%edx
foo+0x27:                       jae    +0x32     < foo+0x5b >
foo+0x29:                       movl   -0x8(%ebp),%eax
foo+0x2c:                       addl   -0x4(%ebp),%eax
foo+0x2f:                       movsbl 0x0(%eax),%eax
foo+0x33:                       pushl  %eax
foo+0x34:                       pushl  $0x8050adc
foo+0x39:                       call   -0x172   < PLT:printf >
foo+0x3e:                       addl   $0x8,%esp
foo+0x41:                       movl   -0x4(%ebp),%eax
foo+0x44:                       incl   %eax
foo+0x45:                       movl   %eax,-0x4(%ebp)
foo+0x48:                       movl   -0x8(%ebp),%eax
foo+0x4b:                       pushl  %eax
foo+0x4c:                       call   -0x195   < PLT:strlen >
foo+0x51:                       addl   $0x4,%esp
foo+0x54:                       movl   -0x4(%ebp),%edx
foo+0x57:                       cmpl   %eax,%edx
foo+0x59:                       jb     -0x32     < foo+0x29 >
foo+0x5b:                       pushl  $0x8050ae0
foo+0x60:                       call   -0x199   < PLT:printf >
foo+0x65:                       addl   $0x4,%esp
foo+0x68:                       leave  
foo+0x69:                       ret    

So, from the above instructions we need to figure out where we can insert a probe with DTrace so we can then get our hands on the relevant data! Well, it's not as bad as it looks really. Our two variables i and string have space created for them on the stack early on:

foo+6:                          leal   0x8050ac8,%eax
foo+0xc:                        movl   %eax,-0x8(%ebp)
foo+0xf:                        movl   $0x0,-0x4(%ebp)
So, i is at %ebp-0x4 and string is at %ebp-0x8. There are a couple places in the code where we could insert a probe to inspect whether or not the current index into the string is a space character. The place I'll choose here is where we load up the current character being indexed for the printf() call. You'll notice that for the call to printf() we place on the stack the character to print and the printf format modifier. Here are the instructions which do that:
foo+0x33:                       pushl  %eax
foo+0x34:                       pushl  $0x8050adc
foo+0x39:                       call   -0x172    < PLT:printf >
And we'll just verify that foo+0x34 is loading up the the printf modifier string:
> 0x8050adc/s
0x8050adc:      %c
So, we're in the right place. The character to be printed out (i.e. the current value of string[i]) will be found in the %eax register at foo+0x33 so this is where we'll place a probe with the pid provider. What we can do is to use the DTrace regs[] array to access the %eax register and predicate on whether or not we are currently looking at a space character (the space character is hex value 0x20). If the register contains the value 0x20 we can then extract the current value of the loop counter i which is found on the stack at %ebp-0x4. So our script looks like this:
#cat local1.d
#pragma D option quiet

inline int SPACE = 0x20;

pid$target:a.out:foo:33
/uregs[R_EAX] == SPACE/
{
        this->pos = \*(int \*)(copyin(uregs[R_EBP]-0x4, sizeof(int)));
        printf("Space found at char %d\\n", this->pos + 1);
}
You'll note that to access the memory location for the loop counter i (the programs stack in this case), we need to use the copyin() DTrace subroutine. Executing the above script on this code we get the following output:
# dtrace -s ./local1.d -c ./local
Foo bar test string
Space found at char 4
Space found at char 8
Space found at char 13
Voila! There you have it. It's a bit grim I know and forces you to rely on knowing a bit about the assembly language of the architecture you're executing on. However, it can prove to be invaluable technique for debugging those hard to get at problems.

Good luck!

Monday May 21, 2007

The DTrace deadman mechanism

[Read More]

Tuesday May 15, 2007

DTrace, JavaU and the BSOD

Last week, my old mucker Phil Harman and I were fortunate enough to present a 3 hour session at JavaU on DTrace . We had a great time doing this session and I hope that those of you who attended got something out of it. We try and do our sessions with as few slides as possible and as much live demo as possible - this makes for a pretty hectic 3 hours for whoever is doing the typing (as the younger of the dynamic duo I don't mind!). I promised to post the scripts that were produced from this session and I will do that shortly (or at least pointers to where you can get better versions of, essentially, the same scripts :-) ).

One of the things I like to cover in my talks are what I call "fun things" to do with DTrace. By "fun" I don't mean side splitting laughter fun but more geek orientated antics that can be got up to with DTrace. One of the examples I give is a bit of a cheat but amusing nonetheless. One day, no doubt tired of hearing me proclaim that DTrace can solve virtually any problem, a Sun colleague challenged me with this one: he bet me that I couldn't make Solaris generate a BSOD . At first this may seem a strange challenge but it's one that DTrace can (kind of) address...

I set about it like this; when a binary is executed the kernel reads the first few bytes of the executable to determine what type of file it is - check out the gexec() code for how we do this. I'm going to use the findexec_by_hdr() function as the buffer that contains the executables magic is passed in as the first argument . By enabling a probe on this function I can look at the first two bytes of the file that have been read and determine whether or not a windows executable has been executed (the first two bytes of a windows 9x executable are 'MZ'). If the predicate evaluates to true then we can then use the system() action to change my GNOME desktop background to that of a BSOD jpeg! See, I said it was a dubious cheat but it usually gets a laugh. The script is:

#!/usr/sbin/dtrace -ws

#pragma D option quiet

fbt::findexec_by_hdr:entry
/ \*args[0] == 'M' && \*(args[0]+1) == 'Z'/ 
{
        raise(SIGKILL);
        system("/usr/bin/gconftool-2 --type=string --set /desktop/gnome/background/picture_options stretched");
        system("/usr/bin/gconftool-2 --type=string --set /desktop/gnome/background/picture_filename /local/JavaU/bsod/BSOD.jpg");
}
Before the script is ran my desktop looks like:

If I then run the above D script and try to execute a random Windows executable my desktop changes to look like:

Who says you can't do everything with DTrace!

Tuesday Jul 18, 2006

DTrace at LUGRadio Live 2006

They don't let me out much nowadays but this weekend is a little different. I'll be off to sunny Wolverhampton on Sunday to speak about DTrace at LUGRadio Live 2006. Looks like I've got the death slot just before lunch so I'll try and make it as interesting as I can for those that can come along. Who knows, I may even bring some BrandZ along with me (if time allows).

Hopefully see you there!

Friday Jun 23, 2006

Determining Opteron processor revisions on Solaris



For a while now I've been doing quite a bit of work on the Opteron processor. For some of this work it's important to know which revision of processor you currently have in the machine you've cobbled together. This should be a doddle, right? Well, not quite.
So, how do you know if you have a rev B or a rev E processor then? The processor revision is determined by a bunch of values returned by the CPUID instruction - these include the family, model and step of the processor. These values can then be used in a lookup table which is given in the Opteron revision guide which can be found here. On a Solaris system the family, model and step can be found using psrinfo(1M):
# psrinfo -pv
The physical processor has 2 virtual processors (0 1)
  x86 (AuthenticAMD family 15 model 33 step 2 clock 2190 MHz)
        Dual Core AMD Opteron(tm) Processor 275
The physical processor has 2 virtual processors (2 3)
  x86 (AuthenticAMD family 15 model 33 step 2 clock 2190 MHz)
        Dual Core AMD Opteron(tm) Processor 275
So, armed with these numbers I can go and look up the processor revision. However, if you take a look at the revision guide, you'll see it's not quite as straight forward as you may hope (to say the least).
However, you can cheat a bit. In the amd memory controller driver in Solaris the saliant information contained in the previously mentioned table is represented in the mc_rev_map[] array:
static const mc_rev_map_t mc_rev_map[] = {
        { 0xf, 0x00, 0x0f, MC_REV_PRE_D, "B/C/CG" },
        { 0xf, 0x10, 0x1f, MC_REV_D_E, "D" },
        { 0xf, 0x20, 0x3f, MC_REV_D_E, "E" },
        { 0xf, 0x40, 0x5f, MC_REV_F, "F" },
        { 0, 0, 0, MC_REV_UNKNOWN, NULL }
};
Where a mc_rev_map_t looks like:
typedef struct mc_rev_map {
        uint_t rm_family;
        uint_t rm_modello;
        uint_t rm_modelhi;
        uint_t rm_rev;
        const char \*rm_name;
} mc_rev_map_t;
Using the contents of this array you can see which revision you have by simply seeing which range (using the rm_modello and rm_modelhi members) your processors model falls within. We can see from the psrinfo output given earlier, that this machine has Rev E processors.
However, instead of doing this we can use mdb to do our work for us. Every memory controller on the system is represented in a struct mc and these, conveniently, contain a member, mc_revname, which identifies the processor revision of it's associated processor. We can look at all the memory controllers on the system using the mc_list global variable which is a pointer to the front of the list of memory controller structure:
 > \*mc_list::list mc_t mc_next | ::print mc_t mc_revname
mc_revname = 0xfffffffff01eb550 "E"
mc_revname = 0xfffffffff01eb550 "E"
It's always more fun with mdb!!

Tuesday Jan 17, 2006

Small sampling interval with cpustat(1M)

It never ceases to amaze me that, even with tools that I use quite frequently, I can still learn new things about those tools. I recently had one of those experiences with cpustat(1M), a tool used to read performance counters from CPU's. Like most other observational tools of this ilk, cpustat allows you to specify a sampling interval. However, unlike most other tools, it allows you to specify intervals with a sub-second granularity - this is something I never knew.

The following example is on an Opteron based system. I'm using 2 of it's 4 pic counters (a pic counter is a hardware counter used to record events of interest) to measure user mode level 1 icache misses in pic0 and system mode level 1 icache misses in pic1. The sampling interval is every half a second:

# cpustat -c pic0=IC_miss,pic1=IC_miss,sys1,nouser1 0.5
   time cpu event      pic0      pic1 
  0.507   1  tick       796      3126 
  0.507   0  tick       574      3785 
  1.007   0  tick       263      3564 
  1.007   1  tick     21584      4851 
  1.507   0  tick      1416      5542 
  1.507   1  tick      6116     12318 
  2.007   0  tick       753      3817 
  2.007   1  tick       869      2507 
  2.507   1  tick      1345      3150 
  2.507   0  tick       185      2791 
Simple stuff but something which had evaded me until now. The most excellent Sriram brought this to my attention - thanks Sriram.

Monday Jan 09, 2006

DTrace and mdb(1)

Just before Christmas, I was unfortunate enough to do a slot at the London Opensolaris User Group. I say unfortunate because my session was supposed to be about DTrace Internals. Actually, the word Internals is far too grand a word for what I aimed to do. Really I should replace Internals with "a bit more under the hood than you'd normally go...". Anyway, I digress.

I was also unfortunate because Adam Leventhal was in London at the time and came along to lend a hand! My presentation itself (the material) was pretty bad really owing to the lack of time I'd had to put into, what is essentially, a very large topic. However, I hope I wasn't too dull and I hope that I wasn't too innacurate in my description of how little bits of DTrace work.

The vast topic of how DTrace actually works is hugely interesting and I'll do my best to put some of my pitted understanding up here in the not-too-distant future. One thing that I put as an addendum to my presentation was some mdb(1) incantations that allow you to extract a bit of information about what is going on around DTrace itelf.

DTrace has several mdb(1) dcmds of interest that allow us to shine some light on the workings of DTrace. Firstly, let's take a couple of the simple ones starting with the ::dtrace_state dcmd:

# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp usba uhci s1394 fctl nca lofs audiosup zfs random nfs sppp ptm ipc crypto logindmux ]
> ::help dtrace_state

NAME
  dtrace_state - print active DTrace consumers
The ::dtrace_state dcmd actually has a very good description which says it all really. You can use it to display all or individual consumers that are running on the system. Used without any arguments we see:

> ::dtrace_state
            ADDR MINOR             PROC NAME                         FILE
ffffffff8155c800     2 ffffffff856be048 dtrace           ffffffff857a84d0
ffffffff8155c000     3 ffffffff856af190 dtrace           ffffffff857a8498
So, we have two consumers running. The intersting bit is the address given on the ADDR column. This is a pointer to a dtrace_state structure. The dtrace_state structure is of huge interst to us when figuring out what a consumer is doing as it holds kernel state for a consumer.

We can then use the ::dtrace dcmd to extract any data that is being held in kernel for that consumer (data that hasn't yet been read by the consumer). Let's take the second consumers dtrace_state address and feed in into the dtrace dcmd:

> ffffffff8155c000::dtrace
  fsflush                                                           1
  Xorg                                                              2
  battstat-applet-                                                  3
  gnome-netstatus-                                                  3
  nautilus                                                          3
  dtrace                                                            4
  mixer_applet2                                                     4
  soffice.bin                                                       4
  mdb                                                               5
  powernowd                                                         6
  java                                                             34
  sched                                                           100
It would be very interesting to know what probes were enabled and what the script looked like that generated that output. Well, that's where things get a bit tougher and I'll do that one in another entry.

I'll leave you with a small piece of mdb that shows you what providers are loaded on your system at the moment. Now, you could do this with dtrace(1M) and a simple bit of awk but that would be dull :-) . If I remember, next time I'll try to explain what it does...

> 1::id2probe | ::print -t dtrace_probe_t dtpr_provider | ::list dtrace_provider
_t dtpv_next | ::print -t dtrace_provider_t dtpv_name 
char \*dtpv_name = 0xffffffff83959078 "dtrace"
char \*dtpv_name = 0xffffffff8527ed00 "pid689"
char \*dtpv_name = 0xffffffff83db07c0 "syscall"
char \*dtpv_name = 0xffffffff83db0790 "sdt"
char \*dtpv_name = 0xffffffff83db07d0 "mib"
char \*dtpv_name = 0xffffffff83db07b0 "io"
char \*dtpv_name = 0xffffffff83db0660 "proc"
char \*dtpv_name = 0xffffffff83db0690 "sched"
char \*dtpv_name = 0xffffffff83db0650 "fpuinfo"
char \*dtpv_name = 0xffffffff83db08f8 "vminfo"
char \*dtpv_name = 0xffffffff83db0920 "sysinfo"
char \*dtpv_name = 0xffffffff83db0928 "vtrace"
char \*dtpv_name = 0xffffffff8316cb78 "profile"
char \*dtpv_name = 0xffffffff83da8a28 "lockstat"
char \*dtpv_name = 0xffffffff83db0900 "fbt"
char \*dtpv_name = 0xffffffff831a5658 "fasttrap"

Friday Apr 15, 2005

DTrace'ing at SUPerG

With a spring in my step I'm off to SUPerG on monday. The "spring" here really refers to the fact that I'm really looking forward to this conference. Last year was my first SUPerG and I must say that it is probably the best conference that I have ever been to. A whole lot of really excellent Sun and external engineers come and share their insights with delegates who are genuinely interested in hearing what is being said. The people attending are not only interested but they are also interesting. I met many excellent people from all over the globe who were willing to not only listen but also share their experiences around Sun and the surrounding technologies. I learnt a whole lot and I'm looking forward to much of the same this year.

Last year Phil Harman (author of the infamous Multithreading in the Solaris Operating Environment paper) and I did a TechBrief on DTrace and we had a great time. The format 1 was one of Phil waxing lyrical about real scenarios/problems that you will encounter when dealing with systems in Data Centres whilst I create the D scripts on the fly and execute them, demonstrating how DTrace makes easy meat of all these problems. To really appreciate DTrace you've got to see it and feel it and this is what we aimed to do; slides are off the menu and smokin' keyboards and caffeine are on!

This year we aim to do the same but maybe with a few different laughs along the way. If you're coming along then I look forward to seeing you there. If you're not then try and get there next year!

A whole raft of talent will be presenting and a sample include Richard McDougall , Jim Mauro and Richard Elling .


1 This format was extremely well used by Messrs Cantrill and Shapiro in their "/proc-tologist" show. If they do this in a venue near you, sell your kidney if you have to but make sure you go. Great fun.
About

jonh

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today