Process(or) Forensics

Optimizing application performance is a multi-facet undertaking. Among all the tools and methods available to us, getting a glimpse at performance metrics from the processor point of view allows us to highlight yet another facet of the application. Most modern microprocessors include so-called performance counters that provide event based drill-down of the processor utilisation and efficiency. On Solaris, the events and performance counters are made available to a user via two commands: cpustat(1M) and cputrack(1M), the former looks at system wide statistics, where as the latter one looks at process-wide statistics. You can see what events and performance counters are available to you by issuing the command: cputrack -h. On my system, the (shortened) output looks like:

blog@tbf3400> cputrack -h


cputrack [-T secs] [-N count] [-Defhnv] [-o file]

-c events [command [args] | -p pid]


CPU performance counter interface: AMD Opteron & Athlon64

event specification syntax:


event[0-3]: FP_dispatched_fpu_ops FP_cycles_no_fpu_ops_retired

FP_dispatched_fpu_ops_ff LS_seg_reg_load

LS_uarch_resync_self_modify LS_uarch_resync_snoop

LS_buffer_2_full LS_retired_cflush LS_retired_cpuid

DC_access DC_miss DC_refill_from_L2 DC_refill_from_system

DC_misaligned_data_ref DC_uarch_late_cancel_access

DC_uarch_early_cancel_access DC_dispatched_prefetch_instr

DC_dcache_accesses_by_locks BU_memory_requests


attributes: edge pc inv cmask umask nouser sys

See Chapter 10 of the "BIOS and Kernel Developer's Guide for the

AMD Athlon 64 and AMD Opteron Processors," AMD publication #26094

cpustat(1M) and cputrack(1M) also provide a reference to processor documentation where one can find more pertinent information about each individual event. For another, excellent tool to collect performance data that can leverage hardware performance counters and much more, please refer to collect(1) that comes with the free Sun Studio 12 suite. Using cputrack(1M) is really straightforward. To demonstrate its usage, we will use a simple C program that makes lots of misaligned data accesses. On an AMD Opteron based system, we can use the DC_misaligned_data_ref event to count misaligned data references.

blog@tbf3400> cputrack -t -N 5 -c DC_misaligned_data_ref ./misaligned_access 1

time lwp event tsc pic0

1.041 1 tick 791649113 75695259

2.051 1 tick 790899011 75756743

3.041 1 tick 779480069 74674120

4.041 1 tick 791630075 75857721

5.051 1 tick 798199772 76464257

Our little application generates a fair number of misaligned data references [pic0 column] in relation to the number of clock cycles [tsc column] elapsed (~ 9.5 %).

As a follow up, one will probably be interested in finding out where these misaligned data references happen? Unfortunately, this question is more tricky to answer on x86 platforms. On the SPARC platform, the Sun Studio performance analyzer includes capabilities and tools that will help out (have a look also at SPOT).

A simple way to answer the above question is to make use of the CPU performance counter library [ libcpc(3)] directly by writing our own code. We will use this technique to write a standalone shared library object that can be dynamically loaded into existing applications (using the LD_PRELOAD environment variable) to perform simple profiling based on processor performance counters.

First, we use the shared object initialization hook to prepare our runtime environment:

#pragma init(dprof_init)

void dprof_init()


Next, we set-up a SIGEMT signal handler (more later):

/\* SIGEMT \*/

bzero(&emt_action, sizeof(emt_action));

emt_action.sa_sigaction = dprof_emt_handler;

emt_action.sa_flags = SA_RESTART | SA_SIGINFO;

if (sigaction(SIGEMT, &emt_action, NULL) == -1) {

    fprintf(stderr, "dprof: sigaction: %s\\n", strerror(errno));



Then, we initialise the CPU performance counter library:

if ((cpc = cpc_open(CPC_VER_CURRENT)) == NULL) {

    fprintf(stderr, "dprof: perf. counters unavailable: %s\\n", strerror(errno));



if ((set = cpc_set_create(cpc)) == NULL) {

    fprintf(stderr, "dprof: could not create set: %s", strerror(errno));



So we are pretty much done with initialization. Now let's set-up the event of interest:

/\* AMD Opteron DC_misaligned_data_ref \*/

event_name = “ DC_misaligned_data_ref”;

preset = 0xffffffffffffffff – 100;

nattrs = 0;

And add it to the list of events to watch. We also would like to be notified by a signal when the counter overflows (CPC_OVF_NOTIFY_EMT):

if ((ckey = cpc_set_add_request(cpc, set, event_name, preset,

                                CPC_COUNT_USER | CPC_OVF_NOTIFY_EMT,

                                nattrs, cpc_attrs)) != 0) {

    fprintf(stderr, "dprof: cannot add request to set: %s\\n", strerror(errno));



And finally bind the performance counters to the current light weight process (LWP):

if (cpc_bind_curlwp(cpc, set, 0) == -1) {

    fprintf(stderr, "dprof: cannot bind lwp%d: %s\\n",

                    _lwp_self(), strerror(errno));



We're done. We should actually be getting SIGEMT signals every time the underlying performance counter overflows (that is with our current settings after 100 events). So let's see what we need to do inside the signal handler:

void dprof_emt_handler(

    int sig,

    siginfo_t \*sip,

    void \*arg

) {

    /\* sig should be SIGEMT and sip->si_code should be EMT_CPCOVF \*/

    if (dladdr(sip->si_addr, &dlip)) {



                dlip.dli_fname, dlip.dli_fbase, dlip.dli_sname, dlip.dli_saddr);

    } else {

        printf("|%p|||||\\n", sip->si_addr);


    /\* Restart counter \*/

    if (cpc_set_restart(cpc, set) != 0) {


                "dprof: cannot restart lwp%d: %s\\n",

                _lwp_self(), strerror(errno ));




So our very basic signal handler tries to resolve the sip->si_addr given to us by the system and print the information out (really, one would probably want to aggregate events into a buffer and process them later). Finally, before exiting the signal handler, the performance counter is reset to its preset value. With the little C program that generates misaligned data references, we see following output (shortened):


Using the modular debugger (mdb), we disassemble the code at the given address:

blog@tbf3400> mdb misaligned_access

> 400da4::dis

fa: movlpd (%rdi),%xmm0

fa+4: mulsd (%rsi),%xmm0

fa+8: movlpd %xmm0,(%rdi)

fa+0xc: movlpd (%rdi),%xmm0

fa+0x10: movlpd %xmm0,(%rsi)

fa+0x14: ret

0x400dad: nop

0x400db0: nop

0x400db4: nop

fb: movlpd (%rdi),%xmm0

fb+4: mulsd (%rsi),%xmm0

fb+8: movlpd %xmm0,(%rdi)

fb+0xc: movlpd (%rdi),%xmm0

fb+0x10: movlpd %xmm0,(%rsi)


and find out that the function named fa() is likely to be the source of misaligned data references.

A couple of words of caution with this technique: it is based on statistical sampling. The processor gets interrupted whenever the underlying performance counter overflows. Expect delays between the time the processor fires the interrupt and the time the operating system services it. That is, the program counter made available to the signal handler may be slightly off.


Post a Comment:
  • HTML Syntax: NOT allowed

Application tuning, sizing, monitoring, porting on Solaris 11


« April 2014