Tuesday Nov 06, 2007

UltraSPARC IV+ Update

Darryl Gove identified a bug with the UltraSPARC IV+ definitions a week ago. Many thanks to Darryl for bringing these to my attention.

These are the current definitions to use in .er.rc files:

en_desc on
mobj_define Vaddr VADDR
mobj_define Paddr PADDR
indxobj_define VIRTPC "VIRTPC"
indxobj_define PHYSPC "PHYSPC"
indxobj_define Process PID
indxobj_define Thread (PID\*1000)+THRID
indxobj_define ThreadID THRID
indxobj_define Seconds (TSTAMP/1000000000) 
indxobj_define Minutes (TSTAMP/60000000000) 
mobj_define US4p_L1DataCacheLine (VADDR&0x3fe0)>>5 
mobj_define US4p_L2CacheLine (PADDR&0x7ffc0)>>6 
mobj_define US4p_L3CacheLine (PADDR&0x7fffc0)>>6 
mobj_define VA_L2 VADDR>>6 
mobj_define VA_L1 VADDR>>5 
mobj_define PA_L2 PADDR>>6 
mobj_define PA_L1 PADDR>>5 
mobj_define US4p_T512_8k (VADDR&0x1fe000)>>13 
mobj_define US4p_T512_64k (VADDR&0xff0000)>>16 
mobj_define US4p_T512_512k (VADDR&0x7f80000)>>19 
mobj_define US4p_T512_4M (VADDR&0x3fc00000)>>22 
mobj_define US4p_T512_32M (VADDR&0x1fe000000)>>25 
mobj_define US4p_T512_256M (VADDR&0xff0000000)>>28 
mobj_define Vpage_32M VADDR>>25 
mobj_define Vpage_256M VADDR>>28 
mobj_define Ppage_32M PADDR>>25 
mobj_define Ppage_256M PADDR>>28

These do include the general options for Sun Studio 12 Performance Analyzer, and does not include Processor and Memory Board definitions .

Thursday Oct 11, 2007

UltraSPARC T2 Features for DProfile

This week Sun released new UltraSPARC T2 based systems. UltraSPARC T2 enhances support for DProfile by adding cache miss reporting, as the next phase of constant improvement to DProfile. Performance Analyzer reports cache miss metrics as additional bar graphs for UltraSPARC T2 as in the turquoise bars (3rd one down) shown here:

UltraSPARC T2 added precise trapping on data and instruction cache miss performance counters for DProfile. Read more about which specific counters in the OpenSPARC T2™ Supplement on page 81. The cache miss selector is sl=3.

The new hardware counters include: Icache misses, Dcache misses, L2 cache instruction misses, and L2 cache load misses. These are available in the Sun Studio 12 Performance Analyzer collect command with these options:

usage:  collect  target 
       Sun Analyzer 7.7 SunOS_sparc 2007/10/09
Specifying HW counters on `UltraSPARC T2':
    == [+][~=]...[~=][/][,]
         for memory-related counters, attempt to backtrack to find
         the triggering instruction and the virtual and physical
         addresses of the memory reference
Well-known HW counters available for profiling:
   icm[/{0|1}],100003 (`I$ Misses', alias for IC_miss; load-store events)
   itlbm[/{0|1}],100003 (`ITLB Misses', alias for ITLB_miss; load-store events)
   ecim[/{0|1}],10007 (`E$ Instr. Misses', alias for L2_imiss; load-store events)
   dcm[/{0|1}],100003 (`D$ Misses', alias for DC_miss; load-store events)
   dtlbm[/{0|1}],100003 (`DTLB Misses', alias for DTLB_miss; load-store events)
   ecdm[/{0|1}],10007 (`E$ Data Misses', alias for L2_dmiss_ld; load-store events)
Raw HW counters available for profiling:
   IC_miss[/{0|1}],1000003 (load-store events)
   DC_miss[/{0|1}],1000003 (load-store events)
   L2_imiss[/{0|1}],1000003 (load-store events)
   L2_dmiss_ld[/{0|1}],1000003 (load-store events)
   ITLB_miss[/{0|1}],1000003 (load-store events)
   DTLB_miss[/{0|1}],1000003 (load-store events) 

Note that the sl=2 counters do not reliably support DProfile. I do not recommend these to be enabled.

With these performance metrics, measurement of miss rates are possible within objects. In the graph above, we're displaying L2 read misses per second. No recompilation is necessary!

Tuesday Aug 07, 2007

DProfile - UltraSPARC T2 Support In Studio 12 Performance Analyzer

Today, Sun announced the first true system on a chip with 64 available threads of computation.

Sun Studio 12 adds DProfile support for UltraSPARC T2.

Activating UltraSPARC T2 objects in Performance Analyzer requires you to create the file .er.rc with the following contents:

en_desc on
mobj_define Vaddr VADDR
mobj_define Paddr PADDR
indxobj_define VIRTPC "VIRTPC"
indxobj_define PHYSPC "PHYSPC"
indxobj_define Process PID
indxobj_define Thread (PID\*1000)+THRID
indxobj_define ThreadID THRID
indxobj_define Minutes (TSTAMP/60000000000)
mobj_define UST2_Bank (PADDR&0x1c0)>>6
mobj_define UST2_L2DCacheSet (((((PADDR>>15)\^PADDR)>>9)&0x1f0) | ((((PADDR>>7)\^PADDR)>>9)&0xc) | ((PADDR>>9)&3))
indxobj_define UST2_L2ICacheSet (((((PHYSPC>>15)\^PHYSPC)>>9)&0x1f0) | ((((PHYSPC>>7)\^PHYSPC)>>9)&0xc) | ((PHYSPC>>9)&3))
mobj_define UST2_L1DataCacheLine (PADDR&0x7f0)>>4
indxobj_define UST2_L1InstrCacheLine (PHYSPC&0x7e0)>>5
indxobj_define UST2_Strand (CPUID)
indxobj_define UST2_Core (CPUID&0x38)>>3
mobj_define VA_L2 VADDR>>6
mobj_define VA_L1 VADDR>>4
mobj_define PA_L2 PADDR>>6
mobj_define PA_L1 PADDR>>4
mobj_define Vpage_256M VADDR>>28
mobj_define Ppage_256M PADDR>>28

On UltraSPARC T2 and Sun Studio 12, collect additional information with DProfile. For binaries with large text and large data, use this command (with or without the -y 17 option) to collect L2 cache instruction and data miss information:

collect -p +on -h +L2_imiss,on,+L2_dmiss_ld,on -A copy -F on command

For binaries with large data usage patterns, but more modest instruction footprint, use these options to collect data-side L1 and L2 miss rates:

collect -p +on -h +DC_miss,on,+L2_dmiss_ld,on -A copy -F on command

If trapstat reports large TLB miss costs use this command:

collect -p +on -h +ITLB_miss,on,+DTLB_miss,on -A copy -F on command

Tuesday Mar 27, 2007

DProfile - Filtering Sets

Russell Brown and I optimized an HPC genetics code with the set token IN and the tokens STACK and LEAF.

The IN operator tests events for existance of scalars in a list.

lval IN rval operator tests that lval is present in rval. For example, in my previous blog entry this expression was in the filter area:

(US4p_L2CacheLine IN (5483)) && (VA_L2 && (2155))

The expression checks that event memory object US4p_L2CacheLine is IN set 5483, and event memory object VA_L2 is IN set 2155. The set can be a list such as:

VA_L2 IN (2155,2156,2157)

Two tokens are available to test instruction constructs. The LEAF token returns the identifier for the leaf routine; and STACK token returns the list of routine identifiers on the stack. When you're in the function tab, click of the filter button and the identifier for the highlighted function name will appear. Two expressions are very useful:

LEAF IN (123)
(123) IN STACK

The first returns metrics when routine identifer 123 (the function identifier, not the name) is executing. (Exclusive Metrics)

The second returns metrics when the routine identifier 123 is on the stack. i.e. Inclusive metrics.

Russell A. Brown and I used these two expressions to diagnose various phases of Russ's program. He sent me a note with some findings using DProfile within Studio Performance Analyzer:

The following is a report of the progress to date in using DProfile to debug and improve the performance of a parallel application executing on a Niagara 1 system.

The application is a benchmark that performs genome sequence matching. This application is a new algorithm that uses binary trees to represent a sparse matrix. Older algorithms represent a dense matrix and execute several decimal orders of magnitude more slowly.

DProfile has identified several performance bugs in the new algorithm, all but the last of which have been corrected based on DProfile analysis:

1. The multi-threaded application was building a binary tree for each thread, and calling malloc() each time that a new tree node was needed. The different threads were required to synchronize on malloc(), a fact which required many mutex locks, as demonstrated by DProfile. The solution to this problem was for each thread to allocate an array that was large enough to contain all of the nodes required for a given binary tree. In this manner, each thread called malloc() once instead of many times, Thus the number of mutex locks was greatly reduced and the performance of the algorithm improved.

2. Each thread used some small matrices that were allocated as 2D arrays, i.e., a row vector of pointers was allocated, and each pointer referenced a column vector that was allocated as well. Because each thread needed to allocate a row vector as well as multiple column vectors, and further because the threads executed asynchronously, the column vectors for a particular thread were scattered throughout memory. This situation created many L2 cache misses, as demonstrated by DProfile. The solution to this problem was to allocate the matrices as 1D arrays that were large enough to contain all of the column vectors. Thus the column vectors for a particular thread were no longer scattered throughout memory, which resulted in fewer L2 cache misses and improved performance.

3. The 1D arrays were allocated in row-major order but were accessed sequentially both in row-major and in column-major order. Column-major access created L2 cache misses, as demonstrated by DProfile. The solution to this problem was to restructure the algorithm so as to use both a row vector and a column vector instead of a matrix. This change resulted in fewer L2 cache misses and improved performance.

4. Despite all of the above improvements which were motivated by Dprofile analysis, Dprofile now indicates that malloc() incurs substantial [user lock stall penalties], independent of whether -lmtmalloc or -lumem is specified in the link command. To investigate this problem more deeply, it is necessary to add functionality to DProfile.

Monday Mar 13, 2006

DProfile - "Data events in a multi-dimensional space for Profiling" From The Top

If you haven't already, check out the new CoolTools Page containing productivity enhancements for developers and performance engineers.

I started on the research of profiling data motion because most of application time was spent waiting for the memory hierarchy to respond. Most alarmingly, memory latency was growing relative to processor cycles over successive processor generations.

CMT addresses memory latency by adding threads to overlap memory hierarchy latency. From Sun's introduction of the T1000 and T2000, it's evident that Sun is pursuing CMT more aggressively than any large system vendor.

Scalability is critical to extract thread-level parallelism (TLP)

Let's look at some sample code that counts a variable var up to a value limit:

for (;;) {
  if (var++ > limit) {

When first multi-threading the application, the brute force approach is to add a lock around the shared variable var:

for (;;) {
   if (var++ > limit) {

This will offer poor scalability, as each thread will over utilize the lock and variable.  The next attempt will create a private copy of the variable in array indexed by index for each thread removing the lock:

for (;(array[index]++ <= limit);) ;

This too will offer poor scaling on Opteron and SPARC processors.  This time, the underlying hardware resource (cache line) is over utilized.

Then how is scalability achieved?

Uniform Utilization of hardware resources ensures scalability:

  • Uniform Hardware Utilization Through Time (All program Phases use the hardware uniformly)

  • Uniform Utilization of Hardware Threads (All Hardware Threads are Busy)

  • Uniform Use of Software Threads (There are enough threads)

  • Uniform Cache Set Use in the Cache Hierarchy (No cache sets are over subscribed)

  • Uniform Physical Memory Use (All memory boards used evenly)

  • Uniform Virtual Memory Use (No Hot Locks)

Asymmetric Utilization is called Skew. Bart touches on Asymmetric Hardware Utilization in Performance Anti-Patters in this Queue article.

So why data motion?

We need a common abstraction that affects all hardware. The only entity that propagates all active system components is Data Motion.

Every state transition in the system is Data Motion:

  • Load Instructions: Are Data Motion toward the Processor Core

  • Store Instructions: Are Data Motion toward the Cache Hierarchy

  • Arithmetic Operations: Are Data Motion within Processor Core (the processor happens to also change your data a bit :-))

  • Data Propagates All Hardware and Software Components

Data Motion allows to identify all asymmetric utilization, Skew, in every component of the system.

DProfile identifies scalability opportunities by tracking data motion in the entire application.

DProfile helps Performance Analysts and Deployments with dramatic results. Commercial applications have improved substantially when deployed using DProfile:

  • 530% on a cryptographic benchmark

  • 600% in a decision support workload

  • 800% with a design application

  • 1000% on Dhrystone

DProfile helps ISV Developers find code issues with ease:

  • 2000% (20x) performance gain in a high performance computing code

  • 1900% (19x) performance gain in my sample code.

All of these gains were achieved by using Dataspace Profiling or DProfile: Data events in a multi-dimensional space for Profiling.

Now that you've slugged through all my blog entries on how to setup Performance Analyzer, let me show you how DProfile works.

DProfile uses event collection agents based on statistical profiling using instruction-level profiling and value profiling that captures timestamp, program counter address, and for memory instructions: effective address and the associated physical address, among others. Unlike statistical tools such as VTune that focus on instruction profiling, DProfile targets data motion and associates the Cost for data motion to all aspects of computation, software and hardware. Unlike memory profiling tools based on DynInst such as Stride and pSIGMA, DProfile instrumentation samples the entire program execution (with about 5% slowdown using default settings).

All analytical processing can be done on a remote system, and response times for queries of Performance Analyzer are within a few minutes in large experiments approaching the data size limitations of Sun Studio Performance Analyzer. The data mining analysis can be parallelized for faster response times.

Unlike hardware based memory profilers, such as Sun Fire Memory Controller, DProfile uses relational agents to map profiled values to hardware and software structures. The first relational agents were introduced by Performance Analyzer a few years ago. The most flexible of the relational agents is the user-specified "Custom Object" introduced in Sun Studio 11. This agent lets a user specify complex expressions that interpret relational agents.  This flexibility permits prototyping of relational agents that can be implemented natively at a later time.

The benefit of relational agents realizes productivity gains by correlating the Cost of data motion to all source, system allocations and hardware structures.

Analysis entails a simple iterative two step process: Look and Filter:

  • Mapping the events to objects using relational agents. (Look)

  • Filtering an anomalous object and then mapping the remaining events to other objects. (and Filter)

Some of these relational agents are built in to Performance Analyzer and are provided by the Sun Studio Compiler Suite. Other relational agents are the formula you learned in previous blog entries.

If you look and do not see asymmetric utilization, look in another dimension or objects. When you find asymmetric utilization, filter and look elsewhere to identify the cause of the bottleneck.

You can think of DProfile as statistical value profiling with multi-dimensional presentation and filtering.

In our example above, we collected an experiment and then ran Performance Analyzer all using an UltraSPARC IV+ system:

The first panel shows the functional view of Performance Analyzer with the Functions tab selected. We observe that the thread_fcn function is taking the most execution time, and the most time in the memory system.  In this example, the time spent in the memory subsystem is the Cost.  We can examine the source of the function causing the Cost and verify that the loop is taking all the time in the memory subsystem:

In the second screenshot, the Source tab is selected.  It shows us the source line causing all the Cost.

But why?

Let's select the Vaddr tab and see in Graphical mode what virtual addresses are the most costly:

You notice a rather even distribution.

From the virtual memory perspective, everything is what we expected. Now let's look at the L2 Cache lines. Looking at L2 cache lines shows the breakdown of Cost by the hardware unit of coherency:

Ah! There's just one very costly line. Let's switch to text mode:

We select the hottest line (click on Object 1901 and it will turn blue), and then click the Filter button pointed by the red arrow. This button enables us to drill down why this L2 cache line is involved with all the Cost.

The advanced filter dialog appears. The expression for the highlighted L2 cache line is in the Filter Clause field. To add this filter clause to the editable Specify Filter text field, press the Set button pointed by the red arrow.

This Specify Filter field, shown with the blue ellipse, accepts any expression. This expression can specify any condition for examination: for example, remote accesses (Processor_Board != Memory_Board) and others. Predefined objects and tokens are available for use here. The C Language logical operators are available to help build complex expressions. These complex expressions drill down to the condition where you need more information.

Now that we are looking at the top cache line, we can select the Core tab to identify which processor cores utilized this one cache line:

Note that we quickly see all 8 cores used this single cache line.

We can look further and we can select the physical address groupings that fit into the L2 cache line, PA_L2. This grouping has the same size and alignment as L2 cache lines. This group is used to see whether this cache line is heavily shared (same physical address) or falsely shared with a hardware conflict (different physical address):

Instantly, you see that we have a heavily shared hardware resource.

Let's again select the Vaddr tab and see in text mode what virtual addresses are the most costly within only this heavily shared L2 cache line:

Very quickly, you see the problem:

  • One source line

  • Using one L2 cache line

  • Shared among 8 Cores

  • Using one Physical Address Group

  • Using 8 different Virtual Addresses

No more guesswork nor hypotheses.  Our questions are answered with DProfile.

The solution is to modify the virtual addresses so we use one L2 cache line per address, one address per core.

This solution improves performance 1900%!

Thursday Dec 15, 2005

DProfile - Loading Range Associations into Analyzer

Not all object identifiers are expression based. For all objects that are range based or state based, you can use the triadic operator (?:) to describe these objects.

The triadic operator is defined as:

( expression1 ? expression2 : expression3 )

This operator evaluates expression1, if true returns expression2, otherwize returns expression3. This is identical to the C language version.

A handy tool called genexp was writen that accepts tab-separated range information in the form:

start_value end_value identifier

and converts the range into an expression that evaluates in log(N) time in Sun Studio Performance Analyzer, where N is the number of elements in the range.

You can download the genexp binary or genexp source.

This is very handy when you want to create object identifiers that mirror hardware objects. For example, the cfgadm command returns the physical addresses for every memory board in a Sun Fire E4900, E6900, E20000 and E25000 family Servers.

# cfgadm -a -s "select=type(memory),cols=ap_id:o_state:info"
Ap_Id                          Occupant     Information
SB0::memory                    configured   base address 0x0, 16777216 KBytes total
SB1::memory                    configured   base address 0x2000000000, 16777216 KBytes total
SB2::memory                    configured   base address 0x4000000000, 16777216 KBytes total
SB6::memory                    configured   base address 0xc000000000, 16777216 KBytes total
SB13::memory                   configured   base address 0x1a000000000, 16777216  KBytes total
SB14::memory                   configured   base address 0x1c000000000, 16777216  KBytes total
SB15::memory                   configured   base address 0x1e000000000, 16777216  KBytes total
SB16::memory                   configured   base address 0x20000000000, 16777216  KBytes total
SB17::memory                   configured   base address 0x22000000000, 16777216  KBytes total, 4211360 KBytes permanent

I have created a script called membrd.sh that takes this output and converts it to a tab-separated list for use in genexp.

You can download the membrd.sh script here.

% membrd.sh
membrd.sh: Usage: membrd.sh -{ve}
membrd.sh: -v generate viewable file
membrd.sh: -e generate expression intermediate file
membrd.sh: Requires super-user executing these commands:
# cfgadm -a -s "select=type(memory),cols=ap_id:o_state:info" > /tmp/,.sb_info
# chmod 444 /tmp/,.sb_info

Adding the Memory_Board definition into Analyzer is then just a matter of running this command:

% membrd.sh -e | genexp Memory_Board PADDR >> .er.rc 

Generating the human readable version for reference is then:

% membrd.sh -v 

Note that these Memory_Board object identifiers will match the Processor_Board object identifiers. You can use these associations to correlate behavior in your application.

For Sun Fire E4500, E4800, E4900, E6500, E6800, E6900 Servers use these expressions:

mobj_define Processor CPUID&0x1ff
mobj_define Core CPUID&0x3ff
mobj_define Processor_Board (CPUID&0x1fc)>>2
mobj_define CoreID CPUID>>9

For the larger Sun Fire E12000, E15000, E20000, and E25000 Servers use these expressions:

mobj_define Processor CPUID&0x3e3
mobj_define Core CPUID&0x3ff
mobj_define Processor_Board CPUID>>5
mobj_define CoreID (CPUID&0x4)>>2

Using the triadic operator (?:) in Sun Studio Performance Analyzer, any series of ranges can be defined and then associated with object identifiers. These identifiers can then be used to make powerful correlations.

[ T: ]

Tuesday Dec 06, 2005

DProfile - Piloting Sun Studio 11 Performance Analyzer

Fasten your seat belt, we're about to go on a multi-dimentional ride into your program - look inside the machine!

I have a sample scalability problem that I will now debug with Sun Studio 11 Performance Analyzer with DProfile.

I've recompiled my program with these flags:

cc -xhwcprof -g -c -xO4 th.c

and linked with these flags:

cc -xO4 -o fn3.out fn3.o th.o -lthread -xhwcprof -g 

Collect the experiement with 6 software threads for analysis using the collect command:

% collect -p +on -A copy -F on fn3.out -N 6

I have used an UltraSPARC-IV+ processor in this test, so this is the .er.rc file that I created to define the perspectives for Analyzer. Note the latter portion of the file contains the processor-specific objects for the Sun Fire F6900 Server running the UltraSPARC-IV+ processor.

en_desc on
mobj_define Vaddr VADDR
mobj_define Paddr PADDR
mobj_define Process PID
mobj_define Thread (PID\*1000)+THRID
mobj_define ThreadID THRID
mobj_define Seconds (TSTAMP/1000000000)
mobj_define Minutes (TSTAMP/60000000000)
mobj_define US4p_L1DataCacheLine (VADDR&0x1fe0)>>5
mobj_define US4p_L2CacheLine (PADDR&0x7ffc0)>>6
mobj_define US4p_L3CacheLine (PADDR&0x7fffc0)>>6
mobj_define VA_L2 VADDR>>6
mobj_define VA_L1 VADDR>>5
mobj_define PA_L2 PADDR>>6
mobj_define PA_L1 PADDR>>5
mobj_define US4p_T512_8k (VADDR&0x1fe000)>>13
mobj_define US4p_T512_64k (VADDR&0xff0000)>>16
mobj_define US4p_T512_512k (VADDR&0x7f80000)>>19
mobj_define US4p_T512_4M (VADDR&0x3fc00000)>>22
mobj_define US4p_T512_32M (VADDR&0x1fe000000)>>25
mobj_define US4p_T512_256M (VADDR&0xff0000000)>>28
mobj_define Vpage_32M VADDR>>25
mobj_define Vpage_256M VADDR>>28
mobj_define Ppage_32M PADDR>>25
mobj_define Ppage_256M PADDR>>28
mobj_define Processor CPUID&0x1ff
mobj_define Core CPUID&0x3ff
mobj_define Processor_Board (CPUID&0x1fc)>>2
mobj_define CoreID CPUID>>9

Fire up analyzer, and lets roll:

% analyzer

This screen appears:

Note in the upper left corner two columns: User CPU and Max. Mem. Stall.

User CPU refers to the total execution time of the application, while Max. Mem. Stall refers to the time spent in the memory subsystem. I will only focus on the memory subsystem column in my blog. (In our scaling problem, we're spending almost all of the time in the memory subsystem)

You will need to find the Data Presentation button: (near the middle of the row of buttons) and the Compose Filter Clause button: (the rightmost button on the button row)

Selecting the Data Presentation button brings up this dialog:

I'll select the exclusive metrics and percentage reporting:

Now I'll select the Tabs option in the panel, and see all of the available perspectives in the analyzer tabs:

The Performance Analyzer built-in objects are on the left, the right column has the built-in virtual and physical page objects and all of the objects defined via the .er.rc file.

These are the default settings in Sun Studio 11 Performance Analyzer. Since we are looking at a scalability problem, I will enable the cache hierarchy object US4p_L2CacheLine, its associated virtual address object VA_L2, and its associated physical address object PA_L2.

Since we will likely want to look at virtual addresses, I will enable Vaddr object, and the Seconds, Core and Thread objects to give you a sense of what DProfile can do for your understanding of my application.

Press OK and let's look inside the machine!

We can look at the Source of the function and see the one C language statement is taking all of the time in the program:

One statement is causing the problem. But why?

Processors share data via L2 cache lines. Let's see what the L2 cache line profile looks like. Just press the tab associated with L2 cache lines:

We see one cache line is taking 94% of all memory system time.

We can drill down and find out why!

Click on the first L2 cache line, and press the Compose Filter Clause button.

The dialog below appears. The filter clause for the L2 cache line is in the dialog box. The current filter is also displayed, currently empty (all data is viewed).

You can request to AND the filter clause to the current filter, OR or SET (assign) the current filter with the selected filter.

Press the SET button to assign the filter, and then OK.

Now you can press the VA_L2 tab. This tab returns the virtual address groupings that mapped into this L2 cache line. The question we are asking here is: how many copies of my virtual address space were mapped into the L2 cache line?

One virtual address range was mapped into this one hot line!

What virtual addresses were in this range? Press the Compose Filter Clause button.

Now AND the clauses together:

We look at Vaddr and see the 6 addresses that are taking all the time in the L2 cache line:

One L2 Cache Line; six addressses; six threads! Coincidence?

We can view the Seconds tab and view when in time these filtered costs occurred.
We can view the Thread tab and view which software threads incurred these filtered costs.
We can view the Core tab and view which hardware cores incurred these filtered costs. We can examine how Solaris scheduled the software threads on the hardware cores!

We confirm that each address was used by just one thread, that falsely shared the L2 cache line.

We can view every object in hardware and software... and understand their relationships! FAST!

False sharing of all hardware structures is detected:

  • Cache Lines

  • TLB Entries

  • Memory Boards

  • Memory Banks

Scalability at your fingertips:

DProfile - Look Inside the Machine!

[ T: ]

DProfile - The Scalability Infrastructure

We have talked about all the Perspectives, Insight and Knowledge available with DProfile. The purpose of all this technology is scalability.

As outlined in the Getting Ready for CMT section, scalability is key. And keeping all hardware structures equally utilized is the cornerstone to scalability: keeping all Virtual Processors busy, keeping all Memory Boards equally busy, and using your caches and banks uniformly.

My previous entry showed how to you profile all of these hardware components by teaching Sun Studio 11 Performance Analyzer about Perspectives.

Here, I'd like to show you some pseudo-code fragments running in multiple Software Threads within a Process cause a scalability problem:

for (;;) {
  if (var++ > limit) {

There is only one central lock that is guards the shared variable var. I chose this obvious case because it's Data Movement Profile is dominated by one cache line taking >95% of time between all Threads.

Here is another pseudocode frament running in multiple Software Threads, that exhibits the same scalability problem:

   for (;(array[index]++ <= limit);) ;

The array is global, with a dedicated index for every Virtual Thread. This is not as obvious, but this case also has the same Data Movement Profile as the previous obvious example: one cache line holding the entire array is being passed among all Threads, and is taking >95% of time. This is false sharing; the next inhibitor to scalability after lock contention.

While the first example is easy to detect with lockstat and DProfile, only DProfile can identify the second example.

Any time there are a few hardware elements within structures being utilized (detected by DProfile), a scalability problem is available for resolution with DProfile.

Here are some examples:

One Memory Board used more frequently than others.
One Bank used more than others.
A group of Cache Lines used more than others.
One thread being used more frequently than others. (This is termed skew in HPC and DSS circles)

With DProfile you can select each of these objects, Filter, and then identify what Software View components are responsible for the underutilization.

[ T: ]

DProfile - Teaching Analyzer Perspectives

I've changed Dataspace Profiling to DProfile.

The application costs are broken down between execution time and memory-subsystem time in the Functions tab. You will be able to view and operate on the memory-subsystem time through all of the perspectives reviewed earlier.

Many of the perspectives in the perspectives table are built-in, such as the load object, function, PC, data object, and virtual and physical pages (8k, 64k, 512k and 4M).

All other perspectives are programmed into Performance Analyzer through the use of the .er.rc file using the expression grammar.

All of the profile perspectives of the machine are created by these expressions. This blog will cover the machine-independent and machine-dependent human readable expressions. Future entries will include tools that create expressions as well.


Performance Analyzer has the Timeline view. A more simple view is the Seconds and Minutes perspectives. These provide a breakdown of memory sub-system time in seconds or minutes. By selecting the column heading, you change the sort order of the object. By selecting "Max Mem Time", you will order by most costly to less costly time intervals. By selecting Name, you will order in time series. Selecting the Graphical radio button gives you an insightful graphical view of your application through time.

en_desc on
mobj_define Seconds (TSTAMP/1000000000)
mobj_define Minutes (TSTAMP/60000000000)

The first line in the .er.rc file will instructs the engine behind Performance Analyzer to analyze the entire process tree (all descendants) created by the collect command.

The second line defines Seconds from the collected time stamps.

The third line defines Minutes from the collected time stamps.

Software Execution


An application may span multiple processes and contain many threads. There are two useful perspectives (objects) that are useful in analysis: the Thread and the ThreadID.

Either represent the Software Execution object within our application. The Thread is a unique identifier across the application; while the ThreadID is a unique identifier only within the each Process of the application.

mobj_define Thread (PID\*1000)+THRID
mobj_define ThreadID THRID

The application will allocate memory in a Process, through the use of virtual memory. Solaris will allocate and map physical memory for this virtual memory.

mobj_define Vaddr VADDR
mobj_define Paddr PADDR
mobj_define Process PID

Since we're announcing the Sun Fire CoolThreads Servers using the UltraSPARC T1 processor, here are the hardware-specific Sun Fire CoolThreads Server formula you add to your .er.rc file to identify cache hierarchy and hardware objects in the system:

mobj_define UST1_Bank (PADDR&0xc0)>>6
mobj_define UST1_L2CacheLine (PADDR&0x3ffc0)>>6
mobj_define UST1_L1DataCacheLine (PADDR&0x7f0)>>4
mobj_define UST1_Strand (CPUID)
mobj_define UST1_Core (CPUID&0x1c)>>2
mobj_define VA_L2 VADDR>>6
mobj_define VA_L1 VADDR>>4
mobj_define PA_L2 PADDR>>6
mobj_define PA_L1 PADDR>>4
mobj_define Vpage_256M VADDR>>28
mobj_define Ppage_256M PADDR>>28

Niagara has L2 Cache and physical memory grouped by UST1_Bank. Based on Paddr, a Bank is selected; then accesses will be serviced by the portion of the L2 cache in that bank. If a reference misses the L2 Cache, the memory controller associated with that Bank will service the miss.

UST1_L2CacheLine return the unique identifier for any L2 Cache Line in a UltraSPARC_T1 processor.

UST1_L1DataCacheLine returns the unique identifier for an L1 Cache Line within one UST1_Core. This does not return the unique identifier for a L1 Cache Line within a UST1_Processor.

UST1_Strand returns the identifier for each UltraSPARC_T1 Virtual Processor.

UST1_Core returns the identifier for the UltraSPARC_T1 Execution Core.

VA_L2 returns the identifer for Virtual Memory grouped in UltraSPARC_T1 L2CacheLine-sized chunks.

VA_L1 returns the identifer for Virtual Memory grouped in UltraSPARC_T1 L1DataCacheLine-sized chunks.

The previous two formula are useful to relate Hardware View cache line costs, back to Program Address Space. You'll filter on a hardware object, and relate it back to the virtual memory allocations for that hardware object.

PA_L2 and PA_L1 provide similar grouping in Physical Memory Paddr. These formula are useful in relating Hardware View cache lines costs, back to Solaris physical address allocations.

I'll show you how to use these formula in my next entry.

[ T: ]

Sunday Nov 20, 2005

Dataspace Profiling Lives - Sun Studio 11 is out!

Download the development environment with Dataspace Profiling: Sun Studio 11. It's FREE!

Get your SPARC box out. Get your app built with the new Sun Studio 11 C compiler. C++ works also, but Dataspace Profiling support is a bit more buggy.

I'll describe how to use Dataspace Profiling on this blog.

When building your application, add these two options to your compile line in the C compiler:

-g -xhwcprof 

This tells the compiler to insert additional information in the binary.


You need to understand what signals your command catches and which it ignores.

Use psig pid on any of the processes and check the signal layout.

Booting with Collector Framework

Next, boot your application with the collector framework disabled. This lets you have your address the same with and without measurement.

If possible, use /tmp or any local filesystem when you start your command, as an experiment file is created in the local directory. This yields better results.

collect -p +on -A copy -F on -y 17 command 

Where 17 is SIGUSR2 from the Signals chapter. Use startstop.sh (provided below) to control when data is collected.

To collect data for a short-lived command use:

collect -p +on -A copy -F on command 

This does not require startstop.sh.

If possible, use sig 16 SIGUSR1 or 17 SIGUSR2.

Generating an Experiment

To start and stop collection of the command, use startstop.sh (change the kill signal to the signal you are using.)

% startstop.sh command 

starts and stops the data collection. Use this command at the start of your run and at the end of your run.

Here is startstop.sh:

for i in `ps -le | grep $1 | cut -c12-17`
	kill -17 $i
	echo Sent USR2 to pid $i

This should get you started.

I'll have a section setting up Performance Analyzer in the next entry.

Thanks for your patience.

Thursday Sep 15, 2005

Dataspace Profiling Knowledge – Look Inside the Machine!

With Insight and Experience, Knowledge drives Change. For peak productivity, you need Action through Knowledge.

Let’s Look Inside the Machine!

For Dataspace Profiling, this table expresses the Machine, application and hardware:

Software View

Hardware View

Software Execution

Program Source

Virtual Memory

Physical Memory

Cache Hierarchy

Execution Units





Memory Board


Processor Board



Load Object


Memory Bank\*

Cache Bank





Virtual Page

Physical Page

TLB Page Entry





Virtual Ecache Line

Physical ECache Line

External Cache Line





Virtual ECache Sub-Block

Physical ECache Sub-Block

ECache Sub-Block




Data Object

Virtual L1 Cache Line

Physical L1 Cache Line

Level One Cache Line



The first row shows the Software View and the Hardware View groupings of various Categories of Machine Resources: Software Execution, Program Source, Virtual Memory allocated by the Software, Physical Memory of the Hardware caching portions of the Virtual Memory, the Cache Hierarchy caching portions of the Physical Memory, the Execution Units, and Time.

Stacked Cells in Columns represent nested Collections of Structures. For example, Load Objects comprise Functions that comprise Instructions. Similarly, Processor Boards comprise Processors that comprise Cores that comprise Strands.

The Cache Hierarchy contains the MMU that contains TLB Translation Entries that map a Virtual Page to a Physical Page. The External Cache comprises External Cache Lines that may contain External Cache Sub-Blocks. The Level One Cache comprises Level One Cache Lines. To simplify this explanation, assume that the machine has only one size for every one of the Cache Hierarchy Components; however, Dataspace Profiling handles an arbitrary number of sizes.

Each Color shows an association among Categories of Machine Resources. For example, A TLB Entry will map Virtual Memory Pages to Physical Memory Pages. Dataspace Profiling manages this association for you with the Object Definitions. When a cost occurs in filling the TLB Entry, Dataspace Profiling automatically manages which Virtual Memory Page and which Physical Memory Page was affected.

The Colors let you gain Insight from one Machine Category to another. Filter on one Category, and change Perspective to the corresponding Category of the same Color. Filtering among Categories enables you narrow down a bottleneck.

In my example, one Category is the Program View of Memory (Virtual Page), another is the Hardware View of Memory (Physical Page), and the third is the Hardware View for the Cache Hierarchy, the TLB Entry. List the Virtual Pages using this TLB Entry? (Filter by TLB Entry, view by Virtual Pages.) Which Physical Page is mapped by this Virtual Page? (AND Filter by the Virtual Page, view by Physical Page).

For another example, observe your Application from the Perspective of Processor Boards. For the Processor Board using the most time, which Memory Boards does this Processor Board use? Note the Colors above, we can find out by filtering on the Processor Board; and changing Perspective to the Memory Boards. Dataspace Profiling provides you the answer!

You can continue to gain Knowledge about your Application by repeating this process, drilling deeper. By Filtering by a Memory Board AND a Processor Board, you gain Knowledge of which other objects in the machine are using both.

For example, repeated AND Filtering allows powerful observations. From which Threads inside which Processes running on what Processors, using which Type Definitions and which Virtual Memory Addresses that were placed on which remote Memory Board, and at what Time during the Application execution.

In my previous blog entry, I walked you through a series of Insights gained through Dataspace Profiling Technology. Now you gain the Knowledge whether the Virtual Addresses are heavily shared (many Processes with one Virtual Memory per Physical Memory or Cache Structure), or they are falsely shared and we have a conflict (many Virtual Memories to one Physical Memory; or many Physical Memories to one Cache Structure). We know which Virtual Addresses, Virtual Pages, Processes, Threads, User-Defined Structures, Functions are affected. You now have the Knowledge. You can Act through Knowledge.

Dataspace Profiling – Look Inside the Machine!

Tuesday Sep 13, 2005

Dataspace Profiling Insight – Look Inside the Machine!

Gain Insight from Multiple Perspectives. Dataspace Profiling Insight dissects performance bottlenecks within the Perspective they are visible. Then, the bottleneck in its entirety can be observed from any other Perspective. The correlation between one Perspective to another offers Insight.

Let’s walk through an example of an Application with a scaling issue.

From the Perspective of L2 Cache, Dataspace Profiling observes:

Select the offending L2 Cache Line entry, and Filter to “Look Inside the Machine” just from the Perspective of this one L2 Cache Line.

To gain Insight, we observe the L2 Cache Line in its entirety from the Perspective of the Virtual Addresses used by the Application:

We gain the Insight, that just four neighboring Virtual Addresses incur the most cost to that one L2 Cache Line.

To gain more Insight, we observe these four Virtual Addresses when they occupy the L2 Cache Line in their entirety:

Now observed from the Perspective of the Processes used by the Application:

We have more Insight. One L2 Cache Line is used by many Processes of the Application and by just four neighboring Virtual Addresses.

We observe the Segment Profile and see the Virtual Memory Segment using 64k Sized Pages. More Insight!

One L2 Cache Line is used by:

  • Many Processes of the Application;
  • Four neighboring Virtual Addresses;
  • One 64k-sized Page.

As I’ve shown you in my previous blog entry, you can have any Perspective you want.

With Dataspace Profiling, you gain Insight into your Application from the Program View and Hardware View. In the Program View, you gain Insight into the Functions, Threads, Type Definitions and Virtual Memory Allocations. In the Hardware View, you gain Insight into the Physical Memory Placement, the Cache Hierarchy Utilization, Memory Management Unit Utilization, Execution Unit Utilization, and even Time.

With Dataspace Profiling Insight, scaling issues are self-evident.

Dataspace Profiling – Look Inside the Machine!

Thursday Sep 08, 2005

Dataspace Profiling Perspectives – Look Inside the Machine!

Dataspace Profiling enhances observability by providing additional perspectives into the costs associated with your application. Traditional profiling tools usually provide a function view of your program cost:

Excl. User CPU  Excl. Max.       Name  
                Mem. Stall             
   sec.      %     sec.      %    
849.494 100.00  799.249 100.00   
813.539  95.77  794.826  99.45   test
 31.762   3.74    3.643   0.46   find_free_slot
  3.342   0.39    0.280   0.04   go_test
  0.460   0.05    0.460   0.06   foo_fval
  0.180   0.02    0.040   0.01   main
  0.120   0.01    0.      0.     linkcnt
  0.030   0.00    0.      0.     memset
  0.030   0.00    0.      0.     rand
  0.020   0.00    0.      0.     use_free_slot
  0.010   0.00    0.      0.     ok

and also an instruction view of costs associated with your application:

Excl. User CPU  Excl. Max. 
                Mem. Stall             
   sec.      %     sec.      %
   3.522   0.4   0.      0.             [ 43] 100003b4c:  sllx        %o3, 2, %g2
   0.390   0.0   0.      0.             [ 43] 100003b50:  ld          [%g3 + %g2], %o3
## 3.623   0.4   3.623   0.5            [ 43] 100003b54:  add         %g2, %g3, %o4
## 6.545   0.8   0.      0.             [ 43] 100003b58:  xnorcc      %o3, 0, %g0
   0.      0.    0.      0.             [ 43] 100003b5c:  be,pn       %icc,0x100003bf0
   0.010   0.0   0.      0.             [ 43] 100003b60:  cmp         %o1, 32

As computer systems are more dependent on the Memory Subsystem, additional perspectives of costs enhance the diagnostic ability of Dataspace Profiling. I group these perspectives into two large categories: Hardware View and Program View.

Hardware View

Hardware View Perspectives include the costs from the Memory Subsystem and the Execution Units.

Costs from the components of the Memory Subsystem itself are costs from the Cache Hierarchy and costs from the Memory Topology of your computer system. For example, the Memory Subsystem costs from the L2 Cache Line perspective:

In this view, we observe the distribution of costs within the lines of the L2 Cache.

Additional Hardware View Perspectives include the Memory Subsystem costs from the Execution Unit Perspective. For example, costs from each Physical Processor:

In these cases, we observe how the operating system scheduled our application on the underlying hardware.

Dataspace Profiling observes your application from the perspective of time. How Memory Subsystem costs changed over time:

Program View

Dataspace Profiling observes an application Program View from the perspective of both Program Source and Program Address Space.

Program Source includes the function source profiled in current tools, and Dataspace Profiling includes the costs of program type definitions and their constituents:

and the profile of a user-defined type by definition order:

Program Address Space profiling includes all the allocations in your application. Dataspace Profiling reports costs from the perspective of Address Segments, Virtual and Physical Pages, etc.

Dataspace Profiling provides you detailed observation of your application from every perspective.

Look Inside the Machine!

Tuesday Sep 06, 2005

Dataspace Profiling - Look Inside the Machine!

Computer systems originally contained a central processing unit encompassing many boards (and sometimes cabinets), and random access memory that responded in the same cycle time as the central processing unit. This central processing unit (or CPU as we know today) was very costly.

Initially, bulbs attached to wires within the CPU aided programmer deduction in the identification of program behavior, in order to save precious processor time. These were the early profiling tools.

Computer languages, such as FORTRAN and COBOL, improved programmer productivity. Profiling libraries followed to breakdown the cost of the most precious resource on the system: the processor. Profiling associated processor costs with processor instructions and the source representation of those instructions: functions and line numbers. Programmer productivity climbed, as critical central processor unit bottlenecks were uncovered and resolved in program source.

Computers continued to evolve: the central processor unit shrank down to a single board, the mainframe led to the minicomputer, and multiple processor systems appeared. A disruptive technology, the microprocessor, appeared around this time. These cheap microprocessors were mass-produced with large-scale integration (LSI) and later VLSI.

Initially, microprocessors were inept compared to central processor units; but mass production laid the death knell for the discrete-logic central processor unit.

The "killer micro" debate raged in this time. Large numbers of cheap commodity microprocessors grouped to solve large problems only possible with mainframes. Sun offered a wide array of microprocessor-based systems, some more powerful than the largest mainframes of the day.

We are now in the mid-1990s. The acquisition costs of microprocessors comprised of a small fraction of overall system cost. The bulk of system cost was the memory subsystem (the cabinet, the interconnect, the controllers, the DRAM chips), and the peripherals.

In software, Solaris engineers were solving complex operating system scaling issues through deduction. In one case, through inference, an engineer found that one hot cache line bottlenecked all the microprocessors in the system. This brilliance in deduction was not missed; I realized we needed a tool to identify the now-new critical resource: the Memory Subsystem. This inflection point in technology drove me to invent Dataspace Profiling.

UltraSPARC-III was the first Sun processor that added support to monitor Memory Subsystem behavior. I worked with every processor team since then to include adequate support for Dataspace Profiling: the now-defunct Millennium processor; Niagara processors; UltraSPARC-III, IIIi and IV-based processors; and ROCK processors.

Computers evolved further: chip-multithreaded (CMT) processors have many Cores driving even more virtual processor Strands of instruction execution. These CMT processors offer fewer Memory Subsystem components than Strands of instruction execution. The performance-critical component in these systems is often the Memory Subsystem and not the Strands of execution.

Traditional profiling tools fail to detect these bottlenecks. Traditional profiling tools persist in monitoring the Processor Core, when the bottleneck is in the Memory Subsystem.

Dataspace Profiling monitors the both the Processor Core and the Memory Subsystem to identify machine bottlenecks and relates the solution back to Program Source and Program Address Space. All machine components are profiled with low-intrusion, on-the-fly, and related back visually to any program source and any program memory object.

Today, thanks to Performance Teams, Processor Teams, Compiler Teams, Solaris Teams, and the Sun Studio Analyzer Team, we are ready. Sun has incorporated Dataspace Profiling in Sun Studio.

Welcome aboard the experience of Dataspace Profiling: Look Inside the Machine!

Tuesday Aug 30, 2005

Getting Ready for CMT

I have been working on performance of computing systems for about 20 years. Earlier this year, I documented operating system and development system requirements for the transition to CMT processors.

CMT systems are next step in computing systems started with multi-board processors, and followed by, microprocessors, SMP, and NUMA SMP. Just as each generation of systems before it, CMT presents new challenges for computer professionals.

As my blog introduction, I am posting Operating Systems and Development Systems for Chip-Multi-Threading Processors which provides requirements for designers, developers, and deployment specialists in the CMT processor systems.




Top Tags
« April 2014