DProfile - "Data events in a multi-dimensional space for Profiling" From The Top

If you haven't already, check out the new CoolTools Page containing productivity enhancements for developers and performance engineers.

I started on the research of profiling data motion because most of application time was spent waiting for the memory hierarchy to respond. Most alarmingly, memory latency was growing relative to processor cycles over successive processor generations.

CMT addresses memory latency by adding threads to overlap memory hierarchy latency. From Sun's introduction of the T1000 and T2000, it's evident that Sun is pursuing CMT more aggressively than any large system vendor.

Scalability is critical to extract thread-level parallelism (TLP)

Let's look at some sample code that counts a variable var up to a value limit:

for (;;) {
  if (var++ > limit) {

When first multi-threading the application, the brute force approach is to add a lock around the shared variable var:

for (;;) {
   if (var++ > limit) {

This will offer poor scalability, as each thread will over utilize the lock and variable.  The next attempt will create a private copy of the variable in array indexed by index for each thread removing the lock:

for (;(array[index]++ <= limit);) ;

This too will offer poor scaling on Opteron and SPARC processors.  This time, the underlying hardware resource (cache line) is over utilized.

Then how is scalability achieved?

Uniform Utilization of hardware resources ensures scalability:

  • Uniform Hardware Utilization Through Time (All program Phases use the hardware uniformly)

  • Uniform Utilization of Hardware Threads (All Hardware Threads are Busy)

  • Uniform Use of Software Threads (There are enough threads)

  • Uniform Cache Set Use in the Cache Hierarchy (No cache sets are over subscribed)

  • Uniform Physical Memory Use (All memory boards used evenly)

  • Uniform Virtual Memory Use (No Hot Locks)

Asymmetric Utilization is called Skew. Bart touches on Asymmetric Hardware Utilization in Performance Anti-Patters in this Queue article.

So why data motion?

We need a common abstraction that affects all hardware. The only entity that propagates all active system components is Data Motion.

Every state transition in the system is Data Motion:

  • Load Instructions: Are Data Motion toward the Processor Core

  • Store Instructions: Are Data Motion toward the Cache Hierarchy

  • Arithmetic Operations: Are Data Motion within Processor Core (the processor happens to also change your data a bit :-))

  • Data Propagates All Hardware and Software Components

Data Motion allows to identify all asymmetric utilization, Skew, in every component of the system.

DProfile identifies scalability opportunities by tracking data motion in the entire application.

DProfile helps Performance Analysts and Deployments with dramatic results. Commercial applications have improved substantially when deployed using DProfile:

  • 530% on a cryptographic benchmark

  • 600% in a decision support workload

  • 800% with a design application

  • 1000% on Dhrystone

DProfile helps ISV Developers find code issues with ease:

  • 2000% (20x) performance gain in a high performance computing code

  • 1900% (19x) performance gain in my sample code.

All of these gains were achieved by using Dataspace Profiling or DProfile: Data events in a multi-dimensional space for Profiling.

Now that you've slugged through all my blog entries on how to setup Performance Analyzer, let me show you how DProfile works.

DProfile uses event collection agents based on statistical profiling using instruction-level profiling and value profiling that captures timestamp, program counter address, and for memory instructions: effective address and the associated physical address, among others. Unlike statistical tools such as VTune that focus on instruction profiling, DProfile targets data motion and associates the Cost for data motion to all aspects of computation, software and hardware. Unlike memory profiling tools based on DynInst such as Stride and pSIGMA, DProfile instrumentation samples the entire program execution (with about 5% slowdown using default settings).

All analytical processing can be done on a remote system, and response times for queries of Performance Analyzer are within a few minutes in large experiments approaching the data size limitations of Sun Studio Performance Analyzer. The data mining analysis can be parallelized for faster response times.

Unlike hardware based memory profilers, such as Sun Fire Memory Controller, DProfile uses relational agents to map profiled values to hardware and software structures. The first relational agents were introduced by Performance Analyzer a few years ago. The most flexible of the relational agents is the user-specified "Custom Object" introduced in Sun Studio 11. This agent lets a user specify complex expressions that interpret relational agents.  This flexibility permits prototyping of relational agents that can be implemented natively at a later time.

The benefit of relational agents realizes productivity gains by correlating the Cost of data motion to all source, system allocations and hardware structures.

Analysis entails a simple iterative two step process: Look and Filter:

  • Mapping the events to objects using relational agents. (Look)

  • Filtering an anomalous object and then mapping the remaining events to other objects. (and Filter)

Some of these relational agents are built in to Performance Analyzer and are provided by the Sun Studio Compiler Suite. Other relational agents are the formula you learned in previous blog entries.

If you look and do not see asymmetric utilization, look in another dimension or objects. When you find asymmetric utilization, filter and look elsewhere to identify the cause of the bottleneck.

You can think of DProfile as statistical value profiling with multi-dimensional presentation and filtering.

In our example above, we collected an experiment and then ran Performance Analyzer all using an UltraSPARC IV+ system:

The first panel shows the functional view of Performance Analyzer with the Functions tab selected. We observe that the thread_fcn function is taking the most execution time, and the most time in the memory system.  In this example, the time spent in the memory subsystem is the Cost.  We can examine the source of the function causing the Cost and verify that the loop is taking all the time in the memory subsystem:

In the second screenshot, the Source tab is selected.  It shows us the source line causing all the Cost.

But why?

Let's select the Vaddr tab and see in Graphical mode what virtual addresses are the most costly:

You notice a rather even distribution.

From the virtual memory perspective, everything is what we expected. Now let's look at the L2 Cache lines. Looking at L2 cache lines shows the breakdown of Cost by the hardware unit of coherency:

Ah! There's just one very costly line. Let's switch to text mode:

We select the hottest line (click on Object 1901 and it will turn blue), and then click the Filter button pointed by the red arrow. This button enables us to drill down why this L2 cache line is involved with all the Cost.

The advanced filter dialog appears. The expression for the highlighted L2 cache line is in the Filter Clause field. To add this filter clause to the editable Specify Filter text field, press the Set button pointed by the red arrow.

This Specify Filter field, shown with the blue ellipse, accepts any expression. This expression can specify any condition for examination: for example, remote accesses (Processor_Board != Memory_Board) and others. Predefined objects and tokens are available for use here. The C Language logical operators are available to help build complex expressions. These complex expressions drill down to the condition where you need more information.

Now that we are looking at the top cache line, we can select the Core tab to identify which processor cores utilized this one cache line:

Note that we quickly see all 8 cores used this single cache line.

We can look further and we can select the physical address groupings that fit into the L2 cache line, PA_L2. This grouping has the same size and alignment as L2 cache lines. This group is used to see whether this cache line is heavily shared (same physical address) or falsely shared with a hardware conflict (different physical address):

Instantly, you see that we have a heavily shared hardware resource.

Let's again select the Vaddr tab and see in text mode what virtual addresses are the most costly within only this heavily shared L2 cache line:

Very quickly, you see the problem:

  • One source line

  • Using one L2 cache line

  • Shared among 8 Cores

  • Using one Physical Address Group

  • Using 8 different Virtual Addresses

No more guesswork nor hypotheses.  Our questions are answered with DProfile.

The solution is to modify the virtual addresses so we use one L2 cache line per address, one address per core.

This solution improves performance 1900%!


Post a Comment:
  • HTML Syntax: NOT allowed



Top Tags
« July 2016