DProfile - Piloting Sun Studio 11 Performance Analyzer
By nk on Dec 06, 2005
Fasten your seat belt, we're about to go on a multi-dimentional ride into your program - look inside the machine!
I have a sample scalability problem that I will now debug with Sun Studio 11 Performance Analyzer with DProfile.
I've recompiled my program with these flags:
cc -xhwcprof -g -c -xO4 th.c
and linked with these flags:
cc -xO4 -o fn3.out fn3.o th.o -lthread -xhwcprof -g
Collect the experiement with 6 software threads for analysis using the collect command:
% collect -p +on -A copy -F on fn3.out -N 6
I have used an UltraSPARC-IV+ processor in this test, so this is the .er.rc file that I created to define the perspectives for Analyzer. Note the latter portion of the file contains the processor-specific objects for the Sun Fire F6900 Server running the UltraSPARC-IV+ processor.
en_desc on mobj_define Vaddr VADDR mobj_define Paddr PADDR mobj_define Process PID mobj_define Thread (PID\*1000)+THRID mobj_define ThreadID THRID mobj_define Seconds (TSTAMP/1000000000) mobj_define Minutes (TSTAMP/60000000000) mobj_define US4p_L1DataCacheLine (VADDR&0x1fe0)>>5 mobj_define US4p_L2CacheLine (PADDR&0x7ffc0)>>6 mobj_define US4p_L3CacheLine (PADDR&0x7fffc0)>>6 mobj_define VA_L2 VADDR>>6 mobj_define VA_L1 VADDR>>5 mobj_define PA_L2 PADDR>>6 mobj_define PA_L1 PADDR>>5 mobj_define US4p_T512_8k (VADDR&0x1fe000)>>13 mobj_define US4p_T512_64k (VADDR&0xff0000)>>16 mobj_define US4p_T512_512k (VADDR&0x7f80000)>>19 mobj_define US4p_T512_4M (VADDR&0x3fc00000)>>22 mobj_define US4p_T512_32M (VADDR&0x1fe000000)>>25 mobj_define US4p_T512_256M (VADDR&0xff0000000)>>28 mobj_define Vpage_32M VADDR>>25 mobj_define Vpage_256M VADDR>>28 mobj_define Ppage_32M PADDR>>25 mobj_define Ppage_256M PADDR>>28 mobj_define Processor CPUID&0x1ff mobj_define Core CPUID&0x3ff mobj_define Processor_Board (CPUID&0x1fc)>>2 mobj_define CoreID CPUID>>9
Fire up analyzer, and lets roll:
This screen appears:
Note in the upper left corner two columns: User CPU and Max. Mem. Stall.
User CPU refers to the total execution time of the application, while Max. Mem. Stall refers to the time spent in the memory subsystem. I will only focus on the memory subsystem column in my blog. (In our scaling problem, we're spending almost all of the time in the memory subsystem)
You will need to find the Data Presentation button: (near the middle of the row of buttons) and the Compose Filter Clause button: (the rightmost button on the button row)
Selecting the Data Presentation button brings up this dialog:
I'll select the exclusive metrics and percentage reporting:
Now I'll select the Tabs option in the panel, and see all of the available perspectives in the analyzer tabs:
The Performance Analyzer built-in objects are on the left, the right column has the built-in virtual and physical page objects and all of the objects defined via the .er.rc file.
These are the default settings in Sun Studio 11 Performance Analyzer. Since we are looking at a scalability problem, I will enable the cache hierarchy object US4p_L2CacheLine, its associated virtual address object VA_L2, and its associated physical address object PA_L2.
Since we will likely want to look at virtual addresses, I will enable Vaddr object, and the Seconds, Core and Thread objects to give you a sense of what DProfile can do for your understanding of my application.
Press OK and let's look inside the machine!
We can look at the Source of the function and see the one C language statement is taking all of the time in the program:
One statement is causing the problem. But why?
Processors share data via L2 cache lines. Let's see what the L2 cache line profile looks like. Just press the tab associated with L2 cache lines:
We see one cache line is taking 94% of all memory system time.
We can drill down and find out why!
Click on the first L2 cache line, and press the Compose Filter Clause button.
The dialog below appears. The filter clause for the L2 cache line is in the dialog box. The current filter is also displayed, currently empty (all data is viewed).
You can request to AND the filter clause to the current filter, OR or SET (assign) the current filter with the selected filter.
Press the SET button to assign the filter, and then OK.
Now you can press the VA_L2 tab. This tab returns the virtual address groupings that mapped into this L2 cache line. The question we are asking here is: how many copies of my virtual address space were mapped into the L2 cache line?
One virtual address range was mapped into this one hot line!
What virtual addresses were in this range? Press the Compose Filter Clause button.
Now AND the clauses together:
We look at Vaddr and see the 6 addresses that are taking all the time in the L2 cache line:
One L2 Cache Line; six addressses; six threads! Coincidence?
We can view the Seconds tab and view when in time these filtered costs occurred.
We can view the Thread tab and view which software threads incurred these filtered costs.
We can view the Core tab and view which hardware cores incurred these filtered costs. We can examine how Solaris scheduled the software threads on the hardware cores!
We confirm that each address was used by just one thread, that falsely shared the L2 cache line.
We can view every object in hardware and software... and understand their relationships! FAST!
False sharing of all hardware structures is detected:
Scalability at your fingertips:
DProfile - Look Inside the Machine!
[ T: NiagaraCMT ]