Monday Apr 20, 2009

Observability and Diagnosis Techniques are the way forward

I've just closed down another escalation involving a performance comparison of an application (Finacle in this case) running on a 6 board SunFire 6900 and a 6 board SunFire E25K. Obviously we're not comparing apples and pears here but never the less the customer found that they noticed huge performance differences during core business hours with the \*same\* (and I use that term loosely as nothing entirely can occupy the same physical space in time or made up of identical particles....you know what I mean...although perhaps some clever physicists can argue against that?!). Anyhow, the customer made the argument that they were using the same application binaries on both platforms and were delivering the same  user load (and they still couldn't confirm that!) to both but were seeing a huge performance degradation (slow interactive response from the users perspective and high load) on the E25K domain which didn't actually surprise me as the 25K is a larger system and will have larger memory latencies due to the larger interconnect differences to the E6900. So, what we observed was multiple application processes consuming userland cpu resources and no userland lock contention via the prstat micro state accounting output. If we'd suspected a scalability issue with the application then we'd more than likely observe userland lock contention. One of the frequent flyers we see in the performance team are caused by applications using malloc in multi threaded SMP environments, see this interesting read for further details. Unfortunately that wasn't to be the case so we needed to take another direction in the observability game to prove to the customer the fundamental differences between the two customer platforms. The answer here was to use cpustat and the ability to show the difference between the two platforms using cycles per instruction calculations.

The following chapter from Solaris™ Performance and Tools: DTrace and MDB Techniques for Solaris 10 and OpenSolaris explains the methodology (or read from Safari Books if you've an account or buy it from Amazon as the entire book is well worth a read).

8.2.7. Cycles per Instruction

The CPC events can monitor more than just the CPU caches. The following example demonstrates the use of the cycle count and instruction count on an Ultra-SPARC IIi to calculate the average number of cycles per instruction, printed last.

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | \\
awk '{ printf "%s %.2f cpi\\n",$0,$4/$5; }'
10.034 0 tick 3554903403 3279712368 1.08 cpi
10.034 1 total 3554903403 3279712368 1.08 cpi


This single 10-second sample averaged 1.08 cycles per instruction. During this test, the CPU was busy running an infinite loop program. Since the same simple instructions are run over and over, the instructions and data are found in the Level-1 cache, resulting in fast instructions.

Now the same test is performed while the CPU is busy with heavy random memory access:

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | \\
awk '{ printf "%s %.2f cpi\\n",$0,$4/$5; }'
10.036 0 tick 205607856 34023849 6.04 cpi
10.036 1 total 205607856 34023849 6.04 cpi


Since accessing main memory is much slower, the cycles per instruction have increased to an average of 6.04.

--

So, looking at the customer's data:

[andharr@node25k]$ grep total cpu\*
cpu2.out: 30.429 48 total 1453222115333 358247394908 4.06 cpi
cpu22.out: 30.523 48 total 1463632215285 347056691816 4.22 cpi
cpu222.out: 30.367 48 total 1395799585592 423393952271 3.30 cpi

[andharr@node6900]$ grep total cpu\*
cpu1.out: 31.038 48 total 1209418147610 522125013039 2.32 cpi
cpu11.out: 30.311 48 total 1194302525311 573624473405 2.08 cpi
cpu111.out: 30.408 48 total 1105516225829 552190193006 2.00 cpi

So the 25k is showing a higher cycles per instruction average than the 6900, so it does show a difference in performance between the two systems. This is more than likely due to memory latency difference on the 25k. If we actually look at the raw data for the busy times sorted by size for the sample period then you can see some very big differences in the largest cpi value between the systems:

[andharr@node6900]$ cat cpu1.out | awk '{print $6}' |sort -n | tail
6.30
6.37
7.14
7.16
7.26
7.35
8.17
8.27
8.36
8.91
[andharr@node6900]$ cat cpu11.out | awk '{print $6}' |sort -n | tail
6.67
6.71
6.77
6.80
7.21
7.70
7.72
8.40
9.21
11.92
[andharr@node6900]$ cat cpu111.out | awk '{print $6}' |sort -n | tail
6.26
6.39
6.65
6.93
6.99
7.25
7.81
8.65
8.81
9.32

[andharr@node25k]$ cat cpu2.out | awk '{print $6}' |sort -n |tail
26.65
26.86
26.99
28.71
29.48
30.06
30.87
32.93
34.05
34.36
[andharr@node25k]$ cat cpu22.out | awk '{print $6}' |sort -n |tail
31.35
31.82
32.32
33.16
34.03
35.00
38.51
47.69
50.19
51.04
[andharr@node25k]$ cat cpu222.out | awk '{print $6}' |sort -n |tail
26.03
26.71
26.90
27.31
27.45
28.29
28.42
29.28
32.30
35.40

Conclusion

So the fact that the E25k showed higher numbers didn't really indicate a good match with the Finacle application running on it. It would seem to suggest that Finacle is generating a memory latency sensitive workload which might not be entirely suited to the 25k platform. pbinding the Finacle application into specific processor sets might alleviate thread migration and some non-local memory accesses which may improve some performance but not entirely eradicate the memory latency issue from happening. The reason it's more noticeable on the E25K platform than the E6900 is that there is a greater distance to travel across the E25K interconnect than the E6900 ie a greater distance in terms of copper for the electrical signals to travel. MPO (Memory Placement Optimization) was introduced in Solaris 9 to help elevate latency in large scale NUMA configurations by attempting to keep LWP's local to it's "home" cpu/memory board, but in some cases cannot eliminate it for all workloads (in this case).

See the following documents for background information on MPO:

Solution 214954 : Sun Fire[TM] Servers: Memory Placement Optimization (MPO) and Solution 216813 : Sun Fire[TM] Servers: Memory Placement Optimization (MPO) Frequently Asked Questions (FAQ)

As my esteemed colleague Clive said, "Applications drive system behaviour" so we need to look at the application and how it interacts with the system rather than the other way round which always points me back to one of my first blog entries on the importance of understanding the application architecture and starting a top down approach. :)
About

I'm an RPE (Revenue Product Engineering) Engineer supporting Solaris on Exadata, Exalogic and Super Cluster. I attempt to diagnose and resolve any problems with Solaris on any of the Engineered Systems.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today