Tuesday Aug 11, 2009

Multiple Oracle instances performance issue

Finally closed a VOS case that's been open for over a year which was related to high system consumption caused by running multiple Oracle RDBMS's on a single system. The observation was 80/90% system cpu consumption from mpstat 1 and the following from lockstat profiling:

Profiling interrupt: 67240 events in 2.168 seconds (31017 events/sec)

Count genr cuml rcnt nsec Hottest CPU+PIL Caller
-------------------------------------------------------------------------------
40920 61% ---- 0.00 987 cpu[7] fop_ioctl
40920 61% ---- 0.00 987 cpu[7] ioctl
40880 61% ---- 0.00 986 cpu[7] read_kstat_data
40248 60% ---- 0.00 1077 cpu[7] syscall_trap
38780 58% ---- 0.00 947 cpu[2] mutex_vector_enter
32478 48% ---- 0.00 947 cpu[5] kstat_hold_bykid
32477 48% ---- 0.00 947 cpu[5] kstat_hold
13516 20% ---- 0.00 1845 cpu[102] (usermode)
6466 10% ---- 0.00 1904 cpu[423] syscall_trap32
6169 9% ---- 0.00 926 cpu[3] kstat_rele
4738 7% ---- 0.00 1626 cpu[96] thread_start
2420 4% ---- 0.00 1359 cpu[135]+11 idle
2317 3% ---- 0.00 1464 cpu[135]+11 disp_getwork
2122 3% ---- 0.00 2764 cpu[101] fop_read
1388 2% ---- 0.00 2510 cpu[101] vx_read
1379 2% ---- 0.00 2509 cpu[101] vx_read1
1352 2% ---- 0.00 2503 cpu[101] vx_cache_read
1267 2% ---- 0.00 2459 cpu[418] trap
1215 2% ---- 0.00 3059 cpu[128] fop_write
1082 2% ---- 0.00 2339 cpu[418] utl0
-------------------------------------------------------------------------------

Originally I raised CR 6734910 - "kstat_hold doesn't scale well on large systems" to track this but it seemed as though Oracle could do a better job of utilizing the kstat interface which then resulted in Oracle addressing the issue in:

Oracle RDBMS Versions >= 10.2.0.1 but < 11.2 - Bug/Patch 8531434 - "Solaris: Excessive CPU by MMNL/CJQ0 when running multiple instances and cpus" (Doc ID 8531434.8)

 Versions >= 11.1.0.6 but < 11.1.0.7 -> Bug/Patch 8777336 - "multiple kstat calls while getting socket count and core count for v$osstat" (Doc ID 8777336.8)

11gR2 doesn't appear effected by this issue. So to avoid a performance hit whilst running multiple Oracle instances on a single host I'd recommend you patch an effected database or if that's not possible then use the workaround of tuning Oracle parameter _job_queue_interval (e.g. from 5 to 30) but potentially loose granularity of performance statistics.

Monday Apr 20, 2009

Observability and Diagnosis Techniques are the way forward

I've just closed down another escalation involving a performance comparison of an application (Finacle in this case) running on a 6 board SunFire 6900 and a 6 board SunFire E25K. Obviously we're not comparing apples and pears here but never the less the customer found that they noticed huge performance differences during core business hours with the \*same\* (and I use that term loosely as nothing entirely can occupy the same physical space in time or made up of identical particles....you know what I mean...although perhaps some clever physicists can argue against that?!). Anyhow, the customer made the argument that they were using the same application binaries on both platforms and were delivering the same  user load (and they still couldn't confirm that!) to both but were seeing a huge performance degradation (slow interactive response from the users perspective and high load) on the E25K domain which didn't actually surprise me as the 25K is a larger system and will have larger memory latencies due to the larger interconnect differences to the E6900. So, what we observed was multiple application processes consuming userland cpu resources and no userland lock contention via the prstat micro state accounting output. If we'd suspected a scalability issue with the application then we'd more than likely observe userland lock contention. One of the frequent flyers we see in the performance team are caused by applications using malloc in multi threaded SMP environments, see this interesting read for further details. Unfortunately that wasn't to be the case so we needed to take another direction in the observability game to prove to the customer the fundamental differences between the two customer platforms. The answer here was to use cpustat and the ability to show the difference between the two platforms using cycles per instruction calculations.

The following chapter from Solaris™ Performance and Tools: DTrace and MDB Techniques for Solaris 10 and OpenSolaris explains the methodology (or read from Safari Books if you've an account or buy it from Amazon as the entire book is well worth a read).

8.2.7. Cycles per Instruction

The CPC events can monitor more than just the CPU caches. The following example demonstrates the use of the cycle count and instruction count on an Ultra-SPARC IIi to calculate the average number of cycles per instruction, printed last.

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | \\
awk '{ printf "%s %.2f cpi\\n",$0,$4/$5; }'
10.034 0 tick 3554903403 3279712368 1.08 cpi
10.034 1 total 3554903403 3279712368 1.08 cpi


This single 10-second sample averaged 1.08 cycles per instruction. During this test, the CPU was busy running an infinite loop program. Since the same simple instructions are run over and over, the instructions and data are found in the Level-1 cache, resulting in fast instructions.

Now the same test is performed while the CPU is busy with heavy random memory access:

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | \\
awk '{ printf "%s %.2f cpi\\n",$0,$4/$5; }'
10.036 0 tick 205607856 34023849 6.04 cpi
10.036 1 total 205607856 34023849 6.04 cpi


Since accessing main memory is much slower, the cycles per instruction have increased to an average of 6.04.

--

So, looking at the customer's data:

[andharr@node25k]$ grep total cpu\*
cpu2.out: 30.429 48 total 1453222115333 358247394908 4.06 cpi
cpu22.out: 30.523 48 total 1463632215285 347056691816 4.22 cpi
cpu222.out: 30.367 48 total 1395799585592 423393952271 3.30 cpi

[andharr@node6900]$ grep total cpu\*
cpu1.out: 31.038 48 total 1209418147610 522125013039 2.32 cpi
cpu11.out: 30.311 48 total 1194302525311 573624473405 2.08 cpi
cpu111.out: 30.408 48 total 1105516225829 552190193006 2.00 cpi

So the 25k is showing a higher cycles per instruction average than the 6900, so it does show a difference in performance between the two systems. This is more than likely due to memory latency difference on the 25k. If we actually look at the raw data for the busy times sorted by size for the sample period then you can see some very big differences in the largest cpi value between the systems:

[andharr@node6900]$ cat cpu1.out | awk '{print $6}' |sort -n | tail
6.30
6.37
7.14
7.16
7.26
7.35
8.17
8.27
8.36
8.91
[andharr@node6900]$ cat cpu11.out | awk '{print $6}' |sort -n | tail
6.67
6.71
6.77
6.80
7.21
7.70
7.72
8.40
9.21
11.92
[andharr@node6900]$ cat cpu111.out | awk '{print $6}' |sort -n | tail
6.26
6.39
6.65
6.93
6.99
7.25
7.81
8.65
8.81
9.32

[andharr@node25k]$ cat cpu2.out | awk '{print $6}' |sort -n |tail
26.65
26.86
26.99
28.71
29.48
30.06
30.87
32.93
34.05
34.36
[andharr@node25k]$ cat cpu22.out | awk '{print $6}' |sort -n |tail
31.35
31.82
32.32
33.16
34.03
35.00
38.51
47.69
50.19
51.04
[andharr@node25k]$ cat cpu222.out | awk '{print $6}' |sort -n |tail
26.03
26.71
26.90
27.31
27.45
28.29
28.42
29.28
32.30
35.40

Conclusion

So the fact that the E25k showed higher numbers didn't really indicate a good match with the Finacle application running on it. It would seem to suggest that Finacle is generating a memory latency sensitive workload which might not be entirely suited to the 25k platform. pbinding the Finacle application into specific processor sets might alleviate thread migration and some non-local memory accesses which may improve some performance but not entirely eradicate the memory latency issue from happening. The reason it's more noticeable on the E25K platform than the E6900 is that there is a greater distance to travel across the E25K interconnect than the E6900 ie a greater distance in terms of copper for the electrical signals to travel. MPO (Memory Placement Optimization) was introduced in Solaris 9 to help elevate latency in large scale NUMA configurations by attempting to keep LWP's local to it's "home" cpu/memory board, but in some cases cannot eliminate it for all workloads (in this case).

See the following documents for background information on MPO:

Solution 214954 : Sun Fire[TM] Servers: Memory Placement Optimization (MPO) and Solution 216813 : Sun Fire[TM] Servers: Memory Placement Optimization (MPO) Frequently Asked Questions (FAQ)

As my esteemed colleague Clive said, "Applications drive system behaviour" so we need to look at the application and how it interacts with the system rather than the other way round which always points me back to one of my first blog entries on the importance of understanding the application architecture and starting a top down approach. :)

Monday Jan 19, 2009

Performance considerations when upgrading Solaris

The biggest piece of advice I can give you about those of you about to upgrade with lots of custom tunables in /etc/system........read the manual (FTFM if you're feeling particularly vocal), no seriously, I mean it! :) You only have to read the Solaris tunables reference manual as it actually discusses upgrading to newer releases with older /etc/system tunables:

"We recommend that you start with an empty /etc/system file when moving to a new Solaris
release. As a first step, add only those tunables that are required by in-house or third-party
applications. Any tunables that involve System V IPC (semaphores, shared memory, and
message queues) have been modified in the Solaris 10 release and should be changed in your
environment. For more information, see “System V IPC Configuration” on page 21. After
baseline testing has been established, evaluate system performance to determine if additional
tunable settings are required."


So, that's a move it out of the way and start from scratch. :) Obviously speak to your application vendors about anything that is required to run the application but other than that, see how things go and only change when and where necessary otherwise you could run into other problems.

The only application which I'll make specific points about is Oracle as with Solaris 10 we've introduced resource controls so the shared memory / semaphore settings no longer need to be defined in /etc/system. See the Oracle installation guide  or  Solution  208623 :   Solaris[TM] 10 Operating System: System V Inter-Process Communication (IPC) resource controls for further details.

Monday Dec 22, 2008

Diagnosing a VxFS filesystem performance issue

I recently got involved in an interesting performance problem in which the customer was experiencing differences in copying a 500mb file to 2 different VxFS filesystems on the same box by up to a factor of 4 in some cases ie "timex cp <500mb file> /vxfs_filesystem". The only thing which had changed (I'm actually smiling at the word \*only\*) between 2 dates were:

\* Various OS Patches
\* Symantec Veritas Volume Manager Patch
\* Upgraded LPFC Driver Servers
\* Upgraded firmware
\* Upgraded Power Path Servers
\* Upgraded SYMCLI

which doesn't help immediately in thinking about probable or possible causes due to far too many things changing! So how does one actually track down what's happening in the lifecycle of the "cp" on a Solaris 9 box...... TNF to the rescue again. Using the TNF wrapper script from one of my previous post, we could enable and disable TNF whilst the cp was happening. So once we'd got that data from the customer and extracted / processed the TNF buffers I first looked at the TNF truss outputs:

[andharr@exdev:~/tnf/tnf/notworking]$ more tnf.truss
looking for pid  12045
Now(ms)          duration(ms)   pid     thread             call           return       errno
48335.67488000   15.83081400    12045    0x30101ab09a0    mmap64         0x7fffffff     0
48351.53089800   254.21403800   12045    0x30101ab09a0    write          0x800000       0
48605.75773800   16.39489300    12045    0x30101ab09a0    mmap64         0x7fffffff     0
48622.17487400   244.79986500   12045    0x30101ab09a0    write          0x800000       0
48866.98866100   16.51242700    12045    0x30101ab09a0    mmap64         0x7fffffff     0
48883.52309100   263.08493800   12045    0x30101ab09a0    write          0x800000       0
49146.62003100   16.96265000    12045    0x30101ab09a0    mmap64         0x7fffffff     0
49163.60188400   262.34315500   12045    0x30101ab09a0    write          0x800000       0
49425.95560000   13.35143400    12045    0x30101ab09a0    mmap64         0x7fffffff     0
49439.32559600   256.47619000   12045    0x30101ab09a0    write          0x800000       0
49695.81298800   13.16556800    12045    0x30101ab09a0    mmap64         0x7fffffff     0
49708.99975900   246.71828900   12045    0x30101ab09a0    write          0x800000       0
49955.73060900   15.93346900    12045    0x30101ab09a0    mmap64         0x7fffffff     0
49971.68440100   254.45911200   12045    0x30101ab09a0    write          0x800000       0
50226.15543500   16.82535100    12045    0x30101ab09a0    mmap64         0x7fffffff     0
50242.98686700   262.94091700   12045    0x30101ab09a0    write          0x800000       0
50505.93946600   15.95571100    12045    0x30101ab09a0    mmap64         0x7fffffff     0
50521.91406000   254.57880900   12045    0x30101ab09a0    write          0x800000       0
50776.50887100   16.54139300    12045    0x30101ab09a0    mmap64         0x7fffffff     0
50793.06922600   263.14118400   12045    0x30101ab09a0    write          0x800000       0
51056.22369200   13.62531200    12045    0x30101ab09a0    mmap64         0x7fffffff     0
51069.86692700   265.74658200   12045    0x30101ab09a0    write          0x800000       0
51335.62423100   14.17114700    12045    0x30101ab09a0    mmap64         0x7fffffff     0
51349.81458000   266.32114200   12045    0x30101ab09a0    write          0x800000       0
51616.14980400   14.28020100    12045    0x30101ab09a0    mmap64         0x7fffffff     0
<snip>

[andharr@exdev:~/tnf/tnf/working]$ more tnf.truss
looking for pid  16545
<snip>
Now(ms)          duration(ms)   pid     thread             call           return       errno
74.48367000   0.09929400     16545    0x30129365ce0    mmap64         0x7fffffff     0
74.60224600   145.95268900   16545    0x30129365ce0    write          0x800000       0
220.56909700   15.92858800    16545    0x30129365ce0    mmap64         0x7fffffff     0
236.51624700   145.57631700   16545    0x30129365ce0    write          0x800000       0
382.11048600   17.13331300    16545    0x30129365ce0    mmap64         0x7fffffff     0
399.25084000   148.40430500   16545    0x30129365ce0    write          0x800000       0
547.67306700   15.89682400    16545    0x30129365ce0    mmap64         0x7fffffff     0
563.59125400   143.89640600   16545    0x30129365ce0    write          0x800000       0
707.50126200   16.40793300    16545    0x30129365ce0    mmap64         0x7fffffff     0
723.91583600   144.97527400   16545    0x30129365ce0    write          0x800000       0
868.90455200   15.84185600    16545    0x30129365ce0    mmap64         0x7fffffff     0
884.76377000   145.10353100   16545    0x30129365ce0    write          0x800000       0
1029.88114300   15.60142300    16545    0x30129365ce0    mmap64         0x7fffffff     0
1045.50632900   145.67129000   16545    0x30129365ce0    write          0x800000       0
1191.19362100   16.41825600    16545    0x30129365ce0    mmap64         0x7fffffff     0
1207.63660000   144.74260200   16545    0x30129365ce0    write          0x800000       0
1352.39696400   17.08674700    16545    0x30129365ce0    mmap64         0x7fffffff     0
1369.49027200   143.29896300   16545    0x30129365ce0    write          0x800000       0
1512.80107700   14.10729700    16545    0x30129365ce0    mmap64         0x7fffffff     0
1526.91309500   144.05898900   16545    0x30129365ce0    write          0x800000       0
1670.98672600   16.63788500    16545    0x30129365ce0    mmap64         0x7fffffff     0
1687.63237200   141.60897100   16545    0x30129365ce0    write          0x800000       0
1829.25798500   15.96867300    16545    0x30129365ce0    mmap64         0x7fffffff     0
1845.23249900   144.46080400   16545    0x30129365ce0    write          0x800000       0
1989.70738500   15.94683000    16545    0x30129365ce0    mmap64         0x7fffffff     0
<snip>

Now, that's interesting, in that we see a difference in the time taken for write(), so what's happening in the write() lifecycle?

<not working>
    48545.230384         0.006560 12045     1 0x30101ab09a0 224 strategy                   device: 0x11100000faf  block: 39690624  size: 0x10000  buf: 0x3004185e058  flags: 558353 0,0
    48545.277271         0.000080 12045     1 0x30101ab09a0 224 strategy                   device: 0x1240000007a  block: 39692928  size: 0x10000  buf: 0x312eec85688  flags: 1289 0,0
    48545.416170         0.000800 12045     1 0x30101ab09a0 224 strategy                   device: 0x11100000faf  block: 39690752  size: 0x10000  buf: 0x327286721b0  flags: 558353 0,0
    48545.454655         0.002640 12045     1 0x30101ab09a0 224 strategy                   device: 0x1240000007a  block: 39693056  size: 0x10000  buf: 0x506b36947c8  flags: 1289 0,0
    48545.544348         0.008562 12045     1 0x30101ab09a0 224 strategy                   device: 0x11100000faf  block: 39690880  size: 0x10000  buf: 0x3004185f458  flags: 558353 0,0
    48545.580753         0.012962 12045     1 0x30101ab09a0 224 strategy                   device: 0x1240000007a  block: 39693184  size: 0x10000  buf: 0x312ead7be08  flags: 1289 0,0
    48545.642601         0.004721 12045     1 0x30101ab09a0 224 strategy                   device: 0x11100000faf  block: 39691008  size: 0x10000  buf: 0x30070b4af30  flags: 558353 0,0
    48545.676366         0.003361 12045     1 0x30101ab09a0 224 strategy                   device: 0x1240000007a  block: 39693312  size: 0x10000  buf: 0x3006ca9bb20  flags: 1289 0,0
    48545.735814         0.003681 12045     1 0x30101ab09a0 224 strategy                   device: 0x11100000faf  block: 39691136  size: 0x10000  buf: 0x327cd4835c8  flags: 558353 0,0
<snip>

So our I/O appears to be in 65k chunks but how long did 1 I/O actually take?

48545.230384  -  12045 1 0x30101ab09a0 224 strategy device: 273,4015  block: 39690624 size: 0x10000 buf: 0x3004185e058 flags: 0x88511
48549.243496  -  0 0 0x2a100477d20 64 biodone device: 273,4015  block: 39690624 buf: 0x3004185e058

4.013112ms

<working>
      236.516247         0.000720 16545     1 0x30129365ce0  65 syscall_start              sysnum: 4 write
      236.972870         0.004481 16545     1 0x30129365ce0  65 address_fault              address: 0xfe800000  fault_type: 2  access: 1
      238.215601         0.068810 16545     1 0x30129365ce0  65 strategy                   device: 0x11100000fa9  block: 24838144  size: 0x100000  buf: 0x300359fa3a8  flags: 289 0,0
      238.247765         0.002240 16545     1 0x30129365ce0  65 strategy                   device: 0x12400000022  block: 24840448  size: 0x100000  buf: 0x30038cd24a0  flags: 297 0,0
      238.461314         0.002240 16545     1 0x30129365ce0  65 thread_block               reason: 0x300359fa468  stack:  4 0x114190801048bb0 0x10f28ec01080b30 0x78a4d834789aa770 0x78
a7dd6078a798b0
<snip>

238.215601  -  16545 1 0x30129365ce0 65 strategy device: 273,4009  block: 24838144 size: 0x100000 buf: 0x300359fa3a8 flags: 0x121
254.402304  -  0 0 0x2a100471d20 64 biodone device: 273,4009  block: 24838144 buf: 0x300359fa3a8

16.186703ms

So time for 8m in 1m chunks:

16.18 \* 8 = 129.493624

and 8m in 65k chunks:

4.013112 \* 128 = 513.678336 (it's going to be less anyway due to fact that we can queue iops and be completed in parallel)

So it would seem that the difference in time can be explained by the way that the I/O is broken up on each filesystem. Perhaps the way in which the filesystem has been configured can explain the difference? Unforuntately, I didn't see differences between the filesystems themselves but some of the configuration details did match some of our physical write sizes:

Filesystem i/o parameters for /filesystem
read_pref_io = 65536
read_nstream = 1
read_unit_io = 65536
write_pref_io = 65536
write_nstream = 1
write_unit_io = 65536
pref_strength = 10
buf_breakup_size = 1048576
discovered_direct_iosz = 262144
max_direct_iosz = 1048576
default_indir_size = 8192
qio_cache_enable = 0
write_throttle = 0
max_diskq = 1048576
initial_extent_size = 1
max_seqio_extent_size = 2048
max_buf_data_size = 8192
hsm_write_prealloc = 0
read_ahead = 1
inode_aging_size = 0
inode_aging_count = 0
fcl_maxalloc = 8130396160
fcl_keeptime = 0
fcl_winterval = 3600
oltp_load = 0

Now our write pattern ties in for our non-working cp in that our prefered write size is 65536 (as above) but for the working one we're using 1048576 which could be due to discovered direct I/O. Now write() is trying to complete a size of 8388608 (8mb) which we can see from the truss:

29986:   0.0146 mmap64(0xFE800000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x07000000) = 0xFE800000
29986:   0.2394 write(4, "\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0".., 8388608) = 8388608
29986:   0.0107 mmap64(0xFE800000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x07800000) = 0xFE800000
29986:   0.2599 write(4, "\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0".., 8388608) = 8388608

so why is our I/O being broken into smaller chunks for one filesystem and not on the other? Ah, what if one of the filesystems was fragmented more than the other? Yes, that could explain the difference if VxFS internally make the choice of using one other the other depending on the state of the filesystem. It's interesting reading the VxFS administration guide as it clearly documents the possible performance implications due to fragmentation (yes, RTFM is applicable here):

Fragmentation reduces performance and availability. Regular use of fsadm’s fragmentation reporting and reorganization facilities is therefore advisable. The easiest way to ensure that fragmentation does not become a problem is to schedule regular defragmentation runs using the cron command. Defragmentation scheduling should range from weekly (for frequently used file systems) to monthly (for infrequently used file systems). Extent fragmentation should be monitored with fsadm or the df  o s commands. There are three factors which can be used to determine the degree of fragmentation:

    \* Percentage of free space in extents of less than eight blocks in length
    \* Percentage of free space in extents of less than 64 blocks in length
    \* Percentage of free space in extents of length 64 blocks or greater

An unfragmented file system will have the following characteristics:

    \* Less than 1 percent of free space in extents of less than eight blocks in length
    \* Less than 5 percent of free space in extents of less than 64 blocks in length
    \* More than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks

A badly fragmented file system will have one or more of the following characteristics:

    \* Greater than 5 percent of free space in extents of less than 8 blocks in length
    \* More than 50 percent of free space in extents of less than 64 blocks in length
    \* Less than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks

The optimal period for scheduling of extent reorganization runs can be determined by choosing a reasonable interval, scheduling fsadm runs at the initial interval, and running the extent fragmentation report feature of fsadm before and after the reorganization. The “before” result is the degree of fragmentation prior to the reorganization. If the degree of fragmentation is approaching the figures for bad fragmentation, reduce the interval between fsadm runs. If the degree of fragmentation is low, increase the interval between fsadm runs. The “after” result is an indication of how well the reorganizer has performed. The degree of fragmentation should be close to the characteristics of an unfragmented file system. If not, it may be a good idea to resize the file system; full file systems tend to fragment and are difficult to defragment. It is also possible that the reorganization is not being performed at a time during which the file system in question is relatively idle.

---

Conclusion


As in all performance related issues, you really need to understand the lifecycle of what you're interested it so you can ideally observe what's happening or not happening! In this case I used mainly truss and TNF to drill down to see whats happening. However what should happen way before that is to actually understand what the impact is otherwise who actually cares if there is a difference if it's not causing a problem. In this case it was the DBA's who noticed the difference but in all honesty it wasn't actually causing a business impact! :) Typical! Still, it's been another learning curve in using TNF in the odd case where Dtrace (and in this case scsi.d) might have also helped if the system was running Solaris 10.

If you are using VxFS then you might want to look at this script to automatically check if a filesystem is fragmented or not. If it is then it'll automatically kick off a defrag........ aren't automatic scripts great! :)

Tuesday Apr 29, 2008

What's the answer to life the universe and everything?

"42"

For those of you that had read or listened to the Hitchhikers Guide to the Galaxy the above question and answer will have more meaning to you than those of you that haven't. Essentially, how can you have a literal answer to such an undefined question which suggests on an allegorical level that it is more important to ask the right questions than to seek definite answers.

I sometimes think of just saying "42" to the question of "What's the answer to our performance problem?" which is usually supplied with some kind of data either in the form of GUDS (a script which collects a whole bunch of Solaris OS output) or some other spreadsheet or application output. This data usually has no context or supplied with anything other than "the customer has a performance problem" which of course makes things slightly difficult for us to answer unless the customer will accept "42".

So investigating performance related issues is usually very time consuming due to difficulty in defining a problem. So it would seem to reason that it's probably a good idea to approach these type of problems in a structured method. Sun has been using an effective troubleshooting process by Kepner Trego for a number of years of which defines a problem as follows:

"Something has deviated from the normal (what you should expect) for which you don't know the reason and would like to know the reason"

Still don't get it? Well, what if you're driving, walking, running, hopping (you get my point) etc from point A to B and have somehow ended up at X21 and you don't know why you've ended up, you'd probably want to know why and thus you'd have a problem because you'd be expecting to end up at point B but have ended up at point X21.

Ok, so how does this related to resolving performance issues then? Well, in order for Sun engineers to progress performance related issues within the services organization we need to understand a problem, the concerns around it and how that fits into the bigger picture. By this I mean looking at an entire application infrastructure (top down approach) rather than examining specific system or application statistics (bottom up approach). This can then help us identify a possible bottleneck or specific area of interest to which we can use any number of OS or application tools to focus in on and identify root cause.

So perhaps we should start by informing people what performance engineers CAN do:

1/ We can make "observations" from static collected data or via an interactive window into customer's system (Shared Shell). Yes, that doesn't mean we can provide root cause from this but comment on what we see. Observations mean NOTHING without context.

2/ We can make suggestions based on above information which might progress to further data collection but again mean NOTHING without context.

Wow, that's not much is it....so what CAN'T we do?

1/ We can't mind read - Sorry, we can't possibly understand you're concerns, application, business, users without providing USEFUL information. So would is useful information? Well answers to these might help get the ball rolling:

\* What tells you that you have a performance issue on your system? i.e Users complaining that the "XYZ" application is taking longer than expected to return data/report, batch job taking longer to complete, etc.

\* When did this issue start happening? This should be the exact date & time the problem started or was first noticed.

\* When have you noticed the issue since? Again the exact date(s) and time(s).

\* How long should you expect the job/application to take to run/complete. This needs to be based on previous data runs or when the system was specified.

\* What other systems also run the job/application but aren't effected?

\* Supply an architecture diagram if applicable, describing how the application interfaces into the system. i.e

user -> application X on client -webquery-> application server -sqlquery-> Oracle database backend server

2/ We can't rub a bottle and get the answer from a genie nor wave a magic wand for the answer - Yes, again it's not just as simple as supplying a couple of OS outputs and getting an answer from us. We'll need to understand the "bigger" picture or make observations before suggestions can be advised.

3/ We can't fix the problem in a split second nor can applying pressure help speed up the process - Again we need to UNDERSTAND the bigger picture before suggestions and action plans can be advised.

So what kind of data stuff can we collect to observe?

Probably one of the quickest ways of allowing us to observe is via Shared Shell. This allows us a direct view onto a system and allows us to see what the customer actually see's. Again, we'll need to discuss with the customer what we're looking at and UNDERSTAND the "bigger" picture to make suggestions or action plans moving forward. If shared shell isn't available then we'll need to collect GUDS data usually in the form of the extended mode. This collects various Solaris outputs in various time snapshots which we can view offline, however we do need baseline data along with bad data to make any useful observations. Yes, one snapshot isn't much help as high values could be normal! Yes, just because you see high user land utilization it doesn't necessarily mean its bad or shows a performance problem. It could just be the system being utilized well processing those "funny" accounting beans for the business. Again and I've said this a few times.....data is USELESS without CONTEXT.

If Oracle is involved then you could get the Oracle DBA to provide statspack data or AWR reports for when you see the problem and when you don't as that might give an indication of Oracle being a bottleneck in the application environment.

Other application vendors might have similar statistic generating reports which show what they are waiting for which might help identify a potential bottleneck.

The "Grey" area

The grey area is a term used by many as an issue which breaks the mold of conventional break fix issues and starts entering the performance tuning arena. Break fix is usually an indication that something is clearly broken such as a customer experiencing a bug in Solaris or helping a customer bring a system up which as crashed or needs to be rebuilt and requires Sun's assistance and expertise to resolve. Performance tuning usually happens because a customer's business has expanded and their application architecture can't cope with the growth for example. It's a little difficult to gauge when a situation starts to go down that path when most application architectures are very complex and involve lots of vendors. I also happen to work in the VOSJEC (Veritas Oracle Sun Joint Escalation Centre) and deal with quite a few interoperability issues so know things can get pretty complex with trying to find the problematic area of interest. For some reason some people term this as the blame game or finger pointing which I personally hate to use. In fact I'd rather it be a Sun issue from my perspective as we get then take the necessary action in raising bugs and getting engineering involved to provide a fix and ultimately resolve the customer's issue. Thankfully my Symantec and Oracle counterparts also take this approach which makes things a little easier in problem resolution.

Conclusion

I think real point of this is that you should really grasp a problem before asking for assistance, as if you understand the problem, then you're colleagues understand the problem and more importantly we (Sun) or I understand the problem and that's half the battle. The rest is so much easier...... :)

About

I'm an RPE (Revenue Product Engineering) Engineer supporting Solaris on Exadata, Exalogic and Super Cluster. I attempt to diagnose and resolve any problems with Solaris on any of the Engineered Systems.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today