Monday Sep 28, 2009

Problems with Solaris and a DISM enabled Oracle Database

Resolving problems becomes ever more challenging in todays complex application environments and sometimes error messages can be misleading or confusing to the system administrator / DBA trying to resolve the issue. With my involvement with the VOSJEC (Veritas Oracle Sun Joint Escalation Center), we've seen a steady increase in the number of cases being raised about customers using Oracle with DISM enabled on Solaris systems and in the majority of cases disabling DISM appears to resolve the issue but with DISM use becoming more popular and making DBA's lives easier I personally don't feel as though that's the real right course of action. DISM essentially enables the DBA the ability to resize the SGA on the fly without taking down the database and maximize availability and avoid downtime.  See the following page from the Administering Oracle Database on Solaris documentation for further details.

For knowing about other DISM related issues then have a read of Solution 230653 : DISM Troubleshooting For Oracle9i and Later Releases as it gives some background of DISM and what can go wrong.

How do I know if the customer is using DISM or not...

\* There should be an ora_dism process running on the system so you can check from the ps listing (explorer / guds).

\* Check in the Oracle alert log as that should also mention the DISM process starting. It should also log the fact that the DISM process has died which means your Oracle shared memory segmented aren't locked and can cause other problems .... see Solution 230653 : DISM Troubleshooting For Oracle9i and Later Releases.

So, what the issue you here yourself asking, well several customers complained about Oracle batch jobs crashing or the entire database falling over with something like:

Sun Mar 15 14:53:28 2009
Errors in file /u01/app/oracle/admin/JSPSOL0/udump/jspsol03_ora_23220.trc:
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
SVR4 Error: 12: Not enough space
Additional information: 4
Additional information: 938866512
Additional information: -1

OR

Mon Aug 24 09:01:17 2009
Errors in file /opt/oracle/admin/MW/bdump/mw1_lgwr_18124.trc:
ORA-00340: IO error processing online log 1 of thread 1
ORA-00345: redo log write error block 167445 count 11
ORA-00312: online log 1 thread 1: '/dev/vx/rdsk/mwdbdg/redo11'
ORA-27070: async read/write failed
SVR4 Error: 12: Not enough space
Additional information: 1
Additional information: 167445
LGWR: terminating instance due to error 340

The first jump to cause would indicate that this is some kind of I/O issue and that perhaps we've run out of space on the backing storage layer. Which unfortunately in this case wasn't the reason as all filesystems had enough free space available. The next hit in our possible causes, is perhaps that we don't have enough swap space available as in using DISM you have to have enough swap space configured as DISM segments in use, see  Solution  214947 :   Solaris[TM] Operating System: DISM double allocation of memory for details. Again, the customer appeared to have the correct swap space configuration for the DISM segment currently in use.
Now, what I actually suspected was that CR 6559612 multiple softlocks on a DISM segment should decrement availrmem just once is coming into play here which if you read engineerings comments on it says can only happen if DISM isn't locked properly and we end up going down a different code path due to seg_pdisable being set to 1. There is another reason for seg_pdisable being set and that's when we're trying to handle page retirement. Coincidentally for the two situations mentioned above both had pending page retirements happening  due CR 6587140 -  "page_retire()/page_trycapture should not try to retire non relocatable kernel pages" fixed in KJP 141414-06 (latest 10) so there was a window for opportunity for CR 6559612 to happen every once and a while.

From what I can currently work out CR 6559612 - "multiple softlocks on a DISM segment should decrement availrmem just once" is a duplicate of CR 6603296 - "Multiple writes into dism segment reduces available swap" and also mentions duplicate CR 6644570 - "Application using large page sizes with DISM on Huron got pread to return error ENOMEM".

There seems to be a family of related issues which are all fixed as part of CR 6423097 - "segvn_pagelock() may perform very poorly" which includes the above:

6526804 DR delete_memory_thread, AIO, and segvn deadlock
6557794 segspt_dismpagelock() and segspt_shmadvise(MADV_FREE) may deadlock
6557813 seg_ppurge_seg() shouldn't flush all unrelated ISM/DISM segments
6557891 softlocks/pagelocks of anon pages should not decrement availrmem for memory swapped pages
6559612 multiple softlocks on a DISM segment should decrement availrmem just once
6562291 page_mem_avail() is stuck due to availrmem overaccounting and lack of seg_preap() calls
6596555 locked anonymous pages should not have assigned disk swap slots
6639424 hat_sfmmu.c:hat_pagesync() doesn't handle well HAT_SYNC_STOPON_REF and HAT_SYNC_STOPON_MOD flags
6639425 optimize checkpage() optimizations
6662927 page_llock contention during I/O

These are all fixed in the development release of Solaris Nevada build 91 but haven't been backported as yet. From looking at the bug reports it does seem likely that CR 6423097 and associated CR's will get fixed in Solaris 10 U8 (scheduled for release 9th October 2009).

So, that would lead us to the following suggestions:

1/ CR 6423097 mentions disabling large page support completely to workaround the issue which you could do and still continue to use DISM:

Workaround for nevada and S10U4 is to add to /etc/system:

set max_uheap_lpsize = 0x2000
set max_ustack_lpsize = 0x2000
set max_privmap_lpsize = 0x2000
set max_shm_lpsize = 0x2000
set use_brk_lpg = 0
set use_stk_lpg = 0
and reboot.

for S10U3 and earlier S10 releases, the workaround is to add to /etc/system:

set exec_lpg_disable = 1
set use_brk_lpg = 0
set use_stk_lpg = 0
set use_zmap_lpg = 0
and reboot

NOTE: this is \*not\* the same as disabling large page coalescing as this completely disables large page support. Disabling large page coalescing is sometimes required dependent on certain types of application workload so see 215536 : Oracle(R) 10g on Solaris[TM] 10 may run slow for further details.

2/ Disable DISM entirely for the short term and use ISM with Oracle whilst waiting for the release of Solaris 10 update 8.

Conclusion / Actions
-------------------------------------

The above is essentially a hit list of things to be aware of when seeing errors of that nature. If the customer isn't using DISM at all then that might take you down another  avenue. If it does then remember that "Applications Drive System Behavior", so you need to understand the architecture, when and where things run to aid in tying the loose ends of any problem together and be able to prove root cause or not. See my previously blog entry about dealing with performance issues as it essentially takes about the same troubleshooting process I use all the time in diagnosing complex problems.

Monday Apr 20, 2009

Observability and Diagnosis Techniques are the way forward

I've just closed down another escalation involving a performance comparison of an application (Finacle in this case) running on a 6 board SunFire 6900 and a 6 board SunFire E25K. Obviously we're not comparing apples and pears here but never the less the customer found that they noticed huge performance differences during core business hours with the \*same\* (and I use that term loosely as nothing entirely can occupy the same physical space in time or made up of identical particles....you know what I mean...although perhaps some clever physicists can argue against that?!). Anyhow, the customer made the argument that they were using the same application binaries on both platforms and were delivering the same  user load (and they still couldn't confirm that!) to both but were seeing a huge performance degradation (slow interactive response from the users perspective and high load) on the E25K domain which didn't actually surprise me as the 25K is a larger system and will have larger memory latencies due to the larger interconnect differences to the E6900. So, what we observed was multiple application processes consuming userland cpu resources and no userland lock contention via the prstat micro state accounting output. If we'd suspected a scalability issue with the application then we'd more than likely observe userland lock contention. One of the frequent flyers we see in the performance team are caused by applications using malloc in multi threaded SMP environments, see this interesting read for further details. Unfortunately that wasn't to be the case so we needed to take another direction in the observability game to prove to the customer the fundamental differences between the two customer platforms. The answer here was to use cpustat and the ability to show the difference between the two platforms using cycles per instruction calculations.

The following chapter from Solaris™ Performance and Tools: DTrace and MDB Techniques for Solaris 10 and OpenSolaris explains the methodology (or read from Safari Books if you've an account or buy it from Amazon as the entire book is well worth a read).

8.2.7. Cycles per Instruction

The CPC events can monitor more than just the CPU caches. The following example demonstrates the use of the cycle count and instruction count on an Ultra-SPARC IIi to calculate the average number of cycles per instruction, printed last.

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | \\
awk '{ printf "%s %.2f cpi\\n",$0,$4/$5; }'
10.034 0 tick 3554903403 3279712368 1.08 cpi
10.034 1 total 3554903403 3279712368 1.08 cpi


This single 10-second sample averaged 1.08 cycles per instruction. During this test, the CPU was busy running an infinite loop program. Since the same simple instructions are run over and over, the instructions and data are found in the Level-1 cache, resulting in fast instructions.

Now the same test is performed while the CPU is busy with heavy random memory access:

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | \\
awk '{ printf "%s %.2f cpi\\n",$0,$4/$5; }'
10.036 0 tick 205607856 34023849 6.04 cpi
10.036 1 total 205607856 34023849 6.04 cpi


Since accessing main memory is much slower, the cycles per instruction have increased to an average of 6.04.

--

So, looking at the customer's data:

[andharr@node25k]$ grep total cpu\*
cpu2.out: 30.429 48 total 1453222115333 358247394908 4.06 cpi
cpu22.out: 30.523 48 total 1463632215285 347056691816 4.22 cpi
cpu222.out: 30.367 48 total 1395799585592 423393952271 3.30 cpi

[andharr@node6900]$ grep total cpu\*
cpu1.out: 31.038 48 total 1209418147610 522125013039 2.32 cpi
cpu11.out: 30.311 48 total 1194302525311 573624473405 2.08 cpi
cpu111.out: 30.408 48 total 1105516225829 552190193006 2.00 cpi

So the 25k is showing a higher cycles per instruction average than the 6900, so it does show a difference in performance between the two systems. This is more than likely due to memory latency difference on the 25k. If we actually look at the raw data for the busy times sorted by size for the sample period then you can see some very big differences in the largest cpi value between the systems:

[andharr@node6900]$ cat cpu1.out | awk '{print $6}' |sort -n | tail
6.30
6.37
7.14
7.16
7.26
7.35
8.17
8.27
8.36
8.91
[andharr@node6900]$ cat cpu11.out | awk '{print $6}' |sort -n | tail
6.67
6.71
6.77
6.80
7.21
7.70
7.72
8.40
9.21
11.92
[andharr@node6900]$ cat cpu111.out | awk '{print $6}' |sort -n | tail
6.26
6.39
6.65
6.93
6.99
7.25
7.81
8.65
8.81
9.32

[andharr@node25k]$ cat cpu2.out | awk '{print $6}' |sort -n |tail
26.65
26.86
26.99
28.71
29.48
30.06
30.87
32.93
34.05
34.36
[andharr@node25k]$ cat cpu22.out | awk '{print $6}' |sort -n |tail
31.35
31.82
32.32
33.16
34.03
35.00
38.51
47.69
50.19
51.04
[andharr@node25k]$ cat cpu222.out | awk '{print $6}' |sort -n |tail
26.03
26.71
26.90
27.31
27.45
28.29
28.42
29.28
32.30
35.40

Conclusion

So the fact that the E25k showed higher numbers didn't really indicate a good match with the Finacle application running on it. It would seem to suggest that Finacle is generating a memory latency sensitive workload which might not be entirely suited to the 25k platform. pbinding the Finacle application into specific processor sets might alleviate thread migration and some non-local memory accesses which may improve some performance but not entirely eradicate the memory latency issue from happening. The reason it's more noticeable on the E25K platform than the E6900 is that there is a greater distance to travel across the E25K interconnect than the E6900 ie a greater distance in terms of copper for the electrical signals to travel. MPO (Memory Placement Optimization) was introduced in Solaris 9 to help elevate latency in large scale NUMA configurations by attempting to keep LWP's local to it's "home" cpu/memory board, but in some cases cannot eliminate it for all workloads (in this case).

See the following documents for background information on MPO:

Solution 214954 : Sun Fire[TM] Servers: Memory Placement Optimization (MPO) and Solution 216813 : Sun Fire[TM] Servers: Memory Placement Optimization (MPO) Frequently Asked Questions (FAQ)

As my esteemed colleague Clive said, "Applications drive system behaviour" so we need to look at the application and how it interacts with the system rather than the other way round which always points me back to one of my first blog entries on the importance of understanding the application architecture and starting a top down approach. :)

Monday Jan 19, 2009

Performance considerations when upgrading Solaris

The biggest piece of advice I can give you about those of you about to upgrade with lots of custom tunables in /etc/system........read the manual (FTFM if you're feeling particularly vocal), no seriously, I mean it! :) You only have to read the Solaris tunables reference manual as it actually discusses upgrading to newer releases with older /etc/system tunables:

"We recommend that you start with an empty /etc/system file when moving to a new Solaris
release. As a first step, add only those tunables that are required by in-house or third-party
applications. Any tunables that involve System V IPC (semaphores, shared memory, and
message queues) have been modified in the Solaris 10 release and should be changed in your
environment. For more information, see “System V IPC Configuration” on page 21. After
baseline testing has been established, evaluate system performance to determine if additional
tunable settings are required."


So, that's a move it out of the way and start from scratch. :) Obviously speak to your application vendors about anything that is required to run the application but other than that, see how things go and only change when and where necessary otherwise you could run into other problems.

The only application which I'll make specific points about is Oracle as with Solaris 10 we've introduced resource controls so the shared memory / semaphore settings no longer need to be defined in /etc/system. See the Oracle installation guide  or  Solution  208623 :   Solaris[TM] 10 Operating System: System V Inter-Process Communication (IPC) resource controls for further details.

Saturday Oct 25, 2008

Oracle parallel query performance on a T5140

Being an engineer it's always a good feeling getting to the bottom of a problem and none so as this one. Take a T5140, create a 3 way RAID 0 LUN using the internal disks and stick Oracle on top so you can do something useful with it for your application...........and what do you get......... a problem. I suspect after that opening some of you are thinking "where's he going with this?" .... the answer nowhere, I'm not picking holes in either the T5140 or Oracle.....good, I'm glad we got that one clear! :)

Anyhow, so a customer came to us complaining that their application wasn't running as expected on this platform and really wanted to know if there was a hardware fault /  bug with platform or operating system running on it. From the tests that the customer had been doing themselves they believed that the bottleneck was the underlying I/O subsystem and in this case the LSI H/W RAID. Essentially,  the customer had configured a 3 disk RAID 0 stripe using the default 64k stripe width like thus:

bash-3.00# raidctl -l c1t1d0
Volume                  Size    Stripe  Status   Cache  RAID
       Sub                     Size                    Level
               Disk
----------------------------------------------------------------
c1t1d0                  409.9G  64K     OPTIMAL  OFF    RAID0
               0.1.0   136.6G          GOOD
               0.2.0   136.6G          GOOD
               0.3.0   136.6G          GOOD

They had then created a single slice for which Oracle was installed and configured for Direct I/O (which is a good thing anyway if you've a UFS filesystem) so we were avoiding the filesystem buffer cache and double buffering. So for a 64k stripe per disk and three disks gives us a total stripe width of 192k. The throughput performance of each of this disks is between 50-60mb per second which means we have a theoretical throughput on all stripes of 150->180mb per second for reads. We can forget writes as Oracle is pwrite()'ing in 8k synchronous chunks to a non-write enabled volume and only hits one disk (because 8K is less than the 64k stripe size) and hence why we saw a 1Gb tablespace creation take 18 seconds and an average through put of 56mb per second which is what we would have expected for a single disk.

SQL> set timing on
SQL> create tablespace andy3
 2  datafile '/u01/oracle/oradata/SUN/andy03.dbf'
 3  size 1g;

Tablespace created.

Elapsed: 00:00:18.12

and iostat -xnz 1 shows us

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0  128.0  104.0 56369.0  0.0  1.6    0.0   11.2   1  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  14.0   78.0  112.0 56281.3  0.0  1.2    0.1   13.5   0  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0   93.0  112.0 53734.0  0.0  1.4    0.0   13.4   1  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0   95.0  104.0 58397.6  0.0  1.4    0.0   12.7   1  92 c1t1d0

This result was the same as the customers but things then get interesting when we start looking at full table scan parallel queries. The customer ended up with these results:

 Parallelism
Time
Throughput (Mb per second)
db_file_multiblock_read_count
 0 719
13.8
16 (16 x 8k = 128k)
 2 195
50.9
16
 4 208
47.7
16
 0 420
23.6
32 (32 x 8k = 256k)
 2 163
60.9
32
 4 208
53.9
32

Now, they look bad especially if you think that theoretically we should be able to achieve 150mb ->180mb based on a three disk stripe (3 x 60mb).

Using the same parallel test plan as the customer:

oracle@v4v-t5140a-gmp03~$more para.sql

set timing on;

select /\*+ FULL(t) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,2) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,4) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,64) \*/ count(\*) from contact_methods t;

oracle@v4v-t5140a-gmp03~/oradata/SUN$ls -alh test01.dbf
-rw-r-----   1 oracle   dba         9.7G Oct 24 08:25 test01.dbf

I got these:

SQL> @para

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:47.85

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:32.53

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:34.68

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:42.17

whilst the first full table scan is running I see the following in iostat -xnz 1:

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 108.0    0.0 93122.4    0.0  0.0  0.4    0.1    4.0   1  35 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 151.0    3.0 95796.9   48.0  0.0  0.5    0.1    3.0   1  34 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 115.0    0.0 103367.6    0.0  0.0  0.3    0.1    2.6   1  28 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 116.0    0.0 102232.7    0.0  0.0  0.3    0.1    3.0   1  29 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 122.0    3.0 105326.4   48.0  0.0  0.3    0.1    2.5   1  29 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 116.0    0.0 96467.2    0.0  0.0  0.5    0.1    4.1   1  34 c1t1d0

and then

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 193.0    0.0 159383.3    0.0  0.0  8.0    0.1   41.4   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 195.5    3.0 163681.0   48.1  0.0  8.1    0.1   40.8   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 220.1    0.0 188770.3    0.0  0.0  7.7    0.1   34.8   3 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 192.0    0.0 168156.9    0.0  0.0  7.2    0.1   37.8   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 191.0    3.0 162361.2   48.0  0.0  7.4    0.1   38.1   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 190.0    0.0 162776.0    0.0  0.0  7.3    0.1   38.7   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 192.0    0.0 162737.6    0.0  0.0  6.9    0.1   35.9   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 186.0    3.0 153754.2   48.0  0.0  8.4    0.1   44.4   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 191.0    1.0 160412.4    8.0  0.0  7.7    0.1   40.1   1 100 c1t1d0

when the parallel jobs are running.

This is because I changed the db_file_multiblock_read_count to 128 (128 \* 8k = 1M) and in fact I saw improvements using 192k (64k\*3) to match the stripe width. I also went with some recommendations from here which also helped along with running the latest T5140 firmware and latest KJP 137111-08 to avoid some known performance issues.

It's amazing that tuning one little option can have such a dramatic effect on the results and also shows that just because you don't get the results you're expecting not to assume that there is a fault with the hardware or software. For me personally, it's always good to understand the results and how you got there, although as in all performance related issues you can sometimes get sidetracked from what's gone on before and end going down the wrong path. To avoid that,  make sure you chat with you're colleagues when you feel like you're not going anywhere as a fresh set of eyes can bring you back on the path and closer to resolution.




Tuesday Apr 29, 2008

What's the answer to life the universe and everything?

"42"

For those of you that had read or listened to the Hitchhikers Guide to the Galaxy the above question and answer will have more meaning to you than those of you that haven't. Essentially, how can you have a literal answer to such an undefined question which suggests on an allegorical level that it is more important to ask the right questions than to seek definite answers.

I sometimes think of just saying "42" to the question of "What's the answer to our performance problem?" which is usually supplied with some kind of data either in the form of GUDS (a script which collects a whole bunch of Solaris OS output) or some other spreadsheet or application output. This data usually has no context or supplied with anything other than "the customer has a performance problem" which of course makes things slightly difficult for us to answer unless the customer will accept "42".

So investigating performance related issues is usually very time consuming due to difficulty in defining a problem. So it would seem to reason that it's probably a good idea to approach these type of problems in a structured method. Sun has been using an effective troubleshooting process by Kepner Trego for a number of years of which defines a problem as follows:

"Something has deviated from the normal (what you should expect) for which you don't know the reason and would like to know the reason"

Still don't get it? Well, what if you're driving, walking, running, hopping (you get my point) etc from point A to B and have somehow ended up at X21 and you don't know why you've ended up, you'd probably want to know why and thus you'd have a problem because you'd be expecting to end up at point B but have ended up at point X21.

Ok, so how does this related to resolving performance issues then? Well, in order for Sun engineers to progress performance related issues within the services organization we need to understand a problem, the concerns around it and how that fits into the bigger picture. By this I mean looking at an entire application infrastructure (top down approach) rather than examining specific system or application statistics (bottom up approach). This can then help us identify a possible bottleneck or specific area of interest to which we can use any number of OS or application tools to focus in on and identify root cause.

So perhaps we should start by informing people what performance engineers CAN do:

1/ We can make "observations" from static collected data or via an interactive window into customer's system (Shared Shell). Yes, that doesn't mean we can provide root cause from this but comment on what we see. Observations mean NOTHING without context.

2/ We can make suggestions based on above information which might progress to further data collection but again mean NOTHING without context.

Wow, that's not much is it....so what CAN'T we do?

1/ We can't mind read - Sorry, we can't possibly understand you're concerns, application, business, users without providing USEFUL information. So would is useful information? Well answers to these might help get the ball rolling:

\* What tells you that you have a performance issue on your system? i.e Users complaining that the "XYZ" application is taking longer than expected to return data/report, batch job taking longer to complete, etc.

\* When did this issue start happening? This should be the exact date & time the problem started or was first noticed.

\* When have you noticed the issue since? Again the exact date(s) and time(s).

\* How long should you expect the job/application to take to run/complete. This needs to be based on previous data runs or when the system was specified.

\* What other systems also run the job/application but aren't effected?

\* Supply an architecture diagram if applicable, describing how the application interfaces into the system. i.e

user -> application X on client -webquery-> application server -sqlquery-> Oracle database backend server

2/ We can't rub a bottle and get the answer from a genie nor wave a magic wand for the answer - Yes, again it's not just as simple as supplying a couple of OS outputs and getting an answer from us. We'll need to understand the "bigger" picture or make observations before suggestions can be advised.

3/ We can't fix the problem in a split second nor can applying pressure help speed up the process - Again we need to UNDERSTAND the bigger picture before suggestions and action plans can be advised.

So what kind of data stuff can we collect to observe?

Probably one of the quickest ways of allowing us to observe is via Shared Shell. This allows us a direct view onto a system and allows us to see what the customer actually see's. Again, we'll need to discuss with the customer what we're looking at and UNDERSTAND the "bigger" picture to make suggestions or action plans moving forward. If shared shell isn't available then we'll need to collect GUDS data usually in the form of the extended mode. This collects various Solaris outputs in various time snapshots which we can view offline, however we do need baseline data along with bad data to make any useful observations. Yes, one snapshot isn't much help as high values could be normal! Yes, just because you see high user land utilization it doesn't necessarily mean its bad or shows a performance problem. It could just be the system being utilized well processing those "funny" accounting beans for the business. Again and I've said this a few times.....data is USELESS without CONTEXT.

If Oracle is involved then you could get the Oracle DBA to provide statspack data or AWR reports for when you see the problem and when you don't as that might give an indication of Oracle being a bottleneck in the application environment.

Other application vendors might have similar statistic generating reports which show what they are waiting for which might help identify a potential bottleneck.

The "Grey" area

The grey area is a term used by many as an issue which breaks the mold of conventional break fix issues and starts entering the performance tuning arena. Break fix is usually an indication that something is clearly broken such as a customer experiencing a bug in Solaris or helping a customer bring a system up which as crashed or needs to be rebuilt and requires Sun's assistance and expertise to resolve. Performance tuning usually happens because a customer's business has expanded and their application architecture can't cope with the growth for example. It's a little difficult to gauge when a situation starts to go down that path when most application architectures are very complex and involve lots of vendors. I also happen to work in the VOSJEC (Veritas Oracle Sun Joint Escalation Centre) and deal with quite a few interoperability issues so know things can get pretty complex with trying to find the problematic area of interest. For some reason some people term this as the blame game or finger pointing which I personally hate to use. In fact I'd rather it be a Sun issue from my perspective as we get then take the necessary action in raising bugs and getting engineering involved to provide a fix and ultimately resolve the customer's issue. Thankfully my Symantec and Oracle counterparts also take this approach which makes things a little easier in problem resolution.

Conclusion

I think real point of this is that you should really grasp a problem before asking for assistance, as if you understand the problem, then you're colleagues understand the problem and more importantly we (Sun) or I understand the problem and that's half the battle. The rest is so much easier...... :)

About

I'm an RPE (Revenue Product Engineering) Engineer supporting Solaris on Exadata, Exalogic and Super Cluster. I attempt to diagnose and resolve any problems with Solaris on any of the Engineered Systems.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today