What's the answer to life the universe and everything?

"42"

For those of you that had read or listened to the Hitchhikers Guide to the Galaxy the above question and answer will have more meaning to you than those of you that haven't. Essentially, how can you have a literal answer to such an undefined question which suggests on an allegorical level that it is more important to ask the right questions than to seek definite answers.

I sometimes think of just saying "42" to the question of "What's the answer to our performance problem?" which is usually supplied with some kind of data either in the form of GUDS (a script which collects a whole bunch of Solaris OS output) or some other spreadsheet or application output. This data usually has no context or supplied with anything other than "the customer has a performance problem" which of course makes things slightly difficult for us to answer unless the customer will accept "42".

So investigating performance related issues is usually very time consuming due to difficulty in defining a problem. So it would seem to reason that it's probably a good idea to approach these type of problems in a structured method. Sun has been using an effective troubleshooting process by Kepner Trego for a number of years of which defines a problem as follows:

"Something has deviated from the normal (what you should expect) for which you don't know the reason and would like to know the reason"

Still don't get it? Well, what if you're driving, walking, running, hopping (you get my point) etc from point A to B and have somehow ended up at X21 and you don't know why you've ended up, you'd probably want to know why and thus you'd have a problem because you'd be expecting to end up at point B but have ended up at point X21.

Ok, so how does this related to resolving performance issues then? Well, in order for Sun engineers to progress performance related issues within the services organization we need to understand a problem, the concerns around it and how that fits into the bigger picture. By this I mean looking at an entire application infrastructure (top down approach) rather than examining specific system or application statistics (bottom up approach). This can then help us identify a possible bottleneck or specific area of interest to which we can use any number of OS or application tools to focus in on and identify root cause.

So perhaps we should start by informing people what performance engineers CAN do:

1/ We can make "observations" from static collected data or via an interactive window into customer's system (Shared Shell). Yes, that doesn't mean we can provide root cause from this but comment on what we see. Observations mean NOTHING without context.

2/ We can make suggestions based on above information which might progress to further data collection but again mean NOTHING without context.

Wow, that's not much is it....so what CAN'T we do?

1/ We can't mind read - Sorry, we can't possibly understand you're concerns, application, business, users without providing USEFUL information. So would is useful information? Well answers to these might help get the ball rolling:

\* What tells you that you have a performance issue on your system? i.e Users complaining that the "XYZ" application is taking longer than expected to return data/report, batch job taking longer to complete, etc.

\* When did this issue start happening? This should be the exact date & time the problem started or was first noticed.

\* When have you noticed the issue since? Again the exact date(s) and time(s).

\* How long should you expect the job/application to take to run/complete. This needs to be based on previous data runs or when the system was specified.

\* What other systems also run the job/application but aren't effected?

\* Supply an architecture diagram if applicable, describing how the application interfaces into the system. i.e

user -> application X on client -webquery-> application server -sqlquery-> Oracle database backend server

2/ We can't rub a bottle and get the answer from a genie nor wave a magic wand for the answer - Yes, again it's not just as simple as supplying a couple of OS outputs and getting an answer from us. We'll need to understand the "bigger" picture or make observations before suggestions can be advised.

3/ We can't fix the problem in a split second nor can applying pressure help speed up the process - Again we need to UNDERSTAND the bigger picture before suggestions and action plans can be advised.

So what kind of data stuff can we collect to observe?

Probably one of the quickest ways of allowing us to observe is via Shared Shell. This allows us a direct view onto a system and allows us to see what the customer actually see's. Again, we'll need to discuss with the customer what we're looking at and UNDERSTAND the "bigger" picture to make suggestions or action plans moving forward. If shared shell isn't available then we'll need to collect GUDS data usually in the form of the extended mode. This collects various Solaris outputs in various time snapshots which we can view offline, however we do need baseline data along with bad data to make any useful observations. Yes, one snapshot isn't much help as high values could be normal! Yes, just because you see high user land utilization it doesn't necessarily mean its bad or shows a performance problem. It could just be the system being utilized well processing those "funny" accounting beans for the business. Again and I've said this a few times.....data is USELESS without CONTEXT.

If Oracle is involved then you could get the Oracle DBA to provide statspack data or AWR reports for when you see the problem and when you don't as that might give an indication of Oracle being a bottleneck in the application environment.

Other application vendors might have similar statistic generating reports which show what they are waiting for which might help identify a potential bottleneck.

The "Grey" area

The grey area is a term used by many as an issue which breaks the mold of conventional break fix issues and starts entering the performance tuning arena. Break fix is usually an indication that something is clearly broken such as a customer experiencing a bug in Solaris or helping a customer bring a system up which as crashed or needs to be rebuilt and requires Sun's assistance and expertise to resolve. Performance tuning usually happens because a customer's business has expanded and their application architecture can't cope with the growth for example. It's a little difficult to gauge when a situation starts to go down that path when most application architectures are very complex and involve lots of vendors. I also happen to work in the VOSJEC (Veritas Oracle Sun Joint Escalation Centre) and deal with quite a few interoperability issues so know things can get pretty complex with trying to find the problematic area of interest. For some reason some people term this as the blame game or finger pointing which I personally hate to use. In fact I'd rather it be a Sun issue from my perspective as we get then take the necessary action in raising bugs and getting engineering involved to provide a fix and ultimately resolve the customer's issue. Thankfully my Symantec and Oracle counterparts also take this approach which makes things a little easier in problem resolution.

Conclusion

I think real point of this is that you should really grasp a problem before asking for assistance, as if you understand the problem, then you're colleagues understand the problem and more importantly we (Sun) or I understand the problem and that's half the battle. The rest is so much easier...... :)

Comments:

Andy - this is an excellent blog entry and ought to be MANDATORY reading for all of Services (Ian White - please note!!).

SGR/ATS is really the right way to go in the early stages of any call, if the engineer does not understand the problem and cannot find a solution. We "few" have to keep banging the SGR drum. keep up the good work.

:)

Posted by Peter Brentnall on April 30, 2008 at 08:54 AM BST #

Awesome! Thanks for putting this out. I rarely see problems that are well defined... and spend most of the time getting a good definition before I can begin to help.

I would also add that people need to avoid stating problems using system statistics. I don't know how many times I have had problems defined as user cpu% is too high... or load avg is too high :)

Posted by Glenn Fawcett on April 30, 2008 at 10:49 AM BST #

Post a Comment:
Comments are closed for this entry.
About

I'm an RPE (Revenue Product Engineering) Engineer supporting Solaris on Exadata, Exalogic and Super Cluster. I attempt to diagnose and resolve any problems with Solaris on any of the Engineered Systems.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today