Things I have learnt in the last few weeks.


1) Customer's system suffered from i/o slowdowns of the order of seconds. The cause was a load of unused targets in the scsi related driver .conf files like ( ses.conf, st.conf, sgen.conf, sd.conf). A home made monitoring application was triggering driver loading, as part of that process the framework tried to probe all the targets in these .conf files. On a copper scsi bus each probe involves sending inquiry commands, obviously with no target to respond to the selection you get the 0.25 second selection timeout which holds the scsi bus. Get enough drivers and unused targets and you can get bus freezes of many seconds. This was resolved by using scsi.d through a sharedshell session that allowed me to login and root cause it- luckily the customer was running Solaris 10.
2) When using Oracle RAC you need to partition your tables to avoid both nodes wanting to access the same block. If that happens the block is read from disk into one node and then when it is free it is passed across the network interconnect to the other node..
3) don't ignore scsi messages in /var/adm/messages about targets reducing sync rate or even worse going async.. A copper scsi drive in async mode canonly transfer about 5MB/second. If you do send it 5 Mbytes in big transfers then it will occupy the bus transferring data slowly holding off other targets. Currently you have to reboot to clear this reduced speed.
4) if you have a storedge 3310 JBOD you must set the scsi option in mpt.conf to reduce the speed to ultra 160 even though the drives can do ultra 320. If you don't you get odd scsi warnings and change lots of hardware that did not need changing.
5) if you get a problem on a pci bus eg a device ends a transaction with a target abort then you can get two distinct behaviors on a sparc system. If that transaction is a write then the host bridge sets a suitable bit in a status register and generates an interrupt that is serviced by pbm_error_interrupt(). This can go to a different cpu than the one issuing the pci write and can take a while to process. This delay explain why crash dumps of "panic : pci fatal error" rarely capture the stack trace of the offending pci write. If the transaction is a read you get a Berr ( bus error) trap on the faulting store instruction and the resultant crash dump shows you the offending stack trace.

I have a new boss and job, I am part of a small team that is hoping to form a diagnostic community within System-TSC ( the part of services I work in). The goal is to speed up the problem resolution of difficult and complex issues by engaging with and offering advice/mentoring/training to other engineers to grow their skill sets. The other folks in this team are all exceptionally talented so this maybe a bit of a challenge.

Administering my little web site and non-technical blog at samoyed.org.uk/blog isn't taking too much time.
Comments:

How about putting some SunSolve reference id's and some patch numbers against these, so we can find out a bit more about them?

Posted by dork on November 15, 2007 at 03:33 PM GMT+00:00 #

good point, I'll write some sunsolve docs about those things, the 3310 JBOD setting is referenced in document 80089, the others are fairly recent discoveries, Oracle do say that applications can be used in a RAC config with "cache fusion" without any changes and that is true of most applications, but I had come across one that definitely benefited from some data partitioning to stop the network interconnect load due to one table that was being updated by all queries.
thanks
tim

Posted by tim uglow on November 15, 2007 at 03:50 PM GMT+00:00 #

Cool! It's really useful when you guys put those references to "official" Sun resources in your blogs because it means that we can follow it through better and it's much nice to be able to say to a customer "You need patch X" than "I read on some guy's blog that scsi can be slow".

Posted by dork on November 16, 2007 at 04:02 AM GMT+00:00 #

Post a Comment:
Comments are closed for this entry.
About

timatworkhomeandinbetween

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today
News

No bookmarks in folder

Blogroll

No bookmarks in folder