By timatworkhomeandinbetween on Nov 15, 2007
1) Customer's system suffered from i/o slowdowns of the order of seconds. The cause was a load of unused targets in the scsi related driver .conf files like ( ses.conf, st.conf, sgen.conf, sd.conf). A home made monitoring application was triggering driver loading, as part of that process the framework tried to probe all the targets in these .conf files. On a copper scsi bus each probe involves sending inquiry commands, obviously with no target to respond to the selection you get the 0.25 second selection timeout which holds the scsi bus. Get enough drivers and unused targets and you can get bus freezes of many seconds. This was resolved by using scsi.d through a sharedshell session that allowed me to login and root cause it- luckily the customer was running Solaris 10.
2) When using Oracle RAC you need to partition your tables to avoid both nodes wanting to access the same block. If that happens the block is read from disk into one node and then when it is free it is passed across the network interconnect to the other node..
3) don't ignore scsi messages in /var/adm/messages about targets reducing sync rate or even worse going async.. A copper scsi drive in async mode canonly transfer about 5MB/second. If you do send it 5 Mbytes in big transfers then it will occupy the bus transferring data slowly holding off other targets. Currently you have to reboot to clear this reduced speed.
4) if you have a storedge 3310 JBOD you must set the scsi option in mpt.conf to reduce the speed to ultra 160 even though the drives can do ultra 320. If you don't you get odd scsi warnings and change lots of hardware that did not need changing.
5) if you get a problem on a pci bus eg a device ends a transaction with a target abort then you can get two distinct behaviors on a sparc system. If that transaction is a write then the host bridge sets a suitable bit in a status register and generates an interrupt that is serviced by pbm_error_interrupt(). This can go to a different cpu than the one issuing the pci write and can take a while to process. This delay explain why crash dumps of "panic : pci fatal error" rarely capture the stack trace of the offending pci write. If the transaction is a read you get a Berr ( bus error) trap on the faulting store instruction and the resultant crash dump shows you the offending stack trace.
I have a new boss and job, I am part of a small team that is hoping to form a diagnostic community within System-TSC ( the part of services I work in). The goal is to speed up the problem resolution of difficult and complex issues by engaging with and offering advice/mentoring/training to other engineers to grow their skill sets. The other folks in this team are all exceptionally talented so this maybe a bit of a challenge.
Administering my little web site and non-technical blog at samoyed.org.uk/blog isn't taking too much time.