Wednesday Mar 04, 2009

I see we quietly announced HERD 2.0 for Sun AMD based machines.

"Herd 2.0" what is that ?

"Hardware Error Report and Decode (HERD) utility" . Its a monitoring daemon for all Sun AMD processor based systems running fairly recent versions of suse and redhat linux that is designed to monitor the machine check log and decode any errors there.  Earlier version were available but had knowledge of the hardware topology of only certain Sun AMD processor based systems but with version 2.0 its ALL Sun AMD based systems.

With the complete topology of the systems it can decode the information that the CPU provides for a error into physical components eg to a specific DIMM

From the Sun download centre ...

Low frequencies of correctable errors are an expected part of modern computing and you need to consult your hardware service  provider to determine when a DIMM is generating too many Correctable Errors and should be swapped.

Thursday Sep 04, 2008

Its a brave new world for the mpt scsi driver.

I remember that when you wanted to add a driver to control a scsi device you added the driver to the right directory, created a driver.conf file, edited that file to select what  target/lun combination you wanted and maybe set some properties  or address filters, ran devfsadm or did a boot -r and it all just worked... 

Now it appears I am old fashioned and that the mpt scsi driver became one of these new fangled "self identifying"  HBA devices with the solaris 10 mpt patch 125081-09 and later revisions that are now in the Kernel patch and hence in the latest updates to solaris 10.

 We did document this in the scsi(4) manual page its just I don't think we really documented that mpt had changed type.

So  if you add a SL500 tape changer device to a mpt scsi bus a /devices node will be created by mpt  now rather than the target driver. As mpt has no idea what driver will attach to that it uses the name "sgen" for scsiclass 08, st for scsiclass 01, sd for scsi classes 00,03,05. This is the big difference previously the /devices node would only have appeared when your target driver attached it and it would have used the name of your driver so you could search for it with find during your solutions install script and put a nice simple entry in /etc/ to get a /dev symbolic link.

Now you will always get a  /devices node called sgen@XXX for a changer device attached to mpt as soon as mpt attaches and probes the bus,  without any modification to sgen.conf  sgen will attach to it. so how do I get my "foo" driver to attach to it?

"man -s4 scsi" says that you will get a list of compatible properties attached to the /devices node, ranging from some very specific strings though to very generic strings. The sgen driver picks up on the very generic entry, but you can use  update_drv -a -i "specific compat property" foo to insert the association ahead of sgen. At this point sgen should let go of the device and the foo driver will attach to it. You can undo this association by using update_drv -d -i "specific compat property" foo to delete the association and allow sgen to re-attach to the changer device.

So for my sl500 how do i see these compatible properties.. 

prtconf -v | grep -i stk

                                            value='scsiclass,08R.vSTK.pSL500.r1126' + 'scsiclass,08.vSTK.pSL500.r1126' + 'scsiclass,08R.vSTK.pSL500' + 'scsiclass,08.vSTK.pSL500' + 'scsa,08.bmpt' + 'scsiclass,08R' + 'scsiclass,08' + 'scsiclass'

so to add my foo driver ahead of sgen and take control of all scsi class 08 changers attached to STK SL500's I used

update_drv  -a -i '"scsiclass,08.vSTK.pSL500"'  foo

and to get symbolic links under /dev/foo I added this line ( tab not spaces) to /etc/

type=ddi_pseudo;name=sgen;      foo/\\M0

instead of the old fashioned form ...

 type=ddi_pseudo;name=foo;        foo/\\M0

all without editing a single  driver.conf file...

Of course the old /dev links for sgen will still be there under /dev/scsi/changer/  but they can be removed with a devfsadm -C

The same would have to be done for the SL500's tape devices  but they would show up under /devices as a node with a name like st@XXXX and you would have to add an alias for them using update_drv -a and a line in /etc/ using name=st; my_driver

certainly some installation programs and solutions will come unstuck when faced with this new mpt behaviour. Hopefully this will help explain what you are seeing.

Tuesday Jul 29, 2008

rcapd stopped my jvm's garbage collector.

The customers brand new machines were having problems with a jvm stopping in garbage collection. The jvm was set to have a 3.1 Gbyte heap but after a few hours of running the application would grind to halt, the garbage collector stats showed that garbage collection was taking an increasing amount of time up into the 10s of seconds..

prstat -m 1 on the box with the application showed the jvm using a lot of user and system time, the system time was odd, but at the top of the active processes was rcapd. A quick look in /etc/project showed that someone had set a resource control to limit  RSS for the user.root project of 1.7 Gbytes.

The line looked something like..

So that was bad, everytime the processes that belonged to root got an RSS of over 1.7Gbytes  rcapd would pick on them and page out their pages, and then set them running again. In the case of the jvm as soon as the garbage collector had to manage a heap  bigger than 1.7 Gbytes it would scan the objects paging all those heap pages back in, then rcapd would page them all back out. The garbage collector was now being limited by the speed of the swap device and was fighting a losing battle.

iostat -xnz 1 showed 100s of Mbytes/sec going to swap and then 100s of Mbytes/sec coming back from swap.

vmstat -p 1  just showed a massive amount of page ins.

rcapd was disabled using rcapadm -D and the new machines flew along. I did some investigations here in the office using a small test case that malloc'ed a large array and then looped around touched its pages and that was crippled as soon as the array was much bigger than the cap, but interestingly as rcapd is procfs based sometimes a procfs command like pfiles or truss would either fail with "the process is traced" or it would stop the rcapd getting to the process's proc files so it would escape capping for a short while...

Friday May 30, 2008

Using makedevice() in a drivers open routine and choosing minor numbers.

A customer reported that the close() function in his driver wasn't called as often as the open() routine. We had the normal discussion about what last close means but it became clear that something odd was going on. Based on an email discussion I wrote a driver with simple open and close entry points. The attach entry point made a device node in the filesystem thus ..

int minor = 0;
ddi_create_minor_node(dip, "ctrl", S_IFCHR, minor, DDI_PSEUDO, 0);

As in all the examples I used 0 for the minor number, so an ls -alL showed 238,0 as the major and minor for my device.
Then in the open routine I then changed the minor number ( under a mutex) using makedevice() . I used a range of 4 minor numbers starting from 0.

static int
cd_open(dev_t \*devp, int flag, int otyp, cred_t \*credp)
/\* select minor \*/
\*devp = makedevice(getmajor(\*devp),minor);


after a sequence of open and closes minor 0 was stuck open.

The answer is do not hand out the same minor number in open as you used in the attach. In my example I should have used minor number 4 in the attach and handed out minors 0 to 3 in the open. That would also allow me to differentiate between a close on the filesystem device node and a close on one of the dynamic minor numbers.

Monday May 12, 2008

first step: using dtrace to look at i/o system calls.

Assuming you believe that your application is having i/o related problems dtrace provides a very powerful way to slice and dice the software stack to find the cause of your delays. I always start by observing the i/o related system calls that a process is making. A few lines of dtrace allow you to get a graphical display of the system call response times - which is really all the application cares about. If you see no signs of delay here then there is no need to probe further down into the i/o stack.

I have chosen probe points in the context of the process being observed in order to try and eliminate any scheduling type issues.. The attached dtrace program io.d uses aggregations and the quantize() function to produce neat graphs of system call response times for aioread/aiowrite, read/write and pread/pwrite system calls as these are the main ways application do i/o.

Read/write and pread/pwrite are quite easy as the i/o happens inside the system call ( well the disk i/o might happen later if doing i/o to a filesystem without O_DSYNC set) as far as the application is concerned, so we just use the time delta between entry and return probes.

For aioread/aiowrite to raw devices there are two system calls involved, one call to kaio() to issue the read of the write and another call to kaio() to reap the return notification. I have therefore used probe points that allow me to observe the sending of the aio_request structure and the reaping of it in the second system call and I have used the time delta between these two events to provide a kernel timing as close to what the userland application would see if it used gethrtime() to time this sequence of systems calls.

Here in this output I am using the program to watch a program generating aiowrites() to two slices on the same disk, ( you also see writes to the tty as the program emits some timing information of its own..).. The vertical index (value) is in millisecs.

ksh: /usr/sbin/dtrace -qs /var/tmp/aio.d $(pgrep aio)
------- 1648930 value axis is in millisecs -------

  write against /devices/pseudo/pts@0:1             
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 82       
               1 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:a,raw
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@                     299      
               8 |@@@@@@@@@@@@@@@@@@@@                     298      
              16 |                                         1        
              32 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:b,raw
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@                     295      
               8 |@@@@@@@@@@@@@@@@@@@@                     295      
              16 |@                                        9        
              32 |                                         0        

------- 1648940 value axis is in millisecs -------

  write against /devices/pseudo/pts@0:1             
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 114      
               1 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:a,raw
           value  ------------- Distribution ------------- count    
               1 |                                         0        
               2 |                                         2        
               4 |@@@@@@@@@@@@@@@@@@@@                     328      
               8 |@@@@@@@@@@@@@@@@@@@@                     334      
              16 |                                         1        
              32 |                                         0        

  kaio writes to /devices/pci@0/pci@0/pci@2/scsi@0/sd@1,0:b,raw
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@                     328      
               8 |@@@@@@@@@@@@@@@@@@@@                     329      
              16 |                                         7        
              32 |                                         0        


I am surprised that companies have not started producing dtrace based i/o analysis tools allowing health checks and event monitoring with an accuracy and ease we have never had before. Even this script would allow you to look for abnormal i/o responses or produce reports on throughtput and response times with small changes to the script.

Thursday May 08, 2008

dtrace looking for slow i/o.

I have spent the last 6 months helping support folks in Sun deal with performance escalations. A recent one has changed the way I think about using dtrace.

Its very easy to just turn on all the fbt probes and then log start and end times of every function, and if you add in a few lines of code you can get deltas timings for function execution, and if you predicate the functions on the i/o generator process you can get all the function timings that are called in the context of the user process you are predicating on. But does this help with the task of finding the cause of the slow i/o ?

In the case at hand the answer is NO.

In this case the graph of number of i/os versus completion time in millisecs had a bonus extra peak at about the 400 millisec mark and of a magnitude of about 10% of all i/o.

Previous engineers had done the massive tracing of all function timings in the context of the oracle process and once they had waded through the 100s of megabytes of numbers had not been able to find delays that matched what the oracle internal stats were showing so they had started looking for very esoteric scheduling type problems..

By applying SGR methodology and using the problem statement "where in the lifecycle of the i/os can we observe the 40 millisec delay", you start to treat the i/o as an object not a code flow, and that is the important paradigm. You then ask yourself "where in the software stack that the object ( i/o) has to traverse on its way to the disk drive and back to the application can I observe the object", dtrace allows you to perform the measurement using aggregations.

The first thing to do as the Oracle process used async i/o against raw device was to observe the distribution of the time taken for the i/o to pass into the kernel via aioread()/aiowrite() and for the corresponding aiowait() to reap the returning i/o completion. Here the i/o is represented by the aio_req_t object, created in the aioread/aiowrite system calls and destroyed in the corresponding aiowait() system call. This is exactly what the application will be using to do its timing in userland. the quantize() aggragating function is ideal for producing graphs and in this case it showed an identical graph to the oracle application own internal statistics.

the lifecyle of the i/o was..

application ( aioread/aiowrite)
kaio (aioread/aiowrite)
a volume manager
ssd  target driver
mpxio - multipathing
leadville - san stack
qlc - hba
qlc interrupt handler
leadville completion routines
ssd interrupt handler
the volume managers completion routines
kaio (aiowait)
application ( aiowait)

So our first test that involved enabling only two fbt probes had allowed us to say that the problem lay in the top kaio and the bottom kaio and that the application statistics were accurate. What we needed was some other probe points to allow us to chop up this stack and find our delay.

The problem with chasing i/o through a volume manager is that the entering i/o ( represented by a buf structure) is often broken up into smaller buf structures or otherwise changed in a manner that does not allow easy following of the objects, so we jump down to the ssd routines, again using just two probes we can graph the time distribution of buf objects entering the ssd driver and the interrupt returning the same object. This showed no i/o taking more than a millisec.

That ruled out issues with the storage itself. So now we have a couple of places in the lifecycle where our delay could be introduced. The next test was to time the pci interrupt thread from its creation to its death again with two probes to see if the interrupt routines was blocking anywhere above sd - it wasn't. This could have been a mistake if the interrupt thread had handed our returning buf off to a taskq thread and that had delayed, so we aggregated the stack of the callers of the kaio completion routine and 100% of the callers were interrupt threads, so now we know where the problem is and where it isn't.

Attention was focused on to the volume manager and it quickly became obvious that the delay was caused by the i/o object being passed off to a worker thread which was delaying the i/o. This would explain why the previous engineers approach had not worked - when an i/o was delayed the thread returned much quicker as it did less work - they should have looked for a function taking less time not one taking more time!

So the way to work these issues is - to understand the software stack, to understand what software object represents your i/o at each of the layers, and to have access to the source code ( to be able to find that object as it passes down through each layer and when it passes back up again.

my next few posts will include the dtrace to chop up the i/o stack at different levels, up near the application will alow you to verify if you have an i/o delay issue and some scripts lower down to try and chop up the stack to speed diagnosis.

Sunday Dec 09, 2007

modload system call name

This weekend I've had to write a modloadable system call to read a cpu register that I can't read from userland on an amd64 box. Everything went well until I tried to modload the syscall module I'd built..
# modload /home/timu/src/syscalls/test1
can't load module
Dec  9 21:27:22 aaaaa test1: REQUEST INFO 
: Out of memory or no room in system tables
Dec  9 21:27:22 aaaaa test1: INITIALIZED
Dec  9 21:27:22 aaaaa genunix: WARNING: system call missing from bind file
Obvious I must have missed adding an entry to /etc/name_to_sysnum !
# grep timu /etc/name_to_sysnum
timuSyscall               66
oh! Its not that as I had used an unused number. ah! How does the system match the module name to the syscall number, there is nothing in the driver I have written to do that...It must be the name..
# cp test1 timuSyscall
# modload /home/timu/src/syscalls/timuSyscall
Dec  9 21:30:04 aaaaa timuSyscall: REQUEST INFO
Dec  9 21:30:04 aaaaa timuSyscall: INITIALIZED
So the module name that you are modloading must match the name you put in /etc/name_to_sysnum.

Thursday Nov 29, 2007

things I didn't know but should have..

Can't get the terminal size right when using the console of a machine?

eval `/usr/openwin/bin/resize`
TERM=xterm ; export TERM

Your Solaris 10 machine crashes with a short panic string of "pci fatal error", afterwards you have a crash dump and no sign of any FMA ereports that look relevant from fmdump.. So how do you find out what the hardware reported that upset the system so much it panic'ed....

 ksh: mdb unix.0 vmcore.0
Loading modules: [ krtld genunix specfs dtrace ufs scsi_vhci sd mpt px ip sctp arp usba fcp fctl nca lofs audiosup md cpc random crypto zfs fcip logindmux ptm sppp nfs ipc ]
> ::walk ereportq_dump | ::ereport -v 


I always used a much longer way of doing it...

Thursday Nov 15, 2007

Things I have learnt in the last few weeks.

1) Customer's system suffered from i/o slowdowns of the order of seconds. The cause was a load of unused targets in the scsi related driver .conf files like ( ses.conf, st.conf, sgen.conf, sd.conf). A home made monitoring application was triggering driver loading, as part of that process the framework tried to probe all the targets in these .conf files. On a copper scsi bus each probe involves sending inquiry commands, obviously with no target to respond to the selection you get the 0.25 second selection timeout which holds the scsi bus. Get enough drivers and unused targets and you can get bus freezes of many seconds. This was resolved by using scsi.d through a sharedshell session that allowed me to login and root cause it- luckily the customer was running Solaris 10.
2) When using Oracle RAC you need to partition your tables to avoid both nodes wanting to access the same block. If that happens the block is read from disk into one node and then when it is free it is passed across the network interconnect to the other node..
3) don't ignore scsi messages in /var/adm/messages about targets reducing sync rate or even worse going async.. A copper scsi drive in async mode canonly transfer about 5MB/second. If you do send it 5 Mbytes in big transfers then it will occupy the bus transferring data slowly holding off other targets. Currently you have to reboot to clear this reduced speed.
4) if you have a storedge 3310 JBOD you must set the scsi option in mpt.conf to reduce the speed to ultra 160 even though the drives can do ultra 320. If you don't you get odd scsi warnings and change lots of hardware that did not need changing.
5) if you get a problem on a pci bus eg a device ends a transaction with a target abort then you can get two distinct behaviors on a sparc system. If that transaction is a write then the host bridge sets a suitable bit in a status register and generates an interrupt that is serviced by pbm_error_interrupt(). This can go to a different cpu than the one issuing the pci write and can take a while to process. This delay explain why crash dumps of "panic : pci fatal error" rarely capture the stack trace of the offending pci write. If the transaction is a read you get a Berr ( bus error) trap on the faulting store instruction and the resultant crash dump shows you the offending stack trace.

I have a new boss and job, I am part of a small team that is hoping to form a diagnostic community within System-TSC ( the part of services I work in). The goal is to speed up the problem resolution of difficult and complex issues by engaging with and offering advice/mentoring/training to other engineers to grow their skill sets. The other folks in this team are all exceptionally talented so this maybe a bit of a challenge.

Administering my little web site and non-technical blog at isn't taking too much time.

Sunday Oct 21, 2007

splitting my blog

So I have decided to split my blog, personal stuff goes to and Sun related technical stuff goes here, that way all the stuff about samoyeds and cycling will be separate and I might just post technical things more often. Its been maniacally busy since we came back from our summer holidays in Scotland or at least that is my excuse for not posting more often..

Thursday Jul 12, 2007

driving tests: what do they teach them...

A small rant.

On the way home last night a car comes very close to my handlebars, a car I recognise as belonging to a neighbour's 19 year old son.

As I get home at the same time as the youth I ask him why he was so close and I was horrified at his reply - "| had to be close to you because there was a car coming the other way". I suggested he ought to have waited as the highway code suggests until it was safe, he suggested I get off the road.

He claimed to be happy to give that reason to a traffic policeman - so he really did not understand at all. His idea of driving was about getting from A to B as fast as possible, much like mine was at that age.

Lets hope with the help of his dad who overheard this conversation and agreed to deal with his son he can be educated before he hurts someone..

oh and its raining again...

Monday Jul 02, 2007

a quick catch up.

a quick catch up.[Read More]

Friday Jun 15, 2007

Farnham Park is so nice.

Farnham Park is so nice.[Read More]

Thursday Jun 14, 2007

sunray is so good.

sunray is so good.[Read More]

Monday Jun 11, 2007

using mapfiles to lay program segments out.

using cc/ld mapfiles to lay program segments out.[Read More]



« April 2014

No bookmarks in folder


No bookmarks in folder