Wednesday Aug 30, 2006

tracking why an application got a SIGFPE divide by zero.

Back from a long holiday a collegue asked me to look at why a small c++ application was dying with SIGFPE on x86 boxes running Solaris 10. They had run dbx and truss and had worked out that it was taking a SIGFPE divide by zero trap on a idivl instruction deep in the flush of a i/o stream. The truss showed the fault as

    Incurred fault #8, FLTIZDIV  %pc = 0x0805065E
      siginfo: SIGFPE FPE_INTDIV addr=0x0805065E
    Received signal #8, SIGFPE [default]
      siginfo: SIGFPE FPE_INTDIV addr=0x0805065E

So that would look like a divide by zero, dbx showed that the instruction was a idivl but the divisor register was not zero !

After a bit of looking at the AMD instruction documants we see that the idiv instruction can generate a "divide error" exception for two reasons - a divide by zero error and an integer overflow. The solaris kernel maps the "divide error" exception onto the FPE_INTDIV trap which truss reports but it could be caused by either cause. In this case we had an integer overlow as the result exceeded the capacity of a signed int. Now the folks who maintain the library that made the stream know to go look at their code.

Monday Jun 05, 2006

never use a tilde character as the first letter in your password.

I thought i was being clever with my algorithm for choosing new passwords, for reasons I can't remember this had "~" as the first character - BIG mistake. "~" is the escape character for lots of things like ssh and our service processor to host protocols.. good job I did not have "~#" or "~." as the first two characters! Something best avoided.

Monday Apr 24, 2006

dma_attr_sgllen can make your i/o look slow

I was asked to look at a slow i/o performance problem using solaris 10 on our fabulously fast
AMD64 boxes. The iostat command was reporting a very slow active service time (asvc_t) when the memory supplying the data to the large(ish) i/o was not allocated from large pages.

Dtrace showed the large page based i/o going out in one chunk with one call to sdintr() at the end of the
i/o before the buf was returned, but the 4k page based i/o was going out in a number of chunks. Each chunk of 128k was terminating in a call to sdintr(), only after the last chunk returned was the buf returned. The important part of the stacktrace  that dtrace or lockstat profiling will show is calls to ddi_dma_nextcookie() as each chunk is initialised.

For the i/o kstat  the service time runs from when the buf is sent to the HBA for transport to the disk and ends when the buf is returned. For the 4k based page i/o each chunk extends the service time by a multiple of the real service time.

So what was causing the i/o to be broken up...  the sd target driver relies on the underlying HBA driver to
do the scsi packet and DMA initialisation via the scsi_init_pkt() call. For this particular HBA the
ddi_dma_attr structure ( man ddi_dma_attr) had 32 in the dma_attr_sgllen field.  This field
describes the number of scatter gather segments that the dma engine built into the HBA card can
deal with per i/o request. If an i/o requires more than 32 scatter gather list elements then it will be broken
up into multiple i/o requests.

The large page buffer is allocated out 2Mb pages of contigeous virtual memory addresses but
more importantly each large page is made from  contigeous physical memory which is used by the dma engines so a buffer allocated from these large pages occupies just one DMA scatter gather list element.

The small page buffer is allocated out of 4K pages of contigeous virtual memory addresses but
each 4k page can be mapped into its virual address  from any physical address, in the worst case each 4k page takes one DMA scatter gather list element. A large i/o can therefore take more than 32 elements and so be chunked into N i/os of 128k making the iostat active service time look N times worse than it is really.

So using large pages can have hidden benefits.  (man ppgsz)

Wednesday Mar 22, 2006

something new and very useful from kernel tnf

Turning on all kernel's TNF probes gathers you a big blob of data about what is going inside the kernel. Prior to Solaris 10 this is the only way to get accurate timing information for system calls. Recently, as in last night, I was trying to work out from one of these blobs why a write into a ufs filesystem might take a long time. I had the pid of the writing process so I could find all its threads, I could see one of them issue the write() system call, then I kept seeing the thread block and almost wake up before blocking again. It did this a number of times, obviously it was competing for a resource like a semaphore or a condvar or a mutex and not getting it. All tnfdump gives you is the address of the resource.

But if you use tnfdump -rx you see a bit more.

Here is what tnfdump gives you..

1995.449700  0.010800 10768     3 0x3002ec894a0   4 thread_block  reason: 0x3002ec89632 stack: 

but here is what tnfdump -rx gives on a solaris 8 system
0x63f8e8  : {
                 tnf_tag 0x22f8     thread_block
             tnf_tag_arg 0x63f840   
              time_delta 0x4fd198
                  reason 0x3002ec89632
                   stack 0x63f900   
        }
0x63f900  : {
                 tnf_tag 0x2358     tnf_symbols
           tnf_self_size 0x38
                       0 0x1011c800
                       1 0x100448cc
                       2 0x1007d5f8
                       3 0x1007d9e4
                       4 0x100adaa4
                       5 0x100adc68
        }

On solaris 9 the stack is compressed two addresses to a line. That looks like a kernel stack trace to me, but the challenge is turn that into a symbolic stack trace. You could on the machine where the prex was run do
echo "0x1011c800/i" | mdb -k
But to be portable I think you should gather from the machine where the prex was run
/usr/ccs/bin/nm -fx /dev/ksyms > file or a live crashdump.

So time for a modified tnfdump or another awk script to glue these things together so we can see why threads might block in the kernel.

so more on that when I have it working but until then if you have to gather kernel prex data, send in the raw output file from tnfxtract, a live dump or the nm of /dev/ksyms so we don't loose any information from the data.

Monday Mar 20, 2006

lrand48() is an excellent function for generating testcases.

Spending a lot of time writing test cases to try and reproduce system panics has led me to use an interesting(ish) methodology. You stare at the data structures in the dump, you stare at the code and see if you can work out how to get things into a similar state. Then comes the difficult bit, I've taken to working out what operations are possible from the userland code and then randomising them using lrand48().

So this weeks exercise has been to reproduce a panic in the poll() code that from code inspection is impossible. The per process file table (indexed by file descriptor) is per process but the poll structures are per lwp, so there is a linked list of interested threads attached to a file entry if any threads are polling on that file entry. In the dump there are 3 threads chained off one file entry. So we know that we have a multi threaded process performing poll() on a single file descriptor from several threads at once. We panic'ed in close as we traversed that list as one of the threads has been reused by a process that does no polling so its per thread poll structures were null. So now we know that it has threads exiting and threads closing the file that we are polling on.

So I wrote a threaded program that opened a net connection and then went into a loop, it randomly started a new thread, those threads then possibly polled on that connection, or possibly exited, or possibly closed the connection. The main thread dealt with all of this, re-opening the connection if it got closed, starting new threads as ones exited - all under the choice of lrand48().

Did this reproduce it ? No, so then I randomised the number and contents of the pollfd array passed to poll() and suddenly the machine paniced with an identical stack trace to the customer's machine - the power of lrand48()

The good news is that it is fixed in solaris 10 already...

Thursday Mar 16, 2006

Where can I/O queue up in sd/ssd

Using the opensolaris source at opensolaris.org you can see the source code to sd.c, which builds the sd and ssd scsi/fibre target drivers.

The driver keeps a count of i/o's it has accepted in the un->un_ncmds_in_driver element of the un structure which is the softstate for that disk lun. It keeps the number of i/o's it has passed down to the HBA to transport in the un->un_ncmds_in_transport variable.

New I/O's passed into sd/ssd to the transport are passed straight down to the HBA until un_ncmds_in_transport exceeds the un_throttle for this lun or during error processing when we retry commands sequentually until one succeeeds.

When we have to queue the bufs we sort them onto a linked list anchored in un->un_waitq_headp and un->un_waitq_tailp.

But there is another place that i/o can be queued before it gets to the waitq and throttles - in the xbuf layer. un->un_xbuf_attr points to the xbuf control block which is of type __ddi_xbuf_attr . This has an unsorted waitq anchored in xa_headp and xa_tailp, by default the xbuf layer only allows 256 commands through to the normal sd waitq and throttles, the rest are queued in the xbuf and un_ncmds_in_driver is incremented. So it is possible when a large number of commands to be dropped on a sd lun, un_ncmds_in_driver is very large but the waitq has less than 256 commands on it - the rest are lurking on the xbuf's queue

These xbuf queued i/o are invisible to iostat's wait column - maybe they shouldn't be?

Friday Sep 09, 2005

measure and monitor before you change anything

Here is something to think about...
If you are going to do a major change to a machine/application then it is worthwhile investing some time in getting some baseline measurements of the system's configuration/performance before you change anything. Exactly what is worth gathering depends on the machine but I see many escalations where we are forced to explain why after a major change the system appears to be slow but with no possibility to gather any data from before the changes.

The explorer scripts do a good job of capturing the base solaris/machine configuration but you also need to capture the configuration of the primary application. You then need to gather performance data from Solaris and the application.

An example of this is a customer moving to 64 bit oracle who suffers much slower i/o performance, luckily there are some snippets of sar data from before the change that allowed us to spot that the amount of i/o had greatly increased due to changes in the Oracle configuration

prepare for the worst and it won't happen.

Wednesday Jul 13, 2005

two little things.

I use this line in my .kshrc to debug ksh scripts

export PS4="ksh debug \\$LINENO | "

once you run a script with ksh -x or put set -x in it you see output like..
enoexec ksh:  ksh -x utds
ksh debug 33 | ksh debug 33 | uname -s
OS=SunOS
ksh debug 34 | [ SunOS = SunOS ]
ksh debug 36 | PSOPTS=-o ppid,pid,comm
ksh debug 49 | trap echo "$0: Interrupted "; exit 1 INT QUIT
ksh debug 51 | ksh debug 51 | basename utds
USAGE=Usage: utds { start | stop | restart }
ksh debug 54 | UTPRODINFO=/etc/opt/SUNWut/basedir/lib/utprodinfo
ksh debug 55 | [ -x /etc/opt/SUNWut/basedir/lib/utprodinfo ]
ksh debug 57 | ksh debug 57 | /etc/opt/SUNWut/basedir/lib/utprodinfo -p SUNWutdso BASEDIR
ksh debug 57 | 2> /dev/null
BASEDIR=/opt
ksh debug 60 | [ -z /opt ]
ksh debug 64 | PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/SUNWut/srds/lib:/opt/SUNWut/srds/sbin
ksh debug 65 | LD_LIBRARY_PATH=/opt/SUNWut/srds/lib:/home/timu/src/malloclib/sparc/libs
ksh debug 66 | export PATH LD_LIBRARY_PATH
ksh debug 67 | UTPUSHD=/opt/SUNWut/srds/lib/utpushd
ksh debug 68 | UTDSD=/opt/SUNWut/srds/lib/utdsd
ksh debug 77 | ExitCode=0
ksh debug 82 | [ ! -f /etc/opt/SUNWut/srds/current/utdsd.run ]
ksh debug 90 | [ -f /etc/opt/SUNWut/srds/current/utdsd.ini ]
ksh debug 92 | . /etc/opt/SUNWut/srds/current/utdsd.ini
utds[92]: /etc/opt/SUNWut/srds/current/utdsd.ini:  not found

with the number after the debug being the line in the script that was being executed..

Thursday Jul 07, 2005

poking at the kernel dispatcher.

I have been drawn into an escalation involving the kernel dispatcher, this is the code that picks and schedules threads onto the available cpu's according to rules in the dispatch tables. It is by far the most complex code in the whole machine and IMHO it is the code that allows Solaris to scale so well.

In this escalation a load of threads including the clock are spinning waiting for another thread to release a dispatcher lock, sadly no one releases it and the machine hangs until the deadman or a cluster heartbeat stops the machine. Dispatcher locks are simple spin locks designed for easy use to protect modifications to the dispatching object when a thread is transitions from sleepq to runq to oncpu. As a thread moves from object to object the thread's t_lockp pointer is moved to point to the lock in the new object.

Now Solaris is quite complex internally especially when you considers doors, these are zero context switch function calls from a thread into another thread. Somewhere in all of the swtch/resume/door code we know that something is switching a thread t's t_lockp from one cpu lock to another cpu lock whilst at the same time ts_update_list() is doing thread_lock(t)/thread_unlock(t) so unlocks a different lock to the one it locked.

My job was made easier by some clever tracing code that a collegue of mine has developed that showed the lock and unlock of different locks, sadly it couldn't spot who manipulated t's t_lockp without having done it's own thread_lock(t) first. Code inspection has us looking at the door code and maybe the interupt pin/unpin code.

It would be good to extend the mechanism we have for statically checking locking sequences in storage drivers to deal with dispatcher locks, maybe that is something the opensolaris community could look at?


Friday Jul 01, 2005

procfs watchpoints

Procfs watchpoints are a little used but very powerful debugging tool. They are documented in man -s4 proc . They allow a program to be interrupted when it performs a memory read/write/execute within a specified range of addresses. A simple bit of code that detects writes to an integer could be based on this ..

--------------------------------------------
#include <sys/types.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/errno.h>
#include <ucontext.h>
#include <procfs.h>
#include <sys/uio.h>

struct thing {
char buffer[64];
int x;
char buffer1[512];
} thingy;

void
sub( int \* x)
{
\*x=getpid();
}

int
main(int argc, char \* argv[])
{

iovec_t iov[2];
struct {
long cmd;
union {
long flags;
prwatch_t prwatch;
} arg;
} ctl;

int pfd;


pfd = open("/proc/self/ctl",O_WRONLY);
if (pfd == -1)
abort()
ctl.cmd = PCWATCH;
ctl.arg.prwatch.pr_wflags = WA_READ|WA_WRITE|WA_EXEC;
ctl.arg.prwatch.pr_vaddr = (caddr_t)&(thingy.x);
ctl.arg.prwatch.pr_size = sizeof(int);
printf("setting watchpoint at address 0x%p size %d\\n",
ctl.arg.prwatch.pr_vaddr, ctl.arg.prwatch.pr_size);
(void)write(pfd, (char \*)&ctl, sizeof (long)+sizeof (prwatch_t));

sub(&(thingy.x));
}
-------------------------------------------
so compiling that into bar and running that under mdb produces..

hoofhearted ksh: mdb bar
> :r
setting watchpoint at address 0x8060de0 size 4
mdb: target stopped at:
sub+0x10:       movl   %edx,(%eax)
>
> $r
%cs = 0x003b            %eax = 0x08060de0      bar`thingy+0x40
%ds = 0x0043            %ebx = 0xbfffb824
%ss = 0x0043            %ecx = 0xbff9da37        libc.so.1`getpid+7
%es = 0x0043            %edx = 0x00000603
%fs = 0x0000            %esi = 0x08047264
%gs = 0x01c3            %edi = 0x080473a0
%eip = 0x08050ab0 sub+0x10
%ebp = 0x08047208
%kesp = 0x00000000
%eflags = 0x00010202
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=<of,df,IF,tf,sf,zf,af,pf,cf>
%esp = 0x08047200
%trapno = 0xe
%err = 0x6
>

So we took a SIGTRAP as soon as we tried to modify thingy.x. If the program had set the breakpoint on its own data then it would probably capture the signal, process it and return. If the breakpoint was set by a debugging process then the debugger would have registered an interest in that signal and could regain control at that point.

So I am using this in my debugging malloc library to watch the areas of the mmap that fall outside of the exact area requested in the call to malloc(). So far I have discovered that strlen has intimate knowledge of malloc and if the base address is aligned it uses a very clever algorithm to find the null terminating byte, but that causes it to use 32 bit load instructions to scan the string, so I have had to change the watchpoint at the end of the returned area to be a write watchpoint until we become 4 byte aligned and then to continue to the end of the page with a read/write watchpoint.

To avoid being accused of endianness I have finally got i86 and amd64 versions working -found a interesting little bug in the kernel whilst I was at it.

The procfs watchpoints work by clearing all  permissions of any page that has any part of a watchpoint in it.  Then when any access to the page happens we examine the avl tree of per process watch points to see if the accesed address is inside a watched area, if it is we send the SIGTRAP and setup to single step the process on return from the signal and set the page permissions  back to normal. After the single instruction has accessed the memory we clear the page permissions and allow the process to run normally again. You can see why it is quite slow!


In the case of sparc the instructions size is 4 bytes and the data access size can be easily calculated from the instruction. On a x86 machine we have to decode the instruction thoroughly to find the instruction and data access size. The small bug I found caused that code to calculate the data access size for a pushl instruction in a 32 bit application running on an amd64 kernel as 8 bytes instead of the correct 4 bytes. So if the data being pushed onto the stack was the last 4 bytes before a watchpoint the kernel thought that you were accessing 8 bytes so triggered the watchpoint, took me all day to find it, I was sure it was my code.


Thursday May 26, 2005

Success with the hunt for the segv.

Yes ! Finally after three whole days I have worked out the answer for the segv in malloc code. This program is a graphics application and it only segv's when run on a jfb framebuffer < big hint>. I sat down and read the malloc code - very clever, I should write it up to save others having the learning curve. We segv when we were carving up a big buffer area into two smaller ones, and importantly we were carving up a buffer called Bottom. This buffer contains all the unallocated space, when it runs out we call _morecore() which uses sbrk() to get more heap pages.

In this case early on we have a 19Mbyte malloc so the next call to _morecore() allocates a wee bit more than 0x01000000 bytes, this is important as it has a bit set in the top bye of the size. The size id kept in the 32 bits 8 bytes below the malloc/free pointer. Using dbx's watchi function I noticed that just as we segv the size of Bottom drops from 0x01001238 to 0x00001238 - maybe something is overwriting the bottom bye of the size.

Luckily I have the source to the standard malloc, so I turned it into a library and preloaded it -it still failed. So I used procfs's watchpoint functionality to put a write watchpoint across the size when the code recognised the address. Yes it took a SIGTRAP and it was whilst writing the null byte at the end of a strcpy. The classic sign of someone doing..

foo = malloc(strlen(string));
strcpy(foo, string);

This of course is wrong as it has not allocated room for the trailing NULL ( and it should have some error checking). This is a bit better...

foo = malloc(1 + strlen(string));
strcpy(foo, string);

So who was the culprit.. /usr/openwin/lib/GL/devhandlers/oglSUNWjfb.so.6 , time for a bug.

More worryingly why did no debugging malloc library not catch it? A combination of things, the space was not a multipleof 16 bytes so there was some padding at the end, I will get a simple testcase and log an rfe against libumem to get it to catch these issues.

Back to core dump running code

Wednesday May 25, 2005

wierd malloc problem.

A collegue asked me to help debug a userland core file a couple of days ago. The stack showed the normal segv in a frame under realfree() that is so typical of a malloc arena corruption. The standard malloc keeps some housekeeping in the 8 bytes below the pointer that malloc returns, once you call free() those 8 bytes and the first few of the allocation are cast into an internal data structure and it is added to a b-tree sorted by size. It is not until another malloc/free causes a change in the tree that corruption is detected. As this can happen a long time after the free the program can segv long after the damage - another reason to use libumem or libwatchmalloc or ...

I thought it would be easy as there was a testcase, just load up libumem or my own debugging malloc library, run the testcase and point out the incorrect code. But it wouldn't fail with those. I interposed on malloc/free/calloc/valloc/memalign but I could see no obvious error like a double free. My current theory is that something is trampling over the structure after it has gone on the free list, time to become more of a dbx power user, or maybe write a mdb dcmd to verify the malloc arena at regular intervals.

Once that is done I can get back to putting the sparc emulator into my crash dump running program.

Friday May 20, 2005

running code from kernel dumps.

Resurrection.

One off-beat program that comes in useful once or twice a year is a program I call Lazerus. Lazerus is fed a kernel crash dump and a configuration file and then starts executing the kernel code in the crash dump from a point defined in the configuration file. The registers are set up according to instructions in the configuration file.  There is a SIGSEGV and a SIGBUS handler which use libkvm to page the correct pages from the crash dump into lazerus's 64 bit address space and then retry the execution. There are many things it can't do as it is running non- priviledged but it runs well enough to be able to watch code flow.  I mainly use it to execute complex functions that have well defined actions that are too complex to  do by hand, for example functions that calculate and follow a hash chain, where the hash function is based on some tunable  and the hash chains are in the dump. It would take a long time to calculate exactly what hash chain was chosen and where it followed the chain to. With Lazerus you set up the registers like %g7 is the thread pointer, %i6 is the stack pointer and %o1..6 are the input arguments, set the pc and npc and then jump into the dump using setcontext().

So the first thing that happens is that a SIGSEGV is taken against the address in %pc, the handler mmap()s a 8k page of memory to an 8k aligned address that spans %pc, it then used kvm_read() to suck the contents of that page from the crash dump into the mmap()'ed space. When we return from the signal handler the kernel continues the program at %pc which now has the instructions copied from the kernel. Soon it will either drop of the end of the page and cause another page to be paged in, or it will branch/jump/call to a new address causing that page to come in. If it accesses data then we will fault against that address and pull it in, straight after pulling a page in we check to see if any of the memory patching directives in the configuration file should be applied to the page.

Version one of Lazerus used a complex method to detect the target of branches/jumps/calls and then used illegal instructions to jump to a handler to get access to the context after every instruction to allow single stepping.

Version two of Lazerus used a much simpler mechanism to perform single stepping.  As soon as the  8k page of text was loaded  procfs was used to  put an execute watchpoint across the whole 8k page.  The option  to get the breakpoint  signal after the instruction had executed was used  so SIGTRAP was raised after every instruction allowing the context to be printed out. This worked much better than version one but  some functions in the kernel caused an unexpected SIGSEGV, some investigation showed that the new libthread in solaris 9 and solaris 10 does a good job of ensuring that %g7 is set to the userland thread address causing any kernel function that uses CURRTHREAD to get the wrong data.  Version two is much better but doesn't work.

Buried deep  in my src/C directory is a program that emulates the more common sparc instructions. It has a software register file with 8 windows. It was used to investigate a problem with ld.so.1 which was almost impossible to debug until  dtrace gave us the proc provider.

So the plan is to ditch the model of using a real sparc processor in userland to execute code from a crash dump and to use the software emulator ( there are several full function emulator available that are used to bringup new processors before silicon is ready but writing your own is always a good way to ensure you understand the instruction set) between each step and have it chew through the kernel code.

Todays goal, between working on escalations, was to get the code that processes the configuration file into decent shape. Over the years I have refined my option processing code, I tend not to use command lines as these can be forgotten but to put every thing in a configuration file so tidy code to process this file is essential. The current incarnation of option processing uses a structure to define the options, the type of data, the bounds and if necessary a function to call to affect the option, the actual code is only about 40 lines but the structure has all the work, hopefully it can be re-used.

Early next week I will get the kvm_read() code and context dumping code running, more then...


Saturday Apr 30, 2005

userland core files.

During the week I collected several escalations that required analysis of userland core files. In both cases I was supplied the core file and a showrev -p and asked for a quick analysis. So my choices were to either..

  • Point our jumpstart configuration scripts at the showrev-p output and install a machine so that I have all the right libraries in the right place. Machines are an expensive resource and it seems a waste to book and occupy a whole machine just for this.

  • Use the shared library checksums in the core file to extract the correct libraries from our patch database and get dbx to use those. I haven't written the scripts to do this yet so it would take a while.
  • Ask the customer for some more files from his machine. The binary that generated the core and then all the shared libraries listed in the output of pldd core. With those I can build a dbxrc that maps the loads of the libraries to use the customer's ones. Once I have these I can debug the core on any machine where dbx is available.

In solaris 10 coreadm gives you much better control over what parts of the faulting processes address space end up in the core file. The default will include the shared library segments so core files will be bigger but debugging them wont need all the other files.



Monday Apr 25, 2005

decoding sparc pci addresses

All modern Sparc Sun machines use the PCI bus to connect peripherals to the system. A PCI bus is a physically addressed bus that can best be represented as an inverted tree. Each level of bus is connected to the next using pci bridge device. The bridge knows the phyical addresses of the devices below itself so it accepts transactions from the higher bus and passes them down. The root of the bus is the host-to-pci bus bridge.

This means that physical PCI bus addresses must be unique within a tree of buses but can be duplicated between seperate trees. This means that there must be a unique way of identifying which tree a pci command is being sent to. Solaris does this by assigning segment of the SPARC chips physical addressable space to each host-to-pci bridge.

If you use prtconf -vp  you can find the host-to-pci bus devices and all their properties. On modern SPARC machine look for schizo devices, on slightly older ones look for psycho nodes. You will find the ranges property , this is in 4 segments , one for configuration space, one for i/o space, one for 32 bit memory and one for 64 bit memory. Encoded in each is the base address and the size.

ranges:                             base address                     size
config    00000000.00000000.   00000000.00000402.00000000.     00000000.01000000.
i/o       01000000.00000000.   00000000.00000402.01000000.     00000000.01000000.
mem32  02000000.00000000.   00000000.0000041c.00000000.     00000000.80000000.
mem64  03000000.00000000.   00000000.0000041c.00000000.     00000000.80000000


So when the PCI bus goes wrong the machine may panic with a PCI Bus Error/BERR which causes the system to print out a fault address AFAR. This is the SPARC physical address that caused the error, so to decide on which bus the fault happenned match the AFAR to a segment supplied by the ranges property of a host-to-pci bridge.

The next problem is to work out which device  on that tree of buses was being accessed. So  subtract the  base of the host-to-pci bus  segment we found above to get the PCI bus physical address that is unique on this tree of PCI buses.  You then have to add the matching assigned-addresses property and the regs property then using the size from the assigned-addresses property you get a base and length to try and match.


 type address size 
assigned-addresses:  82010010. 00000000.08000000. 00000000.01000000...
               reg:  02010010. 00000000.00002000. 00000000.00800000...

sum 10 = mem32 00000000.08002000. 00000000.01000000
so this mem32 segment runs from PCI physical address 0x08002000 for length 0x01000000

It sounds easy but I have been writting code to  automate this for a couple of days now!



About

timatworkhomeandinbetween

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today
News

No bookmarks in folder

Blogroll

No bookmarks in folder