The ISP card and driver work by having the driver poke a 32 bit
quantity, the token, into a mailbox register on the card which
uniquely identifies the particular IO that is being done. Then when
the card has completed the IO it loads the
same 32 bit token into a mailbox register that the driver can then
read and map back to the IO that was being done so that it can be
For some years I have been taking escalation's
where the system would crash and the crash dump showed various
problems with the value that the driver had read back from that
mailbox register. It would very occasionally be 0 or sometimes it
did not map to a known IO that was outstanding. Invariably the
problems were a one off. The systems would fail just once, some
times the card would be replaced, but since where the card was not
replaced the failures were never seen again anyway I began to doubt
that this was a good course of action.
The initial attempt to fix this used the fact that in the 64 bit
port of Solaris a routine had been used to map the 32 bit values put
into the mailbox registers into 64 bit values which then pointed to
the real data structures. These routines were hardened to include
parity bits so that any corruption could be detected, including the
return of a 0. Then for my sins I got to port these changes back to
Solaris 7 and 8.
However the problems persisted. A very low volume of systems
crashing but now we had a much better diagnostic view. We now knew
that the token that was being returned was “good” in that
it had good parity, but the systems still crashed because the driver
thought that the IO had already been completed. Sometimes good
programming can make diagnosis harder and this was another case where
it did. Looking back through the internal structures of the ISP
driver you can see all the tokens that it has used in the past so
perhaps if it has used the token before either the driver or the ISP
card could have been confused and there might be some smoking gun
which would lead us to the solution.
Alas when I looked I saw the same tokens being used over and over
again, sometimes there would just be one token! I had been knocked
back by the excellent kmem_cache_alloc, the
tokens were generated after the structure was allocated in
constructor and so when the structure was
freed it was returned to the cache and the token was not touched.
I discussed this with colleagues and we mulled over various
solutions before realising there was a much better way to map the
tokens to the structures. As with so many solutions once we had it it
The driver has to keep a list of IO that are in the chip so that
if the chip is reset it can complete the IO with an error, since that
list is a fixed size it is in an array. If we use the array offset
as part of the token we would then have some spare bits in the token
where we can write a 16 bit, per array offset, sequence number making
each token close to unique. Now we can spot a token that gets
replayed or corrupted (we keep an XOR of the token as well as there
happens to be another 8 bit register available on the ISP
All of that made it into 10 and was back ported into the latest
patches for 8 and 9.
For some reason the qus driver is almost identical to the ISP
driver so the same changes had to be made to that driver as well.
Last night I finally finished the putback so with luck patches for
10, 9 and 8 will be available soon.