Hardening the ISP driver
By user12625760 on Apr 22, 2005
The ISP card and driver work by having the driver poke a 32 bit quantity, the token, into a mailbox register on the card which uniquely identifies the particular IO that is being done. Then when the card has completed the IO it loads the same 32 bit token into a mailbox register that the driver can then read and map back to the IO that was being done so that it can be completed.
For some years I have been taking escalation's where the system would crash and the crash dump showed various problems with the value that the driver had read back from that mailbox register. It would very occasionally be 0 or sometimes it did not map to a known IO that was outstanding. Invariably the problems were a one off. The systems would fail just once, some times the card would be replaced, but since where the card was not replaced the failures were never seen again anyway I began to doubt that this was a good course of action.
The initial attempt to fix this used the fact that in the 64 bit port of Solaris a routine had been used to map the 32 bit values put into the mailbox registers into 64 bit values which then pointed to the real data structures. These routines were hardened to include parity bits so that any corruption could be detected, including the return of a 0. Then for my sins I got to port these changes back to Solaris 7 and 8.
However the problems persisted. A very low volume of systems crashing but now we had a much better diagnostic view. We now knew that the token that was being returned was “good” in that it had good parity, but the systems still crashed because the driver thought that the IO had already been completed. Sometimes good programming can make diagnosis harder and this was another case where it did. Looking back through the internal structures of the ISP driver you can see all the tokens that it has used in the past so perhaps if it has used the token before either the driver or the ISP card could have been confused and there might be some smoking gun which would lead us to the solution.
Alas when I looked I saw the same tokens being used over and over again, sometimes there would just be one token! I had been knocked back by the excellent kmem_cache_alloc, the tokens were generated after the structure was allocated in constructor and so when the structure was freed it was returned to the cache and the token was not touched.
I discussed this with colleagues and we mulled over various solutions before realising there was a much better way to map the tokens to the structures. As with so many solutions once we had it it was obvious.
The driver has to keep a list of IO that are in the chip so that if the chip is reset it can complete the IO with an error, since that list is a fixed size it is in an array. If we use the array offset as part of the token we would then have some spare bits in the token where we can write a 16 bit, per array offset, sequence number making each token close to unique. Now we can spot a token that gets replayed or corrupted (we keep an XOR of the token as well as there happens to be another 8 bit register available on the ISP card).
All of that made it into 10 and was back ported into the latest patches for 8 and 9.
For some reason the qus driver is almost identical to the ISP driver so the same changes had to be made to that driver as well. Last night I finally finished the putback so with luck patches for 10, 9 and 8 will be available soon.