lrand48() is an excellent function for generating testcases.

Spending a lot of time writing test cases to try and reproduce system panics has led me to use an interesting(ish) methodology. You stare at the data structures in the dump, you stare at the code and see if you can work out how to get things into a similar state. Then comes the difficult bit, I've taken to working out what operations are possible from the userland code and then randomising them using lrand48().

So this weeks exercise has been to reproduce a panic in the poll() code that from code inspection is impossible. The per process file table (indexed by file descriptor) is per process but the poll structures are per lwp, so there is a linked list of interested threads attached to a file entry if any threads are polling on that file entry. In the dump there are 3 threads chained off one file entry. So we know that we have a multi threaded process performing poll() on a single file descriptor from several threads at once. We panic'ed in close as we traversed that list as one of the threads has been reused by a process that does no polling so its per thread poll structures were null. So now we know that it has threads exiting and threads closing the file that we are polling on.

So I wrote a threaded program that opened a net connection and then went into a loop, it randomly started a new thread, those threads then possibly polled on that connection, or possibly exited, or possibly closed the connection. The main thread dealt with all of this, re-opening the connection if it got closed, starting new threads as ones exited - all under the choice of lrand48().

Did this reproduce it ? No, so then I randomised the number and contents of the pollfd array passed to poll() and suddenly the machine paniced with an identical stack trace to the customer's machine - the power of lrand48()

The good news is that it is fixed in solaris 10 already...

Comments:

When I wrote the perl wrapper for libexacct I did exactly the same thing - I wrote a perl script that used the new exacct perl module to create random length exacct-format files of randomly nested exacct records containing random numbers of random length records of random data types, then read them back in and compared them to what they should be. Found a number of bugs as a result. I even fixed some of them too ;-)

Posted by Alan Burlison on March 20, 2006 at 04:14 PM GMT+00:00 #

Post a Comment:
Comments are closed for this entry.
About

timatworkhomeandinbetween

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today
News

No bookmarks in folder

Blogroll

No bookmarks in folder