By tdh on Apr 04, 2007
Well, I finally putback the In-Kernel Sharetab (iks) bits. They will show up in build 62.
I was really impressed by the QA effort undertaken by Helen Chao and John Cui. They added a couple of weeks worth of work by shaking out some really nasty corner cases. And I liked that I when they asked how something should behave, they understood my answer of, "I don't know, you tell me."
What I meant of course was go get Solaris 10 bits and tell me how the sharetab implementation currently worked. And then tell me how my new code differed from the old code. And finally, create a test case such that we never have a regression creep in. I've been involved with other QA engineers who couldn't do this part of the job - they didn't understand that QA wasn't just in testing what a developer had changed, but to also understand how customers use our product and be their advocate to development.
It is real easy to chug away at code and find panics. It is much harder to find behavioral changes. That requires you to understand the baseline cases.
I also had to battle my way past my CRT advocate - Spencer Shepler. There were several panics on test systems the two weeks before I could formally ask to putback. In one case, I was able to show that the panic occurred in code that did not have any iks changes. That one cost me a week of time. But more frustrating was the second panic, a memory exhaustion. We had two cores - one after the iks code had been run and one before the iks code had started to run. They had the same signatures in the cores. I argued that since one occurred before the iks code had been loaded, the problem wasn't with that code. Oh, and this only happened on one machine in the whole company.
I would say I know it isn't the iks code, but I can't formally prove it. I lost an other week with this issue. I learned a lot about looking at core files. One of the other key ways Helen was able to help me out was she realized a testing statement she made was incorrect. That had us looking down the wrong area of code. She also thought of a really sweet way to help isolate whether the problem was in the iks code or already present in the kernel. At all times, I felt she was fully engaged in understanding the problem and finding the root cause. I've had many QA engineers who felt once they filed a bug report, they could go to the next problem.
And then Bill Baker had to ruin it all by identifying a matching ZFS bug and finding the signature in the cores we had our hands on. Bill is a another developer (and much more), who took his own time to comb through the core to help identify whether it was iks or something else. He knew how frustrated I was and was able to correlate another bug report which was in his inbox to what I was seeing. A quick deployment of that fix and we were able to show that it fixed the issue we were hitting.
The point of all of this is that Sun is deadly serious about the quality of its kernel. I don't think that everyone gets that when they look at OpenSolaris and compare development models with the way Linux proceeds. Sun kernel engineers (developers and QA) do not want to release control of the quality of the code. They take pride not only in their work, but in the processes which protect their investments.
This serious commitment to quality was what brought me to Sun. That and the chance to help develop OpenSolaris.