Tuesday Jun 26, 2007

OpenSolaris Project Models and pNFS

I believe that the majority of OpenSolaris development occurs within Sun Microsystems Engineering. As much as we would like for it to snowball in the wild, that has not happened. I'm saying this from my biased view, I know some projects have been proposed externally from Sun, e.g., the i18n port of the closed library. I also acknowledge the work that Dennis Clark is leading for the PPC port. There are more and I am not trying to take away from them. I am relating my experience with trying to get projects off the ground on OpenSolaris - see for example OpenSolaris Project: NFS Server in non-Global Zones.

So what does happen is that a new project gets started and there is no external indication of forward progress. People might start asking for code drops and the reality is that because of the huge internal pressure towards quality in Sun Engineering, that is not going to happen until the code has baked a bit. It gets to the point that a prime question on new project proposals is will code be released. Again, there isn't some hidden agenda within Sun to withhold the code - we are just new to this model and we want things to be perfect, not just good enough.

Look back at the discussion that went on for Project Proposal -- Honeycomb Information and dev tools and the lack of a code drop. The OpenSolaris Project: HoneyComb Fixed Content Storage already shows a binary drop and plans for a code drop in the Fall of 2007. Some valid reasons for a group to not drop code right away are that they do not understand the process (they need someone to help them) and they need to clear a legal hurdle to make sure that they are not violating the rights of either an individual or a company. I've seen both occur internally. The good news is that we have internal people ready and willing to help development groups.

What I find really exciting are projects that have a significant external presence. And sometimes that external pressure doesn't contribute directly to the code work. In NFSv4 and NFSv4.1, the external collaboration takes place through the IETF and Connectathon. Both companies and open source developers come together to design and implement future NFS protocol extensions. Interoperability across multiple OS platforms is ensured via the yearly meetings at Connectathon. And with the UMICH CITI developers working on Projects: NFS Version 4 Open Source Reference Implementation, which is mainly distributed to Linux, but forms a reference for both BSD directly and OSX indirectly, and Sun working on OpenSolaris, it is possible for vendors to do compatibility testing all year long.

Take for example NetApp, which provides only a NFS server. They are able to test new NFSv4.1 features against Linux and OpenSolaris clients. Admittedly this isn't new, NetApp was able to use the Solaris 10 beta code to test NFSv4. And the companies in question all sign NDAs and exchange hardware and engineering drops of binaries for testing.

So there is almost no work being driven from OpenSolaris into this open design project. There is a OpenSolaris Project: NFS version 4.1 pNFS, but it is mainly a portal to the Sun NFS team's work. A question that they asked themselves was whether they were going to do binary drops, code drops, or any drop at all. It wasn't a legal issue, the design is done in the open and all of the coding is new development. It wasn't a fear of the unknown, they had already shared binaries in the past. No, rather it was a concern on the impact of providing a drop on the development schedule. Would the overhead of publishing code and/or binaries kill the final deliverable?

Another OpenSolaris reality is that Sun expects to make money. I know that is an evil concept to some open source developers, but we bet the company on being able to deliver quality and sell service along with the source. So making the deadline for the pNFS deliverable is a major concern for the group.

I'm happy that the group decided that they could both deliver on time and make code and binary drops. Lisa just announced for the group the latest drop in FYI: pNFS Code and BFU Archives posted. You can check out the b66 implementation by downloading it. The code is rough in the sense that you wouldn't want to put it in production, but it gives other developers a chance to see what is going on and allows them to test their own implementations. Remember, this code has not been putback into Nevada - it lives in a group workspace. Before OpenSolaris, it would have only been shared under NDA and the expectation that the person installing the code assumed responsibility for any problems.

Project development in OpenSolaris is different than that occurring in other open source communities. There are different hurdles to jump, but there are different expectations as well. Internal developers are proud of the quality that they demand of the code and want to keep that bar high. That in turn makes early code drops hard for them to deliver. It is something they are learning to do. And the pNFS team is leading the way.

Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily

Wednesday Apr 04, 2007

Putback of In-Kernel Sharetab

Well, I finally putback the In-Kernel Sharetab (iks) bits. They will show up in build 62.

I was really impressed by the QA effort undertaken by Helen Chao and John Cui. They added a couple of weeks worth of work by shaking out some really nasty corner cases. And I liked that I when they asked how something should behave, they understood my answer of, "I don't know, you tell me."

What I meant of course was go get Solaris 10 bits and tell me how the sharetab implementation currently worked. And then tell me how my new code differed from the old code. And finally, create a test case such that we never have a regression creep in. I've been involved with other QA engineers who couldn't do this part of the job - they didn't understand that QA wasn't just in testing what a developer had changed, but to also understand how customers use our product and be their advocate to development.

It is real easy to chug away at code and find panics. It is much harder to find behavioral changes. That requires you to understand the baseline cases.

I also had to battle my way past my CRT advocate - Spencer Shepler. There were several panics on test systems the two weeks before I could formally ask to putback. In one case, I was able to show that the panic occurred in code that did not have any iks changes. That one cost me a week of time. But more frustrating was the second panic, a memory exhaustion. We had two cores - one after the iks code had been run and one before the iks code had started to run. They had the same signatures in the cores. I argued that since one occurred before the iks code had been loaded, the problem wasn't with that code. Oh, and this only happened on one machine in the whole company.

I would say I know it isn't the iks code, but I can't formally prove it. I lost an other week with this issue. I learned a lot about looking at core files. One of the other key ways Helen was able to help me out was she realized a testing statement she made was incorrect. That had us looking down the wrong area of code. She also thought of a really sweet way to help isolate whether the problem was in the iks code or already present in the kernel. At all times, I felt she was fully engaged in understanding the problem and finding the root cause. I've had many QA engineers who felt once they filed a bug report, they could go to the next problem.

And then Bill Baker had to ruin it all by identifying a matching ZFS bug and finding the signature in the cores we had our hands on. Bill is a another developer (and much more), who took his own time to comb through the core to help identify whether it was iks or something else. He knew how frustrated I was and was able to correlate another bug report which was in his inbox to what I was seeing. A quick deployment of that fix and we were able to show that it fixed the issue we were hitting.

The point of all of this is that Sun is deadly serious about the quality of its kernel. I don't think that everyone gets that when they look at OpenSolaris and compare development models with the way Linux proceeds. Sun kernel engineers (developers and QA) do not want to release control of the quality of the code. They take pride not only in their work, but in the processes which protect their investments.

This serious commitment to quality was what brought me to Sun. That and the chance to help develop OpenSolaris.

Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily



« July 2016