Wednesday Jun 01, 2005

FISL Day 1

So the first day of FISL has come to a close. I have to say it went better than expected, based on the quality of questions posed by the audience and visitors to the Sun booth. If today is any indication, my voice is going to completely gone by the end of the conference. I started off the day with a technical overview of Solaris 10/OpenSolaris. You can find the slides for this presentation here. Before taking too much credit myself, the content of these slides are largely based off of Dan's USENIX presentation (thanks Dan!). This is a whirlwind tour of Solaris features - three slides per topic is nowhere near enough. Each of the major topics has been presented many times as a standalone 2-hour presentation, so you can imagine the corners I have to cut to cover them all.

My presention was followed by a great OpenSolaris overview from Tom Goguen. His summary of the CDDL was one of the best I've ever seen - it was the first time I've seen an OpenSolaris presentation without a dozen questions about GPL, CDDL, and everybody's favorite pet license. Dave followed up with a detailed description of how Solaris is developed today and where we see OpenSolaris development heading in the future. All in all, we managed to cram 10+ hours of presentations into a measley 3 1/2 hours. For those of you who still have lingering questions, please stop by the Sun booth and chat with us about anything and everything. We'll be here all week

After retiring to the booth, we had several great discussions with some of the attendees. The highlight of the day was when Dave was talking to an attendee about SMF (and the cool GUI he's working on) and I was feeling particularly bored. Since my laptop was hooked up to the monitor in the "community theater", I decided to play around with some DTrace scripts to come up with a cool demo. Within three minutes I had 4 or 5 people watching what I was doing, so I decided to start talking about all the wonders of DTrace. The 4 or 5 people quickly turned into 10 or 12, and pretty soon I found myself in the middle of a 3 hour mammoth DTrace demo, from which my voice is still recovering. This brings us to the major thing I learned today:

"If you DTrace it, they will come"

Technorati tags:

Thursday Apr 14, 2005

Designing for Failure

In the last few weeks, I've been completely re-designing the ZFS commands from the ground up1. When I stood back and looked at the current state of the utilities, several glaring deficiencies jumped out at me2. I thought I'd use this blog entry to focus on one that near and dear to me. Having spent a great deal of time with the debugging and observability tools, I've invariably focused on answering the question "How do I diagnose and fix a problem when something goes wrong?". When it comes to command line utilities, the core this problem is in well-designed error messages. To wit, running the following (former) ZFS command demonstrates the number one mistake when reporting error messages:

    # zfs create -c pool/foo pool/bar
    zfs: Can't create pool/bar: Invalid argument

The words "Invalid argument" should never appear as an error message. This means that at some point in the software stack, you were able to determine there was a specific problem with an argument. But in the course of passing that error up the stack, any semantic information about the exact nature of the problem has been reduced to simply EINVAL. In the above case, all we know is that one of the two arguments was invalid for some unknown reason, and we have no way of knowing how to fix it. When choosing to display an error message, you should always take the following into account:

An error message must clearly identify the source of the problem in a way that that the user can understand.

An error message must suggest what the user can do to fix the problem.

If you print an error message that the administrator can't understand or doesn't suggest what to do, then you have failed and your design is fundamentally broken. All too often, error semantics are given a back seat during the design process. When approaching the ZFS user interface, I made sure that error semantics were a fundamental part of the design document. Every command has complete usage documentation, examples, and every possible error message that can be emitted. By making this part of the design process, I was forced to examine every possible error scenario from the perspective of an administrator.

A grand vision of proper failure analysis can be seen in the Fault Management Architecture in Solaris 10, part of Predictive Self Healing. A complete explanation of FMA and its ramifications is beyond the scope of a single blog entry, but the basic premise is to move from a series of unrelated error messages to a unified framework of fault diagnosis. Historically, when hardware errors would occur, an arbitrary error message may or may not have been sent to the system log. The error may have been transient (such as an isolated memory CE), or the result of some other fault. Administrators were forced to make costly decisions based on a vague understanding of our hardware failure semantics. When error messages did succeed in describing the problem sufficiently, they invariably failed in suggesting how to fix the problem. With FMA, the sequence of errors is instead fed to a diagnosis engine that is intimately familiar with the characteristics of the hardware, and is able to produce a fault message that both adequately describes the real problem, as well as how to fix it (when it cannot be automatically repaired by FMA).

Such a wide-ranging problem doesn't necessarily compare to a simple set of command line utilities. A smaller scale example can be seen with the Solaris Management Facility. When SMF first integrated, it was incredibly difficult to diagnose problems when they occurred3. The result, after a few weeks of struggle, was one of the best tools to come out of SMF, svcs -x. If you haven't tried this command on your Solaris 10 box, you should give it a shot. It does automated gathering of error information and combines it into output that is specific, intelligible, and repair-focused. During development of the final ZFS command line interface, I've taken a great deal of inspiration from both svcs -x and FMA. I hope that this is reflected in the final product.

So what does this mean for you? First of all, if there's any Solaris error message that is unclear or uninformative that is a bug. There are some rare cases when we have no other choice (because we're relying on an arbitrary subsystem that can only communicate via errno values), but 90% of the time its because the system hasn't been sufficiently designed with failure in mind.

I'll also leave you with a few cardinal4 rules of proper error design beyond the two principles above:

  1. Never distill multiple faults into a single error code. Any error that gets passed between functions or subsystems must be traceable back to a single specific failure.
  2. Stay away from strerror(3c) at all costs. Unless you are truly interfacing with an arbitrary UNIX system, the errno values are rarely sufficient.
  3. Design your error reporting at the same time you design the interface. Put all possible error messages in a single document and make sure they are both consistent and effective.
  4. When possible, perform automated diagnosis to reduce the amount of unimportant data or give the user more specific data to work with.
  5. Distance yourself from the implementation and make sure that any error message makes sense to the average user.

1No, I cannot tell you when ZFS will integrate, or when it will be available. Sorry.

2This is not intended as a jab at the ZFS team. They have been working full steam on the (significantly more complicated) implementation. The commands have grown organically over time, and are beginning to show their age.

3Again, this is not meant to disparage the SMF team. There were many more factors here, and all the problems have since been fixed.

4 "cardinal" might be a stretch here. A better phrase is probably "random list of rules I came up with on the spot".

Thursday Mar 31, 2005

How not to code

This little gem came up in conversation last night, and it was suggested that it would make a rather amusing blog entry. A Solaris project had a command line utility with the following, unspeakably horrid, piece of code:

         \* Use the dynamic linker to look up the function we should
         \* call.
        (void) snprintf(func_name, sizeof (func_name), "do_%s", cmd);
        func_ptr = (int (\*)(int, char \*\*))
                dlsym(RTLD_DEFAULT, func_name);                                                                 

        if (func_ptr == NULL) {
                fprintf(stderr, "Unrecognized command %s", cmd);

        return ((\*func_ptr)(argc, argv));

So when you type "a.out foo", the command would sprintf into a buffer to make "do_foo", and rely on the dynamic linker to find the appropriate function to call. Before I get a stream of comments decrying the idiocy of Solaris programmers: the code will never reach the Solaris codebase, and the responsible party no longer works at Sun. The participants at the dinner table were equally disgusted that this piece of code came out of our organization. Suffice to say that this is much better served by a table:

        for (i = 0; i < sizeof (func_table) / sizeof (func_table[0]); i++) {
                if (strcmp(func_table[i].name, cmd) == 0)
                          return (func_table[i].func(argc, argv));

        fprintf(stderr, "Unrecognized command %s", cmd);

I still can't imagine the original motivation for this code. It is more code, harder to understand, and likely slower (depending on the number of commands and how much you trust the dynamic linker's hash table). We continually preach software observability and transparency - but I never thought I'd see obfuscation of this magnitude within a 500 line command line utility. This prevents us from even searching for callers of do_foo() using cscope.

This serves as a good reminder that the most clever way of doing something is usually not the right answer. Unless you have a really good reason (such as performance), being overly clever will only make your code more difficult to maintain and more prone to error.

Update - Since some people seem a little confused, I thoght I'd elaborate two points. First off, there is no loadable library. This is a library linked directly to the application. There is no need to asynchronously update the commands. Second, the proposed function table does not have to live separately from the code. It would be quite simple to put the function table in the same file with the function definitions, which would improve maintainability and understability by an order of magnitude.

Monday Sep 27, 2004


So my last few posts have sparked quite a bit of discussion out there, appearing on slashdot as well as OSNews. It's been quite an interesting experience, though it's had a significant effect on my work productivity today :-) While I'm not responding to every post, I promise that I'm reading them (and thanks to those of you sending private mail, I promise to respond soon).

I have to say that I've been reasonably impressed with the discussion so far. Slashdot, as usual, leaves something to be desired (even reading at +5), but the comments in my blog and in my email have been for the most part very reasonable. There is a certain amount of typical fanboy drivel (more so on the pro-Linux side, but only because Solaris doesn't have many fanboys). But there's also a reasonable contingent on Slashdot fighting down the baseless arguments of the zealots. In the past, the debate has been rather one-sided. Solaris is usually dismissed as an OS for big computers for people with lots of money. Sun has traditionally let our marketing department do all the talking, which works really well for CEOs and CTOs (our paying customers), but not as well for spreading detailed technical knowledge to the developer community. We're changing our business model - encouraging blogs, releasing Solaris Express, hosting discussions with actual kernel engineers, and eventually open sourcing Solaris - to encourage direct connections with the community at large.

We've been listening to the (often one-sided) discussion for a long time now, and it shows in Solaris. Solaris 10 has killer performance, even on single- and dual-processor x86 machines. Hardware support has been greatly improved (S10 installed on my Toshiba laptop without a hitch). We're focusing on the desktop again, with X.Org integration, Gnome 2.6, Mozilla 1.7, and better open source packages all around. Sure, we're still playing catchup in a lot of ways, but we're learning. I only hope the Linux community can learn from Solaris's strengths, and dismiss many of the Solaris stereotypes that have been implanted (not always without merit) over the course of history. Healthy competition is good, and can only benefit the customer.

As much as I would like to continue this debate forever, I think it's time I get back to doing what I really love - making Solaris into the best OS it can be. I'll probably be focusing on more technical posts for a while, but I'll revive the discussion again at a future point. Until then, feel free to continue posting comments or sending me mail. I do read them, even if I don't respond publicly.

Friday Sep 24, 2004

GPL thoughts and clarifications

So it seems my previous entry has finally started to stir up some controversy. I'll address some of the technical issues raised here shortly. But first I thought I'd clarify my view of the GPL, with the help of an analogy:

Let's say that I manufacture wooden two by fours, and that I want to make them freely available under an "open source" license. There are several options out there:

  1. You have the right to use and modify my 2x4s to your hearts content.

    This is the basis for open source software. It protects the rights of the consumer, but imparts few rights to the developer.

  2. You have the right to use my 2x4s however you please, but if you modify one, then you have to make that modification freely available to the public in the same fashion as the original.

    This gives the developer a few more guarantees about what can and cannot be done with his or her contributions. It protects the developer's rights without infringing on the rights of the consumer.

  3. You have the right to use my 2x4 as-is, but if you decide to build a house with it, then your house must be as freely available as my 2x4.

    This is the provision of the GPL that I don't agree with, and neither do customers that we've talked to. It protects my rights as a developer, but severely limits the rights of the consumer in what can and cannot be done with my public donation.

This analogy has some obvious flaws. Open source software is neither excludable nor rival, unlike the house I just built. There is also a tenuous line between derived works and fair use. In my example, I wouldn't have the right to the furniture put into your house. But I feel like its a reasonable simplification of my earlier point.

As an open source advocate, I would argue that #1 is the "most free". This is why, in many ways, the BSD license is the "most open" of all the main licenses. As a developer, I would argue that #2 is the best solution. My contribution is protected - no one can make changes without giving it back to me(and the community at large). But my code is essentially a service, and I feel everyone should have a right to that service, even if they go off and make money from it.

The problems arise when we get to #3, which is the essential controversy of the GPL. To me, this is a personal choice, which is why GPL advocacy often turns into pseudo-religious fanaticism. In many ways, arguing with a GPL zealot is like an atheist arguing with a religious fundamentalist. In the end, they agree on nothing. The atheist leaves understanding the fundamentalist's beliefs and respects his or her right to have them. The fundamentalist leaves beliving that the atheist will burn in hell for not accepting the one true religion.

This would be fine, except that GPL advocates often blur the line between #2 and #3, and make it seem like the protections of #2 can only be had if you fully embrace the GPL in all its glory. I support the rights provided by #2. You can scream and shout about the benefits of #3 and how it's an inalienable right of all people, but in the end I just don't agree. Don't equate the GPL with open source - if you do want to argue the GPL, make it very clear which points you are arguing for.

One final comment about GPL advocacy. Time and again I see people talk about easing migration, avoiding vendor lockin, and the necessity of consumer choice. But in the same breath they turn around and scream that you must accept the GPL, and any other license would be pure evil (at best, a slow and painful death). Why is it that we have the right to choose everything except our choice of license? I like Linux. I like the GPL. The GPL is not evil. There are a lot of great projects that benefit from the GPL. But it isn't everything to all people, and in my opinion it's not what's best for OpenSolaris.


As has been enumerated in the comments on this post, the original intent of the analogy is to show the the definition of derived works. As mentioned in the comments:

Say I post an example of a function foo() to my website. Oracle goes and uses that function in their software. They make no changes to it whatsover, and are willing to distribute that function in source code form with their product. If it was GPL, they would have to now release all of Oracle under the GPL, even though my code has not been altered. The consumer's rights are preserved - they still have the same rights to my code as before it was put into Oracle. I just don't see why they have a right to code that's not mine.

Though I didn't explain it well enough, the analogy was never intended to encompass right to use, ownership, distribution, or any of the other qualities of the GPL. It is a specific issue with one part of the GPL, and the analogy is intentionally simplistic in order to demonstrate this fact.

Tuesday Sep 07, 2004

Linux Kernel Debugging with KDB

So it's been a while since my KMDB post, but I promised I would do some investigation into kernel debugging on the Linux side. Keep in mind that I have no Linux kernel experience. While I will try to be thorough in my research, there may be things I miss simply from lack of experience or a good test system. Feel free to comment on any errors or omissions.

We'll try to solve the same problem that I approached with KMDB in the last post: a deadlock involving reader-writer locks. Linux has a choice of two debuggers, kdb and kgdb (though User Mode Linux presents interesting possibilities). In this post I'll be taking a look at KDB.

Fire up KDB

Chances are you're not running a Linux kernel with KDB installed. Some distros (like Debian) make it easier to download and apply the patch, but none seems to include it by default (admittedly, I didn't do a very thorough search). This means you'll have to go download the patch, apply it, tweak some kernel config variables (CONFIG_KDB and CONFIG_FRAME_POINTER), recompile/reinstall your kernel, and reboot. Hopefully you've done all this beforehand, because as soon you reboot you've lost your bug (possibly forever - race conditions are fickle creatures). Assuming you were running a kdb-enabled kernel when you hit this bug, you then run:

        # echo "1" > /proc/sys/kernel/kdb

And then press the 'pause' key on your keyboard. Alternatively, you can hook up a serial console, but I'll opt for the easy way out.

Find our troubled thread

First, we need to find the pid our offending process. The only way to do this is to use the 'ps' command to display all processes on the system, and then pick out (visually) which pid belongs to our 'ps' process. Once we have this information, we can then use 'btp <pid>' to get a stack trace.

Get the address of the rwlock

This step is very similar to the one we took when using kmdb. The stack trace produced by 'btp' includes frame pointers like kmdb's $C. Looking back over my kmdb post, it wasn't immediately clear where I got that magic starting number - it came from the frame pointer in the (verbose) stack trace. In any case, we use 'id <addr>' to disassemble the code around our call site. We then use 'mdr <addr+offset>' to examine the memory where the original value is saved. This gets much more interesting (painful) on amd64, where arguments are passed in registers and may not get pushed on the stack until several frames later.

Without a paddle?

At this point, the next step should be "Find who owns the reader lock." But I can't find any commands in the kdb manpages that would help us determine this. Without kmdb's ::kgrep, we're stuck searching for a needle in a haystack. Somewhere on this system, one or more threads have referenced this rwlock in the past. Our only course of action is to try 'bta', which will give us a stack trace of every single process on the system. With a deep understanding of the code, a great deal of persistence, and a little bit of luck, we may be able to pick out the offending stack just by sight. This quickly becomes impractical on large systems, not to mention difficult to verify and prone to error.

With KDB we can do some basic debugging tasks, but it still relies on giant "leaps of faith" to correlate two pieces of seemingly disjoint data (two thread involved in a deadlock, for example). As a point of comparison, KDB provides 40 different commands, while KMDB provides 771 (356 dcmds and 415 walkers on my current desktop). Next week I'll look at kgdb and see if it fills in any of these gaps.

Wednesday Aug 18, 2004

So now Linux is innovative?

I just ran across this interview with Linus. If you've read my blog, you know I've talked before about OS innovation, particularly with regards to Linux and Solaris. So I found this particular Linus quote very interesting:

There's innovation in Linux. There are some really good technical features that I'm proud of. There are capabilities in Linux that aren't in other operating systems. A lot of them are about performance. They're internal ways of doing things in a very efficient manner. In a kernel, you're trying to hide the hard work from the application, rather than exposing the complexity.

This seems to fly in the face of his previous statements, where he claimed that there's no innovation left at the OS level. Some people have commented on my previous blog entry that Linux cannot innovate simply because of its size: with so many hardware platforms to support, it becomes impossible to integrate large projects. What you're left with is "technical features" to improve performance, which (in general) are nothing more than algorithm refinement and clever tricks1.

I also don't buy the "supporting lots of platforms means no chance for innovation" stance. Most of the Solaris 10 innovations (Zones, Greenline, Least Privileges) are completely hardware independent. Those projects which are hardware dependent develop a platform-neutral infrastructure, and provide modules only for those platforms which they suppot (FMA being a good example of this). Testing is hard. It's clear that Linux testing often amounts to little more than "it compiles". Large projects will always break things; even with our stringent testing requirements in Solaris, there are always plenty of "follow on" bugfixes to any project. If Linux (or Linus) is afraid of radical change, then there will be no chance for innnovation2.

As we look towards OpenSolaris, I'm intrigued by this comment by Linus:

I am a dictator, but it's the right kind of dictatorship. I can't really do anything that screws people over. The benevolence is built in. I can't be nasty. If my baser instincts took hold, they wouldn't trust me, and they wouldn't work with me anymore. I'm not so much a leader, I'm more of a shepherd. Now all the kernel developers will read that and say, "He's comparing us to sheep." It's more like herding cats.

Stephen has previously questioned the importance of a single leader for an open source project. In the course of development for OpenSolaris, we've looked at many open source models, ranging from the "benevolent dictatorship" to the community model (and perhaps briefly, the pornocracy). I, for one, question Linus' scalability. Too many projects like crash dumps, kernel debugging, and dynamic tracing have been rejected by Linus largely because Linux debugging is "as good as it needs to be." Crash dumps and kernel tracing are "vendor features," because real kernel developers should understand the code well enough to debug a problem from a stack trace and register dump. I love the Solaris debugging tools, but I know for a fact that there are huge gaps in our capabilities, some of which I hope to fill in upcoming releases. I would never think of Solaris debugging as perfect, or beyond reproach.

So it's time for Space Ghosts's "something.... to think about....":

Is it beneficial to have one man, or one corporation, have the final say in the development of an open source project?

1ZFS is an example of truly innovative performance improvements. Once you chuck the volume manager, all sorts of avenues open up that nobody's ever though about before.

2To be clear, I'm not an ardently anti-Linux. But I do believe that technology should speak for itself, irregardless of philosophical or "religious" issues. Obviously, I'm part of the Solaris kernel group, and I believe that Solaris 10 has significant innovations that no other OS has. But I also have faith in open source and the open development model. If you know of Linux features that are truly innovative (and part of the core operating system), please leave a comment and I'll gladly acknowledge them.

Wednesday Aug 04, 2004

More on OS innovation

The other day on vacation, I ran across a Slashdot article on UNIX branding and GNU/Linux. Tne original article was mildy interesting, to the point where I actually bothered to read the comments. Now, long ago I learned that 99% of Slashdot comments are worthless. Very rarely to you find thoughtful and objective comments; even browsing at +5 can be hazardous to your health. Even so, I managed to find this comment, which contained in part some perspective relating to my previous post on Linux innovation:

I have been saying that for several years now. UNIX is all but dead. The only commercial UNIX likely to still be arround in ten years time as an ongoing product is OS/X. Solaris will have long since joined IRIX, Digital UNIX and VMS as O/S you can still buy and occasionaly see a minor upgrade for it.

There is a basic set of core functions that O/S do and this has not changed in principle for over a decade. Log based file systems, threads that work etc are now standard, but none of this was new ten years ago.

The interesting stuff all takes place either above or below the O/S layer. .NET, J2EE etc are where interesting stuff is happening.

Clearly, this person is not the sharpest tool in the shed when in comes to Operating Systems. But it begs the question: How widespread is this point of view? We love innovation, and it shows in Solaris 10. We have yet to demo Solaris 10 to a customer without them being completely blown away by at least one of the many features. DTrace, Zones, FMA, SMF, and ZFS are but a few reasons why Solaris won't have "joined IRIX, Digital UNIX, and VMS" in a few years.

Part of this is simply that people have yet to experience real OS innovation such as that found in Solaris 10. But more likely this is just a fundamental disconnection between OS developers and end users. If I managed to get my mom and dad running Java Desktop System on Solaris 10, they wouldn't never know what DTrace, Zones, or ZFS is, simply because it's not visible to the end user. But this doesn't mean that it isn't worthwhile innovation: as with all layered software, our innovations directly influence our immediate consumers. Solaris 10 is wildly popular among our customers, who are mostly admins and developers, with some "power users". Even though these people are a quantitatively small portion of our user base, they are arguably the most important. OS innovation directly influences the quality of life and efficiency of developers and admins, which has a cascading effect on the rest of the software stack.

This cascading influence tends to be ignores in arguments over the commoditization of the OS. If you stand at any given layer, you can make a reasonable argument that the software two layers beneath you has become a commodity. JVM developers will argue that hardware is a commodity, while J2EE developers will argue that the OS is a commodity. Even if you're out surfing the web and use a web service developed on J2EE, you're implicitly relying on innovation that has its roots at the OS. Of course, the further you go from the OS the less prominent the influence is, but its still there.

So think twice before declaring the OS irrelevant. Even if you don't use features directly provided by the OS, your quality of life has been improved by having them available to those that do use them.

Tuesday Jul 20, 2004

Is Linux innovative?

In a departure from recent musings on the inner workings of Solaris, I thought I'd examine one of the issues that Bryan has touched on in his blog. Bryan has been looking at some of the larger issues regarding OS innovation, commoditization, and academic research. I thought I'd take a direct approach by examining our nearest competitor, Linux.

Bryan probably said it best: We believe that the operating system is a nexus of innovation.

I don't have a lot of experience with the Linux community, but my impression is that the OS is perceived as a commodity. As a result, Linux is just another OS; albeit one with open source and a large community to back it up. I see a lot of comments like "Linux basically does everything Solaris does" and "Solaris has a lot more features, but Linux is catching up." Very rarely do I see mention of features that blow Solaris (or other operating systems) out of the water. Linus himself has said:

A lot of the exciting work ends up being all user space crap. I mean, exciting in the sense that I wouldn't car [sic], but if you look at the big picture, that's actually where most of the effort and most of the innovation goes.

So Linus seems to agree with my intuition, but I'm in unfamiliar territory here. So, I pose the question:

Is the Linux operating system a source of innovation?

This is a specific question: I'm interested only in software innovation relating to the OS. Issues such as open source, ISV suport, and hardware compatibility are irrelevant, as well as software which is not part of the kernel or doesn't depend on its facilities. I consider software such as the Solaris ptools as falling under the purview of the operating system, because they work hand-in-hand with the /proc filesystem, a kernel facility. Software such as GNOME, KDE, X, GNU tools, etc, are all independent of the OS and not germane to this discussion. I'm also less interested in purely academic work; one of the pitfalls of academic projects is that they rarely see the light of day in a real-world commercial setting. Of course, most innovative work must begin as research before it can be viable in the industry, but certainly proven technologies make better examples.

I can name dozens of Solaris innovations, but only a handful of Linux ones. This could simply be because I know so much about Solaris and so little about Linux; I freely acknowledge that I'm no Linux expert. So are there great Linux OS innovations out there that I'm just not aware of?

Thursday Jul 01, 2004

Real life obfuscated code

In a departure from my usual Solaris propaganda, I thought I'd try a little bit of history. This entry is aimed at all of you C programmers out there that enjoy the novelty of Obfuscated C. If you think you're a real C hacker, and haven't heard of the obfuscated C contenst, then you need to spend a few hours browsing their archives of past winners1.

If you've been reading manpages on your UNIX system, you've probably been using some form of troff2. This is an early typesetting language processor, dating back to pre-UNIX days. You can find some history here. The nroff and troff commands are essentially the same; they are built largely from the same source and differ only in their options and output formats.

The original troff was written by Joe F. Ossanna in assembly language for the PDP-11 in the early 70s. Along came this whizzy portable language known as C, so Ossana rewrote his formatting program. However, it was less of a rewrite and more of a direct translation of the assembly code. The result is a truly incomprehensible tangle of C code, almost completely uncommented. To top it off, Ossana was tragically killed in a car accident in 1977. Rumour has it that attempts were made to enhance troff, before Brian Kernighan caved in and rewrote it from scratch as ditroff.

If you're curious just how incomprehensible 7000 lines of uncommented C code can be, you can find a later version of it from The Unix Tree, an invaluable resource for the nostalgic among us. To begin with, the files are named n1.c, n2.c, etc. To quote from 'n6.c':

	register i,\*j,k;
	extern int chtab[];

	if((i = getrq()) == 0)return(0);
	for(j=chtab;\*j != i;j++)if(\*(j++) == 0)return(0);
	k = \*(++j) | chbits;
int i,j[];
	register k;

	if(((k = i-'0') >= 1) && (k <= 4) && (k != smnt))return(--k);
	for(k=0; j[k] != i; k++)if(j[k] == 0)return(-1);

If this doesn't convince you to write well-structured, well-commented code, I don't know what will. The scary thing is that there are at least 18 bugs in our database open against nroff or troff; one of the side-effects of promising full backwards compatibility. Anyone who has the courage to putback nroff changes earns a badge of honor here - it is a dark place that has claimed the free time of a few brave programmers3. Whenever an open bug report includes such choice phrases as this, you know you're in trouble:

I've seen this problem on non-Sun Unix as well, like Ultrix 3.1 so the problem likely came from Berkeley. The System V version of \*roff (ditroff ?) doesn't have this problem.

1One of my personal favorites is this little gem, a 2000 winner 'natori'. It should be a full moon tomorrow night...

#include <stdio.h>
#include <math.h>
double l;main(_,o,O){return putchar((_--+22&&_+44&&main(_,-43,_),_&&o)?(main(-43,
405859.-4.7+acos(l/2))<1.57))[" #"])):10);}

2On Solaris, most manpages are written in SGML, and can be found in /usr/share/man/sman\*.

3I'd like to think that the x86 disassembler is a close second, but maybe that's just because I'm a survivor.

Thursday Jun 17, 2004

Software Abstraction and the Jaws of Life

Before I start looking at some of problems we're addressing in Solaris, I want to step back and examine one of the more fundamental problems I've been seeing in the industry. In order to develop more powerful software quickly, we insert layers of abstraction to distance ourselves from the actual implmentation (let someone else worry about it). There is nothing inherently wrong with this; no one is going to argue that you should write your business critical web service in assembly language instead of Java using J2EE. The problem comes from the disturbing trend that programmers are increasingly less knowledgeable about the layers upon which they build.

Most people can sit down and learn how to program Java or C, given enough time. The difference between an average programmer and a gifted programmer is the ability to truly understand the levels above and below where one works. For the majority of us1, this means understanding two things: our development platform and our customers. While understanding customer needs is a difficult task, a more tragic problem is the failure of programmers to understand their immediate development environment.

If you are a C programmer, it is crucial that you understand how virtual memory works, what exactly happens when you write to a file descriptor, and how threads are implemented. You should understand what a compiler does, how the dynamic linker works, and how the assembly code the compiler generates really works. If you are a Java programmer, you need to understand how garbage collection works, how java bytecodes are interpreted, and what JIT compiling really does. Just because you don't need to know the OS works doesn't mean you shouldn't. Knowing your environment encourages good software practice, as well as making you more effective at solving problems when they do occur.

Unfortunately, everything in the world is not somebody else's fault. As these new layers are added, things tend to become less and less observable. Not to mention that poor documentation can ruin an otherwise great tool. You may understand how a large number of cross calls can be the product of a misbehaving application, but if you can't determine where they're coming from, what's the point? We in the Solaris group develop tools to provide useful layers of abstraction, as well as tools that rip the hood off2 so you can see what's really happening inside.

Before I start examining some of these tools, I just wanted to point out that there is a large human factor involved that is out of our hands. The most powerful tool in the world can be useless in the hands of someone without a basic understanding of their system. Hopefully by providing these tools we can simultaneously expose the inner workings while sparking desire to learn about these inner workings. So take some some time to read a good book once in a while.

1For us kernel developers, the landscape is a little different. We have to work with a multitude of difference hardware (development platforms) in order to provide an effectively limitless number of solutions for our customers. I like to think that kernel engineers are pound-for-pound the best group of programmers around because of this, but maybe that's just my ego talking.

2Scott is in the habit of talking about how we sell cars, not auto parts. Does that mean we also provide the Jaws of Life to save you after you crash your new Enzo?


Musings about Fishworks, Operating Systems, and the software that runs on them.


« June 2016