Wednesday Jun 14, 2006

Livin' on the edge

It's been a year now since OpenSolaris went live, and so far it's been a very positive experience for me.

Since we "went open" I've found it easier to communicate with folks working on Solaris inside of Sun. As is the case in many large organizations, Sun folks have a tendency to communicate primarily within their local groups both geographically and organizationally. OpenSolaris has built a community which crosses organizational boundaries, and as such it has torn down the barriers to open communication.

It has also been fun to see the number of folks who are living on the edge. There are many users in the community (and developers too) who are eager to try out new features, or to see OpenSolaris running on their kewl new (or in some cases, old) hardware. Having been used to working on a feature for a couple of years, and have it take a year after that to make it into the mainstream, this is a nice shift.

The part I really enjoy about the community, however, is the ability to have frank discussions with our end-users. There's no substitute for hearing about the pain (or pleasure) being experienced by end-users straight from the source. If only all software were done this way!

Technorati Tags: []

Tuesday Mar 14, 2006

The role of documentation in software development

By now I'm sure it's painfully apparent to anyone who has read my blog I am a hard-core geek. In fact, I'm not just your run-of-the-mill hard-core geek, but a kernel geek.

Aside from our being born socially inept, a common affliction amongst hard-core geeks is an inability to communicate our thoughts and ideas (and emotions -- but I ain't goin' there!) effectively. While there are a few rare folks I've met out there who seem to have stumbled upon the magic cure to this nasty affliction, the hard-core geek who can communicate effectively is a rare breed.

As a professional software engineer, one of my fundamental beliefs is that the dominant factor between a successful project and an unsuccessful project reduces to the effective dissemination of key information. As a battle-worn veteran of several software projects, I've seen other projects flounder because the members of the team didn't understand or buy-into the mission, major deliverables or objectives weren't clearly defined, team members didn't agree on the requirements, or team members grew differing perceptions of what the final product would look like!

The impacts of ineffective written communications within a project often are magnified by conspiring factors such as geographically dispersed teams, rushed schedules (i.e. we don't have the TIME to define WHAT we're going to deliver!), and in extreme cases, recalcitrant team members!

In this post I examine the role of effective project documentation in executing successful development projects, and share why I believe this is even more important in an Open software development model than in traditional, closed-source projects.

The Traditional Software Project Model

The traditional project development model is usually the classic waterfall model of software engineering. The project is conceived, funded, and staffed. At that point the conception phase moves into requirements analysis, functional specifications are developed, and the architecture is determined. Next, the design is specified, an implementation created, and the implementation is checked against the specifications of the architecture. If everything checks out, the project is accepted by the sponsors and deployed.

At each stage of the traditional software project, new stakeholders are introduced, and these stakeholders have progressively lower levels of focus. If the communication of the concept is not clear, the architects may come up with a proposal which doesn't match the original concept. If the architecture is over- or under-constrained, the implementation may be untenable or incomplete. If the validation specification does not match the architecture the acceptance may not match the implemented product. Finally, if the end-user documentation does not match the architecture and implementation, the end-user will find using the product to be time consuming, frustrating, and costly. Anyone who has played the telephone game and observed how garbled even a simple message can become can appreciate how even a small defect in the communication chain of a complex project can lead to catastrophy in the final product!

The Open Software Development Project Model

The Open software development model differs from the traditional model in several imporant ways.

First, the concept and requirements typically come directly from the end-user(s), and are usually accompanied by a rough implementation in the form of a prototype. At this point, the community critiques the proposal and implementation. Community members attempt to reverse-engineer any unspecified requirements, and suggest new concepts for inclusion. In the case of large software projects, the development may shift back into the traditional model after this point. With smaller projects, the process may be more iterative, with the traditional validation phase leading to revised requirements, and the cycle repeating itself.

Second, in the Open model the stakeholders at each stage are different. In the case of the traditional model, the needs of the business (the business case) ultimately drive the scope and requirements of a project; the concept phase is nearly set in stone by the time the implementors start to engage. In the Open development model, the focus is usually more on the desired outcome of the end-users of the product. The requirements, and even the high level concepts are more free to evolve over time as the project progresses.

The Role of Communication in Software Development

In my experience, most traditional software projects make it through the first few stages (concept and requirements) with relatively few problems. By the time the implementors are brought on-board, the functional specifications are baked, and they map back to the original concepts and requirements fairly accurately. The big issues such as projected development and scope have been settled by the project bosses. Due to the senior-level of involvement up to this point, the accuracy and quality of specification documents is usually very good.

The really successful software projects become successful because they give the right level of attention to clearly communicating the key concepts and requirements. Software engineering experts have differing opinions on how much of the total effort should be given to a project's early phases, but my gut feel is that it is well over 25%, and may be as much as 50%.

To make an Open project which is proposed in the form of a prototype successful, it is even more important to keep in mind that the key concepts and requirements need to be thoroughly documented for the community to agree on them. There are several reasons for this:

  • Engineers on traditional development projects are usually geographically co-located, and utilize high bandwidth forms of communication such as meetings and whiteboard sessions to reach decisions and agreements. Open development projects are almost always geographically distributed. E-mail and project web pages, which are lower-bandwidth forms of communication, are typically the dominant forms of reaching decisions and agreements; this makes increased accuracy and completeness a necessity.
  • Unlike in the traditional model where the stakeholders are the product bosses, Open projects have the entire community (and the end users) as the stakeholders! If the community can't agree on the high level objectives and functional requirements of a software project, the project is doomed from the start.

Documenting Concepts, Requirements, and Architecture

The concepts and requirements comprise the big picture view of the software project. These define aspects such as the major capabilies of the product and the intended target market. The architecture defines the input/output methods, interoperability features, and user interface.

The first nail in the road at the early phases of a project is that seemingly unimportant details such as how the requirements were gathered are often not recorded. High-level tradeoffs made in mapping the requirements into architecture are often not well documented (if at all). As we'll see, this is a disaster waiting to happen down the road when the implementors hit a roadblock in the design.

A less obvious trap at this point is that the requirements and architecture are usually baked by experienced software engineers with a good, albeit incomplete, idea of the requirements of the implementors. On the other hand, implementors are usually less-experienced, and have a more localized focus -- meaning that there is an inherent opportunity for information loss when translating the requirements into software constructs. If an implementor doesn't understand how flexible a requirement is, or why the requirement is there at all, the implementation may be inadequate or over-constrained.

The latter effect is magnified in the Open development model, where the implementors do not work with the software as their day job; if they do, they often don't have the benefit of years of experience networking with engineers inside a major software company. If an Open development project is going to be taken on by less-experienced implementors, then clearly communicating the concepts, requirements, and architecture of the software is of paramount importance.

Documenting the Design

To help avoid pitfalls during the implementation phase, I believe it is essential that the design team thoroughly document the aspects of their implementation blueprints. The apsects which should be thoroughly documented include any and all of:

  • Assumptions
  • Constraints
  • Design Trade-offs
  • Design Decisions
Let's break down each of these and examine them independently.


We have all heard the cliche' about how assumptions are bad. However, reality dictates that no matter how well the requirements are written and the architecture is specified, there will always be room for interpretation. The idea behind documenting all assumptions regarding the concepts, requirements, and architecture is that the folks who created them can review these assumptions and provide clarifications when necessary. In some circumstances there may be holes in the original specifications which result in assumptions about major requirements! If in doubt, and it's open to interpretation -- document your assumptions about it! The time spent up front will save you a lot of frustration down the road by preventing an implementation going to the validation phase, only for the end-user to come back and say this doesn't work the way I expected it to!

There's also an advantage to catching holes in our assumptions early which is particuarly helpful in the context of an Open development community. We're all human, and as such we have tendencies to blame others rather than ourselves when things go wrong. Having members of a community come back after an implementor has worked long and hard on a problem only to tell him/her that the hard work that person has done is incorrect is a big blow to the implementor's pride. In the worst case, flame wars may insue, and the contributor ends up leaving the community. A lot of frustration and wasted effort can be eliminated if the community has the opportunity to identify misguided assumptions and correct them early on.


One major difference I've noticed between experienced, high-caliber software engineers and average software engineers is that experienced software engineers know how to leverage constraints in order to optimize solutions to difficult problems.

Let me give a concrete example from a project I worked on a few years ago: a previous team worked hard at solving a difficult kernel problem in the area of virtual memory management. The team came up with an approach, worked for about a year, and had a working solution -- only to have their approach nailed to the wall because of a design flaw which could result in data corruption.

when I took over the project it didn't take me long to decide that the systemic problem with their approach revolved around the constraints. Specifically, the project team's approach failed because they had designed around a constraint that a specific DDI call had to remain supported. Because the DDI call in question was part of an obsolete framework, removing support for the call was the best way to solve the problem, and it removed the potential hole where an obsolete or incompatible device driver might cause kernel data corruption when the new feature was enabled.

In a way constraints are assumptions -- but they are assumptions that aren't related to the specifications. Which functions a library supports, which platforms will best run an application, and which programming languages to use may all be constraints of an implementation on which the requirements and architecture are entirely mute.

Another example of leveraging constraints to your advantage in an implementation is creating a dependency on a particular system library, rather than developing your own equivalent functionality.

Design Trade-offs

Some folks look negatively on design-tradeoffs -- they refer to the practice of making trade offs in a design as cutting corners. However, software is like any other engineering practice -- we need to balance effort and complexity against functionality. Finding a happy middle ground is a constant battle.

Most high quality projects I've seen do a good job of articulating their design tradeoffs. This is because there are two paths to detecting an incorrect design tradeoff: in the first case, the tradeoff is documented, and someone detects the problem before it poses a major problem; in the second case, the product blows up in the hands of the end-user! I don't know about you, but I prefer constructive criticism over catastrophic failure any day!

Design Decisions

I prefer to think of the design decisions as the politics of software design. This position may not be popular, but I think it gets the point across well. If others don't understand the reasoning behind your decisions, they will be more inclined to disagree with the outcome. If you do not have a strong case, and others hold enough clout over you, they may convince others that their approach is right, and yours is wrong.

The best defense against showdowns based on emotion and past experience (versus the facts!) is to clearly articulate the reasons you decided something is best implemented a certain way. In many cases, the decision is unimporant and ends up being arbitrary (for instance, how often to poll a descriptor). In other cases, the decision is the result of a thorough analysis. In yet other cases, there may be mistakes that you made others can find that show your decision is flawed. Regardless of the scenario, clear and concise documentation of the reasoning behind a design, and not just the documentation of the final design, will lead to a better end product.

Putting Project Communication into Practice

Of course there is reading about project communication, and there is doing it. To ensure your development project is more likely to be successful, I recommend doing the following.

Start out with a communication plan

The communication plan should be your first deliverable. This plan should lay out which e-mail aliases will be used to communicate regarding various aspects of the project, where the web page is, and what content it will host. The communication plan should also list the documentation deliverables of each phase of the project, and who the consumers are. Get the members of the community to buy-in to your plan before continuing!

Document each project phase

As you progress through your project, consider documentation to be as important as any other deliverables, and do not consider moving onto the next phase until all of the deliverables are complete. Without full agreement by everyone that the project documentation is accurate and complete, you have no assurances that any of what you are delivering is!

Refer to your documentation often

In addition to reaching agreement before moving forward, the best aspect of project documentation is that it will still be around in a year when you have forgotten the important details (I think we did this because ... Why did we do that again?). Use the project documentation you produce to guide you along.


The minimalist approach

Suppose that I am proposing a bugfix. As long as there is no impact of the design of the system, and no change in the user's experience, only a small design document is necessary. My design may be as little as a few paragraphs to accompany the diffs in an e-mail, explaining the root cause of the bug, how I arrived at the fix, and what the fix does differently to correct the abnormal behavior.

Use common sense! If a bugfix is a one-line change to correct a misspelling in a code comment, it's obvious to everyone that no design documentation is required -- the diffs speak for themselves. In anything more complex than the most trivial, mechanical change, there was some thought process involved. The idea is to get that thought process out into the open, where others can review not just the fix, but what you were thinking when you wrote it.

The micro-project

Suppose I'm implementing a new function call in a shared library. The end-users will be software developers. The stakeholders will be whoever requested the interface. I'd start by collecting the requirements, which probably are coming from one or more persons who proposed the interface. Once the requirements fleshed out and agreed upon, I'd draw up an interface specification. Since all library calls require a man page, I would do this in the form of a mock-up man page, and send it out for wider review, making sure there are several experienced software developers among the reviewers. Once the interface was baked, I would sit down and draft up an implementation proposal, listing all of the tradeoffs I could think of. Once I had that in place I would request a peer review of my proposed implementation, and after iterating on it I would continue onto the code.

The macro-project

As usual a one-size-fits-all approach isn't going to work, so you'll need to consider many factors -- size of the team, importance of the functionality, dependencies, schedule, etc. A macro project may range from an RFE (request for enhancement) to a full blown software package.

At a minimum, I would suggest breaking up the project into the following stages, each paired with a set of project documentation deliverables:

  • Concept:
    • List of key requirements and features
    • Component block diagram
    • Interface specifications
  • Design:
    • Dependency and data flow diagrams
    • Detailed specifications for each component
    • Constraints, risks, assumptions, and trade-offs
  • Implementation:
    • Code comments
    • Big theory statements (to describe algorithms)
  • Validation:
    • Test plan
    • End-user documentation

If you are putting together a good-sized project and you lack experience, I would suggest seeking outside help. The OpenSolaris community (general discussion list) is a good place to start.

Technorati Tags: []

Wednesday Mar 08, 2006

Examining the Anatomy of a Process

This blog entry is a continuation of my previous blog entry, Observing the Solaris Kernel.

For my second demo I'll take a look at the anatomy of a simple process. As before, let's start out with a really simple program:

% cat simple2.c
        return (0);
% gcc -o simple2 simple2.c
% ./simple2
\^Z[1] + Stopped (SIGTSTP)        ./simple2

In another window, as root, I fire up mdb in kernel target mode, and look up the proc structure of the process:

root@rutamaya > mdb -k
Loading modules: [ unix krtld genunix specfs dtrace pcplusmp ufs ip sctp usba fctl nca lofs random nfs logindmux ptm cpc fcip sppp ]
mdb> ::ps !head -1
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
mdb> ::ps !grep simple2
R 127247 127218 127247 127193 125946 0x42004000 fffffead763de3c0 simple2
mdb> fffffead763de3c0::print proc_t
    p_exec = 0xfffffe819091a840
    p_as = 0xffffffff899c1010
    p_lockp = 0xffffffff84a6db40
    p_crlock = {
        _opaque = [ 0 ]
    p_cred = 0xffffffff923e14e8

Let's take a look at the context of the process in the kernel, to see what it looks like after I sent it a SIGSTOP using ctrl+z from the terminal:

mdb> fffffead763de3c0::walk thread
mdb> fffffe814473abc0::findstack -v
stack pointer for thread fffffe814473abc0: fffffe80007abb50
[ fffffe80007abb50 _resume_from_idle+0xde() ]
  fffffe80007abb90 swtch+0x241()
  fffffe80007abc00 stop+0xa68(5, 18)
  fffffe80007abc40 isjobstop+0xd7(18)
  fffffe80007abcf0 issig_forreal+0x48c()
  fffffe80007abd20 issig+0x28(0)
  fffffe80007abda0 cv_timedwait_sig+0x266(fffffe814473ad96, fffffe814473ad98,
  fffffe80007abe20 cv_waituntil_sig+0xab(fffffe814473ad96, fffffe814473ad98,
  fffffe80007abe70, 1)
  fffffe80007abeb0 nanosleep+0x141(8047cb0, 8047cb8)
  fffffe80007abf00 sys_syscall32+0x1ff()

::walk thread takes a process structure pointer and "walks" all of the threads in the process (in Solaris there is a 1:1 mapping of user thread to kernel thread these days). Since our simple program is single-threaded this yields a single thread pointer, which we could have also gotten at by hand by looking at the p_tlist member of the proc structure:

mdb> fffffead763de3c0::print proc_t p_tlist
p_tlist = 0xfffffe814473abc0

::findstack takes a thread pointer and walks the thread stack in the kernel. We can see that cv_waituntil_sig() was called from nanosleep(), and it was interrupted by the SIGSTOP signal causing the process to call stop().

Using the pipe facility in mdb and a new dcmd I haven't yet used, ::pid2proc, I can also shorten everything I've done so far into an equivalent two-liner:

mdb> ::ps !grep simple2
R 127247 127218 127247 127193 125946 0x42004000 fffffead763de3c0 simple2
mdb> 0t127247::pid2proc | ::walk thread | ::findstack -v
stack pointer for thread fffffe814473abc0: fffffe80007abb50
[ fffffe80007abb50 _resume_from_idle+0xde() ]
  fffffe80007abb90 swtch+0x241()
  fffffe80007abc00 stop+0xa68(5, 18)
  fffffe80007abc40 isjobstop+0xd7(18)
  fffffe80007abcf0 issig_forreal+0x48c()
  fffffe80007abd20 issig+0x28(0)
  fffffe80007abda0 cv_timedwait_sig+0x266(fffffe814473ad96, fffffe814473ad98,
  fffffe80007abe20 cv_waituntil_sig+0xab(fffffe814473ad96, fffffe814473ad98,
  fffffe80007abe70, 1)
  fffffe80007abeb0 nanosleep+0x141(8047cb0, 8047cb8)
  fffffe80007abf00 sys_syscall32+0x1ff()

Note the 0t prefix is needed because the PID is displayed by ::ps in decimal, just as if I had run the ps command in a terminal. In fact I could have done this from inside mdb using the shell pipe facility I've already used in this demo:

mdb> !ps -ef | grep simple2
   elowe 127247 127218   0 10:04:24 pts/1       0:00 ./simple2

In my first demo, I examined the user and group ID within a file; as we saw from running ::print to look at the proc structure, p_cred also contains a pointer to a cred structure, which holds the identity of a unique user on the system. Since I started the process from another terminal window under my normal UNIX account, the cred structure should contain the credentials associated with my UNIX login.

mdb> fffffead7ceb43c0::print proc_t p_cred|::print cred_t
    cr_ref = 0x32
    cr_uid = 0x1ebfa
    cr_gid = 0xa
mdb> fffffead7ceb43c0::print proc_t p_cred|::print cred_t cr_uid|::map =D

That's me. Since ::print likes hex more than I do, I used the ::map dcmd to pipe the dot from ::print cred_t cr_uid into =D, which is the mdb syntax to display a number in integer decimal format.

There are plenty of interesting things to explore on your own just by starting from the process structure, so this is a great start for folks who are just starting out learning about UNIX internals.

Technorati Tags: [ ]

Monday Mar 06, 2006

Observing the Solaris Kernel

This past weekend I had the pleasure of presenting at the SIGCSE conference in Houston. For the unacquainted the audience is computer science educators, so my focus was on using OpenSolaris as a vehicle for teaching operating system internals.

For the conference I prepared some slides and a demo that attempt to peel off the first layer of the observability onion into the Solaris kernel. Aside from DTrace (which by now the whole world has heard about) I also make extensive use of the excellent mdb kernel debugging facilities. The demos in particular are worth sharing so I am posting them here for all to see and use as you see fit.

First, the slides (also in StarOffice; if you steal them for your own use all I ask is that you provide a pointer back to this blog entry). The main theme of my presentation is that OpenSolaris brings to the table free access to the source code, while the powerful observability tools allow quick insight into the dynamics of the kernel without mastering the source.

The first demo starts from a simple C program, follows the open() syscall flow down to the process file table with DTrace, and then drills all the way down (using the kernel debugger) into the inode at the file system level. This is accomplished by leveraging access to the source code, and for the demo I deliberately chose a part of the system to look at I had never been in before. This demo took about twenty minutes to create and about twenty minutes to walk through.

The second demo shows how mdb can be used to examine the anatomy of a stopped process. I ran out of time before getting to the second demo; I'll post the second demo in another blog entry in a few days, so keep an eye out for it.

First Demo

Let's begin with the following program:
 \* A simple program which creates a file, writes one byte to it,
 \* closes it, and exits.
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>

        int fd;
        const char buf[] = "\\0";

        if ((fd = open("simple1.out", O_CREAT|O_WRONLY, 0600)) < 0) {
                perror("Unable to create file");
                exit (1);

        if (write(fd, buf, 1) < 0) {
                perror("Unable to write file");
                exit (2);

        return (0);
Let's take a closer look at the open(2) system call. Running truss(1) shows that the system call we're interested in is actually the third, since there are two implicit open()s performed automatically (one by the runtime linker looking for a config file, and the other to map in the C library).
% truss -t open ./simple1
open("/var/ld/ld.config", O_RDONLY)             Err#2 ENOENT
open("/lib/", O_RDONLY)                = 3
open("simple1.out", O_WRONLY|O_CREAT, 0600)     = 3
Armed with this much knowledge I use a really simple DTrace script to see what functions open() calls.
#!/usr/sbin/dtrace -Fs
/ execname == "simple1" /
/ self->tracing > 0 /
/ self->tracing > 0 /
/ self->tracing > 0 /
Running this script yields quite a bit of output, of which the first few calls are interesting because we can see that the system call is setting up the new file descriptor which is eventually returned.
  0  => open                                          0
  0    -> open32                                      0
  0      -> copen                                     0
  0        -> falloc                                  0
  0          -> ufalloc                               0
  0            -> ufalloc_file                        0
  0              -> fd_find                           0
Going over to and pulling up the source for fd_find() we can see that the file descriptor is an index into an array, fi_list[], which is embedded in the process u structure. Armed with that new info, I setup a breakpoint into the debugger from the end of open using DTrace.
#!/usr/sbin/dtrace -wFs
/ execname == "simple1" /
{ breakpoint(); }
Starting the script as root and then running the simple1 program as my user ID in another window results in the following session (recall we're looking for the third open call, so we ignore the first two times we hit the breakpoint):
[console] root@rutamaya > ./simple1_files.d
dtrace: script './simple1_files.d' matched 4 probes
dtrace: allowing destructive actions
dtrace: breakpoint action at probe syscall::open:return (ecb fffffe8501e671e0)
kmdb: target stopped at:
kaif_enter+8:   popfq
[0]> :c
dtrace: breakpoint action at probe syscall::open:return (ecb fffffe8501e671e0)
kmdb: target stopped at:
kaif_enter+8:   popfq
[0]> :c
dtrace: breakpoint action at probe syscall::open:return (ecb fffffe8501e671e0)
kmdb: target stopped at:
kaif_enter+8:   popfq
[0]> <gsbase::print cpu_t cpu_thread->t_procp->p_user.u_finfo.fi_list
cpu_thread->t_procp->p_user.u_finfo.fi_list = 0xfffffea7aa34b800
[0]> 0xfffffea7aa34b800,4::print uf_entry_t uf_file->f_vnode
uf_file->f_vnode = 0xfffffe813aa86c80
uf_file->f_vnode = 0xfffffe813aa86c80
uf_file->f_vnode = 0xfffffe813aa86c80
uf_file->f_vnode = 0xffffffffb282ce00

%gsbase holds a pointer to the current CPU structure; from there I can get the running thread, access the process pointer, follow the u structure to the u_finfo which contains the array of open files (again indexed by file descriptor, as we discovered from the source). Once I have the pointer of this array I can follow the struct file to get the vnode for each open file.

Going back to the source and looking at the struct vnode I can see that the v_type and v_path fields are interesting so let's take a look at them..

[0]> 0xfffffea7aa34b800,6::print uf_entry_t uf_file|::grep >0|::print "struct file" f_vnode->v_type f_vnode->v_path
f_vnode->v_type = 4 (VCHR)
f_vnode->v_path = 0xfffffea79ece81b8 "/devices/pseudo/pts@0:1"
f_vnode->v_type = 4 (VCHR)
f_vnode->v_path = 0xfffffea79ece81b8 "/devices/pseudo/pts@0:1"
f_vnode->v_type = 4 (VCHR)
f_vnode->v_path = 0xfffffea79ece81b8 "/devices/pseudo/pts@0:1"
f_vnode->v_type = 1 (VREG)
f_vnode->v_path = 0xfffffe80ea212e40 "/home/elowe/proj/SIGCSE/simple1.out"

This prints each element of the array of uf_entry_t structures in the u structure if the pointer is not NULL, and pipes the dot (which is a struct file to the ::print dcmd to examine the fields we're interested in. The first three file descriptors are stdin/stdout/stderr and all point back to my virtual terminal; the fourth is the file being written to by the simple1 program.

Since I'm using NFS (and I can see from my earlier DTrace output that nfs3_open() is in the call path) I decided it might be interesting to poke at the underlying inode. The source shows that nfs3_open() gets an rnode_t by casting the vnode's v_data directly into the file system's internal inode representation. A little bit ago we had the vnode pointer in our mitts:

[0]> 0xfffffea7aa34b800,4::print uf_entry_t uf_file->f_vnode
uf_file->f_vnode = 0xffffffffb282ce00

So now we can keep digging:

[0]> 0xffffffffb282ce00::print vnode_t
    v_data = 0xfffffe83625e9d10
    v_vfsp = 0xffffffff885eaf00
    v_op = 0xffffffff86d24c40
[0]> 0xffffffff86d24c40::print struct vnodeops
    vnop_name = 0xffffffffc01c5a20 "nfs3"
    vop_open = nfs`nfs3_open
[0]> 0xfffffe83625e9d10::print rnode_t
    r_size = 0x1
    r_attr = {
        va_mask = 0xbfff
        va_type = 1 (VREG)
        va_mode = 0x180
        va_uid = 0x1ebfa
        va_gid = 0xa
[0]> 180=O
[0]> 0x1ebfa=D
[0]> a=D
[0]> :c
We can see that the file mask is 0600, the UID is 0x1ebfa hex (which is 125946 in decimal), and the group ID is 10 decimal.
% ls -l simple1.out
-rw-------   1 elowe    staff          1 Feb 27 11:34 simple1.out

Technorati Tags: [ ]

Wednesday Nov 16, 2005

ZFS saves the day(-ta)!

I've been using ZFS internally for awhile now. For someone who used to administer several machines with Solaris Volume Manager (SVM), UFS, and a pile of aging JBOD disks, my experience so far is easily summed up: "Dude this so @#%& simple, so reliable, and so much more powerful, how did I never live without it??"

So, you can imagine my excitement when ZFS finally hit the gate. The very next day I BFU'ed my workstation, created a ZFS pool, setup a few filesystems and (four commands later, I might add) started beating on it.

Imagine my surprise when my machine stayed up less than two hours!!

No, this wasn't a bug in ZFS... it was a fatal checksum error. One of those "you might want to know that your data just went away" sort of errors. Of course, I had been running UFS on this disk for about a year, and apparently never noticed the silent data corruption. But then I reached into the far recesses of my brain, and I recalled a few strange moments -- like the one time when I did a bringover into a workspace on the disk, and I got a conflict on a file I hadn't changed. Or the other time after a reboot I got a strange panic in UFS while it was unrolling the log. At the time I didn't think much of these things -- I just deleted the file and got another copy from the golden source -- or rebooted and didn't see the problem recur -- but it makes sense to me now. ZFS, with its end-to-end checksums, had discovered in less than two hours what I hadn't known for almost a year -- that I had bad hardware, and it was slowly eating away at my data.

Figuring that I had a bad disk on my hands, I popped a few extra SATA drives in, clobbered the disk and this time set myself up a three-disk vdev using raidz. I copied my data back over, started banging on it again, and after a few minutes, lo and behold, the checksum errors began to pour in:

elowe@oceana% zpool status
  pool: junk
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool online' or replace the device with 'zpool replace'.
 scrub: none requested

        NAME        STATE     READ WRITE CKSUM
        junk        ONLINE       0     0     0
          raidz     ONLINE       0     0     0
            c0d0    ONLINE       0     0     0
            c3d0    ONLINE       0     0     1
            c3d1    ONLINE       0     0     0

A checksum error on a different disk! The drive wasn't at fault after all.

I emailed the internal ZFS interest list with my saga, and quickly got a response. Another user, also running a Tyan 2885 dual-Opteron workstation like mine, had experienced data corruption with SATA disks. The root cause? A faulty power supply.

Since my data is still intact, and the performance isn't hampered at all, I haven't bothered to fix the problem yet. I've been running over a week now with a faulty setup which is still corrupting data on its way to the disk, and have yet to see a problem with my data, since ZFS handily detects and corrects these errors on the fly.

Eventually I suppose I'll get around to replacing that faulty power supply...

Technorati Tags: [ ]

Monday Oct 17, 2005

Translation Storage Buffers

The translation storage buffer (TSB) is a SPARC-specific data structure used to speed up TLB miss handling in the UltraSPARC family of CPUs. Since these CPUs implement TLB miss handling via a trap mechanism, performance of the low-level trap handling code in the operating system is crucial to overall system performance.

The TSB is implemented in software as a direct-mapped, virtually-indexed, virtually-tagged (VIVT) cache. While its size is a minimum of the processor's page size (8K), its upper bound is practically unlimited. The UltraSPARC CPUs have a translation assist mechanism which, given the base address of the TSB in a register, provides a pre-computed TSB entry pointer corresponding to the virtual address of the translation miss. This mechanism only supports TSBs up to 1M in size, which is where the current software support is capped1.

On a TLB miss trap (%tt = {0x64, 0x68, 0x6C}) the trap handler corresponding to the particular trap code is invoked from the trap table by the CPU. For a ITLB or DTLB miss, the TSB is searched for a valid entry whose tag matches the virtual address of the translation miss. If the entry is found, it is loaded from the TSB into the TLB, and the trapped instruction is retry'd. If the entry is not found, the translation hash tables are searched using the current process address space pointer as well as the virtual address as the hash key.

Prior to Solaris 10, the user process TSBs used to come from a global pool of fixed size which was allocated at boot. In Solaris 10, the Dynamic TSB project made significant changes to all of the SPARC HAT layer, but in particular, changed the implementation so that TSBs are allocated dynamically.

1This is only true for user processes. The kernel itself is a special case, and usually has a larger TSB -- up to 16MB in size. The trap handlers ignore the precomputed pointer on kernel misses, and compute the TSB entry index manually.

Technorati Tags: [ ]

Thursday Jun 23, 2005

Debugging early in boot

When things go wrong in the startup code, there are a few options to determine what's going on. Dan Mick has a discussion about firing up with kmdb, which you can use to set breakpoints (using -d) in the debugger. This is often enough to get you going if you have some idea where the problem lies -- but if you don't know where the problem is, it can be slow to isolate the failing code section, so another hidden feature of the kernel may help determine where the system is tipping over.

There is a "PROM debug" facility which provides simple printf-style debugging for the startup code. This facility is available on SPARC and x86/x64 (on SPARC you can #include <prom_debug.h> in any machine dependent file to use this facility). The PRM_DEBUG() macro can be invoked for early printf-style debugging before cmn_err() is available, as cmn_err() relies on STREAMS and the VM subsystem to function. The kernel has a whole wad of these PRM_DEBUG() statements already defined at various points in the startup code, which can help to isolate where the problem system is tipping over. It can also be a nice aid for learning the flow of the startup code. The PRM_DEBUG()macro calls are activated by setting prom_debug to 1 using the kernel debugger. These statements are compiled into both DEBUG and non-DEBUG kernels (so they work with stock Solaris, too!).

Here is a trimmed example from my AMD64 box booting the 64-bit kernel:

<< selected Solaris entry in GRUB menu and press "e" to edit >>
grub edit> kernel /platform/i86pc/multiboot -kd
<< enter, b >>
root (hd0,1,a)
 Filesystem type is ufs, partition type 0x000000bf
kernel /platform/i86pc/multiboot -kd
   [Multiboot-elf, <0x1000000:0x141cb:0x3941d>, shtab=0x104e258, entry=0x100000
Welcome to kmdb
Loaded modules: [ unix krtld genunix ]
[0]> prom_debug/W 1
prom_debug:     0               =       0x1
[0]> :c
startup.c:600: startup_init() starting...
startup.c:637: startup_init() done
startup.c:828: startup_memlist() starting...
startup.c:871: 'modtext' is 0xfffffffffbbb2000
startup.c:872: 'e_modtext' is 0xfffffffffbc00000
startup.c:873: 'moddata' is 0xfffffffffbcf7000
startup.c:874: 'e_moddata' is 0xfffffffffbd42000
startup.c:875: 'econtig' is 0xfffffffffbd42000
startup.c:883: 'cr4_value' is 0xb0
MEMLIST: boot physinstalled:
        Address 0x0, size 0x9b000
        Address 0x100000, size 0xfbe70000
startup.c:896: 'physmax' is 0xfbf6f
startup.c:897: 'physinstalled' is 0xfbf0b
startup.c:898: 'memblocks' is 0x2
MEMLIST: boot physavail:
        Address 0x0, size 0x9b000
        Address 0x100000, size 0xf00000
        Address 0x1050000, size 0xb000
        Address 0x17621000, size 0x1df000
        Address 0x18000000, size 0xe3f70000
startup.c:944: 'mmu.pt_nx' is 0x8000000000000000
startup.c:961: 'npages' is 0xfb30a
startup.c:962: 'obp_pages' is 0x600
startup.c:973: 'physmem' is 0xfb30a
startup.c:1212: kphysm_init() done
startup.c:1213: 'boot_npages' is 0xfb309
SunOS Release 5.11 Version onnv_clone 64-bit
Copyright 1983-2005 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
DEBUG enabled
startup.c:1250: startup_memlist() done
startup.c:1260: startup_modules() starting...
startup.c:1342: startup_modules() done
startup.c:1348: startup_bop_gone() starting...
startup.c:1354: Calling hat_kern_alloc()...
startup.c:1356: hat_kern_alloc() done
startup.c:1362: startup_bop_gone() done
startup.c:1413: startup_vm() starting...
startup.c:1769: startup_vm() done
startup.c:1777: startup_end() starting...
startup.c:1804: Calling configure()...
startup.c:1806: configure() done
startup.c:1820: zeroing out bootops
startup.c:1824: Enabling interrupts
startup.c:1831: startup_end() done
startup.c:1920: Unmapping lower boot pages
startup.c:1923: Unmapping upper boot pages
startup.c:1941: Releasing boot pages
startup.c:1978: Returning boot's VA space to kernel heap
Hostname: rutamaya
NIS domain name is austincampus.Central.Sun.COM

rutamaya console login:

And another example, from a SPARC box:

ok boot -kd
Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ unix krtld genunix sparc ]
[0]> prom_debug/W 1
prom_debug:     0               =       0x1
[0]> :c
../../sun4/os/mlsetup.c:210: mlsetup: now ok to call prom_printf
../../sun4/os/mlsetup.c:228: 'panicbuf' is 0x70002000
../../sun4/os/mlsetup.c:229: 'pa' is 0x3e000000
../../sun4/os/startup.c:805: 'moddata' is 0x190e000
../../sun4/os/startup.c:806: 'nalloc_base' is 0x194e000
../../sun4/os/startup.c:807: 'nalloc_end' is 0x1c00000
../../sun4/os/startup.c:808: 'sdata' is 0x1800000
../../sun4/os/startup.c:813: 'e_text' is 0x1396ef8
../../sun4/os/startup.c:819: 'modtext' is 0x1398000
../../sun4/os/startup.c:820: 'modtext_sz' is 0x68000
../../sun4/os/startup.c:832: 'physinstalled' is 0x20000
../../sun4/os/startup.c:833: 'physmax' is 0x1ffff
../../sun4/os/startup.c:864: 'extra_etpg' is 0x0
../../sun4/os/startup.c:865: 'modtext_sz' is 0x68000
../../sun4/os/startup.c:867: 'extra_etva' is 0x1400000
../../sun4/os/startup.c:874: 'npages' is 0x1f2ae
../../sun4/vm/sfmmu.c:692: 'npages' is 0x1f2ae
../../sun4/vm/sfmmu.c:751: 'ktsb_base' is 0x1a00000
../../sun4/os/startup.c:1489: 'memlist_sz' is 0x2000
../../sun4/os/startup.c:1490: 'memspace' is 0x3000002a000
../../sun4/os/startup.c:1495: 'pp_base' is 0x70002000000
../../sun4/os/startup.c:1496: 'memseg_base' is 0x19ebc00
../../sun4/os/startup.c:1497: 'npages' is 0x1e611
../../sun4/os/startup.c:1505: 'availrmem' is 0x1ded0
SunOS Release 5.11 Version onnv-gate:2005-06-22 64-bit
Copyright 1983-2005 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
DEBUG enabled
misc/forthdebug (494488 bytes) loaded
../../sun4/vm/sfmmu.c:170: 'ktsb_pbase' is 0x3f600000
../../sun4/vm/sfmmu.c:171: 'ktsb4m_pbase' is 0x3f54e000
../../sun4/vm/sfmmu.c:370: 'translen' is 0x4000
Hostname: pacha
SUNW,eri0 : 100 Mbps full duplex link up
NIS domain name is austincampus.Central.Sun.COM
TSI: gfxp0 is GFX8P @ 1152x900

pacha console login:

Happy hacking.

Technorati Tags: [ ]

Saturday Jun 18, 2005

Nice job, guys!

Congrats to the SchilliX team on getting the first OpenSolaris distro up and running! This is very exciting news.
Technorati Tag:

Wednesday Jun 15, 2005

Dude, where'd those CPU cycles go?

Recently I worked on the (trivial) fix for 6276048 which is one of those bugs that is at the same time slightly amusing and slightly depressing. ;)

At my urging, the SSD performance team out east was looking into the costs of doing TLB shootdowns on large systems. In modern systems, the two biggests costs of virtual memory are TLB misses, and keeping the global view of each process' address space in sync across all of the processors (though it isn't particuarly relevant, the machine they were using is a fairly good sized domain of a SunFire 15K, which is a big machine, so there are a lot of processors to keep in sync). The latter operation, better known to VM geeks as TLB shootdowns, was the subject of the investigation, and was a rathole being visited as part of a bigger ongoing investigation into the overhead of CPU cross calls in big SPARC boxes.

While pouring over the kernel profile data from one of the runs, I noticed that the probe rate for the TLB shootdown code in the HAT layer (over in hat_sfmmu.c) was \*significantly\* higher than the probe rate for the cross call routines (over in x_call.c). This immediately rang some alarm bells -- the cost of the TLB shootdown should be constrained by the overhead in doing the cross calls themselves, especially considering the size of the system involved (around 32 CPUs, as I recall)!

So, being the diligent bees that they are, Dave Valin and his team dug in deeper with DTrace, coupled with a few home-grown analysis tools on the back end, and came up with an answer... after just a few hours, my phone rang, and I was staring at this code in disbelief:

#define SFMMU_XCALL_STATS(cpuset, ctxnum)                               \\ 
{                                                                       \\ 
                int cpui;                                               \\ 
                for (cpui = 0; cpui < NCPU; cpui++) {                   \\ 
                        if (CPU_IN_SET(cpuset, cpui)) {                 \\ 
                                if (ctxnum == KCONTEXT) {               \\ 
                                        SFMMU_STAT(sf_kernel_xcalls);   \\ 
                                } else {                                \\ 
                                        SFMMU_STAT(sf_user_xcalls);     \\ 
                                }                                       \\ 
                        }                                               \\ 
                }                                                       \\ 

That's right, we were spending half our time updating some statistics! For reasons I won't go into here, NCPU on the SunFire high-end machines is quite large -- more than 512 -- so this loop was taking quite a long time to execute, and was actually dominating the time spent doing TLB shootdowns. The fix was unbelievably simple -- delete the for() loop construct in the macro. :)

As Bryan reminds us, there's a lesson to be learned in every bug, and usually more involved in the bug than is immediately apparent, so one should keep digging even after finding the root cause until finding the real underlying systemic problem that led to the bug. So, wondering how this one got by us for so long, I started asking around, and discovered that, not surprisingly, this macro was introduced a long, LONG time ago, when NCPU was really really small and insignificant. Add to this, the fact that we used to have little observability into the system at all, until DTrace came along, which is still a relatively recent innovation, and you have a hidden performance problem. In any big, complex system like the Solaris kernel there are bound to be dusty corners which no one has visited for years, and whose design constraints are long overdue for an overhaul. The lesson I took away from this one is that, when examining the source for dusty corners, one must examine the macros and not just the source -- for what is hidden in the macros may just be the most important part of the function!

Technorati Tag:
Technorati Tag:

Tuesday Jun 14, 2005

The secret sauce myth -- exposed

Dangers of having the source (and the "secret sauce" myth exposed)

The other day, there was a good internal thread going on one of the bigger internal Sun mailing lists, and it reminded me of an important point that I think is worth sharing with the outside world -- particularly now that OpenSolaris has landed. Having access to the source can be dangerous to developers if you're trying to develop stable, forward-compatible software for your users. The particular thread was centered around the use of private members of the vnode in a third party filesystem -- but it really could apply to development of any software on any system where folks have unrestricted access to the source code.

There is a myth that so-called "private" interfaces in Solaris are private simply because Solaris is a proprietary operating system, and Sun does not want some external software developer having access to our mytical secret sauce. It goes on -- furthermore, Sun intentionally makes these interfaces cryptically presented in the header files, and does not document them, simply because they provide a better way of doing something than the documented ones. Sun does this so that our software will somehow be better than anyone else can write, giving us a competitive edge, so we can keep selling our proprietary system running on our proprietary hardware.


Spin again!

Solaris provides an interface stability taxonomy which defines certain levels of stability for the programming interfaces which are presented to developers (this taxonomy is documented in the attributes(5) man page). The interface stability taxonomy provides classifications which make it clear what the commitment level of Sun is to maintain backward compatability with any programs that use that interface in the future.

  • Standard means the interface is defined by one of the various standards Solaris supports (e.g. POSIX) and hence cannot change as long as Solaris claims to support that standard
  • Stable interfaces are guaranteed not to change incompatibly until the next major release. To put things into perspective, the last major release of SunOS was from 4.x to 5.0, which as you all know was a LONG time ago!
  • Evolving and Unstable interfaces may not change incompatibly except in a minor release. Solaris 10 was a minor release, as was Solaris 9; Solaris 10 quarterly updates and 2.5.1 are examples of micro releases, which are not allowed to change these interfaces incompatibly.
A good rule to know is that if the interface has a man page, and there is no release taxonomy info at the bottom of the man page, the interface stability is Stable. If there isn't a man page, then the interface may be a private interface, or it may be an implementation artifact (I don't believe there is any way to tell which from outside of Sun yet, but I wouldn't be surprised if there is one on the OpenSolaris website soon). Private interfaces and implementation details of the system may change in micro releases and patches -- wherein lies the hidden danger!

When devloping software, if you want your software to run on future releases of Solaris, you must be careful to use only interfaces which have a suitable commitment level. Interfaces you run across while looking through the source code or headers, if not documented, aren't guaranteed to work in the future -- and are very likely NOT to work in the future. This means your program will stop working if Sun or the OpenSolaris community decides to change that interface for any reason, at any time, even in a patch. Which, needless to say, will cause headaches for the users of your software!

Getting back to the thread, Jim Carlson summed things up quite well, when he stated thus:

...we[1] reserve the right to change these interfaces in whatever way we                      
choose, at any time at all (including in patches), without notice of                 
any kind, and without any attempt to preserve any sort of                            
Thus, if you depend on private interfaces by writing your own software               
that uses them, then you'll end up hurting yourself.                                 
This is roughly the equivalent of a "no user serviceable parts inside"               
warning on an appliance.  It doesn't mean that you can't open it up                  
and poke around inside if you really know what you're doing.  But if                 
you do, and you end up electrocuting yourself or starting a fire, you                
have nobody else to blame.                                                           
[1] "We" in the above sentence means "everyone working on the code."                 
    If that turns out to be a community-based effort, then it's the                  
    members of that community who own and direct it.                                 

Don't believe the myth, and don't electrocute your customers -- it's not worth it. Now that Solaris is opened up, if there isn't a way to do what you want to do, you can create one (or ask the community to help)!

Technorati Tag:
Technorati Tag:

Page Fault Handling in Solaris

Welcome to OpenSolaris! In this entry, I'll walk through the page fault handling code, which is ground zero of the Solaris virtual memory subsystem. Due to the nature of this level in the system, part of the code (the lowest level that interfaces to hardware registers) is machine dependent, while the rest is common code written in C. Hence, I will present this topic in three parts: x64 machine dependent code, which has the most hardware handling for TLB misses, followed by the more complex SPARC machine dependent code, which relies on assembly code to handle TLB misses from trap context; I'll wrap up by covering the common code which is executed from kernel context.

Part 1: x64 Machine Dependent Layer

Since all x86-class machines handle TLB misses using a hardware page table walk mechanism, the Hardware Address Translation, or HAT, layer for x64 systems is the least complex of the two system architectures Solaris currently supports. Both the x86 and AMD systems use a page directory scheme to map per-address-space virtual memory addresses to physical memory addresses. When a TLB miss occurs, the MMU (memory management unit) hardware searches the page table for the page table entry (PTE) associated with the virtual address of the memory access, if one exists. In the page directory model, the virtual address is divided up into several parts; each successive part of the virtual address forms an index into each successive level in the directory, while the higher level directory entries point to the address in memory of the next lowest directory. Each directory table is 4K in size, which corresponds to the base page size of the processor. The pointer to the top-level page directory is programmed into the cr3 hardware register on context switch.

  [Directory based page table]
   Directory-based page tables

Since we're discussing the page fault path in this blog entry, we are interested in the case where the processor fails to find a valid PTE in the lowest level of the directory. This results in a page fault exception (#pf), which passes control synchronously to a page fault handler in trap context. This low-level handler is pftrap(), located in exception.s. The handler jumps to cmntrap() over in locore.s which pushes the machine state onto the stack, switches to kernel context, and invokes the C kernel-side trap handler trap() in trap.c with a trap type of T_PGFLT. The trap() routine figures out that this is a user fault since it lies below KERNELBASE, and calls pagefault() in vm_machdep.c. The pagefault() routine collects the necessary arguments for the common as_fault() routine, and passes control to it.

For more information regarding the x64 HAT layer, refer to Joe Bonasera's blog where he has started blogging about this subsystem which he and Nils Nieuwejaar redesigned from the ground up for the AMD64 port in Solaris 10.

Part 2: SPARC Machine Dependent Layer

The UltraSPARC architecture, the only SPARC architecture currently supported by Solaris -- relies entirely on software to handle TLB misses1. Hence, the HAT layer for SPARC is a bit more complex than the x64 one. To speed up handling of TLB miss traps, the processor provides a hardware-assisted lookup mechanism2 called the Translation Storage Buffer (TSB). The TSB is a virtually indexed, direct-mapped, physically contiguous, and size-aligned region of physical memory which is used to cache recently used Translation Table Entries (TTEs) after retrieval from the page tables. When a TLB miss occurs, the hardware uses the virtual address of the miss combined with the contents of a TSB base address register (which is pre-programmed on context switch) to calculate the pointer into the TSB of the entry corresponding to the virtual address. If the TSB entry tag matches the virtual address of the miss, the TTE is loaded into the TLB by the TLB miss handler, and the trapped instruction is retried. See DTLB_MISS() in trap_table.s and sfmmu_udtlb_slowpath in sfmmu_asm.s. If no match is found, the trap handler branches to a slow path routine called the TSB miss handler3.

The SPARC HAT layer (named sfmmu after the codename spitfire MMU, the first UltraSPARC MMU supported) uses an open hashing technique to implement the page tables in software. The hash lookup is performed using the struct hat pointer for the currently running process and the virtual address of the TLB miss. On a TSB miss, the function sfmmu_tsb_miss_tt in sfmmu_asm.s searches the hash for successive page sizes using the GET_TTE() assembly macro. If a match is found, the TTE is inserted into the TSB, loaded into the TLB, and the trapped instruction is re-issued. If a match is not found, or the access type does not match the permitted access for this mapping (e.g. a write is attempted to a read-only mapping) control is transferred to the sys_trap() routine in mach_locore.s after setting up the appropriate fault type. The sys_trap() routine (which is very involved due to SPARC's register windows) saves the machine state to the stack, switches from trap context to kernel context, and invokes the kernel-side trap handler in C, trap() over in trap.c. The trap() routine recognizes the T_DATA_MMU_MISS trap code and branches to pagefault() in vm_dep.c. As its x64 counterpart does, pagefault() collects the appropriate arguments and invokes the common handler as_fault().

For more information about the sfmmu HAT layer, keep coming back -- this subsystem warrants a more in-depth tour in future blogs.

Part 3: Common Code Layer

The Solaris virtual memory (VM) subsystem uses a segmented model to map each process' address space, as well as the kernel itself. Each segment object maps a contiguous range of virtual memory with common attributes. The backing store for each segment may be device memory, a file, physical memory, etc. Each backing store type is handled by a different segment driver. The most commonly used segment driver is seg_vn, so-named because it maps vnodes associated with files. Perhaps more interestingly, the seg_vn segment driver is also responsible for implementing anonymous memory which is so-called because it is private to a process and is backed by swap space rather than by a file object. Since seg_vn maps the majority of a process' address space, including all text, heap, and stack, I'll use it to illustrate the most common page fault path encountered by a process4.

Returning to the page fault path, assume that the page fault being examined has occurred in a virtual address range that corresponds to a process heap -- for instance, the first touch of new memory allocated by a brk() system call performed by the C library's malloc() routine. Such a fault will allocate process private, anonymous memory which is pre-filled with zeros, known to VM geeks as a ZFOD fault -- short for zero fill on demand. In such a situation, the as_fault() routine (vm_as.c) will search the process' segment tree looking for the segment that maps the virtual address range corresponding to the fault. If as_fault() discovers that no such segment exists, a fatal segmentation violation is signalled to the process causing it to terminate. In our example, a segment is found whose seg_ops corresponds to segvn_ops (seg_vn.c). The SEGOP_FAULT() macro is called, which invokes the segvn_fault() routine in seg_vn.c. In our example, the backing store is swap, so segvn_faultpages() will find there is no vnode backing this range, but rather an anon object and will allocate a page to back this virtual address through anon_zero() in vm_anon.c.

Here is a sample callstack into anon_zero() as viewed from DTrace on my workstation (which is a dual-CPU Opteron running Solaris 64-bit kernel):
# dtrace -qn '::anon_zero:entry { stack(8); exit(0); }'

Here is another sample callstack into anon_zero(), this time from a Ultra-Enterprise 10000:
# dtrace -qn '::anon_zero:entry { stack(8); exit(0); }'

Note that, in both cases, we can only see back as far as where we switched to kernel context, since trap context uses only registers or scratch space for its work and does not save traceable stack frames for us.

For those of you who wish to trace the path of a process' zero fill page faults from beginning to end, you may do so quite easily by running this DTrace script, as root. The script takes one argment, which is the exec name of the binary to trace. I recommend a simple one like "ls" since it is relatively small and short lived.

Technorati Tag:
Technorati Tag:
1 The topic of the details of SPARC TLB handling is one that will take many blog entries to cover from beginning to end, so I'm skipping over many of the details here for now. For the impatient, pick up a copy of Solaris Internals by Jim Mauro and Richard McDougall (ISBN 0-13-022496-0); though much of the material is dated now, many of the details are still accurate.

2 This TSB mechanism could be employed by the hardware for a little extra effort. No current sun4u systems do so, but some future systems may support the TSB lookup in hardware.

3 I'm skipping a step here for the sake of brevity -- there are actually two TSB searches in the case of a process which is using large pages, since a separate 4M-indexed TSB is kept for large pages. If the process is using 4M or larger pages, the second TSB must be searched also prior to a TSB miss. This second search is performed using a software generated TSB index, since the hardware assist only generates a 8K-indexed TSB pointer into the first TSB. See sfmmu_udtlb_slowpath() in the source if you care to see what really happens... Go on, you really have the source now, so no excuses :)

4 In some ways, this is unfortunate because the seg_vn segment driver is the most complicated of all the segment drivers in the Solaris VM subsystem, and as such has a very steep learning curve. Within Sun, we often joke that nobody understands how it all works, as it has evolved over a period of many, many years, and all of the original implementors have since moved on or are now part of Sun's upper management. While the spirit of the code hasn't changed significantly from the original SVR4 code, much of the complexity added over the years has evolved to support modern features like superpages that were not anticipated in the original design. This can make for a few twists and turns in the source even for following the path of a simple example like our ZFOD fault.

Tuesday Jun 07, 2005

A Linux vs (Open)Solaris review

A friend sent this link to what I think is a very well written, (mostly) impartial comparison of OpenSolaris vs RedHat and SuSE Linux. I personally agree with many of the author's conclusions, including some of his criticisms -- in particular, Sun has a major challenge ahead to build a strong open community, and gain buy-in from the major IHVs and ISVs on OpenSolaris as a platform.

It \*is\* all about building a community! From what I've seen within Sun to date, I can say that many, many folks are working very passionately to make sure that OpenSolaris will be a success, and are just as impatient as everyone else when it comes to waiting for the full source to go live... As a company, we are taking the need to build the community very seriously, and are quite excited about the opportunities that lie ahead. I can't say a lot more, but I will say when OpenSolaris does fully launch (it's coming.. I promise ;)) this will become quite evident.

Technorati Tags:  




« July 2016