Monday Nov 29, 2004

Jonathan on OpenSolaris

As Alan points out, Jonathan had a few comments about OpenSolaris in an interview in ComputerWorld. One interesting bit is an "official" timeline:

We will have the license announced by the end of this calendar year and the code fully available [by the] first quarter of next year.

I can tell you the OpenSolaris team is working like gangbusters to make this a reality. As soon as there is an exact date, you can bet that we'll be making some noise. Jonathan also made a statement that seems to directly conflict with my blog post yesterday:

Is there anything preventing you from making all of Solaris open-source? Nothing at all. And let me repeat that. Nothing at all.

That's a little different than my statement that "there are pieces of Solaris that cannot be open sourced due to encumbered licenses." The point here is that there are no wide-ranging problems (patents, System V code, SCO, Novell) preventing us from opening all of Solaris. Nor is there any internal pressure to keep some parts of Solaris secret. The "pieces" that I mentioned are few and far between - a single file, a command, or maybe a driver. We're also securing rights to these pieces on a monthly basis, so the number keeps dropping (and may reach zero by the time OpenSolaris goes live).

The other point is that we won't be releasing a crippled system. Even if some pieces are not immediately available in source form, we will make sure that anyone can build the same version of Solaris that we do. We hope that these pieces can be re-written and released under the OpenSolaris license as quickly as possible.

Sunday Nov 28, 2004

More on OpenSolaris

So Novell has made a vague threat of litigation over OpenSolaris, which was prompty spun by the media into a declaration of war by Novell. But it has generated quite a bit of discussion over at OSNews. The claim itself is largely FUD (the "article" is little more than gossip), and the discussion covers a wide range of (often unrelated) topics. But I thought I'd pick out a few of the points that keep coming up and address them here, as the article over at LWN seems to make some of the same mistakes and assumptions.

Sun does not own sysV, and therefore cannot legally opensource it

No one can really say what Sun owns rights to. Unless you have had the privilege of reading the many contracts Sun has (which most Sun employees haven't, myself included), it's presumptuous to state definitively what we can or cannot do legally. We have spent a lot of money acquiring rights to the code in Solaris, and we have a large legal department that has been researching and extending those rights for a very long time. Novell thinks they own rights to some code that we may or may not be using, but I trust that our legal department (and the OpenSolaris team) has done due diligence in ensuring that we have the necessary rights to open source Solaris.

Sun has been "considering" open sourcing solaris for about five years now. It's all just a PR stunt.

I can't emphasize enough that this is not a PR stunt. We have a dozen engineers working full-time getting OpenSolaris out the door. We have fifty external customers participating in the OpenSolaris pilot. We have had discussions with dozens of customers and ISVs, as well as presenting at numerous BOFs across the country. This will happen. Yes, it has taken five years - there's a lot of ground to cover when open sourcing 20 years of OS development.

Even if it is open source it still is proprietary, because no one can modify its code and can't make changes, all one can do is watch and suggest to Sun.

We have already publicly stated that our goal is to build a community. There is zero benefit to us throwing source code over the wall as a half-hearted guesture towards the open source community. While it may not happen overnight, there will be community contributions to OpenSolaris. We want the responsibility to rest outside of Sun's walls, at which point we become just another (rather large) contributor to an open source project.

However, the company has not yet announced a license, whether the license will be OSI-compliant or exactly how much of Solaris 10 will be under this open source license.

We have not announced a license, but we have also stated numerous times that it will be OSI-compliant. We know that using a non-OSI license will kill OpenSolaris before it leaves the gate. As to how much of Solaris will be released, the answer is "everything we possibly can." There are pieces of Solaris that cannot be open sourced due to encumbered licenses. But time and again people suggest that we will open source "everything but the crown jewels" - as if we could open source everything but DTrace, or everything but x86 support. Every decision is made based on existing legal agreements, not some misguided attempt to create a crippled open source port.




OpenSolaris is still under development - some of the specifics (licensing, governance model, etc) are still being worked out. All of us are involved one way or another in the future of OpenSolaris. Our words may not carry the "official" tag associated with a press release or news conference, but we're the ones working on OpenSolaris every single day. All of this will be settled when OpenSolaris goes live (as soon as we have a date we'll let you know). Until then, we'll keep trying to get the message out there. I encourage you to ignore your preconceived notions of Sun, of what has and has not been said in the media, and instead focus on the real message - straight from the engineers driving OpenSolaris.

Saturday Nov 27, 2004

Inside the cyclic subsystem

A little while ago I mentioned the cyclic subsystem. This is an interesting little area of Solaris, written by Bryan back in Solaris 8. It is the heart of all timer activity in the Solaris kernel.

The majority of this blog post comes straight from the source. Those of you inside Sun or part of the OpenSolaris pilot should check out usr/src/uts/common/os/cyclic.c. A precursor to the infamous sys/dtrace.h, this source file has 1258 lines of source code, and 1638 lines of comments. I'm going to briefly touch on the high-level aspects of the system; but as you can imagine, it's quite complicated.

Traditional methods

Historically, operating systems have relied a regular clock interrupt. This is different from the clock frequency of the chip - the clock interrupt typically fires every 10ms. All regular kernel activity was scheduled around this omnipresent clock. One of these activities would be to check if there any expired timeouts that need to be triggered.

This granular frequency is usually enough for average activities, but can kill realtime applications that require high-precision timing becaus it forces timers to align on these artificial boundaries. For example, imagine we need a timer to fire every 13 milliseconds. Rather than having the timers fire at 13, 26, 39, and 52 ms, we would instead see it fire at 20, 30, 40, and 60 ms. This is clealy not what we wanted. The result is known as "jitter" - timing deviations from the ideal. Timing granuality, scheduling artifacts, system load, and interrupts all introduce arbitrary latency into the system. By using existing Solaris mechanisms (processor binding, realime scheduling class) we could eliminate much of the latency, but we were still stuck with the granularity of the system clock. The frequency could be tuned up, but this would also increase the time spent doing other clock activity (such as process accounting), and induce significant load on the system.

Cyclic subsystem basics

Enter the cyclic subsystem. It provides for highly accurate interval timers. The key feature is that it is based on programmable timestamp counters, which have been available for many years. In particular, these counters can be programmed (quickly) to generate an interrupt at arbitrary and accurate intervals. Originally available only for SPARC, x86 support (based on programmable APICs) is now available in Solaris 10.

The majority of the kernel sees a very simple interface - you can add, remove, or bind cyclics. Internally, we keep around a heap of cyclics, organized by expiration time. This internal interface connects to a hardware-specific backend. We pick off the next cyclic to process, and then program the hardware to notify us after the next interval. This basic layout can be seen on any system with the ::cycinfo -v dcmd:

# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp uhci usba nca crypto random lofs nfs ptm ipc ]
> ::cycinfo -v
CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
  0 d2da0140  online      4 d2da00c0   50d7c1308c180 apic_redistribute_compute

                                       1                                      
                                       |                                      
                    +------------------+------------------+                   
                    0                                     3                   
                    |                                     |                   
          +---------+--------+                  +---------+---------+         
          2                                                                   
          |                                                                   
     +----+----+                                                              
  
      ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
  d2da00c0   0    1 high     0   50d7c1308c180   10000 cbe_hres_tick
  d2da00e0   1    0  low     0   50d7c1308c180   10000 apic_redistribute_compute
  d2da0100   2    3 lock     0   50d7c1308c180   10000 clock
  d2da0120   3    2 high     0   50d7c35024400 1000000 deadman
>

On this system there are no realtime timers active, so the intervals (USECINT) are pretty boring. You may notice one elegant feature of this implementation - the clock() function is now just a cyclic consumer. If you're wondering what 'deadman' is, and why it has such a high interval, it's a debugging feature that saves the system from hanging indefinitely (most of the of the time). Turn it on by adding 'set snooping = 1' in /etc/system. If the clock cannot make forward progress in 50 seconds, a high level cyclic will fire and we'll panic.

To register your own cyclic, use the timer_create(3RT) function with the CLOCK_HIGHRES type (assuming you have the PROC_CLOCK_HIGHRES privilege). This will create a low level cyclic with the appropriate timeout. The average latency is extremely small when done properly (bound to a CPU with interrupts disabled) - on the order of a few microseconds on modern hardware. Much better than the 10 millisecond artifacts possible with clock-based callouts.

More details

At a high level, this seems pretty straightforward. Once you figure out how to program the hardware, just toss some function pointers into an AVL tree and be done with it, right? Here are some of the significant wrinkles in this plan:

  • Fast operation - Because we're dispatching real-time timers, we need to be able to trigger and re-schedule cyclics extremely quickly. In order to do this, we make use of per-CPU data structures, heap management that touches a minimal number of cache lines, and lock-free operation. The latter point is particularly difficult, considering the presence of low-level cyclics.

  • Low level cyclics - The cylic subsystem operates at a high interrupt level. But not all cyclics should run at such a high level (and very few do). In order to support low level cyclics, the subsystem will post a software interrupt to deal with the cyclic at a lower level interrupt. This opens up a whole can of worms, because we have to guarantee a 1-to-1 mapping, as well as maintain timing constraints.

  • Cyclic removal - While rare, it is occasionally necessary to remove pending cyclics (the most common occurence is when unloading modules with registered cyclics). This has to be done without disturbing the other running cyclics.

  • Resource resizing - The heap, as well as internal buffers used for pending lowlevel cyclics, must be able to handle any number of active cyclics. This means that they have to be resizable, while maintaining lock-free operation in the common path.

  • Cyclic jugging - In order to offline CPUs, we must be able to re-schedule cyclics on other active CPUs, without missing a timeout in the process.

As you can see, the cyclic subsystem is a complicated but well-contained subsystem. It uses a well-organized layout to expose a simple interface to the rest of the kernel, and provides great benefit to both in-kernel consumers and timing-sensitive realtime applications.

Monday Nov 22, 2004

Sun and Linux

If you haven't already, go check out Tom Adelstein's article at LXer, as well as Jim's blog entry. Quite a departure from the usual Sun analysis. It's an interesting take: Sun is (and has been) doing great things, but corporate messaging (both our competitors and our own) has muddied the waters to the point where most people don't know what we're about anymore. I also like the view that Linux and Solaris can and should live in harmony - I'm tired of the "all for one and one for all" attitude when it comes to Linux. It'll be interesting to see if OpenSolaris will sway public opinion, or whether we're too firmly cast into the role of evil empire.

Sunday Nov 21, 2004

Dual Core Opterons

So it's no secret that AMD and Intel are in a mad sprint to the finish for dual-core x86 chips. The offical AMD roadmap, as well as public demos have all shown AMD well on track. The latest tidbits of information indicate Linux is up and running on these dual-core systems. Very cool.

Given our close relationship with AMD and the sensitive nature of hardware plans, I'll refrain from saying what we may or may not have running in our labs. But Solaris has some great features that make it well-suited for these dual core chips. First of all, Solaris 10 has had support for both Chip Multi Threading (hyperthreading) and Chip Multi Processing (multi core) for about a year and half now. Solaris has also been NUMA-aware for much longer (with the current lgroups coming in mid-2001, or Solaris 9). I'm sure AMD has made these cores appear as two processesors for legacy purposes, but with a little cpuid tweaks, we'll see them as sibling cores and get all the benefits inherent in Solaris 10 CMP.

Despite this, the NUMA system in Solaris is undergoing drastic change due to the Opteron memory architecture. While Solaris is NUMA-aware, it uses a simplistic memory heirarchy based on the physical architecture of Sun's high end SPARC systems. We have the notion of a "locality group", which represents the logical relationship of CPUs and memory. Currently, there are only two notions of locality - "near" and "far". Solaris tries its best to keep logically connected memory and processes in the same locality group. On Opteron, things get a bit more complicated due to the integrated memory controller and HyperTransport layout. On 4-way machines the processors are laid out in a square, and on 8-way machines we have a ladder formation. Memory transfers must pass through neighboring memory controllers, so now memory could be "near", "far", or "farther". We're revamping the current lgroup system to support arbitrary memory heirachies, which should produce some nice performance gains on 4- and 8-way Opteron machines. Hopefully one of the NUMA folks will blog some more detailed information once this project integrates.

In conclusion: Opterons are cool, but dual-core Opterons are cooler. And Solaris will rip on both of them.

Debugging on AMD64 - Part Three

Given that the amd64 ABI is nearly set in stone, and (as pointed out in comments on my last entry) future OpenSolaris ports could run into similar problems on other architectures (like PowerPC), you may wonder how we can make life easier in Solaris. In this entry I'll elaborate on two possibilities. Note that these are little more than fantasies at the moment - no real engineering work has been done, nor is there any guarantee that they will appear in a future Solaris release.

DWARF Support for MDB

Even though DWARF is a complex beast, it's not impossible to write an interpreter. It's just a matter of doing the work. The more subtle problem is designing it correctly, and making the data accessible in the kernel. Since MDB and KMDB are primarily kernel or post-mortem userland tools, this has not been a high priority. CTF gives us most of what we need, and including all the DWARF information in the kernel (or corefiles) is prohibitively expensive. That being said, there are those among us that would like to see MDB take a more prominent userland role (where it would compete with dbx and gdb), at which point proper DWARF support would be a very nice thing to have.

If this is done properly, we'll end up with a debugging library that's format-independent. Whether the target has CTF, STABS, or DWARF data, MDB (and KMDB) will just "do the right thing". No one argues that this isn't a cool idea - it's just a matter of engineering resources and business justification.

Programmatic Disassembler

The alternative solution is to create a disassembler library that understands code at a semantic level. Once you have a disassembler that understands the logical breakdown of a program, you can determine (via simulation) the original argument values to functions. Of course, it's not always guaranteed to work, but you'll always know when you're guessing (even DWARF can't be correct 100% of the time). This requires no debugging information, only the machine text. It will also help out the DTrace pid provider, which has to wrestle with jump tables and other werid compiler-isms. Of course, this is monumentally more difficult than a DWARF parser - especially on x86.

This idea (along with a prototype) has been around for many years. The converted have prophesized that libdis will bring peace to the world and an end to world hunger. As with many great ideas, there just hasn't been justification for devoting the necessary engineering resources. But if it can get the arguments to functions on amd64 correct in 98% of the situations, it would be incredibly valuable.

OpenSolaris Debugging Futures

There are a host of other ideas that we have kicking around here in the Solaris group. They range from pretty mundance to completely insane. As OpenSolaris finishes getting in gear, I'm looking forward to getting these ideas out in the public and finding support for all the cool possibilities that just aren't high enough priority for us right now. The existence of a larger development community will also make good debugging tools a much better business proposition.

Saturday Nov 20, 2004

RSS-friendly code samples?

So in the past two days, my posts have contained a lot of code/text samples, that have to be formatted in a fixed with font and have their spacing preserved. Previously, I've just been using <pre></pre> tags around these samples. This work on blogs.sun.com, but I've found that this wreaks havoc with some RSS readers because the whitespace is not preserved. The newlines go missing, or whitespace disappears from the beginning of lines.

In an effort to be more RSS-friendly, I wrote a script that goes through and replaces spaces with &nbsp;, puts <br/> at the end of each line, and encloses the whole thing in <tt></tt> tags. The result is extremely ugly, but it seems to get the job done. I'm wondering, is this the best way to accomplish this? I'm not too familiar with RSS, so if anyone out there knows a better way that works for all varieties of RSS readers, please let me know. My googling abilities have yet to turn up anything...

Debugging on AMD64 - Part Two

Last post I talked about one of the annoying features of the amd64 ABI - the optional frame pointer. Today, I'll examine the much more painful problem of argument passing on amd64. For sake of discussion, I'll avoid structure passing and floating point - nasty little kinks in the problem.

Argument Passing on i386

On i386, all arguments are passed on the stack. Before establishing a frame, the caller pushes each argument to the function in reverse order. This gives you this stack layout:

...
arg1
arg0
return PC
previous frame
%ebp

current frame

%esp

If you want to access the third argument, you simply reference 16(%ebp) (8 for the frame + 8 to skip first two args). This makes debugging a breeze. For any given frame pointer (easy to find thanks to the i386 ABI), we can always find the initial arguments to the function. Another trick we use is that nearly every function call is followed by a addl x, %esp instruction. Using this information, we can figure out how many arguments were passed to the function, without relying on CTF or STABS data. Putting this all together, it's easy to get a meaningful stack trace:

        > a76de800::findstack -v
        stack pointer for thread a76de800: a77c5dd4
        [ a77c5dd4 0xfe81994d() ]
          a77c5dec swtch+0x1cb()
          a77c5e10 cv_wait_sig+0x12c(a78a79b0, a6c57028)
          a77c5e70 cte_get_event+0x4d()
          a77c5ea4 ctfs_endpoint_ioctl+0xc2()
          a77c5ec4 ctfs_bu_ioctl+0x2f()
          a77c5ee4 fop_ioctl+0x1e(a79a7980, 63746502, 80d3f48, 102001, a69daf08, a77c5f74)
          a77c5f80 ioctl+0x19b()
          a77c5fac sys_call+0x16e()

Arguments Passing on AMD64

Enter amd64. As previously mentioned, the amd64 ABI was designed primarily for for performance, not debugging. The architects decided that pushing arguments on the stack was expensive, and that with 16 general purpose registers, we might as use some of them to pass arguments. Specifically, we have:

arg0%rdi
arg1%rsi
arg2%rdx
arg3%rcx
arg4%r8
arg5%r9
argN8\*(N-4)(%ebp)

This is an disaster for debugging. Debugging tools that operate in-place (DTrace and truss) can get meaningful arguments, but cannot know how many there are. Tools which examine a stack trace (pstack, mdb) cannot get arguments for any frame. The arguments may or may not be pushed on the stack, or they could be lost completely. If we try to get a stack with arguments, we find:

        > ffffffff8af1c720::findstack -v
        stack pointer for thread ffffffff8af1c720: ffffffffb2a51af0
          ffffffffb2a51d00 vpanic()
          ffffffffb2a51d30 0xfffffffffe972ae3()
          ffffffffb2a51d60 exitlwps+0x1f1()
          ffffffffb2a51dd0 proc_exit+0x40()
          ffffffffb2a51de0 exit+9()
          ffffffffb2a51e40 psig+0x2bc()
          ffffffffb2a51ee0 post_syscall+0x7d5()
          ffffffffb2a51f00 syscall_exit+0x5d()
          ffffffffb2a51f10 sys_syscall32+0x1d8()

The solution

The solution, as envisioned by the amd64 ABI designers, is to rely on DWARF to get the necessary information. If you have ever read the DWARF spec, you know that it a gigantic, ugly beast - an interpreted language that can be used to mine virtually any debugging data in an abstract manner. The problem here is that it requires significantly more work than on i386, and it requires debugging information to be present in the target object.

Implementing a DWARF interpreter is technically quite doable. We even had one brave soul go so far as to implement a limited DWARF disassembler capable of grabbing arguments for functions. But it turns out that the sheer amount of data we would have to add to the kernel to enable this was prohibitive. The bloat would have pushed us past the limit of the miniroot, not to mention the increased memory footprint and necessary changes to krtld and KMDB. That's not to say we won't support it in userland some day.

The lack of an argument count is a less serious. DTrace doesn't need to know how many arguments there are. For the moment, truss simply shows the first 6 arguments always. But truss could be enhanced to use CTF and/or DWARF data to determine the number of arguments to a given function. But it probably won't happen any time soon.

Workaround

Given that there will be no solution to this problem any time soon, you may ask how one can do any kind of debugging at all. The answer is "painfully". I'll walk through an example of finding the arguments to a function, using the following stack:

        > ffffffff8356c100::findstack -v
        stack pointer for thread ffffffff8356c100: ffffffffb2bbdb10
        [ ffffffffb2bbdb10 _resume_from_idle+0xe4() ]
          ffffffffb2bbdb40 swtch+0xc9()
          ffffffffb2bbdb90 cv_wait_sig+0x170()
          ffffffffb2bbdc50 cte_get_event+0xb0()
          ffffffffb2bbdc70 ctfs_endpoint_ioctl+0x7e()
          ffffffffb2bbdc80 ctfs_bu_ioctl+0x32()
          ffffffffb2bbdc90 fop_ioctl+0xb()
          ffffffffb2bbdd70 ioctl+0xac()
          ffffffffb2bbde00 dosyscall+0x12b()
          ffffffffb2bbdf00 trap+0x1308()
        >

Let's say that we want to know the first argument to fop_ioctl(), which is a vnode. The first step is to look at the caller and see where the argument came from:

        > ioctl+0xac::dis -n 6
------> ioctl+0x8e:                     movq   0x10(%r12),%rdi
        ioctl+0x93:                     movq   0x1a0(%rax),%r8
        ioctl+0x9a:                     leaq   -0xcc(%rbp),%r9
        ioctl+0xa1:                     movq   %r15,%rdx
        ioctl+0xa4:                     movl   %r13d,%esi
------> ioctl+0xa7:                     call   +0xeed99 <fop_ioctl>
        ioctl+0xac:                     testl  %eax,%eax
        ioctl+0xae:                     movl   %eax,%ebx
        ioctl+0xb0:                     jne    +0x74    <ioctl+0x124>
        ioctl+0xb2:                     cmpl   $0x8004667e,%r13d
        ioctl+0xb9:                     je     +0x27    <ioctl+0xe0>
        ioctl+0xbb:                     movl   %r14d,%edi
        ioctl+0xbe:                     call   -0x1408e <releasef>

We can see that %rdi (the first argument) came from %r12. Looks like we lucked out - %r12 must be preserved by the function being called. So we look at fop_ioctl():

        > fop_ioctl::dis
        fop_ioctl:                      movq   0x40(%rdi),%rax
        fop_ioctl+4:                    pushq  %rbp
        fop_ioctl+5:                    movq   %rsp,%rbp
        fop_ioctl+8:                    call   \*0x28(%rax)
        fop_ioctl+0xb:                  leave
        fop_ioctl+0xc:                  ret

No dice. We can see that %r12 (as well as %rdi) is still active at this point. Let's keep looking:

        > ctfs_bu_ioctl::dis ! grep r12
        > ctfs_endpoint_ioctl::dis ! grep r12
        > cte_get_event::dis ! grep r12
        cte_get_event+0x13:             pushq  %r12
        cte_get_event+0x32:             movq   0x20(%rdi),%r12
        ...

Finally, we found a function that preserves %r12. Taking a closer look at cte_get_event():

        > cte_get_event::dis -n 8
        cte_get_event:                  pushq  %rbp
        cte_get_event+1:                movq   %rsp,%rbp
        cte_get_event+4:                pushq  %r15
        cte_get_event+6:                movl   %esi,%r15d
        cte_get_event+9:                pushq  %r14
        cte_get_event+0xb:              movq   %rcx,%r14
        cte_get_event+0xe:              pushq  %r13
        cte_get_event+0x10:             movl   %r9d,%r13d
        cte_get_event+0x13:             pushq  %r12

We can see that %r12 was pushed fourth after establishing the frame pointer. This would put it 32 bytes below %rbp for this frame. Remembering that what was really passed was 0x10(%r12), we can finally find our original argument:

        > ffffffffb2bbdc50-20/K
        0xffffffffb2bbdc30:             ffffffff8330ec88
        > ffffffff8330ec88+10/K
        0xffffffff8330ec98:             ffffffff83a5f600
        > ffffffff83a5f600::print vnode_t v_path
        v_path = 0xffffffff83978c40 "/system/contract/process/pbundle"

Whew. We can see that we have the proper vnode, since the path references a /system/contract file. And all it took was about 12 steps! You can see how this has become such a pain for us kernel developers. From the above example, you can see the approximate method is:

  1. Determine where the argument came from in the caller. Hopefully, you will find something that came from the stack, or one of the callee-saved registers (%r12-%r15). If not, look at the function and see if the argument was pushed on the stack or moved somewhere more permanent. This doesn't happen often, so it may be that your argument is lost forever.

  2. If the argument came from a callee-saved register, examine every function in the stack until you find one that saves the value.

  3. By this point, you've hopefully found a place where the value is stored relative to %ebp. Using the frame pointers displayed in the stack trace, fetch the value from the stack.

This is not always guaranteed to work, and is obviously a royal pain. In my next post, I'll go into some future ideas we have to make this (and other debugging) better.

Friday Nov 19, 2004

Debugging on AMD64 - Part One

The amd64 port of Solaris has been available (internally) for about a month and a half, and the rest of the group is starting to realize what those of us on the project team have known for a while: debugging on amd64 is a royal pain. The difficulty comes not from processor changes, but from design choices made in the AMD64 ABI. The ABI was designed primarily with performance in mind - debuggability and observability was largely an afterthought. There are two features of the ABI that really hurt debuggability. In this post I'll cover the less annoying of the two - look for another followup soon.

Frame Pointers

In the i386 ABI, you almost always have to establish a frame pointer for the current function (leaf routines being the exception). This gives you the familiar opening function sequence:

        pushl   %ebp
        movl    %esp, %ebp

And your frame ends up looking like this:

...
arg1
arg0
return PC
previous frame
%ebp

current frame

%esp

This is a restriction of the ABI, not the processor. You can cheat by using the -fomit-frame-pointer flag to gcc, but this is not ABI compliant (although some people still think it's a great idea).

The problem

With amd64, you would think that they would just keep this convention. At first glance it seems that way, until you find this little footnote in section 3.3.2:

The conventional use of %rbp as a frame pointer for the stack frame may be avoided by using %rsp (the stack pointer) to index into the stack frame. This technique saves two instructions in the prologue and epilogue and makes one additional general-purpose register (%rbp) available.

On amd64, the frame pointer is explicitly optional. To make debugging somewhat easier, they provide a .eh_frame ELF section that gives enough information (in the form of a binary search table) to traverse a stack from any point. This is slightly better than DWARF, but still requires a lot of processing. The problem with this is that it unnecessarily restricts the context from which you can gather a backtrace. On i386, your stack walking function is something like:

      frame = %ebp
      while (not at top of stack)
             process frame
             frame = \*frame

Simple and straightforward. This omits a few nasty details like signal frames and #gp faults, but it's largely correct. On amd64, you now have to load the .eh_frame section, process it, and keep it someplace where you have easy access to it. While this doesn't sound so bad for gdb, it becomes a huge nightmare for something like DTrace. If you read a little bit of the technical details behind DTrace, you'll understand that probes execute in arbitrary context. You may be in the middle of handling an interrupt, in dispatcher or VM code, or processing a trap (although on SPARC, DTracing code that executes at TL > 0 is strictly verboten). This means that the set of possible actions is severely limited, not to mention performance-critical. In order to process a stack() directive on amd64, we would now have to do something like:

        frame = %ebp
        while (not at top of stack)
                process frame
                for (each module in the system)
                        next = binary search in .eh_frame
                        if (next)
                                frame = next
                if (frame not found)
                        frame = \*frame

Of course, you could maintain a merged lookup table for all modules on the system, but this is considerably more difficult and a maintenance nightmare. The real show stopper comes with the ustack() action. It is impossible, from arbitrary context within the kernel, to process the objects in userland and find the necessary debugging information. And unless we're using only the pid provider, there's no way to know a priori what processes we will need to examine via ustack(), so we can't even cache the information ahead of time.

The solution

What do we do in Solaris? We punt. Our linkers will happily process .eh_frame sections correctly, but our debugging tools (DTrace, mdb, pstack, etc) will only understand executables that use a frame pointer. All of our code (kernel, libraries, binaries) is compiled with frame pointers, and hopefully our users will do so as well.

The amd64 ABI is still a work in progress, and the Solaris supplement is not yet finished. More language may be added to clarify the Solaris position on this "feature". It will probably be a non-issue as long as GCC defaults to having frame pointers on amd64 Solaris. I'm not completely sure how the latest GCC behaves - I believe that it defaults to using frame pointers, which is good. I just hope -fomit-frame-pointer never becomes common practice as we move to OpenSolaris and a larger development community.

Motivation

Why was this written into the amd64 ABI? It's a dubious optimization that severely hinders debuggability. Some research claims a substantial improvement, though their own data shows questionable gains. On i386, you at least had the advantage of increasing the number of usable registers by 20%. On amd64, adding a 17th general purpose register isn't going to open up a whole new world of compiler optimizations. You're just saving a pushl, movl, an series of operations that (for obvious reasons) is highly optimized on x86. And for leaf routines (which never establish a frame), this is a non-issue. Only in extreme circumstances does the cost (in processor time and I-cache footprint) translate to a tangible benefit - circumstances which usually resort to hand-coded assembly anyway. Given the benefit and the relative cost of losing debuggability, this hardly seems worth it.

It may seem a moot point, since you've been able to use -fomit-frame-pointer on i386 for years. The difference here is that on i386, you were knowingly breaking ABI compatibility by using that option. Your application was no longer guaranteed to work properly, especially when it came to debugging. On amd64, this behavior has received official blessing, so that your application can be ABI compliant but completely opaque to DTrace and mdb. I'm not looking forward to "DTrace can't ustack() my gcc-compiled app" bugs (DTrace already has enough trouble dealing with curious gcc-isms as it is).

It's conceivable that we could add support for this functionality in our userland tools, but don't expect it any time soon. And it will never happen for DTrace. If you think saving a pushl, movl here or there is worth it, then you're obviously so performance-oriented that debuggability is the last thing on your mind. I can understand some of our HPC customers needing this; it's when people start compiling /usr/bin/\* without frame pointers that it gets out of control. Just don't be suprised when you try to DTrace your highly tuned app and find out you can't get a proper stack trace...

Next post, I'll discuss register passing conventions, which is a much more visible (and annoying) problem.

Thursday Nov 18, 2004

Back from the dead

Despite what it may seem, I have not fallen off the face of the earth. Those of us in the solaris kernel group have been a little busy lately, as we're in the final stretch of Solaris 10. Hopefully, this post will be the start of a return to my old blogging form...

A few things I've been up to recently:

  • Fixing bugs. Lots of bugs. SMF, procfs, amd64, you name it.

    The only user-visible changes is one of the SMF features: abbreviated FMRIs. Those of you out there never had to endure the dark ages where svcadm disable sendmail was not a valid command - you had to make sure to remember the entire FMRI (network/smtp:sendmail). This has since propagated to all the SMF commands, including svccfg(1M) and svcprop(1).

    On the not-so-visible side, one of the more interesting bugs I tracked down was a nasty hang in the kernel when running the GDB test suite. If a process with watchpoints enabled received a SIGKILL at an inopportune moment, it would descend into the dark pits of the kernel, never to return. Once OpenSolaris goes live, I'd love to blog about some of the crazy dances procfs has to do, but it's incomprehensible without some source code to go along with it.

  • Tinkering with ZFS. I'm working part-time on a cool ZFS enhancement that (we hope) will do wonders for remote administration and zones virtualization. I'll post some more as the details get flushed out.

  • Solaris 10 launch. I was at the launch, talking about DTrace and sitting in on the Experts Exchange. It was a great event - full of great announcements and lots of customer enthusiasm. Once you see S10 in action, and experience it for yourself, it's hard not to be enthusiastic.

It's hard to surf tech news these days without hitting a Solaris 10 article, but one particularly interesting one is this Forrester analysis. You can also catch the HP backlash, with a mother-approved response, as well an I-don't-have-to-answer-to-my-boss response.

More tech-heavy posts are in the works...

About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« November 2004 »
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
23
24
25
26
30
    
       
Today