Monday Jan 31, 2005

Solaris 10 Released

Finally, Solaris 10 has been released! Go download it at:

http://www.sun.com/software/solaris/get.js

For a litte trivia, the first Solaris 10 putback was on January 22, 2002. So you can imagine how good it feels to see this out the door after 3 years of hard work. I've only been involved for the last year and a half, and I'm overwhelmed. Kudos to the entire Solaris team for putting together such an amazing OS.

P.S. Now that things have settled down a bit, and OpenSolaris is revving up, I promise (once again) to increase my blog output.

Tuesday Jan 25, 2005

First external OpenSolaris builds

To go along with today's announcement (as well as associated press), we also recently provided buildable source to the OpenSolaris pilot participants. The ensuing "build race" has produced a number of very happy people, both inside and outside of Sun.

Check out Ben's blog for a screenshot, as well as Dennis's Blog. Dennis also put the screenshot front and center at blastwave. Very cool stuff!

More info will certainly be forthcoming, so check out the full list of OpenSolaris blogs over at OpenSolaris.org. It's going to be a fun year.

Tuesday Jan 11, 2005

Under the bootchart hood

Last week I announced our bootchart results. Dan continued with a sample of zone boot, as well as some interesting bugs that have been found already thanks to this tool. While we're working on getting the software released, I thought I'd go into some of the DTrace implementation.

To begin with, we were faced with the annoying task of creating a parsable log file. After looking at the existing implementation (which parses top output) and a bunch of groaning, Dan suggested that we should output XML data and leverage the existing Java APIs to make our life easier. Faced with the marriage between something as "lowlevel" as DTrace and something as "abstract" as XML, my first reaction was one of revulsion and guilt. But quickly we realized this was by far the best solution. Our resulting parser was 230 lines of code, compared with 670 for the set of parsers that make up the open source version.

Once we settled on an output format, we had to determine exactly what we would be tracing, and exactly how to do it. First off, we had to trace process lifetime events (fork, exec, exit, etc). With the top implementation, we cannot catch exact event times, nor can we catch short-lived processes which begin and end within a sample period. So we have the following D probes:

  • proc:::create - Fires when a new process is created. We log the parent PID, as well as the new child PID.
  • proc:::exec-success - Fires when a process calls exec(2) successfully. We log the new process name, so that we can convert between PIDs and process names at any future point.
  • proc:::exit - Logs an event when a process exits. We log the current PID.
  • exec_init:entry - This one is a little subtle. Due to the way in which init(1M) is started, we don't get a traditional proc:::create probe. So we have to use FBT and catch calls to exec_init(), which is responsible for spawning init.

This was the easy part. The harder part was to gather usage statistics on a regular basis. The approach we used leveraged the following probes:

  • sched:::on-cpu, sched:::off-cpu - Fires when a thread goes on or off CPU. We keep track of the time spent on CPU, and increment an aggregation using the sum() aggregation.
  • profile:::tick-200ms - Fires on a single CPU every 200 milliseconds. We use printa() to dump the contents of the CPU aggregation on every interval

There were several wrinkles in this plan. First of all, printa() is processed entirely in userland. Given the following script:

        #!/usr/sbin/dtrace -s

        profile:::tick-10ms
        {
                @count["count"] = count();
        }

        profile:::tick-200ms
        {
                printa(@count);
                clear(@count);
        }

One would expect that you would see 5 consecutive outputs of "20". Instead, you see one output of "100", and four more of "0". Because the default switchrate for DTrace is one second, and aggregations are processed by the dtrace(1M) process, we only see the aggregations once a second. This can be fixed by decreasing the switchrate tunable. This also means we can't make use of printa() during anonymous tracing, so we had to have two separate scripts (one for early boot, one for later).

The results are reasonable, but Ziga (the original author of bootchart) suggested a much more clever way of keeping track of samples. Instead of relying on printa(), we key the aggregation based on "sample number" (time divided by a large constant), and then dump the entire aggregation at the end of boot. Provided the amount of data isn't too large, the entire thing can be run anonymously, and we don't have the overhead of DTrace waking up every 10 milliseconds (in the realtime class, no less) to spit out data. We'll likely try this approach in the future.

There's more to be said, but I'll leave this post to be continued later by myself or Dan. In the meantime, you can check out a sample logfile produced by the D script.

Thursday Jan 06, 2005

Boot chart results

I've been on a vacation for a while, but last time I mentioned that Dan and myself had been working on a Solaris port of the open source bootchart program. After about a week of late night hacking, we had a new version (we kept only the image renderer). You can see the results (running on my 2x2 GHz opteron desktop) by clicking on the thumbnail below:

In the next few posts, I'll go over some of the implementation details. We are working on open sourcing the code, but in the meantime I can talk about the instrumentation methodology and some of the hurdles that had to be overcome. A few comparisons with the existing open source implementation:

  • Our graphs show every process used during boot. Unlike the open implementation, which relies on top, we can leverage DTrace to catch every process.

  • We don't have any I/O statistics. Part of this is due to our instrumentation methodology, and part of it is because the system-wide I/O wait statistic has been eliminated from S10 (it was never very useful or accurate). Since we can do basically anything with DTrace, we hope to include per-process I/O statistics at a future date, as well as duplicating the iostat graph with a DTrace equivalent.

  • We include absolute boot time, and show the beginning of init(1M) and those processes that run before our instrumentation can start. So if you wish to compare the above chart with the open implementation, you will have to subtract approximately 9 seconds to get a comparable time.

  • We chose an "up" system as being one where all SMF services were running. It's quite possible to log into a system well before this point, however. In the above graph, for example, one could log in locally on the console after about 20 seconds, and log in graphically after about 30 seconds.

  • We cleaned up the graph a little, altering colors and removing horizontal guidelines. We think it's more readable (given the large number of processes), but your opinion may differ.

Stay tuned for more details.

Saturday Dec 04, 2004

Boot time performance

Dan circulated the following internally, but I thought it was cool enough to share with everyone out there. It seems there's a very slick project to chart boot time performance across a variety of GNU/Linux systems. This makes it much easier to spot performance bottlenecks and improve boot time performance.

It would be really interesting to see this on a Solaris system for two reasons. First, we have the Service Management Facility which significantly parallelizes boot. Having a chart like this would let us see just how well it accomplishes this task. Second, DTrace provides a far superior mechanism for gathering data. The current project uses a combination of top and iostat sampling to gather data. With DTrace, we can get much more accurate data, and do cool stuff like associate I/O with individual processes, track interrupt activity, or gather performance data well before init(1M) begins executing. Our performance team has had a variation on this for a while now, using DTrace to track "events of interest" in order to analyze boot time regressions, but it's presentation is nowhere near that of this tool.

Some of us have downloaded the software and are beginning to poke around with some D scripts. Stay tuned...

Monday Nov 29, 2004

Jonathan on OpenSolaris

As Alan points out, Jonathan had a few comments about OpenSolaris in an interview in ComputerWorld. One interesting bit is an "official" timeline:

We will have the license announced by the end of this calendar year and the code fully available [by the] first quarter of next year.

I can tell you the OpenSolaris team is working like gangbusters to make this a reality. As soon as there is an exact date, you can bet that we'll be making some noise. Jonathan also made a statement that seems to directly conflict with my blog post yesterday:

Is there anything preventing you from making all of Solaris open-source? Nothing at all. And let me repeat that. Nothing at all.

That's a little different than my statement that "there are pieces of Solaris that cannot be open sourced due to encumbered licenses." The point here is that there are no wide-ranging problems (patents, System V code, SCO, Novell) preventing us from opening all of Solaris. Nor is there any internal pressure to keep some parts of Solaris secret. The "pieces" that I mentioned are few and far between - a single file, a command, or maybe a driver. We're also securing rights to these pieces on a monthly basis, so the number keeps dropping (and may reach zero by the time OpenSolaris goes live).

The other point is that we won't be releasing a crippled system. Even if some pieces are not immediately available in source form, we will make sure that anyone can build the same version of Solaris that we do. We hope that these pieces can be re-written and released under the OpenSolaris license as quickly as possible.

Sunday Nov 28, 2004

More on OpenSolaris

So Novell has made a vague threat of litigation over OpenSolaris, which was prompty spun by the media into a declaration of war by Novell. But it has generated quite a bit of discussion over at OSNews. The claim itself is largely FUD (the "article" is little more than gossip), and the discussion covers a wide range of (often unrelated) topics. But I thought I'd pick out a few of the points that keep coming up and address them here, as the article over at LWN seems to make some of the same mistakes and assumptions.

Sun does not own sysV, and therefore cannot legally opensource it

No one can really say what Sun owns rights to. Unless you have had the privilege of reading the many contracts Sun has (which most Sun employees haven't, myself included), it's presumptuous to state definitively what we can or cannot do legally. We have spent a lot of money acquiring rights to the code in Solaris, and we have a large legal department that has been researching and extending those rights for a very long time. Novell thinks they own rights to some code that we may or may not be using, but I trust that our legal department (and the OpenSolaris team) has done due diligence in ensuring that we have the necessary rights to open source Solaris.

Sun has been "considering" open sourcing solaris for about five years now. It's all just a PR stunt.

I can't emphasize enough that this is not a PR stunt. We have a dozen engineers working full-time getting OpenSolaris out the door. We have fifty external customers participating in the OpenSolaris pilot. We have had discussions with dozens of customers and ISVs, as well as presenting at numerous BOFs across the country. This will happen. Yes, it has taken five years - there's a lot of ground to cover when open sourcing 20 years of OS development.

Even if it is open source it still is proprietary, because no one can modify its code and can't make changes, all one can do is watch and suggest to Sun.

We have already publicly stated that our goal is to build a community. There is zero benefit to us throwing source code over the wall as a half-hearted guesture towards the open source community. While it may not happen overnight, there will be community contributions to OpenSolaris. We want the responsibility to rest outside of Sun's walls, at which point we become just another (rather large) contributor to an open source project.

However, the company has not yet announced a license, whether the license will be OSI-compliant or exactly how much of Solaris 10 will be under this open source license.

We have not announced a license, but we have also stated numerous times that it will be OSI-compliant. We know that using a non-OSI license will kill OpenSolaris before it leaves the gate. As to how much of Solaris will be released, the answer is "everything we possibly can." There are pieces of Solaris that cannot be open sourced due to encumbered licenses. But time and again people suggest that we will open source "everything but the crown jewels" - as if we could open source everything but DTrace, or everything but x86 support. Every decision is made based on existing legal agreements, not some misguided attempt to create a crippled open source port.




OpenSolaris is still under development - some of the specifics (licensing, governance model, etc) are still being worked out. All of us are involved one way or another in the future of OpenSolaris. Our words may not carry the "official" tag associated with a press release or news conference, but we're the ones working on OpenSolaris every single day. All of this will be settled when OpenSolaris goes live (as soon as we have a date we'll let you know). Until then, we'll keep trying to get the message out there. I encourage you to ignore your preconceived notions of Sun, of what has and has not been said in the media, and instead focus on the real message - straight from the engineers driving OpenSolaris.

Saturday Nov 27, 2004

Inside the cyclic subsystem

A little while ago I mentioned the cyclic subsystem. This is an interesting little area of Solaris, written by Bryan back in Solaris 8. It is the heart of all timer activity in the Solaris kernel.

The majority of this blog post comes straight from the source. Those of you inside Sun or part of the OpenSolaris pilot should check out usr/src/uts/common/os/cyclic.c. A precursor to the infamous sys/dtrace.h, this source file has 1258 lines of source code, and 1638 lines of comments. I'm going to briefly touch on the high-level aspects of the system; but as you can imagine, it's quite complicated.

Traditional methods

Historically, operating systems have relied a regular clock interrupt. This is different from the clock frequency of the chip - the clock interrupt typically fires every 10ms. All regular kernel activity was scheduled around this omnipresent clock. One of these activities would be to check if there any expired timeouts that need to be triggered.

This granular frequency is usually enough for average activities, but can kill realtime applications that require high-precision timing becaus it forces timers to align on these artificial boundaries. For example, imagine we need a timer to fire every 13 milliseconds. Rather than having the timers fire at 13, 26, 39, and 52 ms, we would instead see it fire at 20, 30, 40, and 60 ms. This is clealy not what we wanted. The result is known as "jitter" - timing deviations from the ideal. Timing granuality, scheduling artifacts, system load, and interrupts all introduce arbitrary latency into the system. By using existing Solaris mechanisms (processor binding, realime scheduling class) we could eliminate much of the latency, but we were still stuck with the granularity of the system clock. The frequency could be tuned up, but this would also increase the time spent doing other clock activity (such as process accounting), and induce significant load on the system.

Cyclic subsystem basics

Enter the cyclic subsystem. It provides for highly accurate interval timers. The key feature is that it is based on programmable timestamp counters, which have been available for many years. In particular, these counters can be programmed (quickly) to generate an interrupt at arbitrary and accurate intervals. Originally available only for SPARC, x86 support (based on programmable APICs) is now available in Solaris 10.

The majority of the kernel sees a very simple interface - you can add, remove, or bind cyclics. Internally, we keep around a heap of cyclics, organized by expiration time. This internal interface connects to a hardware-specific backend. We pick off the next cyclic to process, and then program the hardware to notify us after the next interval. This basic layout can be seen on any system with the ::cycinfo -v dcmd:

# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp uhci usba nca crypto random lofs nfs ptm ipc ]
> ::cycinfo -v
CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
  0 d2da0140  online      4 d2da00c0   50d7c1308c180 apic_redistribute_compute

                                       1                                      
                                       |                                      
                    +------------------+------------------+                   
                    0                                     3                   
                    |                                     |                   
          +---------+--------+                  +---------+---------+         
          2                                                                   
          |                                                                   
     +----+----+                                                              
  
      ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
  d2da00c0   0    1 high     0   50d7c1308c180   10000 cbe_hres_tick
  d2da00e0   1    0  low     0   50d7c1308c180   10000 apic_redistribute_compute
  d2da0100   2    3 lock     0   50d7c1308c180   10000 clock
  d2da0120   3    2 high     0   50d7c35024400 1000000 deadman
>

On this system there are no realtime timers active, so the intervals (USECINT) are pretty boring. You may notice one elegant feature of this implementation - the clock() function is now just a cyclic consumer. If you're wondering what 'deadman' is, and why it has such a high interval, it's a debugging feature that saves the system from hanging indefinitely (most of the of the time). Turn it on by adding 'set snooping = 1' in /etc/system. If the clock cannot make forward progress in 50 seconds, a high level cyclic will fire and we'll panic.

To register your own cyclic, use the timer_create(3RT) function with the CLOCK_HIGHRES type (assuming you have the PROC_CLOCK_HIGHRES privilege). This will create a low level cyclic with the appropriate timeout. The average latency is extremely small when done properly (bound to a CPU with interrupts disabled) - on the order of a few microseconds on modern hardware. Much better than the 10 millisecond artifacts possible with clock-based callouts.

More details

At a high level, this seems pretty straightforward. Once you figure out how to program the hardware, just toss some function pointers into an AVL tree and be done with it, right? Here are some of the significant wrinkles in this plan:

  • Fast operation - Because we're dispatching real-time timers, we need to be able to trigger and re-schedule cyclics extremely quickly. In order to do this, we make use of per-CPU data structures, heap management that touches a minimal number of cache lines, and lock-free operation. The latter point is particularly difficult, considering the presence of low-level cyclics.

  • Low level cyclics - The cylic subsystem operates at a high interrupt level. But not all cyclics should run at such a high level (and very few do). In order to support low level cyclics, the subsystem will post a software interrupt to deal with the cyclic at a lower level interrupt. This opens up a whole can of worms, because we have to guarantee a 1-to-1 mapping, as well as maintain timing constraints.

  • Cyclic removal - While rare, it is occasionally necessary to remove pending cyclics (the most common occurence is when unloading modules with registered cyclics). This has to be done without disturbing the other running cyclics.

  • Resource resizing - The heap, as well as internal buffers used for pending lowlevel cyclics, must be able to handle any number of active cyclics. This means that they have to be resizable, while maintaining lock-free operation in the common path.

  • Cyclic jugging - In order to offline CPUs, we must be able to re-schedule cyclics on other active CPUs, without missing a timeout in the process.

As you can see, the cyclic subsystem is a complicated but well-contained subsystem. It uses a well-organized layout to expose a simple interface to the rest of the kernel, and provides great benefit to both in-kernel consumers and timing-sensitive realtime applications.

Sunday Nov 21, 2004

Dual Core Opterons

So it's no secret that AMD and Intel are in a mad sprint to the finish for dual-core x86 chips. The offical AMD roadmap, as well as public demos have all shown AMD well on track. The latest tidbits of information indicate Linux is up and running on these dual-core systems. Very cool.

Given our close relationship with AMD and the sensitive nature of hardware plans, I'll refrain from saying what we may or may not have running in our labs. But Solaris has some great features that make it well-suited for these dual core chips. First of all, Solaris 10 has had support for both Chip Multi Threading (hyperthreading) and Chip Multi Processing (multi core) for about a year and half now. Solaris has also been NUMA-aware for much longer (with the current lgroups coming in mid-2001, or Solaris 9). I'm sure AMD has made these cores appear as two processesors for legacy purposes, but with a little cpuid tweaks, we'll see them as sibling cores and get all the benefits inherent in Solaris 10 CMP.

Despite this, the NUMA system in Solaris is undergoing drastic change due to the Opteron memory architecture. While Solaris is NUMA-aware, it uses a simplistic memory heirarchy based on the physical architecture of Sun's high end SPARC systems. We have the notion of a "locality group", which represents the logical relationship of CPUs and memory. Currently, there are only two notions of locality - "near" and "far". Solaris tries its best to keep logically connected memory and processes in the same locality group. On Opteron, things get a bit more complicated due to the integrated memory controller and HyperTransport layout. On 4-way machines the processors are laid out in a square, and on 8-way machines we have a ladder formation. Memory transfers must pass through neighboring memory controllers, so now memory could be "near", "far", or "farther". We're revamping the current lgroup system to support arbitrary memory heirachies, which should produce some nice performance gains on 4- and 8-way Opteron machines. Hopefully one of the NUMA folks will blog some more detailed information once this project integrates.

In conclusion: Opterons are cool, but dual-core Opterons are cooler. And Solaris will rip on both of them.

Debugging on AMD64 - Part Three

Given that the amd64 ABI is nearly set in stone, and (as pointed out in comments on my last entry) future OpenSolaris ports could run into similar problems on other architectures (like PowerPC), you may wonder how we can make life easier in Solaris. In this entry I'll elaborate on two possibilities. Note that these are little more than fantasies at the moment - no real engineering work has been done, nor is there any guarantee that they will appear in a future Solaris release.

DWARF Support for MDB

Even though DWARF is a complex beast, it's not impossible to write an interpreter. It's just a matter of doing the work. The more subtle problem is designing it correctly, and making the data accessible in the kernel. Since MDB and KMDB are primarily kernel or post-mortem userland tools, this has not been a high priority. CTF gives us most of what we need, and including all the DWARF information in the kernel (or corefiles) is prohibitively expensive. That being said, there are those among us that would like to see MDB take a more prominent userland role (where it would compete with dbx and gdb), at which point proper DWARF support would be a very nice thing to have.

If this is done properly, we'll end up with a debugging library that's format-independent. Whether the target has CTF, STABS, or DWARF data, MDB (and KMDB) will just "do the right thing". No one argues that this isn't a cool idea - it's just a matter of engineering resources and business justification.

Programmatic Disassembler

The alternative solution is to create a disassembler library that understands code at a semantic level. Once you have a disassembler that understands the logical breakdown of a program, you can determine (via simulation) the original argument values to functions. Of course, it's not always guaranteed to work, but you'll always know when you're guessing (even DWARF can't be correct 100% of the time). This requires no debugging information, only the machine text. It will also help out the DTrace pid provider, which has to wrestle with jump tables and other werid compiler-isms. Of course, this is monumentally more difficult than a DWARF parser - especially on x86.

This idea (along with a prototype) has been around for many years. The converted have prophesized that libdis will bring peace to the world and an end to world hunger. As with many great ideas, there just hasn't been justification for devoting the necessary engineering resources. But if it can get the arguments to functions on amd64 correct in 98% of the situations, it would be incredibly valuable.

OpenSolaris Debugging Futures

There are a host of other ideas that we have kicking around here in the Solaris group. They range from pretty mundance to completely insane. As OpenSolaris finishes getting in gear, I'm looking forward to getting these ideas out in the public and finding support for all the cool possibilities that just aren't high enough priority for us right now. The existence of a larger development community will also make good debugging tools a much better business proposition.

Saturday Nov 20, 2004

Debugging on AMD64 - Part Two

Last post I talked about one of the annoying features of the amd64 ABI - the optional frame pointer. Today, I'll examine the much more painful problem of argument passing on amd64. For sake of discussion, I'll avoid structure passing and floating point - nasty little kinks in the problem.

Argument Passing on i386

On i386, all arguments are passed on the stack. Before establishing a frame, the caller pushes each argument to the function in reverse order. This gives you this stack layout:

...
arg1
arg0
return PC
previous frame
%ebp

current frame

%esp

If you want to access the third argument, you simply reference 16(%ebp) (8 for the frame + 8 to skip first two args). This makes debugging a breeze. For any given frame pointer (easy to find thanks to the i386 ABI), we can always find the initial arguments to the function. Another trick we use is that nearly every function call is followed by a addl x, %esp instruction. Using this information, we can figure out how many arguments were passed to the function, without relying on CTF or STABS data. Putting this all together, it's easy to get a meaningful stack trace:

        > a76de800::findstack -v
        stack pointer for thread a76de800: a77c5dd4
        [ a77c5dd4 0xfe81994d() ]
          a77c5dec swtch+0x1cb()
          a77c5e10 cv_wait_sig+0x12c(a78a79b0, a6c57028)
          a77c5e70 cte_get_event+0x4d()
          a77c5ea4 ctfs_endpoint_ioctl+0xc2()
          a77c5ec4 ctfs_bu_ioctl+0x2f()
          a77c5ee4 fop_ioctl+0x1e(a79a7980, 63746502, 80d3f48, 102001, a69daf08, a77c5f74)
          a77c5f80 ioctl+0x19b()
          a77c5fac sys_call+0x16e()

Arguments Passing on AMD64

Enter amd64. As previously mentioned, the amd64 ABI was designed primarily for for performance, not debugging. The architects decided that pushing arguments on the stack was expensive, and that with 16 general purpose registers, we might as use some of them to pass arguments. Specifically, we have:

arg0%rdi
arg1%rsi
arg2%rdx
arg3%rcx
arg4%r8
arg5%r9
argN8\*(N-4)(%ebp)

This is an disaster for debugging. Debugging tools that operate in-place (DTrace and truss) can get meaningful arguments, but cannot know how many there are. Tools which examine a stack trace (pstack, mdb) cannot get arguments for any frame. The arguments may or may not be pushed on the stack, or they could be lost completely. If we try to get a stack with arguments, we find:

        > ffffffff8af1c720::findstack -v
        stack pointer for thread ffffffff8af1c720: ffffffffb2a51af0
          ffffffffb2a51d00 vpanic()
          ffffffffb2a51d30 0xfffffffffe972ae3()
          ffffffffb2a51d60 exitlwps+0x1f1()
          ffffffffb2a51dd0 proc_exit+0x40()
          ffffffffb2a51de0 exit+9()
          ffffffffb2a51e40 psig+0x2bc()
          ffffffffb2a51ee0 post_syscall+0x7d5()
          ffffffffb2a51f00 syscall_exit+0x5d()
          ffffffffb2a51f10 sys_syscall32+0x1d8()

The solution

The solution, as envisioned by the amd64 ABI designers, is to rely on DWARF to get the necessary information. If you have ever read the DWARF spec, you know that it a gigantic, ugly beast - an interpreted language that can be used to mine virtually any debugging data in an abstract manner. The problem here is that it requires significantly more work than on i386, and it requires debugging information to be present in the target object.

Implementing a DWARF interpreter is technically quite doable. We even had one brave soul go so far as to implement a limited DWARF disassembler capable of grabbing arguments for functions. But it turns out that the sheer amount of data we would have to add to the kernel to enable this was prohibitive. The bloat would have pushed us past the limit of the miniroot, not to mention the increased memory footprint and necessary changes to krtld and KMDB. That's not to say we won't support it in userland some day.

The lack of an argument count is a less serious. DTrace doesn't need to know how many arguments there are. For the moment, truss simply shows the first 6 arguments always. But truss could be enhanced to use CTF and/or DWARF data to determine the number of arguments to a given function. But it probably won't happen any time soon.

Workaround

Given that there will be no solution to this problem any time soon, you may ask how one can do any kind of debugging at all. The answer is "painfully". I'll walk through an example of finding the arguments to a function, using the following stack:

        > ffffffff8356c100::findstack -v
        stack pointer for thread ffffffff8356c100: ffffffffb2bbdb10
        [ ffffffffb2bbdb10 _resume_from_idle+0xe4() ]
          ffffffffb2bbdb40 swtch+0xc9()
          ffffffffb2bbdb90 cv_wait_sig+0x170()
          ffffffffb2bbdc50 cte_get_event+0xb0()
          ffffffffb2bbdc70 ctfs_endpoint_ioctl+0x7e()
          ffffffffb2bbdc80 ctfs_bu_ioctl+0x32()
          ffffffffb2bbdc90 fop_ioctl+0xb()
          ffffffffb2bbdd70 ioctl+0xac()
          ffffffffb2bbde00 dosyscall+0x12b()
          ffffffffb2bbdf00 trap+0x1308()
        >

Let's say that we want to know the first argument to fop_ioctl(), which is a vnode. The first step is to look at the caller and see where the argument came from:

        > ioctl+0xac::dis -n 6
------> ioctl+0x8e:                     movq   0x10(%r12),%rdi
        ioctl+0x93:                     movq   0x1a0(%rax),%r8
        ioctl+0x9a:                     leaq   -0xcc(%rbp),%r9
        ioctl+0xa1:                     movq   %r15,%rdx
        ioctl+0xa4:                     movl   %r13d,%esi
------> ioctl+0xa7:                     call   +0xeed99 <fop_ioctl>
        ioctl+0xac:                     testl  %eax,%eax
        ioctl+0xae:                     movl   %eax,%ebx
        ioctl+0xb0:                     jne    +0x74    <ioctl+0x124>
        ioctl+0xb2:                     cmpl   $0x8004667e,%r13d
        ioctl+0xb9:                     je     +0x27    <ioctl+0xe0>
        ioctl+0xbb:                     movl   %r14d,%edi
        ioctl+0xbe:                     call   -0x1408e <releasef>

We can see that %rdi (the first argument) came from %r12. Looks like we lucked out - %r12 must be preserved by the function being called. So we look at fop_ioctl():

        > fop_ioctl::dis
        fop_ioctl:                      movq   0x40(%rdi),%rax
        fop_ioctl+4:                    pushq  %rbp
        fop_ioctl+5:                    movq   %rsp,%rbp
        fop_ioctl+8:                    call   \*0x28(%rax)
        fop_ioctl+0xb:                  leave
        fop_ioctl+0xc:                  ret

No dice. We can see that %r12 (as well as %rdi) is still active at this point. Let's keep looking:

        > ctfs_bu_ioctl::dis ! grep r12
        > ctfs_endpoint_ioctl::dis ! grep r12
        > cte_get_event::dis ! grep r12
        cte_get_event+0x13:             pushq  %r12
        cte_get_event+0x32:             movq   0x20(%rdi),%r12
        ...

Finally, we found a function that preserves %r12. Taking a closer look at cte_get_event():

        > cte_get_event::dis -n 8
        cte_get_event:                  pushq  %rbp
        cte_get_event+1:                movq   %rsp,%rbp
        cte_get_event+4:                pushq  %r15
        cte_get_event+6:                movl   %esi,%r15d
        cte_get_event+9:                pushq  %r14
        cte_get_event+0xb:              movq   %rcx,%r14
        cte_get_event+0xe:              pushq  %r13
        cte_get_event+0x10:             movl   %r9d,%r13d
        cte_get_event+0x13:             pushq  %r12

We can see that %r12 was pushed fourth after establishing the frame pointer. This would put it 32 bytes below %rbp for this frame. Remembering that what was really passed was 0x10(%r12), we can finally find our original argument:

        > ffffffffb2bbdc50-20/K
        0xffffffffb2bbdc30:             ffffffff8330ec88
        > ffffffff8330ec88+10/K
        0xffffffff8330ec98:             ffffffff83a5f600
        > ffffffff83a5f600::print vnode_t v_path
        v_path = 0xffffffff83978c40 "/system/contract/process/pbundle"

Whew. We can see that we have the proper vnode, since the path references a /system/contract file. And all it took was about 12 steps! You can see how this has become such a pain for us kernel developers. From the above example, you can see the approximate method is:

  1. Determine where the argument came from in the caller. Hopefully, you will find something that came from the stack, or one of the callee-saved registers (%r12-%r15). If not, look at the function and see if the argument was pushed on the stack or moved somewhere more permanent. This doesn't happen often, so it may be that your argument is lost forever.

  2. If the argument came from a callee-saved register, examine every function in the stack until you find one that saves the value.

  3. By this point, you've hopefully found a place where the value is stored relative to %ebp. Using the frame pointers displayed in the stack trace, fetch the value from the stack.

This is not always guaranteed to work, and is obviously a royal pain. In my next post, I'll go into some future ideas we have to make this (and other debugging) better.

Friday Nov 19, 2004

Debugging on AMD64 - Part One

The amd64 port of Solaris has been available (internally) for about a month and a half, and the rest of the group is starting to realize what those of us on the project team have known for a while: debugging on amd64 is a royal pain. The difficulty comes not from processor changes, but from design choices made in the AMD64 ABI. The ABI was designed primarily with performance in mind - debuggability and observability was largely an afterthought. There are two features of the ABI that really hurt debuggability. In this post I'll cover the less annoying of the two - look for another followup soon.

Frame Pointers

In the i386 ABI, you almost always have to establish a frame pointer for the current function (leaf routines being the exception). This gives you the familiar opening function sequence:

        pushl   %ebp
        movl    %esp, %ebp

And your frame ends up looking like this:

...
arg1
arg0
return PC
previous frame
%ebp

current frame

%esp

This is a restriction of the ABI, not the processor. You can cheat by using the -fomit-frame-pointer flag to gcc, but this is not ABI compliant (although some people still think it's a great idea).

The problem

With amd64, you would think that they would just keep this convention. At first glance it seems that way, until you find this little footnote in section 3.3.2:

The conventional use of %rbp as a frame pointer for the stack frame may be avoided by using %rsp (the stack pointer) to index into the stack frame. This technique saves two instructions in the prologue and epilogue and makes one additional general-purpose register (%rbp) available.

On amd64, the frame pointer is explicitly optional. To make debugging somewhat easier, they provide a .eh_frame ELF section that gives enough information (in the form of a binary search table) to traverse a stack from any point. This is slightly better than DWARF, but still requires a lot of processing. The problem with this is that it unnecessarily restricts the context from which you can gather a backtrace. On i386, your stack walking function is something like:

      frame = %ebp
      while (not at top of stack)
             process frame
             frame = \*frame

Simple and straightforward. This omits a few nasty details like signal frames and #gp faults, but it's largely correct. On amd64, you now have to load the .eh_frame section, process it, and keep it someplace where you have easy access to it. While this doesn't sound so bad for gdb, it becomes a huge nightmare for something like DTrace. If you read a little bit of the technical details behind DTrace, you'll understand that probes execute in arbitrary context. You may be in the middle of handling an interrupt, in dispatcher or VM code, or processing a trap (although on SPARC, DTracing code that executes at TL > 0 is strictly verboten). This means that the set of possible actions is severely limited, not to mention performance-critical. In order to process a stack() directive on amd64, we would now have to do something like:

        frame = %ebp
        while (not at top of stack)
                process frame
                for (each module in the system)
                        next = binary search in .eh_frame
                        if (next)
                                frame = next
                if (frame not found)
                        frame = \*frame

Of course, you could maintain a merged lookup table for all modules on the system, but this is considerably more difficult and a maintenance nightmare. The real show stopper comes with the ustack() action. It is impossible, from arbitrary context within the kernel, to process the objects in userland and find the necessary debugging information. And unless we're using only the pid provider, there's no way to know a priori what processes we will need to examine via ustack(), so we can't even cache the information ahead of time.

The solution

What do we do in Solaris? We punt. Our linkers will happily process .eh_frame sections correctly, but our debugging tools (DTrace, mdb, pstack, etc) will only understand executables that use a frame pointer. All of our code (kernel, libraries, binaries) is compiled with frame pointers, and hopefully our users will do so as well.

The amd64 ABI is still a work in progress, and the Solaris supplement is not yet finished. More language may be added to clarify the Solaris position on this "feature". It will probably be a non-issue as long as GCC defaults to having frame pointers on amd64 Solaris. I'm not completely sure how the latest GCC behaves - I believe that it defaults to using frame pointers, which is good. I just hope -fomit-frame-pointer never becomes common practice as we move to OpenSolaris and a larger development community.

Motivation

Why was this written into the amd64 ABI? It's a dubious optimization that severely hinders debuggability. Some research claims a substantial improvement, though their own data shows questionable gains. On i386, you at least had the advantage of increasing the number of usable registers by 20%. On amd64, adding a 17th general purpose register isn't going to open up a whole new world of compiler optimizations. You're just saving a pushl, movl, an series of operations that (for obvious reasons) is highly optimized on x86. And for leaf routines (which never establish a frame), this is a non-issue. Only in extreme circumstances does the cost (in processor time and I-cache footprint) translate to a tangible benefit - circumstances which usually resort to hand-coded assembly anyway. Given the benefit and the relative cost of losing debuggability, this hardly seems worth it.

It may seem a moot point, since you've been able to use -fomit-frame-pointer on i386 for years. The difference here is that on i386, you were knowingly breaking ABI compatibility by using that option. Your application was no longer guaranteed to work properly, especially when it came to debugging. On amd64, this behavior has received official blessing, so that your application can be ABI compliant but completely opaque to DTrace and mdb. I'm not looking forward to "DTrace can't ustack() my gcc-compiled app" bugs (DTrace already has enough trouble dealing with curious gcc-isms as it is).

It's conceivable that we could add support for this functionality in our userland tools, but don't expect it any time soon. And it will never happen for DTrace. If you think saving a pushl, movl here or there is worth it, then you're obviously so performance-oriented that debuggability is the last thing on your mind. I can understand some of our HPC customers needing this; it's when people start compiling /usr/bin/\* without frame pointers that it gets out of control. Just don't be suprised when you try to DTrace your highly tuned app and find out you can't get a proper stack trace...

Next post, I'll discuss register passing conventions, which is a much more visible (and annoying) problem.

Thursday Nov 18, 2004

Back from the dead

Despite what it may seem, I have not fallen off the face of the earth. Those of us in the solaris kernel group have been a little busy lately, as we're in the final stretch of Solaris 10. Hopefully, this post will be the start of a return to my old blogging form...

A few things I've been up to recently:

  • Fixing bugs. Lots of bugs. SMF, procfs, amd64, you name it.

    The only user-visible changes is one of the SMF features: abbreviated FMRIs. Those of you out there never had to endure the dark ages where svcadm disable sendmail was not a valid command - you had to make sure to remember the entire FMRI (network/smtp:sendmail). This has since propagated to all the SMF commands, including svccfg(1M) and svcprop(1).

    On the not-so-visible side, one of the more interesting bugs I tracked down was a nasty hang in the kernel when running the GDB test suite. If a process with watchpoints enabled received a SIGKILL at an inopportune moment, it would descend into the dark pits of the kernel, never to return. Once OpenSolaris goes live, I'd love to blog about some of the crazy dances procfs has to do, but it's incomprehensible without some source code to go along with it.

  • Tinkering with ZFS. I'm working part-time on a cool ZFS enhancement that (we hope) will do wonders for remote administration and zones virtualization. I'll post some more as the details get flushed out.

  • Solaris 10 launch. I was at the launch, talking about DTrace and sitting in on the Experts Exchange. It was a great event - full of great announcements and lots of customer enthusiasm. Once you see S10 in action, and experience it for yourself, it's hard not to be enthusiastic.

It's hard to surf tech news these days without hitting a Solaris 10 article, but one particularly interesting one is this Forrester analysis. You can also catch the HP backlash, with a mother-approved response, as well an I-don't-have-to-answer-to-my-boss response.

More tech-heavy posts are in the works...

Wednesday Oct 13, 2004

Microstate accounting in Solaris 10

With the tidal wave of features that is Solaris 10, it's all too easy to miss the less visible enhancements. Once of these "lost features" is microstate accounting turned on by default for all processes. Process accounting keeps track of time spent by each thread in various states - idle, in system mode, in user mode, etc. This is used for calculating the usage percentages you see in prstat(1) as well as time(1) and ptime(1). Historically there have been two ways of doing process accounting:

  1. Clock-based accounting

    Virtually every operating system in existence has a clock function - a periodic interrupt used to perform regular system activity1. On Solaris, the clock function does a number of simple tasks, mostly having to do with process and CPU accounting. Clock based accounting examines each CPU, and increments per-process statistics depending on whether the current thread is in user mode, system mode, or idle. This uses statistical sampling (taking a snapshot every 10 milliseconds) to come up with a picture of system resource usage.

  2. Microstate accounting

    With microstate accouning, the counters are updated with each state transition as they occur in real time. This results in much more accurate, if more expensive, accounting information. Previously, you could turn this on through the /proc interfaces (proc(4)) using the PR_MSACCT control. This has since become a no-op; you cann't disable microstate accounting in Solaris 10.

The clock-based sampling method has several drawbacks. One is that it is only a guess for rapidly transitioning threads. If you're going between system and user mode faster than once every 10 milliseconds, the clock will only account for a single state during this time period. Even worse is the fact that it can miss threads entirely if it catches a CPU in an idle state. If you have a thread that wakes up every 10 milliseconds and does 9 milliseconds of work before going to sleep, you will never account for the fact that it is using 90% of the CPU2.

These 'phantom' threads can hide from traditional process tools. The now-infamous gtik2_applet2 outlined in the DTrace USENIX paper is a good example of this. These inaccuracies also affect the fair share scheduler (FSS(7)), which must make resource allocation decisions based on the accuracy of these accounting statistics.

With microstate accounting, all of these artifacts disappear. Since we do the calculations in situ, it's impossible to "miss" a transition. The reason we haven't always relied on microstate accounting is that it used to be a fairly expensive process. Threads transition between states with high frequency, and with each transition we need to get a timestamp to determine how long we spent in the last state.

The key obversation that made this practical is that the expensive part is not reading the hardware timestamp. The expensive part is the conversion to nanoseconds - clock ticks are useless when doing any sort of real accounting. The solution was to use clock ticks for all the internal accounting, and only do the conversion to nanoseconds when actually requested. With a few other tricks, we were able to reduce the impact to virtually nil. You can see the difference in micro-benchmark performance of short system calls (such as getpid(2)), but it's unnoticeable in any normal system call (like read(2)), and nonexistent in any macro benchmark.

All the great benefits of microstate acocunting at a fraction of the price.


1 The clock function is actually a consumer of the cyclic subsystem - a nice callout mechanism written by Bryan back in Solaris 8. Certainly worth a post of its own in the future.

2This is especially true on x86, because historically the hardware clock has been the source of all timer activity. This means every thread ran lockstep with the clock with respect to timers and wakeups. This has been addressed in Solaris 10 with the introduction of arbitrary precision timer interrupts, which makes the cyclic subsystem far more interesting on x86 machines (Solaris SPARC has had this functionality for a while).

Thursday Oct 07, 2004

More is better

Just wanted to welcome fellow kernel engineers Jonathan Adams (smf guru and debugger extraordinaire) and Joe Bonasera (amd64 guru and x86 VM master) to the madness that is blogs.sun.com. Blog Away!

About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today