Monday Mar 19, 2007

Linux Defection

Ian Murdock has left the Linux Foundation to lead the operating systems strategy here at Sun. The last few years have seen some exciting changes at Sun: releasing Solaris 10 (which includes several truly revolutionary technologies), embracing x86, leading on x64, and taking Solaris open source. That a luminary of the Linux world was enticed by the changes we've made and the technologies we're creating is a huge vote of confidence. From my (admittedly biased) view, OpenSolaris has been breaking away from the pack with technologies like DTrace, ZFS, Zones, SMF, FMA and others. I'm looking forward to the contributions Ian will bring from his experience with Debian, and to the wake-up call this may be to Linux devotees: OpenSolaris is breaking away.

Sunday Nov 27, 2005

OpenSolaris on LugRadio

The good folks over at LugRadio (that LUG; not the other LUG of course) invited me on their show to answer some of their questions about OpenSolaris. You can find it in the latest episode. Check your volume before you start playing the episode: the first reference to masturbation is approximately 6 seconds in.

Understandably, the LugRadio guys didn't have much exposure to OpenSolaris, but they were certainly interested and recognize that it represents a pretty exciting development in the world of open source. Mostly I addressed some of the differences between Solaris and Linux. They seemed quite keen on Nexenta and I recommend that Solaris and Debian users alike try it out. There's one small correction I'd like to make regarding my comments on the show: I erroneously claimed that Solaris 7 -- the first version of Solaris to support 64-bit processors -- came out in 1996. In fact, Solaris 7 came out in November of 1998 (64-bit support having been first integrated in October of 1997). That cuts our 64-bit experience to 7 or 8 years from the 10 that I claimed.

On a vaguely related note, I'll be in London speaking at the OpenSolaris user group meeting on 12/19/2005. That's after a tromp accross Europe which includes a DTrace cum Java talk and an OpenSolaris BoF at JavaPolis in Belgium.

Technorati tags:

Wednesday Oct 19, 2005

Fall of Code?

Actually, Fall, Winter and Spring of code. Sun just announced the Solaris 10 Univesity Challenge Content. That's right, it's a challenge and a contest and with three times the staying power of single season of code. Apparently in a modern day treaty of Brest-Litovsk, Google ceded the rest of the year to Sun. Perhaps this was the real beef of the recent Sun/Google agreement.

This is actually pretty cool: be a college student, do something cool on OpenSolaris, take a shot at winning the grand prize of $5k and a sweet box (I imagine there might be prizes for other interesting entries). There are some ideas projects on the contest/challenge/seasonal coding page ranging from good (MythTV), to mundane (support for inkjet printers from Epson), to confusing (internet gaming -- I thought online gamling was its own reward), to inane ("A joystick driver - for gaming", oh for gaming? I've been using it for system administration). Here's my list off the top of my head -- if you want more ideas, feel free to drop me a line.

Work on an existing open source project

  • pearpc runs ppc applications on x86. I started working on porting it over and was able to boot Darwin.
  • valgrind is very cool (I've only just seen it). It would be great to port it or to use pieces like KCacheGrind and plug in DTrace as a backend.
  • Port over your favorite application, or get it running in some emulation environment of your own devising.
  • Make something go faster. MySQL, Gnome, mozilla, some random system call, whatever; there's a lot of inefficiency out there.

Write something new (using cools stuff in Solaris)

  • I'd love to see more dynamic languages with native DTrace support. We've already got support for Java, php, Ruby, and Perl in some form; make it better or add support for some other language you know and love (TCL, python, scheme, LISP, ML, etc.).
  • Build another kind of analysis tool on top of DTrace. We're working on a Java binding which is going to make this easier.
  • Write a device driver for your favorite crazy device (which I assume is your new iPod nano or something; you're such a hipster Apple fanboy).
  • Build a tool to simulate a distributed environment on Zones and use DTrace to monitor the communication. WARNING: your distributed systems professor will be your new best friend.

That's what I'd do, but if you have a better idea, go do that instead.

Technorati tags:

OpenSolaris and svk

Today at EuroOSCON, I attended a introductory talk on svn by Chia-liang Kao. I was hopeful that svk might address some of the issues that I thought would prevent us from adopting Subversion for OpenSolaris. In particular, Subversion requires a centralized repository whereas svk, which is built on top of Subversion, provides the distributed revision control system that we'd need. After the talk, my overall impression was that svk seemed to lack a certain polish, but after struggling to phrase that in less subjective terms, I'm coming around a bit.

I got a little nervous when the first example of svk's use was for keeping the contents of /etc under revision control. The big problem that svk solved was having random directories (.svc, SCCS, whatever) in, for example, /etc/skel. Talk about trivia (almost as relevant as a demo implementing a weblog in Ruby on Rails). I guess it's nice that svk solves a problem for that particularly esoteric scenario, but wasn't there some mention that this might be used to, you know, hold onto source code? Was that actually the design center for svk?

Fortunately we did get to the details of using svk for a source code repository. I think this is just my bias coming from teamware, but some of the mechanisms seem a bit strange. In particular, you do svk mirror to make one kind of copy of the main repository (the kind that's a "local repository"), and svk checkout to make a different kind of copy (the kind that's the "working area"). In other words, you have a tree structure, but the branches and leaves are different entities entirely and editing can only be done on the leaves. I guess that's not such a huge pain, but I think this reflects the origins: taking a local revision control system and making it distributed. Consequentially, there's a bunch of stuff left over from Subversion (like branches) that seem rather redundant in a distributed revision control system (don't take branch, make another local repository, right?); it's not that these actually hurt anything, it just means that there's a bunch of complexity for what's essentially ancillary functionality.

Another not-a-big-deal-that-rubs-me-the-wrong-way is that svk is a pile of perl modules (of course, there's probably a specific perlism for that; "epocalyptus" or something I'm sure). I suppose we'll only have to wrestle that bear to the ground once, and stuff it in a tar ball (yes, Allan, or a package). To assuage my nervousness, I'd at least like to be confident that we could maintain this ourselves, but I don't think we have the collective perl expertise (except for the aforementioned Alan) to be able to fix a bug or add a feature we desperately need.

I want this thing to work, because svk seems to be the best option currently available, but I admit that I was a bit disappointed. If we were going to use this for OpenSolaris, we'd certainly need to distribute it in a self-contained form, and really take it through the paces to make sure it could do all the things we need in the strangest edge cases. As I mentioned, we currently use teamware which is an old Sun product that's been in constant use despite being end-of-lifed several years ago. While I like it's overall design, there's a bunch of work that would need to happen for it to be suitable for OpenSolaris. In particular, it currently requires a shared NFS filesystem, and we require you to be on the host machine when committing a change. Adding networked access capabilities to it would probably be a nightmare. It also relies on SCCS (an ancient version control system) for managing individual files; not a problem per se, but a little crufty. Teamware is great and has withstood the test of time, but svk is probably closer to satisfying our requirements.

Update: But there are quite a few other options I hadn't looked into. Svk no longer looks like a front runner. If there are other systems you think are worth considering, let me know.

I'll play with svk some more and try to psych myself up for this brave new world. I'd appreciate any testimonials or feedback, and, of course, please correct all the factual errors I'm sure I committed.

Technorati tags:

Monday Oct 17, 2005

OpenSolaris and Subversion

I just attended Brian W. Fitzpatrick's talk on Subversion at EuroOSCON. Brian did a great job and Subversion looks like a really complete replacement for cvs -- the stated goal of the project. What I was particularly interested in was the feasibility of using Subversion as the revision control system for OpenSolaris; according to the road map we still have a few months to figure it out, but, as my grandmother always said while working away at her mechanical Turing machine, time flies when you're debating the merits of various revision control systems.

While Subversion seems like some polished software, I don't think it's a solution -- or at least a complete solution -- to the problem we have with OpenSolaris. In particular, it's not a distributed revision control system meaning that there's one master repository that manages everything including branches and sub-branches. This means that if you have a development team at the distal point on the globe from the main repository (we pretty much do), all that team's work has to traverse the globe. Now the Subversion folks have ensured that the over the wire protocol is lean, but that doesn't really address the core of the problem -- the concern isn't potentially slow communication, it's that it happens at all. Let's say a bunch of folks -- inside or outside of Sun -- start working on a project; under Subversion there's a single point of failure -- if the one server in Menlo Park goes down (or the connection to it does down), the project can't accept any more integrations. I'm also not clear if branches can have their own policies for integrations. There are a couple other issues we'd need to solve (e.g. comments are made per-integration rather than per-file), but this is by far the biggest.

Brian recommeded a talk on svk later this week; svk is a distributed revision control and source management system that's built on Subversion. I hope svk solves the problems OpenSolaris would have with Subversion, but it would be great if Subversion could eventually migrate to a distributed model. I'd also like to attend this BoF on version control systems, but I'll be busy at the OpenSolaris User Group meeting -- where I'm sure you'll be as well.

Technorati tags:

Sunday Oct 16, 2005


This week I'll be in Amsterdam attending and speaking at Euro OSCON. I'll be running a session on DTrace on Thursday at 1:30. For the uninitiated, DTrace is the systemic observability framework in OpenSolaris that allows for tracing of the kernel and applications on production systems. At OSCON in Portland, Bryan and Wez extended DTrace into php (it already worked on C/C++ and Java); since then, the DTrace community has extended DTrace into Ruby and Perl -- we'll see if we can come up with something new this week.

I'll also be tagging in on the OpenSolaris Users' Group Meeting and the OpenSolaris Technologies BoF along with Casper, Alan, Patrick, Simon, and others.

I've had a bit of a chance to explore Amsterdam a bit and it's spectacular. My California-thinned blood won't let me forgo a comment on the rather cold weather, but my Northeastern upbringing won't let me complain too loudly.

Technorati tags:

Thursday Sep 15, 2005

The mysteries of _init

I hadn't been fully aware that I felt this way, but I recently had a realization: I love the linker. It's a technology that's amazing in both its simplicity and its complexity. I'm sure my feelings are influenced in no small way by the caliber of the engineers working on it -- Rod and Mike are always eager to explain how the some facet of the linker works or to add something new and whizzy if it can't quite do what I need.

Over the course of developing user-level statically defined tracing USDT, I've worked (and continue to work) with the linker guys to figure out the best way to slot the two technologies together. Recently, some users of USDT have run into a problem where binaries compiled with USDT probes weren't actually making them available to the system. We eventually tracked it down to incorrect use of the linker. I thought it would be helpful to describe the problem and the solution in case other people bump into something similar.

First a little bit on initialization. In a C compiler, you can specify an initialization function like this: #pragma init(my_init). The intention of this is to have the specified function (e.g. my_init) called when the binary is loaded into the program. This is a good place to do initialization like memory allocation or other set up used in the rest of the binary. What the compiler actually does when you specify this is create a ".init" section which contains a call to the specified function.

As a concrete example (and the example relevant to this specific manifestation of the problem), take a look at this code in usr/src/lib/libdtrace/common/drti.c:

     88 #pragma init(dtrace_dof_init)
     89 static void
     90 dtrace_dof_init(void)
     91 {

When we compile this into an object file (which we then deliver in /usr/lib/dtrace/drti.o), the compiler generates a .init ELF section that contains a call to dtrace_dof_init() (actually it contains a call with a relocation that gets filled into to be the address of dtrace_dof_init(), but that's a detail for another blog entry).

The linker doesn't really do anything special with .init ELF sections -- it just concatenates them like it does all other sections with the same name. So when you compile a bunch of object files with .init sections, they just get crammed together -- there's still nothing special that causes them to get executed with the binary is loaded.

Here's the clever part, when a compiler invokes the linker, it provides two special object files: crti.o at the beginning, and crtn.o at the end. You can find those binaries on your system in /usr/lib/ or in /usr/sfw/lib/gcc/... for the gcc version. Those binaries are where the clever part happens; crti.o's .init section contains effectively an open brace and crtn.o contains the close brace (the function prologue and epilogue respectively):

$ dis -t .init /usr/lib/crti.o
section .init
    _init:                  55                 pushl  %ebp
    1:                      8b ec              movl   %esp,%ebp
    3:                      53                 pushl  %ebx
    4:                      e8 00 00 00 00     call   +0x5      <0x9>
    9:                      5b                 popl   %ebx
    a:                      81 c3 03 00 00 00  addl   $0x3,%ebx

$ dis -t .init /usr/lib/crtn.o
section .init
    0:                      5b                 popl   %ebx
    1:                      c9                 leave
    2:                      c3                 ret

By now you may see the punch-line: by bracketing the user-generated object files with these crti.o and crtn.o the resulting .init section is the concatenation of the function prologue, all the calls in the user's object files, and finally the function epilogue. All of this is contained in the symbol called _init.

The linker then has some magic that identifies the _init function as special and includes a dynamic entry (DT_INIT) that causes _init to be called by the the run-time linker ( when the binary is loaded. In the binary that was built with USDT but wasn't working properly, there was a .init section with the call to dtrace_dof_init(), but no _init symbol. The problem was, of course, that crti.o and crtn.o weren't being specified in the linker invocation resulting in a .init section, but no _init symbol so no DT_INIT section, so no initialization and no USDT.

Monday Jul 25, 2005

Believe it or not...

I really never thought I'd be sticking a picture of a penguin on my weblog, but here it is...
Joined by Bryan and Eric, I'll be speaking (and listening) at the OpenSolaris BoF at LinuxWorld (August 9, 5:30-7:00). We'll be talking about using and developing on OpenSolaris, how you can participate, and most likely comparing Solaris and Linux as well as giving the mandatory DTrace pitch. I expect it will be a little like the times I've been to Yankee stadium in full Red Sox regalia.

Technorati Tags:

Tuesday Jun 14, 2005

Debugging cross calls on OpenSolaris

I think the thing I love most about debugging software is that each tough bug can seem like an insurmountable challenge -- until you figure it out. But until you do, each tough bugs is the hardest problem you've ever had to solve. There are weeks when every morning presents me with a seemingly impossible challenge, and each afternoon I get to spike my keyboard and do a little victory dance before running for the train.

For my first OpenSolaris blog post, I thought I talk about one of my favorite such bugs. This particularly nasty bug had to do with a tricky interaction between VMware and DTrace (pre-production versions of each to be clear). My buddy Keith -- a fellow Brown CS Grad -- gave me a call and told me about some strange behavior he was seeting running DTrace inside of a VMware VM. Keith is a big fan of DTrace, but an intermittant, but reproducable problem was putting a damper on his DTrace enthusiasm. Every once in a while, running DTrace would cause the system to freeze. Because Solaris was running in the virtual environment, he could see that both virtual CPUs where spinning away, but they weren't making any forward progress. After a couple of back and forths over email, I made the trip down to Palo Alto so we could work on the problem together.

Using some custom debugging tools, we were able to figure out where the two virtual CPUs were spinning. One CPU was in xc_common() and the other was in xc_serv() -- code to handle cross calls. So what was going on?

cross calls

Before I can really delve into the problem, I want to give just a brief overview of cross calls. In general terms, a cross call (xcall) is used in a multi-processor (MP) system when one CPU needs to get another CPU to do some work. It works by sending a special type of interrupt which the remote CPU handles. You may have heard the term interprocessor interrupt (IPI) -- same thing. One example of when xcalls are used is when unmapping a memory region. To unmap a region, a process will typically call the munmap(2) system call. Remember that in an MP system, any processor may have run threads in this process so those mappings may be present in that any CPU's TLB. The unmap operation executes on one CPU, but the other CPUs need to remove the relevant mappings from their own TLBs. To accomplish this communication, the kernel uses a xcall.

DTrace uses xcalls synchronize data used by all CPUs by ensuring that all CPUs have reached a certain point in the code. DTrace executes actions with interrupts disabled (an explanation of why this must be so is well beyond the bounds of this discussion) so we can tell that a CPU isn't in probe context if its able to handle our xcall. When DTrace is stopping tracing activity, for example, it will update some data that affects all CPUs and then use a xcall to make sure that every CPU has seen its effects before proceeding:


10739    	/\*
10740    	 \* We'll set the activity to DTRACE_ACTIVITY_DRAINING, and issue a sync
10741    	 \* to be sure that every CPU has seen it.  See below for the details
10742    	 \* on why this is done.
10743    	 \*/
10744    	state->dts_activity = DTRACE_ACTIVITY_DRAINING;
10745    	dtrace_sync();

dtrace_sync() sends a xcall to all other CPUs and has them spin in a holding pattern until all CPUs have reached that point at which time the CPU which sent the xcall releases them all (and they go back to whatever they had been doing when they received the interrupt). That's the high level overview; let's go into a little more detail on how xcalls work (well, actually a lot more detail).

xcall implementation

If you follow the sequence of functions called by dtrace_sync() (and I encourage you to do so), you'll find that this eventually calls xc_common() to do the heavy lifting. It's important to note that this call to xc_common() will have the sync argument set to 1. What's that mean? In a text book example of good software engineering, someone did a good job documenting what this value means:


 411    /\*
 412     \* Common code to call a specified function on a set of processors.
 413     \* sync specifies what kind of waiting is done.
 414     \*	-1 - no waiting, don't release remotes
 415     \*	0 - no waiting, release remotes immediately
 416     \*	1 - run service locally w/o waiting for remotes.
 417     \*	2 - wait for remotes before running locally
 418     \*/
 419    static void
 420    xc_common(
 421    	xc_func_t func,
 422    	xc_arg_t arg1,
 423    	xc_arg_t arg2,
 424    	xc_arg_t arg3,
 425    	int pri,
 426    	cpuset_t set,
 427    	int sync)

Before you start beating your brain out trying to figure out what you're missing here, in this particular case, this destinction bewteen sync having the value of 1 and 2 is nil: the service (function pointer specified by the func argument) that we're running is dtrace_sync_func() which does literally nothing.

Let's start picking apart xc_common():


 446    	/\*
 447    	 \* Request service on all remote processors.
 448    	 \*/
 449    	for (cix = 0; cix < NCPU; cix++) {
 450    		if ((cpup = cpu[cix]) == NULL ||
 451    		    (cpup->cpu_flags & CPU_READY) == 0) {
 452    			/\*
 453    			 \* In case CPU wasn't ready, but becomes ready later,
 454    			 \* take the CPU out of the set now.
 455    			 \*/
 456    			CPUSET_DEL(set, cix);
 457    		} else if (cix != lcx && CPU_IN_SET(set, cix)) {
 458    			CPU_STATS_ADDQ(CPU, sys, xcalls, 1);
 459    			cpup->cpu_m.xc_ack[pri] = 0;
 460    			cpup->cpu_m.xc_wait[pri] = sync;
 461    			if (sync > 0)
 462    				cpup->cpu_m.xc_state[pri] = XC_SYNC_OP;
 463    			else
 464    				cpup->cpu_m.xc_state[pri] = XC_CALL_OP;
 465    			cpup->cpu_m.xc_pend[pri] = 1;
 466    			send_dirint(cix, xc_xlat_xcptoipl[pri]);
 467    		}
 468    	}

We take a first pass through all the processors; if the processor is ready to go and is in the set of processors we care about (they all are in the case of dtrace_sync()) we set the remote CPU's ack flag to 0, it's wait flag to sync (remember, 1 in this case), and it's state flag to XC_SYNC_OP and then actually send the interrupt with the call to send_dirint().

Next we wait for the remote CPUs to acknowledge that they've executed the requested service which they do by setting the ack flag to 1:


 479    	/\*
 480    	 \* Wait here until all remote calls complete.
 481    	 \*/
 482    	for (cix = 0; cix < NCPU; cix++) {
 483    		if (lcx != cix && CPU_IN_SET(set, cix)) {
 484    			cpup = cpu[cix];
 485    			while (cpup->cpu_m.xc_ack[pri] == 0) {
 486    				ht_pause();
 487    				return_instr();
 488    			}
 489    			cpup->cpu_m.xc_ack[pri] = 0;
 490    		}
 491    	}

That while loop spins waiting for ack to become 1. If you look at the definition of return_instr() it's name is actually more descriptive that you might imagine: it's just a return instruction -- the most trivial function possible. I'm not absolutely certain, but I think it was put there so the compiler wouldn't "optimize" the loop away. The call to the inline function ht_pause() is so that the thread spins in such a way that's considerate on an hyper-threaded CPU. The call to ht_pause() is probably sufficient to prevent the compiler from being overly clever, but the legacy call to return_instr() remains.

Now let's look at the other side of this conversation: what happens on a remote CPU as a result of this interrupt? This code is in xc_serv()


 138    	/\*
 139    	 \* Acknowledge that we have completed the x-call operation.
 140    	 \*/
 141    	cpup->cpu_m.xc_ack[pri] = 1;

I'm sure it comes as no surprise that after executing the given function, it just sets the ack flag.

Since in this case we're dealing with a synchronous xcall, the remote CPU then needs to just chill out until the CPU that initiated the xcall discovers that all remote CPUs have executed the function and are ready to be released:


 146    	/\*
 147    	 \* for (op == XC_SYNC_OP)
 148    	 \* Wait for the initiator of the x-call to indicate
 149    	 \* that all CPUs involved can proceed.
 150    	 \*/
 151    	while (cpup->cpu_m.xc_wait[pri]) {
 152    		ht_pause();
 153    		return_instr();
 154    	}
 156    	while (cpup->cpu_m.xc_state[pri] != XC_DONE) {
 157    		ht_pause();
 158    		return_instr();
 159    	}

And here's the code on the initiating side that releases all the remote CPUs by setting the wait and state flags to the values that the remote CPUs are waiting to see:


 502    	/\*
 503    	 \* Release any waiting CPUs
 504    	 \*/
 505    	for (cix = 0; cix < NCPU; cix++) {
 506    		if (lcx != cix && CPU_IN_SET(set, cix)) {
 507    			cpup = cpu[cix];
 508    			if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) {
 509    				cpup->cpu_m.xc_wait[pri] = 0;
 510    				cpup->cpu_m.xc_state[pri] = XC_DONE;
 511    			}
 512    		}
 513    	}

there's a problem

Wait! Without reading ahead in the code, does anyone see the problem?

Back at VMware, Keith hacked up a version of the virtual machine monitor which allowed us to trace certain points in the code and figure out the precise sequence in which they occurred. We traced the entry and return to xc_common() and xc_serv(). Almost every time we'd see something like this:

  • enter xc_common() on CPU 0
  • enter xc_serv() on CPU 1
  • exit xc_serv() on CPU 1
  • exit xc_common() on CPU 0


  • enter xc_common() on CPU 0
  • enter xc_serv() on CPU 1
  • exit xc_common() on CPU 0
  • exit xc_serv() on CPU 1

But the problem happened when we saw a sequence like this:

  • enter xc_common() on CPU 0
  • enter xc_serv() on CPU 1
  • exit xc_common() on CPU 0
  • enter xc_common() on CPU 0

And nothing futher. What was happening was that after releasing remote CPUs, CPU 0 was exiting from the call to xc_common() and calling it again before the remote invocation of xc_serv() on the other CPU had a change to exit.

Recall that one of the the first things that xc_common() does is set the state flag. If the first call to xc_common() sets the state flag to release the remote CPU from xc_sync(), but when things go wrong, xc_common() was overwriting that flag before the remote CPU got a change to see it.

the problem

We were seeing this repeatably under VMware, but no one had seen this at all on real hardware (yet). The machine Keith and I were using was a 2-way box running Linux. On VMware, each virtual CPU is represented by a thread on the native OS so rather than having absolute control of the CPU, the execution was more or less at the whim of the Linux scheduler.

When this code is running unadulterated on physical CPUs, we won't see this problem. It's just a matter of timing -- the remote CPU has many many fewer instructions to execute before the state flag gets overwritten by a second xcall so there's no problem. On VMware, the Linux scheduler might decide that's your second virtual CPU's right to the physical CPU is trumped by moving the hands on your xclock (why not?) so there are no garuantees about how long these operations can take.

the fix

There are actually quite a few ways to fix this problem -- I'm sure you can think of at least one or two off the top of your head. We just need to make sure that subsequent xcalls can't interfere with each other. When we found this, Solaris 10 was wrapping up -- we were still making changes, but only those deemed of the absolute highest importance. Making changes to the xcall code (which is rather delicate and risky to change) for a bug that only manifests itself on virtual hardware (and which VMware could work around using some clever trickery[1]) didn't seem worthy of being designated a show- stopper.

Keith predicted a few possible situations where this same bug could manifest itself on physical CPUs: on hyper-threaded CPUs, or in the presence of service management interrupts. And that prediction turned out to be spot on: a few weeks after root causing the bug under VMware, we hit the same problem on a system with four hyper-threaded chips (8 logical CPUs).

Since at that time we were even closer to shipping Solaris 10, I chose the fix I thought was the safest and least likely to have nasty side effects. After releasing remote CPUs, the code in xc_common() would now wait for remote CPUs to check in -- wait for them to acknowledge receipt of the directive to proceed.


 515    	/\*
 516    	 \* Wait for all CPUs to acknowledge completion before we continue.
 517    	 \* Without this check it's possible (on a VM or hyper-threaded CPUs
 518    	 \* or in the presence of Service Management Interrupts which can all
 519    	 \* cause delays) for the remote processor to still be waiting by
 520    	 \* the time xc_common() is next invoked with the sync flag set
 521    	 \* resulting in a deadlock.
 522    	 \*/
 523    	for (cix = 0; cix < NCPU; cix++) {
 524    		if (lcx != cix && CPU_IN_SET(set, cix)) {
 525    			cpup = cpu[cix];
 526    			if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) {
 527    				while (cpup->cpu_m.xc_ack[pri] == 0) {
 528    					ht_pause();
 529    					return_instr();
 530    				}
 531    				cpup->cpu_m.xc_ack[pri] = 0;
 532    			}
 533    		}
 534    	}

In that comment, I tried to summarize in 6 lines what has just taken me several pages to describe. And maybe I should have said "livelock" -- oh well. Here's the complementary code in xc_serv():


 170    	/\*
 171    	 \* Acknowledge that we have received the directive to continue.
 172    	 \*/
 173    	ASSERT(cpup->cpu_m.xc_ack[pri] == 0);
 174    	cpup->cpu_m.xc_ack[pri] = 1;


That was one of my favorite bugs to work on, and it's actually fairly typical of a lot of the bugs I investigate: something's going wrong; figure out why. I think the folks who work on Solaris tend to love that kind of stuff as a rule. We spend tons of time building facilities like DTrace, mdb(1), kmdb, CTF, fancy core files, and libdis so that the hart part of investigating mysterious problems isn't gathering data or testing hypotheses, it's thinking of the questions to answer and inventing new hypotheses. It's my hope that OpenSolaris will attract those types of inquisitive minds that thrive on the (seemingly) unsolvable problem.

[1] This sort of problem is hardly unique to DTrace or to Solaris. Apparently (and not surprisingly) there are problems like this in nearly every operating system where the code implicitly or explicitly relies on the relative timing of certain operations. In these cases, VMware has hacks to do things like execute the shifty code in lock step.

Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:

Wednesday Mar 02, 2005

Beautiful S10/RHEL drag race

Network World Fusion has an article comparing Solaris 10 to Red Hat Enterprise Linux 4.0 (the self-proclaimed Solaris killer. The sweetest part of this article is the graph showing relative performance numbers. Solaris gets beat in 32-bit mode, but edges out Linux in 64-bit mode on identical hardware. AND, we've got a bunch more x86-64 performance through the pipeline. This benchmark is just in one very specific area of the system, but it's exciting nonetheless.

Thursday Feb 17, 2005

Sanity and FUD

With all the FUD coming out of Red Hat, I thought the online rags would fall all over themselves to call shotgun on the anti-Solaris bandwagon. But LinuxInsider seems to have planted their flag on rational reporting:

Stacey Quandt, Open Source Practice Leader for the Robert Frances Group, agreed. "Although many organizations continue to migrate Solaris workloads from UltraSPARC to Linux on x86 and other platforms, the latest version of Red Hat does not eliminate the need for Solaris in the enterprise," she said.

"The performance characteristics of Solaris on AMD Opteron licensing of Solaris widens the choice of open-source operating systems on commodity hardware," Quandt told LinuxInsider.

Did you catch that? Not only is Solaris a performance leader, but it's an open source operating system.

But not so fast! In another FUD barrage, it turns out there are too many OSI-approved licenses -- guess what newly opened sourced operating system is using one of the apparently extraneous licenses. Kudos to slashdot for transmogrifying the title "OSI Urged To Reform Open-Source Licensing" to "OSI Hopes To Decrease Number of Licenses" -- turns out a change in voice can have a radical change in meaning...

Thursday Nov 25, 2004

Solaris 10 in the news: week 1

It's been just over a week since we officially launched Solaris 10 and the reactions from the press have been all over the map. Consider these two declarations:

Not many open source aficionados will realize the impact, but by making Solaris 10 free and capable of operating on any kind of hardware, Sun is making a coup in the server market.
As a result Linux will probably not grow much beyond its current market share of about 10% leaving Red Hat and especially Novell with a big problem.


Sun's announcement of the launch of free open source Solaris 10 has garnered a mild positive response from the investment community. However it also has raised a lot of skepticism from the technical community. The current plans are not good enough.


In case you didn't notice the URLs: those are the same guy posted about 20ns apart. Why the sudden reversal? slashdot! Yes, the Linus-loyal quickly rose to Linux's defence citing the multitude of reasons Open Solaris will fail.

It will be a while before anyone knows the relative success or failure of Open Solaris -- we're not even sure of the terms of the license -- so this is last challenge I'm going to rise to. But look: the Linux that exists today is a direct result literally and philosophically of its community. And while Solaris has so far been developed by a closed community, it also reflects the timbre of that community. For example: Linux adheres to a spartan philosophy of debugging help -- ergo no crash dumps; Solaris takes a different tack -- we require projects to come with debugger commands to examine crash dumps. It's the community that defines the project and, as is evident, it's the community that defends it, but don't discount the very real Solaris community.

The dumbest slight is against Solaris's hardware compatabiIity: do you think Linux always ran on as much stuff as it does now? Of course not. A month ago, did Solaris run on the (sick) new laptop I'm typing this on? Nope. The hardware these two operating systems support doesn't define their communities, bur rather the reverse. When people can hack up Solaris to work on their gear, they will just as they've been doing on Linux for years.

I can't wait for the license to be announced, for to open for bidness, and to have people start contributing code from outside of Sun. Does anyone really doubt that these things will happen? The only question is if Open Solaris will take off -- wait and see.

Wednesday Nov 17, 2004

Solaris 10 Launch

I was at the Solaris 10 launch for most of Monday, and it was a pretty fantastic day for everyone working on Solaris 10. I spent about two hours helping to answer questions in an online chat -- here's the transcript -- about Solaris 10 in what was dubbed a webchat sweatshop. There were a bunch of us from the Solaris group as well as Scott and some other execs all huddled around laptops while the HR folks beat the drum at a slow, but steady pace: These answers have to be on the streets of Hong Kong by morning.

I spent the rest of the afternoon talking to customers and the press -- mostly about DTrace -- and the exciting thing was that even more than before, they're getting it and they're excited about Solaris. One of the most indelible moments was when a group of us from the kernel group were talking to Jem Matzan of The Jem Report and he challenged the claim that Solaris 10 was the most innovative operating system ever. As we all painted the picture of Solaris 10, I realized that this was true. Solaris 10 isn't just a random collection of neat crap, but it comes together as a cohesive whole that innovates in each and every place an operating system can innovate. Maybe that's a stretch, but it's not far off..

Thursday Nov 11, 2004

LinuxInsider on Solaris

Check out LinuxInsider's article on Solaris 10. In particular, the last paragraph has high praise for Solaris 10 and Sun's forthcoming hardware ventures:

Bring them together in production systems and what do you get? The ability to organize your business around a single physical computer redundantly implemented in processors spread across your network -- meaning that a lot of business processes now limited by technology costs and software complexity can be simplified right down to affordability. That's what Solaris 10 is really about, and the 10-year impact is likely to be like nothing we've seen before.

Come Monday, I hope to see similar thoughtful analysis coming from other publications.

Wednesday Oct 13, 2004

more on gcore

Trawling through b.s.c I noticed Fintan Ryan talking about gcore(1), and I realized that I hadn't sufficently promoted this cool utility. As part of my work adding variable core file content, I rewote gcore from scratch (it used to be a real pile) to add a few new features and to make it use libproc (i.e. make it slightly less of a pile).

You use gcore to take a core dump of a live running process without actually causing the process to crash. It's not completely uninvasive because gcore stops the process you're taking the core of to ensure a consistent snapshot, but unless the process is huge or it's really cranky about timing the perturbation isn't noticeable. There are a lot of places where taking a snapshot with gcore is plenty useful. Let's say a process is behaving strangely, but you can't attach a debugger because you don't want to take down the service, or you want to have a core file to send to someone who can debug it when you yourself can't -- gcore is perfect. I use to it to take cores of mozilla when it's chugging away on the processor, but not making any visible progress.

I mentioned that big processes can take a while to gcore -- not surprising because we have to dump that whole image out to disk. One of the cool uses of variable core file content is the ability to take faster core dumps by only dumping the sections you care about. Let's say there's some big ISM segment or a big shared memory segment: exclude it and gcore will go faster:

hedge /home/ahl -> gcore -c default-ism 256755
gcore: core.256755 dumped

Pretty handy, but the coolest I've been making of gcore lately is by mixing it with DTrace and the new(ish) system() action. This script snapshots my process once every ten seconds and names the files according to the time they were produced:

# cat gcore.d
#pragma D option destructive
#pragma D option quiet

        doit = 1;

/doit && pid == $1/
        system("gcore -o core.%%t %d", pid);
        system("prun %d", pid);
        doit = 0;
# dtrace -s gcore.d  256755
gcore: core.1097724567.256755 dumped
gcore: core.1097724577.256755 dumped
gcore: core.1097724600.256755 dumped

WARNING! When you specify destructive in DTrace, it means destructive. The system() and stop() actions can be absolutely brutal (I've rendered at least one machine unusable my indelicate use of that Ramirez-Ortiz-ian one-two combo. That said, if you screw something up, you can break into the debugger and set dtrace_destructive_disallow to 1.

OK, so be careful, but that script can give you some pretty neat results. Maybe you have some application that seems to be taking a turn for the worse around 2 a.m. -- put together a DTrace script that detects the problem and use gcore to take a snapshot so you can figure out what was going on when to get to the office in the morning. Take a couple of snapshots to see how things are changing. You do like debugging from core dumps, right?


Adam Leventhal, Fishworks engineer


« June 2016