Monday Aug 08, 2005

Where have I been?

It's been almost a month since my last blog post, so I thought I'd post an update. I spent the month of July in Massachusetts, alternately on vacation, working remotely, and attending my brother's wedding. The rest of the LAE (Linux Application Environment) team joined me (and Nils) for a week out there, and we made some huge progress on the project. For the curious, we're working on how best to leverage OpenSolaris to help the project and the community, at which point we can go into more details about what the final product will look like. Until then, suffice to say "we're working on it". All this time on LAE did prevent me from spending time with my other girlfriend, ZFS. Since getting back, I've caught up with most of the ZFS work in my queue, and the team has made huge progress on ZFS in my absence. As much as I'd like to talk about details (or a schedule), I can't :-( But trust me, you'll know when ZFS integrates into Nevada; there are many bloggers who will not be so quiet when that putback notice comes by. Not to mention that the source code will hit OpenSolaris shortly thereafter.

Tomorrow I'll be up at LinuxWorld, hanging out at the booth with Ben and hosting the OpenSolaris BOF along with Adam and Bryan (Dan will be there as well, though he didn't make the "official" billing). Whether you know nothing about OpenSolaris or are one of our dedicated community members, come check it out.

Tuesday Jul 12, 2005

Operating system tunables

There's an interesting discussion over at opensolaris-code, spawned from an initial request to add some tunables to Solaris /proc. This exposes a few very important philosophical differences between Solaris and other operating systems out there. I encourage you to read the thread in its entirety, but here's an executive summary:

  • When possible, the system should be auto-tuning - If you are creating a tunable to control fine grained behavior of your program or operating system, you should first ask yourself: "Why does this tunable exist? Why can't I just pick the best value?" More often than not, you'll find the answer is "Because I'm lazy" or "The problem is too hard." Only in rare circumstances is there ever a definite need for a tunable, and almost always control coarse on-off behavior.

  • If a tunable is necessary, it should be as specific as possible - The days of dumping every tunable under the sun into /etc/system are over. Very rarely do tunables need to be system wide. Most tunables should be per process, per connection, or per filesystem. We are continually converting our old system-wide tunables into per-object controls.

  • Tunables should be controlled by a well defined interface - /etc/system and /proc are not your personal landfills. /etc/system is by nature undocumented, and designing it as your primary interface is fundamentally wrong. While /proc is well documented, but it's also well defined to be a process filesystem. Besides the enormous breakage you'd introduce by adding /proc/tunables, its philosophically wrong. The /system directory is a slightly better choice, but it's intended primarily for observability of subsystems that translate well to a hierarchical layout. In general, we don't view filesystems as a primary administrative interface, but a programmatic API upon which more sophisticated tools can be built.

One of the best examples of these principles can been seen in the updated System V IPC tunables. Dave Powell rewrote this arcane set of /etc/system tunables during the course of Solaris 10. Many of the tunables were made auto-tuning, and those that couldn't be were converted into resource controls administered on a per process basis using standard Solaris administrative tools. Hopefully Dave will blog at some point about this process, the decisions he made, and why.

There are, of course, always going to be exceptions to the above rules. We still have far too many documented /etc/system tunables in Solaris today, and there will always be some that are absolutely necessary. But our philosophy is focused around these principles, as illustrated by the following story from the discussion thread:

Indeed, one of the more amusing stories was a Platinum Beta customer showing us some slideware from a certain company comparing their OS against Solaris. The slides were discussing available tunables, and the basic gist was something like:

"We used to have way fewer tunables than Solaris, but now we've caught up and have many more than they do. Our OS rules!"

Needless to say, we thought they company was missing the point.

Tags:

Friday Jul 01, 2005

A parting MDB challenge

Like most of Sun's US employees, I'll be taking the next week off for vacation. On top of that, I'll be back in my hometown in MA for the next few weeks, alternately working remotely and attending my brother's wedding. I'll leave you with an MDB challenge, this time much more involved than past "puzzles". I don't have any prizes lying around, but this one would certainly be worth one if I had anything to give.

So what's the task? To implement munges as a dcmd. Here's the complete description:

Implement a new dcmd, ::stacklist, that will walk all threads (or all threads within a specific process when given a proc_t address) and summarize the different stacks by frequency. By default, it should display output identical to 'munges':

> ::stacklist
73      ##################################  tp: fffffe800000bc80
        swtch+0xdf()
        cv_wait+0x6a()
        taskq_thread+0x1ef()
        thread_start+8()

38      ##################################  tp: ffffffff82b21880
        swtch+0xdf()
        cv_wait_sig_swap_core+0x177()
        cv_wait_sig_swap+0xb()
        cv_waituntil_sig+0xd7()
        lwp_park+0x1b1()
        syslwp_park+0x4e()
        sys_syscall32+0x1ff()

...

The first number is the frequency of the given stack, and the 'tp' pointer should be a representative thread of the group. The stacks should be organized by frequency, with the most frequent ones first. When given the '-v' option, the dcmd should print out all threads containing the given stack trace. For extra credit, the ability to walk all threads with a matching stack (addr::walk samestack) would be nice.

This is not an easy dcmd to write, at least when doing it correctly. The first key is to use as little memory as possible. This dcmd must be capable of being run within kmdb(1M), where we have limited memory available. The second key is to leverage existing MDB functionality without duplicating code. You should not be copying code from ::findstack or ::stack into your dcmd. Ideally, you should be able to invoke ::findstack without worry about its inner workings. Alternatively, restructuring the code to share a common routine would also be acceptable.

This command would be hugely beneficial when examining system hangs or other "soft failures," where there is no obvious culprit (such as a panicking thread). Having this functionality in KMDB (where we cannot invoke 'munges') would make debugging a whole class of problems much easier. This is also a great RFE to get started with OpenSolaris. It is self contained, low risk, but non-trivial, and gets you familiar with MDB at the same time. Personally, I have always found the observability tools a great place to start working on Solaris, because the risk is low while still requiring (hence learning) internal knowledge of the kernel.

If you do manage to write this dcmd, please email me (Eric dot Schrock at sun dot com) and I will gladly be your sponsor to get it integrated into OpenSolaris. I might even be able to dig up a prize somewhere...

Sunday Jun 26, 2005

Virtualization and OpenSolaris

There's actually a decent piece over at eWeek discussing the future of Xen and LAE (the project formerly known as Janus) on OpenSolaris. Now that our marketing folks are getting the right message out there about what we're trying to accomplish, I thought I'd follow up with a little technical background on virtualization and why we're investing in these different technologies. Keep in mind that these are my personal beliefs based on interactions with customers and other Solaris engineers. Any resemblance to a corporate strategy is purely coincidental ;-)

Before diving in, I should point out that this will be a rather broad coverage of virtualization strategies. For a more detailed comparison of Zones and Jails in particular, check out James Dickens' Zones comparison chart.

Benefits of Virtualization

First off, virtualization is here to stay. Our customers need virtualization - it dramatically reduces the cost of deploying and maintaining multiple machines and applications. The success of companies such as VMWare is proof enough that such a market exists, though we have been hearing it from our customers for a long time. What we find, however, is that customers are often confused about exactly what they're trying to accomplish, and companies try to pitch a single solution to virtualization problems without recognizing that more appropriate solutions may exist. The most common need for virtualization (as judged by our customer base) is application consolidation. Many of the larger apps have become so complex that they become a system in themselves - and often they don't play nicely with other applications on the box. So "one app per machine" has become the common paradigm. The second most common need is security, either for your application administrators or your developers. Other reasons certainly exist (rapid test environment deployment, distributed system simulation, etc), but these are the two primary ones.

So what does virtualization buy you? It's all about reducing costs, but there are really two types of cost associated with running a system:

  1. Hardware costs - This includes the cost of the machine, but also the costs associated with running that machine (power, A/C).
  2. Software management costs - This includes the cost of deploying new machines, and upgrading/patching software, and observing software behavior.

As we'll see, different virtualization strategies provide different qualities of the above savings.

Hardware virtualization

One of the most well-established forms of virtualization, the most common examples today are Sun Domains and IBM Logical Partitions. In each case, the hardware is responsible for dividing existing resources in such a way as to present multiple machines to the user. This has the advantage of requiring no software layer, no performance impact, and hardware fault isolation. The downside to this is that it requires specialized hardware that is extremely expensive, and provides zero benefit for reducing software management costs.

Software machine virtualization

This approach is probably the one most commonly associated with the term "virtualization". In this scheme, a software layer is created which allows multiple OS instances to run on the same hardware. The most commercialized versions are VMware and Virtual PC, but other projects exist (such as qemu and PearPC). Typically, they require a "host" operating system as well as multiple "guests" (although VMware ESX server runs a custom kernel as the host). While Xen uses a paravitualization technique that requires changes to the guest OS, it is still fundamentally a machine virtualization technique. And Usermode Linux takes a radically different approach, but accomplishes the basic same task.

In the end, this approach has similar strengths and weaknesses as the hardware assisted virtualization. You don't have to buy expensive special-purpose hardware, but you give up the hardware fault isolation and often sacrifice performance (Xen's approach lessens this impact, but its still visible). But most importantly, you still don't save any costs associated with software management - administering software on 10 virtual machines is just as expensive as administering 10 separate machines. And you have no visibility into what's happening within the virtual machine - you may be able to tell that Xen is consuming 50% of your CPU, but you can't tell why unless you log into the virtual system itself.

Software application virtualization

On the grand scale of virtualization, this ranks as the "least virtualized". With this approach, the operating system uses various tricks and techniques to present an alternate view of the machine. This can range from simple chroot(1), to BSD Jails, to Solaris Zones. Each of these provide a more complete OS view with varying degrees of isolation. While Zones is the most complete and the most secure, they all use the same fundamental idea of a single operating system presenting an "alternate reality" that appears to be a complete system at the application level. The upcoming Linux Application Environment on OpenSolaris will take this approach by leveraging Zones and emulating Linux at the system call layer.

The most significant downside to this approach is the fact there is a single kernel. You cannot run different operating systems (though LAE will add an interesting twist), and the "guest" environments have limited access to hardware facilities. On the other hand, this approach results in huge savings on the software management front. Because applications are still processes within the host environment, you have total visibility into what is happening within each guest, using standard operating system tools, as well as manage them as you would any other processes, using standard resource management tools. You can deploy, patch, and upgrade software from a single point without having to physically log into each machine. While not all applications will run in such a reduced environment, those that do will be able to benefit from vastly simplified software management. This approach also has the added bonus that it tends to make better use of shared resources. In Zones, for example, the most common configuration includes a shared /usr directory, so that no additional disk space is needed (and only one copy of each library needs to be resident in memory).

OpenSolaris virtualization in the future

So what does this all mean for OpenSolaris? Why are we continuing to pursue Zones, LAE, and Xen? The short answer is because "our customers want us to." And hopefully, from what's been said above, it's obvious that there is no one virtualization strategy that is correct for everyone. If you want to consolidate servers running a variety of different operating systems (including older versions of Solaris), then Xen is probably the right approach. If you want to consolidate machines running Solaris applications, then Zones is probably your best bet. If you require the ability to survive hardware faults between virtual machines, then domains is the only choice. If you want to take advantage of Solaris FMA and performance, but still want to run the latest and greatest from RedHat with support, then Xen is your option. If you have 90% of your applications on Solaris, and you're just missing that one last app, then LAE is for you. Similarly, if you have a Linux app that you want to debug with DTrace, you can leverage LAE without having to port to Solaris first.

With respect to Linux virtualization in particular, we are always going to pursue ISV certification first. No one at Sun wants you to run Oracle under LAE or Xen. Given the choice, we will always aggressively pursue ISVs to do a native port to Solaris. But we understand that there is an entire ecosystem of applications (typically in-house apps) that just won't run on Solaris x86. We want users to have a choice between virtualization options, and we want all those options to be a fundamental part of the operating system.

I hope that helps clear up the grand strategy. There will always be people who disagree with this vision, but we honestly believe we're making the best choices for our customers.

Tags:


You may note, that I failed to mention cross-architecture virtualization. This is most common at the system level (like PearPC), but application-level solutions do exist (including Apple's upcoming Rosetta). This type of virtualization simply doesn't factor into our plans, yet, and still falls under the umbrella of one of the broad virtualization types.

I also apologize for any virtualization projects out there that I missed. There are undoubtedly many more, but the ones mentioned above serve to illustrate my point.

Saturday Jun 25, 2005

Fun source code facts

A while ago, for my own amusement, I went through the Solaris source base and searched for the source files with the most lines. For some unknown reason this popped in my head yesterday so I decided to try it again. Here are the top 10 longest files in OpenSolaris:

LengthSource File
29944usr/src/uts/common/io/scsi/targets/sd.c
25920[closed]
25429usr/src/uts/common/inet/tcp/tcp.c
22789[closed]
16954[closed]
16339[closed]
15667usr/src/uts/common/fs/nfs4_vnops.c
14550usr/src/uts/sfmmu/vm/hat_sfmmu.c
13931usr/src/uts/common/dtrace/dtrace.c
13027usr/src/uts/sun4u/starfire/io/idn_proto.c

You can see some of the largest files are still closed source. Note that the length of the file doesn't necessarily indicate anything about the quality of the code, it's more just idle curiosity. Knowing the quality of online journalism these days, I'm sure this will get turned into "Solaris source reveals completely unmaintable code" ...

After looking at this, I decided a much more interesting question was "which source files are the most commented?" To answer this question, I ran evey source file through a script I found that counts the number of commented lines in each file. I filtered out those files that were less than 500 lines long, and ran the results through another script to calculate the percentage of lines that were commented. Lines which have a comment along with source are considered a commented line, so some of the ratios were quite high. I filtered out those files which were mostly tables (like uwidth.c), as these comments didn't really count. I also ignored header files, because they tend to be far more commented that the implementation itself. In the end I had the following list:

PercentageFile
62.9%usr/src/cmd/cmd-inet/usr.lib/mipagent/snmp_stub.c
58.7%usr/src/cmd/sgs/libld/amd64/amd64unwind.c
58.4%usr/src/lib/libtecla/common/expand.c
56.7%usr/src/cmd/lvm/metassist/common/volume_nvpair.c
56.6%usr/src/lib/libtecla/common/cplfile.c
55.6%usr/src/lib/libc/port/gen/mon.c
55.4%usr/src/lib/libadm/common/devreserv.c
55.1%usr/src/lib/libtecla/common/getline.c
54.5%[closed]
54.3%usr/src/uts/common/io/ib/ibtl/ibtl_mem.c

Now, when I write code I tend to hover in the 20-30% comments range (my best of those in the gate is gfs.c, which with Dave's help is 44% comments). Some of the above are rather over-commented (especially snmp_sub.c, which likes to repeat comments above and within functions).

I found this little experiment interesting, but please don't base any conclusions on these results. They are for entertainment purposes only.

Technorati Tag:

Thursday Jun 23, 2005

MDB puzzle, take two

Since Bryan solved my last puzzle a little too quickly, this post will serve as a followup puzzle that may or may not be easier. All I know is that Bryan is ineligible this time around ;-)

Once again, the rules are simple. The solution must be a single line dcmd that produces precise output without any additional steps or post processing. For this puzzle, you're actually allowed two commands: one for your dcmd, and another for '::run'. For this puzzle, we'll be using the following test program:

#include 

int
main(int argc, char \*\*argv)
{
        int i;

        srand(time(NULL));

        for (i = 0; i < 100; i++)
                write(rand() % 10, NULL, 0);

        return (0);
}

The puzzle itself demonstrates how conditional breakpoints can be implemented on top of existing functionality:

Stop the test program on entry to the write() system call only when the file descriptor number is 7

I thought this one would be harder than the last, but now I'm not so sure, especially once you absorb some of the finer points from the last post.

Technorati Tag:

MDB puzzle

On a lighter note, I'd thought I post an "MDB puzzle" for the truly masochistic out there. I was going to post two, but the second one was just way too hard, and I was having a hard time finding a good test case in userland. You can check out how we hope to make this better over at the MDB community. Unfortunately I don't have anything cool to give away, other than my blessing as a truly elite MDB hacker. Of course, if you get this one right I might just have to post the second one I had in mind...

The rules are simple. You can only use a single line command in 'mdb -k'. You cannot use shell escapes (!). Your answer must be precise, without requiring post-processing through some other utility. Leaders of the MDB community and their relatives are ineligible, though other Sun employees are welcome to try. And now, the puzzle:

Print out the current working directory of every process with an effective user id of 0.

Should be simple, right? Well, make sure you go home and study your MDB pipelines, because you'll need some clever tricks to get this one just right...

Technorati Tags:

Sunday Jun 19, 2005

Adding a kernel module to OpenSolaris

On opening day, I chose to post an entry on adding a system call to OpenSolaris. Considering the feedback, I thought I'd continue with brief "How-To add to OpenSolaris" documents for a while. There's a lot to choose from here, so I'll just pick them off as quick as I can. Todays topic as adding a new kernel module to OpenSolaris.

For the sake of discussion, we will be adding a new module that does nothing apart from print a message on load and unload. It will be architecture-neutral, and be distributed as part of a separate package (to give you a taste of our packaging system). We'll continue my narcissistic tradition and name this the "schrock" module.

1. Adding source

To begin, you must put your source somewhere in the tree. It must be put somewhere under usr/src/uts/common, but exactly where depends on the type of module. Just about the only real rule is that filesystems go in the "fs" directory, but other than that there are no real rules. The bulk of the modules live in the "io" directory, since the majority of modules are drivers of some kind. For now, we'll put 'schrock.c' in the "io" directory:

#include <sys/modctl.h>
#include <sys/cmn_err.h>

static struct modldrv modldrv = {
	&mod_miscops,
	"schrock module %I%",
	NULL
};

static struct modlinkage modlinkage = {
	MODREV_1, (void \*)&modldrv, NULL
};

int
_init(void)
{
	cmn_err(CE_WARN, "OpenSolaris has arrived");
	return (mod_install(&modlinkage));
}

int
_fini(void)
{
	cmn_err(CE_WARN, "OpenSolaris has left the building");
	return (mod_remove(&modlinkage));
}

int
_info(struct modinfo \*modinfop)
{
	return (mod_info(&modlinkage, modinfop));
}

The code is pretty simple, and is basically the minimum needed to add a module to the system. You notice we use 'mod_miscops' in our modldrv. If we were adding a device driver or filesystem, we would be using a different set of linkage structures.

2. Creating Makefiles

We must add two Makefiles to get this building:

usr/src/uts/intel/schrock/Makefile
usr/src/uts/sparc/schrock/Makefile

With contents similar to the following:

UTSBASE = ../..

MODULE          = schrock
OBJECTS         = $(SCHROCK_OBJS:%=$(OBJS_DIR)/%)
LINTS           = $(SCHROCK_OBJS:%.o=$(LINTS_DIR)/%.ln)
ROOTMODULE      = $(ROOT_MISC_DIR)/$(MODULE)

include $(UTSBASE)/intel/Makefile.intel

ALL_TARGET      = $(BINARY)
LINT_TARGET     = $(MODULE).lint
INSTALL_TARGET  = $(BINARY) $(ROOTMODULE)

CFLAGS          += $(CCVERBOSE)

.KEEP_STATE:

def:            $(DEF_DEPS)

all:            $(ALL_DEPS)

clean:          $(CLEAN_DEPS)

clobber:        $(CLOBBER_DEPS)

lint:           $(LINT_DEPS)

modlintlib:     $(MODLINTLIB_DEPS)

clean.lint:     $(CLEAN_LINT_DEPS)

install:        $(INSTALL_DEPS)

include $(UTSBASE)/intel/Makefile.targ

3. Modifying existing Makefiles

There are two remaining Makefile chores before we can continue. First, we have to add the set of files to usr/src/uts/common/Makefile.files:

KMDB_OBJS += kdrv.o

SCHROCK_OBJS += schrock.o

BGE_OBJS += bge_main.o bge_chip.o bge_kstats.o bge_log.o bge_ndd.o \\
                bge_atomic.o bge_mii.o bge_send.o bge_recv.o

If you had created a subdirectory for your module instead of placing it in "io", you would also have to add a set of rules to usr/src/uts/common/Makefile.rules. If you need to do this, make sure you get both the object targets and the lint targets, or you'll get build failures if you try to run lint.

You'll also need to modify the usr/src/uts/intel/Makefile.intel file, as well as the corresponding SPARC version:

MISC_KMODS      += usba usba10
MISC_KMODS      += zmod
MISC_KMODS      += schrock

#
#       Software Cryptographic Providers (/kernel/crypto):
#

4. Creating a package

As mentioned previously, we want this module to live in its own package. We start by creating usr/src/pkgdefs/SUNWschrock and adding it to the list of COMMON_SUBDIRS in usr/src/pkgdefs/Makefile:

        SUNWsasnm \\
        SUNWsbp2 \\
        SUNWschrock \\
        SUNWscpr  \\
        SUNWscpu  \\

Next, we have to add a skeleton package system. Since we're only adding a miscellaneous module and not a full blown driver, we only need a simple skeleton. First, there's the Makefile:

include ../Makefile.com

.KEEP_STATE:

all: $(FILES)
install: all pkg

include ../Makefile.targ

A 'pkgimfo.tmpl' file:

PKG=SUNWschrock
NAME="Sample kernel module"
ARCH="ISA"
VERSION="ONVERS,REV=0.0.0"
SUNW_PRODNAME="SunOS"
SUNW_PRODVERS="RELEASE/VERSION"
SUNW_PKGVERS="1.0"
SUNW_PKGTYPE="root"
MAXINST="1000"
CATEGORY="system"
VENDOR="Sun Microsystems, Inc."
DESC="Sample kernel module"
CLASSES="none"
HOTLINE="Please contact your local service provider"
EMAIL=""
BASEDIR=/
SUNW_PKG_ALLZONES="true"
SUNW_PKG_HOLLOW="true"

And 'prototype_com', 'prototype_i386', and 'prototype_sparc' (elided) files:

# prototype_i386
!include prototype_com

d none kernel/misc/amd64 755 root sys
f none kernel/misc/amd64/schrock 755 root sys
# prototype_com
i pkginfo

d none kernel 755 root sys
d none kernel/misc 755 root sys
f none kernel/misc/schrock 755 root sys

5. Putting it all together

If we pkgadd our package, or BFU to the resulting archives, we can see our module in action:

halcyon# modload /kernel/misc/schrock
Jun 19 12:43:35 halcyon schrock: WARNING: OpenSolaris has arrived
halcyon# modunload -i 197
Jun 19 12:43:50 halcyon schrock: WARNING: OpenSolaris has left the building

This process is common to all kernel modules (though packaging is simpler for those combined in SUNWckr, for example). Things get a little more complicated and a little more specific when you begin to talk about drivers or filesystems in particular. I'll try to create some simple howtos for those as well.

Technorati Tag:

Friday Jun 17, 2005

Observability in OpenSolaris

Just a heads up that we've formed a new OpenSolaris Observability community. There's not much there night now, but I encourage to head over and check out what OpenSolaris has to offer. Or come to the discussion forum and gripe about what features we're still missing. Topics covered include process, system, hardware, and post mortem observability. We'll be adding much more content as soon as we can.

Technorati Tag:

GDB to MDB Migration, Part Two

So talking to Ben last night convinced me I needed to finish up the GDB to MDB reference that I started last month. So here's part two.

GDBMDBDescription

Program Stack

backtrace n::stack
$C
Display stack backtrace for the current thread
-thread::findstack -v Display a stack for a given thread. In the kernel, thread is the address of the kthread_t. In userland, it's the thread identifier.
info ...- Display information about the current frame. MDB doesn't support the debugging data necessary to maintain the frame abstraction.

Execution Control

continue
c
:c Continue target.
stepi
si
::step
]
Step to the next machine instruction. MDB does not support stepping by source lines.
nexti
ni
::step over
[
Step over the next machine instruction, skipping any function calls.
finish::step out Continue until returning from the current frame.
jump \*addressaddress>reg Jump to the given location. In MDB, reg depends on your platform. For SPARC it's 'pc', for i386 its 'eip', and for amd64 it's 'rip'.

Display

print expraddr::print expr Print the given expression. In GDB you can specify variable names as well as addresses. For MDB, you give a particular address and then specify the type to display (which can include dereferencing of members, etc).
print /faddr/f Print data in a precise format. See ::formats for a list of MDB formats.
disassem addraddr::dis Dissasemble text at the given address, or the current PC if no address is specified

This is just a primer. Both programs support a wide variety of additional options. Running 'mdb -k', you can quickly see just how many commands are out there:

> ::dcmds ! wc -l
     385
> ::walkers ! wc -l
     436

One helpful trick is ::dcmds ! grep thing, which searches the description of each command. Good luck, and join the discussion over at the OpenSolaris MDB community if you have any questions or tips of your own.

Technorati tag:
Technorati tag:
Technorati tag:

Tuesday Jun 14, 2005

How to add a system call to OpenSolaris

When I first started in the Solaris group, I was faced with two equally difficult tasks: learning the development model, and understanding the source code. For both these tasks, the recommended method is usually picking a small bug and working through the process. For the curious, the first bug I putback to ON was 4912227 (ptree call returns zero on failure), a simple bug with near zero risk. It was the first step down a very long road.

As a another first step, someone suggested adding a very simple system call to the kernel. This turned out to be a whole lot harder than one would expect, and has so many subtle aspects that experienced Solaris engineers (myself included) still miss some of the necessary changes. With that in mind, I thought a reasonable first OpenSolaris blog would be describing exactly how to add a new system call to the kernel.

For the purposes of this post, we will assume that it's a simple system call that lives in the generic kernel code, and we'll put the code into an existing file to avoid having to deal with Makefiles. The goal is to print an arbitrary message to the console whenever the system call is issued.

1. Picking a syscall number

Before writing any real code, we first have to pick a number that will represent our system call. The main source of documentation here is syscall.h, which describes all the available system call numbers, as well as which ones are reserved. The maximum number of syscalls is currently 256 (NSYSCALL), which doesn't leave much space for new ones. This could theoretically be extended - I believe the hard limit is in the size of sysset_t, whose 16 integers must be able to represent a complete bitmask of all system calls. This puts our actual limit at 16\*32, or 512, system calls. But for the purposes of our tutorial, we'll pick system call number 56, which is currently unused. For my own amusement, we'll name our (my?) system call 'schrock'. So first we add the following line to syscall.h

#define SYS_uadmin      55
#define SYS_schrock     56
#define SYS_utssys      57

2. Writing the syscall handler

Next, we have to actually add the function that will get called when we invoke the system call. What we should really do is add a new file schrock.c to usr/src/uts/common/syscall, but I'm trying to avoid Makefiles. Instead, we'll just stick it in getpid.c:

#include <sys/cmn_err.h>

int
schrock(void \*arg)
{
	char	buf[1024];
	size_t	len;

	if (copyinstr(arg, buf, sizeof (buf), &len) != 0)
		return (set_errno(EFAULT));

	cmn_err(CE_WARN, "%s", buf);

	return (0);
}

Note that declaring a buffer of 1024 bytes on the stack is a very bad thing to do in the kernel. We have limited stack space, and a stack overflow will result in a panic. We also don't check that the length of the string was less than our scratch space. But this will suffice for illustrative purposes. The cmn_err() function is the simplest way to display messages from the kernel.

3. Adding an entry to the syscall table

We need to place an entry in the system call table. This table lives in sysent.c, and makes heavy use of macros to simplify the source. Our system call takes a single argument and returns an integer, so we'll need to use the SYSENT_CI macro. We need to add a prototype for our syscall, and add an entry to the sysent and sysent32 tables:

int     rename();
void    rexit();
int     schrock();
int     semsys();
int     setgid();

/\* ... \*/

        /\* 54 \*/ SYSENT_CI("ioctl",             ioctl,          3),
        /\* 55 \*/ SYSENT_CI("uadmin",            uadmin,         3),
        /\* 56 \*/ SYSENT_CI("schrock",		schrock,	1),
        /\* 57 \*/ IF_LP64(
                        SYSENT_2CI("utssys",    utssys64,       4),
                        SYSENT_2CI("utssys",    utssys32,       4)),

/\* ... \*/

        /\* 54 \*/ SYSENT_CI("ioctl",             ioctl,          3),
        /\* 55 \*/ SYSENT_CI("uadmin",            uadmin,         3),
        /\* 56 \*/ SYSENT_CI("schrock",		schrock,	1),
        /\* 57 \*/ SYSENT_2CI("utssys",           utssys32,       4),

4. /etc/name_to_sysnum

At this point, we could write a program to invoke our system call, but the point here is to illustrate everything that needs to be done to integrate a system call, so we can't ignore the little things. One of these little things is /etc/name_to_sysnum, which provides a mapping between system call names and numbers, and is used by dtrace(1M), truss(1), and friends. Of course, there is one version for x86 and one for SPARC, so you will have to add the following lines to both the intel and SPARC versions:

ioctl                   54
uadmin                  55
schrock                 56
utssys                  57
fdsync                  58

5. truss(1)

Truss does fancy decoding of system call arguments. In order to do this, we need to maintain a table in truss that describes the type of each argument for every syscall. This table is found in systable.c. Since our syscall takes a single string, we add the following entry:

{"ioctl",       3, DEC, NOV, DEC, IOC, IOA},                    /\*  54 \*/
{"uadmin",      3, DEC, NOV, DEC, DEC, DEC},                    /\*  55 \*/
{"schrock",     1, DEC, NOV, STG},                              /\*  56 \*/
{"utssys",      4, DEC, NOV, HEX, DEC, UTS, HEX},               /\*  57 \*/
{"fdsync",      2, DEC, NOV, DEC, FFG},                         /\*  58 \*/

Don't worry too much about the different constants. But be sure to read up on the truss source code if you're adding a complicated system call.

6. proc_names.c

This is the file that gets missed the most often when adding a new syscall. Libproc uses the table in proc_names.c to translate between system call numbers and names. Why it doesn't make use of /etc/name_to_sysnum is anybody's guess, but for now you have to update the systable array in this file:

        "ioctl",                /\* 54 \*/
        "uadmin",               /\* 55 \*/
        "schrock",              /\* 56 \*/
        "utssys",               /\* 57 \*/
        "fdsync",               /\* 58 \*/

7. Putting it all together

Finally, everything is in place. We can test our system call with a simple program:

#include <sys/syscall.h>

int
main(int argc, char \*\*argv)
{
	syscall(SYS_schrock, "OpenSolaris Rules!");
	return (0);
}

If we run this on our system, we'll see the following output on the console:

June 14 13:42:21 halcyon genunix: WARNING: OpenSolaris Rules!

Because we did all the extra work, we can actually observe the behavior using truss(1), mdb(1), or dtrace(1M). As you can see, adding a system call is not as easy as it should be. One of the ideas that has been floating around for a while is the Grand Unified Syscall(tm) project, which would centralize all this information as well as provide type information for the DTrace syscall provider. But until that happens, we'll have to deal with this process.

Technorati Tag:
Technorati Tag:

Saturday Jun 04, 2005

FISL Final Day

The last day of FISL has come and gone, thankfully. I'm completely drained, both physically and mentally. As you can probably tell from the comments on yesterday's blog entry, we had quite a night out last night in Porto Alegre. I didn't stay out quite as late as some of the Brazil guys, but Ken and I made it back in time to catch about 4 hours of sleep before heading off to the conference. Thankfully I remembered to set my alarm, otherwise I probably would have ended up in bed until the early afternoon. The full details of the night are better told in person...

This last day was significantly quieter than previous days. With the conference winding down, I assume that many people took off early. Most of our presentations today were to an audience of 2 or 3 people, and we even had to cancel some of the early ones as no one was there. I managed to give presentations for Performance, Zones, and DTrace, despite my complete lack of sleep. The DTrace presentation was particularly rough because it's primarily demo-driven, with no set plan. This turns out to be rather difficult after a night of no sleep and a few too many caipirinhas.

The highlight of the day was when a woman (stunningly beautiful, of course) came up to me while I was sitting in one of the chairs and asked to take a picture with me. We didn't talk at all, and I didn't know who she was, but she seemed psyched to be getting her picture taken with someone from Sun. I just keep telling myself that it was my stunning good looks that resulted in the picture, not my badge saying "Sun Microsystems". I can dream, can't I?

Tomorrow begins the 24 hours of travelling to get me back home. I can't wait to get back to my own apartment and a normal lifestyle.

Friday Jun 03, 2005

FISL Day 3

The exhaustion continues to increase. Today I did 3 presentations: DTrace, Zones, and FMA (which turned into OpenSolaris). Every one took up the full hour allotted. And tomorrow I'm going to add a Solaris performance presentation, to bring the grand total to 4 hours of presentations. Given how bad the acoustics are on the exposition floor, my goal is to lose my voice by the end of the night. So far, I've settled into a schedule: wake up around 7:00, check email, work on slides, eat breakfast, then get to the conference around 8:45. After a full day of talking and giving presentations, I get back to the hotel around 7:45 and do about an hour of work/email before going out to dinner. We get back from dinner around 11:30, at which point I get to blogging and finishing up some work. Eventaully I get to sleep around 1:00, at which point I have to do the whole thing the next day. Thank god tomorrow is the end, I don't know how much more I can take.

Today's highlight was when Dimas (from Sun Brazil) began an impromptu Looking Glass demo towards the end of the day. He ended up overflowing our booth with at least 40 people for a solid hour before the commotion started to die down. Those of us sitting in the corner were worried we'd have to lave to make room. Our Solaris presentations hit 25 or so people, but never so many for so long. The combination of cool eye candy and a native Portuguese speaker really helped out (though most people probably couldn't hear him anyway).

Other highlights included hanging out with the folks at CodeBreakers, who really seem to dig Solaris (Thiago had S10 installed on his laptop within half a day). We took some pictures with them (which Dave should post soon), and are going out for barbeque and drinks tonight with them and 100+ other open source Brazil folks. I also helped a few other people get Solaris 10 installed on their laptops (mostly just the "disable USB legacy support" problem). It's unbelievably cool to see the results of handing out Solaris 10 DVDs before even leaving the conference. The top Solaris presentations were understandably DTrace and Zones, though the booth was pretty well packed all day.

Let's hope the last day is as good as the rest. Here's to Software Livre!

Thursday Jun 02, 2005

FISL Day 2

Another day at FISL, another day full of presentations. Today we did mini-presentations every hour on the hour, most of which were very well attended. When we overlapped with the major keynote sessions, turnout tended to be low, but other than that it was very successful. We covered OpenSolaris, DTrace, FMA, SMF, Security, as well as a Java presentation (by Charlie, not Dave or myself). As usual, lots of great questions from the highly technical audience.

The highlight today was a great conversation with a group of folks very interested in starting an OpenSolaris users group in Brazil. Extremely nice group of guys, very interested in technology and helping OpenSolaris build a greater presence in Brazil (both through user groups and Solaris attendance at conferences). I have to say that after experiencing this conference and seeing the enthusiasm that everyone has for exciting technology and open source, I have to agree that Brazil is a great place to focus our OpenSolaris presence. Hopefully we'll see user groups pop up here as well as the rest of the world. We'll be doing everything we can to help from within Sun.

The other, more amusing, highlight of the day was during my DTrace demonstration. I needed an interesting java application to demonstrate the jstack() DTrace action, so I started up the only java application (apart from some internal Sun tools) that I use on a regular basis: Yahoo! Sports Fantasy Baseball StatTracker (the classic version, not the new flash one). I tried to explain that maybe I was trying to debug why the app was lying to me about Tejada going 0-2 so far in the Sox/Orioles game; really he should have hit two homers and I should be dominating this week's scores1. I was rather amused, but I think the cultural divide was a little too wide. Not only baseball, but fantasy baseball: I don't blame the audience at all.

Technorati tags:

1 This is clearly a lie. Despite any dreams of fantasy baseball domination, I would never root for my players in a game over the Red Sox. In the end, Ryan's 40.5 ERA was worth the bottom of the ninth comeback capped by Ortiz's 3-run shot.

Wednesday Jun 01, 2005

FISL Day 1

So the first day of FISL has come to a close. I have to say it went better than expected, based on the quality of questions posed by the audience and visitors to the Sun booth. If today is any indication, my voice is going to completely gone by the end of the conference. I started off the day with a technical overview of Solaris 10/OpenSolaris. You can find the slides for this presentation here. Before taking too much credit myself, the content of these slides are largely based off of Dan's USENIX presentation (thanks Dan!). This is a whirlwind tour of Solaris features - three slides per topic is nowhere near enough. Each of the major topics has been presented many times as a standalone 2-hour presentation, so you can imagine the corners I have to cut to cover them all.

My presention was followed by a great OpenSolaris overview from Tom Goguen. His summary of the CDDL was one of the best I've ever seen - it was the first time I've seen an OpenSolaris presentation without a dozen questions about GPL, CDDL, and everybody's favorite pet license. Dave followed up with a detailed description of how Solaris is developed today and where we see OpenSolaris development heading in the future. All in all, we managed to cram 10+ hours of presentations into a measley 3 1/2 hours. For those of you who still have lingering questions, please stop by the Sun booth and chat with us about anything and everything. We'll be here all week

After retiring to the booth, we had several great discussions with some of the attendees. The highlight of the day was when Dave was talking to an attendee about SMF (and the cool GUI he's working on) and I was feeling particularly bored. Since my laptop was hooked up to the monitor in the "community theater", I decided to play around with some DTrace scripts to come up with a cool demo. Within three minutes I had 4 or 5 people watching what I was doing, so I decided to start talking about all the wonders of DTrace. The 4 or 5 people quickly turned into 10 or 12, and pretty soon I found myself in the middle of a 3 hour mammoth DTrace demo, from which my voice is still recovering. This brings us to the major thing I learned today:

"If you DTrace it, they will come"

Technorati tags:
About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today