Thursday Feb 03, 2011

After eight years, time for a change

I joined Sun more than eight years ago. Coming to Sun was one of the best things I have ever done. I have met a large number of incredible colleagues, and interacted via e-mail, IM and telephone with many more. I have loved working with each and every one of you, and wish I'd gotten to work with more of you. You have made Sun a wonderful place to work at. Sun's engineering culture produces results -- and happiness.

Besides the people, the technologies I've worked with at Sun and Oracle too have yielded much intellectual pleasure. I respect our competition in operating systems, and I believe they have their strengths, but I will always have a soft spot in my heart for Solaris.

The sheer number of areas of Solaris I've either worked on directly or delved into is hard to believe. It is to Solaris' many engineers' credit that it is so easy to read its source code. There is too much good to say about Solaris and other Sun tech to attempt to do so in what really ought to be a short blog post.

It has come time for me to explore different opportunities. This is through no fault of Sun's nor Oracle's. There are other great people I want to work with elsewhere whom I've known since before coming here, and other projects I've wanted to work on that are not particularly relevant to my role here at Oracle. I could not say no indefinitely, and it seems quite worthwhile to explore, and hopefully create new opportunities.

I've also gotten to interact with a great many people outside Sun/Oracle, via the IETF and customer contacts. It has been a privilege and a pleasure. Of course, I'll still be around, just elsewhere :)

My new blog is at . My e-mail address will be nico at For IM you can use either nico at or, both Jabber (XMPP). I'm on LinkedIn, Facebook, you know the drill. I'll be sure to look for and join off-OWAN IM rooms and mailing lists for Sun/Oracle diaspora (if you know of any, do please let me know).

Good luck and best wishes to one and all!

Tuesday Apr 13, 2010

On Unicode Normalization -- or why normalization insensitivity should be rule

Do you know what Unicode normalization is? If you have to deal with Unicode, then you should know. Otherwise this blog post is not for you. Target audience: a) Internet protocol authors, reviewers, IETF WG chairs, the IESG, b) developers in general, particularly any working on filesystems, networked applications or text processing applications.

Short-story: Unicode allows various characters to be written as a single "pre-composed" codepoint or a sequence of one character codepoint plus one or more combining codepoints. Think of 'á': it could be written as a single codepoint that corresponds to the ISO-8859 'á' or as two codepoints, one being plain old ASCII 'a', and the other being the combining codepoint that says "add acute accent mark". There are characters that can have five and more different representations in Unicode.

The long version of the story is too long to go into here. If you find yourself thinking that Unicode is insane, then you need to acquaint yourself with that long story. There are good reasons why Unicode has multiple ways to represent certain characters; wishing it weren't so won't do.

Summary: Unicode normalization creates problems.

So far the approach most often taken in Internet standards to deal with Unicode normalization issues has been to pick a normalization form and then say you "MUST" normalize text to that form. This rarely gets implemented because the burden is too high. Let's call this the "normalize-always" ('n-a', for short) model of normalization. Specifically, in the context of Internet / application protocols, the normalize-always model requires normalizing when: a) preparing query strings (typically on clients), b) creating storage strings (typically on servers). The normalize-always model typically results in all implementors having to implement Unicode normalization, regardless of whether they implement clients or servers.

Examples of protocols/specifications using n-a: stringprep, IMAP/LDAP/XMPP/... via SASL via SASLprepnameprep/IDNA (internationalized domainnames), Net Unicode, e-mail headers, and many others.

I want to promote a better alternative to the normalize-always model: the normalization-insensitive / normalization-preserving (or 'n-i/n-p', for short) model.

In the n-i/n-p model you normalize only when you absolutely have to for interoperability:

  • when comparing Unicode strings (e.g, query strings to storage strings);
  • when creating hash/b-tree/other-index keys from Unicode strings (hash/index lookups are like string comparisons);
  • when you need canonical inputs to cryptographic signature/MAC generation/validation functions;

That's a much smaller number of times and places that one needs to normalize strings than the n-a model. Moreover, in the context of many/most protocols normalization can be left entirely to servers rather than clients -- simpler clients lead to better adoption rates. Easier adoption alone should be a sufficient advantage for the n-i/n-p model.

But it gets better too: the n-i/n-p model also provides better compatibility with and upgrade paths from legacy content. This is because in this model storage strings are not normalized on CREATE operations, which means that you can have Unicode and non-Unicode content co-existing side-by-side (though one should only do that as part of a migration to Unicode, as otherwise users can get confused).

The key to n-i/n-p is: fast n-i string comparison functions, as well as fast byte- or codepoint-at-a-time string normalization functions. By "fast" I mean that any time that two ASCII codepoints appear in sequence you have a fast path and can proceed to the next pair of codepoints starting with the second ASCII codepoint of the first pair. For all- or mostly-ASCII Unicode strings this fast path is not much slower than a typical for-every-byte loop. (Note that strcmp() optimizations such as loading and comparing 32 or 64 bits at a time apply to both, ASCII-only/8-bit-clean and n-i algorithms: you just need to check for any bytes with the high bit set, and whenever you see one you should trigger the slow path.) And, crucially, there's no need for memory allocation when normalization is required in these functions: why build normalized copies of the inputs when all you're doing is comparing or hashing them?

We've implemented normalization-insensitive/preserving behavior in ZFS, controlled by a dataset property (see also; see also; rationale). This means that NFS clients on Solaris, Linux, MacOS X, \*BSD, Windows will interop with each other through ZFS-backed NFS servers regardless of what Unicode normalization forms they use, if any, and without having to have modified the clients to normalize.

My proposal: a) update stringprep to allow for profiles that specify n-i/n-p behavior, b) update SASLprep and various other stringprep profiles (but NOT Nameprep, nor IDNA) to specify n-i/n-p behavior, c) update Net Unicode to specify n-i/n-p behavior while still allowing normalization on CREATE as an option, d) update any other protocols that use n-a and which would benefit from using n-i/n-p to use n-i/n-p.

Your reactions? I expect skepticism, but think carefully, and consider ZFS's solution (n-i/n-p) in the face of competitors that either normalize on CREATE or don't normalize at all, plus the fact that some operating systems tend to prefer NFC (e.g., Solaris, Windows, Linux, \*BSD) while others prefer NFD (e.g., MacOS X). If you'd keep n-a, please explain why.

NOTE to Linus Torvalds (and various Linux developers) w.r.t this old post on the git list: ZFS does not alter filenames on CREATE nor READDIR operations, ever [UPDATE: Apparently the port of ZFS to MacOS X used to normalize on CREATE to match HFS+ behavior]. ZFS supports case- and normalization-insenstive LOOKUPs -- that's all (compare to HFS+, which normalizes to NFD on CREATE).

NOTE ALSO that mixing Unicode and non-Unicode strings can cause cause strange codeset aliasing effects, even in the n-i/n-p model (if there are valid byte sequences in non-Unicode codesets that can be confused with valid UTF-8 byte sequences involving pre-composed and combining codepoints). I've not studied this codeset aliasing issue, but I suspect that the chances of such collisions with meaningful filenames is remote, and if the filesystem is setup to reject non-UTF-8 filenames then the chances that users will be able to create non-UTF-8 filenames without realizing that most such names will be rejected is infinitesimally small. This problem is best avoided by disallowing the creation of invalid UTF-8 filenames; ZFS has an option for that.

UPDATE: Note also that in some protocols you have to normalize early for cryptographic reasons, such as in Kerberos V5 AS-REQs when not using client name canonicalization, or in TGS-REQs when not using referrals. However, it's best to design protocols to avoid this.

Monday Jan 11, 2010

Ever wanted to be able to write C function calls with arguments named in the call?

Have you ever wished you could write

        result = foo(.arg3 = xyz, .arg2 = abc);
as you can in some programming languages? If not then you've probably not met an API with functions that have a dozen arguments most of which take default values. There exist such APIs, and despite any revulsion you might feel about them, there's often good reasons for the need for so many parameters that can be defaulted.

Well, you can get pretty close to such syntax, actually. I don't know why, but I thought of the following hack at 1AM last Thursday, trying to sleep:

struct foo_args {
        type1 arg1;
        type2 arg2;
        type3 arg3;
#define CALL_W_NAMED_ARGS3(result, fname, ...) \\
        do { \\
                struct fname ## _args _a = { __VA_ARGS__ };
                (result) = fname(_a.arg1, _a.arg2, _a.arg3);
        } while (0)
        CALL_W_NAMED_ARGS3(res, foo, .arg2 = xyz, .arg3 = abc);

This relies on C99 struct initializer syntax and variadic macros, but it works. Arguments with non-zero default values can still be initialized since C99 struct initializer syntax allows fields to be assigned more than once.

If you use GCC's statement expressions extension (which, incidentally, is supported by Sun Studio), you can even do this:

#define CALL_W_NAMED_ARGS3(fname, ...) \\
        ({ \\
                struct fname ## _args _a = { __VA_ARGS__ };
                fname(_a.arg1, _a.arg2, _a.arg3);
        res = CALL_W_NAMED_ARGS3(foo, .arg2 = xyz, .arg3 = abc);

You can even define a macro that allows you to call functions by pointer, provided you have suitable typedefs:

struct foo_t_args {
#define CALL_PTR_W_NAMED_ARGS3(ftype, func, ...) \\
        ({ \\
                struct ftype ## _args _a = { __VA_ARGS__ };
                func(_a.arg1, _a.arg2, _a.arg3);
        foo_t f = ...;
        res = CALL_W_NAMED_ARGS3(foo_t, f, .arg2 = xyz, .arg3 = abc);

Useful? I'm not sure that I'd use it. I did a brief search and couldn't find anything on this bit of black magic, so I thought I should at least blog it.

What's interesting though is that C99 introduced initializer syntax that allows out of order, named field references (with missing fields getting initialized to zero) for structs (and arrays) but not function calls. The reason surely must be that while structs may not have unnamed fields, function prototypes not only may, but generally do have unnamed fields. (That and maybe no one thought of function calls with arguments named in the call parameter list).

Monday Aug 24, 2009

Using DTrace to debug encrypted protocols

UPDATED: I hadn't fully swapped in the context when I wrote this blog entry, and Jordan, the engineer working this bug, tells me that the primary problem is an incorrect interpretation of the security layers bitmask on the AD side. I describe that in detail at the end of the original post, plus I add links to the relevant RFCs).

A few months ago there was a bug report that the OpenSolaris CIFS server stack did not interop with Active Directory when "LDAP signing" was enabled. But packet captures, and truss/DTrace clearly showed that smbd/idmapd were properly encrypting and signing all LDAP traffic (when LDAP signing was disabled anyways), and with AES too. So, what gives?

Well, in the process of debugging the problem I realized that I needed to look at the cleartext of otherwise encrypted LDAP protocol data. Normally the way one would do this is to build a special version of the relevant library (the libsasl "gssapi" plugin, in this case) that prints the relevant cleartext. But that's really obnoxious. There's got to be a better way!

Well, there is. I'd already done this sort of thing in the past when debugging other interop bugs related to the Solaris Kerberos stack, and I'd done it with DTrace.

Let's drill down the protocol stack. The LDAP clients in our case were using SASL/GSSAPI/Kerberos V5, with confidentiality protection "SASL security layers", for network security. After looking at some AD docs I quickly concluded that "LDAP signing" clearly meant just that. So the next step was to look at the SASL/GSSAPI part of that stack. The RFC (originally RFC2222 now RFC4752 says that after exchanging the GSS-API Kerberos V5 messages [RFC4121] that setup a shared security context (session keys, ...), the server sends a message to the client consisting of: a one-byte bitmask indicating what "security layers" the server supports (none, integrity protection, or confidentiality+integrity protection), and a 24 bit, network byte order maximum message size. But these four bytes are encrypted, so I couldn't just capture packets and dissect them. The first order of business, then, was to extract these four bytes somehow.

I resorted to DTrace. Since the data in question is in user-land, I had to resort to using copyin() and hand-coding pointer traversal. The relevant function, gss_unwrap(), takes a pointer to a gss_buffer_desc struct that points to the ciphertext, and a pointer to a another gss_buffer_desc where the pointer to the cleartext will be stored. The script:

#!/usr/sbin/dtrace -Fs

 \* If we establish a sec context, then the next unwrap
 \* is of interest.
        self->trace_unwrap = 1;

        self->trace_wrap = 1;

        /\* Trace the ciphertext \*/
        this->gss_wrapped_bufp = arg2;
        this->buflen = \*(unsigned int \*)copyin(this->gss_wrapped_bufp, 4);
        this->bufp = \*(unsigned int \*)copyin(this->gss_wrapped_bufp + 4, 4);
        this->buf = copyin(this->bufp, 32);
        tracemem(this->buf, 32);

        /\* Remember where the cleartext will go \*/
        self->gss_bufp = arg3;
        printf("unwrapped token will be in a gss_buffer_desc at %p\\n", arg3);
        this->gss_buf = copyin(self->gss_bufp, 8);
        tracemem(this->gss_buf, 8);
 \* Now grab the cleartext and print it.
/self->trace_unwrap && self->gss_bufp/
        this->gss_buf = copyin(self->gss_bufp, 8);
        tracemem(this->gss_buf, 8);
        this->buflen = \*(unsigned int \*)copyin(self->gss_bufp, 4);
        self->bufp = \*(unsigned int \*)copyin(self->gss_bufp + 4, 4);
        printf("\\nServer wrap token was %d bytes long; data at %p (%p)\\n",
                this->buflen, self->bufp, self->gss_bufp);
        this->buf = copyin(self->bufp, 4);
        self->trace_unwrap = 0;
        printf("Server wrap token data: %d\\n", \*(int \*)this->buf);
        tracemem(this->buf, 4);
 \* Do the same for the client's reply to the
 \* server's security layers and max message
 \* size negotiation offer.
        self->trace_wrap = 0;
        self->trace_unwrap = 0;
        this->gss_bufp = arg4;
        this->buflen = \*(unsigned int \*)copyin(this->gss_bufp, 4);
        this->bufp = \*(unsigned int \*)copyin(this->gss_bufp + 4, 4);
        this->buf = copyin(this->bufp, 4);
        printf("Client reply is %d bytes long: %d\\n", this->buflen,
                \*(int \*)this->buf);
        tracemem(this->buf, 4);

Armed with this script I could see that AD was offering all three security layer options, or only confidentiality protection, depending on whether LDAP signing was enabled. So far so good. The max message size offered was 10MB. 10MB! That's enormous, and fishy. I immediately suspected an endianness bug. 10MB in flipped around would be... 40KB, which makes much more sense -- our client's default is 64KB. And what is 64KB interpreted as? All possible interpretations will surely be non-sensical to AD: 16MB, 256, or 1 byte.

Armed with a hypothesis, I needed more evidence. DTrace helped yet again. This time I used copyout to change the client's response to the server's security layer and max message size negotiation message. And lo and behold, it worked. The script:

#!/usr/sbin/dtrace -wFs

        self->trace_unwrap = 0;
        printf("This script is an attempted workaround for a possible interop bug in Windows Active Directory: if LDAP signing and s
ealing is enabled and idmapd fails to connect normally but succeeds when this script is used, then AD has an endianness interop bug 
in its SASL/GSSAPI implementation\\n");

 \* We're looking to modify the SASL/GSSAPI client security layer and max
 \* buffer selection.  That happens in the first wrap token sent after
 \* establishing a sec context.
        self->trace_unwrap = 1;

/\* This is that call to gss_wrap() \*/
        self->trace_wrap = 0;
        self->trace_wrap = 0;
        self->trace_unwrap = 0;
        this->gss_bufp = arg4;
        this->buflen = \*(unsigned int \*)copyin(this->gss_bufp, 4);
        this->bufp = \*(unsigned int \*)copyin(this->gss_bufp + 4, 4);
        this->sec_layer = \*(char \*)copyin(this->bufp, 1);
        this->maxbuf_msb = (char \*)copyin(this->bufp + 1, 1);
        this->maxbuf_mid = (char \*)copyin(this->bufp + 2, 1);
        this->maxbuf_lsb = (char \*)copyin(this->bufp + 3, 1);

        printf("The client's wants to select: sec_layer = %d, max buffer = %d\\n",
                \*this->maxbuf_msb << 16 +
                \*this->maxbuf_mid << 8  +

        /\* Now fix it so it matches what we've seen AD advertise \*/
        \*this->maxbuf_msb = 0xa0;
        \*this->maxbuf_mid = 0;
        \*this->maxbuf_lsb = 0;
        copyout(this->maxbuf_msb, this->bufp + 1, 1);
        copyout(this->maxbuf_mid, this->bufp + 2, 1);
        copyout(this->maxbuf_lsb, this->bufp + 3, 1);
        printf("Modified the client's SASL/GSSAPI max buffer selection\\n");

 \* These wrap tokens will be for the security layer -- if we see these
 \* then idmapd and AD are happy together
        printf("It worked!  AD has an endianness interop bug in its SASL/GSSAPI implementation -- tell them to read RFC4752\\n");

Yes, DTrace is unwieldy when dealing with user-land C data (and no doubt it's even more so for high level language data). But it does the job!

Separately from the endianness issue, AD also misinterprets the security layers bitmask. The RFC is clear, in my opinion, though it takes careful reading (so maybe it's "clear"), that this bitmask is a mask of one, two or three bits set when sent by the server, but a single bit when sent by the client. It's also clear, if one follows the chain of documents, that "confidentiality protection" means "confidentiality _and_ integrity protection" in this context (again, perhaps I should say "clear"). The real problem is that the RFC is written in English, not in English-technicalese, saying this about the bitmask sent by the server:

              The client passes this token to GSS_Unwrap and interprets
   the first octet of resulting cleartext as a bit-mask specifying the
   security layers supported by the server and the second through fourth
   octets as the maximum size output_message to send to the server.

and this about the bitmask sent by the client:

   client then constructs data, with the first octet containing the
   bit-mask specifying the selected security layer, the second through
   fourth octets containing in network byte order the maximum size
   output_message the client is able to receive, and the remaining
   octets containing the authorization identity.

Note that "security layers" is plural in the first case, singular in the second.

Note too that for GSS-API mechanisms GSS_Wrap/Unwrap() always do integrity protection -- only confidentiality protection is optional. But RFCs 2222/4752 say nothing of this, so that only an expert in the GSS-API would have known this. AD expects the client to send 0x06 as the bitmask when the server is configured to require LDAP signing and sealing. Makes sense: 0x04 is "confidentiality protection" ("sealing") and 0x02 is "integrity protection" ("signing"). But other implementations would be free to consider that an error, which means that we have an interesting interop problem... And, given the weak language of RFCs 2222/4752, this mistake seems entirely reasonable, even if it is very unfortunate.

Tuesday Jun 09, 2009

DLL hell

A definitive treatise on coping with DLL hell (in general, not just in the Windows world whence the name came) would be nice.

DLL hell nowadays, and in the Unix world, is what you get when a single process loads and runs (or tries to) two or more versions of the same shared object, at the same time, or when multiple versions of the same shared object exist on the system and the wrong one (from the point of view of a caller in that process) gets loaded. This can happen for several reasons, and when it does the results tend to be spectacular.

Typically DLL hell can result when:

  • multiple versions of the same shared object are shipped by the same product/OS vendor as an accident of development in a very large organization or of political issues;
  • multiple versions of the same shared object are shipped by the same product/OS vendor as a result of incompatible changes made in various versions of that shared object without corresponding updates to all consumers of that shared object shipped by the vendor (this is really just a variant of the previous case);
  • a third party ships a plug-in that uses a version of the shared object also shipped by the third party, and which conflicts with a copy shipped by the vendor of the product into which the plug-in plugs in, or where such a conflict arises later when the vendor begins to ship that shared object (this is not uncommon in the world of open source, where some project becomes very popular and eventually every OS must include it);

At first glance the obvious answer is to get all developers, at the vendor and third parties, to ship updates that remove the conflict by ensuring that a single version, shipped by the vendor, will be used. But in practice this can be really difficult to do because: a) there's too many parties to coordinate with, none of whom budgeted for DLL hell surprises and none of whom appreciate the surprise or want to do anything about it when another party could do something instead, b) agreeing on a single version of said object may involve doing lots of development to ensure that all consumers can use the chosen version, c) there's always the risk that future consumers of this shared object will want a new, backwards-incompatible version of that object, which means that DLL hell is never ending.

Ideally libraries should be designed so that DLL hell is reasonably survivable. But this too is not necessarily easy, and requires much help from the language run-time or run-time linker/loader. I wonder how far such an approach could take us.

Consider a library like SQLite3. As long as each consumer's symbol references to SQLite3 APIs are bound to the correct version of SQLite3, then there should be no problem, right? I think that's almost correct, just not quite. Specifically, SQLite3 relies on POSIX advisory file locking, and if you read the comments on that in the src/os_unix.c file in SQLite3 sources, you'll quickly realize that yes, you can have multiple versions of SQLite3 in one process, provided that they are not accessing the same database files!

In other words, multiple versions of some library, in one process, can co-exist provided that there's no implied, and unexpected shared state between them that could cause corruption.

What sorts of such implied, unexpected shared state might there be? Objects named after the process' PID come to mind, for example (pidfiles, ...). And POSIX advisory file locking (see above). What else? Imagine a utility function that looks through the process' open file descriptors looking for ones that the library owns -- oops, but at least that's not very likely. Any process-local namespace that is accessible by all objects in that process will provide a source of conflicts. Fortunately thread-specific keys are safe.

DLL hell is painful, and it can't be prevented altogether. Perhaps we could produce a set of library design guidelines that developers could follow to produce DLL hell-safe libraries. The first step would be to make sure that the run-time can deal. Fortunately the Solaris linker provides "direct binding" (-B direct) and "groups" (-B group and RTLD_GROUP), so that between the two (and run-path and the like) it should be possible to ensure that each consumer of some library always gets the right one (provided one does not use LD_PRELOAD). Perhaps between linker features, careful coding and careful use, DLL hell can be made survivable in most cases. Thoughts? Comments?

Friday Dec 12, 2008

Automated Porting Difficulties: Run-time failures in roboported FOSS

As I explained in my previous blog entry, I'm working on a project whose goal is to automate the process of finding, building and integrating FOSS into OpenSolaris so as to populate our /pending and /contrib (and eventually /dev) IPS package repositories with as much useful FOSS as possible.

We've not done a good job of tracking build failures due to missing interfaces in OpenSolaris, though in the next round of porting we intend to track and investigate build failures. But when we tested candidate packages for /contrib we did run into run-time failures that were due to differences between Linux and Solaris. These we mostly due to:

  1. FOSS expected a Linux-style /proc
  2. CLI conflicts

The first of those was shocking at first, but I quickly remembered: the Linux /proc interfaces are text-based, thus no headers are needed in order to build programs that use /proc. Applications targeting the Solaris /proc could not possibly build on Linux (aside from cross-compilation targeting Solaris, of course): the necessary header, <procfs.h>, would not exist, therefore compilation would break.

Dealing with Linux /proc applications is going to be interesting. Even detecting them is going to be interesting, since they could be simple shell/Python/whatever scripts: simply grepping for "/proc" && !"procfs.h" will surely result in many false positives requiring manual investigation.

The second common run-time failure mode is also difficult to detect a priori, but I think we can at least deal with it automatically. The incompatible CLIs problems results in errors like:

Usage: grep -hblcnsviw pattern file . . .

when running FOSS that expected GNU grep, for example. Other common cases include ls(1), ifconfig(1M), etcetera.

Fortunately OpenSolaris already has a way to get Linux-compatible command-line environments: just put /usr/gnu/bin before /usr/bin in your PATH. Unfortunately that's also not an option here because some programs will expect a Solaris CLI and others will expect a Linux CLI.

But fortunately, once again, I think there's an obvious way to select which CLI environment to use (Solaris vs. Linux) on a per-executable basis (at least for ELF executables): link in an interposer on the exec(2) family of functions, and have the interposer ensure that the correct preference of /usr/gnu/bin or /bin is chosen. Of course, this will be a simple solution only in the case of programs that compile into ELF, and not remotely as simple, perhaps not even feasible for scripts of any kind.

I haven't yet tried the interposer approach for the CLI preference problem, but I will, and I'm reasonably certain that it will work. I'm not as optimistic about the /proc problem; right now I've no good ideas about how to handle the /proc problem, short of manually porting the applications in question or choosing to not package them for OpenSolaris at all until the upstream communities add support for the Solaris /proc. I.e., the /proc problem is very interesting.

Wednesday Dec 10, 2008

Massively porting FOSS for OpenSolaris 2008.11 /pending and /contrib repositories

Today is the official release of OpenSolaris 2008.11, including commercial support.

Along with OpenSolaris 2008.11 we're also publishing new repositories full of various open source software built and packaged for OpenSolaris:

  • A pending repository with 1,708 FOSS pkgs today, and many more coming. This is "pending" in that we want to promote the packages in it to the contrib repository.
  • A contrib repository with 154 FOSS pkgs today, and many more coming soon.

These packages came from two related OpenSolaris projects in the OpenSolaris software porters community:

The two projects focus on different goals. Here I describe the work that we did on the PkgFactory/Roboporter project. Our primary goal is to port and package FOSS to OpenSolaris as quickly as possible. We do not yet focus very much on proper integration with OpenSolaris, such as making sure that the FOSS we package is properly integrated with RBAC, SMF, Solaris audit facilities, with manpages placed in the correct sections, etcetera, though we do intend to get to the point where we do get close enough to proper integration that the most valuable packages can then be polished off manually, put through the ARC and c-team processes, and pushed to the /dev repository.

Note, by the way, that the /pending and /contrib repositories are open to all contributors. The processes involved for contributing packages to these repositories are described in the SW Porters community pages, so if there's something you'd like to make sure that your favorite FOSS is included you can always do it yourself!

The 154 packages in /contrib are a representative subset of the 1,708 packages in /pending, which in turn are a representative subset of some 10,000 FOSS pkgs that we had in an project-private repository. That's right, 10,000, which we built in a matter of just a few weeks. [NOTE: Most, but not all of the 1,708 packages in /pending and 154 in /contrib came from the pkgfactory project.]

The project began with Doug Leavitt doing incredible automation of: a) searching for and downloading spec files from SFE and similar from Ubuntu and other Linux packaging repositories, b) building them on Solaris. (b) is particularly interesting, but I'll let Doug blog about that. With Doug's efforts we had over 12,000 packages in a project-private IPS repository, and the next step was to clean things up, cut the list down to something that we could reasonably test and push to /pending and /contrib. That's where Baban Kenkre and I jumped in.

To come up with that 1,704 package list we first removed all the Perl5 CPAN stuff from the list of 12,000, then we wrote a utility to look for conflicts between our repository, the Solaris WOS and OpenSolaris. It turned out we had many conflicts even withing our own repository (some 2,000 pkgs were removed as a result, if I remember correctly, after removing the Perl5 packages). Then we got down and dirty and did as much [very light-weight] testing as we could.

What's really interesting here is that the tool we wrote to look for conflicts turned out to be really useful in general. That's because it loads package information from our project's repo, the SVR4 Solaris WOS and OpenSolaris into a SQLite3 database, and analyzes the data to some degree. What's really useful about this is that with little knowledge of SQL we did many ad-hoc queries that helped a lot when it came to whittling down our package list and testing. For example: getting a list of all executables in /bin and /usr/sbin that are delivered by our package factory and which have manpages, was trivial, and quite useful (because then I could read the manpages in one terminal and try the executables in another, which made the process of light-weight testing much faster than it would have otherwise been). We did lots of ad-hoc queries against this little database, the kinds of queries that without a database would have required significantly more scripting; SQL is a very powerful language!

That's it for now. We'll blog more later. In the meantime, check out the /pending and /contrib repositories. We hope you're pleased. And keep in mind that what you see there is mostly result of just a few weeks of the PkgFactory project work, so you can expect: a) higher quality as we improve our integration techniques and tools, and b) more, many, many more packages as we move forward. Our two projects' ultimate goal is to package for OpenSolaris all of the useful, redistributable FOSS that you can find on Sourceforge and other places.

Monday Nov 10, 2008

Technology Underlying the Sun Storage 7000 Series

I'm late to the party. And I don't have much to blog about my team's part in the story of the 7000 Series that I haven't already blogged, most of it about ID mapping, and some about filesystem internationalization. Except of course, to tell you that today's product launch is very exciting for me. Not only is this good for Sun's customers (current and, especially, future!) and for Sun, but it's also incredibly gratifying to see something that one has worked on hard be part of a major product and be depended on by others.

Above all: Congratulations to the Fishworks team and to the many teams that contributed to making this happen. The list of such teams is long. Between systems engineering, Solaris engineering and the business teams that made all this possible, plus the integration provided by the Fishworks team, this is a truly enormous undertaking. Just look at the implausible list of storage protocols spoken by the storage appliance: CIFS, NFS, iSCSI, FTP, WebDAV, NDMP and VSCAN, all backed by ZFS. I'm barely scratching the surface here. It's not just the storage protocols; for example, DTrace has an enormous role to play here as well, and there are many other examples.

The best part is the integration, the spectacular BUI (browser user interface). No, wait, the best part is the underlying technologies. No, wait!, the best part is the futures. It's hard to decide what the best aspect of the Sun Storage 7000 series, the story, the people, the technologies, the future, or even what it says about Sun: that Sun Microsystems can innovate and reinvent itself even when the financials don't look great, even while doing much of the development in the open!

The new storage appliance was a project of major proportions, much of it undertaken in the open. I wonder how many thought that this was typical of Sun, to develop cool technologies without knowing how to tie them together. I hope we've shocked you. Now you know: Sun can complete acquisitions successfully and obtain product synergies (usually a four-letter word, that), Sun can do modular development and bring it all together, Sun can detect new trends in the industry (e.g., read-biased SSDs, write-biased SSDs, ...) and capitalize on them, Sun can think outside the box and pull rabbits out of its hat. And you better bet: we can and will keep that up.

Friday Sep 19, 2008

If it's just $1 trillion...

...then it's kinda comparable to the S&L crisis of the 1980s. That one was easier to deal with because the feds (through the FDIC) were already the guarantor of that crisis' bad debts (deposits at small local savings & loans banks). This one is more like the Japanese crisis of two decades ago: bad debt with no government guarantor of last resort. So the feds are doing the only reasonable thing: step in and take over that bad debt.

Unfortunately the feds are also doing something stupid: banning short selling (or is it that margin rates on short interest are being raised to 100%? whatever). As long as the short interest ban is short-lived then the harm will be relatively small. But wouldn't it have been preferable to wait to see if the bad debt takeover alone was enough to squeeze the shorts? Besides, squeezing the shorts does little to restore liquidity, whereas taking over the bad debts sure does (or should anyways, one hopes). Squeezing the shorts through legal means is bad policy: it says the other measures aren't enough, it says "we believe short-sellers are speculators and they should be taken to the shed," it's childish, and it may hurt the normal functioning of markets even once short sales are allowed once more (because now there's a new risk that short-sellers must face, particularly if regulators get used to establishing short interest bans on a per-corporation or market segment basis). Bah.

Observing ID mapping with DTrace

Want to see how idmapd maps some Windows SID to a Unix UID/GID? The idmap(1M) command does provide some degree of observability via the -v option to the show sub-command, but not nearly enough. Try this DTrace script.

The script is not complete, and, most importantly, is not remotely stable, as it uses pid provider probes on internal functions and encodes knowledge of private structures, all of which can change without notice. But it does help a lot! Not only does it help understand operational aspects of ID mapping, but also idmapd's internals. And, happily, it points the way towards a proper, stable USDT provider for idmapd.

Folks who've seen the RPE TOI for ID mapping will probably wish that I'd written this months ago, and used it in the TOI presentation :)

Running the stress tests on idmapd with this script running produces an enormous amount of output, clearly showing how the asynchronous Active Directory LDAP searches and search results are handled.

Thursday Sep 04, 2008

The compromise on abortion that the Republican mavericks should offer

I don't like blogging about politics. My previous blog entry was the only one I've written on politics, and that was about international geopolitics.

But Sarah Palin inspires me.

The culture wars in the U.S. have two major flash points: abortion and gay civil unions/gay marriage.

Abortion is the untractable one. But I believe there is a way, a novel way.

Begin by accepting that, given the structure of the American republic and the Supreme Court's precedents on abortion, there is no chance that abortion can be made illegal any time soon. Even if Roe vs. Wade, and Casey, and the Court's other abortion precedents were overturned the issue would merely become a local issue (though it would also stay a national issue), and most States would likely keep the existing regime more or less as is. It will take decades for the pro-life camp to get its way, if it ever did.

That leaves former President Bill Clinton's formulation of "safe, legal and rare" as the only real option for the pro-life camp. The pro-choice camp's goal, on the other hand, is pretty safe.

Of course, Bill Clinton never did much, if anything, to make abortion rare. And whatever one might do needs to be sold to the pro-life camp with more than "it's all you can hope for."

The solution, then, is to think about the problem from an economics (and demographic) point of view.

Consider: making abortion illegal will not mean zero abortions, for back-alley abortions will return, with the consequent injuries and loss of baby and maternal life. So we can only really hope to minimize abortion. Looked at it this way Bill Clinton's formula looks really good. This is the argument with which to sell an economics-based solution.

And the solution? Simple: provide financial incentives...

  • families to adopt children (though there is already a shortage of children to adopt),
  • to women with unwanted pregnancies to proceed with the pregnancy and put their babies up for adoption,
  • and to abortion clinics to participate in the process of matching women with unwanted pregnancies to families who wish to adopt (effectively becoming market makers -- it sounds awful, to market children, but isn't the alternative worse?).

It sounds like a government program that no fiscal conservative should want taxpayers to pay for. But consider that in the long-term it pays for itself by increasing the future tax base (more babies now -> more adults in the labor pool later). And consider the opportunity cost of not having these children! For Japan- and Russia-style population implosion would have disastrous consequences for the American economy (consider Social Security...). Avoiding population implosion alone should be reason enough to go for such a program. How much to offer as incentives? I don't know, but even if such a program came to cost $50,000 per-baby that would still be cheap, considering the demographics angle.

So, allow choice, but seek to influence it, with naked bribes, yes, but not coercion (which wouldn't be "choice").

This brings us to gay civil unions and/or gay marriage. It's certainly past the time when any politician of consequence could seriously propose the criminalization of homosexuality in the U.S.; sexual autonomy, at least in the serial monogamy sense, has been a de facto reality for a long time, and now it is de jure. Now, if gay civil unions or marriage could mean more adoptive parents of otherwise-to-be-aborted children, then what can someone who is pro-life do but support at least gay civil unions? If life is the imperative, then surely we can encourage gay couples to help, and let God judge whether homosexuals are in sin or not.

Alright, now that that's out of the way I hope to go back to my non-politics blogging ways.

Tuesday Aug 12, 2008

Conclusions from the Georgia war

Georgia was simply not a defensible route for Europe to energy independence from Russia. Nor could it have been for years more, and because of its remoteness, and unless Turkey wished to have a very active role in NATO (which seems unlikely) then it was bound to stay indefensible for as long as Russia manages to keep up its military (i.e., for the foreseeable future).

Therefore Europe has two choices: become a satellite of Russia, or pursue alternatives to natural gas and oil from Russia.

To save Europe from subservience to Russia will require the development of new energy sources. Geopolitical plays can only work if backed by willingness to use superior military firepower. Europe clearly lacks the necessary military superiority and will-power, therefore only new nuclear power plants, and new non-Russian/non-OPEC oil and gas sources qualify in the short- to medium-term.

So, ramp up nuclear power production (as that's the only alternative fuel with a realistic chance of producing enough additional power in in the short- to medium-term). And, of course, build more terminals to receive oil and LNG tankers would help.

But any oil/gas to be received by tanker terminals have got to come from somewhere (and Russia's has got to have an outlet other than Europe). It would help enormously if new oil sources outside OPEC and Russia could be developed, as new friendly supplies would reduce the leverage that Russia has on Europe. That can only be Brazilian, American and Canadian oil.

Does Europe have the fortitude to try? Does the U.S. have the leverage to get Europe to try?

The big loser here is Europe. Europe now has to choose whether to surrender or struggle for independence. The U.S. probably can't force them. A European surrender to Russia will be slow, and subtle, but real. If Europe surrenders then NATO is over. Funny, that Russia is poised to achieve what the Soviet Union could not. But it isn't funny. And I suspect few citizens of Europe understand, and few that do object; anti-Americanism may have won.

The only thing Europe has going for it is that there is much less NIMBYist resistance to nuclear power there than in the U.S. Also, awareness that a power crunch is at hand, and a much more severe one probably coming is starting to sink in around the world (drilling for oil everywhere is now very popular in the U.S., for example, with very large majorities in favor; support for new nuclear power plants is bound to follow as well).

As for the environment, I don't for a second believe in anthropogenic global warming, but ocean acidification is much easier to prove, and appears to be real, and is much, much more of an immediate and dire threat to humans than global warming. Regardless of which threat is real, and regardless of how dire, there's only one way to fight global warming/ocean acidification: increase the wealth of Earth's nations, which in the short-term means producing more energy. American rivers were an environmental mess four decades ago, but today the U.S. is one of the cleanest places on Earth. The U.S. cleaned up when its citizens were rich enough that they could manage to care and to set aside wealth for cleaning things up. It follows that the same is true for the rest of the world, and if that's not enough, consider what would happen if the reverse approach is followed instead: miserable human populations that will burn what they have to to survive, the environment be damned.

Let us set on a crash course to develop new energy sources, realistic and practical ones, and let us set on a course to promote and develop international commerce like never before.

Friday Jun 13, 2008

Can we map IDs between Unix domains? (e.g, for NFSv4)

Today (onnv build 92), no.

But there's no reason we couldn't add support for it.

Here's how I would do it:

  • First, map all UIDs and GIDs in foreign Unix domains to S-1-22-3-<domain-RIDs>-<UID> and S-1-22-4-<domain-RIDs>-<UID>. Whence the domain RIDs? Preferably we'd provide a way for each domain to advertise a domain SID. Otherwise we could allow each domain's SID to be configured locally. Or else derive it from the domain's name, e.g., octet_string_to_RIDs(SHA_256(domain_name)).
  • Second, map all user and group names in foreign Unix to <name>@<domain-name>
  • Third, use libldap to talk to foreign Unix domains with RFC2307+ schemas. Possibly also add support for using NIS. (Yes, the NIS client allows binding to multiple domains, though, of course, the NIS name service backend uses only one; the yp_match(3NSL) and related functions take an optional NIS domain name argument.)

This would require changes to idmapd(1M). I think the code to talk to foreign Unix domains and cast their IDs into our local form should be easy to compartmentalize. idmapd would have to learn how to determine the type of any given domain, and how to find how to talk to it -- this is going to be what most of the surgery on idmapd would be about.

I don't know when we might get to this. Maybe an enterprising member of the community could look into implementing this if they are in a hurry.

Friday Jan 11, 2008

(destructuring-bind) for XML


Plus ├ža change...

Tuesday Nov 13, 2007

More on the design and implementation of Solaris' ID mapping facility, part 1: kernel-land

UPDATE: The ZFS FUID code was written by Mark Shellenbaum. Also, something someone said recently confused me as to who came up with the idea of ephemeral IDs; it was Mike Shapiro.

Now that you know all about ephemeral IDs and ID mapping, let's look at Solaris ID mapping more closely. Afshin has a great analogy to describe what was done to make Solaris deal with SMB and CIFS natively, you should not miss it.

Let's begin with how the kernel treats Windows SIDs and ID mapping.

[Note: the OpenSolaris source code web browser interface cannot find the definitions of certain C types and functions, so in some places I'll link to files and line numbers. Such links will grow stale over time. If and when the OpenSolaris source browser interface is fixed I may come back to fix these links.]

SIDs in the kernel

First we have $SRC/uts/common/os/sid.c. Here you can see that the kernel does not use the traditional SID structure or wire encoding. Instead Solaris treats SIDs as ksid_t objects consisting of an interned domain SID (represented by ksiddomain_t) and a uint32_t RID. The prefix is just the stringified form of the SID (S-1-<authority>-<RID0>-<RID1>-...<RIDn>) up to, but excluding the last RID.

Treating SIDs as a string prefix and an integer RID is a common thread running through all the Solaris components that deal with SIDs, except, of course, where SIDs must be encoded for use in network protocols. Interning is used where space or layout considerations make small, fixed-sized objects preferable to variable-length SID structures, namely: in the kernel and on-disk in ZFS.

The ksidlookupdomain() function takes care of interning SID prefixes for use in-kernel. The interned SID prefix table is just an AVL tree, naturally.

The SIDs of a user are represented by credsid_t, which contains three SIDs plus a list of SIDs that is akin to the supplementary group list. credsid_t objects are reference counted and referenced from cred_t. This is done because the Solaris kernel copies cred_t objects quite often, but a cred_t's SID list is simply not expected to change very often, or even ever; avoiding unnecessary copies of potentially huge SID lists (users with hundreds of group memberships are common in Windows environments) is highly desirable. The crdup() function and friends take care of this.

Back to sid.c for a moment, lookupbyuid() and friends are where the kernel calls the idmap module to map SID<->UIDs. But we'll look at the kernel idmap module later.

Note that not all ephemeral IDs are valid. Specifically, only ephemeral IDs in ranges allocated to the running idmapd daemon are considered valid. See the VALID_UID() and VALID_GID() macros. Kernel code needs to be careful to allow only non-ephemeral UIDs/GIDs in any context where they might be persisted across reboots (e.g., UFS!), or to map them back to SIDs (e.g., ZFS!); in all other cases kernel code should be checking that any UIDs/GIDs are valid using those macros. The reason that the VALID_UID/GID() checks are macros should be instantly clear to the reader: we're optimizing for the expected common/fast case where the given ID is non-ephemeral, in which case we can save a function call. Wherever neither SIDs nor ephemeral IDs can be used the kernel must substitute suitable non-ephemeral IDs, namely, the 'nobody' IDs -- see crgetmapped(), for example.

Can you spot the zones trouble with all of this? All this code was built for global-zone only purposes due to time pressures, though we knew that eventually we'd need to properly virtualize ephemeral IDs and ID mapping. Now that we have a zoned consumer (the NFSv4 client, via nfsmapid(1M)), however, we must virtualize ID mapping so that each zone can continue to have its own UID/GID namespace as usual. The fix is in progress; more details below.

BTW, the sid.c, cred.c code and related headers was designed and written by Casper Dik.


Next we look at how ZFS handles SIDs.

Take a look at $SRC/uts/common/fs/zfs/zfs_fuid.c. This is where FUIDs are implemented. A FUID is ZFS's way of dealing with the fact that SIDs are variable length. Where ZFS used to store 32-bit UIDs and GIDs it now stores 64-bit "FUIDs," and those are simply a {<interned SID prefix>, <RID>} tuple. Traditional POSIX UIDs and GIDs in the -..2\^31-1 range are stored with zero as the interned SID prefix. The interned SID prefix table, in turn, is stored in each dataset.

Here too we see calls to the idmap kernel module, but again, more about that below. And you can see that ZFS keeps a copy of the FUID table in-kernel as an AVL tree (boy, AVL trees are popular for caches!).

If I understand correctly, the ZFS FUID code was written by Mark Shellenbaum. The idea for FUIDs came from Afshin Salek. I'm not sure who thought of using the erstwhile negative UID/GID namespace for dynamic, ephemeral ID mapping.

And you can also see that we have some zone issues here also; these will be addressed, as with all the zone issues mentioned here, in a bug fix that is currently in progress.

I'll leave VFS/fop and ZFS ACL details for another entry, or perhaps for another blogger. The enterprising reader can find the relevant ARC cases and OpenSolaris source code.

The idmap kernel module

Finally we look at the idmap kernel module. This module has several major components: a lookup cache, the basic idmap API, with door upcalls to idmapd, and idmapd registration/unregistration.

The idmap kernel module is fairly straightforward. It uses ONC RPC over doors to talk to idmapd.

Unfortunately there is no RPC-over-doors support in the kernel RPC module. Fortunately implementing RPC-over-doors was quite simple, as you can see in kidmap_rpc_call(). The bulk of the XDR code is generated by rpcgen(1) from the idmap protocol .x file. The code in $SRC/uts/common/idmap/idmap_kapi.c is mostly about implementing the basic ID mapping API.

The module's cache is, again, implemented using an AVL tree. Currently the only way to clear the cache is to unload the module, but as we add zone support this will no longer work, and we'll switch, instead, to unloading the cache whenever idmapd exits cleanly (i.e., unregisters), which will make it possible to clear the cache by stopping or restarting the svc:/system/idmap service. Also, we'll be splitting the cache into several to better support diagonal mapping.

Finally, I'll briefly describe the API.

The ID mapping APIs are designed to batch up many mapping requests into a single door RPC call, and idmapd is designed to batch up as much database and network work as possible too. This is to reduce latency in dealing with users with very large Windows access tokens, or ACLs with many distinct ACE subjects -- one door call for mapping 500 SIDs to POSIX IDs is better than 500 door calls for mapping one SID to a POSIX ID. The caller first calls kidmap_get_create() to get a handle for a single batched request, then the caller repeatedly calls any of the kidmap_batch_get\*by\*() functions to add a request to the batch, followed by a call to kidmap_get_mappings() to make the upcall with all the batched requests, or the caller can abort a request by calling kidmap_get_destroy(). APIs for non-batched, one-off requests are also provided. The user-land version of this API can also deal with user/group names.

The idmap kernel module was written mostly by Julian Pullen (no blog yet).

A word about zones

As I mentioned above, we need to virtualize the VALID_\*ID() macros, underlying functions, and some of the ksid_\*() and zfs_fuid_\*() functions. We're likely going to add a zone_t \* argument to the non-batch kidmap API functions and to kidmap_get_create(), as well as to the VALID_\*ID() macros, related functions, and affected ksid_\*() functions. The affected zfs_fuid_\*() and cr\*() functions already have a cred_t \* argument (or their callers do), from which we can get a zone_t \* via crgetzone(). The biggest problem is that it appears that there exists kernel code that calls VOPs from interrupt context(!), with a NULL cr, so that we'll need a way to indicate that the current zone is not known (or, if in SMB server context, that this is the global zone); the idmap kernel module will have to know to map IDs to the various nobody IDs (including Nobody SID) when no zone is identified by the caller.

In the next blog entry I'll talk about the user-land aspects of Solaris ID mapping.


I'm an engineer at Oracle (erstwhile Sun), where I've been since 2002, working on Sun_SSH, Solaris Kerberos, Active Directory interoperability, Lustre, and misc. other things.


« December 2016