Tuesday Apr 13, 2010

On Unicode Normalization -- or why normalization insensitivity should be rule

Do you know what Unicode normalization is? If you have to deal with Unicode, then you should know. Otherwise this blog post is not for you. Target audience: a) Internet protocol authors, reviewers, IETF WG chairs, the IESG, b) developers in general, particularly any working on filesystems, networked applications or text processing applications.

Short-story: Unicode allows various characters to be written as a single "pre-composed" codepoint or a sequence of one character codepoint plus one or more combining codepoints. Think of 'á': it could be written as a single codepoint that corresponds to the ISO-8859 'á' or as two codepoints, one being plain old ASCII 'a', and the other being the combining codepoint that says "add acute accent mark". There are characters that can have five and more different representations in Unicode.

The long version of the story is too long to go into here. If you find yourself thinking that Unicode is insane, then you need to acquaint yourself with that long story. There are good reasons why Unicode has multiple ways to represent certain characters; wishing it weren't so won't do.

Summary: Unicode normalization creates problems.

So far the approach most often taken in Internet standards to deal with Unicode normalization issues has been to pick a normalization form and then say you "MUST" normalize text to that form. This rarely gets implemented because the burden is too high. Let's call this the "normalize-always" ('n-a', for short) model of normalization. Specifically, in the context of Internet / application protocols, the normalize-always model requires normalizing when: a) preparing query strings (typically on clients), b) creating storage strings (typically on servers). The normalize-always model typically results in all implementors having to implement Unicode normalization, regardless of whether they implement clients or servers.

Examples of protocols/specifications using n-a: stringprep, IMAP/LDAP/XMPP/... via SASL via SASLprepnameprep/IDNA (internationalized domainnames), Net Unicode, e-mail headers, and many others.

I want to promote a better alternative to the normalize-always model: the normalization-insensitive / normalization-preserving (or 'n-i/n-p', for short) model.

In the n-i/n-p model you normalize only when you absolutely have to for interoperability:

  • when comparing Unicode strings (e.g, query strings to storage strings);
  • when creating hash/b-tree/other-index keys from Unicode strings (hash/index lookups are like string comparisons);
  • when you need canonical inputs to cryptographic signature/MAC generation/validation functions;

That's a much smaller number of times and places that one needs to normalize strings than the n-a model. Moreover, in the context of many/most protocols normalization can be left entirely to servers rather than clients -- simpler clients lead to better adoption rates. Easier adoption alone should be a sufficient advantage for the n-i/n-p model.

But it gets better too: the n-i/n-p model also provides better compatibility with and upgrade paths from legacy content. This is because in this model storage strings are not normalized on CREATE operations, which means that you can have Unicode and non-Unicode content co-existing side-by-side (though one should only do that as part of a migration to Unicode, as otherwise users can get confused).

The key to n-i/n-p is: fast n-i string comparison functions, as well as fast byte- or codepoint-at-a-time string normalization functions. By "fast" I mean that any time that two ASCII codepoints appear in sequence you have a fast path and can proceed to the next pair of codepoints starting with the second ASCII codepoint of the first pair. For all- or mostly-ASCII Unicode strings this fast path is not much slower than a typical for-every-byte loop. (Note that strcmp() optimizations such as loading and comparing 32 or 64 bits at a time apply to both, ASCII-only/8-bit-clean and n-i algorithms: you just need to check for any bytes with the high bit set, and whenever you see one you should trigger the slow path.) And, crucially, there's no need for memory allocation when normalization is required in these functions: why build normalized copies of the inputs when all you're doing is comparing or hashing them?

We've implemented normalization-insensitive/preserving behavior in ZFS, controlled by a dataset property (see also; see also; rationale). This means that NFS clients on Solaris, Linux, MacOS X, \*BSD, Windows will interop with each other through ZFS-backed NFS servers regardless of what Unicode normalization forms they use, if any, and without having to have modified the clients to normalize.

My proposal: a) update stringprep to allow for profiles that specify n-i/n-p behavior, b) update SASLprep and various other stringprep profiles (but NOT Nameprep, nor IDNA) to specify n-i/n-p behavior, c) update Net Unicode to specify n-i/n-p behavior while still allowing normalization on CREATE as an option, d) update any other protocols that use n-a and which would benefit from using n-i/n-p to use n-i/n-p.

Your reactions? I expect skepticism, but think carefully, and consider ZFS's solution (n-i/n-p) in the face of competitors that either normalize on CREATE or don't normalize at all, plus the fact that some operating systems tend to prefer NFC (e.g., Solaris, Windows, Linux, \*BSD) while others prefer NFD (e.g., MacOS X). If you'd keep n-a, please explain why.

NOTE to Linus Torvalds (and various Linux developers) w.r.t this old post on the git list: ZFS does not alter filenames on CREATE nor READDIR operations, ever [UPDATE: Apparently the port of ZFS to MacOS X used to normalize on CREATE to match HFS+ behavior]. ZFS supports case- and normalization-insenstive LOOKUPs -- that's all (compare to HFS+, which normalizes to NFD on CREATE).

NOTE ALSO that mixing Unicode and non-Unicode strings can cause cause strange codeset aliasing effects, even in the n-i/n-p model (if there are valid byte sequences in non-Unicode codesets that can be confused with valid UTF-8 byte sequences involving pre-composed and combining codepoints). I've not studied this codeset aliasing issue, but I suspect that the chances of such collisions with meaningful filenames is remote, and if the filesystem is setup to reject non-UTF-8 filenames then the chances that users will be able to create non-UTF-8 filenames without realizing that most such names will be rejected is infinitesimally small. This problem is best avoided by disallowing the creation of invalid UTF-8 filenames; ZFS has an option for that.

UPDATE: Note also that in some protocols you have to normalize early for cryptographic reasons, such as in Kerberos V5 AS-REQs when not using client name canonicalization, or in TGS-REQs when not using referrals. However, it's best to design protocols to avoid this.


I'm an engineer at Oracle (erstwhile Sun), where I've been since 2002, working on Sun_SSH, Solaris Kerberos, Active Directory interoperability, Lustre, and misc. other things.


« July 2016