64-bit Archives Needed
By Ali Bahrami-Oracle on Nov 11, 2011
Even in 2011, 2GB is a very large amount of code, so we had not heard this one before. Most of our toolchain was extended to handle 64-bit sized data back in the 1990's, but archives were not changed, presumably because there was no perceived need for it. Since then of course, programs have continued to get larger, and in 2010, the time had finally come to investigate the issue and find a way to provide for larger archives.
As part of that process, I had to do a deep dive into the archive format, and also do some Unix archeology. I'm going to record what I learned here, to document what Solaris does, and in the hope that it might help someone else trying to solve the same problem for their platform.
Archive Format DetailsArchives are hardly cutting edge technology. They are still used of course, but their basic form hasn't changed in decades. Other than to fix a bug, which is rare, we don't tend to touch that code much. The archive file format is described in /usr/include/ar.h, and I won't repeat the details here. Instead, here is a rough overview of the archive file format, implemented by System V Release 4 (SVR4) Unix systems such as Solaris:
- Every archive starts with a "magic number". This is a sequence of 8 characters: "!<arch>\n".
- The magic number is followed by 1 or more members. A
member starts with a fixed header, defined by the
ar_hdr structure in/usr/include/ar.h.
Immediately following the header comes the
data for the member. Members must be padded at the end with newline
characters so that they have even length.
The requirement to pad members to an even length is a dead giveaway as to the age of the archive format. It tells you that this format dates from the 1970's, and more specifically from the era of 16-bit systems such as the PDP-11 that Unix was originally developed on. A 32-bit system would have required 4 bytes, and 64-bit systems such as we use today would probably have required 8 bytes. 2 byte alignment is a poor choice for ELF object archive members. 32-bit objects require 4 byte alignment, and 64-bit objects require 64-bit alignment. The link-editor uses mmap() to process archives, and if the members have the wrong alignment, we have to slide (copy) them to the correct alignment before we can access the ELF data structures inside. The archive format requires 2 byte padding, but it doesn't prohibit more. The Solaris ar command takes advantage of this, and pads ELF object members to 8 byte boundaries. Anything else is padded to 2 as required by the format.
- The archive header (ar_hdr) represents all numeric values using an ASCII text representation rather than as binary integers. This means that an archive that contains only text members can be viewed using tools such as cat, more, or a text editor. The original designers of this format clearly thought that archives would be used for many file types, and not just for objects. Things didn't turn out that way of course nearly all archives contain relocatable objects for a single operating system and machine, and are used primarily as input to the link-editor (ld).
- Archives can have special members that are created by
the ar command rather than being
supplied by the user. These special members
are all distinguished by having a name that starts with the slash (/)
character. This is an unambiguous marker that says that the user could
not have supplied it. The reason for this is that regular archive members
are given the plain name of the file that was inserted to create them,
and any path components are stripped off. Slash is the delimiter
character used by Unix to separate path components, and as such
cannot occur within a plain file name.
The ar command hides the special members from you when you list the contents of an archive, so most users don't know that they exist. There are only two possible special members: A symbol table that maps ELF symbols to the object archive member that provides it, and a string table used to hold member names that exceed 15 characters. The '/' convention for tagging special members provides room for adding more such members should the need arise. As I will discuss below, we took advantage of this fact to add an alternate 64-bit symbol table special member which is used in archives that are larger than 4GB.
- When an archive contains ELF object members, the ar command builds a special archive member known as the symbol table that maps all ELF symbols in the object to the archive member that provides it. The link-editor uses this symbol table to determine which symbols are provided by the objects in that archive. If an archive has a symbol table, it will always be the first member in the archive, immediately following the magic number. Unlike member headers, symbol tables do use binary integers to represent offsets. These integers are always stored in big-endian format, even on a little endian host such as x86.
- The archive header (ar_hdr) provides 15 characters for representing the member name. If any member has a name that is longer than this, then the real name is written into a special archive member called the string table, and the member's name field instead contains a slash (/) character followed by a decimal representation of the offset of the real name within the string table. The string table is required to precede all normal archive members, so it will be the second member if the archive contains a symbol table, and the first member otherwise.
- The archive format is not designed to make finding a given member easy. Such operations move through the archive from front to back examining each member in turn, and run in O(n) time. This would be bad if archives were commonly used in that manner, but in general, they are not. Typically, the ar command is used to build an new archive from scratch, inserting all the objects in one operation, and then the link-editor accesses the members in the archive in constant time by using the offsets provided by the symbol table. Both of these operations are reasonably efficient. However, listing the contents of a large archive with the ar command can be rather slow.
Factors That Limit Solaris Archive SizeAs is often the case, there was more than one limiting factor preventing Solaris archives from growing beyond the 32-bit limits of 2GB (32-bit signed) and 4GB (32-bit unsigned). These limits are listed in the order they are hit as archive size grows, so the earlier ones mask those that follow.
- The original Solaris archive file format can handle sizes up to 4GB without issue. However, the ar command was delivered as a 32-bit executable that did not use the largefile APIs. As such, the ar command itself could not create a file larger than 2GB. One can solve this by building ar with the largefile APIs which would allow it to reach 4GB, but a simpler and better answer is to deliver a 64-bit ar, which has the ability to scale well past 4GB.
- Symbol table offsets are stored as 32-bit big-endian binary integers, which limits the maximum archive size to 4GB. To get around this limit requires a different symbol table format, or an extension mechanism to the current one, similar in nature to the way member names longer than 15 characters are handled in member headers.
- The size field in the archive member header (ar_hdr) is an ASCII string capable of representing a 32-bit unsigned value. This places a 4GB size limit on the size of any individual member in an archive.
Archive Format Differences Among Unix VariantsWhile considering how to extend Solaris archives to scale to 64-bits, I wanted to know how similar archives from other Unix systems are to those produced by Solaris, and whether they had already solved the 64-bit issue. I've successfully moved archives between different Unix systems before with good luck, so I knew that there was some commonality. If it turned out that there was already a viable defacto standard for 64-bit archives, it would obviously be better to adopt that rather than invent something new.
The archive file format is not formally standardized. However, the ar command and archive format were part of the original Unix from Bell Labs. Other systems started with that format, extending it in various often incompatible ways, but usually with the same common shared core. Most of these systems use the same magic number to identify their archives, despite the fact that their archives are not always fully compatible with each other. It is often true that archives can be copied between different Unix variants, and if the member names are short enough, the ar command from one system can often read archives produced on another.
In practice, it is rare to find an archive containing anything other than objects for a single operating system and machine type. Such an archive is only of use on the type of system that created it, and is only used on that system. This is probably why cross platform compatibility of archives between Unix variants has never been an issue. Otherwise, the use of the same magic number in archives with incompatible formats would be a problem.
I was able to find information for a number of Unix variants, described below. These can be divided roughly into three tribes, SVR4 Unix, BSD Unix, and IBM AIX. Solaris is a SVR4 Unix, and its archives are completely compatible with those from the other members of that group (GNU/Linux, HP-UX, and SGI IRIX).
- AIX is an exception to rule that Unix archive formats are all
based on the original Bell labs Unix format. It appears that AIX
supports 2 formats (small and big), both of which differ in
fundamental ways from other Unix systems:
- These formats use a different magic number than the standard one used by Solaris and other Unix variants.
- They include support for removing archive members from a file without reallocating the file, marking dead areas as unused, and reusing them when new archive items are inserted.
- They have a special table of contents member (File Member Header) which lets you find out everything that's in the archive without having to actually traverse the entire file. Their symbol table members are quite similar to those from other systems though.
- Their member headers are doubly linked, containing offsets to both the previous and next members.
Of the Unix systems described here, AIX has the only format I saw that will have reasonable insert/delete performance for really large archives. Everyone else has O(n) performance, and are going to be slow to use with large archives.
- BSD has gone through 4 versions of archive format, which are described in their manpage. They use the same member header as SVR4, but their symbol table format is different, and their scheme for long member names puts the name directly after the member header rather than into a string table.
- The GNU toolchain uses the SVR4 format, and is compatible with Solaris.
- HP-UX seems to follow the SVR4 model, and is compatible with Solaris.
- IRIX has 32 and 64-bit archives. The 32-bit format is the
standard SVR4 format, and is compatible with Solaris. The 64-bit format
is the same, except that the symbol table uses
IRIX assumes that an archive contains objects of a single ELFCLASS/MACHINE, and any archive containing ELFCLASS64 objects receives a 64-bit symbol table. Although they only use it for 64-bit objects, nothing in the archive format limits it to ELFCLASS64. It would be perfectly valid to produce a 64-bit symbol table in an archive containing 32-bit objects, text files, or anything else.
- Tru64 Unix (Digital/Compaq/HP)
- Tru64 Unix uses a format much like ours, but their symbol table is a hash table, making specific symbol lookup much faster. The Solaris link-editor uses archives by examining the entire symbol table looking for unsatisfied symbols for the link, and not by looking up individual symbols, so there would be no benefit to Solaris from such a hash table. The Tru64 ld must use a different approach in which the hash table pays off for them.
Widening the existing SVR4 archive symbol tables rather than inventing something new is the simplest path forward. There is ample precedent for this approach in the ELF world. When ELF was extended to support 64-bit objects, the approach was largely to take the existing data structures, and define 64-bit versions of them. We called the old set ELF32, and the new set ELF64. My guess is that there was no need to widen the archive format at that time, but had there been, it seems obvious that this is how it would have been done.
The Implementation of 64-bit Solaris ArchivesAs mentioned earlier, there was no desire to improve the fundamental nature of archives. They have always had O(n) insert/delete behavior, and for the most part it hasn't mattered. AIX made efforts to improve this, but those efforts did not find widespread adoption. For the purposes of link-editing, which is essentially the only thing that archives are used for, the existing format is adequate, and issues of backward compatibility trump the desire to do something technically better.
Widening the existing symbol table format to 64-bits is therefore the obvious way to proceed. For Solaris 11, I implemented that, and I also updated the ar command so that a 64-bit version is run by default. This eliminates the 2 most significant limits to archive size, leaving only the limit on an individual archive member.
We only generate a 64-bit symbol table if the archive exceeds 4GB, or when the new -S option to the ar command is used. This maximizes backward compatibility, as an archive produced by Solaris 11 is highly likely to be less than 4GB in size, and will therefore employ the same format understood by older versions of the system. The main reason for the existence of the -S option is to allow us to test the 64-bit format without having to construct huge archives to do so. I don't believe it will find much use outside of that.
Other than the new ability to create and use extremely large archives, this change is largely invisible to the end user. When reading an archive, the ar command will transparently accept either form of symbol table. Similarly, the ELF library (libelf) has been updated to understand either format. Users of libelf (such as the link-editor ld) do not need to be modified to use the new format, because these changes are encapsulated behind the existing functions provided by libelf.
As mentioned above, this work did not lift the limit on the maximum size of an individual archive member. That limit remains fixed at 4GB for now. This is not because we think objects will never get that large, for the history of computing says otherwise. Rather, this is based on an estimation that single relocatable objects of that size will not appear for a decade or two. A lot can change in that time, and it is better not to overengineer things by writing code that will sit and rot for years without being used.
It is not too soon however to have a plan for that eventuality. When the time comes when this limit needs to be lifted, I believe that there is a simple solution that is consistent with the existing format. The archive member header size field is an ASCII string, like the name, and as such, the overflow scheme used for long names can also be used to handle the size. The size string would be placed into the archive string table, and its offset in the string table would then be written into the archive header size field using the same format "/ddd" used for overflowed names.