Monday Nov 26, 2012

Parent Objects

Support for Parent Objects was added in Solaris 11 Update 1. The following material is adapted from the PSARC arc case, and the Solaris Linker and Libraries Manual.

A "plugin" is a shared object, usually loaded via dlopen(), that is used by a program in order to allow the end user to add functionality to the program. Examples of plugins include those used by web browsers (flash, acrobat, etc), as well as mdb and elfedit modules. The object that loads the plugin at runtime is called the "parent object". Unlike most object dependencies, the parent is not identified by name, but by its status as the object doing the load.

Historically, building a good plugin is has been more complicated than it should be:

  • A parent and its plugin usually share a 2-way dependency: The plugin provides one or more routines for the parent to call, and the parent supplies support routines for use by the plugin for things like memory allocation and error reporting.

  • It is a best practice to build all objects, including plugins, with the -z defs option, in order to ensure that the object specifies all of its dependencies, and is self contained. However:

    • The parent is usually an executable, which cannot be linked to via the usual library mechanisms provided by the link editor.

    • Even if the parent is a shared object, which could be a normal library dependency to the plugin, it may be desirable to build plugins that can be used by more than one parent, in which case embedding a dependency NEEDED entry for one of the parents is undesirable.
The usual way to build a high quality plugin with -z defs uses a special mapfile provided by the parent. This mapfile defines the parent routines, specifying the PARENT attribute (see example below). This works, but is inconvenient, and error prone. The symbol table in the parent already describes what it makes available to plugins — ideally the plugin would obtain that information directly rather than from a separate mapfile.

The new -z parent option to ld allows a plugin to link to the parent and access the parent symbol table. This differs from a typical dependency:

  • No NEEDED record is created.

  • The relationship is recorded as a logical connection to the parent, rather than as an explicit object name
However, it operates in the same manner as any other dependency in terms of making symbols available to the plugin.

When the -z parent option is used, the link-editor records the basename of the parent object in the dynamic section, using the new tag DT_SUNW_PARENT. This is an informational tag, which is not used by the runtime linker to locate the parent, but which is available for diagnostic purposes.

The ld(1) manpage documentation for the -z parent option is:

-z parent=object
Specifies a "parent object", which can be an executable or shared object, against which to link the output object. This option is typically used when creating "plugin" shared objects intended to be loaded by an executable at runtime via the dlopen() function. The symbol table from the parent object is used to satisfy references from the plugin object. The use of the -z parent option makes symbols from the object calling dlopen() available to the plugin.

Example

For this example, we use a main program, and a plugin. The parent provides a function named parent_callback() for the plugin to call. The plugin provides a function named plugin_func() to the parent:
% cat main.c
#include <stdio.h>
#include <dlfcn.h>
#include <link.h>

void
parent_callback(void)
{
        printf("plugin_func() has called parent_callback()\n");
}

int
main(int argc, char **argv)
{
        typedef void plugin_func_t(void);

        void            *hdl;
        plugin_func_t   *plugin_func;

        if (argc != 2) {
                fprintf(stderr, "usage: main plugin\n");
                return (1);
        }

        if ((hdl = dlopen(argv[1], RTLD_LAZY)) == NULL) {
                fprintf(stderr, "unable to load plugin: %s\n", dlerror());
                return (1);
        }

        plugin_func = (plugin_func_t *) dlsym(hdl, "plugin_func");
        if (plugin_func == NULL) {
                fprintf(stderr, "unable to find plugin_func: %s\n",
	             dlerror());
                return (1);
        }

        (*plugin_func)();

        return (0);
}

% cat plugin.c
#include <stdio.h>
extern  void    parent_callback(void);

void
plugin_func(void)
{
        printf("parent has called plugin_func() from plugin.so\n");
        parent_callback();
}
Building this in the traditional manner, without -zdefs:
% cc -o main main.c
% cc -G -o plugin.so plugin.c
% ./main ./plugin.so
parent has called plugin_func() from plugin.so
plugin_func() has called parent_callback()
As noted above, when building any shared object, the -z defs option is recommended, in order to ensure that the object is self contained and specifies all of its dependencies. However, the use of -z defs prevents the plugin object from linking due to the unsatisfied symbol from the parent object:
% cc -zdefs -G -o plugin.so plugin.c
Undefined                       first referenced
 symbol                             in file
parent_callback                     plugin.o
ld: fatal: symbol referencing errors. No output written to plugin.so
A mapfile can be used to specify to ld that the parent_callback symbol is supplied by the parent object.
% cat plugin.mapfile
$mapfile_version 2

SYMBOL_SCOPE {
    global:
        parent_callback         { FLAGS = PARENT };
};
% cc -zdefs -Mplugin.mapfile -G -o plugin.so plugin.c
However, the -z parent option to ld is the most direct solution to this problem, allowing the plugin to actually link against the parent object, and obtain the available symbols from it. An added benefit of using -z parent instead of a mapfile, is that the name of the parent object is recorded in the dynamic section of the plugin, and can be displayed by the file utility:
    % cc -zdefs -zparent=main -G -o plugin.so plugin.c
    % elfdump -d plugin.so | grep PARENT
           [0]  SUNW_PARENT       0xcc                main
    % file plugin.so
    plugin.so: ELF 32-bit LSB dynamic lib 80386 Version 1,
        parent main, dynamically linked, not stripped
    % ./main ./plugin.so
    parent has called plugin_func() from plugin.so
    plugin_func() has called parent_callback()

We can also observe this in elfedit plugins on Solaris systems running Solaris 11 Update 1 or newer:

    % file /usr/lib/elfedit/dyn.so 
    /usr/lib/elfedit/dyn.so: ELF 32-bit LSB dynamic lib 80386 Version 1,
        parent elfedit, dynamically linked, not stripped,
        no debugging information available

Related Other Work

The GNU ld has an option named --just-symbols that can be used in a similar manner:
--just-symbols=filename
Read symbol names and their addresses from filename, but do not relocate it or include it in the output. This allows your output file to refer symbolically to absolute locations of memory defined in other programs. You may use this option more than once.
-z parent is a higher level operation aimed specifically at simplifying the construction of high quality plugins. Although it employs the same operation, it differs from --just symbols in 2 significant ways:
  1. There can only be one parent.

  2. The parent is recorded in the created object, and can be displayed by 'file', or other similar tools.

Ancillary Objects: Separate Debug ELF Files For Solaris

We introduced a new object ELF object type in Solaris 11 Update 1 called the Ancillary Object. This posting describes them, using material originally written during their development, the PSARC arc case, and the Solaris Linker and Libraries Manual.

ELF objects contain allocable sections, which are mapped into memory at runtime, and non-allocable sections, which are present in the file for use by debuggers and observability tools, but which are not mapped or used at runtime. Typically, all of these sections exist within a single object file. Ancillary objects allow them to instead go into a separate file.

There are different reasons given for wanting such a feature. One can debate whether the added complexity is worth the benefit, and in most cases it is not. However, one important case stands out — customers with very large 32-bit objects who are not ready or able to make the transition to 64-bits.

We have customers who build extremely large 32-bit objects. Historically, the debug sections in these objects have used the stabs format, which is limited, but relatively compact. In recent years, the industry has transitioned to the powerful but verbose DWARF standard. In some cases, the size of these debug sections is large enough to push the total object file size past the fundamental 4GB limit for 32-bit ELF object files.

The best, and ultimately only, solution to overly large objects is to transition to 64-bits. However, consider environments where:

  • Hundreds of users may be executing the code on large shared systems. (32-bits use less memory and bus bandwidth, and on sparc runs just as fast as 64-bit code otherwise).

  • Complex finely tuned code, where the original authors may no longer be available.

  • Critical production code, that was expensive to qualify and bring online, and which is otherwise serving its intended purpose without issue.
Users in these risk adverse and/or high scale categories have good reasons to push 32-bits objects to the limit before moving on. Ancillary objects offer these users a longer runway.

Design

The design of ancillary objects is intended to be simple, both to help human understanding when examining elfdump output, and to lower the bar for debuggers such as dbx to support them.
  • The primary and ancillary objects have the same set of section headers, with the same names, in the same order (i.e. each section has the same index in both files).

  • A single added section of type SHT_SUNW_ANCILLARY is added to both objects, containing information that allows a debugger to identify and validate both files relative to each other. Given one of these files, the ancillary section allows you to identify the other.

  • Allocable sections go in the primary object, and non-allocable ones go into the ancillary object. A small set of non-allocable objects, notably the symbol table, are copied into both objects.

  • As noted above, most sections are only written to one of the two objects, but both objects have the same section header array. The section header in the file that does not contain the section data is tagged with the SHF_SUNW_ABSENT section header flag to indicate its placeholder status.

  • Compiler writers and others who produce objects can set the SUNW_SHF_PRIMARY section header flag to mark non-allocable sections that should go to the primary object rather than the ancillary.

  • If you don't request an ancillary object, the Solaris ELF format is unchanged. Users who don't use ancillary objects do not pay for the feature. This is important, because they exist to serve a small subset of our users, and must not complicate the common case.

  • If you do request an ancillary object, the runtime behavior of the primary object will be the same as that of a normal object. There is no added runtime cost.

The primary and ancillary object together represent a logical single object. This is facilitated by the use of a single set of section headers. One can easily imagine a tool that can merge a primary and ancillary object into a single file, or the reverse. (Note that although this is an interesting intellectual exercise, we don't actually supply such a tool because there's little practical benefit above and beyond using ld to create the files).

Among the benefits of this approach are:

  • There is no need for per-file symbol tables to reflect the contents of each file. The same symbol table that would be produced for a standard object can be used.

  • The section contents are identical in either case — there is no need to alter data to accommodate multiple files.

  • It is very easy for a debugger to adapt to these new files, and the processing involved can be encapsulated in input/output routines. Most of the existing debugger implementation applies without modification.

  • The limit of a 4GB 32-bit output object is now raised to 4GB of code, and 4GB of debug data. There is also the future possibility (not currently supported) to support multiple ancillary objects, each of which could contain up to 4GB of additional debug data. It must be noted however that the 32-bit DWARF debug format is itself inherently 32-bit limited, as it uses 32-bit offsets between debug sections, so the ability to employ multiple ancillary object files may not turn out to be useful.

Using Ancillary Objects (From the Solaris Linker and Libraries Guide)

By default, objects contain both allocable and non-allocable sections. Allocable sections are the sections that contain executable code and the data needed by that code at runtime. Non-allocable sections contain supplemental information that is not required to execute an object at runtime. These sections support the operation of debuggers and other observability tools. The non-allocable sections in an object are not loaded into memory at runtime by the operating system, and so, they have no impact on memory use or other aspects of runtime performance no matter their size.

For convenience, both allocable and non-allocable sections are normally maintained in the same file. However, there are situations in which it can be useful to separate these sections.

  • To reduce the size of objects in order to improve the speed at which they can be copied across wide area networks.

  • To support fine grained debugging of highly optimized code requires considerable debug data. In modern systems, the debugging data can easily be larger than the code it describes. The size of a 32-bit object is limited to 4 Gbytes. In very large 32-bit objects, the debug data can cause this limit to be exceeded and prevent the creation of the object.

  • To limit the exposure of internal implementation details.

Traditionally, objects have been stripped of non-allocable sections in order to address these issues. Stripping is effective, but destroys data that might be needed later. The Solaris link-editor can instead write non-allocable sections to an ancillary object. This feature is enabled with the -z ancillary command line option.

$ ld ... -z ancillary[=outfile] ...

By default, the ancillary file is given the same name as the primary output object, with a .anc file extension. However, a different name can be provided by providing an outfile value to the -z ancillary option.

When -z ancillary is specified, the link-editor performs the following actions.

  • All allocable sections are written to the primary object. In addition, all non-allocable sections containing one or more input sections that have the SHF_SUNW_PRIMARY section header flag set are written to the primary object.

  • All remaining non-allocable sections are written to the ancillary object.

  • The following non-allocable sections are written to both the primary object and ancillary object.

    .shstrtab

    The section name string table.

    .symtab

    The full non-dynamic symbol table.

    .symtab_shndx

    The symbol table extended index section associated with .symtab.

    .strtab

    The non-dynamic string table associated with .symtab.

    .SUNW_ancillary

    Contains the information required to identify the primary and ancillary objects, and to identify the object being examined.

  • The primary object and all ancillary objects contain the same array of sections headers. Each section has the same section index in every file.

  • Although the primary and ancillary objects all define the same section headers, the data for most sections will be written to a single file as described above. If the data for a section is not present in a given file, the SHF_SUNW_ABSENT section header flag is set, and the sh_size field is 0.

This organization makes it possible to acquire a full list of section headers, a complete symbol table, and a complete list of the primary and ancillary objects from either of the primary or ancillary objects.

The following example illustrates the underlying implementation of ancillary objects. An ancillary object is created by adding the -z ancillary command line option to an otherwise normal compilation. The file utility shows that the result is an executable named a.out, and an associated ancillary object named a.out.anc.

$ cat hello.c
#include <stdio.h>

int
main(int argc, char **argv) 
{ 
        (void) printf("hello, world\n");
        return (0);
}
$ cc -g -zancillary hello.c
$ file a.out a.out.anc
a.out: ELF 32-bit LSB executable 80386 Version 1 [FPU], dynamically
       linked, not stripped, ancillary object a.out.anc
a.out.anc: ELF 32-bit LSB ancillary 80386 Version 1, primary object a.out
$ ./a.out
hello world

The resulting primary object is an ordinary executable that can be executed in the usual manner. It is no different at runtime than an executable built without the use of ancillary objects, and then stripped of non-allocable content using the strip or mcs commands.

As previously described, the primary object and ancillary objects contain the same section headers. To see how this works, it is helpful to use the elfdump utility to display these section headers and compare them. The following table shows the section header information for a selection of headers from the previous link-edit example.

Index

Section Name

Type

Primary Flags

Ancillary Flags

Primary Size

Ancillary Size

13

.text

PROGBITS

ALLOC EXECINSTR

ALLOC EXECINSTR SUNW_ABSENT

0x131

0

20

.data

PROGBITS

WRITE ALLOC

WRITE ALLOC SUNW_ABSENT

0x4c

0

21

.symtab

SYMTAB

0

0

0x450

0x450

22

.strtab

STRTAB

STRINGS

STRINGS

0x1ad

0x1ad

24

.debug_info

PROGBITS

SUNW_ABSENT

0

0

0x1a7

28

.shstrtab

STRTAB

STRINGS

STRINGS

0x118

0x118

29

.SUNW_ancillary

SUNW_ancillary

0

0

0x30

0x30

The data for most sections is only present in one of the two files, and absent from the other file. The SHF_SUNW_ABSENT section header flag is set when the data is absent. The data for allocable sections needed at runtime are found in the primary object. The data for non-allocable sections used for debugging but not needed at runtime are placed in the ancillary file. A small set of non-allocable sections are fully present in both files. These are the .SUNW_ancillary section used to relate the primary and ancillary objects together, the section name string table .shstrtab, as well as the symbol table.symtab, and its associated string table .strtab.

It is possible to strip the symbol table from the primary object. A debugger that encounters an object without a symbol table can use the .SUNW_ancillary section to locate the ancillary object, and access the symbol contained within.

The primary object, and all associated ancillary objects, contain a .SUNW_ancillary section that allows all the objects to be identified and related together.

$ elfdump -T SUNW_ancillary a.out a.out.anc
a.out:
Ancillary Section:  .SUNW_ancillary
     index  tag                    value
       [0]  ANC_SUNW_CHECKSUM     0x8724              
       [1]  ANC_SUNW_MEMBER       0x1         a.out
       [2]  ANC_SUNW_CHECKSUM     0x8724         
       [3]  ANC_SUNW_MEMBER       0x1a3       a.out.anc
       [4]  ANC_SUNW_CHECKSUM     0xfbe2              
       [5]  ANC_SUNW_NULL         0                   

a.out.anc:
Ancillary Section:  .SUNW_ancillary
     index  tag                    value
       [0]  ANC_SUNW_CHECKSUM     0xfbe2              
       [1]  ANC_SUNW_MEMBER       0x1         a.out
       [2]  ANC_SUNW_CHECKSUM     0x8724              
       [3]  ANC_SUNW_MEMBER       0x1a3       a.out.anc
       [4]  ANC_SUNW_CHECKSUM     0xfbe2              
       [5]  ANC_SUNW_NULL         0          

The ancillary sections for both objects contain the same number of elements, and are identical except for the first element. Each object, starting with the primary object, is introduced with a MEMBER element that gives the file name, followed by a CHECKSUM that identifies the object. In this example, the primary object is a.out, and has a checksum of 0x8724. The ancillary object is a.out.anc, and has a checksum of 0xfbe2. The first element in a .SUNW_ancillary section, preceding the MEMBER element for the primary object, is always a CHECKSUM element, containing the checksum for the file being examined.

  • The presence of a .SUNW_ancillary section in an object indicates that the object has associated ancillary objects.

  • The names of the primary and all associated ancillary objects can be obtained from the ancillary section from any one of the files.

  • It is possible to determine which file is being examined from the larger set of files by comparing the first checksum value to the checksum of each member that follows.

Debugger Access and Use of Ancillary Objects

Debuggers and other observability tools must merge the information found in the primary and ancillary object files in order to build a complete view of the object. This is equivalent to processing the information from a single file. This merging is simplified by the primary object and ancillary objects containing the same section headers, and a single symbol table.

The following steps can be used by a debugger to assemble the information contained in these files.

  1. Starting with the primary object, or any of the ancillary objects, locate the .SUNW_ancillary section. The presence of this section identifies the object as part of an ancillary group, contains information that can be used to obtain a complete list of the files and determine which of those files is the one currently being examined.

  2. Create a section header array in memory, using the section header array from the object being examined as an initial template.

  3. Open and read each file identified by the .SUNW_ancillary section in turn. For each file, fill in the in-memory section header array with the information for each section that does not have the SHF_SUNW_ABSENT flag set.

The result will be a complete in-memory copy of the section headers with pointers to the data for all sections. Once this information has been acquired, the debugger can proceed as it would in the single file case, to access and control the running program.


Note - The ELF definition of ancillary objects provides for a single primary object, and an arbitrary number of ancillary objects. At this time, the Oracle Solaris link-editor only produces a single ancillary object containing all non-allocable sections. This may change in the future. Debuggers and other observability tools should be written to handle the general case of multiple ancillary objects.


ELF Implementation Details (From the Solaris Linker and Libraries Guide)

To implement ancillary objects, it was necessary to extend the ELF format to add a new object type (ET_SUNW_ANCILLARY), a new section type (SHT_SUNW_ANCILLARY), and 2 new section header flags (SHF_SUNW_ABSENT, SHF_SUNW_PRIMARY). In this section, I will detail these changes, in the form of diffs to the Solaris Linker and Libraries manual.

Part IV ELF Application Binary Interface

Chapter 13: Object File Format
Object File Format

Edit Note: This existing section at the beginning of the chapter describes the ELF header. There's a table of object file types, which now includes the new ET_SUNW_ANCILLARY type.
e_type
Identifies the object file type, as listed in the following table.
NameValueMeaning
ET_NONE0No file type
ET_REL1Relocatable file
ET_EXEC2Executable file
ET_DYN3Shared object file
ET_CORE4Core file
ET_LOSUNW0xfefeStart operating system specific range
ET_SUNW_ANCILLARY0xfefeAncillary object file
ET_HISUNW0xfefdEnd operating system specific range
ET_LOPROC0xff00Start processor-specific range
ET_HIPROC0xffffEnd processor-specific range
Sections

Edit Note: This overview section defines the section header structure, and provides a high level description of known sections. It was updated to define the new SHF_SUNW_ABSENT and SHF_SUNW_PRIMARY flags and the new SHT_SUNW_ANCILLARY section.

...

sh_type

Categorizes the section's contents and semantics. Section types and their descriptions are listed in Table 13-5.
sh_flags
Sections support 1-bit flags that describe miscellaneous attributes. Flag definitions are listed in Table 13-8.
...
Table 13-5 ELF Section Types, sh_type

NameValue

.
.
.
SHT_LOSUNW0x6fffffee
SHT_SUNW_ancillary0x6fffffee
.
.
.

...

SHT_LOSUNW - SHT_HISUNW

Values in this inclusive range are reserved for Oracle Solaris OS semantics.
SHT_SUNW_ANCILLARY
Present when a given object is part of a group of ancillary objects. Contains information required to identify all the files that make up the group. See Ancillary Section.

...

Table 13-8 ELF Section Attribute Flags

NameValue

.
.
.
SHF_MASKOS0x0ff00000
SHF_SUNW_NODISCARD0x00100000
SHF_SUNW_ABSENT0x00200000
SHF_SUNW_PRIMARY0x00400000
SHF_MASKPROC0xf0000000
.
.
.

...

SHF_SUNW_ABSENT

Indicates that the data for this section is not present in this file. When ancillary objects are created, the primary object and any ancillary objects, will all have the same section header array, to facilitate merging them to form a complete view of the object, and to allow them to use the same symbol tables. Each file contains a subset of the section data. The data for allocable sections is written to the primary object while the data for non-allocable sections is written to an ancillary file. The SHF_SUNW_ABSENT flag is used to indicate that the data for the section is not present in the object being examined. When the SHF_SUNW_ABSENT flag is set, the sh_size field of the section header must be 0. An application encountering an SHF_SUNW_ABSENT section can choose to ignore the section, or to search for the section data within one of the related ancillary files.

SHF_SUNW_PRIMARY

The default behavior when ancillary objects are created is to write all allocable sections to the primary object and all non-allocable sections to the ancillary objects. The SHF_SUNW_PRIMARY flag overrides this behavior. Any output section containing one more input section with the SHF_SUNW_PRIMARY flag set is written to the primary object without regard for its allocable status.

...

Two members in the section header, sh_link, and sh_info, hold special information, depending on section type.

Table 13-9 ELF sh_link and sh_info Interpretation

sh_typesh_linksh_info

.
.
.
SHT_SUNW_ANCILLARY The section header index of the associated string table. 0
.
.
.

Special Sections

Edit Note: This section describes the sections used in Solaris ELF objects, using the types defined in the previous description of section types. It was updated to define the new .SUNW_ancillary (SHT_SUNW_ANCILLARY) section.

Various sections hold program and control information. Sections in the following table are used by the system and have the indicated types and attributes.

Table 13-10 ELF Special Sections

NameTypeAttribute

.
.
.
.SUNW_ancillarySHT_SUNW_ancillaryNone
.
.
.

...

.SUNW_ancillary

Present when a given object is part of a group of ancillary objects. Contains information required to identify all the files that make up the group. See Ancillary Section for details.

...

Ancillary Section

Edit Note: This new section provides the format reference describing the layout of a .SUNW_ancillary section and the meaning of the various tags. Note that these sections use the same tag/value concept used for dynamic and capabilities sections, and will be familiar to anyone used to working with ELF.
In addition to the primary output object, the Solaris link-editor can produce one or more ancillary objects. Ancillary objects contain non-allocable sections that would normally be written to the primary object. When ancillary objects are produced, the primary object and all of the associated ancillary objects contain a SHT_SUNW_ancillary section, containing information that identifies these related objects. Given any one object from such a group, the ancillary section provides the information needed to identify and interpret the others.

This section contains an array of the following structures. See sys/elf.h.

typedef struct {
        Elf32_Word      a_tag;
        union {
                Elf32_Word      a_val;
                Elf32_Addr      a_ptr;
        } a_un;
} Elf32_Ancillary;

typedef struct {
        Elf64_Xword     a_tag;
        union {
                Elf64_Xword     a_val;
                Elf64_Addr      a_ptr;
        } a_un;
} Elf64_Ancillary;
For each object with this type, a_tag controls the interpretation of a_un.
a_val
These objects represent integer values with various interpretations.

a_ptr
These objects represent file offsets or addresses.
The following ancillary tags exist.
Table 13-NEW1 ELF Ancillary Array Tags

NameValuea_un

ANC_SUNW_NULL0Ignored
ANC_SUNW_CHECKSUM1a_val
ANC_SUNW_MEMBER2a_ptr

ANC_SUNW_NULL
Marks the end of the ancillary section.

ANC_SUNW_CHECKSUM
Provides the checksum for a file in the c_val element. When ANC_SUNW_CHECKSUM precedes the first instance of ANC_SUNW_MEMBER, it provides the checksum for the object from which the ancillary section is being read. When it follows an ANC_SUNW_MEMBER tag, it provides the checksum for that member.

ANC_SUNW_MEMBER
Specifies an object name. The a_ptr element contains the string table offset of a null-terminated string, that provides the file name.
An ancillary section must always contain an ANC_SUNW_CHECKSUM before the first instance of ANC_SUNW_MEMBER, identifying the current object. Following that, there should be an ANC_SUNW_MEMBER for each object that makes up the complete set of objects. Each ANC_SUNW_MEMBER should be followed by an ANC_SUNW_CHECKSUM for that object. A typical ancillary section will therefore be structured as:

TagMeaning

ANC_SUNW_CHECKSUMChecksum of this object
ANC_SUNW_MEMBERName of object #1
ANC_SUNW_CHECKSUMChecksum for object #1
.
.
.
ANC_SUNW_MEMBERName of object N
ANC_SUNW_CHECKSUMChecksum for object N
ANC_SUNW_NULL

An object can therefore identify itself by comparing the initial ANC_SUNW_CHECKSUM to each of the ones that follow, until it finds a match.

Related Other Work

The GNU developers have also encountered the need/desire to support separate debug information files, and use the solution detailed at http://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html.

At the current time, the separate debug file is constructed by building the standard object first, and then copying the debug data out of it in a separate post processing step, Hence, it is limited to a total of 4GB of code and debug data, just as a single object file would be. They are aware of this, and I have seen online comments indicating that they may add direct support for generating these separate files to their link-editor.

It is worth noting that the GNU objcopy utility is available on Solaris, and that the Studio dbx debugger is able to use these GNU style separate debug files even on Solaris. Although this is interesting in terms giving Linux users a familiar environment on Solaris, the 4GB limit means it is not an answer to the problem of very large 32-bit objects. We have also encountered issues with objcopy not understanding Solaris-specific ELF sections, when using this approach.

The GNU community also has a current effort to adapt their DWARF debug sections in order to move them to separate files before passing the relocatable objects to the linker. The details of Project Fission can be found at http://gcc.gnu.org/wiki/DebugFission. The goal of this project appears to be to reduce the amount of data seen by the link-editor. The primary effort revolves around moving DWARF data to separate .dwo files so that the link-editor never encounters them. The details of modifying the DWARF data to be usable in this form are involved — please see the above URL for details.

Friday Nov 11, 2011

elffile: ELF Specific File Identification Utility

Solaris 11 has a new standard user level command, /usr/bin/elffile. elffile is a variant of the file utility that is focused exclusively on linker related files: ELF objects, archives, and runtime linker configuration files. All other files are simply identified as "non-ELF". The primary advantage of elffile over the existing file utility is in the area of archives — elffile examines the archive members and can produce a summary of the contents, or per-member details.

The impetus to add elffile to Solaris came from the effort to extend the format of Solaris archives so that they could grow beyond their previous 32-bit file limits. That work introduced a new archive symbol table format. Now that there was more than one possible format, I thought it would be useful if the file utility could identify which format a given archive is using, leading me to extend the file utility:

% cc -c ~/hello.c
% ar r foo.a hello.o 
% file foo.a
foo.a:          current ar archive, 32-bit symbol table
% ar r -S foo.a hello.o 
% file foo.a
foo.a:          current ar archive, 64-bit symbol table
In turn, this caused me to think about all the things that I would like the file utility to be able to tell me about an archive. In particular, I'd like to be able to know what's inside without having to unpack it. The end result of that train of thought was elffile.

Much of the discussion in this article is adapted from the PSARC case I filed for elffile in December 2010:

PSARC 2010/432 elffile

Why file Is No Good For Archives And Yet Should Not Be Fixed

The standard /usr/bin/file utility is not very useful when applied to archives. When identifying an archive, a user typically wants to know 2 things:

  1. Is this an archive?

  2. Presupposing that the archive contains objects, which is by far the most common use for archives, what platform are the objects for? Are they for sparc or x86? 32 or 64-bit? Some confusing combination from varying platforms?

The file utility provides a quick answer to question (1), as it identifies all archives as "current ar archive". It does nothing to answer the more interesting question (2). To answer that question, requires a multi-step process:

  1. Extract all archive members

  2. Use the file utility on the extracted files, examine the output for each file in turn, and compare the results to generate a suitable summary description.

  3. Remove the extracted files

It should be easier and more efficient to answer such an obvious question.

It would be reasonable to extend the file utility to examine archive contents in place and produce a description. However, there are several reasons why I decided not to do so:

  • The correct design for this feature within the file utility would have file examine each archive member in turn, applying its full abilities to each member. This would be elegant, but also represents a rather dramatic redesign and re-implementation of file. Archives nearly always contain nothing but ELF objects for a single platform, so such generality in the file utility would be of little practical benefit.

  • It is best to avoid adding new options to standard utilities for which other implementations of interest exist. In the case of the file utility, one concern is that we might add an option which later appears in the GNU version of file with a different and incompatible meaning. Indeed, there have been discussions about replacing the Solaris file with the GNU version in the past. This may or may not be desirable, and may or may not ever happen. Either way, I don't want to preclude it.

  • Examining archive members is an O(n) operation, and can be relatively slow with large archives. The file utility is supposed to be a very fast operation.

I decided that extending file in this way is overkill, and that an investment in the file utility for better archive support would not be worth the cost. A solution that is more narrowly focused on ELF and other linker related files is really all that we need. The necessary code for doing this already exists within libelf. All that is missing is a small user-level wrapper to make that functionality available at the command line.

In that vein, I considered adding an option for this to the elfdump utility. I examined elfdump carefully, and even wrote a prototype implementation. The added code is small and simple, but the conceptual fit with the rest of elfdump is poor. The result complicates elfdump syntax and documentation, definite signs that this functionality does not belong there.

And so, I added this functionality as a new user level command.

The elffile Command

The syntax for this new command is
    elffile [-s basic | detail | summary] filename...

Please see the elffile(1) manpage for additional details.

To demonstrate how output from elffile looks, I will use the following files:

FileDescription
configA runtime linker configuration file produced with crle
dwarf.oAn ELF object
/etc/passwdA text file
mixed.aArchive containing a mixture of ELF and non-ELF members
mixed_elf.aArchive containing ELF objects for different machines
not_elf.aArchive containing no ELF objects
same_elf.aArchive containing a collection of ELF objects for the same machine. This is the most common type of archive.
The file utility identifies these files as follows:
% file config dwarf.o /etc/passwd mixed.a mixed_elf.a not_elf.a same_elf.a
config:         Runtime Linking Configuration 64-bit MSB SPARCV9
dwarf.o:        ELF 64-bit LSB relocatable AMD64 Version 1
/etc/passwd:    ascii text
mixed.a:        current ar archive, 32-bit symbol table
mixed_elf.a:    current ar archive, 32-bit symbol table
not_elf.a:      current ar archive
same_elf.a:     current ar archive, 32-bit symbol table
By default, elffile uses its "summary" output style. This output differs from the output from the file utility in 2 significant ways:

  1. Files that are not an ELF object, archive, or runtime linker configuration file are identified as "non-ELF", whereas the file utility attempts further identification for such files.

  2. When applied to an archive, the elffile output includes a description of the archive's contents, without requiring member extraction or other additional steps.
Applying elffile to the above files:
% elffile config dwarf.o /etc/passwd mixed.a mixed_elf.a not_elf.a same_elf.a
config: Runtime Linking Configuration 64-bit MSB SPARCV9
dwarf.o: ELF 64-bit LSB relocatable AMD64 Version 1
/etc/passwd: non-ELF
mixed.a: current ar archive, 32-bit symbol table, mixed ELF and non-ELF content
mixed_elf.a: current ar archive, 32-bit symbol table, mixed ELF content
not_elf.a: current ar archive, non-ELF content
same_elf.a: current ar archive, 32-bit symbol table, ELF 64-bit LSB relocatable AMD64 Version 1
The output for same_elf.a is of particular interest: The vast majority of archives contain only ELF objects for a single platform, and in this case, the default output from elffile answers both of the questions about archives posed at the beginning of this discussion, in a single efficient step. This makes elffile considerably more useful than file, within the realm of linker-related files.

elffile can produce output in two other styles, "basic", and "detail". The basic style produces output that is the same as that from 'file', for linker-related files. The detail style produces per-member identification of archive contents. This can be useful when the archive contents are not homogeneous ELF object, and more information is desired than the summary output provides:

% elffile -s detail mixed.a     
mixed.a: current ar archive, 32-bit symbol table
mixed.a(dwarf.o): ELF 32-bit LSB relocatable 80386 Version 1
mixed.a(main.c): non-ELF content
mixed.a(main.o): ELF 64-bit LSB relocatable AMD64 Version 1 [SSE]

The Stub Proto: Not Just For Stub Objects Anymore

One of the great pleasures of programming is to invent something for a narrow purpose, and then to realize that it is a general solution to a broader problem. In hindsight, these things seem perfectly natural and obvious. The stub proto area used to build the core Solaris consolidation has turned out to be one of those things.

As discussed in an earlier article, the stub proto area was invented as part of the effort to use stub objects to build the core ON consolidation. Its purpose was merely as a place to hold stub objects. However, we keep finding other uses for it. It turns out that the stub proto should be more properly thought of as an auxiliary place to put things that we would like to put into the proto to help us build the product, but which we do not wish to package or deliver to the end user. Stub objects are one example, but private lint libraries, header files, archives, and relocatable objects, are all examples of things that might profitably go into the stub proto.

Without a stub proto, these items were handled in a variety of ad hoc ways:

  • If one part of the workspace needed private header files, libraries, or other such items, it might modify its Makefile to reach up and over to the place in the workspace where those things live and use them from there. There are several problems with this:

    • Each component invents its own approach, meaning that programmers maintaining the system have to invest extra effort to understand what things mean. In the past, this has created makefile ghettos in which only the person who wrote the makefiles feels confident to modify them, while everyone else ignores them. This causes many difficulties and benefits no one.

    • These interdependencies are not obvious to the make, utility, and can lead to races.

    • They are not obvious to the human reader, who may therefore not realize that they exist, and break them.

  • Our policy in ON is not to deliver files into the proto unless those files are intended to be packaged and delivered to the end user. However, sometimes non-shipping files were copied into the proto anyway, causing a different set of problems:

    • It requires a long list of exceptions to silence our normal unused proto item error checking.

    • In the past, we have accidentally shipped files that we did not intend to deliver to the end user.

    • Mixing cruft with valuable items makes it hard to discern which is which.

The stub proto area offers a convenient and robust solution. Files needed to build the workspace that are not delivered to the end user can instead be installed into the stub proto. No special exceptions or custom make rules are needed, and the intent is always clear.

We are already accessing some private lint libraries and compilation symlinks in this manner. Ultimately, I'd like to see all of the files in the proto that have a packaging exception delivered to the stub proto instead, and for the elimination of all existing special case makefile rules. This would include shared objects, header files, and lint libraries. I don't expect this to happen overnight — it will be a long term case by case project, but the overall trend is clear.

The Stub Proto, -z assert_deflib, And The End Of Accidental System Object Linking

We recently used the stub proto to solve an annoying build issue that goes back to the earliest days of Solaris: How to ensure that we're linking to the OS bits we're building instead of to those from the running system.

The Solaris product is made up of objects and files from a number of different consolidations, each of which is built separately from the others from an independent code base called a gate. The core Solaris OS consolidation is ON, which stands for "Operating System and Networking". You will frequently also see ON called the OSnet. There are consolidations for X11 graphics, the desktop environment, open source utilities, compilers and development tools, and many others. The collection of consolidations that make up Solaris is known as the "Wad Of Stuff", usually referred to simply as the WOS. None of these consolidations is self contained. Even the core ON consolidation has some dependencies on libraries that come from other consolidations.

The build server used to build the OSnet must be running a relatively recent version of Solaris, which means that its objects will be very similar to the new ones being built. However, it is necessarily true that the build system objects will always be a little behind, and that incompatible differences may exist.

The objects built by the OSnet link to other objects. Some of these dependencies come from the OSnet, while others come from other consolidations. The objects from other consolidations are provided by the standard library directories on the build system (/lib, /usr/lib). The objects from the OSnet itself are supposed to come from the proto areas in the workspace, and not from the build server. In order to achieve this, we make use of the -L command line option to the link-editor.

The link-editor finds dependencies by looking in the directories specified by the caller using the -L command line option. If the desired dependency is not found in one of these locations, ld will then fall back to looking at the default locations (/lib, /usr/lib). In order to use OSnet objects from the workspace instead of the system, while still accessing non-OSnet objects from the system, our Makefiles set -L link-editor options that point at the workspace proto areas. In general, this works well and dependencies are found in the right places.

However, there have always been failures:

  1. Building objects in the wrong order might mean that an OSnet dependency hasn't been built before an object that needs it. If so, the dependency will not be seen in the proto, and the link-editor will silently fall back to the one on the build server.

  2. Errors in the makefiles can wipe out the -L options that our top level makefiles establish to cause ld to look at the workspace proto first. In this case, all objects will be found on the build server.
These failures were rarely if ever caught. As I mentioned earlier, the objects on the build server are generally quite close to the objects built in the workspace. If they offer compatible linking interfaces, then the objects that link to them will behave properly, and no issue will ever be seen. However, if they do not offer compatible linking interfaces, the failure modes can be puzzling and hard to pin down. Either way, there won't be a compile-time warning or error.

The advent of the stub proto eliminated the first type of failure. With stub objects, there is no dependency ordering, and the necessary stub object dependency will always be in place for any OSnet object that needs it. However, makefile errors do still occur, and so, the second form of error was still possible.

While working on the stub object project, we realized that the stub proto was also the key to solving the second form of failure caused by makefile errors:

  1. Due to the way we set the -L options to point at our workspace proto areas, any valid object from the OSnet should be found via a path specified by -L, and not from the default locations (/lib, /usr/lib). Any OSnet object found via the default locations means that we've linked to the build server, which is an error we'd like to catch.

  2. Non-OSnet objects don't exist in the proto areas, and so are found via the default paths. However, if we were to create a symlink in the stub proto pointing at each non-OSnet dependency that we require, then the non-OSnet objects would also be found via the paths specified by -L, and not from the link-editor defaults.

  3. Given the above, we should not find any dependency objects from the link-editor defaults. Any dependency found via the link-editor defaults means that we have a Makefile error, and that we are linking to the build server inappropriately. All we need to make use of this fact is a linker option to produce a warning when it happens.

  4. Although warnings are nice, we in the OSnet have a zero tolerance policy for build noise. The -z fatal-warnings option that was recently introduced with -z guidance can be used to turn the warnings into fatal build errors, forcing the programmer to fix them.

This was too easy to resist. I integrated

7021198 ld option to warn when link accesses a library via default path
PSARC/2011/068 ld -z assert-deflib option
into snv_161 (February 2011), shortly after the stub proto was introduced into ON. This putback introduced the -z assert-deflib option to the link-editor:
-z assert-deflib=[libname]
Enables warning messages for libraries specified with the -l command line option that are found by examining the default search paths provided by the link-editor. If a libname value is provided, the default library warning feature is enabled, and the specified library is added to a list of libraries for which no warnings will be issued. Multiple -z assert-deflib options can be specified in order to specify multiple libraries for which warnings should not be issued.

The libname value should be the name of the library file, as found by the link-editor, without any path components. For example, the following enables default library warnings, and excludes the standard C library.

           ld ... -z assert-deflib=libc.so ...
-z assert-deflib is a specialized option, primarily of interest in build environments where multiple objects with the same name exist and tight control over the library used is required. If is not intended for general use.
Note that the definition of -z assert-deflib allows for exceptions to be specified as arguments to the option. In general, the idea of using a symlink from the stub proto is superior because it does not clutter up the link command with a long list of objects. When building the OSnet, we usually use the plain from of -z deflib, and make symlinks for the non-OSnet dependencies. The exception to this are dependencies supplied by the compiler itself, which are usually found at whatever arbitrary location the compiler happens to be installed at. To handle these special cases, the command line version works better.

Following the integration of the link-editor change, I made use of -z assert-deflib in OSnet builds with

7021896 Prevent OSnet from accidentally linking to build system
which integrated into snv_162 (March 2011). Turning on -z assert-deflib exposed between 10 and 20 existing errors in our Makefiles, which were all fixed in the same putback. The errors we found in our Makefiles underscore how difficult they can be prevent without an automatic system in place to catch them.

Conclusions

The stub proto is proving to be a generally useful construct for ON builds that goes beyond serving as a place to hold stub objects. Although invented to hold stub objects, it has already allowed us to simplify a number of previously difficult situations in our makefiles and builds. I expect that we'll find uses for it beyond those described here as we go forward.

Using Stub Objects

Having told the long and winding tale of where stub objects came from and how we use them to build Solaris, I'd like to focus now on the the nuts and bolts of building and using them. The following new features were added to the Solaris link-editor (ld) to support the production and use of stub objects:

-z stub
This new command line option informs ld that it is to build a stub object rather than a normal object. In this mode, it accepts the same command line arguments as usual, but will quietly ignore any objects and sharable object dependencies.

STUB_OBJECT Mapfile Directive
In order to build a stub version of an object, its mapfile must specify the STUB_OBJECT directive. When producing a non-stub object, the presence of STUB_OBJECT causes the link-editor to perform extra validation to ensure that the stub and non-stub objects will be compatible.

ASSERT Mapfile Directive
All data symbols exported from the object must have an ASSERT symbol directive in the mapfile that declares them as data and supplies the size, binding, bss attributes, and symbol aliasing details. When building the stub objects, the information in these ASSERT directives is used to create the data symbols. When building the real object, these ASSERT directives will ensure that the real object matches the linking interface presented by the stub.

Although ASSERT was added to the link-editor in order to support stub objects, they are a general purpose feature that can be used independently of stub objects. For instance you might choose to use an ASSERT directive if you have a symbol that must have a specific address in order for the object to operate properly and you want to automatically ensure that this will always be the case.

The material presented here is derived from a document I originally wrote during the development effort, which had the dual goals of providing supplemental materials for the stub object PSARC case, and as a set of edits that were eventually applied to the Oracle Solaris Linker and Libraries Manual (LLM). The Solaris 11 LLM contains this information in a more polished form.

Stub Objects

A stub object is a shared object, built entirely from mapfiles, that supplies the same linking interface as the real object, while containing no code or data. Stub objects cannot be used at runtime. However, an application can be built against a stub object, where the stub object provides the real object name to be used at runtime, and then use the real object at runtime.

When building a stub object, the link-editor ignores any object or library files specified on the command line, and these files need not exist in order to build a stub. Since the compilation step can be omitted, and because the link-editor has relatively little work to do, stub objects can be built very quickly.

Stub objects can be used to solve a variety of build problems:

Speed
Modern machines, using a version of make with the ability to parallelize operations, are capable of compiling and linking many objects simultaneously, and doing so offers significant speedups. However, it is typical that a given object will depend on other objects, and that there will be a core set of objects that nearly everything else depends on. It is necessary to impose an ordering that builds each object before any other object that requires it. This ordering creates bottlenecks that reduce the amount of parallelization that is possible and limits the overall speed at which the code can be built.

Complexity/Correctness
In a large body of code, there can be a large number of dependencies between the various objects. The makefiles or other build descriptions for these objects can become very complex and difficult to understand or maintain. The dependencies can change as the system evolves. This can cause a given set of makefiles to become slightly incorrect over time, leading to race conditions and mysterious rare build failures.

Dependency Cycles
It might be desirable to organize code as cooperating shared objects, each of which draw on the resources provided by the other. Such cycles cannot be supported in an environment where objects must be built before the objects that use them, even though the runtime linker is fully capable of loading and using such objects if they could be built.
Stub shared objects offer an alternative method for building code that sidesteps the above issues. Stub objects can be quickly built for all the shared objects produced by the build. Then, all the real shared objects and executables can be built in parallel, in any order, using the stub objects to stand in for the real objects at link-time. Afterwards, the executables and real shared objects are kept, and the stub shared objects are discarded.

Stub objects are built from a mapfile, which must satisfy the following requirements.

  • The mapfile must specify the STUB_OBJECT directive. This directive informs the link-editor that the object can be built as a stub object, and as such causes the link-editor to perform validation and sanity checking intended to guarantee that an object and its stub will always provide identical linking interfaces.

  • All function and data symbols that make up the external interface to the object must be explicitly listed in the mapfile.

  • The mapfile must use symbol scope reduction ('*'), to remove any symbols not explicitly listed from the external interface.

  • All global data exported from the object must have an ASSERT symbol attribute in the mapfile to specify the symbol type, size, and bss attributes. In the case where there are multiple symbols that reference the same data, the ASSERT for one of these symbols must specify the TYPE and SIZE attributes, while the others must use the ALIAS attribute to reference this primary symbol.

Given such a mapfile, the stub and real versions of the shared object can be built using the same command line for each, adding the '-z stub' option to the link for the stub object, and omiting the option from the link for the real object.

To demonstrate these ideas, the following code implements a shared object named idx5, which exports data from a 5 element array of integers, with each element initialized to contain its zero-based array index. This data is available as a global array, via an alternative alias data symbol with weak binding, and via a functional interface.

% cat idx5.c
int _idx5[5] = { 0, 1, 2, 3, 4 };
#pragma weak idx5 = _idx5

int
idx5_func(int index)
{
        if ((index < 0) || (index > 4))
                return (-1);
        return (_idx5[index]);
}
A mapfile is required to describe the interface provided by this shared object.
% cat mapfile
$mapfile_version 2
STUB_OBJECT;
SYMBOL_SCOPE {
        _idx5   {
                        ASSERT { TYPE=data; SIZE=4[5] };
                };
        idx5    {
                        ASSERT { BINDING=weak; ALIAS=_idx5 };
                };
        idx5_func;
    local:
        *;
};
The following main program is used to print all the index values available from the idx5 shared object.
% cat main.c
#include <stdio.h>

extern int      _idx5[5], idx5[5], idx5_func(int);

int
main(int argc, char **argv)
{
        int     i;
        for (i = 0; i < 5; i++)
                (void) printf("[%d] %d %d %d\n",
                    i, _idx5[i], idx5[i], idx5_func(i));
        return (0);
}
The following commands create a stub version of this shared object in a subdirectory named stublib. elfdump is used to verify that the resulting object is a stub. The command used to build the stub differs from that of the real object only in the addition of the -z stub option, and the use of a different output file name. This demonstrates the ease with which stub generation can be added to an existing makefile.
% cc -Kpic -G -M mapfile -h libidx5.so.1 idx5.c -o stublib/libidx5.so.1 -zstub
% ln -s libidx5.so.1 stublib/libidx5.so
% elfdump -d stublib/libidx5.so | grep STUB
      [11]  FLAGS_1           0x4000000           [ STUB ]
The main program can now be built, using the stub object to stand in for the real shared object, and setting a runpath that will find the real object at runtime. However, as we have not yet built the real object, this program cannot yet be run. Attempts to cause the system to load the stub object are rejected, as the runtime linker knows that stub objects lack the actual code and data found in the real object, and cannot execute.
% cc main.c -L stublib -R '$ORIGIN/lib' -lidx5 -lc
% ./a.out
ld.so.1: a.out: fatal: libidx5.so.1: open failed: No such file or directory
Killed
% LD_PRELOAD=stublib/libidx5.so.1 ./a.out
ld.so.1: a.out: fatal: stublib/libidx5.so.1: stub shared object cannot be used at runtime
Killed
We build the real object using the same command as we used to build the stub, omitting the -z stub option, and writing the results to a different file.
% cc -Kpic -G -M mapfile -h libidx5.so.1 idx5.c -o lib/libidx5.so.1

Once the real object has been built in the lib subdirectory, the program can be run.

% ./a.out
[0] 0 0 0
[1] 1 1 1
[2] 2 2 2
[3] 3 3 3
[4] 4 4 4

Mapfile Changes

The version 2 mapfile syntax was extended in a number of places to accommodate stub objects.
Conditional Input
The version 2 mapfile syntax has the ability conditionalize mapfile input using the $if control directive. As you might imagine, these directives are used frequently with ASSERT directives for data, because a given data symbol will frequently have a different size in 32 or 64-bit code, or on differing hardware such as x86 versus sparc.

The link-editor maintains an internal table of names that can be used in the logical expressions evaluated by $if and $elif. At startup, this table is initialized with items that describe the class of object (_ELF32 or _ELF64) and the type of the target machine (_sparc or _x86). We found that there were a small number of cases in the Solaris code base in which we needed to know what kind of object we were producing, so we added the following new predefined items in order to address that need:

NameMeaning
......
_ET_DYNshared object
_ET_EXECexecutable object
_ET_RELrelocatable object
......
STUB_OBJECT Directive
The new STUB_OBJECT directive informs the link-editor that the object described by the mapfile can be built as a stub object.
STUB_OBJECT;
A stub shared object is built entirely from the information in the mapfiles supplied on the command line. When the -z stub option is specified to build a stub object, the presence of the STUB_OBJECT directive in a mapfile is required, and the link-editor uses the information in symbol ASSERT attributes to create global symbols that match those of the real object.

When the real object is built, the presence of STUB_OBJECT causes the link-editor to verify that the mapfiles accurately describe the real object interface, and that a stub object built from them will provide the same linking interface as the real object it represents.

  • All function and data symbols that make up the external interface to the object must be explicitly listed in the mapfile.

  • The mapfile must use symbol scope reduction ('*'), to remove any symbols not explicitly listed from the external interface.

  • All global data in the object is required to have an ASSERT attribute that specifies the symbol type and size.

  • If the ASSERT BIND attribute is not present, the link-editor provides a default assertion that the symbol must be GLOBAL.

  • If the ASSERT SH_ATTR attribute is not present, or does not specify that the section is one of BITS or NOBITS, the link-editor provides a default assertion that the associated section is BITS.

  • All data symbols that describe the same address and size are required to have ASSERT ALIAS attributes specified in the mapfile. If aliased symbols are discovered that do not have an ASSERT ALIAS specified, the link fails and no object is produced.

These rules ensure that the mapfiles contain a description of the real shared object's linking interface that is sufficient to produce a stub object with a completely compatible linking interface.

SYMBOL_SCOPE/SYMBOL_VERSION ASSERT Attribute
The SYMBOL_SCOPE and SYMBOL_VERSION mapfile directives were extended with a symbol attribute named ASSERT. The syntax for the ASSERT attribute is as follows:
ASSERT {
	ALIAS = symbol_name;
	BINDING = symbol_binding;
	TYPE = symbol_type;
	SH_ATTR = section_attributes;

	SIZE = size_value;
	SIZE = size_value[count];
};

The ASSERT attribute is used to specify the expected characteristics of the symbol. The link-editor compares the symbol characteristics that result from the link to those given by ASSERT attributes. If the real and asserted attributes do not agree, a fatal error is issued and the output object is not created.

In normal use, the link editor evaluates the ASSERT attribute when present, but does not require them, or provide default values for them. The presence of the STUB_OBJECT directive in a mapfile alters the interpretation of ASSERT to require them under some circumstances, and to supply default assertions if explicit ones are not present. See the definition of the STUB_OBJECT Directive for the details.

When the -z stub command line option is specified to build a stub object, the information provided by ASSERT attributes is used to define the attributes of the global symbols provided by the object.

ASSERT accepts the following:

ALIAS
Name of a previously defined symbol that this symbol is an alias for. An alias symbol has the same type, value, and size as the main symbol. The ALIAS attribute is mutually exclusive to the TYPE, SIZE, and SH_ATTR attributes, and cannot be used with them. When ALIAS is specified, the type, size, and section attributes are obtained from the alias symbol.

BIND
Specifies an ELF symbol binding, which can be any of the STB_ constants defined in <sys/elf.h>, with the STB_ prefix removed (e.g. GLOBAL, WEAK).

TYPE
Specifies an ELF symbol type, which can be any of the STT_ constants defined in <sys/elf.h>, with the STT_ prefix removed (e.g. OBJECT, COMMON, FUNC). In addition, for compatibility with other mapfile usage, FUNCTION and DATA can be specified, for STT_FUNC and STT_OBJECT, respectively. TYPE is mutually exclusive to ALIAS, and cannot be used in conjunction with it.

SH_ATTR
Specifies attributes of the section associated with the symbol. The section_attributes that can be specified are given in the following table:
Section AttributeMeaning
BITSSection is not of type SHT_NOBITS
NOBITSSection is of type SHT_NOBITS
SH_ATTR is mutually exclusive to ALIAS, and cannot be used in conjunction with it.

SIZE
Specifies the expected symbol size. SIZE is mutually exclusive to ALIAS, and cannot be used in conjunction with it. The syntax for the size_value argument is as described in the discussion of the SIZE attribute below.
SIZE
The SIZE symbol attribute existed before support for stub objects was introduced. It is used to set the size attribute of a given symbol. This attribute results in the creation of a symbol definition.

Prior to the introduction of the ASSERT SIZE attribute, the value of a SIZE attribute was always numeric. While attempting to apply ASSERT SIZE to the objects in the Solaris ON consolidation, I found that many data symbols have a size based on the natural machine wordsize for the class of object being produced. Variables declared as long, or as a pointer, will be 4 bytes in size in a 32-bit object, and 8 bytes in a 64-bit object. Initially, I employed the conditional $if directive to handle these cases as follows:

$if _ELF32
        foo     { ASSERT { TYPE=data; SIZE=4 } };
        bar     { ASSERT { TYPE=data; SIZE=20 } };
$elif _ELF64
        foo     { ASSERT { TYPE=data; SIZE=8 } };
        bar     { ASSERT { TYPE=data; SIZE=40 } };
$else
$error UNKNOWN ELFCLASS
$endif

I found that the situation occurs frequently enough that this is cumbersome. To simplify this case, I introduced the idea of the addrsize symbolic name, and of a repeat count, which together make it simple to specify machine word scalar or array symbols. Both the SIZE, and ASSERT SIZE attributes support this syntax:

The size_value argument can be a numeric value, or it can be the symbolic name addrsize. addrsize represents the size of a machine word capable of holding a memory address. The link-editor substitutes the value 4 for addrsize when building 32-bit objects, and the value 8 when building 64-bit objects. addrsize is useful for representing the size of pointer variables and C variables of type long, as it automatically adjusts for 32 and 64-bit objects without requiring the use of conditional input.

The size_value argument can be optionally suffixed with a count value, enclosed in square brackets. If count is present, size_value and count are multiplied together to obtain the final size value.

Using this feature, the example above can be written more naturally as:
        foo     { ASSERT { TYPE=data; SIZE=addrsize } };
        bar     { ASSERT { TYPE=data; SIZE=addrsize[5] } };

Exported Global Data Is Still A Bad Idea

As you can see, the additional plumbing added to the Solaris link-editor to support stub objects is minimal. Furthermore, about 90% of that plumbing is dedicated to handling global data.

We have long advised against global data exported from shared objects. There are many ways in which global data does not fit well with dynamic linking. Stub objects simply provide one more reason to avoid this practice. It is always better to export all data via a functional interface. You should always hide your data, and make it available to your users via a function that they can call to acquire the address of the data item. However, If you do have to support global data for a stub, perhaps because you are working with an already existing object, it is still easilily done, as shown above.

Oracle does not like us to discuss hypothetical new features that don't exist in shipping product, so I'll end this section with a speculation. It might be possible to do more in this area to ease the difficulty of dealing with objects that have global data that the users of the library don't need. Perhaps someday...

Conclusions

It is easy to create stub objects for most objects. If your library only exports function symbols, all you have to do to build a faithful stub object is to add
STUB_OBJECT;
and then to use the same link command you're currently using, with the addition of the -z stub option.

Happy Stubbing!

Much Ado About Nothing: Stub Objects

The Solaris 11 link-editor (ld) contains support for a new type of object that we call a stub object. A stub object is a shared object, built entirely from mapfiles, that supplies the same linking interface as the real object, while containing no code or data. Stub objects cannot be executed — the runtime linker will kill any process that attempts to load one. However, you can link to a stub object as a dependency, allowing the stub to act as a proxy for the real version of the object.

You may well wonder if there is a point to producing an object that contains nothing but linking interface. As it turns out, stub objects are very useful for building large bodies of code such as Solaris. In the last year, we've had considerable success in applying them to one of our oldest and thorniest build problems. In this discussion, I will describe how we came to invent these objects, and how we apply them to building Solaris.

This posting explains where the idea for stub objects came from, and details our long and twisty journey from hallway idea to standard link-editor feature. I expect that these details are mainly of interest to those who work on Solaris and its makefiles, those who have done so in the past, and those who work with other similar bodies of code. A subsequent posting will omit the history and background details, and instead discuss how to build and use stub objects. If you are mainly interested in what stub objects are, and don't care about the underlying software war stories, I encourage you to skip ahead.

The Long Road To Stubs

This all started for me with an email discussion in May of 2008, regarding a change request that was filed in 2002, entitled:
4631488 lib/Makefile is too patient: .WAITs should be reduced
This CR encapsulates a number of cronic issues with Solaris builds:

  • We build Solaris with a parallel make (dmake) that tries to build as much of the code base in parallel as possible. There is a lot of code to build, and we've long made use of parallelized builds to get the job done quicker. This is even more important in today's world of massively multicore hardware.

  • Solaris contains a large number of executables and shared objects. Executables depend on shared objects, and shared objects can depend on each other. Before you can build an object, you need to ensure that the objects it needs have been built. This implies a need for serialization, which is in direct opposition to the desire to build everying in parallel.

  • To accurately build objects in the right order requires an accurate set of make rules defining the things that depend on each other. This sounds simple, but the reality is quite complex. In practice, having programmers explicitly specify these dependencies is a losing strategy:

    • It's really hard to get right.

    • It's really easy to get it wrong and never know it because things build anyway.

    • Even if you get it right, it won't stay that way, because dependencies between objects can change over time, and make cannot help you detect such drifting.

    • You won't know that you got it wrong until the builds break. That can be a long time after the change that triggered the breakage happened, making it hard to connect the cause and the effect. Usually this happens just before a release, when the pressure is on, it is hard to think calmly, and there is no time for deep fixes.

  • As a poor compromise, the libraries in core Solaris were built using a set of grossly incomplete hand written rules, supplemented with a number of dmake .WAIT directives used to group the libraries into sets of non-interacting groups that can be built in parallel because we think they don't depend on each other.

  • From time to time, someone will suggest that we could analyze the built objects themselves to determine their dependencies and then generate make rules based on those relationships. This is possible, but but there are complications that limit the usefulness of that approach:

    • To analyze an object, you have to build it first. This is a classic chicken and egg scenario.

    • You could analyze the results of a previous build, but then you're not necessarily going to get accurate rules for the current code.

    • It should be possible to build the code without having a built workspace available.

    • The analysis will take time, and remember that we're constantly trying to make builds faster, not slower.

    By definition, such an approach will always be approximate, and therefore only incrementally more accurate than the hand written rules described above. The hand written rules are fast and cheap, while this idea is slow and complex, so we stayed with the hand written approach.

Solaris was built that way, essentially forever, because these are genuinely difficult problems that had no easy answer. The makefiles were full of build races in which the right outcomes happened reliably for years until a new machine or a change in build server workload upset the accidental balance of things. After figuring out what had happened, you'd mutter "How did that ever work?", add another incomplete and soon to be inaccurate make dependency rule to the system, and move on. This was not a satisfying solution, as we tend to be perfectionists in the Solaris group, but we didn't have a better answer. It worked well enough, approximately.

And so it went for years. We needed a different approach — a new idea to cut the Gordian Knot.

In that discussion from May 2008, my fellow linker-alien Rod Evans had the initial spark that lead us to a game changing series of realizations:

  • The link-editor is used to link objects together, but it only uses the ELF metadata in the object, consisting of symbol tables, ELF versioning sections, and similar data. Notably, it does not look at, or understand, the machine code that makes an object useful at runtime.

  • If you had an object that only contained the ELF metadata for a dependency, but not the code or data, the link-editor would find it equally useful for linking, and would never know the difference. Call it a stub object.

  • In the core Solaris OS, we require all objects to be built with a link-editor mapfile that describes all of its publicly available functions and data. Could we build a stub object using the mapfile for the real object?

  • It ought to be very fast to build stub objects, as there are no input objects to process.

  • Unlike the real object, stub objects would not actually require any dependencies, and so, all of the stubs for the entire system could be built in parallel.

  • When building the real objects, one could link against the stub objects instead of the real dependencies. This means that all the real objects can be built built in parallel too, without any serialization. We could replace a system that requires perfect makefile rules with a system that requires no ordering rules whatsoever. The results would be considerably more robust.
We immediately realized that this idea had potential, but also that there were many details to sort out, lots of work to do, and that perhaps it wouldn't really pan out. As is often the case, it would be necessary to do the work and see how it turned out.

Following that conversation, I set about trying to build a stub object. We determined that a faithful stub has to do the following:

  • Present the same set of global symbols, with the same ELF versioning, as the real object.

  • Functions are simple — it suffices to have a symbol of the right type, possibly, but not necessarily, referencing a null function in its text segment.

  • Copy relocations make data more complicated to stub. The possibility of a copy relocation means that when you create a stub, the data symbols must have the actual size of the real data. Any error in this will go uncaught at link time, and will cause tragic failures at runtime that are very hard to diagnose.

  • For reasons too obscure to go into here, involving tentative symbols, it is also important that the data reside in bss, or not, matching its placement in the real object.

  • If the real object has more than one symbol pointing at the same data item, we call these aliased symbols. All data symbols in the stub object must exhibit the same aliasing as the real object.

We imagined the stub library feature working as follows:

  • A command line option to ld tells it to produce a stub rather than a real object. In this mode, only mapfiles are examined, and any object or shared libraries on the command line are are ignored.

  • The extra information needed (function or data, size, and bss details) would be added to the mapfile.

  • When building the real object instead of the stub, the extra information for building stubs would be validated against the resulting object to ensure that they match.
In exploring these ideas, I immediately ran headfirst into the reality of the original mapfile syntax, a subject that I would later write about as The Problem(s) With Solaris SVR4 Link-Editor Mapfiles. The idea of extending that poor language was a non-starter. Until a better mapfile syntax became available, which seemed unlikely in 2008, the solution could not involve extensions to the mapfile syntax.

Instead, we cooked up the idea (hack) of augmenting mapfiles with stylized comments that would carry the necessary information. A typical definition might look like:

# DATA(i386)              __iob                   0x3c0
# DATA(amd64,sparcv9)     __iob                   0xa00
# DATA(sparc)             __iob                   0x140
iob;

A further problem then became clear: If we can't extend the mapfile syntax, then there's no good way to extend ld with an option to produce stub objects, and to validate them against the real objects. The idea of having ld read comments in a mapfile and parse them for content is an unacceptable hack. The entire point of comments is that they are strictly for the human reader, and explicitly ignored by the tool.

Taking all of these speed bumps into account, I made a new plan:

  • A perl script reads the mapfiles, generates some small C glue code to produce empty functions and data definitions, compiles and links the stub object from the generated glue code, and then deletes the generated glue code.

  • Another perl script used after both objects have been built, to compare the real and stub objects, using data from elfdump, and validate that they present the same linking interface.
By June 2008, I had written the above, and generated a stub object for libc. It was a useful prototype process to go through, and it allowed me to explore the ideas at a deep level. Ultimately though, the result was unsatisfactory as a basis for real product. There were so many issues:

  • The use of stylized comments were fine for a prototype, but not close to professional enough for shipping product. The idea of having to document and support it was a large concern.

  • The ideal solution for stub objects really does involve having the link-editor accept the same arguments used to build the real object, augmented with a single extra command line option. Any other solution, such as our prototype script, will require makefiles to be modified in deeper ways to support building stubs, and so, will raise barriers to converting existing code.

  • A validation script that rederives what the linker knew when it built an object will always be at a disadvantage relative to the actual linker that did the work.

  • A stub object should be identifiable as such. In the prototype, there was no tag or other metadata that would let you know that they weren't real objects. Being able to identify a stub object in this way means that the file command can tell you what it is, and that the runtime linker can refuse to try and run a program that loads one.
At that point, we needed to apply this prototype to building Solaris. As you might imagine, the task of modifying all the makefiles in the core Solaris code base in order to do this is a massive task, and not something you'd enter into lightly. The quality of the prototype just wasn't good enough to justify that sort of time commitment, so I tabled the project, putting it on my list of long term things to think about, and moved on to other work. It would sit there for a couple of years.

Semi-coincidentally, one of the projects I tacked after that was to create a new mapfile syntax for the Solaris link-editor. We had wanted to do something about the old mapfile syntax for many years. Others before me had done some paper designs, and a great deal of thought had already gone into the features it should, and should not have, but for various reasons things had never moved beyond the idea stage. When I joined Sun in late 2005, I got involved in reviewing those things and thinking about the problem. Now in 2008, fresh from relearning for the Nth time why the old mapfile syntax was a huge impediment to linker progress, it seemed like the right time to tackle the mapfile issue. Paving the way for proper stub object support was not the driving force behind that effort, but I certainly had them in mind as I moved forward.

The new mapfile syntax, which we call version 2, integrated into Nevada build snv_135 in in February 2010:

6916788 ld version 2 mapfile syntax
PSARC/2009/688 Human readable and extensible ld mapfile syntax

In order to prove that the new mapfile syntax was adequate for general purpose use, I had also done an overhaul of the ON consolidation to convert all mapfiles to use the new syntax, and put checks in place that would ensure that no use of the old syntax would creep back in. That work went back into snv_144 in June 2010:

6916796 OSnet mapfiles should use version 2 link-editor syntax
That was a big putback, modifying 517 files, adding 18 new files, and removing 110 old ones.

I would have done this putback anyway, as the work was already done, and the benefits of human readable syntax are obvious. However, among the justifications listed in CR 6916796 was this

We anticipate adding additional features to the new mapfile language that will be applicable to ON, and which will require all sharable object mapfiles to use the new syntax.
I never explained what those additional features were, and no one asked. It was premature to say so, but this was a reference to stub objects. By that point, I had already put together a working prototype link-editor with the necessary support for stub objects. I was pleased to find that building stubs was indeed very fast. On my desktop system (Ultra 24), an amd64 stub for libc can can be built in a fraction of a second:
    % ptime ld -64 -z stub -o stubs/libc.so.1 -G -hlibc.so.1 \
      -ztext -zdefs -Bdirect ...

    real        0.019708910
    user        0.010101680
    sys         0.008528431

In order to go from prototype to integrated link-editor feature, I knew that I would need to prove that stub objects were valuable. And to do that, I knew that I'd have to switch the Solaris ON consolidation to use stub objects and evaluate the outcome. And in order to do that experiment, ON would first need to be converted to version 2 mapfiles. Sub-mission accomplished.

Normally when you design a new feature, you can devise reasonably small tests to show it works, and then deploy it incrementally, letting it prove its value as it goes. The entire point of stub objects however was to demonstrate that they could be successfully applied to an extremely large and complex code base, and specifically to solve the Solaris build issues detailed above. There was no way to finesse the matter — in order to move ahead, I would have to successfully use stub objects to build the entire ON consolidation and demonstrate their value. In software, the need to boil the ocean can often be a warning sign that things are trending in the wrong direction. Conversely, sometimes progress demands that you build something large and new all at once. A big win, or a big loss — sometimes all you can do is try it and see what happens.

And so, I spent some time staring at ON makefiles trying to get a handle on how things work, and how they'd have to change. It's a big and messy world, full of complex interactions, unspecified dependencies, special cases, and knowledge of arcane makefile features...

...and so, I backed away, put it down for a few months and did other work...

...until the fall, when I felt like it was time to stop thinking and pondering (some would say stalling) and get on with it. Without stubs, the following gives a simplified high level view of how Solaris is built:

  • An initially empty directory known as the proto, and referenced via the ROOT makefile macro is established to receive the files that make up the Solaris distribution.

  • A top level setup rule creates the proto area, and performs operations needed to initialize the workspace so that the main build operations can be launched, such as copying needed header files into the proto area.

  • Parallel builds are launched to build the kernel (usr/src/uts), libraries (usr/src/lib), and commands. The install makefile target builds each item and delivers a copy to the proto area. All libraries and executables link against the objects previously installed in the proto, implying the need to synchronize the order in which things are built.

  • Subsequent passes run lint, and do packaging.
Given this structure, the additions to use stub objects are:

  • A new second proto area is established, known as the stub proto and referenced via the STUBROOT makefile macro. The stub proto has the same structure as the real proto, but is used to hold stub objects. All files in the real proto are delivered as part of the Solaris product. In contrast, the stub proto is used to build the product, and then thrown away.

  • A new target is added to library Makefiles called stub. This rule builds the stub objects. The ld command is designed so that you can build a stub object using the same ld command line you'd use to build the real object, with the addition of a single -z stub option. This means that the makefile rules for building the stub objects are very similar to those used to build the real objects, and many existing makefile definitions can be shared between them.

  • A new target is added to the Makefiles called stubinstall which delivers the stub objects built by the stub rule into the stub proto. These rules reuse much of existing plumbing used by the existing install rule.

  • The setup rule runs stubinstall over the entire lib subtree as part of its initialization.

  • All libraries and executables link against the objects in the stub proto rather than the main proto, and can therefore be built in parallel without any synchronization.
There was no small way to try this that would yield meaningful results. I would have to take a leap of faith and edit approximately 1850 makefiles and 300 mapfiles first, trusting that it would all work out. Once the editing was done, I'd type make and see what happened. This took about 6 weeks to do, and there were many dark days when I'd question the entire project, or struggle to understand some of the many twisted and complex situations I'd uncover in the makefiles. I even found a couple of new issues that required changes to the new stub object related code I'd added to ld. With a substantial amount of encouragement and help from some key people in the Solaris group, I eventually got the editing done and stub objects for the entire workspace built. I found that my desktop system could build all the stub objects in the workspace in roughly a minute. This was great news, as it meant that use of the feature is effectively free — no one was likely to notice or care about the cost of building them.

After another week of typing make, fixing whatever failed, and doing it again, I succeeded in getting a complete build! The next step was to remove all of the make rules and .WAIT statements dedicated to controlling the order in which libraries under usr/src/lib are built. This came together pretty quickly, and after a few more speed bumps, I had a workspace that built cleanly and looked like something you might actually be able to integrate someday. This was a significant milestone, but there was still much left to do.

I turned to doing full nightly builds. Every type of build (open, closed, OpenSolaris, export, domestic) had to be tried. Each type failed in a new and unique way, requiring some thinking and rework. As things came together, I became aware of things that could have been done better, simpler, or cleaner, and those things also required some rethinking, the seeking of wisdom from others, and some rework. After another couple of weeks, it was in close to final form. My focus turned towards the end game and integration. This was a huge workspace, and needed to go back soon, before changes in the gate would made merging increasingly difficult.

At this point, I knew that the stub objects had greatly simplified the makefile logic and uncovered a number of race conditions, some of which had been there for years. I assumed that the builds were faster too, so I did some builds intended to quantify the speedup in build time that resulted from this approach. It had never occurred to me that there might not be one. And so, I was very surprised to find that the wall clock build times for a stock ON workspace were essentially identical to the times for my stub library enabled version! This is why it is important to always measure, and not just to assume.

One can tell from first principles, based on all those removed dependency rules in the library makefile, that the stub object version of ON gives dmake considerably more opportunities to overlap library construction. Some hypothesis were proposed, and shot down:

  • Could we have disabled dmakes parallel feature? No, a quick check showed things being build in parallel.

  • It was suggested that we might be I/O bound, and so, the threads would be mostly idle. That's a plausible explanation, but system stats didn't really support it. Plus, the timing between the stub and non-stub cases were just too suspiciously identical.

  • Are our machines already handling as much parallelism as they are capable of, and unable to exploit these additional opportunities? Once again, we didn't see the evidence to back this up.
Eventually, a more plausible and obvious reason emerged: We build the libraries and commands (usr/src/lib, usr/src/cmd) in parallel with the kernel (usr/src/uts). The kernel is the long leg in that race, and so, wall clock measurements of build time are essentially showing how long it takes to build uts. Although it would have been nice to post a huge speedup immediately, we can take solace in knowing that stub objects simplify the makefiles and reduce the possibility of race conditions. The next step in reducing build time should be to find ways to reduce or overlap the uts part of the builds. When that leg of the build becomes shorter, then the increased parallelism in the libs and commands will pay additional dividends. Until then, we'll just have to settle for simpler and more robust.

And so, I integrated the link-editor support for creating stub objects into snv_153 (November 2010) with

6993877 ld should produce stub objects PSARC/2010/397 ELF Stub Objects
followed by the work to convert the ON consolidation in snv_161 (February 2011) with
7009826 OSnet should use stub objects
4631488 lib/Makefile is too patient: .WAITs should be reduced
This was a huge putback, with 2108 modified files, 8 new files, and 2 removed files. Due to the size, I was allowed a window after snv_160 closed in which to do the putback. It went pretty smoothly for something this big, a few more preexisting race conditions would be discovered and addressed over the next few weeks, and things have been quiet since then.

Conclusions and Looking Forward

Solaris has been built with stub objects since February. The fact that developers no longer specify the order in which libraries are built has been a big success, and we've eliminated an entire class of build error. That's not to say that there are no build races left in the ON makefiles, but we've taken a substantial bite out of the problem while generally simplifying and improving things.

The introduction of a stub proto area has also opened some interesting new possibilities for other build improvements. As this article has become quite long, and as those uses do not involve stub objects, I will defer that discussion to a future article.

Nagging As A Strategy For Better Linking: -z guidance

The link-editor (ld) in Solaris 11 has a new feature that we call guidance that is intended to help you build better objects. The basic idea behind guidance is that if (and only if) you request it, the link-editor will issue messages suggesting better options and other changes you might make to your ld command to get better results. You can choose to take the advice, or you can disable specific types of guidance while acting on others. In some ways, this works like an experienced friend leaning over your shoulder and giving you advice — you're free to take it or leave it as you see fit, but you get nudged to do a better job than you might have otherwise.

We use guidance to build the core Solaris OS, and it has proven to be useful, both in improving our objects, and in making sure that regressions don't creep back in later. In this article, I'm going to describe the evolution in thinking and design that led to the implementation of the -z guidance option, as well as give a brief description of how it works.

The guidance feature issues non-fatal warnings. However, experience shows that once developers get used to ignoring warnings, it is inevitable that real problems will be lost in the noise and ignored or missed. This is why we have a zero tolerance policy against build noise in the core Solaris OS. In order to get maximum benefit from -z guidance while maintaining this policy, I added the -z fatal-warnings option at the same time.

Much of the material presented here is adapted from the arc case:

PSARC 2010/312 Link-editor guidance

The History Of Unfortunate Link-Editor Defaults

The Solaris link-editor is one of the oldest Unix commands. It stands to reason that this would be true — in order to write an operating system, you need the ability to compile and link code. The original link-editor (ld) had defaults that made sense at the time. As new features were needed, command line option switches were added to let the user use them, while maintaining backward compatibility for those who didn't. Backward compatibility is always a concern in system design, but is particularly important in the case of the tool chain (compilers, linker, and related tools), since it is a basic building block for the entire system.

Over the years, applications have grown in size and complexity. Important concepts like dynamic linking that didn't exist in the original Unix system were invented. Object file formats changed. In the case of System V Release 4 Unix derivatives like Solaris, the ELF (Extensible Linking Format) was adopted. Since then, the ELF system has evolved to provide tools needed to manage today's larger and more complex environments. Features such as lazy loading, and direct bindings have been added. In an ideal world, many of these options would be defaults, with rarely used options that allow the user to turn them off. However, the reality is exactly the reverse: For backward compatibility, these features are all options that must be explicitly turned on by the user. This has led to a situation in which most applications do not take advantage of the many improvements that have been made in linking over the last 20 years. If their code seems to link and run without issue, what motivation does a developer have to read a complex manpage, absorb the information provided, choose the features that matter for their application, and apply them? Experience shows that only the most motivated and diligent programmers will make that effort.

We know that most programs would be improved if we could just get you to use the various whizzy features that we provide, but the defaults conspire against us. We have long wanted to do something to make it easier for our users to use the linkers more effectively. There have been many conversations over the years regarding this issue, and how to address it. They always break down along the following lines:

Change ld Defaults
Since the world would be a better place the newer ld features were the defaults, why not change things to make it so?

This idea is simple, elegant, and impossible. Doing so would break a large number of existing applications, including those of ISVs, big customers, and a plethora of existing open source packages. In each case, the owner of that code may choose to follow our lead and fix their code, or they may view it as an invitation to reconsider their commitment to our platform. Backward compatibility, and our installed base of working software, is one of our greatest assets, and not something to be lightly put at risk. Breaking backward compatibility at this level of the system is likely to do more harm than good.

But, it sure is tempting.

New Link-Editor
One might create a new linker command, not called 'ld', leaving the old command as it is. The new one could use the same code as ld, but would offer only modern options, with the proper defaults for features such as direct binding.

The resulting link-editor would be a pleasure to use. However, the approach is doomed to niche status. There is a vast pile of exiting code in the world built around the existing ld command, that reaches back to the 1970's. ld use is embedded in large and unknown numbers of makefiles, and is used by name by compilers that execute it. A Unix link-editor that is not named ld will not find a majority audience no matter how good it might be.

Finally, a new linker command will eventually cease to be new, and will accumulate its own burden of backward compatibility issues.

An Option To Make ld Do The Right Things Automatically
This line of reasoning is best summarized by a CR filed in 2005, entitled
6239804 make it easier for ld(1) to do what's best
The idea is to have a '-z best' option that unchains ld from its backward compatibility commitment, and allows it to turn on the "best" set of features, as determined by the authors of ld. The specific set of features enabled by -z best would be subject to change over time, as requirements change.

This idea is more realistic than the other two, but was never implemented because it has some important issues that we could never answer to our satisfaction:

  • The -z best proposal assumes that the user can turn it on, and trust it to select good options without the user needing to be aware of the options being applied. This is a fallacy. Features such as direct bindings require the user to do some analysis to ensure that the resulting program will still operate properly.

  • A user who is willing to do the work to verify that what -z best does will be OK for their application is capable of turning on those features directly, and therefore gains little added benefit from -z best.

  • The intent is that when a user opts into -z best, that they understand that z best is subject to sometimes incompatible evolution. Experience teaches us that this won't work. People will use this feature, the meaning of -z best will change, code that used to build will fail, and then there will be complaints and demands to retract the change. When (not if) this occurs, we will of course defend our actions, and point at the disclaimer. We'll win some of those debates, and lose others. Ultimately, we'll end up with -z best2 (-z better), or other compromises, and our goal of simplifying the world will have failed.

  • The -z best idea rolls up a set of features that may or may not be related to each other into a unit that must be taken wholesale, or not at all. It could be that only a subset of what it does is compatible with a given application, in which case the user is expected to abandon -z best and instead set the options that apply to their application directly. In doing so, they lose one of the benefits of -z best, that if you use it, future versions of ld may choose a different set of options, and automatically improve the object through the act of rebuilding it.
I drew two conclusions from the above history:

  1. For a link-editor, backward compatibility is vital. If a given command line linked your application 10 years ago, you have every reason to expect that it will link today, assuming that the libraries you're linking against are still available and compatible with their previous interfaces.

  2. For an application of any size or complexity, there is no substitute for the work involved in examining the code and determining which linker options apply and which do not. These options are largely orthogonal to each other, and it can be reasonable not to use any or all of them, depending on the situation, even in modern applications. It is a mistake to tie them together.

The idea for -z guidance came from consideration of these points. By decoupling the advice from the act of taking the advice, we can retain the good aspects of -z best while avoiding its pitfalls:

  • -z guidance gives advice, but the decision to take that advice remains with the user who must evaluate its merit and make a decision to take it or not. As such, we are free to change the specific guidance given in future releases of ld, without breaking existing applications. The only fallout from this will be some new warnings in the build output, which can be ignored or dealt with at the user's convenience.

  • It does not couple the various features given into a single "take it or leave it" option, meaning that there will never be a need to offer "-zguidance2", or other such variants as things change over time. Guidance has the potential to be our final word on this subject.

  • The user is given the flexibility to disable specific categories of guidance without losing the benefit of others, including those that might be added to future versions of the system.
Although -z fatal-warnings stands on its own as a useful feature, it is of particular interest in combination with -z guidance. Used together, the guidance turns from advice to hard requirement: The user must either make the suggested change, or explicitly reject the advice by specifying a guidance exception token, in order to get a build. This is valuable in environments with high coding standards.

ld Command Line Options

The guidance effort resulted in new link-editor options for guidance and for turning warnings into fatal errors. Before I reproduce that text here, I'd like to highlight the strategic decisions embedded in the guidance feature:

  • In order to get guidance, you have to opt in. We hope you will opt in, and believe you'll get better objects if you do, but our default mode of operation will continue as it always has, with full backward compatibility, and without judgement.

  • Guidance suggestions always offers specific advice, and not vague generalizations.

  • You can disable some guidance without turning off the entire feature. When you get guidance warnings, you can choose to take the advice, or you can specify a keyword to disable guidance for just that category. This allows you to get guidance for things that are useful to you, without being bothered about things that you've already considered and dismissed.

  • As the world changes, we will add new guidance to steer you in the right direction. All such new guidance will come with a keyword that let's you turn it off.

  • In order to facilitate building your code on different versions of Solaris, we quietly ignore any guidance keywords we don't recognize, assuming that they are intended for newer versions of the link-editor. If you want to see what guidance tokens ld does and does not recognize on your system, you can use the ld debugging feature as follows:
    % ld -Dargs -z guidance=foo,nodefs
    debug: 
    debug: Solaris Linkers: 5.11-1.2275
    debug: 
    debug: arg[1]   option=-D:  option-argument: args
    debug: arg[2]   option=-z:  option-argument: guidance=foo,nodefs
    debug: warning: unrecognized -z guidance item: foo
    
The -z fatal-warning option is straightforward, and generally useful in environments with strict coding standards. Note that the GNU ld already had this feature, and we accept their option names as synonyms:
-z fatal-warnings | nofatal-warnings
--fatal-warnings | --no-fatal-warnings
The -z fatal-warnings and the --fatal-warnings option cause the link-editor to treat warnings as fatal errors.

The -z nofatal-warnings and the --no-fatal-warnings option cause the link-editor to treat warnings as non-fatal. This is the default behavior.

The -z guidance option is defined as follows:

-z guidance[=item1,item2,...]
Provide guidance messages to suggest ld options that can improve the quality of the resulting object, or which are otherwise considered to be beneficial. The specific guidance offered is subject to change over time as the system evolves. Obsolete guidance offered by older versions of ld may be dropped in new versions. Similarly, new guidance may be added to new versions of ld. Guidance therefore always represents current best practices.

It is possible to enable guidance, while preventing specific guidance messages, by providing a list of item tokens, representing the class of guidance to be suppressed. In this way, unwanted advice can be suppressed without losing the benefit of other guidance. Unrecognized item tokens are quietly ignored by ld, allowing a given ld command line to be executed on a variety of older or newer versions of Solaris.

The guidance offered by the current version of ld, and the item tokens used to disable these messages, are as follows.

Specify Required Dependencies

Dynamic executables and shared objects should explicitly define all of the dependencies they require. Guidance recommends the use of the -z defs option, should any symbol references remain unsatisfied when building dynamic objects. This guidance can be disabled with -z guidance=nodefs.
Do Not Specify Non-Required Dependencies
Dynamic executables and shared objects should not define any dependencies that do not satisfy the symbol references made by the dynamic object. Guidance recommends that unused dependencies be removed. This guidance can be disabled with -z guidance=nounused.
Lazy Loading
Dependencies should be identified for lazy loading. Guidance recommends the use of the -z lazyload option should any dependency be processed before either a -z lazyload or -z nolazyload option is encountered. This guidance can be disabled with -z guidance=nolazyload.
Direct Bindings
Dependencies should be referenced with direct bindings. Guidance recommends the use of the -B direct, or -z direct options should any dependency be processed before either of these options, or the -z nodirect option is encountered. This guidance can be disabled with -z guidance=nodirect.
Pure Text Segment
Dynamic objects should not contain relocations to non-writable, allocable sections. Guidance recommends compiling objects with Position Independent Code (PIC) should any relocations against the text segment remain, and neither the -z textwarn or -z textoff options are encountered. This guidance can be disabled with -z guidance=notext.
Mapfile Syntax
All mapfiles should use the version 2 mapfile syntax. Guidance recommends the use of the version 2 syntax should any mapfiles be encountered that use the version 1 syntax. This guidance can be disabled with -z guidance=nomapfile.
Library Search Path
Inappropriate dependencies that are encountered by ld are quietly ignored. For example, a 32-bit dependency that is encountered when generating a 64-bit object is ignored. These dependencies can result from incorrect search path settings, such as supplying an incorrect -L option. Although benign, this dependency processing is wasteful, and might hide a build problem that should be solved. Guidance recommends the removal of any inappropriate dependencies. This guidance can be disabled with -z guidance=nolibpath.
In addition, -z guidance=noall can be used to entirely disable the guidance feature. See Chapter 7, Link-Editor Quick Reference, in the Linker and Libraries Guide for more information on guidance and advice for building better objects.

Example

The following example demonstrates how the guidance feature is intended to work. We will build a shared object that has a variety of shortcomings:
  • Does not specify all it's dependencies
  • Specifies dependencies it does not use
  • Does not use direct bindings
  • Uses a version 1 mapfile
  • Contains relocations to the readonly allocable text (not PIC)

This scenario is sadly very common — many shared objects have one or more of these issues.

    
% cat hello.c           
#include <stdio.h>
#include <unistd.h>

void
hello(void)
{
        printf("hello user %d\n", getpid());
}

% cat mapfile.v1
# This version 1 mapfile will trigger a guidance message

% cc hello.c -o hello.so -G -M mapfile.v1 -lelf

As you can see, the operation completes without error, resulting in a usable object. However, turning on guidance reveals a number of things that could be better:

% cc hello.c -o hello.so -G -M mapfile.v1 -lelf -zguidance
ld: guidance: version 2 mapfile syntax recommended: mapfile.v1
ld: guidance: -z lazyload option recommended before first dependency
ld: guidance: -B direct or -z direct option recommended before first dependency
Undefined                       first referenced
 symbol                             in file
getpid                              hello.o  (symbol belongs to implicit
                                              dependency /lib/libc.so.1)
printf                              hello.o  (symbol belongs to implicit
                                              dependency /lib/libc.so.1)
ld: warning: symbol referencing errors
ld: guidance: -z defs option recommended for shared objects
ld: guidance: removal of unused dependency recommended: libelf.so.1
warning: Text relocation remains                referenced
    against symbol                  offset      in file
.rodata1 (section)                  0xa         hello.o
getpid                              0x4         hello.o
printf                              0xf         hello.o
ld: guidance: position independent (PIC) code recommended for shared objects
ld: guidance: see ld(1) -z guidance for more information

Given the explicit advice in the above guidance messages, it is relatively easy to modify the example to do the right things:

% cat mapfile.v2
# This version 2 mapfile will not trigger a guidance message
$mapfile_version 2

% cc hello.c -o hello.so -Kpic -G -Bdirect -M mapfile.v2 -lc -zguidance    

There are situations in which the guidance does not fit the object being built. For instance, you want to build an object without direct bindings:

% cc -Kpic hello.c -o hello.so -G -M mapfile.v2 -lc -zguidance
ld: guidance: -B direct or -z direct option recommended before first dependency
ld: guidance: see ld(1) -z guidance for more information

It is easy to disable that specific guidance warning without losing the overall benefit from allowing the remainder of the guidance feature to operate:

% cc -Kpic hello.c -o hello.so -G -M mapfile.v2 -lc -zguidance=nodirect

Conclusions

The linking guidelines enforced by the ld guidance feature correspond rather directly to our standards for building the core Solaris OS. I'm sure that comes as no surprise. It only makes sense that we would want to build our own product as well as we know how. Solaris is usually the first significant test for any new linker feature. We now enable guidance by default for all builds, and the effect has been very positive.

Guidance helps us find suboptimal objects more quickly. Programmers get concrete advice for what to change instead of vague generalities. Even in the cases where we override the guidance, the makefile rules to do so serve as documentation of the fact.

Deciding to use guidance is likely to cause some up front work for most code, as it forces you to consider using new features such as direct bindings. Such investigation is worthwhile, but does not come for free. However, the guidance suggestions offer a structured and straightforward way to tackle modernizing your objects, and once that work is done, for keeping them that way. The investment is often worth it, and will replay you in terms of better performance and fewer problems. I hope that you find guidance to be as useful as we have.

64-bit Archives Needed

A little over a year ago, we received a question from someone who was trying to build software on Solaris. He was getting errors from the ar command when creating an archive. At that time, the ar command on Solaris was a 32-bit command. There was more than 2GB of data, and the ar command was hitting the file size limit for a 32-bit process that doesn't use the largefile APIs.

Even in 2011, 2GB is a very large amount of code, so we had not heard this one before. Most of our toolchain was extended to handle 64-bit sized data back in the 1990's, but archives were not changed, presumably because there was no perceived need for it. Since then of course, programs have continued to get larger, and in 2010, the time had finally come to investigate the issue and find a way to provide for larger archives.

As part of that process, I had to do a deep dive into the archive format, and also do some Unix archeology. I'm going to record what I learned here, to document what Solaris does, and in the hope that it might help someone else trying to solve the same problem for their platform.

Archive Format Details

Archives are hardly cutting edge technology. They are still used of course, but their basic form hasn't changed in decades. Other than to fix a bug, which is rare, we don't tend to touch that code much. The archive file format is described in /usr/include/ar.h, and I won't repeat the details here. Instead, here is a rough overview of the archive file format, implemented by System V Release 4 (SVR4) Unix systems such as Solaris:
  1. Every archive starts with a "magic number". This is a sequence of 8 characters: "!<arch>\n".

  2. The magic number is followed by 1 or more members. A member starts with a fixed header, defined by the ar_hdr structure in/usr/include/ar.h. Immediately following the header comes the data for the member. Members must be padded at the end with newline characters so that they have even length.

    The requirement to pad members to an even length is a dead giveaway as to the age of the archive format. It tells you that this format dates from the 1970's, and more specifically from the era of 16-bit systems such as the PDP-11 that Unix was originally developed on. A 32-bit system would have required 4 bytes, and 64-bit systems such as we use today would probably have required 8 bytes. 2 byte alignment is a poor choice for ELF object archive members. 32-bit objects require 4 byte alignment, and 64-bit objects require 64-bit alignment. The link-editor uses mmap() to process archives, and if the members have the wrong alignment, we have to slide (copy) them to the correct alignment before we can access the ELF data structures inside. The archive format requires 2 byte padding, but it doesn't prohibit more. The Solaris ar command takes advantage of this, and pads ELF object members to 8 byte boundaries. Anything else is padded to 2 as required by the format.

  3. The archive header (ar_hdr) represents all numeric values using an ASCII text representation rather than as binary integers. This means that an archive that contains only text members can be viewed using tools such as cat, more, or a text editor. The original designers of this format clearly thought that archives would be used for many file types, and not just for objects. Things didn't turn out that way of course — nearly all archives contain relocatable objects for a single operating system and machine, and are used primarily as input to the link-editor (ld).

  4. Archives can have special members that are created by the ar command rather than being supplied by the user. These special members are all distinguished by having a name that starts with the slash (/) character. This is an unambiguous marker that says that the user could not have supplied it. The reason for this is that regular archive members are given the plain name of the file that was inserted to create them, and any path components are stripped off. Slash is the delimiter character used by Unix to separate path components, and as such cannot occur within a plain file name.

    The ar command hides the special members from you when you list the contents of an archive, so most users don't know that they exist. There are only two possible special members: A symbol table that maps ELF symbols to the object archive member that provides it, and a string table used to hold member names that exceed 15 characters. The '/' convention for tagging special members provides room for adding more such members should the need arise. As I will discuss below, we took advantage of this fact to add an alternate 64-bit symbol table special member which is used in archives that are larger than 4GB.

  5. When an archive contains ELF object members, the ar command builds a special archive member known as the symbol table that maps all ELF symbols in the object to the archive member that provides it. The link-editor uses this symbol table to determine which symbols are provided by the objects in that archive. If an archive has a symbol table, it will always be the first member in the archive, immediately following the magic number. Unlike member headers, symbol tables do use binary integers to represent offsets. These integers are always stored in big-endian format, even on a little endian host such as x86.

  6. The archive header (ar_hdr) provides 15 characters for representing the member name. If any member has a name that is longer than this, then the real name is written into a special archive member called the string table, and the member's name field instead contains a slash (/) character followed by a decimal representation of the offset of the real name within the string table. The string table is required to precede all normal archive members, so it will be the second member if the archive contains a symbol table, and the first member otherwise.

  7. The archive format is not designed to make finding a given member easy. Such operations move through the archive from front to back examining each member in turn, and run in O(n) time. This would be bad if archives were commonly used in that manner, but in general, they are not. Typically, the ar command is used to build an new archive from scratch, inserting all the objects in one operation, and then the link-editor accesses the members in the archive in constant time by using the offsets provided by the symbol table. Both of these operations are reasonably efficient. However, listing the contents of a large archive with the ar command can be rather slow.

Factors That Limit Solaris Archive Size

As is often the case, there was more than one limiting factor preventing Solaris archives from growing beyond the 32-bit limits of 2GB (32-bit signed) and 4GB (32-bit unsigned). These limits are listed in the order they are hit as archive size grows, so the earlier ones mask those that follow.

  1. The original Solaris archive file format can handle sizes up to 4GB without issue. However, the ar command was delivered as a 32-bit executable that did not use the largefile APIs. As such, the ar command itself could not create a file larger than 2GB. One can solve this by building ar with the largefile APIs which would allow it to reach 4GB, but a simpler and better answer is to deliver a 64-bit ar, which has the ability to scale well past 4GB.

  2. Symbol table offsets are stored as 32-bit big-endian binary integers, which limits the maximum archive size to 4GB. To get around this limit requires a different symbol table format, or an extension mechanism to the current one, similar in nature to the way member names longer than 15 characters are handled in member headers.

  3. The size field in the archive member header (ar_hdr) is an ASCII string capable of representing a 32-bit unsigned value. This places a 4GB size limit on the size of any individual member in an archive.
In considering format extensions to get past these limits, it is important to remember that very few archives will require the ability to scale past 4GB for many years. The old format, while no beauty, continues to be sufficient for its purpose. This argues for a backward compatible fix that allows newer versions of Solaris to produce archives that are compatible with older versions of the system unless the size of the archive exceeds 4GB.

Archive Format Differences Among Unix Variants

While considering how to extend Solaris archives to scale to 64-bits, I wanted to know how similar archives from other Unix systems are to those produced by Solaris, and whether they had already solved the 64-bit issue. I've successfully moved archives between different Unix systems before with good luck, so I knew that there was some commonality. If it turned out that there was already a viable defacto standard for 64-bit archives, it would obviously be better to adopt that rather than invent something new.

The archive file format is not formally standardized. However, the ar command and archive format were part of the original Unix from Bell Labs. Other systems started with that format, extending it in various often incompatible ways, but usually with the same common shared core. Most of these systems use the same magic number to identify their archives, despite the fact that their archives are not always fully compatible with each other. It is often true that archives can be copied between different Unix variants, and if the member names are short enough, the ar command from one system can often read archives produced on another.

In practice, it is rare to find an archive containing anything other than objects for a single operating system and machine type. Such an archive is only of use on the type of system that created it, and is only used on that system. This is probably why cross platform compatibility of archives between Unix variants has never been an issue. Otherwise, the use of the same magic number in archives with incompatible formats would be a problem.

I was able to find information for a number of Unix variants, described below. These can be divided roughly into three tribes, SVR4 Unix, BSD Unix, and IBM AIX. Solaris is a SVR4 Unix, and its archives are completely compatible with those from the other members of that group (GNU/Linux, HP-UX, and SGI IRIX).

    AIX
    AIX is an exception to rule that Unix archive formats are all based on the original Bell labs Unix format. It appears that AIX supports 2 formats (small and big), both of which differ in fundamental ways from other Unix systems:

    • These formats use a different magic number than the standard one used by Solaris and other Unix variants.

    • They include support for removing archive members from a file without reallocating the file, marking dead areas as unused, and reusing them when new archive items are inserted.

    • They have a special table of contents member (File Member Header) which lets you find out everything that's in the archive without having to actually traverse the entire file. Their symbol table members are quite similar to those from other systems though.

    • Their member headers are doubly linked, containing offsets to both the previous and next members.

    Of the Unix systems described here, AIX has the only format I saw that will have reasonable insert/delete performance for really large archives. Everyone else has O(n) performance, and are going to be slow to use with large archives.

    BSD
    BSD has gone through 4 versions of archive format, which are described in their manpage. They use the same member header as SVR4, but their symbol table format is different, and their scheme for long member names puts the name directly after the member header rather than into a string table.

    GNU/Linux
    The GNU toolchain uses the SVR4 format, and is compatible with Solaris.

    HP-UX
    HP-UX seems to follow the SVR4 model, and is compatible with Solaris.

    IRIX
    IRIX has 32 and 64-bit archives. The 32-bit format is the standard SVR4 format, and is compatible with Solaris. The 64-bit format is the same, except that the symbol table uses 64-bit integers.

    IRIX assumes that an archive contains objects of a single ELFCLASS/MACHINE, and any archive containing ELFCLASS64 objects receives a 64-bit symbol table. Although they only use it for 64-bit objects, nothing in the archive format limits it to ELFCLASS64. It would be perfectly valid to produce a 64-bit symbol table in an archive containing 32-bit objects, text files, or anything else.

    Tru64 Unix (Digital/Compaq/HP)
    Tru64 Unix uses a format much like ours, but their symbol table is a hash table, making specific symbol lookup much faster. The Solaris link-editor uses archives by examining the entire symbol table looking for unsatisfied symbols for the link, and not by looking up individual symbols, so there would be no benefit to Solaris from such a hash table. The Tru64 ld must use a different approach in which the hash table pays off for them.
    Widening the existing SVR4 archive symbol tables rather than inventing something new is the simplest path forward. There is ample precedent for this approach in the ELF world. When ELF was extended to support 64-bit objects, the approach was largely to take the existing data structures, and define 64-bit versions of them. We called the old set ELF32, and the new set ELF64. My guess is that there was no need to widen the archive format at that time, but had there been, it seems obvious that this is how it would have been done.

    The Implementation of 64-bit Solaris Archives

    As mentioned earlier, there was no desire to improve the fundamental nature of archives. They have always had O(n) insert/delete behavior, and for the most part it hasn't mattered. AIX made efforts to improve this, but those efforts did not find widespread adoption. For the purposes of link-editing, which is essentially the only thing that archives are used for, the existing format is adequate, and issues of backward compatibility trump the desire to do something technically better.

    Widening the existing symbol table format to 64-bits is therefore the obvious way to proceed. For Solaris 11, I implemented that, and I also updated the ar command so that a 64-bit version is run by default. This eliminates the 2 most significant limits to archive size, leaving only the limit on an individual archive member.

    We only generate a 64-bit symbol table if the archive exceeds 4GB, or when the new -S option to the ar command is used. This maximizes backward compatibility, as an archive produced by Solaris 11 is highly likely to be less than 4GB in size, and will therefore employ the same format understood by older versions of the system. The main reason for the existence of the -S option is to allow us to test the 64-bit format without having to construct huge archives to do so. I don't believe it will find much use outside of that.

    Other than the new ability to create and use extremely large archives, this change is largely invisible to the end user. When reading an archive, the ar command will transparently accept either form of symbol table. Similarly, the ELF library (libelf) has been updated to understand either format. Users of libelf (such as the link-editor ld) do not need to be modified to use the new format, because these changes are encapsulated behind the existing functions provided by libelf.

    As mentioned above, this work did not lift the limit on the maximum size of an individual archive member. That limit remains fixed at 4GB for now. This is not because we think objects will never get that large, for the history of computing says otherwise. Rather, this is based on an estimation that single relocatable objects of that size will not appear for a decade or two. A lot can change in that time, and it is better not to overengineer things by writing code that will sit and rot for years without being used.

    It is not too soon however to have a plan for that eventuality. When the time comes when this limit needs to be lifted, I believe that there is a simple solution that is consistent with the existing format. The archive member header size field is an ASCII string, like the name, and as such, the overflow scheme used for long names can also be used to handle the size. The size string would be placed into the archive string table, and its offset in the string table would then be written into the archive header size field using the same format "/ddd" used for overflowed names.

Solaris 11

Oracle has a strict policy about not discussing product features until they appear in shipping product. Now that Solaris 11 is publically available, it is time to catch up. I will be shortly posting articles on a variety of new developments in the Solaris linkers and related bits:
64-bit Archives
After 40+ years of Unix, the archive file format has run out of room. The ar and link-editor (ld) commands have been enhanced to allow archives to grow past their previous 32-bit limits.
Guidance
The link-editor is now willing and able to tell you how to alter your link lines in order to build better objects.
Stub Objects
This is one of the bigger projects I've undertaken since joining the Solaris group. Stub objects are shared objects, built entirely from mapfiles, that supply the same linking interface as the real object, while containing no code or data. You can link to them, but cannot use them at runtime. It was pretty simple to add this ability to the link-editor, but the changes to the OSnet in order to apply them to building Solaris were massive. I discuss how we came to invent stub objects, how we apply them to build the OSnet in a more parallel and scalable manner, and about the follow on opportunities that have emerged from the new stub proto area we created to hold them.
The elffile Utility
A new standard Solaris utility, elffile is a variant of the file utility, focused exclusively on linker related files. elffile is of particular value for examining archives, as it allows you to find out what is inside them without having to first extract the archive members into temporary files.
This release has been a long time coming. I joined the Solaris group in late 2005, and this will be my first FCS. From a user perspective, Solaris 11 is probably the biggest change to Solaris since Solaris 2.0. Solaris 11 polishes the ground breaking features from Solaris 10 (DTrace, FMA, ZFS, Zones), and uses them to add a powerful new packaging system, numerous other enhacements and features, along with a huge modernization effort. I'm excited to see it go out into the world. I hope you enjoy using it as much as we did creating it.

Software is never done. On to the next one...

Friday Oct 15, 2010

How To Name A Solaris Shared Object

In order to add a new shared object to Solaris, you need to know how to name it. As obvious as this sounds, there is a lot of confusion surrounding this subject.

Solaris follows a standard set of rules for shared object naming, and largely serves as a example of how we intend things to work. Unfortunately, some poor examples have also crept into the system over the years, no doubt adding to the confusion.

The Linker and Libraries Guide contains the basic information, but we seem to be missing a concise description of how Solaris shared objects are supposed to be named. Without that, people will end up trying to intuit what they should be doing by looking about and guessing. The occasional misstep is almost inevitable.

I hope this discussion will fill that gap. I will describe the rules we follow under Solaris, and explain the reasoning behind them.

Naming Of Native Solaris Objects To Be Linked Against With ld

The largest category of shared object are those objects that are intended to be linked against executables and other shared objects via the link-editor (ld). This is typically done using the ld -l command line option. Native shared objects that are intended to be linked against with ld on Solaris are expected to follow the following conventions:
  1. The object should have the fully versioned name.
  2. The object should have an SONAME, set via the ld -h option, that includes the version number.
  3. If this is a public object intended for general use, a symbolic link with the non-versioned name should point at the object.
The C runtime library demonstrates this:
% ls -alF /lib/libc.*
lrwxrwxrwx   1 root     root           9 Mar 22  2010 /lib/libc.so -> libc.so.1*
-rwxr-xr-x   1 root     bin      1721888 Oct  4 10:08 /lib/libc.so.1*
% elfdump -d /lib/libc.so.1 | grep SONAME
       [4]  SONAME            0xb8bc              libc.so.1
Each shared object is therefore accessible by its fully versioned name, or via a symbolic link that makes it available via a generic non-versioned name. Although libc does not show this, there can be more than one version of a given object. This happens when a shared object is changed in a backward incompatible manner, something that we try very hard to prevent under Solaris. The generic symbolic link always points at the most current version of the object. This is the version that newly built code should use.

The non-versioned and versioned names serve the differing needs of the link-editor (ld), and runtime linker (ld.so.1):

  • At link time, it is appropriate to refer to objects generically. When we build a program, we wish to link against the current version of the libraries we intend to use, and to always get the best/latest versions without having to explicitly specify those versions.

  • At runtime, a program must use the specific version of the shared objects that it was linked against. If a program was linked against libXXX.so.1, it must continue to find libXXX.so.1 even if the system has added libXXX.so.2 since the last time the program was built. This means that at runtime, the fully versioned name must be used.
The non-versioned symbolic link is called a compilation symlink, because it is the name seen by the link-editor (ld) at compile/link time. When ld sees an argument of the form '-lXXX', it searches the library path for files with the name 'libXXX.so'. For example, to link a program against the C runtime library, you specify the -lc option. This mechanism allows ld to find the desired library via its generic compilation symlink.

Having arranged for ld to find the object via its generic name, it is now necessary to ensure that the runtime linker will look for it via its fully versioned name. This does not happen by default:

  • When ld builds an object, it puts a NEEDED entry in the dynamic section for each object it links to it.

  • If the linked-to object contains an SONAME entry in its dynamic section, the value of the SONAME entry is placed in the NEEDED entry of the linking object.

  • If the linked-to object does not contain an SONAME entry in its dynamic section, ld uses the name that it found the object under. As we've seen above, this will be the generic compilation symlink name.
This is why you must explicitly use ld -h to specify an SONAME when you build your object. It ensures that the runtime linker will search for the object via it's fully versioned name at runtime.

For example, we saw above that the SONAME for the C runtime library is "libc.so.1". /bin/ls is linked against libc using the ld -lc command line option. Without the SONAME in libc, we'd expect the NEEDED entry to contain "libc.so", but instead we see the desired result:

% elfdump -d /bin/ls | grep libc.so
       [8]  NEEDED            0x60f               libc.so.1
Despite being linked via the generic name, the runtime linker searches for libc.so.1, and not libc.so, when the program is actually run, as shown by ldd:
% ldd /bin/ls | grep libc.so
        libc.so.1 =>     /lib/libc.so.1
To summarize:
  1. The link-editor uses the generic object name to locate the object at link-time.

  2. The runtime linker uses the versioned object name to locate the object at runtime.

  3. This happens not by default, but through the systematic application of the three rules listed at the beginning of this section.

Compilation Symlinks and Private Objects

The compilation symlink exists solely for the benefit of the link-editor (ld), and plays no role in finding the object at runtime. If the compilation symlink is not present, the object is effectively rendered invisible to ld. Solaris takes advantage of this fact to protect users from accidentally linking against objects.

There are objects in the Solaris system that exist as implementation details, to provide support for the parts of the system that are publically documented and committed. Such objects are subject to unannounced change, or even removal, so the lack of a compilation symlink saves a great deal of trouble. For example, Solaris ships with the following object:

% ls -alF /lib/libavl.so*
-rwxr-xr-x   1 root     bin          14K Oct  4 10:07 /lib/libavl.so.1*
Without a compilation symlink named libavl.so, this object will be ignored by the link-editor when it builds new objects. Your programs will not accidentally find it even if you specify -lavl to ld, because ld will not be able to locate a file named libavl.so. If libavl becomes public and committed someday, a compilation symlink will be added for it.

I mentioned before that Solaris contains some shared objects that do not faithfully follow the rules we are discussing. By far, the most common error is to deliver a compilation symlink with a private object. The mere presence of a compilation symlink should not be taken as evidence that the object is public and safe to use. A public object will have manpages documenting the library and the functions it contains, and those manpages will include an ATTRIBUTES section that details the commitment level of the interfaces it provides.

Objects That Have More Than A Major Version

Solaris shared objects use a single version number, referred to as the major version. In the case of libc, as shown above, this version is 1. Non-native shared objects often use a versioning scheme that includes additional sub-version numbers. To handle such objects, we need to generalize our rules.

Although Solaris uses a single version number, our history includes a time when we used more. Those of you who remember SunOS 4.x may recall that those systems had major and minor numbers, as evidenced by the BCP objects still delivered with sparc systems:

% ls -alF /usr/4lib/libc.so*
-rwxr-xr-x   1 root     bin       411820 Jan 22  2005 /usr/4lib/libc.so.1.9*
-rwxr-xr-x   1 root     bin       411080 Jan 22  2005 /usr/4lib/libc.so.2.9*
Note that there is no compilation symlink — we supply these old objects so that customers can continue to run their ancient (now approaching 20 years) SunOS 4.x executables, but we don't want anyone linking new code to them!

Shared objects were first added to SunOS in version 4.0. As with today's system, a change in the major number reflected an incompatible interface change, and when that happened, the older objects would continue to be supplied for the benefit of old executables, and a separate new object would be delivered with the new code. A change in minor number reflected a compatible change, but the runtime linker would print warning messages when you ran an old executable against a newer minor version. This quickly proved to be a bad idea, as it needlessly annoyed users.

We learned two lessons from SunOS 4.x shared objects:

  • Never, ever, issue a warning if the end user is not supposed to care about the issue being warned about and no harm can come from not knowing about it.

  • Since all libraries with the same major number are compatible, the minor number conveys nothing of interest and need not exist.
As a result, the minor number concept was dropped in Solaris 2.x (SunOS 5.x), and has not been missed.

Other bodies of code do utilize additional (minor, micro, etc) version numbers, and you will see them under Solaris in software that originates elsewhere. This is done in order to match name used by the community that produces the software in question, and not necessarily for object versioning. When building such software for Solaris, you must follow a slightly more general version of our rules:

  1. The object should have the fully versioned name.
  2. The object should have an SONAME, set via the ld -h option, that includes only the major version number, and not the minor or smaller version numbers.
  3. A symbolic link matching the SONAME should point at the object.
  4. If this is a public object intended for general use, a symbolic link with the non-versioned name should point at the object.

For example:

% ls -alF /usr/gnu/lib/libncurses.so*
lrwxrwxrwx   1 root     root          15 Mar 22  2010 /usr/gnu/lib/libncurses.so -> libncurses.so.5*
lrwxrwxrwx   1 root     root          17 Mar 22  2010 /usr/gnu/lib/libncurses.so.5 -> libncurses.so.5.7*
-rwxr-xr-x   1 root     bin         351K Jun 10 11:53 /usr/gnu/lib/libncurses.so.5.7*
% elfdump -d /usr/gnu/lib/libncurses.so.5.7 | grep SONAME
       [2]  SONAME            0x27a1              libncurses.so.5
This GNU object, delivered with Solaris, retains its original version number (5.7). However, it still follows our basic rules. The object has the fully versioned named, the SONAME contains only the major number, and a compilation symlink is supplied.

The use of an SONAME that contains only the major version number is done in order to preserve the principle that you should be able to replace an object with a newer version of the same object as long as the major number does not change, and that it will not be necessary to recompile objects linked against such an object in order to run them. To put this in more concrete terms, we expect that libncurses.so.5.7 can be replaced by libncurses.so.5.8, and that any program linked against libncurses.so.5.7 can use libncurses.5.8, without the need to rebuild.

Related to this point, it is worth noting that we now have two symbolic links rather than the single link we use for native Solaris objects, and the purpose of these two links is different, and unrelated:

  • libncurses.so is the compilation symlink, present for the benefit of the link-editor (ld), as discussed above. The compilation symlink operates as previously discussed, and should be omitted when delivering a private object.

  • libncurses.so.5 exists to bridge the gap between the SONAME for the object (libncurses.so.5) and the name of the object file (libncurses.so.5.7). This link is not optional. Without it, the runtime linker will not be able to locate the object.

Rules for Objects Used Via dlopen()

The discussion to this point has been limited to objects that are linked to other objects via the link-editor (ld). Now, let's consider objects that are loaded under program control, via the dlopen() function.

Many objects are used both via ld, and dlopen(). For such objects, you must follow the rules described above that allow the object to work properly with ld. The advice given in this section is only for objects that will never be linked to via ld.

An object that is not linked to via ld does not have to follow any of the rules we've discussed so far:

  • It does not require or benefit from a compilation symlink.

  • It does not require an SONAME, since the runtime linker will not look at it in this context.

  • It can have any can have any arbitrary name, rather than following the libXXX.so.x naming scheme. In particular, it is common to omit the 'lib' prefix from such objects.
As you can see, dlopen() imposes no naming requirements on an object. However, Solaris employs the following conventions for the benefit of human observers:
  • Give all sharable objects a '.so' extension.

  • Only use a version number (i.e. XXX.so.1) if multiple versions may be delivered and co-exist, or for which a version number would be meaningful to the end user.

  • Use a common application specific prefix or install objects together in a directory hierarchy that makes their purpose clear.

For example, the elfedit utility is delivered with a set of runtime loadable support modules:

% ls -alFR /usr/lib/elfedit/
/usr/lib/elfedit/:
total 1185
drwxr-xr-x   3 root     bin           13 Sep  2 15:09 ./
drwxr-xr-x 169 root     bin         2023 Oct  4 10:09 ../
lrwxrwxrwx   1 root     root           1 Jun 19 13:52 32 -> ./
lrwxrwxrwx   1 root     root           5 Jun 19 13:52 64 -> amd64/
drwxr-xr-x   2 root     bin           10 Sep  2 15:09 amd64/
-rwxr-xr-x   1 root     bin        62868 Sep  2 15:09 cap.so*
-rwxr-xr-x   1 root     bin        97948 Sep  2 15:09 dyn.so*
-rwxr-xr-x   1 root     bin        92412 Sep  2 15:09 ehdr.so*
-rwxr-xr-x   1 root     bin        56248 Sep  2 15:09 phdr.so*
-rwxr-xr-x   1 root     bin        65428 Sep  2 15:09 shdr.so*
-rwxr-xr-x   1 root     bin        41760 Sep  2 15:09 str.so*
-rwxr-xr-x   1 root     bin        71268 Sep  2 15:09 sym.so*
-rwxr-xr-x   1 root     bin        47688 Sep  2 15:09 syminfo.so*

/usr/lib/elfedit/amd64:
total 1447
drwxr-xr-x   2 root     bin           10 Sep  2 15:09 ./
drwxr-xr-x   3 root     bin           13 Sep  2 15:09 ../
-rwxr-xr-x   1 root     bin        90928 Sep  2 15:09 cap.so*
-rwxr-xr-x   1 root     bin       130992 Sep  2 15:09 dyn.so*
-rwxr-xr-x   1 root     bin       125064 Sep  2 15:09 ehdr.so*
-rwxr-xr-x   1 root     bin        78368 Sep  2 15:09 phdr.so*
-rwxr-xr-x   1 root     bin        83968 Sep  2 15:09 shdr.so*
-rwxr-xr-x   1 root     bin        54240 Sep  2 15:09 str.so*
-rwxr-xr-x   1 root     bin        98592 Sep  2 15:09 sym.so*
-rwxr-xr-x   1 root     bin        70056 Sep  2 15:09 syminfo.so*

Installing them under /usr/lib/elfedit makes it obvious what application they support, while the use of the .so extension shows that they are sharable objects.

These elfedit objects are private modules delivered together with the elfedit utility, and not used by anything else. In addition, elfedit includes a module version in the handshake it completes with each module as the module is loaded. Therefore, adding version numbers to the file names would add no value, and is not done.

Thursday Jan 07, 2010

A New Mapfile Syntax for Solaris

In the previous entry, I discussed at length the problems and misfeatures of the original Solaris mapfile language that we inherited with System V Release 4 Unix. The original language was not designed to be extended, yet we've built on top of it for 20+ years. Although we could continue to do so, we have come to a point where a new language that retains the good features of the old, while addressing its shortcomings, would pay dividends.

My project to create a replacement mapfile language is in its final stages. I believe that the resulting syntax is simple, highly readable, and easily extended. Yet, the result is also highly evolutionary. I think anyone who knows the old language will have little difficulty understanding and quickly putting the new one to use. The implementation is complete, and I've used it to build a copy of the Solaris OSnet workspace with all of its mapfiles rewritten using the new syntax. Yesterday, the PSARC case for this work was approved, a significant milestone:

PSARC/2009/688 Human readable and extensible ld mapfile syntax

We're currently in a restricted build period leading up to the release of the next OpenSolaris, and this work will have to wait to integrate until after that, probably in the second half of February. However, the work is essentially done, and this seems like a good time to get some information about it into circulation.

The case materials for PSARC/2009/688 include a replacement mapfile chapter for the Solaris Linker and Libraries Guide. The old chapter will be preserved as an appendix for the benefit of those needing to decrypt existing mapfiles. Until this new material appears in the published manual, I hope you will find this HTML version helpful.

There is little reason to repeat the information in that document here. Instead, I would like to describe the underlying principles we used to design this new language, and to provide a series of examples in which a single item is expressed in both the old and new syntaxes. I think that these examples probably offer the fastest way for someone who already knows the old syntax to start using the new one. I will refer to the Linker and Libraries Manual frequently in this discussion, often using the abbreviation LLM.

Design, Testing, and Base Principles

The new syntax was developed in an iterative manner, starting with a paper design, written in the form of a replacement for the current LLM mapfile chapter, and progressing to implementation and testing with real mapfiles. With each iteration, I would take the lessons learned, debate and discuss the options with my fellow linker alien Rod Evans, and alter the design to address the issues and move forward. As might be expected, there were false starts, and surprises along the way, but eventually things solidified around the final design.

Once we had a final design and a working implementation of it, I modified our linker tests so that each test that uses a mapfile now does so twice, once with the old syntax, and once with the new. This has two important benefits:

  1. I can ensure that the new syntax can do anything the old one can (modulo a few obscure features not taken forward), by comparing the two resulting objects to make sure they are identical.

  2. The old syntax will continue to be used, so it will not fail due to bit rot.

As I iterated though the design process, I developed and refined the following list of requirements and observations that in turn guided following iterations. Listed in no particular order:

  • We must offer full support for mapfiles in the original syntax. There can be no abrupt translation and cutover, as there are too many of these files in existence. We hope people will convert in time because the new language is better and easier, and because new features will only appear in the new, but there will be no forced conversion. This implies that we must have the concept of mapfile syntax version, with the old syntax being version 1, and the new version 2. Unfortunately, the default must be version 1 for backward compatibility. The link-editor must be able to cheaply and unambiguously determine which version it is reading from a given file before it has to actually interpret a statement from the file.

  • A given mapfile must contain only version 1 or version 2 syntax (no mixing within a single file). However, a given link-editor invocation can have more than one mapfile, and each mapfile is free to use either syntax without regard to the syntax used by the others.

  • The mapfile version must be a characteristic of the file itself (i.e. determined by the file contents), and not require a different ld command line option. Hence, the -M option is used for mapfiles of either version.

  • My study of other mapfile languages, previous efforts within the linker group, our code, and the mapfiles in the OSnet, all convince me that the current mapfile language is semantically at the right level. It need not be higher or lower level than it is, and the basic concepts are fine. The problem we need to solve with this project is primarily one of syntax.

  • A user familiar with the old mapfiles, upon encountering a mapfile written in the new syntax, should immediately be able to recognize it as a linker mapfile and understand its contents.

  • The scope/version part of the old mapfile syntax is pretty good, well liked, and widely used. We believe that the vast majority of existing mapfiles only use this part of the old syntax. A new syntax built using this as a starting point would help with the familiarity requirement above.

  • We don't see terseness as a inherently bad thing. However, the old syntax is too terse, and we're willing to be a bit more verbose in order to be a lot more readable. It should be possible to read almost any mapfile and understand its meaning without resorting to a reference manual to decode things.

  • The syntax for all directives should follow a single standard generalized form, rather than being invented ad-hoc for each directive. There should be none of the "a $ prefix means this here, but something else over there" that characterizes the old syntax.

  • The magic character nature of the original language must not be carried forward. All mapfile directives should be identified via a unique and mnemonic keyword as their first token (e.g. LOAD_SEGMENT, SYMBOL_VERSION, CAPABILITY, etc).

  • Special characters (e.g. =, *, ;, {}, etc) can be used used in the style of a programming language like C, to define the core syntax of the language. For example, ';' can terminate statements, {} can group items, and = can be used to assign a value to something. However, they must not be used to identify directives as was done in the old version. Special characters, are part of the underlying language, and not of any particular directive, must express the same concept wherever they are used.

  • It should be simple and easy to add an absolutely huge number of new directives, and/or to add a vast number of new options to existing directives, in a backward compatible manner. This is not because we want a huge language (we don't), but because failing to plan for expansion was a key failing of the old language, and we're not going to let that happen again.

  • A linker mapfile language should be something that a programmer can comfortably edit with a standard text editor, just like code, and the other things that a programmer edits. I am not anti-IDE, but I am anti-required-IDE.

    A couple of years ago, I did an XML based mapfile prototype, to determine if that would be a good direction for mapfiles. My conclusion is that it is not. XML is too verbose to be comfortably hand edited, and the XML boilerplace gets in the way. I don't think it is a good fit for a linker mapfile. However, XML does have some useful lessons to teach us, particularly, that the syntax should be simple, and regular. Although we will not use XML, it will be a good thing if the syntax is easily translatable to/from XML using nothing more than simple perl or python. This will serve to make sure we end up with a simple flexible language, and leave the door open to an future XML variant, should that prove interesting.

  • The new syntax should be able to produce an object identical to that produced by the old syntax, without going to extreme or confusing lengths to achieve it. However, it is acceptable to drop support for a small number of marginal features from the original (i.e. reserved segments), as long as it is possible to add them later should we miss them. The original syntax remains available for the few cases requiring dropped obscure features.

  • The translation from the old to the new syntax must be straightforward so that a programmer can convert their mapfiles to the new syntax without too much effort.

  • The internal concepts of segment, and entrance criteria list are good, and should be retained. However, unlike the version 1 syntax, this should all be done within the context of segment definition, rather than having separate segment definition, and section to segment assignment statements. Furthermore, there should be a separate segment directive for each type of supported segment, that only accepts attributes that make sense for segments of that type. This will eliminate a class of error possible in the old syntax, where you attempt to set attributes that are nonsensical for the segment type.

  • The link-editor contains a built in set of default segments with known names (text, data, bss, ...), and of entrance criteria that direct sections from input files to these segments in the output object. The version 1 syntax is not powerful enough to describe these built in items. As a matter of principle, the version 2 syntax should be able to do this. I view this as a matter of language completeness. (Note: The new Linker and Libraries manual referenced above contains an example of using the new syntax to define the built in segments and entrance criteria).

  • It would be nice to have a simple mechanism with the ability to conditionalize mapfile lines based on the target platform. In the old syntax, we've observed that frequently, there are multiple, largely identical, per-platform mapfiles that differ in minor ways. A common example is that of setting a different virtual address for a segment in 32 and 64-bit objects. Another is a symbol that only exists on one platform, for historic, or ABI related reasons. These multiple mapfiles represent needless clutter, and are an opportunity to introduce accidental inconsistencies into the varying objects.

    Something along the lines of what the C preprocessor allows with #if/#endif would fit the bill. However, we have no desire to have a macro facility, or for most of what CPP does. Just conditional input. If you want more, you can use a real preprocessor (like m4, or even cpp), but in the mapfile language, we want something extremely simple that just solves this one little problem.

New Syntax Overview
The full definition of the version 2 mapfile language can be found online. As mentioned earlier, I won't be repeating that information here. Instead, I'll provide a high level overview, with an eye towards showing how the wish list from the previous section was fulfilled.

A version 2 mapfile can contain two types of directive:

  • Control directives, which all start with the '$' character, and which control how the mapfile itself is interpreted.

  • Regular directives, which specify information regarding the output object being linked. Regular directives all start with a mnemonic name that identifies them, such as LOAD_SEGMENT, or SYMBOL_VERSION, and they use a uniform syntax.

As with the version 1 syntax, '#' is the comment character. A # on a line, and everything following it, is ignored by the link-editor, as are empty lines.

The first non-comment, non-empty, line in a version 2 mapfile must be the control directive:

$mapfile_version 2
Any mapfile that does not start with this line is interpreted as a version 1 mapfile, in which case the full original syntax is supported.
Control Directives
Aside from $mapfile_version, there are control directives that provide a conditional input facility that can be used to restrict specific mapfile lines to specific platforms:

$if expr
...
[$elif expr]
       ...
[$else]
...
$endif

The sole purpose of this facility is to allow you to write something like

$if _sparc && _ELF32
    32-bit sparc thing
elif _x86 && _ELF64
    64-bit x86 thing
$else
    others
$endif

as a way to handle minor per-platform variations in an otherwise identical mapfile.

Users of C, and related, languages will instantly recognize this as being very similar to the C preprocessor, substituting '$' for '#'. That is true, but the similarity is very superficial:

  1. Mapfiles have no macro concept.

  2. The expressions evaluated by $if are purely logical (boolean true/false), with no concept of numeric values, and significantly simpler than those of CPP.

I had a few reasons for making '$' the character for control directives:

  1. To give C programmers a strong visual hint that they're not using CPP, and should have different expectations. As I mentioned earlier, if you need a macro pre-processor, Unix has many available that you can use outside the link-editor.

  2. To preserve '#' as the mapfile comment character.

  3. '$' has no previous meaning at the start of a mapfile statement in the original version 1 syntax.
Reasons 2 and 3 both relate to the fact that the link-editor reads the mapfile to determine which version of syntax is being used. By keeping the same comment character, and using a character for control directives not already used at the start of a statement by the old syntax, the link-editor can safely read and discard opening header comments, locate the first statement in the file, and unambiguously determine if the mapfile is using version 1 or version 2 syntax.

There are a small number of predefined values available for use in $if/$elif expressions:

_ELF32   _ELF64
_sparc   X86
true

I expect these to be sufficient for nearly any mapfile. However, the $add control directive exists to define new values, and $clear to remove them. $add might be used to define convenient shorthand for longer expressions. For example, you you were writing a mapfile that had a large number of special cases involving the 64-bit x86 architecture, a definition like the following might be convenient:

$if _ELF64 && _x86
$add amd64
$endif

Lastly, the $error directive allows you to make your mapfiles safe against attempts to use them in an unexpected context. The text following the directive is issued as a fatal error by the link-editor, which then exits. I expect it to be used as follows:

$if _sparc
sparc thing
$elif _x86
x86 thing
$else
$error unknown platform
$endif

The error message includes the mapfile name, and the line number where the $error directive was encountered.

Regular Directives
The regular directives all specify object-related information.

They all share a common underlying abstract syntax, based on the idea of name-value pairs, and the use of {} brackets for grouping, and to express sub-attributes.

All directives are terminated by the ';' character, as are attributes of directives.

Described informally, the simplest form is a directive name without a value:

directive;
The next form is a directive name with a value, or a whitespace separated list of values.
directive = value...;
The '=' operator is shown, which sets the given directive to the given value, or value list. The '+=' operator can be used, to specify that the value is to be added to the current value, and similarly, a '-=' operator is used to remove values.

More complex directives manipulate items that take multiple attributes enclosed within {...} brackets to group the attributes together as a unit:

directive [name] {
        attribute [= value];
        ...
} [name...];

Such a directive can have a name before the opening '{', which is used to name the result of the given statement. As an example, this may be a segment, or version name. One or more optional names may also be allowed following the closing '}', before the terminating ';'. These names are used to express that the named item being defined has relationship with other named items. For example, the SYMBOL_VERSION directive uses this for inherited version names.

Note that the format for attributes within this form follow the same pattern as that of the simple directive form.

Some directives may have attributes that in turn have sub-attributes. In such cases, the sub-attributes are also grouped within nested { ... } brackets to reflect this hierarchy:

directive [name] {
        attribute {
                subattribute [= value];
                ...
        };
        ...
} [name...];

Such nesting can be carried out to arbitrary depth, as required to express the meaning of a given directive. In practice, 1-2 levels of nesting are sufficient for the directives currently defined. I don't anticipate very deep nesting being necessary, but the flexibly to do so gives me confidence that the new syntax is sufficiently flexible, and that we will be able to expand it as necessary going forward.

Old and New Syntax Compared

I think that the best way to evaluate the new mapfile syntax is to show how one might express the same concepts using both. In the subsections that follow, I will show examples in the old syntax and then re-write them using the new. This won't be a comprehensive demonstration of every possible option, but will touch on all of the main features.
Segments/Sections (Elephant, Monkey, and Donkey Ride Again)
The Linker and Libraries Manual contains the following example, which comes from the original AT&T documentation. This example shows how segments are created and sections assigned to them using the old syntax:
elephant : .data : peanuts.o *popcorn.o; 
monkey : $PROGBITS ?AX; 
monkey : .data; 
monkey = LOAD V0x80000000 L0x4000; 
donkey : .data; 
donkey = ?RX A0x1000; 
text = V0x80008000;
I have re-written this example for the new replacement mapfile chapter, as it provides a direct comparison between the old and new syntaxes. The old chapter, and my replacement, both contain a description of what each line means. I'll reproduce the new version here, omitting the explanations:
$mapfile_version 2
LOAD_SEGMENT elephant {
        ASSIGN_SECTION {
                IS_NAME=.data;
                FILE_PATH=peanuts.o;
        };
        ASSIGN_SECTION {
                IS_NAME=.data;
                FILE_OBJNAME=popcorn.o;
        };
};
LOAD_SEGMENT monkey {
        VADDR=0x80000000;
        MAX_SIZE=0x4000;
        ASSIGN_SECTION {
                TYPE=progbits;
                FLAGS=ALLOC EXECUTE;
        };
        ASSIGN_SECTION {
                IS_NAME=.data
        };
};
LOAD_SEGMENT donkey {
        FLAGS=READ EXECUTE;
        ALIGN=0x1000;
        ASSIGN_SECTION {
                IS_NAME=.data;
        };
};
LOAD_SEGMENT text {
        VADDR=0x80008000
};
The original is extremely compact, but also very cryptic. The new version is is considerably longer, as it uses our recommended style of one item per line, with consistent indentation to show structure. The improvement in readability is substantial. I believe that most programmers can read this and follow its meaning without having to look up the syntax. I'm quite sure the same cannot be said of the old one.

Also note that the new version can be significantly compacted without losing much readability, though there's not much value in doing so:

$mapfile_version 2
LOAD_SEGMENT elephant {
        ASSIGN_SECTION { IS_NAME=.data; FILE_PATH=peanuts.o };
        ASSIGN_SECTION { IS_NAME=.data; FILE_OBJNAME=popcorn.o };
};
LOAD_SEGMENT monkey {
        VADDR=0x80000000; MAX_SIZE=0x4000;
        ASSIGN_SECTION { TYPE=progbits; FLAGS=ALLOC EXECUTE };
        ASSIGN_SECTION { IS_NAME=.data };
};
LOAD_SEGMENT donkey {
        FLAGS=READ EXECUTE; ALIGN=0x1000;
        ASSIGN_SECTION { IS_NAME=.data; };
};
LOAD_SEGMENT text { VADDR=0x80008000 };
Output Section Ordering
The version 1 syntax uses the '|' character to specify output section ordering. The LLM gives this example:

segment_name | section_name1;
segment_name | section_name2;
segment_name | section_name3;

In the version 2 syntax, this mapfile would be written as

$mapfile_version 2
LOAD_SEGMENT segment_name {
        OS_ORDER = section_name1 section_name2 section_name3;
};		
Size Symbol Declarations
The version 1 syntax for creating a size symbol is:

segment_name @ symbol_name;

In the version 2 syntax, this is:

$mapfile_version 2
LOAD_SEGMENT segment_name { SIZE_SYMBOL = symbol_name };
File Control Directives

In the version 1 syntax, File Control Directives, indicated by the '-' character, are used to establish the versions that are available from shared objects linked to the object being created. In the new syntax, this is done using the DEPEND_VERSIONS directive.

For example, the following specifies that the version SUNW_1.20, as well as any version inherited by SUNW_1.20, is available for use by the object being created. It also forces SUNW_1.19 to be listed as a dependency, whether or not a symbol from SUNW_1.19 is actually used:

libc.so - SUNW_1.20 $ADDVERS=SUNW_1.19;

The same requirement can be expressed in the new syntax as:

$mapfile_version 2
DEPEND_VERSIONS {
        ALLOW =   SUNW_1.20;
	REQUIRE = SUNW_1.19;
};
Capabilities

Hardware and software capability directives are used to augment or replace the capabilities found in the input objects. For example consider the following statements in the version 1 syntax:

hwcap_1 = mmx;		    # Add MMX to existing hardware capabilities
hwcap_1 = mmx $OVERRIDE;    # Replace existing hardware capabilities with MMX

sfcap_1 = addr32;	    # Add ADDR32 to existing software capabilities
sfcap_1 = addr32 $OVERRIDE; # Replace existing software capabilities with ADDR32

Rewritten using the version 2 syntax:

$mapfile_version 2
CAPABILITY {
	HW += mmx;          # Add MMX to existing hardware capabilities
	HW = mmx;           # Replace existing hardware capabilities with MMX

	SF += addr32;       # Add ADDR32 to existing software capabilities
	SF = addr32;        # Replace existing software capabilities with ADDR32
};
Symbol Versions

The syntax for symbol scope/versioning symbols is the least changed:

  • {} brackets are still used to group the symbols.

  • The version name precedes the opening '{'.

  • Inherited version names follow the closing '}'.
  • The syntax for current scope is unchanged.

  • The syntax for scope auto-reduction is unchanged.

  • The syntax for symbol names without attributes is unchanged.

The following things are different:

  • The 'SYMBOL_SCOPE' or 'SYMBOL_VERSION' keyword is added to the beginning, before the name and opening '{'..

  • The syntax for symbol attributes is changed.

For a large number of mapfiles, the only change necessary will be to add the $mapfile_version control directive to the file, and to put the keyword SYMBOL_SCOPE or SYMBOL_VERSION in front of each scope/version.

To show the difference in how symbol attributes are specified, consider the following directive in the old syntax that uses every possible symbol attribute. This is not a realistic example, as many of these options are not mutually compatible. However, it serves to highlight the full set of syntax differences:

VER_1.2 {
        foo = V0x12345678 S0x23
                FUNCTION DATA COMMON
                FILTER libfoo.so
                AUX libfoo.so
                PARENT EXTERN DIRECT NODIRECT INTERPOSE DYNSORT NODYNSORT;

        protected:
               *;
} VER_1.1;

Rewriting this in the version 2 syntax gives:

$mapfile_version 2
SYMBOL_VERSION VER_1.2 {
        foo {
                VALUE = 0x12345678; SIZE = 0x23;
                TYPE = FUNCTION;    TYPE = DATA;    TYPE=COMMON;
                FILTER = libfoo.so;
                AUX = libfoo.so;
                FLAGS = PARENT EXTERN DIRECT NODIRECT INTERPOSE DYNSORT NODYNSORT;
        }

        protected:
               *;
} VER_1.1;

Although the attribute syntax has changed, it is very similar.

Ordered Input Sections
The compiler usually places functions within a single source file together in an single text section in the resulting object. Such an object is an all or nothing proposition — to use any one of these functions, the link-editor must take the entire text segment as a unit. The contents of such a section are fixed in place, and cannot be altered by the linker.

The Sun compilers support a command line flag, -xF, that causes each function to instead be placed in its own separate section. This gives the link-editor finer grained control, as it can omit unused functions while still pulling in the ones needed to complete the link. The link-editor also has the opportunity to arrange these functions in arbitrary order relative to each other, under user control, specified via the mapfile.

The documentation for the original version 1 syntax in the LLM gives this example:

text = LOAD ?RXO;
text : .text%foo
text : .text%bar
text : .text%main

The result of using this mapfile will be for foo(), bar(), and main() to be placed adjacent to each other at the head of the segment, in that order. The ordering is implicit in the order in which the three section to segment statements (':' lines) are given in the mapfile.

The version 2 syntax accomplishes this reordering as follows:

$mapfile_version 2
LOAD_SEGMENT text {
        ASSIGN_SECTION bar  { IS_NAME = .text%bar };
        ASSIGN_SECTION main { IS_NAME = .text%main };
        ASSIGN_SECTION foo  { IS_NAME = .text%foo };
        IS_ORDER = foo bar main;
};
Conditional Input
This example comes from the linker tests. We have a test that sets an address for the text segment, and this test sets a different address for each of 32-bit sparc, 64-bit sparc, 32-bit x86, and 64-bit x86. As a result, we have four mapfiles:

mapfile-sparc
text = V0x40000;

mapfile-sparcv9
text = V0x100400000;

mapfile-i386
text = V0x8080000;

mapfile-amd64
text = V0x480000;

The version 2 syntax can employ conditional input to represent all of these differing values within a single mapfile, simplifying the test makefile. The $error control directive is used to catch cases where this test is run on a new previously unknown platform, and provide a meaningful error to the developer:

$mapfile_version 2

$if _sparc

$if _ELF64
LOAD_SEGMENT text { VADDR = 0x100400000 };
$else
LOAD_SEGMENT text { VADDR = 0x40000 };
$endif

$elif _x86

$if _ELF32
LOAD_SEGMENT text { VADDR = 0x8080000 };
$else
LOAD_SEGMENT text { VADDR = 0x480000 };
$endif

$else
$error unknown platform
$endif

Wednesday Jan 06, 2010

The Problem(s) With Solaris SVR4 Link-Editor Mapfiles

Until recently, I've never really felt that I fully understood the mapfile language used by the Solaris link-editor (ld), despite having used it for years. It's a terse and arbitrary language that does not encourage intuition, full of special cases and odd twists. No matter how many times you read the mapfile chapter of the Linker and Libraries Guide, you're left with a sneaking suspicion that some things just don't fit, or that you've missed something.

Lately, I've been working on a new mapfile syntax to replace this original language, which Solaris inherited as part of its System V Release 4 origins. In the process, I've examined every line of the manual, and of the code, many times. I believe I understand it all the way down now, and I'd like to record some of what I've learned here. My main reason for doing this is as justification for undertaking a replacement language. Oddly enough though, I believe that this information will make it easier to decode, use, and write these older mapfiles. Once you understand the quirks, you can work around them.

This discussion will not cover the new syntax — that will come in a subsequent installment. However, I do want to reassure you that full support for the original mapfile language will remain in place. We're not about to force anyone to rewrite 20+ years worth of mapfiles. The goal is to freeze the old support in its current form, provide a better alternative, and gradually move the world to it over a period of years.

Terse To A Fault / Not Extensible

The core of the old syntax is simple enough: You can create segments, set attributes for them, and assign sections to them. One can easily believe that it seemed adequate and reasonable to its creators. Their primary design decision was to make SVR4 Mapfiles a magic character language. The purpose of a given statement is specified using special characters (=, :, |, @). Options to these statements are further distinguished from each other using other special characters (?, $, ...), or single letter prefixes.

Languages face continuous pressure to expand and provide new features. The initial language may have seemed spare and elegant, but it failed to provide a scalable mechanism for expansion, and this has proven to be a terrible weakness:

  • There are only a limited number of magic characters available, mainly on the top row of the keyboard.

  • Only a few of these characters have mnemonic meanings that make intuitive sense in the context of object linking. And those that do have such a meaning can easily imply more than one thing. For example, In the SVR4 syntax, '=' means segment creation, and ':' means section to segment assignment. The reverse would make just as much sense: ':' could have meant segment creation, and '=' could have meant assign sections to segments. No one would have found this less intuitive. I used to constantly get these backwards, and would have to look at the manual or another mapfile to remember which character has which meaning. That's pretty sad, considering that '=' is probably the most mnemonic character in the language.

  • After the first few good characters (=, :) are taken, the remaining assignments become rather arbitrary. For instance, the | character is used to specify section order, while @ specifies the creation of a "segment size symbol". Neither of these evoke meaning. To the extent that they do, it's a negative effect, such as the fact that '|' evokes shell pipelines, but means nothing like that in mapfiles.

  • Some characters are overloaded, having different meanings in different contexts.
Most of these characters have no mnemonic value. The human mind struggles to remember what they stand for, resulting in frequent trips to the reference manual to decode them. The problem gets worse as the number of supported features grows, and is exacerbated by the fact that most people only read mapfiles on an occasional basis. The syntax is not reinforced by constant use the way some other terse languages are.

SVR4 Mapfile Syntax As Evolution

For me, the best way to understand the SVR4 mapfile syntax has been to start with it's original form, and then consider how and where each subsequent feature has been added.

In the late 1980's, starting around 1986 or so, Unix System V Release 4 (SVR4) was being developed at AT&T. They created a new linking format (ELF), to resolve the inadequacies of previous format (COFF) used in SVR3. SVR3 had a rather elaborate mapfile syntax. Rather than stay with this syntax, the SVR4 people designed a new, smaller, and simpler replacement. We don't know their reasons for this decision, and can only guess that they didn't think the SVR3 language was necessary, or a good fit with their new ELF based link-editor. As an aside, while researching different mapfile languages during the design of the replacement syntax for Solaris, I discovered that there is a notable similarity between the SVR3 mapfile language and GNU ld linker scripts. SVR3 lives on, as does much of Unix, in its influence on later systems.

The original SVR4 language was very small, consisting of four different possible statements. All of these have the form:

name magic ... ;
where name is a segment name, and magic is a character that determines what the directive does:

  • Segment Definition (=): Create segments and/or modify their attributes

  • Section to Segment Assignment (:): Specify how sections are assigned to segments

  • Section-Within-Segment Ordering (|): Specify the order in which output sections are ordered within a segment

  • Segment size symbols (@): Create absolute symbols containing final segment size in output object.

Solaris started with the original SVR4 code base. Since then, Sun has added three more top level statements:

  • Symbol scope/version definition ({}): Assign symbols to versions

  • File Control Directives (-): Specify which versions can be used from sharable object dependencies

  • Hardware/Software capabilities (=): Augment/override the capabilities from objects.
File Control Directives and Capabilities use the same form of syntax as the original four directives. Symbol scope/version blazed a new path, using {} to group symbol names within:
[version-name] {
    scope:
	symbol [= ...];
	*;
} [inherited-version-name...];
In the following subsections, I will present a brief description of each of these top level mapfile statements, and discuss the various odd or unfortunate aspects of each. If you're not familiar with the mapfile language, it may be helpful to have the Linker and Libraries Guide available as well.
Segment Definition (=)
Segment definition statements can be used to create a new segment, or to modify existing ones:
segment_name = segment-attribute-value... ;
If a segment-attribute-value is one of (LOAD, NOTE, NULL, STACK), then it defines the type of segment being created:
  • LOAD segments are regions of the object to be mapped into the process address space at runtime, represented by PT_LOAD program header entries. LOAD segments were part of the original AT&T language.

  • NOTE segments are regions of the object to contain note sections, represented by PT_NOTE program header entries. NOTE segments were part of the original AT&T language.

  • NULL segments are a concept added by Sun. As far as we can determine, they were created in order to have a type of segment that won't be eliminated from the output object if no sections are assigned to them. Their actual meaning depends on whether the ?E flags is set or not, as described below.

  • STACK "segments" were added by Sun relatively recently in order to allow modifying the attributes of the process stack. This is not really a segment at all: It does not specify a memory mapping, and sections cannot be assigned to it. Rather, it allows access to the PT_SUNWSTACK program header, setting the header flags to specify stack permissions. Only the flags can be set on a stack "segment". The representation of this concept as a segment was done to simplify the underlying implementation of the feature.

If a segment-attribute-value starts with the '?' character, then it is a segment flag:

?R, ?W, ?X
Set the Read (PF_R), Write (PF_W), and eXecute (PF_X) program header flags, respectively. This is a feature of the original SVR4 syntax, and is self explanatory.
?E
The Empty flag can be be used with either a LOAD, or NULL segment:

  1. Applied to a LOAD segment, the ?E flag creates a "reservation". This is an obscure and little used feature by which a program header is written to the output object, "reserving" a region of the address space for use by the program, which presumably knows how to locate it and do something useful. Sections cannot be assigned to such a segment.

  2. Applied to a NULL segment, the ?E flag adds extra PT_NULL program headers to the end of the program header array. This feature is useful for post optimizers which rewrite objects to add segments, and need a place to create corresponding PT_LOAD program headers for them.

  3. The ?E flag is meaningless when applied to NOTE or STACK segments.

The Empty flag was added by Sun. It should be noted that ?E does not correspond to an actual program header flag. It's treatment as a flag in the mapfile syntax, rather than expressing it as a different sort of option (using a magic character other than '?' as a prefix) was primarily a matter of implementation convenience.

?N
Normally, the link-editor makes the ELF and program headers part of the first loadable segment in the object. The ?N flag, if set on the first loadable segment, prevents this from occurring. The headers are still placed in the output object, but are not part of a segment, and therefore not available at runtime. It is meaningless to apply ?N to a non-LOAD segment.

This flag was added by Sun. As with ?E, it does not correspond to a real program header flag. It's representation as a flag is a matter of implementation convenience.

?O
This is another flag, added by Sun, that does not correspond to a real program header. It is used to control the order the placement of sections from input files within the output sections in the segment. Sections are assigned to segments via the ':' mapfile directive. Normally, sections are added in the order seen by the link-editor. When ?O is set, the order of the input sections matches the order in which these assignment directives are found in the mapfile.

This feature was added to support the use of the -xF option to the compiler. That option causes each function to be placed in its own section, rather than all of the functions from a given source file going into a single generic text section. Then, their order can be specified using a mapfile, as with this example taken from the Linker and Libraries Manual:

text = LOAD ?RXO;
text : .text%foo
text : .text%bar
text : .text%main
	
The result of using this mapfile will be for foo(), bar(), and main() to be placed adjacent to each other at the head of the segment, in that order. This feature can be used to put routines that call each other close together, to enhance cache performance. It is worth noting that it was also necessary to set the R and X flags, even though they already are RX on a text segment. This is a quirk of the SVR4 syntax: Any change to the flags replaces the previous value, so we have to specify the flags we want to keep (RX) as well as the one we want to set (O).

A segment-attribute-value can also be a numeric value, prefixed with one of the letters (A, L, R, V), to set the Alignment, Maximum length, Rounding, Physical address, or Virtual address of a LOAD segment, respectively.

The syntax for segment definition suffers from a variety of issues:

  • Most of the options can only be applied to specific segment types, primarily to type LOAD. However, the syntax does nothing to prevent you from trying to apply attributes that are invalid for the type of segment in question. For instance, you might try to assign an address to a STACK segment. The link-editor contains a fair amount of code dedicated to detecting such uses and issuing errors. A better syntax would not allow you to specify nonsensical options in the first place.

  • STACK "segments" are not really segments at all, but simply a convenient way to manipulate a specific program header. This is confusing, and can lead the user to believe they can control aspects of the stack (such as it's address) that are not user settable.

  • An ELF object can only have one PT_SUNWSTACK program header. The segment notation used by mapfiles requires the user to give their stack "segment" a name, perhaps causing a user to think they might be able to create more than one stack by specifying more than one mapfile directive using different names. The link-editor contains code dedicated to catching this and turning it into an error.

  • The ?E, ?N, and ?O flags are confusing, in that they do not correspond directly to ELF program header flags, their terseness, and also their obscure semantics.

  • There is a syntactic ambiguity with capability directives, which use the same magic character (=) as segment definitions, but which are otherwise unrelated. See the discussion of capability directives below for details.
Section to Segment Assignment (:)
The link-editor contains an internal list of entrance criteria each of which contains section attributes. To place a section in an output segment, it compares the section to each item in this list. If a section matches all of the items in a given entrance criteria, then the section is assigned to the corresponding segment, and the search ends.

Sections can be assigned to a specific segment via the following syntax, which uses the ':' magic character. The result of such a statement is to place a new entrance criteria on the internal list:

segment_name : section-attribute-value... [: file-name...];

If a section-attribute-value starts with a '$' prefix, then it specifies a section type. This can be one of ($PROGBITS, $SYMTAB, $STRTAB, $REL, $RELA, $NOTE, $NOBITS).

If a section-attribute-value starts with a '?' prefix, then it specifies one or more section header flags: A (SHF_ALLOC), W (SHF_WRITE), or X (SHF_EXECINSTR). To specify that a given flag must not be present, you can prepend it with the '!' character.

A section-attribute-value that does not start with a '$' or '?' prefix is a section name.

If there is a second colon (':') character on the line, then all items following it are file paths, and if any of these match the path for the input file containing the section to be assigned, it is considered to be a match. If the path name is prefixed with a '*', then the basename of the path is compared to the given name rather than the entire path.

Odd aspects of section to segment assignment:

  • The list of section types is incomplete. ELF defines many more section types than the 7 listed above. Apparently this feature isn't used, at least with types outside of $PROGBITS or $NOBITS, because I've never heard of a complaint about it. In any event, all ELF section types should be supported.

  • The use of a '*' prefix to mean 'basename' in file paths is odd. Conditioned by common shell idioms, any Unix user would expect a '*' within a filepath to be a standard glob wildcard, expanded as the shell would. You'd expect it to match an arbitrary number of characters, and to be usable in the middle of the name, not just at the beginning. Another reasonable assumption would be that it is a regular expression, with the implication that other regular expression features are also possible. None of that applies: A '*' prefix means 'basename', and only if it is the first character.

  • The fact that segment definition, and the assignment rules, are two separate statements creates the potential for a class of error where the user attempts to assign sections to a segment that cannot accept them (e.g. STACK). Having lured you in with a syntax that suggests something that isn't possible, the link-editor contains code to detect and refuse such assignments. Better syntax could prevent this.
Section-Within-Segment Ordering (|)
Section within segment ordering can be used to cause the link-editor to order output sections within a segment in a specified order. The specification is done by section name:
segment_name | section_name1;
segment_name | section_name2;
segment_name | section_name3;

The named sections are placed at the head of the output section in the order listed.

One might expect to be able to put more than one section on a line (you can't), and the use of '|' may cause a Unix user to make some invalid assumptions about shell pipes, or the C bitwise OR operator. However, there's nothing really terrible about this directive.

It's also not terribly useful --- I'm not sure I've ever seen it used outside of our link-editor tests.

Segment size symbols (@)
The '@' magic character is used to create an absolute symbol containing the length of the associated segment, in bytes:

segment_name @ symbol_name;

There is no corresponding mechanism to create a symbol containing the starting address of a segment, so it is debatable how useful the length is. Perhaps the user is expected to know the name of the first item (possibly a function in a text segment) and use that. In any case, we've never seen this feature used outside of our own tests.

The use of '@' carries no useful mnemonic information, but that's not unique to this particular directive.

Symbol Scope/Version Definition ({})
Symbol scope/versioning directives allows you to build objects that group symbols into named versions. When objects are built, they record the versions they require from dependencies, and at runtime, the runtime linker ld.so.1 validates that the necessary versions are present. Versioning was introduced in Solaris 2.5, and was later adopted (with extensions) by the GNU/Linux developers in a manner compatible with Solaris. This is easily the most successful part of mapfile language, and has proven to be a very useful feature. Today, most mapfiles we encounter contain only symbol versioning.

Scope/versioning definitions have the form:

[version-name] {
    scope:
	symbol [= ...];
	*;
} [inherited-version-name...];
If no version-name is specified, it's a simple scope operation, where global names are assigned to the unnamed "global" version. If a version name is given, the symbols within are assigned to that version, and the version can specify other versions that it inherits from.

Within the {} braces, one can encounter three different types of item:

  1. A symbol scope name (default/global, eliminate, exported, hidden/local, protected, singleton, symbolic), followed by a colon. These statements change the current scope, which starts as global, to the one specified. Any symbols listed after a scope declaration receive that scope, until changed by a following scope definition.

  2. A '*', which is called the scope auto-reduction operator. All global symbols in the final object not explicitly listed in a scope/version directive are given the current scope, which must be hidden/local, or eliminate. Auto-reduction is a powerful tool for preventing implementation details of an object from becoming visible to other objects.

  3. A symbol name, optionally followed by a '=' operator and attributes, finally terminated with a ';'.

The attributes that are allowed for a symbol are:

  • A numeric value, prefixed with a 'V', giving the symbol value.

  • A numeric value, prefixed with a 'S', giving the symbol size.

  • One of 'FUNCTION', 'DATA', or 'COMMON', specifying the type of the symbol.

  • 'FILTER', or 'AUX', specifying that the symbol is a standard or auxiliary filter, followed by the name of the object supplying the filtee. The two tokens are separated by whitespace.

  • A large number of flags that specify various attributes: 'PARENT', 'EXTERN', 'DIRECT', 'NODIRECT', 'INTERPOSE', 'DYNSORT', 'NODYNSORT'.
The scope/symbol directives are by far the most successful part of the SVR4 mapfile language, and there is relatively little to complain about. However, there are aspects of the way the symbol attributes work that could certainly be improved, caused in my opinion by an evident attempt to fit things stylistically with the rest of the language:
  • The use of 'V' and 'S' prefixes for value and size, contrasted with the full keywords 'FILTER' and 'AUX' is odd.

  • The lack of some sort of connecting syntax between FILTER/AUX and the associated object is confusing, and leads to certain confusing errors. For example, a statement like 'filter function' is probably intended to say that the symbol is a function, and also a filter, but will be interpreted as being a filter to a library named 'function', drawing no error from the link-editor. A syntax such as 'filter=object' might have been better.

  • The syntax does not distinguish between the type and flag values values. This is generally not a problem, but a syntax that did would be more precise, and possibly helpful.
File Control Directives (-)
File control directives allow you to tell the link-editor to restrict the symbol versions available from a sharable object dependency being linked to the output object. The most common use for this feature is to limit your object to a set of functionality associated with a specific release of the operating system:

shared_object_name - version_name [version_name ...];

where version_name is the name of versions found within the shared object.

When a given shared object is specified with one of these directives, the link-editor will only consider using symbols from the object that come from the listed versions, or the versions they inherit. The link-editor will then make the versions actually used dependencies for the output object.

Alternatively, a version_name can be specified using the form:

$ADDVERS=version_name

In this case, the specified name is made a dependency for the output object whether or not it was actually needed by the link.

There are some odd aspects to file control directives:

  • The '-' magic character has no mnemonic value (as usual).

  • The use of the '$' character in $ADDVERS, to create a type of optional attribute, is unusual, and represents an overloading of '$' relative to other directives.

  • The use of '$' aside, the $ADDVERS= notation is unusual relative to the rest of the language, which might have used another magic character instead.
Hardware/Software Capabilities (=)
The hardware and software capabilities of an object can be augmented, or replaced, using mapfile capability directives:
hwcap_1 = capitem...;
sfcap_1 = capitem...;
where the values on the right hand side of the '=' operator can be one of the following:

  • The name of a capability

  • A numeric value, prefixed with a 'V' to indicate that it is a number rather than a name.

  • $OVERRIDE, instructing the link-editor that the capabilities specified in the mapfile should completely replace those provided by the input objects, rather than add to them.

Perhaps the most unfortunate fact about the capability directives is that they use the '=' magic character, which normally indicates a segment definition. This has some odd ramifications:

  • The names 'hwcap_1', and sfcap_1' have been stolen from the segment namespace, and cannot be used to name segments.

  • As new capabilities are added to the system, it may become necessary to introduce new capability directives. For example, it is clear that 'hwcap_2' will soon be needed on X86 platforms. When this happens, the new name will also be taken from the segment namespace. Existing mapfiles using that name for a segment will break. One might reasonably expect that there are no such mapfiles, but that is a poor justification.

  • These names are case sensitive. Although segments cannot be named 'hwcap_1', or 'sfcap_1', they can have these names using any other case. For instance, 'HWCAP_1' will be interpreted as a segment, not as a capability.

One can understand the temptation to reuse '=' for capabilities, instead of picking some other unused magic character. Which one would you pick to convey the idea of 'capability'? I don't find any of the available characters (%, \^, &, ~) compelling in the least. Still, this overloading of '=' is a problem.

As a demonstration of how very similar mapfile lines can have wildly different meanings, consider the following example, which uses the debug feature of the link-editor to show us how mapfile lines are interpreted:

% cat hello.c
#include <stdio.h>

int
main(int argc, char **argv)
{
        printf("hello\\n");
        return (0);
}
% cat mapfile-cap
HwCaP_1 = LOAD ?RWX;		# A segment
hwcap_1 = V0x12;                # A capability
% LD_OPTIONS=-Dmap cc hello.c -Mmapfile-cap
debug: 
debug: map file=mapfile-cap
debug: segment declaration (=), segment added:
debug: 
debug: segment[3] sg_name:  HwCaP_1
debug:     p_vaddr:      0           p_flags:    [ PF_X PF_W PF_R ]
debug:     p_paddr:      0           p_type:     [ PT_LOAD ]
debug:     p_filesz:     0           p_memsz:    0
debug:     p_offset:     0           p_align:    0x10000
debug:     sg_length:    0
debug:     sg_flags:     [ FLG_SG_ALIGN FLG_SG_FLAGS FLG_SG_TYPE ]
debug: 
debug: hardware/software declaration (=), capabilities added:
debug: 

Other misfeatures of the capability syntax are the overloading of the '$' prefix to indicate an instruction to the link-editor ($OVERRIDE), and the use of the 'V' prefix in front of numeric values. These prefixes have different, though similar, meanings elsewhere, which makes the language hard to understand.

Mapfile Magic Character Decoder Ring

Another strategy for understanding SVR4 mapfiles is to organize things by magic character.

Most mapfile directives have the form:

name magic ... ;
where name is generally (but not always) a segment name, and magic is a character that determines what the directive does.

The following is a comprehensive list, in no particular order, of the magic characters and related syntactic elements used in the current SVR4 mapfile language:

CharacterMeaning
=
  1. Create a new segment, or modify the attributes of an existing one, as long as the segment is not named 'hwcap_1', or 'sfcap_1'.

  2. If '=' is used to reference a "segment" named 'hwcap_1', or 'sfcap_1', then this is a hardware or software capabilities directive, and not a segment directive at all. This means that you cannot create a segment named 'hwcap_1', or 'sfcap_1'. However, these names are case sensitive, so you can create segments of those names using any other case. For example, HWcap_1 would name a segment rather than refer to hardware capabilities.

  3. Within a symbol scope/version, associate a symbol name to one or more following attributes.

  4. Within a "File Control Directive", associate the $ADDVERS option (a use of the '$' magic character) with a version name, causing the given version to be added to the output object even if it is not directly used.
:

  1. Assign sections to segments.

  2. If used twice in a section to segment assignment directive, the second one indicates that the items following it are not section names, as they have been to that point, but are file paths from which the previous sections can come.
| Specify output section ordering within a segment. It does not mean "pipe" as it would in the shell, nor does it mean 'OR' as it would in a C-style programming language.
@ Create a "size symbol" for the specified segment, containing the length of the segment. It is not clear how useful these are, since there is no corresponding "address symbol" that might be used to locate the start of the segment for which we have a size. We've never seen it used.
- A "File Control Directive", used to specify the version definitions to be used from the sharable objects linked to the output object.
{ } Grouping, used to contain the symbols within a scope/version directive.
; Terminates all directives, similar to its purpose in the C programming language.
*

  1. Following the second ':' character in a section to segment assignment directive (:), as a prefix to the file names specified following the ':', specifies that the link-editor should compare the basename of the file providing the input section to the prefixed string, rather than comparing the full file path. The use of '*' in a file path is easily confused with the Unix shell "glob" wildcard character. However, this use in the mapfile is not a glob, and only has its special basename meaning if seen as the first character in the name.

  2. Within a symbol scope/version directive, the scope auto-reduction operator, which causes all symbols not otherwise assigned to a symbol version to be reduced to the current scope, which must be local/hidden, eliminate, or protected.
?
  1. Within a segment directive (=), indicates segment flags: 'E' (Empty), 'N' (Nohdr), O (Order), R (Read), W (Write), and 'X' (eXecute). Note that only RWX represent real program header flags. The others (ENO) are not really segment flags but communicate segment related information to the link-editor. This is an example of overloading --- they are "flag like", so it was convenient to treat them as flags rather than use some other magic character to represent them.

  2. Within a section to segment assignment directive (:), indicates section flags: 'A' (Allocable), 'W' (Writable), and 'X' (eXecinstr). Within these flags, the '!' character can be used to specify that the following flag must not be set in the candidate section.
$

  1. Within a section to segment mapping directive (:), a prefix used to indicate that the name following is a section type (PROGBITS, SYMTAB, etc) rather than a section name.

  2. Within a "File Control Directive", a prefix used to indicate that the following name is a special option to be applied to a version. Currently, the only such option is $ADDVERS.
! Within a section to segment mapping directive (:), and within the specification of section flags (?), negates the meaning of a given flag, indicating that the flag must not be set.
A
  1. When used in a segment (=) directive, as a prefix to a numeric value, indicates that the number is a segment alignment.

  2. When used in section to segment assignment (:) flag value(?), specifies the SHF_ALLOC section header flag.
E When used within a segment definition (=) for a flag (?) value, alters the meaning of LOAD or NULL segments. When applied to a LOAD segnebt, ?E specifies that this segment is to be reserved (Empty). No sections are assigned to it, but a program header is generated and at runtime, the region is available to the running program to use. This is an obscure and little used feature. When applied to a NULL segment, reserves an additional PT_NULL program header, for the use of post optimizers that will add segments to the object. Note that this "flag" does not correspond to an actual program header flag.
L When used in a segment (=) directive, as a prefix to a numeric value, indicates that the number is a maximum segment size.
N When used within a segment definition (=) flag value (?): By default, the first segment in an object, which is usually the text segment, contains the ELF header found at the start of the file, making the ELF header available to the runtime linker. The ?N flag specifies that if this segment is the first in the file, it should omit the ELF header. Note that this "flag" does not correspond to an actual program header flag, and that it has no meaning if the segment does not end up being first.
O When used within a segment definition (=) flag value (?): Input sections assigned to the segment should be ordered within their output sections in the order that section assignment directives (:) for the segment are encountered within the mapfile. Note that this "flag" does not correspond to an actual program header flag.
P When used in a segment (=) directive, as a prefix to a numeric value, indicates that the number is a physical address.
R

  1. When used in a segment (=) directive, as a prefix to a numeric value, indicates that the number is a segment rounding value.

  2. When used within a segment definition (=) flag value(?), specifies the READ (PF_R) program header flag value.
S When used in a symbol scope/version directive, as a prefix to a numeric value in a symbol attributes, specifies that the number provides the symbol size (st_size).
V
  1. When used in a segment (=) directive, as a prefix to a numeric value, indicates that the number is a virtual address.

  2. When used in a hwcap_1 or sfcap_1 capabilities definition (=), as a prefix to value that has not been recognized as a hardware or software capability name, indicates that the item is a number.

  3. When used in a symbol scope/version directive, as a prefix to a numeric value in a symbols attributes, specifies that the number provides the symbol value (st_value).
W

  1. When used within a segment definition (=) flag value (?), specifies the WRITE (PF_W) program header flag value.

  2. When used in section to segment assignment (:) flag value(?), specifies the SHF_WRITE section header flag.
X

  1. When used within a segment definition (=) flag value (?), specifies the EXECUTE (PF_X) program header flag value.

  2. When used in section to segment assignment (:) flag value(?), specifies the SHF_EXECINSTR section header flag.

Time For A Fresh Start

The original mapfile language inherited from AT&T was no beauty, but it was good enough to go forward with. We've continued to build on it for 2 decades for a variety of good reasons, primarily that it was getting the job done, that it wasn't preventing progress, and there has been plenty of other work to do. The sort of users who write mapfiles are up to dealing with a little ugliness, and perhaps have been a bit more tolerant than the situation deserves. The "mapfile situation" has been a concern for years. Put simply, it is not asking too much that a programmer with a reasonable (not necessarily deep) grasp of linker concepts be able to read and understand the intent of a mapfile without resorting to a reference manual or linker source code. Nor should it be a difficult chore to fit a new feature into the language cleanly

The mapfile syntax issue usually comes up in the context of wanting to add a new feature, and disparing at the ugliness of what that implies. One is usually in the middle of solving a considerably more focused and urgent problem, and not willing or able to take an extensive detour to replace underlying infrastructure. And so we've moved forward, adding one thing, and then another, with the situation slowly, but not catastrophically, getting worse each time The current state of our mapfile language is such that we shy away from adding new features, and we are aware of other projects that may need some link-editor support in the near future. The right infrastructure simplifies everything it touches, and as we know all too well, the reverse is also true.

We've known for quite awhile that eventually it would be necessary to tackle this issue systematically and produce a new mapfile language for Solaris. That time has finally arrived.

Tuesday Oct 21, 2008

GNU Hash ELF Sections

The ELF object format is used by several different operating systems, all sharing a common basic design, but each sporting their own extensions. One of the nice aspects of ELF's design is that it facilitates this, defining a common core, as well as reserving space for each implementation to define its own additions. Those of us in the ELF community try to stay current with each other, as a source of new ideas and inspiration, in order to avoid reinventing the wheel, and out of curiosity.

Recently, the GNU linker developers added a new style of hash section to their objects, and I've been learning about them in fine detail. Having done the work, it only makes sense to write it down and share it.

This posting describes the layout and interpretation of GNU ELF hash sections. The GNU hash section provides a new hash section for ELF objects, with better performance than the original SYSV hash.

The information here was gathered from the following sources:

I did not look at the GNU binutils source code to gather this information.

If you spot an error, send me email (First.Last@Oracle.COM, where First and Last are replaced with my name) and I'll fix it.

Hash Function

The GNU hash function uses the DJB (Daniel J Bernstein) hash, which Professor Bernstein originally posted to the comp.lang.c usenet newsgroup years ago:
uint32_t
dl_new_hash (const char *s)
{
        uint32_t h = 5381;

        for (unsigned char c = *s; c != '\0'; c = *++s)
                h = h * 33 + c;

        return h;
}
If you search for this algorithm online, you will find that the hash expression
h = h * 33 + c
is frequently coded as
h = ((h << 5) + h) + c
These are equivalent statements, replacing integer multiplication with a presumably cheaper shift and add operation. Whether this is actually cheaper depends on the CPU used. There used to be a significant difference with older machines, but integer multiplication on modern machines is very fast.

Another variation of this algorithm clips the returned value to 31-bits:

return h & 0x7fffffff;
However, GNU hash uses the full unsigned 32-bit result.

The GNU binutils implementation utilizes the uint_fast32_t type for computing the hash. This type is defined to be the fastest available integer machine type capable of representing at least 32-bits on the current system. As it might be implemented using a wider type, the result is explicitly clipped to a 32-bit unsigned value before being returned.

static uint_fast32_t
dl_new_hash (const char *s)
{
        uint_fast32_t h = 5381;

        for (unsigned char c = *s; c != '\0'; c = *++s)
                h = h * 33 + c;

        return h & 0xffffffff;
}

Dynamic Section Requirements

GNU hash sections place some additional sorting requirements on the contents of the dynamic symbol table. This is in contrast to standard SVR4 hash sections, which allow the symbols to be placed in any order allowed by the ELF standard.

A standard SVR4 hash table includes all of the symbols in the dynamic symbol table. However, some of these symbols will never be looked up via the hash table:

  • LOCAL symbols, unless referenced by a relocation (on some architectures)

  • FILE symbols

  • For sharable objects: All UNDEF symbols

  • For executables: Any UNDEF symbol that are not referenced by a PLT.

  • The special index 0 symbol (a special case of UNDEF)
Omitting these symbols from the hash table section has no impact on correctness, and will result in less hash table congestion, shorter hash chains, and correspondingly better hash performance.

With GNU hash, the dynamic symbol table is divided into two parts. The first part receives the symbols that can be omitted from the hash table. GNU hash does not impose any specific order for the symbols in this part of the dynamic symbol table.

The second half of the dynamic symbol table receives the symbols that are accessible from the hash table. These symbols are required to be sorted by increasing (hash % nbuckets) value, using the GNU hash function described above. The number of hash buckets (nbuckets) is recorded in the GNU hash section, described below. As a result, symbols which will be found in a single hash chain are adjacent in memory, leading to better cache performance.

GNU_HASH section

A GNU_HASH section consists of four separate parts, in order:
Header
An array of (4) 32-bit words providing section parameters:

nbuckets
The number of hash buckets

symndx
The dynamic symbol table has dynsymcount symbols. symndx is the index of the first symbol in the dynamic symbol table that is to be accessible via the hash table. This implies that there are (dynsymcount - symndx) symbols accessible via the hash table.

maskwords
The number of ELFCLASS sized words in the Bloom filter portion of the hash table section. This value must be non-zero, and must be a power of 2 as explained below.

Note that a value of 0 could be interpreted to mean that no Bloom filter is present in the hash section. However, the GNU linkers do not do this — the GNU hash section always includes at least 1 mask word.

shift2
A shift count used by the Bloom filter.

Bloom Filter
GNU_HASH sections contain a Bloom filter. This filter is used to rapidly reject attempts to look up symbols that do not exist in the object. The Bloom filter words are 32-bit for ELFCLASS32 objects, and 64-bit for ELFCLASS64.

Hash Buckets
An array of nbuckets 32-bit hash buckets

Hash Values
An array of (dynsymcount - symndx) 32-bit hash chain values, one per symbol from the second part of the dynamic symbol table.
The header, hash buckets, and hash chains are always 32-bit words, while the Bloom filter words can be 32 or 64-bit depending on the class of object. This means that ELFCLASS32 GNU_HASH sections consist of only 32-bit words, and therefore have their section header sh_entsize field set to 4. ELFCLASS64 GNU_HASH sections have mixed element size, and therefore set sh_entsize to 0.

Assuming that the hash section is aligned properly for accessing ELFCLASS sized words, the (4) 32-bit words directly before the Bloom filter ensure that the filter mask words are always aligned properly and can be accessed directly in memory.

Bloom Filter
GNU hash sections include a Bloom filter. Bloom filters are probabilistic, meaning that false positives are possible, but false negatives are not. This filter is used to rapidly reject symbol names that will not be found in the object, avoiding the more expensive hash lookup operation. Normally, only one object in a process has the given symbol. Skipping the hash operation for all the other objects can greatly speed symbol lookup.

The filter consists of maskwords words, each of which is 32-bits (ELFCLASS32) or 64-bits (ELFCLASS64) depending on the class of object. In the following discussion, C will be used to stand for the size of one mask word in bits. Collectively, the mask words make up a logical bitmask of (C * maskwords) bits.

GNU hash uses a k=2 Bloom filter, which means that two independent hash functions are used for each symbol. The Bloom filter reference contains the following statement:

The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields.
The hash function used by the GNU hash has this property. This fact is leveraged to produce both hash functions required by the Bloom filter from the single hash function described above:
H1 = dl_new_hash(name);
H2 = H1 >> shift2;
As discussed above, the link editor determines how many mask words to use (maskwords) and the amount by which the first hash result is right shifted to produce the second (shift2). The more mask words used, the larger the hash section, but the lower the rate of false positives. I was told in private email that the GNU linker primarily derives shift2 from the base 2 log of the number of symbols entered into the hash table (dynsymcount - symndx), with a minimum value of 5 for ELFCLASS32, and 6 for ELFCLASS64. These values are explicitly recorded in the hash section in order to give the link editor the flexibility to change them in the future should better heuristics emerge.

The Bloom filter mask sets one bit for each of the two hash values. Based on the Bloom filter reference, the word containing each bit, and the bit to set would be calculated as:

N1 = ((H1 / C) % maskwords);
N2 = ((H2 / C) % maskwords);

B1 = H1 % C;
B2 = H2 % C;
To populate the bits when building the filter:
bloom[N1] |= (1 << B1);
bloom[N2] |= (1 << B2);
and to later test the filter:
(bloom[N1] & (1 << B1)) && (bloom[N2] & (1 << B2))
The GNU hash deviates from the above in a significant way. Rather than calculate N1 and N2 separately, a single mask word is used, corresponding to N1 above. This is a conscious decision by the GNU hash developers to optimize cache behavior:
This makes the 2 hash functions for the Bloom filter more dependent than when two different Ns were used, but in our measurements still has very good ratio of rejecting lookups that should be rejected, and is much more cache friendly. It is very important that we touch as few cache lines during lookup as possible.

Therefore, in the GNU hash, the single mask word is actually calculated as:

N = ((H1 / C) % maskwords);
The two bits set in the Bloom filter mask word N are:
BITMASK = (1 << (H1 % C)) | (1 << (H2 % C));
The link-editor sets these bits as
bloom[N] |= BITMASK;
And the test used by the runtime linker is:
(bloom[N] & BITMASK) == BITMASK;
Bit Fiddling: Why maskwords Is Required To Be A Power Of Two
In general, a Bloom filter can be constructed using an arbitrary number of words. However, as noted above, the GNU hash calls for maskwords to be a power of 2. This requirement allows the modulo operation
N = ((H1 / C) % maskwords);
to instead be written as a simple mask operation:
N = ((H1 / C) & (maskwords - 1));
Note that (maskwords - 1) can be precomputed once
MASKWORDS_BITMASK = maskwords - 1;
and then used for every hash:
N = ((H1 / C) & MASKWORDS_BITMASK);
Bloom Filter Special Cases
Bloom filters have a pair of interesting special cases:
  • When a Bloom filter has all of its bits set, all tests result in a True (accept) value. The GNU linker takes advantage of this by issuing a single word Bloom filter with all bits set when it wants to "disable" the Bloom filter. The filter is still there, and is still used, at minimal overhead, but it lets everything through.

  • A Bloom filter with no bits set will return False in all cases. This case is relatively rare in ELF files, as an object that exports no symbols has limited application. However, sometimes objects are built this way, relying on init/fini sections to cause code from the object to run.
Hash Buckets
Following the Bloom filter are nbuckets 32-bit words. Each word N in the array contains the lowest index into the dynamic symbol table for which:
(dl_new_hash(symname) % nbuckets) == N
Since the dynamic symbol table is sorted by the same key (hash % nbuckets), dynsym[buckets[N]] is the first symbol in the hash chain that will contain the desired symbol if it exists.

A bucket element will contain the index 0 if there is no symbol in the hash table for the given value of N. As index 0 of the dynsym is a reserved value, this index cannot occur for a valid symbol, and is therefore non-ambiguous.

Hash Values
The final part of a GNU hash section contains (dynsymcount - symndx) 32-bit words, one entry for each symbol in the second part of the dynamic symbol table. The top 31 bits of each word contains the top 31 bits of the corresponding symbol's hash value. The least significant bit is used as a stopper bit. It is set to 1 when a symbol is the last symbol in a given hash chain:
lsb = (N == dynsymcount - 1) ||
  ((dl_new_hash (name[N]) % nbuckets)
   != (dl_new_hash (name[N + 1]) % nbuckets))

hashval = (dl_new_hash(name) & ~1) | lsb;

Symbol Lookup Using GNU Hash

The following shows how a symbol might be looked up in an object using the GNU hash section. We will assume the existence of an in memory record containing the information needed:
typedef struct {
        const char      *os_dynstr;      /* Dynamic string table */
        Sym             *os_dynsym;      /* Dynamic symbol table */
        Word            os_nbuckets;     /* # hash buckets */
        Word            os_symndx;       /* Index of 1st dynsym in hash */
        Word            os_maskwords_bm; /* # Bloom filter words, minus 1 */
        Word            os_shift2;       /* Bloom filter hash shift */
        const BloomWord *os_bloom;       /* Bloom filter words */
        const Word      *os_buckets;     /* Hash buckets */
        const Word      *os_hashval;     /* Hash value array */
} obj_state_t;
To simplify matters, we elide the details of handling different ELF classes. In the above, Word is a 32-bit unsigned value, BloomWord is either 32 or 64-bit depending in the ELFCLASS, and Sym is either Elf32_Sym or Elf64_Sym.

Given a variable containing the above information for an object, the following pseudo code returns a pointer to the desired symbol if it exists in the object, and NULL otherwise.

Sym *
symhash(obj_state_t *os, const char *symname)
{
        Word            c;
        Word            h1, h2;
        Word            n;
        Word            bitmask; 
        const Sym       *sym;
        Word            *hashval;

        /*
         * Hash the name, generate the "second" hash
         * from it for the Bloom filter.
         */
        h1 = dl_new_hash(symname);
        h2 = h1 >> os->os_shift2;

        /* Test against the Bloom filter */
        c = sizeof (BloomWord) * 8;
        n = (h1 / c) & os->os_maskwords_bm;
        bitmask = (1 << (h1 % c)) | (1 << (h2 % c));
        if ((os->os_bloom[n] & bitmask) != bitmask)
                return (NULL);

        /* Locate the hash chain, and corresponding hash value element */
        n = os->os_buckets[h1 % os->os_nbuckets];
        if (n == 0)    /* Empty hash chain, symbol not present */
                return (NULL);
        sym = &os->os_dynsym[n];
        hashval = &os->os_hashval[n - os->os_symndx];

        /*
         * Walk the chain until the symbol is found or
         * the chain is exhausted.
         */
        for (h1 &= ~1; 1; sym++) {
                h2 = *hashval++;

                /*
                 * Compare the strings to verify match. Note that
                 * a given hash chain can contain different hash
                 * values. We'd get the right result by comparing every
                 * string, but comparing the hash values first lets us
                 * screen obvious mismatches at very low cost and avoid
                 * the relatively expensive string compare.
                 *
		 * We are intentionally glossing over some things here:
	         *
		 *    -  We could test sym->st_name for 0, which indicates
		 *	 a NULL string, and avoid a strcmp() in that case.
		 *
                 *    - The real runtime linker must also take symbol
		 * 	versioning into account. This is an orthogonal
		 *	issue to hashing, and is left out of this
		 *	example for simplicity.
		 *
		 * A real implementation might test (h1 == (h2 & ~1), and then
		 * call a (possibly inline) function to validate the rest.
                 */
                if ((h1 == (h2 & ~1)) &&
                    !strcmp(symname, os->os_dynstr + sym->st_name))
                        return (sym);

                /* Done if at end of chain */
                if (h2 & 1)
                        break;
        }

        /* This object does not have the desired symbol */
        return (NULL);
}

Updates

26 August 2010
Per Lidén pointed out some errors in the above example. There were 3 uses of "sizeof (BloomWord)" that should have been "sizeof (BloomWord) * 8", as we are dealing with bits, rather than bytes. And, there was a typo in the for() loop. Thank you for the corrections.

19 September 2011
Subin Gangadharan wrote to tell me about another typo. In the code example that finishes the article, I wrote
      h1 = dl_new_hash(symname);
      h2 = h2 >> os->os_shift2;
      
This second line should have been
      h2 = h1 >> os->os_shift2;
      
In looking this old article over, I found that a number of '*' characters had been transformed into '\*'. I can only speculate that this happened when Oracle converted our Sun blogs. I have restored the original characters. I also took the opportunity to update my email address to its current Oracle.COM form.

15 August 2013
Mikael Vidstedt points out another minor typo. In the discussion of how the GNU hash calculates a single N, rather than N1 and N2, I concluded by saying the following:
And the test used by the runtime linker is:
      (bloom[N1] & BITMASK) == BITMASK;
      
The N1 should have been simply N:
      (bloom[N] & BITMASK) == BITMASK;
      

Thank you for writing, and for the kind words.

The Cost Of ELF Symbol Hashing

Linux and Solaris are ELF cousins. Both systems use the same base ELF standard, though we've both made our own OS specific extensions over the years. Recently, the GNU linker folks made some changes to how Linux does symbol hashing in their ELF objects. I've been learning about what they've done, and that in turn caused me to consider the bigger picture of ELF hashing overhead.

History and Trends

In an ELF based system, the runtime linker looks up symbols within executables and sharable objects. The available symbols are found in the dynamic symbol table. The lookup is done using a pre-computed hash table section associated with that symbol table. The SVR4 ELF ABI defines the layout of these hash sections, and the hash function. These are all original ELF features, dating back to the late 1980's when ELF was designed. This aspect of ELF has been static since that time.

The runtime linker maintains a list of objects currently mapped into a programs address space. To find a desired symbol, it starts at the head of this list and searches each one in turn using a hash lookup operation, until the symbol is found (success) or the end of the list is hit (failure).

The per-symbol cost of symbol lookup hashing grows with:

  • The number of objects in a process.

  • The number of symbols in those objects.

  • The length of the symbol names. In particular, the C++ language encodes class names and argument types into symbol names, which makes the strings extremely long. Even worse from a string comparison point of view is that these strings tend to have long shared suffixes, differing only in the final characters.
These items all grow over time. In the days when ELF was originally defined, a process had one or two sharable objects at most, and a few hundred symbols, all of which had short names. More recently, it has become common for a program to have tens of sharable objects, and thousands of symbols, and names have grown. This trend shows no sign of abating. It is easy to imagine a near future with hundreds of objects and hundreds of thousands of symbols. In fact, we have already seen a program with almost 1000 objects.

In the past, when the list of objects in a program was very short, it was not necessary to search many objects before a given symbol was found. Most symbols were in the first, often only, object. Hence, most hash operations were successful, and hash overhead was not a significant concern. In modern programs however, failed hash operations dominate. It is usually necessary to perform one or more failing hash operations before getting to the object that has a desired symbol. The more objects, the larger the percentage of failing hashes.

Unless somehow mitigated, the per-symbol cost of hashing will continue to grow as programs grow larger, possibly to a level where the user can feel the effect. There are several ways in which this overhead can be reduced:

  1. Eliminate unnecessary symbols

  2. Eliminate the O(n) search of the object list to reduce the amount of hashing required.

  3. Make each hashing operation cost less.

Eliminate Unnecessary Symbols

Most objects contain global symbols that are for the use of the object, but not intended to be accessed by outside code. One common example would be that of a helper function called within multiple files that are compiled into a sharable object. Such a function needs to be global within the object so that it can be called from multiple files. However, it is not intended to be something the users of the library call directly.

ELF versioning allow symbols to be assigned to versions, thereby creating interfaces that can be maintained for backward compatibility as the object evolves. In a version definition, the scope reduction operator can be used to tell the link-editor that any global symbols not explicitly assigned to a version should not be visible outside the object. For example, the following exports the symbol foo from a version named VERSION_1.0, while reducing the scope of all other global symbols to local:

VERSION_1.0
{
	 global:
		 foo;
	 local:
		 *;
};
Some language compilers offer symbol visibility keywords that have similar effect.

Eliminating unnecessary symbols from the hash table reduces the average length of the hash chains, and speeds the lookup. In addition, hiding unnecessary symbols from object's external interface prevents accidental interposition in which a given library exports a function intended to only be for its own use, and that symbol ends up interposing on a symbol of the same name in a different object.

Eliminate O(n) Object Searching

There are different strategies employed in modern operating systems to minimize the need for symbol hashing:
Prelink
The Linux operating system emphasizes their prelink system. Prelink analyzes all the executables on the system, and the libraries they depend on. Non-conflicting addresses are assigned for all the libraries, and then the executables and libraries are pre-bound to be loaded at those addresses. In effect, the work normally done by the runtime linker for each object at process startup is instead done once. The runtime linker, recognizing a prelinked object at process startup, will map it to its assigned location, and immediately return rather than do the usual relocation processing.

Prelinking is a complex per-system operation, though Linux does a good job of hiding and managing this complexity. Changes to the system can require the prelink operation to be redone.

Prelinking pares the cost of ELF runtime processing to the absolute minimum. As part of that, it completely eliminates symbol hashing at startup. A further advantage is that it does not require any changes to the objects in question. All of the complexity is kept in the prelink system itself.

Prelinking will not prevent hashing from occurring if:

  • The object is loaded via dlopen().
  • The object is not prelinked
  • If the object is prelinked for one system, but is then used by another, either as a copy or via a network filesystem. Prelinking is a per-system concept. The system can use an object prelinked for a different system, but the benefits of prelinking may be lost.
  • The objects on the system have changed, altering the prelinking needed. In this case, prelinking can be recomputed.

Direct Binding
The Solaris operating system employs a combination of direct binding, preferably in conjunction with lazy loading and lazy binding. In non-direct binding, an object records the symbols it requires from other objects. In direct binding, each such symbol also records the object expected to provide that symbol.

Direct bindings were developed with several goals in mind:

  1. They harden large programs against the effects of unintended symbol interposition, when the wrong library provides a given symbol due to the order in which libraries get loaded. This makes it easier to reliably build larger programs.

  2. They allow multiple instances of a named interface to be referenced by different objects, further expanding the ability of complex programs to interface with multiple, perhaps incompatible, external interfaces. For instance, when a program depends on many objects that are developed independently by others, sometimes those objects have dependencies on different incompatible versions of some other object.

  3. Performance: At runtime, the runtime linker uses the direct binding information to skip the O(n) search of all the objects in the process, and instead to go directly to the right library for each symbol, carrying out a single successful hash operation to locate it within the object.

The the first two items were the driving issues that led to the invention of direct bindings, reflecting real problems we were encountering in the field.

Lazy binding, which is the default, reduces the amount of hashing done even further, by delaying the binding operation until the first use of a given symbol by the program. If a given symbol is not needed during the lifetime of a process, it is never looked up, and the hashing that would otherwise occur is eliminated. Most programs do not use all of the symbols available to them in a given execution, so lazy binding amplifies the benefits of direct binding.

Direct bindings do not eliminate hashing, but they eliminate the unproductive failure cases, leaving a single successful hash lookup per symbol. In a sense, they bring us back to the performance of early ELF systems where everything was found in a single library and most hash operations were successful. Direct bindings have a relatively simple implementation. They work with dlopen(), and across network filesystems. However, many objects do not use direct bindings. Unlike prelink, direct bindings require explicit action to be taken when the object is built. Converting an existing non-direct object to use them can require some analysis to be carried out by the programmer, to enable desired interposition that direct bindings would otherwise prevent.

Prelinking and direct bindings are very different solutions that attack the problem of hashing overhead along different axis. Systems will see greatly reduced symbol hash overhead with either strategy, but symbol hashing will still occur. As such, the cost of the hash lookup operation is still of interest.

Making Symbol Hashing Cheaper

The existing SVR4 ELF hash function and hash table sections are fixed. Improving their performance requires introducing a new hash function and hash section. Recently, the developers of the GNU linkers have done this. These new hash sections can coexist in an object with the older SVR4 hash sections, allowing for backward compatibility with older systems. The GNU hash reduces hash overhead in the following ways:
  • An improved hash function is used, to better spread the hash keys and reduce hash chain length.

  • The dynamic symbol table is sorted into hash order, such that memory access tends to be adjacent and monotonically increasing, which can help cache behavior. (Note that the Solaris link-editor does a similar sort, although the specific details differ.)

  • The dynamic symbol table contains some symbols that are never looked up by via the hash table. These symbols are left out of the hash table, reducing its size and hash chain lengths.

  • Perhaps most significantly, the GNU hash section includes a Bloom filter. This filter is used prior to hash lookup to determine if the symbol is found in the object or not.
Bloom filters are used to test whether a given item is part of a given set. They use a compact bitmask representation, and are fast to query. Bloom filters, are probabilistic: False positives are possible, but false negatives are not. The size of the bitmask used to represent the filter, and the quality and number of hash functions determine the rate of false positives.

A Bloom filter is never wrong when it says an item does not exist in a set. Applied to ELF hash tables, the runtime linker can test a symbol name against a Bloom filter, and if rejected, immediately move on to the next object. Since most symbol lookups end in failure as discussed above, this has the potential to eliminate a large number of unnecessary hash operations. It is possible for a Bloom filter to incorrectly indicate that an item exists in the set when it doesn't. When this happens, it is caught by the following hash lookup, so correct linker operation is not affected. Since false positives are rare, this does not significantly affect performance.

It is interesting to note that the use of a Bloom filter makes a successful symbol lookup slightly more expensive than it would be otherwise. The hash table alone can be used to determine if a symbol exists or not, so the Bloom filter is pure overhead in the success case. Despite that, the Bloom filter is a winning optimization, because it is very cheap to compute compared to a hash table lookup, and because most hash operations are against libraries that end up not containing the desired symbol.

It is also worth noting that the runtime linker is free to skip the use of the Bloom filter and proceed directly to the hash lookup operation. This may be worthwhile in situations where the runtime linker has other reasons to believe the lookup will be successful. In particular, if the runtime linker is directed to a given object via a direct binding, the odds of a failed symbol lookup should be zero, so there is no need to screen before the hash lookup.

Conclusions

Tweaking the performance of an existing algorithm has its place, particularly within the inner loops of a busy program. However, the big wins are usually the result of using a better algorithm. In Solaris, direct bindings have been our algorithmic approach to reducing hash overhead. We've made a conscious effort to develop and deploy direct bindings in preference to making improvements to the traditional hash process. We've been pleased with the results — direct bindings are faster and combined with their other attributes, allow programs to scale larger.

Nonetheless, the GNU hash embodies several worthwhile ideas:

  1. Better hash function
  2. Doesn't put symbols that the runtime linker doesn't care about into the hash table.
  3. Bloom filter cheaply eliminates most hash operations.
In particular, the use of a Bloom filter to cheaply filter away unproductive hash operations stands out as a very interesting idea.

It is clear that hash overhead can be measured and reduced. By and large however, we have not found ELF hash overhead to be a hot spot for real programs. It seems that the other things that go on in a program generally dominate the hit that comes from ELF symbol hashing. The core Solaris OS is now built using direct bindings, which has allowed us to harden and simplify aspects of its design. Interestingly enough, measurements do not reflect a significant resulting change in system performance. This does not prove that hash overhead should be ignored, but it does tend to suggest that one has to look towards extremely large non-direct bound programs in order to demonstrate the issue.

It's all food for thought, and perhaps time for some experimentation and measurement.

Wednesday Mar 19, 2008

ld Is Now A Cross Link-Editor

Until yesterday, ld, the Solaris link-editor, was a native linker. This means that it was only able to link objects for the same machine that the linker was running on. To link sparc objects required the use of of a sparc machine, and x86 objects required an x86 system.

With the integration of

PSARC 2008/179 cross link-editor
6671255 link-editor should support cross linking
into Solaris Nevada build 87, the Solaris ld became a cross link-editor. Now, ld running on any type of hardware can link objects for any of the systems Solaris supports. This is currently sparc and x86, but there are people playing with OpenSolaris on other hardware too, so who knows what we might end up with?

The user interface to this new capability is a simple extension of ld's traditional behavior. Traditionally, ld establishes the class (whether the object is 32 or 64-bit) of the object being generated by examining the first ELF object processed from the command line. In the case where an object is being built solely from an archive library or a mapfile, the -64 command line option is available to explicitly override this default. We have extended this to also determine the machine type of the object to produce:

  • The class and machine type of the resulting object default to those of the first ELF object processed from the command line.

  • The new '-z target=' command line option can be used to explicitly specify the machine type.
It's simple: To link objects for a given machine, you supply the link-editor with objects of that type. The link-editor examines the first object, and then configures itself to process objects of that type.

Of course, it's a rare program that doesn't link against at least one system library. You're going to need libc, if nothing else. To do a successful cross-link, you'll need to have an image of the root filesystem for a system of the target type. There are many ways to do this. For testing purposes, I used a sparc and x86 system, using NFS to allow each system to see the root filesystem of the other.

Even though we now have a cross link-editor, we expect that the vast majority of links will be native, for the machine running the linker. We decided to pursue cross linking anyway, for two reasons:

  1. To lower the bar for OpenSolaris ports: People have had some success using the GNU ld to port OpenSolaris, but it is difficult to get very far that way. The code in Solaris depends on link-editor features that are specific to the Solaris ld, so using GNU ld involves a fair amount of hacking around these things to make progress, and the farther up the stack you go from the kernel to userland, the harder it gets. We hope that providing a better framework for adding targets to the Solaris ld will help such efforts. It should now be possible (though still not trivial!) to add support for a new target to the Solaris linker running on sparc or x86, and then use the resulting system to cross-build for the new platform.

  2. To allow the use of fast/cheap commodity desktop systems to build objects for other systems, be they large expensive systems, or small embedded devices.

A cross link-editor is a significant step, but is of little use unless you also have a cross-compiler and assembler. The GNU gcc compiler can be built as a cross compiler, and that should be very helpful for OpenSolaris ports. However, the Sun Studio compilers are native compilers, so it will still be awhile before I can use my amd64 desktop to build a sparc version of Solaris as part of work at Sun. We've taken a first step. It will be interesting to see what follows.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Friday Nov 02, 2007

Avoiding LD_LIBRARY_PATH: The Options

With the introduction of the elfedit utility into Solaris, we have a new answer to the age old question of how to avoid everyones favorite way to get into trouble, the LD_LIBRARY_PATH environment variable. This seems like an appropriate time to revisit this topic.

LD_LIBRARY_PATH Seems Useful. What's the Problem?

The problem is that LD_LIBRARY_PATH is a crude tool, and cannot be easily targeted at a problem program without also hitting other innocent programs. Sometimes this overspray is harmless (it costs some time, but doesn't break anything). Other times, it causes a program to link to the wrong version of something, and that program dies in mysterious ways.

Historically, inappropriate use of LD_LIBRARY_PATH might be the #1 one way to get yourself into trouble in an ELF environment. In particular, people who redistribute binaries with instructions for their users to set LD_LIBRARY_PATH in their shell startup scripts are unleashing forces beyond their control. Experience tells us that such use is destined to end badly.

This subject has been written about many times by many people. My colleague Rod Evans wrote about this ( LD_LIBRARY_PATH - just say no) for one of his first blog entries.

If you need additional convincing on this point, here are some suggested Google searches you might want to try:

LD_LIBRARY_PATH problem
LD_LIBRARY_PATH bad
LD_LIBRARY_PATH evil
LD_LIBRARY_PATH darkest hell

If LD_LIBRARY_PATH is so bad, why does its use persist? Simply because it is the option of last resort, used when everything else has failed. We probably can't eliminate it, but we should strive to reduce its use to the bare minimum.

How to Use, and How To Avoid Using LD_LIBRARY_PATH

The best way to use LD_LIBRARY_PATH is interactively, as a short term aid for testing or development. A developer might use it to point his test program at an alternative version of a library. Beyond that, the less you use it, the better off you'll be. With that in mind, here is a list of ways to avoid LD_LIBRARY_PATH. The items are ordered from best to worst, with the best option right at the top:
  • Explicitly set the correct runpath for the objects you build. If you have the ability to relink the object, you can always do this, and no other workaround is needed. To set a runpath in an object, use the -R compiler/linker option.

    One common problem that people run into with a built in runpath is the use of an absolute path (e.g. /usr/local/lib). Absolute paths are no problem for the well known system libraries, because their location is fixed by convention as well as by standards. However, they can be trouble for libraries supplied by third parties and installed onto the system. Usually the user has a choice of where such applications are installed, their home directory, or /usr/local being two of the more popular places. An application that hard wires the location of user installed libraries cannot handle this. The solution in this case is to use the $ORIGIN token in those runpaths. The $ORIGIN token, which refers to the directory in which the using object resides, can be used to set a non-absolute runpath that will work in any location, as long as the desired libraries reside at a known location relative to the using program. Fortunately, this is often the case.

    For example, consider the case of a 32-bit application named myapp, which relies on a sharable library named mylib.so, as well as on the standard system libraries found in /lib and /usr/lib. The -R option to put the runpath into myapp that will look in these places would be:

    -R '$ORIGIN/../lib:/lib:/usr/lib'
    
    This allows myapp and mylib.so to be installed anywhere, as long as they are kept in the same positions relative to each other.

    Even for system libraries, the use of $ORIGIN can be useful. We use it for all of the linker components in the system. For instance:

    % elfdump -d /usr/bin/ld | grep RUNPATH
           [7]  RUNPATH           0x2e6               $ORIGIN/../../lib
    
    By setting the runpath using $ORIGIN instead of simply hardwiring the well known location /lib, we make it easier to test a tree of alternative linker components, such as results when we do a full build of the Solaris ON consolidation. We know that when we run a test copy of ld, that it will use the related libraries that were built with it, instead of binding to the installed system libraries.

    There is one exception to the advice to make heavy use of $ORIGIN. The runtime linker will not expand tokens like $ORIGIN for secure (setuid) applications. This should not be a problem in the vast majority of cases.

  • Many times, the problem comes in the form of open source software that explicitly sets the runpath to an incorrect value for Solaris. Can you fix the configuration script and contribute the change back to the package maintainer? You'll be doing lots of people a favor if you do.

  • If you have an object with a bad runpath (or no runpath) and the object cannot be rebuilt, it may be possible to alter its runpath using the elfedit command. Using the myapp example from the previous item:
    elfedit -e 'dyn:runpath $ORIGIN/../lib:/lib:/usr/lib' myapp
    
    For this option to be possible, you need to be running a recent version of Solaris that has elfedit, and your object has to have been linked by a version of Solaris that has the necessary extra room. Quoting from the elfedit manpage:
    • The desired string must already exist in the dynamic string table, or there must be enough reserved space within this section for the new string to be added. If your object has a string table reservation area, the value of the .dynamic DT_SUNW_STRPAD element indicates the size of the area. The following elfedit command can be used to check this:

      % elfedit -r -e 'dyn:tag DT_SUNW_STRPAD' file
    • The dynamic section must already have a runpath element, or there must be an unused dynamic slot available where one can be inserted. To test for the presence of an existing runpath:

      % elfedit -r -e 'dyn:runpath' file

      A dynamic section uses an element of type DT_NULL to terminate the array found in that section. The final DT_NULL cannot be changed, but if there are more than one of these, elfedit can convert one of them into a runpath element. To test for extra dynamic slots:

      % elfedit -r -e 'dyn:tag DT_NULL' file

  • If your application was linked with the -c option to the linker, then you can use the crle command to alter the configuration file associated with the application and change the settings for LD_LIBRARY_PATH that are applied for that application. This is a pretty good solution, but is limited by its complexity, and by the fact that the person who linked the object needs to have thought ahead far enough to provide for this option. Odds are that they didn't. If they had, they might just as well have set the runpath correctly in the first place, eliminating the need for anything else.

    You can use crle with an application that was not linked with -c, either by setting the LD_CONFIG environment variable, or by modifying the global system configuration file. However, both of these options suffer from the same issues as the LD_LIBRARY_PATH environment variable: They are too coarse grained to be applied to a single application in a targeted way.

  • If none of the above are possible, then you are indeed stuck with LD_LIBRARY_PATH. In this case, the goal should be to minimize the number of applications that see this environment variable. You should never set it in your interactive shell environment (via whatever dot file your shell supports: .profile, .login, .cshrc, .basrc, etc...). Instead, put it in a wrapper shell script that you use to run the specific program.

    The use of a wrapper script is a pretty safe way to use LD_LIBRARY_PATH, but you should be aware of one limitation of this approach: If the program being wrapped starts any other programs, then those programs will see the LD_LIBRARY_PATH environment variable. Since programs starting other programs is a common Unix technique, this form of leakage can be more common that you might realize.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Introducing elfedit: A Tool For Modifying Existing ELF Objects

Back in June, I wrote about changes we've recently made to Solaris ELF objects that allow their runpaths to be modified without having to rebuild the object. In that posting, I alluded to work that I was then doing when I said "Eventually, Solaris will ship with a standard utility for modifying runpaths". I am happy to say that this has come to pass. I recently integrated /usr/bin/elfedit into build 75 of Solaris Nevada with:
PSARC 2007/509 elfedit
6234471 need a way to edit ELF objects
elfedit can indeed modify the runpath in an object, but it is considerably more general than that. elfedit is a tool for examining and modifying the ELF metadata that resides within ELF objects. It can be used as a batch mode tool from shell scripts, makefiles, etc, or as an interactive tool, for examining and exploring objects. elfedit has a modular design, and ships with a set of standard modules for performing common edits. This design makes it easy to add new functionality by adding additional modules.

Prior to elfedit, making these sorts of modifications required the user to write a program, usually in C using libelf. elfedit raises the programming level required to do this significantly. Many operations can be done using existing elfedit commands. For those that cannot, it is far easier to write an elfedit module to add the ability than it is to write a standalone program.

We envision elfedit being used to solve the following sorts of problems:

[Small Fixups]
To correct minor issues in a built file that cannot be easily rebuilt, or for which sources are not available.

Probably the most notable such item is the ability to alter the runpath of objects built following the integration of

PSARC 2007/127 Reserved space for editing ELF dynamic sections
6516118 Reserved space needed in ELF dynamic section and string table
The ability to do this is a "Frequently Asked Question" for which there has previously been no good answer. This feature is expected to be used nearly as soon as it is available, to fix the runpaths of FOSS (free open source software) built for Solaris, which often has the wrong runpaths set.

Another common situation is when programmers forget to explicitly add the libraries they depend on to the link line, relying on indirect dependencies to make things work. elfedit can be used to add NEEDED dependencies to an existing object's dynamic section, making the dependencies explicit.

[Better Way To Support Specialized Rarely Used Features]
As an avenue for delivering small features to change some object attributes without the need to add additional complex and specialized features to ld and ld.so.1.

For example, we have had requests to allow a mechanism to ld that would allow the user to override the hardware capability bits that are placed in the object by the compiler. Such a feature would be complex to document and burdens already complex commands with features that are rarely used. Such features are a natural fit to elfedit. (See the elfedit(1) manpage for an example of modifying the hardware capabilities).

[Linker Development]
We sometimes work on linker features that require objects with new values or flag bits that the compilers do not yet generate. elfedit allows us to set arbitrary values for such items quickly, and without having to write a program.
[Linker Testing]
Many bugs involve an object that is broken in some way. Once the bug is fixed, we need an object broken in that particular way for our test suite. There are several problems that arise:

  • Cataloging and archiving broken objects is time consuming and error prone.

  • Producing similarly broken objects for different platforms is not always possible.

  • As new platforms appear, we end up with coverage gaps where some platforms can do a given test and others cannot.

elfedit gives us the ability to build a simple object, and then break it intentionally in a specific and controlled manner. Tests can then be self contained, requiring no external data, and applicable to all relevant platforms.

elfedit's ability to extract specific bits of data from an object is very useful for object and linker testing.

Every elfedit module contains documentation for the commands it provides. This information is displayed using the built in help command, in a format that is based on that of Solaris manpages. The help strings in the standard elfedit modules supplied with Solaris are internationalized using the same i18n mechanisms employed by the rest of the linker software found under usr/src/cmd/sgs. Hence, all elfedit modules supplied by Sun will have complete documentation, and will support the necessary language locales.

As with any program that changes the contents of an ELF file, changes to an object by elfedit will invalidate any pre-existing elfsign signature. Assuming the changes are understood and acceptable to the signing authority, such objects will need to be signed after the edits are done.

Modular Design And Extensibility

elfedit has a modular design, reflecting our own experience with dynamic linking, and influenced heavily by the design of mdb, the modular debugger.

The elfedit program contains the code that handles the details of reading objects, executing commands to modify them, and saving the results. Very little of the code that performs the actual edits is found in elfedit itself. Rather, the commands exist in modules, which are sharable objects with a well defined elfedit-specific interface. elfedit loads needed modules on demand when a command from the module is executed. These modules are self contained, and include their own documentation in a standard format that elfedit can display using its help command.

The module forms a namespace for the commands that it supplies. Each module delivers a set of commands, focused on related functionality. A command is specified by combining the module and command names with a colon (:) delimiter, with no intervening whitespace. For example, dyn:runpath refers to the runpath command provided by the dyn module.

Module names must be unique. The command names within a given module are unique within that module, but the same command names can be used in more than one module. For example, most modules contain a command named 'dump', which is used to provide an elfedump-style view of the data.

We have adopted the following general rules of thumb for naming modules and commands:

  • The module name reflects the part of the ELF format the module addresses (ehdr, phdr, shdr, ...)

  • Commands that directly access a field in an ELF structure are given the name of the field (e.g. ehdr:e_flags).

  • Commands that are higher level have a simple descriptive name that reflects their purpose (e.g. dyn:runpath).

Give 'Em Enough Rope

elfedit is a tool for linker development and testing. As such, it follows the Unix tradition of doing what it's told, without a lot of noise. This is great if you are doing linker research & development, or testing. We commonly need to intentionally set ELF metadata to undefined or even "wrong" values. However, it follows that elfedit won't prevent you from making nonsensical or otherwise incorrect changes to your ELF objects.

For example, X86 objects have little endian byteorder (ELFDATA2LSB):

% file /usr/bin/ls
/usr/bin/ls:    ELF 32-bit LSB executable 80386 Version 1 [FPU],
    dynamically linked, not stripped, no debugging information available
We can change the e_ident[EI_DATA] field in the ELF header from its proper value to ELFDATA2MSB, which reverses the byte order advertised by the program and makes it appear to be big endian:
% elfedit -e 'ehdr:ei_data elfdata2MSB' /usr/bin/ls /tmp/badls
% file /tmp/badls
/tmp/badls:     ELF 32-bit MSB executable 80386 Version 1 [FPU],
    dynamically linked, not stripped, no debugging information available
The file command sees the change that we made. However, we haven't really created a big endian X86 binary by changing what it advertises. We now have a little endian binary that is lying about what it contains. And of course, there is no such thing as a big endian X86 hardware, so if we had created such a binary, it wouldn't be runnable anywhere. It should come as no surprise that the system doesn't know what to do with our modified ls binary:
% /tmp/badls
/tmp/badls: cannot execute

This is really nothing to be worried about. If you are using elfedit's low level operations that allow arbitrary changes to individual ELF fields, then you need to know enough about the ELF format to make these changes properly. Most people will use elfedit for the high level operations such as changing runpaths. The high level operations are safe, and do not require expert knowledge to use.

If you are making those low level changes, the Solaris Linkers and Libraries Guide can be very helpful.

Learning More

elfedit is a standard part of the Solaris development branch, the code that will eventually ship from Sun as the next version of Solaris. It is also available as part of OpenSolaris. It is not part of Solaris 10 or earlier Solaris releases. If you are using a recent Solaris distribution, such as Solaris Express Developer Edition then elfedit should be already present on your system.

The elfedit(1) manpage describes the utility in more detail, and gives three examples that should be of general interest:

  1. Changing runpaths
  2. Changing hardware/software capability bits
  3. Reading specific data, without having to grep the output of elfdump.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Thursday Nov 01, 2007

What Are Fake ELF Section Headers?

I'd like to take a moment to explain an unusual feature we added to elfdump last summer. The -P option tells elfdump to ignore the section headers in the file (if any) and to instead generate a "fake" set from the program headers. So, what are fake section headers, and why would you want them?

Earlier this year, there was an "incident" in which a previously unknown hole in the Solaris telnet daemon was used by a worm. As soon as we got a copy of this worm, we tried to examine it with elfdump to see what we might learn:

% elfdump zoneadmd

ELF Header
  ei_magic:   { 0x7f, E, L, F }
  ei_class:   ELFCLASS32          ei_data:      ELFDATA2LSB
  e_machine:  EM_386              e_version:    EV_CURRENT
  e_type:     ET_EXEC
  e_flags:                     0
  e_entry:             0x80512d4  e_ehsize:     52  e_shstrndx:  0
  e_shoff:                     0  e_shentsize:   0  e_shnum:     0
  e_phoff:                  0x34  e_phentsize:  32  e_phnum:     5

Program Header[0]:
    p_vaddr:      0x8050034   p_flags:    [ PF_X  PF_R ]
    p_paddr:      0           p_type:     [ PT_PHDR ]
    p_filesz:     0xa0        p_memsz:    0xa0
    p_offset:     0x34        p_align:    0

Program Header[1]:
    p_vaddr:      0           p_flags:    [ PF_R ]
    p_paddr:      0           p_type:     [ PT_INTERP ]
    p_filesz:     0x11        p_memsz:    0
    p_offset:     0xd4        p_align:    0

Program Header[2]:
    p_vaddr:      0x8050000   p_flags:    [ PF_X  PF_R ]
    p_paddr:      0           p_type:     [ PT_LOAD ]
    p_filesz:     0x6491      p_memsz:    0x6491
    p_offset:     0           p_align:    0x10000

Program Header[3]:
    p_vaddr:      0x8066494   p_flags:    [ PF_X  PF_W  PF_R ]
    p_paddr:      0           p_type:     [ PT_LOAD ]
    p_filesz:     0x3e0       p_memsz:    0xc10
    p_offset:     0x6494      p_align:    0x10000

Program Header[4]:
    p_vaddr:      0x80665c4   p_flags:    [ PF_X  PF_W  PF_R ]
    p_paddr:      0           p_type:     [ PT_DYNAMIC ]
    p_filesz:     0xd8        p_memsz:    0
    p_offset:     0x65c4      p_align:    0
That's it — everything that elfdump could tell us about this object. We sure didn't learn much from that!

If you look at the ELF header, you'll see that our bad guy has set the e_shnum, e_shoff, and e_shentsize fields to zero. These fields are used to locate the section headers for an ELF object. The section headers in turn contain the information needed to look deeper into an object. Section headers are not used to run a program, only to examine it. Zeroing them is a crude, but effective way to obscure what's inside. ELF objects are just files after all, and anyone with write access can modify them. It's not unheard of to modify an ELF object using a binary capable editor like emacs.

Fortunately, the design of ELF makes it difficult to actually hide what an object calls from other sharable objects. And since the system call stubs are all located in libc, you can't hide the system calls your code makes. Here is one way to look inside:

% ldd -r -e LD_DEBUG=bindings zoneadmd 2>&1 | fgrep "binding file=zoneadmd"
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__deregister_frame_info_bases'
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__register_frame_info_bases'
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `_Jv_RegisterClasses'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_environ'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `__iob'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_cleanup'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `atexit'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `__fpstart'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `exit'
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__deregister_frame_info_bases'
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `_Jv_RegisterClasses'
04992: binding file=zoneadmd to 0x0 (undefined weak): symbol `__register_frame_info_bases'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `getenv'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `setsid'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `printf'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fflush'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `signal'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pthread_create'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pthread_join'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `malloc'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pthread_cancel'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `free'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `close'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `snprintf'
04992: binding file=zoneadmd to file=/lib/libnsl.so.1: symbol `inet_addr'
04992: binding file=zoneadmd to file=/lib/libnsl.so.1: symbol `gethostbyname'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `bcopy'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `ntohl'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `socket'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `htons'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `htonl'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `connect'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `___errno'
04992: binding file=zoneadmd to file=/lib/libsocket.so.1: symbol `getsockopt'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fcntl'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `select'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `write'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `read'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strcpy'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `gettimeofday'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strstr'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `mkstemp'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fdopen'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fopen'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `unlink'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fputs'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fputc'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `lseek'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fclose'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fprintf'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fread'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `putc'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_xstat'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_lxstat'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_fxstat'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `_xmknod'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `open'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `mmap'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strrchr'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `nanosleep'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `fork'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `dup2'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `pipe'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `execve'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `kill'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `waitpid'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `localtime_r'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `utimes'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strchr'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `sscanf'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `strtoul'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `rename'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `chmod'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `execl'
04992: binding file=zoneadmd to file=/lib/libc.so.1: symbol `lockf' 
Unlike some other object systems, ELF inter-object references are always looked up by name at runtime. The runtime linker hashes the name and looks it up on the first reference. So if you want to actually call something outside of your own object, you have to call it by its real name. This information is located via the object's program headers, and unlike the section headers, they need to be reasonably accurate for the object to work.

The above experience led us to consider a new feature for elfdump. What if we started with the program headers, and generated a set of "fake" section headers based on the information they contain? Obviously the information available would be reduced in comparison to the real section headers, because the program headers only contain the information needed to run the object. Nonetheless, it would certainly be better than nothing in the case where the section headers are gone. And what about the case where they are present, but we fear that they have been maliciously modified? The information from the "fake" section headers could be compared to that from the actual section headers.

As a result of this worm episode and the aftermath, I added the -P option to elfdump last July:

PSARC 2007/395 Add -P option to elfdump
6530249 elfdump should handle ELF files with no section header table
With an object that has section headers, fake section headers will not be used unless you explicitly use the -P option. If an object doesn't have any section headers, then elfdump automatically turns on the -P option for you.

Let's use the new elfdump with fake section headers to examine the telnet worm. I apologize for the length of this output, but the length underscores the point — there is a lot of information that we can recover from this damaged object:

% elfdump zoneadmd

ELF Header
  ei_magic:   { 0x7f, E, L, F }
  ei_class:   ELFCLASS32          ei_data:      ELFDATA2LSB
  e_machine:  EM_386              e_version:    EV_CURRENT
  e_type:     ET_EXEC
  e_flags:                     0
  e_entry:             0x80512d4  e_ehsize:     52  e_shstrndx:  0
  e_shoff:                     0  e_shentsize:   0  e_shnum:     0  (see shdr[0].sh_size)
  e_phoff:                  0x34  e_phentsize:  32  e_phnum:     5

Program Header[0]:
    p_vaddr:      0x8050034   p_flags:    [ PF_X PF_R ]
    p_paddr:      0           p_type:     [ PT_PHDR ]
    p_filesz:     0xa0        p_memsz:    0xa0
    p_offset:     0x34        p_align:    0

Program Header[1]:
    p_vaddr:      0           p_flags:    [ PF_R ]
    p_paddr:      0           p_type:     [ PT_INTERP ]
    p_filesz:     0x11        p_memsz:    0
    p_offset:     0xd4        p_align:    0

Program Header[2]:
    p_vaddr:      0x8050000   p_flags:    [ PF_X PF_R ]
    p_paddr:      0           p_type:     [ PT_LOAD ]
    p_filesz:     0x6491      p_memsz:    0x6491
    p_offset:     0           p_align:    0x10000

Program Header[3]:
    p_vaddr:      0x8066494   p_flags:    [ PF_X PF_W PF_R ]
    p_paddr:      0           p_type:     [ PT_LOAD ]
    p_filesz:     0x3e0       p_memsz:    0xc10
    p_offset:     0x6494      p_align:    0x10000

Program Header[4]:
    p_vaddr:      0x80665c4   p_flags:    [ PF_X PF_W PF_R ]
    p_paddr:      0           p_type:     [ PT_DYNAMIC ]
    p_filesz:     0xd8        p_memsz:    0
    p_offset:     0x65c4      p_align:    0

Section Header[1]:  sh_name: .dynamic(phdr)
    sh_addr:      0x80665c4       sh_flags:   [ SHF_WRITE SHF_ALLOC ]
    sh_size:      0xd8            sh_type:    [ SHT_DYNAMIC ]
    sh_offset:    0x65c4          sh_entsize: 0x8 (27 entries)
    sh_link:      2               sh_info:    0
    sh_addralign: 0x4       

Section Header[2]:  sh_name: .dynstr(phdr)
    sh_addr:      0x8050890       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0x2da           sh_type:    [ SHT_STRTAB ]
    sh_offset:    0x890           sh_entsize: 0
    sh_link:      0               sh_info:    0
    sh_addralign: 0x1       

Section Header[3]:  sh_name: .dynsym(phdr)
    sh_addr:      0x8050380       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0x510           sh_type:    [ SHT_DYNSYM ]
    sh_offset:    0x380           sh_entsize: 0x10 (81 entries)
    sh_link:      2               sh_info:    1
    sh_addralign: 0x4       

Section Header[4]:  sh_name: .hash(phdr)
    sh_addr:      0x80500e8       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0x298           sh_type:    [ SHT_HASH ]
    sh_offset:    0xe8            sh_entsize: 0x4 (166 entries)
    sh_link:      3               sh_info:    0
    sh_addralign: 0x4       

Section Header[5]:  sh_name: .SUNW_version(phdr)
    sh_addr:      0x8050b6c       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0xa0            sh_type:    [ SHT_SUNW_verneed ]
    sh_offset:    0xb6c           sh_entsize: 0x1 (160 entries)
    sh_link:      2               sh_info:    5
    sh_addralign: 0x4       

Section Header[6]:  sh_name: .interp(phdr)
    sh_addr:      0x80500d4       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0x11            sh_type:    [ SHT_PROGBITS ]
    sh_offset:    0xd4            sh_entsize: 0
    sh_link:      0               sh_info:    0
    sh_addralign: 0x1       

Section Header[7]:  sh_name: .rel(phdr)
    sh_addr:      0x8050c0c       sh_flags:   [ SHF_ALLOC ]
    sh_size:      0x258           sh_type:    [ SHT_REL ]
    sh_offset:    0xc0c           sh_entsize: 0x8 (75 entries)
    sh_link:      3               sh_info:    0
    sh_addralign: 0x4       

Interpreter Section:  .interp(phdr)
	/usr/lib/ld.so.1

Version Needed Section:  .SUNW_version(phdr)
            file                        version
            libnsl.so.1                 SUNW_0.7             
            libsocket.so.1              SUNW_0.7             
            librt.so.1                  SUNW_1.2             
            libpthread.so.1             SUNW_1.2             
            libc.so.1                   SUNW_1.1             

Symbol Table Section:  .dynsym(phdr)
     index    value      size      type bind oth ver shndx          name
       [0]  0x00000000 0x00000000  NOTY LOCL  D    0 UNDEF          
       [1]  0x08067088 0x00000004  OBJT WEAK  D    0 22             environ
       [2]  0x080511f4 0x00000000  FUNC GLOB  D    0 UNDEF          dup2
       [3]  0x00000000 0x00000000  NOTY WEAK  D    0 UNDEF          _Jv_RegisterClasses
       [4]  0x080510a4 0x00000000  FUNC GLOB  D    0 UNDEF          strstr
       [5]  0x08050f64 0x00000000  FUNC GLOB  D    0 UNDEF          pthread_cancel
       [6]  0x080511a4 0x00000000  FUNC GLOB  D    0 UNDEF          open
       [7]  0x080511d4 0x00000000  FUNC GLOB  D    0 UNDEF          nanosleep
       [8]  0x08051244 0x00000000  FUNC GLOB  D    0 UNDEF          localtime_r
       [9]  0x080665c4 0x00000000  OBJT GLOB  D    0 15             _DYNAMIC
      [10]  0x08050ea4 0x00000000  FUNC GLOB  D    0 UNDEF          exit
      [11]  0x08051174 0x00000000  FUNC GLOB  D    0 UNDEF          _lxstat
      [12]  0x080670a4 0x00000000  OBJT GLOB  D    0 22             _end
      [13]  0x08050fd4 0x00000000  FUNC GLOB  D    0 UNDEF          ntohl
      [14]  0x080510e4 0x00000000  FUNC GLOB  D    0 UNDEF          unlink
      [15]  0x08051114 0x00000000  FUNC GLOB  D    0 UNDEF          lseek
      [16]  0x08050ee4 0x00000000  FUNC GLOB  D    0 UNDEF          getenv
      [17]  0x08051234 0x00000000  FUNC GLOB  D    0 UNDEF          waitpid
      [18]  0x080510d4 0x00000000  FUNC GLOB  D    0 UNDEF          fopen
      [19]  0x08051164 0x00000000  FUNC GLOB  D    0 UNDEF          _xstat
      [20]  0x08051074 0x00000000  FUNC GLOB  D    0 UNDEF          read
      [21]  0x080511e4 0x00000000  FUNC GLOB  D    0 UNDEF          fork
      [22]  0x08051094 0x00000000  FUNC GLOB  D    0 UNDEF          gettimeofday
      [23]  0x08050fe4 0x00000000  FUNC GLOB  D    0 UNDEF          socket
      [24]  0x00000000 0x00000000  NOTY WEAK  D    0 UNDEF          __deregister_frame_info_bases
      [25]  0x08051294 0x00000000  FUNC GLOB  D    0 UNDEF          rename
      [26]  0x08050f54 0x00000000  FUNC GLOB  D    0 UNDEF          malloc
      [27]  0x080511b4 0x00000000  FUNC GLOB  D    0 UNDEF          mmap
      [28]  0x08050f94 0x00000000  FUNC GLOB  D    0 UNDEF          snprintf
      [29]  0x08051284 0x00000000  FUNC GLOB  D    0 UNDEF          strtoul
      [30]  0x08051264 0x00000000  FUNC GLOB  D    0 UNDEF          strchr
      [31]  0x08051274 0x00000000  FUNC GLOB  D    0 UNDEF          sscanf
      [32]  0x08051224 0x00000000  FUNC GLOB  D    0 UNDEF          kill
      [33]  0x08051254 0x00000000  FUNC GLOB  D    0 UNDEF          utimes
      [34]  0x08051184 0x00000000  FUNC GLOB  D    0 UNDEF          _fxstat
      [35]  0x08050e94 0x00000000  FUNC GLOB  D    0 UNDEF          __fpstart
      [36]  0x08051124 0x00000000  FUNC GLOB  D    0 UNDEF          fclose
      [37]  0x080510b4 0x00000000  FUNC GLOB  D    0 UNDEF          mkstemp
      [38]  0x08066494 0x00000000  OBJT GLOB  D    0 14             _GLOBAL_OFFSET_TABLE_
      [39]  0x080512b4 0x00000000  FUNC GLOB  D    0 UNDEF          execl
      [40]  0x08051214 0x00000000  FUNC GLOB  D    0 UNDEF          execve
      [41]  0x08051144 0x00000000  FUNC GLOB  D    0 UNDEF          fread
      [42]  0x08050e74 0x00000000  FUNC WEAK  D    0 UNDEF          _cleanup
      [43]  0x08050f24 0x00000000  FUNC GLOB  D    0 UNDEF          signal
      [44]  0x08051064 0x00000000  FUNC GLOB  D    0 UNDEF          write
      [45]  0x080510c4 0x00000000  FUNC GLOB  D    0 UNDEF          fdopen
      [46]  0x08050e64 0x00000000  OBJT GLOB  D    0 9              _PROCEDURE_LINKAGE_TABLE_
      [47]  0x08051154 0x00000000  FUNC GLOB  D    0 UNDEF          putc
      [48]  0x08056491 0x00000000  OBJT GLOB  D    0 13             _etext
      [49]  0x08050ef4 0x00000000  FUNC GLOB  D    0 UNDEF          setsid
      [50]  0x080512a4 0x00000000  FUNC GLOB  D    0 UNDEF          chmod
      [51]  0x08051194 0x00000000  FUNC GLOB  D    0 UNDEF          _xmknod
      [52]  0x08066874 0x00000000  OBJT GLOB  D    0 21             _edata
      [53]  0x08051054 0x00000000  FUNC GLOB  D    0 UNDEF          select
      [54]  0x080668a0 0x000003c0  OBJT WEAK  D    0 22             _iob
      [55]  0x08051014 0x00000000  FUNC GLOB  D    0 UNDEF          connect
      [56]  0x08050fb4 0x00000000  FUNC GLOB  D    0 UNDEF          gethostbyname
      [57]  0x080511c4 0x00000000  FUNC GLOB  D    0 UNDEF          strrchr
      [58]  0x080668a0 0x000003c0  OBJT GLOB  D    0 22             __iob
      [59]  0x08050f14 0x00000000  FUNC GLOB  D    0 UNDEF          fflush
      [60]  0x08051034 0x00000000  FUNC GLOB  D    0 UNDEF          getsockopt
      [61]  0x08051044 0x00000000  FUNC GLOB  D    0 UNDEF          fcntl
      [62]  0x080512c4 0x00000000  FUNC GLOB  D    0 UNDEF          lockf
      [63]  0x08050fa4 0x00000000  FUNC GLOB  D    0 UNDEF          inet_addr
      [64]  0x08050f44 0x00000000  FUNC GLOB  D    0 UNDEF          pthread_join
      [65]  0x08051024 0x00000000  FUNC GLOB  D    0 UNDEF          ___errno
      [66]  0x08051104 0x00000000  FUNC GLOB  D    0 UNDEF          fputc
      [67]  0x08050e84 0x00000000  FUNC GLOB  D    0 UNDEF          atexit
      [68]  0x08050f04 0x00000000  FUNC GLOB  D    0 UNDEF          printf
      [69]  0x00000000 0x00000000  NOTY WEAK  D    0 UNDEF          __register_frame_info_bases
      [70]  0x08051204 0x00000000  FUNC GLOB  D    0 UNDEF          pipe
      [71]  0x08051004 0x00000000  FUNC GLOB  D    0 UNDEF          htonl
      [72]  0x08050fc4 0x00000000  FUNC GLOB  D    0 UNDEF          bcopy
      [73]  0x08050ff4 0x00000000  FUNC GLOB  D    0 UNDEF          htons
      [74]  0x08050f34 0x00000000  FUNC GLOB  D    0 UNDEF          pthread_create
      [75]  0x08067088 0x00000004  OBJT GLOB  D    0 22             _environ
      [76]  0x08050f84 0x00000000  FUNC GLOB  D    0 UNDEF          close
      [77]  0x08050f74 0x00000000  FUNC GLOB  D    0 UNDEF          free
      [78]  0x080510f4 0x00000000  FUNC GLOB  D    0 UNDEF          fputs
      [79]  0x08051134 0x00000000  FUNC GLOB  D    0 UNDEF          fprintf
      [80]  0x08051084 0x00000000  FUNC GLOB  D    0 UNDEF          strcpy

Hash Section:  .hash(phdr)
    bucket  symndx      name
         0  [1]         environ
            [2]         dup2
         1  [3]         _Jv_RegisterClasses
            [4]         strstr
         2  [5]         pthread_cancel
            [6]         open
            [7]         nanosleep
            [8]         localtime_r
         3  [9]         _DYNAMIC
         4  [10]        exit
         6  [11]        _lxstat
        10  [12]        _end
        15  [13]        ntohl
            [14]        unlink
        18  [15]        lseek
            [16]        getenv
        19  [17]        waitpid
        20  [18]        fopen
        21  [19]        _xstat
            [20]        read
        22  [21]        fork
        23  [22]        gettimeofday
            [23]        socket
            [24]        __deregister_frame_info_bases
        24  [25]        rename
            [26]        malloc
        27  [27]        mmap
        28  [28]        snprintf
        29  [29]        strtoul
            [30]        strchr
        30  [31]        sscanf
            [32]        kill
        31  [33]        utimes
        33  [34]        _fxstat
            [35]        __fpstart
            [36]        fclose
        34  [37]        mkstemp
        35  [38]        _GLOBAL_OFFSET_TABLE_
        38  [39]        execl
        39  [40]        execve
            [41]        fread
        41  [42]        _cleanup
        43  [43]        signal
            [44]        write
        44  [45]        fdopen
        46  [46]        _PROCEDURE_LINKAGE_TABLE_
            [47]        putc
            [48]        _etext
        47  [49]        setsid
        48  [50]        chmod
        49  [51]        _xmknod
        52  [52]        _edata
            [53]        select
            [54]        _iob
        53  [55]        connect
        55  [56]        gethostbyname
            [57]        strrchr
        59  [58]        __iob
            [59]        fflush
        62  [60]        getsockopt
            [61]        fcntl
            [62]        lockf
        63  [63]        inet_addr
            [64]        pthread_join
            [65]        ___errno
        64  [66]        fputc
            [67]        atexit
        65  [68]        printf
        66  [69]        __register_frame_info_bases
            [70]        pipe
            [71]        htonl
        71  [72]        bcopy
        73  [73]        htons
        76  [74]        pthread_create
        77  [75]        _environ
            [76]        close
        78  [77]        free
        80  [78]        fputs
            [79]        fprintf
        81  [80]        strcpy

        35  buckets contain        0 symbols
        25  buckets contain        1 symbols
        15  buckets contain        2 symbols
         7  buckets contain        3 symbols
         1  buckets contain        4 symbols
        83  buckets               80 symbols (globals)

Relocation Section:  .rel(phdr)
    type                       offset             section        symbol
  R_386_GLOB_DAT            0x80664b0             .rel(phdr)     __deregister_frame_info_bases
  R_386_GLOB_DAT            0x80664b8             .rel(phdr)     __register_frame_info_bases
  R_386_GLOB_DAT            0x80664bc             .rel(phdr)     _Jv_RegisterClasses
  R_386_COPY                0x8067088             .rel(phdr)     _environ
  R_386_COPY                0x80668a0             .rel(phdr)     __iob
  R_386_JMP_SLOT            0x80664a0             .rel(phdr)     _cleanup
  R_386_JMP_SLOT            0x80664a4             .rel(phdr)     atexit
  R_386_JMP_SLOT            0x80664a8             .rel(phdr)     __fpstart
  R_386_JMP_SLOT            0x80664ac             .rel(phdr)     exit
  R_386_JMP_SLOT            0x80664b4             .rel(phdr)     __deregister_frame_info_bases
  R_386_JMP_SLOT            0x80664c0             .rel(phdr)     _Jv_RegisterClasses
  R_386_JMP_SLOT            0x80664c4             .rel(phdr)     __register_frame_info_bases
  R_386_JMP_SLOT            0x80664c8             .rel(phdr)     getenv
  R_386_JMP_SLOT            0x80664cc             .rel(phdr)     setsid
  R_386_JMP_SLOT            0x80664d0             .rel(phdr)     printf
  R_386_JMP_SLOT            0x80664d4             .rel(phdr)     fflush
  R_386_JMP_SLOT            0x80664d8             .rel(phdr)     signal
  R_386_JMP_SLOT            0x80664dc             .rel(phdr)     pthread_create
  R_386_JMP_SLOT            0x80664e0             .rel(phdr)     pthread_join
  R_386_JMP_SLOT            0x80664e4             .rel(phdr)     malloc
  R_386_JMP_SLOT            0x80664e8             .rel(phdr)     pthread_cancel
  R_386_JMP_SLOT            0x80664ec             .rel(phdr)     free
  R_386_JMP_SLOT            0x80664f0             .rel(phdr)     close
  R_386_JMP_SLOT            0x80664f4             .rel(phdr)     snprintf
  R_386_JMP_SLOT            0x80664f8             .rel(phdr)     inet_addr
  R_386_JMP_SLOT            0x80664fc             .rel(phdr)     gethostbyname
  R_386_JMP_SLOT            0x8066500             .rel(phdr)     bcopy
  R_386_JMP_SLOT            0x8066504             .rel(phdr)     ntohl
  R_386_JMP_SLOT            0x8066508             .rel(phdr)     socket
  R_386_JMP_SLOT            0x806650c             .rel(phdr)     htons
  R_386_JMP_SLOT            0x8066510             .rel(phdr)     htonl
  R_386_JMP_SLOT            0x8066514             .rel(phdr)     connect
  R_386_JMP_SLOT            0x8066518             .rel(phdr)     ___errno
  R_386_JMP_SLOT            0x806651c             .rel(phdr)     getsockopt
  R_386_JMP_SLOT            0x8066520             .rel(phdr)     fcntl
  R_386_JMP_SLOT            0x8066524             .rel(phdr)     select
  R_386_JMP_SLOT            0x8066528             .rel(phdr)     write
  R_386_JMP_SLOT            0x806652c             .rel(phdr)     read
  R_386_JMP_SLOT            0x8066530             .rel(phdr)     strcpy
  R_386_JMP_SLOT            0x8066534             .rel(phdr)     gettimeofday
  R_386_JMP_SLOT            0x8066538             .rel(phdr)     strstr
  R_386_JMP_SLOT            0x806653c             .rel(phdr)     mkstemp
  R_386_JMP_SLOT            0x8066540             .rel(phdr)     fdopen
  R_386_JMP_SLOT            0x8066544             .rel(phdr)     fopen
  R_386_JMP_SLOT            0x8066548             .rel(phdr)     unlink
  R_386_JMP_SLOT            0x806654c             .rel(phdr)     fputs
  R_386_JMP_SLOT            0x8066550             .rel(phdr)     fputc
  R_386_JMP_SLOT            0x8066554             .rel(phdr)     lseek
  R_386_JMP_SLOT            0x8066558             .rel(phdr)     fclose
  R_386_JMP_SLOT            0x806655c             .rel(phdr)     fprintf
  R_386_JMP_SLOT            0x8066560             .rel(phdr)     fread
  R_386_JMP_SLOT            0x8066564             .rel(phdr)     putc
  R_386_JMP_SLOT            0x8066568             .rel(phdr)     _xstat
  R_386_JMP_SLOT            0x806656c             .rel(phdr)     _lxstat
  R_386_JMP_SLOT            0x8066570             .rel(phdr)     _fxstat
  R_386_JMP_SLOT            0x8066574             .rel(phdr)     _xmknod
  R_386_JMP_SLOT            0x8066578             .rel(phdr)     open
  R_386_JMP_SLOT            0x806657c             .rel(phdr)     mmap
  R_386_JMP_SLOT            0x8066580             .rel(phdr)     strrchr
  R_386_JMP_SLOT            0x8066584             .rel(phdr)     nanosleep
  R_386_JMP_SLOT            0x8066588             .rel(phdr)     fork
  R_386_JMP_SLOT            0x806658c             .rel(phdr)     dup2
  R_386_JMP_SLOT            0x8066590             .rel(phdr)     pipe
  R_386_JMP_SLOT            0x8066594             .rel(phdr)     execve
  R_386_JMP_SLOT            0x8066598             .rel(phdr)     kill
  R_386_JMP_SLOT            0x806659c             .rel(phdr)     waitpid
  R_386_JMP_SLOT            0x80665a0             .rel(phdr)     localtime_r
  R_386_JMP_SLOT            0x80665a4             .rel(phdr)     utimes
  R_386_JMP_SLOT            0x80665a8             .rel(phdr)     strchr
  R_386_JMP_SLOT            0x80665ac             .rel(phdr)     sscanf
  R_386_JMP_SLOT            0x80665b0             .rel(phdr)     strtoul
  R_386_JMP_SLOT            0x80665b4             .rel(phdr)     rename
  R_386_JMP_SLOT            0x80665b8             .rel(phdr)     chmod
  R_386_JMP_SLOT            0x80665bc             .rel(phdr)     execl
  R_386_JMP_SLOT            0x80665c0             .rel(phdr)     lockf

Dynamic Section:  .dynamic(phdr)
     index  tag                value
       [0]  NEEDED            0x27f               libnsl.so.1
       [1]  NEEDED            0x294               libsocket.so.1
       [2]  NEEDED            0x2a3               librt.so.1
       [3]  NEEDED            0x2b7               libpthread.so.1
       [4]  NEEDED            0x2c7               libc.so.1
       [5]  INIT              0x80538fc           
       [6]  FINI              0x8053909           
       [7]  HASH              0x80500e8           
       [8]  STRTAB            0x8050890           
       [9]  STRSZ             0x2da               
      [10]  SYMTAB            0x8050380           
      [11]  SYMENT            0x10                
      [12]  CHECKSUM          0x3126              
      [13]  VERNEED           0x8050b6c           
      [14]  VERNEEDNUM        0x5                 
      [15]  PLTRELSZ          0x230               
      [16]  PLTREL            0x11                
      [17]  JMPREL            0x8050c34           
      [18]  REL               0x8050c0c           
      [19]  RELSZ             0x258               
      [20]  RELENT            0x8                 
      [21]  DEBUG             0                   
      [22]  FEATURE_1         0x1                 [ PARINIT ]
      [23]  FLAGS             0                   0
      [24]  FLAGS_1           0                   0
      [25]  PLTGOT            0x8066494           
      [26]  NULL              0                   
I don't expect this feature to get much daily use, but it will be handy to have it in the forensic toolchest next time we're scrambling to understand a damaged or malicious object.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Tuesday Jun 12, 2007

Changing ELF Runpaths (Code Included)

A recent change to Solaris ELF files makes it possible to change the runpath of a dynamic executable or sharable object, something that has not been safely possible up until now. This change 80is currently found in Solaris Nevada (the current development version of Solaris) and in OpenSolaris. It is not yet available in Solaris 10, but in time will appear in the standard shipping Solaris as well.

This seems like a good time to talk about runpaths and the business of how the runtime linker finds dependencies. I also provide a small program named rpath that you can use to modify the runpaths in your file (assuming they were linked under Nevada or OpenSolaris).

The Runpath Problem

The runtime linker looks in the following places, in the order listed, to find the sharable objects it loads into a process at startup time:

  • If LD_LIBRARY_PATH (or the related LD_LIBRARY_PATH_32 and LD_LIBRARY_PATH_64) environment variables are defined, the directories they specify are searched to resolve any remaining dependencies.

  • If the executable, or any sharable objects that are loaded, contain a runpath, the directories it specifies are searched to resolve dependencies for those objects.

  • Finally, it searches two default locations for any remaining dependencies: /lib and /usr/lib (or /lib/64 and /usr/lib/64 for 64-bit code).
As if this isn't complicated enough, it should be noted that the crle command can be used to set values for LD_LIBRARY_PATH, and the default directories.

The above scheme offers a great deal of flexibility, and it usually works well. There is however one notable exception — the "Runpath Problem". The problem is that many objects are not built with a correct runpath, and once an object has been built, it has not been possible to change it. It is common to find objects where the runpath is correct on the system the object was built on, but not on the system where it is installed. Usually, we deal with this all too common situation by setting LD_LIBRARY_PATH, or by creating a linker configuration file with crle. Such solutions have serious downsides, as detailed in an earlier blog entry by Rod Evans entitled "LD_LIBRARY_PATH - just say no".

Both approaches will cause unrelated programs to look in unnecessary additional directories for their dependencies. At best, this imposes unnecessary overhead on their operation. At worst, they may end up binding to the wrong version of a given library, leading to mysterious and hard to debug failures. The environment variable approach is simply too broad.

One important technique that people sometimes use, is to set the environment variables in a wrapper shell script, that may look something like:

#!/bin/sh
#
# Run myapp, setting LD_LIBRARY_PATH so it will run

LD_LIBRARY_PATH="/this/that/theother:/someplace/else"
export LD_LIBRARY_PATH

exec /usr/local/myapp
This is a huge improvement over simply setting LD_CONFIG or LD_LIBRARY_PATH in your shell login config script (.profile, .cshrc, .bashrc, etc), for many reasons:
  • Reduces the scope of influence to only cover the application (and its children — see below)
  • Doesn't require each user to modify their login script(s)
  • Can be managed in a central location
It isn't perfect though. If the program in question should happen to run any child processes (and this is more common than many realize), those child processes will inherit the LD_CONFIG and LD_LIBRARY_PATH settings you've established for this one program. This leak may, or may not, cause problems depending on what programs are run.

It would be far better to modify the object in question and set a runpath that accurately reflects the actual location of its dependencies. The effect of a runpath is limited to the file that contains it, so this solution does not "bleed through" to unrelated files, and it imposes no unnecessary overhead on the general operation of the system. This would be a superior solution if it were possible. However it hasn't been an option until recently.

How Runpaths Are Implemented

Every dynamic executable contains a dynamic section. This is an array of items which convey the information required by the dynamic linker (ld.so.1) to do its work. If an object has a runpath, there will be a DT_RUNPATH and/or DT_RPATH item in the dynamic section (there is more than one of these for historical reasons). As an example, lets examine crle:
% elfdump -d /usr/bin/crle | grep 'R*PATH'
       [4]  RUNPATH           0x612               $ORIGIN/../lib
       [5]  RPATH             0x612               $ORIGIN/../lib
The string (in this case, "$ORIGIN/../lib") is not actually stored in the dynamic section. Rather, it is contained in the dynamic string table (.dynstr). The value 0x612 is the offset within string table at which the desired string starts.

A string table is a section that contains NULL terminated strings, one immediately following the other. To access a given string, you add the offset of the string within the section to the base of the section data area. Consider a string table that contains the names of two variables "var1", and "var2" and a runpath "$ORIGIN/../lib". By ELF convention, string tables always have a 0-length NULL terminated string in the first position. In C language notation, we might declare the contents of the resulting string table section containing these 4 strings as

"\\0var1\\0var2\\0$ORIGIN/../lib"
The indexes of the 4 strings in our table are [0], [1], [6], and [11], and any item in the dynamic section or the dynamic symbol table that needs one of these strings will specify it using the appropriate index. An interesting result of the way that string tables are designed is that that every single offset into a string table represents a usable string. Although our intent with the C string above was to represent 4 strings, it actually contains 23 potential strings (26 if you count the duplicate NULL strings), and not just the 4 we intentionally inserted. Listing them by offset, they are:
[0]  ""

[1]  "var1"
[2]  "ar1"
[3]  "r1"
[4]  "1"
[5]  ""

[6]  "var2"
[7]  "ar2"
[8]  "r2"
[9]  "2"
[10] ""

[11] "$ORIGIN/../lib"
[12] "ORIGIN/../lib"
[13] "RIGIN/../lib"
[14] "IGIN/../lib"
[15] "GIN/../lib"
[16] "IN/../lib"
[17] "N/../lib"
[18] "/../lib"
[19] "../lib"
[20] "./lib"
[21] "/lib"
[22] "lib"
[23] "ib"
[24] "b"
[25] ""
This is a very efficient scheme, since each string can appear once in the string table, and multiple ELF items can refer to it. Also, it allows fixed size things, like ELF symbols or dynamic section entries, to efficiently reference variable length strings. There are two things to note, however:
  1. For a given string, there is no way to tell if it is referenced, where it is referenced from, or how many references there are.
  2. There is no room to add new strings to a string table.

The options for modifying a runpath in this situation are limited:

  • Any string already in the string table, as with the 23 options listed in our example above, can be safely set as a runpath, by simply changing the offset in the runpath DT entries. Note that most string table strings are variable and file (not directory) names that are not likely to make useful path strings. This option is unlikely to help.

  • You might overwrite the existing path string with a new string of equal or shorter length in the (usually true) belief that nothing else is accessing that particular string. It is a simple matter using a binary aware editor to locate and overwrite the existing string. This usually works, but if there is another part of the file accessing that string, this change will break it. We cannot recommend or stand behind this, even though we have done it ourselves for one-off experiments (never in a shipping product).

  • A better approach would be to add a new string to the end of the section, and then change the offset in the dynamic section to use it. Traditionally, ELF files have not had any extra room in the string table section to allow this.

As a result, it has not been possible to support the modification of the runpath in an existing object up until recently.

Making Room

I recently integrated a change to Solaris Nevada (and OpenSolaris) to add a little unused space to our ELF files, in order to facilitate a limited amount of post-link modification:
PSARC 2007/127 Reserved space for editing ELF dynamic sections
6516118 Reserved space needed in ELF dynamic section and
        string table
This change does two things:
  1. Adds some extra NULL bytes to the end of every dynamic string table. (The current value is 512 bytes, but this can change in the future).

  2. Adds a new dynamic section entry named DT_SUNW_STRPAD to keep track of the size of the unused space at the end of the dynamic string table.

  3. Adds some extra (currently 10) unused DT_NULL entries at the end of the dynamic section.
This additional space is small enough that it doesn't increase the size of real world objects by a significant amount. Though small, it gives us a lot of new flexibility. The room in the string table allows for the safe addition of a moderate number of new strings. The additional null DT entries allow us to add a DT_RUNPATH item if the file doesn't already have one to modify. Looking at crle again:
% elfdump -d /usr/bin/crle | egrep 'R*PATH|STRPAD'
       [4]  RUNPATH           0x612               $ORIGIN/../lib
       [5]  RPATH             0x612               $ORIGIN/../lib
      [32]  SUNW_STRPAD       0x200               
The SUNW_STRPAD entry tells us that the dynamic string table has 512 (0x200) bytes of unused space available at the end of its data area.

The way this works is very simple: If a file lacks a DT_SUNW_STRPAD dynamic entry, then we know that it is an older file, and that the dynamic string table does not have any extra space. If it does have a DT_SUNW_STRPAD, then its value tells us how much room is available. In this case, we can add the string, modify the DT_RUNPATH items, and reduce the DT_SUNW_STRPAD value by the number of bytes we used.

If the value in DT_SUNW_STRPAD is too small for our new string, then we are out of luck and cannot add it. This extra room should help in the vast majority of cases, but as with any such approach, there are limits. We recommend the use of the special $ORIGIN token, both because it is a great way to organize objects, and because it is short.

The rpath Utility

Eventually, Solaris will ship with a standard utility for modifying runpaths. However, there is no need to wait. I have written an unofficial test program I call 'rpath' that you can download and build. To build rpath, you will need a version of Solaris Nevada newer than build 61, or a recent version of OpenSolaris. To check your system, try:
% grep DT_SUNW_STRPAD /usr/include/sys/link.h
#define DT_SUNW_STRPAD  0x60000019 /* # of unused bytes at the */
If your grep doesn't find DT_SUNW_STRPAD, your system lacks the necessary support.

To build rpath, unpack the compressed tar file and type 'make'. If you are using gcc, first edit the Makefile and uncomment the CC line:

% gunzip < rpath.tgz | tar xvpf -
% cd rpath
% make
rpath is used as follows:
NAME
     rpath - set/get runpath of ELF dynamic objects

SYNOPSIS
     rpath [-dr] file [runpath]

DESCRIPTION
     rpath can display,  modify,  or  delete  the  runpath  of  a
     dynamic ELF object.

     If called without a runpath  argument  and  without  the  -r
     option,  the  current runpath, if any, is written to stdout.
     If -r is specified, the existing runpath is removed. If run-
     path  is  supplied,  the runpath of the object is set to the
     new value.


OPTIONS
     The following options are supported:

     -d  Cause detailed ELF information about the  ELF  file  and
         the changes being made to it to be written to stderr.

     -r  Instead of adding or modifying the file  runpath,  rpath
         removes  any  DT_RPATH  or  DT_RUNPATH  entries from the
         dynamic section of  the  file.  This  action  completely
         removes  any existing from the file. When this option is
         used, rpath does not allow the runpath argument.

Using rpath

Let's use rpath to look at its own runpath. We will see that it doesn't have one, something that can be verified using elfdump:
% rpath rpath
% elfdump -d rpath | egrep 'R*PATH|STRPAD'
      [28]  SUNW_STRPAD       0x200               
Now, let's add a runpath to it:
% rpath rpath pointless:runpath
% rpath rpath
pointless:runpath
% elfdump -d rpath | egrep 'R*PATH|STRPAD'
      [28]  SUNW_STRPAD       0x1ee               
      [30]  RUNPATH           0x33f      pointless:runpath
Notice that the amount of unused space reported by SUNW_STRPAD has gone down from 512 (0x200) to 494 (0x1ee) bytes, a reduction of 18 bytes. This makes sense, since we added a 17 character string, and we must add a NULL termination.

We can observe the runtime linker looking in 'pointless' and 'runpath' as it loads rpath (note: output is edited for width):

% LD_DEBUG=libs ./rpath 
13707: 
13707: hardware capabilities - 0x25ff7  [ AHF SSE3 SSE2 
       SSE FXSR AMD_3DNowx AMD_3DNow AMD_MMX MMX CMOV
       AMD_SYSC CX8 TSC FPU ]
13707: 
13707: 
13707: configuration file=/var/ld/ld.config: unable to
       process file
13707: 
13707: 
13707: find object=libelf.so.1; searching
13707:  search path=pointless:runpath  (RUNPATH/RPATH
                                        from file rpath)
13707:  trying path=pointless/libelf.so.1
13707:  trying path=runpath/libelf.so.1
13707:  search path=/lib  (default)
13707:  search path=/usr/lib  (default)
13707:  trying path=/lib/libelf.so.1
13707: 
13707: find object=libc.so.1; searching
13707:  search path=pointless:runpath  (RUNPATH/RPATH from
                                        file rpath)
13707:  trying path=pointless/libc.so.1
13707:  trying path=runpath/libc.so.1
13707:  search path=/lib  (default)
13707:  search path=/usr/lib  (default)
13707:  trying path=/lib/libc.so.1
13707: 
13707: find object=libc.so.1; searching
13707:  search path=/lib  (default)
13707:  search path=/usr/lib  (default)
13707:  trying path=/lib/libc.so.1
13707: 
13707: 1: 
13707: 1: transferring control: rpath
13707: 1: 
usage: rpath [-dr] file [runpath]
13707: 1: 
Finally, we'll remove the runpath we just added:
% rpath -r rpath
% rpath rpath
% elfdump -d rpath | egrep 'R*PATH|STRPAD'
      [28]  SUNW_STRPAD       0x1ee               
Note that even though the runpath is gone, the amount of available extra space in the dynamic string section did not go back up from 494 (0x1ee) to 512 (0x200). Adding strings is a one way operation. Once they are added, they are permanent. So even though you now have the ability to add strings of moderate length, you won't want to do it indiscriminately.

On the plus side, you can always re-add the same runpath back without using any more space:

% rpath rpath pointless:runpath
% rpath rpath
pointless:runpath
% elfdump -d rpath | egrep 'R*PATH|STRPAD'
      [28]  SUNW_STRPAD       0x1ee               
      [30]  RUNPATH           0x33f      pointless:runpath
rpath found that the string 'pointless:runpath' was already in the string table, so it used it without inserting another copy.

Conclusions

Our best advice has always been that the LD_LIBRARY_PATH environment variable should not be used to work around objects with bad or missing runpaths. It is best to rebuild such objects and set the runpath correctly. This hasn't changed, and you should always do so if you can.

The problem with that advice is that there are times when all you have is the object, and no option to rebuild. In that case, LD_LIBRARY_PATH has been a necessary evil (and one that we've been glad to have). With the advent of objects that can have their runpaths modified, we now have a better answer, and the use of LD_LIBRARY_PATH for this purpose should be allowed to slowly fade away.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Friday Feb 09, 2007

Which Solaris Files Are Stripped?

In my previous blog entry about the new .SUNW_ldynsym sections, I made the following statement:
It used to be common practice for system binaries to be stripped in order to save space. However, observability is a central tenet of the Solaris philosophy. Solaris objects and executables are therefore shipped in unstripped form, and have been for many years, in order to support such symbol lookups.

It turns out that this is is only partially true...

Brian Utterback posted a comment and pointed out that 490 of the 719 ELF binaries in /usr/bin on his Solaris 10 system are stripped. This shows that Solaris binaries have not been unstripped "for many years". I looked at /usr/bin on my desktop system, which is running a fairily recent Nevada build, and found that only 51 of the 815 files there are stripped. It appears that binaries are (mostly) stripped now. What changed between Solaris 10 and today? And, why "mostly"?

As I usually do in such situations, I sent mail to my fellow Linker Alien Rod Evans. I asked him for his recollection of what policies were used for stripping Solaris files in the past. Here is a summary of what he told me:

  • For a very long time, the rule was that executables are stripped, and sharable libraries are not. The underlying idea was that people would not care to debug our executables, but certainly would debug their own programs that are linked to our libraries. We're not sure when this started, but are pretty sure that it covers most, if not all, of the Solaris 2.x era (we've no idea about the SunOS 4.x days).

  • In the early years, enforcement of this rule was rather incomplete, and exceptions occured. Starting with Solaris 9, automated checks in the nightly builds tightened things up significantly.

  • The policy was changed in September 2005 (2 months before I joined Sun) to not strip any files. The change took effect with Nevada build 24, with
    5072038 binaries shouldn't be stripped
    I imagine that the introduction of DTrace made complete symbol information in binaries more important than before.

  • Solaris is built by combining various "consolidations". The above comments only apply to the core ON consolidation, which consists of the OS and Networking parts of Solaris. The other consolidations are built according to their own rules, which can and do differ. So, you should not be surprised to find some stripped files, even on a current development build of Solaris, like the 51 files I found in /usr/bin on my system.

The new .SUNW_ldynsym sections reduce the need for everything to be unstripped, so we may end up relaxing our ON rule if there is a reason to do so. And if the other consolidations continue to strip their files, .SUNW_ldynsym will provide better observability for them.

Brian is absolutely right — we have not been shipping "Solaris objects and executables" in unstripped form for many years, only sharable libraries! I knew that libraries were unstripped, and that current builds don't strip either binaries, and those two facts misled me.

On the plus side, I have a better understanding of the issue now... :-)


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Wednesday Feb 07, 2007

What Is .SUNW_ldynsym?

Solaris ELF files have a new ELF symbol table. The section type is SHT_SUNW_LDYNSYM, and the section is named .SUNW_ldynsym. In the 20+ years in which the ELF standard has been in use, we have only needed two symbol tables (.symtab, and .dynsym) to support linking, so the addition of a third symbol table is a notable event for ELF cognoscenti. Even if you aren't one of those, you may encounter these sections, and wonder what they are for. I hope to explain that here.

Solaris has many tools that examine running processes or core files and generate stack traces. For example, consider the following call to pstack(1), made on an Xterm process currently running on my system:

% pstack 3094
3094:   xterm -ls -geometry 80x51+0+175
 fef4bea7 pollsys  (8046600, 2, 0, 0)
 fef0767e pselect  (5, 8400168, 84001e8, fef95260, 0, 0) + 19e
 fef0798e select   (5, 8400168, 84001e8, 0, 0) + 7e
 0805b250 in_put   (10, 8416720, 0, fedd561e, 8416720, 0) + 1b0
 08059b20 VTparse  (84166a8, 8057acc, fed387c5, 8416720, 84166a8, 804688c) + 90
 0805d1f1 VTRun    (8046a28, 8046870, feffa7c0, 8046808, 8046858, 804685c) + 205
 08057add main     (0, 80468b4, 80468c8) + 945
 08056eee _start   (4, 8046a90, 0, 8046a9a, 8046aa4, 0) + 7a
In order to show you those function names, pstack (really the libproc library used by pstack) needs to map the addresses of functions on the stack to the ELF symbols that correspond to them. Usually, these symbols come from the symbol table (.symtab). If this symbol table has been removed with the strip(1) program, then the dynamic symbol table (.dynsym) will be used instead. As described in a previous blog entry, the .dynsym contains the subset of global symbols from .symtab that are needed by the runtime linker ld.so.1(1). This fallback allows us to map global functions to their names, but local function symbols are not available. Observability tools like pstack(1) will display the hexidecimal address of such local functions when a name is not available. This is better than nothing, but is not particularly helpful.

It used to be common practice for system binaries to be stripped in order to save space. However, observability is a central tenet of the Solaris philosophy. Solaris objects and executables are therefore shipped in unstripped form, and have been for many years, in order to support such symbol lookups. For the most part, this has been a winning strategy, but there are still issues that come up from time to time:

  • strip(1) removes much more than the symbol table. Usually the size of this extra data is not a significant concern, but there are certain very large programs where the space savings might be worthwhile. It would be great to strip those particular things, but losing the local function symbols and the ability to make accurate stack traces is a bitter pill to swallow. This has led to a number of proposed features to "strip everything except local function symbols". These ideas are reasonable, but complicated. We like the fact that "strip" is a simple straightforward operation, and want to avoid complicating the concept.

  • We don't strip our files, but many Solaris users do. This becomes a problem when those applications misbehave, and they (or we, if you have a high end support contract) are trying to figure out why. Often, it is not possible to rebuild such applications in order to debug them. The ability to observe unmodified applications running in a production environment is another key Solaris virtue, as exemplified by DTrace.
Over the years, we have observed that these problems would be largely solved if we could add local function symbols to the .dynsym, and that in most programs, the additional space used would be minimal. Last fall, I embarked on a project to do this.

I tried hard to avoid adding a new symbol table type, and instead tried several experiments in which the additional local function symbols were placed in the dynsym. The reason for wanting this was to avoid having to modify ELF utilities and debuggers to know about a new symbol table. If the added symbols are in the existing .dynsym, those tools will automatically see them, without needing modification. As detailed in the ARC case that I filed for this work (PSARC/2006/526), I tried many different permutations. In every case, I discovered undesirable backward compatibility issues that kept me from using that solution. It turns out that the layout of .dynsym, and the other ELF sections that interact with it, are completely constrained, and there is no 100% backward compatible way to add local symbols to it.

ELF was designed from the very beginning to make it possible to introduce new section types with full backward/forward compatibility. You can always safely add a new section, with a moderate amount of care, and it will work. More than anything, this ability to extend ELF accounts for its long life. Given that the .dynsym cannot be extended with local symbols, I made the obvious (in hindsight) decision to to introduce a new section type (SHT_SUNW_LDYNSYM), and add a new symbol table section named .SUNW_ldynsym to every Solaris file that has a .dynsym section. Once that decision was made, the implementation was straightforward, giving me confidence that it was the right way to go.

The .SUNW_ldynsym section can be thought of as the local initial part of the .dynsym that we wish to build, but can't. The Solaris linker ( ld(1)) takes care to actually place them side by side, so that the end of the .SUNW_ldynsym section leads directly into the start of the .dynsym section. The runtime linker ( ld.so.1(1)) takes advantage of this to treat them as a single table within the implementation of dladdr(3C). Note that this trick works for applications that mmap(2) the file and access it directly. If you are accessing an ELF file via libelf, as many utilities do, you can't make any assumptions about the relative positions of different sections.

As with .dynsym, .SUNW_ldynsym sections are allocable, meaning that they are part of the process text segment. This means that they are available at runtime for dladdr(3C). It also means that they cannot be stripped. Although you cannot strip .SUNW_ldynsym sections, you can prevent them from being generated by ld(1), by using the -znoldynsym linker option.

.SUNW_ldynsym sections consume a small amount of additional space. We found that for all of core Solaris (OS and Networking), the increase in size was on the order of 1.4%. This small increase pays off by letting our observability tools do a better job. Furthermore, the presence of .SUNW_ldynsym means that in many cases, you can strip programs that you might not have been willing to strip before.

Example

Let's use the following program to see how .SUNW_ldynsym sections improve Solaris observability of local functions:
/*
 * Program to demonstrate SHT_SUNW_LDYNSYM sections. The
 * global main program calls a local function named
 * static_func(). static_func() uses printstack() to exercise
 * the dladdr(3C) function provided by the runtime linker,
 * and then deliberately causes a segfault. The resulting core
 * file can be examined by pstack(1) or mdb(1).
 *
 * In all these cases, if a stripped binary of this program
 * contains a .SUNW_ldynsym section, the static_func() function
 * will be observable by name, and otherwise simply as an
 * address.
 */


#include <ucontext.h>

static void
static_func(void)
{
	/* Use dladdr(3C) to print a call stack */
	printstack(1);

	/*
	 * Write to address 0, killing the process and
	 * producing a core file.
	 */
	*((char *) 0) = 1;
}


int main(int argc, char *argv[])
{
	static_func();
	return (0);
}

Let's build two versions of this program, one containing the .SUNW_ldynsym section, and one without:
% cc -Wl,-znoldynsym test.c -o test_noldynsym
% cc test.c -o test_ldynsym
The elfdump(1) command can be used to let us examine the three symbol tables contained in test_ldynsym. There is no need to examine this (large) output too carefully, but there are some interesting facts worth noticing:
  • Every symbol in .SUNW_ldynsym or .dynsym is also found in .symtab, because .symtab is a superset of the other two tables. This is why it is always preferred to the other two, when available.

  • .symtab is much larger than the other two tables combined, which leads to the temptation to strip it, along with the other things strip(1) removes.

  • The symbols in .dynsym are strictly limited to those needed by the runtime linker.

  • If you consider the .SUNW_ldynsym and .dynsym symbol tables as a single logical entity, you can see that the result follows the rules for ELF symbol table layout.
% elfdump -s test_ldynsym

Symbol Table Section:  .SUNW_ldynsym
     index    value      size      type bind oth ver shndx          name
       [0]  0x00000000 0x00000000  NOTY LOCL  D    0 UNDEF          
       [1]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            test_ldynsym
       [2]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crti.s
       [3]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crt1.o
       [4]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crt1.s
       [5]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            fsr.s
       [6]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            values-Xa.c
       [7]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            test.c
       [8]  0x080507f0 0x00000019  FUNC LOCL  D    0 .text          static_func
       [9]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crtn.s

Symbol Table Section:  .dynsym
     index    value      size      type bind oth ver shndx          name
       [0]  0x00000000 0x00000000  NOTY LOCL  D    0 UNDEF          
       [1]  0x08050668 0x00000000  OBJT GLOB  D    0 .plt           _PROCEDURE_LINKAGE_TABLE_
       [2]  0x08060974 0x00000004  OBJT WEAK  D    0 .data          environ
       [3]  0x0806088c 0x00000000  OBJT GLOB  D    0 .dynamic       _DYNAMIC
       [4]  0x080609c0 0x00000000  OBJT GLOB  D    0 .bssf          _edata
       [5]  0x08060990 0x00000004  OBJT GLOB  D    0 .data          ___Argv
       [6]  0x08050868 0x00000000  OBJT GLOB  D    0 .rodata        _etext
       [7]  0x0805082c 0x0000001b  FUNC GLOB  D    0 .init          _init
       [8]  0x00000000 0x00000000  NOTY GLOB  D    0 ABS            __fsr_init_value
       [9]  0x08050810 0x00000019  FUNC GLOB  D    0 .text          main
      [10]  0x08060974 0x00000004  OBJT GLOB  D    0 .data          _environ
      [11]  0x08060868 0x00000000  OBJT GLOB  P    0 .got           _GLOBAL_OFFSET_TABLE_
      [12]  0x080506b8 0x00000000  FUNC GLOB  D    0 UNDEF          printstack
      [13]  0x080506a8 0x00000000  FUNC GLOB  D    0 UNDEF          _exit
      [14]  0x08050864 0x00000004  OBJT GLOB  D    0 .rodata        _lib_version
      [15]  0x08050698 0x00000000  FUNC GLOB  D    0 UNDEF          atexit
      [16]  0x08050678 0x00000000  FUNC GLOB  D    0 UNDEF          __fpstart
      [17]  0x0805076c 0x0000007b  FUNC GLOB  D    0 .text          __fsr
      [18]  0x08050688 0x00000000  FUNC GLOB  D    0 UNDEF          exit
      [19]  0x080506c8 0x00000000  FUNC WEAK  D    0 UNDEF          _get_exit_frame_monitor
      [20]  0x080609c0 0x00000000  OBJT GLOB  D    0 .bss           _end
      [21]  0x080506e0 0x0000008b  FUNC GLOB  D    0 .text          _start
      [22]  0x08050848 0x0000001b  FUNC GLOB  D    0 .fini          _fini
      [23]  0x08060978 0x00000018  OBJT GLOB  D    0 .data          __environ_lock
      [24]  0x0806099c 0x00000004  OBJT GLOB  D    0 .data          __longdouble_used
      [25]  0x00000000 0x00000000  NOTY WEAK  D    0 UNDEF          __1cG__CrunMdo_exit_code6F_v_

Symbol Table Section:  .symtab
     index    value      size      type bind oth ver shndx          name
       [0]  0x00000000 0x00000000  NOTY LOCL  D    0 UNDEF          
       [1]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            test_ldynsym
       [2]  0x080500f4 0x00000000  SECT LOCL  D    0 .interp        
       [3]  0x08050108 0x00000000  SECT LOCL  D    0 .SUNW_cap      
       [4]  0x08050118 0x00000000  SECT LOCL  D    0 .hash          
       [5]  0x080501fc 0x00000000  SECT LOCL  D    0 .SUNW_ldynsym  
       [6]  0x0805029c 0x00000000  SECT LOCL  D    0 .dynsym        
       [7]  0x0805043c 0x00000000  SECT LOCL  D    0 .dynstr        
       [8]  0x080505c4 0x00000000  SECT LOCL  D    0 .SUNW_version  
       [9]  0x080505f4 0x00000000  SECT LOCL  D    0 .SUNW_dynsymso 
      [10]  0x08050630 0x00000000  SECT LOCL  D    0 .rel.data      
      [11]  0x08050638 0x00000000  SECT LOCL  D    0 .rel.plt       
      [12]  0x08050668 0x00000000  SECT LOCL  D    0 .plt           
      [13]  0x080506e0 0x00000000  SECT LOCL  D    0 .text          
      [14]  0x0805082c 0x00000000  SECT LOCL  D    0 .init          
      [15]  0x08050848 0x00000000  SECT LOCL  D    0 .fini          
      [16]  0x08050864 0x00000000  SECT LOCL  D    0 .rodata        
      [17]  0x08060868 0x00000000  SECT LOCL  D    0 .got           
      [18]  0x0806088c 0x00000000  SECT LOCL  D    0 .dynamic       
      [19]  0x08060974 0x00000000  SECT LOCL  D    0 .data          
      [20]  0x080609c0 0x00000000  SECT LOCL  D    0 .bssf          
      [21]  0x080609c0 0x00000000  SECT LOCL  D    0 .bss           
      [22]  0x00000000 0x00000000  SECT LOCL  D    0 .symtab        
      [23]  0x00000000 0x00000000  SECT LOCL  D    0 .strtab        
      [24]  0x00000000 0x00000000  SECT LOCL  D    0 .comment       
      [25]  0x00000000 0x00000000  SECT LOCL  D    0 .debug_info    
      [26]  0x00000000 0x00000000  SECT LOCL  D    0 .debug_line    
      [27]  0x00000000 0x00000000  SECT LOCL  D    0 .debug_abbrev  
      [28]  0x00000000 0x00000000  SECT LOCL  D    0 .shstrtab      
      [29]  0x080609c0 0x00000000  OBJT LOCL  D    0 .bss           _END_
      [30]  0x08050000 0x00000000  OBJT LOCL  D    0 .interp        _START_
      [31]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crti.s
      [32]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crt1.o
      [33]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crt1.s
      [34]  0x08060994 0x00000004  OBJT LOCL  D    0 .data          __get_exit_frame_monitor_ptr
      [35]  0x08060998 0x00000004  OBJT LOCL  D    0 .data          __do_exit_code_ptr
      [36]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            fsr.s
      [37]  0x080609a0 0x00000020  OBJT LOCL  D    0 .data          trap_table
      [38]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            values-Xa.c
      [39]  0x08060974 0x00000000  NOTY LOCL  D    0 .data          Ddata.data
      [40]  0x080609c0 0x00000000  NOTY LOCL  D    0 .bss           Bbss.bss
      [41]  0x08050868 0x00000000  NOTY LOCL  D    0 .rodata        Drodata.rodata
      [42]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            test.c
      [43]  0x080507f0 0x00000019  FUNC LOCL  D    0 .text          static_func
      [44]  0x080609c0 0x00000000  OBJT LOCL  D    0 .bss           Bbss.bss
      [45]  0x08060974 0x00000000  OBJT LOCL  D    0 .data          Ddata.data
      [46]  0x08050864 0x00000000  OBJT LOCL  D    0 .rodata        Drodata.rodata
      [47]  0x00000000 0x00000000  FILE LOCL  D    0 ABS            crtn.s
      [48]  0x08050668 0x00000000  OBJT GLOB  D    0 .plt           _PROCEDURE_LINKAGE_TABLE_
      [49]  0x08060974 0x00000004  OBJT WEAK  D    0 .data          environ
      [50]  0x0806088c 0x00000000  OBJT GLOB  D    0 .dynamic       _DYNAMIC
      [51]  0x080609c0 0x00000000  OBJT GLOB  D    0 .bssf          _edata
      [52]  0x08060990 0x00000004  OBJT GLOB  D    0 .data          ___Argv
      [53]  0x08050868 0x00000000  OBJT GLOB  D    0 .rodata        _etext
      [54]  0x0805082c 0x0000001b  FUNC GLOB  D    0 .init          _init
      [55]  0x00000000 0x00000000  NOTY GLOB  D    0 ABS            __fsr_init_value
      [56]  0x08050810 0x00000019  FUNC GLOB  D    0 .text          main
      [57]  0x08060974 0x00000004  OBJT GLOB  D    0 .data          _environ
      [58]  0x08060868 0x00000000  OBJT GLOB  P    0 .got           _GLOBAL_OFFSET_TABLE_
      [59]  0x080506b8 0x00000000  FUNC GLOB  D    0 UNDEF          printstack
      [60]  0x080506a8 0x00000000  FUNC GLOB  D    0 UNDEF          _exit
      [61]  0x08050864 0x00000004  OBJT GLOB  D    0 .rodata        _lib_version
      [62]  0x08050698 0x00000000  FUNC GLOB  D    0 UNDEF          atexit
      [63]  0x08050678 0x00000000  FUNC GLOB  D    0 UNDEF          __fpstart
      [64]  0x0805076c 0x0000007b  FUNC GLOB  D    0 .text          __fsr
      [65]  0x08050688 0x00000000  FUNC GLOB  D    0 UNDEF          exit
      [66]  0x080506c8 0x00000000  FUNC WEAK  D    0 UNDEF          _get_exit_frame_monitor
      [67]  0x080609c0 0x00000000  OBJT GLOB  D    0 .bss           _end
      [68]  0x080506e0 0x0000008b  FUNC GLOB  D    0 .text          _start
      [69]  0x08050848 0x0000001b  FUNC GLOB  D    0 .fini          _fini
      [70]  0x08060978 0x00000018  OBJT GLOB  D    0 .data          __environ_lock
      [71]  0x0806099c 0x00000004  OBJT GLOB  D    0 .data          __longdouble_used
      [72]  0x00000000 0x00000000  NOTY WEAK  D    0 UNDEF          __1cG__CrunMdo_exit_code6F_v_
Now, we strip the two versions of our program to remove the .symtab symbol table, and force the system to use the dynamic tables instead:
% strip test_ldynsym test_noldynsym 
% file test_ldynsym test_noldynsym 
test_ldynsym:   ELF 32-bit LSB executable 80386 Version 1, dynamically linked, stripped
test_noldynsym: ELF 32-bit LSB executable 80386 Version 1, dynamically linked, stripped
Running the version without a .SUNW_ldynsym section:
% ./test_noldynsym 
/home/ali/test/test_noldynsym:0x6ca
/home/ali/test/test_noldynsym:main+0xb
/home/ali/test/test_noldynsym:_start+0x7a
Segmentation Fault (core dumped)
% pstack core
core 'core' of 5041:    ./test_noldynsym
 080506d2 ???????? (804692c, 80467a4, 805062a, 1, 80467b0, 80467b8)
 080506eb main     (1, 80467b0, 80467b8) + b
 0805062a _start   (1, 8046994, 0, 80469a5, 80469bf, 8046a03) + 7a
Our program used the printstack(3C) function to display its own stack. Afterwards, we use the pstack command to view the same data from the core file. In both cases, the top line represents the call to the local function static_func(), a fact that we know from examining the source code, since the number and/or '????????' used to represent it are less than obvious to an external observer.

Running the version with a .SUNW_ldynsym section, the system is able to put a name to the local function:

% ./test_ldynsym 
/home/ali/test/test_ldynsym:static_func+0xa
/home/ali/test/test_ldynsym:main+0xb
/home/ali/test/test_ldynsym:_start+0x7a
Segmentation Fault (core dumped)
% pstack core
core 'core' of 5044:    ./test_ldynsym
 08050802 static_func (8046930, 80467a8, 805075a, 1, 80467b4, 80467bc) + 12
 0805081b main     (1, 80467b4, 80467bc) + b
 0805075a _start   (1, 8046998, 0, 80469a7, 80469c1, 8046a05) + 7a

Conclusions

Sometimes it is the little things that make a difference. I expect that the local dynamic symbol table will provide valuable information in difficult debugging situations where one is examining large stripped programs running in a production environment. The rest of the time, the additional data is small, and will have little or no impact on performance.

.SUNW_ldynsym sections have been part of the Solaris development (Nevada) builds since last fall, and are also available in OpenSolaris.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Saturday Sep 23, 2006

Inside ELF Symbol Tables

ELF files are full of things we need to keep track of for later access: Names, addresses, sizes, and intended purpose. Without this information, an ELF file would not be very useful. We would have no way to make sense of the impenetrable mass of octal or hexidecimal numbers.

Consider: When you write a program in any language above direct machine code, you give symbolic names to functions and data. The compiler turns these things into code. At the machine level, they are known only by their address (offset within the file) and their size. There are no names in this machine code. How then, can a linker combine multiple object files, or a symbolic debugger know what name to use for a given address? How do we make sense of these files?

Symbols are the way we manage this information. Compilers generate symbol information along with code. Linkers manipulate symbols, reading them in, matching them up, and writing them out. Almost everything a linker does is driven by symbols. Finally, debuggers use them to figure out what they are looking at and to provide you with a human readable view of that information.

It is therefore a rare ELF file that doesn't have a symbol table. However, most programmers have only an abstract knowledge that symbol tables exist, and that they loosely correspond to their functions and data, and some "other stuff". Protected by the abstractions of compiler, linker, and debugger, we don't usually need to know too much about the details of how a symbol table is organized. I've recently completed a project that required me to learn about symbol tables in great detail. Today, I'm going to write about the symbol tables used by the linker.

.symtab and .dynsym

Sharable objects and dynamic executables usually have 2 distinct symbol tables, one named ".symtab", and the other ".dynsym". (To make this easier to read, I am going to refer to these without the quotes or leading dot from here on.)

The dynsym is a smaller version of the symtab that only contains global symbols. The information found in the dynsym is therefore also found in the symtab, while the reverse is not necessarily true. You are almost certainly wondering why we complicate the world with two symbol tables. Won't one table do? Yes, it would, but at the cost of using more memory than necessary in the running process.

To understand how this works, we need to understand the difference between allocable and a non-allocable ELF sections. ELF files contain some sections (e.g. code and data) needed at runtime by the process that uses them. These sections are marked as being allocable. There are many other sections that are needed by linkers, debuggers, and other such tools, but which are not needed by the running program. These are said to be non-allocable. When a linker builds an ELF file, it gathers all of the allocable sections together in one part of the file, and all of the non-allocable sections are placed elsewhere. When the operating system loads the resulting file, only the allocable part is mapped into memory. The non-allocable part remains in the file, but is not visible in memory. strip(1) can be used to remove certain non-allocable sections from a file. This reduces file size by throwing away information. The program is still runnable, but debuggers may be hampered in their ability to tell you what the program is doing.

The full symbol table contains a large amount of data needed to link or debug our files, but not needed at runtime. In fact, in the days before sharable libraries and dynamic linking, none of it was needed at runtime. There was a single, non-allocable symbol table (reasonably named "symtab"). When dynamic linking was added to the system, the original designers faced a choice: Make the symtab allocable, or provide a second smaller allocable copy. The symbols needed at runtime are a small subset of the total, so a second symbol table saves virtual memory in the running process. This is an important consideration. Hence, a second symbol table was invented for dynamic linking, and consequently named "dynsym".

And so, we have two symbol tables. The symtab contains everything, but it is non-allocable, can be stripped, and has no runtime cost. The dynsym is allocable, and contains the symbols needed to support runtime operation. This division has served us well over the years.

Types Of Symbols

Given how long symbols have been around, there are surprisingly few types:
STT_NOTYPE
Used when we don't know what a symbol is, or to indicate the absence of a symbol.

STT_OBJECT / STT_COMMON
These are both used to represent data. (The word OBJECT in this context should not interpreted as having anything to do with object orientation. STT_DATA might have been a better name.)

STT_OBJECT is used for normal variable definitions, while STT_COMMON is used for tentative definitions. See my earlier blog entry about tentative symbols for more information on the differences between them.

STT_FUNC
A function, or other executable code.

STT_SECTION
When I first started learning about ELF, and someone would say something about "section symbols", I thought they meant a symbol from some given section. That's not it though: A section symbol is a symbol that is used to refer to the section itself. They are used mainly when performing relocations, which are often specified in the form of "modify the value at offset XXX relative to the start of section YYY".

STT_FILE
The name of a file, either of an input file used to construct the ELF file, or of the ELF file itself.

STT_TLS
A third type of data symbol, used for thread local data. A thread local variable is a variable that is unique to each thread. For instance, if I declare the variable "foo" to be thread local, then every thread has a separate foo variable of their own, and they do not see or share values from the other threads. Thread local variables are created for each thread when the thread is created. As such, their number (one per thread) and addresses (depends on when the thread is created, and how many threads there are) are unknown until runtime. An ELF file cannot contain an address for them. Instead, a STT_TLS symbol is used. The value of a STT_TLS symbol is an offset, which is used to calculate a TLS offset relative to the thread pointer. You can read more about TLS in the Linker And Libraries Guide.

STT_REGISTER
The Sparc architecture has a concept known as a "register symbol". These symbols are used to validate symbol/register usage, and can also be used to initialize global registers. Other architectures don't use these.

In addition to symbol type, each symbols has other attributes:

  • Name (Optional: Not all symbols need a name, though most do)
  • Value
  • Size
  • Binding and Visibility
  • ELF Section it references
The exact meaning for some of these attributes depends on the type of symbol involved. For more details, consult the Solaris Linker and Libraries Guide, which is available in PDF form online.

Symbols Table Layout And Conventions

The symbols in a symbol table are written in the following order:
  1. Index 0 in any symbol table is used to represent undefined symbols. As such, the first entry in a symbol table (index 0) is always completely zeroed (type STT_NOTYPE), and is not used.

  2. If the file contains any local symbols, the second entry (index 1) the symbol table will be a STT_FILE symbol giving the name of the file.

  3. Section symbols.

  4. Register symbols.

  5. Global symbols that have been reduced to local scope via a mapfile.

  6. For each input file that supplies local symbols, a STT_FILE symbol giving the name of the input file is put in the symbol table, followed by the symbols in question.

  7. The global symbols immediately follow the local symbols in the symbol table. Local and global symbols are always kept separate in this manner, and cannot be mixed together.
What would happen if we ignored these rules and reordered things in some other way (e.g. sorted by address)? There is no way to answer this question with 100% certainty. It would probably confuse existing tools that manipulate ELF files. In particular, it seems clear that the local and global symbols must remain separate. For years and years, arbitrary software has been free to assume the above layout. We can't possibly know how much software has been written, or how dependent on layout it is. The only safe move is to maintain the well known layout described above.

Next Time: Augmenting The Dynsym

One of the big advantages of Solaris relative to other operating systems is the extensive support for observability: The ability to easily look inside a running program and see what it is doing, in detail. To do that well requires symbols. The symbols in the dynsym may not be enough to do a really good job. For example, to produce a stack trace, we need to take each function address and match it up to its name. If we are looking at a stripped file, or referencing the file from within the process using it via dladdr(3C), we won't have any way to find names for the non-global functions, and will have to resort to displaying hex addresses. This is better than nothing, but not by much. The standard files in a Solaris distribution are not stripped for exactly this reason. However, many files found in production are stripped, and in-process inspection is still limited to the dynsym.

Machines are much larger than they used to be. The memory saved by the symtab/dynsym division is still a good thing, but there are times when we wish that the dynsym contained a bit more data. This is harder than it sounds. The layout of dynsym interacts with the rest of an ELF file in ways that are set in stone by years of existing practice. Backward compatibility is a critical feature of Solaris. We try extremely hard to keep those old programs running. And yet, the needs of observability, spearheaded by important new features like DTrace, put pressure on us in the other direction.

This discussion is prelude to work I recently did to augment the dynsym to contain local symbols, while preserving full backward compatibility with older versions Solaris. I plan to cover that in a future blog entry. ELF is old, and much of how it works cannot be changed. Its original designers (our "Founding Fathers", as Rod calls them) anticipated that this would be the case, based no doubt on hard experience with earlier systems. The ELF design is therefore uniquely flexible, which explains why it has survived as long as it has. There is always a way to add something new. Sometimes, it takes several tries to find the best way.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Friday Sep 22, 2006

What Are "Tentative" Symbols?

In the Linker and Libraries Guide, you will encounter discussion of tentative symbols. Based on the name, we might expect that such a symbol is missing something, but what? And why does the linker have to treat them as a special case?

A tentative symbol is a symbol used to track a global variable when we don't know its size or initial value. In other words, a symbol for which we have not yet assigned a storage address. They are also known as "common block" symbols, because they have their origins in the implementation of Fortran COMMON blocks. They are historical baggage — something that needs to work for compatibility with the past, but also something to avoid in new code.

Consider the following two C declarations, made at outer file scope:

        int foo;
        int foo = 0;
Superficially, these both appear to declare a global variable named foo with an initial value of 0. However, the first definition is tentative — it will have a value of 0 only if some other file doesn't explicitly give it a different value. The outcome depends on what else we link this file against.

To get a better handle on this, let's create two separate C files (t1.c, and t2.c) and experiment:

t1.c
        #include <stdio.h>

        #ifdef TENTATIVE_FOO
        int foo;
        #else
        int foo = 0;
        #endif

        int
        main(int argc, char *argv[])
        {
                printf("FOO: %d\\n", foo);
                return (0);
        }
t2.c
        int foo = 12;

First, we compile and link t1.c by itself, using both forms of declaration for variable foo:

        % cc -DTENTATIVE_FOO t1.c; ./a.out
        FOO: 0
        % cc t1.c; ./a.out
        FOO: 0

As expected, they give identical results. Now, lets add t2.c to the mix and see what happens:

        % cc -DTENTATIVE_FOO t1.c t2.c; ./a.out
        FOO: 12
        % cc t1.c t2.c; ./a.out
        ld: fatal: symbol `foo' is multiply-defined:
                (file t1.o type=OBJT; file t2.o type=OBJT);
        ld: fatal: File processing errors. No output written to a.out
        ./a.out: No such file or directory
As you can see, the two different ways of declaring foo are not 100% equivalent. The tentative declaration of foo in t1.c took on the value provided by the declaration in t2.c. In contrast, the linker was unwilling to merge the two non-tentative definitions of foo that had different values, and instead issued a fatal link error.

Normal C rules say that a variable at file scope without an explicit value is assigned an initial value of 0. However, the existence of other global variables with the same name can change this. The C compiler is only able to see the code in the single file it is compiling, and cannot know how to handle this case. So, it marks it as tentative by giving the symbol a type of STT_COMMON, and leaves it for the linker to figure out. The linker is in a position to match up all of these symbols and merge them into a single instance. The linker has no insight into programmer intent though, and it cannot protect you from doing this by accident. The result usually works, but is fragile.

The other declaration form (with a value) causes a non-tentative symbol to be created (STT_OBJECT). In this case, the linker ensures that all the declarations agree. This is the right behavior if you care about robust and scalable code.

It is worth noting that you will never see a tentative symbol with local scope. It can only happen to global symbols, because global symbols in different files are the only way you can get this form of aliasing to occur.

History

Tentative symbols are bad software engineering. A declaration in one file should not be able to alter one in another file. The need for them dates from the early days of the Fortran language. In Fortran, you can declare a common block in more than one file, with each file independently specifying the number, types, and sizes of the variables. The linker then takes all of these blocks, allocates enough space to satisfy the largest one, and makes all them point at that space. This is a very crude form of a union (variant), and is therefore very useful (and dangerous) Fortran technique.

Sadly, it didn't stop there. We still sometimes find this practice in C code. Two files will both declare:

        int foo;
and then expect that they are both be referring to a single global variable, with an initial value of 0. This is not necessary. The proper solution has existed for decades. The safe way to do the above is to have exactly one declaration for the global variable in a single file. The other files that need to access to it use the "extern" keyword to let the compiler know what is going on. The statement
        extern int foo;
is a reference, not a declaration, and it has a single unambiguous interpretation.

Moral: Don't Do That!

Don't use common block binding in your code. It was a bad idea 40 years ago, and it hasn't improved with age. The necessity of backward compatibility is such that compilers and linkers must support common block binding. We are stuck with it, but we don't have to use it.

You should always try to minimize or eliminate global variables. However, when you do use them:

  • There should be exactly one declaration for each global variable, contained in a single file. When declaring that single instance, always give it an explicit value, even if that value is 0. The C language says that the value is 0 if you don't, but doing it explicitly ensures that you can't accidentally fall into the "tentative trap" if some other module should come along later and define it. Note that this only applies to global variables. Static variables declared at file scope can be safely assumed to have an initial value of 0.

  • The module that declares the variable should supply a header file containing an extern statement for the variable. Furthermore, the module must #include its own header file. The compiler allows you to have a declaration and an extern statement for a variable in the same compilation scope, and it will check to make sure they agree. This ensures that your module can't export a bad extern definition to other code.

  • Other modules that access the global variable must always #include the header file from the defining module, and must never supply their own explicit extern statement for the variable. This protects them from being stuck with an obsolete and incorrect definition if the variable should change later.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Wednesday Jun 14, 2006

Settling An Old Score (Linker Division)

For years, I worked on an interactive language used by scientists to do data analysis and visualization. That program makes heavy use of sharable libraries. Solaris was my primary development platform, due to its superior facilities for observing and debugging software, so my usual strategy was to write and debug my code under Solaris, and then move the results to the many other Unix (and VMS, Windows, and Macintosh) platforms that we supported.

I became very familiar with one quirk of the Solaris linker that bit me many times over the years. The issue has to do with how ld(1) handles the situation where it needs to replace an existing output file. Historically, this was handled by truncating the existing file and rewriting it in place. This preserves the existing inode, and any hard links that may happen to be pointing to it. However, it has a very bad effect on any running process that happens to be using that file. For example, if the output file is a sharable library, creating a new version while a program that uses it is running will inevitably cause that program to die in an unplanned and unexpected way.

If you're not a developer of software that runs for unbounded amounts of time, then you probably have not seen this behavior. If you develop code that does however, then you've almost certainly hit it at some point. In my case, I'd hit it about once a year, usually while multitasking, flipping back and forth between several xterms. Usually, it was obvious what had happened, and I'd quickly recover. Sometimes though, if I was debugging the program for some other reason, the unexpected SIGSEGV or SIGBUS would send me off into the weeds, debugging a mysterious problem in a part of the program unrelated to where I expected the problem to lie. After a minute or so, I'd realize what had happened.. Gack! Bitten again...

I cheerfully admit that this is not a big deal. However, I always wondered what reasons those people at Sun had for not changing it. I assumed that there was some subtlety that I was missing. Other platforms handle it in a different way, by having ld unlink the existing file first, and then create a new file under the same name. Any existing processes continue to see the old file, while the new file becomes available to new processes. When the last program with an open file descriptor exits, the Unix kernel removes the old file from the disk, in the standard Unix way.

I now work at Sun, on the Solaris linker and various related parts. One day the subject of this ld behavior came up, reminding me of my old questions. So, I started asking around. It turns out that no one at Sun is particularly fond of it either. The reasons for not having changed it boil down to the fact that it rarely causes problems, a desire to maximize compatibility with the past, and the fact that there are bigger things to worry about (almost everything).

The compatibility issue boils down to what happens if the output file has a link count that is greater than 1, as described earlier. It's rare for an output file to have multiple links, and even more rare for the makefile to not remove and relink all of the other names. Especially rare, since other operating systems (like Linux and the Macintosh) do break those links, and most Unix software is targetted at multiple operating systems (almost always including Linux).

As described in an earler blog entry, I've been using Solaris Zones to improve our linker testing. The basic approach I want to use for this is to replace the linker files in the test zone with symbolic links that point at the corresponding files in my development workspace. This means that every time I use make(1) to rebuild the linker components, the results will immediately become available in the zone. There is one problem with this idea: The way ld handles existing output files means that any processes using the old linker components will suddenly crash when I type make. Depending on what is running in the zone, this could destabilize and/or cripple the zone.

Of course, we could modify our makefiles to unlink the existing files before running ld, but this is tedious and error prone. I realized that it would be better to change the linker. At last, time to settle an old score.

The first step in making a change like this is to write and submit an ARC (Architectural Review Committee) case describing the change. We take backward compatibility very seriously here, so a change like this requires formal consideration and approval. I submitted PSARC 2006/353 a couple of weeks ago. It generated some discussion pro and con (the change causes existing hard links to be broken), but in the end, the undesirability of causing programs to die in uncontrolled ways, and the bonus of Linux compatibility won the day and the proposal was approved. I put back the code change earlier today, and it will appear in build 43, and soon after in OpenSolaris. Appropriately enough, more typing was involved in proposing the ARC case than in changing the code. The actual code change is very small.

It was a good learning experience for me to go through the ARC process, and gratifying to finally take my revenge on "my old friend". And of course, it will allow us to do better testing of the linker subsystem.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Happy Birthday To OpenSolaris

OpenSolaris 1 Year Anniversary

OpenSolaris is 1 year old today! Congratulations, and thanks to everyone that made it possible!

I've used SunOS/Solaris for ~20 years now, for all of the old school reasons: Solaris has long set the gold standard for things like reliability, scalability, observability, debuggability backward compatibility, and standards compliance. Over those years, I've written code that had to run on most versions of Unix as well as VMS, Windows, and the Macintosh. Sun was always the place where I did initial work, and any possible debugging as well. The reason is that SunOS has always had superior tools for observing and diagnosing what your code is doing. However, the user visible functionality of the OS (as with all Unix OS's) had not changed much in quite some time. Solaris still had the above advantages, but the various Unix variants were beginning to look and feel increasingly similar.

In contrast, the current Solaris is a real jump forward. While other operating systems have been working away at duplicating things that Solaris has had forever, Solaris itself has moved forward with next generation features that no one else will have for quite awhile (Dtrace, ZFS, Zones, FMA, SMF, etc). Others have some subset of these abilities, but no one has the complete package, and no one else has it in such a simple and fully integrated fashion. Once you experience these things, you'll find running systems without them to be limiting (evoking a quieter, simpler era). At the same time, Solaris has moved aggressively to the 64-bit X86 PC platform, making it possible for many people to run it without first having to buy new hardware, and has moved to adopt a modern desktop and make other needed improvements (many of which involve open source code from other projects).

The fact that Solaris has joined the community of open source Unix operating systems makes the above even more compelling. The energy level surrounding Solaris today is high. The quality of Solaris as a desktop OS is rapidly improving, thanks to its adoption of other open source software. At the same time, the OpenSolaris code is there for others to examine, modify, and even to port to other systems This would be a great thing for Unix in general, and is something that we would love to see happen (and it is: See Dtrace and ZFS). And, Solaris still leads the pack with those boring old school virtues.

So on this 1st anniversary of OpenSolaris: If you are a Unix fan who is not familiar with the modern Solaris OS, you should give it a try. You will find it interesting and eye opening. Rest easy: It has a real open source license, and what has been given cannot be taken away. You probably already have a PC lying around that you can use. And the price (free) is exceedingly reasonable.


Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Wednesday Apr 19, 2006

Testing Critical System Components Without Turning Your System Into A Brick


I work on the Solaris runtime linker. One thing you quickly learn in this work is that a small mistake can bring down your system. The runtime linker is an extreme example, but the same thing is true of other core system components. Modifying core parts of a running machine can be a risky game.

There is a time honored strategy for dealing with this:

  1. Be careful

  2. Minimize your exposure

  3. Deal with it

That's not much of a safety net. "Deal with it" can sometimes be a slow, painful process. There has been little improvement in this area for years. Now however, the advent of Solaris zones and ZFS gives us some powerful new options that can make recovery easy and instantaneous.

I'm going to talk about how to do that here. Much of this discussion is linker-specific detail and background motivation, followed by some general comments about zones and ZFS. This is followed by an actual example of how I built a test environment. Please feel free to skip right to the example.

The Linker Testing Problem

The runtime linker lies at the heart of nearly every process on a Solaris (or any modern) operating system. This makes modifying and testing it problematic: If you install a runtime linker that has an error, your entire running system will instantly break. Since everything on the system is dynamically linked, this isn't a casual breakage. Rather, your system is unable to execute anything. Recovery may require booting a memory resident copy of the OS from the installation CD, restoring working files, and rebooting. One moment, you were focusing on solving a problem. Now, your attention is yanked away and focused on system recovery. Once you get your system back, you have to go back and try to remember what you were thinking when it broke. Your productivity is shot.

If you work on something as central as the runtime linker, the odds of never breaking it are stacked against you. That it is going to happen is a simple mathematical fact. If you are careful and methodical, it will happen less. Unless you shy away from doing valuable work though, it is an ever present possibility. Since we can't eliminate the possibility, we have to accommodate it.

Our main strategy in this game is avoidance. To avoid this problem, most linker testing is done against a local copy of the linker components, without installing them in the standard locations (/bin, /usr, etc). We do this by manipulating the command PATH, and setting linker environment variables. We may install them later, if testing seems to show that they are OK, and if we believe that there may be system interactions we need to guard against. The good news is that this approach usually works, and can be managed with a reasonable amount of effort. It has some limits though:

  • It is complicated to get 100% right. Sometimes we end up using linker components from the standard places instead of the ones we think we're using.

  • It isn't a 100% accurate representation of how anyone else uses the linker. It is a close approximation, but not perfect.

  • It isn't efficient: Since it isn't a perfect test, we often have to do a a real install and test again before we know for sure that things are OK.

An ideal approach would not require so much human judgement. It would reflect the user experience exactly.

Doing Better

How would the ideal testing environment for the runtime linker subsystem look? Here's my wish list:

  • Keeps your system in a completely stock and vanilla configuration, without altering system files.

  • Lets you modify any system file, from the point of view of the software you're testing, without violating the previous point. I want to use my test linker subsystem, installed in the standard places, without having it affect anything except my tests.

  • Quick and easy to setup.

  • Upgrades to the operating system should be quick and easy to do.

  • Lets you access and use your development environment from within the test environment exactly as you can outside, and with the same filesystem paths.

  • Testing mistakes can't take down the system.

  • Self healing: After I mess it up, I want to be able to reset to a working vanilla state with a simple command, and without having to remember what I changed.

  • Run on a standard system with only modest extra resources. More disk space is OK. Using another computer isn't.

In years past, you might have tried to construct something like this by constructing an image of the system in a test area, and then applying the chroot(2) system call (probably in the form of the /usr/sbin/chroot command) to make it appear like the real system. This can work, but it has some big drawbacks:

  • Requires a lot of work to set up.

  • Requires a lot of ongoing work to track system changes from release to release (which in the Solaris group, come every 2 weeks).

  • Requires a lot of work to keep stable and correct.

If you've ever set up an anonymous FTP server, you know how much manual work is involved. Imagine doing it for an entire OS and then having to keep up with daily changes. People have tried this, but it ends up being too much ongoing effort to manage and maintain. No one minds doing work up front, but afterwards, we really want a system that can take care of itself. The goal is to save time and effort, not to simply redirect it.

What we really need is a sort of super chroot: One that sets itself up and doesn't demand so much from us. Something that creates a virtual instance of the machine we're using, that is created automatically by the system, so we don't have to construct a Solaris root filesystem manually. Something easy to create, lightweight in operation, that is essentially identical to our installed system, and something that we can play with, wreck, and reset with little or no overhead.

Before Solaris 10, this would have been a tall order. As of Solaris 10, it is standard stuff: We can build it using Solaris Zones in conjunction with ZFS. Not only can we do it, but it's easy.

Zones

You can read more about Solaris Zones at the OpenSolaris website. Quoting from that page:

Zones are an operating system abstraction for partitioning systems, allowing multiple applications to run in isolation from each other on the same physical hardware. This isolation prevents processes running within a zone from monitoring or affecting processes running in other zones, seeing each other's data, or manipulating the underlying hardware. Zones also provide an abstraction layer that separates applications from physical attributes of the machine on which they are deployed, such as physical device paths and network interface names.

The main instance of Solaris running on your system is known as the global zone. A given system is allowed to have 1 or more non-global zones: These are virtualized copies of the main system that present the programs running within them with the illusion that they are running on separate and distinct systems. Zones come in two flavors: Sparse, and Whole Root. The difference is that a sparse zone uses loopback mounts to re-use key filesystems (/, /usr, /platform) from the main system in a readonly mode, wheras a whole root zone makes a complete copy of these filesystems. A whole root zone allows you install different Solaris packages into its root filesystem — this is what we need for linker testing.

Zones are extremely easy to set up. They provide us with the ability to create an environment in which we can install and test the runtime linker without running the risk of taking down the machine. The worst that can happen is that we wreck the zone, but the damage will always be contained. A non-global zone cannot damage the global zone. If we do damage the non-global zone, it is easy to halt, destroy, and recreate it, all without any need to halt or reboot the main system.

This is a big leap forward, and by itself, would be worth using. However, setting up a whole root zone can take half an hour. To really make this approach win, we need to be able to reset a zone much faster than that. We can do this using ZFS.

ZFS

ZFS is a powerful new filesystem that is making its debut with Solaris 10, Update 2. ZFS makes it cheap and easy to create an arbitrary number of filesystems on any Solaris system, from small desktop machines to large servers.

ZFS has a snapshot facility that allows you to capture a readonly copy of a filesystem (even really large ones) in a matter of seconds. A snapshot requires almost no disk space initially, as all the file data blocks are shared. As the main filesystem is modified, the snapshot continues to reference the old data blocks. Once a snapshot has been made, ZFS allows you to roll back the main filesystem to the state captured by the snapshot. This operation is trivial to do, and essentially instantaneous.

Each Solaris whole root zone has a copy of the main system filesystems, kept at a location you specify when you create the zone. ZFS therefore presents us with a solution to the problem of how to rapidly and easily reset a linker test environment:

  • Create a ZFS filesystem to hold the zone data.

  • Create a zone in the ZFS filesystem, and do the initial login that finishes the Solaris "install" process for the zone.

  • Halt the zone.

  • Capture a ZFS snapshot of the filesystem.

  • Restart the zone.

Once this is done, you can use the zone for testing, as if it were a especially convenient second system that can see the same files your real system can see. When you need to reset it:

  • Halt the zone.

  • Revert to the zone snapshot.

  • Restart the zone.

I created such a zone using my Ultra 20 desktop system. Here are the commands to do the above:

% zoneadm -z test halt
% zfs rollback -r tank/test@baseline
% zoneadm -z test boot

These commands take 7 seconds from start to finish! Speed is not going to be a problem.

Building It

Let's walk through the construction of the linker test zone I have on my desktop system. The first step is to get a ZFS filesystem set up. My system has an extra disk (/dev/rdsk/c2d0) that I will use for this purpose. It doesn't have any pre-existing data on it that I care about saving, so I will dedicate the entire thing for ZFS to use.

I need to create a ZFS pool, and then create a filesystem within it. Following the ZFS examples I've seen, I'm going to name the pool "tank". I will mount the filesystem on /zone/test.

root# zpool create -f tank c2d0
root# zfs create tank/test
root# zfs set mountpoint=/zone/test tank/test 
root# df -k /zone/test
Filesystem            kbytes    used   avail capacity  Mounted on
tank/test            241369088      98 241368561     1%    /zone/test

That took 4 seconds.

The next step is to create the zone within the ZFS filesystem now mounted at /zone/test. In order to allow installing linker components into the root and usr filesystems, this needs to be a whole root zone. At Sun, all of our home directories are automounted via NFS, with NIS used to manage user authentication. So, I'll need to give my zone a network interface. This interface needs a unique IP address, different from the main system address. I do most of my development work in a local filesystem (/export/home), so I'll arrange for it to appear within my test zone as well. My host is named rtld, so I will name my test zone rtld-test. Summarizing these decisions:

Hostname: rtld
Zone Hostname: rtld-test
Zone IP: 172.20.25.173
Zone Type: Whole Root
Zone Path: /zone/test
Loopback Mounts: /export/home

Let's create a test zone:

root# chmod 700 /zone/test
root# zonecfg -z test
test: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:test> create -b
zonecfg:test> set autoboot=true
zonecfg:test> set zonepath=/zone/test
zonecfg:test> add net
zonecfg:test:net> set address=172.20.25.173
zonecfg:test:net> set physical=nge0
zonecfg:test:net> end
zonecfg:test> add fs
zonecfg:test:fs> set dir=/export/home
zonecfg:test:fs> set special=/export/home
zonecfg:test:fs> set type=lofs
zonecfg:test:fs> end
zonecfg:test> info
zonename: test
zonepath: /zone/test
autoboot: true
pool: 
fs:
        dir: /export/home
        special: /export/home
        raw not specified
        type: lofs
        options: []
net:
        address: 172.20.25.173
        physical: nge0
zonecfg:test> verify
zonecfg:test> commit
zonecfg:test> exit
root# zoneadm -z test verify
root# zoneadm -z test install
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying <120628> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <974> packages on the zone.
Initialized <974> packages on zone.
Zone  is initialized.
Installation of these packages generated errors: 
Installation of these packages generated warnings: 
The file  contains a log of the zone installation.
root# zoneadm list -cv
  ID NAME             STATUS         PATH
   0 global           running        /
   - test             installed      /zone/test

This part of the process takes about 12 minutes on this system.

The output from "zoneadm list" shows us that the zone is installed, but not running. To get it running for the first time, we must boot it, and then login to the console and finish the installation process. This is the same process a standard Solaris goes through after the initial reboot — smf initializes, you are asked some questions about hostname, root password, and name service, and then the system is ready for use. Before using it though, we halt it and capture a snapshot, for later use.

root# zoneadm -z test boot
root# zlogin -C test
[Install Output omitted]
~.
[Connection to zone 'test' console closed]
root# zoneadm list -cv
  ID NAME             STATUS         PATH
   0 global           running        /
  12 test             running        /zone/test
root# zoneadm -z test halt
root# zfs snapshot tank/test@baseline
root# zoneadm -z test boot

This last part takes about 5 minutes. In total, we can go from no ZFS and no zone, to having a usable linker test zone in well under half an hour. This story is going to get even better soon: There are "zone cloning" features coming soon which will greatly lower the time it takes to create new zones.

Using It

Now that we have a test zone, let's experiment with it. In this section, I will be using two separate terminal windows, one logged into the global zone, and one logged into the test zone. I will show interactions with the global zone on the left, and the test zone on the right. In this example, I remove the runtime linker (/lib/ld.so.1) and demonstrate that (1) This does not take down the system, and (2) It is easily and quickly repaired.

The first step is to log into the test zone. The uname command is used as a trivial way to show that both zones are operating normally.

ali@rtld% uname
SunOS
ali@rtld% ssh rtld-test
Password: passwd
ali@rtld-test% uname
SunOS

Now, let's simulate the situation in which a bad runtime linker is installed, by simply removing it.

ali@rtld-test% su -
Password: passwd
root@rtld-test# rm /lib/ld.so.1

That is normally all it takes to wreck a working system. However, the global zone is unharmed, and my system continues to run.

ali@rtld% uname
SunOS
root@rtld-test# uname
uname: Cannot find /lib/ld.so.1
Killed
root@rtld-test# ls
ls: Cannot find /usr/lib/ld.so.1
Killed

Since my system is still running, I can quickly repair the broken test environment. In this simple case, I can repair the damage by copying /lib/ld.so.1 from my global zone into the test zone.

ali@rtld% su -
Password: passwd
root@rtld# cp /lib/ld.so.1 \\
              /zone/test/root/lib
root@rtld-test# uname
SunOS

That's fine if the damage is simple, but what if the situation is more complex? The ld.so.1 from the global zone may be incompatible with other changes made to the linker components in the test zone, in which case, the above fix will not work. In that case, we will want to exercise the ability to quickly reset the test zone to a known good state. First, let's break it again:

root@rtld-test# rm /lib/ld.so.1
root@rtld-test# uname
uname: Cannot find /lib/ld.so.1
Killed

This time, we'll reset the test zone, from the global one:

root@rtld# zoneadm -z test halt
root@rtld# zfs rollback \\
               -r tank/test@baseline
root@rtld# zoneadm -z test boot
# Connection to rtld-test closed by remote host.
Connection to rtld-test closed.

The test zone is back, good as new and ready for use:

ali@rtld% ssh rtld-test 
Password: passwd
ali@rtld-test(501)% uname
SunOS

Conclusions: A Rising Tide Floats All Boats

I've started to regard the test zone the same way I view an Etch-A-Sketch®: I can play with it, mess it up, learn from the results, and then I give it a quick shake and it is ready to go again. This is cool stuff!

Before doing this experiment, I had never used zones or ZFS. I had heard about them, but nothing more. I sat down on Friday morning to see what I could do with them, and I had the solution described here working within 8 hours of effort. It's hard to beat that return on investment. The result is a real leap forward in terms of how easily and completely we can test our work.

Zones and ZFS provide new and powerful abilities not available elsewhere. They're included in the standard system for free, and not as expensive add ons. They're simple and easy to use. Once you play with them, I am confident that you'll start seeing uses for them in your daily work. Happy hunting!

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

About

I work in the core Solaris OS group on the Solaris linkers. These blogs discuss various aspects of linking and the ELF object file format.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Feeds