Hello from a libc-free world! (Part 2)

In the previous post we conquered compilation by constructing a small program that can be compiled without using libc. Understanding object code and the details of an ELF executable are the next step in our adventure.

We left off with the following program pieces:

jesstess@kid-charlemagne:~$ cat stubstart.S
.globl _start _start: call main movl $1, %eax xorl %ebx, %ebx int $0x80
jesstess@kid-charlemagne:~$ cat hello.c
int main() { char *str = "Hello World"; return 0; }
jesstess@kid-charlemagne:~/c$ gcc -nostdlib stubstart.S hello.c -o hello
What did all that work get us?
jesstess@kid-charlemagne:~/c$ wc -c hello
1373 hello
jesstess@kid-charlemagne:~$ objdump -D hello | wc -l
93
We're down to a little over 1300 bytes of executable and what at under 100 lines seems like a very reasonable amount of assembly to dissect. Since no little bit of assembly is going to scare us at this point, let's look at the assembly now, with objdump -D so we see the assembly for all sections (output here). If it looks intimidating at first, just give it a quick once-over and I promise it won't be by the end of this post.

Alright, we have 5 sections: .text, which contains the familiar _start and main symbols, .rodata, .eh_frame_hdr, .eh_frame, and .comment.

Step 1: Back up - what the heck is a "section"?

If we dust off our favorite copy of the Tool Interface Standard ELF Specification and have a look inside, it tells us this:

An ELF executable like the result of our compilation has two views: it has a program header describing the segments, which contain information used at run-time, and a section header describing the sections, which contain information for linking and relocation. We can look at the program header's segment information or the section header's section information with readelf -l or readelf -S, respectively (output here). The output from these commands on our program is summarized in Figure 1. We won't worry about segments again during this post.

ELF segments and sections Figure 1: our ELF segments and sections

Step 2: What goes in our sections?

The specification also tells us what goes where in our executable:

.text: The executable instructions for a program.

.rodata: Constant data. This is the "read-only data" segment.

.eh_frame: Information necessary for frame-unwinding during exception handling.

.eh_frame_hdr: To quote the Linux Standard Base Specification: "This section contains a pointer to the .eh_frame section which is accessible to the runtime support code of a C++ application. This section may also contain a binary search table which may be used by the runtime support code to more efficiently access records in the .eh_frame section."

We don't have to worry about exceptions with this example, so .eh_frame and .eh_frame_hdr aren't doing much that we care about, and on this machine, compiling with -fno-asynchronous-unwind-tables will suppress creation of these two sections.

.comment: Compiler version information.

Speaking of getting rid of sections: for those of us with a minimalist aesthetic, strip(1) is our friend. We can --remove-section on non-essential sections like .comment to get rid of them entirely; file(1) will tell us if an ELF executable has been stripped.

Other common sections we don't see with our example because they'd be empty:

.data: Initialized global variables and initialized static local variables.

.bss: Uninitialized global and local variables; filled with zeroes. A popular section to bring up during CS interviews!

That's the story on sections. Now, we know that symbols, like _start and main, live in these sections, but are there any more symbols in this program?

Step 3: Understand the symbols and why they live where they live.

We can get symbol information for our executable with objdump -t:
jesstess@kid-charlemagne:~/c$ objdump -t hello
hello:     file format elf64-x86-64

SYMBOL TABLE:
00000000004000e8 l    d  .text                   0000000000000000 .text
0000000000400107 l    d  .rodata                 0000000000000000 .rodata
0000000000400114 l    d  .eh_frame_hdr           0000000000000000 .eh_frame_hdr
0000000000400128 l    d  .eh_frame               0000000000000000 .eh_frame
0000000000000000 l    d  .comment                0000000000000000 .comment
0000000000000000 l    df *ABS*                   0000000000000000 hello.c
00000000004000e8 g       .text                   0000000000000000 _start
0000000000600fe8 g       *ABS*                   0000000000000000 __bss_start
00000000004000f4 g     F .text                   0000000000000013 main
0000000000600fe8 g       *ABS*                   0000000000000000 _edata
0000000000600fe8 g       *ABS*                   0000000000000000 _end
The symbol table for our executable has 11 entries. Weirdly, only rare versions of the objdump man page, like this one, will actually explain the symbol table column by column. It breaks the table down as follows:

Column 1: the symbol's value/address.
Column 2: a set of characters and spaces representing the flag bits set for the symbol. There are 7 groupings, three of which are represented in this symbol table. The first can be l, g, <space>, or !, if the symbol is local, global, neither, or both, respectively. The sixth can be d, D, or <space>, for debugging, dynamic, or normal, respectively. The seventh can be F, f, O, or <space>, for function, file, object, or normal symbol, respectively. Descriptions of the 4 remaining grouping can be found in that unusally comprehensive objdump manpage.
Column 3: which section the symbol lives in. *ABS*, or absolute, means the symbol is not associated with a certain section.
Column 4: the symbol's size/alignment.
Column 5: the symbol's name.

Our 5 sections all have associated local (l) debugging (d) symbols. main is indeed a function (F), and hello.c is in fact a file (f), and it isn't associated with any particular section (*ABS*). _start and main are part of the executable instructions for our program and thus live in the .text section as we'd expect. The only oddities here are __bss_start, _edata, and _end, all *ABS*olute, global symbols that we certainly didn't write into our program. Where did they come from?

The culprit this time is the linker script. gcc implicitly called ld to do the linking on this machine as part of the compilation process. ld --verbose will spit out the linker script that was used, and looking at this script (output here) we see that _edata is defined as the end of the .data section, and __bss_start and _end mark the beginning and end of the .bss section. These symbols could be used by memory management schemes (for example if sbrk wants to know where the heap could start) and garbage collectors.

Note that str, our initialized local variable, doesn't show up in the symbol table. Why? Because it gets allocated on the stack (or possibly in a register) at runtime. However, something related to str is in the .rodata section, even though we don't see it in the symbol table...

With char *str = "Hello, World"; we're actually creating two different objects. The first is the string literal "Hello, World", which is just that array of characters, and has some address but no explicit way of naming it. That array is read-only and lives in .rodata. The second is the local variable str, which is of type "pointer to char". That is what lives on the stack. Its initial value is the address of the string literal that was created.

We can prove this, and see some other useful information, by looking at the contents of our sections with the strings decoded:
jesstess@kid-charlemagne:~$ objdump -s hello

hello:     file format elf64-x86-64

Contents of section .text:
 4000e8 e80b0000 00b80100 000031db cd809090  ..........1.....
 4000f8 554889e5 48c745f8 0b014000 b8000000  UH..H.E...@.....
 400108 00c9c3                               ...             
Contents of section .rodata:
 40010b 48656c6c 6f20576f 726c6400           Hello World.    
Contents of section .eh_frame_hdr:
 400118 011b033b 14000000 01000000 e0ffffff  ...;............
 400128 30000000                             0...            
Contents of section .eh_frame:
 400130 14000000 00000000 017a5200 01781001  .........zR..x..
 400140 030c0708 90010000 1c000000 1c000000  ................
 400150 f8004000 13000000 00410e10 8602430d  ..@......A....C.
 400160 06000000 00000000                    ........        
Contents of section .comment:
 0000 00474343 3a202855 62756e74 7520342e  .GCC: (Ubuntu 4.
 0010 332e332d 35756275 6e747534 2920342e  3.3-5ubuntu4) 4.
 0020 332e3300                             3.3. 
Voila! Our "Hello World" string is in .rodata, and our .comment section is now explained: it just holds a string with the gcc version used to compile the program.

Step 4: Trim the fat and put it all together

This executable has 5 sections: .text, .rodata, .eh_frame_hdr, .eh_frame, and .comment. Really, only one of them, .text, has assembly that's germane to what this little program does. This can be confirmed by doing an objdump -d (only disassemble those sections which are expected to contain instructions) instead of the objdump -D (disassemble the contents of all sections, not just those expected to contain instructions) done at the beginning of the post and noting that only the content of .text is displayed.

.rodata really only contains the string "Hello World", and .comment really only contains a gcc version string. The "instructions" for those sections seen in the objdump -D output come from objdump treating the hexadecimal representations of the ASCII characters in those strings as instructions and trying to disassemble them. We can convert the first couple of numbers in the .comment section to ASCII characters to prove this. In Python:
>>> "".join(chr(int(x, 16)) for x in "47 43 43 3a 20 28 55 62 75 6e 74 75".split())
'GCC: (Ubuntu'

In .text, _start calls main, and in main a pointer to the memory location where "Hello World" is stored, 0x40010b (where .rodata starts, as seen in the obdjump -D output), is pushed onto the stack. We then return from main to _start, which takes care of returning from the program, as described in Part I.

And that's everything! All sections and symbols are accounted for. Nothing is magic (and I mean magic in a good I-would-ace-this-test way, not a sorry-Jimmy-Santa-isn't-real way). Whew.

Looking at and really understanding the core parts of an ELF executable means that we can add complexity now without cheating our way around parts we don't understand. To that end, stay tuned for Part 3, where we'll stuff this program with a veritable variable smörgåsbord and see where everything ends up in the program's memory.

~jesstess

Comments:

Post a Comment:
Comments are closed for this entry.
About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today