In the previous post we conquered compilation by constructing a small program that can be compiled without using libc. Understanding object code and the details of an ELF executable are the next step in our adventure.
We're down to a little over 1300 bytes of executable and what at under
100 lines seems like a very reasonable amount of assembly to dissect.
Since no little bit of assembly is going to scare us at this point, let's look at the assembly now, with objdump -D so we see the assembly for all sections (output here). If it looks intimidating at first, just give it a quick once-over and I promise it won't be by the end of this post.
Alright, we have 5 sections: .text, which contains the familiar _start and main symbols, .rodata, .eh_frame_hdr, .eh_frame, and .comment.
An ELF executable like the result of our compilation has two views: it has a program header describing the segments, which contain information used at run-time, and a section header describing the sections, which contain information for linking and relocation. We can look at the program header's segment information or the section header's section information with readelf -l or readelf -S, respectively (output here).
The output from these commands on our program is summarized in Figure
1. We won't worry about segments again during this post.
Figure 1: our ELF segments and sections
Step 2: What goes in our sections?
The specification also tells us what goes where in our executable:
.text: The executable instructions for a program.
.rodata: Constant data. This is the "read-only data" segment.
.eh_frame: Information necessary for frame-unwinding during exception handling.
.eh_frame_hdr: To quote the Linux Standard Base Specification: "This section contains a pointer to the .eh_frame
section which is accessible to the runtime support code of a C++
application. This section may also contain a binary search table which
may be used by the runtime support code to more efficiently access
records in the .eh_frame section."
We don't have to worry about exceptions with this example, so .eh_frame and .eh_frame_hdr aren't doing much that we care about, and on this machine, compiling with -fno-asynchronous-unwind-tables will suppress creation of these two sections.
.comment: Compiler version information.
Speaking of getting rid of sections: for those of us with a minimalist aesthetic, strip(1) is our friend. We can --remove-section on non-essential sections like .comment to get rid of them entirely; file(1) will tell us if an ELF executable has been stripped.
Other common sections we don't see with our example because they'd be empty:
.data: Initialized global variables and initialized static local variables.
.bss: Uninitialized global and local variables; filled with zeroes. A popular section to bring up during CS interviews!
That's the story on sections. Now, we know that symbols, like _start and main, live in these sections, but are there any more symbols in this program?
Step 3: Understand the symbols and why they live where they live.
We can get symbol information for our executable with objdump -t:
jesstess@kid-charlemagne:~/c$ objdump -t hello
hello: file format elf64-x86-64SYMBOLTABLE:00000000004000e8ld.text0000000000000000.text0000000000400107ld.rodata0000000000000000.rodata0000000000400114ld.eh_frame_hdr0000000000000000.eh_frame_hdr0000000000400128ld.eh_frame0000000000000000.eh_frame0000000000000000ld.comment0000000000000000.comment0000000000000000ldf*ABS*0000000000000000hello.c00000000004000e8g.text0000000000000000_start0000000000600fe8g*ABS*0000000000000000__bss_start00000000004000f4gF.text0000000000000013main0000000000600fe8g*ABS*0000000000000000_edata0000000000600fe8g*ABS*0000000000000000_end
The symbol table for our executable has 11 entries. Weirdly, only rare versions of the objdump man page, like this one, will actually explain the symbol table column by column. It breaks the table down as follows:
Column 1: the symbol's value/address. Column 2: a set of characters and spaces representing
the flag bits set for the symbol. There are 7 groupings, three of which
are represented in this symbol table. The first can be l, g,
<space>, or !, if the symbol is local, global, neither, or both,
respectively. The sixth can be d, D, or <space>, for debugging,
dynamic, or normal, respectively. The seventh can be F, f, O, or
<space>, for function, file, object, or normal symbol,
respectively. Descriptions of the 4 remaining grouping can be found in
that unusally comprehensive objdump manpage. Column 3: which section the symbol lives in. *ABS*, or absolute, means the symbol is not associated with a certain section. Column 4: the symbol's size/alignment. Column 5: the symbol's name.
Our 5 sections all have associated local (l) debugging (d) symbols. main is indeed a function (F), and hello.c is in fact a file (f), and it isn't associated with any particular section (*ABS*). _start and main are part of the executable instructions for our program and thus live in the .text section as we'd expect. The only oddities here are __bss_start, _edata, and _end, all *ABS*olute, global symbols that we certainly didn't write into our program. Where did they come from?
The culprit this time is the linker script. gcc implicitly called ld to do the linking on this machine as part of the compilation process. ld --verbose will spit out the linker script that was used, and looking at this script (output here) we see that _edata is defined as the end of the .data section, and __bss_start and _end mark the beginning and end of the .bss section. These symbols could be used by memory management schemes (for example if sbrk wants to know where the heap could start) and garbage collectors.
Note that str, our initialized local variable, doesn't show
up in the symbol table. Why? Because it gets allocated on the stack (or
possibly in a register) at runtime. However, something related to str is in the .rodata section, even though we don't see it in the symbol table...
With char *str = "Hello, World"; we're actually creating
two different objects. The first is the string literal "Hello, World",
which is just that array of characters, and has some address but no
explicit way of naming it. That array is read-only and lives in .rodata. The second is the local variable str,
which is of type "pointer to char". That is what lives on the stack.
Its initial value is the address of the string literal that was created.
We can prove this, and see some other useful information, by looking at the contents of our sections with the strings decoded:
jesstess@kid-charlemagne:~$ objdump -s hello
hello: file format elf64-x86-64Contentsofsection.text:4000e8e80b000000b80100000031dbcd809090..........1.....4000f8554889e548c745f80b014000b8000000UH..H.E...@.....40010800c9c3...Contentsofsection.rodata:40010b48656c6c6f20576f726c6400HelloWorld.Contentsofsection.eh_frame_hdr:400118011b033b1400000001000000e0ffffff...;............400128300000000...Contentsofsection.eh_frame:4001301400000000000000017a520001781001.........zR..x..400140030c0708900100001c0000001c000000................400150f80040001300000000410e108602430d..@......A....C.4001600600000000000000........Contentsofsection.comment:0000004743433a20285562756e747520342e.GCC:(Ubuntu4.0010332e332d357562756e7475342920342e3.3-5ubuntu4)4.0020332e33003.3.
Voila! Our "Hello World" string is in .rodata, and our .comment section is now explained: it just holds a string with the gcc version used to compile the program.
Step 4: Trim the fat and put it all together
This executable has 5 sections: .text, .rodata, .eh_frame_hdr, .eh_frame, and .comment. Really, only one of them, .text, has assembly that's germane to what this little program does. This can be confirmed by doing an objdump -d (only disassemble those sections which are expected to contain instructions) instead of the objdump -D (disassemble the contents of all sections, not just those expected to contain instructions) done at the beginning of the post and noting that only the content of .text is displayed.
.rodata really only contains the string "Hello World", and .comment really only contains a gcc version string. The "instructions" for those sections seen in the objdump -D output come from objdump treating the hexadecimal representations of the ASCII characters in those strings as instructions and trying to disassemble them. We can convert the first couple of numbers in the .comment section to ASCII characters to prove this. In Python:
In .text, _start calls main, and in main a pointer to the memory location where "Hello World" is stored, 0x40010b (where .rodata starts, as seen in the obdjump -D output), is pushed onto the stack. We then return from main to _start, which takes care of returning from the program, as described in Part I.
And that's everything! All sections and symbols are accounted for.
Nothing is magic (and I mean magic in a good I-would-ace-this-test way,
not a sorry-Jimmy-Santa-isn't-real way). Whew.
Looking at and really understanding the core parts of an ELF executable
means that we can add complexity now without cheating our way around
parts we don't understand. To that end, stay tuned for Part 3, where
we'll stuff this program with a veritable variable smörgåsbord and see where everything ends up in the program's memory.