Moving code around more easily

Yesterday I wrote about relocations and I promised a second part. So here it is. And thanks for the comments everyone.

We explored yesterday what it takes for loading an executable into memory and I said it was "position dependent code". One of the commenters recommended reading Ulrich Drepper's "How to Write a Shared Library" paper which focuses on the concepts for shared libraries. Unlike an executable, a library is loaded by many programs at the same time, often at different addresses in memory. By necessity, a library must be "position-independent".

Different ways of achieving position independence

But first, how was the code yesterday position-dependent in the first place? Well, if you look at the code the linker generated:

08048198 :
8048198: ff 25 68 92 04 08 jmp *0x8049268
[<qMalloc>:]
80481b4: e8 df ff ff ff call 8048198

We see two addresses there: 0x8048198 and 0x8049268. One of these two addresses is absolute, the other is relative to the Program Counter (figuring out which one is which is left as an exercise to the reader). The one that concerns us is the absolute one: if the code had been loaded at a different address in memory, the values would be just plain wrong.

Relocating the code

There would be two ways of achieving position independence. The first and most simple way is to mark every place where an absolute address was used and leave behind information so the program loader can correct the addresses. As I discussed yesterday, this information are the relocations and you really don't want to have too many of them.

Interestingly, this is the solution that Windows chooses for its Portable Executable (PE) file format. But it uses an extra trick: each shared library, or in Windows speak, each DLL contains a "preferred load address" to which the code is already relocated to. If the program loader loads the DLL at the address it prefers to be loaded at, then it doesn't need to do any fixups. The job is done.

If it can't load at the preferred address, often because something else is already loaded there (like another DLL), then the program loader needs to go over the relocation table and fix everything up again. Needless to say that the relocation costs more than the simple loading, so finding a proper load address for your library is crucial.

That is not the solution that Linux chooses by default, though it can be done with prelinking (on Mac OS X it's called prebinding which is a more apt term). I don't have a Windows build environment to test this one. I can try replicating the results by compiling the code on Linux without the usual -fPIC but adding the -shared option. The result isn't exactly like Windows, due to the absence of the preferred load address. Disassembling the shared library (with objdump -CRd), I get:

000001d0 <qMalloc>:
1d0: 55 push %ebp
1d1: 89 e5 mov %esp,%ebp
1d3: 83 ec 18 sub $0x18,%esp
1d6: 8b 45 08 mov 0x8(%ebp),%eax
1d9: 89 04 24 mov %eax,(%esp)
1dc: e8 fc ff ff ff call 1dd <qMalloc+0xd>
1dd: R_386_PC32 malloc
1e1: c9 leave
1e2: c3 ret

As you can see, it left behind a dynamic relocation record and it did not save a useful value in the "call" instruction.

Position-independent code

The solution that Linux chooses by default is called "Position Independent Code" or PIC for short, which you enable with the -fPIC compiler option. If we do that and recompile our code, here's what we see in the disassembly (highlighting the differences):

Disassembly of section .plt:

000001c4 <malloc@plt>:
1c4: ff a3 0c 00 00 00 jmp *0xc(%ebx)
1ca: 68 00 00 00 00 push $0x0
1cf: e9 e0 ff ff ff jmp 1b4

Disassembly of section .text:

000001d4 <qMalloc>:
1d4: 55 push %ebp
1d5: 89 e5 mov %esp,%ebp
1d7: 53 push %ebx
1d8: 83 ec 14 sub $0x14,%esp
1db: e8 17 00 00 00 call 1f7 <__i686.get_pc_thunk.bx>
1e0: 81 c3 b4 10 00 00 add $0x10b4,%ebx

1e6: 8b 45 08 mov 0x8(%ebp),%eax
1e9: 89 04 24 mov %eax,(%esp)
1ec: e8 d3 ff ff ff call 1c4 <malloc@plt>
1f1: 83 c4 14 add $0x14,%esp
1f4: 5b pop %ebx
1f5: 5d pop %ebp
1f6: c3 ret

000001f7 <__i686.get_pc_thunk.bx>:
1f7: 8b 1c 24 mov (%esp),%ebx
1fa: c3 ret

DYNAMIC RELOCATION RECORDS
OFFSET TYPE VALUE
000012a0 R_386_JUMP_SLOT malloc

If you compare the highlighted sections with what we had seen the last post when the malloc@plt symbol appeared, you'll notice that the "jmp" instruction doesn't have an absolute address hardcoded. It's doing actually a jump to the pointer at %ebx+12. The second modification, at the beginning of the qMalloc function, loads to this register the value of 0x1e0+0x10b4. If you add the 12 bytes of the "jmp" instruction, we calculate 0x12a0, which contains the address of the malloc function.

Impact on memory - virtual memory principles

What did we accomplish here? Well, to understand the value, we need to go back to the concepts I mentioned at the end of the last blog: clean and dirty memory. Clean memory is when it is unmodified from its backing store, whereas dirty is when there are modifications made in memory. For a library, a page is clean if its contents are exactly like they are on disk. Therefore, if the program loader needs to fix up anything, the page is dirty.

If the operating system's virtual memory manager needs to reclaim some memory, it may do so by discarding pages out of the main memory. But the running tasks must not notice the difference. If the memory is clean, there's nothing to lose by just throwing the data away. After all, they are exactly like on disk, so they can simply be reloaded when they are needed again. If they are dirty, however, the VMM cannot discard the page without first saving it somewhere. That's where the swap comes in, and at a cost (due to disk I/O).

Another benefit of having a clean page is when we have two processes using the same library. The clean pages can be shared between the two or more processes, whereas the dirty ones usually cannot. That's a considerable gain -- for example, a second process using QtWebKit shares 16.2 MB of memory with the first one (4.7.1, x86).

A notable counter-example is NVidia's proprietary GL libraries. They have relocations in the .text segment, which means that each process, just by linking to libGL will use 16 MB of private, non-sharable memory.

Comparison

Ulrich Drepper's paper is teaching a bunch of techniques on how to avoid relocations in shared libraries. Why should people worry about them? The two main reasons are what I've exposed before: the impact on load-time performance, due to having to perform work, and sharing of memory.

Position-independent code can be loaded at any address without relocations and shared between processes. In PIC code, the places where absolute addresses are actually necessary and their relocations are concentrated in specific sections of the image, instead of spread throughout.

On the other hand, there are many constructs that require absolute addresses. Very relevant to us are the C++ virtual tables -- QtWebKit has over 500kB just of them and it's not sharable. Another drawback is that the compiler must dedicate one register to be the "PIC register", like my disassemblies above have shown.

So a library that doesn't need relocations or the PIC register is especially interesting. When DLLs on Windows can be loaded at their preferred load addresses, the process is very fast. When they can't, however, then it's far more costly.

The middle-ground is to combine the -fPIC option with prelinking. In that case, the code itself will not have relocations -- they will all be contained in those specific sections -- thus freeing the code to be shared. In turn, the sections with relocations will be pre-filled with values should the program loader succeed at loading at the preferred load address.

What I have explained so far is only about code relocations for function calls. There are also code relocations for data access and data relocations too. They operate on the same principle as I have exposed, including the fact that code relocations for data access change with the -fPIC flag. Maybe I'll write a blog on that subject another day. Meanwhile, it's an exercise for the reader.


Blog Topics:

Comments