By Darryl Gove on May 20, 2009
The next consideration when using libraries is that each library will get mapped in on a new virtual page of memory; as shown in this
% pmap 60500 60500: a.out 00010000 8K r-x-- /libraries/a.out 00020000 8K rwx-- /libraries/a.out FEEC0000 24K rwx-- [ anon ] FEED0000 8K r-x-- /libraries/lib1_26.so FEEE0000 8K rwx-- /libraries/lib1_26.so FEEF0000 8K r-x-- /libraries/lib1_25.so FEF00000 8K rwx-- /libraries/lib1_25.so FEF10000 8K r-x-- /libraries/lib1_24.so FEF20000 8K rwx-- /libraries/lib1_24.so FEF30000 8K r-x-- /libraries/lib1_23.so FEF40000 8K rwx-- /libraries/lib1_23.so FEF50000 8K rwx-- [ anon ] FEF60000 8K r-x-- /libraries/lib1_22.so FEF70000 8K rwx-- /libraries/lib1_22.so FEF80000 8K r-x-- /libraries/lib1_21.so FEF90000 8K rwx-- /libraries/lib1_21.so FEFA0000 8K r-x-- /libraries/lib1_20.so FEFB0000 8K rwx-- /libraries/lib1_20.so FEFC0000 8K r-x-- /libraries/lib1_19.so ....
There are finite number of TLB entries on a chip. If each library takes an entry, and the code jumps around between libraries, then a single application can utilise quite a few TLB entries. Take a CMT system where there are multiple applications (or copies of the same application) running, and there becomes a lot of pressure on the TLB.
One of the enhancements in Solaris to support CMT processors is Shared Context. When multiple applications map the same library at the same address, then they can share a single context to map that library. This can lead to a significant reduction in the TLB pressure. Shared context only works for libraries that are loaded into the same memory locations in different contexts, so it can be defeated if the libraries are loaded in different orders or any other mechanisms that scramble the locations in memory.
If each library is mapped into a different TLB entry, then every call into a new library is a new ITLB entry, together with a jump through the PLT, together with the normal register spill/fill overhead. This can become quite a significant chunk of overhead.
To round this off, lets look at some figures from an artificial code run on an UltraSPARC T1 system that was hanging around here.
|Application that jumps between 26 different routines a->b->c...->z. All the routines are included in the same executable.||3s|
|Application that jumps between 26 different routines a->...z. The routines are provided as a library, and calls are therefore routed through the PLT.||6s|
|Application that jumps between 26 different routines a->...z. The routines are
provided as a library, but all are declared
|Application that jumps between 26 different routines a->...z. Each routine is defined in its own library, so calls to the routine have to go through the PLT, and also require a new ITLB entry to be used.||60s|
Since the routines in this test code don't actually do anything, the overhead of calling through the PLT is clearly shown as a doubling of runtime. However, this is insignificant when compared with the costs of calling to separate libraries, which is about 10x slower than this.
Moving the experiment to look at the impact on CMT systems:
|One copy of this executable per core of an UltraSPARC T1 processor||1 minute|
|Two copies of this executable per core||5 minutes|
|Four copies of this executable per core (fully loaded system)||8 minutes|
Running multiple copies of the application has a significant impact on performance. The performance counters show very few instructions being executed, and much time being lost to ITLB misses. Now this performance is from a system without the shared context changes - so I would expect much better scaling on a system with these improvements (if I find one I'll rerun the experiment).
The conclusion is that care needs to be taken when deciding to split application code into libraries.