Recently, a customers use of C++ objects within a dlopen(3c) environment revealed a problem that took some time to evaluate and understand. Sadly, this seems to be a recurring issue where the expectations of the C++ implementation are compromised by dynamic linking capabilities. Of course, dynamic linking is the norm for Solaris, and C++ is commonly employed in dynamic linking environments. But there are subtleties in regards symbol visibility that can cause problems.
This customer was using a java application to System.loadLibrary a C++ shared object, built to use standard iostreams. The underlying dlopen() failed as part of calling _init, and the result was a core dump. By preloading libumem(3lib), the customer discovered the problem was a bad free().
>::umem_status Status: ready and active Concurrency: 4 Logs: (inactive) Message buffer: free(d352a040): invalid or corrupted buffer
There seemed to be an inconsistency in memory allocation underlying this failure. And, I felt I'd been here before. A similar (but slightly different as it turns out) problem had been uncovered a few months ago. So, I stated poking through the symbol bindings for this process. I do this for a living, but even I find analyzing the symbol bindings of a process to be a little daunting. There are just so many bindings to wade through. In Solaris 10 we invented lari(1) to help uncover interesting symbol bindings. I gave a quick introduction to this tool in a previous posting.
First it was necessary to obtain a trace of all process bindings, including those produced by the dlopen(). The following environment variables result in this trace being saved in the file dbg.pid.
% LD_DEBUG=files,detail LD_DEBUG_OUTPUT=dbg java-app
The interesting information that lari(1) unravels focuses on the existence of multiple instances of the same symbol name. But even this can be a lot of information to digest (although I still don't understand why so many objects export the same interfaces). For this application, I wanted to narrow things down to just those symbols that were involved in a runtime binding. And, as we're dealing with C++, a little bit of demangling might be useful too.
% lari -bC -D dbg.pid [3:1EP]: __1cDstdMbasic_string4Ccn0ALchar_traits4Cc__n0AJallocator4Cc___J__nullref_[0x30] \\ [std::basic_string <char,std::char_traits <char>,std::allocator <char>>::__nullref]: \\ /local/ISV/libdlopened.so [3:1SF]: __1cDstdMbasic_string4Ccn0ALchar_traits4Cc__n0AJallocator4Cc___J__nullref_[0x30] \\ [std::basic_string <char,std::char_traits <char>,std::allocator <char>>::__nullref]: \\ /usr/lib/cpu/sparcv8plus/libCstd_isa.so.1 .....
Now that's interesting. Here we have three occurrences of the same __nullref_ symbol, and two different instances have been bound to. The libdlopened.so version is also defined as protected, which means that there may be internal references to this symbol from within the same object. A quick inspection of the original process bindings for this symbol also uncovers their addresses.
09268: 1: binding file=/usr/lib/libCstd.so.1 (0xd1677b00:0x177b00) to \\ file=/local/ISV/libdlopened.so (0xd352a040:0x192a040): \\ symbol `__1cDstdMbasic_string4Ccn0ALchar_traits4Cc__n0AJallocator4Cc___J__nullref_'
There's that bad free() address, 0xd352a040.
Now I'm not sure why the C++ implementation is trying to free a data item that exists within an object, but the core of the problem (I'm told) is that there are two instances of __nullref_ being used, and this has led to confusion. But why have we bound to two different instances?
The problem seems to stem from the search scope and visibility of the objects loaded with dlopen(). Refer to the section "Symbol Lookup" under the "Runtime Linking Programming Interface" in the Linker and Libraries Guide for a detailed explanation.
By default, a dlopen() family is loaded with the RTLD_LOCAL attribute. In this customers application, libdlopened.so is loaded by the dlopen(), and libCstd.so.1 is loaded as one of the dependencies. libCstd.so.1 is not a dependency of the java application itself. Therefore, libCstd.so.1 is maintained within the local scope of the family of dlopen objects. All objects within this family are able to bind to this dependency. Objects outside of this family can not. But, libCstd.so.1 also acts as a filter, and brings in the filtee libCstd_isa.so.1. This filtee is effectively brought in using another dlopen(), and thus libCstd_isa.so.1 exists within its own local scope. Hence, the __nullref_ reference from libCstd_isa.so.1 can not be satisfied by the definition in libdlopened.so - the referring object, and the defining object, live in different local scopes. Hence we get two different symbol bindings.
Sadly, this seems to be a common failure point. The C++ implementation can deposit the same data item in multiple objects. However, the design expects all such objects to be of global scope, such that interposition occurs, and only one definition from the multiple symbols is bound to. This requirement can be undermined by a number of dynamic linking techniques.
The first is the local scope families produced by dlopen() and filters as shown by this customers scenario - although both of these techniques have been around since the early days of Solaris. It is possible that scenarios like this are typically avoided because the application maintains its own dependency on the C++ libraries, or dlopen() is employed with the RTLD_GLOBAL flag. The scenario can also be avoided by preloading the C++ library. All these mechanisms force the C++ library to be of global scope, and hence allow interposition to bind to one instance of the problematic symbol. (Another hack for this scenario is to set LD_NOAUXFLTR=yes, which suppresses auxiliary filtering - hence libCstd_isa.so.1 wouldn't get loaded).
However, similar issues can result from using linker options such as -Bsymbolic, and direct bindings, or scoping dynamic object interfaces using mapfiles. The problem is that the dynamic linking technologies exist to carve out local namespaces within a process, and protect multiple dlopen() families from adversely interacting with one-another. A requirement that is becoming more and more relevant in todays large dynamic applications.
C++ implementation requirements, and user dynamic linking requirements seem to be a odds.
Perhaps it is time to invent a new symbol attribute. Attributes that allow symbols to be demoted to protected, or local scope already exists. A previous posting introduced some compiler techniques in this area. But we have no attribute that states that a symbol must remain global, and that it should have no internal or direct bindings established to it, and that it should be elevated above any local scope families created within a dynamically linked process. Perhaps with such a symbol attribute, assigned by the compilers for the symbols they know must be completely interposable, we'd establish a more robust environment.
Now, I wonder what name we'd give this new super-global attribute?