One way dynamic_cast across library boundaries can fail and how to fix it

The Squish IDE supports users in debugging test scripts with a debugger that allows to inspect and expand test script variables as well as script variables that refer to objects from the application under test. Recently we found out that the expansion of these script variables does not work anymore. This problem only occurred on macOS and investigating the cause has led to learning some new details about how the dynamic_cast function works across different C++ implementations and platforms. In this article, I'm presenting the problem, cause and solution we found here so that others can learn from it.

The Problem

The lack of expandability for those variables was traced down to this bit of code in the Squish sources:

bool VariableWatcher::objectHasComponents( Object* obj ) const
{
    const RemoteObjectBase *ro = dynamic_cast<const RemoteObjectBase*>( obj );
    if ( !ro ) {
        // error handling
        return false;
    }
    return true;
}

On macOS the dynamic_cast would return 0, even though the passed in instance of Object does inherit RemoteObjectBase and the actual type is the same across platforms too. The function presented here and RemoteObjectBase live in different shared libraries though and the Squish buildsystem ensures that the libraries hide symbols that are not explicitly tagged as exported.

The Qt documentation states that dynamic_cast across shared library boundaries can be problematic. However, in our case we had no QObject base class and would not want to introduce one either. The issue also does not appear on Linux or Windows, so we had to investigate this problem further to determine the reason and find a fix or workaround for macOS.

The cause: Symbol visiblity and C++ runtime implementation

The cause of this issue is a combination of symbol visibility and the different C++ runtime libraries used on macOS and Linux.

The dynamic_cast function needs to compare types to determine whether the object inherits the given type. The implementation of how this comparison works differs between libc++ (used on macOS) and libstdc++ (usually used on Linux) and is being discussed for example in this older mail exchange in the llvm project. The libstdc++ implementation checks the name attribute of the type information structure obtained from the object and type, whereas libc++ compares the actual pointer value to the type information directly.

The pointer value of the type information can be access via the typeid function and this could demonstrate the problem too. The value returned was different depending on whether typeid was being invoked in objectHasComponent or in the shared library defining RemoteObjectBase.

This type information is also part of the symbols of a shared library on Linux and macOS. Looking at the output of objdump for the library that contains the objectHasComponents function shows this:

objdump -Ct libb.dylib | grep RemoteObjectBase | grep typeinfo
0000000000003f2a lw O __TEXT,__const typeinfo name for RemoteObjectBase
0000000000004048 lw O __DATA_CONST,__const typeinfo for RemoteObjectBase

(The libb.dylib library here is from a small example project that reproduces the issue without all the rest of the Squish codebase.)

The lw flags towards the beginning of the lines indicate that the type information for RemoteObjectBase is a local symbol in this library. This local symbol will apparently shadow the one from the shared library where the class is defined when invoking dynamic_cast within libb.dylib. However, the type information within the actual object to cast still has the type information based on the global symbol and thus there's a mismatch now.

The solution: Adjust the import macro

The sample project mentioned helped to identify that the compiler sets the type information symbol up as a local one based on the visibility flags specified. Disabled hiding of symbols in shared libraries would make the issue go away - the type information symbol is a global one and dynamic_cast works as expected also on macOS.

I'm going to make a quick detour here to provide some background on how symbol visibility is usually handled in case that is not known to some readers:

The symbol reduction is something that is handled in a platform-specific way and it usually involves placing a compiler-specific command in front of the class declaration. On some platforms there are separate commands needed dependingon whether the class itself is compiled (i.e. exporting a symbol) and when using the class in another module (i.e. importing the symbol). Setting this up is usually done through some preprocess macro's and defines, so there's a macro defined for the exporting and one for importing the class and an additional define is set that switches between the two.

On Linux and macOS only the exporting needs a compiler command (__attribute__((visibility('default')))) and in addition a compiler command-line option hides all symbols of a shared library by default (-fvisibility=hidden). Hence, the importing macro is often defined to be empty.

While looking for a solution that avoids disabled symbol hiding on macOS we found some projects, for example the appleseed project, where the import macro is defined in the same way as the export macro for linux/macOS. Instead of being defined to be empty, it also adds __attribute__((visibility('default'))) in front of the class when the class is being used outside of its defining library. Modifying our import macro makes the compiler generate a global symbol within the libb.dylib shared library (and its counterpart in Squish):

objdump -tC libb.dylib| grep RemoteObj
0000000000004050 gw    O __DATA_CONST,__const typeinfo for RemoteObjectBase
0000000000003f2a gw    O __TEXT,__const typeinfo name for RemoteObjectBase

This proves to be a viable middle ground between figuring out a replacement for dynamic_cast - like qobject_cast - and exposing all internal classes and functions from the libraries. So for this particular case where dynamic_cast causes unexpected problems when used across shared libraries there is a solution that does not require introducing the QObject type into the class hierarchy to allow using qobject_cast.

We stopped further investigating why the compiler generates the local symbol at this point. If anyone has more background information on this do not hesitate to post a comment.

Comments