ELF

Executable and Linkable Format

System V Release 4.0 was announced on October 18, 1988

Performance of shared objects

https://docs.fedoraproject.org/en-US/packaging-guidelines/#_statically_linking_executables "Executables and libraries SHOULD NOT be linked statically against libraries which come from other packages."

Linus Torvalds: "I really wish Fedora stopped doing that.libraries are not a good thing in general. They add a lot of overhead in this case, but more importantly they also add lots of unnecessary dependencies and complexity, and almost no shared libraries are actually version-safe, so it adds absolutely zero upside." "but unless it's some very core library used by a lot of things"

  • size (disk, network bandwidth, memory)
  • development (relink)
  • convenience for upgrade (security fixes, backport: one place to patch)

libllvm11, 50 packages

Myth: "shared objects are slow"

Why?

  • Less internalization (conservative, all exported symbols are needed)
  • Cross-library optimizations
  • Technical decision made in 1990+ (interposition by default)

Formulae:

  • -fpic -fno-semantic-interposition ~= -fpie
  • -shared -Bsymbolic ~= -pie (Allow copy relocation and canonical plt. Allow relax general dynamic/local dynamic tls models and tls descriptors to initial exec/local exec. How to resolve undefined weak symbols.)

ELF dynamic linking model

Philosophy: dynamic linking is similar to static linking (ld ... a.so b.so is similar to ld ... a.a b.a)

  • a.a <-> a.so
  • No source-level annotation (trivial to split/join libraries)

First case: f is defined in the same translation unit of g. One notable point: GCC -fpic suppresses interprocedural optimizations including inlining for such non-inline external linkage functions.

1
2
3
// f and g are STB_GLOBAL STV_DEFAULT.
void f() { ... }
void g() { f(); }

Second case: f is defined in a different object file which will be linked into the same shared object.

1
2
3
4
5
6
7
8
// a.c -> a.o
void f() { ... }

// b.c -> b.o
void f();
void g() { f(); }

// a.o b.o -> a.so

Third case: f is defined in a different shared object or the executable. The symbol search on f cannot be prevented.

1
2
3
4
5
6
7
8
// a.c -> a.o - a.so
void f() { ... }

// b.c -> b.o
void f();
void g() { f(); }

// b.o a.so -> b.so

.a <-> .so

ld ... a.a b.a

  • a.so a.a: X.o Y.o
  • b.so b.a: X1.o Y1.o Z.o

ld ... -( a.a(X.o) a.a(Y.o) -) -( b.a(X1.o) b.a(Y1.o) b.a(Z.o) -)

X.o shadows X1.o. Y.o shadows Y1.o. In a link, ld may extract a.a(X.o) a.a(Y.o) b.a(Z.o).

Say Z.o references some symbols defined in X1.o or Y1.o, the references will bind to X.o Y.o instead.

ld ... a.so b.so

Interposition: simple, elegant, but inefficient and error-prone.

However, this is a lame interpretation: by reorganizing the archive a bit, you can easily cause a multiple definition error.

Dynamic linking model in ELF

The ELF wording on dynamic linking hasn't changed since 2000-07-17, i.e. the evolution of dynamic linking has not contributed back to the specification.

The dynamic loader does one critical job: resolving dynamic relocations and binding symbol references from one component to another.

There is a flat namespace for symbol search.

The dynamic loader computes a breadth-first search list (executable, needed0, needed1, needed2, needed0_of_needed0, needed1_of_needed0, ...).

For each symbol reference, the dynamic loader iterates over the list and finds the first component which provides a definition. (For dlsym with an explicit handle, the symbol search uses the dependency order, a breadth-first search rooted at the handle.)

The implication is that STB_GLOBAL and STB_WEAK definitions are equivalent in terms of symbol search. A STB_WEAK definition can preempt a STB_GLOBAL definition.

While not mentioned in the ELF specification, many dynamic loader implementations allow the environment variable LD_PRELOAD to inject shared objects. The search list may look like executable, preload0, preload1, needed0, needed1, needed2, needed0_of_preload0, ..., needed0_of_needed0, needed1_of_needed0, ...

Note that the executable is always the first element of the search list, so a defined symbol of any binding in the executable cannot be preempted (interposed). In a shared object, a default visibility STB_GLOBAL or STB_WEAK symbol can be preempted (interposed) because an earlier component may define a symbol of the same name.

Alternative symbol search models

Solaris named the above the default search model and introduced an alternative model: direct bindings. With -z defs, one can ensure the dependencies are provided as part of the link and all symbol references are satisfied. The linker can record the bound component for each symbol reference.

Here is an example from Solaris's Linkers and Libraries Guide:

1
2
3
$ elfdump -y W.so.2
[6] [ DEPEND DIRECT ] <self> a
[7] [ DEPEND LAZY DIRECT ] [1] w.so.1 b

With the information about the component name, the dynamic loader can speed up its symbol search by just looking at one component. In particular, frequently the bound component is the component itself.

In Mac OS X, the two-level namespace introduced in 10.1 (default unless you use ld -flat_namespace) is a similar model.

Prelink can be conceived as a direct binding model without great ergonomics.

The standard ELF specification defines DF_SYMBOLIC which can be conceived as a special case of direct bindings. When a shared object is marked as DF_SYMBOLIC (set by ld -Bsymbolic), the symbol search checks the shared object itself before starting the linear search from the executable. It is quite common for a shared object to call STV_DEFAULT definitions in itself. DF_SYMBOLIC can improve the performance greatly.

-Bsymbolic

The linker option -Bsymbolic can be used together with -shared. ld -shared -Bsymbolic is very similar to -pie.

-Bsymbolic follows ELF DF_SYMBOLIC semantics: all defined symbols are non-preemptible. This can optimize relocation processing:

  • function calls: a branch instruction (e.g. call foo@PLT) will not create a PLT entry. The associated R_*_JUMP_SLOT dynamic relocation will be suppressed.
  • variable access and function addresses: the GOT entry will not cause a R_*_GLOB_DAT dynamic relocation. On x86-64, with R_X86_64_GOTPCRELX/R_X86_64_REX_GOTPCRELX, the GOT indirection code sequence can be rewritten. However, the code sequence is still longer than that without GOT. On PowerPC64, there is a similar TOC optimization. On other architectures, there is no difference in code sequences.
1
2
3
4
5
6
7
8
9
# a.o
.globl f
f:
...

# b.o
.globl g
g:
call f@PLT

-fno-semantic-interposition can address pessimization when the definition as the call site are in the same translation unit.

Working at the shared object level, -Bsymbolic can address cross-translation-unit pessimization which cannot be optimized with -fno-semantic-interposition. Personally I think this claims most of direct binding benefits.

As a data point, when building the Linux kernel's x86_64 defconfig with a clang -fPIC built clang, my build is 15% faster if I add -Bsymbolic-functions to libLLVM.so and libclang-cpp.so. I cannot tell performance difference with a mostly statically linked PIE clang.

However, in practice, deployment of -Bsymbolic may run into pointer equality problems. Many objects in C++ are not clearly part of a single object file, but are required by the ODR to have a single definition. For example, C++ [dcl.inline]: "An inline function or variable with external or module linkage can be defined in multiple translation units ([basic.def.odr]), but is one entity with one address. A type or static variable defined in the body of such a function is therefore a single entity."

We will discuss variables and functions separately.

Pointer equality for variables

An inline variable with external linkage and a local static variable defined in an inline function with external linkage are required to be unique. The address of such a variable seen by a -Bsymbolic linked shared object may be different from the address seen from outside the shared object. Fortunately it is uncommon to export such a vague linkage variable to both the executable and a shared object.

1
2
3
4
5
6
7
8
9
10
11
12
13
// a.h
inline int *addr() {
static int data;
return &data;
}

// a.cc -> a.o -> a.so (-Bsymbolic)
#include "a.h"
int *addr0 = addr();

// b.cc -> b.o -> exe
#include "a.h"
int *addr1 = addr();

(ELF specific) In addition, a regular non-inline variable with external linkage can cause incompatibility problems due to copy relocations. GCC/Clang -fno-pic emit direct access relocations referencing a global variable. If the global variable turns out to be defined in a shared object, there will be a copy relocation in the executable. The object the shared object sees and the executable sees will be different.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// a.h
extern int var;

// a.cc - a.o - a.so (-Bsymbolic)
int var;
// clang -fpic (indirect): movq var@GOTPCREL(%rip), %rax
// ld -Bsymbolic: (non-GOTPCRELX) R_X86_64_RELATIVE or (GOTPCRELX) PC-relative
int *addr0() { return &var; }

// b.cc - b.o - exe
int main() {
// clang -fno-pic (direct): movl $var, %esi
// clang -fpie/-fpic (indirect): movq var@GOTPCREL(%rip), %rsi
int *addr1 = &var;
}

For Clang -fno-pic, the direct access relocation can be avoided with -fno-direct-access-access-external-data. GCC feature request: PR98112.

Since GCC 5, on x86-64, -fpie can cause copy relocations as well due to HAVE_LD_PIE_COPYRELOC. We should fix it. Pending GCC patch.

(In C++, typeid() on an incomplete class can define a typeinfo name object. A -Bsymbolic linked shared object may see a different copy, but the address can hardly cause a problem.).

Pointer equality for functions

The address of an inline function seen by a -Bsymbolic linked shared object may be different from the address seen from outside the shared object. Fortunately such cases are rare. Windows link.exe enables identical COMDAT folding (/OPT:ICF) by default. ELF/Mach-O programs may use -fvisibility-inlines-hidden. Assuming pointer equality will break Identical COMDAT Folding and -fvisibility-inlines-hidden anyway.

On Mach-O, such symbols are placed into __LINKEDIT,__weak_binding so that dyld can coalesce the definitions across dylibs.

On Windows, you need to compile the DLL and the executable differently: the defining DLL needs __declspec(dllexport) on the inline function and the executable needs __declspec(dllexport). This is tricky.

1
2
3
4
5
6
7
// a.cc -> a.obj -> a.dll
__declspec(dllexport) inline void f() {}
__declspec(dllexport) void *g() { return (void *)&f; }

// b.cc -> b.obj -> b.exe
// The address will be different if dllimport is emitted or dllexport is used.
__declspec(dllimport) inline void f() {}

(ELF specific) In addition, a regular non-inline function with external linkage can cause incompatibility problems due to canonical PLT entries. GCC/Clang -fno-pic emit direct access relocations when taking the address of an external function. If the global variable turns out to be defined in a shared object, there will be a canonical PLT entry in the executable. The function address the shared object sees and the executable sees will be different.

1
2
3
4
5
6
7
8
9
10
// a.h
void fun();

// a.cc - a.o - a.so (-Bsymbolic-functions)
void fun() {}
void *addr0() { return (void *)&fun; }

// b.cc - b.o - exe
// addr1() != addr0()
void *addr1() { return (void *)&fun; }

-fno-pic should use GOT when taking the address of an external default visibility function. GCC feature request: PR100593.

-Bsymbolic-functions

The function incompatibility problems are uncommon. It is often benign when the function address seen by a shared object is different from outside the shared object. However, the variable case is usually severe: the executable and a shared object may act on different copies of a variable supposed to be the same entity.

In practice, we can usually use the linker option -Bsymbolic-functions. The option applies to STT_FUNC symbols in ld.lld and non-STT_OBJECT symbols in GNU ld and gold, avoiding variable incompatibility problems. Though rare, it may make sense to add a linker option (say, -Bsymbolic-global-functions which applies to STT_FUNC STB_GLOBAL symbols to bypass vague linkage STB_WEAK symbols. GNU ld feature request: PR27871.

Visibility

-fvisibility=protected

A non-default visibility symbol cannot be preempted, even if the binding is STB_WEAK. -fvisibility=protected can make all definitions protected and thus non-preemptible, nullifying the performance benefit of -fno-semantic-interposition and -Bsymbolic. Note: if you want a definition to be preemptible, you will need a default visibility attribute, even if it is weak (e.g. __attribute__((weak,visibility("default")))).

However, -fvisibility=protected shares the same problem with -Bsymbolic: too coarse-grained. It can cause the same sets of problems as discussed above in Pointer equality for variables and functions. Notably, a shared object built with -fvisibility=protected is incompatible with -fno-pic global variable access.

In GCC/binutils's x86 port, there were some changes in the wrong indirection. As a result, there are some STT_OBJECT issues resulting in poor Clang interoperability (and also gold).

1
2
3
4
5
6
7
8
9
10
11
% cat a.s
leaq foo(%rip), %rax

.data
.global foo
.protected foo
foo:
% gcc -fuse-ld=bfd -shared a.s
/usr/bin/ld.bfd: /tmp/ccchu3Xo.o: relocation R_X86_64_PC32 against protected symbol `foo' can not be used when making a shared object
/usr/bin/ld.bfd: final link failed: bad value
collect2: error: ld returned 1 exit status

See Copy relocations, canonical PLT entries and protected visibility for details. There is no problem when you only use Clang and LLD.

-fvisibility=hidden

-fvisibility=hidden can make all definitions hidden and thus non-preemptible, nullifying the performance benefit of -fno-semantic-interposition.

-fvisibility=hidden requires annotation of exported symbols (__attribute__((visibility("default")))). The explicit annotation sometimes makes it inconvenient to split and join libraries.

However, projects with Windows portability in mind will define macros to dispatch to either the visibility attribute or __declspec(dllexport).

-fvisibility-inlines-hidden

The C++ specific -fvisibility-inlines-hidden is a safer subset of -fvisibility=hidden. The option just violates pointer equality for inline function definitions. As discussed above, this is usually safe.

Interaction with LD_PRELOAD

There are several types of LD_PRELOAD usage.

First, use LD_PRELOAD=same_soname.so to replace a DT_NEEDED entry with the same SONAME. Both -fno-semantic-interposition and -Bsymbolic are compatible with such usage.

Second, use LD_PRELOAD=malloc.so to intercept some functions not defined in the application or any of its shared object dependencies. Both -fno-semantic-interposition and -Bsymbolic are compatible. Common examples include malloc replacement and fakeroot, both interposing some libc.so functions.

1
void *f() { return malloc(0xb612); }

Third, use LD_PRELOAD=different_soname.so to replace a function defined in a shared object dependency and the SONAME is different. (This usage is unlikely compatible with C++'s one definition rule.) Such usage is incompatible with -Bsymbolic and -fno-semantic-interposition.

Source level implication

Now let's discuss how the compiler models C/C++ in terms of the binary format semantics. An external linkage function/variable has STB_GLOBAL binding and STV_DEFAULT visibility by default, e.g. f and g in the following code (Note: we exclude vague linkage definitions for our discussion.)

1
2
3
// f and g are STB_GLOBAL STV_DEFAULT.
void f() { ... }
void g() { f(); }

A -fpic compiled object file can be linked as a shared object. GCC's -fpic interpretation is: since f is preemptible when linked as a shared object, let's be pessimistic: consider the definition inexact and suppress interprocedural optimizations including inlining.

The emitted assembly looks like:

1
2
3
4
5
6
7
.globl f
f:
...

.globl g
g:
call f@PLT # or call f; R_X86_64_PLT32

f cannot be inlined into g. g cannot make use of f's characteristics for optimizations. (This turns out to be the biggest difference between -fpie and -fpic. A -fpie object file cannot make a shared object, so a definition is known to be non-preemptible.) In CPython's case, they said it was up to 30% due to suppressed interprocedural optimizations.

The pessimization does not stop here. See my next article -Bsymbolic and its friends for the cost.

This is a feature used by extremely few libraries that penalizes most other libraries. I'll say fewer than 0.1%. A portable project needs ld -interposable, DYLD_FORCE_FLAT_NAMESPACE or __attribute__((section("__DATA,__interpose"))) to work on Mach-O, but there is no counterpart on Windows. Read on.

GCC -fno-semantic-interposition

GCC 5 introduced -fno-semantic-interposition to optimize -fpic. First, GCC can apply interprocedural optimizations including inlining like -fno-pic and -fpie. Second, in the emitted assembly, a function call will go through a local alias to avoid PLT if linked with -shared.

1
2
3
4
5
6
7
.globl f
f: # STB_GLOBAL
...
.set f.localalias, f # STB_LOCAL

g:
call f.localalias

If f is a non-definition declaration, -fno-semantic-interposition has no behavior difference.

The Last Alliance of ELF and Men

You may want to read my -fno-semantic-interposition first. This section formulates an ambitious plan "the Last Alliance of ELF and Men" and also serves as a summary. For easy navigation, You can click the {}-style links to jump back to the anchors defines in previous paragraphs.

I wish that distributions default to a function-only variant of -fno-semantic-interposition and (in the long term) a STB_GLOBAL variant of -Wl,-Bsymbolic-functions, bringing back the lost performance for decades. macOS (Mach-O), Windows (PE-COFF), and Solaris (ELF) direct bindings have set up precedent so there is a good chance that most pieces of portable software are already in a good state.

However, there is still some amount of work needed to annotate software which cannot be built with -fsemantic-interposition=variable or -Wl,-Bsymbolic-global-functions. Distributions have to put into resources. In return, I estimate that many pieces of software may be 5% to 20% faster (CPython is 1.3x faster) and a few percentage smaller in size.

There is a trade-off and the downside is that LD_PRELOAD replacing a function definition in a shared object will be a non-default choice. The users can build the software by themselves.

We need an option (say, -fsemantic-interposition=variable) to disable interposition for functions but enable interposition for variables, because we want to be compatible with copy relocations, which will require years to fix. GCC feature request: PR100618.

We may need a configure-time option for default -fsemantic-interposition=variable, like GCC's --enable-default-pie.

We need a -Bsymbolic-functions variant which only applies to STB_GLOBAL symbols (i.e. STB_WEAK symbols are excluded). {-Bsymbolic-global-functions}

We need a linker option to cancel default -Bsymbolic-global-functions. I have added -Bno-symbolic to GNU ld and gold (binutils 2.37; PR27834) and ld.lld 13.

(From Peter Smith) The linker can introduce a debugging option for executables to catch accidental interposition, say, --warn-interposition: "Warning symbol S of type STT_FUNC is defined in executable A and shared objects B and C, using definition in A."

GCC -fno-pic should be fixed to use GOT to take the address of an external default visibility function. PR100593. {-fno-pic_got}