ELF
Executable and Linkable Format
System V Release 4.0 was announced on October 18, 1988
Performance of shared objects
https://docs.fedoraproject.org/en-US/packaging-guidelines/#_statically_linking_executables "Executables and libraries SHOULD NOT be linked statically against libraries which come from other packages."
Linus Torvalds: "I really wish Fedora stopped doing that.libraries are not a good thing in general. They add a lot of overhead in this case, but more importantly they also add lots of unnecessary dependencies and complexity, and almost no shared libraries are actually version-safe, so it adds absolutely zero upside." "but unless it's some very core library used by a lot of things"
- size (disk, network bandwidth, memory)
- development (relink)
- convenience for upgrade (security fixes, backport: one place to patch)
libllvm11, 50 packages
Myth: "shared objects are slow"
Why?
- Less internalization (conservative, all exported symbols are needed)
- Cross-library optimizations
- Technical decision made in 1990+ (interposition by default)
Formulae:
-fpic -fno-semantic-interposition ~= -fpie-shared -Bsymbolic ~= -pie(Allow copy relocation and canonical plt. Allow relax general dynamic/local dynamic tls models and tls descriptors to initial exec/local exec. How to resolve undefined weak symbols.)
ELF dynamic linking model
Philosophy: dynamic linking is similar to static linking
(ld ... a.so b.so is similar to
ld ... a.a b.a)
- a.a <-> a.so
- No source-level annotation (trivial to split/join libraries)
First case: f is defined in the same translation unit of g. One
notable point: GCC -fpic suppresses interprocedural
optimizations including inlining for such non-inline external linkage
functions. 1
2
3// f and g are STB_GLOBAL STV_DEFAULT.
void f() { ... }
void g() { f(); }
Second case: f is defined in a different object file which will be
linked into the same shared object. 1
2
3
4
5
6
7
8// a.c -> a.o
void f() { ... }
// b.c -> b.o
void f();
void g() { f(); }
// a.o b.o -> a.so
Third case: f is defined in a different shared object or the
executable. The symbol search on f cannot be prevented.
1
2
3
4
5
6
7
8// a.c -> a.o - a.so
void f() { ... }
// b.c -> b.o
void f();
void g() { f(); }
// b.o a.so -> b.so
.a <-> .so
ld ... a.a b.a
a.so a.a: X.o Y.ob.so b.a: X1.o Y1.o Z.o
ld ... -( a.a(X.o) a.a(Y.o) -) -( b.a(X1.o) b.a(Y1.o) b.a(Z.o) -)
X.o shadows X1.o. Y.o shadows Y1.o. In a link, ld may extract a.a(X.o) a.a(Y.o) b.a(Z.o).
Say Z.o references some symbols defined in X1.o or Y1.o, the references will bind to X.o Y.o instead.
ld ... a.so b.so
Interposition: simple, elegant, but inefficient and error-prone.
However, this is a lame interpretation: by reorganizing the archive a bit, you can easily cause a multiple definition error.
Dynamic linking model in ELF
The ELF wording on dynamic linking hasn't changed since 2000-07-17, i.e. the evolution of dynamic linking has not contributed back to the specification.
The dynamic loader does one critical job: resolving dynamic relocations and binding symbol references from one component to another.
There is a flat namespace for symbol search.
The dynamic loader computes a breadth-first search list
(executable, needed0, needed1, needed2, needed0_of_needed0, needed1_of_needed0, ...).
For each symbol reference, the dynamic loader iterates over the list
and finds the first component which provides a definition. (For
dlsym with an explicit handle, the symbol search uses the
dependency order, a breadth-first search rooted at the handle.)
The implication is that STB_GLOBAL and
STB_WEAK definitions are equivalent in terms of symbol
search. A STB_WEAK definition can preempt a
STB_GLOBAL definition.
While not mentioned in the ELF specification, many dynamic loader
implementations allow the environment variable LD_PRELOAD
to inject shared objects. The search list may look like
executable, preload0, preload1, needed0, needed1, needed2, needed0_of_preload0, ..., needed0_of_needed0, needed1_of_needed0, ...
Note that the executable is always the first element of the search
list, so a defined symbol of any binding in the executable cannot be
preempted (interposed). In a shared object, a default visibility
STB_GLOBAL or STB_WEAK symbol can be preempted
(interposed) because an earlier component may define a symbol of the
same name.
Alternative symbol search models
Solaris named the above the default search model and introduced an
alternative model: direct bindings. With -z defs, one can
ensure the dependencies are provided as part of the link and all symbol
references are satisfied. The linker can record the bound component for
each symbol reference.
Here is an example from Solaris's Linkers and Libraries Guide:
1
2
3$ elfdump -y W.so.2
[6] [ DEPEND DIRECT ] <self> a
[7] [ DEPEND LAZY DIRECT ] [1] w.so.1 b
With the information about the component name, the dynamic loader can speed up its symbol search by just looking at one component. In particular, frequently the bound component is the component itself.
In Mac OS X, the two-level namespace introduced in 10.1 (default
unless you use ld -flat_namespace) is a similar model.
Prelink can be conceived as a direct binding model without great ergonomics.
The standard ELF specification defines DF_SYMBOLIC which
can be conceived as a special case of direct bindings. When a shared
object is marked as DF_SYMBOLIC (set by
ld -Bsymbolic), the symbol search checks the shared object
itself before starting the linear search from the executable. It is
quite common for a shared object to call STV_DEFAULT
definitions in itself. DF_SYMBOLIC can improve the
performance greatly.
-Bsymbolic
The linker option -Bsymbolic can be used together with
-shared. ld -shared -Bsymbolic is very similar
to -pie.
-Bsymbolic follows ELF DF_SYMBOLIC
semantics: all defined symbols are non-preemptible. This can optimize
relocation processing:
- function calls: a branch instruction (e.g.
call foo@PLT) will not create a PLT entry. The associatedR_*_JUMP_SLOTdynamic relocation will be suppressed. - variable access and function addresses: the GOT entry will not cause
a
R_*_GLOB_DATdynamic relocation. On x86-64, withR_X86_64_GOTPCRELX/R_X86_64_REX_GOTPCRELX, the GOT indirection code sequence can be rewritten. However, the code sequence is still longer than that without GOT. On PowerPC64, there is a similar TOC optimization. On other architectures, there is no difference in code sequences.
1 | # a.o |
-fno-semantic-interposition can address pessimization
when the definition as the call site are in the same translation
unit.
Working at the shared object level, -Bsymbolic can
address cross-translation-unit pessimization which cannot be optimized
with -fno-semantic-interposition. Personally I think this
claims most of direct binding benefits.
As a data point, when building the Linux kernel's x86_64 defconfig
with a clang -fPIC built clang, my build is 15% faster if I
add -Bsymbolic-functions to libLLVM.so and
libclang-cpp.so. I cannot tell performance difference with
a mostly statically linked PIE clang.
However, in practice, deployment of -Bsymbolic may run
into pointer equality problems. Many objects in C++ are not clearly part
of a single object file, but are required by the ODR to have a single
definition. For example, C++ [dcl.inline]: "An inline function or
variable with external or module linkage can be defined in multiple
translation units ([basic.def.odr]), but is one entity with one address.
A type or static variable defined in the body of such a function is
therefore a single entity."
We will discuss variables and functions separately.
Pointer equality for variables
An inline variable with external linkage and a local static variable
defined in an inline function with external linkage are required to be
unique. The address of such a variable seen by a -Bsymbolic
linked shared object may be different from the address seen from outside
the shared object. Fortunately it is uncommon to export such a vague
linkage variable to both the executable and a shared object.
1 | // a.h |
(ELF specific) In addition, a regular non-inline variable with
external linkage can cause incompatibility problems due to copy
relocations. GCC/Clang -fno-pic emit direct access
relocations referencing a global variable. If the global variable turns
out to be defined in a shared object, there will be a copy relocation in
the executable. The object the shared object sees and the executable
sees will be different.
1 | // a.h |
For Clang -fno-pic, the direct access relocation can be
avoided with -fno-direct-access-access-external-data. GCC
feature request: PR98112.
Since GCC 5, on x86-64,
-fpie can cause copy relocations as well due to
HAVE_LD_PIE_COPYRELOC. We should fix it. Pending
GCC patch.
(In C++, typeid() on an incomplete class can define a
typeinfo name object. A -Bsymbolic linked shared object may
see a different copy, but the address can hardly cause a problem.).
Pointer equality for functions
The address of an inline function seen by a -Bsymbolic
linked shared object may be different from the address seen from outside
the shared object. Fortunately such cases are rare. Windows link.exe
enables identical COMDAT folding (/OPT:ICF) by default.
ELF/Mach-O programs may use -fvisibility-inlines-hidden.
Assuming pointer equality will break Identical COMDAT Folding and
-fvisibility-inlines-hidden anyway.
On Mach-O, such symbols are placed into
__LINKEDIT,__weak_binding so that dyld can coalesce the
definitions across dylibs.
On Windows, you need to compile the DLL and the executable
differently: the defining DLL needs __declspec(dllexport)
on the inline function and the executable needs
__declspec(dllexport). This is tricky.
1 | // a.cc -> a.obj -> a.dll |
(ELF specific) In addition, a regular non-inline function with
external linkage can cause incompatibility problems due to canonical PLT
entries. GCC/Clang -fno-pic emit direct access relocations
when taking the address of an external function. If the global variable
turns out to be defined in a shared object, there will be a canonical
PLT entry in the executable. The function address the shared object sees
and the executable sees will be different.
1 | // a.h |
-fno-pic should use GOT when
taking the address of an external default visibility function. GCC
feature request: PR100593.
-Bsymbolic-functions
The function incompatibility problems are uncommon. It is often benign when the function address seen by a shared object is different from outside the shared object. However, the variable case is usually severe: the executable and a shared object may act on different copies of a variable supposed to be the same entity.
In practice, we can usually use the linker option
-Bsymbolic-functions. The option applies to
STT_FUNC symbols in ld.lld and non-STT_OBJECT
symbols in GNU ld and gold, avoiding variable incompatibility problems.
Though rare, it may make sense
to add a linker option (say, -Bsymbolic-global-functions
which applies to STT_FUNC STB_GLOBAL symbols
to bypass vague linkage STB_WEAK symbols. GNU ld feature
request: PR27871.
Visibility
-fvisibility=protected
A non-default visibility symbol cannot be preempted, even if the
binding is STB_WEAK. -fvisibility=protected
can make all definitions protected and thus non-preemptible, nullifying
the performance benefit of -fno-semantic-interposition and
-Bsymbolic. Note: if you want a definition to be
preemptible, you will need a default visibility attribute, even if it is
weak (e.g.
__attribute__((weak,visibility("default")))).
However, -fvisibility=protected shares the same problem
with -Bsymbolic: too coarse-grained. It can cause the same
sets of problems as discussed above in Pointer equality for variables
and functions. Notably, a shared object built with
-fvisibility=protected is incompatible with
-fno-pic global variable access.
In GCC/binutils's x86 port, there were some changes in the wrong
indirection. As a result, there are some STT_OBJECT issues
resulting in poor Clang interoperability (and also gold).
1 | % cat a.s |
See Copy relocations, canonical PLT entries and protected visibility for details. There is no problem when you only use Clang and LLD.
-fvisibility=hidden
-fvisibility=hidden can make all definitions hidden and
thus non-preemptible, nullifying the performance benefit of
-fno-semantic-interposition.
-fvisibility=hidden requires annotation of exported
symbols (__attribute__((visibility("default")))). The
explicit annotation sometimes makes it inconvenient to split and join
libraries.
However, projects with Windows portability in mind will define macros
to dispatch to either the visibility attribute or
__declspec(dllexport).
-fvisibility-inlines-hidden
The C++ specific -fvisibility-inlines-hidden is a safer
subset of -fvisibility=hidden. The option just violates
pointer equality for inline function definitions. As discussed above,
this is usually safe.
Interaction with
LD_PRELOAD
There are several types of LD_PRELOAD usage.
First, use LD_PRELOAD=same_soname.so to replace a
DT_NEEDED entry with the same SONAME. Both
-fno-semantic-interposition and -Bsymbolic are
compatible with such usage.
Second, use LD_PRELOAD=malloc.so to intercept some
functions not defined in the application or any of its shared object
dependencies. Both -fno-semantic-interposition and
-Bsymbolic are compatible. Common examples include malloc
replacement and fakeroot, both interposing some libc.so functions.
1
void *f() { return malloc(0xb612); }
Third, use LD_PRELOAD=different_soname.so to replace a
function defined in a shared object dependency and the SONAME is
different. (This usage is unlikely compatible with C++'s one definition
rule.) Such usage is incompatible with -Bsymbolic and
-fno-semantic-interposition.
Source level implication
Now let's discuss how the compiler models C/C++ in terms of the
binary format semantics. An external linkage function/variable has
STB_GLOBAL binding and STV_DEFAULT visibility
by default, e.g. f and g in the following code (Note: we exclude vague
linkage definitions for our discussion.)
1 | // f and g are STB_GLOBAL STV_DEFAULT. |
A -fpic compiled object file can be linked as a shared
object. GCC's -fpic interpretation is: since f
is preemptible when linked as a shared object, let's be pessimistic:
consider the definition inexact and suppress interprocedural
optimizations including inlining.
The emitted assembly looks like: 1
2
3
4
5
6
7.globl f
f:
...
.globl g
g:
call f@PLT # or call f; R_X86_64_PLT32
f cannot be inlined into g. g
cannot make use of f's characteristics for optimizations.
(This turns out to be the biggest difference between -fpie
and -fpic. A -fpie object file cannot make a
shared object, so a definition is known to be non-preemptible.) In
CPython's case, they said it was up to 30% due to suppressed
interprocedural optimizations.
The pessimization does not stop here. See my next article -Bsymbolic and its friends for the cost.
This is a feature used by extremely few libraries that penalizes most
other libraries. I'll say fewer than 0.1%. A portable project needs
ld -interposable, DYLD_FORCE_FLAT_NAMESPACE or
__attribute__((section("__DATA,__interpose"))) to work on
Mach-O, but there is no counterpart on Windows. Read on.
GCC -fno-semantic-interposition
GCC 5 introduced -fno-semantic-interposition to optimize
-fpic. First, GCC can apply interprocedural optimizations
including inlining like -fno-pic and -fpie.
Second, in the emitted assembly, a function call will go through a local
alias to avoid PLT if linked with -shared.
1 | .globl f |
If f is a non-definition declaration,
-fno-semantic-interposition has no behavior difference.
The Last Alliance of ELF and Men
You may want to read my -fno-semantic-interposition
first. This section formulates an ambitious plan "the Last Alliance of
ELF and Men" and also serves as a summary. For easy navigation, You can
click the {}-style links to jump back to the anchors
defines in previous paragraphs.
I wish that distributions default to a function-only variant of
-fno-semantic-interposition and (in the long term) a
STB_GLOBAL variant of
-Wl,-Bsymbolic-functions, bringing back the lost
performance for decades. macOS (Mach-O), Windows (PE-COFF), and Solaris
(ELF) direct bindings have set up precedent so there is a good chance
that most pieces of portable software are already in a good state.
However, there is still some amount of work needed to annotate
software which cannot be built with
-fsemantic-interposition=variable or
-Wl,-Bsymbolic-global-functions. Distributions have to put
into resources. In return, I estimate that many pieces of software may
be 5% to 20% faster (CPython is 1.3x faster) and a few percentage
smaller in size.
There is a trade-off and the downside is that LD_PRELOAD
replacing a function definition in a shared object will be a non-default
choice. The users can build the software by themselves.
We need an option (say,
-fsemantic-interposition=variable) to disable interposition
for functions but enable interposition for variables, because we want to
be compatible with copy relocations, which will require years to fix.
GCC feature request: PR100618.
We may need a configure-time option for default
-fsemantic-interposition=variable, like GCC's
--enable-default-pie.
We need a -Bsymbolic-functions variant which only
applies to STB_GLOBAL symbols (i.e. STB_WEAK
symbols are excluded). {-Bsymbolic-global-functions}
We need a linker option to cancel default
-Bsymbolic-global-functions. I have added
-Bno-symbolic to GNU ld and gold (binutils
2.37; PR27834) and ld.lld 13.
(From Peter Smith) The linker can introduce a debugging option for
executables to catch accidental interposition, say,
--warn-interposition: "Warning symbol S of type STT_FUNC is
defined in executable A and shared objects B and C, using definition in
A."
GCC -fno-pic should be fixed to use GOT to take the
address of an external default visibility function. PR100593.
{-fno-pic_got}