All about thread-local storage

Thread-local storage (TLS) provides a mechanism allocating distinct objects for different threads. It is the usual implementation for GCC extension __thread, C11 _Thread_local, and C++11 thread_local, which allow the use of the declared name to refer to the entity associated with the current thread. This article will describe thread-local storage on ELF platforms in detail, and touch on other related topics, such as: thread-specific data keys and Windows/macOS TLS.

An example usage of thread-local storage is POSIX errno:

Each thread has its own thread ID, scheduling priority and policy, errno value, floating point environment, thread-specific key/value bindings, and the required system resources to support a flow of control.

Different threads have different errno copies. errno is typically defined as a function which returns a thread-local variable.

For each architecture, the authoritative ELF ABI document is the processor supplement (psABI) to the System V ABI (generic ABI). These documents usually reference The ELF Handling for Thread-Local Storage by Ulrich Drepper. The document, however, mixes general specifications and glibc internals.

Representation

Assembler behavior

The compiler usually defines thread-local variables in .tdata and .tbss sections (which have the section flag SHF_TLS). The symbols representing thread-local variables have type STT_TLS (representing thread-local storage entities). In GNU as syntax, you can give a the type STT_TLS with .type a, @tls_object. The st_value value of a TLS symbols is the offset relative to the defining section.

1
2
3
4
5
6
7
8
9
10
.section .tbss,"awT",@nobits
.globl a, b
.type a, @tls_object
.type b, @tls_object
a:
.zero 4
.size a, .-a
b:
.zero 4
.size b, .-b

In this example, st_value(a)=0 while st_value(b)=4.

In Clang and GCC produced assembly, thread-local variables are annotated as .type a, @object (STT_OBJECT). When the assembler sees that such symbols are defined in SHF_TLS sections or referenced by TLS relocations, STT_NOTYPE/STT_OBJECT will be upgraded to STT_TLS.

GNU as supports an directive .tls_common which defines STT_TLS SHN_COMMON symbols. This is an obscure feature. It is not clear whether GCC still has a code path which emits .tls_common directives. LLVM integrated assembler does not support .tls_common.

Linker behavior

The linker combines .tdata input sections into a .tdata output section. .tbss input sections are combined into a .tbss output section. The two SHF_TLS output sections are placed into a PT_TLS program header.

  • p_offset: the file offset of the TLS initialization image
  • p_vaddr: the virtual address of the TLS initialization image
  • p_filesz: the size of the TLS initialization image
  • p_memsz: the total size of the thread-local storage. The last p_memsz-p_filesz bytes will be zeroed by the dynamic loader.
  • p_align: alignment

The PT_TLS program header is contained in a PT_LOAD program header. Conceptually PT_TLS and STT_TLS symbols are like in a separate address space. The dynamic loader should copy the [p_vaddr,p_vaddr+p_filesz) of the TLS initialization image to the corresponding static TLS block.

In executable and shared object files, st_value normally holds a virtual address. For a STT_TLS symbol, st_value holds an offset relative to the virtual address of the PT_TLS program header. The first byte of PT_TLS is referenced by the TLS symbol with st_value==0.

GNU ld treats STT_TLS SHN_COMMON symbols as defined in .tcommon sections. Its internal linker script places such sections into the output section .tdata. LLD does not support STT_TLS SHN_COMMON symbols.

Dynamic loader behavior

The dynamic loader collects PT_TLS program headers from the main executable and immediately loaded shared objects (via transitive DT_NEEDED), and allocates static TLS blocks, one block for each PT_TLS. For each PT_TLS, the dynamic loader copies p_filesz bytes from the TLS initialization image to the TLS block and sets the trailing p_memsz-p_filesz bytes to zeroes.

For the static TLS block of the main executable, the module ID is one and the TP offset of a TLS symbol is a link-time constant. The linker and the dynamic loader share the same formula.

For a shared object loaded at program start, the offset from the thread pointer to its static TLS block is a fixed value at program start, albeit not a link-time constant. The offset can be referenced by a GOT dynamic relocation used by the initial-exec TLS model.

The ELF Handling for Thread-Local Storage describes two TLS variants and specifies their data structures. However, only the TP offset of the static TLS block of the main executable is a hard requirement. Nevertheless, libc implementations usually place static TLS blocks together, and allocate a space for both the thread control block and the static TLS blocks.

For a new thread created by pthread_create, the static TLS blocks are usually allocated as part of the thread stack. Without a guard page between the largest address of the stack and the thread control block, this could be considered as vulnerable as stack overflow can overwrite the thread control block.

Models

Local exec TLS model (executable & non-preemptible)

This is the most efficient TLS model. It applies when the TLS symbol is defined in the executable.

The compiler picks this model in -fno-pic/-fpie modes if the variable is

  • a definition
  • or a declaration with a non-default visibility.

The first condition is obvious. The second condition is becuase a non-default visibility means the variable must be defined by another translation unit in the executable.

1
2
3
_Thread_local int def;
__attribute__((visibility("hidden"))) extern thread_local int ref;
int foo() { return def + ref; }
1
2
# x86-64
movl %fs:def@TPOFF, %eax

For the static TLS block of the main executable, the TP offset of a TLS symbol is a link-time constant. Here is a list of common relocation types:

  • arm: R_ARM_TLS_LE32
  • aarch64:
    • -mtls-size=12: R_AARCH64_TLSLE_ADD_TPREL_LO12
    • -mtls-size=24 (default): R_AARCH64_TLSLE_ADD_TPREL_HI12, R_AARCH64_TLSLE_ADD_TPREL_LO12_NC
    • -mtls-size=32: R_AARCH64_TLSLE_MOVW_TPREL_G1, R_AARCH64_TLSLE_MOVW_TPREL_G0_NC
    • -mtls-size=48: R_AARCH64_TLSLE_MOVW_TPREL_G2, R_AARCH64_TLSLE_MOVW_TPREL_G1_NC, R_AARCH64_TLSLE_MOVW_TPREL_G0_NC
  • i386: R_386_TLS_LE
  • x86-64: R_X86_64_TPOFF32
  • mips: R_MIPS_TPREL_HI16, R_MIPS_TPREL_LO16
  • ppc32: R_PPC_TPREL_HA, R_PPC_TPREL_LO
  • ppc64: R_PPC64_TPREL_HA, R_PPC64_TPREL_LO
  • riscv: R_RISCV_TPREL_HI20, R_RISCV_TPREL_LO12_I, R_RISCV_TPREL_LO12_S

For RISC architectures, because an instruction typically has 4 bytes and cannot encode a 32-bit offset, it usually takes two instructions to materialize a TP offset.

In https://reviews.llvm.org/D93331, I patched LLD to reject local-exec TLS relocations in -shared mode. In GNU ld, at least arm, riscv and x86's ports have the similar diagnostics, but aarch64 and ppc64 do not error.

Initial exec TLS model (executable & preemptible)

This model is less efficient than local exec. It applies when the TLS symbol is defined in the executable or a shared object available at program start. The shared object can be due to DT_NEEDED or LD_PRELOAD.

The compiler picks this model in -fno-pic/-fpie modes if the variable is a declaration with default visibility. The idea is that a symbol referenced by the executable must be defined by an immediately loaded shared object, instead of a dlopen loaded shared object. The linker enforces this as well by defaulting to -z defs for a -no-pie/-pie link.

1
2
extern thread_local int ref;
int foo() { return ref; }
1
2
3
# x86-64
movq ref@GOTTPOFF(%rip), %rax
movl %fs:(%rax), %eax

Because the offset from the thread pointer to the start of a static block is fixed at program start, such an offset can be encoded by a GOT relocation. Such relocation types typically have GOT and TPREL/TPOFF in their names. Here is a list of common relocation types:

  • arm: R_ARM_TLS_IE32
  • aarch64: R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21, R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC
  • i386: R_386_TLS_IE
  • x86-64: R_X86_64_GOTTPOFF
  • ppc32: R_PPC_GOT_TPREL16
  • ppc64: R_PPC64_GOT_TPREL16_HA, R_PPC64_GOT_TPREL16_LO_DS
  • riscv: R_RISCV_TLS_GOT_HI20, R_RISCV_PCREL_LO12_I

If the TLS symbol does not satisfy initial-exec to local-exec relaxation requirements, the linker will allocate a GOT entry and emit a dynamic relocation. Here is a list of dynamic relocation types:

  • arm: R_ARM_TLS_TPOFF32
  • aarch64: R_AARCH64_TLS_TPREL64
  • mips32: R_MIPS_TLS_TPREL32
  • mips64: R_MIPS_TLS_TPREL64
  • i386: R_386_TPOFF
  • x86-64: R_X86_64_TPOFF64
  • ppc32: R_PPC_TPREL32
  • ppc64: R_PPC64_TPREL64
  • riscv: R_RISCV_TLS_TPREL64

While they have TPREL or TPOFF in their names, these dynamic relocations have the same bitwidth as the word size. This is a good way to distinguish them from the local-exec relocation types used in object files.

If you add the __attribute((tls_model("initial-exec"))) attribute, a thread-local variable can use this model in -fpic mode. If the object file is linked into an executable, everything is fine. If the object file is linked into a shared object, the shared object generally needs to be an immediately loaded shared object. The linker sets the DF_STATIC_TLS flag to annotate a shared object with initial-exec TLS relocations.

glibc ld.so reserves some space in static TLS blocks and allows dlopen on such a shared object if its TLS size is small. There could be an obscure reason for using such an attribute: general dynamic and local dynamic TLS models are not async-signal-safe in glibc. However, other libc implementations may not reserve additional TLS space for dlopen'ed initial-exec shared objects, e.g. musl will error.

General dynamic and local dynamic TLS models (DSO)

The two modes are used when the TLS symbol may be defined by a shared object. They do not assume the TLS symbol is backed by a static TLS block. Instead, they assume that the thread-local storage of the module may be dynamically allocated, making the models suitable for dlopen usage. The dynamically allocated TLS storage is usually referred to as dynamic TLS.

Each TLS symbol is assigned a pair of (module ID, offset from dtv[m] to the symbol), which is usually referred to as a tls_index object. The module ID m is assigned by the dynamic loader when the module (the executable or a shared object) is loaded, so it is unknown at link time. dtv means the dynamic thread vector. Each thread has its own dynamic thread vector, which is a mapping from module ID to thread-local storage. dtv[m] points to the storage allocated for the module with the ID m.

In the simplest form, once we have a pointer to the (module ID, offset from dtv[m] to the symbol) pair, we can get the address of the symbol with the following C program:

1
2
3
4
5
// v is a pointer to the first element of the pair.
void *__tls_get_addr(size_t *v) {
pthread_t self = __pthread_self();
return (void *)(self->dtv[v[0]] + v[1]);
}

General dynamic TLS model (DSO & non-preemptible)

The general dynamic TLS model is the most flexible model. It assumes neither the module ID nor the offset from dtv[m] to the symbol is known at link time. The model is used in -fpic mode when the local dynamic TLS model does not apply. The compiler emits code to set up a pointer to the TLSGD entry of the symbol, then arranges for a call to __tls_get_addr. The return value will contain the runtime address of the TLS symbol in the current thread. On x86-64, you will notice that the leaq instruction has a data16 prefix and the call instruction has two data16 (0x66) prefixes and one rex64 prefix. This is a deliberate choice to make the total size of leaq+call to be 16, suitable for link-time relaxation.

1
2
3
4
5
data16 leaq def@tlsgd(%rip), %rdi
# GNU as does not allow duplicate data16 prefixes, so .value is used here.
.value 0x6666
rex64 call __tls_get_addr@PLT
movl (%rax), %eax

(There is an open issue that LLVM disassembler does not display data16 and rex64 prefixes.)

At the linker stage, if the TLS symbol does not satisfy relaxation requirements, the linker will allocate two consecutive words in the .got section for the TLSGD relocation, relocated by two dynamic relocations. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word. The relocation types are:

  • arm: R_ARM_TLS_DTPMOD32 and R_ARM_TLS_DTPOFF32
  • aarch64: R_AARCH64_TLS_DTPMOD and R_AARCH64_TLS_DTPREL (rarely used because TLS descriptors are the default)
  • i386: R_386_TLS_DTPMOD32 and R_386_TLS_DTPOFF32
  • x86-64: R_X86_64_DTPMOD64 and R_X86_64_DTPOFF64
  • mips32: R_MIPS_TLS_DTPMOD32 and R_MIPS_TLS_DTPOFF32
  • mips64: R_MIPS_TLS_DTPMOD64 and R_MIPS_TLS_DTPOFF64
  • ppc32: R_PPC_DTPMOD32 and R_X86_64_DTPREL32
  • ppc64: R_PPC64_DTPMOD64 and R_X86_64_DTPREL64
  • riscv32: R_RISCV_TLS_DTPMOD32 and R_X86_64_TLS_DTPREL32
  • riscv64: R_RISCV_TLS_DTPMOD64 and R_X86_64_TLS_DTPREL64

Local dynamic TLS model (DSO & preemptible)

The local-dynamic TLS model assumes that the offset from dtv[m] to the symbol is a link-time constant. This case happens when the TLS symbol is non-preemptible. The compiler emits code to set up a pointer to the TLSLD entry of the module, next arranges for a call to __tls_get_addr, then adds a link-time constant to the return value to get the address.

1
2
3
leaq def@tlsld(%rip), %rdi
call __tls_get_addr@PLT
movl def@dtpoff(%rax), %edx

I say "the TLSLD entry of the module" because while (on x86-64) def@tlsld looks like the TLSLD entry of the non-preemptible TLS symbol, it can really be shared by other non-preemptible TLS symbols. So one module needs just one such entry. Technically we can just use general dynamic relocation types to represent the local dynamic TLS model. For example, GCC riscv does this:

1
2
3
4
5
6
7
8
9
10
la.tls.gd a0, .LANCHOR0
call __tls_get_addr@@plt

.section .tbss,"awT",@nobits
.align 2
.set .LANCHOR0, .+0
.type a, @object
.size a, 4
a:
.zero 4

This is clever. However, I would prefer dedicated local-dynamic relocation types. If we perform a relocatable link merging this object file with another (with its own local symbol .LANCHOR0), the local symbols .LANCHOR0 are separate and their GOT entries cannot be shared. Architectures have dedicated local-dynamic relocation types can share the GOT entries.

Note that the code sequence is not shorter than the general-dynamic TLS model. Actually on RISC architectures the code sequence is usually longer due to the addition of DTPREL. Local-dynamic is beneficial if a function needs to access two or more non-preemptible TLS symbols, because the __tls_get_addr can be shared.

1
2
3
4
leaq def0@tlsld(%rip), %rdi
call __tls_get_addr@PLT
movl def0@dtpoff(%rax), %edx
movl def1@dtpoff(%rax), %eax

At the linker stage, if the TLS symbol does not satisfy local-dynamic to local-exec relaxation requirements, the linker will allocate two consecutive words in the .got section for the TLSLD relocation. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word.

If the architecture does not define TLS relaxations, the linker can still made an optimization: in -no-pie/-pie modes, set the first word to 1 (main executable) and omit the dynamic relocation for the module ID.

TLS descriptors

Some architectures (arm, aarch64, i386, x86-64) have TLS descriptors as more efficient alternatives to the traditional general dynamic and local dynamic TLS models. Such ABIs repurpose the first word of the (module ID, offset from dtv[m] to the symbol) pair to represent a function pointer. The function pointer points to a very simple function in the static TLS case and a function similar to __tls_get_addr in the dynamic TLS case. The caller does an indirection function call instead of calling __tls_get_addr. There are two main points:

  • The function call to __tls_get_addr uses the regular calling convention: the compiler has to make the pessimistic assumption that all volatile registers may be clobbered by __tls_get_addr.
  • In glibc (which does lazy TLS allocation), __tls_get_addr is very complex. If the TLS of the module is backed by a static TLS block, the dynamic loader can simply place the TP offset into the second word and let the function pointer point to a function which simply returns the second word.

The first point is the prominent reason that TLS descriptors are generally more efficient. Arguably traditional general dynamic and local dynamic TLS models could have a mechanism to use custom calling convention for __tls_get_addr as well.

In musl, in the static TLS case, the two words will be set to ((size_t)__tlsdesc_static, tpoff) where __tlsdesc_static is a function which returns the second word. glibc's static TLS case is similar.

1
2
3
4
5
6
.globl __tlsdesc_static
.hidden __tlsdesc_static
__tlsdesc_static:
# The second word stores the TP offset of the TLS symbol.
movq 8(%rax), %rax
ret

The scheme optimizes for static TLS but penalizes the case that requires dynamic TLS. Remember that we have just two words in the GOT and by changing the first word to a function pointer, we have lost information about the module ID. To retain the information, the dynamic loader has to set the second word to a pointer to a (module ID, offset) pair allocated by malloc.

aarch64 defaults to TLS descriptors. On arm, i386 and x86-64, you can select TLS descriptors via GCC -mtls-dialect=gnu2.

(I implemented TLS descriptors and relaxations in LLD'x x86-64 port.)

Which model does the compiler pick?

1
2
3
4
5
6
7
8
9
10
11
if (executable) { // -fno-pic or -fpie
if (preemptible)
initial-exec;
else
local-exec;
} else { // -fpic
if (preemptible || local-dynamic is not profitable)
general-dynamic;
else
local-dynamic;
}

The linker uses a similar criterion to check whether TLS relaxations apply.

Some psABIs define TLS relaxations. The idea is that the code sequences have fixed forms and are annotated with appropriate relocations, So the linker understands the compiler's intention and can perform 4 kinds of code sequence modification as optimizations. There are 4 relaxation schemes. I have annotated them with the respective condition.

  • general-dynamic/TLSDESC to local-exec relaxation: -no-pie/-pie && non-preemptible
  • general-dynamic/TLSDESC to initial-exec relaxation: -no-pie/-pie && preemptible
  • local-dynamic to local-exec relaxation: -no-pie/-pie (the symbol must be non-preemptible, otherwise it is an error to use local-dynamic)
  • initial-exec to local-exec relaxation: -no-pie/-pie && non-preemptible

I sometimes call the relaxation schemes poor man's link-time optimization with nice ergonomics.

To make TLS relaxations available, the compiler needs to communicate sufficient information to the linker. So you may find marker relocations which don't relocate values. Here is a general-dynamic code sequence for ppc64:

1
2
3
addis r3, r2, x@got@tlsgd@ha # R_PPC64_GOT_TLSGD16_HA
addi r3, r3, x@got@tlsgd@l # R_PPC64_GOT_TLSGD16_LO
bl __tls_get_addr(x@tlsgd) # R_PPC64_TLSGD followed by R_PPC64_REL24

R_PPC64_TLSGD does not relocate the location. It is there to indicate that it is the __tls_get_addr function call in the code sequence.

According to Stefan Pintilie, "In the early days of the transition from the ELFv1 ABI that is used for big endian PowerPC Linux distributions to the ELFv2 ABI that is used for little endian PowerPC Linux distributions, there was some ambiguity in the specification of the relocations for TLS." The bl __tls_get_addr instruction was not relocated by R_PPC64_TLSGD. Blindly converting the addis/addi instructions can make the code sequence malformed. Therefore GNU ld detected the missing R_PPC64_TLSGD/R_PPC64_TLSLD and disabled relaxations in 2009-03-03.

I was not fond of the fact that we still needed such a hack in 2020 but I implemented a scheme in LLD anyway because the request was so strong. https://reviews.llvm.org/D92959

TLS variants

In Variant II, the static TLS blocks are placed below the thread pointer. The thread pointer points to the start of the thread control block. The thread control block is a per-thread data structure describing various attributes of the thread. It is defined by the libc implementation. i386, x86-64, s390 and sparc use this variant.

1
2
3
TP % p_align == 0
tlsblock3 tlsblock2 tlsblock1 TP TCB
The TP offset of tlsblock1 (for the main executable) is -p_memsz - ((-p_vaddr-p_memsz)&(p_align-1)).

If you find the formula above confusing, it is;-) In normal cases, you can forget the alignment requirement and the TP offset of tlsblock1 is just -p_memsz. glibc has a Variant II bug when p_vaddr%p_align!=0: BZ24606. I reported the problem to FreeBSD rtld but looks like as of 13.0 its formula is still incorrect: https://reviews.freebsd.org/D24366.

In Variant I, the static TLS blocks are placed above the thread pointer. The thread pointer points to the end of the thread control block. arm, aarch64, alpha, ia64, m68k, mips, ppc, riscv use schemes similar to this variant. I say similar because some architecturs (including m68, mips, powerpc32, powerpc64) place the thread pointer at the end of the thread control block plus a displacement.

1
2
3
TP_WITHOUT_DISPLACEMENT % p_align == 0
TCB TP_WITHOUT_DISPLACEMENT tlsblock1 tlsblock2 tlsblock3
If displacement is 0, the TP offset of tlsblock1 is p_vaddr&(p_align-1).

As an example, on powerpc64, the end of the thread control block is at r13-0x7000. The space allocated for the TLS symbol with st_value==0 is at r13-0x7000+p_vaddr%p_align (p_vaddr%p_align is normally 0). The idea is that the add instruction has a range of [-0x8000, 0x8000). By having the 0x7000 displacement, we can leverage the negative part of the range.

Since p_vaddr%p_align is normally 0, the code sequence accessing st_value==0 may look like:

1
2
addis 3, 13, 0
lwz 3, -0x7000(3)

arm and aarch64 have a zero displacement but they reserve two words at TP. The TP offset of tlsblock1 is sizeof(void*)*2 + ((p_vaddr-sizeof(void*)*2)&(p_align-1)).

Async-signal-safe TLS

C11 7.14.1 Specify signal handling says:

If the signal occurs other than as the result of calling the abort or raise function, the behavior is undefined if the signal handler refers to any object with static or thread storage duration that is not a lock-free atomic object other than by assigning a value to an object declared as volatile sig_atomic_t, or the signal handler calls any function in the standard library other than the abort function, the _Exit function, the quick_exit function, or the signal function with the first argument equal to the signal number corresponding to the signal that caused the invocation of the handler. Furthermore, if such a call to the signal function results in a SIG_ERR return, the value of errno is indeterminate.

C++11 [support.signal] says:

An evaluation is signal-safe unless it includes one of the following:

an access to an object with thread storage duration;

A signal handler invocation has undefined behavior if it includes an evaluation that is not signal-safe.

Despite that, accessing TLS from signal handlers can be useful (think of CPU and memory profilers), hence the accesses need to be async-signal safe. Google reported the issue due to its usage of JVM and dlopen'ed JNI libraries (Async-signal-safe access to __thread variables from dlopen()ed libraries?). They eventually resorted to a non-upstream patch which used a custom allocator.

Let's discuss this topic in details.

Local-exec and initial-exec TLS models trivially satisfy the requirement since the size of static TLS blocks is fixed at program start and every thread has a pre-allocated copy.

For a dlopen'ed shared object which uses general-dynamic or local-dynamic TLS model, there are two cases.

  • The dynamic loader allocates sufficient storage for all currently running threads at dlopen time, and allocates sufficient storage at pthread_create time. This is musl's choice. At dlopen time, the dynamic loader needs to block signal deliveray, take a thread list lock and install a new dynamic thread vector for each thread.
  • Lazy TLS allocation. TLS allocation is done at the first time __tls_get_addr is called. This is glibc and many other libs implementation's choice. The allocation is typically done by malloc, which is not async-signal-safe.

Lazy TLS allocation has the nice property that it does not penalizes the threads which do not need to access TLS of the new shared object. However, it is difficult to make __tls_get_addr async-signal-safe. It is impossible to both allocate lazily and have dynamic TLS access that cannot fail (TLS redux). If __tls_get_addr cannot allocate memory, the ideal behavior is "fail safe" (e.g. abort), as opposed to the full range of undefined behaviors or deadlock.

One workaround is to let the shared object use the initial-exec TLS model. This will consume the static TLS space - a global resource.

If a dlopen implementing eager TLS allocation is developed, conceivably it may need a new symbol version because there can be programs expecting lazy TLS allocation.

Large code model

Many 64-bit architectures have a small code model. Some have defined a large code model.

A small code model usually restricts the addresses and sizes of sections to 4GiB or 2GiB, while a large code model generally makes no such assumption. The TLS size is usually small and code models and impose some limitation even with a large code model.

For the local-exec TLS model, because a symbol is usually referenced via an offset adding to a register (thread pointer), it needs no distinction with a large code model.

For the initial-exec TLS model, because loading an GOT is needed, and GOT is part of the data sections, a large code model technically should implement a code sequence which is not restricted by the distance between code and data. GCC has not implemented such code sequences.

For the general-dynamic and local-dynamic TLS models, there is usually a GOT load and a __tls_get_addr call. As discussed previously, the GOT load needs to be free of 32-bit limitation. For the __tls_get_addr call, on architectures which have implemented range extension thunks, since the linker can redirect the call to a thunk which arranges for the call, no special treatment is needed.

x86-64 has not implemented thunks. Compile a problem with x86-64 gcc -S -fpic -mcmodel=large and you can see that the __tls_get_addr call is indirect. This is to prevent the +-2GiB range limitation imposed by the direct CALL instruction.

1
2
3
4
5
6
7
8
9
10
11
movabsq	$_GLOBAL_OFFSET_TABLE_-.L2, %r11
pushq %rbx
leaq .L2(%rip), %rbx
addq %r11, %rbx
leaq a@tlsgd(%rip), %rdi
movabsq $__tls_get_addr@PLTOFF, %rax
addq %rbx, %rax
call *%rax
popq %rbx
movl (%rax), %eax
ret

The support for large code model TLS is fairly limited as of today. Most configurations don't lift the GOT load limitation. On aarch64, -fpic -mcmodel=large has not been implemented on GCC and Clang.

Thread-specific data keys

An alternative to ELF TLS is thread-specific data keys: pthread_key_create, pthread_setspecific, pthread_getspecific and pthread_key_delete. This scheme can be seen as a simpler implementation of __tls_get_addr with key reuse feature. There are C11 equivalents (tss_create, tss_set, tss_get, tss_delete) which are rarely used. Windows provides similar API: TlsAlloc, TlsSetValue, TlsGetValue, TlsFree.

The maximum number of keys is usually limited. On glibc it is usually 1024. On musl it is 128. So applications which potentially need many data keys typically create a wrapper on top of thread-specific data keys, e.g. chromium base/threading/thread_local_storage.h.

POSIX.1-2017 does not require pthread_setspecific/pthread_getspecific to be async-signal-safe. Nevertheless, most implementations make pthread_getspecific async-signal-safe. pthread_setspecific is not necessarily async-signal-safe.

-femulated-tls

-femulated-tls uses thread-specific data keys to implement emulated TLS. The runtime implementation is quite similar to a __tls_get_addr implementation in a lazy TLS allocation scheme.

Its inefficiency comes from these aspects:

  • There is no linker relaxation.
  • Instead of geting the dynamic thread vector from the thread pointer (usually available in a register), the runtime needs to call pthread_getspecific to get the vector.
  • The dynamic loader does not know emulated TLS, so the storage allocation is typically done in the access function via pthread_once.

libgcc has a mature runtime. In compiler-rt, the runtime was contributed by Android folks in 2015.

C++ thread_local

C++ thread_local adds additional features to __thread: dynamic initialization on first-use and destruction on thread exit. If a thread_local variable needs dynamic initialization or has a non-trivial destructor, the compiler calls the TLS wrapper function (_ZTW*, in a COMDAT group) instead of referencing the variable directly. The TLS wrapper calls the TLS init function (_ZTH*, weak), which is an alias for __tls_init. __tls_init calls the constructors and registers the destructors with __cxa_thread_atexit.

The __cxa_thread_atexit complexity is because a thread_local variabled defined in a dlopen'ed shared object needs to be destruct at dlclose time before thread exit. libsupc++ and libc++abi define __cxa_thread_atexit. They call __cxa_thread_atexit_impl if the libc implementation provides it or use a generic implementation based on thread-specific data keys.

As an example, x needs a TLS wrapper function. The compiler may inline the TLS wrapper function and __tls_init.

1
2
extern thread_local int x;
int foo() { return x; }

The assembly looks like the following. It uses undefined weak _ZTH1x to check whether the TLS init function is defined. If yes, call the TLS init function. Then reference the variable via usual initial-exec or general dynamic TLS model or TLSDESC.

1
2
3
4
5
6
7
8
9
10
11
12
_Z3foov:
pushq %rax
cmpq $0, _ZTH1x@GOTPCREL(%rip)
je .LBB0_2
callq _ZTH1x@PLT
.LBB0_2:
movq x@GOTTPOFF(%rip), %rax
movl %fs:(%rax), %eax
popq %rcx
retq

.weak _ZTH1x

If you know x does not need dynamic initialization, C++20 constinit can make it as efficient as the plain old __thread. If you can enable C++20 mode, [[clang::require_constant_initialization]] can be used in older language modes.

1
extern thread_local constinit int x;

Here is an example that __tls_init needs to call __cxa_thread_atexit.

1
2
3
struct S { S(); ~S(); };
thread_local S s;
S &foo() { return s; }

libc API for TLS ranges

Sanitizers have a desire to find TLS ranges (https://sourceware.org/bugzilla/show_bug.cgi?id=16291).

The leak sanitizer (usually invoked by a callback registered by atexit (default: LSAN_OPTIONS=detect_leaks=1:leak_check_at_exit=1), but it can also be invoked via __lsan_do_leak_check) needs TLS ranges as GC roots.

The memory sanitizer intercepts __tls_get_addr and needs to unpoison/resets TLS blocks to avoid false positives when the storage gets reused by a new thread (https://github.com/google/sanitizers/issues/547).

The thread sanitizer intercepts __tls_get_addr and needs to unpoison/resets TLS blocks to avoid false positives.

In LLVM, OrcJIT has a desire to register TLS blocks. Lang Hames told me that he has got native TLS working by implementing dyld’s TLS support APIs in the Orc runtime. Such APIs don't exist on ELF libc implementations AFAIK.

macOS TLS

The support was added very late. The scheme is similar to ELF's TLS descriptors, without the good custom calling convention promise. In other words, the performance is likely worse than ELF's general dynamic TLS model. To my surprise, thread-local variables of internal linkage need an indirect function call, too.

1
2
thread_local int tls;
int f() { return tls; }
1
2
3
movq _tls@TLVP(%rip), %rdi
callq *(%rdi)
movl (%rax), %eax

Windows TLS

The code sequence fetches ThreadLocalStoragePointer (offset 88) out of the Thread Environment Block and indexes it by _tls_index. The return value is indexed with the offset of the variable from the start of the .tls section. The scheme is similar to ELF's local-dynamic TLS model, replacing a __tls_get_desc call with an array index operation.

1
2
3
4
movl _tls_index(%rip), %eax
movq %gs:88, %rdx
movq (%rdx,%rax,8), %rax
movl %ecx, tls@SECREL32(%rax)

Referencing a TLS variable from another DLL is not supported.

1
2
__declspec(dllimport) extern thread_local int tls;
// error C2492: 'tls': data with thread storage duration may not have dll interface

There are a lot of of details but my personal understanding of Windows does not allow me to say more ;-) Interested readers can go to Thread Local Storage, part 3: Compiler and linker support for implicit TLS.

Metadata sections, COMDAT and SHF_LINK_ORDER

Metadata sections

Many compiler options intrument text sections or annotate text sections, and need to create a metadata section for (almost) every text section. Such metadata sections have the following property:

  • All relocations from the metadata section reference the associated text section or (if present) the associated auxiliary metadata sections.

In many applications there is no auxiliary metadata section.

Without inlining (discussed in detail later), many sections additionally have this following property:

  • The metadata section is only referenced by the associated text section or not referenced at all.

Below is an example:

1
2
3
4
.section .text.foo,"ax",@progbits

.section .meta.foo,"a",@progbits
.quad .text.foo-. # PC-relative relocation

Real world examples include:

  • non-SHF_ALLOC: .debug_* (DWARF debugging information), .stack_sizes (stack sizes)
  • SHF_ALLOC, not referenced via relocation by code: .eh_frame (unwind table), .gcc_except_table (language-specific data area for exception handling), __patchable_function_entries (-fpatchable-function-entry=)
  • SHF_ALLOC, referenced via relocation by code: __llvm_prf_cnts (clang -fprofile-generate/-fprofile-instr-generate), __sancov_bools (clang -fsanitize-coverage=inline-bool-flags), __sancov_cntrs (clang -fsanitize-coverage=inline-8bit-counters), __sancov_guards (clang -fsanitize-coverage=trace-pc-guard)

Non-SHF_ALLOC metadata sections need to use absolute relocation types. There is no program counter concept for a section not loaded into memory, so PC-relative relocations cannot be used.

1
2
3
4
# Without 'w', text relocation.
.section .meta.foo,"",@progbits
.quad .text.foo # link-time constant
# Absolute relocation types have different treatment in SHF_ALLOC and non-SHF_ALLOC sections.

For SHF_ALLOC sections, PC-relative relocations are recommended. If absolute relocations (with the width equaling the word size) are used, R_*_RELATIVE dynamic relocations will be produced and the section needs to be writable.

1
2
3
4
5
6
.section .meta.foo,"a",@progbits
.quad .text.foo-. # link-time constant

# Without 'w', text relocation.
.section .meta.foo,"aw",@progbits
.quad .text.foo # R_*_RELATIVE dynamic relocation if -pie or -shared

C identifier name sections

The runtime usually needs to access all the metadata sections. Metadata section names typically consist of pure C-like identifier characters (isalnum characters in the C locale plus _) to leverage a linker magic. Let's use the section name foo as an example.

  • If __start_foo is not defined, the linker defines it to the start of the output section foo.
  • If __stop_foo is not defined, the linker defines it to the end of the output section foo.

Garbage collection on metadata sections

Users want GC for metadata sections: if .text.foo is retained, meta (for .text.foo) is retained; if .text.foo is discarded, meta is discarded. There are three use cases:

  • If meta does not have the SHF_ALLOC flag, it is usually retained under --gc-sections. {alloc}
  • If meta has the SHF_ALLOC flag and .text.foo does not reference meta, meta will be discarded, because meta is not referenced by other sections (prerequisite). {nonalloc-noreloc}
  • If meta has the SHF_ALLOC flag and .text.foo references meta, traditional GC semantics work as intended. {nonalloc-reloc}

The first case is undesired, because the metadata section is unnecessarily retained. The second case has a more serious correctness issue.

To make the two cases work, we can place .text.foo and meta in a section group. If .text.foo is already in a COMDAT group, we can place meta into the same group; otherwise we can create a non-COMDAT section group (LLVM>=13.0.0, comdat noduplicates support for ELF).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Zero flag section group
.section .text.foo,"aG",@progbits,foo
.globl foo
foo:

.section .meta.foo,"a?",@progbits
.quad .text.foo-.


# GRP_COMDAT section group, common with C++ inline functions and template instantiations
.section .text.foo,"aG",@progbits,foo,comdat
.globl foo
foo:

.section .meta.foo,"a?",@progbits
.quad .text.foo-.

A section group requires an extra section header (usually named .group), which requires 40 bytes on ELFCLASS32 platforms and 64 bytes on ELFCLASS64 platforms. The size overhead is concerning in many applications, so people were looking for better representations. (AArch64 and x86-64 define ILP32 ABIs and use ELFCLASS32, but technically they can use ELFCLASS32 for small code model with regular ABIs, if the kernel allows.)

Another approach is SHF_LINK_ORDER. There are separate chapters introducing section groups (COMDAT) and SHF_LINK_ORDER in this article.

Metadata sections referenced by text sections

Let's discuss the third case in detail. We have these conditions:

  • The metadata sections have the SHF_ALLOC flag.
  • The metadata sections have a C identifier name, so that the runtime can collect them via __start_/__stop_ symbols.
  • Each text section references a metadata section.

Since the runtime uses __start_/__stop_, __start_/__stop_ references are present in a live section.

Now let's introduce the unfortunate special rule about __start_/__stop_:

  • If a live section has a __start_foo or __stop_foo reference, all foo input section will be retained by ld.bfd --gc-sections. Yes, all, even if the input section is in a different object file.
1
2
3
4
5
6
7
8
9
10
11
12
13
# a.s
.global _start
.text
_start:
leaq __start_meta(%rip), %rdi
leaq __stop_meta(%rip), %rsi

.section meta,"a"
.byte 0

# b.s
.section meta,"a"
.byte 1

a.o:(meta) and b.o:(meta) are not referenced via regular relocations. Nevertheless, they are retained by the __start_meta reference. (The __stop_meta reference can retain the sections as well.)

Now, it is natural to ask: how can we make GC for meta?

In LLD<=12, the user can set the SHF_LINK_ORDER flag, because the rule is refined:

__start_/__stop_ references from a live input section retains all non-SHF_LINK_ORDER C identifier name sections.

(Example SHF_LINK_ORDER C identifier name sections: __patchable_function_entries (-fpatchable-function-entry), __sancov_guards (clang -fsanitize-coverage=trace-pc-guard, before clang 13))

In LLD>=13, the user can also use a section group, because the rule is further refined:

__start_/__stop_ references from a live input section retains all non-SHF_LINK_ORDER non-SHF_GROUP C identifier name sections.

GNU ld does not implement the refinement yet (PR27259). (binutils discussion: https://sourceware.org/pipermail/binutils/2021-February/115463.html)

A section group has size overhead, so SHF_LINK_ORDER may be attempting. However, it ceases to be a solution when inlining happens. Let's walk through an example demonstrating the problem.

Our first design uses a plain meta for each text section. We use ,unique to keep separate sections, otherwise the assembler would combine meta into a monolithic section.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Monolithic meta.
.globl _start
_start:
leaq __start_meta(%rip), %rdi
leaq __stop_meta(%rip), %rsi
call bar

.section .text.foo,"ax",@progbits
.globl foo
foo:
leaq .Lmeta.foo(%rip), %rax
ret

.section .text.bar,"ax",@progbits
.globl bar
bar:
call foo
leaq .Lmeta.bar(%rip), %rax
ret

.section meta,"a",@progbits,unique,0
.Lmeta.foo:
.byte 0

.section meta,"a",@progbits,unique,1
.Lmeta.bar:
.byte 1

The __start_meta/__stop_meta references retain meta sections, so we add the SHF_LINK_ORDER flag to defeat the rule. Note: we can omit ,unique because sections with different linked-to sections are not combined by the assembler.

1
2
3
4
5
6
7
.section meta,"ao",@progbits,foo
.Lmeta.foo:
.byte 0

.section meta,"ao",@progbits,bar
.Lmeta.bar:
.byte 1

This works as long as inlining is not concerned.

However, in many instrumentations, the metadata references are created before inlining. With LTO, if the instrumentation is preformed before LTO, inlining can naturally happen after instrumentation. If foo is inlined into bar, the meta for .text.foo may get a reference from another text section .text.bar, breaking an implicit assumption of SHF_LINK_ORDER: a SHF_LINK_ORDER section can only be referenced by its linked-to section.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Both .text.foo and .text.bar reference meta.
.section .text.foo,"ax",@progbits
.globl foo
foo:
leaq .Lmeta.foo(%rip), %rax
ret

.section .text.bar,"ax",@progbits
.globl bar
bar:
leaq .Lmeta.foo(%rip), %rax
leaq .Lmeta.bar(%rip), %rax
ret

Remember that _start calls bar but not foo, .text.bar (caller) will be retained while .text.foo (callee) will be discarded. The meta for foo will link to the discarded .text.foo. This will be recjected by linkers. LLD will report: {{.*}}:(meta): sh_link points to discarded section {{.*}}:(.text.foo).

Reflection

The unfortunate GNU ld rule was to work around a glibc static linking problem PR11133. I am with Alan Modra:

I think this is a glibc bug. There isn't any good reason why a reference to a __start_section/__stop_section symbol in an output section should affect garbage collection of input sections, except of course that it works around this glibc --gc-sections problem. I can imagine other situations where a user has a reference to __start_section but wants the current linker behaviour.

Anyhow, GNU ld installed a workaround and made it apply to all C identifier name sections, not just the glibc sections.

In 2010-01, gold got the rule (https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=f1ec9ded5c740c22735843025e5d3a8ff4c4079e). In 2015, it was ported to GNU ld. LLD had dropped the behavior for a while until r294592 restored it. LLD refined the rule by excluding SHF_LINK_ORDER.

Making each meta part of a zero flag section group can address this problem, but why do we need a section group to work around a problem which should not exist? I added -z start-stop-gc to LLD so that we can drop the rule entirely (D96914 PR27451).

What if all metadata sections are discarded?

You may see this: error: undefined symbol: __start_meta (LLD) or undefined reference to `__start_xx' (GNU ld).

One approach is to use undefined weak symbols:

1
__attribute__((weak)) extern const char __start_meta[], __stop_meta[];

Another is to ensure there is at least one live metadata section, by creating an empty section in the runtime. In binutils 2.36, GNU as introduced the flag R to represent SHF_GNU_RETAIN on FreeBSD and Linux emulations. I have added the support to LLVM integrated assembler and allowed the syntax on all ELF platforms.

1
.section meta,"aR",@progbits

With GCC>=11 or Clang>=13 (https://reviews.llvm.org/D97447), you can write:

1
2
__attribute__((retain,used,section("meta")))
static const char dummy[0];

The used attribute, when attached to a function or variable definition, indicates that there may be references to the entity which are not apparent in the source code. On COFF and Mach-O targets (Windows and Apple platforms), the used attribute prevents symbols from being removed by linker section GC. On ELF targets, GNU ld/gold/LLD may remove the definition if it is not otherwise referenced.

The retain attributed was introduced in GCC 11 to set the SHF_GNU_RETAIN flag on ELF targets.

The typical solution before SHF_GNU_RETAIN is:

1
2
3
asm(".pushsection .init_array,\"aw\",@init_array\n" \
".reloc ., R_AARCH64_NONE, meta\n" \
".popsection\n")

This idea is that SHT_INIT_ARRAY sections are GC roots. An empty SHT_INIT_ARRAY does not change the output. The artificial reference keeps meta live.

I added .reloc support for R_ARM_NONE/R_AARCH64_NONE/R_386_NONE/R_X86_64_NONE/R_PPC_NONE/R_PPC64_NONE in LLVM 9.0.0.

COMDAT

In C++, inline functions, template instantiations and a few other things can be defined in multiple object files but need deduplication at link time. In the dark ages the functionality was implemented by weak definitions: the linker does not report duplicate definition errors and resolves the references to the first definition. The downside is that unneeded copies remained in the linked image.

In Microsoft PE file format, the section flag (IMAGE_SCN_LNK_COMDAT) marks a section COMDAT and enables deduplication on a per-section basis (IMAGE_COMDAT_SELECT_NODUPLICATES can drop the deduplication requirement). The PE format interestingly does not need additional space to represent COMDAT sections. Every section has an associated symbol. This symbol has a section definition auxiliary record which has reserved Number/Selection fields.

If a text section needs a data section and deduplication is needed for both sections, you have two choices:

  • Use two COMDAT symbols. There is the drawback that deduplication happens independently for the interconnected sections.
  • Make the data section link to the text section via IMAGE_COMDAT_SELECT_ASSOCIATIVE. Whether an IMAGE_COMDAT_SELECT_ASSOCIATIVE section is retained is dependent on its referenced section.

In the GNU world, .gnu.linkonce. was invented to deduplicate groups with just one member. .gnu.linkonce. has been long obsoleted in favor of section groups but the usage has been daunting until 2020. Adhemerval Zanella removed the last live glibc use case for .gnu.linkonce. BZ #20543.

ELF section groups

The ELF specification generalized PE COMDAT to allow an arbitrary number of groups to be interrelated.

Some sections occur in interrelated groups. For example, an out-of-line definition of an inline function might require, in addition to the section containing its executable instructions, a read-only data section containing literals referenced, one or more debugging information sections and other informational sections. Furthermore, there may be internal references among these sections that would not make sense if one of the sections were removed or replaced by a duplicate from another object. Therefore, such groups must be included or omitted from the linked object as a unit. A section cannot be a member of more than one group.

According to "such groups must be included or omitted from the linked object as a unit", a linker's garbage collection feature must retain or discard the sections as a unit.

The most common section group flag is GRP_COMDAT, which makes the member sections similar to COMDAT in Microsoft PE file format, but can apply to multiple sections. (The committee borrowed the name "COMDAT" from PE.)

This is a COMDAT group. It may duplicate another COMDAT group in another object file, where duplication is defined as having the same group signature. In such cases, only one of the duplicate groups may be retained by the linker, and the members of the remaining groups must be discarded.

I want to highlight one thing GCC does (and Clang inherits) for backward compatibility: the definitions relatived to a COMDAT group member are kept STB_WEAK instead of STB_GLOBAL. The idea is that old toolchain which does not recognize COMDAT groups can still operate correctly, just in a degraded manner.

The section group flag can be 0: no signature based deduplication should happen.

In a generic-abi thread, Cary Coutant initially suggested to use a new section flag SHF_ASSOCIATED. HP-UX and Solaris folks objected to a new generic flag. Cary Coutant then discussed with Jim Dehnert and noticed that the existing (rare) flag SHF_LINK_ORDER has semantics closer to the metadata GC semantics, so he intended to replace the existing flag SHF_LINK_ORDER. Solaris had used its own SHF_ORDERED extension before it migrated to the ELF simplification SHF_LINK_ORDER. Solaris is still using SHF_LINK_ORDER so the flag cannot be repurposed. People discussed whether SHF_OS_NONCONFORMING could be repurposed but did not take that route: the platform already knows whether a flag is unknown and knowing a flag is non-conforming does not help produce better output. In the end the agreement was that SHF_LINK_ORDER gained additional metadata GC semantics.

The new semantics:

This flag adds special ordering requirements for link editors. The requirements apply to the referenced section identified by the sh_link field of this section's header. If this section is combined with other sections in the output file, the section must appear in the same relative order with respect to those sections, as the referenced section appears with respect to sections the referenced section is combined with.

A typical use of this flag is to build a table that references text or data sections in address order.

In addition to adding ordering requirements, SHF_LINK_ORDER indicates that the section contains metadata describing the referenced section. When performing unused section elimination, the link editor should ensure that both the section and the referenced section are retained or discarded together. Furthermore, relocations from this section into the referenced section should not be taken as evidence that the referenced section should be retained.

Actually, ARM EHABI has been using SHF_LINK_ORDER for index table sections .ARM.exidx*. A .ARM.exidx section contains a sequence of 2-word pairs. The first word is 31-bit PC-relative offset to the start of the region. The idea is that if the entries are ordered by the start address, the end address of an entry is implicitly the start address of the next entry and does not need to be explicitly encoded. For this reason the section uses SHF_LINK_ORDER for the ordering requirement. The GC semantics are very similar to the metadata sections'.

So the updated SHF_LINK_ORDER wording can be seen as recognition for the current practice (even though the original discussion did not actually notice ARM EHABI).

In GNU as, before version 2.35, SHF_LINK_ORDER could be produced by ARM assembly directives, but not specified by user-customized sections.

Implementation pitfalls

Mixed unordered and ordered sections

If an output section consists of only non-SHF_LINK_ORDER sections, the rule is clear: input sections are ordered in their input order. If an output section consists of only SHF_LINK_ORDER sections, the rule is also clear: input sections are ordered with respect to their linked-to sections.

What is unclear is how to handle an output section with mixed unordered and ordered sections.

GNU ld had a diagnostic: . LLD rejected the case as well error: incompatible section flags for .rodata.

When I implemented -fpatchable-function-entry= for Clang, I observed some GC related issues with the GCC implementation. I reported them and carefully chose SHF_LINK_ORDER in the Clang implementation if the integrated assembler is used.

This was a problem if the user wanted to place such input sections along with unordered sections, e.g. .init.data : { ... KEEP(*(__patchable_function_entries)) ... } (https://github.com/ClangBuiltLinux/linux/issues/953).

As a response, I submitted D77007 to allow ordered input section descriptions within an output section.

This worked well for the Linux kernel. Mixed unordered and ordered sections within an input section description was still a problem. This made it infeasible to add SHF_LINK_ORDER to an existing metadata section and expect new object files linkable with old object files which do not have the flag. I asked how to resolve this upgrade issue and Ali Bahrami responded:

The Solaris linker puts sections without SHF_LINK_ORDER at the end of the output section, in first-in-first-out order, and I don't believe that's considered to be an error.

So I went ahead and implemented a similar rule for LLD: D84001 allows arbitrary mix and places SHF_LINK_ORDER sections before non-SHF_LINK_ORDER sections.

If the linked-to section is discarded due to compiler optimizations

We decided that the integrated assembler allows SHF_LINK_ORDER with sh_link=0 and LLD can handle such sections as regular unordered sections (https://reviews.llvm.org/D72904).

If the linked-to section is discarded due to --gc-sections

You will see error: ... sh_link points to discarded section ....

A SHF_LINK_ORDER section has an assumption: it can only be referenced by its linked-to section. Inlining and the discussed __start_ rule can break this assumption.

Others

  • During --icf={safe,all}, SHF_LINK_ORDER sections are not eligible (conservative but working).
  • In relocatable output, SHF_LINK_ORDER sections cannot be combined by name.
  • When comparing two input sections with different linked-to output sections, use vaddr of output sections instead of section indexes. Peter Smith fixed this in https://reviews.llvm.org/D79286.

Miscellaneous

Arm Compiler 5 splits up DWARF Version 3 debug information and puts these sections into comdat groups. On "monolithic input section handling", Peter Smith commented that:

We found that splitting up the debug into fragments works well as it permits the linker to ensure that all the references to local symbols are to sections within the same group, this makes it easy for the linker to remove all the debug when the group isn't selected.

This approach did produce significantly more debug information than gcc did. For small microcontroller projects this wasn't a problem. For larger feature phone problems we had to put a lot of work into keeping the linker's memory usage down as many of our customers at the time were using 32-bit Windows machines with a default maximum virtual memory of 2Gb.

COMDAT sections have size overhead on extra section headers. Developers may be tempted to decrease the overhead with SHF_LINK_ORDER. However, the approach does not work due to the ordering requirement. Considering the following fragments:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
header [a.o common]
- DW_TAG_compile_unit [a.o common]
-- DW_TAG_variable [a.o .data.foo]
-- DW_TAG_namespace [common]
--- DW_TAG_subprogram [a.o .text.bar]
--- DW_TAG_variable [a.o .data.baz]
footer [a.o common]
header [b.o common]
- DW_TAG_compile_unit [b.o common]
-- DW_TAG_variable [b.o .data.foo]
-- DW_TAG_namespace [common]
--- DW_TAG_subprogram [b.o .text.bar]
--- DW_TAG_variable [b.o .data.baz]
footer [b.o common]

DW_TAG_* tags associated with concrete sections can be represented with SHF_LINK_ORDER sections. After linking the sections will be ordered before the common parts.

Everything I know about GNU toolchain

As mainly an LLVM person, I occasionally contribute to GNU toolchain projects. This is sometimes for fun, sometimes for investigating why an (usually ancient) feature works in a particular way, sometimes for pushing forward a toolchain feature with the mind of both communities, or sometimes just for getting sense of how things work with mailing list+GNU make.

Read More

Copy relocations, canonical PLT entries and protected visibility

Background:

  • -fno-pic can only be used by executables. On most platforms and architectures, direct access relocations are used to reference external data symbols.
  • -fpic can be used by both executables and shared objects. Windows has __declspec(dllimport) but most other binary formats allow a default visibility external data to be resolved to a shared object, so generally direct access relocations are disallowed.
  • -fpie was introduced as a mode similar to -fpic for ELF: the compiler can make the assumption that the produced object file can only be used by executables, thus all definitions are non-preemptible and thus interprocedural optimizations can apply on them.

For

1
2
extern int a;
int *foo() { return &a; }

-fno-pic typically produces an absolute relocation (a PC-relative relocation can be used as well). On ELF x86-64 it is usually R_X86_64_32 in the position dependent small code model. If a is defined in the executable (by another translation unit), everything works fine. If a turns out to be defined in a shared object, its real address will be non-constant at link time. Either action needs to be taken:

  • Emit a dynamic relocation in every use site. Text sections are usually non-writable. A dynamic relocation applied on a non-writable section is called a text relocation.
  • Emit a single copy relocation. Copy relocations only work for executables. The linker obtains the size of the symbol, allocates the bytes in .bss (this may make the object writable. On LLD a readonly area may be picked.), and emit an R_*_COPY relocation. All references resolve to the new location.

Multiple text relocations are even less acceptable, so on ELF a copy relocation is generally used. Here is a nice description from Rich Felker: "Copy relocations are not a case of overriding the definition in the abstract machine, but an implementation detail used to support data objects in shared libraries when the main program is non-PIC."

Copy relocations have drawbacks:

  • Break page sharing.
  • Make the symbol properties (e.g. size) part of ABI.
  • If the shared object is linked with -Bsymbolic or --dynamic-list and defines a data symbol copy relocated by the executable, the address of the symbol may be different in the shared object and in the executable.

What went poorly was that -fno-pic code had no way to avoid copy relocations on ELF. Traditionally copy relocations could only occur in -fno-pic code. A GCC 5 change made this possible for x86-64. Please read on.

x86-64: copy relocations and -fpie

-fpic using GOT indirection for external data symbols has cost. Making -fpie similar to -fpic in this regard incurs costs if the data symbol turns out to be defined in the executable. Having the data symbol defined in another translation unit linked into the executable is very common, especially if the vendor uses fully/mostly statically linking mode.

In GCC 5, "x86-64: Optimize access to globals in PIE with copy reloc" started to use direct access relocations for external data symbols on x86-64 in -fpie mode.

1
2
extern int a;
int foo() { return a; }
  • GCC<5: movq a@GOTPCREL(%rip), %rax; movl (%rax), %eax (8 bytes)
  • GCC>=5: movl a(%rip), %eax (6 bytes)

This change is actually useful for architectures other than x86-64 but is never implemented for other architectures. What went wrong: the change was implemented as an inflexible configure-time choice (HAVE_LD_PIE_COPYRELOC), defaulting to such a behavior if ld supports PIE copy relocations (most binutils installations). Keep in mind that such a -fpie default breaks -Bsymbolic and --dynamic-list in shared objects.

Clang addressed the inflexible configure-time choice via an opt-in option -mpie-copy-relocations (D19996).

I noticed that:

  • The option can be used for -fno-pic code as well to prevent copy relocations on ELF. This is occasionally users want (if their shared objects use -Bsymbolic and export data symbols (usually undesired from API perspecitives but can avoid costs at times)), and they switch from -fno-pic to -fpic just for this purpose.
  • The option name should describe the code generation behavior, instead of the inferred behavior at the linking stage on a partibular binary format.
  • The option does not need to tie to ELF.
    • On COFF, the behavior is like always -fdirect-access-external-data. __declspec(dllimport) is needed to enable indirect access.
    • On Mach-O, the behavior is like -fdirect-access-external-data for -fno-pic (only available on arm) and the opposite for -fpic.
  • H.J. Lu introduced R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX as GOT optimization to x86-64 psABI. This is great! With the optimization, GOT indirection can be optimized, so the incured cost is very low now.

So I proposed an alternative option -f[no-]direct-access-external-data: https://reviews.llvm.org/D92633 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98112. My wish on the GCC side is to drop HAVE_LD_PIE_COPYRELOC and (x86-64) default to GOT indirection for external data symbols in -fpie mode.

Please keep in mind that -f[no-]semantic-interposition is for definitions while -f[no-]direct-access-external-data is for undefined data symbols. GCC 5 introduced -fno-semantic-interposition to use local aliases for references to definitions in the same translation unit.

STV_PROTECTED

Now let's consider how STV_PROTECTED comes into play. Here is the generic ABI definition:

A symbol defined in the current component is protected if it is visible in other components but not preemptable, meaning that any reference to such a symbol from within the defining component must be resolved to the definition in that component, even if there is a definition in another component that would preempt by the default rules. A symbol with STB_LOCAL binding may not have STV_PROTECTED visibility. If a symbol definition with STV_PROTECTED visibility from a shared object is taken as resolving a reference from an executable or another shared object, the SHN_UNDEF symbol table entry created has STV_DEFAULT visibility.

A non-local STV_DEFAULT defined symbol is by default preemptible in a shared object on ELF. STV_PROTECTED can make the symbol non-preemptible. You may have noticed that I use "preemptible" while the generic ABI uses "preemptable" and LLVM IR uses "dso_preemptable". Both forms work. "preemptible" is my opition because it is more common.

Protected data symbols and copy relocations

Many folks consider that copy relocations are best-effort support provided by the toolchain. STV_PROTECTED is intended as an optimization and the optimization can error out if it can't be done for whatever reason. Since copy relocations are already oftentimes unacceptable, it is natural to think that we should just disallow copy relocations on protected data symbols.

However, GNU ld 2.26 made a change which enabled copy relocations on protected data symbols for i386 and x86-64.

A glibc change "Add ELF_RTYPE_CLASS_EXTERN_PROTECTED_DATA to x86" is needed to make copy relocations on protected data symbols work. "[AArch64][BZ #17711] Fix extern protected data handling" and "[ARM][BZ #17711] Fix extern protected data handling" ported the thing to arm and aarch64.

Despite the glibc support, GNU ld aarch64 errors relocation R_AARCH64_ADR_PREL_PG_HI21 against symbol `foo' which may bind externally can not be used when making a shared object; recompile with -fPIC.

powerpc64 ELFv2 is interesting: TOC indirection (TOC is a variant of GOT) is used everywhere, data symbols normally have no direct access relocations, so this is not a problem.

1
2
3
4
5
// b.c
__attribute__((visibility("protected"))) int foo;
// a.c
extern int foo;
int main() { return foo; }
1
2
gcc -fuse-ld=bfd -fpic -shared b.c -o b.so
gcc -fuse-ld=bfd -pie -fno-pic a.c ./b.so

gold does not allow copy relocations on protected data symbols, but it misses some cases: https://sourceware.org/bugzilla/show_bug.cgi?id=19823.

Protected data symbols and direct accesses

If a protected data symbol in a shared object is copy relocated, allowing direct accesses will cause the shared object to operate on a different copy from the executable. Therefore, direct accesses to protected data symbols have to be disallowed in -fpic code, just in case the symbols may be copy relocated. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65248 changed GCC 5 to use GOT indirection for protected external data.

1
2
3
__attribute__((visibility("protected"))) int foo;
int val() { return foo; }
// -fPIC: GOT on at least aarch64, arm, i386, x86-64

This caused unneeded pessimization for protected external data. Clang always treats protected similar to hidden/internal.

For older GCC (and all versions of Clang), direct accesses are produced in -fpic code. Mixing such object files can silently break copy relocations on protected data symbols. Therefore, GNU ld made the change https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commit;h=ca3fe95e469b9daec153caa2c90665f5daaec2b5 to error in -shared mode.

1
2
3
4
5
6
7
8
9
10
11
% cat a.s
leaq foo(%rip), %rax

.data
.global foo
.protected foo
foo:
% gcc -fuse-ld=bfd -shared a.s
/usr/bin/ld.bfd: /tmp/ccchu3Xo.o: relocation R_X86_64_PC32 against protected symbol `foo' can not be used when making a shared object
/usr/bin/ld.bfd: final link failed: bad value
collect2: error: ld returned 1 exit status

This led to a heated discussion https://sourceware.org/legacy-ml/binutils/2016-03/msg00312.html. Swift folks noticed this https://bugs.swift.org/browse/SR-1023 and their reaction was to switch from GNU ld to gold.

GNU ld's aarch64 port does not have the diagnostic.

binutils commit "x86: Clear extern_protected_data for GNU_PROPERTY_NO_COPY_ON_PROTECTED" introduced GNU_PROPERTY_NO_COPY_ON_PROTECTED. With this property, ld -shared will not error for relocation R_X86_64_PC32 against protected symbol `foo' can not be used when making a shared object.

The two issues above are the costs enabling copy relocations on protected data symbols. Personally I don't think copy relocations on protected data symbols are actually leveraged. GNU ld's x86 port can just (1) reject such copy relocations and (2) allow direct accesses referencing protected data symbols in -shared mode. But I am not really clear about the glibc case. I wish GNU_PROPERTY_NO_COPY_ON_PROTECTED can become the default or be phased out in the future.

Protected function symbols and canonical PLT entries

1
2
3
4
// b.c
__attribute__((visibility("protected"))) void *foo () {
return (void *)foo;
}

GNU ld's aarch64 and x86 ports rejects the above code. On many other architectures including powerpc the code is supported.

1
2
3
4
5
6
7
8
9
% gcc -fpic -shared b.c -fuse-ld=bfd b.c -o b.so
/usr/bin/ld.bfd: /tmp/cc3Ay0Gh.o: relocation R_X86_64_PC32 against protected symbol `foo' can not be used when making a shared object
/usr/bin/ld.bfd: final link failed: bad value
collect2: error: ld returned 1 exit status
% gcc -shared -fuse-ld=bfd -fpic b.c -o b.so
/usr/bin/ld.bfd: /tmp/ccXdBqMf.o: relocation R_AARCH64_ADR_PREL_PG_HI21 against symbol `foo' which may bind externally can not be used when making a shared object; recompile with -fPIC
/tmp/ccXdBqMf.o: in function `foo':
a.c:(.text+0x0): dangerous relocation: unsupported relocation
collect2: error: ld returned 1 exit status

The rejection is mainly a historical issue to make pointer equality work with -fno-pic code. The GNU ld idea is that:

  • The compiler emits GOT-generating relocations for -fpic code (in reality it does it for declarations but not for definitions).
  • -fno-pic main executable uses direct access relocation types and gets a canonical PLT entry.
  • glibc ld.so resolves the GOT in the shared object to the canonical PLT entry.

Actually we can take the interepretation that a canonical PLT entry is incompatible with a shared STV_PROTECTED definition, and reject the attempt to create a canonical PLT entry (gold/LLD). And we can keep producing direct access relocations referencing protected symbols for -fpic code. STV_PROTECTED is no different from STV_HIDDEN.

On many architectures, a branch instruction uses a branch specific relocation type (e.g. R_AARCH64_CALL26, R_PPC64_REL24, R_RISCV_CALL_PLT). This is great because the address is insignificant and the linker can arrange for a regular PLT if the symbol turns out to be external.

On i386, a branch in -fno-pic code emits an R_386_PC32 relocation, which is indistinguishable from an address taken operation. If the symbol turns out to be external, the linker has to employ a tricky called "canonical PLT entry" (st_shndx=0, st_value!=0). The term is a parlance within a few LLD developers, but not broadly adopted.

1
2
3
// a.c
extern void foo(void);
int main() { foo(); }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
% gcc -m32 -shared -fuse-ld=bfd -fpic b.c -o b.so
% gcc -m32 -fno-pic -no-pie -fuse-ld=lld a.c ./b.so

% gcc -m32 -fno-pic a.c ./b.so -fuse-ld=lld
ld.lld: error: cannot preempt symbol: foo
>>> defined in ./b.so
>>> referenced by a.c
>>> /tmp/ccDGhzEy.o:(main)
collect2: error: ld returned 1 exit status

% gcc -m32 -fno-pic -no-pie a.c ./b.so -fuse-ld=bfd
# canonical PLT entry; foo has different addresses in a.out and b.so.
% gcc -m32 -fno-pic -pie a.c ./b.so -fuse-ld=bfd
/usr/bin/ld.bfd: /tmp/ccZ3Rl8Y.o: warning: relocation against `foo' in read-only section `.text'
/usr/bin/ld.bfd: warning: creating DT_TEXTREL in a PIE
% gcc -m32 -fno-pic -pie a.c ./b.so -fuse-ld=bfd -z text
/usr/bin/ld.bfd: /tmp/ccUv8wXc.o: warning: relocation against `foo' in read-only section `.text'
/usr/bin/ld.bfd: read-only segment has dynamic relocations
collect2: error: ld returned 1 exit status

This used to be a problem for x86-64 as well, until "x86-64: Generate branch with PLT32 relocation" changed call/jmp foo to emit R_X86_64_PLT32 instead of R_X86_64_PC32. Note: (-fpie/-fpic) call/jmp foo@PLT always emits R_X86_64_PLT32.

The relocation type name is a bit misleading, _PLT32 does not mean that a PLT will always be created. Rather, it is optional: the linker can resolve _PLT32 to any place where the function will be called. If the symbol is preemptible, the place is usually the PLT entry. If the symbol is non-preemptible, the linker can convert _PLT32 into _PC32. A function symbol can be either branched or taken address. For an address taken operation, the function symbol is used in a manner similar to a data symbol. R_386_PLT32 cannot be used. LLD and gold will just reject the link if text relocations are disabled.

On i386, my proposal is that branches to a default visibility function declaration should use R_386_PLT32 instead of R_386_PC32, in a manner similar to x86-64. Originally I thought an assembler change sufficed: https://sourceware.org/bugzilla/show_bug.cgi?id=27169. Please read the next section why this should be changed on the compiler side.

Non-default visibility ifunc and R_386_PC32

For a call to a hidden function declaration, the compiler produces an R_386_PC32 relocation. The relocation is an indicator that EBX may not be set up.

If the declaration refers to an ifunc definition, the linker will resolve the R_386_PC32 to an IPLT entry. For -pie and -shared links, the IPLT entry references EBX. If the call site does not set up EBX to be _GLOBAL_OFFSET_TABLE_, the IPLT call will be incorrect.

GNU ld has implemented a diagnostic ("i686 ifunc and non-default symbol visibility") to catch the problem. If we change call/jmp foo to always use R_386_PLT32, such a diagnostic will be lost.

Can we change the compiler to emit call/jmp foo@PLT for default visibility function declarations? If the compiler emits such a modifier but does not set up EBX, the ifunc can still be non-preemptible (e.g. hidden in another translation unit or -Bsymbolic) and we will still have a dilemma.

Personally, I think avoiding a canonical PLT entry is more useful than a ld ifunc diagnostic. i386 ABI is legacy and the x86 maintainer will not make the change, though.

Summary

I hope the above give an overview to interested readers. Symbol interposition is subtle. One has to think about all the factors related to symbol interposition and the relevant toolchain fixes are like a whack-a-mole game. I appreciate all the prior discussions and I believe many unsatisfactory things can be fixed in a quite backward-compatible way.

Some features are inherently incompatible. We make the trade-off in favor of more important features. Here are two things that should not work. However, if -fpie or -fno-direct-access-external-data is specified, both limitations will be circumvented.

  • Copy relocations on protected data symbols.
  • Canonical PLT entries on protected function symbols. With the R_386_PLT32 change, this issue will only affect function pointers.

People sometimes simply just say: "protected visibility does not work." I'd argue that Clang+gold/LLD works quite well.

The things on GCC+GNU ld side are inconsistent, though. Here is a list of changes I wish can happen:

  • GCC: add -f[no-]direct-access-external-data.
  • GCC: drop HAVE_LD_PIE_COPYRELOC in favor of -f[no-]direct-access-external-data.
  • GCC x86-64: default to GOT indirection for external data symbols in -fpie mode.
  • GCC or GNU as i386: emit R_386_PLT32 for branches to undefined function symbols.
  • GNU ld x86: disallow copy relocations on protected data symbols. (I think canonical PLT entries on protected symbols have been disallowed.)
  • GCC aarch64/arm/x86/...: allow direct access relocations on protected symbols in -fpic mode.
  • GNU ld aarch64/x86: allow direct access relocations on protected data symbols in -shared mode.

The breaking changes for GCC+GNU ld:

  • The "copy relocations on protected data symbols" scheme has been supported in the past few years with GNU ld on x86, but it did not work before circa 2015, and should not work in the future. Fortunately the breaking surface may be narrow: this scheme does not work with gold or LLD. Many architectures don't work.
  • ld is not the only consumer of R_386_PLT32. The Linux kernel has code resolving relocations and it needs to be fixed (patch uploaded: https://github.com/ClangBuiltLinux/linux/issues/1210).

I'll conclude thie article with random notes on other binary formats:

Windows/COFF __declspec(dllimport) gives us a different perspecitive how external references can be designed. The annotation is verbose but differentiates the two cases (1) the symbol has to be defined in the same linkage unit (2) the symbol can be defined in another linkage unit. If we lift the "the symbol visibility is decided by the most constrained visibility" requirement for protected->default, a COFF undefined/defined symbol is quite like a protected undefined/defined symbol in ELF. __declspec(dllimport) gives the undefined symbol default visibility (i.e. the LLVM IR dllimport is redundant). __declspec(dllexport) is something which cannot be modeled with the existing ELF visibilities.

For an undefined variable, Mach-O uses __attribute__((visibility("hidden"))) to say "a definition must be available in another translation unit in the same linkage unit" but does not actually mark the undefined symbol anyway. COFF uses __declspec(dllimport) to convey this. In ELF, __attribute__((visibility("hidden"))) additionally makes the undefined symbol unexportable. The Mach-O notation actually resembles COFF: it can be exported by the definition in another translation unit. From its behavior, I think it would be more appropriately mapped to LLVM IR protected instead of hidden.

Appendix

For a STB_GLOBAL/STB_WEAK symbol,

STV_DEFAULT: both compiler & linker need to assume such symbols can be preempted in -fpic mode. The compiler emits GOT indirection by default. GCC -fno-semantic-interposition uses local aliases on defined non-weak function symbols for x86 (unimplemented in other architectures). Clang -fno-semantic-interposition uses local aliases on defined non-weak symbols (both function and data) for x86.

STV_PROTECTED: GCC -fpic uses GOT indirection for data symbols, regardless of defined or undefined. This pessimization is to make a misfeature "copy relocation on protected data symbol" work (https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#protected-data-symbols-and-direct-accesses). Clang code generation treats STV_PROTECTED the same way as STV_HIDDEN.

STV_HIDDEN: non-preemptible, regardless of defined or undefined. The compiler suppresses GOT indirection, unless undefined STB_WEAK.

For defined symbols, -fno-pic/-fpie can avoid GOT indirection for STV_DEFAULT (and GCC STV_PROTECTED). -fvisibility=hidden can change visibility.

For undefined symbols, -fpie/-fpic use GOT indirection by default. Clang -fno-direct-access-external-data (discussed in my article) can avoid GOT indirection. If you -fpic -fno-direct-access-external-data & ld -shared, you'll need additional linker options to make the linker know defined non-STB_LOCAL STV_DEFAULT symbols are non-preemptible.

LLD and GNU linker incompatibilities

Subtitle: Is LLD a drop-in replacement for GNU ld?

The motivation for this article was someone challenging the "drop-in replacement" claim on LLD's website (the discussion was about Linux-like ELF toolchain):

LLD is a linker from the LLVM project that is a drop-in replacement for system linkers and runs much faster than them. It also provides features that are useful for toolchain developers.

99.9% pieces of software work with LLD without a change. Some linker script applications may need an adaption (such adaption is oftentimes due to brittle assumptions: asking too much from GNU ld's behavior which should be fixed anyway). So I defended for this claim.

Read More