All about thread-local storage

Thread-local storage (TLS) provides a mechanism allocating distinct objects for different threads. It is the usual implementation for GCC extension __thread, C11 _Thread_local, and C++11 thread_local, which allow the use of the declared name to refer to the entity associated with the current thread. This article will describe thread-local storage on ELF platforms in detail, and touch on other related topics, such as: thread-specific data keys and Windows/macOS TLS.

An example usage of thread-local storage is POSIX errno:

Each thread has its own thread ID, scheduling priority and policy, errno value, floating point environment, thread-specific key/value bindings, and the required system resources to support a flow of control.

Different threads have different errno copies. errno is typically defined as a function which returns a thread-local variable.

For each architecture, the authoritative ELF ABI document is the processor supplement (psABI) to the System V ABI (generic ABI). These documents usually reference The ELF Handling for Thread-Local Storage by Ulrich Drepper. The document, however, mixes general specifications and glibc internals.

Representation

Assembler behavior

The compiler usually defines thread-local variables in .tdata and .tbss sections (which have the section flag SHF_TLS). The symbols representing thread-local variables have type STT_TLS (representing thread-local storage entities). In GNU as syntax, you can give a the type STT_TLS with .type a, @tls_object. The st_value value of a TLS symbols is the offset relative to the defining section.

1
2
3
4
5
6
7
8
9
10
.section .tbss,"awT",@nobits
.globl a, b
.type a, @tls_object
.type b, @tls_object
a:
.zero 4
.size a, .-a
b:
.zero 4
.size b, .-b

In this example, st_value(a)=0 while st_value(b)=4.

In Clang and GCC produced assembly, thread-local variables are annotated as .type a, @object (STT_OBJECT). When the assembler sees that such symbols are defined in SHF_TLS sections or referenced by TLS relocations, STT_NOTYPE/STT_OBJECT will be upgraded to STT_TLS.

GNU as supports an directive .tls_common which defines STT_TLS SHN_COMMON symbols. This is an obscure feature. It is not clear whether GCC still has a code path which emits .tls_common directives. LLVM integrated assembler does not support .tls_common.

Linker behavior

The linker combines .tdata input sections into a .tdata output section. .tbss input sections are combined into a .tbss output section. The two SHF_TLS output sections are placed into a PT_TLS program header.

  • p_offset: the file offset of the TLS initialization image
  • p_vaddr: the virtual address of the TLS initialization image
  • p_filesz: the size of the TLS initialization image
  • p_memsz: the total size of the thread-local storage. The last p_memsz-p_filesz bytes will be zeroed by the dynamic loader.
  • p_align: alignment

The PT_TLS program header is contained in a PT_LOAD program header. If PT_GNU_RELRO is used, PT_TLS is contained in a PT_GNU_RELRO and the PT_GNU_RELRO is contained in a PT_LOAD. Conceptually PT_TLS and STT_TLS symbols are like in a separate address space. The dynamic loader should copy the [p_vaddr,p_vaddr+p_filesz) of the TLS initialization image to the corresponding static TLS block.

In executable and shared object files, st_value normally holds a virtual address. For a STT_TLS symbol, st_value holds an offset relative to the virtual address of the PT_TLS program header. The first byte of PT_TLS is referenced by the TLS symbol with st_value==0.

GNU ld treats STT_TLS SHN_COMMON symbols as defined in .tcommon sections. Its internal linker script places such sections into the output section .tdata. LLD does not support STT_TLS SHN_COMMON symbols.

Dynamic loader behavior

The dynamic loader collects PT_TLS program headers from the main executable and immediately loaded shared objects (via transitive DT_NEEDED), and allocates static TLS blocks, one block for each PT_TLS. For each PT_TLS, the dynamic loader copies p_filesz bytes from the TLS initialization image to the TLS block and sets the trailing p_memsz-p_filesz bytes to zeroes.

For the static TLS block of the main executable, the module ID is one and the TP offset of a TLS symbol is a link-time constant. The linker and the dynamic loader share the same formula.

For a shared object loaded at program start, the offset from the thread pointer to its static TLS block is a fixed value at program start, albeit not a link-time constant. The offset can be referenced by a GOT dynamic relocation used by the initial-exec TLS model.

The ELF Handling for Thread-Local Storage describes two TLS variants and specifies their data structures. However, only the TP offset of the static TLS block of the main executable is a hard requirement. Nevertheless, libc implementations usually place static TLS blocks together, and allocate a space for both the thread control block and the static TLS blocks.

For a new thread created by pthread_create, the static TLS blocks are usually allocated as part of the thread stack. Without a guard page between the largest address of the stack and the thread control block, this could be considered as vulnerable as stack overflow can overwrite the thread control block.

Models

Local exec TLS model (executable & non-preemptible)

This is the most efficient TLS model. It applies when the TLS symbol is defined in the executable.

The compiler picks this model in -fno-pic/-fpie modes if the variable is

  • a definition
  • or a declaration with a non-default visibility.

The first condition is obvious. The second condition is becuase a non-default visibility means the variable must be defined by another translation unit in the executable.

1
2
3
_Thread_local int def;
__attribute__((visibility("hidden"))) extern thread_local int ref;
int foo() { return def + ref; }
1
2
# x86-64
movl %fs:def@TPOFF, %eax

For the static TLS block of the main executable, the TP offset of a TLS symbol is a link-time constant. Here is a list of common relocation types:

  • arm: R_ARM_TLS_LE32
  • aarch64:
    • -mtls-size=12: R_AARCH64_TLSLE_ADD_TPREL_LO12
    • -mtls-size=24 (default): R_AARCH64_TLSLE_ADD_TPREL_HI12, R_AARCH64_TLSLE_ADD_TPREL_LO12_NC
    • -mtls-size=32: R_AARCH64_TLSLE_MOVW_TPREL_G1, R_AARCH64_TLSLE_MOVW_TPREL_G0_NC
    • -mtls-size=48: R_AARCH64_TLSLE_MOVW_TPREL_G2, R_AARCH64_TLSLE_MOVW_TPREL_G1_NC, R_AARCH64_TLSLE_MOVW_TPREL_G0_NC
  • i386: R_386_TLS_LE
  • x86-64: R_X86_64_TPOFF32
  • mips: R_MIPS_TPREL_HI16, R_MIPS_TPREL_LO16
  • ppc32: R_PPC_TPREL_HA, R_PPC_TPREL_LO
  • ppc64: R_PPC64_TPREL_HA, R_PPC64_TPREL_LO
  • riscv: R_RISCV_TPREL_HI20, R_RISCV_TPREL_LO12_I, R_RISCV_TPREL_LO12_S

For RISC architectures, because an instruction typically has 4 bytes and cannot encode a 32-bit offset, it usually takes two instructions to materialize a TP offset.

In https://reviews.llvm.org/D93331, I patched LLD to reject local-exec TLS relocations in -shared mode. In GNU ld, at least arm, riscv and x86's ports have the similar diagnostics, but aarch64 and ppc64 do not error.

Initial exec TLS model (executable & preemptible)

This model is less efficient than local exec. It applies when the TLS symbol is defined in the executable or a shared object available at program start. The shared object can be due to DT_NEEDED or LD_PRELOAD.

The compiler picks this model in -fno-pic/-fpie modes if the variable is a declaration with default visibility. The idea is that a symbol referenced by the executable must be defined by an immediately loaded shared object, instead of a dlopen loaded shared object. The linker enforces this as well by defaulting to -z defs for a -no-pie/-pie link.

1
2
extern thread_local int ref;
int foo() { return ref; }
1
2
3
# x86-64
movq ref@GOTTPOFF(%rip), %rax
movl %fs:(%rax), %eax

Because the offset from the thread pointer to the start of a static block is fixed at program start, such an offset can be encoded by a GOT relocation. Such relocation types typically have GOT and TPREL/TPOFF in their names. Here is a list of common relocation types:

  • arm: R_ARM_TLS_IE32
  • aarch64: R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21, R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC
  • i386: R_386_TLS_IE
  • x86-64: R_X86_64_GOTTPOFF
  • ppc32: R_PPC_GOT_TPREL16
  • ppc64: R_PPC64_GOT_TPREL16_HA, R_PPC64_GOT_TPREL16_LO_DS
  • riscv: R_RISCV_TLS_GOT_HI20, R_RISCV_PCREL_LO12_I

If the TLS symbol does not satisfy initial-exec to local-exec optimization requirements, the linker will allocate a GOT entry and emit a dynamic relocation. Here is a list of dynamic relocation types:

  • arm: R_ARM_TLS_TPOFF32
  • aarch64: R_AARCH64_TLS_TPREL64
  • mips32: R_MIPS_TLS_TPREL32
  • mips64: R_MIPS_TLS_TPREL64
  • i386: R_386_TPOFF
  • x86-64: R_X86_64_TPOFF64
  • ppc32: R_PPC_TPREL32
  • ppc64: R_PPC64_TPREL64
  • riscv: R_RISCV_TLS_TPREL64

While they have TPREL or TPOFF in their names, these dynamic relocations have the same bitwidth as the word size. This is a good way to distinguish them from the local-exec relocation types used in object files.

If you add the __attribute((tls_model("initial-exec"))) attribute, a thread-local variable can use this model in -fpic mode. If the object file is linked into an executable, everything is fine. If the object file is linked into a shared object, the shared object generally needs to be an immediately loaded shared object. The linker sets the DF_STATIC_TLS flag to annotate a shared object with initial-exec TLS relocations.

glibc ld.so reserves some space in static TLS blocks and allows dlopen on such a shared object if its TLS size is small. There could be an obscure reason for using such an attribute: general dynamic and local dynamic TLS models are not async-signal-safe in glibc. However, other libc implementations may not reserve additional TLS space for dlopen'ed initial-exec shared objects, e.g. musl will error.

General dynamic and local dynamic TLS models (DSO)

The two modes are used when the TLS symbol may be defined by a shared object. They do not assume the TLS symbol is backed by a static TLS block. Instead, they assume that the thread-local storage of the module may be dynamically allocated, making the models suitable for dlopen usage. The dynamically allocated TLS storage is usually referred to as dynamic TLS.

Each TLS symbol is assigned a pair of (module ID, offset from dtv[m] to the symbol), which is usually referred to as a tls_index object. The module ID m is assigned by the dynamic loader when the module (the executable or a shared object) is loaded, so it is unknown at link time. dtv means the dynamic thread vector. Each thread has its own dynamic thread vector, which is a mapping from module ID to thread-local storage. dtv[m] points to the storage allocated for the module with the ID m.

In the simplest form, once we have a pointer to the (module ID, offset from dtv[m] to the symbol) pair, we can get the address of the symbol with the following C program:

1
2
3
4
5
// v is a pointer to the first element of the pair.
void *__tls_get_addr(size_t *v) {
pthread_t self = __pthread_self();
return (void *)(self->dtv[v[0]] + v[1]);
}

General dynamic TLS model (DSO & non-preemptible)

The general dynamic TLS model is the most flexible model. It assumes neither the module ID nor the offset from dtv[m] to the symbol is known at link time. The model is used in -fpic mode when the local dynamic TLS model does not apply. The compiler emits code to set up a pointer to the TLSGD entry of the symbol, then arranges for a call to __tls_get_addr. The return value will contain the runtime address of the TLS symbol in the current thread. On x86-64, you will notice that the leaq instruction has a data16 prefix and the call instruction has two data16 (0x66) prefixes and one rex64 prefix. This is a deliberate choice to make the total size of leaq+call to be 16, suitable for link-time optimization.

1
2
3
4
5
data16 leaq def@tlsgd(%rip), %rdi  # R_X86_64_TLSGD
# GNU as does not allow duplicate data16 prefixes, so .value is used here.
.value 0x6666
rex64 call __tls_get_addr@PLT
movl (%rax), %eax

(There is an open issue that LLVM disassembler does not display data16 and rex64 prefixes.)

Here is a list of common relocation types. They are called "initial relocations" in The ELF Handling for Thread-Local Storage.

  • arm: R_ARM_TLS_GD32
  • aarch64: R_AARCH64_TLSGD_ADR_PREL21, R_AARCH64_TLSGD_ADR_PAGE21, R_AARCH64_TLSGD_ADD_LO12_NC, R_AARCH64_TLSGD_MOVW_G1, R_AARCH64_TLSGD_MOVW_G0_NC (rarely used because TLS descriptors are the default)
  • i386: R_386_TLS_GD
  • x86-64: R_X86_64_TLSGD
  • mips: R_MIPS_TLS_GD, R_MICROMIPS_TLS_GD
  • ppc32: R_PPC_GOT_TLSGD16
  • ppc64: R_PPC64_GOT_TLSGD16_HA, R_PPC64_GOT_TLSGD16_LO
  • riscv: R_RISCV_TLS_GD_HI20

When the linker scans such a relocation, it checks whether the referenced TLS symbol satisfy optimization requirements. If not, the linker allocates two consecutive words in the .got section if not allocated yet. The two entries are relocated by two dynamic relocations. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word. The relocation types are:

  • arm: R_ARM_TLS_DTPMOD32 and R_ARM_TLS_DTPOFF32
  • aarch64: R_AARCH64_TLS_DTPMOD and R_AARCH64_TLS_DTPREL (rarely used because TLS descriptors are the default)
  • i386: R_386_TLS_DTPMOD32 and R_386_TLS_DTPOFF32
  • x86-64: R_X86_64_DTPMOD64 and R_X86_64_DTPOFF64
  • mips32: R_MIPS_TLS_DTPMOD32 and R_MIPS_TLS_DTPOFF32
  • mips64: R_MIPS_TLS_DTPMOD64 and R_MIPS_TLS_DTPOFF64
  • ppc32: R_PPC_DTPMOD32 and R_X86_64_DTPREL32
  • ppc64: R_PPC64_DTPMOD64 and R_X86_64_DTPREL64
  • riscv32: R_RISCV_TLS_DTPMOD32 and R_X86_64_TLS_DTPREL32
  • riscv64: R_RISCV_TLS_DTPMOD64 and R_X86_64_TLS_DTPREL64

The are called "outstanding relocations" in The ELF Handling for Thread-Local Storage.

Local dynamic TLS model (DSO & preemptible)

The local-dynamic TLS model assumes that the offset from dtv[m] to the symbol is a link-time constant. This case happens when the TLS symbol is non-preemptible. The compiler emits code to set up a pointer to the TLSLD entry of the module, next arranges for a call to __tls_get_addr, then adds a link-time constant to the return value to get the address.

1
2
3
leaq def@tlsld(%rip), %rdi
call __tls_get_addr@PLT
movl def@dtpoff(%rax), %edx

I say "the TLSLD entry of the module" because while (on x86-64) def@tlsld looks like the TLSLD entry of the non-preemptible TLS symbol, it can really be shared by other non-preemptible TLS symbols. So one module needs just one such entry. Technically we can just use general dynamic relocation types to represent the local dynamic TLS model. For example, GCC riscv does this:

1
2
3
4
5
6
7
8
9
10
la.tls.gd a0, .LANCHOR0
call __tls_get_addr@@plt

.section .tbss,"awT",@nobits
.align 2
.set .LANCHOR0, .+0
.type a, @object
.size a, 4
a:
.zero 4

This is clever. However, I would prefer dedicated local-dynamic relocation types. If we perform a relocatable link merging this object file with another (with its own local symbol .LANCHOR0), the local symbols .LANCHOR0 are separate and their GOT entries cannot be shared. Architectures with dedicated local-dynamic relocation types can share the GOT entries.

Note that the code sequence is not shorter than the general-dynamic TLS model. Actually on RISC architectures the code sequence is usually longer due to the addition of DTPREL. Local-dynamic is beneficial if a function needs to access two or more non-preemptible TLS symbols, because the __tls_get_addr can be shared.

1
2
3
4
leaq def0@tlsld(%rip), %rdi
call __tls_get_addr@PLT
movl def0@dtpoff(%rax), %edx
movl def1@dtpoff(%rax), %eax

Here is a list of common relocation types.

  • arm: R_ARM_TLS_LDM32
  • i386: R_386_TLS_LDM
  • x86-64: R_X86_64_TLSLD
  • mips: R_MIPS_TLS_LDM, R_MICROMIPS_TLS_LDM
  • ppc32: R_PPC_GOT_TLSLD16
  • ppc64: R_PPC64_GOT_TLSLD16_HA, R_PPC64_GOT_TLSLD16_LO, R_PPC64_GOT_TLSLD_PCREL34

At the linker stage, if the TLS symbol does not satisfy local-dynamic to local-exec optimization requirements, the linker will allocate two consecutive words in the .got section for the TLSLD relocation. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word.

If the architecture does not define TLS optimization, the linker can still made an optimization: in -no-pie/-pie modes, set the first word to 1 (main executable) and omit the dynamic relocation for the module ID.

TLS descriptors

Some architectures (arm, aarch64, i386, x86-64) have TLS descriptors as more efficient alternatives to the traditional general dynamic and local dynamic TLS models. Such ABIs repurpose the first word of the (module ID, offset from dtv[m] to the symbol) pair to represent a function pointer. The function pointer points to a very simple function in the static TLS case and a function similar to __tls_get_addr in the dynamic TLS case. The caller does an indirection function call instead of calling __tls_get_addr. There are two main points:

  • The function call to __tls_get_addr uses the regular calling convention: the compiler has to make the pessimistic assumption that all volatile registers may be clobbered by __tls_get_addr.
  • In glibc (which does lazy TLS allocation), __tls_get_addr is very complex. If the TLS of the module is backed by a static TLS block, the dynamic loader can simply place the TP offset into the second word and let the function pointer point to a function which simply returns the second word.

The first point is the prominent reason that TLS descriptors are generally more efficient. Arguably traditional general dynamic and local dynamic TLS models could have a mechanism to use custom calling convention for __tls_get_addr as well.

In musl, in the static TLS case, the two words will be set to ((size_t)__tlsdesc_static, tpoff) where __tlsdesc_static is a function which returns the second word. glibc's static TLS case is similar.

1
2
3
4
5
6
.globl __tlsdesc_static
.hidden __tlsdesc_static
__tlsdesc_static:
# The second word stores the TP offset of the TLS symbol.
movq 8(%rax), %rax
ret

The scheme optimizes for static TLS but penalizes the case that requires dynamic TLS. Remember that we have just two words in the GOT and by changing the first word to a function pointer, we have lost information about the module ID. To retain the information, the dynamic loader has to set the second word to a pointer to a (module ID, offset) pair allocated by malloc.

aarch64 defaults to TLS descriptors. On arm, i386 and x86-64, you can select TLS descriptors via GCC -mtls-dialect=gnu2.

(I implemented TLS descriptors and optimization in LLD'x x86-64 port.)

Which model does the compiler pick?

1
2
3
4
5
6
7
8
9
10
11
if (executable) { // -fno-pic or -fpie
if (preemptible)
initial-exec;
else
local-exec;
} else { // -fpic
if (preemptible || local-dynamic is not profitable)
general-dynamic;
else
local-dynamic;
}

The linker uses a similar criterion to check whether TLS optimization apply.

Some psABIs define TLS optimization. The idea is that the code sequences have fixed forms and are annotated with appropriate relocations, So the linker understands the compiler's intention and can perform 4 kinds of code sequence modification as optimizations. There are 4 optimization schemes. I have annotated them with the respective condition.

  • general-dynamic/TLSDESC to local-exec optimization: -no-pie/-pie && non-preemptible
  • general-dynamic/TLSDESC to initial-exec optimization: -no-pie/-pie && preemptible
  • local-dynamic to local-exec optimization: -no-pie/-pie (the symbol must be non-preemptible, otherwise it is an error to use local-dynamic)
  • initial-exec to local-exec optimization: -no-pie/-pie && non-preemptible

I sometimes call the optimization schemes poor man's link-time optimization with nice ergonomics.

To make TLS optimization available, the compiler needs to communicate sufficient information to the linker. So you may find marker relocations which don't relocate values. Here is a general-dynamic code sequence for ppc64:

1
2
3
addis r3, r2, x@got@tlsgd@ha # R_PPC64_GOT_TLSGD16_HA
addi r3, r3, x@got@tlsgd@l # R_PPC64_GOT_TLSGD16_LO
bl __tls_get_addr(x@tlsgd) # R_PPC64_TLSGD followed by R_PPC64_REL24

R_PPC64_TLSGD does not relocate the location. It is there to indicate that it is the __tls_get_addr function call in the code sequence.

According to Stefan Pintilie, "In the early days of the transition from the ELFv1 ABI that is used for big endian PowerPC Linux distributions to the ELFv2 ABI that is used for little endian PowerPC Linux distributions, there was some ambiguity in the specification of the relocations for TLS." The bl __tls_get_addr instruction was not relocated by R_PPC64_TLSGD. Blindly converting the addis/addi instructions can make the code sequence malformed. Therefore GNU ld detected the missing R_PPC64_TLSGD/R_PPC64_TLSLD and disabled optimization in 2009-03-03.

I was not fond of the fact that we still needed such a hack in 2020 but I implemented a scheme in LLD anyway because the request was so strong. https://reviews.llvm.org/D92959

TLS variants

In Variant II, the static TLS blocks are placed below the thread pointer. The thread pointer points to the start of the thread control block. The thread control block is a per-thread data structure describing various attributes of the thread. It is defined by the libc implementation. i386, x86-64, s390 and sparc use this variant.

1
2
3
TP % p_align == 0
tlsblock3 tlsblock2 tlsblock1 TP TCB
The TP offset of tlsblock1 (for the main executable) is -p_memsz - ((-p_vaddr-p_memsz)&(p_align-1)).

If you find the formula above confusing, it is;-) In normal cases, you can forget the alignment requirement and the TP offset of tlsblock1 is just -p_memsz. glibc has a Variant II bug when p_vaddr%p_align!=0: BZ24606. I reported the problem to FreeBSD rtld but looks like as of 13.0 its formula is still incorrect: https://reviews.freebsd.org/D24366.

In Variant I, the static TLS blocks are placed above the thread pointer. The thread pointer points to the end of the thread control block. arm, aarch64, alpha, ia64, m68k, mips, ppc, riscv use schemes similar to this variant. I say similar because some architecturs (including m68, mips, powerpc32, powerpc64) place the thread pointer at the end of the thread control block plus a displacement.

1
2
3
TP_WITHOUT_DISPLACEMENT % p_align == 0
TCB TP_WITHOUT_DISPLACEMENT tlsblock1 tlsblock2 tlsblock3
If displacement is 0, the TP offset of tlsblock1 is p_vaddr&(p_align-1).

As an example, on powerpc64, the end of the thread control block is at r13-0x7000. The space allocated for the TLS symbol with st_value==0 is at r13-0x7000+p_vaddr%p_align (p_vaddr%p_align is normally 0). The idea is that the add instruction has a range of [-0x8000, 0x8000). By having the 0x7000 displacement, we can leverage the negative part of the range.

Since p_vaddr%p_align is normally 0, the code sequence accessing st_value==0 may look like:

1
2
addis 3, 13, 0
lwz 3, -0x7000(3)

arm and aarch64 have a zero displacement but they reserve two words at TP. The TP offset of tlsblock1 is sizeof(void*)*2 + ((p_vaddr-sizeof(void*)*2)&(p_align-1)).

Async-signal-safe TLS

C11 7.14.1 Specify signal handling says:

If the signal occurs other than as the result of calling the abort or raise function, the behavior is undefined if the signal handler refers to any object with static or thread storage duration that is not a lock-free atomic object other than by assigning a value to an object declared as volatile sig_atomic_t, or the signal handler calls any function in the standard library other than the abort function, the _Exit function, the quick_exit function, or the signal function with the first argument equal to the signal number corresponding to the signal that caused the invocation of the handler. Furthermore, if such a call to the signal function results in a SIG_ERR return, the value of errno is indeterminate.

C++11 [support.signal] says:

An evaluation is signal-safe unless it includes one of the following:

an access to an object with thread storage duration;

A signal handler invocation has undefined behavior if it includes an evaluation that is not signal-safe.

Despite that, accessing TLS from signal handlers can be useful (think of CPU and memory profilers), hence the accesses need to be async-signal safe. Google reported the issue due to its usage of JVM and dlopen'ed JNI libraries (Async-signal-safe access to __thread variables from dlopen()ed libraries?). They eventually resorted to a non-upstream patch which used a custom allocator.

Let's discuss this topic in details.

Local-exec and initial-exec TLS models trivially satisfy the requirement since the size of static TLS blocks is fixed at program start and every thread has a pre-allocated copy.

For a dlopen'ed shared object which uses general-dynamic or local-dynamic TLS model, there are two cases.

  • The dynamic loader allocates sufficient storage for all currently running threads at dlopen time, and allocates sufficient storage at pthread_create time. This is musl's choice. At dlopen time, the dynamic loader needs to block signal deliveray, take a thread list lock and install a new dynamic thread vector for each thread.
  • Lazy TLS allocation. TLS allocation is done at the first time __tls_get_addr is called. This is glibc and many other libs implementation's choice. The allocation is typically done by malloc, which is not async-signal-safe.

Lazy TLS allocation has the nice property that it does not penalizes the threads which do not need to access TLS of the new shared object. However, it is difficult to make __tls_get_addr async-signal-safe. It is impossible to both allocate lazily and have dynamic TLS access that cannot fail (TLS redux). If __tls_get_addr cannot allocate memory, the ideal behavior is "fail safe" (e.g. abort), as opposed to the full range of undefined behaviors or deadlock.

One workaround is to let the shared object use the initial-exec TLS model. This will consume the static TLS space - a global resource.

If a dlopen implementing eager TLS allocation is developed, conceivably it may need a new symbol version because there can be programs expecting lazy TLS allocation.

Large code model

Many 64-bit architectures have a small code model. Some have defined a large code model.

A small code model usually restricts the addresses and sizes of sections to 4GiB or 2GiB, while a large code model generally makes no such assumption. The TLS size is usually small and code models and impose some limitation even with a large code model.

For the local-exec TLS model, because a symbol is usually referenced via an offset adding to a register (thread pointer), it needs no distinction with a large code model.

For the initial-exec TLS model, because loading an GOT is needed, and GOT is part of the data sections, a large code model technically should implement a code sequence which is not restricted by the distance between code and data. GCC has not implemented such code sequences.

For the general-dynamic and local-dynamic TLS models, there is usually a GOT load and a __tls_get_addr call. As discussed previously, the GOT load needs to be free of 32-bit limitation. For the __tls_get_addr call, on architectures which have implemented range extension thunks, since the linker can redirect the call to a thunk which arranges for the call, no special treatment is needed.

x86-64 has not implemented thunks. Compile a problem with x86-64 gcc -S -fpic -mcmodel=large and you can see that the __tls_get_addr call is indirect. This is to prevent the +-2GiB range limitation imposed by the direct CALL instruction.

1
2
3
4
5
6
7
8
9
10
11
movabsq	$_GLOBAL_OFFSET_TABLE_-.L2, %r11
pushq %rbx
leaq .L2(%rip), %rbx
addq %r11, %rbx
leaq a@tlsgd(%rip), %rdi
movabsq $__tls_get_addr@PLTOFF, %rax
addq %rbx, %rax
call *%rax
popq %rbx
movl (%rax), %eax
ret

The support for large code model TLS is fairly limited as of today. Most configurations don't lift the GOT load limitation. On aarch64, -fpic -mcmodel=large has not been implemented on GCC and Clang.

Thread-specific data keys

An alternative to ELF TLS is thread-specific data keys: pthread_key_create, pthread_setspecific, pthread_getspecific and pthread_key_delete. This scheme can be seen as a simpler implementation of __tls_get_addr with key reuse feature. There are C11 equivalents (tss_create, tss_set, tss_get, tss_delete) which are rarely used. Windows provides similar API: TlsAlloc, TlsSetValue, TlsGetValue, TlsFree.

The maximum number of keys is usually limited. On glibc it is usually 1024. On musl it is 128. So applications which potentially need many data keys typically create a wrapper on top of thread-specific data keys, e.g. chromium base/threading/thread_local_storage.h.

POSIX.1-2017 does not require pthread_setspecific/pthread_getspecific to be async-signal-safe. Nevertheless, most implementations make pthread_getspecific async-signal-safe. pthread_setspecific is not necessarily async-signal-safe.

-femulated-tls

-femulated-tls uses thread-specific data keys to implement emulated TLS. The runtime implementation is quite similar to a __tls_get_addr implementation in a lazy TLS allocation scheme.

Its inefficiency comes from these aspects:

  • There is no linker optimization.
  • Instead of geting the dynamic thread vector from the thread pointer (usually available in a register), the runtime needs to call pthread_getspecific to get the vector.
  • The dynamic loader does not know emulated TLS, so the storage allocation is typically done in the access function via pthread_once.

libgcc has a mature runtime. In compiler-rt, the runtime was contributed by Android folks in 2015.

C++ thread_local

C++ thread_local adds additional features to __thread: dynamic initialization on first-use and destruction on thread exit. If a thread_local variable needs dynamic initialization or has a non-trivial destructor, the compiler calls the TLS wrapper function (_ZTW*, in a COMDAT group) instead of referencing the variable directly. The TLS wrapper calls the TLS init function (_ZTH*, weak), which is an alias for __tls_init. __tls_init calls the constructors and registers the destructors with __cxa_thread_atexit.

The __cxa_thread_atexit complexity is because a thread_local variabled defined in a dlopen'ed shared object needs to be destruct at dlclose time before thread exit. libsupc++ and libc++abi define __cxa_thread_atexit. They call __cxa_thread_atexit_impl if the libc implementation provides it or use a generic implementation based on thread-specific data keys.

As an example, x needs a TLS wrapper function. The compiler may inline the TLS wrapper function and __tls_init.

1
2
extern thread_local int x;
int foo() { return x; }

The assembly looks like the following. It uses undefined weak _ZTH1x to check whether the TLS init function is defined. If yes, call the TLS init function. Then reference the variable via usual initial-exec or general dynamic TLS model or TLSDESC.

1
2
3
4
5
6
7
8
9
10
11
12
_Z3foov:
pushq %rax
cmpq $0, _ZTH1x@GOTPCREL(%rip)
je .LBB0_2
callq _ZTH1x@PLT
.LBB0_2:
movq x@GOTTPOFF(%rip), %rax
movl %fs:(%rax), %eax
popq %rcx
retq

.weak _ZTH1x

If you know x does not need dynamic initialization, C++20 constinit can make it as efficient as the plain old __thread. If you can enable C++20 mode, [[clang::require_constant_initialization]] can be used in older language modes.

1
extern thread_local constinit int x;

Here is an example that __tls_init needs to call __cxa_thread_atexit.

1
2
3
struct S { S(); ~S(); };
thread_local S s;
S &foo() { return s; }

macOS TLS

The support was added very late. The scheme is similar to ELF's TLS descriptors, without the good custom calling convention promise. In other words, the performance is likely worse than ELF's general dynamic TLS model. To my surprise, thread-local variables of internal linkage need an indirect function call, too.

1
2
thread_local int tls;
int f() { return tls; }
1
2
3
movq _tls@TLVP(%rip), %rdi
callq *(%rdi)
movl (%rax), %eax

Windows TLS

The code sequence fetches ThreadLocalStoragePointer (offset 88) out of the Thread Environment Block and indexes it by _tls_index. The return value is indexed with the offset of the variable from the start of the .tls section. The scheme is similar to ELF's local-dynamic TLS model, replacing a __tls_get_desc call with an array index operation.

1
2
3
4
movl _tls_index(%rip), %eax
movq %gs:88, %rdx
movq (%rdx,%rax,8), %rax
movl %ecx, tls@SECREL32(%rax)

Referencing a TLS variable from another DLL is not supported.

1
2
__declspec(dllimport) extern thread_local int tls;
// error C2492: 'tls': data with thread storage duration may not have dll interface

There are a lot of of details but my personal understanding of Windows does not allow me to say more ;-) Interested readers can go to Thread Local Storage, part 3: Compiler and linker support for implicit TLS.

libc API for TLS blocks

Sanitizers' runtime needs TLS blocks for a variety of use cases. See https://sourceware.org/bugzilla/show_bug.cgi?id=16291 for a glibc feature request. Read on for a detailed description.

In LLVM, OrcJIT has a desire to register TLS blocks. Lang Hames told me that he has got native TLS working by implementing dyld’s TLS support APIs in the Orc runtime.

Florian Weimer posted Thread properties API in 2021-05.

Why does compiler-rt need to know TLS blocks?

AddressSanitizer "asan" (-fsanitize=address)

The main task of AddressSanitizer is to detect addressability problems. If a regular memory byte is not addressable (i.e. accesses should be UB), it is said to be poisoned and the associated shadow encodes the addressability information (all unpoisoned/all poisoned/partly poisoned).

On thread creation, the runtime should unpoison the thread stack and static TLS blocks to allow accesses. (test/asan/TestCases/Linux/unpoison_tls.cpp; introduced in https://github.com/llvm/llvm-project/commit/09886cd17ab8e5e601fda0e2aa21ff28c1a8fa63 "[asan] Make ASan report the correct thread address ranges to LSan.") The runtime additionally unpoisons the thread stack and TLS blocks on thread exit to allow accesses from later TSD destructors.

Note: if the allocation is rtld/libc internal and not intercepted, there is no need to unpoison the range. The associated shadow is supposed to be zeros. However, if the allocation is intercepted, the runtime should unpoison the range in case the range reuses a previous allocation which happens to contain poisoned bytes.

In glibc, _dl_allocate_tls and _dl_deallocate_tls call malloc/free functions which are internal and not intercepted, so the allocations are opaque to the runtime and the shadow bytes are all zeroes.

Hardware-assisted AddressSanitizer "hwasan" (-fsanitize=hwaddress)

Its ClearShadowForThreadStackAndTLS is similar to asan's.

LeakSanitizer "lsan" (-fsanitize=leak)

LeakSanitizer detects memory leaks. On many targets, it is integrated (and enabled by default) in AddressSanitizer, but it can be used standalone. The checker is triggered by an atexit hook (the default options are LSAN_OPTIONS=detect_leaks=1:leak_check_at_exit=1), but it can also be invoked via __lsan_do_leak_check.

Each supported platform provides an entry point: StopTheWorld (e.g. Linux 1), which does the following:

  • Invoke the clone syscall to create a new process which shared the address space with the calling process.
  • In the new process, list threads by iterating over /proc/$pid/task/.
  • In the new process, call SuspendThread (ptrace PTRACE_ATTACH) to suspend a thread.

StopTheWorld returns. The runtime performs mark-and-sweep, reports leaks, and then calls ResumeAllThreads (ptrace PTRACE_DETACH).

Note: the implementation cannot call libc functions. It does not perform code injection. The toot set includes static/dynamic TLS blocks for each thread.

(The pthread_create interceptor calls AdjustStackSize which computes a minimum stack size with GetTlsSize. https://code.woboq.org/llvm/compiler-rt/lib/sanitizer_common/sanitizer_posix_libcdep.cpp.html#411 I am not sure musl needs this.)

Intercepting __tls_get_addr is useful to lsan but is not necessary. First, the Linux InitializePlatformSpecificModules implementation ignores leaks from the dynamic loader. Second, allocations called by __tls_get_addr are suppressed by a built-in rule leak:*tls_get_addr in kStdSuppressions.

The current lsan implementation has more requirement on GetTls: it does not intercept pthread_setspecific. Instead, it expects GetTls returned range to include pointers to pthread_setspecific regions, otherwise there would be false positive leak reports.

In addition, lsan gets the static TLS boundaries at ptread_create time and expects the boundaries to include TLS blocks of dynamically loaded modules. This means that GetTls returned range needs to include static TLS surplus.

( You might ask that the thread control block has the dtv pointer, why can't lsan track the referenced allocations. Well, for threads, rtld/libc implementations typically allocate the static TLS blocks as part of the thread stack, which are not seen by the runtime, so the runtime does not know the allocations. )

On glibc, GetTls returned range includes pthread::{specific_1stblock,specific} for thread-specific data keys. There is currently a hack to ignore allocations from ld.so allocated dynamic TLS blocks. Note: if the pthread::{specific_1stblock,specific} pointers are encrypted, lsan cannot track the allocation.

MemorySanitizer "msan" (-fsanitize=memory)

MemorySanitizer detects uses of uninitialized memory. If a regular memory byte has uninitialized (poisoned) bits, its associated shadow byte has one bits.

Similar to asan. On thread creation, the runtime should unpoison the thread stack and static TLS blocks to allow accesses. (test/msan/tls_reuse.cpp) The runtime additionally unpoisons the thread stack and TLS blocks on thread exit to allow accesses from TSD destructors.

msan needs to do more than asan: the __tls_get_addr interceptor (DTLS_on_tls_get_addr) detects new dynamic TLS blocks and unpoisons the shadow. ld.so calls a non-interposable memset to clear the blocks. Otherwise, if a dynamic TLS block reuses a previous allocation with poison, there may be false positives. One way to semi reliably trigger this is (test/msan/dtls_test.cpp https://github.com/google/sanitizers/issues/547):

  • in a thread, write an uninitialized (poisoned) value to a dynamic TLS block
  • destroy the thread
  • create a new thread
  • try making the new thread reuse the poisoned dynamic TLS block.

Note: aarch64 uses TLSDESC by default and there is no interposable symbol.

During the development of glibc 2.19, commit 1f33d36a8a9e78c81bed59b47f260723f56bb7e6 ("Patch 2/4 of the effort to make TLS access async-signal-safe.") was checked in. DTLS_on_tls_get_addr detects the __signal_safe_memalign header and considers it a dynamic TLS block if the block is not within the static TLS boundaries. commit dd654bf9ba1848bf9ed250f8ebaa5097c383dcf8 ("Revert "Patch 2/4 of the effort to make TLS access async-signal-safe.") reverted __signal_safe_memalign, but the implementation remains in grte branches.

See also Re: glibc 2.19 - asyn-signal safe TLS and ASan.

Similar to lsan: the pthread_create interceptor calls AdjustStackSize which computes a minimum stack size with GetTlsSize.

ThreadSanitizer "tsan" (-fsanitize=thread)

Similar to lsan: the pthread_create interceptor calls AdjustStackSize which computes a minimum stack size with GetTlsSize.

Similar to msan, the runtime unpoisons TLS blocks to avoid false positives. Tested by test/tsan/dtls.c (D20927). tsan also needs to intercept __tls_get_addr. The problem that aarch64 TLSDESC does not have an interposable symbol also applies.

I wrongly thought https://reviews.llvm.org/D93866 was a workaround. https://sourceware.org/pipermail/libc-alpha/2021-January/121352.html explained that the code has not materialized changed since 2012.

For dynamic TLS blocks, older glibc (e.g. 2.23) calls __libc_memalign, which is intercepted (tsan/rtl/tsan_interceptors_posix.cpp); since BZ #17730, newer glibc (e.g. 2.32) calls malloc.

glibc TLS allocation

For dynamic TLS blocks, allocate_and_init allocates the block.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
```

### Android bionic

Android bionic (API level 31) introduced some TLS APIs in `libc/include/sys/thread_properties.h`.
`__libc_get_static_tls_bounds` and `__libc_iterate_dynamic_tls` are used in compiler-rt.

```c
/**
* Gets the bounds of static TLS for the current thread.
*
* Available since API level 31.
*/
void __libc_get_static_tls_bounds(void** __static_tls_begin,
void** __static_tls_end) __INTRODUCED_IN(31);

/**
* Iterates over all dynamic TLS chunks for the given thread.
* The thread should have been suspended. It is undefined-behaviour if there is concurrent
* modification of the target thread's dynamic TLS.
*
* Available since API level 31.
*/
void __libc_iterate_dynamic_tls(pid_t __tid,
void (*__cb)(void* __dynamic_tls_begin,
void* __dynamic_tls_end,
size_t __dso_id,
void* __arg),
void* __arg) __INTRODUCED_IN(31);

dalias's notes

1
2
3
4
5
6
7
8
9
<@dalias> i think the api proposed there looks wrong
<@dalias> e.g. "static tls bounds" supposes a particular implementation where static is a single block range and static and dynamic are distinct
<@dalias> the interfaces proposed for dynamic are even worse
<@dalias> allowing interposition of individual dynamic tls area creation
<@dalias> supposing that they're created individually and ignoring that any interposition here would be extremely unsafe
<@dalias> the alternative prposed __libc_iterate_dynamic_tls is just a renamed dl_iterate_phdr without the glibc bug
<@dalias> and is pointless -- just fix the glibc bug
<@dalias> "When a thread (or dynamic TLS) is destroyed, the shadow for the stack (or dynamic TLS) should be unpoisoned"
<@dalias> this is backwards -- it should be poisoned because it's no longer valid. the stated desired behavior is based on bad glibc implementation internals (reuse of the stack/tls memory) and ignores that something should be done to unpoison it at the moment it's reused, not when it's freed for reuse