Updated in 2024-11.
Thread-local storage (TLS) provides a mechanism allocating distinct
objects for different threads. It is the usual implementation for GCC
extension __thread
, C11 _Thread_local
, and
C++11 thread_local
, which allow the use of the declared
name to refer to the entity associated with the current thread. This
article will describe thread-local storage on ELF platforms in detail,
and touch on other related topics, such as: thread-specific data keys
and Windows/macOS TLS.
An example usage of thread-local storage is POSIX
errno
:
Each thread has its own thread ID, scheduling priority and policy, errno value, floating point environment, thread-specific key/value bindings, and the required system resources to support a flow of control.
Different threads have different errno
copies.
errno
is typically defined as a function which returns a
thread-local variable (e.g. __errno_location
).
For each architecture, the authoritative ELF ABI document is the processor supplement (psABI) to the System V ABI (generic ABI). These documents usually reference The ELF Handling for Thread-Local Storage by Ulrich Drepper. The document, however, mixes general specifications and glibc internals.
Representation
Assembler behavior
The compiler usually defines thread-local variables in
.tdata
and .tbss
sections (which have the
section flag SHF_TLS
). The symbols representing
thread-local variables have type STT_TLS
(representing
thread-local storage entities). In GNU as syntax, you can give
a
the type STT_TLS
with
.type a, @tls_object
. The st_value
value of a
TLS symbols is the offset relative to the defining section.
1 | .section .tbss,"awT",@nobits |
In this example, st_value(a)=0
while
st_value(b)=4
.
In Clang and GCC produced assembly, thread-local variables are
annotated as .type a, @object
(STT_OBJECT
).
When the assembler sees that such symbols are defined in
SHF_TLS
sections or referenced by TLS relocations,
STT_NOTYPE
/STT_OBJECT
will be upgraded to
STT_TLS
.
GNU as supports an directive .tls_common
which defines
STT_TLS SHN_COMMON
symbols. This is an obscure feature. It
is not clear whether GCC still has a code path which emits
.tls_common
directives. LLVM integrated assembler does not
support .tls_common
.
Linker behavior
The linker combines .tdata
input sections into a
.tdata
output section. .tbss
input sections
are combined into a .tbss
output section. The two
SHF_TLS
output sections are placed into a
PT_TLS
program header.
p_offset
: the file offset of the TLS initialization imagep_vaddr
: the virtual address of the TLS initialization imagep_filesz
: the size of the TLS initialization imagep_memsz
: the total size of the thread-local storage. The lastp_memsz-p_filesz
bytes will be zeroed by the dynamic loader.p_align
: alignment
The PT_TLS
program header is contained in a
PT_LOAD
program header. If PT_GNU_RELRO
is
used, PT_TLS
is contained in a PT_GNU_RELRO
and the PT_GNU_RELRO
is contained in a
PT_LOAD
. Conceptually PT_TLS
and
STT_TLS
symbols are like in a separate address space. The
dynamic loader should copy the [p_vaddr,p_vaddr+p_filesz)
of the TLS initialization image to the corresponding static TLS
block.
In executable and shared object files, st_value
normally
holds a virtual address. For a STT_TLS
symbol,
st_value
holds an offset relative to the virtual address of
the PT_TLS
program header. The first byte of
PT_TLS
is referenced by the TLS symbol with
st_value==0
.
GNU ld treats STT_TLS SHN_COMMON
symbols as defined in
.tcommon
sections. Its internal linker script places such
sections into the output section .tdata
. ld.lld does not
support STT_TLS SHN_COMMON
symbols.
Dynamic loader behavior
The dynamic loader collects PT_TLS
program headers from
the main executable and immediately loaded shared objects (via
transitive DT_NEEDED
), and allocates static TLS blocks, one
block for each PT_TLS
. For each PT_TLS
, the
dynamic loader copies p_filesz
bytes from the TLS
initialization image to the TLS block and sets the trailing
p_memsz-p_filesz
bytes to zeroes.
For the static TLS block of the main executable, the module ID is one and the TP offset of a TLS symbol is a link-time constant. The linker and the dynamic loader share the same formula.
For a shared object loaded at program start, the offset from the thread pointer to its static TLS block is a fixed value at program start, albeit not a link-time constant. The offset can be referenced by a GOT dynamic relocation used by the initial-exec TLS model.
The ELF Handling for Thread-Local Storage describes two TLS variants and specifies their data structures. However, only the TP offset of the static TLS block of the main executable is a hard requirement. Nevertheless, libc implementations usually place static TLS blocks together, and allocate a space for both the thread control block and the static TLS blocks.
For a new thread created by pthread_create
, the static
TLS blocks are usually allocated as part of the thread stack. Without a
guard page between the largest address of the stack and the thread
control block, this could be considered as vulnerable as stack overflow
can overwrite the thread control block.
Models
Local exec TLS model (executable & non-preemptible)
This is the most efficient TLS model. It applies when the TLS symbol is defined in the executable.
The compiler picks this model in -fno-pic/-fpie
modes if
the variable is
- a definition
- or a declaration with a non-default visibility.
The first condition is obvious. The second condition is because a non-default visibility means the variable must be defined by another translation unit in the executable.
1 | _Thread_local int def; |
1 | # x86-64 |
For the static TLS block of the main executable, the TP offset of a TLS symbol is a link-time constant. Here is a list of common relocation types:
- arm:
R_ARM_TLS_LE32
- aarch64:
-mtls-size=12
:R_AARCH64_TLSLE_ADD_TPREL_LO12
- [-65536,65536) (unavailable in GCC/Clang):
R_AARCH64_TLSLE_MOVW_TPREL_G0
-mtls-size=24
(default):R_AARCH64_TLSLE_ADD_TPREL_HI12
,R_AARCH64_TLSLE_ADD_TPREL_LO12_NC
-mtls-size=32
:R_AARCH64_TLSLE_MOVW_TPREL_G1
,R_AARCH64_TLSLE_MOVW_TPREL_G0_NC
-mtls-size=48
:R_AARCH64_TLSLE_MOVW_TPREL_G2
,R_AARCH64_TLSLE_MOVW_TPREL_G1_NC
,R_AARCH64_TLSLE_MOVW_TPREL_G0_NC
- i386:
R_386_TLS_LE
- x86-64:
R_X86_64_TPOFF32
- mips:
R_MIPS_TPREL_HI16
,R_MIPS_TPREL_LO16
- ppc32:
R_PPC_TPREL_HA
,R_PPC_TPREL_LO
- ppc64:
R_PPC64_TPREL_HA
,R_PPC64_TPREL_LO
- riscv:
R_RISCV_TPREL_HI20
,R_RISCV_TPREL_LO12_I
,R_RISCV_TPREL_LO12_S
For RISC architectures, because an instruction typically has 4 bytes and cannot encode a 32-bit offset, it usually takes two instructions to materialize a TP offset.
In https://reviews.llvm.org/D93331, I patched ld.lld to
reject local-exec TLS relocations in -shared
mode. In GNU
ld, at least arm, riscv and x86's ports have the similar diagnostics,
but aarch64 and ppc64 do not error.
Initial exec TLS model (executable & preemptible)
This model is less efficient than local exec. It applies when the TLS
symbol is defined in the executable or a shared object available at
program start. The shared object can be due to DT_NEEDED
or
LD_PRELOAD
.
The compiler picks this model in -fno-pic/-fpie
modes if
the variable is a declaration with default visibility. The idea is that
a symbol referenced by the executable must be defined by an immediately
loaded shared object, instead of a dlopen loaded shared object. The
linker enforces this as well by defaulting to -z defs
for a
-no-pie/-pie
link.
1 | extern thread_local int ref; |
1 | # x86-64 |
Because the offset from the thread pointer to the start of a static
block is fixed at program start, such an offset can be encoded by a GOT
relocation. Such relocation types typically have GOT
and
TPREL/TPOFF
in their names. Here is a list of common
relocation types:
- arm:
R_ARM_TLS_IE32
- aarch64:
R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21
,R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC
(adrp x0, :gottprel:tls; ldr x0, [x0, #:gottprel_lo12:tls]
) - i386:
R_386_TLS_IE
- x86-64:
R_X86_64_GOTTPOFF
- ppc32:
R_PPC_GOT_TPREL16
- ppc64:
R_PPC64_GOT_TPREL16_HA
,R_PPC64_GOT_TPREL16_LO_DS
- riscv:
R_RISCV_TLS_GOT_HI20
,R_RISCV_PCREL_LO12_I
If the TLS symbol does not satisfy initial-exec to local-exec optimization requirements, the linker will allocate a GOT entry and emit a dynamic relocation. Here is a list of dynamic relocation types:
- arm:
R_ARM_TLS_TPOFF32
- aarch64:
R_AARCH64_TLS_TPREL64
- mips32:
R_MIPS_TLS_TPREL32
- mips64:
R_MIPS_TLS_TPREL64
- i386:
R_386_TPOFF
- x86-64:
R_X86_64_TPOFF64
- ppc32:
R_PPC_TPREL32
- ppc64:
R_PPC64_TPREL64
- riscv:
R_RISCV_TLS_TPREL64
While they have TPREL
or TPOFF
in their
names, these dynamic relocations have the same bitwidth as the word
size. This is a good way to distinguish them from the local-exec
relocation types used in object files.
If you add the __attribute((tls_model("initial-exec")))
attribute, a thread-local variable can use this model in
-fpic
mode. If the object file is linked into an
executable, everything is fine. If the object file is linked into a
shared object, the shared object generally needs to be an immediately
loaded shared object. The linker sets the DF_STATIC_TLS
flag to annotate a shared object with initial-exec TLS relocations.
glibc ld.so reserves some space in static TLS blocks and allows dlopen on such a shared object if its TLS size is small. There could be an obscure reason for using such an attribute: general dynamic and local dynamic TLS models are not async-signal-safe in glibc. However, other libc implementations may not reserve additional TLS space for dlopen'ed initial-exec shared objects, e.g. musl will error.
General dynamic and local dynamic TLS models (DSO)
The two modes are used when the TLS symbol may be defined by a shared object. They do not assume the TLS symbol is backed by a static TLS block. Instead, they assume that the thread-local storage of the module may be dynamically allocated, making the models suitable for dlopen usage. The dynamically allocated TLS storage is usually referred to as dynamic TLS.
Each TLS symbol is assigned a pair of (module ID, offset from dtv[m]
to the symbol), which is usually referred to as a tls_index
object. The module ID m is assigned by the dynamic loader when the
module (the executable or a shared object) is loaded, so it is unknown
at link time. dtv means the dynamic thread vector. Each thread has its
own dynamic thread vector, which is a mapping from module ID to
thread-local storage. dtv[m] points to the storage allocated for the
module with the ID m.
In the simplest form, once we have a pointer to the (module ID, offset from dtv[m] to the symbol) pair, we can get the address of the symbol with the following C program:
1 | // v is a pointer to the first element of the pair. |
General dynamic TLS model (DSO)
The general dynamic TLS model is the most flexible model. It assumes
neither the module ID nor the offset from dtv[m] to the symbol is known
at link time. The model is used in -fpic
mode when the
local dynamic TLS model does not apply. The compiler emits code to set
up a pointer to the TLSGD entry of the symbol, then arranges for a call
to __tls_get_addr
. The return value will contain the
runtime address of the TLS symbol in the current thread. On x86-64, you
will notice that the leaq instruction has a data16 prefix and the call
instruction has two data16 (0x66) prefixes and one rex64 prefix. This is
a deliberate choice to make the total size of leaq+call to be 16,
suitable for link-time optimization.
1 | data16 leaq def@tlsgd(%rip), %rdi # R_X86_64_TLSGD |
(There is an open issue that LLVM disassembler does not display data16 and rex64 prefixes.)
Here is a list of common relocation types. They are called "initial relocations" in The ELF Handling for Thread-Local Storage.
- arm:
R_ARM_TLS_GD32
- aarch64:
R_AARCH64_TLSGD_ADR_PREL21
,R_AARCH64_TLSGD_ADR_PAGE21
,R_AARCH64_TLSGD_ADD_LO12_NC
,R_AARCH64_TLSGD_MOVW_G1
,R_AARCH64_TLSGD_MOVW_G0_NC
(rarely used because TLS descriptors are the default) - i386:
R_386_TLS_GD
- x86-64:
R_X86_64_TLSGD
- mips:
R_MIPS_TLS_GD
,R_MICROMIPS_TLS_GD
- ppc32:
R_PPC_GOT_TLSGD16
- ppc64:
R_PPC64_GOT_TLSGD16_HA
,R_PPC64_GOT_TLSGD16_LO
- riscv:
R_RISCV_TLS_GD_HI20
When the linker scans such a relocation, it checks whether the
referenced TLS symbol satisfy optimization requirements. If not, the
linker allocates two consecutive words in the .got
section
if not allocated yet. The two entries are relocated by two dynamic
relocations. The dynamic loader will write the module ID to the first
word and the offset from dtv[m] to the symbol to the second word. The
dynamic relocation types are:
- arm:
R_ARM_TLS_DTPMOD32
andR_ARM_TLS_DTPOFF32
- aarch64:
R_AARCH64_TLS_DTPMOD
andR_AARCH64_TLS_DTPREL
(rarely used because TLS descriptors are the default) - x86-32:
R_386_TLS_DTPMOD32
andR_386_TLS_DTPOFF32
- x86-64:
R_X86_64_DTPMOD64
andR_X86_64_DTPOFF64
- mips32:
R_MIPS_TLS_DTPMOD32
andR_MIPS_TLS_DTPOFF32
- mips64:
R_MIPS_TLS_DTPMOD64
andR_MIPS_TLS_DTPOFF64
- ppc32:
R_PPC_DTPMOD32
andR_X86_64_DTPREL32
- ppc64:
R_PPC64_DTPMOD64
andR_X86_64_DTPREL64
- riscv32:
R_RISCV_TLS_DTPMOD32
andR_X86_64_TLS_DTPREL32
- riscv64:
R_RISCV_TLS_DTPMOD64
andR_X86_64_TLS_DTPREL64
- s390/s390x:
R_390_TLS_DTPMOD
andR_390_TLS_DTPOFF
The are called "outstanding relocations" in The ELF Handling for Thread-Local Storage.
On x86-64, -fno-plt
uses
call *__tls_get_addr@GOTPCREL(%rip)
instead of
call __tls_get_addr
. If linked with GNU ld, a 2016-06
commit is needed. For
-fpic -fno-plt -Wa,-mrelax-relocations=no
compiled
relocatable object files, GNU ld cannot be used (https://sourceware.org/bugzilla/show_bug.cgi?id=24784 is
wontfix).
Local dynamic TLS model (DSO & non-preemptible)
The local-dynamic TLS model assumes that the offset from dtv[m] to
the symbol is a link-time constant. This case happens when the TLS
symbol is non-preemptible. The compiler emits code to set up a pointer
to the TLSLD entry of the module, next arranges for a call to
__tls_get_addr
, then adds a link-time constant to the
return value to get the address.
1 | static _Thread_local int def, def1; |
The access of def
, if uses the local-dynamic TLS model,
may look like: 1
2
3leaq def@tlsld(%rip), %rdi
call __tls_get_addr@PLT
movl def@dtpoff(%rax), %edx
I say "the TLSLD entry of the module" because while (on x86-64)
def@tlsld
looks like the TLSLD entry of the non-preemptible
TLS symbol, it can really be shared by other non-preemptible TLS
symbols. So one module needs just one such entry. Technically we can
just use general dynamic relocation types to represent the local dynamic
TLS model. For example, GCC riscv does this:
1 | la.tls.gd a0, .LANCHOR0 |
This is clever. However, I would prefer dedicated local-dynamic
relocation types. If we link multiple relocatable files together, the
local symbols .LANCHOR0
are separate and their GOT entries
cannot be shared. Architectures with dedicated local-dynamic relocation
types can share the GOT entries.
Note that the code sequence is not shorter than the general-dynamic
TLS model. Actually on RISC architectures the code sequence is usually
longer due to the addition of DTPREL. Local-dynamic is beneficial if a
function needs to access two or more non-preemptible TLS symbols,
because the __tls_get_addr
can be shared.
1 | leaq def0@tlsld(%rip), %rdi |
Here is a list of common relocation types.
- arm:
R_ARM_TLS_LDM32
- i386:
R_386_TLS_LDM
- x86-64:
R_X86_64_TLSLD
- mips:
R_MIPS_TLS_LDM
,R_MICROMIPS_TLS_LDM
- ppc32:
R_PPC_GOT_TLSLD16
- ppc64:
R_PPC64_GOT_TLSLD16_HA
,R_PPC64_GOT_TLSLD16_LO
,R_PPC64_GOT_TLSLD_PCREL34
At the linker stage, if the TLS symbol does not satisfy local-dynamic
to local-exec optimization requirements, the linker will allocate two
consecutive words in the .got
section for the TLSLD
relocation. The dynamic loader will write the module ID to the first
word and the offset from dtv[m] to the symbol to the second word.
If the architecture does not define TLS optimization, the linker can
still made an optimization: in -no-pie/-pie
modes, set the
first word to 1 (main executable) and omit the dynamic relocation for
the module ID.
TLS descriptors
Some architectures (arm, aarch64, i386, x86-64) have TLS descriptors
as more efficient alternatives to the traditional general dynamic and
local dynamic TLS models. Such ABIs repurpose the first word of the
(module ID, offset from dtv[m] to the symbol) pair to represent a
function pointer. The function pointer points to a very simple function
in the static TLS case and a function similar to
__tls_get_addr
in the dynamic TLS case. The caller does an
indirection function call instead of calling
__tls_get_addr
. There are two main points:
- The function call to
__tls_get_addr
uses the regular calling convention: the compiler has to make the pessimistic assumption that all volatile registers may be clobbered by__tls_get_addr
. - In glibc (which does lazy TLS allocation),
__tls_get_addr
is very complex. If the TLS of the module is backed by a static TLS block, the dynamic loader can simply place the TP offset into the second word and let the function pointer point to a function which simply returns the second word.
The first point is the prominent reason that TLS descriptors are
generally more efficient. Arguably traditional general dynamic and local
dynamic TLS models could have a mechanism to use custom calling
convention for __tls_get_addr
as well.
For the following program, when TLS descriptors are used
(-fpic -mtls-dialect=desc
; gnu2
for arm and
x86), we can see that the resigers holding arguments are not spilled.
1
2
3
4
5
6
7__thread int x;
void ext(int a, int b, int c, int d, int e, int f);
int foo(int a, int b, int c, int d, int e, int f) {
int ret = ++x;
ext(a, b, c, d, e, f);
return ret;
}
GCC's x86-64 port assumes that FLAGS_REG
and
RAX
are changed while all other registers are preserved.
Currently, glibc's x86-64 port does not
preserve vector registers in _dl_tlsdesc_dynamic
(the
slow code path).
In musl, in the static TLS case, the two words of the TLS descriptor
will be set to ((size_t)__tlsdesc_static, tpoff)
where
__tlsdesc_static
is a function which returns the second
word. glibc's static TLS case is similar.
1 | .globl __tlsdesc_static |
The scheme optimizes for static TLS but penalizes the case that requires dynamic TLS. Remember that we have just two words in the GOT and by changing the first word to a function pointer, we have lost information about the module ID. To retain the information, the dynamic loader has to set the second word to a pointer to a (module ID, offset) pair allocated by malloc.
aarch64 defaults to TLS descriptors. On arm, i386 and x86-64, you can
select TLS descriptors via GCC -mtls-dialect=gnu2
. RISC-V
psABI specifies
TLS descriptors in 2023-09.
https://github.com/riscv-non-isa/riscv-elf-psabi-doc/issues/94
(I implemented TLS descriptors and optimization in ld.lld's IA-32, x86-64, and RISC-V ports.)
Let's see an example. Loading an int32_t
using x86-64
TLSDESC has a code sequence like the following: 1
2
3leaq x@TLSDESC(%rip), %rax
call *x@TLSCALL(%rax)
movl %fs:(%rax), %eax
Let's say the linker does not optimize this sequence into
initial-exec or local-exec model, because we are building a shared
object. At run-time, musl rtld resolves the indirect call to load the
GOT entry of __tlsdesc_static
. When the code executes, the
call instruction will call the __tlsdesc_static
function,
loading the TP offset from the second word. Then the third instruction
movl %fs:(%rax), %eax
performs a %fs relative memory
load.
In GNU ld aarch64/arm/x86-64, R_*_TLSDESC
relocations
are placed in .rela.plt
. glibc ld.so has to consider such
lazy binding relocations. However, due to data race, lazy binding is a
bad idea so glibc eagerly resolves R_*_TLSDESC
now. I filed
ld: Move R_*_TLSDESC to .rela.dyn
.
For the dynamic case, rtld allocates an object with
dlopen
which holds the module ID and the offset from
dtv[m]
to the symbol. The second GOT entry is rewritten to
reference the object. RISC-V TLS descriptors explored
using static entries in place of a runtime dlopen
, but the
idea was turned down.
PowerPC
__tls_get_addr_opt
Alan Modra implemented a poor man's TLSDESC scheme for PowerPC in 2015. https://sourceware.org/legacy-ml/libc-alpha/2015-03/msg00626.html
If glibc ld.so sees DT_PPC64_OPT
, it sets the module ID
in the tls_index
object to zero and sets the TP offset
(instead of the offset from dtv[m]). glibc exports the symbol
__tls_get_addr_opt
.
If GNU ld sees a defined __tls_get_addr_opt
, it converts
a __tls_get_addr
call to call
__tls_get_addr_opt
instead. The PLT code sequence checks
whether the module ID is zero, and if true, just adds the TP offset to
TP and returns; otherwise calls the __tls_get_addr_opt
definition in ld.so.
In the common cases for immediately loaded shared objects, this can save a function call. However, this scheme does not have the benefit of TLSDESC's custom calling convention.
s390x __tls_get_offset
s390 and s390x use __tls_get_offset
instead of
__tls_get_addr
. See Toolchain
notes on z/Architecture for detail.
Which model does the compiler pick?
1 | if (executable) { // -fno-pic or -fpie |
The linker uses a similar criterion to check whether TLS optimization apply.
Link-time TLS optimization
Some psABIs define TLS optimization. The idea is that the code sequences have fixed forms and are annotated with appropriate relocations, So the linker understands the compiler's intention and can perform 4 kinds of code sequence modification as optimizations. There are 4 optimization schemes. I have annotated them with the respective condition.
- general-dynamic/TLSDESC to local-exec optimization:
-no-pie/-pie
&& non-preemptible - general-dynamic/TLSDESC to initial-exec optimization:
-no-pie/-pie
&& preemptible - local-dynamic to local-exec optimization:
-no-pie/-pie
(the symbol must be non-preemptible, otherwise it is an error to use local-dynamic) - initial-exec to local-exec optimization:
-no-pie/-pie
&& non-preemptible
I sometimes call the optimization schemes poor man's link-time optimization with nice ergonomics. Intuitively general-dynamic/TLSDESC to initial-exec optimizations are rare, since it is uncommon to reference a TLS symbol defined in another module.
To make TLS optimization available, the compiler needs to communicate sufficient information to the linker. So you may find marker relocations which don't relocate values. Here is a general-dynamic code sequence for ppc64:
1 | addis r3, r2, x@got@tlsgd@ha # R_PPC64_GOT_TLSGD16_HA |
R_PPC64_TLSGD
does not relocate the location. It is
there to indicate that it is the __tls_get_addr
function
call in the code sequence.
According to Stefan Pintilie, "In the early days of the transition
from the ELFv1 ABI that is used for big endian PowerPC Linux
distributions to the ELFv2 ABI that is used for little endian PowerPC
Linux distributions, there was some ambiguity in the specification of
the relocations for TLS." The bl __tls_get_addr
instruction
was not relocated by R_PPC64_TLSGD
. Blindly converting the
addis/addi instructions can make the code sequence malformed. Therefore
GNU ld detected the missing R_PPC64_TLSGD/R_PPC64_TLSLD
and
disabled optimization in 2009-03-03.
I was not fond of the fact that we still needed such a hack in 2020 but I implemented a scheme in ld.lld anyway because the request was so strong. https://reviews.llvm.org/D92959
The following example tests several optimization schemes.
a.c
tests general-dynamic to initial-exec optimization.
b.c:f1
tests local-dynamic to local-exec optimization. If
f0
uses the general-dynamic TLS model, general-dynamic to
local-exec optimization is tested as well.
1 | cat > b.c <<e |
We can test general-dynamic to local-exec optimization with
clang -fpic -O1 a.c b.c -o a
.
TLS variants
There are two variants.
In Variant I, the static TLS blocks are placed above (after) the thread pointer. The thread pointer points to the end of the thread control block. The thread control block (TCB) is a per-thread data structure describing various attributes of the thread, as defined by the libc implementation. arm, aarch64, alpha, ia64, m68k, mips, ppc, and riscv use schemes similar to this variant. I say similar because some architectures (including m68k, mips, powerpc32, powerpc64) place the thread pointer at the end of the thread control block plus a displacement.
Let's say the main executable and two immediately loaded shared
objects contain PT_TLS
segments; then the placement of TCB
and static TLS blocks is like the following. 1
2
3
4
5TCB TP_WITHOUT_DISPLACEMENT [GAP] tlsblock0 tlsblock1 tlsblock2
TP_WITHOUT_DISPLACEMENT % tls_align == 0
set TP to TP_WITHOUT_DISPLACEMENT + displacement
If displacement is 0 and GAP is absent, the TP offset of tlsblock0 is exe.tls.p_vaddr&(exe.tls.p_align-1).
TP_WITHOUT_DISPLACEMENT
is aligned according to the
maximum p_align
of all PT_TLS
segments.
The TP offset of tlsblock0
is a value that is shared
between the linker and the dynamic loader. We can compute the offset
relative to TP_WITHOUT_DISPLACEMENT
as follows:
1
2
3
4
5exe.tls_id = ++tls_cnt; // set tls_cnt to 1
// GAP_ABOVE_TP is 8 for aarch32 and 16 for aarch64
exe.tls.offset = GAP_ABOVE_TP + ((-GAP_ABOVE_TP+exe.tls.p_vaddr) & (exe.tls.p_align-1));
tls_offset = exe.tls.offset + exe.tls.p_memsz;
tls_align = max(exe.tls.p_align, MIN_TLS_ALIGN);
As an example, on powerpc64, the displacement is 0x7000, which means
TP (the r13 register) is set to the end of the thread control block
(TP_WITHOUT_DISPLACEMENT
) plus 0x7000. The space allocated
for the TLS symbol with st_value==0
in the executable is at
r13-0x7000 + exe.tls.p_vaddr%exe.tls.p_align
. Since
exe.tls.p_vaddr%exe.tls.p_align
is normally 0, the code
sequence accessing st_value==0
may look like:
1
2addis 3, 13, 0
lwz 3, -0x7000(3)
A single add instruction can access
[r13-0x8000, r13+0x8000)
, i.e.
[TP_WITHOUT_DISPLACEMENT-0x1000, TP_WITHOUT_DISPLACEMENT+0xf000)
.
If the thread control block is no larger than 0x1000, its member can be
accessed with one single add instruction.
arm and aarch64 have a zero displacement, but they reserve two words
at TP (a gap before tlsblock0
). The TP offset of tlsblock0
is
sizeof(void*)*2 + ((p_vaddr-sizeof(void*)*2)&(p_align-1))
.
Dynamic loaders place tlsblock1
and
tlsblock2
with the minimum alignment padding. Their offsets
are not known by the linker, so theoretically dynamic loaders can add an
arbitrary amount of padding. If we treat the TP without displacement as
the anchor point, the offsets of PT_TLS
segments of
immediately loaded shared objects can be determined as follows:
1
2
3
4
5
6
7for (int i = 0; i < n_dso_with_tls; i++) {
p = dso_with_tls[i];
p->tls_id = ++tls_cnt;
p->tls.offset = tls_offset + ((-tls_offset+p->tls.p_vaddr) & (p->tls.p_align-1)); // tls_offset = p_vaddr (mod p_align)
tls_offset = p->tls.offset + p->tls.p_memsz;
tls_align = max(tls_align, p->tls.p_align);
}
i386, x86-64, s390, and sparc use Variant II. In Variant II, the static TLS blocks are placed below (before) the thread pointer. The thread pointer points to the start of the thread control block.
Let's say the main executable and two immediately loaded shared
objects contain PT_TLS
segments; then the placement of TCB
and static TLS blocks is like the following. 1
2
3
4tlsblock2 tlsblock1 tlsblock0 TP TCB
TP % tls_align == 0
The TP offset of tlsblock0 is -exe.tls.p_memsz - ((-exe.tls.p_memsz-exe.tls.p_vaddr)&(exe.tls.p_align-1)).
TP is aligned by the maximum p_align
of all
PT_TLS
segments.
The TP offset of tlsblock0
is a value that is shared
between the linker and the dynamic loader. It can be computed as:
1
2
3
4exe.tls_id = ++tls_cnt; // set tls_cnt to 1
tls_offset = exe.tls.p_memsz + ((-(uintptr_t)exe.tls.p_vaddr-exe.tls.p_memsz) & (exe.tls.p_align-1));
exe.tls.offset = -tls_offset;
tls_align = max(exe.tls.p_align, MIN_TLS_ALIGN);
If you find the formula above confusing, it is;-) In normal cases,
you can forget the alignment requirement and the TP offset of tlsblock0
is just -p_memsz
. glibc has a Variant II bug when
p_vaddr%p_align!=0
: BZ24606.
I reported the problem to FreeBSD rtld and fixed the formula for
i386/amd64 in https://github.com/freebsd/freebsd-src/commit/e6c76962031625d51fe4225ecfa15c85155eb13a.
Dynamic loaders place tlsblock1
and
tlsblock2
with the minimum alignment padding, though their
placement can be more flexible as there offsets are not known by the
linker. 1
2
3
4
5
6
7for (int i = 0; i < n_dso_with_tls; i++) {
p = dso_with_tls[i];
p->tls_id = ++tls_cnt;
tls_offset += p->tls.p_memsz + ((-p->tls.p_memsz-p->tls.p_vaddr) & (p->tls.p_align-1));
p->tls.offset = -tls_offset; // -tls_offset = p_vaddr (mod p_align)
tls_align = max(tls_align, p->tls.p_align);
}
Alignment
For a TLS variable, its alignment describes how its position in the
TLS initialization image is aligned. If ths PT_TLS
program
header satisfies p_vaddr%p_align==0
, then
st_value
is aligned by the variable alignment as well.
glibc's TLS implementation
1 | // nptl/descr.h |
On x86-64, TLS_DTV_AT_TP
is 0.
struct pthread
is at fs:0
. dtv
is
at fs:8
.
Async-signal-safe TLS
C11 7.14.1 Specify signal handling says:
If the signal occurs other than as the result of calling the abort or raise function, the behavior is undefined if the signal handler refers to any object with static or thread storage duration that is not a lock-free atomic object other than by assigning a value to an object declared as volatile sig_atomic_t, or the signal handler calls any function in the standard library other than the abort function, the _Exit function, the quick_exit function, or the signal function with the first argument equal to the signal number corresponding to the signal that caused the invocation of the handler. Furthermore, if such a call to the signal function results in a SIG_ERR return, the value of errno is indeterminate.
C++11 [support.signal] says:
An evaluation is signal-safe unless it includes one of the following:
an access to an object with thread storage duration;
A signal handler invocation has undefined behavior if it includes an evaluation that is not signal-safe.
Despite that, accessing TLS from signal handlers can be useful (think of CPU and memory profilers), hence the accesses need to be async-signal safe. Google reported the issue due to its usage of JVM and dlopen'ed JNI libraries (Async-signal-safe access to __thread variables from dlopen()ed libraries?). They eventually resorted to a non-upstream patch which used a custom allocator.
Let's discuss this topic in details.
Local-exec and initial-exec TLS models trivially satisfy the requirement since the size of static TLS blocks is fixed at program start and every thread has a pre-allocated copy.
For a dlopen'ed shared object which uses general-dynamic or local-dynamic TLS model, there are two cases.
- The dynamic loader allocates sufficient storage for all currently
running threads at
dlopen
time, and allocates sufficient storage atpthread_create
time. This is musl's choice. At dlopen time, the dynamic loader needs to block signal deliveray, take a thread list lock and install a new dynamic thread vector for each thread. - Lazy TLS allocation. TLS allocation is done at the first time
__tls_get_addr
is called. This is glibc and many other libs implementation's choice. The allocation is typically done by malloc, which is not async-signal-safe.
Lazy TLS allocation has the nice property that it does not penalizes
the threads which do not need to access TLS of the new shared object.
However, it is difficult to make __tls_get_addr
async-signal-safe. It is impossible to both allocate lazily and have
dynamic TLS access that cannot fail (TLS
redux). If __tls_get_addr
cannot allocate memory, the
ideal behavior is "fail safe" (e.g. abort), as opposed to the full range
of undefined behaviors or deadlock.
One workaround is to let the shared object use the initial-exec TLS model. This will consume the static TLS space - a global resource.
If a dlopen implementing eager TLS allocation is developed, conceivably it may need a new symbol version because there can be programs expecting lazy TLS allocation.
Large code model
Many 64-bit architectures have a small code model. Some have defined a large code model. See Relocation overflow and code models for detail.
A small code model usually restricts the addresses and sizes of sections to 4GiB or 2GiB, while a large code model generally makes no such assumption. The TLS size is usually small and code models and impose some limitation even with a large code model.
For the local-exec TLS model, because a symbol is usually referenced via an offset adding to a register (thread pointer), it needs no distinction with a large code model.
For the initial-exec TLS model, because loading an GOT is needed, and GOT is part of the data sections, a large code model technically should implement a code sequence which is not restricted by the distance between code and data. GCC has not implemented such code sequences.
For the general-dynamic and local-dynamic TLS models, there is
usually a GOT load and a __tls_get_addr
call. As discussed
previously, the GOT load needs to be free of 32-bit limitation. For the
__tls_get_addr
call, on architectures which have
implemented range extension thunks, since the linker can redirect the
call to a thunk which arranges for the call, no special treatment is
needed.
x86-64 has not implemented thunks. Compile a problem with x86-64
gcc -S -fpic -mcmodel=large
and you can see that the
__tls_get_addr
call is indirect. This is to prevent the
+-2GiB range limitation imposed by the direct CALL instruction.
1 | movabsq $_GLOBAL_OFFSET_TABLE_-.L2, %r11 |
The support for large code model TLS is fairly limited as of today.
Most configurations don't lift the GOT load limitation. On aarch64,
-fpic -mcmodel=large
has not been implemented on GCC and
Clang.
Thread-specific data keys
An alternative to ELF TLS is thread-specific data keys:
pthread_key_create
, pthread_setspecific
,
pthread_getspecific
and pthread_key_delete
.
This scheme can be seen as a simpler implementation of
__tls_get_addr
with key reuse feature. There are C11
equivalents (tss_create
, tss_set
,
tss_get
, tss_delete
) which are rarely used.
Windows provides similar API: TlsAlloc
,
TlsSetValue
, TlsGetValue
,
TlsFree
.
The maximum number of keys is usually limited. On glibc it is usually
1024. On musl it is 128. So applications which potentially need many
data keys typically create a wrapper on top of thread-specific data
keys, e.g. chromium
base/threading/thread_local_storage.h
.
POSIX.1-2017 does not require
pthread_setspecific
/pthread_getspecific
to be
async-signal-safe. Nevertheless, most implementations make
pthread_getspecific
async-signal-safe.
pthread_setspecific
is not necessarily
async-signal-safe.
-femulated-tls
-femulated-tls
uses thread-specific data keys to
implement emulated TLS. It is like using a general-dynamic TLS model for
all modes.
1 | __thread int tls0; |
1 | adrp x0, :got:__emutls_v.tls0 |
Each thread-local variable definition is associated with a
__emutls_control
instance which records
size/alignment/index/initializer.
The runtime implementation of __emutls_get_address
is
similar to a __tls_get_addr
implementation in a lazy TLS
allocation scheme. Each thread has an object array
(emutls_address_array
), with each element being a pointer
to the variable value. Each thread-local variable is assigned an index
into the array.
The inefficiency comes from these aspects:
- There is no linker optimization.
- Instead of geting the dynamic thread vector from the thread pointer
(usually available in a register), the runtime needs to call
pthread_getspecific
to get the vector. - The dynamic loader does not know emulated TLS, so the storage
allocation is typically done in the access function via
pthread_once
.
libgcc has a mature runtime. In compiler-rt, the runtime was contributed by Android folks in 2015.
Currently Android and OpenBSD targets default to
-femulated-tls
in Clang. See
hasDefaultEmulatedTLS
.
C++ thread_local
C++ thread_local adds additional features to __thread
:
dynamic initialization on first-use and destruction on thread exit. If a
thread_local variable needs dynamic initialization or has a non-trivial
destructor, the compiler calls the TLS wrapper function
(_ZTW*
, in a COMDAT group) instead of referencing the
variable directly. The TLS wrapper calls the TLS init function
(_ZTH*
, weak), which is an alias for
__tls_init
. __tls_init
calls the constructors
and registers the destructors with __cxa_thread_atexit
.
The __cxa_thread_atexit
complexity is because a
thread_local variabled defined in a dlopen'ed shared object needs to be
destruct at dlclose time before thread exit. libsupc++ and libc++abi
define __cxa_thread_atexit
. They call
__cxa_thread_atexit_impl
if the libc implementation
provides it or use a generic implementation based on thread-specific
data keys.
As an example, x
needs a TLS wrapper function. The
compiler may inline the TLS wrapper function and
__tls_init
.
1 | extern thread_local int x; |
The assembly looks like the following. It uses undefined weak
_ZTH1x
to check whether the TLS init function is defined.
If yes, call the TLS init function. Then reference the variable via
usual initial-exec or general dynamic TLS model or TLSDESC.
1 | _Z3foov: |
If you know x
does not need dynamic initialization,
C++20 constinit can make it as efficient as the plain old
__thread
.
[[clang::require_constant_initialization]]
can be used with
older language standards.
1 | extern thread_local constinit int x; |
Here is an example that __tls_init
needs to call
__cxa_thread_atexit
.
1 | struct S { S(); ~S(); }; |
Undefined weak TLS symbols
Regular unresolved weak symbols have a zero value. TLS symbols are like in a separate address space and the rule doesn't apply.
macOS TLS
The support was added very late. The scheme is similar to ELF's TLS descriptors, without the good custom calling convention promise. In other words, the performance is likely worse than ELF's general dynamic TLS model. To my surprise, thread-local variables of internal linkage need an indirect function call, too.
1 | thread_local int tls; |
1 | // x86-64 |
1 | adrp x0, _tls@TLVPPAGE // ARM64_RELOC_TLVP_LOAD_PAGEOFF12(_tls) |
_tls@TLVP
refers a TLV descriptor in
__DATA,__thread_vars
. Each descriptor consists of three
words:
void* (*thunk)(struct TLVDescriptor*)
unsigned long key
: thread-specific data key (unique per library) denoting the TLS blockunsigned long offset
: the offset in the block
dyld
iterates the the descriptors and sets the first word
(thunk
) to getAddrFunc
.
__attribute__((tls_model(...)))
attributes are ignored
in Clang.
Windows TLS
The code sequence fetches ThreadLocalStoragePointer
(offset 88) out of the Thread Environment Block and indexes it by
_tls_index
. The return value is indexed with the offset of
the variable from the start of the .tls
section. The scheme
is similar to ELF's local-dynamic TLS model, replacing a
__tls_get_desc
call with an array index operation.
1 | movl _tls_index(%rip), %eax |
Referencing a TLS variable from another DLL is not supported.
1 | __declspec(dllimport) extern thread_local int tls; |
There are a lot of of details but my personal understanding of Windows does not allow me to say more ;-) Interested readers can go to Thread Local Storage, part 3: Compiler and linker support for implicit TLS.
libc API for TLS blocks
Sanitizers' runtime needs TLS blocks for a variety of use cases. See https://sourceware.org/bugzilla/show_bug.cgi?id=16291 for a glibc feature request. Read on for a detailed description.
In LLVM, OrcJIT has a desire to register TLS blocks. Lang Hames told me that he has got native TLS working by implementing dyld’s TLS support APIs in the Orc runtime.
Florian Weimer posted Thread properties API in 2021-05.
Why does compiler-rt need to know TLS blocks?
AddressSanitizer "asan"
(-fsanitize=address
)
The main task of AddressSanitizer is to detect addressability problems. If a regular memory byte is not addressable (i.e. accesses should be UB), it is said to be poisoned and the associated shadow encodes the addressability information (all unpoisoned/all poisoned/partly poisoned).
On thread creation, the runtime should unpoison the thread stack and
static TLS blocks to allow accesses.
(test/asan/TestCases/Linux/unpoison_tls.cpp
; introduced in
[asan] Make ASan report the correct thread address ranges to LSan.
)
The runtime additionally unpoisons the thread stack and TLS blocks on
thread exit to allow accesses from later TSD destructors.
Note: if the allocation is rtld/libc internal and not intercepted, there is no need to unpoison the range. The associated shadow is supposed to be zeros. However, if the allocation is intercepted, the runtime should unpoison the range in case the range reuses a previous allocation which happens to contain poisoned bytes.
In glibc, _dl_allocate_tls
and
_dl_deallocate_tls
call malloc/free functions which are
internal and not intercepted, so the allocations are opaque to the
runtime and the shadow bytes are all zeroes.
Hardware-assisted
AddressSanitizer "hwasan" (-fsanitize=hwaddress
)
Its ClearShadowForThreadStackAndTLS
is similar to
asan's.
LeakSanitizer "lsan"
(-fsanitize=leak
)
LeakSanitizer detects memory leaks. On many targets, it is integrated
(and enabled by default) in AddressSanitizer, but it can be used
standalone. The checker is triggered by an atexit
hook (the
default options are
LSAN_OPTIONS=detect_leaks=1:leak_check_at_exit=1
), but it
can also be invoked via __lsan_do_leak_check
.
Each supported platform provides an entry point:
StopTheWorld
(e.g. Linux 1),
which does the following:
- Invoke the clone syscall to create a new process which shared the address space with the calling process.
- In the new process, list threads by iterating over
/proc/$pid/task/
. - In the new process, call
SuspendThread
(ptracePTRACE_ATTACH
) to suspend a thread.
StopTheWorld
returns. The runtime performs
mark-and-sweep, reports leaks, and then calls
ResumeAllThreads
(ptrace PTRACE_DETACH
).
Note: the implementation cannot call libc functions. It does not perform code injection. The toot set includes static/dynamic TLS blocks for each thread.
(The pthread_create
interceptor calls
AdjustStackSize
which computes a minimum stack size with
GetTlsSize
. https://code.woboq.org/llvm/compiler-rt/lib/sanitizer_common/sanitizer_posix_libcdep.cpp.html#411
I am not sure musl needs this.)
Intercepting __tls_get_addr
is useful to lsan but is not
necessary. First, the Linux
InitializePlatformSpecificModules
implementation ignores
leaks from the dynamic loader. Second, allocations called by
__tls_get_addr
are suppressed by a built-in rule
leak:*tls_get_addr
in kStdSuppressions
.
The current lsan implementation has more requirement on
GetTls
: it does not intercept
pthread_setspecific
. Instead, it expects
GetTls
returned range to include pointers to
pthread_setspecific
regions, otherwise there would be false
positive leak reports.
In addition, lsan gets the static TLS boundaries at
pthread_create
time and expects the boundaries to include
TLS blocks of dynamically loaded modules. This means that
GetTls
returned range needs to include static TLS
surplus.
( You might ask that the thread control block has the dtv pointer, why can't lsan track the referenced allocations. Well, for threads, rtld/libc implementations typically allocate the static TLS blocks as part of the thread stack, which are not seen by the runtime, so the runtime does not know the allocations. )
On glibc, GetTls
returned range includes
pthread::{specific_1stblock,specific}
for thread-specific
data keys. There is currently a hack to ignore allocations from ld.so
allocated dynamic TLS blocks. Note: if the
pthread::{specific_1stblock,specific}
pointers are
encrypted, lsan cannot track the allocation.
MemorySanitizer "msan"
(-fsanitize=memory
)
MemorySanitizer detects uses of uninitialized memory. If a regular memory byte has uninitialized (poisoned) bits, its associated shadow byte has one bits.
Similar to asan. On thread creation, the runtime should unpoison the
thread stack and static TLS blocks to allow accesses.
(test/msan/tls_reuse.cpp
) The runtime additionally
unpoisons the thread stack and TLS blocks on thread exit to allow
accesses from TSD destructors.
msan needs to do more than asan: the __tls_get_addr
interceptor (DTLS_on_tls_get_addr
) detects new dynamic TLS
blocks and unpoisons the shadow. ld.so calls a non-interposable
memset
to clear the blocks. Otherwise, if a dynamic TLS
block reuses a previous allocation with poison, there may be false
positives. One way to semi reliably trigger this is
(test/msan/dtls_test.cpp
https://github.com/google/sanitizers/issues/547):
- in a thread, write an uninitialized (poisoned) value to a dynamic TLS block
- destroy the thread
- create a new thread
- try making the new thread reuse the poisoned dynamic TLS block.
Note: aarch64 uses TLSDESC by default and there is no interposable symbol.
During the development of glibc 2.19, commit
1f33d36a8a9e78c81bed59b47f260723f56bb7e6 ("Patch 2/4 of the effort
to make TLS access async-signal-safe.") was checked in.
DTLS_on_tls_get_addr
detects the
__signal_safe_memalign
header and considers it a dynamic
TLS block if the block is not within the static TLS boundaries. commit
dd654bf9ba1848bf9ed250f8ebaa5097c383dcf8 ("Revert "Patch 2/4 of the
effort to make TLS access async-signal-safe.") reverted
__signal_safe_memalign
, but the implementation remains in
grte branches.
See also Re: glibc 2.19 - asyn-signal safe TLS and ASan.
Similar to lsan: the pthread_create
interceptor calls
AdjustStackSize
which computes a minimum stack size with
GetTlsSize
.
ThreadSanitizer "tsan"
(-fsanitize=thread
)
Similar to lsan: the pthread_create
interceptor calls
AdjustStackSize
which computes a minimum stack size with
GetTlsSize
.
Similar to msan, the runtime unpoisons TLS blocks to avoid false
positives. Tested by test/tsan/dtls.c
(D20927). tsan also
needs to intercept __tls_get_addr
. The problem that aarch64
TLSDESC does not have an interposable symbol also applies.
I wrongly thought https://reviews.llvm.org/D93866 was a workaround. https://sourceware.org/pipermail/libc-alpha/2021-January/121352.html explained that the code has not materialized changed since 2012.
For dynamic TLS blocks, older glibc (e.g. 2.23) calls
__libc_memalign
, which is intercepted
(tsan/rtl/tsan_interceptors_posix.cpp
); since BZ #17730,
newer glibc (e.g. 2.32) calls malloc
.
NumericalSanitizer
"nsan" (-fsanitize=numerical
)
Similar to msan and dfsan, the runtime unpoisons TLS blocks to avoid false positives (#102718).
glibc TLS allocation
For dynamic TLS blocks, allocate_and_init
allocates the
block.
For a new thread, glibc for Variant II ports allocates a memory
region, places the static TLS block (which contains the pthread
structure) at the end (nptl/allocatestack.c
), and uses the
remaining space as the thread stack. The stack pointer may be just a few
hundred bytes below the canary address (%fs:0x28 on x86-64) and a large
oob write can potentially override it.
Android bionic
Android bionic (API level 31) introduced some TLS APIs in
libc/include/sys/thread_properties.h
.
__libc_get_static_tls_bounds
and
__libc_iterate_dynamic_tls
are used in compiler-rt.
1 | /** |
dalias's notes
1 | <@dalias> i think the api proposed there looks wrong |
Test TLS: 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41cat > ./a.c <<eof
#include <assert.h>
int foo();
int bar();
int main() {
assert(foo() == 2);
assert(foo() == 4);
assert(bar() == 2);
assert(bar() == 4);
}
eof
cat > ./b.c <<eof
__thread int tls0;
extern __thread int tls1;
int foo() { return ++tls0 + ++tls1; }
static __thread int tls2, tls3;
int bar() { return ++tls2 + ++tls3; }
eof
echo '__thread int tls1;' > ./c.c
sed 's/ /\t/' > ./Makefile <<'eof'
.MAKE.MODE = meta curDirOk=true
CC := gcc -m32 -g -fpic -mtls-dialect=gnu2
LDFLAGS := -m32 -Wl,-rpath=.
all: a0 a1 a2
run: all
./a0 && ./a1 && ./a2
c.so: c.o; ${LINK.c} -shared $> -o $@
bc.so: b.o c.o; ${LINK.c} -shared $> -o $@
b.so: b.o c.so; ${LINK.c} -shared $> -o $@
a0: a.o b.o c.o; ${LINK.c} $> -o $@
a1: a.o b.so; ${LINK.c} $> -o $@
a2: a.o bc.so; ${LINK.c} $> -o $@
eof
1 | bmake run && bmake CFLAGS=-O1 run |