In recent ISO C++ standards, [depr.c.headers] describes how a C header name.h is transformed to the corresponding C++ cname header. There is a helpful example:
[ Example: The header assuredly provides its declarations and definitions within the namespace std. It may also provide these names within the global namespace. The header <stdlib.h> assuredly provides the same declarations and definitions within the global namespace, much as in the C Standard. It may also provide these names within the namespace std. — end example ]
"may also" in the wording allows implementations to provide mix-and-match, e.g. std::exit can be used with #include <stdlib.h> and ::exit can be used with #include <cstdlib>.
libstdc++ chooses to enable global namespace declarations with C++ cname header. For example, #include <cstdlib> also includes the corresponding C header stdlib.h and we get declarations in both the global namespace and the namespace std.
The compiler knows that the declarations in std are identical to the ones in the global namespace. The compiler recognizes some library functions and can optimize them. By using the compiler can optimize some C library functions in namespace std (e.g. many std::mem* and std::str* functions).
For some C standard library headers, libstdc++ provides wrappers (libstdc++-v3/include/c_compatibility/) which take precedence over the glibc headers. The configuration of libstdc++ uses --enable-cheaders=c_global by default. if GLIBCXX_C_HEADERS_C_GLOBAL in libstdc++-v3/include/Makefile.am describes that the 6 wrappers (complex.h, fenv.h, tgmath.h, math.h, stdatomic.h, stdlib.h) shadow the C library headers of the same name. For example, #include <stdlib.h> includes the wrapper stdlib.h which includes cstdlib, therefore bringing exit into the namespace std.
Recently I have fixed two glibc rtld bugs related to early GOT relocation for retro-computing architectures: m68k and powerpc32. They are related to the obscure PI_STATIC_AND_HIDDEN macro which I am going to demystify.
In 2002, PI_STATIC_AND_HIDDEN was introduced into glibc rtld (runtime loader). This macro indicates whether accesses to the following types of variables need dynamic relocations.
static specifier: static int a; (STB_LOCAL)
hidden visibility attribute: __attribute__((visibility("hidden"))) int a; (STB_GLOBAL STV_HIDDEN), __attribute__((weak, visibility("hidden"))) int a; (STB_WEAK STV_HIDDEN)
PI in the macro name is an abbreviation for "position independent". This is a misnomer: a code sequence using GOT is typically position-independent as well.
In -fPIC mode, the compiler assumes that all non-local STV_DEFAULT symbols may be preemptible at run time. A GOT-generating relocation is used and the GOT is typically unavoidable at link time (on some architectures the linker can optimize out the GOT). This case is not interesting to rtld as rtld does not need to export such variables.
Excluding these cases (non-local STV_DEFAULT), all other variables are known to be non-preemptible at compile time. The compiler can generate code which is guaranteed to avoid dynamic relocations at link time.
Non-HIDDEN_VAR_NEEDS_DYNAMIC_RELOC architectures with PC-relative instructions
To avoid dynamic relocations, the most common approach is to generate PC-relative instructions, as most modern architectures (e.g. aarch64, riscv, and x86-64) provide. Using PC-relative instructions to reference variables assumes that the distance from code to data is a link-time constant. Nowadays this condition is satisfied everywhere except the rare FDPIC ABI.
Here are some assembly fragments from architectures using PC-relative instructions. The instructions may not be familar to you, but that is fine. We can see that there is no GOT related marker. I have added some comments indicating the relocation type and the referenced symbol. var in the C code has internal linkage which lowers to the STB_LOCAL binding. References to such local symbols are often redirected to the section symbol (.bss): the link-time behaviors are identical.
Non-HIDDEN_VAR_NEEDS_DYNAMIC_RELOC architectures without PC-relative instructions
Many older architectures do not have PC-relative instructions.
x86-32 does not have PC-relative instructions, but it provides a way to avoid a load from a GOT entry. It achieves this with a detour: compute the address of _GLOBAL_OFFSET_TABLE_ (GOT base symbol), then add an offset (S-_GLOBAL_OFFSET_TABLE_) to get the symbol address. _GLOBAL_OFFSET_TABLE_ is computed this way: compute the address of a location in code, then add an offset (_GLOBAL_OFFSET_TABLE_ - PC).
You probably see now how the x86-32 ABI was misdesigned: the involvement of _GLOBAL_OFFSET_TABLE_ is unnecessary. A relocation with the calculation of S-_GLOBAL_OFFSET_TABLE_ would achieve the same net effect.
The relocations with GOT in their names just use the GOT as an anchor. They don't indicate a load from a GOT entry.
powerpc64 does not have PC-relative instructions before POWER10. Earlier microarchitectures use TOC-relative relocations to compute the symbol address.
A few older architectures tend to use a load from a GOT entry. The GOT entry needs a relative relocation (instead of R_*_GLOB_DAT: the symbol is non-preemptible, so no symbol search is needed). See All about Global Offset Table. In glibc, these architecture define HIDDEN_VAR_NEEDS_DYNAMIC_RELOC.
Some architectures even assume the distance from code to data may not be a link-time constant (see All about Procedure Linkage Table). They do not provide a relocation with a calculation of S-_GLOBAL_OFFSET_TABLE_ or S-P.
# nios2: r22 is a callee-saved register which requires a spill and expensive setup ldw r3, %got(var)(r22) # R_NIOS2_GOT16 var ldw r2, 0(r3) addi r2, r2, 1 stw r2, 0(r3)
# powerpc32: r30 is a callee-saved register which requires a spill and expensive setup lwz 9,.LC0-.LCTOC1(30) lwz 3,0(9) addi 3,3,1 stw 3,0(9)
.section ".got2","aw" # Like a manual GOT section .align 2 .LCTOC1 = .+32768 .LC0: .long .LANCHOR0 # R_PPC_ADDR32 .bss; may become R_PPC_RELATIVE at link time
.section ".bss" .set .LANCHOR0,. + 0 var: .zero 4
The first task of rtld is to relocate itself and bind all symbols to itself. Afterward, non-preemptible functions and data can be freely accessed.
On architectures where a GOT entry is used to access a non-preemptible variable, rtld needs to be careful not to reference such variables before relative relocations are applied. In rtld.c, _dl_start has the following code:
1 2 3 4 5 6 7 8 9 10
if (bootstrap_map.l_addr) { // Apply R_*_RELATIVE, R_*_GLOB_DAT, and R_*_JUMP_SLOT. ELF_DYNAMIC_RELOCATE (&bootstrap_map, NULL, 0, 0, 0); }
_rtld_local_ro is a hidden global variable. Taking its address may be reordered before ELF_DYNAMIC_RELOCATE by the compiler. On an architecture using a GOT entry to load the address, the reordering will make the subsequent memory store (_rtld_local_ro.dl_find_object) to crash, since the GOT address is incorrect: it's zero or the link-time address instead of the run-time address.
I was pretty sure there is a relocation bug but was not immediately clear which piece of code may be at fault.
Nowadays there aren't many choices for powerpc32 images. Void Linux ppc still provides powerpc32 glibc and musl images. I downloaded one and fed it into qemu, booted it with qemu-system-ppc -machine mac99 -m 2047M -cdrom void-live-ppc-20210825.iso -net nic -net user,smb=$HOME/Dev -boot d. I booted into the 4.4.261 kernel because gdb aborts immediately with 5.13.12 kernel. Daniel Kolesa mentioned this 5.x kernel incompatibility to me and nobody has looked into it yet.
The live CD provides free space of about 1GiB and I can install cifs-utils and gdb. Then run ld.so under gdb.
1 2 3 4 5 6
xbps-install -S xbps-install cifs-utils gdb cgdb mkdir ~/Dev mount -t cifs -o vers=3.0 //10.0.2.4/qemu ~/Dev cd ~/Dev/glibc/out/ppc gdb -ex 'directory ../../elf' -ex r elf/ld.so
ld.so has 671 R_68K_RELATIVE relocations and one R_68K_GLOB_DAT for __stack_chk_guard@@GLIBC_2.4. The following function is used to apply a relocation. It is shared by self-relocation and relocation for other modules. The self-relocation code defines RTLD_BOOTSTRAP and needs just R_68K_RELATIVE, R_68K_GLOB_DAT, and R_68K_JMP_SLOT.
if (__builtin_expect (r_type == R_68K_RELATIVE, 0)) *reloc_addr = map->l_addr + reloc->r_addend; else { ... switch (r_type) { case R_68K_COPY: ... case R_68K_GLOB_DAT: case R_68K_JMP_SLOT: *reloc_addr = value; break;
However, somehow many case labels were available for self-relocation. GCC compiles the switch statement into a jump table which requires loading an address from GOT. With some clean-up to generic relocation code, GCC decides to perform loop-invariant code motion and hoists the load of the jump table address. The hoisted load is before relative relocations are applied, so the jump table address is incorrect.
The foolproof approach is to add an optimization barrier (e.g. calling an non-inlinable function after relative relocations are resolved). That is non-trivial given the code structure. So Andreas Schwab suggested a simple approach by avoiding the jump table: handle just the essential relocations.
The faulty code concealed well and I could not have found it without a debugger. It took me a while to set up a m68k image using q800. The memory is limited to 1000MiB and the emulation is very slow. Linux 5.19 is expected to gain the support for a virtual Motorola 68000 machine. With qemu-system-m68k -M virt things will become better.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Installation
7z x debian-11.0.0-m68k-NETINST-1.iso install/kernels/vmlinux-5.16.0-5-m68k install/cdrom/initrd.gz mv install/kernels/vmlinux-5.16.0-5-m68k install/cdrom/initrd.gz .
stage 1 (ldso/dlstart.c): only relative relocations are applied. This allows static variables can be accessed.
stage 2 __dls2: This applies non-relative relocations.
stage 2b __dls2b: Set up thread pointer with a TLS stub.
stage 3 __dls3: Load the executable and immediately loaded shared objects. Apply relocations and possibly relocate rtld/libc itself again for possible symbol interposition (e.g. R_*_COPY, interposed malloc implementation).
Each stage uses a PC-relative code sequence to load the address of the next stage entry point, and then jump to it. This serves as a strong compiler barrier preventing code reordering.
(In glibc, elf/rtld.cELF_DYNAMIC_RELOCATE (&bootstrap_map, NULL, 0, 0, 0); is kinda like musl's stage 1 plus stage 2.)
Stage 1 computes the entry of stage 2 with GETFUNCSYM(&dls2, __dls2, base+dyn[DT_PLTGOT]); where GETFUNCSYM is defined for every port:
This approach is elegant. It even allows a static or hidden function call with a dynamic relocation, though I haven't found such an architecture in my testing.
glibc's sigaction sets the sa_restorer field of sigaction to __restore_rt, and sets the SA_RESTORER. The kernel sets up the __restore_rt frame with saved process context information (ucontext_t structure) before jumping to the signal handler. See kernel arch/x86/kernel/signal.c:setup_rt_frame. Upon returning from the signal handler, control passes to __restore_rt. See man 2 sigreturn.
__restore_rt is implemented in assembly. It comes with DWARF call frame information in .eh_frame.
With the information, libunwind can unwind through the trampoline without knowing the ucontext_t structure. Note that all general purpose registers are encoded. libunwind/docs/unw_get_reg.man says
However, for signal frames (see unw_is_signal_frame(3)), it is usually possible to access all registers.
Volatile registers are also saved in the saved process context information. This is different from other frames where volatile registers' information is typically lost.
As a relatively new port, Linux AArch64 defines the signal trampoline __kernel_rt_sigreturn in the VDSO (see arch/arm64/kernel/vdso/sigreturn.S). This is unlike x86-64 which defines the function in libc. We can use gdb to dump the VDSO.
1 2 3 4 5
(gdb) i proc m process 430749 ... 0xfffff7ffc000 0xfffff7ffd000 0x1000 0x0 [vdso] (gdb) dump binary memory vdso.so 0xfffff7ffc000 0xfffff7ffd000
As of Linux 5.8 (https://git.kernel.org/linus/87676cfca14171fc4c99d96ae2f3e87780488ac4), vdso.so does not have PT_GNU_EH_FRAME. Therefore unwinders (llvm-project libunwind, nongnu libunwind, libgcc_s.so.1) ignore its unwind tables. In gdb, gdb/aarch64-linux-tdep.c recognizes the two instructions and encodes how the kernel sets up the ucontext_t structure.
Previously, vdso.so generated a small set of CFI instructions to encode X29 (FP) and X30 (LR).
However, there was a serious problem: CFI cannot describe a signal trampoline frame. AArch64 does not define a register number for PC and provides no direct way to encode the PC of the previous frame. Instead, it sets return_address_register to X30 and the unwinder updates the PC to whatever value the saved X30 is. Actually, with unw_get_reg(&cursor, UNW_REG_IP, &pc); unw_get_reg(&cursor, UNW_AARCH64_X30, &x30);, we know pc == x30. This approach works fine when LR forms a chain since we know between two adjacent frames, the sets {PC, X30} differ by one element. However, when unwinding through the signal trampoline, the CFI can describe the previous PC but not the previous X30.
musl x86-64
src/signal/x86_64/restore.s implements a signal trampoline __restore_rt. There is no .eh_frame information.
nongnu libunwind does not know that __restore_rt is a signal trampoline (unw_is_signal_frame always returns 0). On ELF targets, -O1 and above typically imply -fomit-frame-pointer and many functions do not save RBP. Note: some functions may save RBP even with -fomit-frame-pointer.
In the absence of a valid frame chain, combined with the fact that nongnu libunwind does not recognize Linux x86-64's signal trampoline, libunwind cannot unwind through the __restore_rt frame. gdb recognizes the signal trampoline frame and with its FP-based unwinding it can retrieve several frames, but not the ones above raise.
1 2 3 4 5 6 7 8 9 10 11
% ld.lld @response.release.txt && ./nongnu pc=0x0000000000206add sp=0x00007ffc018618a0 0 ./nongnu: pc=0x00007f9fedcd602f sp=0x00007ffc018620c0 0 /home/ray/musl/out/release/lib/libc.so: pc=0x0000000000000000 sp=0x00007ffc01862db0 0 % gdb ./nongnu -x =(printf 'b handler\nhandle SIGUSR1 nostop\nr\nbt') ... #0 handler (signo=10) at a.c:9 #1 <signal handler called> #2 0x00007ffff7fae78a in __restore_sigs () from /home/ray/musl/out/release/lib/libc.so #3 0x00007ffff7fae8f1 in raise () from /home/ray/musl/out/release/lib/libc.so #4 0x0000000000000000 in ?? ()
If musl is built with -fno-omit-frame-pointer, nongnu libunwind will use its FP-based fallback (see src/x86_64/Gstep.c). The output looks like:
unw_step uses the saved RBP to infer RSP/RBP/RIP in the previous frame. If the signal handler saves RBP and calls unw_step, the saved RBP is essentially the RBP value in the signal trampoline frame.
Actually, not every source file needs to be built with -fno-omit-frame-pointer. We just need to build the source files that transfer control to the user program, and their callers. For this example, building src/signal/raise.c with -fno-omit-frame-pointer allows us to unwind to main. Additionally rebuilding src/env/__libc_start_main.c allows us to unwind to _start.
musl's Makefile specifies -fno-asynchronous-unwind-tables (see option to enable eh_frame for a 2011 discussion). If CFLAGS -g is specified, libc.so will have .debug_frame. gdb can retrieve the caller of raise:
1 2 3 4 5
#0 handler (signo=10) at a.c:9 #1 <signal handler called> #2 __restore_sigs (set=set@entry=0x7fffffffe240) at ../../arch/x86_64/syscall_arch.h:40 #3 0x00007ffff7fa36e0 in raise (sig=sig@entry=10) at ../../src/signal/raise.c:11 #4 0x00000000002071ff in main () at a.c:33
nongnu libunwind can be built with --enable-debug-frame to support .debug_frame. Unfortunately, since it does not recognize the signal trampoline, it cannot retrieve the main frame for this example.
Unwinders' compatibility with libc implementations
The values represent how the unwinder unwinds through the signal trampoline frame.
Linux kernel arch/x86/kernel/signal.c:setup_rt_frame
Core dump
The kernel core dumper coredump.c is simple. The glibc __restore_rt page or the VDSO is not prioritized in the presence of a core file limit. If the page is missing in the core file, gdb prog core -ex bt -batch will not be able to unwind past the signal trampoline. A userspace core dumper may be handy.
FORTRAN 77 COMMON blocks compiled to COMMON symbols. You could declare a COMMON block in more than one file, with each specifying the number, type, and size of the variable. The linker allocated enough space to satisfy the largest size.
In a GCC/Clang -g1 or -g2 build, the debug information is often much larger than text sections. Some assemblers and linkers offer an optional feature which compresses debug sections.
History
In 2007-11, Craig Silverstein added --compress-debug-sections=zlib to gold. When the option was specified, gold compressed the content of a .debug* section with zlib and changed the section name to .debug*.zlib.$uncompressed_size.
Unix-like systems represent static libraries as .a archives. A .a archive consists of a header and a collection of files with metadata. Its usage is tightly coupled with the linker. An archive almost always contains only relocatable object files and the linker has built-in support for reading it.
1 2 3 4
% as /dev/null -o a.o % rm -f b.a && ar rc b.a a.o % ar t b.a a.o
One may add other types of files to .a but that is almost assuredly a bad thing.
1 2 3 4 5 6
% rm -f a.a && ar rc a.a a.o b.a # archive in archive, bad % ar t a.a a.o b.a % echo hello > a.txt % rm -f a.a && ar rc a.a a.o a.txt # text file in archive, bad
The original linker designers noticed that for many programs not every member was needed, so they tried to allow the linker to skip unused members. Therefore, they invented the interesting but confusing archive member extraction rule. See Symbol processing#Archive processing for details.
LLD is the LLVM linker. Its ELF port is typically installed as ld.lld. This article makes an in-depth analysis of ld.lld's performance. The topic has been in my mind for a while. Recently Rui Ueyama released mold 1.0 and people wonder why with multi-threading its ELF port is faster than ld.lld. So I finally completed the article.
First of all, I am very glad that Rui Ueyama started mold. Our world has a plethora of compilers, but not many people learn or write linkers. As its design documentation says, there are many drastically different designs which haven't been explored. In my view, mold is innovative in that it introduced parallel symbol table initialization, symbol resolution, and relocation scan which to my knowledge hadn't been implemented before, and showed us amazing results. The innovation gives existing and future linkers incentive to optimize further.