On 2022-07-07, I added a RISC-V linker relaxation framework in ld.lld
and implemented R_RISCV_ALIGN/R_RISCV_CALL/R_RISCV_CALL_PLT
relaxation. The changes will be included in the next llvm-project
release 15.0.0. This post describes the implementation.
In ISO C++ standards, [support.c.headers.general] says:
Source files that are not intended to also be valid ISO C should not
use any of the C headers.
Then, [depr.c.headers] describes how a C header name.h
is transformed to the corresponding C++ cname header. There
is a helpful example:
[ Example: The header assuredly provides its declarations
and definitions within the namespace std. It may also provide these
names within the global namespace. The header <stdlib.h> assuredly
provides the same declarations and definitions within the global
namespace, much as in the C Standard. It may also provide these names
within the namespace std. — end example ]
"may also" in the wording allows implementations to provide
mix-and-match, e.g. #include <stdlib.h> may provide
std::exit and #include <cstdlib> may
provide ::exit.
libstdc++ chooses to enable global namespace declarations with C++
cname header. For example,
#include <cstdlib> also includes the corresponding C
header stdlib.h and we get declarations in both the global
namespace and the namespace std.
The compiler knows that the declarations in the namespace
std are identical to the ones in the global namespace. The
compiler recognizes some library functions and can optimize them. By
using the compiler can optimize some C library functions in
the namespace std (e.g. many std::mem* and
std::str* functions).
For some C standard library headers, libstdc++ provides wrappers
(libstdc++-v3/include/c_compatibility/) which take
precedence over the glibc headers. The configuration of libstdc++ uses
--enable-cheaders=c_global
by default. if GLIBCXX_C_HEADERS_C_GLOBAL in
libstdc++-v3/include/Makefile.am describes that the 6
wrappers
(complex.h, fenv.h, tgmath.h, math.h, stdatomic.h, stdlib.h)
shadow the C library headers of the same name. For example,
#include <stdlib.h> includes the wrapper
stdlib.h which includes cstdlib, therefore
bringing exit into the namespace std.
Recently I have fixed two glibc rtld bugs related to early GOT
relocation for retro-computing architectures: m68k and powerpc32. They
are related to the obscure PI_STATIC_AND_HIDDEN macro which
I am going to demystify.
In 2002, PI_STATIC_AND_HIDDEN
was introduced into glibc rtld (runtime loader). This macro
indicates whether accesses to the following types of variables need
dynamic relocations.
static specifier: static int a;
(STB_LOCAL)
hidden visibility attribute:
__attribute__((visibility("hidden"))) int a;
(STB_GLOBAL STV_HIDDEN),
__attribute__((weak, visibility("hidden"))) int a;
(STB_WEAK STV_HIDDEN)
PI in the macro name is an abbreviation for "position
independent". This is a misnomer: a code sequence using GOT is typically
position-independent as well.
In -fPIC mode, the compiler assumes that all non-local
STV_DEFAULT symbols may be preemptible at run time. A
GOT-generating relocation is used and the GOT is typically unavoidable
at link time (on some architectures the linker can optimize out the
GOT). This case is not interesting to rtld as rtld does not need to
export such variables.
Excluding these cases (non-local STV_DEFAULT), all other
variables are known to be non-preemptible at compile time. The compiler
can generate code which is guaranteed to avoid dynamic relocations at
link time.
Non-HIDDEN_VAR_NEEDS_DYNAMIC_RELOC
architectures with PC-relative instructions
To avoid dynamic relocations, the most common approach is to generate
PC-relative instructions, as most modern architectures (e.g. aarch64,
riscv, and x86-64) provide. Using PC-relative instructions to reference
variables assumes that the distance from code to data is a link-time
constant. Nowadays this condition is satisfied everywhere except the
rare FDPIC ABI.
Here are some assembly fragments from architectures using PC-relative
instructions. The instructions may not be familar to you, but that is
fine. We can see that there is no GOT related marker. I have added some
comments indicating the relocation type and the referenced symbol.
var in the C code has internal linkage which lowers to the
STB_LOCAL binding. References to such local symbols are
often redirected to the section symbol (.bss): the
link-time behaviors are identical.
Non-HIDDEN_VAR_NEEDS_DYNAMIC_RELOC
architectures without PC-relative instructions
Many older architectures do not have PC-relative instructions.
x86-32 does not have PC-relative instructions, but it provides a way
to avoid a load from a GOT entry. It achieves this with a detour:
compute the address of _GLOBAL_OFFSET_TABLE_ (GOT base
symbol), then add an offset (S-_GLOBAL_OFFSET_TABLE_) to
get the symbol address. _GLOBAL_OFFSET_TABLE_ is computed
this way: compute the address of a location in code, then add an offset
(_GLOBAL_OFFSET_TABLE_ - PC).
You probably see now how the x86-32 ABI was misdesigned: the
involvement of _GLOBAL_OFFSET_TABLE_ is unnecessary. A
relocation with the calculation of S-_GLOBAL_OFFSET_TABLE_
would achieve the same net effect.
The relocations with GOT in their names
just use the GOT as an anchor. They don't indicate a load from a GOT
entry.
powerpc64 does not have PC-relative instructions before POWER10.
Earlier microarchitectures use TOC-relative relocations to compute the
symbol address.
A few older architectures tend to use a load from a GOT entry. The
GOT entry needs a relative relocation (instead of
R_*_GLOB_DAT: the symbol is non-preemptible, so no symbol
search is needed). See All about Global
Offset Table. In glibc, these architecture define
HIDDEN_VAR_NEEDS_DYNAMIC_RELOC.
Some architectures even assume the distance from code to data may not
be a link-time constant (see All about
Procedure Linkage Table). They do not provide a relocation with a
calculation of S-_GLOBAL_OFFSET_TABLE_ or S-P.
# nios2: r22 is a callee-saved register which requires a spill and expensive setup ldw r3, %got(var)(r22) # R_NIOS2_GOT16 var ldw r2, 0(r3) addi r2, r2, 1 stw r2, 0(r3)
# powerpc32: r30 is a callee-saved register which requires a spill and expensive setup lwz 9,.LC0-.LCTOC1(30) lwz 3,0(9) addi 3,3,1 stw 3,0(9)
.section ".got2","aw" # Like a manual GOT section .align 2 .LCTOC1 = .+32768 .LC0: .long .LANCHOR0 # R_PPC_ADDR32 .bss; may become R_PPC_RELATIVE at link time
.section ".bss" .set .LANCHOR0,. + 0 var: .zero 4
The first task of rtld is to relocate itself and bind all symbols to
itself. Afterward, non-preemptible functions and data can be freely
accessed.
On architectures where a GOT entry is used to access a
non-preemptible variable, rtld needs to be careful not to reference such
variables before relative relocations are applied. In
rtld.c, _dl_start has the following code:
1 2 3 4 5 6 7 8 9 10
if (bootstrap_map.l_addr) { // Apply R_*_RELATIVE, R_*_GLOB_DAT, and R_*_JUMP_SLOT. ELF_DYNAMIC_RELOCATE (&bootstrap_map, NULL, 0, 0, 0); }
_rtld_local_ro is a hidden global variable. Taking its
address may be reordered before ELF_DYNAMIC_RELOCATE by the
compiler. On an architecture using a GOT entry to load the address, the
reordering will make the subsequent memory store
(_rtld_local_ro.dl_find_object) to crash, since the GOT
address is incorrect: it's zero or the link-time address instead of the
run-time address.
I was pretty sure there is a relocation bug but was not immediately
clear which piece of code may be at fault.
Nowadays there aren't many choices for powerpc32 images. Void Linux ppc still provides
powerpc32 glibc and musl images. I downloaded one and fed it into qemu,
booted it with
qemu-system-ppc -machine mac99 -m 2047M -cdrom void-live-ppc-20210825.iso -net nic -net user,smb=$HOME/Dev -boot d.
I booted into the 4.4.261 kernel because gdb aborts immediately with
5.13.12 kernel. Daniel Kolesa mentioned this 5.x kernel incompatibility
to me and nobody has looked into it yet.
The live CD provides free space of about 1GiB and I can install
cifs-utils and gdb. Then run ld.so under gdb.
1 2 3 4 5 6
xbps-install -S xbps-install cifs-utils gdb cgdb mkdir ~/Dev mount -t cifs -o vers=3.0 //10.0.2.4/qemu ~/Dev cd ~/Dev/glibc/out/ppc gdb -ex 'directory ../../elf' -ex r elf/ld.so
ld.so has 671 R_68K_RELATIVE relocations
and one R_68K_GLOB_DAT for
__stack_chk_guard@@GLIBC_2.4. The following function is
used to apply a relocation. It is shared by self-relocation and
relocation for other modules. The self-relocation code defines
RTLD_BOOTSTRAP and needs just R_68K_RELATIVE,
R_68K_GLOB_DAT, and R_68K_JMP_SLOT.
if (__builtin_expect (r_type == R_68K_RELATIVE, 0)) *reloc_addr = map->l_addr + reloc->r_addend; else { ... switch (r_type) { case R_68K_COPY: ... case R_68K_GLOB_DAT: case R_68K_JMP_SLOT: *reloc_addr = value; break;
However, somehow many case labels were available for self-relocation.
GCC compiles the switch statement into a jump table which requires
loading an address from GOT. With some clean-up to generic relocation
code, GCC decides to perform loop-invariant code motion and hoists the
load of the jump table address. The hoisted load is before relative
relocations are applied, so the jump table address is incorrect.
The foolproof approach is to add an optimization barrier (e.g.
calling an non-inlinable function after relative relocations are
resolved). That is non-trivial given the code structure. So Andreas
Schwab suggested a simple approach by avoiding the jump table: handle
just the essential relocations.
The faulty code concealed well and I could not have found it without
a debugger. It took me a while to set up a m68k image using q800. The
memory is limited to 1000MiB and the emulation is very slow. Linux 5.19
is expected to gain the support for a virtual Motorola 68000 machine.
With qemu-system-m68k -M virt things will become
better.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Installation
7z x debian-11.0.0-m68k-NETINST-1.iso install/kernels/vmlinux-5.16.0-5-m68k install/cdrom/initrd.gz mv install/kernels/vmlinux-5.16.0-5-m68k install/cdrom/initrd.gz .
stage 1 (ldso/dlstart.c): only relative relocations are
applied. This allows static variables can be accessed.
stage 2 __dls2: This applies non-relative
relocations.
stage 2b __dls2b: Set up thread pointer with a TLS
stub.
stage 3 __dls3: Load the executable and immediately
loaded shared objects. Apply relocations and possibly relocate rtld/libc
itself again for possible symbol interposition (e.g.
R_*_COPY, interposed malloc implementation).
Each stage uses a PC-relative code sequence to load the address of
the next stage entry point, and then jump to it. This serves as a strong
compiler barrier preventing code reordering.
(In glibc, elf/rtld.cELF_DYNAMIC_RELOCATE (&bootstrap_map, NULL, 0, 0, 0);
is kinda like musl's stage 1 plus stage 2.)
Stage 1 computes the entry of stage 2 with
GETFUNCSYM(&dls2, __dls2, base+dyn[DT_PLTGOT]); where
GETFUNCSYM is defined for every port:
This approach is elegant. It even allows a static or hidden function
call with a dynamic relocation, though I haven't found such an
architecture in my testing.
glibc's sigaction sets the sa_restorer
field of sigaction to __restore_rt, and sets
the SA_RESTORER. The kernel sets up the
__restore_rt frame with saved process context information
(ucontext_t structure) before jumping to the signal
handler. See kernel
arch/x86/kernel/signal.c:setup_rt_frame. Upon returning
from the signal handler, control passes to __restore_rt.
See man 2 sigreturn.
__restore_rt is implemented in assembly. It comes with
DWARF call frame information in .eh_frame.
With the information, libunwind can unwind through the sigreturn
trampoline without knowing the ucontext_t structure. Note
that all general purpose registers are encoded.
libunwind/docs/unw_get_reg.man says
However, for signal frames (see unw_is_signal_frame(3)), it is
usually possible to access all registers.
Volatile registers are also saved in the saved process context
information. This is different from other frames where volatile
registers' information is typically lost.
As a relatively new port, Linux AArch64 defines the sigreturn
trampoline __kernel_rt_sigreturn in the VDSO (see
arch/arm64/kernel/vdso/sigreturn.S). This is unlike x86-64
which defines the function in libc. We can use gdb to dump the VDSO.
1 2 3 4 5
(gdb) i proc m process 430749 ... 0xfffff7ffc000 0xfffff7ffd000 0x1000 0x0 [vdso] (gdb) dump binary memory vdso.so 0xfffff7ffc000 0xfffff7ffd000
As of Linux 5.8 (https://git.kernel.org/linus/87676cfca14171fc4c99d96ae2f3e87780488ac4),
vdso.so does not have PT_GNU_EH_FRAME.
Therefore unwinders (llvm-project libunwind, nongnu libunwind,
libgcc_s.so.1) ignore its unwind tables. In gdb,
gdb/aarch64-linux-tdep.c recognizes the two instructions
and encodes how the kernel sets up the ucontext_t
structure.
Previously, vdso.so generated a small set of CFI
instructions to encode X29 (FP) and X30 (LR).
However, there was a serious problem: CFI cannot describe a sigreturn
trampoline frame. AArch64 does not define a register number for PC and
provides no direct way to encode the PC of the previous frame. Instead,
it sets return_address_register to X30 and the unwinder updates the PC
to whatever value the saved X30 is. Actually, with nongnu libunwind and
unw_get_reg(&cursor, UNW_REG_IP, &pc); unw_get_reg(&cursor, UNW_AARCH64_X30, &x30);,
we know pc == x30. This approach works fine when LR forms a
chain since we know between two adjacent frames, the sets
{PC, X30} differ by one element. However, when unwinding
through the sigreturn trampoline, the CFI can describe the previous PC
but not the previous X30.
musl x86-64
src/signal/x86_64/restore.s implements a sigreturn
trampoline __restore_rt. There is no .eh_frame
information.
nongnu libunwind does not know that __restore_rt is a
sigreturn trampoline (unw_is_signal_frame always returns
0). On ELF targets, -O1 and above typically imply
-fomit-frame-pointer and many functions do not save RBP.
Note: some functions may save RBP even with
-fomit-frame-pointer.
In the absence of a valid frame chain, combined with the fact that
nongnu libunwind does not recognize Linux x86-64's sigreturn trampoline,
libunwind cannot unwind through the __restore_rt frame. gdb
recognizes the sigreturn trampoline frame and with its FP-based
unwinding it can retrieve several frames, but not the ones above
raise.
1 2 3 4 5 6 7 8 9 10 11
% ld.lld @response.release.txt && ./nongnu pc=0x0000000000206add sp=0x00007ffc018618a0 0 ./nongnu: pc=0x00007f9fedcd602f sp=0x00007ffc018620c0 0 /home/ray/musl/out/release/lib/libc.so: pc=0x0000000000000000 sp=0x00007ffc01862db0 0 % gdb ./nongnu -x =(printf 'b handler\nhandle SIGUSR1 nostop\nr\nbt') ... #0 handler (signo=10) at a.c:9 #1 <signal handler called> #2 0x00007ffff7fae78a in __restore_sigs () from /home/ray/musl/out/release/lib/libc.so #3 0x00007ffff7fae8f1 in raise () from /home/ray/musl/out/release/lib/libc.so #4 0x0000000000000000 in ?? ()
If musl is built with -fno-omit-frame-pointer, nongnu
libunwind will use its FP-based fallback (see
src/x86_64/Gstep.c). The output looks like:
unw_step uses the saved RBP to infer RSP/RBP/RIP in the
previous frame. If the signal handler saves RBP and calls
unw_step, the saved RBP is essentially the RBP value in the
frame.
Actually, not every source file needs to be built with
-fno-omit-frame-pointer. We just need to build the source
files that transfer control to the user program, and their callers. For
this example, building src/signal/raise.c with
-fno-omit-frame-pointer allows us to unwind to
main. Additionally rebuilding
src/env/__libc_start_main.c allows us to unwind to
_start.
musl's Makefile specifies
-fno-asynchronous-unwind-tables (see option to enable
eh_frame for a 2011 discussion). If CFLAGS -g is
specified, libc.so will have .debug_frame. gdb
can retrieve the caller of raise:
1 2 3 4 5
#0 handler (signo=10) at a.c:9 #1 <signal handler called> #2 __restore_sigs (set=set@entry=0x7fffffffe240) at ../../arch/x86_64/syscall_arch.h:40 #3 0x00007ffff7fa36e0 in raise (sig=sig@entry=10) at ../../src/signal/raise.c:11 #4 0x00000000002071ff in main () at a.c:33
nongnu libunwind can be built with --enable-debug-frame
to support .debug_frame. Unfortunately, since it does not
recognize the , it cannot retrieve the main frame for this
example.
RISC-V
Like AArch64, Linux RISC-V defines the sigreturn trampoline __vdso_rt_sigreturn
in the VDSO.
1 2 3 4 5 6 7
ENTRY(__vdso_rt_sigreturn) .cfi_startproc .cfi_signal_frame li a7, __NR_rt_sigreturn scall .cfi_endproc ENDPROC(__vdso_rt_sigreturn)
llvm-project libunwind added support for unwinding through a
sigreturn trampoline in https://reviews.llvm.org/D148499 (2023-05).
The output looks like the following on Arch Linux RISC-V (riscv64gc).
Linux kernel arch/x86/kernel/signal.c:setup_rt_framearch/riscv/kernel/vdso/rt_sigreturn.S:__vdso_rt_sigreturn
Core dump
The kernel core dumper coredump.c is simple. The glibc
__restore_rt page or the VDSO is not prioritized in the
presence of a core file limit. If the page is missing in the core file,
gdb prog core -ex bt -batch will not be able to unwind past
the . A userspace core dumper may be handy.
FORTRAN 77 COMMON blocks compiled to COMMON symbols. You could
declare a COMMON block in more than one file, with each specifying the
number, type, and size of the variable. The linker allocated enough
space to satisfy the largest size.
Binary sizes are important. Filesystem compression is ergonomic but
typically does not leverage application information well. Compressing
allocable sections (text, data) increases program startup time and
introduces memory overhead. In addition, filesystem compression is not
sufficiently portable.
Debug sections are large and contribute to a significant portion of
the binary size. Therefore, it is appealing to compress debug
sections.
Here is a -DCMAKE_BUILD_TYPE=Debug build directory of
llvm-project where I just ran ninja clang (on 2022-10-21).
Here are the total sizes of .o files, text sections, and debug sections.
It is typical that the debug information is often much larger than text
sections.
Some assemblers and linkers offer a feature to compress debug
sections.
llvm-objcopy supports --compress-debug-sections=zlib to
compress debug sections. We can use the option to check what if we
compress debug sections for the assembler.
1 2
% for i in **/*.o; do /tmp/Rel/bin/llvm-objcopy --compress-debug-sections=zlib $i /tmp/c/o && readelf -WS /tmp/c/o | awk 'BEGIN{FPAT="\\[.*?\\]|\\S+"} $2~/\.debug_/{d += strtonum("0x"$6)} END{print d}'; done | awk '{s+=$1} END{print s}' 161691798
For debug sections, we have a compression ratio of 3.90! The total .o
size is 995438992 bytes, 68% of the original.
Then let's check zstd.
1 2
% for i in **/*.o; do /tmp/Rel/bin/llvm-objcopy --compress-debug-sections=zstd $i /tmp/c/o && readelf -WS /tmp/c/o | awk 'BEGIN{FPAT="\\[.*?\\]|\\S+"} $2~/\.debug_/{d += strtonum("0x"$6)} END{print d}'; done | awk '{s+=$1} END{print s}' 159341878
To check whether an object file has compressed debug sections, we can
use readelf.
1 2 3 4 5 6
% readelf -S a.o ... Section Headers: [Nr] Name Type Address Off Size ES Flg Lk Inf Al ... [ 5] .debug_abbrev PROGBITS 0000000000000000 000080 000087 00 C 0 0 8
In the readelf -S output, the Flg column
describes sh_flags where C indicates the
SHF_COMPRESSED flag.
1 2 3 4 5 6 7 8 9 10 11
% readelf -t a.o ... Section Headers: [Nr] Name Type Address Off Size ES Lk Inf Al Flags ... [ 5] .debug_abbrev PROGBITS 0000000000000000 000080 000087 00 0 0 8 [0000000000000800]: COMPRESSED ZLIB, 0000000000000093, 1
History
In 2007-11, Craig Silverstein added
--compress-debug-sections=zlib to gold. When the option
was specified, gold compressed the content of a .debug*
section with zlib and changed the section name to
.debug*.zlib.$uncompressed_size.
Unix-like systems represent static libraries as .a
archives. A .a archive consists of a header and a
collection of files with metadata. Its usage is tightly coupled with the
linker. An archive almost always contains only relocatable object files
and the linker has built-in support for reading it.
1 2 3 4
% as /dev/null -o a.o % rm -f b.a && ar rc b.a a.o % ar t b.a a.o
One may add other types of files to .a but that is
almost assuredly a bad thing.
1 2 3 4 5 6
% rm -f a.a && ar rc a.a a.o b.a # archive in archive, bad % ar t a.a a.o b.a % echo hello > a.txt % rm -f a.a && ar rc a.a a.o a.txt # text file in archive, bad
The original linker designers noticed that for many programs not
every member was needed, so they tried to allow the linker to skip
unused members. Therefore, they invented the interesting but confusing
archive member extraction rule. See Symbol
processing#Archive processing for details.