2022-08-28

-march=, -mcpu=, and -mtune=

Updated in 2025-05.

In GCC and Clang, there are three major options specifying the architecture and microarchitecture the generated code can run on. The general semantics are described below, but each target machine may assign different semantics.

-march=X: (execution domain) Generate code that can use instructions available in the architecture X
-mtune=X: (optimization domain) Optimize for the microarchitecture X, but does not change the ABI or make assumptions about available instructions
-mcpu=X: Specify both -march= and -mtune= but can be overridden by the two options. The supported values are generally the same as -mtune=. The architecture name is inferred from X

2022-08-21

glibc and DT_GNU_HASH

tl;dr "Easy Anti-Cheat"'s incompatibility with glibc 2.36 provides shared objects (libc.so.6, ld-linux-x86_64.so.2) is an instance of Hyrum's law.

On 2022-08-02 glibc 2.36 was released.
On the same day the x86-64 package was moved to [core] on Arch Linux.
On 2022-08-03 Jelgnum reported that with the new glibc, "Easy Anti-Cheat" cannot load the anti-cheat module (GLIBC update broke EAC for most games that use it).
Multiple Arch Linux game users confirmed the problem.
Frogging101 bisected the problem to the glibc commit Do not use --hash-style=both for building glibc shared objects.
The problem led to heated discussions, some clickbait news, and claims such as "glibc breaks ABI" and "glibc does not prioritize compatibility with pre-existing applications".

I feel compelled to demystify the accident and wish that people can stop defamation to glibc.

2022-08-14

DWARF in reproducible builds

Deterministic builds with clang and lld describes several degrees of build determinism. Here is the description of local determinism:

Like incremental basic determinism, but builds are also independent of the name of the build directory. Builds of the same source code on the same machine produce exactly the same output every time, independent of the location of the source checkout directory or the build directory.

2022-07-10

RISC-V linker relaxation in lld

On 2022-07-07, I added a RISC-V linker relaxation framework in ld.lld and implemented R_RISCV_ALIGN/R_RISCV_CALL/R_RISCV_CALL_PLT relaxation. The changes will be included in the next llvm-project release 15.0.0. This post describes the implementation.

See The dark side of RISC-V linker relaxation for more information about RISC-V linker relaxation.

2022-05-29

Everything I know about glibc

Updated in 2023-08.

Repository: https://sourceware.org/git/gitweb.cgi?p=glibc.git
Wiki: https://sourceware.org/glibc/wiki/
Bugzilla: https://sourceware.org/bugzilla/
Mailing lists: {libc-announce,libc-alpha,libc-locale,libc-stable,libc-help}@sourceware.org
Patchwork: https://patchwork.sourceware.org/project/glibc/list/ Patch Review Workflow
Contribution Checklist: https://sourceware.org/glibc/wiki/Contribution%20checklist
Build bots: https://builder.sourceware.org/buildbot/#/builders, search "glibc"
public inbox: https://inbox.sourceware.org/libc-alpha/

glibc is an implementation of the user-space side of standard C/POSIX functions with Linux extensions.

2022-05-15

C standard library headers in C++

Updated in 2024-01.

In ISO C++ standards, [support.c.headers.general] says:

Source files that are not intended to also be valid ISO C should not use any of the C headers.

Then, [depr.c.headers] describes how a C header name.h is transformed to the corresponding C++ cname header. There is a helpful example:

[ Example: The header assuredly provides its declarations and definitions within the namespace std. It may also provide these names within the global namespace. The header <stdlib.h> assuredly provides the same declarations and definitions within the global namespace, much as in the C Standard. It may also provide these names within the namespace std. — end example ]

"may also" in the wording allows implementations to provide mix-and-match, e.g. #include <stdlib.h> may provide std::exit and #include <cstdlib> may provide ::exit.

libstdc++ chooses to enable global namespace declarations with C++ cname header. For example, #include <cstdlib> also includes the corresponding C header stdlib.h and we get declarations in both the global namespace and the namespace std.

1 2	. /usr/include/c++/12/cstdlib .. /usr/include/stdlib.h

The preprocessed output looks like:

extern void exit (int __status) noexcept (true) __attribute__ ((__noreturn__));

extern "C++"
{
namespace std __attribute__ ((__visibility__ ("default")))
{
  using ::exit;
}
}

The compiler knows that the declarations in the namespace std are identical to the ones in the global namespace. The compiler recognizes some library functions and can optimize them. By using the compiler can optimize some C library functions in the namespace std (e.g. many std::mem* and std::str* functions).

For some C standard library headers, libstdc++ provides wrappers (libstdc++-v3/include/c_compatibility/) which take precedence over the glibc headers. The configuration of libstdc++ uses --enable-cheaders=c_global by default. if GLIBCXX_C_HEADERS_C_GLOBAL in libstdc++-v3/include/Makefile.am describes that the 6 wrappers (complex.h, fenv.h, tgmath.h, math.h, stdatomic.h, stdlib.h) shadow the C library headers of the same name. For example, #include <stdlib.h> includes the wrapper stdlib.h which includes cstdlib, therefore bringing exit into the namespace std.

1
2
3

. /usr/include/c++/12/stdlib.h
.. /usr/include/c++/12/cstdlib
... /usr/include/stdlib.h

2022-04-24

PI_STATIC_AND_HIDDEN/HIDDEN_VAR_NEEDS_DYNAMIC_RELOC in glibc rtld

Recently I have fixed two glibc rtld bugs related to early GOT relocation for retro-computing architectures: m68k and powerpc32. They are related to the obscure PI_STATIC_AND_HIDDEN macro which I am going to demystify.

In 2002, PI_STATIC_AND_HIDDEN was introduced into glibc rtld (runtime loader). This macro indicates whether accesses to the following types of variables need dynamic relocations.

static specifier: static int a; (STB_LOCAL)
hidden visibility attribute: __attribute__((visibility("hidden"))) int a; (STB_GLOBAL STV_HIDDEN), __attribute__((weak, visibility("hidden"))) int a; (STB_WEAK STV_HIDDEN)

PI in the macro name is an abbreviation for "position independent". This is a misnomer: a code sequence using GOT is typically position-independent as well.

In -fPIC mode, the compiler assumes that all non-local STV_DEFAULT symbols may be preemptible at run time. A GOT-generating relocation is used and the GOT is typically unavoidable at link time (on some architectures the linker can optimize out the GOT). This case is not interesting to rtld as rtld does not need to export such variables.

Excluding these cases (non-local STV_DEFAULT), all other variables are known to be non-preemptible at compile time. The compiler can generate code which is guaranteed to avoid dynamic relocations at link time.

1 2	static int var; int foo() { return ++var; }

On 2022-04-26, I replaced PI_STATIC_AND_HIDDEN with the opposite macro HIDDEN_VAR_NEEDS_DYNAMIC_RELOC.

Non-`HIDDEN_VAR_NEEDS_DYNAMIC_RELOC` architectures with PC-relative instructions

To avoid dynamic relocations, the most common approach is to generate PC-relative instructions, as most modern architectures (e.g. aarch64, riscv, and x86-64) provide. Using PC-relative instructions to reference variables assumes that the distance from code to data is a link-time constant. Nowadays this condition is satisfied everywhere except the rare FDPIC ABI.

Here are some assembly fragments from architectures using PC-relative instructions. The instructions may not be familar to you, but that is fine. We can see that there is no GOT related marker. I have added some comments indicating the relocation type and the referenced symbol. var in the C code has internal linkage which lowers to the STB_LOCAL binding. References to such local symbols are often redirected to the section symbol (.bss): the link-time behaviors are identical.

# aarch64
        adrp    x1, .LANCHOR0               # R_AARCH64_ADR_PREL_PG_HI21 .bss
        ldr     w0, [x1, #:lo12:.LANCHOR0]  # R_AARCH64_LDST32_ABS_LO12_NC .bss
        add     w0, w0, 1
        str     w0, [x1, #:lo12:.LANCHOR0]  # R_AARCH64_LDST32_ABS_LO12_NC .bss

# arm
        ldr     r3, .L3
.LPIC0:
        add     r3, pc, r3
        ldr     r0, [r3]
        add     r0, r0, #1
        str     r0, [r3]
.L3:
        .word   .LANCHOR0-(.LPIC0+8)  # R_ARM_REL32 .bss

# riscv64
        lla     a5,.LANCHOR0  # R_RISCV_PCREL_HI20+R_RISCV_PCREL_LO12_I
        lw      a0,0(a5)
        addiw   a0,a0,1
        sw      a0,0(a5)

# x86-64
        movl    var(%rip), %eax  # R_X86_64_PC32 .bss-0x4
        addl    $1, %eax
        movl    %eax, var(%rip)  # R_X86_64_PC32 .bss-0x4

Non-`HIDDEN_VAR_NEEDS_DYNAMIC_RELOC` architectures without PC-relative instructions

Many older architectures do not have PC-relative instructions.

x86-32 does not have PC-relative instructions, but it provides a way to avoid a load from a GOT entry. It achieves this with a detour: compute the address of _GLOBAL_OFFSET_TABLE_ (GOT base symbol), then add an offset (S-_GLOBAL_OFFSET_TABLE_) to get the symbol address. _GLOBAL_OFFSET_TABLE_ is computed this way: compute the address of a location in code, then add an offset (_GLOBAL_OFFSET_TABLE_ - PC).

You probably see now how the x86-32 ABI was misdesigned: the involvement of _GLOBAL_OFFSET_TABLE_ is unnecessary. A relocation with the calculation of S-_GLOBAL_OFFSET_TABLE_ would achieve the same net effect.

The relocations with GOT in their names just use the GOT as an anchor. They don't indicate a load from a GOT entry.

# x86-32
        call    __x86.get_pc_thunk.dx         # R_386_PC32   __x86.get_pc_thunk.dx
        addl    $_GLOBAL_OFFSET_TABLE_, %edx  # R_386_GOTPC  _GLOBAL_OFFSET_TABLE_
        movl    var@GOTOFF(%edx), %eax        # R_386_GOTOFF .bss
        addl    $1, %eax
        movl    %eax, var@GOTOFF(%edx)        # R_386_GOTOFF .bss

powerpc64 does not have PC-relative instructions before POWER10. Earlier microarchitectures use TOC-relative relocations to compute the symbol address.

addis 10,2,.LANCHOR0@toc@ha  # R_PPC64_TOC16_HA
lwz 9,.LANCHOR0@toc@l(10)    # R_PPC64_TOC16_LO
addi 9,9,1
extsw 3,9
stw 9,.LANCHOR0@toc@l(10)    # R_PPC64_TOC16_LO

A pending patch [PATCH v3] powerpc64: Enable static-pie will define PI_STATIC_AND_HIDDEN.

`HIDDEN_VAR_NEEDS_DYNAMIC_RELOC` architectures

A few older architectures tend to use a load from a GOT entry. The GOT entry needs a relative relocation (instead of R_*_GLOB_DAT: the symbol is non-preemptible, so no symbol search is needed). See All about Global Offset Table. In glibc, these architecture define HIDDEN_VAR_NEEDS_DYNAMIC_RELOC.

Some architectures even assume the distance from code to data may not be a link-time constant (see All about Procedure Linkage Table). They do not provide a relocation with a calculation of S-_GLOBAL_OFFSET_TABLE_ or S-P.

# m68k
        move.l var@GOT(%a5),%a0  # R_68K_GOT32O var
        move.l (%a0),%d0
        addq.l #1,%d0
        move.l %d0,(%a0)

# microblaze
        lwi     r4,r20,var@GOT  # R_MICROBLAZE_GOT_64 var
        lwi     r3,r4,0
        addik   r3,r3,1
        swi     r3,r4,0

# nios2: r22 is a callee-saved register which requires a spill and expensive setup
        ldw     r3, %got(var)(r22)  # R_NIOS2_GOT16 var
        ldw     r2, 0(r3)
        addi    r2, r2, 1
        stw     r2, 0(r3)

# powerpc32: r30 is a callee-saved register which requires a spill and expensive setup
        lwz 9,.LC0-.LCTOC1(30)
        lwz 3,0(9)
        addi 3,3,1
        stw 3,0(9)

        .section        ".got2","aw"  # Like a manual GOT section
        .align 2
.LCTOC1 = .+32768
.LC0:
        .long .LANCHOR0  # R_PPC_ADDR32 .bss; may become R_PPC_RELATIVE at link time

        .section        ".bss"
        .set    .LANCHOR0,. + 0
var:
        .zero   4

The first task of rtld is to relocate itself and bind all symbols to itself. Afterward, non-preemptible functions and data can be freely accessed.

On architectures where a GOT entry is used to access a non-preemptible variable, rtld needs to be careful not to reference such variables before relative relocations are applied. In rtld.c, _dl_start has the following code:

if (bootstrap_map.l_addr)
  {
    // Apply R_*_RELATIVE, R_*_GLOB_DAT, and R_*_JUMP_SLOT.
    ELF_DYNAMIC_RELOCATE (&bootstrap_map, NULL, 0, 0, 0);
  }

__rtld_malloc_init_stubs ();

// _rtld_local_ro.dl_find_object
GLRO (dl_find_object) = &_dl_find_object;

_rtld_local_ro is a hidden global variable. Taking its address may be reordered before ELF_DYNAMIC_RELOCATE by the compiler. On an architecture using a GOT entry to load the address, the reordering will make the subsequent memory store (_rtld_local_ro.dl_find_object) to crash, since the GOT address is incorrect: it's zero or the link-time address instead of the run-time address.

powerpc32

I recently cleaned up the bootstrap code a bit with elf: Move elf_dynamic_do_Rel RTLD_BOOTSTRAP branches outside. Afterwards, GCC powerpc32 appears to reliably reorder _rtld_local_ro, causing ld.so to crash right away.

1
2

mkdir -p out/ppc; cd out/ppc
../../configure --prefix=/tmp/glibc/ppc --host=powerpc-linux-gnu CC=powerpc-linux-gnu-gcc CXX=powerpc-linux-gnu-g++ && make -j 50 && make -j 50 install && 'cp' -f /usr/powerpc-linux-gnu/lib/libgcc_s.so.1 /tmp/glibc/ppc/lib

1
2
3

% elf/ld.so
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
[1]    373503 segmentation fault  elf/ld.so

I was pretty sure there is a relocation bug but was not immediately clear which piece of code may be at fault.

Nowadays there aren't many choices for powerpc32 images. Void Linux ppc still provides powerpc32 glibc and musl images. I downloaded one and fed it into qemu, booted it with qemu-system-ppc -machine mac99 -m 2047M -cdrom void-live-ppc-20210825.iso -net nic -net user,smb=$HOME/Dev -boot d. I booted into the 4.4.261 kernel because gdb aborts immediately with 5.13.12 kernel. Daniel Kolesa mentioned this 5.x kernel incompatibility to me and nobody has looked into it yet.

The live CD provides free space of about 1GiB and I can install cifs-utils and gdb. Then run ld.so under gdb.

xbps-install -S
xbps-install cifs-utils gdb cgdb
mkdir ~/Dev
mount -t cifs -o vers=3.0 //10.0.2.4/qemu ~/Dev
cd ~/Dev/glibc/out/ppc
gdb -ex 'directory ../../elf' -ex r elf/ld.so

gdb says stw r9,1168(r25) triggers SIGSEGV.

% powerpc-linux-gnu-objdump --disassemble=_dl_start -S elf/ld.so
...
  if (bootstrap_map.l_addr)
   1f468:       40 96 01 64     bne     cr5,1f5cc <_dl_start+0x3ac>
   1f46c:       83 3e ff c4     lwz     r25,-60(r30)  # load the address of _rtld_local_ro from GOT => 0
   1f470:       3a 61 00 10     addi    r19,r1,16
  bootstrap_map.l_relocated = 1;
   1f474:       a1 21 01 b0     lhz     r9,432(r1)
   1f478:       61 29 20 00     ori     r9,r9,8192
   1f47c:       b1 21 01 b0     sth     r9,432(r1)
  __rtld_malloc_init_stubs ();
   1f480:       4b ff d5 f1     bl      1ca70 <__rtld_malloc_init_stubs>
  GLRO (dl_find_object) = &_dl_find_object;
   1f484:       81 3e ff b8     lwz     r9,-72(r30)
    ElfW(Addr) entry = _dl_start_final (arg, &info);
   1f488:       7e 64 9b 78     mr      r4,r19
   1f48c:       7f 83 e3 78     mr      r3,r28
  GLRO (dl_find_object) = &_dl_find_object;
   1f490:       91 39 04 90     stw     r9,1168(r25)  # access 0+1168 => SIGSEGV
    ElfW(Addr) entry = _dl_start_final (arg, &info);
   1f494:       4b ff fb 9d     bl      1f030 <_dl_start_final>

Then I confirm that the GOT entry corresponds to _rtld_local_ro.

% readelf -Ws elf/ld.so | grep 4ffb8
0004ffb8  00000016 R_PPC_RELATIVE                    4efc8
% readelf -Ws elf/ld.so | grep 4efc8
     7: 0004efc8  1192 OBJECT  GLOBAL DEFAULT   14 _rtld_global_ro@@GLIBC_PRIVATE
   583: 0004efc8  1192 OBJECT  LOCAL  DEFAULT   14 _rtld_local_ro
   726: 0004efc8  1192 OBJECT  GLOBAL DEFAULT   14 _rtld_global_ro

elf: Move post-relocation code of _dl_start into _dl_start_final shall fix the bug.

Note: adding asm volatile("" ::: "memory"); in between does not prevent reordering.

Note: in the absence of a powerpc32 system, qemu-ppc-static -d in_asm elf/ld.so may provide some clue about the faulty basic block.

----------------
IN: _dl_start
0x4001f484:  813effb8  lwz      r9, -0x48(r30)
0x4001f488:  7e649b78  mr       r4, r19
0x4001f48c:  7f83e378  mr       r3, r28
0x4001f490:  91390490  stw      r9, 0x490(r25)
0x4001f494:  4bfffb9d  bl       0x4001f030

qemu: uncaught target signal 11 (Segmentation fault) - core dumped
[1]    383218 segmentation fault  qemu-ppc-static -d in_asm elf/ld.so

m68k

Last week I fixed a similar bug for m68k: m68k: Removal of ELF_DURING_STARTUP optimization broke ld.so.

ld.so has 671 R_68K_RELATIVE relocations and one R_68K_GLOB_DAT for __stack_chk_guard@@GLIBC_2.4. The following function is used to apply a relocation. It is shared by self-relocation and relocation for other modules. The self-relocation code defines RTLD_BOOTSTRAP and needs just R_68K_RELATIVE, R_68K_GLOB_DAT, and R_68K_JMP_SLOT.

// sysdeps/m68k/dl-machine.h
static inline void __attribute__ ((unused, always_inline))
elf_machine_rela (struct link_map *map, struct r_scope_elem *scope[],
                  const Elf32_Rela *reloc, const Elf32_Sym *sym,
                  const struct r_found_version *version,
                  void *const reloc_addr_arg, int skip_ifunc)
{
  Elf32_Addr *const reloc_addr = reloc_addr_arg;
  const unsigned int r_type = ELF32_R_TYPE (reloc->r_info);

  if (__builtin_expect (r_type == R_68K_RELATIVE, 0))
    *reloc_addr = map->l_addr + reloc->r_addend;
  else
    {
      ...
      switch (r_type)
        {
        case R_68K_COPY:
          ...
        case R_68K_GLOB_DAT:
        case R_68K_JMP_SLOT:
          *reloc_addr = value;
          break;

However, somehow many case labels were available for self-relocation. GCC compiles the switch statement into a jump table which requires loading an address from GOT. With some clean-up to generic relocation code, GCC decides to perform loop-invariant code motion and hoists the load of the jump table address. The hoisted load is before relative relocations are applied, so the jump table address is incorrect.

The foolproof approach is to add an optimization barrier (e.g. calling an non-inlinable function after relative relocations are resolved). That is non-trivial given the code structure. So Andreas Schwab suggested a simple approach by avoiding the jump table: handle just the essential relocations.

The faulty code concealed well and I could not have found it without a debugger. It took me a while to set up a m68k image using q800. The memory is limited to 1000MiB and the emulation is very slow. Linux 5.19 is expected to gain the support for a virtual Motorola 68000 machine. With qemu-system-m68k -M virt things will become better.

# Installation

7z x debian-11.0.0-m68k-NETINST-1.iso install/kernels/vmlinux-5.16.0-5-m68k install/cdrom/initrd.gz
mv install/kernels/vmlinux-5.16.0-5-m68k install/cdrom/initrd.gz .

qemu-img create -f qcow2 debian-m68k.qcow2 8G
qemu-system-m68k -M q800 -m 1000m -serial none -serial mon:stdio -net nic,model=dp83932 -net user -kernel vmlinux-5.16.0-5-m68k -initrd initrd.gz -append 'console=ttyS0 vga=off' -drive file=debian-m68k.qcow2,format=qcow2 -drive file=debian-11.0.0-m68k-NETINST-1.iso,format=raw,media=cdrom -nographic -boot d

# After installation

sudo qemu-nbd -c /dev/nbd0 m68k-deb10.qcow2
# Extract vmlinux and initrd
sudo qemu-nbd -d /dev/nbd0 m68k-deb10.qcow2

qemu-system-m68k -M q800 -m 1000M -kernel vmlinux-5.16.0-6-m68k -initrd initrd.img-5.16.0-6-m68k -append 'root=/dev/sda2 console=tty' -hda debian-m68k.qcow2 -net nic,model=dp83932 -net user,smb=$HOME/Dev

musl rtld

musl rtld has a clear separation of 3 stages.

stage 1 (ldso/dlstart.c): only relative relocations are applied. This allows static variables can be accessed.
stage 2 __dls2: This applies non-relative relocations.
stage 2b __dls2b: Set up thread pointer with a TLS stub.
stage 3 __dls3: Load the executable and immediately loaded shared objects. Apply relocations and possibly relocate rtld/libc itself again for possible symbol interposition (e.g. R_*_COPY, interposed malloc implementation).

Each stage uses a PC-relative code sequence to load the address of the next stage entry point, and then jump to it. This serves as a strong compiler barrier preventing code reordering.

(In glibc, elf/rtld.c ELF_DYNAMIC_RELOCATE (&bootstrap_map, NULL, 0, 0, 0); is kinda like musl's stage 1 plus stage 2.)

Stage 1 computes the entry of stage 2 with GETFUNCSYM(&dls2, __dls2, base+dyn[DT_PLTGOT]); where GETFUNCSYM is defined for every port:

// arch/m68k/reloc.h
#define GETFUNCSYM(fp, sym, got) __asm__ ( \
	".hidden " #sym "\n" \
	"lea " #sym "-.-8,%0 \n" \
	"lea (%%pc,%0),%0 \n" \
	: "=a"(*fp) : : "memory" )

// arch/powerpc/reloc.h
#define GETFUNCSYM(fp, sym, got) __asm__ ( \
	".hidden " #sym " \n" \
	"	bl 1f \n" \
	"	.long " #sym "-. \n" \
	"1:	mflr %1 \n" \
	"	lwz %0, 0(%1) \n" \
	"	add %0, %0, %1 \n" \
	: "=r"(*(fp)), "=r"((int){0}) : : "memory", "lr" )

This approach is elegant. It even allows a static or hidden function call with a dynamic relocation, though I haven't found such an architecture in my testing.

2022-04-10

Unwinding through a signal handler

This post has some notes about unwinding through a signal handler. You may want to read Stack unwinding first.

// a.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <libunwind.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>

static void handler(int signo) {
  unw_context_t context;
  unw_cursor_t cursor;
  unw_getcontext(&context);
  unw_init_local(&cursor, &context);
  unw_word_t pc, sp;
  do {
    unw_get_reg(&cursor, UNW_REG_IP, &pc);
    unw_get_reg(&cursor, UNW_REG_SP, &sp);
    printf("pc=0x%016zx sp=0x%016zx", (size_t)pc, (size_t)sp);
    Dl_info info = {};
    if (dladdr((void *)pc, &info))
      printf(" %s:%s", info.dli_fname, info.dli_sname ? info.dli_sname : "");
    puts("");
  } while (unw_step(&cursor) > 0);
  exit(0);
}

int main() {
  signal(SIGUSR1, handler);
  raise(SIGUSR1);
  return 1;
}

(printf and dladdr are not required to be async-signal-safe functions, but here we apparently know using them can't cause problems.)

Tips: we can additionally add the following code block to get memory mappings.

char buf[128];
FILE *f = fopen("/proc/self/maps", "r");
while (fgets(buf, sizeof buf, f))
  printf("%s", buf);
fclose(f);

Build the program with either llvm-project libunwind or nongnu libunwind:

# ninja -C /tmp/Debug unwind builtins
clang -g -I llvm-project/libunwind/include a.c -no-pie --unwindlib=libunwind --rtlib=compiler-rt -ldl -Wl,-E,-rpath,/tmp/Debug/lib/x86_64-unknown-linux-gnu -o llvm

# autoreconf -i; mkdir -p out/debug; ../../configure CFLAGS='-O0 -g' CXXFLAGS='-O0 -g'; make -j 20
libunwind=/tmp/p/libunwind
clang -g -I $libunwind/include -I $libunwind/out/debug/include a.c -no-pie $libunwind/out/debug/src/.libs/libunwind.a $libunwind/out/debug/src/.libs/libunwind-x86_64.a -llzma -ldl -Wl,-E -o nongnu

(Some targets default to -fno-asynchronous-unwind-tables. In the absence of C++ exceptions, we need at least -funwind-tables.)

glibc x86-64

With either implementation, the output looks like the following on Linux glibc x86-64. I annotated the lines with location information.

pc=0x0000000000206d2a sp=0x00007fffd366bce0 ./nongnu: # in handler, the instruction after call unw_getcontext
pc=0x00007f5962cb0920 sp=0x00007fffd366c500 /lib/x86_64-linux-gnu/libc.so.6: # __restore_rt
pc=0x00007f5962cb08a1 sp=0x00007fffd366d200 /lib/x86_64-linux-gnu/libc.so.6:gsignal # raise
pc=0x0000000000206cfd sp=0x00007fffd366d320 ./nongnu:main
pc=0x00007f5962c9b7fd sp=0x00007fffd366d340 /lib/x86_64-linux-gnu/libc.so.6:__libc_start_main
pc=0x0000000000206bba sp=0x00007fffd366d410 ./nongnu:_start # from crt1.o

__restore_rt is a sigreturn trampoline defined in glibc sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:

  nop
.align 16
__restore_rt:
  movq $15, %rax  # __NR_rt_sigreturn
  syscall

(Newer ports use VDSO.)

glibc's sigaction sets the sa_restorer field of sigaction to __restore_rt, and sets the SA_RESTORER. The kernel sets up the __restore_rt frame with saved process context information (ucontext_t structure) before jumping to the signal handler. See kernel arch/x86/kernel/signal.c:setup_rt_frame. Upon returning from the signal handler, control passes to __restore_rt. See man 2 sigreturn.

__restore_rt is implemented in assembly. It comes with DWARF call frame information in .eh_frame.

% llvm-dwarfdump -eh-frame /lib/x86_64-linux-gnu/libc.so.6
...
00002458 00000010 00000000 CIE
  Format:                DWARF32
  Version:               1
  Augmentation:          "zRS"
  Code alignment factor: 1
  Data alignment factor: -8
  Return address column: 16
  Augmentation data:     1B

  DW_CFA_nop:
  DW_CFA_nop:


0000246c 00000078 00000018 FDE cie=00002458 pc=0003c91f...0003c929
  Format:       DWARF32
  DW_CFA_def_cfa_expression: DW_OP_breg7 RSP+160, DW_OP_deref
  DW_CFA_expression: R8 DW_OP_breg7 RSP+40
  DW_CFA_expression: R9 DW_OP_breg7 RSP+48
  DW_CFA_expression: R10 DW_OP_breg7 RSP+56
  DW_CFA_expression: R11 DW_OP_breg7 RSP+64
  DW_CFA_expression: R12 DW_OP_breg7 RSP+72
  DW_CFA_expression: R13 DW_OP_breg7 RSP+80
  DW_CFA_expression: R14 DW_OP_breg7 RSP+88
  DW_CFA_expression: R15 DW_OP_breg7 RSP+96
  DW_CFA_expression: RDI DW_OP_breg7 RSP+104
  DW_CFA_expression: RSI DW_OP_breg7 RSP+112
  DW_CFA_expression: RBP DW_OP_breg7 RSP+120
  DW_CFA_expression: RBX DW_OP_breg7 RSP+128
  DW_CFA_expression: RDX DW_OP_breg7 RSP+136
  DW_CFA_expression: RAX DW_OP_breg7 RSP+144
  DW_CFA_expression: RCX DW_OP_breg7 RSP+152
  DW_CFA_expression: RSP DW_OP_breg7 RSP+160
  DW_CFA_expression: RIP DW_OP_breg7 RSP+168
  DW_CFA_nop:
  DW_CFA_nop:

  0x3c91f: CFA=DW_OP_breg7 RSP+160, DW_OP_deref: RAX=[DW_OP_breg7 RSP+144], RDX=[DW_OP_breg7 RSP+136], RCX=[DW_OP_breg7 RSP+152], RBX=[DW_OP_breg7 RSP+128], RSI=[DW_OP_breg7 RSP+112], RDI=[DW_OP_breg7 RSP+104], RBP=[DW_OP_breg7 RSP+120], RSP=[DW_OP_breg7 RSP+160], R8=[DW_OP_breg7 RSP+40], R9=[DW_OP_breg7 RSP+48], R10=[DW_OP_breg7 RSP+56], R11=[DW_OP_breg7 RSP+64], R12=[DW_OP_breg7 RSP+72], R13=[DW_OP_breg7 RSP+80], R14=[DW_OP_breg7 RSP+88], R15=[DW_OP_breg7 RSP+96], RIP=[DW_OP_breg7 RSP+168]
...

The DW_OP_breg7 RSP offsets correspond to the ucontext_t offsets of these registers.

% cat sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c
...
   do_cfa_expr                                                          \
   do_expr (8 /* r8 */, oR8)                                            \
   do_expr (9 /* r9 */, oR9)                                            \
   do_expr (10 /* r10 */, oR10)                                         \
% cat sysdeps/unix/sysv/linux/x86_64/ucontext_i.sym
...
#define ucontext(member)        offsetof (ucontext_t, member)
#define mcontext(member)        ucontext (uc_mcontext.member)
#define mreg(reg)               mcontext (gregs[REG_##reg])

oRBP            mreg (RBP)
oRSP            mreg (RSP)
oRBX            mreg (RBX)

With the information, libunwind can unwind through the sigreturn trampoline without knowing the ucontext_t structure. Note that all general purpose registers are encoded. libunwind/docs/unw_get_reg.man says

However, for signal frames (see unw_is_signal_frame(3)), it is usually possible to access all registers.

Volatile registers are also saved in the saved process context information. This is different from other frames where volatile registers' information is typically lost.

glibc AArch64

The output looks like:

pc=0x0000000000214b10 sp=0x0000ffffe81f6050 ./nongnu: # handler
pc=0x0000ffffa55cd5b0 sp=0x0000ffffe81f6c70 linux-vdso.so.1:__kernel_rt_sigreturn
pc=0x0000ffffa5438070 sp=0x0000ffffe81f7ed0 /lib/aarch64-linux-gnu/libc.so.6:gsignal
pc=0x0000000000214bfc sp=0x0000ffffe81f8000 ./nongnu:main
pc=0x0000ffffa5425090 sp=0x0000ffffe81f8010 /lib/aarch64-linux-gnu/libc.so.6:__libc_start_main
pc=0x00000000002149cc sp=0x0000ffffe81f8160 ./nongnu: # _start
pc=0x00000000002149cc sp=0x0000ffffe81f8160 ./nongnu: # _start

As a relatively new port, Linux AArch64 defines the sigreturn trampoline __kernel_rt_sigreturn in the VDSO (see arch/arm64/kernel/vdso/sigreturn.S). This is unlike x86-64 which defines the function in libc. We can use gdb to dump the VDSO.

(gdb) i proc m
process 430749
...
      0xfffff7ffc000     0xfffff7ffd000     0x1000        0x0 [vdso]
(gdb) dump binary memory vdso.so 0xfffff7ffc000     0xfffff7ffd000

  nop

.globl __kernel_rt_sigreturn
__kernel_rt_sigreturn:
  mov     x8, #__NR_rt_sigreturn  // 0xad
  svc     #0x0

As of Linux 5.8 (https://git.kernel.org/linus/87676cfca14171fc4c99d96ae2f3e87780488ac4), vdso.so does not have PT_GNU_EH_FRAME. Therefore unwinders (llvm-project libunwind, nongnu libunwind, libgcc_s.so.1) ignore its unwind tables. In gdb, gdb/aarch64-linux-tdep.c recognizes the two instructions and encodes how the kernel sets up the ucontext_t structure.

Previously, vdso.so generated a small set of CFI instructions to encode X29 (FP) and X30 (LR).

% llvm-dwarfdump -eh-frame vdso.so
000000c0 0000001c 00000000 CIE
  Format:                DWARF32
  Version:               1
  Augmentation:          "zRS"
  Code alignment factor: 4
  Data alignment factor: -8
  Return address column: 30
  Augmentation data:     1B

  DW_CFA_def_cfa: WSP +0
  DW_CFA_def_cfa: W29 +0
  DW_CFA_offset: W29 0
  DW_CFA_offset_extended_sf: W30 8
  DW_CFA_nop:
  DW_CFA_nop:
  DW_CFA_nop:

  CFA=W29: W29=[CFA], W30=[CFA+8]

000000e0 00000010 00000024 FDE cie=000000c0 pc=000005b0...000005b8
  Format:       DWARF32
  DW_CFA_nop:
  DW_CFA_nop:
  DW_CFA_nop:

  0x5b0: CFA=W29: W29=[CFA], W30=[CFA+8]

However, there was a serious problem: CFI cannot describe a sigreturn trampoline frame. AArch64 does not define a register number for PC and provides no direct way to encode the PC of the previous frame. Instead, it sets return_address_register to X30 and the unwinder updates the PC to whatever value the saved X30 is. Actually, with nongnu libunwind and unw_get_reg(&cursor, UNW_REG_IP, &pc); unw_get_reg(&cursor, UNW_AARCH64_X30, &x30);, we know pc == x30. This approach works fine when LR forms a chain since we know between two adjacent frames, the sets {PC, X30} differ by one element. However, when unwinding through the sigreturn trampoline, the CFI can describe the previous PC but not the previous X30.

musl x86-64

src/signal/x86_64/restore.s implements a sigreturn trampoline __restore_rt. There is no .eh_frame information.

nongnu libunwind does not know that __restore_rt is a sigreturn trampoline (unw_is_signal_frame always returns 0). On ELF targets, -O1 and above typically imply -fomit-frame-pointer and many functions do not save RBP. Note: some functions may save RBP even with -fomit-frame-pointer.

In the absence of a valid frame chain, combined with the fact that nongnu libunwind does not recognize Linux x86-64's sigreturn trampoline, libunwind cannot unwind through the __restore_rt frame. gdb recognizes the sigreturn trampoline frame and with its FP-based unwinding it can retrieve several frames, but not the ones above raise.

% ld.lld @response.release.txt && ./nongnu
pc=0x0000000000206add sp=0x00007ffc018618a0 0 ./nongnu:
pc=0x00007f9fedcd602f sp=0x00007ffc018620c0 0 /home/ray/musl/out/release/lib/libc.so:
pc=0x0000000000000000 sp=0x00007ffc01862db0 0
% gdb ./nongnu -x =(printf 'b handler\nhandle SIGUSR1 nostop\nr\nbt')
...
#0  handler (signo=10) at a.c:9
#1  <signal handler called>
#2  0x00007ffff7fae78a in __restore_sigs () from /home/ray/musl/out/release/lib/libc.so
#3  0x00007ffff7fae8f1 in raise () from /home/ray/musl/out/release/lib/libc.so
#4  0x0000000000000000 in ?? ()

If musl is built with -fno-omit-frame-pointer, nongnu libunwind will use its FP-based fallback (see src/x86_64/Gstep.c). The output looks like:

pc=0x0000000000206ada sp=0x00007fffd51b1830 0 ./nongnu:
pc=0x00007f0f09352858 sp=0x00007fffd51b2040 0 /home/ray/musl/out/release-fp/lib/libc.so:__setjmp
pc=0x0000000000206aaa sp=0x00007fffd51b2db0 0 ./nongnu:main
pc=0x00007f0f092f88ec sp=0x00007fffd51b2dd0 0 /home/ray/musl/out/release-fp/lib/libc.so:
pc=0x00000000002069d6 sp=0x00007fffd51b2e00 0 ./nongnu:_start

unw_step uses the saved RBP to infer RSP/RBP/RIP in the previous frame. If the signal handler saves RBP and calls unw_step, the saved RBP is essentially the RBP value in the frame.

1
2
3

rbp_loc = DWARF_LOC(rbp, 0);
rsp_loc = DWARF_VAL_LOC(c, rbp + 16);
rip_loc = DWARF_LOC (rbp + 8, 0);

Actually, not every source file needs to be built with -fno-omit-frame-pointer. We just need to build the source files that transfer control to the user program, and their callers. For this example, building src/signal/raise.c with -fno-omit-frame-pointer allows us to unwind to main. Additionally rebuilding src/env/__libc_start_main.c allows us to unwind to _start.

musl's Makefile specifies -fno-asynchronous-unwind-tables (see option to enable eh_frame for a 2011 discussion). If CFLAGS -g is specified, libc.so will have .debug_frame. gdb can retrieve the caller of raise:

#0  handler (signo=10) at a.c:9
#1  <signal handler called>
#2  __restore_sigs (set=set@entry=0x7fffffffe240) at ../../arch/x86_64/syscall_arch.h:40
#3  0x00007ffff7fa36e0 in raise (sig=sig@entry=10) at ../../src/signal/raise.c:11
#4  0x00000000002071ff in main () at a.c:33

nongnu libunwind can be built with --enable-debug-frame to support .debug_frame. Unfortunately, since it does not recognize the , it cannot retrieve the main frame for this example.

RISC-V

Like AArch64, Linux RISC-V defines the sigreturn trampoline __vdso_rt_sigreturn in the VDSO.

ENTRY(__vdso_rt_sigreturn)
	.cfi_startproc
	.cfi_signal_frame
	li a7, __NR_rt_sigreturn
	scall
	.cfi_endproc
ENDPROC(__vdso_rt_sigreturn)

llvm-project libunwind added support for unwinding through a sigreturn trampoline in https://reviews.llvm.org/D148499 (2023-05).

The output looks like the following on Arch Linux RISC-V (riscv64gc).

[root@archlinux riscv64]# ./b
pc=0x0000000000010a82 sp=0x00007fffcdd00910 ./b:
pc=0x00007fff8af83800 sp=0x00007fffcdd01520 linux-vdso.so.1:__vdso_rt_sigreturn
pc=0x00007fff8ae7fbee sp=0x00007fffcdd01960 /usr/lib/libc.so.6:
pc=0x00007fff8ae4ad66 sp=0x00007fffcdd019b0 /usr/lib/libc.so.6:gsignal
pc=0x0000000000010a3c sp=0x00007fffcdd019c0 ./b:main
pc=0x00007fff8ae3b1d4 sp=0x00007fffcdd019e0 /usr/lib/libc.so.6:
pc=0x00007fff8ae3b27c sp=0x00007fffcdd01b10 /usr/lib/libc.so.6:__libc_start_main
pc=0x00000000000109a0 sp=0x00007fffcdd01b60 ./b:_start

Unwinders' compatibility with libc implementations

The values represent how the unwinder unwinds through the frame.

	Linux glibc	Linux musl
nongnu libunwind AArch64	recognizes in VDSO	not tested
nongnu libunwind x86-64	.eh_frame in libc.so.6	unwindable if FP is enabled
gdb AArch64	recognizes in VDSO	not tested
gdb x86-64	recognizes	recognizes sigreturn trampoline

Links to frame related code

gcc libgcc/config/aarch64/linux-unwind.h:aarch64_fallback_frame_state
gdb gdb/aarch64-linux-tdep.c:aarch64_linux_rt_sigframe, gdb/amd64-linux-tdep.c:amd64_linux_sigtramp_start, gdb/riscv-linux-tdep.c:riscv_linux_sigframe
llvm-project libunwind https://reviews.llvm.org/D90898. We now use syscall(SYS_rt_sigprocmask, ...) to check whether a PC pointer is addressable.
Linux kernel arch/x86/kernel/signal.c:setup_rt_frame arch/riscv/kernel/vdso/rt_sigreturn.S:__vdso_rt_sigreturn

Core dump

The kernel core dumper coredump.c is simple. The glibc __restore_rt page or the VDSO is not prioritized in the presence of a core file limit. If the page is missing in the core file, gdb prog core -ex bt -batch will not be able to unwind past the . A userspace core dumper may be handy.

Frame pointer based unwinding

#define _GNU_SOURCE
#include <dlfcn.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>

[[gnu::noinline]] void unwind() {
  void **fp = __builtin_frame_address(0);
  for (;;) {
#if defined(__riscv) || defined(__loongarch__)
    void **next_fp = fp[-2], *pc = fp[-1];
#elif defined(__powerpc__)
    void **next_fp = fp[0];
    void *pc = next_fp <= fp ? 0 : next_fp[2];
#else
    void **next_fp = *fp, *pc = fp[1];
#endif
    printf("%p %p", next_fp, pc);
    Dl_info info = {};
    if (dladdr((void *)pc, &info))
      printf(" %s:%s", info.dli_fname, info.dli_sname ? info.dli_sname : "");
    puts("");

    if (next_fp <= fp) break;
    fp = next_fp;
  }
}

void handler() {
  unwind();
}

[[gnu::noinline]] void qux() { signal(SIGUSR1, handler); raise(SIGUSR1); }
[[gnu::noinline]] void bar() { qux(); }
[[gnu::noinline]] void foo() { bar(); }
int main() { foo(); }

2022-02-27

Analysis and introspection options in linkers

Updated in 2025-05.

Reproduce tarball

LLD offers a convenient feature to bundle all input files into a tarball, making it easier to experiment with different linker options. Use either of these commands:

1 2	clang -fuse-ld=lld -Wl,--reproduce=/tmp/rep.tar a.o b.o LLD_REPRODUCE=/tmp/rep.tar clang -fuse-ld=lld a.o b.o

Then unpack the tarball, navigate to the directory, and invoke LLD with the response file:

1 2	cd /tmp; tar xf rep.tar; cd rep ld.lld @response.txt # append options like -y foo

The response file includes the --chroot option, which GNU ld does not support. In most cases, you can simply remove this option to examine GNU ld's behavior.

1	ld.bfd @response.txt

`--trace-symbol=<sym>`

Alias: -y sym

2022-02-20

lld 14 ELF changes

llvm-project 14 will be released soon. I added some lld/ELF notes to https://github.com/llvm/llvm-project/blob/release/14.x/lld/docs/ReleaseNotes.rst. Here I will elaborate on some changes.

Non-HIDDEN_VAR_NEEDS_DYNAMIC_RELOC architectures with PC-relative instructions

Non-HIDDEN_VAR_NEEDS_DYNAMIC_RELOC architectures without PC-relative instructions

HIDDEN_VAR_NEEDS_DYNAMIC_RELOC architectures