Stack unwinding

Update in 2024-01.

中文版

The main usage of stack unwinding is:

  • To obtain a stack trace for debugger, crash reporter, profiler, garbage collector, etc.
  • With personality routines and language specific data area, to implement C++ exceptions (Itanium C++ ABI). See C++ exception handling ABI

Stack unwinding tasks can be divided into two categories:

  • synchronous: triggered by the program itself, C++ throw, get its own stack trace, etc. This type of stack unwinding only occurs at the function call (in the function body, it will not appear in the prologue/epilogue)
  • asynchronous: triggered by a garbage collector, signals or an external program, this kind of stack unwinding can happen in function prologue/epilogue

Frame pointer

The most classic and simplest stack unwinding is based on the frame pointer: fix a register as the frame pointer (RBP on x86-64), put the frame pointer in the stack frame at the function prologue, and update the frame pointer to the address of the saved frame pointer. The frame pointer and its saved values in the stack form a singly linked list.

1
2
3
4
5
pushq %rbp
movq %rsp, %rbp # after this, RBP references the current frame
...
popq %rbp
retq # RBP references the previous frame

After obtaining the initial frame pointer value (__builtin_frame_address), dereference the frame pointer continuously to get the frame pointer values of all stack frames. This method is not applicable to some instructions in the prologue/epilogue.

Note: on RISC-V and LoongArch, the stack slot for the previous frame pointer is stored at fp[-2] instead of fp[0]. See Consider standardising which stack slot fp points to for the RISC-V discussion.

The following code works on many architectures:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <stdio.h>
[[gnu::noinline]] void qux() {
void **fp = __builtin_frame_address(0);
for (;;) {
#if defined(__riscv) || defined(__loongarch__)
void **next_fp = fp[-2], *pc = fp[-1];
#elif defined(__powerpc__)
void **next_fp = fp[0];
void *pc = next_fp <= fp ? 0 : next_fp[2];
#else
void **next_fp = *fp, *pc = fp[1];
#endif
printf("%p %p\n", next_fp, pc);
if (next_fp <= fp) break;
fp = next_fp;
}
}
[[gnu::noinline]] void bar() { qux(); }
[[gnu::noinline]] void foo() { bar(); }
int main() { foo(); }

The frame pointer-based method is simple, but has several drawbacks.

When the above code is compiled with -O1 or above and [[gnu::noinline]] attributes are removed, foo and bar will have tail calls, and the program output will not include the stack frame for foo and bar. -fno-omit-frame-pointer does not suppress the tail call optimization.

The compiler default of -fomit-frame-pointer has played an important role. Many targets default to -fomit-frame-pointer with -O1 or above. Therefire, in practice, it is not guaranteed that all libraries contain frame pointers. When unwinding a thread, it is necessary to check whether next_fp is like a stack address before dereferencing it to prevent segfaults.

If we can inject code to the target thread, pthread_attr_getstack gets the stack bounds and is an efficient way to check page accessibility.

1
2
3
4
5
6
pthread_attr_t attr;
void *addr = 0;
size_t size = 0;
pthread_getattr_np(pthread_self(), &attr);
pthread_attr_getstack(&attr, &addr, &size);
pthread_attr_destroy(&attr);

Another way is to parse /proc/*/maps to determine whether the address is readable (slow). There is a smart trick:

1
2
3
4
5
6
7
8
9
#include <fcntl.h>
#include <unistd.h>

// Or use the write end of a pipe.
int fd = open("/dev/random", O_WRONLY);
assert(fd >= 0);
if (write(fd, address, 1) < 0)
// not readable
close(fd);

On Linux, rt_sigprocmask is better:

1
2
3
4
5
6
7
8
9
#include <errno.h>
#include <fcntl.h>
#include <syscall.h>
#include <unistd.h>

errno = 0;
syscall(SYS_rt_sigprocmask, ~0, address, (void *)0, /*sizeof(kernel_sigset_t)=*/8);
if (errno == EFAULT)
// not readable

In addition, reserving a register for the frame pointer will increase text size and have negative performance impact (prologue, epilogue additional instruction overhead and register pressure caused by one fewer register), which may be quite significant on x86-32 which lack registers. On an architecture with relatively sufficient registers, e.g. x86-64, the performance loss can be very small, say, 1%.

Compiler behavior

  • -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer: all functions maintain frame pointers.
  • -fno-omit-frame-pointer -momit-leaf-frame-pointer: all non-leaf functions maintain frame pointers. Leaf functions don't maintain frame pointers.
  • -fomit-frame-pointer: all functions maintain frame pointers. arm*-apple-darwin and thumb*-apple-darwin don't support the option.

For -O0, most targets default to -fno-omit-frame-pointer.

For -O1 and above, many targets default to -fomit-frame-pointer (while Apple and FreeBSD don't). Some targets default to -momit-leaf-frame-pointer. Specify -fno-omit-leaf-frame-pointer to get a similar effect to -O0.

GCC 8 is known to omit frame pointers if the function does not need a frame record for x86 (i386: Don't use frame pointer without stack access). There is a feature request: Option to force frame pointer.

libunwind

C++ exception and stack unwinding of profiler/crash reporter usually use libunwind API and DWARF Call Frame Information. In the 1990s, Hewlett-Packard defined a set of libunwind API, which is divided into two categories:

  • unw_*: The entry points are unw_init_local (local unwinding, current process) and unw_init_remote (remote unwinding, other processes). Applications that usually use libunwind use this API. For example, Linux perf will call unw_init_remote
  • _Unwind_*: This part is standardized as Level 1: Base ABI of Itanium C++ ABI: Exception Handling. The Level 2 C++ ABI calls these _Unwind_* APIs. Among them, _Unwind_Resume is the only API that is directly called by C++ compiled code. _Unwind_Backtrace is used by a few applications to obtain stack traces. Other functions are called by libsupc++/libc++abi __cxa_* functions and __gxx_personality_v0.

Hewlett-Packard has open sourced https://www.nongnu.org/libunwind/ (in addition to many projects called "libunwind"). The common implementations of this API on Linux are:

  • libgcc/unwind-* (libgcc_s.so.1 or libgcc_eh.a): Implemented _Unwind_* and introduced some extensions: _Unwind_Resume_or_Rethrow, _Unwind_FindEnclosingFunction, __register_frame etc.
  • llvm-project/libunwind (libunwind.so or libunwind.a) is a simplified implementation of HP API, which provides part of unw_*, but does not implement unw_init_remote. Part of the code is taken from ld64. If you use Clang, you can use --rtlib=compiler-rt --unwindlib=libunwind to choose
  • glibc's internal implementation of _Unwind_Find_FDE, usually not exported, and related to __register_frame_info

DWARF Call Frame Information

The unwind instructions required by different areas of the program are described by DWARF Call Frame Information (CFI). On the ELF platform, a modified form called .eh_frame is used. See Linux Standard Base Core Specification, Generic Part for detail. Compiler/assembler/linker/libunwind provides corresponding support.

.eh_frame is composed of Common Information Entry (CIE) and Frame Description Entry (FDE). A CIE has these fields:

  • length: The size of the length field plus the value of length must be an integral multiple of the address size.
  • CIE_id: Constant 0. This field is used to distinguish a CIE and a FDE. In a FDE, this field is non-zero, representing CIE_pointer
  • version: Constant 1
  • augmentation: A NUL-terminated string describing the CIE/FDE parameter list.
    • z: augmentation_data_length and augmentation_data fields are present and provide arguments to interpret the remaining bytes
    • P: retrieve one byte (encoding) and a value (length decided by the encoding) from augmentation_data to indicate the personality routine pointer
    • L: retrieve one byte from augmentation_data to indicate the encoding of language-specific data area (LSDA) in FDEs. The augmentation data of a FDE stores LSDA
    • R: retrieve one byte from augmentation_data to indicate the encoding of initial_location and address_range in FDEs
    • S: an associated FDE describes a signal frame (used by unw_is_signal_frame)
  • code_alignment_factor: Assuming that the instruction length is a multiple of 2 or 4 (for RISC), it affects the multiplier of parameters such as DW_CFA_advance_loc
  • data_alignment_factor: The multiplier that affects parameters such as DW_CFA_offset DW_CFA_val_offset
  • return_address_register
  • augmentation_data_length: only present if augmentation contains z.
  • augmentation_data: only present if augmentation contains z. This field provides arguments describing augmentation. For P, the argument specifies the personality (1-byte encoding and the encoded pointer). For R, the argument specifies the encoding of FDE initial_location.
  • initial_instructions: bytecode for unwinding, a common prefix used by all FDEs using this CIE
  • padding

In .debug_frame version 4 or above, address_size (4 or 8) and segment_selector_size are present. .eh_frame does not have the two fields.

Each FDE has an associated CIE. FDE has these fields:

  • length: The length of FDE itself. If it is 0xffffffff, the next 8 bytes (extended_length) record the actual length. Unless specially constructed, extended_length is not used
  • CIE_pointer: Subtract CIE_pointer from the current position to get the associated CIE
  • initial_location: The address of the first location described by the FDE. The value is encoded with a relocation referencing a section symbol
  • address_range: initial_location and address_range describe an address range
  • instructions: bytecode for unwinding, essentially (address,opcode) pairs
  • augmentation_data_length
  • augmentation_data: If the associated CIE augmentation contains L characters, language-specific data area will be recorded here
  • padding

A CIE may optionally refer to a personality routine in the text section (.cfi_personality directive). A FDE may optionally refer to its associated LSDA in .gcc_except_table (.cfi_lsda directive). The personality routine and LSDA are used in Level 2: C++ ABI of Itanium C++ ABI.

llvm-dwarfdump --eh-frame and objdump -Wf can dump the section. objdump -WF (short for --dwarf=frames-interp) gives a tabular output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
load "elf.pk";
load "dwarf-frame.pk";

type EhFrameCIE = struct {
Dwarf_Initial_Length length;
uint32 cie_id == 0;
uint8 version;
string augmentation;
ULEB128 code_alignment_factor;
LEB128 data_alignment_factor;
ULEB128 return_address_register;
if (strchr(augmentation, 'z') < augmentation'length)
ULEB128 augmentation_length;
if (strchr(augmentation, 'z') < augmentation'length)
uint8[augmentation_length.value] augmentation_data;
uint8[length.value + length'size - OFFSET] initial_instructions;
};

type EhFrameFDE = struct {
Dwarf_Initial_Length length;
uint32 cie_pointer;
int32 initial_location; // augmentation 'R' decides the type
int32 address_range; // augmentation 'R' decides the type
uint8[length.value + length'size - OFFSET] instructions;
};

type EhFrameEntry = union {
EhFrameCIE cie;
EhFrameFDE fde;
};

.eh_frame vs .debug_frame

Some target default to -fasynchronous-unwind-tables while some default to -fno-asynchronous-unwind-tables.

Here is the GCC and Clang behavior:

1
2
3
4
5
Compiler options                                  Produced section
-fasynchronous-unwind-tables -fexceptions .eh_frame
-fno-asynchronous-unwind-tables -fexceptions .eh_frame
-fasynchronous-unwind-tables -fno-exceptions .eh_frame
-fno-asynchronous-unwind-tables -fno-exceptions none (-g0) or .debug_frame (-g1 and above)

.eh_frame is based on .debug_frame introduced in DWARF v2. They have some differences, though:

  • .eh_frame has the flag of SHF_ALLOC (indicating that a section should be part of the process image) but .debug_frame does not, so the latter has very few usage scenarios.
  • .debug_frame supports DWARF64 format (supports 64-bit offsets but the volume will be slightly larger) but .eh_frame does not support (in fact, it can be expanded, but lacks demand)
  • In the CIE of .debug_frame, augmentation instead of augmentation_data_length and augmentation_data is used.
  • The version field in CIEs is different.
  • The meaning of CIE_pointer in FDEs is different. .debug_frame indicates a section offset (absolute) and .eh_frame indicates a relative offset. This change made by .eh_frame is great. If the length of .eh_frame exceeds 32-bit, .debug_frame has to be converted to DWARF64 to represent CIE_pointer, and relative offset does not need to worry about this issue (if the distance between FDE and CIE exceeds 32-bit, add a CIE OK)
  • In .eh_frame, augmentation typically includes R and the FDE encoding is DW_EH_PE_pcrel|DW_EH_PE_sdata4 for small code models of AArch64/PowerPC64/x86-64. initial_location has 4 bytes in GCC (even if -mcmodel=large). In .debug_frame, 64-bit architectures need 8-byte initial_location. Therefore, .eh_frame is usually smaller than an equivalent .debug_frame

For two otherwise equivalent relocatable object files, one using .debug_frame while the other using .eh_frame, size(.debug_frame)+size(.rela.debug_frame) > size(.eh_frame)+size(.rela.eh_frame), perhaps larger by ~20%. If we compress .debug_frame (.eh_frame cannot be compressed), size(compressed .debug_frame)+size(.rela.debug_frame) < size(.eh_frame)+size(.rela.eh_frame).


For the following function:

1
2
3
void f() {
__builtin_unwind_init();
}

The compiler produces .cfi_* (CFI directives) to annotate the assembly, .cfi_startproc and .cfi_endproc annotate the FDE area, and other CFI directives describe CFI instructions. A call frame is indicated by an address on the stack. This address is called Canonical Frame Address (CFA), and is usually the stack pointer value of the call site. The following example demonstrates the usage of CFI instructions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
f:
# At the function entry, CFA = rsp+8
.cfi_startproc
# %bb.0:
pushq %rbp
# Redefine CFA = rsp+16
.cfi_def_cfa_offset 16
# rbp is saved at the address CFA-16
.cfi_offset %rbp, -16
movq %rsp, %rbp
# CFA = rbp+16. CFA does not needed to be redefined when rsp changes
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
# rbx is saved at the address CFA-56
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
# CFA = rsp+8
.cfi_def_cfa %rsp, 8
retq
.Lfunc_end0:
.size f, .Lfunc_end0-f
.cfi_endproc

The assembler parses CFI directives and generates .eh_frame (this mechanism was introduced by Alan Modra in 2003). Linker collects .eh_frame input sections in .o/.a files to generate output .eh_frame. In 2006, GNU as introduced .cfi_personality and .cfi_lsda.

.eh_frame_hdr and PT_GNU_EH_FRAME

To locate the FDE where a pc is located, you need to scan .eh_frame from the beginning to find the appropriate FDE (whether the pc falls in the interval indicated by initial_location and address_range). The time spent is proportional to the number of scanned CIE and FDE records. https://sourceware.org/pipermail/binutils/2001-December/015674.html introduced .eh_frame_hdr, a binary search index table describing (initial_location, FDE address) pairs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
type EhFrameHdr = struct {
uint8 version;
uint8 eh_frame_ptr_enc;
uint8 fde_count_enc;
uint8 table_enc;
int32 eh_frame_ptr; // eh_frame_ptr_enc decides the type
int32 fde_count; // fde_count_enc decides the type

type Entry = struct {
int32 initial_location; // table_enc decides the type
int32 address; // table_enc decides the type
};
Entry[fde_count] table;
};

The linker collects all .eh_frame input sections. With --eh-frame-hdr, ld generates .eh_frame_hdr and creates a program header PT_GNU_EH_FRAME encompassing .eh_frame_hdr. An unwinder can parse the program headers and look for PT_GNU_EH_FRAME to locate .eh_frame_hdr. Please check out the example below.

Clang and GCC usually pass --eh-frame-hdr to ld, with the exception that gcc -static does not pass --eh-frame-hdr. The difference is a historical choice related to __register_frame_info.

GNU ld and ld.lld only support eh_frame_ptr_enc = DW_EH_PE_pcrel | DW_EH_PE_sdata4; (PC-relative int32_t), fde_count_enc = DW_EH_PE_udata4; (uint32_t), and table_enc = DW_EH_PE_datarel | DW_EH_PE_sdata4; (.eh_frame_hdr-relative int32_t). (GNU ld also supports DW_EH_PE_omit when there is no FDE.)

(

1
2
3
4
5
load "eh_frame_hdr.pk"
load elf
var efile = Elf64_File @ 0#B
var hdr = efile.get_sections_by_name(".eh_frame_hdr")[0]
printf "%Tv\n", EhFrameHdr @ hdr.sh_offset
)

__register_frame_info

Before .eh_frame_hdr and PT_GNU_EH_FRAME were invented, there was a static constructor frame_dummy in crtbegin (crtstuff.c): calling __register_frame_info to register the executable file .eh_frame.

Nowadays __register_frame_info is only used by programs linked with -static. Correspondingly, if you specify -Wl,--no-eh-frame-hdr when linking, you cannot unwind (if you use a C++ exception, the program will call std::terminate).

libunwind example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <libunwind.h>
#include <stdio.h>

void backtrace() {
unw_context_t context;
unw_cursor_t cursor;
// Store register values into context.
unw_getcontext(&context);
// Locate the PT_GNU_EH_FRAME which contains PC.
unw_init_local(&cursor, &context);
size_t rip, rsp;
do {
unw_get_reg(&cursor, UNW_X86_64_RIP, &rip);
unw_get_reg(&cursor, UNW_X86_64_RSP, &rsp);
printf("rip: %zx rsp: %zx\n", rip, rsp);
} while (unw_step(&cursor) > 0);
}

void bar() {backtrace();}
void foo() {bar();}
int main() {foo();}

If you use llvm-project/libunwind:

1
$CC a.c -Ipath/to/include -Lpath/to/lib -lunwind

If you use nongnu.org/libunwind, there are two options: (a) Add #define UNW_LOCAL_ONLY before #include <libunwind.h> (b) Link one more library, on x86-64 it is -l:libunwind-x86_64.so. If you use Clang, you can also use clang --rtlib=compiler-rt --unwindlib=libunwind -I path/to/include a.c, in addition to providing unw_*, it can ensure that libgcc_s.so is not linked

  • unw_getcontext: Get register value (including PC)
  • unw_init_local
    • Use dl_iterate_phdr to traverse executable files and shared objects, and find the PT_LOAD program header that contains the PC
    • Find the PT_GNU_EH_FRAME(.eh_frame_hdr) of the module where you are, and save it in cursor
  • unw_step
    • Binary search for the .eh_frame_hdr item corresponding to the PC, record the FDE found and the CIE it points to
    • Execute initial_instructions in CIE
    • Execute the instructions (bytecode) in FDE. An automaton maintains the current location and CFA. Among the instructions, DW_CFA_advance_loc advances the location; DW_CFA_def_cfa_* updates CFA; DW_CFA_offset indicates that the value of a register is stored at CFA+offset
    • The automaton stops when the current location is greater than or equal to PC. In other words, the executed instruction is a prefix of FDE instructions

An unwinder locates the applicable FDE according to the program counter, and executes all the CFI instructions before the program counter.

The most common directives are:

  • DW_CFA_def_cfa_*
  • DW_CFA_offset
  • DW_CFA_advance_loc

In a -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=X86 build of clang, .text is 51.7MiB, .eh_frame is 4.2MiB, .eh_frame_hdr is 646B. There are 2 CIE and 82745 FDE.

Remarks

CFI instructions are suitable for the compiler to generate code, but cumbersome to write in hand-written assembly. In 2015, Alex Dowad contributed an awk script to musl libc to parse the assembly and automatically generate CFI directives. In fact, generating precise CFI instructions is challenging for compilers as well. For a function that does not use a frame pointer, adjusting SP requires outputting a CFI directive to redefine CFA. GCC does not parse inline assembly, so adjusting SP in inline assembly often results in imprecise CFI.

1
2
3
4
5
6
7
8
9
10
void foo() {
asm("subq $128, %rsp\n"
// Cannot unwind if -momit-leaf-frame-pointer
"nop\n"
"addq $128, %rsp\n");
}

int main() {
foo();
}

In glibc, x86: Remove arch-specific low level lock implementation removed sysdeps/unix/sysv/linux/x86_64/lowlevellock.h. The file used to do something like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#define lll_lock(futex, private) \
(void) \
({ int ignore1, ignore2, ignore3; \
if (__builtin_constant_p (private) && (private) == LLL_PRIVATE) \
__asm __volatile (__lll_lock_asm_start \
"1:\tlea %2, %%" RDI_LP "\n" \
"2:\tsub $128, %%" RSP_LP "\n" \
".cfi_adjust_cfa_offset 128\n" \
"3:\tcallq __lll_lock_wait_private\n" \
"4:\tadd $128, %%" RSP_LP "\n" \
".cfi_adjust_cfa_offset -128\n" \
"24:" \
: "=S" (ignore1), "=&D" (ignore2), "=m" (futex), \
"=a" (ignore3) \
: "0" (1), "m" (futex), "3" (0) \
: "cx", "r11", "cc", "memory"); \
...

.cfi_adjust_cfa_offset 128 works with frames using RSP as the CFA but not with RBP. Unfortunately it is difficult to ensure a frame does not use RBP. Even with -fomit-frame-pointer, some conditions will switch to RBP.

The CFIInstrInserter pass in LLVM can insert .cfi_def_cfa_* .cfi_offset .cfi_restore to adjust the CFA and callee-saved registers. The CFIFixup pass in LLVM can insert .cfi_restore_state .cfi_remember_state. CFIFixup generated information is more space-efficient and is therefore preferred.

The DWARF scheme also has very low information density. The various compact unwind schemes have made improvement on this aspect. To list a few issues:

  • CIE address_size: nobody uses different values for an architecture. Even if they do (ILP32 ABIs in AArch64 and x86-64), the information is already available elsewhere.
  • CIE segment_selector_size: It is nice that they cared x86, but x86 itself does not need it anymore:/
  • CIE code_alignment_factor and data_alignment_factor: A RISC architecture with such preference can hard code the values.
  • CIE return_address_register: I do not know when an architecture wants to use a different register for the return address.
  • length: The DWARF's 8-byte form is definitely overengineered... For standard form prologue/epilogue, the field should not be needed.
  • initial_location and address_range: if a binary search index table is always needed, why do we need the length field?
  • instructions: bytecode is flexible but commonly a function prologue/epilogue is of a standard form and the few callee-saved registers can be encoded in a more compact way.
  • augmentation_data: While this provide flexibility, in practice very rarely a function needs anything more than a personality and a LSDA pointer.

Callee-saved registers other than FP are oftentimes unneeded but there is no compiler option to drop them.

SHT_X86_64_UNWIND

.eh_frame has special processing in linker/dynamic loader, so conventionally it should use a separate section type, but SHT_PROGBITS was used in the design. In the x86-64 psABI, the type of .eh_frame is SHT_X86_64_UNWIND (influenced by Solaris).

  • In GNU as, .section .eh_frame,"a",@unwind will generate SHT_X86_64_UNWIND, and .cfi_* will generate SHT_PROGBITS.
  • Since Clang 3.8, .cfi_* generates SHT_X86_64_UNWIND

.section .eh_frame,"a",@unwind is rare (glibc's x86 port, libffi, LuaJIT and other packages), so checking the type of .eh_frame is a good way to distinguish Clang/GCC object file :) For ld.lld 11.0.0, I contributed https://reviews.llvm.org/D85785 to allow mixed types for .eh_frame in a relocatable link;-)

Suggestion to future architectures: When defining processor-specific section types, please do not use 0x70000001 (SHT_ARM_EXIDX=SHT_IA_64_UNWIND=SHT_PARISC_UNWIND=SHT_X86_64_UNWIND=SHT_LOPROC+1) for purposes other than unwinding :) SHT_CSKY_ATTRIBUTES=0x70000001:)

Linker perspective

Usually in the case of COMDAT group and -ffunction-sections, .data/.rodata needs to be split like .text, but .eh_frame is monolithic. Like many other metadata sections, the main problem with the monolithic section is that garbage collection is challenging in the linker. Unlike some other metadata sections, simply abandoning garbage collecting is not a choice:

  • .eh_frame_hdr is a binary search index table and duplicate/unused entries can confuse the customers.
  • .eh_frame is not small. Users want to discard unused FDE.

ld.lld has some special handling for .eh_frame:

  • -M requires special code
  • --gc-sections. --gc-sections occurs before .eh_frame deduplication/GC.
  • For -r and --emit-relocs, a relocation from .eh_frame to a STT_SECTION symbol in a discarded section (due to COMDAT group rule) should be allowed (normally such a STB_LOCAL relocation from outside of the group is disallowed).

When a linker processes .eh_frame, it needs to conceptually split .eh_frame into CIE/FDE. ld.lld splits .eh_frame before marking sections as live for --gc-sections. ld.lld handles CIE and FDE differently:

  • relocations in CIEs reference personality routine symbols. The personality routines should be marked. The idea is that personality routines are generally not referenced by other means, and would be discarded if we don't retain them when scanning CIE relocations.
  • relocations in FDEs may reference code functions (via initial_location) and LSDA (via augmentation_data). Code functions (identified by the SHF_EXECINSTR flag) are not marked. LSDA neither in a section group nor having the SHF_LINK_ORDER flag is marked.

ld.lld merges identical CIEs.

GNU ld and gold support --ld-generated-unwind-info which can synthesize CFI for PLT entries. This increases CFI coverage, but I think it is largely obsoleted and irrelevant nowadays. See --no-ld-generated-unwind-info in Explain GNU style linker options#no-ld-generated-unwind-info. Note, such a mechanism is not available for range extension thunks (also linker synthesized code).

For --icf, two text sections may may have identical content and relocations but different LSDA, e.g. the two functions may have catch blocks of different types. We cannot merge the two sections. For simplicity, we can mark all text section referenced by LSDA as not eligible for ICF.

Compact unwind descriptors

On macOS, Apple designed the compact unwind descriptors mechanism to accelerate unwinding. In theory, this technique can be used to save some space in __eh_frame, but it has not been implemented. The main idea is:

  • The FDE of most functions has a fixed mode (specify CFA at the prologue, store callee-saved registers), and the FDE instructions can be compressed to 32-bit.
  • Personality/lsda described by CIE/FDE augmentation data is very common and can be extracted as a fixed field.

Only 64-bit will be discussed below. A descriptor occupies 32 bytes

1
2
3
4
5
6
.quad _foo
.set L1, Lfoo_end-_foo
.long L1
.long compact_unwind_description
.quad personality
.quad lsda_address

If you study .eh_frame_hdr (binary search index table) and .ARM.exidx, you can know that the length field is redundant.

The Compact unwind descriptor is encoded as:

1
2
3
uint32_t : 24; // vary with different modes
uint32_t mode : 4;
uint32_t flags : 4;

Five modes are defined:

  • 0: reserved
  • 1: FP-based frame: RBP is frame pointer, frame size is variable
  • 2: SP-based frame: frame pointer is not used, frame size is fixed during compilation
  • 3: large SP-based frame: frame pointer is not used, the frame size is fixed at compile time but the value is large and cannot be represented by mode 2
  • 4: DWARF CFI escape

FP-based frame (UNWIND_MODE_BP_FRAME)

The compact unwind encoding is:

1
2
3
4
5
uint32_t regs : 15;
uint32_t : 1; // 0
uint32_t stack_adjust : 8;
uint32_t mode : 4;
uint32_t flags : 4;

The callee-saved registers on x86-64 are: RBX, R12, R13, R14, R15, RBP. 3 bits can encode a register, 15 bits are enough to represent 5 registers except RBP (whether to save and where). stack_adjust records the extra stack space outside the save register.

SP-based frame (UNWIND_MODE_STACK_IMMD)

The compact unwind encoding is:

1
2
3
4
5
6
uint32_t reg_permutation : 10;
uint32_t cnt : 3;
uint32_t : 3;
uint32_t size : 8;
uint32_t mode : 4;
uint32_t flags : 4;

cnt represents the number of saved registers (maximum 6). reg_permutation indicates the sequence number of the saved register. size*8 represents the stack frame size.

Large SP-based frame (UNWIND_MODE_STACK_IND)

Compact unwind descriptor编码为:

1
2
3
4
5
6
uint32_t reg_permutation : 10;
uint32_t cnt : 3;
uint32_t adj : 3;
uint32_t size_offset : 8;
uint32_t mode : 4;
uint32_t flags : 4;

Similar to SP-based frame. In particular: the stack frame size is read from the text section. The RSP adjustment is usually represented by subq imm, %rsp, and size_offset is used to represent the distance from the instruction to the beginning of the function. The actual stack size also includes adj*8.

DWARF CFI escape

If for various reasons, the compact unwind descriptor cannot be expressed, it must fall back to DWARF CFI.

In the LLVM implementation, each function is represented by only a compact unwind descriptor. If asynchronous stack unwinding occurs in epilogue, existing implementations cannot distinguish it from stack unwinding in function body. Canonical Frame Address will be calculated incorrectly, and the caller-saved register will be read incorrectly. If it happens in prologue, and the prologue has other instructions outside the push register and subq imm, $rsp, an error will occur. In addition, if shrink wrapping is enabled for a function, prologue may not be at the beginning of the function. The asynchronous stack unwinding from the beginning to the prologue also fails. It seems that most people don't care about this issue. It may be because the profiler loses a few percentage points of the profile.

In fact, if you use multiple descriptors to describe each area of a function, you can still unwind accurately. OpenVMS proposed [RFC] Improving compact x86-64 compact unwind descriptors in 2018, but unfortunately there is no relevant implementation.

ARM exception handling

Divided into .ARM.exidx and .ARM.extab

.ARM.exidx is a binary search index table, composed of 2-word pairs. The first word is 31-bit PC-relative offset to the start of the region. The second word uses the program description more clearly:

1
2
3
4
5
6
7
8
9
if (indexData == EXIDX_CANTUNWIND)
return false; // like an absent .eh_frame entry. In the case of C++ exceptions, std::terminate
if (indexData & 0x80000000) {
extabAddr = &indexData;
extabData = indexData; // inline
} else {
extabAddr = &indexData + signExtendPrel31(indexData);
extabData = read32(&indexData + signExtendPrel31(indexData)); // stored in .ARM.extab
}

tableData & 0x80000000 means a compact model entry, otherwise means a generic model entry.

.ARM.exidx is equivalent to enhanced .eh_frame_hdr, compact model is equivalent to inlining the personality and lsda in .eh_frame. Consider the following three situations:

  • If the C++ exception will not be triggered and the function that may trigger the exception will not be called: no personality is needed, only one EXIDX_CANTUNWIND entry is needed, no .ARM.extab
  • If a C++ exception is triggered but no landing pad is required: personality is __aeabi_unwind_cpp_pr0, only a compact model entry is needed, no .ARM.extab
  • If there is a catch: __gxx_personality_v0 is required, .ARM.extab is required

.ARM.extab is equivalent to the combined .eh_frame and .gcc_except_table.

Generic model

1
2
3
4
5
uint32_t personality; // bit 31 is 0
uint32_t : 24;
uint32_t num : 8;
uint32_t opcodes[]; // opcodes, variable length
uint8_t lsda[]; // variable length

In construction.

Windows ARM64 exception handling

See https://docs.microsoft.com/en-us/cpp/build/arm64-exception-handling, this is my favorite coding scheme. Support the unwinding of mid-prolog and mid-epilog. Support function fragments (used to represent unconventional stack frames such as shrink wrapping).

Saved in two sections .pdata and .xdata.

1
2
3
uint32_t function_start_rva;
uint32_t Flag : 2;
uint32_t Data : 30;

For canonical form functions, Packed Unwind Data is used, and no .xdata record is required; for descriptors that cannot be represented by Packed Unwind Data, it is stored in .xdata.

Packed Unwind Data

1
2
3
4
5
6
7
8
uint32_t FunctionStartRVA;
uint32_t Flag : 2;
uint32_t FunctionLength : 11;
uint32_t RegF : 3;
uint32_t RegI : 4;
uint32_t H : 1;
uint32_t CR : 2;
uint32_t FrameSize : 9;

MIPS compact exception tables

In construction.

Linux kernel ORC unwind tables

For x86-64, the Linux kernel uses its own unwind tables: ORC. You can find its documentation on https://www.kernel.org/doc/html/latest/x86/orc-unwinder.html and there is an lwn.net introduction The ORCs are coming.

objtool orc generate a.o parses .eh_frame and generates .orc_unwind and .orc_unwind_ip. For an object file assembled from:

1
2
3
4
.globl foo
.type foo, @function
foo:
ret

At two addresses the unwind information changes: the start of foo and the end of foo, so 2 ORC entries will be produced. If the DWARF CFA changes (e.g. due to push/pop) in the middle of the function, there may be more entries.

.orc_unwind_ip contains two entries, representing the PC-relative addresses.

1
2
3
4
Relocation section '.rela.orc_unwind_ip' at offset 0x2028 contains 2 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000000 0000000500000002 R_X86_64_PC32 0000000000000000 .text + 0
0000000000000004 0000000500000002 R_X86_64_PC32 0000000000000000 .text + 1

.orc_unwind contains two entries of type orc_entry. The entries encode how IP/SP/BP of the previous frame are stored.

1
2
3
4
5
6
7
8
struct orc_entry {
s16 sp_offset; // sp_offset and sp_reg encode where SP of the previous frame is stored
s16 bp_offset; // bp_offset and bp_reg encode where BP of the previous frame is stored
unsigned sp_reg:4;
unsigned bp_reg:4;
unsigned type:2; // how IP of the previous frame is stored
unsigned end:1;
} __attribute__((__packed__));

You may find similarities in this scheme and UNWIND_MODE_BP_FRAME and UNWIND_MODE_STACK_IMMD in Apples's compact unwind descriptors. The ORC scheme uses 16-bit integers so assumably UNWIND_MODE_STACK_IND will not be needed. During unwinding, most callee-saved registers other than BP are unneeded, so ORC does not bother recording them.

The linker will resolve relocations in .orc_unwind_ip and create __start_orc_unwind_ip/__stop_orc_unwind_ip/__start_orc_unwind/__stop_orc_unwind delimiter the section contents. Then, a host utility scripts/sorttable sorts the contents of .orc_unwind_ip and .orc_unwind. To unwind a stack frame, unwind_next_frame

  • performs a binary search into the .orc_unwind_ip table to figure out the relevant ORC entry
  • retrieves the previous SP with the current SP, orc->sp_reg and orc->sp_offset.
  • retrieves the previous IP with orc->type and other values.
  • retrieves the previous BP with the currrent BP, the previous SP, orc->bp_reg and orc->bp_offset. bp->reg can be ORC_REG_UNDEFINED/ORC_REG_PREV_SP/ORC_REG_BP.

Compact C Type Format SFrame

In construction.

As an intended format improving on .eh_frame, the saving appears too small.

LLVM

In LLVM, Function::needsUnwindTableEntry decides whether CFI instructions should be emitted: hasUWTable() || !doesNotThrow() || hasPersonalityFn()

On ELF targets, if a function has uwtable or personality, or does not have nounwind (needsUnwindTableEntry), it marks that .eh_frame is needed in the module. Then, a function gets .eh_frame if needsUnwindTableEntry or -g[123] is specified.

To ensure no .eh_frame, every function needs nounwind.

uwtable(sync) and uwtable(async) specify the amount of unwind information. (See [RFC] Asynchronous unwind tables attribute.

If .eh_frame is not produced, but at least one function makes Function::needsUnwindTableEntry return true, .debug_frame is produced if llvm::MachineModuleInfo::DbgInfoAvailable is true or -fforce-dwarf-frame is specified.

lib/CodeGen/AsmPrinter/AsmPrinter.cpp:352

Epilogue

It remains an open question how the future stack unwinding strategy should evolve for profiling purposes. We have at least 3 routes:

  • compact unwind scheme.
  • hardware assisted. Piggybacking on security hardening features like shadow call stack. However, this unlikely provides more information about callee-saved registers.
  • mainly FP-based. People don't use FP due to performance loss. If -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer doesn't hurt performance that much, it may be better than some information in .eh_frame. We can use unwind information to fill the gap, e.g. for shrink wrapping.

Unwind information is difficult to have 100% coverage. Linker generated code (PLT and range extension thunks) generally does not have unwind informatioin coverage.

中文版

Stack unwinding主要有以下作用:

  • 获取stack trace,用于debugger、crash reporter、profiler、garbage collector等
  • 加上personality routine和language specific data area后实现C++ exceptions(Itanium C++ ABI)。参见C++ exception handling ABI

Stack unwinding可以分成两类:

  • synchronous: 程序自身触发的,C++ throw、获取自身stack trace等。这类stack unwinding只发生在函数调用处(在function body内,不会出现在prologue/epilogue)
  • asynchronous: 由signal或外部程序触发,这类stack unwinding可以发生在函数prologue/epilogue

Frame pointer

最经典、最简单的stack unwinding基于frame pointer:固定一个寄存器为frame pointer(在x86-64上为RBP),函数prologue处把frame pointer放入栈帧,并更新frame pointer为保存的frame pointer的地址。 frame pointer值和栈上保存的值形成了一个单链表。获取初始frame pointer值(__builtin_frame_address)后,不停解引用frame pointer即可得到所有栈帧的frame pointer值。 这种方法不适用于prologue/epilogue的部分指令。

1
2
3
4
5
pushq %rbp
movq %rsp, %rbp # after this, RBP references the current frame
...
popq %rbp
retq # RBP references the previous frame

下面是个简单的stack unwinding例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <stdio.h>
[[gnu::noinline]] void qux() {
void **fp = __builtin_frame_address(0);
for (;;) {
printf("%p\n", fp);
void **next_fp = *fp;
if (next_fp <= fp) break;
fp = next_fp;
}
}
[[gnu::noinline]] void bar() { qux(); }
[[gnu::noinline]] void foo() { bar(); }
int main() { foo(); }

基于frame pointer的方法简单,但是有若干缺陷。

上面的代码用-O1或以上编译时foo和bar会tail call,程序输出不会包含foo bar的栈帧(-fomit-leaf-frame-pointer并不阻碍tail call)。

实践中,有时候不能保证所有库都包含frame pointer。unwind一个线程时,为了增强健壮性需要检测一个next_fp是否像栈地址。检测的一种方法是解析/proc/*/maps判断地址是否可读(慢),另一种是

1
2
3
4
5
6
7
#include <fcntl.h>
#include <unistd.h>

// Or use the write end of a pipe.
int fd = open("/dev/random", O_WRONLY);
if (write(fd, address, 1) < 0)
// not readable

Linux上rt_sigprocmask更好:

1
2
3
4
5
6
7
8
9
#include <errno.h>
#include <fcntl.h>
#include <syscall.h>
#include <unistd.h>

errno = 0;
syscall(SYS_rt_sigprocmask, ~0, address, (void *)0, /*sizeof(kernel_sigset_t)=*/8);
if (errno == EFAULT)
// not readable

另外,预留一个寄存器用于frame pointer会增大text size、有性能开销(prologue、epilogue额外的指令开销和少一个寄存器带来的寄存器压力),在寄存器贫乏的x86-32可能相当显著,在寄存器较为充足的x86-64可能也有1%以上的性能损失。

编译器行为

  • -O0: 预设-fno-omit-frame-pointer,所有函数都有frame pointer
  • -O1或以上: 预设-fomit-frame-pointer,只有必要情况才设置frame pointer。指定-fno-omit-leaf-frame-pointer则可得到类似-O0效果。可以额外指定-momit-leaf-frame-pointer去除leaf functions的frame pointer

libunwind

C++ exception、profiler/crash reporter的stack unwinding通常用libunwind API和DWARF Call Frame Information。上个世纪90年代Hewlett-Packard定义了一套libunwind API,分为两类:

  • unw_*: 入口是unw_init_local(local unwinding,当前进程)和unw_init_remote(remote unwinding,其他进程)。通常使用libunwind的应用使用这套API。比如Linux perf会调用unw_init_remote
  • _Unwind_*: 这部分标准化为Itanium C++ ABI: Exception Handling的Level 1: Base ABI。Level 2 C++ ABI调用这些_Unwind_* API。其中的_Unwind_Resume是唯一被C++编译后的代码直接调用的API,其中的_Unwind_Backtrace被少数应用用于获取backtrace,其他函数则会被libsupc++/libc++abi调用。

Hewlett-Packard开源了https://www.nongnu.org/libunwind/(除此之外还有很多叫做"libunwind"的项目)。这套API在Linux上的常见实现是:

  • libgcc/unwind-* (libgcc_s.so.1libgcc_eh.a): 实现了_Unwind_*并引入了一些扩展:_Unwind_Resume_or_Rethrow, _Unwind_FindEnclosingFunction, __register_frame
  • llvm-project/libunwind (libunwind.solibunwind.a)是HP API的一个简化实现,提供了部分unw_*,但没有实现unw_init_remote。部分代码取自ld64。使用Clang的话可以用--rtlib=compiler-rt --unwindlib=libunwind选择
  • glibc的_Unwind_Find_FDE内部实现,通常不导出,和__register_frame_info有关

DWARF Call Frame Information

程序不同区域需要的unwind指令由DWARF Call Frame Information (CFI)描述,在ELF平台上由.eh_frame存储。Compiler/assembler/linker/libunwind提供相应支持。

.eh_frame由Common Information Entry (CIE)和Frame Description Entry (FDE)组成。CIE有这些字段:

  • length
  • CIE_id: 常数0。该字段用于区分CIE和FDE,在FDE中该字段非0,为CIE_pointer
  • version: 常数1
  • augmentation: 描述CIE/FDE参数列表的字串。P字符表示personality routine指针;L字符表示FDE的augmentation data存储了language-specific data area (LSDA)
  • address_size: 一般为4或8
  • segment_selector_size: for x86
  • code_alignment_factor: 假设指令长度都是2或4的倍数(用于RISC),影响DW_CFA_advance_loc等的参数的乘数
  • data_alignment_factor: 影响DW_CFA_offset DW_CFA_val_offset等的参数的乘数
  • return_address_register
  • augmentation_data_length
  • augmentation_data: personality
  • initial_instructions
  • padding

每个FDE有一个关联的CIE。FDE有这些字段:

  • length: FDE自身长度。若为0xffffffff,接下来8字节(extended_length)记录实际长度。除非特别构造,extended_length是用不到的
  • CIE_pointer: 从当前位置减去CIE_pointer得到关联的CIE
  • initial_location: 该FDE描述的第一个位置的地址。在.o中此处有一个引用section symbol的relocation
  • address_range: initial_location和address_range描述了一个地址区间
  • instructions: unwind时的指令
  • augmentation_data_length
  • augmentation_data: 如果关联的CIE augmentation包含L字符,这里会记录language-specific data area
  • padding

CIE引用text section中的personality。FDE引用.gcc_except_table中的LSDA。personality和lsda用于Itanium C++ ABI的Level 2: C++ ABI。

.eh_frame基于DWARF v2引入的.debug_frame。它们有一些区别:

  • .eh_frame带有SHF_ALLOC flag(标志一个section是否应为内存中镜像的一部分)而.debug_frame没有,因此后者的使用场景非常少。
  • debug_frame支持DWARF64格式(支持64-bit offsets但体积会稍大)而.eh_frame不支持(其实可以拓展,但是缺乏需求)
  • .debug_frame的CIE中没有augmentation_data_length和augmentation_data
  • CIE中version的值不同
  • FDE中CIE_pointer的含义不同。.debug_frame中表示一个section offset(absolute)而.eh_frame中表示一个relative offset。.eh_frame作出的这一改变很好。如果.eh_frame长度超过32-bit,.debug_frame得转换成DWARF64才能表示CIE_pointer,而relative offset则无需担心这一问题(如果FDE到CIE的距离超过32-bit了,追加一个CIE即可)

对于如下的函数:

1
2
3
void f() {
__builtin_unwind_init();
}

编译器用.cfi_*(CFI directive)标注汇编,.cfi_startproc.cfi_endproc标识FDE区域,其他CFI directives描述CFI instructions。 一个call frame用栈上的一个地址表示。这个地址叫做Canonical Frame Address (CFA),通常是call site的stack pointer值。下面用一个例子描述CFI instructions的作用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
f:
# At the function entry, CFA = rsp+8
.cfi_startproc
# %bb.0:
pushq %rbp
# Redefine CFA = rsp+16
.cfi_def_cfa_offset 16
# rbp is saved at the address CFA-16
.cfi_offset %rbp, -16
movq %rsp, %rbp
# CFA = rbp+16. CFA does not needed to be redefined when rsp changes
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
# rbx is saved at the address CFA-56
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
# CFA = rsp+8
.cfi_def_cfa %rsp, 8
retq
.Lfunc_end0:
.size f, .Lfunc_end0-f
.cfi_endproc

汇编器解析CFI directives生成.eh_frame(这套机制由Alan Modra在2003年引入)。Linker收集.o中的.eh_frame input sections生成output .eh_frame。 2006年GNU as引入了.cfi_personality.cfi_lsda

.eh_frame_hdrPT_GNU_EH_FRAME

定位一个pc所在的FDE需要从头扫描.eh_frame,找到合适的FDE(pc是否落在initial_location和address_range表示的区间),所花时间和扫描的CIE和FDE记录数相关。 https://sourceware.org/pipermail/binutils/2001-December/015674.html引入了.eh_frame_hdr,包含binary search index table描述(initial_location, FDE address) pairs。

ld --eh-frame-hdr可以生成.eh_frame_hdr。Linker会另外创建program header PT_GNU_EH_FRAME来包含.eh_frame_hdr。 Unwinder会寻找PT_GNU_EH_FRAME来定位.eh_frame_hdr,见下文的例子。

__register_frame_info

.eh_frame_hdrPT_GNU_EH_FRAME发明之前,crtbegin (crtstuff.c)中有一个static constructor frame_dummy:调用__register_frame_info注册可执行文件的.eh_frame

现在__register_frame_info只有-static链接的程序才会用到。相应地,如果链接时指定-Wl,--no-eh-frame-hdr,就无法unwind(如果使用C++ exception则会导致std::terminate)。

libunwind例子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <libunwind.h>
#include <stdio.h>

void backtrace() {
unw_context_t context;
unw_cursor_t cursor;
// Store register values into context.
unw_getcontext(&context);
// Locate the PT_GNU_EH_FRAME which contains PC.
unw_init_local(&cursor, &context);
size_t rip, rsp;
do {
unw_get_reg(&cursor, UNW_X86_64_RIP, &rip);
unw_get_reg(&cursor, UNW_X86_64_RSP, &rsp);
printf("rip: %zx rsp: %zx\n", rip, rsp);
} while (unw_step(&cursor) > 0);
}

void bar() {backtrace();}
void foo() {bar();}
int main() {foo();}

如果使用llvm-project/libunwind:

1
$CC a.c -Ipath/to/include -Lpath/to/lib -lunwind

如果使用nongnu.org/libunwind,两种选择:(a) #include <libunwind.h>前添加#define UNW_LOCAL_ONLY (b) 多链接一个库,x86-64上是-l:libunwind-x86_64.so。 使用Clang的话也可用clang --rtlib=compiler-rt --unwindlib=libunwind -I path/to/include a.c,除了提供unw_*外,能确保不链接libgcc_s.so

  • unw_getcontext: 获取寄存器值(包含PC)
  • unw_init_local
    • 使用dl_iterate_phdr遍历可执行文件和shared objects,找到包含PC的PT_LOAD program header
    • 找到所在module的PT_GNU_EH_FRAME(.eh_frame_hdr),存入cursor
  • unw_step
    • 二分搜索PC对应的.eh_frame_hdr项,记录找到的FDE和其指向的CIE
    • 执行CIE中的initial_instructions
    • 执行FDE中的instructions。维护一个location、CFA,初始指向FDE的initial_location,指令中DW_CFA_advance_loc增加location;DW_CFA_def_cfa_*更新CFA;DW_CFA_offset表示一个寄存器的值保存在CFA+offset处
    • location大于等于PC时停止。也就是说,执行的指令是FDE instructions的一个前缀

Unwinder根据program counter找到适用的FDE,执行所有在program counter之前的CFI instructions。

有几种重要的

  • DW_CFA_def_cfa_*
  • DW_CFA_offset
  • DW_CFA_advance_loc

一个-DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=X86的clang,.text 51.7MiB、.eh_frame 4.2MiB、.eh_frame_hdr 646、2个CIE、82745个FDE。

注记

CFI instructions适合编译器生成代码,而手写汇编要准确标准每一条指令是繁琐的,也很容易出错。 2015年Alex Dowad也musl libc贡献了awk脚本,解析assembly并自动标注CFI directives。 其实对于编译器生成的代码也不容易,对于一个不用frame pointer的函数,调整SP就得同时输出一条CFI directive重定义CFA。GCC是不解析inline assembly的,因此inline assembly里调整SP往往会造成不准确的CFI。

1
2
3
4
5
6
7
8
9
10
void foo() {
asm("subq $128, %rsp\n"
// Cannot unwind if -fomit-leaf-frame-pointer
"nop\n"
"addq $128, %rsp\n");
}

int main() {
foo();
}

而LLVM里的CFIInstrInserter可以插入.cfi_def_cfa_* .cfi_offset .cfi_restore调整CFA和callee-saved寄存器。

The DWARF scheme also has very low information density. The various compact unwind schemes have made improvement on this aspect. To list a few issues:

  • CIE address_size: nobody uses different values for an architecture. Even if they do (ILP32 ABIs in AArch64 and x86-64), the information is already available elsewhere.
  • CIE segment_selector_size: It is nice that they cared x86, but x86 itself does not need it anymore:/
  • CIE code_alignment_factor and data_alignment_factor: A RISC architecture with such preference can hard code the values.
  • CIE return_address_register: I do not know when an architecture wants to use a different register for the return address.
  • length: The DWARF's 8-byte form is definitely overengineered... For standard form prologue/epilogue, the field should not be needed.
  • initial_location and address_range: if a binary search index table is always needed, why do we need the length field?
  • instructions: bytecode is flexible but commonly a function prologue/epilogue is of a standard form and the few callee-saved registers can be encoded in a more compact way.
  • augmentation_data: While this provide flexibility, in practice very rarely a function needs anything more than a personality and a LSDA pointer.

SHT_X86_64_UNWIND

.eh_frame在linker/dynamic loader里有特殊处理,照理应该用一个单独的section type,但当初设计时却用了SHT_PROGBITS。 在x86-64 psABI中.eh_frame的型别为SHT_X86_64_UNWIND(可能是受到Solaris影响)。

  • GNU as中,.section .eh_frame,"a",@unwind会生成SHT_X86_64_UNWIND,而.cfi_*则生成SHT_PROGBITS
  • Clang 3.8起,.cfi_*生成SHT_X86_64_UNWIND

.section .eh_frame,"a",@unwind很少见(glibc's x86 port,libffi,LuaJIT等少数包),因此检查.eh_frame的型别是个辨别Clang/GCC object file的好方法:) 我有一个LLD 11.0.0 commit就是在relocatable link时接受两种型别的.eh_frame;-)

给未来架构的建议:在定义processor-specific section types时,请不要把0x70000001 (SHT_ARM_EXIDX=SHT_IA_64_UNWIND=SHT_PARISC_UNWIND=SHT_X86_64_UNWIND=SHT_LOPROC+1)用于unwinding以外的用途:) SHT_CSKY_ATTRIBUTES=0x70000001:)

Linker角度的问题

通常在COMDAT group和启用-ffunction-sections的情况下,.data/.rodata需要像.text那样分裂开,但是.eh_frame是一个monolithic section。 和很多其他metadata sections一样,monolithic的主要问题是linker garbage collection有点麻烦。 很其他metadata sections不同的是,简单的丢弃garbage collecting不是一种选择:.eh_frame_hdr是一个binary search index table,重复/无用的entries会使consumers困惑。

Linker在处理.eh_frame时,需要在概念上分裂.eh_frame成CIE/FDE。 --gc-sections时,概念上的引用关系和实际的relocation是相反的:FDE有一个引用text section的relocation;GC时,若被指向的text section被丢弃,引用它的FDE也应被丢弃。

LLD对.eh_frame有些特殊处理:

  • -M需要特殊代码
  • --gc-sections发生在.eh_frame deduplication/GC前。CIE中的personality是有效的reference,FDE中的initial_location应该忽略,FDE中的lsda引用只考虑non-section-group情形
  • 在relocatable link中,允许从.eh_frame指向一个discarded section(due to COMDAT group rule)的STT_SECTION symbol(通常对于discarded section,来自section group外的STB_LOCAL relocation应该被拒绝)

Compact unwind descriptors

在macOS上,Apple设计了compact unwind descriptors机制加速unwinding,理论上这种技术可以用于节省一些__eh_frame空间,但并没有实现。 主要思想是:

  • 大多数函数的FDE都有固定的模式(prologue处指定CFA、存储callee-saved registers),可以把FDE instructions压缩为32-bit。
  • CIE/FDE augmentation data描述的personality/lsda很常见,可以提取出来成为固定字段。

下面只讨论64-bit。一个descriptor占32字节

1
2
3
4
5
6
.quad _foo
.set L1, Lfoo_end-_foo
.long L1
.long compact_unwind_description
.quad personality
.quad lsda_address

如果研究.eh_frame_hdr(binary search index table)和.ARM.exidx的话,可以知道length字段是冗余的。

Compact unwind descriptor编码为:

1
2
3
uint32_t : 24; // vary with different modes
uint32_t mode : 4;
uint32_t flags : 4;

定义了5种mode:

  • 0: reserved
  • 1: FP-based frame: RBP为frame pointer,frame size可变
  • 2: SP-based frame: 不用frame pointer,frame size编译期固定
  • 3: large SP-based frame: 不用frame pointer,frame size编译期固定但数值较大,无法用mode 2表示
  • 4: DWARF CFI escape

FP-based frame (UNWIND_MODE_BP_FRAME)

Compact unwind descriptor编码为:

1
2
3
4
5
uint32_t regs : 15;
uint32_t : 1; // 0
uint32_t stack_adjust : 8;
uint32_t mode : 4;
uint32_t flags : 4;

x86-64上callee-saved寄存器有:RBX,R12,R13,R14,R15,RBP。3 bits可以编码一个寄存器,15 bits足够表示除RBP外的5个寄存器(是否保存及保存在哪里)。 stack_adjust记录保存寄存器外的额外栈空间。

SP-based frame (UNWIND_MODE_STACK_IMMD)

Compact unwind descriptor编码为:

1
2
3
4
5
6
uint32_t reg_permutation : 10;
uint32_t cnt : 3;
uint32_t : 3;
uint32_t size : 8;
uint32_t mode : 4;
uint32_t flags : 4;

cnt表示保存的寄存器数(最大6)。 reg_permutation表示保存的寄存器的排列的序号。 size*8表示栈帧大小。

Large SP-based frame (UNWIND_MODE_STACK_IND)

Compact unwind descriptor编码为:

1
2
3
4
5
6
uint32_t reg_permutation : 10;
uint32_t cnt : 3;
uint32_t adj : 3;
uint32_t size_offset : 8;
uint32_t mode : 4;
uint32_t flags : 4;

和SP-based frame类似。特别的是:栈帧大小是从text section读取的。RSP调整量通常由subq imm, %rsp表示,用size_offset表示该指令到函数开头的距离。 实际表示的stack size还要算上adj*8。

DWARF CFI escape

如果因为各种原因,compact unwind descriptor无法表示,就要回退到DWARF CFI。

LLVM实现里,每一个函数只用一个compact unwind descriptor表示。如果asynchronous stack unwinding发生在epilogue,已有实现无法把它和发生在function body的stack unwinding区分开来。 Canonical Frame Address会计算错误,caller-saved寄存器也会错误地读取。 如果发生在prologue,且prologue在push寄存器和subq imm, $rsp外有其他指令,也会出错。 另外如果一个函数启用了shrink wrapping,prologue可能不在函数开头处。开头到prologue间的asynchronous stack unwinding也会出错。 这个问题似乎多数人都不关心,可能是因为profiler丢失几个百分点的profile大家不在乎吧。

其实如果用多个descriptors描述一个函数的各个区域,还是可以准确unwind的。 OpenVMS 2018年提出了[RFC] Improving compact x86-64 compact unwind descriptors,可惜没有相关实现。

ARM exception handling

分为.ARM.exidx.ARM.extab

.ARM.exidx是个binary search index table,由2-word pairs组成。 第一个word是31-bit PC-relative offset to the start of the region。 第二个word用程序描述更加清晰:

1
2
3
4
5
6
7
8
9
if (indexData == EXIDX_CANTUNWIND)
return false; // like an absent .eh_frame entry. In the case of C++ exceptions, std::terminate
if (indexData & 0x80000000) {
extabAddr = &indexData;
extabData = indexData; // inline
} else {
extabAddr = &indexData + signExtendPrel31(indexData);
extabData = read32(&indexData + signExtendPrel31(indexData)); // stored in .ARM.extab
}

tableData & 0x80000000表示一个compact model entry,否则表示一个generic model entry。

.ARM.exidx相当于增强的.eh_frame_hdr,compact model相当于内联了.eh_frame中的personality和lsda。考虑下面三种情况:

  • 如果不会触发C++ exception且不会调用可能触发exception的函数:不需要personality,只需要一个EXIDX_CANTUNWIND entry,不需要.ARM.extab
  • 如果会触发C++ exception但是不需要landing pad:personality是__aeabi_unwind_cpp_pr0,只需要一个compact model的entry,不需要.ARM.extab
  • 如果有catch:需要__gxx_personality_v0,需要.ARM.extab

.ARM.extab相当于合并的.eh_frame.gcc_except_table

Generic model

1
2
3
4
5
uint32_t personality; // bit 31 is 0
uint32_t : 24;
uint32_t num : 8;
uint32_t opcodes[]; // opcodes, variable length
uint8_t lsda[]; // variable length

待补充

Windows ARM64 exception handling

参见https://docs.microsoft.com/en-us/cpp/build/arm64-exception-handling,这是我最欣赏的编码方案。 支持mid-prolog和mid-epilog的unwinding。支持function fragments(用来表示shrink wrapping等非常规栈帧)。

保存在.pdata.xdata两个sections。

1
2
3
uint32_t function_start_rva;
uint32_t Flag : 2;
uint32_t Data : 30;

对于canonical form的函数,使用Packed Unwind Data,不需要.xdata记录;对于Packed Unwind Data无法表示的descriptor,保存在.xdata

Packed Unwind Data

1
2
3
4
5
6
7
8
uint32_t FunctionStartRVA;
uint32_t Flag : 2;
uint32_t FunctionLength : 11;
uint32_t RegF : 3;
uint32_t RegI : 4;
uint32_t H : 1;
uint32_t CR : 2;
uint32_t FrameSize : 9;

MIPS compact exception tables

待补充

Linux kernel ORC unwind tables

对于x86-64,Linux kernel使用自己的unwind tables:ORC。文档在https://www.kernel.org/doc/html/latest/x86/orc-unwinder.html。lwn.net上有一篇介绍The ORCs are coming

objtool orc generate a.o解析.eh_frame并生成.orc_unwind.orc_unwind_ip。对于这样一个.o文件:

1
2
3
4
.globl foo
.type foo, @function
foo:
ret

Unwind information在两个地址发生改变:foo的开头和末尾,因此需要两个2 ORC entries。 如果DWARF CFA在函数中间变更(比如因为push/pop),可能会需要更多entries。

.orc_unwind_ip有两个entries,表示PC-relative地址。

1
2
3
4
Relocation section '.rela.orc_unwind_ip' at offset 0x2028 contains 2 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000000 0000000500000002 R_X86_64_PC32 0000000000000000 .text + 0
0000000000000004 0000000500000002 R_X86_64_PC32 0000000000000000 .text + 1

.orc_unwind包含类隔类型为orc_entry的entries,记录上一个栈帧的IP/SP/BP的存储位置。

1
2
3
4
5
6
7
8
struct orc_entry {
s16 sp_offset; // sp_offset and sp_reg encode where SP of the previous frame is stored
s16 bp_offset; // bp_offset and bp_reg encode where BP of the previous frame is stored
unsigned sp_reg:4;
unsigned bp_reg:4;
unsigned type:2; // how IP of the previous frame is stored
unsigned end:1;
} __attribute__((__packed__));

你可能会发现这个方案和Apples's compact unwind descriptors的UNWIND_MODE_BP_FRAME and UNWIND_MODE_STACK_IMMD有相似之处。 ORC方案使用16-bit整数,所以UNWIND_MODE_STACK_IND应该是用不到的。 Unwinding时,除了BP以外的多数callee-saved寄存器用不到,所以ORC也没存储它们。

Linker会resolve .orc_unwind_ip的relocations,并创建__start_orc_unwind_ip/__stop_orc_unwind_ip/__start_orc_unwind/__stop_orc_unwind用于定界。 然后,一个host utility scripts/sorttable.orc_unwind_ip.orc_unwind进行排序。 运行时unwind_next_frame执行下面的步骤unwind一个栈帧:

  • .orc_unwind_ip里二分搜索需要的ORC entry
  • 根据当前SP、orc->sp_regorc->sp_offset获取上一个栈帧的SP
  • 根据orc->type和其他信息获取上一个栈帧的IP
  • 根据当前BP、上一个栈帧的SP、orc->bp_regorc->bp_offset获取上一个栈帧的BP

LLVM

In LLVM, Function::needsUnwindTableEntry decides whether CFI instructions should be emitted: hasUWTable() || !doesNotThrow() || hasPersonalityFn()

On ELF targets, if a function has uwtable or personality, or does not have nounwind (needsUnwindTableEntry), it marks that .eh_frame is needed in the module. Then, a function gets .eh_frame if needsUnwindTableEntry or -g[123] is specified.

To ensure no .eh_frame, every function needs nounwind.

uwtable is coarse-grained: it does not specify the amount of unwind information. [RFC] Asynchronous unwind tables attribute proposes to make it a gradient.

lib/CodeGen/AsmPrinter/AsmPrinter.cpp:352

尾声

用于profiling的stack unwinding策略会怎样发展是个开放问题。 我们有至少三种路线:

  • compact unwind
  • 硬件支持。
  • 主要基于FP