# Stack unwinding

The main usage of stack unwinding is:

• To obtain a stack trace for debugger, crash reporter, profiler, garbage collector, etc.
• With personality routines and language specific data area, to implement C++ exceptions (Itanium C++ ABI). See C++ exception handling ABI

Stack unwinding tasks can be divided into two categories:

• synchronous: triggered by the program itself, C++ throw, get its own stack trace, etc. This type of stack unwinding only occurs at the function call (in the function body, it will not appear in the prologue/epilogue)
• asynchronous: triggered by a garbage collector, signals or an external program, this kind of stack unwinding can happen in function prologue/epilogue

## Frame pointer

The most classic and simplest stack unwinding is based on the frame pointer: fix a register as the frame pointer (RBP on x86-64), put the frame pointer in the stack frame at the function prologue, and update the frame pointer to the address of the saved frame pointer. The frame pointer and its saved values in the stack form a singly linked list. After obtaining the initial frame pointer value (__builtin_frame_address), dereference the frame pointer continuously to get the frame pointer values of all stack frames. This method is not applicable to some instructions in the prologue/epilogue.

Here is an example:

The frame pointer-based method is simple, but has several drawbacks.

When the above code is compiled with -O1 or above, foo and bar will have tail calls, and the program output will not include the stack frame of foo and bar (-fomit-leaf-frame-pointer does not hinder the tail call).

In practice, it is not guaranteed that all libraries contain frame pointers. When unwinding a thread, it is necessary to check whether next_fp is like a stack address before dereferencing it to prevent segfaults. One way to check page accessibility is to parse /proc/*/maps to determine whether the address is readable (slow). There is a smart trick:

In addition, reserving a register for the frame pointer will increase text size and have negative performance impact (prologue, epilogue additional instruction overhead and register pressure caused by one fewer register), which may be quite significant on x86-32 which lack registers. On an architecture with relatively sufficient registers, e.g. x86-64, the performance loss can be more than 1%.

### Compiler behavior

• -O0: Default -fno-omit-frame-pointer, all functions have frame pointer
• -O1 or above: Preset -fomit-frame-pointer, set frame pointer only if necessary. Specify -fno-omit-leaf-frame-pointer to get a similar effect to -O0. You can additionally specify -momti-leaf-frame-pointer to remove the frame pointer of leaf functions

## libunwind

C++ exception and stack unwinding of profiler/crash reporter usually use libunwind API and DWARF Call Frame Information. In the 1990s, Hewlett-Packard defined a set of libunwind API, which is divided into two categories:

• unw_*: The entry points are unw_init_local (local unwinding, current process) and unw_init_remote (remote unwinding, other processes). Applications that usually use libunwind use this API. For example, Linux perf will call unw_init_remote
• _Unwind_*: This part is standardized as Level 1: Base ABI of Itanium C++ ABI: Exception Handling. The Level 2 C++ ABI calls these _Unwind_* APIs. Among them, _Unwind_Resume is the only API that is directly called by C++ compiled code. _Unwind_Backtrace is used by a few applications to obtain stack traces. Other functions are called by libsupc++/libc++abi __cxa_* functions and __gxx_personality_v0.

Hewlett-Packard has open sourced https://www.nongnu.org/libunwind/ (in addition to many projects called "libunwind"). The common implementations of this API on Linux are:

• libgcc/unwind-* (libgcc_s.so.1 or libgcc_eh.a): Implemented _Unwind_* and introduced some extensions: _Unwind_Resume_or_Rethrow, _Unwind_FindEnclosingFunction, __register_frame etc.
• llvm-project/libunwind (libunwind.so or libunwind.a) is a simplified implementation of HP API, which provides part of unw_*, but does not implement unw_init_remote. Part of the code is taken from ld64. If you use Clang, you can use --rtlib=compiler-rt --unwindlib=libunwind to choose
• glibc's internal implementation of _Unwind_Find_FDE, usually not exported, and related to __register_frame_info

## DWARF Call Frame Information

The unwind instructions required by different areas of the program are described by DWARF Call Frame Information (CFI) and stored by .eh_frame on the ELF platform. Compiler/assembler/linker/libunwind provides corresponding support.

.eh_frame is composed of Common Information Entry (CIE) and Frame Description Entry (FDE). CIE has these fields:

• length
• CIE_id: Constant 0. This field is used to distinguish CIE and FDE. In FDE, this field is non-zero, representing CIE_pointer
• version: Constant 1
• augmentation_string: A string describing the CIE/FDE parameter list. The P character indicates the personality routine pointer; the L character indicates that the augmentation data of the FDE stores the language-specific data area (LSDA)
• address_size: Generally 4 or 8
• segment_selector_size: For x86
• code_alignment_factor: Assuming that the instruction length is a multiple of 2 or 4 (for RISC), it affects the multiplier of parameters such as DW_CFA_advance_loc
• data_alignment_factor: The multiplier that affects parameters such as DW_CFA_offset DW_CFA_val_offset
• augmentation_data_length
• augmentation_data: personality
• initial_instructions: bytecode for unwinding, a common prefix used by all FDEs using this CIE

Each FDE has an associated CIE. FDE has these fields:

• length: The length of FDE itself. If it is 0xffffffff, the next 8 bytes (extended_length) record the actual length. Unless specially constructed, extended_length is not used
• CIE_pointer: Subtract CIE_pointer from the current position to get the associated CIE
• initial_location: The address of the first location described by the FDE. There is a relocation referring to the section symbol in .o
• instructions: bytecode for unwinding, essentially (address,opcode) pairs
• augmentation_data_length
• augmentation_data: If the associated CIE augmentation contains L characters, language-specific data area will be recorded here

A CIE may optionally refer to a personality routine in the text section. A FDE may optionally refer to its associated LSDA in .gcc_except_table. The personality routine and LSDA are used in Level 2: C++ ABI of Itanium C++ ABI.

.eh_frame is based on .debug_frame introduced in DWARF v2. They have some differences, though:

• .eh_frame has the flag of SHF_ALLOC (indicating that a section should be part of the mirror image in memory) but .debug_frame does not, so the latter has very few usage scenarios.
• debug_frame supports DWARF64 format (supports 64-bit offsets but the volume will be slightly larger) but .eh_frame does not support (in fact, it can be expanded, but lacks demand)
• There is no augmentation_data_length and augmentation_data in the CIE of .debug_frame
• The version field in CIE is different
• The meaning of CIE_pointer in FDE is different. .debug_frame indicates a section offset (absolute) and .eh_frame indicates a relative offset. This change made by .eh_frame is great. If the length of .eh_frame exceeds 32-bit, .debug_frame has to be converted to DWARF64 to represent CIE_pointer, and relative offset does not need to worry about this issue (if the distance between FDE and CIE exceeds 32-bit, add a CIE OK)

For the following function:

The compiler produces .cfi_* (CFI directives) to annotate the assembly, .cfi_startproc and .cfi_endproc annotate the FDE area, and other CFI directives describe CFI instructions. A call frame is indicated by an address on the stack. This address is called Canonical Frame Address (CFA), and is usually the stack pointer value of the call site. The following example demonstrates the usage of CFI instructions:

The assembler parses CFI directives and generates .eh_frame (this mechanism was introduced by Alan Modra in 2003). Linker collects .eh_frame input sections in .o/.a files to generate output .eh_frame. In 2006, GNU as introduced .cfi_personality and .cfi_lsda.

### .eh_frame_hdr and PT_EH_FRAME

To locate the FDE where a pc is located, you need to scan .eh_frame from the beginning to find the appropriate FDE (whether the pc falls in the interval indicated by initial_location and address_range). The time spent is proportional to the number of scanned CIE and FDE records. https://sourceware.org/pipermail/binutils/2001-December/015674.html introduced .eh_frame_hdr, a binary search index table describing (initial_location, FDE address) pairs.

The linker collects all .eh_frame input sections. With --eh-frame-hdr, ld generates .eh_frame_hdr and creates a program header PT_EH_FRAME to describe .eh_frame_hdr. An unwinder can parse the program headers and look for PT_EH_FRAME to locate .eh_frame_hdr. Please check out the example below.

### __register_frame_info

Before .eh_frame_hdr and PT_EH_FRAME were invented, there was a static constructor frame_dummy in crtbegin (crtstuff.c): calling __register_frame_info to register the executable file .eh_frame.

Now __register_frame_info is only used by programs linked with -static. Correspondingly, if you specify -Wl,--no-eh-frame-hdr when linking, you cannot unwind (if you use a C++ exception, the program will call std::terminate).

### libunwind example

If you use llvm-project/libunwind：

If you use nongnu.org/libunwind, there are two options: (a) Add #define UNW_LOCAL_ONLY before #include <libunwind.h> (b) Link one more library, on x86-64 it is -l:libunwind- x86_64.so. If you use Clang, you can also use clang --rtlib=compiler-rt --unwindlib=libunwind -I path/to/include a.c, in addition to providing unw_*, it can ensure that libgcc_s.so is not linked

• unw_getcontext: Get register value (including PC)
• unw_init_local
• Use dl_iterate_phdr to traverse executable files and shared objects, and find the PT_LOAD program header that contains the PC
• Find the PT_EH_FRAME(.eh_frame_hdr) of the module where you are, and save it in cursor
• unw_step
• Binary search for the .eh_frame_hdr item corresponding to the PC, record the FDE found and the CIE it points to
• Execute initial_instructions in CIE
• Execute the instructions (bytecode) in FDE. An automaton maintains the current location and CFA. Among the instructions, DW_CFA_advance_loc advances the location; DW_CFA_def_cfa_* updates CFA; DW_CFA_offset indicates that the value of a register is stored at CFA+offset
• The automaton stops when the current location is greater than or equal to PC. In other words, the executed instruction is a prefix of FDE instructions

An unwinder locates the applicable FDE according to the program counter, and executes all the CFI instructions before the program counter.

There are several important

• DW_CFA_def_cfa_*
• DW_CFA_offset

A -DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=X86 clang, .text 51.7MiB, .eh_frame 4.2MiB, .eh_frame_hdr 646, 2 CIE, 82745 FDE.

### Remarks

CFI instructions are suitable for the compiler to generate code, but cumbersome to write in hand-written assembly. In 2015, Alex Dowad contributed an awk script to musl libc to parse the assembly and automatically generate CFI directives. In fact, generating precise CFI instructions is challenging for ompilers as well. For a function that does not use a frame pointer, adjusting SP requires outputting a CFI directive to redefine CFA. GCC does not parse inline assembly, so adjusting SP in inline assembly often results in imprecise CFI.

The CFIInstrInserter pass in LLVM can insert .cfi_def_cfa_* .cfi_offset .cfi_restore to adjust the CFA and callee-saved registers.

The DWARF scheme also has very low information density. The various compact unwind schemes have made improvement on this aspect. To list a few issues:

• CIE address_size: nobody uses different values for an architecture. Even if they do (ILP32 ABIs in AArch64 and x86-64), the information is already available elsewhere.
• CIE segment_selector_size: It is nice that they cared x86, but x86 itself does not need it anymore:/
• CIE code_alignment_factor and data_alignment_factor: A RISC architecture with such preference can hard code the values.
• CIE return_address_register: I do not know when an architecture wants to use a different register for the return address.
• length: The DWARF's 8-byte form is definitely overengineered... For standard form prologue/epilogue, the field should not be needed.
• initial_location and address_range: if a binary search index table is always needed, why do we need the length field?
• instructions: bytecode is flexible but commonly a function prologue/epilogue is of a standard form and the few callee-saved registers can be encoded in a more compact way.
• augmentation_data: While this provide flexibility, in practice very rarely a function needs anything more than a personality and a LSDA pointer.

Callee-saved registers other than FP are oftentimes unneeded but there is no compiler option to drop them.

#### SHT_X86_64_UNWIND

.eh_frame has special processing in linker/dynamic loader, so conventionally it should use a separate section type, but SHT_PROGBITS was used in the design. In the x86-64 psABI, the type of .eh_frame is SHT_X86_64_UNWIND (influenced by Solaris).

• In GNU as, .section .eh_frame,"a",@unwind will generate SHT_X86_64_UNWIND, and .cfi_* will generate SHT_PROGBITS.
• Since Clang 3.8, .cfi_* generates SHT_X86_64_UNWIND

.section .eh_frame,"a",@unwind is rare (glibc's x86 port, libffi, LuaJIT and other packages), so checking the type of .eh_frame is a good way to distinguish Clang/GCC object file :) For LLD 11.0.0, I contributed https://reviews.llvm.org/D85785 to allow mixed types for .eh_frame in a relocatable link;-)

Suggestion to future architectures: When defining processor-specific section types, please do not use 0x70000001 (SHT_ARM_EXIDX=SHT_IA_64_UNWIND=SHT_PARISC_UNWIND=SHT_X86_64_UNWIND=SHT_LOPROC+1) for purposes other than unwinding :) SHT_CSKY_ATTRIBUTES=0x70000001:)

Usually in the case of COMDAT group and -ffunction-sections, .data/.rodata needs to be split like .text, but .eh_frame is monolithic. Like many other metadata sections, the main problem with the monolithic section is that garbage collection is challenging in the linker. Unlike some other metadata sections, simply abandoning garbage collecting is not a choice: .eh_frame_hdr is a binary search index table and duplicate/unused entries can confuse the customers.

When a linker processes .eh_frame, it needs to conceptually split .eh_frame into CIE/FDE. During --gc-sections, the conceptual reference relationship is reversed considering the actual relocation: a FDE has a relocation referencing the text section; during GC, if the pointed text section is discarded, the FDE that references it should also be discarded.

LLD has some special handling for .eh_frame:

• -M requires special code
• --gc-sections occurs before .eh_frame deduplication/GC. The personality in a CIE is a valid reference. However, initial_location in FDE should be ignored. Moreover, a LSDA reference in a FDE in a section group should be ignored.
• In a relocatable link, a relocation from .eh_frame to a STT_SECTION symbol in a discarded section (due to COMDAT group rule) should be allowed (normally such a STB_LOCAL relocation from outside of the group is disallowed).

GNU ld and gold support --ld-generated-unwind-info which can synthesize CFI for PLT entries. I don't think they do that for range extension thunks.

## Compact unwind descriptors

On macOS, Apple designed the compact unwind descriptors mechanism to accelerate unwinding. In theory, this technique can be used to save some space in __eh_frame, but it has not been implemented. The main idea is:

• The FDE of most functions has a fixed mode (specify CFA at the prologue, store callee-saved registers), and the FDE instructions can be compressed to 32-bit.
• Personality/lsda described by CIE/FDE augmentation data is very common and can be extracted as a fixed field.

Only 64-bit will be discussed below. A descriptor occupies 32 bytes

If you study .eh_frame_hdr (binary search index table) and .ARM.exidx, you can know that the length field is redundant.

The Compact unwind descriptor is encoded as:

Five modes are defined:

• 0: reserved
• 1: FP-based frame: RBP is frame pointer, frame size is variable
• 2: SP-based frame: frame pointer is not used, frame size is fixed during compilation
• 3: large SP-based frame: frame pointer is not used, the frame size is fixed at compile time but the value is large and cannot be represented by mode 2
• 4: DWARF CFI escape

### FP-based frame (UNWIND_MODE_BP_FRAME)

The compact unwind encoding is:

The callee-saved registers on x86-64 are: RBX, R12, R13, R14, R15, RBP. 3 bits can encode a register, 15 bits are enough to represent 5 registers except RBP (whether to save and where). stack_adjust records the extra stack space outside the save register.

### SP-based frame (UNWIND_MODE_STACK_IMMD)

The compact unwind encoding is:

cnt represents the number of saved registers (maximum 6). reg_permutation indicates the sequence number of the saved register. size*8 represents the stack frame size.

### Large SP-based frame (UNWIND_MODE_STACK_IND)

Compact unwind descriptor编码为：

Similar to SP-based frame. In particular: the stack frame size is read from the text section. The RSP adjustment is usually represented by subq imm, %rsp, and size_offset is used to represent the distance from the instruction to the beginning of the function. The actual stack size also includes adj*8.

### DWARF CFI escape

If for various reasons, the compact unwind descriptor cannot be expressed, it must fall back to DWARF CFI.

## ARM exception handling

.ARM.exidx是个binary search index table，由2-word pairs组成。 第一个word是31-bit PC-relative offset to the start of the region。 第二个word用程序描述更加清晰：

tableData & 0x80000000表示一个compact model entry，否则表示一个generic model entry。

.ARM.exidx相当于增强的.eh_frame_hdr，compact model相当于内联了.eh_frame中的personality和lsda。考虑下面三种情况：

• 如果不会触发C++ exception且不会调用可能触发exception的函数：不需要personality，只需要一个EXIDX_CANTUNWIND entry，不需要.ARM.extab
• 如果有catch：需要__gxx_personality_v0，需要.ARM.extab

.ARM.extab相当于合并的.eh_frame.gcc_except_table

## Linux kernel ORC unwind tables

objtool orc generate a.o解析.eh_frame并生成.orc_unwind.orc_unwind_ip。对于这样一个.o文件：

Unwind information在两个地址发生改变：foo的开头和末尾，因此需要两个2 ORC entries。 如果DWARF CFA在函数中间变更(比如因为push/pop)，可能会需要更多entries。

.orc_unwind_ip有两个entries，表示PC-relative地址。

.orc_unwind包含类隔类型为orc_entry的entries，记录上一个栈帧的IP/SP/BP的存储位置。

Linker会resolve .orc_unwind_ip的relocations，并创建__start_orc_unwind_ip/__stop_orc_unwind_ip/__start_orc_unwind/__stop_orc_unwind用于定界。 然后，一个host utility scripts/sorttable.orc_unwind_ip.orc_unwind进行排序。 运行时unwind_next_frame执行下面的步骤unwind一个栈帧：

• .orc_unwind_ip里二分搜索需要的ORC entry
• 根据当前SP、orc->sp_regorc->sp_offset获取上一个栈帧的SP
• 根据orc->type和其他信息获取上一个栈帧的IP
• 根据当前BP、上一个栈帧的SP、orc->bp_regorc->bp_offset获取上一个栈帧的BP