2025-09-07

Stack unwinding

Stack unwinding主要有以下作用：

獲取stack trace，用於debugger、crash reporter、profiler、garbage collector等
加上personality routine和language specific data area後實現C++ exceptions(Itanium C++ ABI)。参见C++ exception handling ABI

Stack unwinding可以分成兩類：

synchronous: 程序自身觸發的，C++ throw、獲取自身stack trace等。這類stack unwinding只發生在函數調用處(在function body內，不會出現在prologue/epilogue)
asynchronous: 由signal或外部程序觸發，這類stack unwinding可以發生在函數prologue/epilogue

Frame pointer

最經典、最簡單的stack unwinding基於frame pointer：固定一個寄存器爲frame pointer(在x86-64上爲RBP)，函數prologue處把frame pointer放入棧幀，並更新frame pointer爲保存的frame pointer的地址。 frame pointer值和棧上保存的值形成了一個單鏈表。獲取初始frame pointer值(__builtin_frame_address)後，不停解引用frame pointer即可得到所有棧幀的frame pointer值。這種方法不適用於prologue/epilogue的部分指令。

pushq %rbp
movq %rsp, %rbp # after this, RBP references the current frame
...
popq %rbp
retq  # RBP references the previous frame

下面是個簡單的stack unwinding例子：

#include <stdio.h>
[[gnu::noinline]] void qux() {
  void **fp = __builtin_frame_address(0);
  for (;;) {
#if defined(__riscv) || defined(__loongarch__)
    void **next_fp = fp[-2], *pc = fp[-1];
#elif defined(__powerpc__)
    void **next_fp = fp[0];
    void *pc = next_fp <= fp ? 0 : next_fp[2];
#else
    void **next_fp = *fp, *pc = fp[1];
#endif
    printf("%p %p\n", next_fp, pc);
    if (next_fp <= fp) break;
    fp = next_fp;
  }
}
[[gnu::noinline]] void bar() { qux(); }
[[gnu::noinline]] void foo() { bar(); }
int main() { foo(); }

基於frame pointer的方法簡單，但是有若干缺陷。

上面的代碼用-O1或以上編譯時foo和bar會tail call，程序輸出不會包含foo bar的棧幀(-fomit-leaf-frame-pointer並不阻礙tail call)。

實踐中，有時候不能保證所有庫都包含frame pointer。unwind一個線程時，爲了增強健壯性需要檢測一個next_fp是否像棧地址。檢測的一種方法是解析/proc/*/maps判斷地址是否可讀(慢)，另一種是

#include <fcntl.h>
#include <unistd.h>

// Or use the write end of a pipe.
int fd = open("/dev/random", O_WRONLY);
if (write(fd, address, 1) < 0)
  // not readable

Linux上rt_sigprocmask更好：

#include <errno.h>
#include <fcntl.h>
#include <syscall.h>
#include <unistd.h>

errno = 0;
syscall(SYS_rt_sigprocmask, ~0, address, (void *)0, /*sizeof(kernel_sigset_t)=*/8);
if (errno == EFAULT)
  // not readable

另外，預留一個寄存器用於frame pointer會增大text size、有性能開銷(prologue、epilogue額外的指令開銷和少一個寄存器帶來的寄存器壓力)，在寄存器貧乏的x86-32可能相當顯著，在寄存器較爲充足的x86-64可能也有1%以上的性能損失。

編譯器行爲

-O0: 預設-fno-omit-frame-pointer，所有函數都有frame pointer
-O1或以上: 預設-fomit-frame-pointer，只有必要情況才設置frame pointer。指定-fno-omit-leaf-frame-pointer則可得到類似-O0效果。可以額外指定-momit-leaf-frame-pointer去除leaf functions的frame pointer

libunwind

C++ exception、profiler/crash reporter的stack unwinding通常用libunwind API和DWARF Call Frame Information。上個世紀90年代Hewlett-Packard定義了一套libunwind API，分爲兩類：

unw_*: 入口是unw_init_local(local unwinding，當前進程)和unw_init_remote(remote unwinding，其他進程)。通常使用libunwind的應用使用這套API。比如Linux perf會調用unw_init_remote
_Unwind_*: 這部分標準化爲Itanium C++ ABI: Exception Handling的Level 1: Base ABI。Level 2 C++ ABI調用這些_Unwind_* API。其中的_Unwind_Resume是唯一被C++編譯後的代碼直接調用的API，其中的_Unwind_Backtrace被少數應用用於獲取backtrace，其他函數則會被libsupc++/libc++abi調用。

Hewlett-Packard開源了https://www.nongnu.org/libunwind/(除此之外還有很多叫做"libunwind"的項目)。這套API在Linux上的常見實現是：

libgcc/unwind-* (libgcc_s.so.1或libgcc_eh.a): 實現了_Unwind_*並引入了一些擴展：_Unwind_Resume_or_Rethrow, _Unwind_FindEnclosingFunction, __register_frame等
llvm-project/libunwind (libunwind.so或libunwind.a)是HP API的一個簡化實現，提供了部分unw_*，但沒有實現unw_init_remote。部分代碼取自ld64。使用Clang的話可以用--rtlib=compiler-rt --unwindlib=libunwind選擇
glibc的_Unwind_Find_FDE內部實現，通常不導出，和__register_frame_info有關

DWARF Call Frame Information

程序不同區域需要的unwind指令由DWARF Call Frame Information (CFI)描述，在ELF平臺上由.eh_frame存儲。Compiler/assembler/linker/libunwind提供相應支持。

.eh_frame由Common Information Entry (CIE)和Frame Description Entry (FDE)組成。CIE有這些字段：

length
CIE_id: 常數0。該字段用於區分CIE和FDE，在FDE中該字段非0，爲CIE_pointer
version: 常數1
augmentation: 描述CIE/FDE參數列表的字串。P字符表示personality routine指針；L字符表示FDE的augmentation data存儲了language-specific data area (LSDA)
address_size: 一般爲4或8
segment_selector_size: for x86
code_alignment_factor: 假設指令長度都是2或4的倍數(用於RISC)，影響DW_CFA_advance_loc等的參數的乘數
data_alignment_factor: 影響DW_CFA_offset DW_CFA_val_offset等的參數的乘數
return_address_register
augmentation_data_length
augmentation_data: personality
initial_instructions
padding

每個FDE有一個關聯的CIE。FDE有這些字段：

length: FDE自身長度。若爲0xffffffff，接下來8字節(extended_length)記錄實際長度。除非特別構造，extended_length是用不到的
CIE_pointer: 從當前位置減去CIE_pointer得到關聯的CIE
initial_location: 該FDE描述的第一個位置的地址。在.o中此處有一個引用section symbol的relocation
address_range: initial_location和address_range描述了一個地址區間
instructions: unwind時的指令
augmentation_data_length
augmentation_data: 如果關聯的CIE augmentation包含L字符，這裏會記錄language-specific data area
padding

CIE引用text section中的personality。FDE引用.gcc_except_table中的LSDA。personality和lsda用於Itanium C++ ABI的Level 2: C++ ABI。

.eh_frame基於DWARF v2引入的.debug_frame。它們有一些區別：

.eh_frame帶有SHF_ALLOC flag(標誌一個section是否應爲內存中鏡像的一部分)而.debug_frame沒有，因此後者的使用場景非常少。
debug_frame支持DWARF64格式(支持64-bit offsets但體積會稍大)而.eh_frame不支持(其實可以拓展，但是缺乏需求)
.debug_frame的CIE中沒有augmentation_data_length和augmentation_data
CIE中version的值不同
FDE中CIE_pointer的含義不同。.debug_frame中表示一個section offset(absolute)而.eh_frame中表示一個relative offset。.eh_frame作出的這一改變很好。如果.eh_frame長度超過32-bit，.debug_frame得轉換成DWARF64才能表示CIE_pointer，而relative offset則無需擔心這一問題(如果FDE到CIE的距離超過32-bit了，追加一個CIE即可)

對於如下的函數：

1
2
3

void f() {
  __builtin_unwind_init();
}

編譯器用.cfi_*(CFI directive)標註彙編，.cfi_startproc和.cfi_endproc標識FDE區域，其他CFI directives描述CFI instructions。一個call frame用棧上的一個地址表示。這個地址叫做Canonical Frame Address (CFA)，通常是call site的stack pointer值。下面用一個例子描述CFI instructions的作用：

f:
# At the function entry, CFA = rsp+8
	.cfi_startproc
# %bb.0:
	pushq	%rbp
# Redefine CFA = rsp+16
	.cfi_def_cfa_offset 16
# rbp is saved at the address CFA-16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
# CFA = rbp+16. CFA does not needed to be redefined when rsp changes
	.cfi_def_cfa_register %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
# rbx is saved at the address CFA-56
	.cfi_offset %rbx, -56
	.cfi_offset %r12, -48
	.cfi_offset %r13, -40
	.cfi_offset %r14, -32
	.cfi_offset %r15, -24
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
# CFA = rsp+8
	.cfi_def_cfa %rsp, 8
	retq
.Lfunc_end0:
	.size	f, .Lfunc_end0-f
	.cfi_endproc

彙編器解析CFI directives生成.eh_frame(這套機制由Alan Modra在2003年引入)。Linker收集.o中的.eh_frame input sections生成output .eh_frame。 2006年GNU as引入了.cfi_personality和.cfi_lsda。

`.eh_frame_hdr`和`PT_GNU_EH_FRAME`

定位一個pc所在的FDE需要從頭掃描.eh_frame，找到合適的FDE(pc是否落在initial_location和address_range表示的區間)，所花時間和掃描的CIE和FDE記錄數相關。 https://sourceware.org/pipermail/binutils/2001-December/015674.html引入了.eh_frame_hdr，包含binary search index table描述(initial_location, FDE address) pairs。

ld --eh-frame-hdr可以生成.eh_frame_hdr。Linker會另外創建program header PT_GNU_EH_FRAME來包含.eh_frame_hdr。 Unwinder會尋找PT_GNU_EH_FRAME來定位.eh_frame_hdr，見下文的例子。

`__register_frame_info`

在.eh_frame_hdr和PT_GNU_EH_FRAME發明之前，crtbegin (crtstuff.c)中有一個static constructor frame_dummy：調用__register_frame_info註冊可執行文件的.eh_frame。

現在__register_frame_info只有-static鏈接的程序纔會用到。相應地，如果鏈接時指定-Wl,--no-eh-frame-hdr，就無法unwind(如果使用C++ exception則會導致std::terminate)。

libunwind例子

#include <libunwind.h>
#include <stdio.h>

void backtrace() {
  unw_context_t context;
  unw_cursor_t cursor;
  // Store register values into context.
  unw_getcontext(&context);
  // Locate the PT_GNU_EH_FRAME which contains PC.
  unw_init_local(&cursor, &context);
  size_t rip, rsp;
  do {
    unw_get_reg(&cursor, UNW_X86_64_RIP, &rip);
    unw_get_reg(&cursor, UNW_X86_64_RSP, &rsp);
    printf("rip: %zx rsp: %zx\n", rip, rsp);
  } while (unw_step(&cursor) > 0);
}

void bar() {backtrace();}
void foo() {bar();}
int main() {foo();}

如果使用llvm-project/libunwind：

1	$CC a.c -Ipath/to/include -Lpath/to/lib -lunwind

如果使用nongnu.org/libunwind，兩種選擇：(a) #include <libunwind.h>前添加#define UNW_LOCAL_ONLY (b) 多鏈接一個庫，x86-64上是-l:libunwind-x86_64.so。使用Clang的話也可用clang --rtlib=compiler-rt --unwindlib=libunwind -I path/to/include a.c，除了提供unw_*外，能確保不鏈接libgcc_s.so

unw_getcontext: 獲取寄存器值(包含PC)
unw_init_local
- 使用dl_iterate_phdr遍歷可執行文件和shared objects，找到包含PC的PT_LOAD program header
- 找到所在module的PT_GNU_EH_FRAME(.eh_frame_hdr)，存入cursor
unw_step
- 二分搜索PC對應的.eh_frame_hdr項，記錄找到的FDE和其指向的CIE
- 執行CIE中的initial_instructions
- 執行FDE中的instructions。維護一個location、CFA，初始指向FDE的initial_location，指令中DW_CFA_advance_loc增加location；DW_CFA_def_cfa_*更新CFA；DW_CFA_offset表示一個寄存器的值保存在CFA+offset處
- location大於等於PC時停止。也就是說，執行的指令是FDE instructions的一個前綴

Unwinder根據program counter找到適用的FDE，執行所有在program counter之前的CFI instructions。

有幾種重要的

DW_CFA_def_cfa_*
DW_CFA_offset
DW_CFA_advance_loc

一個-DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=X86的clang，.text 51.7MiB、.eh_frame 4.2MiB、.eh_frame_hdr 646、2個CIE、82745個FDE。

註記

CFI instructions適合編譯器生成代碼，而手寫彙編要準確標準每一條指令是繁瑣的，也很容易出錯。 2015年Alex Dowad也musl libc貢獻了awk腳本，解析assembly並自動標註CFI directives。其實對於編譯器生成的代碼也不容易，對於一個不用frame pointer的函數，調整SP就得同時輸出一條CFI directive重定義CFA。GCC是不解析inline assembly的，因此inline assembly裏調整SP往往會造成不準確的CFI。

void foo() {
  asm("subq $128, %rsp\n"
  // Cannot unwind if -fomit-leaf-frame-pointer
      "nop\n"
      "addq $128, %rsp\n");
}

int main() {
  foo();
}

而LLVM裏的CFIInstrInserter可以插入.cfi_def_cfa_* .cfi_offset .cfi_restore調整CFA和callee-saved寄存器。

The DWARF scheme also has very low information density. The various compact unwind schemes have made improvement on this aspect. To list a few issues:

CIE address_size: nobody uses different values for an architecture. Even if they do (ILP32 ABIs in AArch64 and x86-64), the information is already available elsewhere.
CIE segment_selector_size: It is nice that they cared x86, but x86 itself does not need it anymore:/
CIE code_alignment_factor and data_alignment_factor: A RISC architecture with such preference can hard code the values.
CIE return_address_register: I do not know when an architecture wants to use a different register for the return address.
length: The DWARF's 8-byte form is definitely overengineered... For standard form prologue/epilogue, the field should not be needed.
initial_location and address_range: if a binary search index table is always needed, why do we need the length field?
instructions: bytecode is flexible but commonly a function prologue/epilogue is of a standard form and the few callee-saved registers can be encoded in a more compact way.
augmentation_data: While this provide flexibility, in practice very rarely a function needs anything more than a personality and a LSDA pointer.

`SHT_X86_64_UNWIND`

.eh_frame在linker/dynamic loader裏有特殊處理，照理應該用一個單獨的section type，但當初設計時卻用了SHT_PROGBITS。在x86-64 psABI中.eh_frame的型別爲SHT_X86_64_UNWIND(可能是受到Solaris影響)。

GNU as中，.section .eh_frame,"a",@unwind會生成SHT_X86_64_UNWIND，而.cfi_*則生成SHT_PROGBITS。
Clang 3.8起，.cfi_*生成SHT_X86_64_UNWIND

.section .eh_frame,"a",@unwind很少見(glibc's x86 port,libffi,LuaJIT等少數包)，因此檢查.eh_frame的型別是個辨別Clang/GCC object file的好方法:) 我有一個LLD 11.0.0 commit就是在relocatable link時接受兩種型別的.eh_frame;-)

給未來架構的建議：在定義processor-specific section types時，請不要把0x70000001 (SHT_ARM_EXIDX=SHT_IA_64_UNWIND=SHT_PARISC_UNWIND=SHT_X86_64_UNWIND=SHT_LOPROC+1)用於unwinding以外的用途:) SHT_CSKY_ATTRIBUTES=0x70000001:)

Linker角度的問題

通常在COMDAT group和啓用-ffunction-sections的情況下，.data/.rodata需要像.text那樣分裂開，但是.eh_frame是一個monolithic section。和很多其他metadata sections一樣，monolithic的主要問題是linker garbage collection有點麻煩。很其他metadata sections不同的是，簡單的丟棄garbage collecting不是一種選擇：.eh_frame_hdr是一個binary search index table，重複/無用的entries會使consumers困惑。

Linker在處理.eh_frame時，需要在概念上分裂.eh_frame成CIE/FDE。 --gc-sections時，概念上的引用關係和實際的relocation是相反的：FDE有一個引用text section的relocation；GC時，若被指向的text section被丟棄，引用它的FDE也應被丟棄。

LLD對.eh_frame有些特殊處理：

-M需要特殊代碼
--gc-sections發生在.eh_frame deduplication/GC前。CIE中的personality是有效的reference，FDE中的initial_location應該忽略，FDE中的lsda引用只考慮non-section-group情形
在relocatable link中，允許從.eh_frame指向一個discarded section(due to COMDAT group rule)的STT_SECTION symbol(通常對於discarded section，來自section group外的STB_LOCAL relocation應該被拒絕)

Compact unwind descriptors

在macOS上，Apple設計了compact unwind descriptors機制加速unwinding，理論上這種技術可以用於節省一些__eh_frame空間，但並沒有實現。主要思想是：

大多數函數的FDE都有固定的模式(prologue處指定CFA、存儲callee-saved registers)，可以把FDE instructions壓縮爲32-bit。
CIE/FDE augmentation data描述的personality/lsda很常見，可以提取出來成爲固定字段。

下面只討論64-bit。一個descriptor佔32字節

.quad _foo
.set L1, Lfoo_end-_foo
.long L1
.long compact_unwind_description
.quad personality
.quad lsda_address

如果研究.eh_frame_hdr(binary search index table)和.ARM.exidx的話，可以知道length字段是冗餘的。

Compact unwind descriptor編碼爲：

1
2
3

uint32_t : 24; // vary with different modes
uint32_t mode : 4;
uint32_t flags : 4;

定義了5種mode：

0: reserved
1: FP-based frame: RBP爲frame pointer，frame size可變
2: SP-based frame: 不用frame pointer，frame size編譯期固定
3: large SP-based frame: 不用frame pointer，frame size編譯期固定但數值較大，無法用mode 2表示
4: DWARF CFI escape

FP-based frame (`UNWIND_MODE_BP_FRAME`)

Compact unwind descriptor編碼爲：

uint32_t regs : 15;
uint32_t : 1; // 0
uint32_t stack_adjust : 8;
uint32_t mode : 4;
uint32_t flags : 4;

x86-64上callee-saved寄存器有：RBX,R12,R13,R14,R15,RBP。3 bits可以編碼一個寄存器，15 bits足夠表示除RBP外的5個寄存器(是否保存及保存在哪裏)。 stack_adjust記錄保存寄存器外的額外棧空間。

SP-based frame (`UNWIND_MODE_STACK_IMMD`)

Compact unwind descriptor編碼爲：

uint32_t reg_permutation : 10;
uint32_t cnt : 3;
uint32_t : 3;
uint32_t size : 8;
uint32_t mode : 4;
uint32_t flags : 4;

cnt表示保存的寄存器數(最大6)。 reg_permutation表示保存的寄存器的排列的序號。 size*8表示棧幀大小。

Large SP-based frame (`UNWIND_MODE_STACK_IND`)

Compact unwind descriptor編碼爲：

uint32_t reg_permutation : 10;
uint32_t cnt : 3;
uint32_t adj : 3;
uint32_t size_offset : 8;
uint32_t mode : 4;
uint32_t flags : 4;

和SP-based frame類似。特別的是：棧幀大小是從text section讀取的。RSP調整量通常由subq imm, %rsp表示，用size_offset表示該指令到函數開頭的距離。實際表示的stack size還要算上adj*8。

DWARF CFI escape

如果因爲各種原因，compact unwind descriptor無法表示，就要回退到DWARF CFI。

LLVM實現裏，每一個函數只用一個compact unwind descriptor表示。如果asynchronous stack unwinding發生在epilogue，已有實現無法把它和發生在function body的stack unwinding區分開來。 Canonical Frame Address會計算錯誤，caller-saved寄存器也會錯誤地讀取。如果發生在prologue，且prologue在push寄存器和subq imm, $rsp外有其他指令，也會出錯。另外如果一個函數啓用了shrink wrapping，prologue可能不在函數開頭處。開頭到prologue間的asynchronous stack unwinding也會出錯。這個問題似乎多數人都不關心，可能是因爲profiler丟失幾個百分點的profile大家不在乎吧。

其實如果用多個descriptors描述一個函數的各個區域，還是可以準確unwind的。 OpenVMS 2018年提出了[RFC] Improving compact x86-64 compact unwind descriptors，可惜沒有相關實現。

ARM exception handling

分爲.ARM.exidx和.ARM.extab

.ARM.exidx是個binary search index table，由2-word pairs組成。第一個word是31-bit PC-relative offset to the start of the region。第二個word用程序描述更加清晰：

if (indexData == EXIDX_CANTUNWIND)
  return false;  // like an absent .eh_frame entry. In the case of C++ exceptions, std::terminate
if (indexData & 0x80000000) {
  extabAddr = &indexData;
  extabData = indexData; // inline
} else {
  extabAddr = &indexData + signExtendPrel31(indexData);
  extabData = read32(&indexData + signExtendPrel31(indexData)); // stored in .ARM.extab
}

tableData & 0x80000000表示一個compact model entry，否則表示一個generic model entry。

.ARM.exidx相當於增強的.eh_frame_hdr，compact model相當於內聯了.eh_frame中的personality和lsda。考慮下面三種情況：

如果不會觸發C++ exception且不會調用可能觸發exception的函數：不需要personality，只需要一個EXIDX_CANTUNWIND entry，不需要.ARM.extab
如果會觸發C++ exception但是不需要landing pad：personality是__aeabi_unwind_cpp_pr0，只需要一個compact model的entry，不需要.ARM.extab
如果有catch：需要__gxx_personality_v0，需要.ARM.extab

.ARM.extab相當於合併的.eh_frame和.gcc_except_table。

Generic model

uint32_t personality; // bit 31 is 0
uint32_t : 24;
uint32_t num : 8;
uint32_t opcodes[];   // opcodes, variable length
uint8_t lsda[];       // variable length

待補充

Windows ARM64 exception handling

參見https://docs.microsoft.com/en-us/cpp/build/arm64-exception-handling，這是我最欣賞的編碼方案。支持mid-prolog和mid-epilog的unwinding。支持function fragments(用來表示shrink wrapping等非常規棧幀)。

保存在.pdata和.xdata兩個sections。

1
2
3

uint32_t function_start_rva;
uint32_t Flag : 2;
uint32_t Data : 30;

對於canonical form的函數，使用Packed Unwind Data，不需要.xdata記錄；對於Packed Unwind Data無法表示的descriptor，保存在.xdata。

Packed Unwind Data

uint32_t FunctionStartRVA;
uint32_t Flag : 2;
uint32_t FunctionLength : 11;
uint32_t RegF : 3;
uint32_t RegI : 4;
uint32_t H : 1;
uint32_t CR : 2;
uint32_t FrameSize : 9;

MIPS compact exception tables

待補充

Linux kernel ORC unwind tables

對於x86-64，Linux kernel使用自己的unwind tables：ORC。文檔在https://www.kernel.org/doc/html/latest/x86/orc-unwinder.html。lwn.net上有一篇介紹The ORCs are coming。

objtool orc generate a.o解析.eh_frame並生成.orc_unwind和.orc_unwind_ip。對於這樣一個.o文件：

.globl foo
.type foo, @function
foo:
  ret

Unwind information在兩個地址發生改變：foo的開頭和末尾，因此需要兩個2 ORC entries。如果DWARF CFA在函數中間變更(比如因爲push/pop)，可能會需要更多entries。

.orc_unwind_ip有兩個entries，表示PC-relative地址。

Relocation section '.rela.orc_unwind_ip' at offset 0x2028 contains 2 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000000000  0000000500000002 R_X86_64_PC32          0000000000000000 .text + 0
0000000000000004  0000000500000002 R_X86_64_PC32          0000000000000000 .text + 1

.orc_unwind包含類隔類型爲orc_entry的entries，記錄上一個棧幀的IP/SP/BP的存儲位置。

struct orc_entry {
  s16 sp_offset; // sp_offset and sp_reg encode where SP of the previous frame is stored
  s16 bp_offset; // bp_offset and bp_reg encode where BP of the previous frame is stored
  unsigned sp_reg:4;
  unsigned bp_reg:4;
  unsigned type:2; // how IP of the previous frame is stored
  unsigned end:1;
} __attribute__((__packed__));

你可能會發現這個方案和Apples's compact unwind descriptors的UNWIND_MODE_BP_FRAME and UNWIND_MODE_STACK_IMMD有相似之處。 ORC方案使用16-bit整數，所以UNWIND_MODE_STACK_IND應該是用不到的。 Unwinding時，除了BP以外的多數callee-saved寄存器用不到，所以ORC也沒存儲它們。

Linker會resolve .orc_unwind_ip的relocations，並創建__start_orc_unwind_ip/__stop_orc_unwind_ip/__start_orc_unwind/__stop_orc_unwind用於定界。然後，一個host utility scripts/sorttable對.orc_unwind_ip和.orc_unwind進行排序。運行時unwind_next_frame執行下面的步驟unwind一個棧幀：

在.orc_unwind_ip裏二分搜索需要的ORC entry
根據當前SP、orc->sp_reg、orc->sp_offset獲取上一個棧幀的SP
根據orc->type和其他信息獲取上一個棧幀的IP
根據當前BP、上一個棧幀的SP、orc->bp_reg、orc->bp_offset獲取上一個棧幀的BP

SFrame

.sframe is a lightweight alternative to .eh_frame that uses more compact encoding at the cost of reduced flexibility. It focuses on describing three key elements: the Canonical Frame Address (CFA), the return address, and the frame pointer. Unlike .eh_frame, it does not include personality routines, Language Specific Data Area (LSDA) information, or the ability to encode extra callee-saved registers.

An .sframe section contains a header followed by an optional auxiliary header and arrays of Function Descriptor Entries (FDEs) and Frame Row Entries (FREs).

The auxiliary header, which is currently unused, could be a replacement for the .eh_frame augmentation data. This would be useful for things like personality routines, language-specific data areas (LSDAs), and signal frames.

struct sframe_header {
  struct {
    uint16_t sfp_magic;
    uint8_t sfp_version;
    uint8_t sfp_flags;
  } sfh_preamble;
  uint8_t sfh_abi_arch;
  int8_t sfh_cfa_fixed_fp_offset;
  // Used by x86-64 to define the return address slot relative to CFA
  int8_t sfh_cfa_fixed_ra_offset;
  // Size in bytes of the auxiliary header, allowing extensibility
  uint8_t sfh_auxhdr_len;
  // Numbers of FDEs and FREs
  uint32_t sfh_num_fdes;
  uint32_t sfh_num_fres;
  // Size in bytes of FREs
  uint32_t sfh_fre_len;
  // Offsets in bytes of FDEs and FREs
  uint32_t sfh_fdeoff;
  uint32_t sfh_freoff;
} ATTRIBUTE_PACKED;

Each FDE describes a function's start address and references its associated Frame Row Entries (FREs).

struct sframe_func_desc_entry {
  int32_t sfde_func_start_address;
  uint32_t sfde_func_size;
  uint32_t sfde_func_start_fre_off;
  uint32_t sfde_func_num_fres;
  // bits 0-3 fretype: sfre_start_address type
  // bit 4 fdetype: SFRAME_FDE_TYPE_PCINC or SFRAME_FDE_TYPE_PCMASK
  // bit 5 pauth_key: (AArch64 only) the signing key for the return address
  uint8_t sfde_func_info;
  // The size of the repetitive code block for SFRAME_FDE_TYPE_PCMASK; used by .plt
  uint8_t sfde_func_rep_size;
  uint16_t sfde_func_padding2;
} ATTRIBUTE_PACKED;

template <class AddrType>
struct sframe_frame_row_entry {
  // If the fdetype is SFRAME_FDE_TYPE_PCINC, this is an offset relative to sfde_func_start_address
  AddrType sfre_start_address;
  // bit 0 fre_cfa_base_reg_id: define BASE_REG as either FP or SP
  // bits 1-4 fre_offset_count: typically 1 to 3, describing CFA, FP, and RA
  // bits 5-6 fre_offset_size: byte size of offset entries (1, 2, or 4 bytes)
  sframe_fre_info sfre_info;
} ATTRIBUTE_PACKED;

Each FRE contains variable-length stack offsets stored as trailing data with sizes of uint8_t, uint16_t, or uint32_t, determined by the fre_offset_size field. The interpretation of these offsets is architecture-specific:

x86-64:

First offset: Encodes CFA as BASE_REG + offset
Second offset (if present): Encodes FP as CFA + offset
Implicit return address: Computed as CFA + sfh_cfa_fixed_ra_offset (using header field)

AArch64:

First offset: Encodes CFA as BASE_REG + offset
Second offset: Encodes return address as CFA + offset
Third offset (if present): Encodes FP as CFA + offset

SFrame reduces size compared to .eh_frame plus .eh_frame_hdr by:

Eliminating .eh_frame_hdr through sorted sfde_func_start_address fields
Replacing CIE pointers with direct FDE-to-FRE references
Using variable-width sfre_start_address fields (1 or 2 bytes) for small functions
Storing start addresses instead of address ranges. .eh_frame address ranges
Start addresses in a small function use 1 or 2 byte fields, more efficient than .eh_frame initial_location, which needs at least 4 bytes (DW_EH_PE_sdata4).
Hard-coding stack offsets rather than using flexible register specifications

However, the bytecode design of .eh_frame can sometimes be more efficient than .sframe, as demonstrated on x86-64.

The format includes endianness variations that complicate toolchain support. I think we should use a little-endian format universally, regardless of the target system's native endianness, removing template <class Endian> from C++ code. On the big-endian z/Architecture, this is efficient: the LOAD REVERSED instructions are used by the bswap versions in the following program, not even requiring extra instructions.

#define WIDTH(x) \
typedef __UINT##x##_TYPE__ [[gnu::aligned(1)]] uint##x; \
uint##x load_inc##x(uint##x *p) { return *p+1; } \
uint##x load_bswap_inc##x(uint##x *p) { return __builtin_bswap##x(*p)+1; }; \
uint##x load_eq##x(uint##x *p) { return *p==3; } \
uint##x load_bswap_eq##x(uint##x *p) { return __builtin_bswap##x(*p)==3; }; \

WIDTH(16);
WIDTH(32);
WIDTH(64);

LLVM

In LLVM, Function::needsUnwindTableEntry decides whether CFI instructions should be emitted: hasUWTable() || !doesNotThrow() || hasPersonalityFn()

On ELF targets, if a function has uwtable or personality, or does not have nounwind (needsUnwindTableEntry), it marks that .eh_frame is needed in the module. Then, a function gets .eh_frame if needsUnwindTableEntry or -g[123] is specified.

To ensure no .eh_frame, every function needs nounwind.

uwtable is coarse-grained: it does not specify the amount of unwind information. [RFC] Asynchronous unwind tables attribute proposes to make it a gradient.

lib/CodeGen/AsmPrinter/AsmPrinter.cpp:352

尾声

用於profiling的stack unwinding策略會怎樣發展是個開放問題。我們有至少三種路線：

compact unwind
硬件支持。
主要基於FP

Unwind information is difficult to have 100% coverage. Linker generated code (PLT and range extension thunks) generally does not have unwind informatioin coverage.

Frame pointer

編譯器行爲

libunwind

DWARF Call Frame Information

.eh_frame_hdr和PT_GNU_EH_FRAME

__register_frame_info

libunwind例子

註記

SHT_X86_64_UNWIND

Linker角度的問題

Compact unwind descriptors

FP-based frame (UNWIND_MODE_BP_FRAME)

SP-based frame (UNWIND_MODE_STACK_IMMD)

Large SP-based frame (UNWIND_MODE_STACK_IND)

DWARF CFI escape

ARM exception handling

Generic model

Windows ARM64 exception handling

Packed Unwind Data

MIPS compact exception tables

Linux kernel ORC unwind tables

SFrame

LLVM

尾声

`.eh_frame_hdr`和`PT_GNU_EH_FRAME`

`__register_frame_info`

`SHT_X86_64_UNWIND`

FP-based frame (`UNWIND_MODE_BP_FRAME`)

SP-based frame (`UNWIND_MODE_STACK_IMMD`)

Large SP-based frame (`UNWIND_MODE_STACK_IND`)