On most Linux platforms (except AArch32, which uses
.ARM.exidx), DWARF .eh_frame is required for
C++ exception
handling and stack
unwinding to restore callee-saved registers. While
.eh_frame can be used for call trace recording, it is often
criticized for its runtime overhead. As an alternative, developers can
enable frame pointers, or adopt SFrame, a newer format designed
specifically for profiling. This article examines the size overhead of
enabling non-DWARF stack walking mechanisms when building several LLVM
executables.
Runtime performance analysis will be added in a future update.
Stack walking mechanisms
Here is a survey of mechanisms available for x86-64:
- Frame pointers: fast but costs a register
- DWARF
.eh_frame: comprehensive but slower, supports additional features like C++ exception handling - SFrame: a new format being developed, profiling only.
.eh_frameis still needed for debugging and C++ exception handling. Check out Remarks on SFrame for details. - x86 Last Branch Record (LBR): Skylake increased the LBR stack size to 32. Supported by AMD Zen 4 as Last Branch Record Extension Version 2 (LbrExtV2)
- Apple's Compact Unwinding Format: This has llvm, lld/MachO, and libunwind implementation. Supports x86-64 and AArch64. This can mostly replace DWARF CFI, but some entries need DWARF escape.
- OpenVMS's Compact Unwinding Format: This modifies Apple's Compact Unwinding Format.
Space overhead analysis
Frame pointer size impact
For most architectures, GCC defaults to
-fomit-frame-pointer in -O compilation to free
up a register for general use. To enable frame pointers, specify
-fno-omit-frame-pointer, which reserves the frame pointer
register (e.g., rbp on x86-64) and emits push/pop
instructions in function prologues/epilogues.
For leaf functions (those that don't call other functions), while the
frame pointer register should still be reserved for consistency, the
push/pop operations are often unnecessary. Compilers provide
-momit-leaf-frame-pointer (with target-specific defaults)
to reduce code size.
The viability of this optimization depends on the target architecture:
- On AArch64, the return address is available in the link register
(X30). The immediate caller can be retrieved by inspecting X30, so
-momit-leaf-frame-pointerdoes not compromise unwinding. - On x86-64, after the prologue instructions execute, the return address is stored at RSP plus an offset. An unwinder needs to know the stack frame size to retrieve the return address, or it must utilize DWARF information for the leaf frame and then switch to the FP chain for parent frames.
Beyond this architectural consideration, there are additional
practical reasons to use -momit-leaf-frame-pointer on
x86-64:
- Many hand-written assembly implementations (including numerous glibc functions) don't establish frame pointers, creating gaps in the frame pointer chain anyway.
- In the prologue sequence
push rbp; mov rbp, rsp, after the first instruction executes, RBP does not yet reference the current stack frame. When shrink-wrapping optimizations are enabled, the instruction region where RBP still holds the old value becomes larger, increasing the window where the frame pointer is unreliable.
Given these trade-offs, three common configurations have emerged:
- omitting FP:
-fomit-frame-pointer -momit-leaf-frame-pointer(smallest overhead) - reserving FP, but removing FP push/pop for leaf functions:
-fno-omit-frame-pointer -momit-leaf-frame-pointer(frame pointer chain omitting the leaf frame) - reserving FP:
-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer(complete frame pointer chain, largest overhead)
The size impact varies significantly by program. Here's a Ruby
script section_size.rb that compares section sizes:
1 | % ~/Dev/unwind-info-size-analyzer/section_size.rb /tmp/out/custom-{none,nonleaf,all}/bin/{llvm-mc,opt} |
For instance, llvm-mc is dominated by read-only data,
making the relative .text percentage quite small, so frame
pointer impact on the VM size is minimal. ("VM size" is a metric used by
bloaty, representing the total p_memsz size of
PT_LOAD segments, excluding alignment
padding.) As expected, llvm-mc grows larger as more
functions set up the frame pointer chain. However, opt
actually becomes smaller when -fno-omit-frame-pointer is
enabled—a counterintuitive result that warrants explanation.
Without frame pointer, the compiler uses RSP-relative addressing to
access stack objects. When using the register-indirect + disp8/disp32
addresing mode, RSP needs an extra SIB byte while RBP doesn't. For
larger functions accessing many local variables, the savings from
shorter RBP-relative encodings can outweigh the additional
push rbp; mov rbp, rsp; pop rbp instructions in the
prologues/epilogues.
1 | % echo 'mov rax, [rsp+8]; mov rax, [rbp-8]' | /tmp/Rel/bin/llvm-mc -x86-asm-syntax=intel -output-asm-variant=1 -show-encoding |
SFrame vs .eh_frame
Oracle is advocating for SFrame adoption in Linux distributions. The SFrame implementation is handled by the assembler and linker rather than the compiler. Let's build the latest binutils-gdb to test it.
Building test program
We'll use the clang compiler from https://github.com/llvm/llvm-project/tree/release/21.x as our test program.
There are still issues related to garbage collection (object file
format design issue), so I'll just disable
-Wl,--gc-sections.
1 | --- i/llvm/cmake/modules/AddLLVM.cmake |
1 | configure-llvm custom-sframe -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang' -DLLVM_ENABLE_UNWIND_TABLES=on -DLLVM_ENABLE_LLD=off -DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-fuse-ld=bfd -DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc -DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++ -DCMAKE_C_FLAGS="-B$HOME/opt/binutils/bin -Wa,--gsframe" -DCMAKE_CXX_FLAGS="-B$HOME/opt/binutils/bin -Wa,--gsframe" |
1 | % ~/Dev/bloaty/out/release/bloaty /tmp/out/custom-sframe/bin/clang |
The results show that .sframe (8.87 MiB) is
approximately 10% larger than the combined size of
.eh_frame and .eh_frame_hdr (7.07 + 0.99 =
8.06 MiB). While SFrame is designed for efficiency during stack walking,
it carries a non-trivial space overhead compared to traditional DWARF
unwind information.
SFrame vs FP
Having examined SFrame's overhead compared to .eh_frame,
let's now compare the two primary approaches for non-hardware-assisted
stack walking.
- Frame pointer approach: Reserve FP but omit
push/pop for leaf functions
g++ -fno-omit-frame-pointer -momit-leaf-frame-pointer - SFrame approach: Omit FP and use SFrame metadata
g++ -fomit-frame-pointer -momit-leaf-frame-pointer -Wa,--gsframe
To conduct a fair comparison, we build LLVM executables using both approaches with both Clang and GCC compilers. The following script configures and builds test binaries with each combination:
1 |
|
The results reveal interesting differences between compiler implementations:
1 | % ~/Dev/unwind-info-size-analyzer/section_size.rb /tmp/out/custom-{fp,sframe,fp-gcc,sframe-gcc}/bin/{llvm-mc,opt} |
- SFrame incurs a significant VM size increase.
- GCC-built binaries are significantly larger than their Clang counterparts, probably due to more aggressive inlining or vectorization strategies.
With Clang-built binaries, the frame pointer configuration produces a
smaller opt executable (55.6 MiB) compared to the SFrame
configuration (62.5 MiB). This reinforces our earlier observation that
RBP addressing can be more compact than RSP-relative addressing for
large functions with frequent local variable accesses.
Assembly comparison reveals that functions using RBP and RSP addressing produce quite similar code.
In contrast, GCC-built binaries show the opposite trend: the frame
pointer version of opt (70.0 MiB) is smaller than the
SFrame version (76.2 MiB).
The generated assembly differs significantly between omit-FP and
non-omit-FP builds, I have compared symbol sizes between two GCC builds.
1
nvim -d =(/tmp/Rel/bin/llvm-nm -U --size-sort /tmp/out/custom-fp-gcc/bin/llvm-mc) =(/tmp/Rel/bin/llvm-nm -U --size-sort /tmp/out/custom-sframe-gcc/bin/llvm-mc)
Many functions, such as
_ZN4llvm15ELFObjectWriter24executePostLayoutBindingEv, have
significant more instructions in the keep-FP build. This suggests that
GCC's frame pointer code generation may not be as optimized as its
default omit-FP path.
Runtime performance analysis
TODO
perf record overhead with EH
perf record overhead with FP
Summary
This article examines the space overhead of different stack walking mechanisms when building LLVM executables.
Frame pointer configurations: Enabling frame
pointers (-fno-omit-frame-pointer) can paradoxically reduce
x86-64 binary size when stack object accesses are frequent. This occurs
because RBP-relative addressing produces more compact encodings than
RSP-relative addressing, which requires an extra SIB byte. The savings
from shorter instructions can outweigh the prologue/epilogue
overhead.
SFrame vs .eh_frame: For the x86-64
clang executable, SFrame metadata is approximately 10%
larger than the combined size of .eh_frame and
.eh_frame_hdr. Given the significant VM size overhead and
the lack of clear advantages over established alternatives, I am
skeptical about SFrame's viability as the future of stack walking for
userspace programs. While SFrame will receive a major revision V3 in the
upcoming months, it needs to achieve substantial size reductions
comparable to existing compact unwinding schemes to justify its adoption
over frame pointers. I hope interested folks can implement something
similar to macOS's compact unwind descriptors (with x86-64 support) and
OpenVMS's.
GCC's frame pointer code generation appears less optimized than its default omit-frame-pointer path, as evidenced by substantial differences in generated assembly.
Runtime performance analysis remains to be conducted to complete the trade-off evaluation.
Appendix:
configure-llvm
This script specifies common options when configuring llvm-project: https://github.com/MaskRay/Config/blob/master/home/bin/configure-llvm
-DCMAKE_CXX_ARCHIVE_CREATE="$HOME/Stable/bin/llvm-ar qc --thin <TARGET> <OBJECTS>" -DCMAKE_CXX_ARCHIVE_FINISH=:: Use thin archives to reduce disk usage-DLLVM_TARGETS_TO_BUILD=host: Build a single target-DCLANG_ENABLE_OBJC_REWRITER=off -DCLANG_ENABLE_STATIC_ANALYZER=off: Disable less popular components-DLLVM_ENABLE_PLUGINS=off -DCLANG_PLUGIN_SUPPORT=off: Disable-Wl,--export-dynamic, preventing large.dynsymand.dynstrsections
Appendix: My SFrame build
1 | mkdir -p out/release && cd out/release |
gcc -B$HOME/opt/binutils/bin and
clang -B$HOME/opt/binutils/bin -fno-integrated-as will use
as and ld from the install directory.
Appendix: Scripts
Ruby scripts used by this post are available at https://github.com/MaskRay/unwind-info-size-analyzer/