<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>MaskRay</title>
  
  
  <link href="https://maskray.me/blog/atom.xml" rel="self"/>
  
  <link href="https://maskray.me/blog/"/>
  <updated>2026-04-14T06:07:56.880Z</updated>
  <id>https://maskray.me/blog/</id>
  
  <author>
    <name>MaskRay</name>
    
  </author>
  
  <generator uri="https://hexo.io/">Hexo</generator>
  
  <entry>
    <title>Recent lld/ELF performance improvements</title>
    <link href="https://maskray.me/blog/2026-04-12-recent-lld-elf-performance-improvements"/>
    <id>https://maskray.me/blog/2026-04-12-recent-lld-elf-performance-improvements</id>
    <published>2026-04-12T07:00:00.000Z</published>
    <updated>2026-04-14T06:07:56.880Z</updated>
    
    <content type="html"><![CDATA[<p>Since the LLVM 22 branch was cut, I've landed patches thatparallelize more link phases and cut task-runtime overhead. This postcompares current <code>main</code> against lld 22.1, <ahref="https://github.com/rui314/mold">mold</a>, and <ahref="https://github.com/davidlattimore/wild">wild</a>.</p><p>Headline: a Release+Asserts clang <code>--gc-sections</code> link is1.37x as fast as lld 22.1; Chromium debug with <code>--gdb-index</code>is 1.07x as fast. mold and wild are still ahead — the last sectionexplains why.</p><span id="more"></span><h2 id="benchmark">Benchmark</h2><p><code>lld-0201</code> is main at 2026-02-01 (6a1803929817);<code>lld-load</code> is main plus the new<code>[ELF] Parallelize input file loading</code>. <code>mold</code> and<code>wild</code> run with <code>--no-fork</code> so the wall-clocknumbers include the linker process itself.</p><p>Three reproduce tarballs, <code>--threads=8</code>,<code>hyperfine -w 1 -r 10</code>, pinned to CPU cores with<code>numactl -C</code>.</p><table><colgroup><col style="width: 20%" /><col style="width: 20%" /><col style="width: 20%" /><col style="width: 20%" /><col style="width: 20%" /></colgroup><thead><tr><th>Workload</th><th>lld-0201</th><th>lld-load</th><th>mold</th><th>wild</th></tr></thead><tbody><tr><td>clang-23 Release+Asserts, <code>--gc-sections</code></td><td>1.255 s</td><td>917.8 ms</td><td>552.6 ms</td><td>367.2 ms</td></tr><tr><td>clang-23 Debug (no <code>--gdb-index</code>)</td><td>4.582 s</td><td>4.306 s</td><td>2.464 s</td><td>1.565 s</td></tr><tr><td>clang-23 Debug (<code>--gdb-index</code>)</td><td>6.291 s</td><td>5.915 s</td><td>4.001 s</td><td>N/A</td></tr><tr><td>Chromium Debug (no <code>--gdb-index</code>)</td><td>6.140 s</td><td>5.904 s</td><td>2.665 s</td><td>2.010 s</td></tr><tr><td>Chromium Debug (<code>--gdb-index</code>)</td><td>7.857 s</td><td>7.322 s</td><td>3.786 s</td><td>N/A</td></tr></tbody></table><p>Note that <code>llvm/lib/Support/Parallel.cpp</code> design keeps themain thread idle during <code>parallelFor</code>, so<code>--threads=N</code> really utilizes <code>N+1</code> threads.</p><p>wild does not yet implement <code>--gdb-index</code> — it silentlywarns and skips, producing an output about 477 MB smaller on Chromium.For fair 4-way comparisons I also strip <code>--gdb-index</code> fromthe response file; the <code>no --gdb-index</code> rows above use thatsetup.</p><p>A few observations before diving in:</p><ul><li>The <code>--gdb-index</code> surcharge on the Chromium link is<code>+1.42 s</code> for lld (5.90 s → 7.32 s) versus<code>+1.12 s</code> for mold (2.67 s → 3.79 s). This is currently oneof the biggest remaining gaps.</li><li>Excluding <code>--gdb-index</code>, mold is 1.66x–2.22x as fast andwild 2.5x–2.94x as fast on this machine. There is plenty of roomleft.</li><li><code>clang-23 Release+Asserts --gc-sections</code> (workload 1) hascollapsed from 1.255 s to 918 ms, a 1.37x speedup over 10 weeks. Most ofthat came from the parallel <code>--gc-sections</code> mark, parallelinput loading, and the task-runtime cleanup below — each contributing amultiplicative factor.</li></ul><h3 id="macos-apple-m4-notes">macOS (Apple M4) notes</h3><p>The same clang-23 Release+Asserts link, <code>--threads=8</code>, onan Apple M4 (macOS 15, system allocator for all four linkers):</p><table><thead><tr><th>Linker</th><th>Wall</th><th>User</th><th>Sys</th><th>(User+Sys)/Wall</th></tr></thead><tbody><tr><td>lld-0201</td><td>324.4 ± 1.5 ms</td><td>502.1 ms</td><td>171.7 ms</td><td>2.08x</td></tr><tr><td>lld-load</td><td>221.5 ± 1.8 ms</td><td>476.5 ms</td><td>368.8 ms</td><td>3.82x</td></tr><tr><td>mold</td><td>201.2 ± 1.7 ms</td><td>875.1 ms</td><td>220.5 ms</td><td>5.44x</td></tr><tr><td>wild</td><td>107.1 ± 0.5 ms</td><td>456.8 ms</td><td>284.6 ms</td><td>6.92x</td></tr></tbody></table><h2 id="parallelize---gc-sections-mark">Parallelize<code>--gc-sections</code> mark</h2><p>Garbage collection had been a single-threaded BFS over<code>InputSection</code> graph. On a Release+Asserts clang link,<code>markLive</code> was ~315 ms of the 1562 ms wall time (20%).</p><p><ahref="https://github.com/llvm/llvm-project/commit/6f9646a598f25efa3c4db066d2d51fb248b13526">commit6f9646a598f2</a> adds <code>markParallel</code>, a level-synchronizedBFS. Each BFS level is processed with <code>parallelFor</code>; newlydiscovered sections land in per-thread queues, which are merged beforethe next level. The parallel path activates when<code>!TrackWhyLive &amp;&amp; partitions.size() == 1</code>.Implementation details that turned out to matter:</p><ul><li>Depth-limited inline recursion (<code>depth &lt; 3</code>) beforepushing to the next-level queue. Shallow reference chains stay hot incache and avoid queue overhead.</li><li>Optimistic "load then compare-exchange" section-flag dedup insteadof atomic fetch-or. The vast majority of sections are visited once, sothe load almost always wins.</li></ul><p>On the Release+Asserts clang link, <code>markLive</code> dropped from315 ms to 82 ms at <code>--threads=8</code> (from 199 ms to 50 ms at<code>--threads=16</code>); total wall time 1.16x–1.18x.</p><p>Two prerequisite cleanups were needed for correctness:</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/6a874161621ec52b8efa125790e3e8e72bb9167a">commit6a874161621e</a> moved <code>Symbol::used</code> into the existing<code>std::atomic&lt;uint16_t&gt; flags</code>. The bitfield waspreviously racing with other mark threads.</li><li><ahref="https://github.com/llvm/llvm-project/commit/2118499a898b514f70fb1754ad8713a4267f7bd3">commit2118499a898b</a> decoupled <code>SharedFile::isNeeded</code> from themark walk. <code>--as-needed</code> used to flip <code>isNeeded</code>inside <code>resolveReloc</code>, which would have required coordinatedwrites across threads; it is now a post-GC scan of global symbols.</li></ul><h2 id="parallelize-input-file-loading">Parallelize input fileloading</h2><p>Historically, <code>LinkerDriver::createFiles</code> walked thecommand line and called <code>addFile</code> serially.<code>addFile</code> maps the file (<code>MemoryBuffer::getFile</code>),sniffs the magic, and constructs an <code>ObjFile</code>,<code>SharedFile</code>, <code>BitcodeFile</code>, or<code>ArchiveFile</code>. For thin archives it also materializes eachmember. On workloads with hundreds of archives and thousands of objects,this serial walk dominates the early part of the link.</p><p>The pending patch will rewrite <code>addFile</code> to record a<code>LoadJob</code> for each non-script input together with a snapshotof the driver's state machine (<code>inWholeArchive</code>,<code>inLib</code>, <code>asNeeded</code>, <code>withLOption</code>,<code>groupId</code>). After <code>createFiles</code> finishes,<code>loadFiles</code> fans the jobs out to worker threads. Linkerscripts stay on the main thread because <code>INPUT()</code> and<code>GROUP()</code> recursively call back into<code>addFile</code>.</p><p>A few subtleties made this harder than it sounds:</p><ul><li><code>BitcodeFile</code> and fatLTO construction call<code>ctx.saver</code> / <code>ctx.uniqueSaver</code>, both of which arenon-thread-safe <code>StringSaver</code> /<code>UniqueStringSaver</code>. I serialized those constructors behind amutex; pure-ELF links hit it zero times.</li><li>Thin-archive member buffers used to be appended to<code>ctx.memoryBuffers</code> directly. To keep the outputdeterministic across <code>--threads</code> values, each job nowaccumulates into a per-job <code>SmallVector</code> which is merged into<code>ctx.memoryBuffers</code> in command-line order.</li><li><code>InputFile::groupId</code> used to be assigned inside the<code>InputFile</code> constructor from a global counter. With parallelconstruction the assignment race would have been unobservable but stillugly; <ahref="https://github.com/llvm/llvm-project/commit/b6c8cba516daabced0105114a7bcc745bc52faae">b6c8cba516daabced0105114a7bcc745bc52faae</a>hoists <code>++nextGroupId</code> into the serial driver loop and storesthe value into each file after construction.</li></ul><p>The output is byte-identical to the old lld and deterministic across<code>--threads</code> values, which I verified with <code>diff</code>across <code>--threads=&#123;1,2,4,8&#125;</code> on Chromium.</p><p>A <code>--time-trace</code> breakdown is useful to set expectations.On Chromium, the serial portion of <code>createFiles</code> accounts foronly ~81 ms of the 5.9 s wall, and <code>loadFiles</code> (after thispatch) runs in ~103 ms in parallel. Serial <code>readFile</code>/mmap isnot the bottleneck. What moves the needle is overlapping the per-fileconstructor work — magic sniffing, archive member materialization,bitcode initialization — with everything else that now kicks off on themain thread while workers chew through the job list.</p><h2 id="extending-parallel-relocation-scanning">Extending parallelrelocation scanning</h2><p>Relocation scanning has been parallel since LLVM 17, but three caseshad opted out via <code>bool serial</code>:</p><ol type="1"><li><code>-z nocombreloc</code>, because <code>.rela.dyn</code> mergedrelative and non-relative relocations and needed deterministicordering.</li><li>MIPS, because <code>MipsGotSection</code> is mutated duringscanning.</li><li>PPC64, because <code>ctx.ppc64noTocRelax</code> (a<code>DenseSet</code> of <code>(Symbol*, offset)</code> pairs) waswritten without a lock.</li></ol><p><ahref="https://github.com/llvm/llvm-project/commit/076226f378df115622d0d959e975ee7c7fb3c051">commit076226f378df</a> and <ahref="https://github.com/llvm/llvm-project/commit/dc4df5da886e09d36577b3302952bc91b5e7e154">commitdc4df5da886e</a> separate relative and non-relative dynamic relocationsunconditionally and always build <code>.rela.dyn</code> with<code>combreloc=true</code>; the only remaining effect of<code>-z nocombreloc</code> is suppressing <code>DT_RELACOUNT</code>. <ahref="https://github.com/llvm/llvm-project/commit/2f7bd4fa97232dfab7f2347c745005eb9e2ffd2d">commit2f7bd4fa9723</a> then protects <code>ctx.ppc64noTocRelax</code> with thealready-existing <code>ctx.relocMutex</code>, which is only taken onrare slow paths. After these changes, only MIPS still runs scanningserially.</p><h2 id="target-specific-relocation-scanning">Target-specific relocationscanning</h2><p>Relocation scanning used to go through a generic loop in<code>Relocations.cpp</code> that called<code>Target-&gt;getRelExpr</code> through a virtual for everyrelocation — once to classify the expression kind (PC-relative, PLT,TLS, etc.) and again from the TLS-optimization dispatch. On anyrealistic link that is a hot inner loop running over tens of millions ofrelocations, and the virtual call plus its post-dispatch switch are areal fraction of the cost.</p><p>The fix is to move the whole per-section scan loop intotarget-specific code, so each <code>Target::scanSection</code> /<code>scanSectionImpl</code> pair can inline its own<code>getRelExpr</code>, handle TLS optimization in-place, andspecialize for the two or three relocation kinds that dominate on thatarchitecture. Rolled out across most backends in early 2026:</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/4b887533389c78e3e678b3af85d1dc8e3bf59e83">4b887533389c</a>x86 (i386 / x86-64). On lld's own object files,<code>R_X86_64_PC32</code> and <code>R_X86_64_PLT32</code> make up ~95%of relocations and now hit an inlined hot path.</li><li><ahref="https://github.com/llvm/llvm-project/commit/371e0e2082e9">371e0e2082e9</a>AArch64, <ahref="https://github.com/llvm/llvm-project/commit/4ea72c1e8cbd">4ea72c1e8cbd</a>RISC-V, <ahref="https://github.com/llvm/llvm-project/commit/cd01e6526af6">cd01e6526af6</a>LoongArch, <ahref="https://github.com/llvm/llvm-project/commit/c04b00de7508">c04b00de7508</a>ARM, <ahref="https://github.com/llvm/llvm-project/commit/6d9169553029">6d9169553029</a>Hexagon, <ahref="https://github.com/llvm/llvm-project/commit/aec1c984266c">aec1c984266c</a>SystemZ, <ahref="https://github.com/llvm/llvm-project/commit/5e87f8147d68">5e87f8147d68</a>PPC32, <ahref="https://github.com/llvm/llvm-project/commit/aecc4997bf12">aecc4997bf12</a>PPC64.</li></ul><p>Besides devirtualization, inlining TLS relocation handling into<code>scanSectionImpl</code> let the TLS-optimization-specificexpression kinds be replaced with general ones:<code>R_RELAX_TLS_GD_TO_LE</code> / <code>R_RELAX_TLS_LD_TO_LE</code> /<code>R_RELAX_TLS_IE_TO_LE</code> fold into <code>R_TPREL</code>,<code>R_RELAX_TLS_GD_TO_IE</code> folds into <code>R_GOT_PC</code>, and<code>getTlsGdRelaxSkip</code> goes away. What remains in the shareddispatch path — <code>getRelExpr</code> called from<code>relocateNonAlloc</code> and <code>relocateEH</code> — is a muchsmaller set.</p><p>Average <code>Scan relocations</code> wall time on a clang-14 link(<code>--threads=8</code>, x86-64, 50 runs, measured via<code>--time-trace</code>) drops from 110 ms to 102 ms, ~8% from the x86commit alone.</p><h2 id="faster-getsectionpiece">Faster <code>getSectionPiece</code></h2><p>Merge sections (<code>SHF_MERGE</code>) split their input into"pieces". Every reference into a merge section needs to map an offset toa piece. The old implementation was always a binary search in<code>MergeInputSection::pieces</code>, called from<code>MarkLive</code>, <code>includeInSymtab</code>, and<code>getRelocTargetVA</code>.</p><p><ahref="https://github.com/llvm/llvm-project/commit/42cc454777274a06933abcd098ec3281158717f9">commit42cc45477727</a> changes this in two ways:</p><ol type="1"><li>For non-string fixed-size merge sections,<code>getSectionPiece</code> uses <code>offset / entsize</code>directly.</li><li>For non-section <code>Defined</code> symbols pointing into mergesections, the piece index is pre-resolved during<code>splitSections</code> and packed into <code>Defined::value</code>as <code>((pieceIdx + 1) &lt;&lt; 32) | intraPieceOffset</code>.</li></ol><p>The binary search is now limited to references via section symbols(addend-based), which is common on AArch64 but rare on x86-64 where theassembler emits local labels for <code>.L</code> references intomergeable strings. The clang-relassert link with<code>--gc-sections</code> is 1.05x as fast.</p><h2 id="optimizing-the-underlying-llvmlibsupportparallel.cpp">Optimizingthe underlying <code>llvm/lib/Support/Parallel.cpp</code></h2><p>All of the wins above rely on<code>llvm/lib/Support/Parallel.cpp</code>, the tiny work-stealing-ishtask runtime shared by lld, dsymutil, and a handful of debug-info tools.Four changes in that file mattered:</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/c7b5f7c635e2534a9b2b2b204998b0bc39921b7e">commitc7b5f7c635e2</a> — <code>parallelFor</code> used to pre-split work intoup to <code>MaxTasksPerGroup</code> (1024) tasks and spawn each throughthe executor's mutex + condvar. It now spawns only<code>ThreadCount</code> workers; each grabs the next chunk via anatomic <code>fetch_add</code>. On a clang-14 link(<code>--threads=8</code>), futex calls dropped from ~31K to ~1.4K(glibc release+asserts); wall time 927 ms → 879 ms. This is the reasonthe parallel mark and parallel scan numbers are worth quoting at all —on the old runtime, spawn overhead was a real fraction of the work beingparallelized.</li><li><ahref="https://github.com/llvm/llvm-project/commit/9085f74018a4f465afa84815d64af850f09b733f">commit9085f74018a4</a> — <code>TaskGroup::spawn()</code> replaced themutex-based <code>Latch::inc()</code> with an atomic<code>fetch_add</code> and passes the <code>Latch&amp;</code> through<code>Executor::add()</code> so the worker calls <code>dec()</code>directly. Eliminates one <code>std::function</code> construction perspawn.</li><li><ahref="https://github.com/llvm/llvm-project/commit/5b1be759295c4a2f357fbad852e04c74fc012dc1">commit5b1be759295c</a> — removed the <code>Executor</code> abstract baseclass. <code>ThreadPoolExecutor</code> was always the onlyimplementation; <code>add()</code> and <code>getThreadCount()</code> arenow direct calls instead of virtual dispatches.</li><li><ahref="https://github.com/llvm/llvm-project/commit/8daaa26efdda3802f73367d844b267bda3f84cbe">commit8daaa26efdda</a> — enables nested parallel <code>TaskGroup</code> viawork-stealing. Historically, nested groups ran serially to avoiddeadlock (the thread that was supposed to run a nested task might beblocked in the outer group's <code>sync()</code>). Worker threads nowactively execute tasks from the queue while waiting, instead of justblocking. Root-level groups on the main thread keep the efficientblocking <code>Latch::sync()</code>, so the common non-nested case paysnothing. In lld this lets <code>SyntheticSection::writeTo</code> callswith internal parallelism (<code>GdbIndexSection</code>,<code>MergeNoTailSection</code>) parallelize automatically when calledfrom inside <code>OutputSection::writeTo</code>, instead of degeneratingto serial execution on a worker thread — which was the exact situation<a href="https://reviews.llvm.org/D131247">D131247</a> had worked aroundby threading a root <code>TaskGroup</code> all the way down.</li></ul><h2 id="small-wins-worth-mentioning">Small wins worth mentioning</h2><ul><li><ahref="https://github.com/llvm/llvm-project/commit/036b755daedb">036b755daedb</a>parallelizes <code>demoteAndCopyLocalSymbols</code>. Each file collectslocal <code>Symbol*</code> pointers in a per-file vector via<code>parallelFor</code>, which are merged into the symbol tableserially. Linking clang-14 (<code>--no-gc-sections</code>) with its 208K<code>.symtab</code> entries is 1.04x as fast.</li></ul><h2 id="where-lld-still-loses-time">Where lld still loses time</h2><p>To locate the gap I ran <code>lld --time-trace</code>,<code>mold --perf</code>, and <code>wild --time</code> on the Chromium<code>--gdb-index</code> link (<code>--threads=8</code>). Grouped intocomparable phases:</p><table><colgroup><col style="width: 20%" /><col style="width: 20%" /><col style="width: 20%" /><col style="width: 20%" /><col style="width: 20%" /></colgroup><thead><tr><th>Work scope</th><th>lld-0201</th><th>lld-load</th><th>mold</th><th>wild</th></tr></thead><tbody><tr><td>mmap + parse sections + merge strings + symbol resolve</td><td>376 ms</td><td>292 ms</td><td>230 ms</td><td>113 ms</td></tr><tr><td><code>--gc-sections</code> mark</td><td>268 ms</td><td>79 ms</td><td>30 ms</td><td>— *</td></tr><tr><td>Scan relocations</td><td>106 ms</td><td>97 ms</td><td>60 ms</td><td>— *</td></tr><tr><td>Assign / finalize / symtab</td><td>76 ms</td><td>100 ms</td><td>27 ms</td><td>84 ms</td></tr><tr><td>Write sections</td><td>87 ms</td><td>87 ms</td><td>90 ms</td><td>110 ms</td></tr><tr><td><strong>Wall (hyperfine)</strong></td><td><strong>1255 ms</strong></td><td><strong>918 ms</strong></td><td><strong>553 ms</strong></td><td><strong>367 ms</strong></td></tr></tbody></table><p>* wild fuses <code>--gc-sections</code> marking and relocation-drivenlive-section propagation into one <code>Find required sections</code>pass (60 ms), so these two rows are effectively merged.</p><p>A subtlety on wild's parse number: wild's<code>Load inputs into symbol DB</code> phase by itself is only 23 ms,but it does only <code>mmap</code> + <code>.symtab</code> scan +global-name hash bucketing. Section-header parsing, mergeable-stringsplitting, COMDAT handling, and symbol resolution are deferred to laterwild phases. The 113 ms row above sums those(<code>Load inputs into symbol DB</code> 23 +<code>Resolve symbols</code> 12 + <code>Section resolution</code> 21 +<code>Merge strings</code> 57) so it covers the same work lld calls<code>Parse input files</code>.</p><p>Meaningful gaps, in order of absolute impact:</p><p><strong>Parse: lld-load 292 ms vs wild 113 ms ≈ 2.6x.</strong> Thebiggest remaining cross-linker gap on this workload, and the samepattern holds on the larger workloads below. The phase is alreadyparallel; the gap is constant factor in the per-object parse path(reading section headers, interning strings, splitting CIEs/FDEs,merging globals into the symbol table). On clang-relassert the 179 msparse gap alone accounts for ~33% of the 551 ms wall-clock gap betweenlld-load and wild.</p><p><strong>Assign / finalize / symtab: 100 ms vs mold 27 ms ≈3.7x.</strong> <code>finalizeAddressDependentContent</code>,<code>assignAddresses</code>, <code>finalizeSynthetic</code>,<code>Add symbols to symtabs</code>, and <code>Finalize .eh_frame</code>together cost ~100 ms on this workload; mold's equivalents(<code>compute_section_sizes</code>, <code>compute_symtab_size</code>,<code>create_output_sections</code>, <code>set_osec_offsets</code>)total 27 ms. This gap grows linearly with the number of<code>.symtab</code> entries — on clang-debug it's 127 ms lld vs 27 msmold, on Chromium 570 ms vs ~80 ms. I have a local branch that turns<code>SymbolTableBaseSection::finalizeContents</code> into aprefix-sum-driven parallel fill and replaces the<code>stable_partition</code> + <code>MapVector</code> shuffle withper-file <code>lateLocals</code> buffers. 1640 ELF tests pass; notposted yet.</p><p><strong><code>markLive</code>: 79 ms, 3.4x faster than the Feb 1baseline (268 ms).</strong> This is apples-to-oranges comparison: lldsupports <code>__start_</code>/<code>__stop_</code> edges,<code>SHF_LINK_ORDER</code> dependencies, linker scripts<code>KEEP</code>, and others features. lld correctly handles<code>--gc-sections --as-needed</code> with <code>Symbol::used</code>(tests <code>gc-sections-shared.s</code>, <code>weak-shared-gc.s</code>,<code>as-needed-not-in-regular.s</code>):</p><ul><li><strong>mold over-approximates <code>DT_NEEDED</code> on twoaxes</strong>: it emits <code>DT_NEEDED</code> for DSOs referenced onlyvia weak relocs, and for DSOs referenced only from GC'd sections. Italso retains undefined symbols that are only reachable from deadsections in <code>.dynsym</code>.</li><li><strong>wild handles weak refs correctly but not dead-sectionrefs</strong>: weak-only references do not force <code>DT_NEEDED</code>(matching lld), but DSOs referenced only from GC'd sections still get<code>DT_NEEDED</code> entries. wild does drop the correspondingundefined symbols from <code>.dynsym</code>, so its<code>DT_NEEDED</code> decision and its symtab-inclusion decisiondiverge slightly.</li><li><strong>lld is strictest on all three axes</strong></li></ul><p><strong>Scan relocations: 97 ms vs 60 ms.</strong> Clean 1.6x ratio,small absolute. Target-specific scanning (the<code>Add target-specific relocation scanning for …</code>) removed somedispatch overhead; what remains is<code>InputSectionBase::relocations</code> overhead. wild foldsrelocation-driven liveness into <code>Find required sections</code>,which is why there's no separate wild row.</p><p>Interestingly, <strong>writing section content is not a gap</strong>(87–110 ms across all four). The earlier assumption that<code>.debug_*</code> section writes were a lld weakness didn't survivemeasurement.</p><p>One cost that only shows up on debug-info-heavy workloads is<code>--gdb-index</code> construction, which lld does in ~1.3 s vsmold's ~0.9 s on Chromium. The work is embarrassingly parallel perinput, but lld funnels string interning through a sharded<code>DenseMap</code>; mold uses a lock-free <code>ConcurrentMap</code>sized by HyperLogLog. wild does not yet implement<code>--gdb-index</code>.</p><p>wild is worth calling out separately: its user time is comparable tolld's but its system time is roughly half, and its parse phase is 4-8xfaster than either of the C++ linkers across all three workloads. moldis at the other extreme — the highest user time on every workload,bought back by aggressive parallelism.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;Since the LLVM 22 branch was cut, I&#39;ve landed patches that
parallelize more link phases and cut task-runtime overhead. This post
compares current &lt;code&gt;main&lt;/code&gt; against lld 22.1, &lt;a
href=&quot;https://github.com/rui314/mold&quot;&gt;mold&lt;/a&gt;, and &lt;a
href=&quot;https://github.com/davidlattimore/wild&quot;&gt;wild&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Headline: a Release+Asserts clang &lt;code&gt;--gc-sections&lt;/code&gt; link is
1.37x as fast as lld 22.1; Chromium debug with &lt;code&gt;--gdb-index&lt;/code&gt;
is 1.07x as fast. mold and wild are still ahead — the last section
explains why.&lt;/p&gt;</summary>
    
    
    
    
    <category term="llvm" scheme="https://maskray.me/blog/tags/llvm/"/>
    
    <category term="linker" scheme="https://maskray.me/blog/tags/linker/"/>
    
    <category term="performance" scheme="https://maskray.me/blog/tags/performance/"/>
    
  </entry>
  
  <entry>
    <title>Bit-field layout</title>
    <link href="https://maskray.me/blog/2026-02-22-bit-field-layout"/>
    <id>https://maskray.me/blog/2026-02-22-bit-field-layout</id>
    <published>2026-02-22T08:00:00.000Z</published>
    <updated>2026-02-24T05:30:06.513Z</updated>
    
    <content type="html"><![CDATA[<p>The C and C++ standards leave nearly every detail to theimplementation. C23 §6.7.3.2:</p><blockquote><p>An implementation may allocate any addressable storage unit largeenough to hold a bit-field. If enough space remains, a bit-field thatimmediately follows another bit-field in a structure shall be packedinto adjacent bits of the same unit. If insufficient space remains,whether a bit-field that does not fit is put into the next unit oroverlaps adjacent units is implementation-defined. The order ofallocation of bit-fields within a unit (high-order to low-order orlow-order to high-order) is implementation-defined. The alignment of theaddressable storage unit is unspecified</p></blockquote><p>C++ is also terse — <code>[class.bit]p1</code>:</p><blockquote><p>Allocation of bit-fields within a class object isimplementation-defined. Alignment of bit-fields isimplementation-defined. Bit-fields are packed into some addressableallocation unit.</p></blockquote><p>The actual rules come from the platform ABI:</p><ul><li><strong>Itanium ABI</strong> — used on Linux, macOS, BSD, and mostnon-Windows platforms. The Itanium C++ ABI (<ahref="https://itanium-cxx-abi.github.io/cxx-abi/abi.html#class-types">section2.4</a>) defers bit-field placement to "the base C ABI" but adds its ownconstraints (notably: bit-fields are never placed in the tail padding ofa base class).</li><li>System V ABI Processor Supplement. The x86-64 psABI says littleabout bit-fields, while the <ahref="https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#id115">AArch64AAPCS</a> has a more detailed description.</li><li><strong>Microsoft ABI</strong> — used on Windows (MSVC). In GCC andClang, structs with the <code>ms_struct</code> attribute also mimicsthis ABI.</li></ul><p>Clang implements both ABIs in<code>clang/lib/AST/RecordLayoutBuilder.cpp</code>. It processesbit-fields in <strong>two distinct phases</strong>:</p><ol type="1"><li><strong>Layout</strong> (storage units) — assign a bit offset toevery bit-field. This is ABI-specified and determines<code>sizeof</code> and <code>alignof</code>.</li><li><strong>Codegen</strong> (access units) — choose what LLVM IR loadsand stores to emit. This is a compiler optimization that affectsgenerated code but not the ABI.</li></ol><p>Understanding these separately is the key to understandingbit-fields. This article focuses on Itanium (the default on mostplatforms), with a section on how the Microsoft ABI differs.</p><h2 id="phase-1-storage-units">Phase 1: Storage Units</h2><p>In <code>clang/lib/AST/RecordLayoutBuilder.cpp</code>,<code>ItaniumRecordLayoutBuilder::LayoutFields</code> lays out fields ofa <code>RecordDecl</code>. For each bit field, it calls<code>LayoutBitField</code> to determine the storage unit and bitoffset.</p><p>A <strong>storage unit</strong> is a region of <code>sizeof(T)</code>bytes, by default aligned to <code>alignof(T)</code>. For an<code>int</code> bit-field, that's a 4-byte region at a 4-byte-alignedoffset. The alignment can be reduced by the <code>packed</code>attribute and <code>#pragma pack</code>.</p><ul><li><code>StorageUnitSize = sizeof(T) * 8</code> — the unit's size inbits</li><li><code>FieldAlign = alignof(T)</code> in bits — the unit's alignment(before modifiers)</li><li><code>FieldOffset</code> — the first bit after the lastbit-field</li></ul><h3 id="itaniums-core-rule">Itanium's Core Rule</h3><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> (FieldSize == <span class="number">0</span> ||</span><br><span class="line">    (AllowPadding &amp;&amp;</span><br><span class="line">     (FieldOffset &amp; (FieldAlign<span class="number">-1</span>)) + FieldSize &gt; StorageUnitSize))</span><br><span class="line">  FieldOffset = <span class="built_in">alignTo</span>(FieldOffset, FieldAlign);</span><br></pre></td></tr></table></figure><p>Compute where <code>FieldOffset</code> falls within its alignedstorage unit. If the remaining space is less than<code>FieldSize</code>, round up to the next aligned boundary.Otherwise, pack the bit-field at the current position.</p><h3 id="declared-type-matters">Declared Type Matters</h3><p>Consider two structs that store the same total number of bits (7 + 7+ 2 = 16) but use different declared types:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">U8</span>  &#123;</span> <span class="type">uint8_t</span>  a:<span class="number">7</span>, b:<span class="number">7</span>, c:<span class="number">2</span>; &#125;;   <span class="comment">// sizeof = 3</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">U16</span> &#123;</span> <span class="type">uint16_t</span> a:<span class="number">7</span>, b:<span class="number">7</span>, c:<span class="number">2</span>; &#125;;   <span class="comment">// sizeof = 2</span></span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">S1</span> &#123;</span> <span class="type">int</span> a:<span class="number">14</span>; <span class="type">int</span> b:<span class="number">10</span>; <span class="type">int</span> c:<span class="number">30</span>; &#125;;   <span class="comment">// sizeof = 8</span></span><br></pre></td></tr></table></figure><p><strong>Walk-through for <code>U8</code></strong> (all fields haveStorageUnitSize = 8, FieldAlign = 8):</p><ul><li><code>a</code> at bit 0. Position = 0, 0 + 7 = 7 &lt;= 8. Fits.<strong>Offset = 0.</strong></li><li><code>b</code> at bit 7. Position = 7, 7 + 7 = 14 &gt; 8. Doesn'tfit. New unit at bit 8. <strong>Offset = 8.</strong></li><li><code>c</code> at bit 15. Position = 15 - 8 = 7, 7 + 2 = 9 &gt; 8.Doesn't fit. New unit at bit 16. <strong>Offset = 16.</strong></li></ul><p>Three 1-byte storage units. <code>sizeof(U8) = 3</code>. Eightpadding bits wasted.</p><p><strong>Walk-through for <code>U16</code></strong> (all fields haveStorageUnitSize = 16, FieldAlign = 16):</p><ul><li><code>a</code> at bit 0. Position = 0, 0 + 7 = 7 &lt;= 16. Fits.<strong>Offset = 0.</strong></li><li><code>b</code> at bit 7. Position = 7, 7 + 7 = 14 &lt;= 16. Fits.<strong>Offset = 7.</strong></li><li><code>c</code> at bit 14. Position = 14, 14 + 2 = 16 &lt;= 16. Fits.<strong>Offset = 14.</strong></li></ul><p>One 2-byte storage unit. <code>sizeof(U16) = 2</code>. No waste.</p><p><strong>Walk-through for <code>S1</code></strong> (all fields haveStorageUnitSize = 32, FieldAlign = 32):</p><ul><li><code>a</code> at bit 0. Position = 0, 14 fits in 32. <strong>Offset= 0.</strong></li><li><code>b</code> at bit 14. Position = 14, 14 + 10 = 24 &lt;= 32.Fits. <strong>Offset = 14.</strong> Bits 24–31 are padding (unfilledtail of the first storage unit).</li><li><code>c</code> at bit 24. Position = 24, 24 + 30 = 54 &gt; 32.Doesn't fit. New unit at bit 32. <strong>Offset = 32.</strong> Bits62–63 are padding (unfilled tail of the second storage unit).</li></ul><p><code>sizeof(S1) = 8</code>, <code>alignof(S1) = 4</code>.</p><p>Note: Phase 1 uses two <code>int</code> storage units, but Phase 2 isfree to merge <code>a</code>, <code>b</code>, and <code>c</code> into asingle <code>i64</code> access unit (since there are no non-bit-fieldbarriers and 8 bytes fits in a register). On x86_64, the LLVM type endsup as <code>&#123; i64 &#125;</code>.</p><h3 id="mixed-types">Mixed Types</h3><p>When bit-fields have different declared types, the storage unit sizechanges:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">S2</span> &#123;</span> <span class="type">int</span> a:<span class="number">24</span>; <span class="type">short</span> b:<span class="number">8</span>; &#125;;   <span class="comment">// sizeof = 4</span></span><br></pre></td></tr></table></figure><ul><li><code>a</code> is <code>int</code> (StorageUnitSize = 32). Placed atbit 0.</li><li><code>b</code> is <code>short</code> (StorageUnitSize = 16,FieldAlign = 16). Current offset = 24. Position within a 16-bit alignedunit: 24 % 16 = 8. 8 + 8 = 16 &lt;= 16. Fits. <strong>Offset =24.</strong></li></ul><p><code>sizeof(S2) = 4</code>. The <code>short</code> bit-fieldoverlaps into the <code>int</code>'s storage unit. Under Itanium,storage units of different types <em>can</em> share bytes.</p><p>The <code>short</code> can also reuse space left by a smallerbit-field:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">S2b</span> &#123;</span> <span class="type">int</span> a:<span class="number">16</span>; <span class="type">short</span> b:<span class="number">8</span>; &#125;;   <span class="comment">// sizeof = 4</span></span><br></pre></td></tr></table></figure><ul><li><code>a</code> is <code>int</code> (StorageUnitSize = 32). Placed atbit 0.</li><li><code>b</code> is <code>short</code> (StorageUnitSize = 16,FieldAlign = 16). Current offset = 16. Position within a 16-bit alignedunit: 16 % 16 = 0. 0 + 8 = 8 &lt;= 16. Fits. <strong>Offset =16.</strong></li></ul><p>Here <code>b</code>'s 16-bit storage unit (bits 16–31) falls entirelywithin <code>a</code>'s 32-bit storage unit.</p><blockquote><p>Under Microsoft ABI, <code>sizeof</code> is 8: the type size changefrom <code>int</code> to <code>short</code> forces a new storageunit.</p></blockquote><p>This overlapping extends to non-bit-field members too. Anon-bit-field can be allocated within the unfilled bytes of a precedingbit-field's storage unit:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">S2c</span> &#123;</span> <span class="type">uint16_t</span> first:<span class="number">8</span>; <span class="type">uint8_t</span> second; &#125;;   <span class="comment">// sizeof = 2</span></span><br></pre></td></tr></table></figure><ul><li><code>first</code> is <code>uint16_t:8</code>. Placed at bit 0. Uses8 bits of a 16-bit storage unit (bytes 0–1).</li><li><code>second</code> is a non-bit-field <code>uint8_t</code>. Thebit-field state resets, but DataSize is only 1 byte. <code>second</code>(alignment 1) goes at <strong>byte 1</strong> (bit 8) — inside<code>first</code>'s storage unit.</li></ul><p>Note that this overlapping means a write to <code>first</code> viaits access unit could touch byte 1 where <code>second</code> lives.Phase 2 must ensure the access units don't clobber each other (see <ahref="#itanium-merging-algorithm">Hard constraints</a>).</p><blockquote><p>Under Microsoft ABI, <code>sizeof</code> is 4: <code>first</code>gets a full <code>uint16_t</code> unit (2 bytes), and<code>second</code> starts at byte 2 instead of byte 1.</p></blockquote><h3 id="non-bit-field-after-bit-field">Non-bit-field AfterBit-field</h3><p>When a non-bit-field field cannot fit within the remaining bytes, itresets the bit-field state and unfilled bits become padding:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">S3</span> &#123;</span> <span class="type">int</span> a:<span class="number">10</span>; <span class="type">int</span> b:<span class="number">6</span>; <span class="type">char</span> c; <span class="type">int</span> d:<span class="number">6</span>; &#125;;   <span class="comment">// sizeof = 4</span></span><br></pre></td></tr></table></figure><ul><li><code>a</code> at bit 0, <code>b</code> at bit 10 — both fit in thefirst <code>int</code> storage unit. <code>a + b</code> occupy 16 bits =2 bytes, leaving 16 bits unused in the 32-bit storage unit.</li><li><code>c</code> is not a bit-field. It resets<code>UnfilledBitsInLastUnit</code> to 0. <code>c</code> (a<code>char</code>, alignment 1) goes at <strong>byte 2</strong> (bit16). A subsequent bit-field could have used bits 16–31, but thenon-bit-field <code>c</code> claims byte 2.</li><li><code>d</code> is a new <code>int</code> bit-field. Current bitoffset = 24 (byte 3). Position = 24 % 32 = 24. 24 + 6 = 30 &lt;= 32.Fits. <strong>Offset = 24.</strong></li></ul><p><code>sizeof(S3) = 4</code>.</p><blockquote><p>Under Microsoft ABI, <code>sizeof</code> is 12:<code>a</code>+<code>b</code> get a full <code>int</code> unit (4bytes), <code>c</code> starts at byte 4, and <code>d</code> gets a new<code>int</code> unit at byte 8.</p></blockquote><h3 id="bit-field-after-non-bit-field">Bit-field AfterNon-bit-field</h3><p>The overlap works in the other direction too. When a bit-fieldfollows a non-bit-field, its storage unit can encompass the precedingbytes:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">NB</span> &#123;</span> <span class="type">char</span> a; <span class="type">int</span> b:<span class="number">4</span>; &#125;;   <span class="comment">// sizeof = 4</span></span><br></pre></td></tr></table></figure><ul><li><code>a</code> is a <code>char</code> at byte 0. DataSize = 1byte.</li><li><code>b</code> is <code>int:4</code>. FieldOffset = 8, FieldAlign =32, StorageUnitSize = 32. Position: <code>8 &amp; 31 = 8</code>.<code>8 + 4 = 12 ≤ 32</code>. Fits. <strong>Offset = 8.</strong></li></ul><p><code>b</code>'s 4-byte <code>int</code> storage unit (bytes 0–3)encompasses <code>a</code> at byte 0. No padding is inserted — the corerule only cares whether the field fits within an aligned unit, notwhether that unit overlaps earlier non-bit-field storage.</p><blockquote><p>Under Microsoft ABI, <code>sizeof</code> is 8: <code>b</code>'s<code>int</code> unit starts at byte 4, after <code>a</code> is paddedto <code>int</code> alignment.</p></blockquote><h3 id="attributes-and-pragmas">Attributes and Pragmas</h3><p>Several attributes and pragmas alter the placement rules. They allwork by changing <code>FieldAlign</code>.</p><p><strong><code>packed</code></strong> — sets<code>FieldAlign = 1</code> (bit-granular packing). Bitfields pack atthe next available <em>bit</em> with no alignment constraint.</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> [[<span class="title">gnu</span>:</span>:packed]] P &#123; <span class="type">int</span> x:<span class="number">4</span>, y:<span class="number">30</span>, z:<span class="number">30</span>; &#125;;</span><br><span class="line"><span class="comment">// 4 + 30 + 30 = 64 bits = 8 bytes. sizeof = 8.</span></span><br></pre></td></tr></table></figure><blockquote><p>Under Microsoft ABI, <code>sizeof</code> is 12: each bit-field mustfit within a single <code>int</code> unit, so <code>x</code>,<code>y</code>, and <code>z</code> each get their own 4-byte unit.</p></blockquote><p><code>packed</code> can also be applied to individual fields:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">P2</span> &#123;</span> <span class="type">short</span> a:<span class="number">8</span>; [[gnu::packed]] <span class="type">int</span> b:<span class="number">30</span>; &#125;;   <span class="comment">// sizeof = 6, b at bit 8</span></span><br><span class="line"><span class="comment">// Without packed on b: b at bit 32, sizeof = 8</span></span><br></pre></td></tr></table></figure><p>Without packed, <code>b</code>'s FieldAlign is 32, so it doesn't fitin <code>a</code>'s <code>short</code> storage unit and starts a new<code>int</code> unit at bit 32. With packed, <code>b</code>'sFieldAlign drops to 1, so it packs immediately after <code>a</code> atbit 8.</p><p><strong><code>#pragma pack(N)</code></strong> — caps<code>FieldAlign</code> at <code>N * 8</code> bits and suppresses thepadding-insertion test (<code>AllowPadding = false</code>, so theoverflow check is skipped — the field is placed at the current offsetwithout rounding up).</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="keyword">pragma</span> pack(1)</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">PP</span> &#123;</span> <span class="type">char</span> a; <span class="type">int</span> b:<span class="number">4</span>; <span class="type">int</span> c:<span class="number">28</span>; <span class="type">char</span> s; &#125;;   <span class="comment">// sizeof = 6</span></span><br><span class="line"><span class="meta">#<span class="keyword">pragma</span> pack()</span></span><br></pre></td></tr></table></figure><p><code>b</code> packs at bit 8 by the normal core rule —<code>(8 &amp; 31) + 4 = 12 ≤ 32</code>, so it fits. Without<code>#pragma pack</code>, <code>c:28</code> at bit 12 would fail thesame check — <code>12 + 28 = 40 &gt; 32</code> — and round up to bit 32.With <code>#pragma pack(1)</code>, <code>AllowPadding</code> is false,so the overflow check is skipped and <code>c</code> stays at bit 12.Total: <code>a</code>(8) + <code>b</code>+<code>c</code>(32) +<code>s</code>(8) = 48 bits = 6 bytes.</p><p><strong><code>aligned(N)</code></strong> — forces minimum alignment.Overrides <code>packed</code>, but is itself overridden by<code>#pragma pack</code>.</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">A</span> &#123;</span> <span class="type">char</span> a; [[gnu::aligned(<span class="number">16</span>)]] <span class="type">int</span> b:<span class="number">1</span>; <span class="type">char</span> c; &#125;;</span><br><span class="line"><span class="comment">// b aligned to 16 bytes = bit 128. c at byte 17. sizeof = 32, alignof = 16.</span></span><br></pre></td></tr></table></figure><p><strong>Precedence</strong> (for non-zero-width bit-fields):<code>#pragma pack</code> &gt; <code>aligned</code> attr &gt;<code>packed</code> attr &gt; natural alignment.</p><h3 id="zero-width-bitfields">Zero-width Bitfields</h3><p><code>T : 0</code> rounds up to <code>alignof(T)</code>, acting as aseparator. Subsequent fields start in a new storage unit.</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">Z</span> &#123;</span> <span class="type">char</span> x; <span class="type">int</span> : <span class="number">0</span>; <span class="type">char</span> y; &#125;;</span><br><span class="line"><span class="comment">// x86:         y at offset 4, sizeof = 5, alignof = 1</span></span><br><span class="line"><span class="comment">// ARM/AArch64: y at offset 4, sizeof = 8, alignof = 4</span></span><br></pre></td></tr></table></figure><p>On most targets, anonymous bit-fields don't contribute to structalignment. But on AArch32/AArch64 (with<code>useZeroLengthBitfieldAlignment()</code>), zero-width bit-fields<em>do</em> raise the struct's alignment.</p><p>Zero-width bit-fields are exempt from both <code>packed</code> and<code>#pragma pack</code> — they always round up to<code>alignof(T)</code>.</p><h3 id="microsoft-abi-differences">Microsoft ABI Differences</h3><p>Clang uses the Microsoft layout rules in two situations: targeting aWindows triple (e.g. <code>x86_64-windows-msvc</code>), which uses<code>MicrosoftRecordLayoutBuilder</code>; or applying<code>__attribute__((ms_struct))</code> to individual structs on anytarget, which activates the <code>IsMsStruct</code> path inside<code>ItaniumRecordLayoutBuilder</code>. GCC documents the rules under<ahref="https://gcc.gnu.org/onlinedocs/gccint/Storage-Layout.html#:~:text=TARGET_MS_BITFIELD_LAYOUT_P"><code>TARGET_MS_BITFIELD_LAYOUT_P</code></a>.</p><p>The Microsoft ABI uses a fundamentally different layout strategy.While Itanium packs bit-fields into overlapping storage units ofpotentially different types, Microsoft allocates a<strong>complete</strong> storage unit of the declared type, thenparcels bits among successive bit-fields <strong>of the same typesize</strong>.</p><p>The key differences:</p><p><strong>Type size changes force a new storage unit.</strong> In theGCC documentation's wording: "a bit-field won't share the same storageunit with the previous bit-field if their underlying types havedifferent sizes, and the bit-field will be aligned to the highestalignment of the underlying types of itself and of the previousbit-field." Itanium would let them overlap.</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">Itn</span> &#123;</span> <span class="type">int</span> a:<span class="number">24</span>; <span class="type">short</span> b:<span class="number">8</span>; &#125;;                             <span class="comment">// sizeof = 4</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> __<span class="title">attribute__</span>((<span class="title">ms_struct</span>)) <span class="title">MS</span> &#123;</span> <span class="type">int</span> a:<span class="number">24</span>; <span class="type">short</span> b:<span class="number">8</span>; &#125;;   <span class="comment">// sizeof = 8</span></span><br></pre></td></tr></table></figure><p>Under Itanium, <code>b</code>'s <code>short</code> storage unitoverlaps into <code>a</code>'s <code>int</code> unit — everything fitsin 4 bytes. Under Microsoft, the type size changes from 4 to 2, so<code>b</code> gets its own storage unit. The <code>int</code> unit (4bytes) plus the <code>short</code> unit (2 bytes, padded to 4 foralignment) gives 8 bytes. Note that the rule is about type<em>size</em>, not type identity — <code>int a:24; unsigned b:8</code>share a unit because both types are 4 bytes.</p><p>Each unit is discrete — this is a direct consequence of the type sizerule.</p><p><strong>Zero-width bit-fields are ignored unless they follow anon-zero-width bit-field.</strong>(<code>MicrosoftRecordLayoutBuilder::layoutZeroWidthBitField</code>.)GCC's documentation: "zero-sized bit-fields are disregarded unless theyfollow another nonzero-size bit-field." When honored, they terminate thecurrent run and affect the struct's alignment.</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// MS mode:</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">MS_ZW1</span> &#123;</span> <span class="type">long</span> : <span class="number">0</span>; <span class="type">char</span> bar; &#125;;                       <span class="comment">// sizeof = 1 (no preceding bit-field)</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">MS_ZW2</span> &#123;</span> <span class="type">char</span> foo; <span class="type">int</span> : <span class="number">0</span>; <span class="type">char</span> bar; &#125;;              <span class="comment">// sizeof = 2 (preceding non-bit-field doesn&#x27;t count)</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">MS_ZW3</span> &#123;</span> <span class="type">int</span> : <span class="number">0</span>; <span class="type">long</span> : <span class="number">0</span>; <span class="type">char</span> bar; &#125;;              <span class="comment">// sizeof = 1 (zero-width doesn&#x27;t count either)</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">MS_ZW4</span> &#123;</span> <span class="type">char</span> foo : <span class="number">4</span>; <span class="type">int</span> : <span class="number">0</span>; <span class="type">char</span> bar; &#125;;          <span class="comment">// sizeof = 8 (non-zero-width bit-field — honored)</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">MS_ZW5</span> &#123;</span> <span class="type">long</span> : <span class="number">0</span>; <span class="type">char</span> foo : <span class="number">4</span>; <span class="type">int</span> : <span class="number">0</span>; <span class="type">char</span> bar; &#125;;  <span class="comment">// sizeof = 8 (first ignored, second honored)</span></span><br></pre></td></tr></table></figure><p><strong>Alignment = type size.</strong> The alignment of afundamental type always equals its size —<code>alignof(long long) == 8</code> even on targets where the naturalalignment is 4 (like Darwin PPC32).</p><p><strong>Unions.</strong> ms_struct ignores all alignment attributesin unions. All bit-fields use alignment 1 and start at offset 0.</p><h2 id="phase-2-access-units">Phase 2: Access Units</h2><p>LLVM IR has no bit-field concept. To access a bit-field, theClang-generated IR must:</p><ol type="1"><li>Load an integer from memory (the <strong>access unit</strong>)</li><li>Mask and shift to extract or insert the bit-field's bits</li><li>Store the integer back</li></ol><p>The access unit is the LLVM type that gets loaded and stored.Choosing it well matters:</p><ul><li>Too narrow means multiple memory operations for adjacent bit-fieldwrites;</li><li>Too wide means touching memory unnecessarily or clobbering adjacentdata.</li></ul><p>Implementation: <code>CGRecordLowering::accumulateBitFields</code>(<code>clang/lib/CodeGen/CGRecordLayoutBuilder.cpp</code>).</p><h3 id="itanium-merging-algorithm">Itanium: Merging Algorithm</h3><p><strong>Hard constraints</strong> — an access unit must never:</p><ol type="1"><li><strong>Overlap non-bit-field storage.</strong> The C memory modelallows non-bit-field members to be accessed from other threads. Aload/store of the access unit must not touch bytes belonging to othermembers.</li><li><strong>Cross a zero-width bit-field</strong> at a byte boundary.Zero-width bit-fields define memory location boundaries — they arebarriers.</li><li><strong>Extend into reusable tail padding.</strong> In C++, aderived class may place fields in a non-POD base class's tail padding.The access unit must not overwrite those bytes.</li></ol><p><strong>Soft goals</strong> — subject to the hard constraints, accessunits should be:</p><ul><li><strong>Power-of-2 sized</strong> (1, 2, 4, 8 bytes). Non-power-of-2sizes (e.g., 3 bytes) get lowered as multiple smaller loads plus bitmanipulation.</li><li><strong>No wider than a register.</strong> Avoids multi-registerloads.</li><li><strong>Naturally aligned</strong> (on strict-alignment targets).Avoids the compiler synthesizing unaligned access sequences.</li><li><strong>As wide as possible</strong> within the above. Fewer, wideraccesses let LLVM combine adjacent bit-field writes into oneread-modify-write.</li></ul><p><strong>The algorithm: spans then merging.</strong></p><p><em>Step 1 — Spans.</em> Bitfields that share a byte are inseparable.They form a minimal "span" that must be in the same access unit. A spanis a maximal run of bit-fields where each successive one startsmid-byte.</p><p>Spans break at byte-aligned boundaries and at zero-width bit-fieldbarriers. A field mid-byte is unconditionally part of the current span —step 2 never sees it as a merge point.</p><p><em>Step 2 — Merge.</em> Starting from each span, try to widen theaccess unit by incorporating the next span. Accept the merge if thecombined unit:</p><ul><li>Fits in one register (<code>&lt;= RegSize</code>)</li><li>Is power-of-2 and naturally aligned (on strict-alignmenttargets)</li><li>Doesn't cross a barrier (zero-width bit-field or non-bit-fieldstorage)</li><li>The natural <code>iN</code> type fits before the limit offset</li></ul><p>Track the best candidate and install it when merging can't improvefurther.</p><p><strong>Access unit representation.</strong></p><p>Clang represents each access unit as either an integer type<code>iN</code> or an array type <code>[N x i8]</code> (see<code>CGRecordLowering::accumulateBitFields</code>). <code>iN</code> ispreferred — it generates a single load/store instruction. But LLVM's<code>iN</code> types have allocation sizes rounded up to powers of 2(<code>DataLayout.getTypeAllocSize</code>). For example,<code>i24</code> has allocation size 4 bytes.</p><p>If that rounded-up size would extend past the next field or pastreusable tail padding, the access unit is <strong>clipped</strong> to<code>[N x i8]</code>, which has an exact byte count. Clang assumesclipped for each new span (<code>BestClipped = true</code>) and sets itto false only when the natural <code>iN</code> fits within the availablespace (<code>BeginOffset + TypeSize &lt;= LimitOffset</code>).</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Tail padding reuse (C++)</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">A</span> &#123;</span> <span class="type">int</span> x:<span class="number">24</span>; ~A(); &#125;;      <span class="comment">// non-POD: DataSize=3, Size=4</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">B</span> :</span> A &#123; <span class="type">char</span> c; &#125;;          <span class="comment">// c at offset 3, in A&#x27;s tail padding</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// i24 allocates 4 bytes, but byte 3 belongs to B::c.</span></span><br><span class="line"><span class="comment">// Access unit for x is clipped to [3 x i8].</span></span><br></pre></td></tr></table></figure><p><strong>Strict vs cheap unaligned.</strong> On targets with cheapunaligned access (x86, AArch64 without <code>+strict-align</code>),alignment checks are skipped — spans merge freely up to register width.On strict-alignment targets (e.g. <code>-mstrict-align</code>), a mergeis rejected if the combined access unit would not be naturally alignedat its offset within the struct.</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">Align</span> &#123;</span> <span class="type">char</span> x; <span class="type">short</span> a:<span class="number">12</span>; <span class="type">short</span> b:<span class="number">4</span>; <span class="type">char</span> c:<span class="number">8</span>; &#125;; <span class="comment">// sizeof = 6</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// AArch64 -mno-strict-align:  %struct.S = type &lt;&#123; i8, i8, i32 &#125;&gt;</span></span><br><span class="line"><span class="comment">//   → a+b+c merged into one i32 at offset 2 (unaligned, but cheap)</span></span><br><span class="line"><span class="comment">// AArch64 -mstrict-align:     %struct.S = type &#123; i8, i16, i8 &#125;</span></span><br><span class="line"><span class="comment">//   → a+b merged</span></span><br><span class="line"><span class="comment">//   → +c rejected; a+b stay as i16, c gets its own i8</span></span><br></pre></td></tr></table></figure><p><strong><code>-ffine-grained-bit-field-accesses</code>.</strong> ThisClang flag disables merging entirely. Each span becomes its own accessunit — no adjacent spans are combined. For example:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">S4</span> &#123;</span> <span class="type">unsigned</span> <span class="type">long</span> f1:<span class="number">28</span>, f2:<span class="number">4</span>, f3:<span class="number">12</span>; &#125;;</span><br><span class="line"><span class="comment">// Default:        %struct.S4 = type &#123; i64 &#125;       — spans merged into one access unit</span></span><br><span class="line"><span class="comment">// Fine-grained:   %struct.S4 = type &#123; i32, i16 &#125;  — each span kept separate</span></span><br></pre></td></tr></table></figure><p><a href="https://reviews.llvm.org/D36562">The flag is incompatiblewith sanitizers</a> and is automatically disabled (with a warning) whenany sanitizer is active.</p><p>Returning to S3:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">S3</span> &#123;</span> <span class="type">int</span> a:<span class="number">10</span>; <span class="type">int</span> b:<span class="number">6</span>; <span class="type">char</span> c; <span class="type">int</span> d:<span class="number">6</span>; &#125;;</span><br></pre></td></tr></table></figure><p>Phase 1 assigned: <code>a</code><span class="citation"data-cites="0">@0</span>, <code>b</code><span class="citation"data-cites="10">@10</span>, <code>c</code><span class="citation"data-cites="16">@16</span> (byte 2), <code>d</code><spanclass="citation" data-cites="24">@24</span> (byte 3).</p><p>Phase 2 sees two bit-field runs (separated by non-bit-field<code>c</code>):</p><p><em>Run 1: <code>a</code> and <code>b</code></em> (bits 0–15, bytes0–1). They share byte 1 (bits 8–15), so they form one span. The spancovers 2 bytes. The natural type <code>i16</code> fits exactly — noclipping needed. Access unit: <code>i16</code>.</p><p><em>Run 2: <code>d</code></em> (bits 24–29, byte 3). Single span, 6bits in 1 byte. Access unit: <code>i8</code>.</p><p>The resulting LLVM struct type:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">%struct.S3 = type &#123; i16, i8, i8 &#125;</span><br><span class="line">                    a,b   c   d</span><br></pre></td></tr></table></figure><p>To read <code>a</code>, codegen loads the <code>i16</code>, extractsbits 0–9. To read <code>b</code>, it loads the same <code>i16</code>,extracts bits 10–15. Neither load touches <code>c</code>.</p><p><strong>When clipping is needed.</strong> Widen the bit-fields so<code>a + b</code> no longer fits in 2 bytes:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">S3w</span> &#123;</span> <span class="type">int</span> a:<span class="number">14</span>; <span class="type">int</span> b:<span class="number">10</span>; <span class="type">char</span> c; <span class="type">int</span> d:<span class="number">6</span>; &#125;;</span><br></pre></td></tr></table></figure><p>Phase 1 assigned: <code>a</code><span class="citation"data-cites="0">@0</span>, <code>b</code><span class="citation"data-cites="14">@14</span>, <code>c</code><span class="citation"data-cites="24">@24</span> (byte 3), <code>d</code><spanclass="citation" data-cites="32">@32</span> (byte 4).<code>sizeof(S3w) = 8</code>.</p><p><em>Run 1: <code>a</code> and <code>b</code></em> (bits 0–23, bytes0–2). The span covers 3 bytes. The natural type <code>i24</code> hasallocation size 4 bytes — but byte 3 belongs to <code>c</code>. Theaccess unit is <strong>clipped to <code>[3 x i8]</code></strong>.</p><p><em>Run 2: <code>d</code></em> (bits 32–37, byte 4). Access unit:<code>i8</code>.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">%struct.S3w = type &#123; [3 x i8], i8, i8, [3 x i8] &#125;</span><br><span class="line">                      a,b       c    d    padding</span><br></pre></td></tr></table></figure><p><strong>Endianness.</strong></p><p>Access unit selection is endianness-agnostic — spans, merging, andclipping all work in byte offsets from the start of the struct.Endianness matters only when codegen emits the shift/mask sequence toextract or insert a bitfield within its access unit.</p><p>LLVM loads an access unit as a single integer. On little-endian, bit0 of the integer corresponds to the lowest-addressed byte's LSB —bitfield offsets from Phase 1 can be used directly as shift amounts. Onbig-endian, bit 0 of the integer corresponds to the highest-addressedbyte's MSB, so the bit numbering within the loaded integer isreversed.</p><p>Clang handles this in <code>setBitFieldInfo</code>(<code>CGRecordLayoutBuilder.cpp</code>):</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">Info.Offset = (<span class="type">unsigned</span>)(<span class="built_in">getFieldBitOffset</span>(FD) - Context.<span class="built_in">toBits</span>(StartOffset));</span><br><span class="line"><span class="comment">// ...</span></span><br><span class="line"><span class="keyword">if</span> (DataLayout.<span class="built_in">isBigEndian</span>())</span><br><span class="line">  Info.Offset = Info.StorageSize - (Info.Offset + Info.Size);</span><br></pre></td></tr></table></figure><p>The little-endian offset counts up from the LSB; the big-endianoffset is mirrored to count down from the MSB.<code>EmitLoadOfBitfieldLValue</code> (<code>CGExpr.cpp</code>) thenuses <code>Info.Offset</code> uniformly — it right-shifts by<code>Offset</code> and masks to <code>Size</code> bits, which works forboth endiannesses because the flip was already baked into<code>Offset</code>.</p><h3 id="microsoft-discrete-access-units">Microsoft: Discrete AccessUnits</h3><p>Microsoft ABI's codegen is simple: each bit-field gets an access unitof its declared type. Adjacent bit-fields of the same type size shareone access unit. Zero-width bit-fields and type-size changes break runs.There is no complex merging — the Phase 1 storage units <em>are</em> theaccess units.</p><p>Contrast S3 under both ABIs:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">S3</span> &#123;</span> <span class="type">int</span> a:<span class="number">10</span>; <span class="type">int</span> b:<span class="number">6</span>; <span class="type">char</span> c; <span class="type">int</span> d:<span class="number">6</span>; &#125;;</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">Itanium:   %struct.S3  = type &#123; i16, i8, i8 &#125;        // a,b merged into i16, d is i8</span><br><span class="line">Microsoft: %struct.MS3 = type &#123; i32, i8, i32 &#125;       // a,b share i32 unit, d gets own i32</span><br></pre></td></tr></table></figure><p>Itanium's Phase 2 merges <code>a</code> and <code>b</code> into thetightest access unit that covers both (<code>i16</code>), and clips orshrinks to avoid touching <code>c</code>. Microsoft uses the fulldeclared type (<code>int</code> = <code>i32</code>) for each storageunit — no merging, no clipping.</p><p>Similarly for mixed types:</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">S2</span> &#123;</span> <span class="type">int</span> a:<span class="number">24</span>; <span class="type">short</span> b:<span class="number">8</span>; &#125;;</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">Itanium:   %struct.S2  = type &#123; i32 &#125;                 // a and b merged into one i32</span><br><span class="line">Microsoft: %struct.MS2 = type &#123; i32, i16 &#125;            // separate units: i32 for a, i16 for b</span><br></pre></td></tr></table></figure><p>Itanium merges <code>a</code> and <code>b</code> into a single<code>i32</code> since they share the same 4 bytes. Microsoft gives eachits own access unit matching the declared type.</p><h2 id="conclusion">Conclusion</h2><p>Phase 1 decides <em>where</em> bits go — it's specified by the ABIand determines <code>sizeof</code> and <code>alignof</code>. Phase 2decides <em>how</em> to access them — it's a compiler optimization thataffects codegen but not the binary layout. They answer differentquestions and often produce different-sized units. The storage unit fora bit-field is determined by its declared type; the access unit isdetermined by what's safe and efficient to load.</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;The C and C++ standards leave nearly every detail to the
implementation. C23 §6.7.3.2:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;An implementation may allocate</summary>
      
    
    
    
    
    <category term="c" scheme="https://maskray.me/blog/tags/c/"/>
    
    <category term="gcc" scheme="https://maskray.me/blog/tags/gcc/"/>
    
    <category term="clang" scheme="https://maskray.me/blog/tags/clang/"/>
    
  </entry>
  
  <entry>
    <title>Call relocation types</title>
    <link href="https://maskray.me/blog/2026-02-16-call-relocation-types"/>
    <id>https://maskray.me/blog/2026-02-16-call-relocation-types</id>
    <published>2026-02-16T08:00:00.000Z</published>
    <updated>2026-02-17T18:31:45.390Z</updated>
    
    <content type="html"><![CDATA[<p>Most architectures encode direct branch/call instructions with aPC-relative displacement. This post discusses a specific category ofbranch relocations: those used for direct function calls and tail calls.Some architectures use two ELF relocation types for a callinstruction:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"># i386, x86-64</span><br><span class="line">call foo              # R_386_PC32, R_X86_64_PC32</span><br><span class="line">call foo@plt          # R_386_PLT32, R_X86_64_PLT32</span><br><span class="line"></span><br><span class="line"># m68k</span><br><span class="line">bsr.l foo             # R_68K_PC32</span><br><span class="line">bsr.l foo@plt         # R_68K_PLT32</span><br><span class="line"></span><br><span class="line"># s390/s390x</span><br><span class="line">brasl %r14, foo       # R_390_PC32DBL</span><br><span class="line">brasl %r14, foo@plt   # R_390_PLT32DBL</span><br><span class="line"></span><br><span class="line"># sparc</span><br><span class="line">call    foo, 0        # not PIC: R_SPARC_WDISP30</span><br><span class="line">call    foo, 0        # gas -KPIC: R_SPARC_WPLT30</span><br></pre></td></tr></table></figure><span id="more"></span><p>This post describes why I think this happened.</p><h2 id="static-linking-one-type-suffices">Static linking: one typesuffices</h2><p>In the static linking model, all symbols are resolved at link time:every symbol is either defined in a relocatable object file or anundefined weak symbol. A branch instruction with a PC-relativedisplacement—x86 <code>call</code>, m68k <code>bsr.l</code>, s390<code>brasl</code>—can reuse the same PC-relative data relocation typeused for data references.</p><ul><li>i386: <code>R_386_PC32</code> for both <code>call foo</code> and<code>.long foo - .</code></li><li>x86-64: <code>R_X86_64_PC32</code> for both <code>call foo</code>and <code>.long foo - .</code></li><li>m68k: <code>R_68K_PC32</code> for <code>bsr.l foo</code>,<code>move.l var,%d0</code>, and <code>.long foo - .</code></li><li>s390x: <code>R_390_PC32DBL</code> for <code>brasl %r14, foo</code>,<code>larl %r1, var</code>, and <code>.long foo - .</code></li></ul><p>No separate "call" relocation type is needed. The linker simplypatches the displacement to point to the symbol address.</p><h2 id="dynamic-linking-changes-the-picture">Dynamic linking changes thepicture</h2><p>With System V Release 4 style shared libraries, variable access andfunction calls diverge.</p><p>For <strong>variables and function addresses</strong>, a referencefrom one component to a symbol defined in another cannot use a plainPC-relative relocation, because the distance between the two componentsis not known at link time. The <ahref="/blog/2021-08-29-all-about-global-offset-table">Global OffsetTable</a> was introduced for this purpose, along with GOT-generatingrelocation types. (Additionally, <ahref="/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected">copyrelocations</a> are a workaround for external data symbols from<code>-fno-pic</code> relocatable files.) To satisfy the <strong>pointerequality</strong> requirement, a PC-relative data relocation in anexecutable must resolve to the same address as its counterpart in ashared object—this is why GOT indirection is used for symbols not knownat compile time to be preemptible.</p><p>For <strong>direct function calls</strong>, the situation isdifferent. A call instruction has "transfer control there by any means"semantics - the caller usually doesn't care <em>how</em> the callee isreached, only that it gets there. This allows the linker to interpose a<a href="/blog/2021-09-19-all-about-procedure-linkage-table">PLTstub</a> when the target is in another component, without any specialcode sequence at the call site. Alternatively, some architecturessupport an indirect call sequence that bypasses PLT entirely: <ahref="/blog/2021-09-19-all-about-procedure-linkage-table#fno-plt"><code>-fno-plt</code>on x86</a> and <ahref="/blog/2023-09-04-toolchain-notes-on-mips#:~:text=mno%2dplt"><code>-mno-plt</code>on MIPS</a> (o32/n32 non-PIC).</p><p>Variable accesses do not have the same semantics - so the PC-relativedata relocation type cannot be reused on a call instruction.</p><p>This is why separate branch relocation types were introduced:<code>R_386_PLT32</code>, <code>R_68K_PLT32</code>,<code>R_390_PLT32DBL</code>, and so on. The relocation type carries thesemantic information: "this is a function call that can use PLTindirection."</p><h2 id="misleading-names">Misleading names</h2><p>The <code>@plt</code> notation in assembly and the <code>PLT32</code>relocation type names are misleading. They suggest that a PLT entry isinvolved, but that is often not the case - when the callee is defined inthe same component, the linker resolves the branch directly—no PLT entryis created.</p><p><code>R_386_CALL32</code> and <code>R_X86_64_CALL32</code> would havebeen a better name.</p><p>In addition, the <code>@plt</code> notation itself is problematic asa <ahref="/blog/2025-03-16-relocation-generation-in-assemblers#relocation-specifier-flavors">relocationspecifier</a>.</p><h2 id="architecture-comparison">Architecture comparison</h2><p><strong>Single type (clean design).</strong> Some architecturesrecognized from the start that one call relocation type is sufficient.The linker can decide whether a PLT stub is needed based on the symbol'sbinding and visibility.</p><ul><li>AArch64: <code>R_AARCH64_CALL26</code> for <code>bl</code> and<code>R_AARCH64_JUMP26</code> for <code>b</code>.</li><li>PowerPC64 ELFv2: <code>R_PPC64_REL24</code> for<code>bl</code>.</li></ul><p>These architectures never had the naming confusion—there is no "PLT"in the relocation name, and no redundant pair.</p><p><strong>Redundant pairs (misguided).</strong> Some architecturesintroduced separate "PLT" and "non-PLT" call relocation types, creatinga distinction without a real difference.</p><ul><li>SPARC: <code>R_SPARC_WPLT30</code> alongside<code>R_SPARC_WDISP30</code>. The assembler decides at assembly timebased on PIC mode and symbol preemptivity, when ideally the linkershould make these decisions.</li><li>PPC32: <code>R_PPC_REL24</code> (non-PIC) and<code>R_PPC_PLTREL24</code> (PIC) have genuinely different semantics(the addend of <code>R_PPC_PLTREL24</code> encodes the r30 GOT pointersetup). However, <code>R_PPC_LOCAL24PC</code> is entirely useless—alloccurrences can be replaced with <code>R_PPC_REL24</code>.</li><li>RISC-V: <code>R_RISCV_CALL_PLT</code> alongside the now-removed<code>R_RISCV_CALL</code>. The community recognized that only onerelocation is needed. <code>R_RISCV_CALL_PLT</code> is kept (despite thename, does not mandate a PLT entry).</li></ul><p>x86-64 started with <code>R_X86_64_PC32</code> for<code>call foo</code> (inherited from the static-linking mindset) and<code>R_X86_64_PLT32</code> for <code>call foo@plt</code> (symbols notcompile-time known to be non-preemptible). In 2018, binutils <ahref="https://sourceware.org/bugzilla/show_bug.cgi?id=22791"class="uri">https://sourceware.org/bugzilla/show_bug.cgi?id=22791</a>switched to <code>R_X86_64_PLT32</code> for <code>call foo</code>. LLVMintegrated assembler followed suit.</p><p>This means <code>R_X86_64_PC32</code> is now effectively reserved fordata references, and <code>R_X86_64_PLT32</code> marks all calls—a cleanseparation achieved by convention.</p><p>However, GNU Assembler still produces <code>R_X86_64_PC32</code> when<code>call foo</code> references an <code>STB_LOCAL</code> symbol. I'vesent a patch to fix this: <ahref="https://sourceware.org/pipermail/binutils/2026-February/148251.html">[PATCHv2] x86: keep PLT32 relocation for local symbols instead of convertingto PC32</a>.</p><p>GCC's s390 port seems to always generate <code>@plt</code> (even forhidden visibility functions), leading to <code>R_390_PLT32DBL</code>relocations.</p><h2 id="range-extension-thunks">Range extension thunks</h2><p>When a branch target is out of range, some architectures allow thelinker to insert a <ahref="/blog/2026-01-25-long-branches-in-compilers-assemblers-and-linkers#linker-range-extension-thunks">rangeextension thunks</a>: On AArch64 and PowerPC64, this is wellestablished.</p><p>On x86-64, the ±2GiB range of <code>call</code>/<code>jmp</code> hasbeen sufficient so far, but as executables grow, <ahref="/blog/2023-05-14-relocation-overflow-and-code-models">relocationoverflow</a> becomes a concern. There are proposals to add rangeextension thunks to x86-64, which would require the linker to identifycall sites-a PC-relative data relocation like <code>R_X86_64_PC32</code>would not be suitable (due to pointer equality requirement), makingconsistent use of <code>R_X86_64_PLT32</code> for calls all the moreimportant.</p><h2 id="recommendation-for-future-architectures">Recommendation forfuture architectures</h2><p>For a specific instruction or pseudo instruction for function callsand tail calls, use a single call relocation type—no "PLT" vs. "non-PLT"distinction. The assembler should emit the same relocation, and thelinker, which knows whether the symbol is preemptible, decides whether aPLT stub is needed. Optionally enable range extension thunks for therelocation type. AArch64's <code>R_AARCH64_CALL26</code> and PowerPC64ELFv2's <code>R_PPC64_REL24</code> demonstrate this approach well.</p><p>This discussion does not apply to intra-function branches, whichtarget local labels.</p><h2 id="see-also">See also</h2><ul><li><a href="/blog/2021-09-19-all-about-procedure-linkage-table">Allabout Procedure Linkage Table</a> for how PLT works</li><li><a href="/blog/2021-08-29-all-about-global-offset-table">All aboutGlobal Offset Table</a> for the data-access counterpart</li><li><ahref="/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected">Copyrelocations, canonical PLT entries and protected visibility</a> for thevariable-access complications</li><li><ahref="/blog/2025-03-16-relocation-generation-in-assemblers">Relocationgeneration in assemblers</a> for how assemblers decide which relocationsto emit</li><li><ahref="/blog/2026-01-25-long-branches-in-compilers-assemblers-and-linkers">Longbranches in compilers, assemblers, and linkers</a> for range extensionthunks</li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;Most architectures encode direct branch/call instructions with a
PC-relative displacement. This post discusses a specific category of
branch relocations: those used for direct function calls and tail calls.
Some architectures use two ELF relocation types for a call
instruction:&lt;/p&gt;
&lt;figure class=&quot;highlight plaintext&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre&gt;&lt;span class=&quot;line&quot;&gt;1&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;2&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;3&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;4&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;5&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;6&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;7&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;8&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;9&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;10&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;11&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;12&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;13&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;14&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;15&lt;/span&gt;&lt;br&gt;&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;line&quot;&gt;# i386, x86-64&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;call foo              # R_386_PC32, R_X86_64_PC32&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;call foo@plt          # R_386_PLT32, R_X86_64_PLT32&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;# m68k&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;bsr.l foo             # R_68K_PC32&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;bsr.l foo@plt         # R_68K_PLT32&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;# s390/s390x&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;brasl %r14, foo       # R_390_PC32DBL&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;brasl %r14, foo@plt   # R_390_PLT32DBL&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;# sparc&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;call    foo, 0        # not PIC: R_SPARC_WDISP30&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;call    foo, 0        # gas -KPIC: R_SPARC_WPLT30&lt;/span&gt;&lt;br&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/figure&gt;</summary>
    
    
    
    
    <category term="linker" scheme="https://maskray.me/blog/tags/linker/"/>
    
    <category term="binutils" scheme="https://maskray.me/blog/tags/binutils/"/>
    
  </entry>
  
  <entry>
    <title>lld 22 ELF changes</title>
    <link href="https://maskray.me/blog/2026-02-01-lld-22-elf-changes"/>
    <id>https://maskray.me/blog/2026-02-01-lld-22-elf-changes</id>
    <published>2026-02-01T08:00:00.000Z</published>
    <updated>2026-02-02T03:25:22.840Z</updated>
    
    <content type="html"><![CDATA[<p>For those unfamiliar, <a href="https://lld.llvm.org/">lld</a> is theLLVM linker, supporting PE/COFF, ELF, Mach-O, and WebAssembly ports.These object file formats differ significantly, and each port mustfollow the conventions of the platform's system linker. As a result, theports share limited code (diagnostics, memory allocation, etc) and havelargely separate reviewer groups.</p><p>With LLVM 22.1 releasing soon, I've added some notes to the <ahref="https://github.com/llvm/llvm-project/blob/release/22.x/lld/docs/ReleaseNotes.rst"class="uri">https://github.com/llvm/llvm-project/blob/release/22.x/lld/docs/ReleaseNotes.rst</a>as an lld/ELF maintainer. As usual, I've reviewed almost all the patchesnot authored by me.</p><p>For the first time, I used an LLM agent (Claude Code) to help lookthrough commits(<code>git log release/21.x..release/22.x -- lld/ELF</code>) and draftthe release notes. Despite my request to only read lld/ELF changes,Claude Code also crafted notes for other ports, which I retained sincetheir release notes had been quite sparse for several releases. Changesback ported to the 21.x release are removed(<code>git log --oneline llvmorg-22-init..llvmorg-21.1.8 -- lld</code>).</p><p>I'll delve into some of the key changes.</p><span id="more"></span><ul><li><code>--print-gc-sections=&lt;file&gt;</code> has been added toredirect garbage collection section listing to a file, avoidingcontamination of stdout with other linker output. (<ahref="https://github.com/llvm/llvm-project/pull/159706">#159706</a>)</li><li>A <code>VersionNode</code> lexer state has been added for betterversion script parsing. This brings the lexer behavior closer to GNU ld.(<ahref="https://github.com/llvm/llvm-project/pull/174530">#174530</a>)</li><li>Unversioned undefined symbols now use version index 0, aligning withGNU ld 2.46 behavior. (<ahref="https://github.com/llvm/llvm-project/pull/168189">#168189</a>)</li><li><code>.data.rel.ro.hot</code> and <code>.data.rel.ro.unlikely</code>are now recognized as RELRO sections, allowing profile-guided staticdata partitioning. (<ahref="https://github.com/llvm/llvm-project/pull/148920">#148920</a>)</li><li>DTLTO now supports archive members and bitcode members of thinarchives. (<ahref="https://github.com/llvm/llvm-project/pull/157043">#157043</a>)</li><li>For DTLTO,<code>--thinlto-remote-compiler-prepend-arg=&lt;arg&gt;</code> has beenadded to prepend an argument to the remote compiler's command line. (<ahref="https://github.com/llvm/llvm-project/pull/162456">#162456</a>)</li><li>Balanced Partitioning (BP) section ordering now skips input sectionswith null data, and filters out section symbols. (<ahref="https://github.com/llvm/llvm-project/pull/149265">#149265</a>) (<ahref="https://github.com/llvm/llvm-project/pull/151685">#151685</a>)</li><li>For AArch64, fixed a crash when using<code>--fix-cortex-a53-843419</code> with synthetic sections andimproved handling when patched code is far from the short jump. (<ahref="https://github.com/llvm/llvm-project/pull/170495">#170495</a>)</li><li>For AArch64, added support for the <code>R_AARCH64_FUNCINIT64</code>dynamic relocation type for relocating word-sized data using the returnvalue of a function. (<ahref="https://github.com/llvm/llvm-project/pull/156564">#156564</a>)</li><li>For AArch64, added support for the <code>R_AARCH64_PATCHINST</code>relocation type to support deactivation symbols. (<ahref="https://github.com/llvm/llvm-project/pull/133534">#133534</a>)</li><li>For AArch64, added support for reading AArch64 Build Attributes andconverting them into GNU Properties. (<ahref="https://github.com/llvm/llvm-project/pull/147970">#147970</a>)</li><li>For ARM, fixed incorrect veneer generation for wraparound branchesat the high end of the 32-bit address space branching to the low end.(<ahref="https://github.com/llvm/llvm-project/pull/165263">#165263</a>)</li><li>For LoongArch, <code>-r</code> now synthesizes<code>R_LARCH_ALIGN</code> at input section start to preserve alignmentinformation. (<ahref="https://github.com/llvm/llvm-project/pull/153935">#153935</a>)</li><li>For LoongArch, added relocation types for LA32R/LA32S. (<ahref="https://github.com/llvm/llvm-project/pull/172618">#172618</a>) (<ahref="https://github.com/llvm/llvm-project/pull/176312">#176312</a>)</li><li>For RISC-V, added infrastructure for handling vendor-specificrelocations. (<ahref="https://github.com/llvm/llvm-project/pull/159987">#159987</a>)</li><li>For RISC-V, added support for statically resolved vendor-specificrelocations. (<ahref="https://github.com/llvm/llvm-project/pull/169273">#169273</a>)</li><li>For RISC-V, <code>-r</code> now synthesizes<code>R_RISCV_ALIGN</code> at input section start to preserve alignmentinformation during two-stage linking. (<ahref="https://github.com/llvm/llvm-project/pull/151639">#151639</a>)This is an interesting <ahref="/blog/2021-03-14-the-dark-side-of-riscv-linker-relaxation#:~:text=relocatable%20linking%20challenge">relocatablelinking challenge</a> for linker relaxation.</li></ul><p>Besides me, Peter Smith (smithp35) and Jessica Clarke (jrtc27) havedone a lot of reviews.</p><p>jrtc27 has done great work simplifying the dynamic relocation system,which is highly appreciated.</p><p>I should call out <ahref="https://github.com/llvm/llvm-project/pull/172618"class="uri">https://github.com/llvm/llvm-project/pull/172618</a>: forthis relatively large addition, the author and approver are from thesame company and contributing to their architecture, and neither theauthor nor the approver is a regular lld contributor/reviewer. Theauthor did not request review from regular reviewers and landed thepatch just 3 minutes after their colleague's approval. I left a commentasking to keep the PR open for other maintainers to review.</p><h2 id="distributed-thinlto">Distributed ThinLTO</h2><p><a href="https://llvm.org/docs/DTLTO.html">Distributed ThinLTO(DTLTO)</a> enables distributing ThinLTO backend compilations toexternal systems (e.g., Incredibuild, distcc-like tools) during the linkstep. This feature was contributed by PlayStation, who had offered it asa proprietary technology before upstreaming.</p><p>The traditional distributed ThinLTO is implemented in Bazel in buck2.<strong>Bazel-style distribution</strong> (build system orchestrated)uses a multi-step workflow:</p><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Compile to bitcode (made parallel by build system)</span></span><br><span class="line">clang -c -O2 -flto=thin a.c b.c</span><br><span class="line"><span class="comment"># Thin link</span></span><br><span class="line">clang -flto=thin -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp,--thinlto-emit-imports-files -Wl,--thinlto-prefix-replace=<span class="string">&#x27;;lto/&#x27;</span> a.o b.o</span><br><span class="line"><span class="comment"># Backend compilation (distributed by build system) with dynamic dependencies</span></span><br><span class="line">clang -c -O2 -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o</span><br><span class="line">clang -c -O2 -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o</span><br><span class="line"><span class="comment"># Final native link</span></span><br><span class="line">clang -fuse-ld=lld @a.rsp   <span class="comment"># a.rsp contains lto/a.o and lto/b.o</span></span><br></pre></td></tr></table></figure><p>The build system sees the index files from step 2 as outputs andschedules step 3 jobs across the build cluster. This requires a buildsystem that handles <strong>dynamic dependencies</strong>—outputs ofstep 2 determine inputs to step 3.</p><p><strong>DTLTO</strong> (linker orchestrated) integrates steps 2-4into a single link invocation:</p><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">clang -flto=thin -c a.c b.c</span><br><span class="line">clang -flto=thin -fuse-ld=lld -fthinlto-distributor=&lt;distributor&gt; *.o</span><br></pre></td></tr></table></figure><p>LLD performs the thin-link internally, generates a JSON jobdescription for each backend compilation, invokes the distributorprocess, waits for native objects, and links them. The distributor isresponsible for farming out the compilations to remote machines.</p><p>DTLTO works with any build system but requires a separate distributorprocess that speaks its JSON protocol. DTLTO is essentially "ThinLTOdistribution for projects that don't use Bazel".</p><h2 id="pointer-field-protection">Pointer Field Protection</h2><p><code>R_AARCH64_PATCHINST</code> is a static relocation type usedwith Pointer Field Protection (PFP), which leverages Armv8.3-A PointerAuthentication (PAC) to protect pointer fields in structs.</p><p>Consider the following C++ code:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">struct</span> <span class="title class_">cls</span> &#123;</span><br><span class="line">  ~<span class="built_in">cls</span>();</span><br><span class="line">  <span class="type">long</span> *ptr;</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line">  <span class="type">long</span> *ptr2;</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="type">long</span> *<span class="title">load</span><span class="params">(cls *c)</span> </span>&#123; <span class="keyword">return</span> c-&gt;ptr; &#125;</span><br><span class="line"><span class="function"><span class="type">void</span> <span class="title">store</span><span class="params">(cls *c, <span class="type">long</span> *ptr)</span> </span>&#123; c-&gt;ptr = ptr; &#125;</span><br></pre></td></tr></table></figure><p>With Pointer Field Protection enabled, the compiler generates PACinstructions to sign and authenticate pointers:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">load:</span><br><span class="line">  ldr   x8, [x0]      // Load the PAC-signed pointer from c-&gt;ptr</span><br><span class="line">  autda x8, x0        // Authenticate and strip the PAC, R_AARCH64_PATCHINST __pfp_ds__ZTS3cls.ptr</span><br><span class="line">  mov   x0, x8</span><br><span class="line">  ret</span><br><span class="line"></span><br><span class="line">store:</span><br><span class="line">  pacda x1, x0        // Sign ptr using c as a discriminator, R_AARCH64_PATCHINST __pfp_ds__ZTS3cls.ptr</span><br><span class="line">  str   x1, [x0]</span><br><span class="line">  ret</span><br></pre></td></tr></table></figure><p>Each PAC instruction is associated with an<code>R_AARCH64_PATCHINST</code> relocation referencing a<strong>deactivation symbol</strong> (the <code>__pfp_ds_</code> prefixstands for "pointer field protection deactivation symbol"). By default,<code>__pfp_ds__ZTS3cls.ptr</code> is an undefined weak symbol in everyrelocatable file.</p><p>However, if the field's address escapes in any translation unit(e.g., someone takes <code>&amp;c-&gt;ptr</code>), the compiler definesthe deactivation symbol as an absolute symbol (ELF<code>SHN_ABS</code>). When the linker sees a defined deactivationsymbol, it patches the PAC instruction to a NOP(<code>R_AARCH64_PATCHINST</code> acts as <code>R_AARCH64_ABS64</code>when the referenced symbol is defined), disabling the protection forthat field. This is necessary because external code could modify thepointer without signing it, which would cause authenticationfailures.</p><p>The linker allows duplicate definitions of absolute symbols if thevalues are identical.</p><p><code>R_AARCH64_FUNCINIT64</code> is a related static relocation typethat produces an <code>R_AARCH64_IRELATIVE</code> dynamic relocation (<ahref="/blog/2021-01-18-gnu-indirect-function">GNU indirectfunction</a>). It initializes function pointers in static data at loadtime by calling a resolver function that returns the PAC-signedaddress.</p><p>PFP is AArch64-specific because it relies on Pointer Authentication(PAC), a hardware feature introduced in Armv8.3-A. PAC providesdedicated instructions (<code>pacda</code>, <code>autda</code>, etc.)that cryptographically sign pointers using keys stored in systemregisters. x86-64 lacks an equivalent mechanism—Intel CET providesshadow stacks and indirect branch tracking for control-flow integrity,but cannot sign arbitrary data pointers stored in memory.</p><p>Takeaways:</p><ul><li>Security features need linker support. This is because many featuresrequire aggregated information across all translation units. In thiscase, if <em>any</em> TU exposes a field's address, the linker disablesprotection for this field <em>everywhere</em> The implementation isusually lightweight.</li><li>Relocations can do more than fill in addresses:<code>R_AARCH64_PATCHINST</code> conditionally patches instructions toNOPs based on symbol resolution. This is a different paradigm fromtraditional "compute address, write it" relocations.</li></ul><h2 id="risc-v-vendor-relocations">RISC-V vendor relocations</h2><p>RISC-V's openness encourages vendors to add custom instructions.Qualcomm has the uC extensions for their microcontrollers; CHERIoT addscapability-based security.</p><p>The RISC-V psABI adopted the vendor relocation system:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">Relocation 0: R_RISCV_VENDOR      references symbol &quot;QUALCOMM&quot;</span><br><span class="line">Relocation 1: R_RISCV_QC_ABS20_U  (vendor-specific type)</span><br></pre></td></tr></table></figure><p>The <code>R_RISCV_VENDOR</code> marker identifies the vendornamespace via its symbol reference. The subsequent relocation uses avendor-specific type number that only makes sense within that namespace.Different vendors can reuse the same type numbers without conflict.</p><p>In lld 22:</p><ul><li>Infrastructure for vendor relocations was added (<ahref="https://github.com/llvm/llvm-project/pull/159987">#159987</a>).The implementation folds vendor namespace information into the upperbits of <code>RelType</code>, allowing existing relocation processingcode to work with minimal changes.</li><li>Support for statically-resolved vendor relocations was added (<ahref="https://github.com/llvm/llvm-project/pull/169273">#169273</a>),including Qualcomm and Andes relocation types. The patch landed withoutinvolving the regular lld/ELF reviewer pool. For changes that setarchitectural precedents, broader consensus should be sought beforemerging. I've <ahref="https://github.com/llvm/llvm-project/pull/178584#pullrequestreview-3736342355">commented</a>on this.</li></ul><p>The <ahref="https://github.com/riscv-non-isa/riscv-toolchain-conventions">RISC-Vtoolchain conventions</a> document the vendor relocation scheme.</p><p>There's a maintainability concern: accepting vendor-specificrelocations into the core linker sets a precedent. RISC-V is uniquelyfragmented compared to other LLVM backends-x86, AArch64, PowerPC, andothers don't have nearly as many vendors adding custom instructions andrelocations. This fragmentation is a direct consequence of RISC-V's opennature and extensibility, but it creates new challenges for upstreamtoolchain maintainers. Accumulated vendor-specific code could become asignificant maintenance burden.</p><h2 id="gnu-ld-compatibility">GNU ld compatibility</h2><p>Large corporate users of lld/ELF don't care about GNU ldcompatibility. They add features for their own use cases and move on. Idiligently coordinate with binutils maintainers and file featurerequests when appropriate. When lld implements a new option or behavior,I often file corresponding GNU ld feature requests to keep the toolsaligned.</p><p>This coordination work is largely invisible but essential for thebroader toolchain ecosystem. Users benefit when they can switch betweenlinkers without surprises.</p><hr /><p>Link: <a href="/blog/2025-09-07-lld-21-elf-changes">lld 21 ELFchanges</a></p>]]></content>
    
    
    <summary type="html">&lt;p&gt;For those unfamiliar, &lt;a href=&quot;https://lld.llvm.org/&quot;&gt;lld&lt;/a&gt; is the
LLVM linker, supporting PE/COFF, ELF, Mach-O, and WebAssembly ports.
These object file formats differ significantly, and each port must
follow the conventions of the platform&#39;s system linker. As a result, the
ports share limited code (diagnostics, memory allocation, etc) and have
largely separate reviewer groups.&lt;/p&gt;
&lt;p&gt;With LLVM 22.1 releasing soon, I&#39;ve added some notes to the &lt;a
href=&quot;https://github.com/llvm/llvm-project/blob/release/22.x/lld/docs/ReleaseNotes.rst&quot;
class=&quot;uri&quot;&gt;https://github.com/llvm/llvm-project/blob/release/22.x/lld/docs/ReleaseNotes.rst&lt;/a&gt;
as an lld/ELF maintainer. As usual, I&#39;ve reviewed almost all the patches
not authored by me.&lt;/p&gt;
&lt;p&gt;For the first time, I used an LLM agent (Claude Code) to help look
through commits
(&lt;code&gt;git log release/21.x..release/22.x -- lld/ELF&lt;/code&gt;) and draft
the release notes. Despite my request to only read lld/ELF changes,
Claude Code also crafted notes for other ports, which I retained since
their release notes had been quite sparse for several releases. Changes
back ported to the 21.x release are removed
(&lt;code&gt;git log --oneline llvmorg-22-init..llvmorg-21.1.8 -- lld&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;I&#39;ll delve into some of the key changes.&lt;/p&gt;</summary>
    
    
    
    
    <category term="llvm" scheme="https://maskray.me/blog/tags/llvm/"/>
    
    <category term="linker" scheme="https://maskray.me/blog/tags/linker/"/>
    
    <category term="release" scheme="https://maskray.me/blog/tags/release/"/>
    
  </entry>
  
  <entry>
    <title>Long branches in compilers, assemblers, and linkers</title>
    <link href="https://maskray.me/blog/2026-01-25-long-branches-in-compilers-assemblers-and-linkers"/>
    <id>https://maskray.me/blog/2026-01-25-long-branches-in-compilers-assemblers-and-linkers</id>
    <published>2026-01-25T08:00:00.000Z</published>
    <updated>2026-02-17T18:36:16.159Z</updated>
    
    <content type="html"><![CDATA[<p>Branch instructions on most architectures use PC-relative addressingwith a limited range. When the target is too far away, the branchbecomes "out of range" and requires special handling.</p><p>Consider a large binary where <code>main()</code> at address 0x10000calls <code>foo()</code> at address 0x8010000-over 128MiB away. OnAArch64, the <code>bl</code> instruction can only reach ±128MiB, so thiscall cannot be encoded directly. Without proper handling, the linkerwould fail with an error like "relocation out of range." The toolchainmust handle this transparently to produce correct executables.</p><p>This article explores how compilers, assemblers, and linkers worktogether to solve the long branch problem.</p><ul><li>Compiler (IR to assembly): Handles branches within a function thatexceed the range of conditional branch instructions</li><li>Assembler (assembly to relocatable file): Handles branches within asection where the distance is known at assembly time</li><li>Linker: Handles cross-section and cross-object branches discoveredduring final layout</li></ul><span id="more"></span><h2 id="branch-range-limitations">Branch range limitations</h2><p>Different architectures have different branch range limitations.Here's a quick comparison of unconditional / conditional branchranges:</p><table><colgroup><col style="width: 12%" /><col style="width: 14%" /><col style="width: 14%" /><col style="width: 12%" /><col style="width: 46%" /></colgroup><thead><tr><th>Architecture</th><th>Cond</th><th>Uncond</th><th>Call</th><th>Notes</th></tr></thead><tbody><tr><td>AArch64</td><td>±1MiB</td><td>±128MiB</td><td>±128MiB</td><td>Thunks</td></tr><tr><td>AArch32 (A32)</td><td>±32MiB</td><td>±32MiB</td><td>±32MiB</td><td>Thunks, interworking</td></tr><tr><td>AArch32 (T32)</td><td>±1MiB</td><td>±16MiB</td><td>±16MiB</td><td>Thunks, interworking</td></tr><tr><td>LoongArch</td><td>±128KiB</td><td>±128MiB</td><td>±128MiB</td><td>Linker relaxation</td></tr><tr><td>M68k (68020+)</td><td>±2GiB</td><td>±2GiB</td><td>±2GiB</td><td>Assembler picks size</td></tr><tr><td>MIPS (pre-R6)</td><td>±128KiB</td><td>±128KiB (<code>b offset</code>)</td><td>±128KiB (<code>bal offset</code>)</td><td>In <code>-fno-pic</code> code, pseudo-absolute<code>j</code>/<code>jal</code> can be used for a 256MiB region.</td></tr><tr><td>MIPS R6</td><td>±128KiB</td><td>±128MiB</td><td>±128MiB</td><td></td></tr><tr><td>PowerPC64</td><td>±32KiB</td><td>±32MiB</td><td>±32MiB</td><td>Thunks</td></tr><tr><td>RISC-V</td><td>±4KiB</td><td>±1MiB</td><td>±1MiB</td><td>Linker relaxation</td></tr><tr><td>SPARC</td><td>±1MiB</td><td>±8MiB</td><td>±2GiB</td><td>No thunks needed</td></tr><tr><td>SuperH</td><td>±256B</td><td>±4KiB</td><td>±4KiB</td><td>Use register-indirect if needed</td></tr><tr><td>x86-64</td><td>±2GiB</td><td>±2GiB</td><td>±2GiB</td><td>Large code model changes call sequence</td></tr><tr><td>Xtensa</td><td>±2KiB</td><td>±128KiB</td><td>±512KiB</td><td>Linker relaxation</td></tr><tr><td>z/Architecture</td><td>±64KiB</td><td>±4GiB</td><td>±4GiB</td><td>No thunks needed</td></tr></tbody></table><p>The following subsections provide detailed per-architectureinformation, including relocation types relevant for linkerimplementation.</p><h3 id="aarch32">AArch32</h3><p>In A32 state:</p><ul><li>Branch (<code>b</code>/<code>b&lt;cond&gt;</code>), conditionalbranch and link (<code>bl&lt;cond&gt;</code>)(<code>R_ARM_JUMP24</code>): ±32MiB</li><li>Unconditional branch and link (<code>bl</code>/<code>blx</code>,<code>R_ARM_CALL</code>): ±32MiB</li></ul><p>Note: <code>R_ARM_CALL</code> is for unconditional<code>bl</code>/<code>blx</code> which can be relaxed to BLX inline;<code>R_ARM_JUMP24</code> is for branches which require a veneer forinterworking.</p><p>In T32 state (Thumb state pre-ARMv8):</p><ul><li>Conditional branch (<code>b&lt;cond&gt;</code>,<code>R_ARM_THM_JUMP8</code>): ±256 bytes</li><li>Short unconditional branch (<code>b</code>,<code>R_ARM_THM_JUMP11</code>): ±2KiB</li><li>ARMv5T branch and link (<code>bl</code>/<code>blx</code>,<code>R_ARM_THM_CALL</code>): ±4MiB</li><li>ARMv6T2 wide conditional branch (<code>b&lt;cond&gt;.w</code>,<code>R_ARM_THM_JUMP19</code>): ±1MiB</li><li>ARMv6T2 wide branch (<code>b.w</code>,<code>R_ARM_THM_JUMP24</code>): ±16MiB</li><li>ARMv6T2 wide branch and link (<code>bl</code>/<code>blx</code>,<code>R_ARM_THM_CALL</code>): ±16MiB. <code>R_ARM_THM_CALL</code> can berelaxed to BLX.</li></ul><h3 id="aarch64">AArch64</h3><ul><li>Test bit and branch (<code>tbz</code>/<code>tbnz</code>,<code>R_AARCH64_TSTBR14</code>): ±32KiB</li><li>Compare and branch (<code>cbz</code>/<code>cbnz</code>,<code>R_AARCH64_CONDBR19</code>): ±1MiB</li><li>Conditional branches (<code>b.&lt;cond&gt;</code>,<code>R_AARCH64_CONDBR19</code>): ±1MiB</li><li>Unconditional branches (<code>b</code>/<code>bl</code>,<code>R_AARCH64_JUMP26</code>/<code>R_AARCH64_CALL26</code>):±128MiB</li></ul><p>The compiler's <code>BranchRelaxation</code> pass handlesout-of-range conditional branches by inverting the condition andinserting an unconditional branch. The AArch64 assembler does notperform branch relaxation; out-of-range branches produce linker errorsif not handled by the compiler.</p><h3 id="loongarch">LoongArch</h3><ul><li>Conditional branches(<code>beq</code>/<code>bne</code>/<code>blt</code>/<code>bge</code>/<code>bltu</code>/<code>bgeu</code>,<code>R_LARCH_B16</code>): ±128KiB (18-bit signed)</li><li>Compare-to-zero branches (<code>beqz</code>/<code>bnez</code>,<code>R_LARCH_B21</code>): ±4MiB (23-bit signed)</li><li>Unconditional branch/call (<code>b</code>/<code>bl</code>,<code>R_LARCH_B26</code>): ±128MiB (28-bit signed)</li><li>Medium range call (<code>pcaddu12i</code>+<code>jirl</code>,<code>R_LARCH_CALL30</code>): ±2GiB</li><li>Long range call (<code>pcaddu18i</code>+<code>jirl</code>,<code>R_LARCH_CALL36</code>): ±128GiB</li></ul><h3 id="m68k">M68k</h3><ul><li>Short branch(<code>Bcc.B</code>/<code>BRA.B</code>/<code>BSR.B</code>): ±128 bytes(8-bit displacement)</li><li>Word branch(<code>Bcc.W</code>/<code>BRA.W</code>/<code>BSR.W</code>): ±32KiB(16-bit displacement)</li><li>Long branch(<code>Bcc.L</code>/<code>BRA.L</code>/<code>BSR.L</code>, 68020+):±2GiB (32-bit displacement)</li></ul><p>GNU Assembler provides <ahref="https://sourceware.org/binutils/docs/as/M68K_002dBranch.html">pseudoopcodes</a> (<code>jbsr</code>, <code>jra</code>, <code>jXX</code>) that"automatically expand to the shortest instruction capable of reachingthe target". For example, <code>jeq .L0</code> emits one of<code>beq.b</code>, <code>beq.w</code>, and <code>beq.l</code> dependingon the displacement.</p><p>With the long forms available on 68020 and later, M68k doesn't needlinker range extension thunks.</p><h3 id="mips">MIPS</h3><ul><li>Conditional branches(<code>beq</code>/<code>bne</code>/<code>bgez</code>/<code>bltz</code>/etc,<code>R_MIPS_PC16</code>): ±128KiB</li><li>PC-relative jump (<code>b offset</code>(<code>bgez $zero, offset</code>)): ±128KiB</li><li>PC-relative call (<code>bal offset</code>(<code>bgezal $zero, offset</code>)): ±128KiB</li><li>Pseudo-absolute jump/call (<code>j</code>/<code>jal</code>,<code>R_MIPS_26</code>): branch within the current 256MiB region, onlysuitable for <code>-fno-pic</code> code. Deprecated in R6 in favor of<code>bc</code>/<code>balc</code></li></ul><p>16-bit instructions removed in Release 6:</p><ul><li>Conditional branch (<code>beqz16</code>,<code>R_MICROMIPS_PC7_S1</code>): ±128 bytes</li><li>Unconditional branch (<code>b16</code>,<code>R_MICROMIPS_PC10_S1</code>): ±1KiB</li></ul><p>MIPS Release 6:</p><ul><li>Unconditional branch, compact (<code>bc16</code>, unclear toolchainimplementation): ±1KiB</li><li>Compare and branch, compact(<code>beqc</code>/<code>bnec</code>/<code>bltc</code>/<code>bgec</code>/etc,<code>R_MIPS_PC16</code>): ±128KiB</li><li>Compare register to zero and branch, compact(<code>beqzc</code>/<code>bnezc</code>/etc,<code>R_MIPS_PC21_S2</code>): ±4MiB</li><li>Branch (and link), compact (<code>bc</code>/<code>balc</code>,<code>R_MIPS_PC26_S2</code>): ±128MiB</li></ul><p>Compiler long branch handling: Both GCC(<code>mips_output_conditional_branch</code>) and LLVM(<code>MipsBranchExpansion</code>) handle out-of-range conditionalbranches by inverting the condition and inserting an unconditionaljump:</p><p>LLVM's <code>MipsBranchExpansion</code> pass handles out-of-rangebranches.</p><p>lld implements LA25 thunks for MIPS PIC/non-PIC interoperability, butnot range extension thunks. GNU ld also does not implement rangeextension thunks for MIPS.</p><p>GCC's mips port ported <ahref="https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d1399bd0ff3893bb9ebea7b977c7f3ec91b728b0">added<code>-mlong-calls</code></a> in 1993-03. In <code>-mno-abicalls</code>mode, GCC's <code>-mlong-calls</code> option (<ahref="https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d1399bd0ff3893bb9ebea7b977c7f3ec91b728b0">addedin 1993</a>) generates indirect call sequences that can reach anyaddress.</p><h3 id="powerpc">PowerPC</h3><ul><li>Conditional branch (<code>bc</code>/<code>bcl</code>,<code>R_PPC64_REL14</code>): ±32KiB</li><li>Unconditional branch (<code>b</code>/<code>bl</code>,<code>R_PPC64_REL24</code>/<code>R_PPC64_REL24_NOTOC</code>):±32MiB</li></ul><p>GCC-generated code relies on linker thunks. However, the legacy<code>-mlongcall</code> can be used to generate long code sequences.</p><h3 id="risc-v">RISC-V</h3><ul><li>Compressed <code>c.beqz</code>: ±256 bytes</li><li>Compressed <code>c.jal</code>: ±2KiB</li><li><code>jalr</code> (I-type immediate): ±2KiB</li><li>Conditional branches(<code>beq</code>/<code>bne</code>/<code>blt</code>/<code>bge</code>/<code>bltu</code>/<code>bgeu</code>,B-type immediate): ±4KiB</li><li><code>jal</code> (J-type immediate, <code>PseudoBR</code>): ±1MiB(notably smaller than other RISC architectures: AArch64 ±128MiB,PowerPC64 ±32MiB, LoongArch ±128MiB)</li><li><code>PseudoJump</code> (using <code>auipc</code> +<code>jalr</code>): ±2GiB</li><li><code>beqi</code>/<code>bnei</code> (Zibi extension, 5-bit compareimmediate (1 to 31 and -1)): ±4KiB</li></ul><p>Qualcomm uC Branch Immediate extension (Xqcibi):</p><ul><li><code>qc.beqi</code>/<code>qc.bnei</code>/<code>qc.blti</code>/<code>qc.bgei</code>/<code>qc.bltui</code>/<code>qc.bgeui</code>(32-bit, 5-bit compare immediate): ±4KiB</li><li><code>qc.e.beqi</code>/<code>qc.e.bnei</code>/<code>qc.e.blti</code>/<code>qc.e.bgei</code>/<code>qc.e.bltui</code>/<code>qc.e.bgeui</code>(48-bit, 16-bit compare immediate): ±4KiB</li></ul><p>Qualcomm uC Long Branch extension (Xqcilb):</p><ul><li><code>qc.e.j</code>/<code>qc.e.jal</code> (48-bit,<code>R_RISCV_VENDOR(QUALCOMM)+R_RISCV_QC_E_CALL_PLT</code>): ±2GiB</li></ul><p>For function calls:</p><ul><li>The <a href="https://go-review.googlesource.com/c/go/+/345051">Gocompiler</a> emits a single <code>jal</code> for calls and relies on itslinker to generate trampolines when the target is out of range.</li><li>In contrast, GCC and Clang emit <code>auipc</code>+<code>jalr</code>and rely on linker relaxation to shrink the sequence when possible.</li></ul><p>The <code>jal</code> range (±1MiB) is notably smaller than other RISCarchitectures (AArch64 ±128MiB, PowerPC64 ±32MiB, LoongArch ±128MiB).This limits the effectiveness of linker relaxation ("start large andshrink"), and leads to frequent trampolines when the compileroptimistically emits <code>jal</code> ("start small and grow").</p><h3 id="sparc">SPARC</h3><ul><li>Compare and branch (<code>cxbe</code>, <code>R_SPARC_5</code>): ±64bytes</li><li>Conditional branch (<code>bcc</code>, <code>R_SPARC_WDISP19</code>):±1MiB</li><li>Unconditional branch (<code>b</code>, <code>R_SPARC_WDISP22</code>):±8MiB</li><li><code>call</code>(<code>R_SPARC_WDISP30</code>/<code>R_SPARC_WPLT30</code>): ±2GiB</li></ul><p>With ±2GiB range for <code>call</code>, SPARC doesn't need rangeextension thunks in practice.</p><h3 id="superh">SuperH</h3><p>SuperH uses fixed-width 16-bit instructions, which limits branchranges.</p><ul><li>Conditional branch (<code>bf</code>/<code>bt</code>): ±256 bytes(8-bit displacement)</li><li>Unconditional branch (<code>bra</code>): ±4KiB (12-bitdisplacement)</li><li>Branch to subroutine (<code>bsr</code>): ±4KiB (12-bitdisplacement)</li></ul><p>For longer distances, register-indirect branches(<code>braf</code>/<code>bsrf</code>) are used. The compiler invertsconditions and emits these when targets exceed the short ranges.</p><p>SuperH is supported by GCC and binutils, but not by LLVM.</p><h3 id="xtensa">Xtensa</h3><p>Xtensa uses variable-length instructions: 16-bit (narrow,<code>.n</code> suffix) and 24-bit (standard).</p><ul><li>Narrow conditional branch (<code>beqz.n</code>/<code>bnez.n</code>,16-bit): -28 to +35 bytes (6-bit signed + 4)</li><li>Conditional branch (compare two registers)(<code>beq</code>/<code>bne</code>/<code>blt</code>/<code>bge</code>/etc,24-bit): ±256 bytes</li><li>Conditional branch (compare with zero)(<code>beqz</code>/<code>bnez</code>/<code>bltz</code>/<code>bgez</code>,24-bit): ±2KiB</li><li>Unconditional jump (<code>j</code>, 24-bit): ±128KiB</li><li>Call(<code>call0</code>/<code>call4</code>/<code>call8</code>/<code>call12</code>,24-bit): ±512KiB</li></ul><p>The assembler performs branch relaxation: when a conditional branchtarget is too far, it inverts the condition and inserts a <code>j</code>instruction.</p><p>Per <ahref="https://www.sourceware.org/binutils/docs/as/Xtensa-Call-Relaxation.html"class="uri">https://www.sourceware.org/binutils/docs/as/Xtensa-Call-Relaxation.html</a>,for calls, GNU Assembler pessimistically generates indirect sequences(<code>l32r</code>+<code>callx8</code>) when the target distance isunknown. GNU ld then performs linker relaxation.</p><h3 id="x86-64">x86-64</h3><ul><li>Short conditional jump (<code>Jcc rel8</code>): -128 to +127bytes</li><li>Short unconditional jump (<code>JMP rel8</code>): -128 to +127bytes</li><li>Near conditional jump (<code>Jcc rel32</code>): ±2GiB</li><li>Near unconditional jump (<code>JMP rel32</code>): ±2GiB</li></ul><p>With a ±2GiB range for near jumps, x86-64 rarely encountersout-of-range branches in practice. That said, Google and Meta Platformsdeploy mostly statically linked executables on x86-64 production serversand have run into the huge executable problem for certainconfigurations.</p><h3 id="zarchitecture">z/Architecture</h3><ul><li>Short conditional branch (<code>BRC</code>,<code>R_390_PC16DBL</code>): ±64KiB (16-bit halfword displacement)</li><li>Long conditional branch (<code>BRCL</code>,<code>R_390_PC32DBL</code>): ±4GiB (32-bit halfword displacement)</li><li>Short call (<code>BRAS</code>, <code>R_390_PC16DBL</code>):±64KiB</li><li>Long call (<code>BRASL</code>, <code>R_390_PC32DBL</code>):±4GiB</li></ul><p>With ±4GiB range for long forms, z/Architecture doesn't need linkerrange extension thunks. LLVM's <code>SystemZLongBranch</code> passrelaxes short branches (<code>BRC</code>/<code>BRAS</code>) to longforms (<code>BRCL</code>/<code>BRASL</code>) when targets are out ofrange.</p><h2 id="compiler-branch-range-handling">Compiler: branch rangehandling</h2><p>Conditional branch instructions usually have shorter ranges thanunconditional ones, making them less suitable for linker thunks (as wewill explore later). Compilers typically keep conditional branch targetswithin the same section, allowing the compiler to handle out-of-rangecases via branch relaxation.</p><p>Within a function, conditional branches may still go out of range.The compiler measures branch distances and relaxes out-of-range branchesby inverting the condition and inserting an unconditional branch:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"># Before relaxation (out of range)</span><br><span class="line">beq .Lfar_target       # ±4KiB range on RISC-V</span><br><span class="line"></span><br><span class="line"># After relaxation</span><br><span class="line">bne .Lskip             # Inverted condition, short range</span><br><span class="line">j .Lfar_target         # Unconditional jump, ±1MiB range</span><br><span class="line">.Lskip:</span><br></pre></td></tr></table></figure><p>Some architectures have conditional branch instructions that comparewith an immediate, with even shorter ranges due to encoding additionalimmediates. For example, AArch64's <code>cbz</code>/<code>cbnz</code>(compare and branch if zero/non-zero) and<code>tbz</code>/<code>tbnz</code> (test bit and branch) have only±32KiB range. RISC-V Zibi <code>beqi</code>/<code>bnei</code> have ±4KiBrange. The compiler handles these in a similar way:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">// Before relaxation (cbz has ±32KiB range)</span><br><span class="line">  cbz w0, far</span><br><span class="line"></span><br><span class="line">// After relaxation</span><br><span class="line">  cbnz w0, .Lskip       // Inverted condition</span><br><span class="line">  b far                 // Unconditional branch, ±128MiB range</span><br><span class="line">.Lskip:</span><br></pre></td></tr></table></figure><p>An Intel employee contributed <ahref="https://reviews.llvm.org/D41634"class="uri">https://reviews.llvm.org/D41634</a> (in 2017) when inversionof a branch condintion is impossible. This is for an out-of-treebackend. As of Jan 2026 there is no in-tree test for this code path.</p><p>In LLVM, this is handled by the <code>BranchRelaxation</code> pass,which runs just before <code>AsmPrinter</code>. Different backends havetheir own implementations:</p><ul><li><code>BranchRelaxation</code>: AArch64, AMDGPU, AVR, RISC-V</li><li><code>HexagonBranchRelaxation</code>: Hexagon</li><li><code>PPCBranchSelector</code>: PowerPC</li><li><code>SystemZLongBranch</code>: SystemZ</li><li><code>MipsBranchExpansion</code>: MIPS</li><li><code>MSP430BSel</code>: MSP430</li></ul><p>The generic <code>BranchRelaxation</code> pass computes block sizesand offsets, then iterates until all branches are in range. Forconditional branches, it tries to invert the condition and insert anunconditional branch. For unconditional branches that are still out ofrange, it calls <code>TargetInstrInfo::insertIndirectBranch</code> toemit an indirect jump sequence (e.g.,<code>adrp</code>+<code>add</code>+<code>br</code> on AArch64) or a longjump sequence (e.g., pseudo <code>jump</code> on RISC-V).</p><p>Note: The size estimates may be inaccurate due to inline assembly.LLVM uses heuristics to estimate inline assembly sizes, but for certainassembly constructs the size is not precisely known at compile time.</p><p>Unconditional branches and calls can target different sections sincethey have larger ranges. If the target is out of reach, the linker caninsert thunks to extend the range.</p><p>For x86-64, the large code model uses multiple instructions for callsand jumps to support text sections larger than 2GiB (see <ahref="/blog/2023-05-14-relocation-overflow-and-code-models#x86-64-large-code-model">Relocationoverflow and code models: x86-64 large code model</a>). This is apessimization if the callee ends up being within reach. Google and MetaPlatforms have interest in allowing range extension thunks as areplacement for the multiple instructions.</p><h2 id="assembler-instruction-relaxation">Assembler: instructionrelaxation</h2><p>The assembler converts assembly to machine code. When the target of abranch is within the same section and the distance is known at assemblytime, the assembler can select the appropriate encoding. This isdistinct from linker thunks, which handle cross-section or cross-objectreferences where distances aren't known until link time.</p><p>Assembler instruction relaxation handles two cases (see <ahref="/blog/2024-04-27-clang-o0-output-branch-displacement-and-size-increase">Clang-O0 output: branch displacement and size increase</a> for examples):</p><ul><li><strong>Span-dependent instructions</strong>: Select an appropriateencoding based on displacement.<ul><li>On x86, a short jump (<code>jmp rel8</code>) can be relaxed to anear jump (<code>jmp rel32</code>) when the target is far.</li><li>On RISC-V, <code>beqz</code> may be assembled to the 2-byte<code>c.beqz</code> when the displacement fits within ±256 bytes.</li></ul></li><li><strong>Conditional branch transform</strong>: Invert the conditionand insert an unconditional branch. On RISC-V, a <code>blt</code> mightbe relaxed to <code>bge</code> plus an unconditional branch.</li></ul><p>The assembler uses an iterative layout algorithm that alternatesbetween fragment offset assignment and relaxation until all fragmentsbecome legalized. See <ahref="/blog/2024-06-30-integrated-assembler-improvements-in-llvm-19">Integratedassembler improvements in LLVM 19</a> for implementation details.</p><h2 id="linker-range-extension-thunks">Linker: range extensionthunks</h2><p>When the linker resolves relocations, it may discover that a branchtarget is out of range. At this point, the instruction encoding isfixed, so the linker cannot simply change the instruction. Instead, itgenerates <strong>range extension thunks</strong> (also called veneers,branch stubs, or trampolines).</p><p>A thunk is a small piece of linker-generated code that can reach theactual target using a longer sequence of instructions. The originalbranch is redirected to the thunk, which then jumps to the realdestination.</p><p>Range extension thunks are one type of linker-generated thunk. Othertypes include:</p><ul><li><strong>ARM interworking veneers</strong>: Switch between ARM andThumb instruction sets (see <ahref="/blog/2023-04-23-linker-notes-on-aarch32">Linker notes onAArch32</a>)</li><li><strong>MIPS LA25 thunks</strong>: Enable PIC and non-PIC codeinteroperability (see <ahref="/blog/2023-09-04-toolchain-notes-on-mips">Toolchain notes onMIPS</a>)</li><li><strong>PowerPC64 TOC/NOTOC thunks</strong>: Handle calls betweenfunctions using different TOC pointer conventions (see <ahref="/blog/2023-02-26-linker-notes-on-power-isa">Linker notes on PowerISA</a>)</li></ul><h3 id="short-range-vs-long-range-thunks">Short range vs long rangethunks</h3><p>A <strong>short range thunk</strong> (see <ahref="https://reviews.llvm.org/D148701">lld/ELF's AArch64implementation</a>) contains just a single branch instruction. Since ituses a branch, its reach is also limited by the branch range—it can onlyextend coverage by one branch distance. For targets further away,multiple short range thunks can be chained, or a long range thunk withaddress computation must be used.</p><p>Long range thunks use indirection and can jump to (practically)arbitrary locations.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">// Short range thunk: single branch, 4 bytes</span><br><span class="line">__AArch64AbsLongThunk_dst:</span><br><span class="line">  b dst                         // ±128MiB range</span><br><span class="line"></span><br><span class="line">// Long range thunk: address computation, 12 bytes</span><br><span class="line">__AArch64ADRPThunk_dst:</span><br><span class="line">  adrp x16, dst                 // Load page address (±4GiB range)</span><br><span class="line">  add x16, x16, :lo12:dst       // Add page offset</span><br><span class="line">  br x16                        // Indirect branch</span><br></pre></td></tr></table></figure><h3 id="thunk-examples">Thunk examples</h3><p><strong>AArch32 (PIC)</strong> (see <ahref="/blog/2023-04-23-linker-notes-on-aarch32">Linker notes onAArch32</a>): <figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">__ARMV7PILongThunk_dst:</span><br><span class="line">  movw ip, :lower16:(dst - .)   ; ip = intra-procedure-call scratch register</span><br><span class="line">  movt ip, :upper16:(dst - .)</span><br><span class="line">  add ip, ip, pc</span><br><span class="line">  bx ip</span><br></pre></td></tr></table></figure></p><p><strong>PowerPC64 ELFv2</strong> (see <ahref="/blog/2023-02-26-linker-notes-on-power-isa">Linker notes on PowerISA</a>): <figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">__long_branch_dst:</span><br><span class="line">  addis 12, 2, .branch_lt@ha    # Load high bits from branch lookup table</span><br><span class="line">  ld 12, .branch_lt@l(12)       # Load target address</span><br><span class="line">  mtctr 12                      # Move to count register</span><br><span class="line">  bctr                          # Branch to count register</span><br></pre></td></tr></table></figure></p><h3 id="thunk-impact-on-debugging-and-profiling">Thunk impact ondebugging and profiling</h3><p>Thunks are transparent at the source level but visible in low-leveltools:</p><ul><li><strong>Stack traces</strong>: May show thunk symbols (e.g.,<code>__AArch64ADRPThunk_foo</code>) between caller and callee</li><li><strong>Profilers</strong>: Samples may attribute time to thunkcode; some profilers aggregate thunk time with the target function</li><li><strong>Disassembly</strong>: <code>objdump</code> or<code>llvm-objdump</code> will show thunk sections interspersed withregular code</li><li><strong>Code size</strong>: Each thunk adds bytes; large binariesmay have thousands of thunks</li></ul><h3 id="lldelfs-thunk-creation-algorithm">lld/ELF's thunk creationalgorithm</h3><p>lld/ELF uses a multi-pass algorithm in<code>finalizeAddressDependentContent</code>:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">assignAddresses</span>();</span><br><span class="line"><span class="keyword">for</span> (pass = <span class="number">0</span>; pass &lt; <span class="number">30</span>; ++pass) &#123;</span><br><span class="line">  <span class="comment">// Pre-create empty ThunkSections with a step size of about 2 * thunkSectionSpacing.</span></span><br><span class="line">  <span class="comment">// This ensures that for the most common thunk-needing relocation type, the relocations</span></span><br><span class="line">  <span class="comment">// can find a ThunkSection within thunkSectionSpacing bytes.</span></span><br><span class="line">  <span class="keyword">if</span> (pass == <span class="number">0</span>)</span><br><span class="line">    <span class="built_in">createInitialThunkSections</span>();</span><br><span class="line"></span><br><span class="line">  <span class="type">bool</span> changed = <span class="literal">false</span>;</span><br><span class="line">  <span class="keyword">for</span> (relocation : all_relocations) &#123;</span><br><span class="line">    <span class="comment">// If this relocation needs a thunk and the thunk is still in range, skip.</span></span><br><span class="line">    <span class="comment">// Otherwise, restore the original relocation.</span></span><br><span class="line">    <span class="keyword">if</span> (pass &gt; <span class="number">0</span> &amp;&amp; <span class="built_in">normalizeExistingThunk</span>(rel))</span><br><span class="line">      <span class="keyword">continue</span>;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (!<span class="built_in">needsThunk</span>(rel)) <span class="keyword">continue</span>;</span><br><span class="line">    Thunk *t = <span class="built_in">getOrCreateThunk</span>(rel);</span><br><span class="line">    ts = <span class="built_in">findOrCreateThunkSection</span>(rel, src);</span><br><span class="line">    ts-&gt;<span class="built_in">addThunk</span>(t);</span><br><span class="line">    rel.sym = t-&gt;<span class="built_in">getThunkTargetSym</span>();  <span class="comment">// redirect</span></span><br><span class="line">    changed = <span class="literal">true</span>;</span><br><span class="line">  &#125;</span><br><span class="line">  <span class="comment">// Intersperse ThunkSections and regular input sections.</span></span><br><span class="line">  <span class="built_in">mergeThunks</span>();</span><br><span class="line">  <span class="keyword">if</span> (!changed) <span class="keyword">break</span>;</span><br><span class="line">  <span class="built_in">assignAddresses</span>();  <span class="comment">// recalculate with new thunks</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>Key details:</p><ul><li><strong>Multi-pass</strong>: Iterates until convergence (max 30passes). Adding thunks changes addresses, potentially puttingpreviously-in-range calls out of range.</li><li><strong>Pre-allocated ThunkSections</strong>: On pass 0,<code>createInitialThunkSections</code> places empty<code>ThunkSection</code>s at regular intervals(<code>thunkSectionSpacing</code>). For AArch64: 128 MiB - 0x30000 ≈127.8 MiB.</li><li><strong>Thunk reuse</strong>: <code>getThunk</code> returns existingthunk if one exists for the same target;<code>normalizeExistingThunk</code> checks if a previously-created thunkis still in range.</li><li><strong>ThunkSection placement</strong>: <code>getISDThunkSec</code>finds a ThunkSection within branch range of the call site, or createsone adjacent to the calling InputSection.</li></ul><h3 id="lldmachos-thunk-creation-algorithm">lld/MachO's thunk creationalgorithm</h3><p>lld/MachO uses a single-pass algorithm in<code>TextOutputSection::finalize</code>:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">for</span> (callIdx = <span class="number">0</span>; callIdx &lt; inputs.<span class="built_in">size</span>(); ++callIdx) &#123;</span><br><span class="line">  <span class="comment">// Finalize sections within forward branch range (minus slop)</span></span><br><span class="line">  <span class="keyword">while</span> (finalIdx &lt; endIdx &amp;&amp; <span class="built_in">fits_in_range</span>(inputs[finalIdx]))</span><br><span class="line">    <span class="built_in">finalizeOne</span>(inputs[finalIdx++]);</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Process branch relocations in this section</span></span><br><span class="line">  <span class="keyword">for</span> (Relocation &amp;r : <span class="built_in">reverse</span>(isec-&gt;relocs)) &#123;</span><br><span class="line">    <span class="keyword">if</span> (!<span class="built_in">isBranchReloc</span>(r)) <span class="keyword">continue</span>;</span><br><span class="line">    <span class="keyword">if</span> (<span class="built_in">targetInRange</span>(r)) <span class="keyword">continue</span>;</span><br><span class="line">    <span class="keyword">if</span> (<span class="built_in">existingThunkInRange</span>(r)) &#123; reuse it; <span class="keyword">continue</span>; &#125;</span><br><span class="line">    <span class="comment">// Create new thunk and finalize it</span></span><br><span class="line">    <span class="built_in">createThunk</span>(r);</span><br><span class="line">  &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>Key differences from lld/ELF:</p><ul><li><strong>Single pass</strong>: Addresses are assigned monotonicallyand never revisited</li><li><strong>Slop reservation</strong>: Reserves<code>slopScale * thunkSize</code> bytes (default: 256 × 12 = 3072 byteson ARM64) to leave room for future thunks</li><li><strong>Thunk naming</strong>:<code>&lt;function&gt;.thunk.&lt;sequence&gt;</code> where sequenceincrements per target</li></ul><p><a href="https://github.com/llvm/llvm-project/issues/50920">Thunkstarvation problem</a>: If many consecutive branches need thunks, eachthunk (12 bytes) consumes slop faster than call sites (4 bytes apart)advance. The test <code>lld/test/MachO/arm64-thunk-starvation.s</code>demonstrates this edge case. Mitigation is increasing<code>--slop-scale</code>, but pathological cases with hundreds ofconsecutive out-of-range callees can still fail.</p><h3 id="molds-thunk-creation-algorithm">mold's thunk creationalgorithm</h3><p>mold uses a two-pass approach:</p><ul><li>Pessimistically over-allocate thunks. Out-of-section relocations andrelocations referencing to a section not assigned address yetpessimistically need thunks.(<code>requires_thunk(ctx, isec, rel, first_pass)</code> when<code>first_pass=true</code>)</li><li>Then remove unnecessary ones.</li></ul><p>Linker pass ordering:</p><ul><li><code>compute_section_sizes()</code> calls<code>create_range_extension_thunks()</code> — final section addressesare NOT yet known</li><li><code>set_osec_offsets()</code> assigns section addresses</li><li><code>remove_redundant_thunks()</code> is called AFTER addresses areknown — check unneeded thunks due to out-of-section relocations</li><li>Rerun <code>set_osec_offsets()</code></li></ul><p><strong>Pass 1</strong> (<code>create_range_extension_thunks</code>):Process sections in batches using a sliding window. The window tracksfour positions:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">Sections:   [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] ...</span><br><span class="line">             ^       ^       ^           ^</span><br><span class="line">             A       B       C           D</span><br><span class="line">             |       |_______|           |</span><br><span class="line">             |         batch             |</span><br><span class="line">             |                           |</span><br><span class="line">             earliest                    thunk</span><br><span class="line">             reachable                   placement</span><br><span class="line">             from C</span><br></pre></td></tr></table></figure><ul><li><strong>[B, C)</strong> = current batch of sections to process (size≤ branch_distance/5)</li><li><strong>A</strong> = earliest section still reachable from C (forthunk expiration)</li><li><strong>D</strong> = where to place the thunk (furthest pointreachable from B)</li></ul><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Simplified from OutputSection&lt;E&gt;::create_range_extension_thunks</span></span><br><span class="line"><span class="keyword">while</span> (b &lt; sections.<span class="built_in">size</span>()) &#123;</span><br><span class="line">  <span class="comment">// Advance D: find furthest point where thunk is reachable from B</span></span><br><span class="line">  <span class="keyword">while</span> (d &lt; size &amp;&amp; thunk_at_d_reachable_from_b)</span><br><span class="line">    <span class="built_in">assign_address</span>(sections[d++]);</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Compute batch [B, C)</span></span><br><span class="line">  c = b + <span class="number">1</span>;</span><br><span class="line">  <span class="keyword">while</span> (c &lt; d &amp;&amp; sections[c] &lt; sections[b] + batch_size) c++;</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Advance A: expire thunks no longer reachable</span></span><br><span class="line">  <span class="keyword">while</span> (a &lt; b &amp;&amp; sections[a] + branch_distance &lt; sections[c]) a++;</span><br><span class="line">  <span class="comment">// Expire thunk groups before A: clear symbol flags.</span></span><br><span class="line">  <span class="keyword">for</span> (; t &lt; thunks.<span class="built_in">size</span>() &amp;&amp; thunks[t].offset &lt; sections[a]; t++)</span><br><span class="line">    <span class="keyword">for</span> (sym in thunks[t].symbols) sym-&gt;flags = <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Scan [B,C) relocations. If a symbol is not assigned to a thunk group yet,</span></span><br><span class="line">  <span class="comment">// assign it to the new thunk group at D.</span></span><br><span class="line">  <span class="keyword">auto</span> &amp;thunk = thunks.<span class="built_in">emplace_back</span>(<span class="keyword">new</span> <span class="built_in">Thunk</span>(offset));</span><br><span class="line">  <span class="built_in">parallel_for</span>(b, c, [&amp;](i64 i) &#123;</span><br><span class="line">    <span class="keyword">for</span> (rel in sections[i].relocs) &#123;</span><br><span class="line">      <span class="keyword">if</span> (<span class="built_in">requires_thunk</span>(rel)) &#123;</span><br><span class="line">        Symbol &amp;sym = rel.symbol;</span><br><span class="line">        <span class="keyword">if</span> (!sym.flags.<span class="built_in">test_and_set</span>()) &#123;  <span class="comment">// atomic: skip if already set</span></span><br><span class="line">          lock_guard <span class="built_in">lock</span>(mu);</span><br><span class="line">          thunk.symbols.<span class="built_in">push_back</span>(&amp;sym);</span><br><span class="line">        &#125;</span><br><span class="line">      &#125;</span><br><span class="line">    &#125;</span><br><span class="line">  &#125;);</span><br><span class="line">  offset += thunk.<span class="built_in">size</span>();</span><br><span class="line">  b = c;  <span class="comment">// Move to next batch</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><strong>Pass 2</strong> (<code>remove_redundant_thunks</code>): Afterfinal addresses are known, remove thunk entries for symbols actually inrange.</p><p>Key characteristics:</p><ul><li><strong>Pessimistic over-allocation</strong>: Assumes allout-of-section calls need thunks; safe to shrink later</li><li><strong>Batch size</strong>: branch_distance/5 (25.6 MiB forAArch64, 3.2 MiB for AArch32)</li><li><strong>Parallelism</strong>: Uses TBB for parallel relocationscanning within each batch</li><li><strong>Single branch range</strong>: Uses one conservative<code>branch_distance</code> per architecture. For AArch32, uses ±16 MiB(Thumb limit) for all branches, whereas lld/ELF uses ±32 MiB for A32branches.</li><li><strong>Thunk size not accounted in D-advancement</strong>: Theactual thunk group size is unknown when advancing D, so the end of alarge thunk group may be unreachable from the beginning of thebatch.</li><li><strong>No convergence loop</strong>: Single forward pass foraddress assignment, no risk of non-convergence</li></ul><h3 id="gnu-lds-thunk-creation-algorithm">GNU ld's thunk creationalgorithm</h3><p>Each port implements the algorithm on their own. There is no codesharing.</p><p>GNU ld's AArch64 port (<code>bfd/elfnn-aarch64.c</code>) uses aniterative algorithm but with a single stub type and no lookup table.</p><p><strong>Main iteration loop</strong>(<code>elfNN_aarch64_size_stubs()</code>):</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">group_sections(htab, stub_group_size, ...);  <span class="comment">// Default: 127 MiB</span></span><br><span class="line">layout_sections_again();</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> (;;) &#123;</span><br><span class="line">  stub_changed = <span class="literal">false</span>;</span><br><span class="line">  _bfd_aarch64_add_call_stub_entries(&amp;stub_changed, ...);</span><br><span class="line">  <span class="keyword">if</span> (!stub_changed)</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">  _bfd_aarch64_resize_stubs(htab);</span><br><span class="line">  layout_sections_again();</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>GNU ld's ppc64 port (<code>bfd/elf64-ppc.c</code>) uses an iterativemulti-pass algorithm with a branch lookup table(<code>.branch_lt</code>) for long-range stubs.</p><p><strong>Section grouping</strong>: Sections are grouped by<code>stub_group_size</code> (~28-30 MiB default); each group gets onestub section. For 14-bit conditional branches(<code>R_PPC64_REL14</code>, ±32KiB range), group size is reduced by1024x.</p><p><strong>Main iteration loop</strong>(<code>ppc64_elf_size_stubs()</code>):</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">while</span> (<span class="number">1</span>) &#123;</span><br><span class="line">  <span class="comment">// Scan all relocations in all input sections</span></span><br><span class="line">  <span class="keyword">for</span> (input_bfd; section; irela) &#123;</span><br><span class="line">    <span class="comment">// Only process branch relocations (R_PPC64_REL24, R_PPC64_REL14, etc.)</span></span><br><span class="line">    stub_type = ppc_type_of_stub(section, irela, ...);</span><br><span class="line">    <span class="keyword">if</span> (stub_type == ppc_stub_none)</span><br><span class="line">      <span class="keyword">continue</span>;</span><br><span class="line">    <span class="comment">// Create or merge stub entry</span></span><br><span class="line">    stub_entry = ppc_add_stub(...);</span><br><span class="line">  &#125;</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Size all stubs, potentially upgrading long_branch to plt_branch</span></span><br><span class="line">  bfd_hash_traverse(&amp;stub_hash_table, ppc_size_one_stub, ...);</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Check for convergence</span></span><br><span class="line">  <span class="keyword">if</span> (!stub_changed &amp;&amp; all_sizes_stable)</span><br><span class="line">    <span class="keyword">break</span>;</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Re-layout sections</span></span><br><span class="line">  layout_sections_again();</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><strong>Convergence control</strong>:</p><ul><li><code>STUB_SHRINK_ITER = 20</code> (<ahref="https://sourceware.org/PR28827">PR28827</a>): After 20 iterations,stub sections only grow (prevents oscillation)</li><li>Convergence when:<code>!stub_changed &amp;&amp; all section sizes stable</code></li></ul><p><strong>Stub type upgrade</strong>: <code>ppc_type_of_stub()</code>initially returns <code>ppc_stub_long_branch</code> for out-of-rangebranches. Later, <code>ppc_size_one_stub()</code> checks if the stub'sbranch can reach; if not, it upgrades to<code>ppc_stub_plt_branch</code> and allocates an 8-byte entry in<code>.branch_lt</code>.</p><h3 id="comparing-linker-thunk-algorithms">Comparing linker thunkalgorithms</h3><table><colgroup><col style="width: 16%" /><col style="width: 18%" /><col style="width: 22%" /><col style="width: 12%" /><col style="width: 29%" /></colgroup><thead><tr><th>Aspect</th><th>lld/ELF</th><th>lld/MachO</th><th>mold</th><th>GNU ld ppc64</th></tr></thead><tbody><tr><td>Passes</td><td>Multi (max 30)</td><td>Single</td><td>Two</td><td>Multi (shrink after 20)</td></tr><tr><td>Strategy</td><td>Iterative refinement</td><td>Sliding window</td><td>Sliding window</td><td>Iterative refinement</td></tr><tr><td>Thunk placement</td><td>Pre-allocated intervals</td><td>Inline with slop</td><td>Batch intervals</td><td>Per stub-group</td></tr></tbody></table><h2 id="linker-relaxation">Linker relaxation</h2><p>Some architectures take a different approach: instead of onlyexpanding branches, the linker can also <strong>shrink</strong>instruction sequences when the target is close enough. RISC-V andLoongArch both use this technique. See <ahref="/blog/2021-03-14-the-dark-side-of-riscv-linker-relaxation">Thedark side of RISC-V linker relaxation</a> for a deeper dive into thecomplexities and tradeoffs.</p><p>Consider a function call using the <code>call</code>pseudo-instruction, which expands to <code>auipc</code> +<code>jalr</code>: <figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"># Before linking (8 bytes)</span><br><span class="line">call ext</span><br><span class="line"># Expands to:</span><br><span class="line">#   auipc ra, %pcrel_hi(ext)</span><br><span class="line">#   jalr ra, ra, %pcrel_lo(ext)</span><br></pre></td></tr></table></figure></p><p>If <code>ext</code> is within ±1MiB, the linker can relax this to:<figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"># After relaxation (4 bytes)</span><br><span class="line">jal ext</span><br></pre></td></tr></table></figure></p><p>This is enabled by <code>R_RISCV_RELAX</code> relocations thataccompany <code>R_RISCV_CALL_PLT</code> relocations. The<code>R_RISCV_RELAX</code> relocation signals to the linker that thisinstruction sequence is a candidate for shrinking.</p><p>Example object code before linking: <figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">0000000000000006 &lt;foo&gt;:</span><br><span class="line">       6: 97 00 00 00   auipc   ra, 0</span><br><span class="line">                R_RISCV_CALL_PLT ext</span><br><span class="line">                R_RISCV_RELAX *ABS*</span><br><span class="line">       a: e7 80 00 00   jalr    ra</span><br><span class="line">       e: 97 00 00 00   auipc   ra, 0</span><br><span class="line">                R_RISCV_CALL_PLT ext</span><br><span class="line">                R_RISCV_RELAX *ABS*</span><br><span class="line">      12: e7 80 00 00   jalr    ra</span><br></pre></td></tr></table></figure></p><p>After linking with relaxation enabled, the 8-byte<code>auipc</code>+<code>jalr</code> pairs become 4-byte<code>jal</code> instructions: <figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">0000000000000244 &lt;foo&gt;:</span><br><span class="line">     244: 41 11         addi    sp, sp, -16</span><br><span class="line">     246: 06 e4         sd      ra, 8(sp)</span><br><span class="line">     248: ef 00 80 01   jal     ext</span><br><span class="line">     24c: ef 00 40 01   jal     ext</span><br><span class="line">     250: ef 00 00 01   jal     ext</span><br></pre></td></tr></table></figure></p><p>When the linker deletes instructions, it must also adjust:</p><ul><li>Subsequent instruction offsets within the section</li><li>Symbol addresses</li><li>Other relocations that reference affected locations</li><li>Alignment directives (<code>R_RISCV_ALIGN</code>)</li></ul><p>This makes RISC-V linker relaxation more complex than thunkinsertion, but it provides code size benefits that other architecturescannot achieve at link time.</p><p>LoongArch uses a similar approach. A<code>pcaddu12i</code>+<code>jirl</code> sequence(<code>R_LARCH_CALL36</code>, ±128GiB range) can be relaxed to a single<code>bl</code> instruction (<code>R_LARCH_B26</code>, ±128MiB range)when the target is close enough.</p><h2 id="diagnosing-out-of-range-errors">Diagnosing out-of-rangeerrors</h2><p>When you encounter a "relocation out of range" error, check thelinker diagnostic and locate the relocatable file and function.Determine how the function call is lowered in assembly.</p><h2 id="summary">Summary</h2><p>Handling long branches requires coordination across thetoolchain:</p><table><colgroup><col style="width: 25%" /><col style="width: 40%" /><col style="width: 33%" /></colgroup><thead><tr><th>Stage</th><th>Technique</th><th>Example</th></tr></thead><tbody><tr><td>Compiler</td><td>Branch relaxation pass</td><td>Invert condition + add unconditional jump</td></tr><tr><td>Assembler</td><td>Instruction relaxation</td><td>Invert condition + add unconditional jump</td></tr><tr><td>Linker</td><td>Range extension thunks</td><td>Generate trampolines</td></tr><tr><td>Linker</td><td>Linker relaxation</td><td>Shrink <code>auipc</code>+<code>jalr</code> to <code>jal</code>(RISC-V)</td></tr></tbody></table><p>The linker's thunk generation is particularly important for largeprograms where function calls may exceed branch ranges. Differentlinkers use different algorithms with various tradeoffs betweencomplexity, optimality, and robustness.</p><p>Linker relaxation approaches adopted by RISC-V and LoongArch is analternative that avoids range extension thunks but introduces othercomplexities.</p><h2 id="related">Related</h2><ul><li><ahref="/blog/2023-05-14-relocation-overflow-and-code-models">Relocationoverflow and code models</a></li><li><a href="/blog/2023-04-23-linker-notes-on-aarch32">Linker notes onAArch32</a></li><li><a href="/blog/2023-03-05-linker-notes-on-aarch64">Linker notes onAArch64</a></li><li><a href="/blog/2023-02-26-linker-notes-on-power-isa">Linker notes onPower ISA</a></li><li><a href="/blog/2023-02-19-linker-notes-on-x86">Linker notes onx86</a></li><li><a href="/blog/2023-09-04-toolchain-notes-on-mips">Toolchain noteson MIPS</a></li><li><ahref="/blog/2024-02-11-toolchain-notes-on-z-architecture">Toolchainnotes on z/Architecture</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;Branch instructions on most architectures use PC-relative addressing
with a limited range. When the target is too far away, the branch
becomes &quot;out of range&quot; and requires special handling.&lt;/p&gt;
&lt;p&gt;Consider a large binary where &lt;code&gt;main()&lt;/code&gt; at address 0x10000
calls &lt;code&gt;foo()&lt;/code&gt; at address 0x8010000-over 128MiB away. On
AArch64, the &lt;code&gt;bl&lt;/code&gt; instruction can only reach ±128MiB, so this
call cannot be encoded directly. Without proper handling, the linker
would fail with an error like &quot;relocation out of range.&quot; The toolchain
must handle this transparently to produce correct executables.&lt;/p&gt;
&lt;p&gt;This article explores how compilers, assemblers, and linkers work
together to solve the long branch problem.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Compiler (IR to assembly): Handles branches within a function that
exceed the range of conditional branch instructions&lt;/li&gt;
&lt;li&gt;Assembler (assembly to relocatable file): Handles branches within a
section where the distance is known at assembly time&lt;/li&gt;
&lt;li&gt;Linker: Handles cross-section and cross-object branches discovered
during final layout&lt;/li&gt;
&lt;/ul&gt;</summary>
    
    
    
    
    <category term="llvm" scheme="https://maskray.me/blog/tags/llvm/"/>
    
    <category term="linker" scheme="https://maskray.me/blog/tags/linker/"/>
    
    <category term="binutils" scheme="https://maskray.me/blog/tags/binutils/"/>
    
    <category term="assembler" scheme="https://maskray.me/blog/tags/assembler/"/>
    
  </entry>
  
  <entry>
    <title>Maintaining shadow branches for GitHub PRs</title>
    <link href="https://maskray.me/blog/2026-01-22-maintaining-shadow-branches-for-github-prs"/>
    <id>https://maskray.me/blog/2026-01-22-maintaining-shadow-branches-for-github-prs</id>
    <published>2026-01-22T08:00:00.000Z</published>
    <updated>2026-01-23T17:19:08.557Z</updated>
    
    <content type="html"><![CDATA[<p>I've created <ahref="https://github.com/MaskRay/pr-shadow">pr-shadow</a> with vibecoding, a tool that maintains a shadow branch for GitHub pull requests(PR) that never requires force-pushing. This addresses pain points Idescribed in <ahref="/blog/2023-09-09-reflections-on-llvm-switch-to-github-pull-requests#patch-evolution">Reflectionson LLVM's switch to GitHub pull requests#Patch evolution</a>.</p><span id="more"></span><h2 id="the-problem">The problem</h2><p>GitHub structures pull requests around branches, enforcing abranch-centric workflow. There are multiple problems when you force-pusha branch after a rebase:</p><ul><li>The UI displays "force-pushed the BB branch from X to Y". Clicking"compare" shows <code>git diff X..Y</code>, which includes unrelatedupstream commits—not the actual patch difference. For a project likeLLVM with 100+ commits daily, this makes the comparison essentiallyuseless.</li><li>Inline comments may become "outdated" or misplaced after forcepushes.</li><li>If your commit message references an issue or another PR, each forcepush creates a new link on the referenced page, cluttering it withduplicate mentions. (Adding backticks around the link text works aroundthis, but it's not ideal.)</li></ul><p>These difficulties lead to recommendations favoring <ahref="https://github.com/orgs/community/discussions/3478">less flexibleworkflows</a> that only append commits (including merge commits) anddiscourage rebases. However, this means working with an outdated base,and switching between the main branch and PR branches causes numerousrebuilds-especially painful for large repositories likellvm-project.</p><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">git switch main; git pull; ninja -C build</span><br><span class="line"></span><br><span class="line"><span class="comment"># Switching to a feature branch with an outdated base requires numerous rebuilds.</span></span><br><span class="line">git switch feature0</span><br><span class="line">git merge origin/main  <span class="comment"># I prefer `git rebase main` to remove merge commits, which clutter the history</span></span><br><span class="line">ninja -C out/release</span><br><span class="line"></span><br><span class="line"><span class="comment"># Switching to another feature branch with an outdated base requires numerous rebuilds.</span></span><br><span class="line">git switch feature1</span><br><span class="line">git merge origin/main</span><br><span class="line">ninja -C out/release</span><br><span class="line"></span><br><span class="line"><span class="comment"># Listing fixup commits ignoring upstream merges requires the clumsy --first-parent.</span></span><br><span class="line">git <span class="built_in">log</span> --first-parent</span><br></pre></td></tr></table></figure><p>In a large repository, avoiding rebases isn't realistic—other commitsfrequently modify nearby lines, and rebasing is often the only way todiscover that your patch needs adjustments due to interactions withother landed changes.</p><p>In 2022, GitHub introduced "Pull request title and description" forsquash merging. This means updating the final commit message requiresediting via the web UI. I prefer editing the local commit message andsyncing the PR description from it.</p><h2 id="the-solution">The solution</h2><p>After updating my <code>main</code> branch, before switching to afeature branch, I always run</p><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git rebase main feature</span><br></pre></td></tr></table></figure><p>to minimize the number of modified files. To avoid the force-pushproblems, I use pr-shadow to maintain a shadow PR branch (e.g.,<code>pr/feature</code>) that only receives fast-forward commits(including merge commits).</p><p>I work freely on my local branch (rebase, amend, squash), then syncto the PR branch using <code>git commit-tree</code> to create a commitwith the same tree but parented to the previous PR HEAD.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">Local branch (feature)     PR branch (pr/feature)</span><br><span class="line">        A                         A (init)</span><br><span class="line">        |                         |</span><br><span class="line">        B (amend)                 C1 &quot;Fix bug&quot;</span><br><span class="line">        |                         |</span><br><span class="line">        C (rebase)                C2 &quot;Address review&quot;</span><br></pre></td></tr></table></figure><p>Reviewers see clean diffs between C1 and C2, even though theunderlying commits were rewritten.</p><p>When a rebase is detected (<code>git merge-base</code> withmain/master changed), the new PR commit is created as a merge commitwith the new merge-base as the second parent. GitHub displays these as"condensed" merges, preserving the diff view for reviewers.</p><h2 id="usage">Usage</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Initialize and create PR</span></span><br><span class="line">git switch -c feature</span><br><span class="line">edit &amp;&amp; git commit -m feature</span><br><span class="line"></span><br><span class="line"><span class="comment"># Set `git merge-base origin/main feature` as the initial base. Push to pr/feature and open a GitHub PR.</span></span><br><span class="line">prs init</span><br><span class="line"><span class="comment"># Same but create a draft PR. Repeated `init`s are rejected.</span></span><br><span class="line">prs init --draft</span><br><span class="line"></span><br><span class="line"><span class="comment"># Work locally (rebase, amend, etc.)</span></span><br><span class="line">git fetch origin main:main</span><br><span class="line">git rebase main</span><br><span class="line">git commit --amend</span><br><span class="line"></span><br><span class="line"><span class="comment"># Sync to PR</span></span><br><span class="line">prs push <span class="string">&quot;Rebase and fix bug&quot;</span></span><br><span class="line"><span class="comment"># Force push if remote diverged due to messing with pr/feature directly.</span></span><br><span class="line">prs push --force <span class="string">&quot;Rewrite&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Update PR title/body from local commit message.</span></span><br><span class="line">prs desc</span><br><span class="line"></span><br><span class="line"><span class="comment"># Run gh commands on the PR.</span></span><br><span class="line">prs gh view</span><br><span class="line">prs gh checks</span><br></pre></td></tr></table></figure><p>The tool supports both fork-based workflows (pushing to your fork)and same-repo workflows (for branches like<code>user/&lt;name&gt;/feature</code>). It also works with GitHubEnterprise, auto-detecting the host from the repository URL.</p><h2 id="related-work">Related work</h2><p>The name "prs" is a tribute to <ahref="https://github.com/spacedentist/spr">spr</a>, which implements asimilar shadow branch concept. However, spr pushes user branches to themain repository rather than a personal fork. While necessary for stackedpull requests, this approach is discouraged for single PRs as itclutters the upstream repository. pr-shadow avoids this by pushing toyour fork by default.</p><p>I owe an apology to folks who receive<code>users/MaskRay/feature</code> branches (if they use the default<code>fetch = +refs/heads/*:refs/remotes/origin/*</code> to receive userbranches). I had been abusing spr for a long time after <ahref="/blog/2023-09-09-reflections-on-llvm-switch-to-github-pull-requests#patch-evolution">LLVM'sGitHub transition</a> to avoid unnecessary rebuilds when switchingbetween the main branch and PR branches.</p><p>Additionally, spr embeds a PR URL in commit messages (e.g.,<code>Pull Request: https://github.com/llvm/llvm-project/pull/150816</code>),which can cause downstream forks to add unwanted backlinks to theoriginal PR.</p><p>If I need stacked pull requests, I will probably use pr-shadow withthe base patch and just rebase stacked ones - it's unclear how sprhandles stacked PRs.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;I&#39;ve created &lt;a
href=&quot;https://github.com/MaskRay/pr-shadow&quot;&gt;pr-shadow&lt;/a&gt; with vibe
coding, a tool that maintains a shadow branch for GitHub pull requests
(PR) that never requires force-pushing. This addresses pain points I
described in &lt;a
href=&quot;/blog/2023-09-09-reflections-on-llvm-switch-to-github-pull-requests#patch-evolution&quot;&gt;Reflections
on LLVM&#39;s switch to GitHub pull requests#Patch evolution&lt;/a&gt;.&lt;/p&gt;</summary>
    
    
    
    
    <category term="github" scheme="https://maskray.me/blog/tags/github/"/>
    
    <category term="git" scheme="https://maskray.me/blog/tags/git/"/>
    
  </entry>
  
  <entry>
    <title>2025年总结</title>
    <link href="https://maskray.me/blog/2025-12-31-summary"/>
    <id>https://maskray.me/blog/2025-12-31-summary</id>
    <published>2025-12-31T08:00:00.000Z</published>
    <updated>2026-01-01T05:28:06.576Z</updated>
    
    <content type="html"><![CDATA[<p>TODO</p><p>一如既往，主要在工具链领域耕耘。但由于工作忙碌在opensource社区投入的时间减少了。</p><h2 id="blogging">Blogging</h2><p>不包括这篇总结，一共写了18篇文章。</p><ul><li><ahref="/blog/2025-01-12-understanding-and-improving-clang-ftime-report">Understandingand improving Clang -ftime-report</a></li><li><a href="/blog/2025-01-20-natural-loops">Natural loops</a></li><li><a href="/blog/2025-02-02-lld-20-elf-changes">lld 20 ELFchanges</a></li><li><a href="/blog/2025-02-17-migrating-comments-to-giscus">Migratingcomments to giscus</a></li><li><a href="/blog/2025-03-09-compiling-c++-with-clang-api">CompilingC++ with the Clang API</a></li><li><ahref="/blog/2025-03-16-relocation-generation-in-assemblers">Relocationgeneration in assemblers</a></li><li><ahref="/blog/2025-04-06-llvm-integrated-assembler-improving-mcexpr-mcvalue">LLVMintegrated assembler: Improving MCExpr and MCValue</a></li><li><ahref="/blog/2025-05-26-llvm-integrated-assembler-improving-expressions-and-relocations">LLVMintegrated assembler: Improving expressions and relocations</a></li><li><a href="/blog/2025-07-13-gcc-miscompiles-llvm">GCC 13.3.0miscompiles LLVM</a></li><li><ahref="/blog/2025-07-27-llvm-integrated-assembler-engineering-better-fragments">LLVMintegrated assembler: Engineering better fragments</a></li><li><ahref="/blog/2025-08-17-llvm-integrated-assembler-improving-sections-and-symbols">LLVMintegrated assembler: Improving sections and symbols</a></li><li><ahref="/blog/2025-08-24-understanding-alignment-from-source-to-object-file">Understandingalignment - from source to object file</a></li><li><ahref="/blog/2025-08-31-benchmarking-compression-programs">Benchmarkingcompression programs</a></li><li><a href="/blog/2025-09-07-lld-21-elf-changes">lld 21 ELFchanges</a></li><li><a href="/blog/2025-09-28-remarks-on-sframe">Remarks onSFrame</a></li><li><ahref="/blog/2025-10-26-stack-walking-space-and-time-trade-offs">Stackwalking: space and time trade-offs</a></li><li><ahref="/blog/2025-12-07-sacramento-travelogue">Sacramento游记</a></li><li><a href="/blog/2025-12-14-weak-avl-tree">Weak AVL Tree</a></li></ul><h2 id="llvm-project">llvm-project</h2><ul><li><p>翻新了integrated assembler，写了4篇相关的blog posts: <ahref="https://maskray.me/blog/tags/assembler/"class="uri">https://maskray.me/blog/tags/assembler/</a></p></li><li><p><ahref="https://github.com/llvm/llvm-project/pulls?q=sort%3Aupdated-desc+is%3Apr+created%3A%3E2025-01-01+reviewed-by%3AMaskRay">Reviewednumerous patches</a>. query<code>is:pr created:&gt;2025-01-01 reviewed-by:MaskRay</code> =&gt; "989Closed"</p></li></ul><h2 id="linux-kernel">Linux kernel</h2><p>贡献了两个commits，被引用了一次。</p><h2 id="ccls">ccls</h2><ul><li><code>clang.prependArgs</code></li><li>支持了LLVM 21和22</li></ul><h2 id="elf-specification">ELF specification</h2><p>尝试推进<ahref="/blog/2024-03-31-a-compact-section-header-table-for-elf">compactsection header table</a>，没有取得共识。 一些成员希望采用generalcompression (likezstd)的方式，像<code>SHF_COMPRESSED</code>那样压缩section headertable。包括我在内的另一些人不喜欢采用general compression。</p><h2 id="misc">Misc</h2><p>Reported 6 feature requests or bugs to binutils.</p><ul><li><ahref="https://sourceware.org/bugzilla/show_bug.cgi?id=33633"><code>ld --build-id does not use symtab/strtab content</code></a></li><li><ahref="https://sourceware.org/bugzilla/show_bug.cgi?id=33370"><code>gas: monolithic .sframe violates COMDAT group rule</code></a></li><li><ahref="https://sourceware.org/bugzilla/show_bug.cgi?id=33336"><code>gas: Clarify whitespace between a label's symbol and its colon</code></a></li><li><ahref="https://sourceware.org/bugzilla/show_bug.cgi?id=33331"><code>ld: Add --print-gc-sections=file</code></a></li><li><ahref="https://sourceware.org/bugzilla/show_bug.cgi?id=33236"><code>ld riscv: Relocatable linking challenge with R_RISCV_ALIGN</code></a></li><li><ahref="https://sourceware.org/bugzilla/show_bug.cgi?id=32720"><code>ld: add --why-live</code></a></li></ul><h2 id="旅行">旅行</h2><ul><li>第一次去：台南、西安、兰州、天水、Sacramento、Puerto Vallarta,Jalisco, Mexico、Mazatlán, Sinaloa, Mexico</li><li>曾经去过：台北(上一次是近11年前)、北京</li></ul>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;TODO&lt;/p&gt;
&lt;p&gt;一如既往，主要在工具链领域耕耘。但由于工作忙碌在open
source社区投入的时间减少了。&lt;/p&gt;
&lt;h2 id=&quot;blogging&quot;&gt;Blogging&lt;/h2&gt;
&lt;p&gt;不包括这篇总结，一共写了18篇文章。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a
href</summary>
      
    
    
    
    
    <category term="summary" scheme="https://maskray.me/blog/tags/summary/"/>
    
  </entry>
  
  <entry>
    <title>Weak AVL Tree</title>
    <link href="https://maskray.me/blog/2025-12-14-weak-avl-tree"/>
    <id>https://maskray.me/blog/2025-12-14-weak-avl-tree</id>
    <published>2025-12-14T08:00:00.000Z</published>
    <updated>2026-02-16T06:56:40.580Z</updated>
    
    <content type="html"><![CDATA[<p>tl;dr: Weak AVL trees are replacements for AVL trees and red-blacktrees.</p><p>The 2014 paper <ahref="https://sidsen.azurewebsites.net/papers/rb-trees-talg.pdf"><em>Rank-BalancedTrees</em></a> (Haeupler, Sen, Tarjan) presents a framework using ranksand rank differences to define binary search trees.</p><ul><li>Each node has a non-negative integer rank <code>r(x)</code>. Nullnodes have rank -1.</li><li>The rank difference of a node <code>x</code> with parent<code>p(x)</code> is <code>r(p(x)) − r(x)</code>.</li><li>A node is <code>i,j</code> if its children have rank differences<code>i</code> and <code>j</code> (unordered), e.g., a 1,2 node haschildren with rank differences 1 and 2.</li><li>A node is called 1-node if its rank difference is 1.</li></ul><p>Several balanced trees fit this framework:</p><ul><li>AVL tree: Ranks are defined as heights. Every node is 1,1 or 1,2(rank differences of children)</li><li>Red-Black tree: All rank differences are 0 or 1, and no parent of a0-child is a 0-child. (red: 0-child; black: 1-child; null nodes areblack)</li><li>Weak AVL tree (new tree described by this paper): All rankdifferences are 1 or 2, and every leaf has rank 0.<ul><li>A weak AVL tree without 2,2 nodes is an AVL tree.</li></ul></li></ul><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">AVL trees ⫋ weak AVL trees ⫋ red-black trees</span><br></pre></td></tr></table></figure><h2 id="weak-avl-tree">Weak AVL Tree</h2><p>Weak AVL trees are replacements for AVL trees and red-black trees. Asingle insertion or deletion operation requires at most two rotations(forming a double rotation when two are needed). In contrast, AVLdeletion requires O(log n) rotations, and red-black deletion requires upto three.</p><p>Without deletions, a weak AVL tree is exactly an AVL tree. Withdeletions, its height remains at most that of an AVL tree with the samenumber of insertions but no deletions.</p><p>The rank rules imply:</p><ul><li>Null nodes have rank -1, leaves have rank 0, unary nodes have rank1.</li></ul><h3 id="insertion">Insertion</h3><p>The new node <code>x</code> has a rank of 0, changed from the nullnode of rank -1. There are three cases.</p><ul><li>If the tree was previously empty, the new node becomes theroot.</li><li>If the parent of the new node was previously a unary node (1,2node), it is now a 1,1 binary node.</li><li>If the parent of the new node was previously a leaf (1,1 node), itis now a 0,1 binary node, leading to a rank violation.</li></ul><p>When the tree was previously non-empty, <code>x</code> has a parentnode. We call the following subroutine with <code>x</code> indicatingthe new node to handle the second and third cases.</p><p>The following subroutine handles the rank increase of <code>x</code>.We call <code>break</code> if there is no more rank violation, i.e. weare done.</p><p>The 2014 paper isn't very clear about the conditions.</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Assume that x&#x27;s rank has just increased by 1 and rank_diff(x) has been updated.</span></span><br><span class="line"></span><br><span class="line">p = x-&gt;parent;</span><br><span class="line"><span class="keyword">if</span> (<span class="built_in">rank_diff</span>(x) == <span class="number">1</span>) &#123;</span><br><span class="line">  <span class="comment">// x was previously a 2-child before increasing x-&gt;rank.</span></span><br><span class="line">  <span class="comment">// Done.</span></span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">  <span class="keyword">for</span> (;;) &#123;</span><br><span class="line">    <span class="comment">// Otherwise, p is a 0,1 node (previously a 1,1 node before increasing x-&gt;rank).</span></span><br><span class="line">    <span class="comment">// x being a 0-child is a violation.</span></span><br><span class="line"></span><br><span class="line">    Promote p.</span><br><span class="line">    <span class="comment">// Since we have promoted both x and p, it&#x27;s as if rank_diff(x&#x27;s sibling) is flipped.</span></span><br><span class="line">    <span class="comment">// p is now a 1,2 node.</span></span><br><span class="line"></span><br><span class="line">    x = p;</span><br><span class="line">    p = p-&gt;parent;</span><br><span class="line">    <span class="comment">// x is a 1,2 node.</span></span><br><span class="line">    <span class="keyword">if</span> (!p) <span class="keyword">break</span>;</span><br><span class="line">    d = p-&gt;ch[<span class="number">1</span>] == x;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (<span class="built_in">rank_diff</span>(x) == <span class="number">1</span>) &#123; <span class="keyword">break</span>; &#125;</span><br><span class="line">    <span class="comment">// Otherwise, x is a 0-child, leading to a new rank violation.</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">auto</span> sib = p-&gt;ch[!d];</span><br><span class="line">    <span class="keyword">if</span> (<span class="built_in">rank_diff</span>(sib) == <span class="number">2</span>) &#123; <span class="comment">// p is a 0,2 node</span></span><br><span class="line">      <span class="keyword">auto</span> y = x-&gt;ch[d^<span class="number">1</span>];</span><br><span class="line">      <span class="keyword">if</span> (y &amp;&amp; <span class="built_in">rank_diff</span>(y) == <span class="number">1</span>) &#123;</span><br><span class="line">        <span class="comment">// y is a 1-child. y must the previous `x` in the last iteration.</span></span><br><span class="line">        Perform a <span class="type">double</span> rotation involving `p`, `x`, <span class="keyword">and</span> `y`.</span><br><span class="line">      &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">        <span class="comment">// Otherwise, y is null or a 2-child.</span></span><br><span class="line">        Perform a single rotation involving `p` <span class="keyword">and</span> `x`.</span><br><span class="line">        x is now a <span class="number">1</span>,<span class="number">1</span> node <span class="keyword">and</span> there is no more violation.</span><br><span class="line">      &#125;</span><br><span class="line">      <span class="keyword">break</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Otherwise, p is a 0,1 node. Goto the next iteration.</span></span><br><span class="line">  &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>Insertion never introduces a 2,2 node, so insertion-only sequencesproduce AVL trees.</p><h3 id="deletion">Deletion</h3><p>TODO: Describe deletion</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">if (!was_2 &amp;&amp; !x &amp;&amp; !p-&gt;ch[0] &amp;&amp; !p-&gt;ch[1] &amp;&amp; p-&gt;rp()) &#123;</span><br><span class="line">  // p was unary and becomes 2,2. Demote it.</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="implementation">Implementation</h3><p>Since valid rank differences can only be 1 or 2, ranks can be encodedefficiently using bit flags. There are three approaches:</p><ul><li>Store two bits representing the rank differences to each child. Bit0: rank difference to left child (1 = diff is 2, 0 = diff is 1). Bit 1:rank difference to right child</li><li>Store a single bit representing the parity (even/odd) of the node'sabsolute rank. The rank difference to a child is computed by comparingparities. Same parity → rank difference of 2. Different parity → rankdifference of 1</li><li>Store a 1-bit rank difference parity in each node.</li></ul><p>FreeBSD's <code>sys/tree.h</code> (<ahref="https://reviews.freebsd.org/D25480"class="uri">https://reviews.freebsd.org/D25480</a>, 2020) uses the firstapproach. The <code>rb_</code> prefix remains as it can also indicate<code>Rank-Balanced</code>:) Note: its insertion operation can be futheroptimized as the following code demonstrates.</p><p><a href="https://github.com/pvachon/wavl_tree"class="uri">https://github.com/pvachon/wavl_tree</a> and <ahref="https://crates.io/crates/wavltree"class="uri">https://crates.io/crates/wavltree</a> use the secondapproach.</p><p>The third approach is less efficient because a null node can beeither a 1-child (parent is binary) or a 2-child (parent is unary),requiring the sibling node to be probed to determine the rankdifference:<code>int rank_diff(Node *p, int d) &#123; return p-&gt;ch[d] ? p-&gt;ch[d]-&gt;par_and_flg &amp; 1 : p-&gt;ch[!d] ? 2 : 1; &#125;</code></p><p><a href="https://maskray.me/blog/2025-12-14-weak-avl-tree"class="uri">https://maskray.me/blog/2025-12-14-weak-avl-tree</a> is aC++ implementation covering both approaches, supporting the followingoperations:</p><ul><li><code>insert</code>: insert a node</li><li><code>remove</code>: remove a node</li><li><code>rank</code>: count elements less than a key</li><li><code>select</code>: find the k-th smallest element (0-indexed)</li><li><code>prev</code>: find the largest element less than a key</li><li><code>next</code>: find the smallest element greater than a key</li></ul><p><strong>Node structure:</strong></p><ul><li><code>ch[2]</code>: left and right child pointers.</li><li><code>par_and_flg</code>: packs the parent pointer with 2 flag bitsin the low bits. Bit 0 indicates whether the left child has rankdifference 2; bit 1 indicates whether the right child has rankdifference 2. A cleared bit means rank difference 1.</li><li><code>i</code>: the key value.</li><li><code>sum</code>, <code>size</code>: augmented data maintained by<code>mconcat</code> for order statistics operations.</li></ul><p><strong>Helper methods:</strong></p><ul><li><code>rd2(d)</code>: returns true if child <code>d</code> has rankdifference 2.</li><li><code>flip(d)</code>: toggles the rank difference of child<code>d</code> between 1 and 2.</li><li><code>clr_flags()</code>: sets both children to rank difference 1(used after rotations to reset a node to 1,1).</li></ul><p><strong>Invariants:</strong></p><ul><li>Leaves always have <code>flags() == 0</code>, meaning both nullchildren are 1-children (null nodes have rank -1, leaf has rank 0).</li><li>After each insertion or deletion, <code>mconcat</code> is calledalong the path to the root to update augmented data.</li></ul><p><strong>Rotations:</strong></p><p>The <code>rotate(x, d)</code> function rotates node <code>x</code> indirection <code>d</code>. It lifts <code>x-&gt;ch[d]</code> to replace<code>x</code>, and updates the augmented data for <code>x</code>. Thecaller is responsible for updating rank differences.</p><h3 id="misc">Misc</h3><p>Visualization: <ahref="https://tjkendev.github.io/bst-visualization/avl-tree/bu-weak.html"class="uri">https://tjkendev.github.io/bst-visualization/avl-tree/bu-weak.html</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;tl;dr: Weak AVL trees are replacements for AVL trees and red-black
trees.&lt;/p&gt;
&lt;p&gt;The 2014 paper &lt;a
href=&quot;https://sidsen.azurewebsites.net</summary>
      
    
    
    
    
    <category term="algorithm" scheme="https://maskray.me/blog/tags/algorithm/"/>
    
    <category term="data structure" scheme="https://maskray.me/blog/tags/data-structure/"/>
    
  </entry>
  
  <entry>
    <title>Sacramento游记</title>
    <link href="https://maskray.me/blog/2025-12-07-sacramento-travelogue"/>
    <id>https://maskray.me/blog/2025-12-07-sacramento-travelogue</id>
    <published>2025-12-07T08:00:00.000Z</published>
    <updated>2026-01-20T02:11:00.054Z</updated>
    
    <content type="html"><![CDATA[<p>周末从旧金山湾南部去Sacramento参观。</p><span id="more"></span><h2 id="周六">周六</h2><p>周六上午看了Crocker Art Museum，相当不错。博物馆以Edwin BryantCrocker命名(他是Central Pacific Railroad的The Big Four之一CharlesCrocker的兄弟)</p><figure><img src="/static/2025-12-07-sacramento-travelogue/20251206_115206.webp"style="width:40.0%" alt="老子和尹喜。书卷中是道德经通行本前两章内容" /><figcaptionaria-hidden="true">老子和尹喜。书卷中是道德经通行本前两章内容</figcaption></figure><p>十九世纪，Sacramento被视作“二埠”(SanFrancisco为“大埠”)。我好奇Sacramento是否还存在Chinatown。 在DOCO -DowntownCommons(购物商场)附近简单逛逛后，沿着"Chinatown"指示牌向北走来到4thSt和J St路口。马路对面的高大建筑溯源堂(Soo Yuen BenevolentAssociation)没有开放。 穿过JSt后看到溯源堂左边有个牌坊，上书“沙加缅度华埠”。穿过牌坊则来到一个广场，没有看到人迹。</p><figure><img src="/static/2025-12-07-sacramento-travelogue/20251206_141624.webp"style="width:40.0%" alt="牌坊" /><figcaption aria-hidden="true">牌坊</figcaption></figure><p>1950和1960年代，I-5州际公路建设和城市更新项目拆除了部分Chinatown建筑，I-5如今即位于Chinatown遗迹西侧。残存的街区仅限于J Street和I Street之间、3rd Street和5thStreet之间的两个街区。 人口也逐渐迁出，此处已是一个ghosttown，非常冷清。</p><p>Google地图显示的场所"ChinatownMall"似乎是中华会馆遗迹，无人、不可进入。一个Reddit贴文显示这是1959年建成的，现在已经荒废。有一个建筑物写着“邓高密公所”，也没有人。 中山纪念馆(415 JSt，似乎是1971年建成)则是我们唯一找到的开放的场所，在周六周日13:00至15:00开放，里面悬挂着挂着孙中山像、美国国旗和民国旗。</p><figure style="display: flex; gap: 5px;"><img src="/static/2025-12-07-sacramento-travelogue/20251206_142216.webp" title="中山纪念馆"/><img src="/static/2025-12-07-sacramento-travelogue/20251206_142231.webp" title="中山纪念馆"/><img src="/static/2025-12-07-sacramento-travelogue/20251206_142003.webp" title="中山纪念馆"/><img src="/static/2025-12-07-sacramento-travelogue/20251206_142607.webp" title="中山纪念馆"/></figure><p>下午参观了California State Railroad Museum，周边停车都是flat rate$20。</p><figure style="display: flex; gap: 5px;"><img src="/static/2025-12-07-sacramento-travelogue/20251206_151521.webp" title="Chinese Railroad Workers Historic Photos & Painting Exhibition"/><img src="/static/2025-12-07-sacramento-travelogue/20251206_151647.webp" title=""/><img src="/static/2025-12-07-sacramento-travelogue/20251206_151723.webp" title=""/><img src="/static/2025-12-07-sacramento-travelogue/20251206_151755.webp" title=""/><img src="/static/2025-12-07-sacramento-travelogue/20251206_153213.swebp" title=""/></figure><figure style="display: flex; gap: 5px;"><img src="/static/2025-12-07-sacramento-travelogue/20251206_151521.webp" title=""/><img src="/static/2025-12-07-sacramento-travelogue/20251206_151723.webp" title=""/><img src="/static/2025-12-07-sacramento-travelogue/20251206_151746.webp" title=""/><img src="/static/2025-12-07-sacramento-travelogue/20251206_152855.webp" title="火车"/><img src="/static/2025-12-07-sacramento-travelogue/20251206_153336.webp" title="火车"/></figure><p>历史讲述部分仿佛回味了一遍The Iron Horse (1924 film)。</p><h2 id="周日">周日</h2><p>11:57到达Leland Stanford Mansion。正好12:00有一个tour，得以进入参观。只能跟随free guided tour进入。</p><p>这座宅邸最初建于1856年。1861年，LelandStanford购得了这处房产。他当选州长在1862和1863年在此处加盖的房子里处理公务。在他之后，还有两位州长（第9任和第10任）也曾在此办公。</p><p>1868年，Stanford夫妇的唯一孩子LelandJr.出生于此。Mansion在1872年进行了扩建。Stanford夫妇和他们的儿子在此居住，直到1876年才迁往旧金山（他们在旧金山的那处宅邸后来毁于1906年的旧金山大地震）。1900年，LelandStanford的遗孀将这座府邸捐赠给教会，作为孤儿院使用。孤儿院用此地直到1978年州政府购入了这处历史性房产。此后，该建筑经过翻新，并最终作为博物馆对外开放。</p><p>Leland作为Central Pacific Railroad的The Big Four之一，建成了FirstTranscontinental Railroad西段。 府邸内有十二处火车标示。</p><p>Stanford University全称为Leland Stanford JuniorUniversity，即为纪念旅行欧洲时年少去世的Leland jr。</p><p>下午去了California Museum，凭借Bank of America的Museum onUs活动免费进入。</p><figure><img src="/static/2025-12-07-sacramento-travelogue/20251207_140847.webp"style="width:40.0%" alt="Early Chinatowns" /><figcaption aria-hidden="true">Early Chinatowns</figcaption></figure><p>看到了一些关于修建第一条横贯大陆铁路、1942年对日裔美国人的囚禁(另，今天12月7日正好是1941年12月7日日本偷袭珍珠港的84周年纪念日)、Chinatown的描述。看到Locke(乐居)小镇的描述后(大约14:46)决定立刻离开Sacramento，驱车前往Locke小镇。</p><figure><img src="/static/2025-12-07-sacramento-travelogue/20251207_144048.webp"style="width:40.0%" alt="Rethinking &quot;Chinatown&quot;" /><figcaption aria-hidden="true">Rethinking "Chinatown"</figcaption></figure><h2 id="locke">Locke</h2><p>以下描述很大一部分来自翻译总结Locke Foundation于2023年的一篇描述。 <ahref="https://locke-foundation.org/wp-content/uploads/2023/09/LF-newsletter-Fall-2023-final.pdf"class="uri">https://locke-foundation.org/wp-content/uploads/2023/09/LF-newsletter-Fall-2023-final.pdf</a></p><p>乐居镇建于1915年，是美国现存规模最大、保存最完整的农村华人社区。(它是Sacramento-San Joaquin Delta三角洲地区现在仅剩下的Chinatown。)它不是传统的"唐人街"，而是华人为华人建造的独立小镇。当年华人怀揣"金山梦"来到加州淘金，却因《外侨土地法》无法拥有土地，只能租地建房。(有些评论称之为美国唯一一个由华人建立、由华人专居、并由华人经营的城镇(独立、而不是某个城市的街区))</p><p>后来Locke镇逐渐没落了，可能和以下几点有关</p><ul><li>Chinese Exclusion Act (1882)。无法获得海外移民补充。</li><li>经济层面的冲击。1930年代大萧条首先重创了这个小镇。禁酒令于1933年结束后，Locke曾因"加州蒙特卡洛"的绰号吸引大量寻欢客前来赌博和消遣，这一客源随即消失。</li><li>农业机械化。1940至1950年代大规模农业机械化使得许多农场工人失去工作，小规模佃农也难以维持生计。华人劳工赖以生存的体力劳动需求锐减。</li><li>人口外流。二战后，许多华裔青年离开小镇前往城市寻找更好的经济机会。</li><li>土地所有权问题。由于加州1913年的《外国人土地法》禁止亚裔购买土地，华人只能租用GeorgeLocke家族的土地。虽然该法于1952年被裁定违宪，但Locke的居民始终未能购买他们建房的土地。</li></ul><p>1976年，香港商人ClarenceChu家族购买了乐居庄园。他能说中山话，与老居民建立了信任。最终他推动县政府创建土地分区，让居民以极低价格（每块地仅3000-5000美元）购买了房屋下的土地。2004年，近百年的历史不公终于得到纠正。</p><p>乐居管理公司（LMC）负责城镇日常管理，相当于整个小镇的业主协会；乐居镇基金会（LF）专注于教育、保护和推广，运营博物馆、举办节庆活动、颁发奖学金、记录口述历史等。</p><figure style="width: 40%"><img src="/static/2025-12-07-sacramento-travelogue/20251207_154358.webp" title="main street"/></figure><p><a href="https://locke-foundation.org/locke-museums/"class="uri">https://locke-foundation.org/locke-museums/</a>提到四栋建筑和一个公园。</p><ul><li>Dai Loy Gambling House大来赌场。过去八座赌场中建筑物仅剩的一座。赌场于1951年关闭。</li><li>Boarding House Museum 寄宿公寓</li><li>Jan Ying Chinese Association Museum 俊英工商会</li><li>Joe Shoong Chinese School 中文学校</li><li>Memorial Park</li></ul><p>Google maps上结束时间不准确，四栋建筑均于16:00关闭。 后来根据视频<ahref="https://www.youtube.com/watch?v=wtzcOgaMYcQ"class="uri">https://www.youtube.com/watch?v=wtzcOgaMYcQ</a>，关门的工作人员其实就是几栋建筑的主人ClarenceChu！ 他在1976年从George Locke家族购买了town of Locke。</p><p>大来赌场</p><figure style="display: flex; gap: 5px;"><img src="/static/2025-12-07-sacramento-travelogue/20251207_155640.webp" title=""/><img src="/static/2025-12-07-sacramento-travelogue/20251207_160159.webp" title=""/><img src="/static/2025-12-07-sacramento-travelogue/20251207_155736.webp" title=""/></figure><p>寄宿公寓</p><figure style="display: flex; gap: 5px;"><img src="/static/2025-12-07-sacramento-travelogue/20251207_153939.webp" title="Boarding House Museum"/><img src="/static/2025-12-07-sacramento-travelogue/20251207_153914.webp" title="Boarding House Museum"/><img src="/static/2025-12-07-sacramento-travelogue/20251207_153857.webp" title="Boarding House Museum"/></figure><p>俊英工商会</p><figure style="width: 40%"><img src="/static/2025-12-07-sacramento-travelogue/20251207_155542.webp" title=""/></figure><p>中文学校</p><figure style="display: flex; gap: 5px;"><img src="/static/2025-12-07-sacramento-travelogue/20251207_155319.webp" title="中文学校"/><img src="/static/2025-12-07-sacramento-travelogue/20251207_154913.webp" title="中文学校"/></figure><p>纪念公园</p><figure style="display: flex; gap: 5px;"><img src="/static/2025-12-07-sacramento-travelogue/20251207_155400.webp" title="纪念公园"/><img src="/static/2025-12-07-sacramento-travelogue/20251207_154721.webp" title="国家历史纪念碑"/><img src="/static/2025-12-07-sacramento-travelogue/20251207_154736.webp" title="国家历史纪念碑"/></figure>]]></content>
    
    
    <summary type="html">&lt;p&gt;周末从旧金山湾南部去Sacramento参观。&lt;/p&gt;</summary>
    
    
    
    
    <category term="travel" scheme="https://maskray.me/blog/tags/travel/"/>
    
  </entry>
  
  <entry>
    <title>Stack walking: space and time trade-offs</title>
    <link href="https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-offs"/>
    <id>https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-offs</id>
    <published>2025-10-26T07:00:00.000Z</published>
    <updated>2025-12-01T08:59:40.484Z</updated>
    
    <content type="html"><![CDATA[<p>On most Linux platforms (except AArch32, which uses<code>.ARM.exidx</code>), DWARF <code>.eh_frame</code> is required for<a href="/blog/2020-12-12-c++-exception-handling-abi">C++ exceptionhandling</a> and <a href="/blog/2020-11-08-stack-unwinding">stackunwinding</a> to restore callee-saved registers. While<code>.eh_frame</code> can be used for call trace recording, it is oftencriticized for its runtime overhead. As an alternative, developers canenable frame pointers, or adopt SFrame, a newer format designedspecifically for profiling. This article examines the size overhead ofenabling non-DWARF stack walking mechanisms when building several LLVMexecutables.</p><p>Runtime performance analysis will be added in a future update.</p><span id="more"></span><h2 id="stack-walking-mechanisms">Stack walking mechanisms</h2><p>Here is a survey of mechanisms available for x86-64:</p><ul><li>Frame pointers: Fast and simple, but costs a register.</li><li>DWARF <code>.eh_frame</code>: Comprehensive but slower, supportsadditional features like C++ exception handling</li><li>SFrame: This is a new experimental format only support profiling.<code>.eh_frame</code> remains necessary for debugging and C++ exceptionhandling. Check out <a href="/blog/2025-09-28-remarks-on-sframe">Remarkson SFrame</a> for details.</li><li>LLVM's Compact Unwinding Format: A highly space-efficient format, <ahref="https://faultlore.com/blah/compact-unwinding/">implemented byApple for Mach-O binaries</a>. This has llvm, lld/MachO, and libunwindimplementation. Supports x86-64 and AArch64. This can mostly replaceDWARF CFI, though some entries need DWARF escape(<code>__eh_frame</code> section would be tiny). OpenVMS modified it fortheir x86-64 port.</li><li>x86 Last Branch Record (LBR): A hardware feature that captures alimited history of most recent branches (up to 32 on Skylake+). Whenconfigured to track branches for SamplePGO, the limited depth means itwon't reliably capture deep stack traces. Traditionally Intel only, butAMD Zen 4 has since implemented <ahref="https://lkml.kernel.org/lkml/b6bb0abaa8a54c0b6d716344700ee11a1793d709.1660211399.git.sandipan.das@amd.com/T/">LastBranch Record Extension Version 2 (LbrExtV2)</a></li><li>Control-flow Enforcement Technology (CET) Shadow Stack: Thishardware security hardening feature can be used to get stack traces.While it introduces some overhead, it offers the flexibility ofprocess-specific enablement.</li></ul><h2 id="space-overhead-analysis">Space overhead analysis</h2><h3 id="frame-pointer-size-impact">Frame pointer size impact</h3><p>For most architectures, GCC defaults to<code>-fomit-frame-pointer</code> in <code>-O</code> compilation to freeup a register for general use. To enable frame pointers, specify<code>-fno-omit-frame-pointer</code>, which reserves the frame pointerregister (e.g., <code>rbp</code> on x86-64) and emits push/popinstructions in function prologues/epilogues.</p><p>For leaf functions (those that don't call other functions), while theframe pointer register should still be reserved for consistency, thepush/pop operations are often unnecessary. Compilers provide<code>-momit-leaf-frame-pointer</code> (with target-specific defaults)to reduce code size.</p><p>The viability of this optimization depends on the targetarchitecture:</p><ul><li>On AArch64, the return address is available in the link register(X30). The immediate caller can be retrieved by inspecting X30, so<code>-momit-leaf-frame-pointer</code> does not compromiseunwinding.</li><li>On x86-64, after the prologue instructions execute, the returnaddress is stored at RSP plus an offset. An unwinder needs to know thestack frame size to retrieve the return address, or it must utilizeDWARF information for the leaf frame and then switch to the FP chain forparent frames.</li></ul><p>Beyond this architectural consideration, there are additionalpractical reasons to use <code>-momit-leaf-frame-pointer</code> onx86-64:</p><ul><li>Many hand-written assembly implementations (including numerous glibcfunctions) don't establish frame pointers, creating gaps in the framepointer chain anyway.</li><li>In the prologue sequence <code>push rbp; mov rbp, rsp</code>, afterthe first instruction executes, RBP does not yet reference the currentstack frame. When shrink-wrapping optimizations are enabled, theinstruction region where RBP still holds the old value becomes larger,increasing the window where the frame pointer is unreliable.</li></ul><p>Given these trade-offs, three common configurations have emerged:</p><ul><li>omitting FP:<code>-fomit-frame-pointer -momit-leaf-frame-pointer</code> (smallestoverhead)</li><li>reserving FP, but removing FP push/pop for leaf functions:<code>-fno-omit-frame-pointer -momit-leaf-frame-pointer</code> (framepointer chain omitting the leaf frame)</li><li>reserving FP:<code>-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer</code>(complete frame pointer chain, largest overhead)</li></ul><p>The size impact varies significantly by program. Here's a <ahref="https://github.com/MaskRay/object-file-size-analyzer/blob/master/section_size.rb">Rubyscript <code>section_size.rb</code></a> that compares section sizes:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">% ~/Dev/object-file-size-analyzer/section_size.rb /tmp/out/custom-&#123;none,nonleaf,all&#125;/bin/&#123;llvm-mc,opt&#125;</span><br><span class="line">Filename                            |       .text size |        EH size |  VM size | VM increase</span><br><span class="line">------------------------------------+------------------+----------------+----------+------------</span><br><span class="line">/tmp/out/custom-none/bin/llvm-mc    |  2114687 (23.7%) |  367992 (4.1%) |  8914057 |           -</span><br><span class="line">/tmp/out/custom-nonleaf/bin/llvm-mc |  2124143 (24.0%) |  301688 (3.4%) |  8856713 |       -0.6%</span><br><span class="line">/tmp/out/custom-all/bin/llvm-mc     |  2149535 (24.0%) |  362408 (4.1%) |  8942729 |       +0.3%</span><br><span class="line">/tmp/out/custom-none/bin/opt        | 39018511 (70.2%) | 4561112 (8.2%) | 55583965 |           -</span><br><span class="line">/tmp/out/custom-nonleaf/bin/opt     | 38879897 (71.4%) | 3542288 (6.5%) | 54424789 |       -2.1%</span><br><span class="line">/tmp/out/custom-all/bin/opt         | 38980905 (71.0%) | 3888624 (7.1%) | 54871285 |       -1.3%</span><br></pre></td></tr></table></figure><p>For instance, <code>llvm-mc</code> is dominated by read-only data,making the relative <code>.text</code> percentage quite small, so framepointer impact on the VM size is minimal. ("VM size" is a metric used bybloaty, representing the total <code>p_memsz</code> size of<code>PT_LOAD</code> segments, excluding <ahref="/blog/2023-12-17-exploring-the-section-layout-in-linker-output">alignmentpadding</a>.) As expected, <code>llvm-mc</code> grows larger as morefunctions set up the frame pointer chain. However, <code>opt</code>actually becomes smaller when <code>-fno-omit-frame-pointer</code> isenabled—a counterintuitive result that warrants explanation.</p><p>Without frame pointer, the compiler uses RSP-relative addressing toaccess stack objects. When using the register-indirect + disp8/disp32addresing mode, RSP needs an extra SIB byte while RBP doesn't. Forlarger functions accessing many local variables, the savings fromshorter RBP-relative encodings can outweigh the additional<code>push rbp; mov rbp, rsp; pop rbp</code> instructions in theprologues/epilogues.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">% echo &#x27;mov rax, [rsp+8]; mov rax, [rbp-8]&#x27; | /tmp/Rel/bin/llvm-mc -x86-asm-syntax=intel -output-asm-variant=1 -show-encoding</span><br><span class="line">        mov     rax, qword ptr [rsp + 8]        # encoding: [0x48,0x8b,0x44,0x24,0x08]</span><br><span class="line">        mov     rax, qword ptr [rbp - 8]        # encoding: [0x48,0x8b,0x45,0xf8]</span><br><span class="line"></span><br><span class="line"># ModR/M byte 0x44: Mod=01 (register-indirect addressing + disp8), Reg=0 (dest reg RAX), R/M=100 (SIB byte follows)</span><br><span class="line"># ModR/M byte 0x45: Mod=01 (register-indirect addressing + disp8), Reg=0 (dest reg RAX), R/M=101 (RBP)</span><br></pre></td></tr></table></figure><h3 id="sframe-vs-.eh_frame">SFrame vs .eh_frame</h3><p>Oracle is advocating for SFrame adoption in Linux distributions. TheSFrame implementation is handled by the assembler and linker rather thanthe compiler. Let's build the latest binutils-gdb to test it.</p><p><strong>Building test program</strong></p><p>We'll use the clang compiler from <ahref="https://github.com/llvm/llvm-project/tree/release/21.x"class="uri">https://github.com/llvm/llvm-project/tree/release/21.x</a>as our test program.</p><p>There are still issues related to garbage collection (<ahref="/blog/2025-09-28-remarks-on-sframe#:~:text=garbage">object fileformat design issue</a>), so I'll just disable<code>-Wl,--gc-sections</code>.</p><figure class="highlight patch"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">--- i/llvm/cmake/modules/AddLLVM.cmake</span></span><br><span class="line"><span class="comment">+++ w/llvm/cmake/modules/AddLLVM.cmake</span></span><br><span class="line"><span class="meta">@@ -331,4 +331,4 @@</span> function(add_link_opts target_name)</span><br><span class="line">         # TODO Revisit this later on z/OS.</span><br><span class="line"><span class="deletion">-        set_property(TARGET $&#123;target_name&#125; APPEND_STRING PROPERTY</span></span><br><span class="line"><span class="deletion">-                     LINK_FLAGS &quot; -Wl,--gc-sections&quot;)</span></span><br><span class="line"><span class="addition">+        #set_property(TARGET $&#123;target_name&#125; APPEND_STRING PROPERTY</span></span><br><span class="line"><span class="addition">+        #             LINK_FLAGS &quot; -Wl,--gc-sections&quot;)</span></span><br><span class="line">       endif()</span><br></pre></td></tr></table></figure><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">configure-llvm custom-sframe -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS=<span class="string">&#x27;clang&#x27;</span> -DLLVM_ENABLE_UNWIND_TABLES=on -DLLVM_ENABLE_LLD=off -DCMAKE_&#123;EXE,SHARED&#125;_LINKER_FLAGS=-fuse-ld=bfd -DCMAKE_C_COMPILER=<span class="variable">$HOME</span>/opt/gcc-15/bin/gcc -DCMAKE_CXX_COMPILER=<span class="variable">$HOME</span>/opt/gcc-15/bin/g++ -DCMAKE_C_FLAGS=<span class="string">&quot;-B<span class="variable">$HOME</span>/opt/binutils/bin -Wa,--gsframe&quot;</span> -DCMAKE_CXX_FLAGS=<span class="string">&quot;-B<span class="variable">$HOME</span>/opt/binutils/bin -Wa,--gsframe&quot;</span></span><br><span class="line">ninja -C /tmp/out/custom-sframe clang</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line">% ~/Dev/bloaty/out/release/bloaty /tmp/out/custom-sframe/bin/clang</span><br><span class="line">    FILE SIZE        VM SIZE</span><br><span class="line"> --------------  --------------</span><br><span class="line">  63.9%  88.0Mi  73.9%  88.0Mi    .text</span><br><span class="line">  11.1%  15.2Mi   0.0%       0    .strtab</span><br><span class="line">   7.2%  9.96Mi   8.4%  9.96Mi    .rodata</span><br><span class="line">   6.4%  8.87Mi   7.5%  8.87Mi    .sframe</span><br><span class="line">   5.1%  7.07Mi   5.9%  7.07Mi    .eh_frame</span><br><span class="line">   2.9%  3.96Mi   0.0%       0    .symtab</span><br><span class="line">   1.4%  1.98Mi   1.7%  1.98Mi    .data.rel.ro</span><br><span class="line">   0.9%  1.23Mi   1.0%  1.23Mi    [LOAD #4 [R]]</span><br><span class="line">   0.7%   999Ki   0.8%   999Ki    .eh_frame_hdr</span><br><span class="line">   0.0%       0   0.5%   614Ki    .bss</span><br><span class="line">   0.2%   294Ki   0.2%   294Ki    .data</span><br><span class="line">   0.0%  23.1Ki   0.0%  23.1Ki    .rela.dyn</span><br><span class="line">   0.0%  8.99Ki   0.0%  8.99Ki    .dynstr</span><br><span class="line">   0.0%  8.77Ki   0.0%  8.77Ki    .dynsym</span><br><span class="line">   0.0%  7.24Ki   0.0%  7.24Ki    .rela.plt</span><br><span class="line">   0.0%  6.73Ki   0.0%       0    [Unmapped]</span><br><span class="line">   0.0%  6.29Ki   0.0%  3.84Ki    [21 Others]</span><br><span class="line">   0.0%  4.84Ki   0.0%  4.84Ki    .plt</span><br><span class="line">   0.0%  3.36Ki   0.0%  3.30Ki    .init_array</span><br><span class="line">   0.0%  2.50Ki   0.0%  2.50Ki    .hash</span><br><span class="line">   0.0%  2.44Ki   0.0%  2.44Ki    .got.plt</span><br><span class="line"> 100.0%   137Mi 100.0%   119Mi    TOTAL</span><br><span class="line">% ~/Dev/object-file-size-analyzer/eh_size.rb /tmp/out/custom-sframe/bin/clang</span><br><span class="line">clang: sframe=9303875 eh_frame=7408976 eh_frame_hdr=1023004 eh=8431980 sframe/eh_frame=1.2558 sframe/eh=1.1034</span><br></pre></td></tr></table></figure><p>The results show that <code>.sframe</code> (8.87 MiB) isapproximately 10% larger than the combined size of<code>.eh_frame</code> and <code>.eh_frame_hdr</code> (7.07 + 0.99 =8.06 MiB). While SFrame is designed for efficiency during stack walking,it carries a non-trivial space overhead compared to traditional DWARFunwind information.</p><h3 id="sframe-vs-fp">SFrame vs FP</h3><p>Having examined SFrame's overhead compared to <code>.eh_frame</code>,let's now compare the two primary approaches for non-hardware-assistedstack walking.</p><ul><li><strong>Frame pointer approach</strong>: Reserve FP but omitpush/pop for leaf functions<code>g++ -fno-omit-frame-pointer -momit-leaf-frame-pointer</code></li><li><strong>SFrame approach</strong>: Omit FP and use SFrame metadata<code>g++ -fomit-frame-pointer -momit-leaf-frame-pointer -Wa,--gsframe</code></li></ul><p>To conduct a fair comparison, we build LLVM executables using bothapproaches with both Clang and GCC compilers. The following scriptconfigures and builds test binaries with each combination:</p><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#!/bin/zsh</span></span><br><span class="line"><span class="function"><span class="title">conf</span></span>() &#123;</span><br><span class="line">  configure-llvm <span class="variable">$1</span> -DCMAKE_EXE_LINKER_FLAGS=<span class="string">&#x27;-fuse-ld=bfd -pie -Wl,-z,pack-relative-relocs&#x27;</span> \</span><br><span class="line">    -DCMAKE_SHARED_LINKER_FLAGS=-fuse-ld=bfd -DLLVM_ENABLE_UNWIND_TABLES=on -DLLVM_ENABLE_LLD=off <span class="variable">$&#123;@:2&#125;</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">clang=(-DCMAKE_CXX_COMPILER=/tmp/Rel/bin/clang++ -DCMAKE_C_COMPILER=/tmp/Rel/bin/clang)</span><br><span class="line">gcc=(<span class="string">&quot;-DCMAKE_C_COMPILER=<span class="variable">$HOME</span>/opt/gcc-15/bin/gcc&quot;</span> <span class="string">&quot;-DCMAKE_CXX_COMPILER=<span class="variable">$HOME</span>/opt/gcc-15/bin/g++&quot;</span>)</span><br><span class="line"></span><br><span class="line">compact=<span class="string">&quot;-fomit-frame-pointer -momit-leaf-frame-pointer -B<span class="variable">$HOME</span>/opt/binutils/bin -mllvm -elf-compact-unwind -mllvm -x86-epilog-cfi=0&quot;</span></span><br><span class="line">fp=<span class="string">&quot;-fno-omit-frame-pointer -momit-leaf-frame-pointer -B<span class="variable">$HOME</span>/opt/binutils/bin -Wa,--gsframe=no&quot;</span></span><br><span class="line">sframe=<span class="string">&quot;-fomit-frame-pointer -momit-leaf-frame-pointer -B<span class="variable">$HOME</span>/opt/binutils/bin -Wa,--gsframe&quot;</span></span><br><span class="line"></span><br><span class="line">conf custom-compact -DCMAKE_&#123;C,CXX&#125;_FLAGS=<span class="string">&quot;<span class="variable">$compact</span>&quot;</span> <span class="variable">$&#123;clang[@]&#125;</span> \</span><br><span class="line">  -DCMAKE_EXE_LINKER_FLAGS=<span class="string">&#x27;-fuse-ld=lld -pie -Wl,-z,pack-relative-relocs&#x27;</span> \</span><br><span class="line">  -DCMAKE_SHARED_LINKER_FLAGS=-fuse-ld=lld</span><br><span class="line"></span><br><span class="line">conf custom-fp -DCMAKE_&#123;C,CXX&#125;_FLAGS=<span class="string">&quot;-fno-integrated-as <span class="variable">$fp</span>&quot;</span> <span class="variable">$&#123;clang[@]&#125;</span></span><br><span class="line">conf custom-sframe -DCMAKE_&#123;C,CXX&#125;_FLAGS=<span class="string">&quot;-fno-integrated-as <span class="variable">$sframe</span>&quot;</span> <span class="variable">$&#123;clang[@]&#125;</span></span><br><span class="line"></span><br><span class="line">conf custom-fp-gcc -DCMAKE_&#123;C,CXX&#125;_FLAGS=<span class="string">&quot;<span class="variable">$fp</span>&quot;</span> <span class="variable">$&#123;gcc[@]&#125;</span></span><br><span class="line">conf custom-sframe-gcc -DCMAKE_&#123;C,CXX&#125;_FLAGS=<span class="string">&quot;<span class="variable">$sframe</span>&quot;</span> <span class="variable">$&#123;gcc[@]&#125;</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> i <span class="keyword">in</span> compact fp sframe  fp-gcc sframe-gcc; <span class="keyword">do</span> ninja -C /tmp/out/custom-<span class="variable">$i</span> llvm-mc opt; <span class="keyword">done</span></span><br></pre></td></tr></table></figure><p>The results reveal interesting differences between compilerimplementations:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line">% ~/Dev/object-file-size-analyzer/section_size.rb /tmp/out/custom-&#123;fp,sframe,compact,fp-gcc,sframe-gcc&#125;/bin/&#123;llvm-mc,opt&#125;</span><br><span class="line">Filename                               |       .text size |        EH size |   .sframe size |  VM size | VM increase</span><br><span class="line">---------------------------------------+------------------+----------------+----------------+----------+------------</span><br><span class="line">/tmp/out/custom-fp/bin/llvm-mc         |  2120895 (23.5%) |  301528 (3.3%) |       0 (0.0%) |  9043221 |           -</span><br><span class="line">/tmp/out/custom-sframe/bin/llvm-mc     |  2109231 (22.3%) |  367424 (3.9%) |  348041 (3.7%) |  9474085 |       +4.8%</span><br><span class="line">/tmp/out/custom-compact/bin/llvm-mc    |  2109519 (24.4%) |  106288 (1.2%) |       0 (0.0%) |  8639637 |       -4.5%</span><br><span class="line">/tmp/out/custom-fp-gcc/bin/llvm-mc     |  2744214 (29.2%) |  301836 (3.2%) |       0 (0.0%) |  9389677 |       +3.8%</span><br><span class="line">/tmp/out/custom-sframe-gcc/bin/llvm-mc |  2705860 (27.7%) |  354292 (3.6%) |  356073 (3.6%) |  9780985 |       +8.2%</span><br><span class="line">/tmp/out/custom-fp/bin/opt             | 38769545 (69.9%) | 3547688 (6.4%) |       0 (0.0%) | 55425217 |           -</span><br><span class="line">/tmp/out/custom-sframe/bin/opt         | 38891295 (62.4%) | 4559644 (7.3%) | 4448874 (7.1%) | 62292133 |      +12.4%</span><br><span class="line">/tmp/out/custom-compact/bin/opt        | 38898415 (74.8%) | 1200764 (2.3%) |       0 (0.0%) | 52020449 |       -6.1%</span><br><span class="line">/tmp/out/custom-fp-gcc/bin/opt         | 54654215 (78.1%) | 3631196 (5.2%) |       0 (0.0%) | 70001373 |      +26.3%</span><br><span class="line">/tmp/out/custom-sframe-gcc/bin/opt     | 53644895 (70.4%) | 4857364 (6.4%) | 5263676 (6.9%) | 76206149 |      +37.5%</span><br><span class="line"></span><br><span class="line">% ruby ~/Dev/object-file-size-analyzer/eh_size.rb  /tmp/out/custom-compact/bin/opt</span><br><span class="line">opt: sframe=0 eh_frame=267008 eh_frame_hdr=933756 eh=1200764 sframe/eh_frame=0.0 sframe/eh=0.0</span><br><span class="line">% ruby ~/Dev/object-file-size-analyzer/eh_size.rb  /tmp/out/custom-sframe/bin/opt</span><br><span class="line">opt: sframe=4448874 eh_frame=3938448 eh_frame_hdr=621196 eh=4559644 sframe/eh_frame=1.1296 sframe/eh=0.9757</span><br><span class="line"></span><br><span class="line">% ~/Dev/object-file-size-analyzer/section_size.rb /tmp/out/custom-&#123;fp-sync,sframe-sync,compact-sync&#125;/bin/&#123;llvm-mc,opt&#125;</span><br><span class="line">Filename                                 |       .text size |        EH size |   .sframe size |  VM size | VM increase</span><br><span class="line">-----------------------------------------+------------------+----------------+----------------+----------+------------</span><br><span class="line">/tmp/out/custom-fp-sync/bin/llvm-mc      |  2120895 (24.1%) |  263396 (3.0%) |       0 (0.0%) |  8802093 |           -</span><br><span class="line">/tmp/out/custom-sframe-sync/bin/llvm-mc  |  2109231 (23.2%) |  291084 (3.2%) |  248654 (2.7%) |  9090325 |       +3.3%</span><br><span class="line">/tmp/out/custom-compact-sync/bin/llvm-mc |  2109519 (24.4%) |  106288 (1.2%) |       0 (0.0%) |  8639637 |       -1.8%</span><br><span class="line">/tmp/out/custom-fp-sync/bin/opt          | 38769545 (72.2%) | 2997572 (5.6%) |       0 (0.0%) | 53706041 |           -</span><br><span class="line">/tmp/out/custom-sframe-sync/bin/opt      | 38891295 (66.9%) | 3425116 (5.9%) | 2951292 (5.1%) | 58091421 |       +8.2%</span><br><span class="line">/tmp/out/custom-compact-sync/bin/opt     | 38898415 (74.8%) | 1200764 (2.3%) |       0 (0.0%) | 52020449 |       -3.1%</span><br></pre></td></tr></table></figure><ul><li>SFrame incurs a significant VM size increase.</li><li>GCC-built binaries are significantly larger than their Clangcounterparts, probably due to more aggressive inlining or vectorizationstrategies.</li><li><code>/tmp/out/custom-compact</code> has significantly smaller EHsize. See details below.</li></ul><p>With Clang-built binaries, the frame pointer configuration produces asmaller <code>opt</code> executable (55.6 MiB) compared to the SFrameconfiguration (62.5 MiB). This reinforces our earlier observation thatRBP addressing can be more compact than RSP-relative addressing forlarge functions with frequent local variable accesses.</p><p>Assembly comparison reveals that functions using RBP and RSPaddressing produce quite similar code.</p><p>In contrast, GCC-built binaries show the opposite trend: the framepointer version of <code>opt</code> (70.0 MiB) is smaller than theSFrame version (76.2 MiB).</p><p>The generated assembly differs significantly between omit-FP andnon-omit-FP builds, I have compared symbol sizes between two GCC builds.<figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">nvim -d =(/tmp/Rel/bin/llvm-nm -U --size-sort /tmp/out/custom-fp-gcc/bin/llvm-mc) =(/tmp/Rel/bin/llvm-nm -U --size-sort /tmp/out/custom-sframe-gcc/bin/llvm-mc)</span><br></pre></td></tr></table></figure></p><p>Many functions, such as<code>_ZN4llvm15ELFObjectWriter24executePostLayoutBindingEv</code>, havesignificant more instructions in the keep-FP build. This suggests thatGCC's frame pointer code generation may not be as optimized as itsdefault omit-FP path.</p><p>The <code>/tmp/out/custom-compact</code> build uses my llvm-projectbranch (<ahref="http://github.com/MaskRay/llvm-project/tree/demo-unwind"class="uri">http://github.com/MaskRay/llvm-project/tree/demo-unwind</a>)that ports Mach-O compact unwind to ELF, allowing the majority of<code>.eh_frame</code> FDEs to replace CFI instructions with unwinddescriptors. Linker behavior:</p><ul><li>Split FDEs into two groups: descriptor-based (augmentation 'C') andinstruction-based</li><li>Generate <code>.eh_frame_hdr</code> version 2 with 12-byte tableentries when compact FDEs are present:<code>(pc_ptr, unwind_descriptor_or_fde_ptr)</code>. Compact FDEsdescribed by <code>.eh_frame_hdr</code> inline are removed from theoutput <code>.eh_frame</code> section.</li></ul><p>Note: <code>.ARM.exidx</code> and <ahref="https://maskray.me/blog/2020-11-08-stack-unwinding#:~:text=MIPS">MIPScompact exception tables</a> also describe unwind descriptors inline ina binary search index table.</p><p>FDEs not representable by compact unwind (e.g. shrink wrappingoptimization) use the traditional CFI instructions (called DWARF escapein the Mach-O compact unwind information).</p><p>This implementation involves several components:</p><ul><li><code>-mllvm -elf-compact-unwind</code>: Emits<code>.eh_frame</code> CIEs with augmentation character 'C' and FDEsusing unwind descriptors.</li><li><code>-mllvm -x86-epilog-cfi=0</code>: Disables epilogue CFI for x86(primarily implemented by <ahref="https://reviews.llvm.org/D42848">D42848</a> in 2018, notablydisabled for Darwin and Windows). Without this option most frames willnot utilize unwind descriptors because the current Mach-O compact unwindimplementation does not support<code>popq %rbp; .cfi_def_cfa %rsp, 8; ret</code>. I believe this isstill fair as we expect to use a 8-byte descriptor, sufficient todescribe epilogue CFI.</li><li>lld/ELF changes: FDEs are split into descriptor-based (augmentation'C') and CFI-instruction-based groups. When compact FDEs are present,<code>.eh_frame_hdr</code> version 2 is generated with 12-byte tableentries containing (pc_ptr, unwind_descriptor_or_fde_ptr). The PCpointer remains 4 bytes, while the 8-byte entry indicates either anunwind descriptor (odd value) or an FDE pointer (even value).</li></ul><p>With the current implementation, 4937 out of 77648 FDEs (6.36%)require a DWARF escape, while the remaining FDEs can be replaced withunwind descriptors.</p><p><code>.eh_frame_hdr</code> will become even smaller if we implementthe two-level page table structure in Mach-O<code>__unwind_info</code>.</p><h2 id="runtime-performance-analysis">Runtime performance analysis</h2><p>TODO</p><p>perf record overhead with EH</p><p>perf record overhead with FP</p><p>Here is a <ahref="https://llvm-compile-time-tracker.com/compare.php?from=5d0f1591f8b91ac7919910c4e3e9614a8804c02a&amp;to=76cdaf78a7d4b06130031818397da11b8985ab08&amp;stat=instructions:u">benchmarkrun from llvm-compile-time-tracker.com</a>.</p><p>The <code>stable2-O3</code> benchmark is relevant. When enabling FPfor non-leaf functions, the <code>instructions:u</code> metric increasesby +2.44% while <code>wall-time</code> (a noisy metric) increases byjust 0.56%.</p><h2 id="summary">Summary</h2><p>This article examines the space overhead of different stack walkingmechanisms when building LLVM executables.</p><p><strong>Frame pointer configurations:</strong> Enabling framepointers (<code>-fno-omit-frame-pointer</code>) can paradoxically reducex86-64 binary size when stack object accesses are frequent. This occursbecause RBP-relative addressing produces more compact encodings thanRSP-relative addressing, which requires an extra SIB byte. The savingsfrom shorter instructions can outweigh the prologue/epilogueoverhead.</p><p><strong>SFrame vs .eh_frame:</strong> For the x86-64<code>clang</code> executable, SFrame metadata is approximately 10%larger than the combined size of <code>.eh_frame</code> and<code>.eh_frame_hdr</code>. Given the significant VM size overhead andthe lack of clear advantages over established alternatives, I amskeptical about SFrame's viability as the future of stack walking foruserspace programs. While SFrame will receive a major revision V3 in theupcoming months, it needs to achieve substantial size reductionscomparable to existing compact unwinding schemes to justify its adoptionover frame pointers. I hope interested folks can implement somethingsimilar to macOS's compact unwind descriptors (with x86-64 support) andOpenVMS's.</p><p><strong>ELF compact unwind</strong>: My prototype porting Mach-Ocompact unwind to ELF demonstrates significant promise. The approachreduces VM size by 4.5-6.1% compared to frame pointers, achieving thesmallest binaries in my benchmarks. By replacing verbose CFIinstructions with 8-byte unwind descriptors (with DWARF escape forcomplex cases like shrink-wrapped functions), <code>.eh_frame</code>shrinks dramatically—only 6.36% of FDEs require the traditional CFIformat. This approach, once completed, offers a compelling alternativeto SFrame: better compression, compatibility with existing<code>.eh_frame</code> infrastructure, and a clear path toimplementation.</p><p><strong><ahref="https://discourse.llvm.org/t/rfc-adding-sframe-support-to-llvm/86900/34?u=maskray">LLVMcommunity: I need your support</a></strong>. I've raised technicalobjections to the SFrame RFC as maintainer. Some engineers dismissedthem. Now they're escalating to Project Council to override technicalreview. This looks OKR-driven, not merit-driven.</p><p>GCC's frame pointer code generation appears less optimized than itsdefault omit-frame-pointer path, as evidenced by substantial differencesin generated assembly.</p><p>Runtime performance analysis remains to be conducted to complete thetrade-off evaluation.</p><h2 id="appendix-configure-llvm">Appendix:<code>configure-llvm</code></h2><p>This script specifies common options when configuring llvm-project:<ahref="https://github.com/MaskRay/Config/blob/master/home/bin/configure-llvm"class="uri">https://github.com/MaskRay/Config/blob/master/home/bin/configure-llvm</a></p><ul><li><code>-DCMAKE_CXX_ARCHIVE_CREATE="$HOME/Stable/bin/llvm-ar qc --thin &lt;TARGET&gt; &lt;OBJECTS&gt;" -DCMAKE_CXX_ARCHIVE_FINISH=:</code>:Use thin archives to reduce disk usage</li><li><code>-DLLVM_TARGETS_TO_BUILD=host</code>: Build a singletarget</li><li><code>-DCLANG_ENABLE_OBJC_REWRITER=off -DCLANG_ENABLE_STATIC_ANALYZER=off</code>:Disable less popular components</li><li><code>-DLLVM_ENABLE_PLUGINS=off -DCLANG_PLUGIN_SUPPORT=off</code>:Disable <code>-Wl,--export-dynamic</code>, preventing large<code>.dynsym</code> and <code>.dynstr</code> sections</li></ul><h2 id="appendix-my-sframe-build">Appendix: My SFrame build</h2><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">mkdir</span> -p out/release &amp;&amp; <span class="built_in">cd</span> out/release</span><br><span class="line">../../configure --prefix=<span class="variable">$HOME</span>/opt/binutils --disable-multilib</span><br><span class="line">make -j $(<span class="built_in">nproc</span>) all-ld all-binutils all-gas</span><br><span class="line">make -j $(<span class="built_in">nproc</span>) install-ld install-binutils install-gas</span><br></pre></td></tr></table></figure><p><code>gcc -B$HOME/opt/binutils/bin</code> and<code>clang -B$HOME/opt/binutils/bin -fno-integrated-as</code> will use<code>as</code> and <code>ld</code> from the install directory.</p><h2 id="appendix-scripts">Appendix: Scripts</h2><p>Ruby scripts used by this post are available at <ahref="https://github.com/MaskRay/object-file-size-analyzer/"class="uri">https://github.com/MaskRay/object-file-size-analyzer/</a></p>]]></content>
    
    
    <summary type="html">&lt;p&gt;On most Linux platforms (except AArch32, which uses
&lt;code&gt;.ARM.exidx&lt;/code&gt;), DWARF &lt;code&gt;.eh_frame&lt;/code&gt; is required for
&lt;a href=&quot;/blog/2020-12-12-c++-exception-handling-abi&quot;&gt;C++ exception
handling&lt;/a&gt; and &lt;a href=&quot;/blog/2020-11-08-stack-unwinding&quot;&gt;stack
unwinding&lt;/a&gt; to restore callee-saved registers. While
&lt;code&gt;.eh_frame&lt;/code&gt; can be used for call trace recording, it is often
criticized for its runtime overhead. As an alternative, developers can
enable frame pointers, or adopt SFrame, a newer format designed
specifically for profiling. This article examines the size overhead of
enabling non-DWARF stack walking mechanisms when building several LLVM
executables.&lt;/p&gt;
&lt;p&gt;Runtime performance analysis will be added in a future update.&lt;/p&gt;</summary>
    
    
    
    
    <category term="gcc" scheme="https://maskray.me/blog/tags/gcc/"/>
    
    <category term="sframe" scheme="https://maskray.me/blog/tags/sframe/"/>
    
  </entry>
  
  <entry>
    <title>Remarks on SFrame</title>
    <link href="https://maskray.me/blog/2025-09-28-remarks-on-sframe"/>
    <id>https://maskray.me/blog/2025-09-28-remarks-on-sframe</id>
    <published>2025-09-28T07:00:00.000Z</published>
    <updated>2025-11-29T05:44:51.880Z</updated>
    
    <content type="html"><![CDATA[<p>SFrame is a new <a href="/blog/2020-11-08-stack-unwinding">stackwalking format</a> for userspace profiling, inspired by Linux'sin-kernel <ahref="https://docs.kernel.org/arch/x86/orc-unwinder.html">ORC unwindformat</a>. While SFrame eliminates some <code>.eh_frame</code> CIE/FDEoverhead, it sacrifices functionality (e.g., personality, LSDA,callee-saved registers) and flexibility, and its stack offsets are lesscompact than <code>.eh_frame</code>'s bytecode-style CFI instructions.In llvm-project executables I've tested on x86-64, <code>.sframe</code>section is 20% larger than <code>.eh_frame</code>. It also remainssignificantly larger than highly compact schemes like <ahref="https://www.corsix.org/content/windows-arm64-unwind-codes">WindowsARM64 unwind codes</a>.</p><p>SFrame describes three elements for each function:</p><ul><li>Canonical Frame Address (CFA): The base address for stack framecalculations</li><li>Return address</li><li>Frame pointer</li></ul><p>An <code>.sframe</code> section follows a straightforward layout:</p><ul><li>Header: Contains metadata and offset information</li><li>Auxiliary header (optional): Reserved for future extensions</li><li>Function Descriptor Entries (FDEs): Array describing eachfunction</li><li>Frame Row Entries (FREs): Arrays of unwinding information perfunction</li></ul><span id="more"></span><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">struct</span> [[gnu::packed]] sframe_header &#123;</span><br><span class="line">  <span class="keyword">struct</span> &#123;</span><br><span class="line">    <span class="type">uint16_t</span> sfp_magic;</span><br><span class="line">    <span class="type">uint8_t</span> sfp_version;</span><br><span class="line">    <span class="type">uint8_t</span> sfp_flags;</span><br><span class="line">  &#125; sfh_preamble;</span><br><span class="line">  <span class="type">uint8_t</span> sfh_abi_arch;</span><br><span class="line">  <span class="type">int8_t</span> sfh_cfa_fixed_fp_offset;</span><br><span class="line">  <span class="comment">// Used by x86-64 to define the return address slot relative to CFA</span></span><br><span class="line">  <span class="type">int8_t</span> sfh_cfa_fixed_ra_offset;</span><br><span class="line">  <span class="comment">// Size in bytes of the auxiliary header, allowing extensibility</span></span><br><span class="line">  <span class="type">uint8_t</span> sfh_auxhdr_len;</span><br><span class="line">  <span class="comment">// Numbers of FDEs and FREs</span></span><br><span class="line">  <span class="type">uint32_t</span> sfh_num_fdes;</span><br><span class="line">  <span class="type">uint32_t</span> sfh_num_fres;</span><br><span class="line">  <span class="comment">// Size in bytes of FREs</span></span><br><span class="line">  <span class="type">uint32_t</span> sfh_fre_len;</span><br><span class="line">  <span class="comment">// Offsets in bytes of FDEs and FREs</span></span><br><span class="line">  <span class="type">uint32_t</span> sfh_fdeoff;</span><br><span class="line">  <span class="type">uint32_t</span> sfh_freoff;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>While magic is popular choices for file formats, they deviate fromestablished ELF conventions, which simplifies utilizes the section typefor distinction.</p><p>The version field resembles the similar uses within DWARF sectionheaders. SFrame will likely evolve over time, unlike ELF's more stablecontrol structures. This means we'll probably need to keep producers andconsumers evolving in lockstep, which creates a stronger case forinternal versioning. An internal version field would allow linkers toupgrade or ignore unsupported low-version input pieces, providing moreflexibility in handling version mismatches.</p><h2 id="data-structures">Data structures</h2><h3 id="function-descriptor-entries-fdes">Function Descriptor Entries(FDEs)</h3><p>Function Descriptor Entries serve as the bridge between functions andtheir unwinding information. Each FDE describes a function's locationand provides a direct link to its corresponding Frame Row Entries(FREs), which contain the actual unwinding data.</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">struct</span> [[gnu::packed]] sframe_func_desc_entry &#123;</span><br><span class="line">  <span class="type">int32_t</span> sfde_func_start_address;</span><br><span class="line">  <span class="type">uint32_t</span> sfde_func_size;</span><br><span class="line">  <span class="type">uint32_t</span> sfde_func_start_fre_off;</span><br><span class="line">  <span class="type">uint32_t</span> sfde_func_num_fres;</span><br><span class="line">  <span class="comment">// bits 0-3 fretype: sfre_start_address type</span></span><br><span class="line">  <span class="comment">// bit 4 fdetype: SFRAME_FDE_TYPE_PCINC or SFRAME_FDE_TYPE_PCMASK</span></span><br><span class="line">  <span class="comment">// bit 5 pauth_key: (AArch64 only) the signing key for the return address</span></span><br><span class="line">  <span class="type">uint8_t</span> sfde_func_info;</span><br><span class="line">  <span class="comment">// The size of the repetitive code block for SFRAME_FDE_TYPE_PCMASK; used by .plt</span></span><br><span class="line">  <span class="type">uint8_t</span> sfde_func_rep_size;</span><br><span class="line">  <span class="type">uint16_t</span> sfde_func_padding2;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>The current design has room for optimization. The<code>sfde_func_num_fres</code> field uses a full 32 bits, which iswasteful for most functions. We could use <code>uint16_t</code> instead,requiring exceptionally large functions to be split across multipleFDEs.</p><p>It's important to note that SFrame's function concept represents coderanges rather than logical program functions. This distinction becomesparticularly relevant with compiler optimizations like hot-coldsplitting, where a single logical function may span multiplenon-contiguous code ranges, each requiring its own FDE.</p><p>The padding field <code>sfde_func_padding2</code> representsunnecessary overhead in modern architectures where unaligned memoryaccess performs efficiently, making the alignment benefitsnegligible.</p><p>To enable binary search on <code>sfde_func_start_address</code>, FDEsmust maintain a fixed size, which precludes the use of variable-lengthinteger encodings like PrefixVarInt.</p><h3 id="frame-row-entries-fres">Frame Row Entries (FREs)</h3><p>Frame Row Entries contain the actual unwinding information forspecific program counter ranges within a function. The template designallows for different address sizes based on the function'scharacteristics.</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span> &lt;<span class="keyword">class</span> <span class="title class_">AddrType</span>&gt;</span><br><span class="line"><span class="keyword">struct</span> [[gnu::packed]] sframe_frame_row_entry &#123;</span><br><span class="line">  <span class="comment">// If the fdetype is SFRAME_FDE_TYPE_PCINC, this is an offset relative to sfde_func_start_address</span></span><br><span class="line">  AddrType sfre_start_address;</span><br><span class="line">  <span class="comment">// bit 0 fre_cfa_base_reg_id: define BASE_REG as either FP or SP</span></span><br><span class="line">  <span class="comment">// bits 1-4 fre_offset_count: typically 1 to 3, describing CFA, FP, and RA</span></span><br><span class="line">  <span class="comment">// bits 5-6 fre_offset_size: byte size of offset entries (1, 2, or 4 bytes)</span></span><br><span class="line">  sframe_fre_info sfre_info;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>Each FRE contains variable-length stack offsets stored as trailingdata. The <code>fre_offset_size</code> field determines whether offsetsuse 1, 2, or 4 bytes (<code>uint8_t</code>, <code>uint16_t</code>, or<code>uint32_t</code>), allowing optimal space usage based on stackframe sizes.</p><h2 id="architecture-specific-stack-offsets">Architecture-specific stackoffsets</h2><p>SFrame adapts to different processor architectures by varying itsoffset encoding to match their respective calling conventions andarchitectural constraints.</p><h3 id="x86-64">x86-64</h3><p>The x86-64 implementation takes advantage of the architecture'spredictable stack layout:</p><ul><li>First offset: Encodes CFA as <code>BASE_REG + offset</code></li><li>Second offset (if present): Encodes FP as<code>CFA + offset</code></li><li>Return address: Computed implicitly as<code>CFA + sfh_cfa_fixed_ra_offset</code> (using the header field)</li></ul><h3 id="aarch64">AArch64</h3><p>AArch64's more flexible calling conventions require explicit returnaddress tracking:</p><ul><li>First offset: Encodes CFA as <code>BASE_REG + offset</code></li><li>Second offset: Encodes return address as<code>CFA + offset</code></li><li>Third offset (if present): Encodes FP as<code>CFA + offset</code></li></ul><p>The explicit return address encoding accommodates AArch64's variablestack layouts and link register usage patterns.</p><h3 id="s390x">s390x</h3><p>FP and return address may not be saved at the same time. In leaffunctions GCC might save the return address and FP to floating-pointregisters.</p><ul><li>First offset: Encodes CFA as <code>BASE_REG + offset</code></li><li>Second offset (if preset): Encodes the return address as one of<ul><li>stack slot:<code>CFA + offset2, if (offset2 &amp; 1 == 0)</code></li><li>register number:<code>offset2 &gt;&gt; 1, if (offset2 &amp; 1 == 1)</code></li><li>not saved:<code>if (offset2 == SFRAME_FRE_RA_OFFSET_INVALID)</code></li></ul></li><li>Third offset (if present)<ul><li>FP stack slot = CFA + offset3, if (offset3 &amp; 1 == 0)</li><li>FP register number = offset3 &gt;&gt; 1, if (offset3 &amp; 1 ==1)</li></ul></li></ul><p>The format uses 0 (an invalid SFrame RA offset from CFA value) toindicate that the return address is not saved, while FP is saved.</p><h2 id="toolchain-implementation">Toolchain implementation</h2><p>In the GNU toolchain, the assembler in GNU Binutils reinterprets CFIdirectives and generates the <code>.sframe</code> section, while GCCitself has no knowledge of SFrame.</p><p>Some scenarios that cannot be described by <code>.eh_frame</code> inthe absence of the frame pointer are equally inexpressible in SFrame.Additionally, SFrame has extra limitations, as certain CFI directivescannot be re-encoded into the SFrame format. You can take a look at<code>as_warn</code> code in binutils-gdb <code>gas/gen-sframe.c</code>to learn some cases.</p><p>On the other hand, the assembler approach allows SFrame to work withhand-written assembly files with CFI directives.</p><h2 id="orc-and-.sframe">ORC and <code>.sframe</code></h2><p>TODO</p><h2 id="eh_frame-and-.sframe"><code>.eh_frame</code> and<code>.sframe</code></h2><p>SFrame reduces header size compared to <code>.eh_frame</code> plus<code>.eh_frame_hdr</code> by:</p><ul><li>Eliminating <code>.eh_frame_hdr</code> through sorted<code>sfde_func_start_address</code> fields</li><li>Replacing CIE pointers with direct FDE-to-FRE references</li><li>Using variable-width <code>sfre_start_address</code> fields (1 or 2bytes) for small functions</li><li>Storing start addresses instead of address ranges.<code>.eh_frame</code> address ranges</li><li>Start addresses in a small function use 1 or 2 byte fields, moreefficient than <code>.eh_frame</code> initial_location, which needs atleast 4 bytes (<code>DW_EH_PE_sdata4</code>).</li><li>Hard-coding stack offsets rather than using flexible registerspecifications</li></ul><p>However, the bytecode design of <code>.eh_frame</code> can sometimesbe more efficient than <code>.sframe</code>, as demonstrated onx86-64.</p><hr /><p>SFrame serves as a specialized complement to <code>.eh_frame</code>rather than a complete replacement. The current version does not includepersonality routines, Language Specific Data Area (LSDA) information, orthe ability to encode extra callee-saved registers. While theseconstraints make SFrame ideal for profilers, they prevent it fromsupporting C++ exception handling, where libstdc++/libc++abi requiresthe full <code>.eh_frame</code> feature set.</p><p>In practice, executables and shared objects will likely contain allthree sections:</p><ul><li><code>.eh_frame</code>: Complete unwinding information for exceptionhandling</li><li><code>.eh_frame_hdr</code> (encompassed by the<code>PT_GNU_EH_FRAME</code> program header): Fast lookup table for<code>.eh_frame</code></li><li><code>.sframe</code> (encompassed by the <code>PT_GNU_SFRAME</code>program header)</li></ul><p>The auxiliary header, currently unused, provides a pathway for futureenhancements. It could potentially accommodate <code>.eh_frame</code>augmentation data such as personality routines, language-specific dataareas (LSDAs), and signal frame handling, bridging some of the currentfunctionality gaps.</p><h2 id="large-text-section-support">Large text section support</h2><p>The <code>sfde_func_start_address</code> field uses a signed 32-bitoffset to reference functions, providing a ±2GB addressing range fromthe field's location. This signed encoding offers flexibility in sectionordering-<code>.sframe</code> can be placed either before or after textsections.</p><p>However, this approach faces limitations with large binaries,particularly when LLVM generates <code>.ltext</code> sections forx86-64. The typical section layout creates significant gaps between<code>.sframe</code> and <code>.ltext</code>:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">.ltext          // Large text section</span><br><span class="line">.lrodata        // Large read-only data</span><br><span class="line">.rodata         // Regular read-only data</span><br><span class="line">// .eh_frame and .sframe position</span><br><span class="line">.text           // Regular text section</span><br><span class="line">.data</span><br><span class="line">.bss</span><br><span class="line">.ldata          // Large data</span><br><span class="line">.lbss           // Large BSS</span><br></pre></td></tr></table></figure><h2 id="object-file-format-design-issues">Object file format designissues</h2><h3 id="mandatory-index-building-problems">Mandatory index buildingproblems</h3><p>Currently, Binutils enforces a single-element structure within each<code>.sframe</code> section, regardless of whether it resides in arelocatable object or final executable. While the<code>SFRAME_F_FDE_SORTED</code> flag can be cleared to permit unsortedFDEs, proposed unwinder implementations for the Linux kernel do not seemto support multiple elements in a single section. The design choicemakes linker merging mandatory rather than optional.</p><p>This design choice stems from Linux kernel requirements, where kernelmodules are relocatable files created with <code>ld -r</code>. Thepending SFrame support for linux-perf expects each module to contain asingle indexed format for efficient runtime processing. Consequently,GNU ld merges all input <code>.sframe</code> sections into a singleindexed element, even when producing relocatable files. This behaviordeviates from standard <ahref="/blog/2022-11-21-relocatable-linking">relocatable linking</a>conventions that suppress synthetic section finalization.</p><p>This approach differs from almost every metadata section, whichsupport multiple concatenated elements, each with its own header andbody. LLVM supports numerous well-behaved metadata sections(<code>__asan_globals</code>, <code>.stack_sizes</code>,<code>__patchable_function_entries</code>, <code>__llvm_prf_cnts</code>,<code>__sancov_bools</code>, <code>__llvm_covmap</code>,<code>__llvm_gcov_ctr_section</code>, <code>.llvmcmd</code>, and<code>llvm_offload_entries</code>) that concatenate without issues.SFrame stands apart as the only metadata section demandingversion-specific merging as default linker behavior, creatingunprecedented maintenance burden. For optimal portability, unwindersshould support multiple-element structures within a <code>.sframe</code>section.</p><p>For optimal portability, we must support object files from diverseorigins—not just those built from a single toolchain. In environmentswhere almost everything is built from source with a single toolchainoffering strong SFrame support, forcing default-on index building may beacceptable. However, we must also accommodate environments with prebuiltobject files using older SFrame versions, or toolchains that don'tsupport old formats. I believe unwinders should support multiple-elementstructures within a <code>.sframe</code> section. When a linker buildsan index for <code>.sframe</code>, it should be viewed as anoptimization that relieves the unwinder from constructing its own indexat runtime. This index construction should remain optional rather thanrequired.</p><h3 id="section-group-compliance-and-garbage-collection-issues">Sectiongroup compliance and garbage collection issues</h3><p>GNU Assembler generates a single <code>.sframe</code> sectioncontaining relocations to <code>STB_LOCAL</code> symbols from multipletext sections, including those in different section groups.</p><p>This creates ELF specification violations when a referenced textsection is discarded by the <ahref="/blog/2021-07-25-comdat-and-section-group">COMDAT section grouprule</a>. The ELF specification states:</p><blockquote><p>A symbol table entry with <code>STB_LOCAL</code> binding that isdefined relative to one of a group's sections, and that is contained ina symbol table section that is not part of the group, must be discardedif the group members are discarded. References to this symbol tableentry from outside the group are not allowed.</p></blockquote><p>The problem manifests when inline functions are deduplicated:</p><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">cat</span> &gt; a.cc &lt;&lt;<span class="string">&#x27;eof&#x27;</span></span><br><span class="line">[[gnu::noinline]] inline int <span class="function"><span class="title">inl</span></span>() &#123; <span class="built_in">return</span> 0; &#125;</span><br><span class="line">auto *fa = inl;</span><br><span class="line">eof</span><br><span class="line"><span class="built_in">cat</span> &gt; b.cc &lt;&lt;<span class="string">&#x27;eof&#x27;</span></span><br><span class="line">[[gnu::noinline]] inline int <span class="function"><span class="title">inl</span></span>() &#123; <span class="built_in">return</span> 0; &#125;</span><br><span class="line">auto *fb = inl;</span><br><span class="line">eof</span><br><span class="line">~/opt/gcc-15/bin/g++ -Wa,--gsframe -c a.cc b.cc</span><br></pre></td></tr></table></figure><p>Linkers correctly reject this violation:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">% ld.lld a.o b.o</span><br><span class="line">ld.lld: error: relocation refers to a discarded section: .text._Z3inlv</span><br><span class="line">&gt;&gt;&gt; defined in b.o</span><br><span class="line">&gt;&gt;&gt; referenced by b.cc</span><br><span class="line">&gt;&gt;&gt;               b.o:(.sframe+0x1c)</span><br><span class="line"></span><br><span class="line">% gold a.o b.o</span><br><span class="line">b.o(.sframe+0x1c): error: relocation refers to local symbol &quot;.text._Z3inlv&quot; [2], which is defined in a discarded section</span><br><span class="line">  section group signature: &quot;inl()&quot;</span><br><span class="line">  prevailing definition is from a.o</span><br></pre></td></tr></table></figure><p>(In 2020, I reported a <a href="https://gcc.gnu.org/PR93195">similarissue</a> for GCC <code>-fpatchable-function-entry=</code>.)</p><p>Some linkers don't implement this error check. A separate issuearises with garbage collection: by default, an unreferenced<code>.sframe</code> section will be discarded. If the linker implementsa workaround to force-retain <code>.sframe</code>, it mightinadvertently retain all text sections referenced by<code>.sframe</code>, even those that would otherwise be garbagecollected.</p><p>The solution requires restructuring the assembler's output strategy.Instead of creating a monolithic <code>.sframe</code> section, theassembler should generate individual SFrame sections corresponding toeach text section. When a text section belongs to a COMDAT group, itsassociated SFrame section must join the same group. For standalone textsections, the <code>SHF_LINK_ORDER</code> flag should establish theproper association.</p><p>This approach would create multiple SFrame sections withinrelocatable files, making the size optimization benefits of a simplifiedlinking view format even more compelling. While this comes with theoverhead of additional section headers (where each<code>Elf64_Shdr</code> consumes 64 bytes), it's a cost we should pay tobe a good ELF citizen. This reinforces the value of my <ahref="/blog/2024-04-01-light-elf-exploring-potential-size-reduction">sectionheader reduction proposal</a>.</p><h3 id="version-compatibility-challenges">Version compatibilitychallenges</h3><p>The current design creates significant version compatibilityproblems. When a linker only supports v3 but encounters object fileswith v2 <code>.sframe</code> sections, it faces impossible choices:</p><ul><li>Discard v2 sections: Silently losing functionality</li><li>Report errors: Breaking builds with mixed-version object files</li><li>Concatenate sections: Currently unsupported by unwinders</li><li>Upgrade v2 to v3: Requires maintaining version-specific merge logicfor every version</li></ul><p>This differs fundamentally from reading a format—each version needsversion-specific <em>merging</em> logic in every linker. Consider thescenario where v2 uses layout A, v3 uses layout B, and v4 uses layout C.A linker receiving objects with all three versions must produce coherentoutput with proper indexing while maintaining version-specific mergelogic for each.</p><p>Real-world mixing scenarios include:</p><ul><li>Third-party vendor libraries built with older toolchains</li><li>Users linking against prebuilt libraries from different sources</li><li>Users who don't need SFrame but must handle prebuilt libraries witholder versions</li><li>Users updating their linker to a newer version that drops legacySFrame support</li></ul><p>Most users will not need stack tracing features—this may changeeventually, but that will take many years. In the meantime, they mustaccept unneeded information while handling the resulting compatibilityissues.</p><p>Requiring version-specific merging as default behavior would createmaintenance burden unmatched by any other loadable metadata section.</p><h3 id="proposed-format-separation">Proposed format separation</h3><p>A future version should distinguish between linking and executionviews to resolve the compatibility and maintenance challenges outlinedabove. This separation has precedent in existing debug formats:<code>.debug_pubnames</code>/<code>.gdb_index</code> provides anexcellent model for separate linking and execution views. DWARF v5's<code>.debug_names</code> takes a different approach, unifying bothviews at the cost of larger linking formats—a reasonable tradeoff sincerelocatable files contain only a single <code>.debug_names</code>section, and debuggers can efficiently load sections with concatenatedname tables.</p><p>For SFrame, the separation would work as follows:</p><p><strong>Separate linking format.</strong> Assemblers produce asimpler format, omitting index-specific metadata fields such as<code>sfh_num_fdes</code>, <code>sfh_num_fres</code>,<code>sfh_fdeoff</code>, and <code>sfh_freoff</code>.</p><p><strong>Default concatenation behavior.</strong> Linkers concatenate<code>.sframe</code> input sections by default, consistent with DWARFand other metadata sections. Linkers can handle mixed-version scenariosgracefully without requiring version-specific merge logic, eliminatingthe impossible maintenance burden of keeping version-specific mergelogic for every SFrame version in every linker implementation.Distributions can roll out SFrame support incrementally withoutrequiring all linkers to support index building immediately.</p><p>The unwinder implementation cost is manageable. Stack unwindersalready need to support <code>.sframe</code> sections across the mainexecutable and all shared objects. Supporting multiple concatenatedelements within a single <code>.sframe</code> section presents nofundamental technical barrier—this is a one-time implementation costthat provides forward and backward compatibility.</p><p><strong>Optional index construction.</strong> When the opt-in option<code>--sframe-index</code> is requested, the linker builds an indexfrom recognized versions while reporting warnings for unrecognized ones.This is analogous to <ahref="/blog/2022-10-30-distribution-of-debug-information"><code>--gdb-index</code>and <code>--debug-names</code></a>.</p><p>With this approach, the linker builds <code>.sframe_idx</code> frominput <code>.sframe</code> sections. To support the Linux kernelworkflow (<code>ld -r</code> for kernel modules),<code>ld -r --sframe-index</code> must also generate the indexedformat.</p><p>The index construction happens before section matching in linkescripts. The output section description<code>.sframe_idx : &#123; *(.sframe_idx) &#125;</code> places the synthesized<code>.sframe_idx</code> into the <code>.sframe_idx</code> outputsection. <code>.sframe</code> input sections have been replaced by thelinker-synthesized <code>.sframe_idx</code>, so we don't write<code>*(.sframe)</code>.</p><h2 id="alternative-deriving-sframe-from-.eh_frame">Alternative:Deriving SFrame from .eh_frame</h2><p>An alternative approach could eliminate the need for assemblers togenerate <code>.sframe</code> sections directly. Instead, the linkerwould merge and optimize <code>.eh_frame</code> as usual (which requiresCIE and FDE boundary information), then derive <code>.sframe</code> (or<code>.sframe_idx</code>) from the optimized <code>.eh_frame</code>.</p><p>This approach offers a significant advantage: since the linker onlyreads the stable <code>.eh_frame</code> format and produces<code>.sframe</code> or <code>.sframe_idx</code> as output, versioncompatibility concerns disappear entirely.</p><p>While CFI instruction decoding introduces additional complexity (astep previously unneeded), this is balanced by the architecturaladvantage of centralizing the conversion logic. Rather than scatteringformat-specific processing code throughout the linker (similar to how<code>SHF_MERGE</code> and <code>.eh_frame</code> require specialinternal representations), the transformation logic remainslocalized.</p><p>The counterargument centers on maintenance burden. This fine-grainedknowledge of the SFrame format may expose the linker to more frequentupdates as the format evolves—a serious risk, given that the linker'sfoundational role in the build process demands exceptional stability androbustness.</p><h3 id="post-processing-alternative">Post-processing alternative</h3><p>A more cautious intermediate strategy could leverage existing Linuxdistribution post-processing tools, modifying them to append<code>.sframe</code> sections to executable and shared object filesafter linking completes. While this introduces more friction than nativelinker support and requires integration into package build systems, itoffers several compelling advantages:</p><ul><li>Allows <code>.sframe</code> format experimentation without imposinglinker complexity</li><li>Provides time for the format to mature and prove its value beforecommitting to linker integration</li><li>Enables testing across diverse userspace packages in real-worldscenarios</li><li>Post-link tools can optimize and even overwrite sections in-placewithout linker constraints</li><li>For cases where optimization significantly shrinks the section,<code>.sframe</code> can be placed at the end of the file (similar toBOLT moving <code>.rodata</code>)</li></ul><p>However, this approach faces practical challenges. Post-processingadds build complexity, particularly with features like build-ids andread-only file systems. The success of <code>.gdb_index</code>, wherelinker support (<code>--gdb-index</code>) proved more popular thanpost-link tools, suggests that native linker support eventually becomesnecessary for widespread adoption.</p><p>The key question is timing: should linker integration be the startingpoint or the outcome of proven stability?</p><h2 id="shf_alloc-considerations">SHF_ALLOC considerations</h2><p>The <code>.sframe</code> section carries the <code>SHF_ALLOC</code>flag, meaning it's loaded as part of the program's read-only datasegment. This design choice creates tradeoffs:</p><p><strong>With SHF_ALLOC:</strong> - <code>.sframe</code> contributesto initial read-only data segment consumption - Can be accessed directlyas part of the memory-mapped area, relying on kernel's page fault ondemand mechanism.</p><p><strong>Without SHF_ALLOC:</strong> - No upfront memory cost -Tracers must open the file and initiate IO to mmap the section on demand- Runtime cost may not amortize well for frequent tracing</p><p>Analysis of 337 files in /usr/bin and /usr/lib/x86_64-linux-gnu/shows <code>.eh_frame</code> typically consumes 5.2% (median: 5.1%) offile size:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">EH_Frame size distribution:</span><br><span class="line">  Min: 0.3%    Max: 11.5%    Mean: 5.2%    Median: 5.1%</span><br><span class="line"></span><br><span class="line">  0%-1%: 7 files      5%-6%: 62 files</span><br><span class="line">  1%-2%: 17 files     6%-7%: 33 files</span><br><span class="line">  2%-3%: 37 files     7%-8%: 36 files</span><br><span class="line">  3%-4%: 49 files     8%-9%: 20 files</span><br><span class="line">  4%-5%: 50 files     9%-10%: 20 files</span><br><span class="line">                      10%-12%: 6 files</span><br></pre></td></tr></table></figure><p>If <code>.sframe</code> size is comparable to <code>.eh_frame</code>,this represents significant overhead for applications that never usestack tracing—likely the majority of users. Most users will not needstack trace features, raising the question of whether having<code>.sframe</code> always loaded is an acceptable overhead fordistributions shipping it by default.</p><p>perf supports <code>.debug_frame</code>(tools/perf/util/unwind-libunwind-local.c), which does not have<code>SHF_ALLOC</code>. While there's a difference between status quoand what's optimal, the non-<code>SHF_ALLOC</code> approach deservesconsideration for scenarios where runtime tracing overhead can beamortized or where memory footprint matters more than immediateaccess.</p><h2 id="kernel-challenges">Kernel challenges</h2><p>The <code>.sframe</code> section may not be resident in the physicalmemory. SFrame proposers are attempting to defer user stack traces untilsyscall boundaries.</p><p>Ian Rogers points out that BPF programs can no longer simply stacktrace user code. This change breaks stack trace deduplication, acommonly used BPF primitive.</p><h2 id="miscellaneous-minor-considerations">Miscellaneous minorconsiderations</h2><p><strong>Linker relaxation considerations:</strong></p><p>Since <code>.sframe</code> carries the <code>SHF_ALLOC</code> flag,it affects text section addresses and consequently influences <ahref="/blog/2022-07-10-riscv-linker-relaxation-in-lld">linkerrelaxation</a> on architectures like RISC-V and LoongArch.</p><p>If variable-length encoding is introduced to the format,<code>.sframe</code> would behave as an address-dependent sectionsimilar to <code>.relr.dyn</code>. However, this dependency should notpose significant implementation challenges.</p><p><strong>Endianness considerations:</strong></p><p>The SFrame format currently supports endianness variants, whichcomplicates toolchain implementation. While runtime consumers typicallytarget a single endianness, development tools must handle both variantsto support cross-compilation workflows.</p><p>The endianness discussion in <ahref="https://lwn.net/Articles/1035727/">The future of 32-bit support inthe kernel</a> reinforces my belief in preferring universallittle-endian for new formats. A universal little-endian approach wouldreduce implementation complexity by eliminating the need for:</p><ul><li>Endianness-aware function calls like<code>read32le(config, p)</code> where <code>config-&gt;endian</code>specifies the object file's byte order</li><li>Template-based abstractions such as<code>template &lt;class Endian&gt;</code> that must wrap every dataaccess function</li></ul><p>Instead, toolchain code could use straightforward calls like<code>read32le(p)</code>, streamlining both implementation andmaintenance.</p><p>This approach remains efficient even on big-endian architectures likeIBM z/Architecture and POWER. z/Architecture's LOAD REVERSEDinstructions, for instance, handle byte swapping with minimal overhead,often requiring no additional instructions beyond normal loads. Whileslight performance differences may exist compared to native endianoperations, the toolchain simplification benefits generally outweighthese concerns.</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="keyword">define</span> WIDTH(x) \</span></span><br><span class="line"><span class="meta">typedef __UINT##x##_TYPE__ [[gnu::aligned(1)]] uint##x; \</span></span><br><span class="line"><span class="meta">uint##x load_inc##x(uint##x *p) &#123; return *p+1; &#125; \</span></span><br><span class="line"><span class="meta">uint##x load_bswap_inc##x(uint##x *p) &#123; return __builtin_bswap##x(*p)+1; &#125;; \</span></span><br><span class="line"><span class="meta">uint##x load_eq##x(uint##x *p) &#123; return *p==3; &#125; \</span></span><br><span class="line"><span class="meta">uint##x load_bswap_eq##x(uint##x *p) &#123; return __builtin_bswap##x(*p)==3; &#125;; \</span></span><br><span class="line"><span class="meta"></span></span><br><span class="line">WIDTH(<span class="number">16</span>);</span><br><span class="line">WIDTH(<span class="number">32</span>);</span><br><span class="line">WIDTH(<span class="number">64</span>);</span><br></pre></td></tr></table></figure><p>However, I understand that my opinion is probably not popular withinthe object file format community and faces resistance from stakeholderswith significant big-endian investments.</p><h2 id="questioned-benefits">Questioned benefits</h2><p>SFrame's primary benefit centers on enabling frame pointer omissionwhile preserving unwinding capabilities. In scenarios where usersalready omit leaf frame pointers, SFrame could theoretically allowswitching from<code>-fno-omit-frame-pointer -momit-leaf-frame-pointer</code> to<code>-fomit-frame-pointer -momit-leaf-frame-pointer</code>. Thisbenefit appears most significant on x86-64, which has limitedgeneral-purpose registers (without APX). Performance analyses show mixedresults: some studies claim frame pointers degrade performance by lessthan 1%, while others suggest 1-2%. However, this argument overlooks acritical tradeoff—SFrame unwinding itself performs worse than framepointer unwinding, potentially negating any performance gains fromregister availability.</p><p>Another claimed advantage is SFrame's ability to provide coverage infunction prologues and epilogues, where frame-pointer-based unwindingmay miss frames. Yet this overlooks a straightforward alternative: framepointer unwinding can be enhanced to detect prologue and epiloguepatterns by disassembling instructions at the program counter.</p><p>SFrame also faces a practical consideration: the <code>.sframe</code>section likely requires kernel page-in during unwinding, while theprocess stack is more likely already resident in physical memory. As IanRogers noted in <a href="https://lwn.net/Articles/1030223/">LWN</a>,system-wide profiling encounters limitations when system calls haven'ttransitioned to user code, BPF helpers may return placeholder values,and JIT compilers require additional SFrame support.</p><p>Looking ahead, hardware-assisted unwinding through features like x86Shadow Stack and AArch64 Guarded Control Stack may reshape the entirelandscape, potentially reducing the relevance of metadata-basedunwinding formats. Meanwhile, compact unwinding schemes like <ahref="https://www.corsix.org/content/windows-arm64-unwind-codes">WindowsARM64</a> demonstrate that significantly smaller metadata formats remainviable alternatives to both SFrame and <code>.eh_frame</code>. Proposalslike Asynchronous Compact Unwind Descriptors have demonstrated thatcompact unwind formats can work with shrink-wrapping optimizations.There is a feature request for a compact information for AArch64 <ahref="https://github.com/ARM-software/abi-aa/issues/344"class="uri">https://github.com/ARM-software/abi-aa/issues/344</a></p><h2 id="summary">Summary</h2><p>Beyond these fundamental questions about SFrame's value proposition,the format presents a size improvement to Linux kernel's ORC unwinder.Its design presents several implementation challenges that meritconsideration for future versions:</p><ul><li>Object file format design issues (mandatory index building, sectiongroup compliance, version compatibility)</li><li>Limited large text section support restricts deployment in modernbinaries</li><li>Size issue</li></ul><p>These technical concerns, combined with the fundamental valuequestions raised above, suggest that careful consideration is warrantedbefore widespread adoption.</p><h2 id="if-we-proceed-here-is-how-to-do-it-right">If we proceed, here ishow to do it right</h2><p>According to <ahref="https://github.com/llvm/llvm-project/issues/64449#issuecomment-3433777733">thiscomment on llvm-project #64449</a>, "v3 is the version that will besubmitted upstream when the time is right." Please share feedback on theformat before it's finalized, even if you may not be impressed with thedesign.</p><p>To ensure rapid SFrame evolution without compatibility concerns, abetter approach is to build a library that parses <code>.eh_frame</code>and generates SFrame. The Linux kernel can then use this library (inobjtool?) to generate SFrame for vmlinux and modules. Relying onassembler/linker output for this critical metadata format requires alevel of stability that is currently concerning.</p><p>The ongoing maintenance implications warrant particular attention.Observing the binutils mailing list reveals a significant volume ofSFrame commits. Most linker features stabilize quickly after initialimplementation, but SFrame appears to require continued evolution. Giventhe linker's foundational role in the build process, which demandsexceptional stability and robustness, the long-term maintenance burdendeserves careful consideration.</p><p>Early integration into GNU toolchain has provided valuable feedbackfor format evolution, but this comes at the cost of coupling theformat's maturity to linker stability. The SFrame GNU toolchaindevelopers exhibit a <ahref="https://sourceware.org/pipermail/binutils/2025-October/144974.html">concerningtendency to disregard ELF and linker conventions</a>—a serious problemfor all linker maintainers.</p><h3 id="learning-from-existing-compact-unwind-implementations">Learningfrom existing compact unwind implementations</h3><p>LLVM has had a battle-tested compact unwind format in production usesince 2009 with OS X 10.6. The efficiency gains are dramatic even if itmight only cover synchronous unwinding needs. OpenVMS's x86-64 port,which is ELF-based, also adopted this format as documented in their "VSIOpenVMS Calling Standard" and their <ahref="https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282">2018post on LLVM Discourse</a>. This isn't to suggest we should simply adoptthe existing compact unwind format wholesale. The x86-64 design datesback to 2009 or earlier, and there are likely improvements we can make.However, we should aim for similar or better efficiency gains.</p><p>On AArch64, there are at least two formats the ELF one can learnfrom: LLVM's compact unwind format (aarch64) and Windows ARM64 FrameUnwind Code.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;SFrame is a new &lt;a href=&quot;/blog/2020-11-08-stack-unwinding&quot;&gt;stack
walking format&lt;/a&gt; for userspace profiling, inspired by Linux&#39;s
in-kernel &lt;a
href=&quot;https://docs.kernel.org/arch/x86/orc-unwinder.html&quot;&gt;ORC unwind
format&lt;/a&gt;. While SFrame eliminates some &lt;code&gt;.eh_frame&lt;/code&gt; CIE/FDE
overhead, it sacrifices functionality (e.g., personality, LSDA,
callee-saved registers) and flexibility, and its stack offsets are less
compact than &lt;code&gt;.eh_frame&lt;/code&gt;&#39;s bytecode-style CFI instructions.
In llvm-project executables I&#39;ve tested on x86-64, &lt;code&gt;.sframe&lt;/code&gt;
section is 20% larger than &lt;code&gt;.eh_frame&lt;/code&gt;. It also remains
significantly larger than highly compact schemes like &lt;a
href=&quot;https://www.corsix.org/content/windows-arm64-unwind-codes&quot;&gt;Windows
ARM64 unwind codes&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;SFrame describes three elements for each function:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Canonical Frame Address (CFA): The base address for stack frame
calculations&lt;/li&gt;
&lt;li&gt;Return address&lt;/li&gt;
&lt;li&gt;Frame pointer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;An &lt;code&gt;.sframe&lt;/code&gt; section follows a straightforward layout:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Header: Contains metadata and offset information&lt;/li&gt;
&lt;li&gt;Auxiliary header (optional): Reserved for future extensions&lt;/li&gt;
&lt;li&gt;Function Descriptor Entries (FDEs): Array describing each
function&lt;/li&gt;
&lt;li&gt;Frame Row Entries (FREs): Arrays of unwinding information per
function&lt;/li&gt;
&lt;/ul&gt;</summary>
    
    
    
    
    <category term="linker" scheme="https://maskray.me/blog/tags/linker/"/>
    
    <category term="sframe" scheme="https://maskray.me/blog/tags/sframe/"/>
    
  </entry>
  
  <entry>
    <title>lld 21 ELF changes</title>
    <link href="https://maskray.me/blog/2025-09-07-lld-21-elf-changes"/>
    <id>https://maskray.me/blog/2025-09-07-lld-21-elf-changes</id>
    <published>2025-09-07T07:00:00.000Z</published>
    <updated>2026-02-02T00:31:25.015Z</updated>
    
    <content type="html"><![CDATA[<p>LLVM 21.1 have been released. As usual, I maintain lld/ELF and haveadded some notes to <ahref="https://github.com/llvm/llvm-project/blob/release/21.x/lld/docs/ReleaseNotes.rst"class="uri">https://github.com/llvm/llvm-project/blob/release/21.x/lld/docs/ReleaseNotes.rst</a>.I've meticulously reviewed nearly all the patches that are not authoredby me. I'll delve into some of the key changes.</p><span id="more"></span><ul><li>Added <code>-z dynamic-undefined-weak</code> to make undefined weaksymbols dynamic when the dynamic symbol table is present. (<ahref="https://github.com/llvm/llvm-project/pull/143831">#143831</a>)</li><li>For <code>-z undefs</code> (default for <code>-shared</code>),relocations referencing undefined strong symbols now behave likerelocations referencing undefined weak symbols.</li><li><code>--why-live=&lt;glob&gt;</code> prints for each symbol matching<code>&lt;glob&gt;</code> a chain of items that kept it live duringgarbage collection. This is inspired by the Mach-O LLD feature of thesame name.</li><li><code>--thinlto-distributor=</code> and<code>--thinlto-remote-compiler=</code> options are added to supportIntegrated Distributed ThinLTO. (<ahref="https://github.com/llvm/llvm-project/pull/142757">#142757</a>)</li><li>Linker script <code>OVERLAY</code> descriptions now support virtualmemory regions (e.g. <code>&gt;region</code>) and<code>NOCROSSREFS</code>.</li><li>When the last <code>PT_LOAD</code> segment is executable andincludes BSS sections, its <code>p_memsz</code> member is now correct.(<ahref="https://github.com/llvm/llvm-project/pull/139207">#139207</a>)</li><li>Spurious <code>ASSERT</code> errors before the layout converges arenow fixed.</li><li>For ARM and AArch64, <code>--xosegment</code> and<code>--no-xosegment</code> control whether to place executable-only andreadable-executable sections in the same segment. The default option is<code>--no-xosegment</code>. (<ahref="https://github.com/llvm/llvm-project/pull/132412">#132412</a>)</li><li>For AArch64, added support for the <code>SHF_AARCH64_PURECODE</code>section flag, which indicates that the section only contains programcode and no data. An output section will only have this flag set if allinput sections also have it set. (<ahref="https://github.com/llvm/llvm-project/pull/125689">#125689</a>, <ahref="https://github.com/llvm/llvm-project/pull/134798">#134798</a>)</li><li>For AArch64 and ARM, added <code>-zexecute-only-report</code>, whichchecks for missing <code>SHF_AARCH64_PURECODE</code> and<code>SHF_ARM_PURECODE</code> section flags on executable sections. (<ahref="https://github.com/llvm/llvm-project/pull/128883">#128883</a>)</li><li>For AArch64, <code>-z nopac-plt</code> has been added.</li><li>For AArch64 and X86_64, added <code>--branch-to-branch</code>, whichrewrites branches that point to another branch instruction to insteadbranch directly to the target of the second instruction. Enabled bydefault at <code>-O2</code>.</li><li>For AArch64, added support for <code>-zgcs-report-dynamic</code>,enabling checks for GNU GCS Attribute Flags in Dynamic Objects when GCSis enabled. Inherits value from <code>-zgcs-report</code> (capped at<code>warning</code> level) unless user-defined, ensuring compatibilitywith GNU ld linker.</li><li>The default Hexagon architecture version in ELF object filesproduced by lld is changed to v68. This change is only effective whenthe version is not provided in the command line by the user and cannotbe inferred from inputs.</li><li>For LoongArch, the initial-exec to local-exec TLS optimization hasbeen implemented.</li><li>For LoongArch, several relaxation optimizations are supported,including relaxation for <code>R_LARCH_PCALA_HI20/LO12</code> and<code>R_LARCH_GOT_PC_HI20/LO12</code> relocations, instructionrelaxation for <code>R_LARCH_CALL36</code>, TLS local-exec(<code>LE</code>)/global dynamic (<code>GD</code>)/ local dynamic(<code>LD</code>) model relaxation, and TLSDESC code sequencerelaxation.</li><li>For RISCV, an oscillation bug due to call relaxation is now fixed.(<ahref="https://github.com/llvm/llvm-project/pull/142899">#142899</a>)</li><li>For x86-64, the <code>.ltext</code> section is now placed before<code>.rodata</code>.</li></ul><hr /><p>Link: <a href="/blog/2025-02-02-lld-20-elf-changes">lld 20 ELFchanges</a></p>]]></content>
    
    
    <summary type="html">&lt;p&gt;LLVM 21.1 have been released. As usual, I maintain lld/ELF and have
added some notes to &lt;a
href=&quot;https://github.com/llvm/llvm-project/blob/release/21.x/lld/docs/ReleaseNotes.rst&quot;
class=&quot;uri&quot;&gt;https://github.com/llvm/llvm-project/blob/release/21.x/lld/docs/ReleaseNotes.rst&lt;/a&gt;.
I&#39;ve meticulously reviewed nearly all the patches that are not authored
by me. I&#39;ll delve into some of the key changes.&lt;/p&gt;</summary>
    
    
    
    
    <category term="llvm" scheme="https://maskray.me/blog/tags/llvm/"/>
    
    <category term="linker" scheme="https://maskray.me/blog/tags/linker/"/>
    
    <category term="release" scheme="https://maskray.me/blog/tags/release/"/>
    
  </entry>
  
  <entry>
    <title>Benchmarking compression programs</title>
    <link href="https://maskray.me/blog/2025-08-31-benchmarking-compression-programs"/>
    <id>https://maskray.me/blog/2025-08-31-benchmarking-compression-programs</id>
    <published>2025-08-31T07:00:00.000Z</published>
    <updated>2025-09-17T16:55:28.974Z</updated>
    
    <content type="html"><![CDATA[<p>tl;dr <ahref="https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7"class="uri">https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7</a>is a single-file Ruby program that downloads and compiles multiplecompression utilities, then benchmarks their compression anddecompression performance on a specified input file, finally generates aHTML file with scatter charts. Scroll to the end to view example HTMLpages.</p><p>Compression algorithms can be broadly categorized into three groupsbased on their typical compression ratio and decompression speed:</p><ul><li>Low ratio, high speed: <em>lz4, snappy, Oodle Selkie</em>.</li><li>Medium ratio, medium speed: <em>zlib, zstd, </em>brotli<em>, OodleKraken</em>.</li><li>High ratio, low speed: <em>LZMA, bzip2, bzip3, bsc, zpaq,</em>kanzi<em>, Oodle Leviathan</em>.</li></ul><p><strong>Low ratio</strong> Codecs in this category prioritize speedabove all else. The compression and compression speeds are comparable.They are designed to decompress so quickly that they don't introduce anoticeable delay when reading data from storage like solid-state drives.These codecs typically producing byte-aligned output and often skip thefinal step of entropy encoding, which, while crucial for highcompression, is computationally intensive. They are excellent choicesfor applications where latency is critical, such as kernel features likezswap.</p><p><strong>Medium ratio</strong> This is the sweet spot for many tasks.The codecs achieve better compression ratio by employing entropyencoding, usually Huffman coding.</p><p><em>zstd</em> has emerged as a clear leader, gaining popularity andeffectively supplanting older codecs like the venerable DEFLATE(zlib).</p><p><strong>High ratio</strong> They are designed to squeeze every lastbit of redundancy out of the data, often at the cost of significantlylonger compression and decompression times, and large memory usage. Theyare perfect for archival purposes or data distribution where the filesare compressed once and decompressed infrequently. Codecs typically have3 important components:</p><ul><li>Transforms: Codecs typically implement strong transforms to increaseredundancy, even very specific ones like branch/call/jump filters formachine code.</li><li>Predication model: This model anticipates the next piece of databased on what has already been processed.</li><li>Entropy encoding: Traditional codecs use arithmetic encoder, whichis replaced by the more efficient Range variant of Asymmetric NumeralSystems (rANS).</li></ul><p>Some projects apply neural network models, such as Recurrent NeuralNetwork, Long Short-Term Memory, and Transformer, to the predicationmodel. They are usually very slow.</p><hr /><p>This categorization is loose. Many modern programs offer a wide rangeof compression levels that allow them to essentially span multiplecategories. For example, a high-level <em>zstd</em> compression canachieve a ratio comparable to <em>xz</em> (a high-compression codec) byusing more RAM and CPU. While <em>zstd</em>'s compression speed or ratiois generally lower, its decompression speed is often much faster thanthat of <em>xz</em>.</p><h2 id="benchmarking">Benchmarking</h2><p>I want to benchmark the single worker performance of a fewcompression programs:</p><ul><li><em>lz4</em>: Focuses on speed over compression ratio. Memory usageis extremely low. It seems Pareto superior to Google's<em>Snappy</em>.</li><li><em>zstd</em>: Gained significant traction and obsoleted manyexisting codecs. Its LZ77 variant uses three recent match offsets likeLZX. For entropy encoding, it employs Huffman coding for literals and2-way interleaved Finite State Entropy for Huffman weights, literallengths, match lengths, and offset codes. The large alphabet of literalsmakes Huffman a good choice, as compressing them with FSE provideslittle gain for a speed cost. However, other symbols have a small range,making them a sweet spot for FSE. zstd works on multiple streams at thesame time to utilize instruction-level parallelism. zstd is supported bythe <code>Accept-Encoding: zstd</code> HTTP header. Decompression memoryusage is very low.</li><li><em>brotli</em>: Uses a combination of LZ77, 2nd order contextmodel, Huffman coding, and static dictionary. The decompression speed issimilar to gzip with a higher ratio. At lower levels, its performance isovershadowed by <em>zstd</em>. Compared with DEFLATE, it employs alarger sliding window (from 16KiB-16B to 16MiB-16B) and a smallerminimum match length (2 instead of 3). It has a predefined dictionarythat works well for web content (but feels less elegant) and supports120 transforms. <em>brotli</em> is supported by the<code>Accept-Encoding: br</code> HTTP header. Decompression memory usageis quite low.</li><li><em>bzip3</em>: Combines BWT, RLE, and LZP and uses arithmeticencoder. Memory usage is large.</li><li><em>xz</em>: LZMA2 with a few filters. The filters must be enabledexplicitly.</li><li>lzham: Provides a compression ratio similar to LZMA but with fasterdecompression. Compression is slightly slower while memory usage islarger. The build system is not well-polished for Linux. I have forkedit, fixed <code>stdint.h</code> build errors, and installed<code>lzhamtest</code>. The command line program <code>lzhamtest</code>should really be renamed to <code>lzham</code>.</li><li><em>zpaq</em>: Functions as a command-line archiver supportingmultiple files. It combines context mixing with arithmetic encoder butoperates very slowly.</li><li><em>kanzi</em>: There are a wide variety of transforms and entropyencoders, unusual for a compresion program. For the compression speed ofenwik8, it's Pareto superior to <em>xz</em>, but decompression isslower. Levels 8 and 9 belong to the PAQ8 family and consume substantialmemory.</li></ul><p>I'd like to test lzham (not updated for a few years), but I'm havingtrouble getting it to compile due to a <code>cstdio</code> headerissue.</p><p>Many modern compressors are parallel by default. I have to disablethis behavior by using options like <code>-T1</code>. Still,<em>zstd</em> uses a worker thread for I/O overlap, but I don't botherwith <code>--single-thread</code>.</p><p>To ensure fairness, each program is built with consistent compileroptimizations, such as <code>-O3 -march=native</code>.</p><p>Below is a Ruby program that downloads and compiles multiplecompression utilities, compresses then decompress a specified inputfile. It collects performance metrics including execution time, memoryusage, and compression ratio, and finally generates an HTML file withscatter charts visualizing the results. The program has several notablefeatures:</p><ul><li>Adding new compressors is easy: just modify<code>COMPRESSORS</code>.</li><li>Benchmark results are cached in files named<code>cache_$basename_$digest.json</code>, allowing reuse of previousruns for the same input file.</li><li>Adding a new compression level does not invalidate existingbenchmark results for other levels.</li><li>The script generates an HTML file with interactive scatter charts.Each compressor is assigned a unique, deterministic color based on ahash of its name (using the <code>hsl</code> function in CSS).</li></ul><p>The single file Ruby program is available at <ahref="https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7"class="uri">https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7</a></p><h2 id="limitation">Limitation</h2><p>A single run might not be representative.</p><p>Running the executable incurs initialization overhead, which would beamortized in a library setup. However, library setup would make updatinglibraries more difficult.</p><h2 id="demo">Demo</h2><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">ruby bench.rb enwik8</span><br><span class="line"><span class="comment"># The first iframe below</span></span><br><span class="line"></span><br><span class="line">ruby bench.rb clang</span><br><span class="line"><span class="comment"># The second iframe below</span></span><br></pre></td></tr></table></figure><div class="demo"><p><iframe src="/static/2025-08-31-benchmarking-compression-programs/enwik8.html" width=100% height=800px></iframe></p><p><iframe src="/static/2025-08-31-benchmarking-compression-programs/clang.html" width=100% height=800px></iframe></p></div><p>Many programs exhibit a stable decompression speed (uncompressed size/ decompression time). There is typically a slightly higherdecompression speed at higher compression levels. If you think of thecompressed content as a form of "byte code", a more highly compressedfile means there are fewer bytes for the decompression algorithm toprocess, resulting in faster decompression. Some programs, like<em>zpaq</em> and <em>kanzi</em>, use different algorithms that canresult in significantly different decompression speeds.</p><p><code>xz -9</code> doesn't use parallelism on the two files under~100 MiB because their uncompressed size is smaller than the defaultblock size for level 9.</p><blockquote><p>From <code>install/include/lzma/container.h</code></p><p>For each thread, about 3 * block_size bytes of memory will beallocated. This may change in later liblzma versions. If so, the memoryusage will probably be reduced, not increased.</p></blockquote>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;tl;dr &lt;a
href=&quot;https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7&quot;
class=&quot;uri&quot;&gt;https://gist.github.com/MaskRay/74cdaa83c1f4</summary>
      
    
    
    
    
    <category term="ruby" scheme="https://maskray.me/blog/tags/ruby/"/>
    
    <category term="compression" scheme="https://maskray.me/blog/tags/compression/"/>
    
  </entry>
  
  <entry>
    <title>Understanding alignment - from source to object file</title>
    <link href="https://maskray.me/blog/2025-08-24-understanding-alignment-from-source-to-object-file"/>
    <id>https://maskray.me/blog/2025-08-24-understanding-alignment-from-source-to-object-file</id>
    <published>2025-08-24T07:00:00.000Z</published>
    <updated>2026-04-12T18:22:31.494Z</updated>
    
    <content type="html"><![CDATA[<p>Updated in 2026-04.</p><p>Alignment refers to the practice of placing data or code at memoryaddresses that are multiples of a specific value, typically a power of2. This is typically done to meet the requirements of the programminglanguage, ABI, or the underlying hardware. Misaligned memory accessesmight be expensive or will cause traps on certain architectures.</p><p>This blog post explores how alignment is represented and managed asC++ code is transformed through the compilation pipeline: from sourcecode to LLVM IR, assembly, and finally the object file. We'll focus onalignment for both variables and functions.</p><span id="more"></span><h2 id="alignment-in-c-source-code">Alignment in C++ source code</h2><p><a href="https://eel.is/c++draft/basic.align">C++ [basic.align]</a>specifies</p><blockquote><p>Object types have alignment requirements ([basic.fundamental],[basic.compound]) which place restrictions on the addresses at which anobject of that type may be allocated. An alignment is animplementation-defined integer value representing the number of bytesbetween successive addresses at which a given object can be allocated.An object type imposes an alignment requirement on every object of thattype; stricter alignment can be requested using the alignment specifier([dcl.align]). Attempting to create an object ([intro.object]) instorage that does not meet the alignment requirements of the object'stype is undefined behavior.</p></blockquote><p><code>alignas</code> can be used to request a stricter alignment. <ahref="https://eel.is/c++draft/dcl.align">[decl.align]</a></p><blockquote><p>An alignment-specifier may be applied to a variable or to a classdata member, but it shall not be applied to a bit-field, a functionparameter, or an exception-declaration ([except.handle]). Analignment-specifier may also be applied to the declaration of a class(in an elaborated-type-specifier ([dcl.type.elab]) or class-head([class]), respectively). An alignment-specifier with an ellipsis is apack expansion ([temp.variadic]).</p></blockquote><p>Example: <figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">alignas</span>(<span class="number">16</span>) <span class="type">int</span> i0;</span><br><span class="line"><span class="keyword">struct</span> <span class="title class_">alignas</span>(<span class="number">8</span>) S &#123;&#125;;</span><br></pre></td></tr></table></figure></p><p>If the strictest <code>alignas</code> on a declaration is weaker thanthe alignment it would have without any alignas specifiers, the programis ill-formed.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">% echo &#x27;alignas(2) int v;&#x27; | clang -fsyntax-only -xc++ -</span><br><span class="line">&lt;stdin&gt;:1:1: error: requested alignment is less than minimum alignment of 4 for type &#x27;int&#x27;</span><br><span class="line">    1 | alignas(2) int v;</span><br><span class="line">      | ^</span><br><span class="line">1 error generated.</span><br></pre></td></tr></table></figure><p>However, the GNU extension <code>__attribute__((aligned(1)))</code>can request a weaker alignment.</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">typedef</span> <span class="type">int32_t</span> __attribute__((aligned(<span class="number">1</span>))) <span class="type">unaligned_int32_t</span>;</span><br></pre></td></tr></table></figure><p>Further reading: <ahref="https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8">Whatis the Strict Aliasing Rule and Why do we care?</a></p><h2 id="llvm-ir-representation">LLVM IR representation</h2><p>In the LLVM Intermediate Representation (IR), both global variablesand functions can have an <code>align</code> attribute to specify theirrequired alignment.</p><p><a href="https://llvm.org/docs/LangRef.html#global-variables">Globalvariable alignment</a>:</p><blockquote><p>An explicit alignment may be specified for a global, which must be apower of 2. If not present, or if the alignment is set to zero, thealignment of the global is set by the target to whatever it feelsconvenient. If an explicit alignment is specified, the global is forcedto have exactly that alignment. Targets and optimizers are not allowedto over-align the global if the global has an assigned section. In thiscase, the extra alignment could be observable: for example, code couldassume that the globals are densely packed in their section and try toiterate over them as an array, alignment padding would break thisiteration. For TLS variables, the module flag MaxTLSAlign, if present,limits the alignment to the given value. Optimizers are not allowed toimpose a stronger alignment on these variables. The maximum alignment is1 &lt;&lt; 32.</p></blockquote><p>Function alignment</p><blockquote><p>An explicit alignment may be specified for a function. If notpresent, or if the alignment is set to zero, the alignment of thefunction is set by the target to whatever it feels convenient. If anexplicit alignment is specified, the function is forced to have at leastthat much alignment. All alignments must be a power of 2.</p></blockquote><p>An explicit preferred alignment (<code>prefalign</code>) may also bespecified for a function definition (must be a power of 2). Unlike<code>align</code>, it is a hint: the final alignment will generallyland somewhere between the minimum and preferred values. If absent, thepreferred alignment is determined in a target-specific way(<code>STI-&gt;getTargetLowering()-&gt;getPrefFunctionAlignment()</code>).(<ahref="https://discourse.llvm.org/t/rfc-enhancing-function-alignment-attributes/88019/3"class="uri">https://discourse.llvm.org/t/rfc-enhancing-function-alignment-attributes/88019/3</a>)</p><figure class="highlight llvm"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">define</span> void <span class="title">@f</span>() <span class="keyword">align</span> <span class="number">2</span> prefalign(<span class="number">16</span>) &#123; <span class="keyword">ret</span> void &#125;</span><br></pre></td></tr></table></figure><hr /><p>In addition, <code>align</code> can be used in parameter attributesto decorate a pointer or <ahref="https://reviews.llvm.org/D115161">vector of pointers</a>.</p><h2 id="llvm-back-end-representation">LLVM back end representation</h2><p><strong>Global variables</strong><code>AsmPrinter::emitGlobalVariable</code> determines the alignment forglobal variables based on a set of nuanced rules:</p><ul><li>With an explicit alignment (<code>explicit</code>),<ul><li>If the variable has a section attribute, return<code>explicit</code>.</li><li>Otherwise, compute a preferred alignment for the data layout(<code>getPrefTypeAlign</code>, referred to as <code>pref</code>).Return<code>pref &lt; explicit ? explicit : max(E, getABITypeAlign)</code>.</li></ul></li><li>Without an explicit alignment: return<code>getPrefTypeAlign</code>.</li></ul><p><code>getPrefTypeAlign</code> employs a heuristic for global variabledefinitions: if the variable's size exceeds 16 bytes and the preferredalignment is less than 16 bytes, it sets the alignment to 16 bytes. Thisheuristic balances performance and memory efficiency for common cases,though it may not be optimal for all scenarios. (See <ahref="https://discourse.llvm.org/t/preferred-alignment-of-globals-16bytes/24410">Preferredalignment of globals &gt; 16bytes</a> in 2012)</p><p>For assembly output, AsmPrinter emits <code>.p2align</code> (power of2 alignment) directives with a zero fill value (i.e. the padding bytesare zeros). <figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">% echo &#x27;int v0;&#x27; | clang --target=x86_64 -S -xc - -o -</span><br><span class="line">        .file   &quot;-&quot;</span><br><span class="line">        .type   v0,@object                      # @v0</span><br><span class="line">        .bss</span><br><span class="line">        .globl  v0</span><br><span class="line">        .p2align        2, 0x0</span><br><span class="line">v0:</span><br><span class="line">        .long   0                               # 0x0</span><br><span class="line">        .size   v0, 4</span><br><span class="line">...</span><br></pre></td></tr></table></figure></p><p><strong>Functions</strong> For functions,<code>AsmPrinter::emitFunctionHeader</code> emits alignment directivesbased on the machine function's alignment settings.</p><p><code>MachineFunction::init()</code> sets the <em>minimum</em>alignment from the subtarget:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="type">void</span> <span class="title">MachineFunction::init</span><span class="params">()</span> </span>&#123;</span><br><span class="line">...</span><br><span class="line">  Alignment = STI.<span class="built_in">getTargetLowering</span>()-&gt;<span class="built_in">getMinFunctionAlignment</span>();</span><br></pre></td></tr></table></figure><p>The <em>preferred</em> alignment is computed separately by<code>MachineFunction::getPreferredAlignment()</code>:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">Align <span class="title">MachineFunction::getPreferredAlignment</span><span class="params">()</span> <span class="type">const</span> </span>&#123;</span><br><span class="line">  Align PrefAlignment;</span><br><span class="line">  <span class="keyword">if</span> (MaybeAlign A = F.<span class="built_in">getPreferredAlignment</span>()) <span class="comment">// explicit prefalign IR attr</span></span><br><span class="line">    PrefAlignment = *A;</span><br><span class="line">  <span class="keyword">else</span> <span class="keyword">if</span> (!F.<span class="built_in">hasOptSize</span>())</span><br><span class="line">    PrefAlignment = STI.<span class="built_in">getTargetLowering</span>()-&gt;<span class="built_in">getPrefFunctionAlignment</span>();</span><br><span class="line">  <span class="keyword">else</span></span><br><span class="line">    PrefAlignment = <span class="built_in">Align</span>(<span class="number">1</span>);</span><br><span class="line">  <span class="keyword">return</span> std::<span class="built_in">max</span>(PrefAlignment, <span class="built_in">getAlignment</span>());</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>Here is a summary of the minimum and preferred function alignmentvalues across major LLVM targets:</p><table><colgroup><col style="width: 22%" /><col style="width: 27%" /><col style="width: 30%" /><col style="width: 19%" /></colgroup><thead><tr><th>Target</th><th>MinAlign</th><th>PrefAlign</th><th>Notes</th></tr></thead><tbody><tr><td>X86</td><td>(default)</td><td>16</td><td>All variants</td></tr><tr><td>AArch64</td><td>4</td><td>8–64</td><td>CPU-dependent: 8 (A64FX, NeoverseE1), 16 (most cores), 32(ExynosM3), 64 (Ampere1B/C)</td></tr><tr><td>ARM</td><td>2 (Thumb) / 4</td><td>1–8</td><td>Default 1; Exynos sets 8 for non-Thumb</td></tr><tr><td>RISC-V</td><td>2 (Zca) / 4</td><td>1 (default)</td><td>CPU-specific via TuneInfo</td></tr><tr><td>PowerPC</td><td>4</td><td>16</td><td>PWR8+ only; older CPUs have no explicit pref</td></tr><tr><td>SystemZ</td><td>2</td><td>16</td><td></td></tr><tr><td>LoongArch</td><td>4</td><td>32</td><td></td></tr><tr><td>MIPS</td><td>4 (GP32) / 8 (GP64)</td><td>(not set)</td><td></td></tr></tbody></table><p>When the integrated assembler supports <code>.prefalign</code> andthe function's preferred alignment exceeds its minimum alignment,<code>AsmPrinter::emitFunctionHeader</code> emits <code>.p2align</code>for the minimum alignment followed by <code>.prefalign</code> with thepreferred alignment and a function end symbol (<ahref="https://github.com/llvm/llvm-project/pull/184032"class="uri">https://github.com/llvm/llvm-project/pull/184032</a>). Whenthe minimum and preferred alignments coincide, only<code>.p2align</code> is emitted: the <code>.prefalign</code> wouldeither be redundant (if equal) or weaker (if the minimum exceeds thepreferred), so it is suppressed. The behavior is the same regardless of<code>-ffunction-sections</code> (initially <code>.prefalign</code>required function sections because the section size carried the bodylength; symbol-based <code>.prefalign</code> removed thatdependency).</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line">% echo &#x27;void f()&#123;&#125; [[gnu::aligned(32)]] void g()&#123;&#125;&#x27; | clang --target=x86_64 -S -xc - -o -</span><br><span class="line">        .att_syntax</span><br><span class="line">        .file   &quot;-&quot;</span><br><span class="line">        .text</span><br><span class="line">        .globl  f                               # -- Begin function f</span><br><span class="line">        .prefalign      4, .Lfunc_end0, nop</span><br><span class="line">        .type   f,@function</span><br><span class="line">f:                                      # @f</span><br><span class="line">...</span><br><span class="line">.Lfunc_end0:</span><br><span class="line">        .size   f, .Lfunc_end0-f</span><br><span class="line">...</span><br><span class="line">        .globl  g                               # -- Begin function g</span><br><span class="line">        .p2align        5</span><br><span class="line">        .type   g,@function</span><br><span class="line">g:                                      # @g</span><br><span class="line">...</span><br><span class="line">.Lfunc_end1:</span><br><span class="line">        .size   g, .Lfunc_end1-g</span><br></pre></td></tr></table></figure><p>For <code>f</code>, the x86 minimum function alignment is 1 (no<code>.p2align</code> needed), while the preferred alignment is 16, soonly <code>.prefalign 4, .Lfunc_end0, nop</code> is emitted; the finalalignment depends on <code>f</code>'s body size. For <code>g</code>,<code>[[gnu::aligned(32)]]</code> lowers to both <code>align 32</code>and <code>prefalign(32)</code> in LLVM IR: <code>.p2align 5</code>establishes a 32-byte minimum, and<code>.prefalign 5, .Lfunc_end1, nop</code> is emitted alongside itbecause the check in <code>AsmPrinter::emitFunctionHeader</code>compares the machine function's backend minimum (1 on x86) against thepreferred alignment (32).</p><p>To see both directives side by side, the minimum alignment must bestrictly less than the preferred alignment. That requires the two to beset independently at the IR level, e.g.:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">% echo &#x27;define void @g() align 8 prefalign(32) &#123; ret void &#125;&#x27; | llc -mtriple=x86_64 -o -</span><br><span class="line">...</span><br><span class="line">        .globl  g                               # -- Begin function g</span><br><span class="line">        .p2align        3</span><br><span class="line">        .prefalign      5, .Lfunc_end0, nop</span><br><span class="line">        .type   g,@function</span><br></pre></td></tr></table></figure><p>The emitted <code>.p2align</code> directives omit the fill valueargument: for code sections, this space is filled with no-opinstructions. The <code>.prefalign</code> directive's fill operand isrequired; <code>nop</code> requests target-appropriate NOPinstructions.</p><h2 id="assembly-representation">Assembly representation</h2><p>GNU Assembler supports multiple alignment directives:</p><ul><li><p><code>.p2align 3</code>: align to 2**3</p></li><li><p><code>.balign 8</code>: align to 8</p></li><li><p><code>.align 8</code>: this is identical to <code>.balign</code>on some targets and <code>.p2align</code> on the others.</p></li><li><p><code>.prefalign log2_align, end_sym, nop|fill_byte</code> (LLVMextension, <a href="https://github.com/llvm/llvm-project/pull/184032"class="uri">https://github.com/llvm/llvm-project/pull/184032</a>): padsthe current location so that the code between the directive and<code>end_sym</code> starts at a body-size-dependent alignment.<code>log2_align</code> is a log2 exponent in [0, 63] (e.g.<code>4</code> means 16-byte alignment); <code>end_sym</code> must be asymbol defined in the same section. The fill operand is required:<code>nop</code> fills with target-appropriate NOP instructions, whilean integer in [0, 255] fills with that byte value.</p><p>The alignment is determined by the <em>body size</em>(<code>end_sym</code> offset minus the padded start), letting<code>pref_align = 1 &lt;&lt; log2_align</code>:</p><ul><li>body_size <code>&lt; pref_align</code>: align to<code>std::bit_ceil(body_size)</code>, the smallest power of 2 ≥body_size</li><li>body_size <code>≥ pref_align</code>: align to<code>pref_align</code></li></ul><p>In <code>ELFObjectWriter</code>, the section's<code>sh_addralign</code> is set to the maximum of regular alignmentvalues and computed alignments over all <code>.prefalign</code>fragments. To also enforce a minimum alignment, emit a<code>.p2align</code> before <code>.prefalign</code>.</p><p>If the cache block size is 64 and the goal is to minimize the numberof cache blocks a function spans, it suffices to align the functionstart to <code>min(64, std::bit_ceil(body_size))</code>. That's theminimum alignment that prevents an unnecessary boundary crossing. Forexample, a 12-byte function aligned to min(64, bit_ceil(11)) = 16 isguaranteed not to cross a 64-byte boundary unnecessarily.</p></li></ul><p>Clang supports "direct object emission" (<code>clang -c</code>typically bypasses a separate assembler), the LLVMAsmPrinter directlyuses the <code>MCObjectStreamer</code> API. This allows Clang to emitthe machine code directly into the object file, bypassing the need toparse and interpret alignment directives and instructions from atext-based assembly file.</p><p>These alignment directives have an optional third argument: themaximum number of bytes to skip. If doing the alignment would requireskipping more bytes than the specified maximum, the alignment is notdone at all. GCC's <code>-falign-functions=m:n</code> utilizes thisfeature.</p><p>Feature requests:</p><ul><li><a href="https://sourceware.org/bugzilla/show_bug.cgi?id=33943">gas:.prefalign directive for body-size-dependent function alignment</a></li><li><a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124314">GCC:Emit .prefalign for body-size-dependent function alignment</a></li></ul><h2 id="object-file-format">Object file format</h2><p>In an object file, the section alignment is determined by thestrictest alignment directive present in that section. The assemblersets the section's overall alignment to the maximum of all thesedirectives, as if an implicit directive were at the start.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">.section .text.a,&quot;ax&quot;</span><br><span class="line"># implicit alignment max(4, 8)</span><br><span class="line"></span><br><span class="line">.long 0</span><br><span class="line">.balign 4</span><br><span class="line">.long 0</span><br><span class="line">.balign 8</span><br></pre></td></tr></table></figure><p>This alignment is stored in the <code>sh_addralign</code> fieldwithin the ELF section header table. You can inspect this value usingtools such as <code>readelf -WS</code> (<code>llvm-readelf -S</code>) or<code>objdump -h</code> (<code>llvm-objdump -h</code>).</p><h2 id="linker-considerations">Linker considerations</h2><p>The linker combines multiple object files into a single executable.When it maps input sections from each object file into output sectionsin the final executable, it ensures that section alignments specified inthe object files are preserved.</p><h3 id="how-the-linker-handles-section-alignment">How the linker handlessection alignment</h3><p><strong>Output section alignment</strong>: This is the maximum<code>sh_addralign</code> value among all its contributing inputsections. This ensures the strictest alignment requirements are met.</p><p><strong>Section placement</strong>: The linker also uses input<code>sh_addralign</code> information to position each input sectionwithin the output section. As illustrated in the following example, eachinput section (like <code>a.o:.text.f</code> or <code>b.o:.text</code>)is aligned according to its <code>sh_addralign</code> value before beingplaced sequentially.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">output .text</span><br><span class="line">  # align to sh_addralign(a.o:.text). No-op if this is the first section without any preceding DOT assignment or data command.</span><br><span class="line">  a.o:.text</span><br><span class="line">  # align to sh_addralign(a.o:.text.f)</span><br><span class="line">  a.o:.text.f</span><br><span class="line">  # align to sh_addralign(b.o:.text)</span><br><span class="line">  b.o:.text</span><br><span class="line">  # align to sh_addralign(b.o:.text.g)</span><br><span class="line">  b.o:.text.g</span><br></pre></td></tr></table></figure><p><strong>Link script control</strong> A linker script can override thedefault alignment behavior. The <code>ALIGN</code> keyword enforces astricter alignment. For example <code>.text : ALIGN(32) &#123; ... &#125;</code>aligns the section to at least a 32-byte boundary. This is often done tooptimize for specific hardware or for memory mapping requirements.</p><p>The <code>SUBALIGN</code> keyword on an output section overrides theinput section alignments.</p><p><strong>Padding</strong>: To achieve the required alignment, thelinker may insert padding between sections or before the first inputsection (if there is a gap after the output section start). The fillvalue is determined by the following rules:</p><ul><li>If specified, use the <ahref="https://sourceware.org/binutils/docs/ld/Output-Section-Attributes.html"><code>=fillexp</code>output section attribute</a> (within an output sectiondescription).</li><li>If a non-code section, use zero.</li><li>Otherwise, use a trap or no-op instructin.</li></ul><h3 id="padding-and-section-reordering">Padding and sectionreordering</h3><p>Linkers typically preserve the order of input sections from objectfiles. To minimize the padding required between sections, linker scriptscan use a <code>SORT_BY_ALIGNMENT</code> keyword to arrange inputsections in descending order of their alignment requirements. Similarly,GNU ld supports <ahref="/blog/2022-02-06-all-about-common-symbols#sort-common"><code>--sort-common</code></a>to sort COMMON symbols by decreasing alignment.</p><p>While this sorting can reduce wasted space, modern linking strategiesoften prioritize other factors, such as cache locality (for performance)and data similarity (for Lempel–Ziv compression ratio), which canconflict with sorting by alignment. (Search<code>--bp-compression-sort=</code> on <ahref="/blog/2020-11-15-explain-gnu-linker-options">Explain GNU stylelinker options</a>).</p><h3 id="system-page-size">System page size</h3><p>The alignment of a variable or function can be as large as the systempage size. Some implementations allow a larger alignment. (<ahref="/blog/2023-12-17-exploring-the-section-layout-in-linker-output#over-aligned-segment">Over-alignedsegment</a>)</p><h2 id="abi-compliance">ABI compliance</h2><p>Some platforms have special rules. For example,</p><ul><li>On SystemZ, the <code>larl</code> (load address relative long)instruction cannot generate odd addresses. To prevent GOT indirection,compilers ensure that symbols are at least aligned by 2. (<ahref="/blog/2024-02-11-toolchain-notes-on-z-architecture">Toolchainnotes on z/Architecture</a>)</li><li>On AIX, the default alignment mode is <code>power</code>: for doubleand long double, the first member of this data type is aligned accordingto its natural alignment value; subsequent members of the aggregate arealigned on 4-byte boundaries. (<a href="https://reviews.llvm.org/D79719"class="uri">https://reviews.llvm.org/D79719</a>)</li><li>z/OS caps the maximum alignment of static storage variables to 16.(<a href="https://reviews.llvm.org/D98864"class="uri">https://reviews.llvm.org/D98864</a>)</li></ul><p>The standard representation of the the Itanium C++ ABI requiresmember function pointers to be even, to distinguish between virtual andnon-virtual functions.</p><blockquote><p>In the standard representation, a member function pointer for avirtual function is represented with ptr set to 1 plus the function'sv-table entry offset (in bytes), converted to a function pointer as ifby<code>reinterpret_cast&lt;fnptr_t&gt;(uintfnptr_t(1 + offset))</code>,where <code>uintfnptr_t</code> is an unsigned integer of the same sizeas <code>fnptr_t</code>.</p></blockquote><p>Conceptually, a pointer to member function is a tuple:</p><ul><li>A function pointer or virtual table index, discriminated by theleast significant bit</li><li>A displacement to apply to the <code>this</code> pointer</li></ul><p>Due to the least significant bit discriminator, members function needa stricter alignment even if <code>__attribute__((aligned(1)))</code> isspecified:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">virtual</span> <span class="type">void</span> <span class="title">bar1</span><span class="params">()</span> __<span class="title">attribute__</span><span class="params">((aligned(<span class="number">1</span>)))</span></span>;</span><br></pre></td></tr></table></figure><p>Side note: check out <ahref="https://rants.vastheman.com/2021/09/21/msvc/">MSVC C++ ABI MemberFunction Pointers</a> for a comparison with the MSVC C++ ABI.</p><h2 id="architecture-considerations">Architecture considerations</h2><p>Contemporary architectures generally support unaligned memory access,likely with very small performance penalties. However, someimplementations might restrict or penalize unaligned accesses heavily,or require specific handling. Even on architectures supporting unalignedaccess, atomic operations might still require alignment.</p><ul><li>On AArch64, a bit in the system control register<code>sctlr_el1</code> enables alignment check.</li><li>On x86, if the AM bit is set in the CR0 register and the AC bit isset in the EFLAGS register, alignment checking of user-mode dataaccessing is enabled.</li></ul><p>Linux's RISC-V port supports<code>prctl(PR_SET_UNALIGN, PR_UNALIGN_SIGBUS);</code> to enable strictalignment.</p><p><code>clang -fsanitize=alignment</code> can detect misaligned memoryaccess. Check out my <ahref="/blog/2023-01-29-all-about-undefined-behavior-sanitizer#fsanitizealignment">write-up</a>.</p><p>In 1989, US Patent 4814976, which covers "RISC computer withunaligned reference handling and method for the same" (4 instructions:lwl, lwr, swl, and swr), was granted to MIPS Computer Systems Inc. Itcaused a barrier for other RISC processors, see <ahref="https://www.probell.com/lexra/">The Lexra Story</a>.</p><blockquote><p>Almost every microprocessor in the world can emulate thefunctionality of unaligned loads and stores in software. MIPSTechnologies did not invent that. By any reasonable interpretation ofthe MIPS Technologies' patent, Lexra did not infringe. In mid-2001 Lexrareceived a ruling from the USPTO that all claims in the the lawsuit wereinvalid because of prior art in an IBM CISC patent. However, MIPSTechnologies appealed the USPTO ruling in Federal court, adding toLexra's legal costs and hurting its sales. That forced Lexra into anunfavorable settlement. The patent expired on December 23, 2006 at whichpoint it became legal for anybody to implement the complete MIPS-Iinstruction set, including unaligned loads and stores.</p></blockquote><h2 id="aligning-code-for-performance">Aligning code forperformance</h2><p>GCC offers a family of performance-tuning options named<code>-falign-*</code>, that instruct the compiler to align certain codesegments to specific memory boundaries. These options might improveperformance by preventing certain instructions from crossing cache lineboundaries (or instruction fetch boundaries), which can otherwise causean extra cache miss.</p><ul><li><code>-falign-function=n</code>: Align functions.</li><li><code>-falign-labels=n</code>: Align branch targets.</li><li><code>-falign-jumps=n</code>: Align branch targets, for branchtargets where the targets can only be reached by jumping.</li><li><code>-falign-loops=n</code>: Align the beginning of loops.</li></ul><p>Important considerations</p><p><strong>Inefficiency with Small Functions</strong>: Aligning smallfunctions can be inefficient and may not be worth the overhead. Toaddress this, GCC introduced <code>-flimit-function-alignment</code> in2016. When the <code>-falign-functions</code> max-skip (the paddingbudget, defaulting to N-1) is greater than or equal to the function'scode size, this option caps the <code>.p2align</code> max-skip operandto the function size minus one, preventing the NOP padding fromexceeding the function body itself. GCC computes the function code sizevia <code>shorten_branches</code> in <code>final.cc</code>, which storesit in <code>crtl-&gt;max_insn_address</code>, then<code>assemble_start_function</code> in <code>varasm.cc</code> uses itto cap <code>max_skip</code>.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">% echo &#x27;int add1(int a)&#123;return a+1;&#125;&#x27; | gcc -O2 -S -fcf-protection=none -xc - -o - -falign-functions=16 | grep p2align</span><br><span class="line">        .p2align 4</span><br><span class="line">% echo &#x27;int add1(int a)&#123;return a+1;&#125;&#x27; | gcc -O2 -S -fcf-protection=none -xc - -o - -falign-functions=16 -flimit-function-alignment | grep p2align</span><br><span class="line">        .p2align 4,,3</span><br></pre></td></tr></table></figure><p>The max-skip operand, if present, is evaluated at parse time, so youcannot do: <figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">.p2align 4, , b-a</span><br><span class="line">a:</span><br><span class="line">  nop</span><br><span class="line">b:</span><br></pre></td></tr></table></figure></p><p>In LLVM, the x86 backend does not implement<code>TargetInstrInfo::getInstSizeInBytes</code>, making it challengingto implement <code>-flimit-function-alignment</code>.</p><p><strong>Cold code</strong>: These options don't apply to coldfunctions. To ensure that cold functions are also aligned, use<code>-fmin-function-alignment=n</code> instead.</p><p><strong>Benchmarking</strong>: Aligning functions can make benchmarksmore reliable. For example, on x86-64, a hot function less than 32 bytesmight be placed in a way that uses one or two cache lines (determined by<code>function_addr % cache_line_size</code>), making benchmark resultsnoisy. Using <code>-falign-functions=32</code> can ensure the functionalways occupies a single cache line, leading to more consistentperformance measurements.</p><hr /><p>LLVM notes: In <code>clang/lib/CodeGen/CodeGenModule.cpp</code>,<code>-falign-functions=N</code> and <code>[[gnu::aligned(N)]]</code>now set <strong>both</strong> the minimum alignment and the preferredalignment (consistent with GCC). The separate<code>-fpreferred-function-alignment=N</code> option controls only thepreferred alignment hint without affecting the minimum.</p><p>A low-overhead loop (also called a zero-overhead loop) is ahardware-assisted looping mechanism found in many processorarchitectures, particularly digital signal processors (DSPs). Theprocessor includes dedicated registers that store the loop startaddress, loop end address, and loop count. A hardware loop typicallyconsists of three components:</p><ul><li>Loop setup instruction: Sets the loop end address and iterationcount</li><li>Loop body: Contains the actual instructions to be repeated</li><li>Loop end instruction: Jumps back to the loop body if furtheriterations are required</li></ul><p>Here is an example from Arm v8.1-M low-overhead branch extension.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">1:</span><br><span class="line">  dls lr, Rn    // Setup loop with count in Rn</span><br><span class="line">  ...           // Loop body instructions</span><br><span class="line">2:</span><br><span class="line">  le lr, 1b     // Loop end - branch back to label 1 if needed</span><br></pre></td></tr></table></figure><p>To minimize the number of cache lines used by the loop body, ideallythe loop body (the instruction immediately following DLS) should bealigned to a 64-byte boundary. However, GNU Assembler lacks a directiveto specify alignment like "align DLS to a multiple of 64 plus 60 bytes."Inserting an alignment after the DLS is counterproductive, as it wouldintroduce unwanted NOP instructions at the beginning of the loop body,negating the performance benefits of the low-overhead loopmechanism.</p><p>It would be desirable to simulate the functionality with<code>.org ((.+4+63) &amp; -64) - 4  // ensure that .+4 is aligned to 64-byte boundary</code>,but this complex expression involves bitwise AND and is not arelocatable expression. LLVM integrated assembler would report<code>expected absolute expression</code> while GNU Assembler has asimilar error.</p><p>A potential solution would be to extend the alignment directives withan optional offset parameter:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"># Align to 64-byte boundary with 60-byte offset, using NOP padding in code sections</span><br><span class="line">.balign 64, , , 60</span><br><span class="line"></span><br><span class="line"># Same alignment with offset, but skip at most 16 bytes of padding</span><br><span class="line">.balign 64, , 16, 60</span><br></pre></td></tr></table></figure><p>Xtensa's <code>LOOP</code> instructions has similar alignmentrequirement, but I am not familiar with the detail. The GNU Assembleruses the special alignment as a special machine-dependent fragment. (<ahref="https://sourceware.org/binutils/docs/as/Xtensa-Automatic-Alignment.html"class="uri">https://sourceware.org/binutils/docs/as/Xtensa-Automatic-Alignment.html</a>)</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;Updated in 2026-04.&lt;/p&gt;
&lt;p&gt;Alignment refers to the practice of placing data or code at memory
addresses that are multiples of a specific value, typically a power of
2. This is typically done to meet the requirements of the programming
language, ABI, or the underlying hardware. Misaligned memory accesses
might be expensive or will cause traps on certain architectures.&lt;/p&gt;
&lt;p&gt;This blog post explores how alignment is represented and managed as
C++ code is transformed through the compilation pipeline: from source
code to LLVM IR, assembly, and finally the object file. We&#39;ll focus on
alignment for both variables and functions.&lt;/p&gt;</summary>
    
    
    
    
    <category term="llvm" scheme="https://maskray.me/blog/tags/llvm/"/>
    
    <category term="linker" scheme="https://maskray.me/blog/tags/linker/"/>
    
    <category term="assembler" scheme="https://maskray.me/blog/tags/assembler/"/>
    
  </entry>
  
  <entry>
    <title>LLVM integrated assembler: Improving sections and symbols</title>
    <link href="https://maskray.me/blog/2025-08-17-llvm-integrated-assembler-improving-sections-and-symbols"/>
    <id>https://maskray.me/blog/2025-08-17-llvm-integrated-assembler-improving-sections-and-symbols</id>
    <published>2025-08-17T07:00:00.000Z</published>
    <updated>2025-08-23T07:46:47.059Z</updated>
    
    <content type="html"><![CDATA[<p>In my previous post, <ahref="/blog/2025-05-26-llvm-integrated-assembler-improving-expressions-and-relocations">LLVMintegrated assembler: Improving expressions and relocations</a> delvedinto enhancements made to LLVM's expression resolving and relocationgeneration. This post covers recent refinements to MC, focusing onsections and symbols.</p><span id="more"></span><h2 id="sections">Sections</h2><p>Sections are named, contiguous blocks of code or data within anobject file. They allow you to logically group related parts of yourprogram. The assembler places code and data into these sections as itprocesses the source file.</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">MCSection</span> &#123;</span><br><span class="line">...</span><br><span class="line">  <span class="keyword">enum</span> <span class="title class_">SectionVariant</span> &#123;</span><br><span class="line">    SV_COFF = <span class="number">0</span>,</span><br><span class="line">    SV_ELF,</span><br><span class="line">    SV_GOFF,</span><br><span class="line">    SV_MachO,</span><br><span class="line">    SV_Wasm,</span><br><span class="line">    SV_XCOFF,</span><br><span class="line">    SV_SPIRV,</span><br><span class="line">    SV_DXContainer,</span><br><span class="line">  &#125;;</span><br></pre></td></tr></table></figure><p>In LLVM 20, the <ahref="https://github.com/llvm/llvm-project/blob/release/20.x/llvm/include/llvm/MC/MCSection.h"><code>MCSection</code>class</a> used an enum called <code>SectionVariant</code> todifferentiate between various object file formats, such as ELF, Mach-O,and COFF. These subclasses are used in contexts where the section typeis known at compile-time, such as in <code>MCStreamer</code> and <ahref="https://github.com/llvm/llvm-project/blob/release/21.x/llvm/include/llvm/MC/MCObjectWriter.h"><code>MCObjectTargetWriter</code></a>.This change eliminates the need for runtime type information (RTTI)checks, simplifying the codebase and improving efficiency.</p><p>Additionally, the storage for fragments' fixups (adjustments toaddresses and offsets) has been moved into the <code>MCSection</code>class.</p><h2 id="symbols">Symbols</h2><p>Symbols are names that represent memory addresses or values.</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">MCSymbol</span> &#123;</span><br><span class="line"><span class="keyword">protected</span>:</span><br><span class="line">  <span class="comment">/// The kind of the symbol.  If it is any value other than unset then this</span></span><br><span class="line">  <span class="comment">/// class is actually one of the appropriate subclasses of MCSymbol.</span></span><br><span class="line">  <span class="keyword">enum</span> <span class="title class_">SymbolKind</span> &#123;</span><br><span class="line">    SymbolKindUnset,</span><br><span class="line">    SymbolKindCOFF,</span><br><span class="line">    SymbolKindELF,</span><br><span class="line">    SymbolKindGOFF,</span><br><span class="line">    SymbolKindMachO,</span><br><span class="line">    SymbolKindWasm,</span><br><span class="line">    SymbolKindXCOFF,</span><br><span class="line">  &#125;;</span><br><span class="line"></span><br><span class="line">  <span class="comment">/// A symbol can contain an Offset, or Value, or be Common, but never more</span></span><br><span class="line">  <span class="comment">/// than one of these.</span></span><br><span class="line">  <span class="keyword">enum</span> <span class="title class_">Contents</span> : <span class="type">uint8_t</span> &#123;</span><br><span class="line">    SymContentsUnset,</span><br><span class="line">    SymContentsOffset,</span><br><span class="line">    SymContentsVariable,</span><br><span class="line">    SymContentsCommon,</span><br><span class="line">    SymContentsTargetCommon, <span class="comment">// Index stores the section index</span></span><br><span class="line">  &#125;;</span><br></pre></td></tr></table></figure><p>Similar to sections, the <ahref="https://github.com/llvm/llvm-project/blob/release/20.x/llvm/include/llvm/MC/MCSymbol.h"><code>MCSymbol</code>class</a> also used a discriminator enum, SymbolKind, to distinguishbetween object file formats. This enum has also been removed.</p><p>Furthermore, the <code>MCSymbol</code> class had an<code>enum Contents</code> to specify the kind of symbol. This name wasa bit confusing, so it has been <ahref="https://github.com/llvm/llvm-project/commit/190778a8ba6d30995b7e1b4b4a556ab6444bdf3a">renamedto <code>enum Kind</code></a> for clarity.</p><ul><li>regular symbol</li><li><ahref="/blog/2023-05-08-assemblers#symbol-equating-directives">equatedsymbol</a></li><li><a href="/blog/2022-02-06-all-about-common-symbols">commonsymbol</a></li></ul><p>A special enumerator, <code>SymContentsTargetCommon</code>, which wasused by AMDGPU for a specific type of common symbol, has also been <ahref="https://github.com/llvm/llvm-project/commit/aa96e20dcefa7d73229c98a7d2727696ff949459">removed</a>.The functionality it provided is now handled by updating<code>ELFObjectWriter</code> to respect the symbol's section index(<code>SHN_AMDGPU_LDS</code> for this special AMDGPU symbol).</p><p><code>sizeof(MCSymbol)</code> has been reduced to 24 bytes on 64-bitsystems.</p><p>The previous blog post <ahref="/blog/2025-05-26-llvm-integrated-assembler-improving-expressions-and-relocations">LLVMintegrated assembler: Improving expressions and relocations</a>describes other changes:</p><ul><li>The <code>MCSymbol::IsUsed</code> flag was a workaround fordetecting a subset of invalid reassignments and is <ahref="https://github.com/llvm/llvm-project/commit/e015626f189dc76f8df9fdc25a47638c6a2f3feb">removed</a>.</li><li>The <code>MCSymbol::IsResolving</code> flag is added to detectcyclic dependencies of equated symbols.</li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;In my previous post, &lt;a
href=&quot;/blog/2025-05-26-llvm-integrated-assembler-improving-expressions-and-relocations&quot;&gt;LLVM
integrated assembler: Improving expressions and relocations&lt;/a&gt; delved
into enhancements made to LLVM&#39;s expression resolving and relocation
generation. This post covers recent refinements to MC, focusing on
sections and symbols.&lt;/p&gt;</summary>
    
    
    
    
    <category term="llvm" scheme="https://maskray.me/blog/tags/llvm/"/>
    
    <category term="assembler" scheme="https://maskray.me/blog/tags/assembler/"/>
    
  </entry>
  
  <entry>
    <title>LLVM integrated assembler: Engineering better fragments</title>
    <link href="https://maskray.me/blog/2025-07-27-llvm-integrated-assembler-engineering-better-fragments"/>
    <id>https://maskray.me/blog/2025-07-27-llvm-integrated-assembler-engineering-better-fragments</id>
    <published>2025-07-27T07:00:00.000Z</published>
    <updated>2026-03-27T16:54:20.176Z</updated>
    
    <content type="html"><![CDATA[<p>In my previous assembler posts, I've discussed improvements on <ahref="/blog/2025-05-26-llvm-integrated-assembler-improving-expressions-and-relocations">expressionresolving and relocation generation</a>. Now, let's turn our attentionto recent refinements within section fragments. Understanding how anassembler utilizes these fragments is key to appreciating theimprovements we've made. At a high level, the process unfolds in threemain stages:</p><ul><li>Parsing phase: The assembler constructs section fragments. Thesefragments represent sequences of regular instructions or data, <ahref="/blog/2024-04-27-understanding-clang-o0-size-increase-branch-displacement">span-dependentinstructions</a>, alignment directives, and other elements.</li><li>Section layout phase: Once fragments are built, the assemblerassigns offsets to them and finalizes the span-dependent content.</li><li><ahref="/blog/2025-03-16-relocation-generation-in-assemblers">Relocationdecision phase</a>: In the final stage, the assembler evaluates fixupsand, if necessary, updates the content of the fragments.</li></ul><span id="more"></span><p>When the LLVM integrated assembler was introduced in 2009, itssection and fragment design was quite basic. Performance wasn't theconcern at the time. As LLVM evolved, many assembler features added overthe years came to rely heavily on this original design. This created acomplex web that made optimizing the fragment representationincreasingly challenging.</p><p>Here's a look at some of the features that added to this complexityover the years:</p><ul><li>2010: Mach-O <code>.subsection_via_symbols</code> and atoms</li><li>2012: NativeClient's bundle alignment mode. I've created a dedicatedchapter for this.</li><li>2015: Hexagon instruction bundle</li><li>2016: CodeView variable definition ranges</li><li>2018: RISC-V linker relaxation</li><li>2020: x86 <code>-mbranches-within-32B-boundaries</code></li><li>2023: LoongArch linker relaxation. This is largely identical toRISC-V linker relaxation. Any refactoring or improvements to the RISC-Vlinker relaxation often necessitate corresponding changes to theLoongArch implementation.</li><li>2023: z/OS <a href="https://en.wikipedia.org/wiki/GOFF">GOFF(Generalized Object File Format)</a></li></ul><p>I've included the start year for each feature to indicate when it wasinitially introduced, to the best of my knowledge. This doesn't implythat maintenance stopped after that year. On the contrary, many of thesefeatures, like RISC-V linker relaxation, require ongoing, activemaintenance.</p><p>Despite the intricate history, I've managed to untangle thesedependencies and implement the necessary fixes. And that, in a nutshell,is what this blog post is all about!</p><h2 id="reducing-sizeofmcfragment">Reducing sizeof(MCFragment)</h2><p>A significant aspect of optimizing fragment management involveddirectly reducing the memory footprint of the MCFragment object itself.Several targeted changes contributed to making<code>sizeof(MCFragment)</code> smaller, as mentioned by my previousblog post: <ahref="/blog/2024-06-30-integrated-assembler-improvements-in-llvm-19#fragments">Integratedassembler improvements in LLVM 19</a>.</p><ul><li><a href="https://github.com/llvm/llvm-project/pull/94913">MCInst:decrease inline element count to 6</a></li><li><a href="https://github.com/llvm/llvm-project/pull/95293">[MC]Reduce size of MCDataFragment by 8 bytes</a> by <span class="citation"data-cites="aengelke">@aengelke</span></li><li><a href="https://github.com/llvm/llvm-project/pull/95341">[MC] MoveMCFragment::Atom to MCSectionMachO::Atoms</a></li></ul><p>The fragment management system has also been streamlined bytransitioning from a doubly-linked list (<code>llvm::iplist</code>) to asingly-linked list, eliminating unnecessary overhead. A few prerequisitecommits removed backward iterator requirements. It's worth noting thatthe complexities introduced by features like NaCl's bundle alignmentmode, x86's <code>-mbranches-within-32B-boundaries</code> option, andHexagon's instruction bundles presented challenges.</p><h2 id="the-quest-for-trivially-destructible-fragments">The quest fortrivially destructible fragments</h2><p>Historically, <code>MCFragment</code> subclasses, specifically<code>MCDataFragment</code> and <code>MCRelaxableFragment</code>, reliedon <code>SmallVector</code> member variables to store their content andfixups. This approach, while functional, presented two keyinefficiencies:</p><ul><li>Inefficient storage of small objects: The content and fixups forindividual fragments are typically very small. Storing a multitude ofthese tiny objects individually within <code>SmallVectors</code> led toless-than-optimal memory utilization.</li><li>Non-trivial destructors: When deallocating sections, the<code>~MCSection</code> destructor had to meticulously traverse thefragment list and explicitly destroy each fragment.</li></ul><p>In 2024, <span class="citation"data-cites="aengelke">@aengelke</span> initiated a draft to storefragment content out-of-line. Building upon that foundation, I'veextended this approach to also store fixups out-of-line, and ensuredcompatibility with the aforementioned features that cause complexity(especially RISC-V and LoongArch linker relaxation.)</p><p>Furthermore, <code>MCRelaxableFragment</code> previously contained<code>MCInst Inst;</code>, which also necessitated a non-trivialdestructor. To address this, I've redesigned its data structure.operands are now stored within the parent MCSection, and the<code>MCRelaxableFragment</code> itself only holds references:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="type">uint32_t</span> Opcode = <span class="number">0</span>;</span><br><span class="line"><span class="type">uint32_t</span> Flags = <span class="number">0</span>; <span class="comment">// x86-only for the EVEX prefix</span></span><br><span class="line"><span class="type">uint32_t</span> OperandStart = <span class="number">0</span>;</span><br><span class="line"><span class="type">uint32_t</span> OperandSize = <span class="number">0</span>;</span><br></pre></td></tr></table></figure><p>Unfortunately, we still need to encode <code>MCInst::Flags</code> tosupport the x86 EVEX prefix, e.g., <code>&#123;evex&#125; xorw $foo, %ax</code>.My hope is that the x86 maintainers might refactor<code>X86MCCodeEmitter::encodeInstruction</code> to make this flagstorage unnecessary.</p><p>The new design of <code>MCFragment</code> and <code>MCSection</code>is as follows:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">MCFragment</span> &#123;</span><br><span class="line">  ...</span><br><span class="line">  <span class="comment">// Track content and fixups for the fixed-size part as fragments are</span></span><br><span class="line">  <span class="comment">// appended to the section. The content remains immutable, except when</span></span><br><span class="line">  <span class="comment">// modified by applyFixup.</span></span><br><span class="line">  <span class="type">uint32_t</span> ContentStart = <span class="number">0</span>;</span><br><span class="line">  <span class="type">uint32_t</span> ContentEnd = <span class="number">0</span>;</span><br><span class="line">  <span class="type">uint32_t</span> FixupStart = <span class="number">0</span>;</span><br><span class="line">  <span class="type">uint32_t</span> FixupEnd = <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line">  <span class="comment">// Track content and fixups for the optional variable-size tail part,</span></span><br><span class="line">  <span class="comment">// typically modified during relaxation.</span></span><br><span class="line">  <span class="type">uint32_t</span> VarContentStart = <span class="number">0</span>;</span><br><span class="line">  <span class="type">uint32_t</span> VarContentEnd = <span class="number">0</span>;</span><br><span class="line">  <span class="type">uint32_t</span> VarFixupStart = <span class="number">0</span>;</span><br><span class="line">  <span class="type">uint32_t</span> VarFixupEnd = <span class="number">0</span>;</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">MCSection</span> &#123;</span><br><span class="line">  ...</span><br><span class="line">  <span class="comment">// Content and fixup storage for fragments</span></span><br><span class="line">  SmallVector&lt;<span class="type">char</span>, <span class="number">0</span>&gt; ContentStorage;</span><br><span class="line">  SmallVector&lt;MCFixup, <span class="number">0</span>&gt; FixupStorage;</span><br><span class="line">  SmallVector&lt;MCOperand, <span class="number">0</span>&gt; MCOperandStorage;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>(As a side note, <ahref="https://llvm.org/docs/Proposals/VariableNames.html">the LLVM<code>CamelCase</code> variables are odd</a>. As the MC maintainer, I'dbe delighted to see them refactored to <code>camelBack</code> or<code>snake_case</code> if people agree on the direction.)</p><p>Key changes:</p><ul><li><a href="https://github.com/llvm/llvm-project/pull/146307">MC: Storefragment content and fixups out-of-line</a></li><li><a href="https://github.com/llvm/llvm-project/pull/146462">CodeView:Move MCCVDefRangeFragment storage to MCContext/MCFragment. NFC</a></li><li><a href="https://github.com/llvm/llvm-project/pull/147229">MC: StoreMCRelaxableFragment MCInst out-of-line</a></li></ul><h2 id="fewer-fragments-fixed-size-part-and-variable-tail">Fewerfragments: fixed-size part and variable tail</h2><p>Prior to LLVM 21.1, the assembler, operated with a fragment designdating back to 2009, placed every span-dependent instruction into itsown distinct fragment. The x86 code sequence<code>push rax; jmp foo; nop; jmp foo</code> would be represented withnumerous fragments:<code>MCDataFragment(nop); MCRelaxableFragment(jmp foo); MCDataFragment(nop); MCRelaxableFragment(jmp foo)</code>.</p><p>A more efficient approach emerged: storing both a <strong>fixed-sizepart and an optional variable-size tail</strong> within a singlefragment.</p><ul><li>The fixed-size part maintains a consistent size throughout theassembly process.</li><li>The variable-size tail, if present, encodes elements that can changein size or content, such as a span-dependent instruction, an alignmentdirective, a fill directive, or other similar span-dependentconstructs.</li></ul><p>The new design led to significantly fewer fragments:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">MCFragment(fixed: push rax, variable: jmp foo)</span><br><span class="line">MCFragment(fixed: nop, variable: jmp foo)</span><br></pre></td></tr></table></figure><p>Key changes:</p><ul><li><a href="https://github.com/llvm/llvm-project/pull/148544">MC:Restructure MCFragment as a fixed part and a variable tail</a>.</li><li><a href="https://github.com/llvm/llvm-project/pull/149030">MC:Encode FT_Align in fragment's variable-size tail</a></li></ul><h2 id="reducing-instruction-encoding-overhead">Reducing instructionencoding overhead</h2><p>Encoding individual instructions is the most performance-criticaloperation within <code>MCObjectStreamer</code>. Recognizing this,significant effort has been dedicated to reducing this overhead sinceMay 2023.</p><ul><li><a href="https://reviews.llvm.org/D145791">[MC] Always encodeinstruction into SmallVector</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/e8934075b90a38f9dd3361472e696f11e50a8aa4">[MC]Remove the legacy overload of encodeInstruction</a> with a lot of priorcleanups</li><li><a href="https://github.com/llvm/llvm-project/pull/94950">[MC][ELF]Emit instructions directly into fragment</a></li><li><a href="https://github.com/llvm/llvm-project/pull/94947">[MC][X86]Avoid copying MCInst in emitInstrEnd</a> in 2024-06</li><li><ahref="https://github.com/llvm/llvm-project/commit/d77ac81e93e5e2df5275b687b53049d9acfe1357">X86AsmBackend:Remove some overhead from auto padding feature</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/3fe6d276dc952b3b2b487cb67a999c3981cf9563">X86AsmBackend:Simplify isRightAfterData for the auto-pad feature</a></li></ul><p>It's worth mentioning that x86's instruction padding features,introduced in 2020, have imposed considerable overhead. Specifically,these features are:</p><ul><li><code>-mbranches-within-32B-boundaries</code>. See <ahref="https://reviews.llvm.org/D70157">Align branches within 32-Byteboundary(NOP padding)</a></li><li><a href="https://reviews.llvm.org/D75203">[X86] Relax existinginstructions to reduce the number of nops needed for alignmentpurposes</a></li><li><a href="https://reviews.llvm.org/D76286">"Enhanced relaxation"</a>:The feature allows x86 prefix padding for all instructions, effectivelymaking all instructions span-dependent and requiring its own fragment.My <a href="https://reviews.llvm.org/D94542">D94542</a> disabled this bydefault due to concern of <code>-g</code> vs <code>-g0</code>differences.</li></ul><p>My recent optimization efforts demanded careful attention to theseparticularly complex and performance-sensitive code.</p><h2 id="eager-fragment-creation">Eager fragment creation</h2><p>Encoding an instruction is a far more frequent operation thanappending a variable-size tail to the current fragment. In the previousdesign, the instruction encoder was burdened with an extra check: it hadto determine if the current fragment already had a variable-sizetail.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">encodeInstruction:</span><br><span class="line">  if (current fragment has a variable-size tail)</span><br><span class="line">    start a new fragment</span><br><span class="line">  append data to the current fragment</span><br><span class="line"></span><br><span class="line">emitValueToAlignment:</span><br><span class="line">  Encode the alignment in the variable-size tail of the current fragment</span><br><span class="line"></span><br><span class="line">emitDwarfLocDirective:</span><br><span class="line">  Encode the .loc in the variable-size tail of the current fragment</span><br></pre></td></tr></table></figure><p>Our new strategy optimizes this by maintaining a current fragmentthat is guaranteed not to have a variable-size tail. This meansfunctions appending data to the fixed-size part no longer need toperform this check. Instead, any function that sets a variable-size tailwill now immediately start a new fragment.</p><p>Here's how the workflow looks with this optimization:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">encodeInstruction:</span><br><span class="line">  assert(current fragment doesn&#x27;t have a variable-size tail)</span><br><span class="line">  append data to the current fragment</span><br><span class="line"></span><br><span class="line">emitValueToAlignment:</span><br><span class="line">  Encode the alignment in the variable-size tail of the current fragment</span><br><span class="line">  start a new fragment</span><br><span class="line"></span><br><span class="line">emitDwarfLocDirective:</span><br><span class="line">  Encode the .loc in the variable-size tail of the current fragment</span><br><span class="line">  start a new fragment</span><br></pre></td></tr></table></figure><p>Key changes:</p><ul><li><a href="https://github.com/llvm/llvm-project/pull/149471">MC:Simplify fragment reuse determination</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/39c8cfb70d203439e3296dfdfe3d41f1cb2ec551">MC:Optimize getOrCreateDataFragment</a></li></ul><p>It's worth noting that the first patch was made possible thanks tothe removal of the bundle alignment mode.</p><h2 id="fragment-content-in-trailing-data">Fragment content in trailingdata</h2><p>Our <code>MCFragment</code> class manages four distinct sets ofappendable data: fixed-size content, fixed-size fixups, variable-sizetail content, and variable-size tail fixups. Of these, the fixed-sizecontent is typically the largest. We can optimize its storage byutilizing it as trailing data, akin to a flexible array member.</p><p>This approach offers several compelling advantages:</p><ul><li>Improved data locality: Storing the content after the MCFragmentobject enhances cache utility.</li><li>Simplified metadata: We can replace the pair of<code>uint32_t ContentStart = 0; uint32_t ContentEnd = 0;</code> with asingle <code>uint32_t ContentSize;</code>.</li></ul><p>This optimization leverages a clever technique made possible by usinga special purpose bump allocator. After allocating<code>sizeof(MCFragment)</code> bytes for a new fragment, we know thatany remaining space within the current bump allocator block immediatelyfollows the fragment's end. This contiguous space can then beefficiently used for the fragment's trailing data.</p><p>However, this design introduces a few important considerations:</p><ul><li>Tail fragment appends only: Data can only be appended to the tailfragment of a subsection. Fragments located in the middle of asubsection are immutable in their fixed-size content. Anypost-assembler-layout adjustments must target the variable-sizetail.</li><li>Dynamic Allocation Management: When new data needs to be appended, afunction is invoked to ensure the current bump allocator block hassufficient space. If not, the current fragment is closed (its fixed-sizecontent is finalized), and a new fragment is started. For instance, an8-byte sequence could be stored as one single fragment, or, if spaceconstraints dictate, as two fragments each encoding 4 bytes.</li><li>New block allocation: If the available space in the current block isinsufficient, a new block large enough to accommodate both an MCFragmentand the required bytes for its trailing data is allocated.</li><li>Section/subsection Switching: The previously saved fragment listtail cannot be simply reused. This is because it's tied to the memoryspace of the previous bump allocator block. Instead, a new fragment mustbe allocated using the current bump allocator block and appended to thenew subsection's tail.</li></ul><p>I have thought about making the variable-size content immediatelyfollow the fixed-size content, but leb128 and x86's potentially verylong instruction (15 bytes) stopped me from doing it. There is certainlyroom for future improvements, though.</p><p>Key changes:</p><ul><li><a href="https://github.com/llvm/llvm-project/pull/150183">GOFF:Only register sections within MCObjectStreamer::changeSection</a></li><li><a href="https://github.com/llvm/llvm-project/pull/150574">MC:Allocate initial fragment and define section symbol inchangeSection</a></li><li><ahref="https://github.com/llvm/llvm-project/pull/150846">MCFragment: Usetrailing data for fixed-size part</a></li></ul><h2 id="fragment-fixups-stored-in-section">Fragment fixups stored insection</h2><p>TODO</p><p>MCFragment should not hold references to fixups stored in the parentMCSection. Instead, fixups reference the fragment.</p><p>The optional variable-size tail of a fragment can have at most onefixup.</p><h2id="deprecating-complexity-nativeclients-bundle-alignment-mode">Deprecatingcomplexity: NativeClient's bundle alignment mode</h2><p>Google's now-discontinued Native Client (NaCl) project provided asandboxing environment through a combination of Software Fault Isolation(SFI) and memory segmentation. A distinctive feature of its SFIimplementation was the "bundle alignment mode", which adds NOP paddingto ensure that no instruction crosses a 32-byte alignment boundary. Theverifier's job is to check all instructions starting at 32-byte-multipleaddresses.</p><p>While the core concept of aligned bundling is intriguing, itsimplementation within the LLVM assembler proved problematic. Introducedin 2012, this feature imposed noticeable performance penalties on userswho had no need for NaCl, perhaps more critically, significantlyincreased the complexity of MC's internal workings. I was particularlyconcerned by its pervasive modifications to<code>MCObjectStreamer</code> and <code>MCAssembler</code>.</p><p>The complexity deepened with the introduction of</p><ul><li>2014: <a href="https://reviews.llvm.org/D5915">MCStreamer's pendinglabels</a>, which led to more complexity:<ul><li>2015: <a href="https://reviews.llvm.org/D10325">[MC] Ensure thatpending labels are flushed when -mc-relax-all flag is used</a></li><li>2019: <a href="https://reviews.llvm.org/D71368">[MC] Match labels toexisting fragments even when switching sections.</a> by an Appledeveloper. In a nutshell, the pending labels mechanism was causingheadache to Mach-O, requiring additional code to manage.</li></ul></li><li>2015: <a href="https://reviews.llvm.org/D8072">NaCl's<code>mc-relax-all</code> optimization</a></li></ul><p>In <code>MCObjectStreamer</code>, newly defined labels were put intoa "pending label" list and initially assigned to a<code>MCDummyFragment</code> associated with the current section. Thesymbols would be reassigned to a new fragment when the next instructionor directive was parsed. This pending label system introduced complexityand a missing <code>flushPendingLabels</code> could lead to subtle bugsrelated to incorrect symbol values. <code>flushPendingLabels</code> wascalled by many <code>MCObjectStreamer</code> functions, noticeably oncefor each new fragment, adding overhead. It also complicated the labeldifference evaluation due to <code>MCDummyFragment</code> in<code>MCExpr.cpp:AttemptToFoldSymbolOffsetDifference</code>.</p><p>For the following code, aligned bundling requires that .Ltmp isdefined at addl.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">$ clang var.c -S -o - -fPIC -m32</span><br><span class="line">...</span><br><span class="line">.bundle_lock align_to_end</span><br><span class="line">  calll   .L0$pb</span><br><span class="line">.bundle_unlock</span><br><span class="line">.L0$pb:</span><br><span class="line">  popl    %eax</span><br><span class="line">.Ltmp0:</span><br><span class="line">  addl    $_GLOBAL_OFFSET_TABLE_+(.Ltmp0-.L0$pb), %eax</span><br></pre></td></tr></table></figure><p>Recognizing these long-standing issues, a series of pivotal changeswere undertaken:</p><ul><li>2024: <a href="https://github.com/llvm/llvm-project/pull/95188">[MC]Aligned bundling: remove special handling for RelaxAll</a> removed anoptimization for NaCl in the <ahref="/blog/2024-04-27-understanding-clang-o0-size-increase-branch-displacement"><code>mc-relax-all</code>mode</a></li><li>2024: <ahref="https://github.com/llvm/llvm-project/commit/75006466296ed4b0f845cbbec4bf77c21de43b40">[MC]Remove pending labels</a></li><li>2024: <ahref="https://github.com/llvm/llvm-project/commit/8fa4fe1f995a9bc85666d63e84c094f9a09686">[MC]AttemptToFoldSymbolOffsetDifference: remove MCDummyFragment check.NFC</a></li><li>2025: Finally, <ahref="https://github.com/llvm/llvm-project/pull/148781">MC: Removebundle alignment mode</a>, after Derek Schuff agreed to drop NaClsupport from LLVM.</li></ul><p>Should future features require a variant of bundle alignment, Ifirmly believe a much cleaner implementation is necessary. This couldpotentially be achieved through a backend hook within<code>X86AsmBackend::finishLayout</code>, applied after the primaryassembler layout phase, similar to how the<code>-mbranches-within-32B-boundaries</code> option is handled, thougheven that implementation warrants an extensive revisit itself.</p><h2 id="lessons-learned">Lessons learned</h2><p><strong>The cost of missing early optimization</strong></p><p>Early design choices can have a far-reaching impact on future code.The initial LLVM MC design, while admirably simple in its inception,inadvertently created a rigid foundation. As new features piled on, eachrelying more and more on the specific fragment internals, rectifyingfoundational inefficiencies became incredibly challenging. The Hyrum'sLaw was evident: features built on this foundation inevitably dependedon all its observable behaviors. Optimizing the underlying structurerequired not just a change to the core, but also a thorough fix for allits unsuspecting users. I encountered significant struggles with thedeeply ingrained complexities stemming from NaCl's bundle alignmentmode, x86's <code>-mbranches-within-32B-boundaries</code> option, andthe intricacies of RISC-V linker relaxation.</p><p><strong>Cargo cult programming and snowball effect</strong></p><p>I observed instances of "cargo cult programming", where existingsolutions were copied without a full understanding of their underlyingrationale or applicability. For example:</p><ul><li>The WebAssembly implementation heavily mirrored that of ELF.Consequently, many improvements made to the ELF component oftennecessitated corresponding, sometimes redundant, changes to theWebAssembly implementation. In additin, the WebAssembly implementationcopied ELF-specific code that was irrelevant for WebAssembly'sarchitecture, adding unnecessary bloat and complexity.</li><li>LoongArch's RISC-V replication: LoongArch's linker relaxationimplementation directly copied the approach taken for RISC-V.Refactoring or improvements to RISC-V's linker relaxation frequentlyrequire mirrored changes in the LoongArch codebase, creating parallelmaintenance burdens. I am particularly glad that I landed myfoundational <a href="https://reviews.llvm.org/D153097">[RISCV] Makelinker-relaxable instructions terminate MCDataFragment</a> and <ahref="https://reviews.llvm.org/D155357">[RISCV] Allow delayed decisionfor ADD/SUB relocations</a> in 2023, before the LoongArch teamreplicated the RISC-V approach. This timing, I hope, mitigated somefuture headaches for their implementation.</li></ul><p>These patterns illustrate how initial design choices, or theexpedience of copying existing solutions, can lead to a "snowballeffect" of accumulating complexity and redundant code that makes futureoptimization and maintenance significantly harder. On a positive note,I'm also pleased that <ahref="/blog/2025-05-26-llvm-integrated-assembler-improving-expressions-and-relocations">thestreamlining of the relocation generation framework</a> was completedbefore Apple's upstreaming of their Mach-O support for 32-bit RISC-V.This critical work should provide a more robust and less complex basefor their contributions, and reducing maintenance on my end.</p><p><strong>The cost of features</strong></p><p>Specific features, particularly those designed for niche orspecialized use cases like NaCl's bundle alignment mode, introduceddisproportionate complexity and performance overhead across the entireassembler. Even though NaCl itself was deprecated in 2020, it took until2025 to finally excise its complex support from LLVM. This highlights acommon challenge in large, open-source projects: while many developersare motivated to add new features, there's often far less incentive ordedicated effort to streamline or remove their underlying implementationcomplexities once they're no longer strictly necessary or have become aperformance drain.</p><p>I want to acknowledge the work of individuals like Rafael Ávila deEspíndola, Saleem Abdulrasool, and Nirav Dave, whose improvements toLLVM MC were vital. Without their contributions, the MC layer wouldundoubtedly be in a far less optimized state today.</p><h2 id="epilogue">Epilogue</h2><p>This extensive work on fragment optimization would not have beenpossible without the invaluable contributions of <ahref="https://aengelke.net/">Alexis Engelke</a>. My sincere thanks go toAlexis for his meticulous reviews of numerous patches, his insightfulsuggestions, and for contributing many significant improvementshimself.</p><p>What I have learnd through the process?</p><h2id="appendix-how-gnu-assembler-mastered-fragments-decades-ago">Appendix:How GNU Assembler mastered fragments decades ago</h2><p>After dedicating several paragraphs to explaining the historicalshortcomings of LLVM MC's fragment representation, a natural questionarises: how does GNU Assembler (GAS), arguably the other most popularassembler on Linux systems, approach fragment handling?</p><p>Delving into its history reveals a fascinating answer. The earliestcommit I could locate is a cvs2svn-generated record from April 1991.Given the 1987 copyright notice within the code, it's highly probablethat this foundational work on fragments was laid down as early as1987.</p><p>You can explore this initial structure in as.h here: <ahref="https://github.com/bminor/binutils-gdb/commit/3a69b3aca678a3caf3ade7f9d42d18233b097ec6#diff-0771d3312685417eb5061a8f0856da4f0406ca8bd6c7d68b6a50a026a4e48c9dR212"class="uri">https://github.com/bminor/binutils-gdb/commit/3a69b3aca678a3caf3ade7f9d42d18233b097ec6#diff-0771d3312685417eb5061a8f0856da4f0406ca8bd6c7d68b6a50a026a4e48c9dR212</a>.Please check out <code>as.h</code> and <code>frags.c</code>.</p><p>Observing the <code>frag</code> struct, a few points stand out:</p><ul><li>While the exact purpose of <code>fr_offset</code> isn't immediatelyclear to me, <code>fr_fix</code> and <code>fr_var</code> bear a strikingresemblance to the concepts we've recently introduced in MCFragment. Itmight make the variable-size content immediately follow the fixed-sizecontent, though.</li><li>The <code>char fr_literal[1]</code> demonstrates an early use ofwhat we now call a flexible array member. Today, GCC and Clang's<code>-fstrict-flex-arrays=2</code> would report a warning.</li><li><code>fr_symbol</code> could be more appropriately placed within aunion</li><li><code>fr_pcrel_adjust</code> and <code>fr_bsr</code> would ideallybe architecture-specific data.</li><li>Fragments are allocated using <ahref="https://www.gnu.org/software/libc/manual/html_node/Obstacks.html">obstacks</a>,which appear to be a more sophisticated form of a bump allocator, withadditional bookkeeping overhead.</li></ul><p>But truly, I should stop the minor nit-picking. What astonishinglyimpresses me is the sheer foresight demonstrated in GAS's fragmentallocator design. Conceived in 1987 or even earlier, it masterfullyanticipated solutions that LLVM MC, first conceived in 2009, has onlynow achieved decades later. This design held the lead on fragmentarchitecture for nearly four decades!</p><p>My greatest tribute goes to the original authors of GNU Assembler forthis remarkable piece of engineering.</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * A code fragment (frag) is some known number of chars, followed by some</span></span><br><span class="line"><span class="comment"> * unknown number of chars. Typically the unknown number of chars is an</span></span><br><span class="line"><span class="comment"> * instruction address whose size is yet unknown. We always know the greatest</span></span><br><span class="line"><span class="comment"> * possible size the unknown number of chars may become, and reserve that</span></span><br><span class="line"><span class="comment"> * much room at the end of the frag.</span></span><br><span class="line"><span class="comment"> * Once created, frags do not change address during assembly.</span></span><br><span class="line"><span class="comment"> * We chain the frags in (a) forward-linked list(s). The object-file address</span></span><br><span class="line"><span class="comment"> * of the 1st char of a frag is generally not known until after relax().</span></span><br><span class="line"><span class="comment"> * Many things at assembly time describe an address by &#123;object-file-address</span></span><br><span class="line"><span class="comment"> * of a particular frag&#125;+offset.</span></span><br><span class="line"><span class="comment"></span></span><br><span class="line"><span class="comment"> <span class="doctag">BUG:</span> it may be smarter to have a single pointer off to various different</span></span><br><span class="line"><span class="comment">notes for different frag kinds. See how code pans </span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">frag</span>                        /* <span class="title">a</span> <span class="title">code</span> <span class="title">fragment</span> */</span></span><br><span class="line"><span class="class">&#123;</span></span><br><span class="line">        <span class="type">unsigned</span> <span class="type">long</span> fr_address; <span class="comment">/* Object file address. */</span></span><br><span class="line">        <span class="class"><span class="keyword">struct</span> <span class="title">frag</span> *<span class="title">fr_next</span>;</span>        <span class="comment">/* Chain forward; ascending address order. */</span></span><br><span class="line">                                <span class="comment">/* Rooted in frch_root. */</span></span><br><span class="line"></span><br><span class="line">        <span class="type">long</span> fr_fix;        <span class="comment">/* (Fixed) number of chars we know we have. */</span></span><br><span class="line">                                <span class="comment">/* May be 0. */</span></span><br><span class="line">        <span class="type">long</span> fr_var;        <span class="comment">/* (Variable) number of chars after above. */</span></span><br><span class="line">                                <span class="comment">/* May be 0. */</span></span><br><span class="line">        <span class="class"><span class="keyword">struct</span> <span class="title">symbol</span> *<span class="title">fr_symbol</span>;</span> <span class="comment">/* For variable-length tail. */</span></span><br><span class="line">        <span class="type">long</span> fr_offset;        <span class="comment">/* For variable-length tail. */</span></span><br><span class="line">        <span class="type">char</span>        *fr_opcode;        <span class="comment">/*-&gt;opcode low addr byte,for relax()ation*/</span></span><br><span class="line">        relax_stateT fr_type;   <span class="comment">/* What state is my tail in? */</span></span><br><span class="line">        relax_substateT        fr_subtype;</span><br><span class="line">                <span class="comment">/* These are needed only on the NS32K machines */</span></span><br><span class="line">        <span class="type">char</span>        fr_pcrel_adjust;</span><br><span class="line">        <span class="type">char</span>        fr_bsr;</span><br><span class="line">        <span class="type">char</span>        fr_literal [<span class="number">1</span>];        <span class="comment">/* Chars begin here. */</span></span><br><span class="line">                                <span class="comment">/* One day we will compile fr_literal[0]. */</span></span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure>]]></content>
    
    
    <summary type="html">&lt;p&gt;In my previous assembler posts, I&#39;ve discussed improvements on &lt;a
href=&quot;/blog/2025-05-26-llvm-integrated-assembler-improving-expressions-and-relocations&quot;&gt;expression
resolving and relocation generation&lt;/a&gt;. Now, let&#39;s turn our attention
to recent refinements within section fragments. Understanding how an
assembler utilizes these fragments is key to appreciating the
improvements we&#39;ve made. At a high level, the process unfolds in three
main stages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Parsing phase: The assembler constructs section fragments. These
fragments represent sequences of regular instructions or data, &lt;a
href=&quot;/blog/2024-04-27-understanding-clang-o0-size-increase-branch-displacement&quot;&gt;span-dependent
instructions&lt;/a&gt;, alignment directives, and other elements.&lt;/li&gt;
&lt;li&gt;Section layout phase: Once fragments are built, the assembler
assigns offsets to them and finalizes the span-dependent content.&lt;/li&gt;
&lt;li&gt;&lt;a
href=&quot;/blog/2025-03-16-relocation-generation-in-assemblers&quot;&gt;Relocation
decision phase&lt;/a&gt;: In the final stage, the assembler evaluates fixups
and, if necessary, updates the content of the fragments.&lt;/li&gt;
&lt;/ul&gt;</summary>
    
    
    
    
    <category term="llvm" scheme="https://maskray.me/blog/tags/llvm/"/>
    
    <category term="assembler" scheme="https://maskray.me/blog/tags/assembler/"/>
    
  </entry>
  
  <entry>
    <title>GCC 13.3.0 miscompiles LLVM</title>
    <link href="https://maskray.me/blog/2025-07-13-gcc-miscompiles-llvm"/>
    <id>https://maskray.me/blog/2025-07-13-gcc-miscompiles-llvm</id>
    <published>2025-07-13T07:00:00.000Z</published>
    <updated>2026-02-16T06:56:40.572Z</updated>
    
    <content type="html"><![CDATA[<p>For years, I've been involved in updating LLVM's MC layer. A recentjourney led me to <ahref="/blog/2025-05-26-llvm-integrated-assembler-improving-expressions-and-relocations">eliminatethe <code>FK_PCRel_</code> fixup kinds</a>:</p><pre><code>MCFixup: Remove FK_PCRel_The generic FK_Data_ fixup kinds handle both absolute and PC-relativefixups. ELFObjectWriter sets IsPCRel to true for `.long foo-.`, so thebackend has to handle PC-relative FK_Data_.However, the existence of FK_PCRel_ encouraged backends to implement itas a separate fixup type, leading to redundant and error-prone code.Removing FK_PCRel_ simplifies the overall fixup mechanism.</code></pre><span id="more"></span><p>As a prerequisite, I had to update several backends that relied onthe now-deleted fixup kinds. It was during this process that somethingunexpected happened. Contributors <ahref="https://github.com/llvm/llvm-project/commit/c20379198c7fb66b9514d21ae1e07b0705e3e6fa">reportedthat when built by GCC 13.3.0</a>, the LLVM integrated assembler hadtest failures.</p><p>To investigate, I downloaded and built GCC 13.3.0 locally:</p><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">../../configure --prefix=<span class="variable">$HOME</span>/opt/gcc-13.3.0 --disable-bootstrap --enable-languages=c,c++ --disable-libsanitizer --disable-multilib</span><br><span class="line">make -j 30 &amp;&amp; make -j 30 install</span><br></pre></td></tr></table></figure><p>I then built a Release build (<code>-O3</code>) of LLVM. Sure enough,the failure was reproducible:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">% /tmp/out/custom-gcc-13/bin/llc llvm/test/CodeGen/X86/2008-08-06-RewriterBug.ll -mtriple=i686 -o s -filetype=obj</span><br><span class="line">Unknown immediate size</span><br><span class="line">UNREACHABLE executed at /home/ray/llvm/llvm/lib/Target/X86/MCTargetDesc/X86BaseInfo.h:904!</span><br><span class="line">PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.</span><br><span class="line">Stack dump:</span><br><span class="line">0.      Program arguments: /tmp/out/custom-gcc-13/bin/llc llvm/test/CodeGen/X86/2008-08-06-RewriterBug.ll -mtriple=i686 -o s -filetype=obj</span><br><span class="line">1.      Running pass &#x27;Function Pass Manager&#x27; on module &#x27;llvm/test/CodeGen/X86/2008-08-06-RewriterBug.ll&#x27;.</span><br><span class="line">2.      Running pass &#x27;X86 Assembly Printer&#x27; on function &#x27;@foo&#x27;</span><br><span class="line">Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):</span><br><span class="line">0  llc 0x0000000002f06bcb</span><br><span class="line">fish: Job 1, &#x27;/tmp/out/custom-gcc-13/bin/llc …&#x27; terminated by signal SIGABRT (Abort)</span><br></pre></td></tr></table></figure><p>Interestingly, a RelWithDebInfo build (<code>-O2 -g</code>) of LLVMdid not reproduce the failure, suggesting either an undefined behavior,or an optimization-related issue within GCC 13.3.0.</p><h2 id="the-bisection-trail">The Bisection trail</h2><p>I built GCC at the <code>releases/gcc-13</code> branch, and the issuevanished. This strongly indicated that the problem lay somewhere betweenthe <code>releases/gcc-13.3.0</code> tag and the<code>releases/gcc-13</code> branch.</p><p>The bisection led me to a specific commit, directing me to <ahref="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109934#c6"class="uri">https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109934#c6</a>.</p><p>I developed a workaround at the code block with a typo "RemaningOps".Although I had observed it before, I was hesitant to introduce a commitsolely for a typo fix. However, it became clear this was the perfectopportunity to address both the typo and implement a workaround for theGCC miscompilation. This led to the landing of <ahref="https://github.com/llvm/llvm-project/commit/6d67794d164ebeedbd287816e1541964fb5d6c99">thiscommit</a>, resolving the miscompilation.</p><p>Sam James from Gentoo mentioned that the miscompilation wasintroduced by a commit cherry-picked into GCC 13.3.0. GCC 13.2.0 and GCC13.4.0 are good.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;For years, I&#39;ve been involved in updating LLVM&#39;s MC layer. A recent
journey led me to &lt;a
href=&quot;/blog/2025-05-26-llvm-integrated-assembler-improving-expressions-and-relocations&quot;&gt;eliminate
the &lt;code&gt;FK_PCRel_&lt;/code&gt; fixup kinds&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;MCFixup: Remove FK_PCRel_

The generic FK_Data_ fixup kinds handle both absolute and PC-relative
fixups. ELFObjectWriter sets IsPCRel to true for `.long foo-.`, so the
backend has to handle PC-relative FK_Data_.

However, the existence of FK_PCRel_ encouraged backends to implement it
as a separate fixup type, leading to redundant and error-prone code.

Removing FK_PCRel_ simplifies the overall fixup mechanism.&lt;/code&gt;&lt;/pre&gt;</summary>
    
    
    
    
    <category term="llvm" scheme="https://maskray.me/blog/tags/llvm/"/>
    
    <category term="gcc" scheme="https://maskray.me/blog/tags/gcc/"/>
    
    <category term="assembler" scheme="https://maskray.me/blog/tags/assembler/"/>
    
  </entry>
  
  <entry>
    <title>LLVM integrated assembler: Improving expressions and relocations</title>
    <link href="https://maskray.me/blog/2025-05-26-llvm-integrated-assembler-improving-expressions-and-relocations"/>
    <id>https://maskray.me/blog/2025-05-26-llvm-integrated-assembler-improving-expressions-and-relocations</id>
    <published>2025-05-26T07:00:00.000Z</published>
    <updated>2025-08-09T23:39:39.437Z</updated>
    
    <content type="html"><![CDATA[<p>In my previous post, <ahref="/blog/2025-04-06-llvm-integrated-assembler-improving-mcexpr-mcvalue">LLVMintegrated assembler: Improving MCExpr and MCValue</a> delved intoenhancements made to LLVM's internal MCExpr and MCValue representations.This post covers recent refinements to MC, focusing on expressionresolving and relocation generation.</p><span id="more"></span><h2 id="preventing-cyclic-dependencies">Preventing cyclicdependencies</h2><p><ahref="/blog/2023-05-08-assemblers#symbol-equating-directives">Equatedsymbols</a> may form a cycle, which is not allowed.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"># CHECK: [[#@LINE+2]]:7: error: cyclic dependency detected for symbol &#x27;a&#x27;</span><br><span class="line"># CHECK: [[#@LINE+1]]:7: error: expression could not be evaluated</span><br><span class="line">a = a + 1</span><br><span class="line"></span><br><span class="line"># CHECK: [[#@LINE+3]]:6: error: cyclic dependency detected for symbol &#x27;b1&#x27;</span><br><span class="line"># CHECK: [[#@LINE+1]]:6: error: expression could not be evaluated</span><br><span class="line">b0 = b1</span><br><span class="line">b1 = b2</span><br><span class="line">b2 = b0</span><br></pre></td></tr></table></figure><p>Previously, LLVM's interated assembler used an occurs check to detectthese cycles when parsing symbol equating directives.</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="type">bool</span> <span class="title">parseAssignmentExpression</span><span class="params">(StringRef Name, <span class="type">bool</span> allow_redef,</span></span></span><br><span class="line"><span class="params"><span class="function">                               MCAsmParser &amp;Parser, MCSymbol *&amp;Sym,</span></span></span><br><span class="line"><span class="params"><span class="function">                               <span class="type">const</span> MCExpr *&amp;Value)</span> </span>&#123;</span><br><span class="line">  ...</span><br><span class="line">  <span class="comment">// Validate that the LHS is allowed to be a variable (either it has not been</span></span><br><span class="line">  <span class="comment">// used as a symbol, or it is an absolute symbol).</span></span><br><span class="line">  Sym = Parser.<span class="built_in">getContext</span>().<span class="built_in">lookupSymbol</span>(Name);</span><br><span class="line">  <span class="keyword">if</span> (Sym) &#123;</span><br><span class="line">    <span class="comment">// Diagnose assignment to a label.</span></span><br><span class="line">    <span class="comment">//</span></span><br><span class="line">    <span class="comment">// <span class="doctag">FIXME:</span> Diagnostics. Note the location of the definition as a label.</span></span><br><span class="line">    <span class="comment">// <span class="doctag">FIXME:</span> Diagnose assignment to protected identifier (e.g., register name).</span></span><br><span class="line">    <span class="keyword">if</span> (Value-&gt;<span class="built_in">isSymbolUsedInExpression</span>(Sym))</span><br><span class="line">      <span class="keyword">return</span> Parser.<span class="built_in">Error</span>(EqualLoc, <span class="string">&quot;Recursive use of &#x27;&quot;</span> + Name + <span class="string">&quot;&#x27;&quot;</span>);</span><br><span class="line">    ...</span><br><span class="line">  &#125;</span><br></pre></td></tr></table></figure><p><code>isSymbolUsedInExpression</code> implemented occurs check as atree (or more accurately, a DAG) traversal.</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="type">bool</span> <span class="title">MCExpr::isSymbolUsedInExpression</span><span class="params">(<span class="type">const</span> MCSymbol *Sym)</span> <span class="type">const</span> </span>&#123;</span><br><span class="line">  <span class="keyword">switch</span> (<span class="built_in">getKind</span>()) &#123;</span><br><span class="line">  <span class="keyword">case</span> MCExpr::Binary: &#123;</span><br><span class="line">    <span class="type">const</span> MCBinaryExpr *BE = <span class="built_in">static_cast</span>&lt;<span class="type">const</span> MCBinaryExpr *&gt;(<span class="keyword">this</span>);</span><br><span class="line">    <span class="keyword">return</span> BE-&gt;<span class="built_in">getLHS</span>()-&gt;<span class="built_in">isSymbolUsedInExpression</span>(Sym) ||</span><br><span class="line">           BE-&gt;<span class="built_in">getRHS</span>()-&gt;<span class="built_in">isSymbolUsedInExpression</span>(Sym);</span><br><span class="line">  &#125;</span><br><span class="line">  <span class="keyword">case</span> MCExpr::Target: &#123;</span><br><span class="line">    <span class="type">const</span> MCTargetExpr *TE = <span class="built_in">static_cast</span>&lt;<span class="type">const</span> MCTargetExpr *&gt;(<span class="keyword">this</span>);</span><br><span class="line">    <span class="keyword">return</span> TE-&gt;<span class="built_in">isSymbolUsedInExpression</span>(Sym);</span><br><span class="line">  &#125;</span><br><span class="line">  <span class="keyword">case</span> MCExpr::Constant:</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">  <span class="keyword">case</span> MCExpr::SymbolRef: &#123;</span><br><span class="line">    <span class="type">const</span> MCSymbol &amp;S = <span class="built_in">static_cast</span>&lt;<span class="type">const</span> MCSymbolRefExpr *&gt;(<span class="keyword">this</span>)-&gt;<span class="built_in">getSymbol</span>();</span><br><span class="line">    <span class="keyword">if</span> (S.<span class="built_in">isVariable</span>() &amp;&amp; !S.<span class="built_in">isWeakExternal</span>())</span><br><span class="line">      <span class="keyword">return</span> S.<span class="built_in">getVariableValue</span>()-&gt;<span class="built_in">isSymbolUsedInExpression</span>(Sym);</span><br><span class="line">    <span class="keyword">return</span> &amp;S == Sym;</span><br><span class="line">  &#125;</span><br><span class="line">  <span class="keyword">case</span> MCExpr::Unary: &#123;</span><br><span class="line">    <span class="type">const</span> MCExpr *SubExpr =</span><br><span class="line">        <span class="built_in">static_cast</span>&lt;<span class="type">const</span> MCUnaryExpr *&gt;(<span class="keyword">this</span>)-&gt;<span class="built_in">getSubExpr</span>();</span><br><span class="line">    <span class="keyword">return</span> SubExpr-&gt;<span class="built_in">isSymbolUsedInExpression</span>(Sym);</span><br><span class="line">  &#125;</span><br><span class="line">  &#125;</span><br><span class="line"></span><br><span class="line">  <span class="built_in">llvm_unreachable</span>(<span class="string">&quot;Unknown expr kind!&quot;</span>);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>While generally effective, this routine wasn't universally appliedacross all symbol equating scenarios, such as with <code>.weakref</code>or some target-specific parsing code, leading to potential undetectedcycles, and therefore infinite loop in assembler execution.</p><p>To address this, I adopted a 2-color depth-first search (DFS)algorithm. While a 3-color DFS is typical for DAGs, a 2-color approachsuffices for our trees, although this might lead to more work when asymbol is visited multiple times. Shared subexpressions are very rare inLLVM.</p><p>Here is the relevant change to<code>evaluateAsRelocatableImpl</code>. I also need a new bit from<code>MCSymbol</code>.</p><figure class="highlight patch"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">@@ -497,13 +498,25 @@</span> bool MCExpr::evaluateAsRelocatableImpl(MCValue &amp;Res, const MCAssembler *Asm,</span><br><span class="line"></span><br><span class="line">   case SymbolRef: &#123;</span><br><span class="line">     const MCSymbolRefExpr *SRE = cast&lt;MCSymbolRefExpr&gt;(this);</span><br><span class="line"><span class="deletion">-    const MCSymbol &amp;Sym = SRE-&gt;getSymbol();</span></span><br><span class="line"><span class="addition">+    MCSymbol &amp;Sym = const_cast&lt;MCSymbol &amp;&gt;(SRE-&gt;getSymbol());</span></span><br><span class="line">     const auto Kind = SRE-&gt;getKind();</span><br><span class="line">     bool Layout = Asm &amp;&amp; Asm-&gt;hasLayout();</span><br><span class="line"></span><br><span class="line">     // Evaluate recursively if this is a variable.</span><br><span class="line"><span class="addition">+    if (Sym.isResolving()) &#123;</span></span><br><span class="line"><span class="addition">+      if (Asm &amp;&amp; Asm-&gt;hasFinalLayout()) &#123;</span></span><br><span class="line"><span class="addition">+        Asm-&gt;getContext().reportError(</span></span><br><span class="line"><span class="addition">+            Sym.getVariableValue()-&gt;getLoc(),</span></span><br><span class="line"><span class="addition">+            &quot;cyclic dependency detected for symbol &#x27;&quot; + Sym.getName() + &quot;&#x27;&quot;);</span></span><br><span class="line"><span class="addition">+        Sym.IsUsed = false;</span></span><br><span class="line"><span class="addition">+        Sym.setVariableValue(MCConstantExpr::create(0, Asm-&gt;getContext()));</span></span><br><span class="line"><span class="addition">+      &#125;</span></span><br><span class="line"><span class="addition">+      return false;</span></span><br><span class="line"><span class="addition">+    &#125;</span></span><br><span class="line">     if (Sym.isVariable() &amp;&amp; (Kind == MCSymbolRefExpr::VK_None || Layout) &amp;&amp;</span><br><span class="line">         canExpand(Sym, InSet)) &#123;</span><br><span class="line"><span class="addition">+      Sym.setIsResolving(true);</span></span><br><span class="line"><span class="addition">+      auto _ = make_scope_exit([&amp;] &#123; Sym.setIsResolving(false); &#125;);</span></span><br><span class="line">       bool IsMachO =</span><br><span class="line">           Asm &amp;&amp; Asm-&gt;getContext().getAsmInfo()-&gt;hasSubsectionsViaSymbols();</span><br><span class="line">       if (Sym.getVariableValue()-&gt;evaluateAsRelocatableImpl(Res, Asm,</span><br></pre></td></tr></table></figure><p>Unfortunately, I cannot remove<code>MCExpr::isSymbolUsedInExpression</code>, as it is still used byAMDGPU (<ahref="https://github.com/llvm/llvm-project/pull/112251">[AMDGPU] Avoidresource propagation for recursion through multiple functions</a>).</p><h2 id="revisiting-the-.weakref-directive">Revisiting the<code>.weakref</code> directive</h2><p>The .weakref directive had intricate impact on the expressionresolving framework.</p><p><code>.weakref</code> enables the creation of weak aliases withoutdirectly modifying the target symbol's binding. This allows a headerfile in library A to optionally depend on symbols from library B. Whenthe target symbol is otherwise not referenced, the object file affectedby the weakref directive will include an undefined weak symbol. However,when the target symbol is defined or referenced (by the user), it canretain STB_GLOBAL binding to support archive member extraction. GCC's<code>[[gnu::weakref]]</code> attribute, as used in runtime libraryheaders like <code>libgcc/gthr-posix.h</code>, utilizes thisfeature.</p><p>I've noticed a few issues:</p><ul><li>Unreferenced <code>.weakref alias, target</code> created undefined<code>target</code>.</li><li>Crash when <code>alias</code> was already defined.</li><li><code>VK_WEAKREF</code> was mis-reused by the <code>alias</code>directive of llvm-ml (MASM replacement).</li></ul><p>And addressed them with</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/2b0256e49bbe5c0dc9c8f4800b1e2f131026cb45">[MC]Ignore VK_WEAKREF in MCValue::getAccessVariant</a> (2019-12). Wow, it'sinteresting to realize I'd actually delved into this a few yearsago!</li><li><ahref="https://github.com/llvm/llvm-project/commit/95756e67c230c231c616a9aeabc2eea1e2831829">MC:Rework .weakref</a> (2025-05)</li></ul><h2 id="expression-resolving-and-reassignments">Expression resolving andreassignments</h2><p><code>=</code> and its equivalents (<code>.set</code>,<code>.equ</code>) allow a symbol to be <ahref="/blog/2023-05-08-assemblers#symbol-equating-directives">equated</a>multiple times. This means when a symbol is referenced, its currentvalue is captured at that moment, and subsequent reassignments do notalter prior references.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">.data</span><br><span class="line">.set x, 0</span><br><span class="line">.long x         // reference the first instance</span><br><span class="line">x = .-.data</span><br><span class="line">.long x         // reference the second instance</span><br><span class="line">.set x,.-.data</span><br><span class="line">.long x         // reference the third instance</span><br></pre></td></tr></table></figure><p>The assembly code evaluates to<code>.long 0; .long 4; .long 8</code>.</p><p>Historically, the LLVM integrated assembler restricted reassigningsymbols whose value wasn't a parse-time integer constant(<code>MCConstExpr</code>). This was a safeguard against potentiallyunsafe reassignments, as an old value might still be referenced.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">% clang -c g.s</span><br><span class="line">g.s:6:8: error: invalid reassignment of non-absolute variable &#x27;x&#x27;</span><br><span class="line">.set x,.-.data</span><br><span class="line">       ^</span><br></pre></td></tr></table></figure><p>The safeguard was implemented with multiple conditions, aided by a <ahref="https://reviews.llvm.org/D12347">mysterious</a> <ahref="https://github.com/llvm/llvm-project/commit/9b4a824217f1fe23f83045afe7521acb791bc2d0"><code>IsUsed</code></a>variable.</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Diagnose assignment to a label.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// <span class="doctag">FIXME:</span> Diagnostics. Note the location of the definition as a label.</span></span><br><span class="line"><span class="comment">// <span class="doctag">FIXME:</span> Diagnose assignment to protected identifier (e.g., register name).</span></span><br><span class="line"><span class="keyword">if</span> (Value-&gt;<span class="built_in">isSymbolUsedInExpression</span>(Sym))</span><br><span class="line">  <span class="keyword">return</span> Parser.<span class="built_in">Error</span>(EqualLoc, <span class="string">&quot;Recursive use of &#x27;&quot;</span> + Name + <span class="string">&quot;&#x27;&quot;</span>);</span><br><span class="line"><span class="keyword">else</span> <span class="keyword">if</span> (Sym-&gt;<span class="built_in">isUndefined</span>(<span class="comment">/*SetUsed*/</span> <span class="literal">false</span>) &amp;&amp; !Sym-&gt;<span class="built_in">isUsed</span>() &amp;&amp;</span><br><span class="line">         !Sym-&gt;<span class="built_in">isVariable</span>())</span><br><span class="line">  ; <span class="comment">// Allow redefinitions of undefined symbols only used in directives.</span></span><br><span class="line"><span class="keyword">else</span> <span class="keyword">if</span> (Sym-&gt;<span class="built_in">isVariable</span>() &amp;&amp; !Sym-&gt;<span class="built_in">isUsed</span>() &amp;&amp; allow_redef)</span><br><span class="line">  ; <span class="comment">// Allow redefinitions of variables that haven&#x27;t yet been used.</span></span><br><span class="line"><span class="keyword">else</span> <span class="keyword">if</span> (!Sym-&gt;<span class="built_in">isUndefined</span>() &amp;&amp; (!Sym-&gt;<span class="built_in">isVariable</span>() || !allow_redef))</span><br><span class="line">  <span class="keyword">return</span> Parser.<span class="built_in">Error</span>(EqualLoc, <span class="string">&quot;redefinition of &#x27;&quot;</span> + Name + <span class="string">&quot;&#x27;&quot;</span>);</span><br><span class="line"><span class="keyword">else</span> <span class="keyword">if</span> (!Sym-&gt;<span class="built_in">isVariable</span>())</span><br><span class="line">  <span class="keyword">return</span> Parser.<span class="built_in">Error</span>(EqualLoc, <span class="string">&quot;invalid assignment to &#x27;&quot;</span> + Name + <span class="string">&quot;&#x27;&quot;</span>);</span><br><span class="line"><span class="keyword">else</span> <span class="keyword">if</span> (!<span class="built_in">isa</span>&lt;MCConstantExpr&gt;(Sym-&gt;<span class="built_in">getVariableValue</span>()))</span><br><span class="line">  <span class="keyword">return</span> Parser.<span class="built_in">Error</span>(EqualLoc,</span><br><span class="line">                      <span class="string">&quot;invalid reassignment of non-absolute variable &#x27;&quot;</span> +</span><br><span class="line">                          Name + <span class="string">&quot;&#x27;&quot;</span>);</span><br></pre></td></tr></table></figure><p>Over the past few years, during our work on porting Clang to Linuxkernel ports, we worked around this by modifying the assembly codeitself:</p><ul><li><ahref="https://git.kernel.org/linus/a780e485b5768e78aef087502499714901b68cc4">ARM:8971/1: replace the sole use of a symbol with its definition</a> in2020-04</li><li><ahref="https://git.kernel.org/linus/44069737ac9625a0f02f0f7f5ab96aae4cd819bc">crypto:aesni - add compatibility with IAS</a> in 2020-07</li><li><ahref="https://git.kernel.org/linus/d72c4a36d7ab560127885473a310ece28988b604">powerpc/64/asm:Do not reassign labels</a> in 2021-12</li></ul><p>This prior behavior wasn't ideal. I've since enabled properreassignment by implementing a system where the symbol is cloned uponredefinition, and the symbol table is updated accordingly. Crucially,any existing references to the original symbol remain unchanged, and theoriginal symbol is no longer included in the final emitted symboltable.</p><p>Before rolling out this improvement, I discovered problematic uses inthe AMDGPU and ARM64EC backends that required specific fixes orworkarounds. This is a common challenge when making general improvementsto LLVM's MC layer: you often need to untangle and resolve individualbackend-specific "hacks" before a more generic interface enhancement canbe applied.</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/76ee2d34f787357eec1a5dec16b294578151881e">MCParser:Error when .set reassigns a non-redefinable variable</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/e015626f189dc76f8df9fdc25a47638c6a2f3feb">MC:Allow .set to reassign non-MCConstantExpr expressions</a></li></ul><p>For the following assembly, newer Clang emits relocations referencing<code>foo, foo, bar, foo</code> like GNU Assembler.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">b = a</span><br><span class="line">a = foo</span><br><span class="line">call a</span><br><span class="line">call b</span><br><span class="line">a = bar</span><br><span class="line">call a</span><br><span class="line">call b</span><br></pre></td></tr></table></figure><h2 id="relocation-generation">Relocation generation</h2><p>For a deeper dive into the concepts of relocation generation, youmight find my previous post, <ahref="https://maskray.me/blog/2025-03-16-relocation-generation-in-assemblers">Relocationgeneration in assemblers</a>, helpful.</p><p>Driven by the need to support new RISC-V vendor relocations (e.g.,Xqci extensions from Qualcomm) and my preference against introducing anextra <code>MCAsmBackend</code> hook, I've significantly refactoredLLVM's relocation generation framework. This effort generalized existingRISC-V/LoongArch ADD/SUB relocation logic and enabled its customizationfor other targets like AVR and PowerPC.</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/c512d951861c9e35649b6c9672c227244bb9b6be">MC:Generalize RISCV/LoongArch handleAddSubRelocations and AVRshouldForceRelocation</a></li></ul><p>The linker relaxation framework sometimes generated redundantrelocations that could have been resolved. This occurred in severalscenarios, including:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">.option norelax</span><br><span class="line">j label</span><br><span class="line">// For assembly input, RISCVAsmParser::ParseInstruction sets ForceRelocs (https://reviews.llvm.org/D46423).</span><br><span class="line">// For direct object emission, RISCVELFStreamer sets ForceRelocs (#77436)</span><br><span class="line">.option relax</span><br><span class="line">call foo  // linker-relaxable</span><br><span class="line"></span><br><span class="line">.option norelax</span><br><span class="line">j label   // redundant relocation due to ForceRelocs</span><br><span class="line">.option relax</span><br><span class="line"></span><br><span class="line">label:</span><br></pre></td></tr></table></figure><p>And also with label differences within a section withoutlinker-relaxable instructions:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">call foo</span><br><span class="line"></span><br><span class="line">.section .text1,&quot;ax&quot;</span><br><span class="line"># No linker-relaxable instruction. Label differences should be resolved.</span><br><span class="line">w1:</span><br><span class="line">  nop</span><br><span class="line">w2:</span><br><span class="line"></span><br><span class="line">.data</span><br><span class="line"># Redundant R_RISCV_SET32 and R_RISCV_SUB32</span><br><span class="line">.long w2-w1</span><br></pre></td></tr></table></figure><p>These issues have now been resolved through a series of patches,significantly revamping the target-neutral relocation generationframework. Key contributions include:</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/5d87ebf3ade73d43b2dc334e4d23bc86ddc47879">[MC]Refactor fixup evaluation and relocation generation</a></li><li><ahref="https://github.com/llvm/llvm-project/pull/140494">RISCV,LoongArch:Encode RELAX relocation implicitly</a></li><li><a href="https://github.com/llvm/llvm-project/pull/140692">RISCV:Remove shouldForceRelocation and unneeded relocations</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/b754e4085541df750c51677e522dd939e2aa9e2d">MC:Remove redundant relocations for label differences</a></li></ul><p>I've also streamlined relocation generation within the SPARC backend.Given its minimal number of relocations, the SPARC implementation couldserve as a valuable reference for downstream targets seeking tocustomize their own relocation handling.</p><h2id="simplification-to-assembly-and-machine-code-emission">Simplificationto assembly and machine code emission</h2><p>For a dive into the core classes involved in LLVM's assembly andmachine code emission, you might read my <ahref="/blog/2023-05-08-assemblers#notes-on-llvm-assembly-and-machine-code-emission">Noteson LLVM assembly and machine code emission</a>.</p><p>The <code>MCAssembler</code> class orchestrates the emission process,managing <code>MCAsmBackend</code>, <code>MCCodeEmitter</code>, and<code>MCObjectWriter</code>. In turn, <code>MCObjectWriter</code>oversees <code>MCObjectTargetWriter</code>.</p><p>Historically, many member functions within the subclasses of<code>MCAsmBackend</code>, <code>MCObjectWriter</code>, and<code>MCObjectTargetWriter</code> accepted a <code>MCAssembler *</code>argument. This was often redundant, as it was typically only used toaccess the <code>MCContext</code> instance. To streamline this, I'veadded a <code>MCAssembler *</code> member variable directly to<code>MCAsmBackend</code>, <code>MCObjectWriter</code>, and<code>MCObjectTargetWriter</code>, along with convenient helperfunctions like <code>getContext</code>. This change cleans up theinterfaces and improves code clarity.</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/84f06b88b64352e35fc8363081e58fd37e326452">MCAsmBackend:Add member variable MCAssembler * and define getContext</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/fe32806d67eef72ff406dbcdc6a28d882a00e3a3">ELFObjectWriter:Remove the MCContext argument from getRelocType</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/1193f62f7c19e4e0cc36ee5006fa27ec108dc466">MachObjectWriter:Remove the MCAssembler argument from getSymbolAddress</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/9513284f25545029de68f7e09bc5c1606636c489">WinCOFFObjectWriter:Simplify code with member MCAssembler *</a></li></ul><p>Previously, the ARM, Hexagon, and RISC-V backends had uniquerequirements that led to extra arguments being passed to MCAsmBackendhooks. These arguments were often unneeded by other targets. I've sincerefactored these interfaces, replacing those specialized arguments withmore generalized and cleaner approaches.</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/63adf075551221901cee551af101a484234fd1f2">ELFObjectWriter:Move Thumb-specific condition to ARMELFObjectWriter</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/f0ff2bea75f45a72143ac7fcd16a1199eb5ebf6e">MCAsmBackend:Remove MCSubtargetInfo argument</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/5710759eb390c0d5274c2a4d43967282d7df1993">MCAsmBackend,X86:Pass MCValue to fixupNeedsRelaxationAdvanced. NFC</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/2ff226ae2c9bdafc686d698b69b4a8519213f325">MCAsmBackend,Hexagon:Remove MCRelaxableFragment from fixupNeedsRelaxationAdvanced</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/871b0a32216770b84fe6fed412610ad03dafbf7f">MCAsmBackend:Simplify applyFixup</a></li></ul><h2 id="future-plan">Future plan</h2><p>The assembler's ARM port has a limitation where only relocations withimplicit addends (REL) are handled. For <ahref="/blog/2024-03-09-a-compact-relocation-format-for-elf">CREL</a>, weaim to use explicit addends across all targets to simplifylinker/tooling implementation, but this is incompatible with<code>ARMAsmBackend</code>'s current design. See this ARM CREL assemblerissue <a href="https://github.com/llvm/llvm-project/issues/141678"class="uri">https://github.com/llvm/llvm-project/issues/141678</a>.</p><p>To address this issue, we should</p><ul><li>In <code>MCAssembler::evaluateFixup</code>, generalize<code>MCFixupKindInfo::FKF_IsAlignedDownTo32Bits</code> (ARM hack, alsoused by other backends) to support more fixups, including<code>ARM::fixup_arm_uncondbl</code> (<code>R_ARM_CALL</code>). Create anew hook in <code>MCAsmBackend</code>.</li><li>In <code>ARMAsmBackend</code>, move the <code>Value -= 8</code> codefrom <code>adjustFixupValue</code> to the new hook.</li></ul><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="type">unsigned</span> <span class="title">ARMAsmBackend::adjustFixupValue</span><span class="params">(<span class="type">const</span> MCAssembler &amp;Asm,</span></span></span><br><span class="line"><span class="params"><span class="function">...</span></span></span><br><span class="line"><span class="params"><span class="function">  <span class="keyword">case</span> ARM::fixup_arm_condbranch:</span></span></span><br><span class="line"><span class="params"><span class="function">  <span class="keyword">case</span> ARM::fixup_arm_uncondbranch:</span></span></span><br><span class="line"><span class="params"><span class="function">  <span class="keyword">case</span> ARM::fixup_arm_uncondbl:</span></span></span><br><span class="line"><span class="params"><span class="function">  <span class="keyword">case</span> ARM::fixup_arm_condbl:</span></span></span><br><span class="line"><span class="params"><span class="function">  <span class="keyword">case</span> ARM::fixup_arm_blx:</span></span></span><br><span class="line"><span class="params"><span class="function">    <span class="comment">// Check that the relocation value is legal.</span></span></span></span><br><span class="line"><span class="params"><span class="function">    Value -= <span class="number">8</span>;</span></span></span><br><span class="line"><span class="params"><span class="function">    <span class="keyword">if</span> (!isInt&lt;<span class="number">26</span>&gt;(Value)) &#123;</span></span></span><br><span class="line"><span class="params"><span class="function">      Ctx.reportError(Fixup.getLoc(), <span class="string">&quot;Relocation out of range&quot;</span>);</span></span></span><br><span class="line"><span class="params"><span class="function">      <span class="keyword">return</span> <span class="number">0</span>;</span></span></span><br></pre></td></tr></table></figure><p>Enabling RELA/CREL support requires significant effort and exceeds myexpertise or willingness to address for AArch32. However, I do want toadd a new MCAsmBackend hook to minimize AArch32's invasive modificationsto the generic relocation generation framework.</p><p>For reference, the arm-vxworks port in binutils <ahref="https://sourceware.org/pipermail/binutils/2006-March/046211.html">introducedRELA support in 2006</a>.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;In my previous post, &lt;a
href=&quot;/blog/2025-04-06-llvm-integrated-assembler-improving-mcexpr-mcvalue&quot;&gt;LLVM
integrated assembler: Improving MCExpr and MCValue&lt;/a&gt; delved into
enhancements made to LLVM&#39;s internal MCExpr and MCValue representations.
This post covers recent refinements to MC, focusing on expression
resolving and relocation generation.&lt;/p&gt;</summary>
    
    
    
    
    <category term="llvm" scheme="https://maskray.me/blog/tags/llvm/"/>
    
    <category term="assembler" scheme="https://maskray.me/blog/tags/assembler/"/>
    
  </entry>
  
  <entry>
    <title>LLVM integrated assembler: Improving MCExpr and MCValue</title>
    <link href="https://maskray.me/blog/2025-04-06-llvm-integrated-assembler-improving-mcexpr-mcvalue"/>
    <id>https://maskray.me/blog/2025-04-06-llvm-integrated-assembler-improving-mcexpr-mcvalue</id>
    <published>2025-04-06T07:00:00.000Z</published>
    <updated>2025-05-31T18:53:41.609Z</updated>
    
    <content type="html"><![CDATA[<p>In my previous post, <ahref="/blog/2025-03-16-relocation-generation-in-assemblers"><em>RelocationGeneration in Assemblers</em></a>, I explored some key concepts behindLLVM’s integrated assemblers. This post dives into recent improvementsI’ve made to refine that system.</p><p>The LLVM integrated assembler handles fixups and relocatableexpressions as distinct entities. Relocatable expressions, inparticular, are encoded using the <code>MCValue</code> class, whichoriginally looked like this:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">MCValue</span> &#123;</span><br><span class="line">  <span class="type">const</span> MCSymbolRefExpr *SymA = <span class="literal">nullptr</span>, *SymB = <span class="literal">nullptr</span>;</span><br><span class="line">  <span class="type">int64_t</span> Cst = <span class="number">0</span>;</span><br><span class="line">  <span class="type">uint32_t</span> RefKind = <span class="number">0</span>;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><span id="more"></span><p>In this structure:</p><ul><li><code>RefKind</code> acts as an optional relocation specifier,though only a handful of targets actually use it.</li><li><code>SymA</code> represents an optional symbol reference (theaddend).</li><li><code>SymB</code> represents another optional symbol reference (thesubtrahend).</li><li><code>Cst</code> holds a constant value.</li></ul><p>While functional, this design had its flaws. For one, the wayrelocation specifiers were encoded varied across architectures:</p><ul><li>Targets like COFF, Mach-O, and ELF's PowerPC, SystemZ, and X86 embedthe relocation specifier within <code>MCSymbolRefExpr *SymA</code> aspart of <code>SubclassData</code>.</li><li>Conversely, ELF targets such as AArch64, MIPS, and RISC-V store itas a target-specific subclass of <code>MCTargetExpr</code>, and convertit to <code>MCValue::RefKind</code> during<code>MCValue::evaluateAsRelocatable</code>.</li></ul><p>Another issue was with <code>SymB</code>. Despite being typed as<code>const MCSymbolRefExpr *</code>, its<code>MCSymbolRefExpr::VariantKind</code> field went unused. This isbecause expressions like <code>add - sub@got</code> are notrelocatable.</p><p>Over the weekend, I tackled these inconsistencies and reworked therepresentation into something cleaner:</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">MCValue</span> &#123;</span><br><span class="line">  <span class="type">const</span> MCSymbol *SymA = <span class="literal">nullptr</span>, *SymB = <span class="literal">nullptr</span>;</span><br><span class="line">  <span class="type">int64_t</span> Cst = <span class="number">0</span>;</span><br><span class="line">  <span class="type">uint32_t</span> Specifier = <span class="number">0</span>;</span><br><span class="line">&#125;;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>This updated design not only aligns more closely with the concept ofrelocatable expressions but also shaves off some compiler time in LLVM.The ambiguous <code>RefKind</code> has been renamed to<code>Specifier</code> for clarity. Additionally, targets thatpreviously encoded the relocation specifier within<code>MCSymbolRefExpr</code> (rather than using<code>MCTargetExpr</code>) can now access it directly via<code>MCValue::Specifier</code>.</p><p>To support this change, I made a few adjustments:</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/83c3ec1b07c6c6857379cbdc6819262f2813b8e3">Introduced<code>getAddSym</code> and <code>getSubSym</code> methods</a>, returning<code>const MCSymbol *</code>, as replacements for <code>getSymA</code>and <code>getSymB</code>.</li><li>Eliminated dependencies on the old accessors,<code>MCValue::getSymA</code> and <code>MCValue::getSymB</code>.</li><li><ahref="https://github.com/llvm/llvm-project/commit/33246f79e87a0e629ae776d1811a1175a3f10065">Reworkedthe expression folding code that handles + and -</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/94821ce45fe93aa78cc5ea03cd9deac91b7af127">Storedthe <code>const MCSymbolRefExpr *SymA</code> specifier at<code>MCValue::Specifier</code></a></li><li>Some targets relied on PC-relative fixups with explicit specifiersforcing relocations. I have <ahref="https://github.com/llvm/llvm-project/commit/38c3ad36be1facbe6db2dede7e93c0f12fb4e1dc">defined<code>MCAsmBackend::shouldForceRelocation</code> for SystemZ</a> and <ahref="https://github.com/llvm/llvm-project/commit/4182d2dcb5ecbfc34d41a6cd11810cd36844eddb">cleanedup ARM and PowerPC</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/d5893fc2a7e1191afdb4940469ec9371a319b114">Changedthe type of <code>SymA</code> and <code>SymB</code> to<code>const MCSymbol *</code></a></li><li><ahref="https://github.com/llvm/llvm-project/commit/e5923936109ce4ce7be2c8fb3372b14d33c385d9">Replacedthe temporary <code>getSymSpecifier</code> with<code>getSpecifier</code></a></li><li><ahref="https://github.com/llvm/llvm-project/commit/8fa5b6cc0293d806e36b90d4116e5925fa5d7f2e">Replacedthe legacy <code>getAccessVariant</code> with<code>getSpecifier</code></a></li></ul><h2 id="streamlining-mach-o-support">Streamlining Mach-O support</h2><p>Mach-O assembler support in LLVM has accumulated significanttechnical debt, impacting both target-specific and generic code. Oneparticularly nagging issue was the<code>const SectionAddrMap *Addrs</code> parameter in<code>MCExpr::evaluateAs*</code> functions. This parameter existed tohandle cross-section label differences, primarily for generating(compact) unwind information in Mach-O. A typical example of this can beseen in assembly like:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">        .section        __TEXT,__text,regular,pure_instructions</span><br><span class="line">Leh_func_begin0:</span><br><span class="line">        .section        __TEXT,__eh_frame,coalesced,no_toc+strip_static_syms+live_support</span><br><span class="line">Ltmp3:</span><br><span class="line">Ltmp4 = Leh_func_begin0-Ltmp3</span><br><span class="line">        .long   Ltmp4</span><br></pre></td></tr></table></figure><p>The <code>SectionAddrMap *Addrs</code> parameter always felt like aclunky workaround to me. It wasn’t until I dug into the <ahref="https://github.com/llvm/llvm-project/tree/main/llvm/lib/Target/AArch64/MCTargetDesc/AArch64MachObjectWriter.cpp">Mach-OAArch64 object writer</a> that I realized this hack wasn't necessary forthat writer. This discovery prompted a cleanup effort to remove thedependency on <code>SectionAddrMap</code> for ARM and X86 and eliminatethe parameter:</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/1b7759de8e6979dda2d949b1ba1c742922e5c366">[MC,MachO]Replace SectionAddrMap workaround with cleaner variablehandling</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/b90a92687f399df5afe3e1a2493b0d9c6295ac8c">MCExpr:Remove unused SectionAddrMap workaround</a></li></ul><p>While I was at it, I also tidied up <code>MCSymbolRefExpr</code> by<ahref="https://github.com/llvm/llvm-project/commit/768ccf69f3febe962e0d63dc87fbee31e59547a7">removingthe clunky <code>HasSubsectionsViaSymbolsBit</code></a>, furthersimplifying the codebase.</p><h2 id="stremlining-instprinter">Stremlining InstPrinter</h2><p>The MCExpr code also determines how expression operands in assemblyinstructions are printed. I have made improvements in this area aswell:</p><ul><li><ahref="https://github.com/llvm/llvm-project/commit/3acccf042ab8a7b7e663bb2b2fac328d9bf65b38">[MC]Don't print () around $ names</a></li><li><ahref="https://github.com/llvm/llvm-project/commit/04a67528d303ac4be7943b2ae57222f9c9fd509a">[MC]Simplify MCBinaryExpr/MCUnaryExpr printing by reducingparentheses</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;In my previous post, &lt;a
href=&quot;/blog/2025-03-16-relocation-generation-in-assemblers&quot;&gt;&lt;em&gt;Relocation
Generation in Assemblers&lt;/em&gt;&lt;/a&gt;, I explored some key concepts behind
LLVM’s integrated assemblers. This post dives into recent improvements
I’ve made to refine that system.&lt;/p&gt;
&lt;p&gt;The LLVM integrated assembler handles fixups and relocatable
expressions as distinct entities. Relocatable expressions, in
particular, are encoded using the &lt;code&gt;MCValue&lt;/code&gt; class, which
originally looked like this:&lt;/p&gt;
&lt;figure class=&quot;highlight cpp&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre&gt;&lt;span class=&quot;line&quot;&gt;1&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;2&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;3&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;4&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;5&lt;/span&gt;&lt;br&gt;&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;line&quot;&gt;&lt;span class=&quot;keyword&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;title class_&quot;&gt;MCValue&lt;/span&gt; &amp;#123;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;  &lt;span class=&quot;type&quot;&gt;const&lt;/span&gt; MCSymbolRefExpr *SymA = &lt;span class=&quot;literal&quot;&gt;nullptr&lt;/span&gt;, *SymB = &lt;span class=&quot;literal&quot;&gt;nullptr&lt;/span&gt;;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;  &lt;span class=&quot;type&quot;&gt;int64_t&lt;/span&gt; Cst = &lt;span class=&quot;number&quot;&gt;0&lt;/span&gt;;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;  &lt;span class=&quot;type&quot;&gt;uint32_t&lt;/span&gt; RefKind = &lt;span class=&quot;number&quot;&gt;0&lt;/span&gt;;&lt;/span&gt;&lt;br&gt;&lt;span class=&quot;line&quot;&gt;&amp;#125;;&lt;/span&gt;&lt;br&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/figure&gt;</summary>
    
    
    
    
    <category term="llvm" scheme="https://maskray.me/blog/tags/llvm/"/>
    
    <category term="assembler" scheme="https://maskray.me/blog/tags/assembler/"/>
    
  </entry>
  
  <entry>
    <title>Relocation generation in assemblers</title>
    <link href="https://maskray.me/blog/2025-03-16-relocation-generation-in-assemblers"/>
    <id>https://maskray.me/blog/2025-03-16-relocation-generation-in-assemblers</id>
    <published>2025-03-16T07:00:00.000Z</published>
    <updated>2026-02-28T07:43:35.618Z</updated>
    
    <content type="html"><![CDATA[<p>This post explores how GNU Assembler and LLVM integrated assemblergenerate relocations, an important step to generate a relocatable file.Relocations identify parts of instructions or data that cannot be fullydetermined during assembly because they depend on the final memorylayout, which is only established at link time or load time. These areessentially placeholders that will be filled in (typically with absoluteaddresses or PC-relative offsets) during the linking process.</p><h2 id="relocation-generation-the-basics">Relocation generation: thebasics</h2><p>Symbol references are the primary candidates for relocations. Forinstance, in the x86-64 instruction <code>movl sym(%rip), %eax</code>(GNU syntax), the assembler calculates the displacement between theprogram counter (PC) and <code>sym</code>. This distance affects theinstruction's encoding and typically triggers a<code>R_X86_64_PC32</code> relocation, unless <code>sym</code> is alocal symbol defined within the current section.</p><p>Both the GNU assembler and LLVM integrated assembler utilize multiplepasses during assembly, with several key phases relevant to relocationgeneration:</p><span id="more"></span><h2 id="parsing-phase">Parsing phase</h2><p>During parsing, the assembler builds section fragments that containinstructions and other directives. It parses each instruction into itsopcode (e.g., <code>movl</code>) and operands (e.g.,<code>sym(%rip), %eax</code>). It identifies registers, immediate values(like 3 in <code>movl $3, %eax</code>), and expressions.</p><p>Expressions can be constants, symbol refereces (like<code>sym</code>), or unary and binary operators (<code>-sym</code>,<code>sym0-sym1</code>). Those unresolvable at parse time-potentialrelocation candidates-turn into "fixups". These often skip immediateoperand range checks, as shown here:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">% echo &#x27;addi a0, a0, 2048&#x27; | llvm-mc -triple=riscv64</span><br><span class="line">&lt;stdin&gt;:1:14: error: operand must be a symbol with %lo/%pcrel_lo/%tprel_lo modifier or an integer in the range [-2048, 2047]</span><br><span class="line">addi a0, a0, 2048</span><br><span class="line">             ^</span><br><span class="line">% echo &#x27;addi a0, a0, %lo(x)&#x27; | llvm-mc -triple riscv64 -show-encoding</span><br><span class="line">        addi    a0, a0, %lo(x)                  # encoding: [0x13,0x05,0bAAAA0101,A]</span><br><span class="line">                                        #   fixup A - offset: 0, value: %lo(x), kind: fixup_riscv_lo12_i</span><br></pre></td></tr></table></figure><p>A fixup ties to a specific location (an offset within a fragment),with its value being the expression (which must eventually evaluate to arelocatable expression).</p><p>Meanwhile, the assembler tracks defined and referenced symbols, andfor ELF, it tracks symbol bindings(<code>STB_LOCAL, STB_GLOBAL, STB_WEAK</code>) from directives like<code>.globl</code>, <code>.weak</code>, or the rarely used<code>.local</code>.</p><h2 id="section-layout-phase">Section layout phase</h2><p>After parsing, the assembler arranges each section by assigningprecise offsets to its fragments-instructions, data, or other directives(e.g., <code>.line</code>, <code>.uleb128</code>). It calculates sizesand adjusts for alignment. This phase finalizes symbol offsets (e.g.,<code>start:</code> at offset 0x10) while leaving external ones for thelinker.</p><p>This phase, which employs a fixed-point iteration, is quite complex.I won't go into details, but you might find <ahref="/blog/2024-04-27-clang-o0-output-branch-displacement-and-size-increase">Clang's-O0 output: branch displacement and size increase</a> interesting.</p><h2 id="relocation-decision-phase">Relocation decision phase</h2><p>Then the assembler evaluates each fixup to determine if it can beresolved directly or requires a relocation entry. This process starts byattempting to convert fixups into relocatable expressions.</p><h3 id="evaluating-relocatable-expressions">Evaluating relocatableexpressions</h3><p>In their most general form, relocatable expressions follow thepattern <code>relocation_specifier(sym_a - sym_b + offset)</code>,where</p><ul><li><code>relocation_specifier</code>: This may or may not be absent. Iwill explain this concept later.</li><li><code>sym_a</code> is a symbol reference (the "addend")</li><li><code>sym_b</code> is an optional symbol reference (the"subtrahend")</li><li><code>offset</code> is a constant value</li></ul><p>Most common cases involve only <code>sym_a</code> or<code>offset</code> (e.g., <code>movl sym(%rip), %eax</code> or<code>movl $3, %eax</code>). Only a few target architectures support thesubtrahend term (<code>sym_b</code>). Notable exceptions include AVR andRISC-V, as explored in <ahref="/blog/2021-03-14-the-dark-side-of-riscv-linker-relaxation.md">Thedark side of RISC-V linker relaxation</a>.</p><p>Attempting to use unsupported expression forms will result inassembly errors:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">% echo -e &#x27;movl a+b, %eax\nmovl a-b, %eax&#x27; | clang -c -xassembler -</span><br><span class="line">&lt;stdin&gt;:1:1: error: expected relocatable expression</span><br><span class="line">movl a+b, %eax</span><br><span class="line">^</span><br><span class="line">&lt;stdin&gt;:2:1: error: symbol &#x27;b&#x27; can not be undefined in a subtraction expression</span><br><span class="line">movl a-b, %eax</span><br><span class="line">^</span><br></pre></td></tr></table></figure><p>Let's use some notations from the AArch64 psABI.</p><ul><li><code>S</code> is the address of the symbol.</li><li><code>A</code> is the addend for the relocation.</li><li><code>P</code> is the address of the place being relocated (derivedfrom <code>r_offset</code>).</li><li><code>GOT</code> is the address of the Global Offset Table, thetable of code and data addresses to be resolved at dynamic linktime.</li><li><code>GDAT(S+A)</code> represents a pointer-sized entry in the<code>GOT</code> for address <code>S+A</code>.</li></ul><h3 id="pc-relative-fixups">PC-relative fixups</h3><p>PC-relative fixups compute their values as<code>sym_a - current_location + offset</code> (<code>S - P + A</code>)and can be seen as a special case that uses <code>sym_b</code>. (I’veskipped <code>- sym_b</code>, since no target I know permits asubtrahend here.)</p><p>When <code>sym_a</code> is a non-ifunc local symbol defined withinthe current section, these PC-relative fixups evaluate to constants. Butif <code>sym_a</code> is a global or weak symbol in the same section, arelocation entry is generated. This ensures <ahref="/blog/2021-05-16-elf-interposition-and-bsymbolic">ELF symbolinterposition</a> stays in play.</p><p>In contrast, label differences (e.g. <code>.quad g-f</code>) can beresolved even if <code>f</code> and <code>g</code> are global.</p><p>On some targets (e.g., AArch64, PowerPC, RISC-V), the PC-relativeoffset is relative to the start of the instruction (P), while others(e.g., AArch32, x86) are relative to P plus a constant.</p><h3 id="resolution-outcomes">Resolution Outcomes</h3><p>The assembler's evaluation of fixups leads to one of threeoutcomes:</p><ul><li>Error: When the expression isn't supported.</li><li>Resolved fixups: The assembler updates the relevant bits in theinstruction directly. No relocation entry is needed.<ul><li>There are target-specific exceptions that make the fixup unresolved.In AArch64 <code>adrp x0, l0; l0:</code>, the immediate might be either0 or 1, dependant on the instruction address. In RISC-V, linkerrelaxation might make fixups unresolved.</li></ul></li><li>Unresolved fixups: When the fixup evaluates to a relocatableexpression but not a constant, the assembler<ul><li>Generates an appropriate relocation (offset, type, symbol,addend).</li><li>For targets that use RELA, usually zeros out the bits in theinstruction field that will be modified by the linker.</li><li>For targets that use REL, leave the addend in the instructionfield.</li><li>If the referenced symbol is defined and local, and the relocationtype is not in exceptions (gas <code>tc_fix_adjustable</code>), therelocation references the section symbol instead of the local symbol.See <a href="#section-symbol-conversion">Section symbol conversion</a>for details and caveats.</li></ul></li></ul><p>Fixup resolution depends on the fixup type:</p><ul><li>PC-relative fixups that describe the symbol itself (the relocationoperation looks like <code>S - P + A</code>) resolve to a constant if<code>sym_a</code> is a non-ifunc local symbol defined in the currentsection.</li><li><code>relocation_specifier(S + A)</code> style fixups resolve when<code>S</code> refers to an absolute symbol.</li><li>Other fixups, including TLS and GOT related ones, remainunresolved.</li></ul><p>For ELF targets, if a non-TLS relocation operation references thesymbol itself <code>S</code> (not <code>GDAT</code>), it may be adjustedto reference the section symbol instead (see below).</p><p>If you are interested in relocation representations in differentobject file formats, please check out my post <ahref="/blog/2024-01-14-exploring-object-file-formats">Exploring objectfile formats</a>.</p><p>If an equated symbol <code>sym</code> is resolved relative to asection, relocations are generated against <code>sym</code>. Otherwise,if it resolves to a constant or an undefined symbol, relocations aregenerated against that constant or undefined symbol.</p><h3 id="section-symbol-adjustment">Section symbol adjustment</h3><p>When the assembler generates an unresolved fixup for a local symbol,it can convert the relocation to reference the section symbol(<code>STT_SECTION</code>) instead, folding the original symbol's offsetwithin the section into the addend. This allows the original localsymbol to be omitted from <code>.symtab</code>. The tradeoff is that the<code>STT_SECTION</code> symbol itself must be present, so theconversion saves <code>.symtab</code> entries only when a section hasmore than one local symbol referenced by relocations. This is common inpractice:</p><ul><li>Text sections often contain labels for jump targets or C++ exceptionhandling.</li><li>DWARF <code>.debug_*</code> sections contain labels referenced byother <code>.debug_*</code> sections.</li><li><code>SHF_STRINGS</code> sections (<code>.rodata.str1.1</code>,<code>.debug_str</code>, <code>.debug_line_str</code>) have a label foreach string literal.</li></ul><p>Not all relocations are eligible for this conversion. PLT-generatingand GOT-generating relocations, for example, may require dynamicrelocations where the symbol identity is significant, so they mustreference the original symbol. In GNU Assembler, the backend hook<code>tc_fix_adjustable</code> controls which relocation types areexcluded from the conversion.</p><p>While TLS relocations could be adjusted, lld/ELF does not support TLSrelocations against section symbols.</p><p>Relocations referencing symbols within <code>SHF_MERGE</code>sections also require extra care, because the linker may rearrange ordeduplicate content within these sections. On most architectures, anabsolute (<code>S + A</code>) or PC-relative (<code>S + A - P</code>)relocation pointing to a <code>SHF_MERGE</code> section can safely beconverted when the addend is zero, since the relocation still refers tothe exact start of a merge piece.</p><p>On x86-64, however, a PC-relative reference to a mergeable string canproduce a non-zero addend. For example,<code>int foo(int b) &#123; return "abcdef"[b]; &#125;</code> compiles to:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">leaq    .LC0(%rip), %rax</span><br><span class="line"># R_X86_64_PC32          .LC0 - 4</span><br></pre></td></tr></table></figure><p>The <code>R_X86_64_PC32</code> relocation type uses the end of theinstruction as the PC reference point, so the addend includes a -4adjustment:</p><p>If this relocation were converted to reference the section symbol,the addend -4 would point into a different merge piece or before thesection entirely. After the linker's merge section optimization, thebyte at that offset may no longer correspond to the same byte in theoriginal relocatable file. Therefore, GAS's x86-64 port disables the<code>STT_SECTION</code> conversion when the relocation references a<code>SHF_MERGE</code> section with a non-zero addend.</p><p>RISC-V applies the same rule for a related reason: linker relaxationcan change distances between symbols, so an addend that appears safe atassembly time may become incorrect after relaxation. The currentimplementation retains <code>.L.str</code> and <code>.L.str1</code> arekept in <code>.symtab</code> regardless of<code>-mrelax</code>/<code>-mno-relax</code>.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">.section        .rodata.str1.1,&quot;aMS&quot;,@progbits,1</span><br><span class="line">.L.str:</span><br><span class="line">.asciz  &quot;abcdef&quot;</span><br><span class="line"></span><br><span class="line">.section .rodata,&quot;a&quot;</span><br><span class="line">.L.str1:</span><br><span class="line">.asciz &quot;a&quot;</span><br><span class="line"></span><br><span class="line">.data</span><br><span class="line">.long .L.str    # can convert: addend would be 0</span><br><span class="line">.long .L.str1   # can convert: non-SHF_MERGE section</span><br></pre></td></tr></table></figure><p>Binutils feature request: <ahref="https://sourceware.org/bugzilla/show_bug.cgi?id=33885"class="uri">https://sourceware.org/bugzilla/show_bug.cgi?id=33885</a></p><h3 id="fixup-overflow-check">Fixup overflow check</h3><p>For <code>.long x</code>, GAS accepts <code>x</code> if its value isin the range <code>(-2**32, 2**32)</code>. This design allows<code>.long x</code> to work regardless of signedness. When a symbol isinvolved, GAS supports both <code>.long sym-0xffffffff</code> and<code>.long sym+1</code>, as well as <code>.long sym+0xffffffff</code>and <code>.long sym-1</code>. However,<code>.long sym+0x100000000</code> is rejected in favor of<code>.long sym+0</code>.</p><p>The underlying check asks: "can this value be truncated to 32 bitswithout losing bit-pattern information?" The accepted range is the unionof:</p><ul><li><code>uint32_t</code> values: <code>[0, 2**32)</code></li><li><code>int32_t</code> values: <code>[-2**31, 2**31)</code></li><li>Negative values that fit in 33 bits:<code>(-2**32, -2**31)</code></li></ul><p>The union gives <code>(-2**32, 2**32)</code>.</p><p>Note: the union of just <code>int32_t</code> and<code>uint32_t</code> is <code>[-2**31, 2**32)</code>, which matches<code>checkIntUInt</code> in lld/ELF (<ahref="https://reviews.llvm.org/D63690"class="uri">https://reviews.llvm.org/D63690</a>).</p><h2 id="examples-in-action">Examples in action</h2><p><strong>Branches</strong></p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">% echo -e &#x27;call fun\njmp fun&#x27; | clang -c -xassembler - -o - | fob -dr -</span><br><span class="line">...</span><br><span class="line">       0: e8 00 00 00 00                callq   0x5 &lt;.text+0x5&gt;</span><br><span class="line">                0000000000000001:  R_X86_64_PLT32       fun-0x4</span><br><span class="line">       5: e9 00 00 00 00                jmp     0xa &lt;.text+0xa&gt;</span><br><span class="line">                0000000000000006:  R_X86_64_PLT32       fun-0x4</span><br><span class="line">% echo -e &#x27;bl fun\nb fun&#x27; | clang --target=aarch64 -c -xassembler - -o - | fob -dr -</span><br><span class="line">...</span><br><span class="line">       0: 94000000      bl      0x0 &lt;.text&gt;</span><br><span class="line">                0000000000000000:  R_AARCH64_CALL26     fun</span><br><span class="line">       4: 14000000      b       0x4 &lt;.text+0x4&gt;</span><br><span class="line">                0000000000000004:  R_AARCH64_JUMP26     fun</span><br></pre></td></tr></table></figure><p><strong>Absolute and PC-relative symbol references</strong></p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">% echo -e &#x27;movl a, %eax\nmovl a(%rip), %eax&#x27; | clang -c -xassembler - -o - | llvm-objdump -dr -</span><br><span class="line">...</span><br><span class="line">       0: 8b 04 25 00 00 00 00          movl    0x0, %eax</span><br><span class="line">                0000000000000003:  R_X86_64_32S a</span><br><span class="line">       7: 8b 05 00 00 00 00             movl    (%rip), %eax            # 0xd &lt;.text+0xd&gt;</span><br><span class="line">                0000000000000009:  R_X86_64_PC32        a-0x4</span><br></pre></td></tr></table></figure><p><code>(a-.)(%rip)</code> would probably be more semantically correctbut is not adopted by GNU Assembler.</p><h2 id="relocation-specifiers">Relocation specifiers</h2><p>Relocation specifiers guide the assembler on how to resolve andencode expressions into instructions. They specify details like:</p><ul><li>Whether to reference the symbol itself, its Procedure Linkage Table(PLT) entry, or its Global Offset Table (GOT) entry.</li><li>Which part of a symbol's address to use (e.g., lower or upperbits).</li><li>Whether to use an absolute address or a PC-relative one.</li></ul><p>This concept appears across various architectures but withinconsistent terminology. The Arm architecture refers to elements like<code>:lo12:</code> and <code>:lower16:</code> as "relocationspecifiers". IBM's AIX documentation also uses this term. Many GNUBinutils target documents simply call these "modifiers", while AVRdocumentation uses "relocatable expression modifiers".</p><p>Picking the right term was tricky. "Relocatable expression modifier"nails the idea of tweaking relocatable expressions but feels overlyverbose. "Relocation modifier", though concise, suggests adjustmentshappen during the linker's relocation step rather than the assembler'sexpression evaluation. I landed on "relocation specifier" as the winner.It's clear, aligns with Arm and IBM’s usage, and fits the assembler'srole seamlessly.</p><p>For example, RISC-V <code>addi</code> can be used with either anabsolute address or a PC-relative address. Relocation specifiers<code>%lo</code> and <code>%pcrel_lo</code> could differentiate the twouses. Similarly, <code>%hi</code>, <code>%pcrel_hi</code>, and<code>%got_pcrel_hi</code> could differentiate the uses of<code>lui</code> and <code>auipc</code>.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"># Position-dependent code (PDC) - absolute addressing</span><br><span class="line">lui     a0, %hi(var)                    # Load upper immediate with high bits of symbol address</span><br><span class="line">addi    a0, a0, %lo(var)                # Add lower 12 bits of symbol address</span><br><span class="line"></span><br><span class="line"># Position-independent code (PIC) - PC-relative addressing</span><br><span class="line">auipc   a0, %pcrel_hi(var)              # Add upper PC-relative offset to PC</span><br><span class="line">addi    a0, a0, %pcrel_lo(.Lpcrel_hi1)  # Add lower 12 bits of PC-relative offset</span><br><span class="line"></span><br><span class="line"># Position-independent code via Global Offset Table (GOT)</span><br><span class="line">auipc   a0, %got_pcrel_hi(var)          # Calculate address of GOT entry relative to PC</span><br><span class="line">ld      a0, %pcrel_lo(.Lpcrel_hi1)(a0)  # Load var&#x27;s address from GOT</span><br></pre></td></tr></table></figure><p>Why use <code>%hi</code> with <code>lui</code> if it's always paired?It's about clarify and explicitness. <code>%hi</code> ensuresconsistency with <code>%lo</code> and cleanly distinguishes it from from<code>%pcrel_hi</code>. Since both <code>lui</code> and<code>auipc</code> share the U-type instruction format, tying relocationspecifiers to formats rather than specific instructions is a smart,flexible design choice.</p><h2 id="relocation-specifier-flavors">Relocation specifier flavors</h2><p>Assemblers use various syntaxes for relocation specifiers, reflectingarchitectural quirks and historical conventions. Below, we explore themain flavors, their usage across architectures, and some of theirpeculiarities.</p><p><strong><code>expr@specifier</code></strong></p><p>This is likely the most widespread syntax, adopted by many binutilstargets, including ARC, C-SKY, Power, M68K, SuperH, SystemZ, and x86,among others. It's also used in Mach-O object files, e.g.,<code>adrp x8, _bar@GOTPAGE</code>.</p><p>This suffix style puts the specifier after an <code>@</code>. It'sintuitive—think <code>sym@got</code>. In PowerPC, operators can getelaborate, such as <code>sym@toc@l(9)</code>. Here, <code>@toc@l</code>is a single, indivisible operator-not two separate <code>@</code>pieces-indicating a TOC-relative reference with a low 16-bitextraction.</p><p>Parsing is loose: while both <code>expr@specifier+expr</code> and<code>expr+expr@specifier</code> are accepted (by many targets),conceptually it's just <code>specifier(expr+expr)</code>. For example,x86 accepts <code>sym@got+4</code> or <code>sym+4@got</code>, but don'tmisread—<code>@got</code> applies to <code>sym+4</code>, not just<code>sym</code>.</p><p><strong><code>%specifier(expr)</code></strong></p><p>MIPS, SPARC, RISC-V, and LoongArch favor this prefix style, wrappingthe expression in parentheses for clarity. In MIPS, parentheses areoptional, and operators can nest, like</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"># MIPS</span><br><span class="line">addiu   $2, $2, %lo(0x12345)</span><br><span class="line">addiu   $2, $2, %lo 0x12345</span><br><span class="line">lui     $1, %hi(%neg(%gp_rel(main)))</span><br><span class="line">ld      $1, %got_page($.str)($gp)</span><br></pre></td></tr></table></figure><p>Like <code>expr@specifier</code>, the specifier applies to the wholeexpression. Don't misinterpret <code>%lo(3)+sym</code>-it resolves as<code>sym+3</code> with an <code>R_MIPS_LO16</code> relocation.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"># MIPS</span><br><span class="line">addiu   $2, $2, %lo(3)+sym  # R_MIPS_LO16  sym+0x3</span><br><span class="line">addiu   $2, $2, %lo 3+sym   # R_MIPS_LO16  sym+0x3</span><br></pre></td></tr></table></figure><p>SPARC has an anti-pattern. Its <code>%lo</code> and <code>%hi</code>expand to different relocation types depending on whether gas's<code>-KPIC</code> option (<code>llvm-mc -position-independent</code>)is specified.</p><p><strong><code>expr(specifier)</code></strong></p><p>A simpler suffix style, this is used by AArch32 for data directives.It's less common but straightforward, placing the operator inparentheses after the expression.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">.word sym(gotoff)</span><br><span class="line">.long f(FUNCDESC)</span><br><span class="line"></span><br><span class="line">.long f(got)+3    // allowed b GNU assembler and LLVM integrated assembler, but probably not used in the wild</span><br></pre></td></tr></table></figure><p><strong><code>:specifier:expr</code></strong></p><p>AArch32 and AArch64 adopt this colon-framed prefix notation, avoidingthe confusion that parentheses might introduce.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">// AArch32</span><br><span class="line">movw    r0, :lower16:x</span><br><span class="line"></span><br><span class="line">// AArch64</span><br><span class="line">add     x8, x8, :lo12:sym</span><br><span class="line"></span><br><span class="line">adrp    x0, :got:var</span><br><span class="line">ldr     x0, [x0, :got_lo12:var]</span><br></pre></td></tr></table></figure><p>Applying this syntax to data directives or instructions' firstoperands, however, could create parsing ambiguity. In both GNU Assemblerand LLVM, <code>.word :plt:fun</code> would be interpreted as<code>.word: plt: fun</code>, treating <code>.word</code> and<code>plt</code> as labels, rather than achieving the intendedmeaning.</p><p>One idea is to <code>#</code> for disambiguitation:<figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">.word #:gotpcrel:var</span><br></pre></td></tr></table></figure></p><p><strong>Recommendation</strong></p><p>For new architectures, I'd suggest adopting<code>%specifier(expr)</code>, and never use <code>@specifier</code>.The <code>%</code> symbol works seamlessly with data directives, andduring operand parsing, the parser can simply peek at the first token tocheck for a relocation specifier.</p><p>I favor <code>%specifier(expr)</code> over<code>%specifier expr</code> because it provides clearer scoping,especially in data directives with multiple operands, such as<code>.long %lo(a), %lo(b)</code>.</p><p>( <code>%specifier(...)</code> resembles <code>%</code> expansion inGNU Assembler's altmacro mode. <figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">.altmacro</span><br><span class="line">.macro m arg; .long \arg; .endm</span><br><span class="line">.data; m %(1+2)</span><br></pre></td></tr></table></figure> )</p><p><strong>Inelegance</strong></p><p>RISC-V favors <code>%specifier(expr)</code> but clings to<code>call sym@plt</code> for <ahref="https://github.com/riscv-non-isa/riscv-elf-psabi-doc/issues/98">legacyreasons</a>.</p><p>AArch64 uses <code>:specifier:expr</code>, yet PAuth ABI(<code>.quad (g + 7)@AUTH(ia,0)</code>) cannot use <code>:</code> afterdata directives due to parsing ambiguity. <code>R_AARCH64_PLT32</code>,<code>R_AARCH64_GOTPCREL32</code>, and <code>R_AARCH64_FUNCINIT</code>were fixed in <ahref="https://github.com/llvm/llvm-project/pull/155776">llvm/llvm-project#155776</a>to use <code>%pltpcrel(foo)</code> and <code>%gotpcrel(foo)</code>instead of the unofficial <code>foo@plt - .</code> /<code>foo@gotpcrel</code> forms.</p><h2 id="tls-symbols">TLS symbols</h2><p>When a symbol is defined in a section with the <code>SHF_TLS</code>flag (Thread-Local Storage), GNU assembler assigns it the type<code>STT_TLS</code> in the symbol table. For undefined TLS symbols, theprocess differs: GCC and Clang don’t emit explicit labels. Instead,assemblers identify these symbols through TLS-specific relocationspecifiers in the code, deduce their thread-local nature, and set theirtype to <code>STT_TLS</code> accordingly.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">// AArch64</span><br><span class="line">add     x8, x8, :tprel_hi12:tls</span><br><span class="line"></span><br><span class="line">// x86</span><br><span class="line">movl    %fs:tls@TPOFF, %eax</span><br></pre></td></tr></table></figure><h2 id="composed-relocations">Composed relocations</h2><p>Most instructions trigger zero or one relocation, but some generatetwo. Often, one acts as a marker, paired with a standard relocation. Forexample:</p><ul><li>PPC64 <ahref="/blog/2021-02-14-all-about-thread-local-storage#:~:text=R_PPC64_TLSGD"><code>bl __tls_get_addr(x@tlsgd)</code></a>pairs a marker <code>R_PPC64_TLSGD</code> with<code>R_PPC64_REL24</code></li><li>PPC64's link-time GOT-indirect to PC-relative optimization (withPower10's prefixed instruction) generates a<code>R_PPC64_PCREL_OPT</code> relocation following a GOT relocation. <ahref="https://reviews.llvm.org/D79864"class="uri">https://reviews.llvm.org/D79864</a></li><li>RISC-V linker relaxation uses <code>R_RISCV_RELAX</code> alongsideanother relocation, and<code>R_RISCV_ADD*</code>/<code>R_RISCV_SUB*</code> pairs.</li><li>Mach-O scattered relocations for label differences.</li><li>XCOFF represents a label difference with a pair of <ahref="https://reviews.llvm.org/D77424"><code>R_POS</code> and<code>R_NEG</code> relocations</a>.</li></ul><p>These marker cases tie into "composed relocations", as outlined inthe Generic ABI:</p><blockquote><p>If multiple consecutive relocation records are applied to the samerelocation location (<code>r_offset</code>), they are composed insteadof being applied independently, as described above. By consecutive, wemean that the relocation records are contiguous within a singlerelocation section. By composed, we mean that the standard applicationdescribed above is modified as follows:</p><ul><li><p>In all but the last relocation operation of a composed sequence,the result of the relocation expression is retained, rather than havingpart extracted and placed in the relocated field. The result is retainedat full pointer precision of the applicable ABI processorsupplement.</p></li><li><p>In all but the first relocation operation of a composed sequence,the addend used is the retained result of the previous relocationoperation, rather than that implied by the relocation type.</p></li></ul><p>Note that a consequence of the above rules is that the locationspecified by a relocation type is relevant for the first element of acomposed sequence (and then only for relocation records that do notcontain an explicit addend field) and for the last element, where thelocation determines where the relocated value will be placed. For allother relocation operands in a composed sequence, the location specifiedis ignored.</p><p>An ABI processor supplement may specify individual relocation typesthat always stop a composition sequence, or always start a new one.</p></blockquote><h2 id="implicit-addends">Implicit addends</h2><p>ELF <code>SHT_REL</code> and Mach-O utilize implicit addends.TODO</p><ul><li><code>R_MIPS_HI16</code> (https://reviews.llvm.org/D101773)</li></ul><h2 id="gnu-assembler-internals">GNU Assembler internals</h2><p>GNU Assembler utilizes <code>struct fixup</code> to represent boththe fixup and the relocatable expression.</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">fix</span> &#123;</span></span><br><span class="line">  ...</span><br><span class="line">  <span class="comment">/* NULL or Symbol whose value we add in.  */</span></span><br><span class="line">  symbolS *fx_addsy;</span><br><span class="line"></span><br><span class="line">  <span class="comment">/* NULL or Symbol whose value we subtract.  */</span></span><br><span class="line">  symbolS *fx_subsy;</span><br><span class="line"></span><br><span class="line">  <span class="comment">/* Absolute number we add in.  */</span></span><br><span class="line">  valueT fx_offset;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>The relocation specifier is part of the instruction instead of partof <code>struct fix</code>. Targets have different internalrepresentations of instructions.</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// gas/config/tc-aarch64.c</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">reloc</span></span></span><br><span class="line"><span class="class">&#123;</span></span><br><span class="line">  bfd_reloc_code_real_type type;</span><br><span class="line">  expressionS <span class="built_in">exp</span>;</span><br><span class="line">  <span class="type">int</span> pc_rel;</span><br><span class="line">  <span class="class"><span class="keyword">enum</span> <span class="title">aarch64_opnd</span> <span class="title">opnd</span>;</span></span><br><span class="line">  <span class="type">uint32_t</span> flags;</span><br><span class="line">  <span class="type">unsigned</span> need_libopcodes_p : <span class="number">1</span>;</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">aarch64_instruction</span></span></span><br><span class="line"><span class="class">&#123;</span></span><br><span class="line">  aarch64_inst base;</span><br><span class="line">  aarch64_operand_error parsing_error;</span><br><span class="line">  <span class="type">int</span> cond;</span><br><span class="line">  <span class="class"><span class="keyword">struct</span> <span class="title">reloc</span> <span class="title">reloc</span>;</span></span><br><span class="line">  <span class="type">unsigned</span> gen_lit_pool : <span class="number">1</span>;</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="comment">// gas/config/tc-ppc.c</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ppc_fixup</span></span></span><br><span class="line"><span class="class"> &#123;</span></span><br><span class="line">   expressionS <span class="built_in">exp</span>;</span><br><span class="line">   <span class="type">int</span> opindex;</span><br><span class="line">   bfd_reloc_code_real_type reloc;</span><br><span class="line"> &#125;;</span><br></pre></td></tr></table></figure><p>The 2002 message <ahref="https://sourceware.org/pipermail/binutils/2002-August/021813.html">stageone of gas reloc rewrite</a> describes the passes.</p><p>In PPC, the result of <code>@l</code> and <code>@ha</code> can beeither signed or unsigned, determined by the instruction opcode.</p><p>In <code>md_apply_fix</code>, TLS-related relocation specifiers call<code>S_SET_THREAD_LOCAL (fixP-&gt;fx_addsy);</code>.</p><h2 id="llvm-internals">LLVM internals</h2><p>LLVM integrated assembler encodes fixups and relocatable expressionsseparately.</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">MCFixup</span> &#123;</span><br><span class="line">  <span class="comment">/// The value to put into the fixup location. The exact interpretation of the</span></span><br><span class="line">  <span class="comment">/// expression is target dependent, usually it will be one of the operands to</span></span><br><span class="line">  <span class="comment">/// an instruction or an assembler directive.</span></span><br><span class="line">  <span class="type">const</span> MCExpr *Value = <span class="literal">nullptr</span>;</span><br><span class="line"></span><br><span class="line">  <span class="comment">/// The byte index of start of the relocation inside the MCFragment.</span></span><br><span class="line">  <span class="type">uint32_t</span> Offset = <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line">  <span class="comment">/// The target dependent kind of fixup item this is. The kind is used to</span></span><br><span class="line">  <span class="comment">/// determine how the operand value should be encoded into the instruction.</span></span><br><span class="line">  MCFixupKind Kind = FK_NONE;</span><br><span class="line"></span><br><span class="line">  <span class="comment">/// The source location which gave rise to the fixup, if any.</span></span><br><span class="line">  SMLoc Loc;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>LLVM encodes relocatable expressions as <code>MCValue</code>,<figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">class MCValue &#123;</span><br><span class="line">  const MCSymbol *SymA = nullptr, *SymB = nullptr;</span><br><span class="line">  int64_t Cst = 0;</span><br><span class="line">  uint32_t Specifier = 0;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure></p><p>with:</p><ul><li><code>Specifier</code> as an optional relocation specifier (named<code>RefKind</code> before LLVM 21)</li><li><code>SymA</code> as an optional symbol reference (addend)</li><li><code>SymB</code> as an optional symbol reference (subtrahend)</li><li><code>Cst</code> as a constant value</li></ul><p>This mirrors the relocatable expression concept, but<code>Specifier</code>—<ahref="https://github.com/llvm/llvm-project/commit/0999cbd0b9ed8aa893cce10d681dec6d54b200ad">addedin 2014 for AArch64 as <code>RefKind</code></a>—remains rare amongtargets. (I've recently made some cleanup to some targets. For instance,I migrated PowerPC's <ahref="https://github.com/llvm/llvm-project/commit/89812985358784b16fb66928ad4da411386f4720"><spanclass="citation" data-cites="l">@l</span> and <span class="citation"data-cites="ha">@ha</span> folding to use<code>Specifier</code></a>.)</p><p>AArch64 implements a clean approach to select the relocation type. Itdispatches on the fixup kind (an operand within a specific instructionformat), then refines it with the relocation specifier.</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// AArch64ELFObjectWriter::getRelocType</span></span><br><span class="line"><span class="type">unsigned</span> Kind = Fixup.<span class="built_in">getTargetKind</span>();</span><br><span class="line"><span class="keyword">switch</span> (Kind) &#123;</span><br><span class="line"><span class="comment">// Handle generic MCFixupKind.</span></span><br><span class="line"><span class="keyword">case</span> FK_Data_1:</span><br><span class="line"><span class="keyword">case</span> FK_Data_2:</span><br><span class="line">  ...</span><br><span class="line"></span><br><span class="line"><span class="comment">// Handle target-specific MCFixupKind.</span></span><br><span class="line"><span class="keyword">case</span> AArch64::fixup_aarch64_add_imm12:</span><br><span class="line">  <span class="keyword">if</span> (RefKind == AArch64::S_DTPREL_HI12)</span><br><span class="line">    <span class="keyword">return</span> <span class="built_in">R_CLS</span>(TLSLD_ADD_DTPREL_HI12);</span><br><span class="line">  <span class="keyword">if</span> (RefKind == AArch64::S_TPREL_HI12)</span><br><span class="line">    <span class="keyword">return</span> <span class="built_in">R_CLS</span>(TLSLE_ADD_TPREL_HI12);</span><br><span class="line">  ...</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>MCAssembler::evaluateFixup</code> and<code>ELFObjectWriter::recordRelocation</code> record a relocation.</p><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// MCAssembler::evaluateFixup</span></span><br><span class="line">Evaluate `<span class="type">const</span> MCExpr *Fixup::Value` to a relocatable expression.</span><br><span class="line">Determine the fixup value. Adjust the value <span class="keyword">if</span> FKF_IsPCRel.</span><br><span class="line">If the relocatable expression is a constant, treat <span class="keyword">this</span> fixup as resolved.</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (IsResolved &amp;&amp; is_reloc_directive)</span><br><span class="line">  IsResolved = <span class="literal">false</span>;</span><br><span class="line">Backend.<span class="built_in">applyFixup</span>(...)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment">// applyFixup</span></span><br><span class="line"><span class="keyword">if</span> (...)</span><br><span class="line">  IsResolved = <span class="literal">false</span>;</span><br><span class="line"><span class="keyword">if</span> (!IsResolved) &#123;</span><br><span class="line">  <span class="comment">// For exposition I&#x27;ve inlined ELFObjectWriter::recordRelocation here.</span></span><br><span class="line">  <span class="comment">// the function roughly maps to GNU Assembler&#x27;s `md_apply_fix` and `tc_gen_reloc`,</span></span><br><span class="line">  Type = TargetObjectWriter-&gt;<span class="built_in">getRelocType</span>(Ctx, Target, Fixup, IsPCRel)</span><br><span class="line">  Determine whether SymA can be converted to a section symbol.</span><br><span class="line">  Relocations.<span class="built_in">push_back</span>(...)</span><br><span class="line">&#125;</span><br><span class="line"><span class="comment">// Write a value to the relocated location. When using relocations with explicit addends, the function is a no-op when `IsResolved` is true.</span></span><br></pre></td></tr></table></figure><p><code>FKF_IsPCRel</code> applies to fixups whose relocationoperations look like <code>S - P + A</code>, like branches andPC-relative operations, but not to GOT-related operations (e.g.,<code>GDAT - P + A</code>).</p><h3 id="mcsymbolrefexpr-issues"><code>MCSymbolRefExpr</code> issues</h3><p>The expression structure follows a traditional object-orientedhierarchy:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">MCExpr</span><br><span class="line">  MCConstantExpr: Value</span><br><span class="line">  MCSymbolRefExpr: VariantKind, Symbol</span><br><span class="line">  MCUnaryExpr: Op, Expr</span><br><span class="line">  MCBinaryExpr: Op, LHS, RHS</span><br><span class="line">  MCTargetExpr:</span><br><span class="line">    X86MCExpr: x86 register</span><br><span class="line">  MCSpecifierExpr: expression with a relocation specifier</span><br></pre></td></tr></table></figure><p><code>MCSymbolRefExpr::VariantKind</code> enums the relocationspecifier, but it's a poor fit:</p><ul><li>Other expressions, like <code>MCConstantExpr</code> (e.g., PPC<code>4@l</code>) and <code>MCBinaryExpr</code> (e.g., PPC<code>(a+1)@l</code>), also need it.</li><li>Semantics blur when folding expressions with <code>@</code>, whichis unavoidable when <code>@</code> can occur at any position within thefull expression.</li><li>The generic <code>MCSymbolRefExpr</code> lacks target-specifichooks, cluttering the interface with any target-specific logic.</li></ul><p>Consider what happens with addition or subtraction:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">MCBinaryExpr</span><br><span class="line">  LHS(MCSymbolRefExpr): VariantKind, SymA</span><br><span class="line">  RHS(MCSymbolRefExpr): SymB</span><br></pre></td></tr></table></figure><p>Here, the specifier attaches only to the LHS, leaving the full resultuncovered. This awkward design demands workarounds.</p><ul><li>Parsing <code>a+4@got</code> exposes clumsiness. After<code>AsmParser::parseExpression</code> processes <code>a+4</code>, itdetects <code>@got</code> and retrofits it onto<code>MCSymbolRefExpr(a)</code>, which feels hacked together.</li><li>PowerPC's <code>@l</code> <code>@ha</code> optimization needs<code>PPCAsmParser::extractSpecifier</code> and<code>PPCAsmParser::applySpecifier</code> to convert a<code>MCSymbolRefExpr</code> to a <code>MCSpecifierExpr</code>.</li></ul><p>Worse, leaky abstractions that <code>MCSymbolRefExpr</code> isaccessed widely in backend code introduces another problem: while<code>MCBinaryExpr</code> with a constant RHS mimics<code>MCSymbolRefExpr</code> semantically, code often handles only thelatter.</p><h3id="mcfixup-should-store-mcvalue-instead-of-mcexpr"><code>MCFixup</code>should store <code>MCValue</code> instead of <code>MCExpr</code></h3><p>The const <code>MCExpr *MCFixup::getValue()</code> method feelsinconvenient and less elegant compared to GNU Assembler's unifiedfixup/relocatable expression for these reasons:</p><ul><li>Relocation specifier can be encoded by every sub-expression in the<code>MCExpr</code> tree, rather than the fixup itself (or theinstruction, as in GNU Assembler). Supporting all of<code>a+4@got, a@got+4, (a+4)@got</code> requires extensive hacks inLLVM MCParser.</li><li><code>evaluateAsRelocatable</code> converts an MCExpr to an MCValuewithout updating the MCExpr itself. This leads to redundant evaluations,as <code>MCAssembler::evaluateFixup</code> is called multiple times,such as in <code>MCAssembler::fixupNeedsRelaxation</code> and<code>MCAssembler::layout</code>.</li></ul><p>Storing a MCValue directly in MCFixup, or adding a relocationspecifier member, could eliminate the need for many target-specific<code>MCTargetFixup</code> classes that manage relocation specifiers.However, target-specific evaluation hooks would still be needed forspecifiers like PowerPC <code>@l</code> or RISC-V<code>%lo()</code>.</p><p>Computing label differences will be simplified as we can utilize<code>SymA</code> and <code>SymB</code>.</p><p>Our long-term goal is to encode the relocation specifier within<code>MCFixup</code>. (<ahref="https://github.com/llvm/llvm-project/issues/135592"class="uri">https://github.com/llvm/llvm-project/issues/135592</a>)</p><p><code>MCSymbolRefExpr::VariantKind</code> as the legacy way to encoderelocations should be completely removed (probably in a distant futureas many cleanups are required).</p><h3 id="asmparser-exprspecifier">AsmParser:<code>expr@specifier</code></h3><p>In LLVM's assembly parser library (LLVMMCParser), the parsing of<code>expr@specifier</code> was supported for all targets until Iupdated it to be <ahref="https://github.com/llvm/llvm-project/commit/a0671758eb6e52a758bd1b096a9b421eec60204c">anopt-in feature</a> in March 2025.</p><p>AsmParser's <code>@specifier</code> parsing is suboptimal,necessitating lexer workarounds.</p><p>The <code>@</code> symbol can appear after a symbol or an expression(via <code>parseExpression</code>) and may occur multiple times within asingle operand, making it challenging to validate and reject invalidcases.</p><p>In the GNU Assembler, COFF targets permit <code>@</code> withinidentifier names, and MinGW supports constructs like<code>.long ext24@secrel32</code>. It appears that a recognized suffixis treated as a specifier, while an unrecognized suffix results in asymbol that includes the <code>@</code>.</p><p>The PowerPC AsmParser(<code>llvm/lib/Target/PowerPC/AsmParser/PPCAsmParser.cpp</code>) parsesan operand and then calls <code>PPCAsmParser::extractSpecifier</code> toextract the optional <code>@</code> specifier. When the <code>@</code>specifier is detected and removed, it generates a<code>PPCMCExpr</code>. This functionality is currently implemented for<code>@l</code> and <span class="citation" data-cites="ha">@ha</span>`,and it would be beneficial to extend this to include all specifiers.</p><h3 id="asmprinter">AsmPrinter</h3><p>In <code>llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp</code>,<code>AsmPrinter::lowerConstant</code> outlines how LLVM handles theemission of a global variable initializer. When processing<code>ConstantExpr</code> elements, this function may generate datadirectives in the assembly code that involve differences betweensymbols.</p><p>One significant use case for this intricate code is<code>clang++ -fexperimental-relative-c++-abi-vtables</code>. Thisfeature produces a PC-relative relocation that points to either the PLT(Procedure Linkage Table) entry of a function or the function symboldirectly.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;This post explores how GNU Assembler and LLVM integrated assembler
generate relocations, an important step to generate a relocatable file.
Relocations identify parts of instructions or data that cannot be fully
determined during assembly because they depend on the final memory
layout, which is only established at link time or load time. These are
essentially placeholders that will be filled in (typically with absolute
addresses or PC-relative offsets) during the linking process.&lt;/p&gt;
&lt;h2 id=&quot;relocation-generation-the-basics&quot;&gt;Relocation generation: the
basics&lt;/h2&gt;
&lt;p&gt;Symbol references are the primary candidates for relocations. For
instance, in the x86-64 instruction &lt;code&gt;movl sym(%rip), %eax&lt;/code&gt;
(GNU syntax), the assembler calculates the displacement between the
program counter (PC) and &lt;code&gt;sym&lt;/code&gt;. This distance affects the
instruction&#39;s encoding and typically triggers a
&lt;code&gt;R_X86_64_PC32&lt;/code&gt; relocation, unless &lt;code&gt;sym&lt;/code&gt; is a
local symbol defined within the current section.&lt;/p&gt;
&lt;p&gt;Both the GNU assembler and LLVM integrated assembler utilize multiple
passes during assembly, with several key phases relevant to relocation
generation:&lt;/p&gt;</summary>
    
    
    
    
    <category term="llvm" scheme="https://maskray.me/blog/tags/llvm/"/>
    
    <category term="binutils" scheme="https://maskray.me/blog/tags/binutils/"/>
    
    <category term="assembler" scheme="https://maskray.me/blog/tags/assembler/"/>
    
  </entry>
  
</feed>
