A deep dive into SmallVector::push_back

SmallVector is LLVM's most-used container, and push_back its hot operation. For the trivially-copyable specialization the fast path should be fast.

1
2
3
#include <llvm/ADT/SmallVector.h>

void f(llvm::SmallVectorImpl<int> &v, int x) { v.push_back(x); }

clang -S --target=x86_64 -O2 -DNDEBUG a.cc generates:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
push   rbp                 # callee-saved spills + a stack realignment,
push rbx # all on the fast path
push rax
mov eax, [rdi + 8] # size
cmp eax, [rdi + 12] # vs capacity
jae .Lgrow
.Lstore: # reached from the fast path AND from .Lgrow
mov rcx, [rdi]
mov [rcx + rax*4], esi
inc dword ptr [rdi + 8]
add rsp, 8
pop rbx
pop rbp
ret
.Lgrow:
mov rbx, rdi # keep `this`/`x` alive across the call
mov ebp, esi
call SmallVectorBase<unsigned>::grow_pod
...
jmp .Lstore

push_back reserves capacity and then stores, so the store at .Lstore is shared between the no-grow and post-grow paths. On the grow path this and x must survive the grow_pod call, which means they are saved in callee-saved registers, leading to push rbx/push rbp in the prologue. push rbp is needed to maintain the 16-byte alignment of the stack frame.

GCC's output is also inefficient:

1
2
3
4
5
push   rbp ; mov ebp, esi    # x -> rbp, in the entry block
push rbx ; mov rbx, rdi # this -> rbx
... ; cmp ; jnb .Lslow
.Lmerge: # reached by both paths, reads rbx/rbp
mov rdx, [rbx] ; mov [rdx+rax*4], ebp ; ...

Shrink wrapping can't remove it

Shrink wrapping relocates the save/restore of callee-saved registers; it never duplicates a block. To carry this/x across the conditional grow_pod call into a store the fast path also reaches, a callee-saved register must be live from entry. clang -mllvm -debug-only=shrink-wrap reports No Shrink wrap candidate found. GCC's -fshrink-wrap-separate (on at -O2) does not optimize this as well.

The transformation that would help is tail duplication — give the slow path its own copy of the store so the fast path keeps this/x in their argument registers. Neither compiler does it here, and it is not shrink-wrapping's job.

The fix: tail calling the slow path

Move the grow-and-store out of line and tail calls it:

1
2
3
4
5
6
7
8
9
10
11
12
13
LLVM_ATTRIBUTE_NOINLINE void growAndPushBack(ValueParamT Elt) {
T Tmp = Elt; // in case Elt aliases storage that grow() invalidates
this->grow(this->size() + 1);
std::memcpy(reinterpret_cast<void *>(this->end()), &Tmp, sizeof(T));
this->set_size(this->size() + 1);
}

void push_back(ValueParamT Elt) {
if (LLVM_UNLIKELY(this->size() >= this->capacity()))
return growAndPushBack(Elt);
std::memcpy(reinterpret_cast<void *>(this->end()), &Elt, sizeof(T));
this->set_size(this->size() + 1);
}
1
2
3
4
5
6
7
mov    eax, [rdi + 8]
cmp eax, [rdi + 12]
jae growAndPushBack # TAILCALL
mov rcx, [rdi]
mov [rcx + rax*4], esi
inc dword ptr [rdi + 8]
ret

7 instructions instead of 14, no callee-saved registers, nothing to shrink-wrap.

growAndPushBack is an vague linkage function that is emitted in a separate section using COMDAT, costing some .o file size.

Two details:

  • noinline is load-bearing. Without it Clang and GCC may sometimes inline the helper back and the prologue returns.
  • T Tmp = Elt handles Elt referencing the vector's own storage. It is elided for small by-value types and, for large by-const& types, is smaller than reusing the old pointer-relocation machinery.
1
2
3
4
5
6
7
#include <llvm/ADT/SmallVector.h>
// noinline growAndPushBack is load-bearing for both Clang and GCC.
void DecodeMOVDDUPMask(unsigned n, llvm::SmallVectorImpl<int> &v) {
for (unsigned l = 0; l < n; l += 2)
for (unsigned i = 0; i < 2; ++i)
v.push_back(i);
}

Results

lld .text shrinks 40,512 bytes; by-const& element types win most, e.g. GotSection::addConstant goes 167 → 45 bytes. On the LLVM compile-time tracker the clang build is 0.41–0.51% fewer instructions:u across every configuration, for +0.13% binary size.

Sorted by relative size, a few outliers grow ~13.8% — the constexpr ByteCode interpreter (Interp.cpp, EvalEmitter.cpp). A smaller push_back likely perturbs the bottom-up inliner's near-threshold decisions.

Aside: "approximately trivially copyable"

1
2
3
std::is_trivially_copy_constructible<T> &&
std::is_trivially_move_constructible<T> &&
std::is_trivially_destructible<T>

is the predicate that selects the SmallVectorTemplateBase<..., true> specialization, where copy/move construction optimizes to memcpy and destroy_range is a no-op.

The condition is deliberately broader than is_trivially_copyable, which also demands trivial assignment. The motivating case is std::pair<int,int>: its constructors are trivial, but its assignment is user-provided (to support pair<T&,U&>), so is_trivially_copyable is false. SmallVector only ever copies or moves elements by construction into uninitialized storage (memcpy), never by assignment, so the distinction is unobservable and memcpy is sound — and std::pair<POD,POD> stays on the fast path.

The condition is also stronger than trivial relocatability, whose operational definition is just is_trivially_move_constructible && is_trivially_destructible. The extra is_trivially_copy_constructible is there because SmallVector also calls memcpys when copying a live element — push_back(const T&), or copy-constructing from another vector.

Aside: a smaller header than std::vector

std::vector stores three 8-byte members. SmallVector stores a begin pointer plus a 32-bit size and a 32-bit capacity when sizeof(T) >= 4.

1
2
3
4
template <class Size_T> class SmallVectorBase {
void *BeginX;
Size_T Size, Capacity;
};

So for int, pointers, and most structs the header is 16 bytes — 8 fewer than std::vector. The cost of carrying a size instead of an end pointer is that addressing the end (begin + size * sizeof(T)) needs a multiply, visible when sizeof(T) is not a power of two.

std::vector::push_back is slow in both libc++ and libstdc++

Neither standard vector has a frame-free push_back fast path.

libc++'s push_back forwards to emplace_back, which routes the grow decision through std::__if_likely_else(cond, fast, slow). The slow path is kept out of line, but as a by-reference lambda, so its closure {&__end_, &__x, this} is materialized on the stack and the trailing this->__end_ = __end is a merge. The fast path therefore spills the closure and runs with a 48-byte frame:

1
2
3
4
5
6
7
8
9
push   rbx
sub rsp, 48
mov [rsp+12], esi # spill x
mov rax, [rdi+8] # __end
mov [rsp+16], rax
lea rcx, [rsp+16] ; mov [rsp+24], rcx # } closure {&__end,
lea rcx, [rsp+12] ; mov [rsp+32], rcx # } &x, this}, built
mov [rsp+40], rdi # } on the fast path
jae .Lslow # else: store x; this->__end_ = __end

libstdc++ is heavier still: its push_back inlines _M_realloc_insert, pulling the whole reallocation — operator new, memcpy, operator delete, and the length_error throw — into the function. To keep state live across those calls the fast path holds six callee-saved registers, on both g++ and clang.

A direct out-of-line member taking (this, Elt) in registers — the growAndPushBack above — is what keeps the fast path free of both a frame and callee-saved registers.

Note: many libc++ builds enable hardening by default. Disable it (and exceptions) for the best performance:

1
-fno-exceptions -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_NONE

Boost's small_vector has the same frame

boost::container::small_vector<int, N>::push_back tells the same story, independent of the inline capacity N (even N == 0):

1
2
3
4
5
6
7
8
9
sub    rsp, 24                  # frame on the fast path
mov [rsp+12], esi # spill x — dead on the fast path
mov rax, [rdi+8] # size (64-bit)
lea rdx, [4*rax] ; add rdx, [rdi] ; cmp rax, [rdi+16] # end; vs capacity
je .Lgrow
mov [rdx], esi ; inc rax ; mov [rdi+8], rax ; add rsp, 24 ; ret
.Lgrow:
lea r8, [rsp+12] # &x, for vector::priv_insert_…'s insert_emplace_proxy<const int&>
call ...

x is spilled only so the cold grow path can pass its address to the emplace proxy, yet the sub rsp, 24 and the spill land on the fast path (the store itself uses esi). Boost also keeps size and capacity as size_t, so small_vector<int,0> is 24 bytes — like std::vector, 8 more than SmallVector<int,0>.

Takeaways

  • A fast/slow merge that rejoins after a call forces callee-saved spills onto the hot path, and shrink-wrapping can't remove them.
  • A tail-called out-of-line slow path removes the overhead.
  • Inliner behavior makes size effects are non-monotonic.

Recent LLVM hash table improvements

LLVM has several hash tables. They used quadratic probing with in-band sentinel keys (empty, tombstone); recent work has been replacing that with linear probing with tombstone key removed.

  • DenseMap (replacement for std::unordered_map): DenseMapInfo::getEmptyKey() / getTombstoneKey().
    • DenseSet: implemented using DenseMap
    • compiler-rt/lib/sanitizer_common/sanitizer_dense_map.h ports the implementation for sanitizers.
  • SmallPtrSet (replacement for std::unordered_set<T *>): hard-coded -1 (empty) and -2 (tombstone).
  • StringMap (replacement for std::unordered_map<std::string, V>)
    • StringSet: implemented using StringMap

For the open-addressed DenseMap and SmallPtrSet, pointers, references, and iterators are invalidated by insert. StringMap is different: each entry lives in a heap-allocated StringMapEntry<V> node, so entry pointers survive grow. std::unordered_map, being node-based, keeps surviving-element pointers valid across both insert and erase and only invalidates the erased element's own iterator. LLVM code rarely needs that stronger contract — callers do not hold long-lived references into the container across mutation — and that gap is what gives pass to relocating erase and bit-array occupancy.

Recently,

  • Tombstones have been removed from DenseMap and SmallPtrSet. erase() also invalidates pointers.
  • DenseMap has also retired its empty-key sentinel, leading to significant performance improvements. DenseMap with integer keys (int/unsigned/size_t) had -1/-2 reserved — a footgun, now fixed.
  • StringMap got Algorithm R deletion too. Its entries are separately heap-allocated, so erase keeps entry pointers valid but invalidates iterators; erase-while-iterating moved to remove_if.

Read More

Fighting Hyrum's Law in LLVM

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. — Hyrum's Law

In a compiler, the most common form of Hyrum's Law is dependence on unspecified behavior — hash bucket order, the order of equal elements after std::sort, padding offsets. The same framing covers a few cases that are technically undefined behavior (use of an invalidated iterator) or plain incidental properties (ABI struct layout, ELF section offsets).

When the compiler itself harbors such a dependency, the symptom is usually output that varies build-to-build: an unstable sort that lands differently after the standard library changes, a hash map whose iteration order shifts when the hash function does. Occasionally the variation is run-to-run within a single build — DenseMap<void *, X> keys with an ASLR-derived seed reorder buckets each invocation. Either way, reproducible builds, bisection, and bug reports all assume same input → same output, and a stealth Hyrum dependency breaks that.

This post surveys some mechanisms that perturb the contract's blind spots so dependencies cannot quietly form.

Read More

Recent lld/ELF performance improvements

Updated in 2026-05.

Since the LLVM 22 branch was cut, I've landed patches that parallelize more link phases and cut task-runtime overhead. This post compares current main against lld 22.1, mold, and wild.

Headline: a Release+Asserts clang --gc-sections link is 1.34x as fast as lld 22.1; Chromium debug with --gdb-index is 1.09x as fast. mold and wild are still ahead — the last section explains why.

Read More

Bit-field layout

The C and C++ standards leave nearly every detail to the implementation. C23 §6.7.3.2:

An implementation may allocate any addressable storage unit large enough to hold a bit-field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified

C++ is also terse — [class.bit]p1:

Allocation of bit-fields within a class object is implementation-defined. Alignment of bit-fields is implementation-defined. Bit-fields are packed into some addressable allocation unit.

Read More

Call relocation types

Most architectures encode direct branch/call instructions with a PC-relative displacement. This post discusses a specific category of branch relocations: those used for direct function calls and tail calls. Some architectures use two ELF relocation types for a call instruction:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# i386, x86-64
call foo # R_386_PC32, R_X86_64_PC32
call foo@plt # R_386_PLT32, R_X86_64_PLT32

# m68k
bsr.l foo # R_68K_PC32
bsr.l foo@plt # R_68K_PLT32

# s390/s390x
brasl %r14, foo # R_390_PC32DBL
brasl %r14, foo@plt # R_390_PLT32DBL

# sparc
call foo, 0 # not PIC: R_SPARC_WDISP30
call foo, 0 # gas -KPIC: R_SPARC_WPLT30

Read More

lld 22 ELF changes

For those unfamiliar, lld is the LLVM linker, supporting PE/COFF, ELF, Mach-O, and WebAssembly ports. These object file formats differ significantly, and each port must follow the conventions of the platform's system linker. As a result, the ports share limited code (diagnostics, memory allocation, etc) and have largely separate reviewer groups.

With LLVM 22.1 releasing soon, I've added some notes to the https://github.com/llvm/llvm-project/blob/release/22.x/lld/docs/ReleaseNotes.rst as an lld/ELF maintainer. As usual, I've reviewed almost all the patches not authored by me.

For the first time, I used an LLM agent (Claude Code) to help look through commits (git log release/21.x..release/22.x -- lld/ELF) and draft the release notes. Despite my request to only read lld/ELF changes, Claude Code also crafted notes for other ports, which I retained since their release notes had been quite sparse for several releases. Changes back ported to the 21.x release are removed (git log --oneline llvmorg-22-init..llvmorg-21.1.8 -- lld).

I'll delve into some of the key changes.

Read More

Long branches in compilers, assemblers, and linkers

Branch instructions on most architectures use PC-relative addressing with a limited range. When the target is too far away, the branch becomes "out of range" and requires special handling.

Consider a large binary where main() at address 0x10000 calls foo() at address 0x8010000-over 128MiB away. On AArch64, the bl instruction can only reach ±128MiB, so this call cannot be encoded directly. Without proper handling, the linker would fail with an error like "relocation out of range." The toolchain must handle this transparently to produce correct executables.

This article explores how compilers, assemblers, and linkers work together to solve the long branch problem.

  • Compiler (IR to assembly): Handles branches within a function that exceed the range of conditional branch instructions
  • Assembler (assembly to relocatable file): Handles branches within a section where the distance is known at assembly time
  • Linker: Handles cross-section and cross-object branches discovered during final layout

Read More

2025年总结

TODO

一如既往,主要在工具链领域耕耘。但由于工作忙碌在open source社区投入的时间减少了。

Blogging

不包括这篇总结,一共写了18篇文章。

llvm-project

Linux kernel

贡献了两个commits,被引用了一次。

ccls

  • clang.prependArgs
  • 支持了LLVM 21和22

ELF specification

尝试推进compact section header table,没有取得共识。 一些成员希望采用general compression (like zstd)的方式,像SHF_COMPRESSED那样压缩section header table。包括我在内的另一些人不喜欢采用general compression。

Misc

Reported 6 feature requests or bugs to binutils.

旅行

  • 第一次去:台南、西安、兰州、天水、Sacramento、Puerto Vallarta, Jalisco, Mexico、Mazatlán, Sinaloa, Mexico
  • 曾经去过:台北(上一次是近11年前)、北京