SmallVector is LLVM's most-used container, and
push_back its hot operation. For the trivially-copyable
specialization the fast path should be fast.
1 |
|
clang -S --target=x86_64 -O2 -DNDEBUG a.cc
generates:
1 | push rbp # callee-saved spills + a stack realignment, |
push_back reserves capacity and then stores, so
the store at .Lstore is shared between the no-grow and
post-grow paths. On the grow path this and x
must survive the grow_pod call, which means they are saved
in callee-saved registers, leading to
push rbx/push rbp in the prologue.
push rbp is needed to maintain the 16-byte alignment of the
stack frame.
GCC's output is also inefficient:
1 | push rbp ; mov ebp, esi # x -> rbp, in the entry block |
Shrink wrapping can't remove it
Shrink wrapping relocates the save/restore of callee-saved registers;
it never duplicates a block. To carry this/x
across the conditional grow_pod call into a store the fast
path also reaches, a callee-saved register must be live from entry.
clang -mllvm -debug-only=shrink-wrap reports
No Shrink wrap candidate found. GCC's
-fshrink-wrap-separate (on at -O2) does not
optimize this as well.
The transformation that would help is tail duplication —
give the slow path its own copy of the store so the fast path keeps
this/x in their argument registers. Neither
compiler does it here, and it is not shrink-wrapping's job.
The fix: tail calling the slow path
Move the grow-and-store out of line and tail calls it:
1 | LLVM_ATTRIBUTE_NOINLINE void growAndPushBack(ValueParamT Elt) { |
1 | mov eax, [rdi + 8] |
7 instructions instead of 14, no callee-saved registers, nothing to shrink-wrap.
growAndPushBack is an vague linkage function that is
emitted in a separate section using COMDAT, costing some .o
file size.
Two details:
noinlineis load-bearing. Without it Clang and GCC may sometimes inline the helper back and the prologue returns.T Tmp = ElthandlesEltreferencing the vector's own storage. It is elided for small by-value types and, for large by-const&types, is smaller than reusing the old pointer-relocation machinery.
1 |
|
Results
lld .text shrinks 40,512 bytes;
by-const& element types win most, e.g.
GotSection::addConstant goes 167 → 45 bytes. On the LLVM compile-time
tracker the clang build is 0.41–0.51% fewer
instructions:u across every configuration, for +0.13%
binary size.
Sorted by relative size, a few outliers grow ~13.8% — the constexpr
ByteCode interpreter (Interp.cpp,
EvalEmitter.cpp). A smaller push_back likely
perturbs the bottom-up inliner's near-threshold decisions.
Aside: "approximately trivially copyable"
1 | std::is_trivially_copy_constructible<T> && |
is the predicate that selects the
SmallVectorTemplateBase<..., true> specialization,
where copy/move construction optimizes to memcpy and
destroy_range is a no-op.
The condition is deliberately broader than
is_trivially_copyable, which also demands trivial
assignment. The motivating case is
std::pair<int,int>: its constructors are trivial, but
its assignment is user-provided (to support
pair<T&,U&>), so
is_trivially_copyable is false.
SmallVector only ever copies or moves elements by
construction into uninitialized storage (memcpy), never by
assignment, so the distinction is unobservable and memcpy
is sound — and std::pair<POD,POD> stays on the fast
path.
The condition is also stronger than trivial relocatability, whose
operational definition is just
is_trivially_move_constructible && is_trivially_destructible.
The extra is_trivially_copy_constructible is there because
SmallVector also calls memcpys when
copying a live element — push_back(const T&),
or copy-constructing from another vector.
Aside: a smaller header
than std::vector
std::vector stores three 8-byte members.
SmallVector stores a begin pointer plus a 32-bit size and a
32-bit capacity when sizeof(T) >= 4.
1 | template <class Size_T> class SmallVectorBase { |
So for int, pointers, and most structs the header is
16 bytes — 8 fewer than std::vector. The
cost of carrying a size instead of an end pointer is that addressing the
end (begin + size * sizeof(T)) needs a multiply, visible
when sizeof(T) is not a power of two.
std::vector::push_back is slow in both libc++ and libstdc++
Neither standard vector has a frame-free
push_back fast path.
libc++'s push_back forwards to
emplace_back, which routes the grow decision through
std::__if_likely_else(cond, fast, slow). The slow path is
kept out of line, but as a by-reference lambda, so its closure
{&__end_, &__x, this} is materialized on the stack
and the trailing this->__end_ = __end is a merge. The
fast path therefore spills the closure and runs with a 48-byte
frame:
1 | push rbx |
libstdc++ is heavier still: its push_back inlines
_M_realloc_insert, pulling the whole reallocation —
operator new, memcpy,
operator delete, and the length_error throw —
into the function. To keep state live across those calls the fast path
holds six callee-saved registers, on both g++ and clang.
A direct out-of-line member taking (this, Elt) in
registers — the growAndPushBack above — is what keeps the
fast path free of both a frame and callee-saved registers.
Note: many libc++ builds enable hardening by default. Disable it (and exceptions) for the best performance:
1 | -fno-exceptions -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_NONE |
Boost's small_vector has the same frame
boost::container::small_vector<int, N>::push_back
tells the same story, independent of the inline capacity N
(even N == 0):
1 | sub rsp, 24 # frame on the fast path |
x is spilled only so the cold grow path can pass its
address to the emplace proxy, yet the sub rsp, 24 and the
spill land on the fast path (the store itself uses esi).
Boost also keeps size and capacity as size_t, so
small_vector<int,0> is 24 bytes — like
std::vector, 8 more than
SmallVector<int,0>.
Takeaways
- A fast/slow merge that rejoins after a call forces callee-saved spills onto the hot path, and shrink-wrapping can't remove them.
- A tail-called out-of-line slow path removes the overhead.
- Inliner behavior makes size effects are non-monotonic.