In assembly languages, some instructions with an immediate operand can be encoded in two (or more) forms with different sizes. On x86-64, a JMP/JCC (jumps) can be encoded either in 2 bytes with a 8-bit relative offset or 6 bytes with a 32-bit relative offset. The short form is preferred because it takes less space. However, when the target of the jump is too far away, the long form must be used.
1 | ja foo # jump near if above, 77 <rel8> |
A 1978 paper by Thomas G. Szymanski ("Assembling Code for Machines with Span-Dependent Instructions") used the term "span-dependent instructions" to refer to such instructions. Assemblers grapple with the challenge of choosing the optimal size for these instructions, often referred to as the "branch displacement problem" since branches are the most common type. A good resource for understanding Szymanski's work is Assembling Span-Dependent Instructions.
Popular assemblers still used today tend to favor a "start small and grow" approach, typically requiring one more pass than Szymanski's "start big and shrink" method. This approach often results in smaller code and can handle additional complexities like alignment directives.
In LLVM, the MC library (Machine Code) is reponsible for assembly, disassembly, and object file formats. Within MC, "assembler relaxation" deals with span-dependent instructions. This is distinct from linker relaxation.
Eli Bendersky provides a detailed explanation in a 2013 blog post and highlights an interesting behavior:
For example, when compiling with -O0, the LLVM assembler simply relaxes all jumps it encounters on first sight. This allows it to put all instructions immediately into data fragments, which ensures there's much fewer fragments overall, so the assembly process is faster and consumes less memory.
When -O0
is enabled and the integrated assembler is used
(common by default), clangDriver passes the -mrelax-all
flag to the LLVM MC library. This sets the MCRelaxAll
flag
in MCTargetOptions
, instructing the assembler to
potentially start with the long form (near) for JMP and JCC instructions
on the X86 target only. Other instructions like ADD/SUB/CMP and non-x86
architectures remain unaffected.
Here is an example: 1
2
3
4
5void foo(int a) {
// -mrelax-all: near jump (6 bytes)
// -mno-relax-all or -fno-integrated-as: short jump (2 bytes)
if (a) bar();
}
1 | # -mrelax-all |
The impact of -mrelax-all
on text section size is
significant, especially when there are many branch instructions. In an
x86-64 release build using lld, -mrelax-all
increased the
.text
section size by 7.9%. This translates to a 5.4%
increase in VM size and a 4.6% increase in the overall file size.
Dean Michael Berris proposed to remove the
-mrelax-all
default for -O0
in 2016, but
it stalled. -mrelax-all
caused undesired interaction issues
with RISC-V's conditional
branch transforms, leading Craig Topper to remove
-mrelax-all
as the -O0
default for
RISC-V.
While -mrelax-all
might have offered slight compile time
benefits in the past, the gains are negligible today. Benchmarking using
stage 2 builds of Clang showed no measurable difference between
-mrelax-all
and -mno-relax-all
. On
llvm-compile-time-tracker running the llvm-test-suite/CTMark benchmark,
compile time actually increased
slightly by 0.62% while the text section size decreased
by 4.44%.
A difference for assembly at different optimisation levels would be
quite surprising. GCC/GNU assembler don't exhibit similar expansion of
JMP/JCC instructions even at -O0
.
These arguments strengthen the case for removing
-mrelax-all
as the default for -O0
. My patch has
landed and will be included in the next major release, LLVM 19.1.
Understanding the compile time difference
I have studied a notorious huge file,
llvm/lib/Target/X86/X86ISelLowering.cpp
.
Fragment count: A significant difference exists in the number of assembler fragments generated:
-mrelax-all
: 89633-mno-relax-all
: 143852
With -mrelax-all
, the number of MCRelaxableFragments is
substantially reduced (to zero when building Clang). This reduction
likely contributes to the compile time difference.
Fixed-point algorithm iterations:
-mrelax-all
ensures the fixed-point algorithm (almost
always) converges in a single iteration. In contrast, with
-mno-relax-all
, around 6% of sections require additional
iterations. However, this difference is likely not the primary factor
affecting compile time.
1 | // -mrelax-all |
Why did people not complain about the code size increase?
Because people generally care less about -O0
code size.
-O0
is often used alongside -g
for debugging
purposes. The total file size increase caused by
-mrelax-all
might seem less significant in comparison.
In addition, not all projects can be successfully built with
-O0
optimization. This is typically due to issues like very
large programs or mandatory inlining behavior.
For a discussion on size reduction ideas in ELF relocatable files, please check out my blog post about Light ELF.