Updated in 2022-12.
Linker optimization/relaxation
Because the linker has a global view and layout information, it can perform some peephole optimizations which are difficult/impossible to do on the compiler side. Generic link-time code sequence transformation is risky, because semantic information is lost and what the linker sees are byte streams. However, if every instruction in the candidate code sequence is associated with one ore more relocations, the ABI and the implementation can assign (additional) semantics to the relocation types and make such transformation safe. This technique is usually called linker optimization or linker relaxation. It seems that the term "linker optimization" is often used when the number of bytes does not change while "linker relaxation" is used when the number of bytes decreases.
In GNU ld, gold and ld.lld, their i386, x86-64 and ppc64 ports have implemented various linker optimizations. Here is an example of x86-64 GOTPCRELX optimization (available in GNU ld since 2015).
1 | # Load the address via GOT indirection. |
More ports have implemented TLS code sequence optimization. See All about thread-local storage for details.
Because the term "link-time optimization" is similar to linker relaxation but is usually used in a narrow sense which is very different (communicate symbol resolution to the compiler, combine the information of multiple translation units, and perform IR level optimization), I (and some other folks) have used "linker relaxation" to refer to the transformations without changing the code sequence length.
Now let's move our focus to linker relaxation.
RISC architectures usually need more than one instructions to materialize a symbol address. There are usually two instructions, the first materializing the high bits and the second materializing the low bits. In many cases, an alternative instruction can replace the two instructions. This requires the linker to be able to delete an instruction from the section.
Another case is the design of branch instructions. On some RISC architectures (e.g. AVR), a long instruction can be replaced with a short instruction.
1 | jmp dest # 4 bytes, R_AVR_CALL |
A branch instruction typically has a much shorter range than the 32-bit address space. In rare cases the jump target may be out of range. Most RISC architectures use range extension thunks: let the linker redirect the branch instruction to a thunk which materializes the target address and jumps there.
RISC-V linker relaxation
Instead of giving relocation types such as
R_RISCV_HI20, R_RISCV_LO12, R_RISCV_PCREL_LO12_I, R_RISCV_PCREL_LO12_S, R_RISCV_CALL
additional semantics, the designer introduced a new relocation type
R_RISCV_RELAX
. You will see that at the same location there
are two relocations. The ELF specification says:
If multiple consecutive relocation records are applied to the same relocation location (r_offset), they are composed instead of being applied independently, as described above. By consecutive, we mean that the relocation records are contiguous within a single relocation section. By composed, we mean that the standard application described above is modified as follows:
- In all but the last relocation operation of a composed sequence, the result of the relocation expression is retained, rather than having part extracted and placed in the relocated field. The result is retained at full pointer precision of the applicable ABI processor supplement.
- In all but the first relocation operation of a composed sequence, the addend used is the retained result of the previous relocation operation, rather than that implied by the relocation type.
Note that a consequence of the above rules is that the location specified by a relocation type is relevant for the first element of a composed sequence (and then only for relocation records that do not contain an explicit addend field) and for the last element, where the location determines where the relocated value will be placed. For all other relocation operands in a composed sequence, the location specified is ignored.
An ABI processor supplement may specify individual relocation types that always stop a composition sequence, or always start a new one.
In the composed sequence, R_RISCV_RELAX
performs the
operation of deleting bytes.
In -no-pie
mode, GNU ld can perform GP relaxation.
1 | lui a0, %hi(sym) # R_RISCV_HI20, R_RISCV_RELAX |
From binutils 2.41 onwards, GNU ld will support --no-relax-gp
to disable GP relaxation.
For branches which can potentially be long ranged, RISC-V uses 2 instructions. This nicely avoids range extension thunks. If the branch turns out to be short ranges, GNU ld can delete one instruction and even compress the remaining one.
1 | call fun # auipc ra, ..; jalr ra, ..(ra) |
Example
1 | void ext(void); |
1 | 0000000000000000 <.text>: |
Two instructions are used to materialize a call. If the call turns out to be short ranged, the first instruction can be dropped.
1 | 000000000000244 <foo>: |
Assembler implications
The existing documentation has praised the scheme a lot. This article, as you have already seen from the title, will delve into the downside.
Alignment directives
Deleting bytes may affect the following alignment directive. In the
following example, if the call code sequence is shortened to 4 bytes,
the mv instruction may not start at an offset aligned by 16.
1
2
3call fun
.balign 16
mv a1, a0
The address this problem, depending on whether RVC (compressed
instructions) can be used, the assembler emits a padding of align-4 (no
RVC) or align-2 (RVC) bytes for an alignment directive. Without
generality, let's assume RVC is unavailable and the padding is align-4.
To tell the linker the number of bytes, an R_RISCV_ALIGN
with addend align-4 is emitted at the beginning of the padding.
The linker can delete some bytes to satisfy the alignment requirement. The deleted number of bytes must be less than or equal to align-4.
1 | call fun # 8 bytes |
Since the assembler emits NOPs according to the worse case, the
alignment requirement is not necessarily correct when no relaxation is
performed. Therefore, a linker which has not implemented deleting bytes
(e.g. ld.lld) should conservatively bail out if such an R_RISCV_ALIGN is
seen. This explains the ld.lld diagnostic
error: relocation R_RISCV_ALIGN requires unimplemented linker relaxation
.
Linker friendly
R_RISCV_ALIGN
In ELF, many relocation types capable of linker optimization are pure optional features: R_AVR_CALL, R_PPC64_PCREL_OPT, R_X86_64_GOTPCRELX, R_X86_64_REX_GOTPCRELX, TLS relocation types supporting GD->LE/GD->IE/LD->LE/etc optimizations.
Currently R_RISCV_ALIGN
just encodes the expected
alignment. Can it encode the actual bytes of NOPs beside the expected
alignment? Then linkers not supporting linker relaxation can happily
accept -mrelax
object files.
There are several schemes
- Encode the actual bytes of NOPs in the symbol part of the r_info
field. The associated symbol can be absolute
(
st_shndx==SHN_ABS
). - Encode the actual bytes of NOPs in the high bits of the r_addend field.
- Introduce a new relocation type used together with
R_RISCV_ALIGN
.
The third approach is probably least favored. The first one should be compatible with existing implementations.
I filed https://github.com/riscv-non-isa/riscv-elf-psabi-doc/issues/183 for this issue, but closed it after I implemented linker relaxation for ld.lld.
.align [abs-expr[, abs-expr[, abs-expr]]]
In GNU assembler, the third argument of .align
specifies
the maximum number of bytes that should be skipped by this alignment
directive. This is unfortunately not representable with the
R_RISCV_ALIGN
scheme. The second argument is not
either.
Fixup against a local symbol
For a reference to a local symbol in the same section (e.g. a branch target), traditionally no relocation is needed. The fixup is resolved at assembly time. With linker relaxation, we need to preserve the relocation so that the linker can fix the offset.
1 | .globl fun |
Label differences related to text section symbols
Label differences related to local text section symbols are often used to describe metadata sections. In the presence of linker relaxation, label differences generally needed to be treated as non-constants.
The assembler behavior is not well documented. See https://github.com/riscv-non-isa/riscv-asm-manual/issues/80.
For A-B
, if A
is defined relative to a text
section or is undefined, generally we should emit a pair of ADD/SUB
relocations (e.g. R_RISCV_ADD32
/R_RISCV_SUB32
,
R_RISCV_ADD64
/R_RISCV_SUB64
). In other cases,
we should use PC-relative relocations.
1 | w: |
In the LLVM integrated assembler, whether we should emit a pair of ADD/SUB relocations is decided upfront at parsing time. (https://reviews.llvm.org/D145474)
Currently both GNU assembler and LLVM integrated assembler emit
R_RISCV_ADD32
/R_RISCV_SUB32
pairs regardless
of the relax/norelax state. I think it will be useful to have a way to
emit a constant in the norelax case to make relocatable object files
smaller. This will help clang -g:
1 | .section .debug_info,"",@progbits |
If 32-bit label differences always lead to
R_RISCV_ADD32
/R_RISCV_SUB32
pairs, it will be
cumbersome to get rid of the relocations in -mno-relax
mode. An object file producer can do some hard work emitting a constant
but this isn't feasible for an assembly file producer.
DWARF
DWARF is a widely used debugging information format.
Code addresses
In DWARF, a debugging information entry (DIE) describing an entity
that has a range of machine code adddresses may have
DW_AT_low_pc/DW_AT_high_pc/DW_AT_ranges
attributes to
describe the addresses.
On other architectures, the typical implementation uses assembly directives this way and needs one relocation referencing the start of the function.
1 | .quad .Lfunc_begin0 # DW_AT_low_pc |
Due to linker relaxation, the length is not a constant, so the label
difference will actually produce two relocations, a pair of
R_RISCV_ADD32
and R_RISCV_SUB32
. So RISC-V
gives us two relocations which take some space in the object file.
1 | 0x0000002a: DW_TAG_subprogram [2] |
An alternative design is to let the value of the
DW_AT_high_pc
be of class address, specifically,
DW_FORM_addr
. It takes 8 bytes on ELFCLASS64 but can remove
one relocation.
1 | .quad .Lfunc_begin0 # DW_AT_low_pc |
Line number information
Line number information gives association from source file locations to machine instruction addresses. It is conceptually a matrix with one row for each instruction. The matrix has columns for:
- the source file name
- the source line number
- the source column number
- and so on
DWARF uses a byte-coded language to encode the matrix. The specification says:
Most of the instructions in a line number program are special opcodes.`
For the above example (consecutive ext()
calls), on most
architectures, a call takes one special opcode of one byte.
llvm-dwarfdump --debug-line
can dump the matrix:
1 | # x86-64 |
However, one RISC-V, it is bloated with two relocations associated
with one row!
DW_LNS_advance_line+DW_LNS_fixed_advance_pc+DW_LNS_copy
take 6 bytes. Two Elf64_Rela
relocations take 48 bytes.
1 | # RISC-V |
54x waste! So, what is wrong?
Well, due to linker relaxation, the address increment between two calls is not a compile-time constant. A special opecode is encoded with the following formula:
1 | # DWARF v4 introduced maximum_operations_per_instruction for VLIW architectures. |
The variables except operation_increment are compile-time constants,
but we don't have a relocation type representing multiplication. If such
a relocation type exists and the compiler can ensure that the maximum
address_advance
does not make the ubyte opcode
overflow (this is not simple), we can make the linked output much
smaller. That said, relocations are still the primary overhead of object
files.
Among major binary formats (Mach-O (8 bytes), PE-COFF (10 bytes)),
Elf64_Rela
(24 bytes) on ELF has a very noticeable
overhead. I don't know whether RISC-V people may want to switch to
ELFCLASS32 for 64-bit RISCV in the future. Currently the Linux kernel
associates ELFCLASS32 to ILP32 ABI variants, but nothing prevents
ELFCLASS32 from being used for small code model object files.
Call frame information
A frame description entry (FDE) in call frame information
(.debug_frame/.eh_frame
) encodes the length of the entity.
A pair of relocations are needed.
Similar to line number information, the call frame instructions are
encoded in a byte-coded language. The DW_CFA_advance_loc
instruction has a 6-bit operand (encoded wit hthe opcode) representing
that the location delta is operand * code_alignment_factor
.
The instruction can be relocated by a pair of R_RISCV_SET6
and R_RISCV_SUB6
.
Split DWARF
-gsplit-dwarf
produces a .dwo
file which
will not be processed by the linker. If .dwo
files contain
relocations, they will not be resolved. Therefore the practice is that
.dwo
files do not contain relocations.
Currently Clang and GCC use DW_AT_high_pc
or
DW_AT_ranges
to describe address ranges where the range
sizes/endpoints are described by relocations. Such implementations are
incompatible with linker relaxation. To work with linker relaxation,
DW_AT_high_pc
and DW_AT_ranges
need to use
indexes into .debug_addr
(e.g.
DW_RLE_startx_endx
).
Range lists and location lists
Due to lack of support for uleb128 label differences, the compact
DW_LLE_offset_pair
description cannot be used. GCC uses the
DW_LLE_startx_endx
description (PR99090).
The operands are indices into the .debug_addr
section. The
two values in .debug_addr
are relocated by
R_RISCV_64
relocations.
Language-specific data area
In the Itanium C++ ABI, the information needed to process exceptions
is called language-specific data area (LSDA). On ELF targets, this is
usually stored in the .gcc_except_table
section. See C++ exception
handling ABI for details.
1 | int comdat() { |
Call site records describe the landing pad offset/length. Without
linker relaxation, the values are assembly time constant and actually
.gcc_except_table
has no relocations referencing text
sections.
1 | .section .gcc_except_table,"a",@progbits |
With linker relaxation, in the generic case we need to have relocation pairs representing label differences. If a label difference may exceed 127, a uleb128 directive will not be suitable because we don't have uleb128 label difference relocation types.
In addition, uleb128 relaxation is difficult. The call site table length field in the header is encoded in uleb128. The value and other uleb128 offsets/lengths can cause oscillation. See GNU as (PR4029).
Anyway GCC and Clang have chosen .word
to encode
offsets/lengths.
1 | .section .gcc_except_table,"a",@progbits |
The other problem is that relocations from
.gcc_except_table
to the text section can cause some linker
garbage collection difficulty. It requires fragmented
.gcc_except_table
sections. See C++ exception
handling ABI for details.
ld --emit-relocs
A fair amount of work is needed to keep --emit-relocs
output updated in the presence of linker relaxation. Relocations
associated with shrunk instructions may look strange and some
relocations may feel out of the place.
Linker implementation
GNU ld does the following steps: 1
2
3
4
5
6
7
8
9
10
11
12
13lang_check_relocs ();
ldemul_after_allocation ();
ldelf_map_segments
lang_relax_sections
lang_size_sections (&relax_again, false);
bfd_relax_section
_bfd_riscv_relax_delete
_bfd_riscv_relax_align
ldwrite ();
bfd_final_link
riscv_elf_relocate_section
See RISC-V linker relaxation in lld for the ld.lld implementation.
Does link-time optimization or post link-time optimization help?
Nearly no. This is a phase ordering issue. IR level (and future machine IR level) link-time optimization is performed very early, just after symbol resolution. It has to be done early because the later steps need access to its emitted input sections. The LTO library has no layout information to guide its choice of branch instructions. As an example, for a reference from .data to .text, the LTO library does not know the distance.
Epilog
I sometimes call linker relaxation poor man's link-time optimization with nice ergonomics. I agree it is useful, probably more so for embedded systems, but there appears to be great relocatable object file size cost. Certain toolchain components have high complexity but they are mostly settled down now.
LoongArch folks appear zealous and want to add linker relaxation (https://github.com/loongson/LoongArch-Documentation/pull/77). I have expressed my concern on the GitHub PR and https://sourceware.org/pipermail/binutils/2022-December/125322.html regarding adding support without sorting out all the issues.