---
# try also 'default' to start simple
theme: default
# random image from a curated Unsplash collection by Anthony
# like them? see https://unsplash.com/collections/94734566/slidev
# background: https://source.unsplash.com/collection/94734566/1920x1080
# apply any windi css classes to the current slide
class: 'text-center'
# https://sli.dev/custom/highlighters.html
highlighter: shiki
# show line numbers in code blocks
lineNumbers: false
# persist drawings in exports and build
drawings:
persist: false
# use UnoCSS (experimental)
css: unocss
---
# Implement RISC-V linker relaxation in lld
---
layout: 'intro'
---
宋方睿
---
# What is linker optimization/relaxation?
- Link-time code sequence transformation (peephole optimization)
- Difficult/impossible to do on the compiler side, e.g. (a) whether `extern int v;` resolves to this component or another? (b) whether a defined symbol is sufficiently close to the current instruction?
- The linker has a global view and address information
- Optimizable code sequences are communicated via relocations
"Linker relaxation" is used when the number of bytes changes (Andes NDS32, Xtensa, RISC-V, etc)
Link-time optimization (LTO) is usually used in a narrow sense.
It involves IR level optimization and requires (re)compilation (optimization and code generation) and re-linking.
---
# x86-64 GOTPCRELX optimization
```c
extern int var;
int foo(void) { return var; }
```
If `var` is non-preemptible (known to resolve to the same component at link time)
```asm
movq var@GOTPCREL(%rip), %rax # load address from GOT, R_X86_64_REX_GOTPCRELX
movl (%rax), %eax # load variable value
```
```asm
leaq var(%rip), %rax # set address relative to RIP
movl (%rax), %eax # load variable value
```
---
# PowerPC64 TOC optimization
```c
extern int var;
int foo(void) { return var; }
```
If `var` is non-preemptible (known to resolve to the same component at link time)
```asm
addis 3, 2, .LC0@toc@ha
ld 3, .LC0@toc@l(3)
lwa 3, 0(3)
.section .toc,"aw",@progbits
.LC0:
.tc var[TC],var
```
```asm
addis 3, 2, var@toc@ha
addi 3, 3, var@toc@l
lwa 3, 0(3)
.section .toc,"aw",@progbits
.LC0:
.tc var[TC],var
```
The linker may garbage collect the `.toc` section.
---
# RISC-V linker relaxation
* `R_RISCV_ALIGN` relaxation
* call relaxation
* local-exec TLS relaxation
* lui relaxation
Key insight: RISC typically needs two instructions to access a symbol.
Oftentimes one instruction is sufficient if the symbol is sufficiently close.
---
# Call relaxation
```c
void ext(void); void foo(void) { ext(); ext(); }
```
Relax to one instruction if `isInt<21>(displacement)`.
If `displacement` is smaller, `c.j` or `c.jal` may be used.
```asm
# pseudo: call ext@plt
auipc ra, 0 # R_RISCV_CALL_PLT(ext), R_RISCV_RELAX
jalr ra, 0(ra)
# pseudo: call ext@plt
auipc ra, 0 # R_RISCV_CALL_PLT(ext), R_RISCV_RELAX
jalr ra, 0(ra)
```
```asm
# If 32-bit and displacement is representable as an int12.
c.jal ra, ext
c.jal ra, ext
# Otherwise, if displacement is representable as an int21.
jal ra, ext
jal ra, ext
```
```asm
# pseudo: jump ext@plt, t0
auipc t0, 0
jalr zero, 0(t0)
```
```asm
# If rd==zero and displacement is representable as an int12.
c.j ext
# Otherwise, if displacement is representable as an int21.
jal zero, ext
```
Other RISC architectures prefer one jump instruction. They need range extension thunks (veneers).
RISC-V circumvents range extension thunks.
---
# `R_RISCV_ALIGN` relaxation
```asm
auipc ra, 0 # R_RISCV_CALL_PLT(ext), R_RISCV_RELAX
jalr ra, 0(ra)
auipc ra, 0 # R_RISCV_CALL_PLT(ext), R_RISCV_RELAX
jalr ra, 0(ra)
.balign 16 # R_RISCV_ALIGN
ret
```
---
# lld
* LLVM linker
* Ports are independent: PE-COFF, ELF, WebAssembly, Mach-O
* Nowadays I use the command name "ld.lld" to refer to the ELF port
---
# lld ELF
* Parse command line options
* Find and scan input files (.o, .so, .a), interleaved with symbol resolution
* Call LLVM LTO to get ELF object files
* Global transforms (section based garbage collection, identical code folding, etc)
* Create synthetic sections
* Map input sections and synthetic (linker-generated) sections into output sections
* Scan relocations
* Finalize synthetic sections
* Layout (addresses, thunks, `SHT_RELR`, symbol assignments)
* Assign file offsets
* Write header and sections
Challenges?
---
# Scan relocations
* Makes dynamic relocation decisions
* Determine the sizes of `.got`, `.got.plt`, `.plt`, `.rela.dyn`, `.relr.dyn`
* One pass, infeasible to change
---
# Symbol values
* A linker script can define symbols and decide output section addresses from symbol values.
* The changed section layout may change symbol values.
```text
SECTIONS {
.mid 0x10800 : { mid_start = .; *(.mid); mid_end = .; }
.high 0x110000+(mid_end-mid_start) : { *(.high) }
.high2 0x210000+SIZEOF(.mid) : { *(.high2) }
}
```
3 output sections.
The size change of `.mid` will change the address of `.high`.
---
# Design
Add another relocation scanning pass for linker relaxation.
* Scan relocations
* Finalize synthetic sections
* Layout (_relaxation,_ addresses, `SHT_RELR`, symbol assignments)
* Assign file offsets
The relaxation pass computes:
* for each relocated location, the replacement relocation type, the rewritten code sequence, and the number of bytes to delete
* the size of each input code section
* `st_value` and `st_size` for each symbol defined relative to the section
---
# Implementation
* Record deleted ranges. Lazy deletion
* Record the size of input section
* Consider deleted ranges when computing section addresses and symbol values (for relocation displacement)
* Delete bytes at the end of the layout phase
```cpp
template void Writer::finalizeAddressDependentContent() {
...
uint32_t pass = 0;
for (;;) {
Create thunks or call relaxOnce; // relaxOnce is new
++pass;
Report "not converged" if pass is too large;
Update address-dependent sections;
Assign addresses to sections and symbols;
}
if (!config->relocatable && config->emachine == EM_RISCV)
riscvFinalizeRelax(pass);
...
}
```
---
```cpp
bool RISCV::relaxOnce(int pass) const {
...
bool changed = false;
for (OutputSection *osec : outputSections) {
if (!(osec->flags & SHF_EXECINSTR))
continue;
for (InputSection *sec : getInputSections(*osec, storage))
changed |= relax(*sec);
}
return changed;
}
```
---
```cpp
static bool relax(InputSection &sec) {
Restore original st_value for symbols relative to this section.
std::fill_n(aux.relocTypes.get(), sec.relocations.size(), R_RISCV_NONE);
aux.writes.clear();
for (auto [i, r] : llvm::enumerate(sec.relocations)) {
const uint64_t loc = secAddr + r.offset - delta;
uint32_t &cur = aux.relocDeltas[i], remove = 0;
switch (r.type) {
case R_RISCV_ALIGN: remove = the number of bytes to delete; break;
case R_RISCV_CALL: case R_RISCV_CALL_PLT:
if (i + 1 != sec.relocations.size() && sec.relocations[i + 1].type == R_RISCV_RELAX)
relaxCall(sec, i, loc, r, remove);
break;
// Other relaxable relocation types
}
Update symbol st_value/st_size according to symbol anchors;
delta += remove;
if (delta != cur) { cur = delta; changed = true; }
}
Update trailing symbol anchors;
sec.bytesDropped = delta;
return changed;
}
```
---
Postponed range deletion: for each input code section, we iterate over its non-resolved relocations.
For an `R_RISCV_ALIGN` associated with some bytes to delete, we copy all content from the previous location to `r_offset`, then skip some bytes for the next copy.
```asm
... # Copy all content from the previous location to r_offset
.balign 8 # R_RISCV_ALIGN(r_addend=6)
# A prefix of the NOPs may be skipped for the next memcpy
addi a0, a0, 1
```
Say we need to delete 2 bytes. If we use `[]` to indicate the copied bytes, the current and the next copy patterns will look like:
```text
old: ...] NOP NOP [NOP NOP NOP NOP ADDI ADDI ADDI ADDI ...]
old: next copy
new: ...] [NOP NOP NOP NOP ADDI ADDI ADDI ADDI ...]
new:
```
---
# Call relaxation
```asm
call dest@plt # R_RISCV_CALL_PLT, R_RISCV_RELAX
```
If auipc+jalr can be relaxed to a 4-byte jal, we ignore auipc, replace jalr with jal, and increment `p` and `offset` so that next memcpy will start copying from the first byte after jalr.
The rewritten instruction starts at the first byte indicated by `skip=4`.
```text
old: ...] AUIPC AUIPC AUIPC AUIPC JALR JALR JALR JALR [.........]
remove=4 skip=4 next copy
new: ...] JAL JAL JAL JAL [.........]
```
---
```asm
tail dest@plt
```
If the `tail` pseudo instruction which can be relaxed to `c.j`.
```text
old: ...] AUIPC AUIPC AUIPC AUIPC JALR JALR JALR JALR [.........]
remove=6 skip=2 next copy
new: ...] C.J C.J [.........]
```
---
```cpp
void elf::riscvFinalizeRelax(int passes) {
...
for (OutputSection *osec : outputSections) {
if (!(osec->flags & SHF_EXECINSTR))
continue;
for (InputSection *sec : getInputSections(*osec, storage)) {
RISCVRelaxAux &aux = *sec->relaxAux;
if (!aux.relocDeltas)
continue;
Allocate space for the new section content to `p`;
sec->rawData = makeArrayRef(p, newSize);
// Update section content: remove NOPs for R_RISCV_ALIGN and rewrite
// instructions for relaxed relocations.
for (size_t i = 0, e = rels.size(); i != e; ++i) {
uint32_t remove = aux.relocDeltas[i] - delta;
delta = aux.relocDeltas[i];
if (remove == 0)
continue;
// Copy from last location to the current relocated location.
const Relocation &r = rels[i];
uint64_t size = r.offset - offset;
memcpy(p, old.data() + offset, size);
p += size;
```
---
```cpp
// For R_RISCV_ALIGN, we will place `offset` in a location (among NOPs)
// to satisfy the alignment requirement. If `remove` is a multiple of 4,
// it is as if we have skipped some NOPs. Otherwise we are in the middle
// of a 4-byte NOP, and we need to rewrite the NOP sequence.
int64_t skip = 0;
if (r.type == R_RISCV_ALIGN) {
if (remove % 4 != 0) {
skip = r.addend - remove;
Rewrite `skip` bytes with nop and an optional trailing c.nop;
}
} else if (RelType newType = aux.relocTypes[i]) {
Rewrite code sequence;
}
p += skip;
offset = r.offset + skip + remove;
}
memcpy(p, old.data() + offset, old.size() - offset);
Subtract the previous relocDeltas value from the relocation offset.
For a pair of R_RISCV_CALL/R_RISCV_RELAX with the same offset, decrease
their r_offset by the same delta.
}
}
}
```
---
# Next steps
* lui relaxation
* TLSDESC and its relaxation (on my plate)
* Unallocatable global pointer
---
# Acknowledgement
* 陳枝懋 (Chih-Mao Chen), Jessica Clarke, Greg McGary, Peter Smith
* Kito Cheng, Felix Yan, Xinlong Wu, 汪辰