lld 17 ELF changes

LLVM 17 will be released. As usual, I maintain lld/ELF and have added some notes to https://github.com/llvm/llvm-project/blob/release/17.x/lld/docs/ReleaseNotes.rst. Here I will elaborate on some changes.

  • When --threads= is not specified, the number of concurrency is now capped to 16. A large --thread= can harm performance, especially with some system malloc implementations like glibc's. (D147493)
  • --remap-inputs= and --remap-inputs-file= are added to remap input files. (D148859)
  • --lto= is now available to support clang -funified-lto (D123805)
  • --lto-CGO[0-3] is now available to control CodeGenOpt::Level independent of the LTO optimization level. (D141970)
  • --check-dynamic-relocations= is now correct 32-bit targets when the addend is larger than 0x80000000. (D149347)
  • --print-memory-usage has been implemented for memory regions. (D150644)
  • SHF_MERGE, --icf=, and --build-id=fast have switched to 64-bit xxh3. (D154813)
  • Quoted output section names can now be used in linker scripts. (#60496 <https://github.com/llvm/llvm-project/issues/60496>_)
  • MEMORY can now be used without a SECTIONS command. (D145132)
  • REVERSE can now be used in input section descriptions to reverse the order of input sections. (D145381)
  • Program header assignment can now be used within OVERLAY. This functionality was accidentally lost in 2020. (D150445)
  • Operators ^ and ^= can now be used in linker scripts.
  • LoongArch is now supported.
  • DT_AARCH64_MEMTAG_* dynamic tags are now supported. (D143769)
  • AArch32 port now supports BE-8 and BE-32 modes for big-endian. (D140201) (D140202) (D150870)
  • R_ARM_THM_ALU_ABS_G* relocations are now supported. (D153407)
  • .ARM.exidx sections may start at non-zero output section offset. (D148033)
  • Arm Cortex-M Security Extensions is now implemented. (D139092)
  • BTI landing pads are now added to PLT entries accessed by range extension thunks or relative vtables. (D148704) (D153264)
  • AArch64 short range thunk has been implemented to mitigate the performance loss of a long range thunk. (D148701)
  • R_AVR_8_LO8/R_AVR_8_HI8/R_AVR_8_HLO8/R_AVR_LO8_LDI_GS/R_AVR_HI8_LDI_GS have been implemented. (D147100) (D147364)
  • --no-power10-stubs now works for PowerPC64.
  • DT_PPC64_OPT is now supported. (D150631)
  • PT_RISCV_ATTRIBUTES is added to include the SHT_RISCV_ATTRIBUTES section. (D152065)
  • R_RISCV_PLT32 is added to support C++ relative vtables. (D143115)
  • RISC-V global pointer relaxation has been implemented. Specify --relax-gp to enable the linker relaxation. (D143673)
  • The symbol value of foo is correctly handled when --wrap=foo and RISC-V linker relaxation are used. (D151768)
  • x86-64 large data sections are now placed away from code sections to alleviate relocation overflow pressure. (D150510)

When using glibc malloc with a larger std::thread::hardware_concurrency (say, more than 16), parallel relocation scanning can be quite slower without the --threads=16 throttling.

I usually try to make extensions, unless too LLVM internal specific (e.g. --lto-*), accepted by the binutils community. The feature request for --remap-inputs= and --remap-inputs-file= was a success story, implemented by GNU ld 2.41.

PT_RISCV_ATTRIBUTES output is still not quite right. I also question about its usefulness. Unfortunately, at this stage, it's difficult to get rid of it.

This cycle has a surprising number of new features, and I have spent lots of spare time reviewing them to ensure that they are robust and properly tested. Most stuff is completely unrelated to my day job.

There are quite a few AArch32 changes from Arm engineers, primarily about big-endian support and Cortex-M Security Extensions.

I was firm that the RISC-V global pointer relaxation needs to be opt-in. I had a GNU ld --relax-gp patch last year and utilitized this opportunity (ld.lld feature proposal) to move forward GNU ld --relax-gp. It's unfortunately opt-out, but having an option is a step forward.

This release adds support for LoongArch, which is a relatively new architecture that took inspiration from Mips and RISC-V.

Speed

Unlike previous versions, there is just a minor performance improvement compared with lld 15.0.0. I added a simplified version of 64-bit xxh3 into the LLVMSupport library and utilized it in lld.

Linking a -DCMAKE_BUILD_TYPE=Debug build of clang 16:

1
2
3
4
5
6
7
8
9
10
11
12
% hyperfine --warmup 2 --min-runs 25 "numactl -C 20-27 "{/tmp/out/custom-16/bin/ld.lld,/tmp/out/custom-17/bin/ld.lld}" @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/out/custom-16/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 3.159 s ± 0.035 s [User: 7.089 s, System: 3.076 s]
Range (min … max): 3.095 s … 3.250 s 25 runs

Benchmark 2: numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 3.131 s ± 0.027 s [User: 6.851 s, System: 3.101 s]
Range (min … max): 3.080 s … 3.198 s 25 runs

Summary
'numactl -C 20-27 /tmp/out/custom-17/bin/ld.lld @response.txt --threads=8' ran
1.01 ± 0.01 times faster than 'numactl -C 20-27 /tmp/out/custom-16/bin/ld.lld @response.txt --threads=8'

This influence to the total link time is small. However, if I test the time proportion of the hash function in the total link time, I can see that the proportion has been reduced to nearly one third. On some workload and some machines this effect may be larger.