2025-01
mold比lld快在哪里?
有完整parallel symbol resolution,不过为此单核性能下降,且CPU总时间增大(多进程链接可能会得不偿失)。
Relocation scanning快一些,简化了错误处理,对于一个架构只支持REL和RELA中的一种。 Range extension thunks不完善。 lld对于某个架构都支持REL、RELA、CREL,有一些virtual function开销。
简化--gc-sections
,保守但少处理一些功能,使用了oneTBB并行化。
Assign sections快一些。由于不支持linker script
SECTIONS
和一些麻烦的SHF_LINK_ORDER
,可以取巧。
Finalize synthetic sections和write
output快一些。使用oneTBB并行,有好的调度器均衡地分配CPU资源处理synthetic
sections或写大量input sections。
而lld只能用穷人的llvm/Support/Parallel.h
。
大多数linker script语法不支持。Symbol representation可以简化些。
用了fork trick (unless
--nofork
),这可能是一种性能评测游戏。可能会让linker的parent
process难以估计linker的资源消耗。
几乎所有函数都用了template <typename E>
,减少了virtual
function开销。但相应地,大大增加了code size。
https://maximullaris.com/awk_tech_notes.html
"AWK’s main goal was to be extremely terse yet productive language well suited for one-liners."
2025-02
https://github.com/yosefk/funtrace描述了一个小型function trace runtime。 funtrace.cpp是runtime,和program一起链接。funtrace.cpp hook了 -pg, -finstrument-functions instrument function的 fentry/return/__cyg_profile_func_enter ,记录function address和 x86 __rdtsc()值,然后由 Rust 写的 funtrace2viz/src/main.rs 转成 chrome trace event format
https://herecomesthemoon.net/2025/01/type-inference-in-rust-and-cpp/ mentions that Rust can adopt a Hindley-Milner type system because:
- function overloads
- lack of implicit conversions (except limited and specific conversions like lifetime shortening and references to pointers)
- no inheritance
- no specialization
C++, on the other hand, has these features and makes HM infeasible.
To minimize output size even without ld --gc-sections (https://maskray.me/blog/2021-02-28-linker-garbage-collection), libc implementations often aggressively separate function variants into individual files. E.g. asprintf.c sprintf.c vasprintf.c vsprintf.c vsnprintf.c
The archive processing semantics
(https://maskray.me/blog/2021-06-20-symbol-processing#archive-processing)
ensures that while libc.a:x.o
is extracted,
libc.a:y.o
may remain unneeded. The linker just doesn't see
the input sections from libc.a:y.o, so section-based garbage collection
is not needed.
bzip2 seems like RLE+BWT+Huffman with quite small block sizes (meaningful when machines did not have large RAM back then). bzip3 tries both RLE and LZP (using a rather large minimum match length: LZP_MIN_MATCH of 40). Given the inherent slowness of the BWT, replacing Huffman with an arithmetic encoder (which has a better compression ratio) seems like a logical optimization. tANS is used in zstd and LFZSE, but later rANS becomes more popular in newer codecs. Anyhow, the bottleneck is likely in BWT.
1 | #define LZP_DICTIONARY 18 7 refs |
Further tuning of the LZP predictor might be possible, but any changes at this stage would impact codec compatibility.
The BWT step contributes significantly to the slow decompression speed of bzip2 and bzip3. LZMA and zstd using higher compression levels offer faster decompression speeds.
I haven't read bsc, another spiritual successor of bzip2. How does bzip3 compare with it? The source code does refer to bsc. https://encode.su/threads/3763-bsc-m03-(experimental-M03-sorting-compressor)
Achieved the milestone of 6000 commits in the llvm/llvm-project repository. Should identify gaps, learn more things, and pay less attention on commits.
2025-03
https://reviews.llvm.org/D23110 populated the generic
assembly parser with MIPS expression modifiers (e.g.
%pcrel_hi
).
TLS handling in LLVM integrated assembler. AArch64 encodes the TLS
kind in MCExpr:
ImmVal = AArch64MCExpr::create(ImmVal, RefKind, getContext());
PowerPC encodes the TLS kind as MCSymbolRefExpr's VariantKind (encoded
in MCSymbolRefExpr::SubclassData
).
The 2010 LLVM MC commit https://github.com/llvm/llvm-project/commit/55992564152f0fce6758a4495cc39422f5e1cc94
introduced MCSymbolRefExpr::VariantKind
with x86 relocation
operators, but it's flawed - other expressions (e.g. MCBinaryExpr) need
it too, and semantics get messy (e.g. (a@plt)-b
). Many
targets overload the generic interface. MCTargetExpr is a better fit, as
AArch64 and RISC-V show. The lengthy list of PowerPC-specific
VK_PPC_
entries is disheartening, though cleaning it up now
feels like a daunting task.