MaskRay

2025-09-11

2025-01

mold比lld快在哪里？

有完整parallel symbol resolution，不过为此单核性能下降，且CPU总时间增大(多进程链接可能会得不偿失)。

Relocation scanning快一些，简化了错误处理，对于一个架构只支持REL和RELA中的一种。 Range extension thunks不完善。 lld对于某个架构都支持REL、RELA、CREL，有一些virtual function开销。

简化--gc-sections，保守但少处理一些功能，使用了oneTBB并行化。

Assign sections快一些。由于不支持linker script SECTIONS和一些麻烦的SHF_LINK_ORDER，可以取巧。

Finalize synthetic sections和write output快一些。使用oneTBB并行，有好的调度器均衡地分配CPU资源处理synthetic sections或写大量input sections。而lld只能用穷人的llvm/Support/Parallel.h。

大多数linker script语法不支持。Symbol representation可以简化些。

用了fork trick (unless --nofork)，这可能是一种性能评测游戏。可能会让linker的parent process难以估计linker的资源消耗。

几乎所有函数都用了template <typename E>，减少了virtual function开销。但相应地，大大增加了code size。

https://maximullaris.com/awk_tech_notes.html

"AWK’s main goal was to be extremely terse yet productive language well suited for one-liners."

2025-02

https://github.com/yosefk/funtrace描述了一個小型function trace runtime。 funtrace.cpp是runtime，和program一起鏈接。funtrace.cpp hook了 -pg, -finstrument-functions instrument function的 fentry/return/__cyg_profile_func_enter ，記錄function address和 x86 __rdtsc()值，然後由 Rust 寫的 funtrace2viz/src/main.rs 轉成 chrome trace event format

https://herecomesthemoon.net/2025/01/type-inference-in-rust-and-cpp/ mentions that Rust can adopt a Hindley-Milner type system because:

function overloads
lack of implicit conversions (except limited and specific conversions like lifetime shortening and references to pointers)
no inheritance
no specialization

C++, on the other hand, has these features and makes HM infeasible.

To minimize output size even without ld --gc-sections (https://maskray.me/blog/2021-02-28-linker-garbage-collection), libc implementations often aggressively separate function variants into individual files. E.g. asprintf.c sprintf.c vasprintf.c vsprintf.c vsnprintf.c

The archive processing semantics (https://maskray.me/blog/2021-06-20-symbol-processing#archive-processing) ensures that while libc.a:x.o is extracted, libc.a:y.o may remain unneeded. The linker just doesn't see the input sections from libc.a:y.o, so section-based garbage collection is not needed.

bzip2 seems like RLE+BWT+Huffman with quite small block sizes (meaningful when machines did not have large RAM back then). bzip3 tries both RLE and LZP (using a rather large minimum match length: LZP_MIN_MATCH of 40). Given the inherent slowness of the BWT, replacing Huffman with an arithmetic encoder (which has a better compression ratio) seems like a logical optimization. tANS is used in zstd and LFZSE, but later rANS becomes more popular in newer codecs. Anyhow, the bottleneck is likely in BWT.

1 2	#define LZP_DICTIONARY 18 7 refs #define LZP_MIN_MATCH 40

Further tuning of the LZP predictor might be possible, but any changes at this stage would impact codec compatibility.

The BWT step contributes significantly to the slow decompression speed of bzip2 and bzip3. LZMA and zstd using higher compression levels offer faster decompression speeds.

I haven't read bsc, another spiritual successor of bzip2. How does bzip3 compare with it? The source code does refer to bsc. https://encode.su/threads/3763-bsc-m03-(experimental-M03-sorting-compressor)

Achieved the milestone of 6000 commits in the llvm/llvm-project repository. Should identify gaps, learn more things, and pay less attention on commits.

2025-03

https://reviews.llvm.org/D23110 populated the generic assembly parser with MIPS expression modifiers (e.g. %pcrel_hi).

TLS handling in LLVM integrated assembler. AArch64 encodes the TLS kind in MCExpr: ImmVal = AArch64MCExpr::create(ImmVal, RefKind, getContext()); PowerPC encodes the TLS kind as MCSymbolRefExpr's VariantKind (encoded in MCSymbolRefExpr::SubclassData).

The 2010 LLVM MC commit https://github.com/llvm/llvm-project/commit/55992564152f0fce6758a4495cc39422f5e1cc94 introduced MCSymbolRefExpr::VariantKind with x86 relocation operators, but it's flawed - other expressions (e.g. MCBinaryExpr) need it too, and semantics get messy (e.g. (a@plt)-b). Many targets overload the generic interface. MCTargetExpr is a better fit, as AArch64 and RISC-V show. The lengthy list of PowerPC-specific VK_PPC_ entries is disheartening, though cleaning it up now feels like a daunting task.

Development on gold slowed considerably around a decade ago, following Cary's retirement. gold exhibits more issues on non-x86 architectures. I have extensive experience in this area-in 2018, I contributed to Google's migration from gold to lld.

LLD's --warn-backrefs is an excellent tool for ensuring compatibility with GNU linkers. I played a key role in Google's effort to transition from Gold to LLD. Around late 2019, I observed that many executables could no longer link with GNU Gold due to archive ordering problems stemming from missing cc_library dependencies. In 2020, I addressed numerous dependency issues and implemented --warn-backrefs. About a year later, I found that executables could still be linked with Gold—provided I omitted some options it didn’t recognize. This compatibility would not have been achievable without the deployment of --warn-backrefs.

2025-04

expr@specifier parsing is a hack in LLVM AsmParser. In GNU Assember's COFF targets, @ is allowed as part of identifiers. https://reviews.llvm.org/D1978

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38534 Optimizing noreturn nothrow functions by skipping callee-saved register spills is problematic. GCC's x86 port implemented the optimization and caused debug information regressions with noreturn calls like abort(). necessitating the -mnoreturn-no-callee-saved-registers addition for x86.

2025-08

POWER10 prefixed instructions may not cross 64-byte cache line boundaries. GNU Assembler and LLVM integrated assembler (https://reviews.llvm.org/D72570) enforce at least 64-byte section alignment. If a prefixed instruction would fall at offset 60 modulo 64, a 4-byte NOP is inserted before it. A clever mechanism ensures labels defined on the same source line are placed after the NOP.

.space 60
a:
b: c: paddi 1, 2, 8589934576, 0
d: nop

1
2
3

~/Dev/binutils/out/ppc64le/gas/as-new llvm/test/MC/PowerPC/ppc64-prefix-align.s -mpwr10
llvm-objdump -d --show-all-symbols

... 0000000000000000 <.text>: ...

000000000000003c : 3c: 00 00 00 60 nop

0000000000000040 : 0000000000000040 : 40: ff ff 01 06 f0 ff 22 38 paddi 1, 2, 8589934576, 0

0000000000000048 : 48: 00 00 00 60 nop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

---

https://sourceware.org/pipermail/binutils/2025-August/143620.html

Thanks for raising this topic, and thanks to Alan for removing CloudABI.

FTR FreeBSD dropped CloudABI in 2021
https://reviews.freebsd.org/D31923 and clang followed suit in 2023
(https://reviews.llvm.org/D158920).

It seems that many people expressed regret over the discontinuation of
CloudABI. https://val.packett.cool/blog/use-openat/ ("path.join
Considered Harmful, or openat() All The Things") says:

> Anyway, thankfully, what’s fulfilling the need for an ABI is WASI, the WebAssembly System Interface, which is… basically kinda sorta just a wasm32-cloudabi target if you look at it! (Well, with the whole Component Model thing that’s only going to be one aspect of it but still.) The WASI overview explicitly references CloudABI and Capsicum. And even the aforementioned research into Capsicumizing existing software in the form of libpreopen. In a way, we have won after all! :) The industry-hyped, Wasm-workgroup-blessed, by-all-compilers-supported ABI for POSIX-y applications is based on exactly these ideas.

Side note: it seems that Android ndk is exploring alternatives to
WebAssembly for sandboxing, such as Lightweight Fault Isolation (LFI).
The performance of WebAssembly is bad.
LFI might need assembly-level shenanigans like NativeClient.

---

Say we have nodes indexed from 0 to n-1.
We can derive an implicit binary search tree by making node i a child of node index `(i | (i+1)) & ~(2 << __builtin_ctz(~i))`.
There can be one or two trees.

This implicit tree structure can be used as an interval tree. An interval tree is a data structure used to store a set of intervals and quickly find which ones contain a specific point.
To store an interval [l,r), use `(r & (-1u << 31-__builtin_clz(l^r))) - 1` to find the highest node within the interval, which serves as the designated storage location.

#include <stdio.h>

int main() { const int N = 16; puts("digraph G {"); for (int i = 0; i < N; i++) { int p = (i | (i+1)) & ~(2 << __builtin_ctz(~i)); if (p < N) printf("%d -> %d", i, p); } puts("}"); // int l, r; // while (scanf("%d%d", &l, &r) == 2) // [l, r) // printf("%d", (r & (-1u << 31-__builtin_clz(l^r))) - 1); } ```

Share