LLVM 21.1 have been released. As usual, I maintain lld/ELF and have added some notes to https://github.com/llvm/llvm-project/blob/release/21.x/lld/docs/ReleaseNotes.rst. I've meticulously reviewed nearly all the patches that are not authored by me. I'll delve into some of the key changes.
Benchmarking compression programs
tl;dr https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7 is a single-file Ruby program that downloads and compiles multiple compression utilities, then benchmarks their compression and decompression performance on a specified input file, finally generates a HTML file with scatter charts. Scroll to the end to view example HTML pages.
Compression algorithms can be broadly categorized into three groups based on their typical compression ratio and decompression speed:
- Low ratio, high speed: lz4, snappy, Oodle Selkie.
- Medium ratio, medium speed: zlib, zstd, brotli, Oodle Kraken.
- High ratio, low speed: LZMA, bzip2, bzip3, bsc, zpaq, kanzi, Oodle Leviathan.
Low ratio Codecs in this category prioritize speed above all else. The compression and compression speeds are comparable. They are designed to decompress so quickly that they don't introduce a noticeable delay when reading data from storage like solid-state drives. These codecs typically producing byte-aligned output and often skip the final step of entropy encoding, which, while crucial for high compression, is computationally intensive. They are excellent choices for applications where latency is critical, such as kernel features like zswap.
Medium ratio This is the sweet spot for many tasks. The codecs achieve better compression ratio by employing entropy encoding, usually Huffman coding.
zstd has emerged as a clear leader, gaining popularity and effectively supplanting older codecs like the venerable DEFLATE (zlib).
High ratio They are designed to squeeze every last bit of redundancy out of the data, often at the cost of significantly longer compression and decompression times, and large memory usage. They are perfect for archival purposes or data distribution where the files are compressed once and decompressed infrequently. Codecs typically have 3 important components:
- Transforms: Codecs typically implement strong transforms to increase redundancy, even very specific ones like branch/call/jump filters for machine code.
- Predication model: This model anticipates the next piece of data based on what has already been processed.
- Entropy encoding: Traditional codecs use arithmetic encoder, which is replaced by the more efficient Range variant of Asymmetric Numeral Systems (rANS).
Some projects apply neural network models, such as Recurrent Neural Network, Long Short-Term Memory, and Transformer, to the predication model. They are usually very slow.
This categorization is loose. Many modern programs offer a wide range of compression levels that allow them to essentially span multiple categories. For example, a high-level zstd compression can achieve a ratio comparable to xz (a high-compression codec) by using more RAM and CPU. While zstd's compression speed or ratio is generally lower, its decompression speed is often much faster than that of xz.
Benchmarking
I want to benchmark the single worker performance of a few compression programs:
- lz4: Focuses on speed over compression ratio. Memory usage is extremely low. It seems Pareto superior to Google's Snappy.
- zstd: Gained significant traction and obsoleted many
existing codecs. Its LZ77 variant uses three recent match offsets like
LZX. For entropy encoding, it employs Huffman coding for literals and
2-way interleaved Finite State Entropy for Huffman weights, literal
lengths, match lengths, and offset codes. The large alphabet of literals
makes Huffman a good choice, as compressing them with FSE provides
little gain for a speed cost. However, other symbols have a small range,
making them a sweet spot for FSE. zstd works on multiple streams at the
same time to utilize instruction-level parallelism. zstd is supported by
the
Accept-Encoding: zstd
HTTP header. Decompression memory usage is very low. - brotli: Uses a combination of LZ77, 2nd order context
model, Huffman coding, and static dictionary. The decompression speed is
similar to gzip with a higher ratio. At lower levels, its performance is
overshadowed by zstd. Compared with DEFLATE, it employs a
larger sliding window (from 16KiB-16B to 16MiB-16B) and a smaller
minimum match length (2 instead of 3). It has a predefined dictionary
that works well for web content (but feels less elegant) and supports
120 transforms. brotli is supported by the
Accept-Encoding: br
HTTP header. Decompression memory usage is quite low. - bzip3: Combines BWT, RLE, and LZP and uses arithmetic encoder. Memory usage is large.
- xz: LZMA2 with a few filters. The filters must be enabled explicitly.
- lzham: Provides a compression ratio similar to LZMA but with faster
decompression. Compression is slightly slower while memory usage is
larger. The build system is not well-polished for Linux. I have forked
it, fixed
stdint.h
build errors, and installedlzhamtest
. The command line programlzhamtest
should really be renamed tolzham
. - zpaq: Functions as a command-line archiver supporting multiple files. It combines context mixing with arithmetic encoder but operates very slowly.
- kanzi: There are a wide variety of transforms and entropy encoders, unusual for a compresion program. For the compression speed of enwik8, it's Pareto superior to xz, but decompression is slower. Levels 8 and 9 belong to the PAQ8 family and consume substantial memory.
I'd like to test lzham (not updated for a few years), but I'm having
trouble getting it to compile due to a cstdio
header
issue.
Many modern compressors are parallel by default. I have to disable
this behavior by using options like -T1
. Still,
zstd uses a worker thread for I/O overlap, but I don't bother
with --single-thread
.
To ensure fairness, each program is built with consistent compiler
optimizations, such as -O3 -march=native
.
Below is a Ruby program that downloads and compiles multiple compression utilities, compresses then decompress a specified input file. It collects performance metrics including execution time, memory usage, and compression ratio, and finally generates an HTML file with scatter charts visualizing the results. The program has several notable features:
- Adding new compressors is easy: just modify
COMPRESSORS
. - Benchmark results are cached in files named
cache_$basename_$digest.json
, allowing reuse of previous runs for the same input file. - Adding a new compression level does not invalidate existing benchmark results for other levels.
- The script generates an HTML file with interactive scatter charts.
Each compressor is assigned a unique, deterministic color based on a
hash of its name (using the
hsl
function in CSS).
The single file Ruby program is available at https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7
Limitation
A single run might not be representative.
Running the executable incurs initialization overhead, which would be amortized in a library setup. However, library setup would make updating libraries more difficult.
Demo
1 | ruby bench.rb enwik8 |
Many programs exhibit a stable decompression speed (uncompressed size / decompression time). There is typically a slightly higher decompression speed at higher compression levels. If you think of the compressed content as a form of "byte code", a more highly compressed file means there are fewer bytes for the decompression algorithm to process, resulting in faster decompression. Some programs, like zpaq and kanzi, use different algorithms that can result in significantly different decompression speeds.
xz -9
doesn't use parallelism on the two files under
~100 MiB because their uncompressed size is smaller than the default
block size for level 9.
From
install/include/lzma/container.h
For each thread, about 3 * block_size bytes of memory will be allocated. This may change in later liblzma versions. If so, the memory usage will probably be reduced, not increased.
Understanding alignment - from source to object file
Alignment refers to the practice of placing data or code at memory addresses that are multiples of a specific value, typically a power of 2. This is typically done to meet the requirements of the programming language, ABI, or the underlying hardware. Misaligned memory accesses might be expensive or will cause traps on certain architectures.
This blog post explores how alignment is represented and managed as C++ code is transformed through the compilation pipeline: from source code to LLVM IR, assembly, and finally the object file. We'll focus on alignment for both variables and functions.
LLVM integrated assembler: Improving sections and symbols
In my previous post, LLVM integrated assembler: Improving expressions and relocations delved into enhancements made to LLVM's expression resolving and relocation generation. This post covers recent refinements to MC, focusing on sections and symbols.
LLVM integrated assembler: Engineering better fragments
In my previous assembler posts, I've discussed improvements on expression resolving and relocation generation. Now, let's turn our attention to recent refinements within section fragments. Understanding how an assembler utilizes these fragments is key to appreciating the improvements we've made. At a high level, the process unfolds in three main stages:
- Parsing phase: The assembler constructs section fragments. These fragments represent sequences of regular instructions or data, span-dependent instructions, alignment directives, and other elements.
- Section layout phase: Once fragments are built, the assembler assigns offsets to them and finalizes the span-dependent content.
- Relocation decision phase: In the final stage, the assembler evaluates fixups and, if necessary, updates the content of the fragments.
GCC 13.3.0 miscompiles LLVM
For years, I've been involved in updating LLVM's MC layer. A recent
journey led me to eliminate
the FK_PCRel_
fixup kinds:
MCFixup: Remove FK_PCRel_
The generic FK_Data_ fixup kinds handle both absolute and PC-relative
fixups. ELFObjectWriter sets IsPCRel to true for `.long foo-.`, so the
backend has to handle PC-relative FK_Data_.
However, the existence of FK_PCRel_ encouraged backends to implement it
as a separate fixup type, leading to redundant and error-prone code.
Removing FK_PCRel_ simplifies the overall fixup mechanism.
LLVM integrated assembler: Improving expressions and relocations
In my previous post, LLVM integrated assembler: Improving MCExpr and MCValue delved into enhancements made to LLVM's internal MCExpr and MCValue representations. This post covers recent refinements to MC, focusing on expression resolving and relocation generation.
LLVM integrated assembler: Improving MCExpr and MCValue
In my previous post, Relocation Generation in Assemblers, I explored some key concepts behind LLVM’s integrated assemblers. This post dives into recent improvements I’ve made to refine that system.
The LLVM integrated assembler handles fixups and relocatable
expressions as distinct entities. Relocatable expressions, in
particular, are encoded using the MCValue
class, which
originally looked like this:
1 | class MCValue { |
Relocation generation in assemblers
This post explores how GNU Assembler and LLVM integrated assembler generate relocations, an important step to generate a relocatable file. Relocations identify parts of instructions or data that cannot be fully determined during assembly because they depend on the final memory layout, which is only established at link time or load time. These are essentially placeholders that will be filled in (typically with absolute addresses or PC-relative offsets) during the linking process.
Relocation generation: the basics
Symbol references are the primary candidates for relocations. For
instance, in the x86-64 instruction movl sym(%rip), %eax
(GNU syntax), the assembler calculates the displacement between the
program counter (PC) and sym
. This distance affects the
instruction's encoding and typically triggers a
R_X86_64_PC32
relocation, unless sym
is a
local symbol defined within the current section.
Both the GNU assembler and LLVM integrated assembler utilize multiple passes during assembly, with several key phases relevant to relocation generation:
Compiling C++ with the Clang API
This post describes how to compile a single C++ source file to an
object file with the Clang API. Here is the code. It behaves like a
simplified clang
executable that handles -c
and -S
.