2025-09-28

Remarks on SFrame

The .sframe format is a lightweight alternative to .eh_frame and .eh_frame_hdr designed for efficient stack unwinding. By trading some functionality and flexibility for compactness, SFrame achieves significantly smaller size while maintaining the essential unwinding capabilities needed by profilers.

SFrame focuses on three fundamental elements for each function:

Canonical Frame Address (CFA): The base address for stack frame calculations
Return address
Frame pointer

An .sframe section follows a straightforward layout:

Header: Contains metadata and offset information
Auxiliary header (optional): Reserved for future extensions
Function Descriptor Entries (FDEs): Array describing each function
Frame Row Entries (FREs): Arrays of unwinding information per function

2025-09-07

lld 21 ELF changes

LLVM 21.1 have been released. As usual, I maintain lld/ELF and have added some notes to https://github.com/llvm/llvm-project/blob/release/21.x/lld/docs/ReleaseNotes.rst. I've meticulously reviewed nearly all the patches that are not authored by me. I'll delve into some of the key changes.

2025-08-31

Benchmarking compression programs

tl;dr https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7 is a single-file Ruby program that downloads and compiles multiple compression utilities, then benchmarks their compression and decompression performance on a specified input file, finally generates a HTML file with scatter charts. Scroll to the end to view example HTML pages.

Compression algorithms can be broadly categorized into three groups based on their typical compression ratio and decompression speed:

Low ratio, high speed: lz4, snappy, Oodle Selkie.
Medium ratio, medium speed: zlib, zstd, brotli, Oodle Kraken.
High ratio, low speed: LZMA, bzip2, bzip3, bsc, zpaq, kanzi, Oodle Leviathan.

Low ratio Codecs in this category prioritize speed above all else. The compression and compression speeds are comparable. They are designed to decompress so quickly that they don't introduce a noticeable delay when reading data from storage like solid-state drives. These codecs typically producing byte-aligned output and often skip the final step of entropy encoding, which, while crucial for high compression, is computationally intensive. They are excellent choices for applications where latency is critical, such as kernel features like zswap.

Medium ratio This is the sweet spot for many tasks. The codecs achieve better compression ratio by employing entropy encoding, usually Huffman coding.

zstd has emerged as a clear leader, gaining popularity and effectively supplanting older codecs like the venerable DEFLATE (zlib).

High ratio They are designed to squeeze every last bit of redundancy out of the data, often at the cost of significantly longer compression and decompression times, and large memory usage. They are perfect for archival purposes or data distribution where the files are compressed once and decompressed infrequently. Codecs typically have 3 important components:

Transforms: Codecs typically implement strong transforms to increase redundancy, even very specific ones like branch/call/jump filters for machine code.
Predication model: This model anticipates the next piece of data based on what has already been processed.
Entropy encoding: Traditional codecs use arithmetic encoder, which is replaced by the more efficient Range variant of Asymmetric Numeral Systems (rANS).

Some projects apply neural network models, such as Recurrent Neural Network, Long Short-Term Memory, and Transformer, to the predication model. They are usually very slow.

This categorization is loose. Many modern programs offer a wide range of compression levels that allow them to essentially span multiple categories. For example, a high-level zstd compression can achieve a ratio comparable to xz (a high-compression codec) by using more RAM and CPU. While zstd's compression speed or ratio is generally lower, its decompression speed is often much faster than that of xz.

Benchmarking

I want to benchmark the single worker performance of a few compression programs:

lz4: Focuses on speed over compression ratio. Memory usage is extremely low. It seems Pareto superior to Google's Snappy.
zstd: Gained significant traction and obsoleted many existing codecs. Its LZ77 variant uses three recent match offsets like LZX. For entropy encoding, it employs Huffman coding for literals and 2-way interleaved Finite State Entropy for Huffman weights, literal lengths, match lengths, and offset codes. The large alphabet of literals makes Huffman a good choice, as compressing them with FSE provides little gain for a speed cost. However, other symbols have a small range, making them a sweet spot for FSE. zstd works on multiple streams at the same time to utilize instruction-level parallelism. zstd is supported by the Accept-Encoding: zstd HTTP header. Decompression memory usage is very low.
brotli: Uses a combination of LZ77, 2nd order context model, Huffman coding, and static dictionary. The decompression speed is similar to gzip with a higher ratio. At lower levels, its performance is overshadowed by zstd. Compared with DEFLATE, it employs a larger sliding window (from 16KiB-16B to 16MiB-16B) and a smaller minimum match length (2 instead of 3). It has a predefined dictionary that works well for web content (but feels less elegant) and supports 120 transforms. brotli is supported by the Accept-Encoding: br HTTP header. Decompression memory usage is quite low.
bzip3: Combines BWT, RLE, and LZP and uses arithmetic encoder. Memory usage is large.
xz: LZMA2 with a few filters. The filters must be enabled explicitly.
lzham: Provides a compression ratio similar to LZMA but with faster decompression. Compression is slightly slower while memory usage is larger. The build system is not well-polished for Linux. I have forked it, fixed stdint.h build errors, and installed lzhamtest. The command line program lzhamtest should really be renamed to lzham.
zpaq: Functions as a command-line archiver supporting multiple files. It combines context mixing with arithmetic encoder but operates very slowly.
kanzi: There are a wide variety of transforms and entropy encoders, unusual for a compresion program. For the compression speed of enwik8, it's Pareto superior to xz, but decompression is slower. Levels 8 and 9 belong to the PAQ8 family and consume substantial memory.

I'd like to test lzham (not updated for a few years), but I'm having trouble getting it to compile due to a cstdio header issue.

Many modern compressors are parallel by default. I have to disable this behavior by using options like -T1. Still, zstd uses a worker thread for I/O overlap, but I don't bother with --single-thread.

To ensure fairness, each program is built with consistent compiler optimizations, such as -O3 -march=native.

Below is a Ruby program that downloads and compiles multiple compression utilities, compresses then decompress a specified input file. It collects performance metrics including execution time, memory usage, and compression ratio, and finally generates an HTML file with scatter charts visualizing the results. The program has several notable features:

Adding new compressors is easy: just modify COMPRESSORS.
Benchmark results are cached in files named cache_$basename_$digest.json, allowing reuse of previous runs for the same input file.
Adding a new compression level does not invalidate existing benchmark results for other levels.
The script generates an HTML file with interactive scatter charts. Each compressor is assigned a unique, deterministic color based on a hash of its name (using the hsl function in CSS).

The single file Ruby program is available at https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7

Limitation

A single run might not be representative.

Running the executable incurs initialization overhead, which would be amortized in a library setup. However, library setup would make updating libraries more difficult.

Demo

ruby bench.rb enwik8
# The first iframe below

ruby bench.rb clang
# The second iframe below

Many programs exhibit a stable decompression speed (uncompressed size / decompression time). There is typically a slightly higher decompression speed at higher compression levels. If you think of the compressed content as a form of "byte code", a more highly compressed file means there are fewer bytes for the decompression algorithm to process, resulting in faster decompression. Some programs, like zpaq and kanzi, use different algorithms that can result in significantly different decompression speeds.

xz -9 doesn't use parallelism on the two files under ~100 MiB because their uncompressed size is smaller than the default block size for level 9.

From install/include/lzma/container.h

For each thread, about 3 * block_size bytes of memory will be allocated. This may change in later liblzma versions. If so, the memory usage will probably be reduced, not increased.

2025-08-24

Understanding alignment - from source to object file

Alignment refers to the practice of placing data or code at memory addresses that are multiples of a specific value, typically a power of 2. This is typically done to meet the requirements of the programming language, ABI, or the underlying hardware. Misaligned memory accesses might be expensive or will cause traps on certain architectures.

This blog post explores how alignment is represented and managed as C++ code is transformed through the compilation pipeline: from source code to LLVM IR, assembly, and finally the object file. We'll focus on alignment for both variables and functions.

2025-08-17

LLVM integrated assembler: Improving sections and symbols

In my previous post, LLVM integrated assembler: Improving expressions and relocations delved into enhancements made to LLVM's expression resolving and relocation generation. This post covers recent refinements to MC, focusing on sections and symbols.

2025-07-27

LLVM integrated assembler: Engineering better fragments

In my previous assembler posts, I've discussed improvements on expression resolving and relocation generation. Now, let's turn our attention to recent refinements within section fragments. Understanding how an assembler utilizes these fragments is key to appreciating the improvements we've made. At a high level, the process unfolds in three main stages:

Parsing phase: The assembler constructs section fragments. These fragments represent sequences of regular instructions or data, span-dependent instructions, alignment directives, and other elements.
Section layout phase: Once fragments are built, the assembler assigns offsets to them and finalizes the span-dependent content.
Relocation decision phase: In the final stage, the assembler evaluates fixups and, if necessary, updates the content of the fragments.

2025-07-13

GCC 13.3.0 miscompiles LLVM

For years, I've been involved in updating LLVM's MC layer. A recent journey led me to eliminate the FK_PCRel_ fixup kinds:

MCFixup: Remove FK_PCRel_

The generic FK_Data_ fixup kinds handle both absolute and PC-relative
fixups. ELFObjectWriter sets IsPCRel to true for `.long foo-.`, so the
backend has to handle PC-relative FK_Data_.

However, the existence of FK_PCRel_ encouraged backends to implement it
as a separate fixup type, leading to redundant and error-prone code.

Removing FK_PCRel_ simplifies the overall fixup mechanism.

2025-05-26

LLVM integrated assembler: Improving expressions and relocations

In my previous post, LLVM integrated assembler: Improving MCExpr and MCValue delved into enhancements made to LLVM's internal MCExpr and MCValue representations. This post covers recent refinements to MC, focusing on expression resolving and relocation generation.

2025-04-06

LLVM integrated assembler: Improving MCExpr and MCValue

In my previous post, Relocation Generation in Assemblers, I explored some key concepts behind LLVM’s integrated assemblers. This post dives into recent improvements I’ve made to refine that system.

The LLVM integrated assembler handles fixups and relocatable expressions as distinct entities. Relocatable expressions, in particular, are encoded using the MCValue class, which originally looked like this:

class MCValue {
  const MCSymbolRefExpr *SymA = nullptr, *SymB = nullptr;
  int64_t Cst = 0;
  uint32_t RefKind = 0;
};

2025-03-16

Relocation generation in assemblers

This post explores how GNU Assembler and LLVM integrated assembler generate relocations, an important step to generate a relocatable file. Relocations identify parts of instructions or data that cannot be fully determined during assembly because they depend on the final memory layout, which is only established at link time or load time. These are essentially placeholders that will be filled in (typically with absolute addresses or PC-relative offsets) during the linking process.

Relocation generation: the basics

Symbol references are the primary candidates for relocations. For instance, in the x86-64 instruction movl sym(%rip), %eax (GNU syntax), the assembler calculates the displacement between the program counter (PC) and sym. This distance affects the instruction's encoding and typically triggers a R_X86_64_PC32 relocation, unless sym is a local symbol defined within the current section.

Both the GNU assembler and LLVM integrated assembler utilize multiple passes during assembly, with several key phases relevant to relocation generation: