Weak AVL Tree

tl;dr: Weak AVL trees are replacements for AVL trees and red-black trees.

The 2014 paper Rank-Balanced Trees (Haeupler, Sen, Tarjan) presents a framework using ranks and rank differences to define binary search trees.

  • Each node has a non-negative integer rank r(x). Null nodes have rank -1.
  • The rank difference of a node x with parent p(x) is r(p(x)) − r(x).
  • A node is i,j if its children have rank differences i and j (unordered), e.g., a 1,2 node has children with rank differences 1 and 2.
  • A node is called 1-node if its rank difference is 1.

Several balanced trees fit this framework:

  • AVL tree: Ranks are defined as heights. Every node is 1,1 or 1,2 (rank differences of children)
  • Red-Black tree: All rank differences are 0 or 1, and no parent of a 0-child is a 0-child. (red: 0-child; black: 1-child; null nodes are black)
  • Weak AVL tree (new tree described by this paper): All rank differences are 1 or 2, and every leaf has rank 0.
    • A weak AVL tree without 2,2 nodes is an AVL tree.
1
AVL trees ⫋ weak AVL trees ⫋ red-black trees

Weak AVL Tree

Weak AVL trees are replacements for AVL trees and red-black trees. A single insertion or deletion operation requires at most two rotations (forming a double rotation when two are needed). In contrast, AVL deletion requires O(log n) rotations, and red-black deletion requires up to three.

Without deletions, a weak AVL tree is exactly an AVL tree. With deletions, its height remains at most that of an AVL tree with the same number of insertions but no deletions.

The rank rules imply:

  • Null nodes have rank -1, leaves have rank 0, unary nodes have rank 1.

Insertion

The new node x has a rank of 0, changed from the null node of rank -1. There are three cases.

  • If the tree was previously empty, the new node becomes the root.
  • If the parent of the new node was previously a unary node (1,2 node), it is now a 1,1 binary node.
  • If the parent of the new node was previously a leaf (1,1 node), it is now a 0,1 binary node, leading to a rank violation.

When the tree was previously non-empty, x has a parent node. We call the following subroutine with x indicating the new node to handle the second and third cases.

The following subroutine handles the rank increase of x. We call break if there is no more rank violation, i.e. we are done.

The 2014 paper isn't very clear about the conditions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Assume that x's rank has just increased by 1 and rank_diff(x) has been updated.

p = x->parent;
if (rank_diff(x) == 1) {
// x was previously a 2-child before increasing x->rank.
// Done.
} else {
for (;;) {
// Otherwise, p is a 0,1 node (previously a 1,1 node before increasing x->rank).
// x being a 0-child is a violation.

Promote p.
// Since we have promoted both x and p, it's as if rank_diff(x's sibling) is flipped.
// p is now a 1,2 node.

x = p;
p = p->parent;
// x is a 1,2 node.
if (!p) break;
d = p->ch[1] == x;

if (rank_diff(x) == 1) { break; }
// Otherwise, x is a 0-child, leading to a new rank violation.

auto sib = p->ch[!d];
if (rank_diff(sib) == 2) { // p is a 0,2 node
auto y = x->ch[d^1];
if (y && rank_diff(y) == 1) {
// y is a 1-child. y must the previous `x` in the last iteration.
Perform a double rotation involving `p`, `x`, and `y`.
} else {
// Otherwise, y is null or a 2-child.
Perform a single rotation involving `p` and `x`.
x is now a 1,1 node and there is no more violation.
}
break;
}

// Otherwise, p is a 0,1 node. Goto the next iteration.
}
}

Insertion never introduces a 2,2 node, so insertion-only sequences produce AVL trees.

Deletion

TODO: Describe deletion

1
2
3
if (!was_2 && !x && !p->ch[0] && !p->ch[1] && p->rp()) {
// p was unary and becomes 2,2. Demote it.
}

Implementation

Since valid rank differences can only be 1 or 2, ranks can be encoded efficiently using bit flags. There are three approaches:

  • Store two bits representing the rank differences to each child. Bit 0: rank difference to left child (1 = diff is 2, 0 = diff is 1). Bit 1: rank difference to right child
  • Store a single bit representing the parity (even/odd) of the node's absolute rank. The rank difference to a child is computed by comparing parities. Same parity → rank difference of 2. Different parity → rank difference of 1
  • Store a 1-bit rank difference parity in each node.

FreeBSD's sys/tree.h (https://reviews.freebsd.org/D25480, 2020) uses the first approach. The rb_ prefix remains as it can also indicate Rank-Balanced:) Note: its insertion operation can be futher optimized as the following code demonstrates.

https://github.com/pvachon/wavl_tree and https://crates.io/crates/wavltree use the second approach.

The third approach is less efficient because a null node can be either a 1-child (parent is binary) or a 2-child (parent is unary), requiring the sibling node to be probed to determine the rank difference: int rank_diff(Node *p, int d) { return p->ch[d] ? p->ch[d]->par_and_flg & 1 : p->ch[!d] ? 2 : 1; }

https://maskray.me/blog/2025-12-14-weak-avl-tree is a C++ implementation covering both approaches, supporting the following operations:

  • insert: insert a node
  • remove: remove a node
  • rank: count elements less than a key
  • select: find the k-th smallest element (0-indexed)
  • prev: find the largest element less than a key
  • next: find the smallest element greater than a key

Node structure:

  • ch[2]: left and right child pointers.
  • par_and_flg: packs the parent pointer with 2 flag bits in the low bits. Bit 0 indicates whether the left child has rank difference 2; bit 1 indicates whether the right child has rank difference 2. A cleared bit means rank difference 1.
  • i: the key value.
  • sum, size: augmented data maintained by mconcat for order statistics operations.

Helper methods:

  • rd2(d): returns true if child d has rank difference 2.
  • flip(d): toggles the rank difference of child d between 1 and 2.
  • clr_flags(): sets both children to rank difference 1 (used after rotations to reset a node to 1,1).

Invariants:

  • Leaves always have flags() == 0, meaning both null children are 1-children (null nodes have rank -1, leaf has rank 0).
  • After each insertion or deletion, mconcat is called along the path to the root to update augmented data.

Rotations:

The rotate(x, d) function rotates node x in direction d. It lifts x->ch[d] to replace x, and updates the augmented data for x. The caller is responsible for updating rank differences.

Misc

Visualization: https://tjkendev.github.io/bst-visualization/avl-tree/bu-weak.html

Stack walking: space and time trade-offs

On most Linux platforms (except AArch32, which uses .ARM.exidx), DWARF .eh_frame is required for C++ exception handling and stack unwinding to restore callee-saved registers. While .eh_frame can be used for call trace recording, it is often criticized for its runtime overhead. As an alternative, developers can enable frame pointers, or adopt SFrame, a newer format designed specifically for profiling. This article examines the size overhead of enabling non-DWARF stack walking mechanisms when building several LLVM executables.

Runtime performance analysis will be added in a future update.

Read More

Remarks on SFrame

SFrame is a new stack walking format for userspace profiling, inspired by Linux's in-kernel ORC unwind format. While SFrame eliminates some .eh_frame CIE/FDE overhead, it sacrifices functionality (e.g., personality, LSDA, callee-saved registers) and flexibility, and its stack offsets are less compact than .eh_frame's bytecode-style CFI instructions. In llvm-project executables I've tested on x86-64, .sframe section is 20% larger than .eh_frame. It also remains significantly larger than highly compact schemes like Windows ARM64 unwind codes.

SFrame describes three elements for each function:

  • Canonical Frame Address (CFA): The base address for stack frame calculations
  • Return address
  • Frame pointer

An .sframe section follows a straightforward layout:

  • Header: Contains metadata and offset information
  • Auxiliary header (optional): Reserved for future extensions
  • Function Descriptor Entries (FDEs): Array describing each function
  • Frame Row Entries (FREs): Arrays of unwinding information per function

Read More

Benchmarking compression programs

tl;dr https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7 is a single-file Ruby program that downloads and compiles multiple compression utilities, then benchmarks their compression and decompression performance on a specified input file, finally generates a HTML file with scatter charts. Scroll to the end to view example HTML pages.

Compression algorithms can be broadly categorized into three groups based on their typical compression ratio and decompression speed:

  • Low ratio, high speed: lz4, snappy, Oodle Selkie.
  • Medium ratio, medium speed: zlib, zstd, brotli, Oodle Kraken.
  • High ratio, low speed: LZMA, bzip2, bzip3, bsc, zpaq, kanzi, Oodle Leviathan.

Low ratio Codecs in this category prioritize speed above all else. The compression and compression speeds are comparable. They are designed to decompress so quickly that they don't introduce a noticeable delay when reading data from storage like solid-state drives. These codecs typically producing byte-aligned output and often skip the final step of entropy encoding, which, while crucial for high compression, is computationally intensive. They are excellent choices for applications where latency is critical, such as kernel features like zswap.

Medium ratio This is the sweet spot for many tasks. The codecs achieve better compression ratio by employing entropy encoding, usually Huffman coding.

zstd has emerged as a clear leader, gaining popularity and effectively supplanting older codecs like the venerable DEFLATE (zlib).

High ratio They are designed to squeeze every last bit of redundancy out of the data, often at the cost of significantly longer compression and decompression times, and large memory usage. They are perfect for archival purposes or data distribution where the files are compressed once and decompressed infrequently. Codecs typically have 3 important components:

  • Transforms: Codecs typically implement strong transforms to increase redundancy, even very specific ones like branch/call/jump filters for machine code.
  • Predication model: This model anticipates the next piece of data based on what has already been processed.
  • Entropy encoding: Traditional codecs use arithmetic encoder, which is replaced by the more efficient Range variant of Asymmetric Numeral Systems (rANS).

Some projects apply neural network models, such as Recurrent Neural Network, Long Short-Term Memory, and Transformer, to the predication model. They are usually very slow.


This categorization is loose. Many modern programs offer a wide range of compression levels that allow them to essentially span multiple categories. For example, a high-level zstd compression can achieve a ratio comparable to xz (a high-compression codec) by using more RAM and CPU. While zstd's compression speed or ratio is generally lower, its decompression speed is often much faster than that of xz.

Benchmarking

I want to benchmark the single worker performance of a few compression programs:

  • lz4: Focuses on speed over compression ratio. Memory usage is extremely low. It seems Pareto superior to Google's Snappy.
  • zstd: Gained significant traction and obsoleted many existing codecs. Its LZ77 variant uses three recent match offsets like LZX. For entropy encoding, it employs Huffman coding for literals and 2-way interleaved Finite State Entropy for Huffman weights, literal lengths, match lengths, and offset codes. The large alphabet of literals makes Huffman a good choice, as compressing them with FSE provides little gain for a speed cost. However, other symbols have a small range, making them a sweet spot for FSE. zstd works on multiple streams at the same time to utilize instruction-level parallelism. zstd is supported by the Accept-Encoding: zstd HTTP header. Decompression memory usage is very low.
  • brotli: Uses a combination of LZ77, 2nd order context model, Huffman coding, and static dictionary. The decompression speed is similar to gzip with a higher ratio. At lower levels, its performance is overshadowed by zstd. Compared with DEFLATE, it employs a larger sliding window (from 16KiB-16B to 16MiB-16B) and a smaller minimum match length (2 instead of 3). It has a predefined dictionary that works well for web content (but feels less elegant) and supports 120 transforms. brotli is supported by the Accept-Encoding: br HTTP header. Decompression memory usage is quite low.
  • bzip3: Combines BWT, RLE, and LZP and uses arithmetic encoder. Memory usage is large.
  • xz: LZMA2 with a few filters. The filters must be enabled explicitly.
  • lzham: Provides a compression ratio similar to LZMA but with faster decompression. Compression is slightly slower while memory usage is larger. The build system is not well-polished for Linux. I have forked it, fixed stdint.h build errors, and installed lzhamtest. The command line program lzhamtest should really be renamed to lzham.
  • zpaq: Functions as a command-line archiver supporting multiple files. It combines context mixing with arithmetic encoder but operates very slowly.
  • kanzi: There are a wide variety of transforms and entropy encoders, unusual for a compresion program. For the compression speed of enwik8, it's Pareto superior to xz, but decompression is slower. Levels 8 and 9 belong to the PAQ8 family and consume substantial memory.

I'd like to test lzham (not updated for a few years), but I'm having trouble getting it to compile due to a cstdio header issue.

Many modern compressors are parallel by default. I have to disable this behavior by using options like -T1. Still, zstd uses a worker thread for I/O overlap, but I don't bother with --single-thread.

To ensure fairness, each program is built with consistent compiler optimizations, such as -O3 -march=native.

Below is a Ruby program that downloads and compiles multiple compression utilities, compresses then decompress a specified input file. It collects performance metrics including execution time, memory usage, and compression ratio, and finally generates an HTML file with scatter charts visualizing the results. The program has several notable features:

  • Adding new compressors is easy: just modify COMPRESSORS.
  • Benchmark results are cached in files named cache_$basename_$digest.json, allowing reuse of previous runs for the same input file.
  • Adding a new compression level does not invalidate existing benchmark results for other levels.
  • The script generates an HTML file with interactive scatter charts. Each compressor is assigned a unique, deterministic color based on a hash of its name (using the hsl function in CSS).

The single file Ruby program is available at https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7

Limitation

A single run might not be representative.

Running the executable incurs initialization overhead, which would be amortized in a library setup. However, library setup would make updating libraries more difficult.

Demo

1
2
3
4
5
ruby bench.rb enwik8
# The first iframe below

ruby bench.rb clang
# The second iframe below

Many programs exhibit a stable decompression speed (uncompressed size / decompression time). There is typically a slightly higher decompression speed at higher compression levels. If you think of the compressed content as a form of "byte code", a more highly compressed file means there are fewer bytes for the decompression algorithm to process, resulting in faster decompression. Some programs, like zpaq and kanzi, use different algorithms that can result in significantly different decompression speeds.

xz -9 doesn't use parallelism on the two files under ~100 MiB because their uncompressed size is smaller than the default block size for level 9.

From install/include/lzma/container.h

For each thread, about 3 * block_size bytes of memory will be allocated. This may change in later liblzma versions. If so, the memory usage will probably be reduced, not increased.

Understanding alignment - from source to object file

Alignment refers to the practice of placing data or code at memory addresses that are multiples of a specific value, typically a power of 2. This is typically done to meet the requirements of the programming language, ABI, or the underlying hardware. Misaligned memory accesses might be expensive or will cause traps on certain architectures.

This blog post explores how alignment is represented and managed as C++ code is transformed through the compilation pipeline: from source code to LLVM IR, assembly, and finally the object file. We'll focus on alignment for both variables and functions.

Read More

LLVM integrated assembler: Engineering better fragments

In my previous assembler posts, I've discussed improvements on expression resolving and relocation generation. Now, let's turn our attention to recent refinements within section fragments. Understanding how an assembler utilizes these fragments is key to appreciating the improvements we've made. At a high level, the process unfolds in three main stages:

  • Parsing phase: The assembler constructs section fragments. These fragments represent sequences of regular instructions or data, span-dependent instructions, alignment directives, and other elements.
  • Section layout phase: Once fragments are built, the assembler assigns offsets to them and finalizes the span-dependent content.
  • Relocation decision phase: In the final stage, the assembler evaluates fixups and, if necessary, updates the content of the fragments.

Read More

GCC 13.3.0 miscompiles LLVM

For years, I've been involved in updating LLVM's MC layer. A recent journey led me to eliminate the FK_PCRel_ fixup kinds:

MCFixup: Remove FK_PCRel_

The generic FK_Data_ fixup kinds handle both absolute and PC-relative
fixups. ELFObjectWriter sets IsPCRel to true for `.long foo-.`, so the
backend has to handle PC-relative FK_Data_.

However, the existence of FK_PCRel_ encouraged backends to implement it
as a separate fixup type, leading to redundant and error-prone code.

Removing FK_PCRel_ simplifies the overall fixup mechanism.

Read More