Weak AVL Tree

tl;dr: Weak AVL trees are replacements for AVL trees and red-black trees.

The 2014 paper Rank-Balanced Trees (Haeupler, Sen, Tarjan) presents a framework using ranks and rank differences to define binary search trees.

  • Each node has a non-negative integer rank r(x). Null nodes have rank -1.
  • The rank difference of a node x with parent p(x) is r(p(x)) − r(x).
  • A node is i,j if its children have rank differences i and j (unordered), e.g., a 1,2 node has children with rank differences 1 and 2.
  • A node is called 1-node if its rank difference is 1.

Several balanced trees fit this framework:

  • AVL tree: Ranks are defined as heights. Every node is 1,1 or 1,2 (rank differences of children)
  • Red-Black tree: All rank differences are 0 or 1, and no parent of a 0-child is a 0-child. (red: 0-child; black: 1-child; null nodes are black)
  • Weak AVL tree (new tree described by this paper): All rank differences are 1 or 2, and every leaf has rank 0.
    • A weak AVL tree without 2,2 nodes is an AVL tree.
1
AVL trees ⫋ weak AVL trees ⫋ red-black trees

Weak AVL Tree

Weak AVL trees are replacements for AVL trees and red-black trees. A single insertion or deletion operation requires at most two rotations (forming a double rotation when two are needed). In contrast, AVL deletion requires O(log n) rotations, and red-black deletion requires up to three.

Without deletions, a weak AVL tree is exactly an AVL tree. With deletions, its height remains at most that of an AVL tree with the same number of insertions but no deletions.

The rank rules imply:

  • Null nodes have rank -1, leaves have rank 0, unary nodes have rank 1.

Insertion

The new node x has a rank of 0, changed from the null node of rank -1. There are three cases.

  • If the tree was previously empty, the new node becomes the root.
  • If the parent of the new node was previously a unary node (1,2 node), it is now a 1,1 binary node.
  • If the parent of the new node was previously a leaf (1,1 node), it is now a 0,1 binary node, leading to a rank violation.

When the tree was previously non-empty, x has a parent node. We call the following subroutine with x indicating the new node to handle the second and third cases.

The following subroutine handles the rank increase of x. We call break if there is no more rank violation, i.e. we are done.

The 2014 paper isn't very clear about the conditions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Assume that x's rank has just increased by 1 and rank_diff(x) has been updated.

p = x->parent;
if (rank_diff(x) == 1) {
// x was previously a 2-child before increasing x->rank.
// Done.
} else {
for (;;) {
// Otherwise, p is a 0,1 node (previously a 1,1 node before increasing x->rank).
// x being a 0-child is a violation.

Promote p.
// Since we have promoted both x and p, it's as if rank_diff(x's sibling) is flipped.
// p is now a 1,2 node.

x = p;
p = p->parent;
// x is a 1,2 node.
if (!p) break;
d = p->ch[1] == x;

if (rank_diff(x) == 1) { break; }
// Otherwise, x is a 0-child, leading to a new rank violation.

auto sib = p->ch[!d];
if (rank_diff(sib) == 2) { // p is a 0,2 node
auto y = x->ch[d^1];
if (y && rank_diff(y) == 1) {
// y is a 1-child. y must the previous `x` in the last iteration.
Perform a double rotation involving `p`, `x`, and `y`.
} else {
// Otherwise, y is null or a 2-child.
Perform a single rotation involving `p` and `x`.
x is now a 1,1 node and there is no more violation.
}
break;
}

// Otherwise, p is a 0,1 node. Goto the next iteration.
}
}

Insertion never introduces a 2,2 node, so insertion-only sequences produce AVL trees.

Deletion

TODO: Describe deletion

1
2
3
if (!was_2 && !x && !p->ch[0] && !p->ch[1] && p->rp()) {
// p was unary and becomes 2,2. Demote it.
}

Implementation

In 2020, FreeBSD has changed its sys/tree.h to use weak AVL trees instead of red-black trees. https://reviews.freebsd.org/D25480 The rb_ prefix remains as it can also indicate Rank-Balanced:)

This implementation stores 2 bits per node to encode the parity of the rank difference to each child. An alternative approach encodes a 1-bit absolute rank parity in each node, as seen in https://github.com/pvachon/wavl_tree and https://crates.io/crates/wavltree.

Below are two C++ implementations covering both approaches, supporting the following operations:

  • insert: insert a node
  • remove: remove a node
  • rank: count elements less than a key
  • select: find the k-th smallest element (0-indexed)
  • prev: find the largest element less than a key
  • next: find the smallest element greater than a key

The insertion and deletion operations are inspired by the FreeBSD implementation, with the insertion further optimized.

We encode rank differences instead of the absolute rank values. Since valid rank differences can only be 1 or 2, a single bit suffices to encode each. We take 2 bits from the parent pointer to store the two child rank differences.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
#include <algorithm>
#include <cassert>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <numeric>
#include <random>
#include <vector>
using namespace std;

struct Node {
Node *ch[2]{};
uintptr_t par_and_flg{};
int i{}, sum{}, size{};

Node *parent() const { return reinterpret_cast<Node*>(par_and_flg & ~3UL); }
void set_parent(Node *p) { par_and_flg = (par_and_flg & 3) | reinterpret_cast<uintptr_t>(p); }
uintptr_t flags() const { return par_and_flg & 3; }
bool rd2(int d) const { return par_and_flg & (1 << d); }
void flip(int d) { par_and_flg ^= (1 << d); }
void clr_flags() { par_and_flg &= ~3UL; }

void mconcat() {
sum = i;
size = 1;
if (ch[0]) sum += ch[0]->sum, size += ch[0]->size;
if (ch[1]) sum += ch[1]->sum, size += ch[1]->size;
}

bool operator<(const Node &o) const { return i < o.i; }
};

struct WAVL {
Node *root{};

~WAVL() {
auto destroy = [](auto &self, Node *n) -> void {
if (!n) return;
self(self, n->ch[0]);
self(self, n->ch[1]);
delete n;
};
destroy(destroy, root);
}

Node *rotate(Node *x, int d) {
auto pivot = x->ch[d];
if ((x->ch[d] = pivot->ch[d^1])) x->ch[d]->set_parent(x);
pivot->set_parent(x->parent());
if (!x->parent()) root = pivot;
else x->parent()->ch[x != x->parent()->ch[0]] = pivot;
pivot->ch[d^1] = x;
x->set_parent(pivot);
x->mconcat();
return pivot;
}

void insert(Node *x) {
assert(x->ch[0] == nullptr && x->ch[1] == nullptr);
if (!root) {
root = x;
x->mconcat();
return;
}
Node *p = root;
int d = 0;
for (;;) {
d = *p < *x;
if (!p->ch[d])
break;
p = p->ch[d];
}
p->ch[d] = x;
x->par_and_flg = reinterpret_cast<uintptr_t>(p);
auto *x2 = x;
if (p->rd2(d)) {
p->flip(d);
} else {
assert(p->rd2(d^1) == 0);
p->flip(d^1);
int d1 = d;
for (x = p, p = x->parent(); p; x = p, p = x->parent()) {
d = (p->ch[1] == x);
if (p->rd2(d)) {
p->flip(d);
break;
}
p->flip(d^1);
if (!p->rd2(d ^ 1)) {
if ((d^1) == d1) {
assert(!x->rd2(d1) && (x->ch[d1] == x2 || x->ch[d1]->flags() == 1 || x->ch[d1]->flags() == 2));
x->flip(d);
auto y = rotate(x, d^1); // y is previous x
if (y->rd2(d))
x->flip(d^1);
else if (y->rd2(d^1))
p->flip(d);
x = y;
}
x = rotate(p, d);
x->clr_flags();
break;
}
d1 = d;
}
}
for (; x2; x2 = x2->parent()) x2->mconcat();
}

void remove(Node *x) {
auto old = x;
auto p = x->parent();
auto right = x->ch[1];
Node *child;
if (!x->ch[0]) x = child = right;
else if (!right) x = child = x->ch[0];
else {
if (!(child = right->ch[0])) {
child = right->ch[1];
p = x = right;
} else {
do x = child; while ((child = x->ch[0]));
child = x->ch[1];
p = x->parent();
p->ch[0] = child;
old->ch[1]->set_parent(x);
x->ch[1] = old->ch[1];
}
old->ch[0]->set_parent(x);
x->ch[0] = old->ch[0];
x->par_and_flg = old->par_and_flg;
}
if (!old->parent()) root = x;
else old->parent()->ch[old != old->parent()->ch[0]] = x;
if (child) child->set_parent(p);

Node *x2 = p;
if (p) {
x = child;
if (p->ch[0] == x && p->ch[1] == x) {
p->clr_flags();
x = p;
p = x->parent();
}
while (p) {
int d2 = (p->ch[1] == x);
if (!p->rd2(d2)) {
p->flip(d2);
break;
}
if (p->rd2(d2 ^ 1)) {
p->flip(d2 ^ 1);
x = p;
p = x->parent();
continue;
}
auto sib = p->ch[d2^1];
if (sib->flags() == 3) {
sib->clr_flags();
x = p;
p = x->parent();
continue;
}
sib->flip(d2^1);
if (sib->rd2(d2))
p->flip(d2);
else if (!sib->rd2(d2^1)) {
p->flip(d2);
x = rotate(sib, d2);
if (x->rd2(d2^1)) sib->flip(d2);
if (x->rd2(d2)) p->flip(d2^1);
x->par_and_flg |= 3;
}
rotate(p, d2^1);
break;
}
}
for (; x2; x2 = x2->parent()) x2->mconcat();
}

Node *find(int key) const {
auto tmp = root;
while (tmp) {
if (key < tmp->i) tmp = tmp->ch[0];
else if (key > tmp->i) tmp = tmp->ch[1];
else return tmp;
}
return nullptr;
}

Node *min() const {
Node *p = nullptr;
for (auto n = root; n; n = n->ch[0]) p = n;
return p;
}

int rank(int key) const {
int r = 0;
for (auto n = root; n; ) {
if (key <= n->i) n = n->ch[0];
else {
r += 1 + (n->ch[0] ? n->ch[0]->size : 0);
n = n->ch[1];
}
}
return r;
}

int select(int k) const {
auto x = root;
while (x) {
int lsz = x->ch[0] ? x->ch[0]->size : 0;
if (k < lsz) x = x->ch[0];
else if (k == lsz) return x->i;
else k -= lsz + 1, x = x->ch[1];
}
return -1;
}

int prev(int key) const {
int res = -1;
for (auto x = root; x; )
if (key <= x->i) x = x->ch[0];
else { res = x->i; x = x->ch[1]; }
return res;
}

int next(int key) const {
int res = -1;
for (auto x = root; x; )
if (key >= x->i) x = x->ch[1];
else { res = x->i; x = x->ch[0]; }
return res;
}

static Node *next(Node *x) {
if (x->ch[1]) {
x = x->ch[1];
while (x->ch[0]) x = x->ch[0];
} else {
while (x->parent() && x == x->parent()->ch[1]) x = x->parent();
x = x->parent();
}
return x;
}
};

void print_tree(Node *n, int d = 0) {
if (!n) return;
print_tree(n->ch[0], d + 1);
printf("%*s%d (%d,%d)\n", 2*d, "", n->i, n->rd2(0) ? 2 : 1, n->rd2(1) ? 2 : 1);
print_tree(n->ch[1], d + 1);
}

int compute_rank(Node *n, bool debug = false) {
if (!n) return -1;
int lr = compute_rank(n->ch[0], debug), rr = compute_rank(n->ch[1], debug);
if (lr < -1 || rr < -1) return -2;
int rank_l = lr + (n->rd2(0) ? 2 : 1);
int rank_r = rr + (n->rd2(1) ? 2 : 1);
if (rank_l != rank_r) {
if (debug) printf("node %d: rank mismatch left=%d right=%d\n", n->i, rank_l, rank_r);
return -2;
}
if (!n->ch[0] && !n->ch[1] && n->flags() != 0) {
if (debug) printf("node %d: leaf must be 1,1 but flags=%lu\n", n->i, n->flags());
return -2;
}
int expected_sum = n->i + (n->ch[0] ? n->ch[0]->sum : 0) + (n->ch[1] ? n->ch[1]->sum : 0);
if (n->sum != expected_sum) {
if (debug) printf("node %d: sum mismatch got=%d expected=%d\n", n->i, n->sum, expected_sum);
return -2;
}
int expected_size = 1 + (n->ch[0] ? n->ch[0]->size : 0) + (n->ch[1] ? n->ch[1]->size : 0);
if (n->size != expected_size) {
if (debug) printf("node %d: size mismatch got=%d expected=%d\n", n->i, n->size, expected_size);
return -2;
}
return rank_l;
}

bool verify_tree(const WAVL &tree, bool verbose = false) {
int rank = compute_rank(tree.root);
if (rank < -1) {
printf("INVALID TREE\n");
compute_rank(tree.root, true);
return false;
}
if (verbose) printf("Tree verified, rank = %d\n", rank);
return true;
}

int main() {
srand(42);
WAVL tree;
int i = 0;
std::vector<int> a(20);
std::iota(a.begin(), a.end(), 1);
std::shuffle(a.begin(), a.end(), std::default_random_engine(42));
for (int val : a) {
auto n = new Node;
n->i = val;
tree.insert(n);
if (i++ < 6) {
printf("-- %d After insertion of %d\n", i, val);
print_tree(tree.root);
}
}
printf("\nSum\tof values = %d\n", tree.root->sum);
verify_tree(tree, true);

for (int val : {5, 10, 15}) {
if (auto found = tree.find(val)) {
tree.remove(found);
delete found;
}
}
printf("After removing 5, 10, 15:\n");
printf("\nSum\tof values = %d\n", tree.root->sum);
verify_tree(tree, true);

std::vector<Node*> ref;
for (auto n = tree.min(); n; n = WAVL::next(n)) ref.push_back(n);

for (int i = 0; i < 100000; i++) {
if (ref.size() < 5 || (ref.size() < 1000 && rand() % 2 == 0)) {
auto n = new Node;
n->i = rand() % 100000;
tree.insert(n);
ref.push_back(n);
} else {
int idx = rand() % ref.size();
tree.remove(ref[idx]);
delete ref[idx];
ref[idx] = ref.back();
ref.pop_back();
}
if (i%100 == 0 && !verify_tree(tree)) {
printf("FAILED at iteration %d\n", i);
return 1;
}
}

while (!ref.empty()) {
tree.remove(ref.back());
delete ref.back();
ref.pop_back();
if (tree.root && !verify_tree(tree)) {
printf("FAILED during final cleanup\n");
return 1;
}
}
printf("Stress test passed\n");

// Test rank, select, prev, next
printf("\nTesting rank/select/prev/next...\n");
std::vector<int> vals = {10, 20, 30, 40, 50};
for (int v : vals) {
auto n = new Node;
n->i = v;
tree.insert(n);
}

// rank tests (number of elements < key)
assert(tree.rank(5) == 0);
assert(tree.rank(10) == 0);
assert(tree.rank(15) == 1);
assert(tree.rank(20) == 1);
assert(tree.rank(25) == 2);
assert(tree.rank(50) == 4);
assert(tree.rank(55) == 5);

// select tests (0-indexed)
assert(tree.select(0) == 10);
assert(tree.select(1) == 20);
assert(tree.select(2) == 30);
assert(tree.select(3) == 40);
assert(tree.select(4) == 50);
assert(tree.select(5) == -1);

// prev tests (largest < key)
assert(tree.prev(10) == -1);
assert(tree.prev(11) == 10);
assert(tree.prev(20) == 10);
assert(tree.prev(21) == 20);
assert(tree.prev(50) == 40);
assert(tree.prev(55) == 50);

// next tests (smallest > key)
assert(tree.next(5) == 10);
assert(tree.next(10) == 20);
assert(tree.next(15) == 20);
assert(tree.next(40) == 50);
assert(tree.next(50) == -1);
assert(tree.next(55) == -1);

printf("rank/select/prev/next tests passed\n");
}

Here is another implementation that encodes the parity of the absolute ranks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
#include <algorithm>
#include <cassert>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <numeric>
#include <random>
#include <vector>
using namespace std;

struct Node {
Node *ch[2]{};
uintptr_t par_and_flg{};
int i{}, sum{}, size{};

Node *parent() const { return reinterpret_cast<Node*>(par_and_flg & ~1UL); }
void set_parent(Node *p) { par_and_flg = (par_and_flg & 1) | reinterpret_cast<uintptr_t>(p); }
uintptr_t flags() const { return par_and_flg & 3; }
bool rp() const { return par_and_flg & 1; }
void flip() { par_and_flg ^= 1; }
void clr_flags() { par_and_flg &= ~1UL; }

void mconcat() {
sum = i;
size = 1;
if (ch[0]) sum += ch[0]->sum, size += ch[0]->size;
if (ch[1]) sum += ch[1]->sum, size += ch[1]->size;
}

bool operator<(const Node &o) const { return i < o.i; }
};

struct WAVL {
Node *root{};

~WAVL() {
auto destroy = [](auto &self, Node *n) -> void {
if (!n) return;
self(self, n->ch[0]);
self(self, n->ch[1]);
delete n;
};
destroy(destroy, root);
}

Node *rotate(Node *x, int d) {
auto pivot = x->ch[d];
if ((x->ch[d] = pivot->ch[d^1])) x->ch[d]->set_parent(x);
pivot->set_parent(x->parent());
if (!x->parent()) root = pivot;
else x->parent()->ch[x != x->parent()->ch[0]] = pivot;
pivot->ch[d^1] = x;
x->set_parent(pivot);
x->mconcat();
return pivot;
}

void insert_rebalance(Node *p, int d) {
Node *x;
bool par_x, par_y, par_p;
for (;;) {
// Promote p.
p->flip();
x = p;
p = p->parent();
if (!p)
return;
d = p->ch[1] == x;
par_x = x->rp();
par_y = !p->ch[d^1] || p->ch[d^1]->rp();
par_p = p->rp();
// Continue if p is a 0,1 node.
if (!(par_x == par_p && par_x != par_y))
break;
}

// If p is not 0,2, finish.
if (par_x != par_p || par_x != par_y)
return;

Node *y = x->ch[d^1];
if (y && y->rp() != par_x) {
rotate(x, d^1);
rotate(p, d);
y->flip();
x->flip();
x = y;
} else {
rotate(p, d);
}
p->flip();
}

void insert(Node *x) {
assert(x->ch[0] == nullptr && x->ch[1] == nullptr);
if (!root) {
root = x;
x->mconcat();
return;
}
Node *p = root;
int d = 0;
for (;;) {
d = *p < *x;
if (!p->ch[d])
break;
p = p->ch[d];
}
p->ch[d] = x;
x->set_parent(p);

if (!p->ch[d^1])
insert_rebalance(p, d);
for (; x; x = x->parent())
x->mconcat();
}

Node *find(int key) const {
auto tmp = root;
while (tmp) {
if (key < tmp->i) tmp = tmp->ch[0];
else if (key > tmp->i) tmp = tmp->ch[1];
else return tmp;
}
return nullptr;
}

Node *min() const {
Node *p = nullptr;
for (auto n = root; n; n = n->ch[0]) p = n;
return p;
}

int rank(int key) const {
int r = 0;
for (auto n = root; n; ) {
if (key <= n->i) n = n->ch[0];
else {
r += 1 + (n->ch[0] ? n->ch[0]->size : 0);
n = n->ch[1];
}
}
return r;
}

int select(int k) const {
auto x = root;
while (x) {
int lsz = x->ch[0] ? x->ch[0]->size : 0;
if (k < lsz) x = x->ch[0];
else if (k == lsz) return x->i;
else k -= lsz + 1, x = x->ch[1];
}
return -1;
}

int prev(int key) const {
int res = -1;
for (auto x = root; x; )
if (key <= x->i) x = x->ch[0];
else { res = x->i; x = x->ch[1]; }
return res;
}

int next(int key) const {
int res = -1;
for (auto x = root; x; )
if (key >= x->i) x = x->ch[1];
else { res = x->i; x = x->ch[0]; }
return res;
}

static Node *next(Node *x) {
if (x->ch[1]) {
x = x->ch[1];
while (x->ch[0]) x = x->ch[0];
} else {
while (x->parent() && x == x->parent()->ch[1]) x = x->parent();
x = x->parent();
}
return x;
}

static bool is_22_node(Node *n) {
if (n->rp()) {
return (!n->ch[0] || n->ch[0]->rp()) && (!n->ch[1] || n->ch[1]->rp());
} else {
return n->ch[0] && !n->ch[0]->rp() && n->ch[1] && !n->ch[1]->rp();
}
}

void remove(Node *n) {
Node *y = n;
if (n->ch[0] && n->ch[1])
for (y = n->ch[1]; y->ch[0]; y = y->ch[0]);
Node *p = y->parent(), *x = y->ch[0] ? y->ch[0] : y->ch[1];
bool was_2 = p && y->rp() == p->rp();

if (p) p->ch[p->ch[1] == y] = x;
else root = x;
if (x) x->set_parent(p);

if (y != n) {
y->ch[0] = n->ch[0]; y->ch[1] = n->ch[1];
y->par_and_flg = n->par_and_flg;
if (n->parent()) n->parent()->ch[n->parent()->ch[1] == n] = y;
else root = y;
y->ch[0]->set_parent(y);
if (y->ch[1]) y->ch[1]->set_parent(y);
if (p == n) p = y;
}

Node *x2 = p;
if (p) {
if (!was_2 && !x && !p->ch[0] && !p->ch[1] && p->rp()) {
was_2 = p->parent() && p->rp() == p->parent()->rp();
p->flip();
x = p;
p = p->parent();
}
for (; p && was_2; x = p, p = p->parent()) {
int d = p->ch[1] == x;
Node *y = p->ch[d^1];
bool creates_3 = p->parent() && p->rp() == p->parent()->rp();
if (y->rp() == p->rp()) {
p->flip();
} else if (is_22_node(y)) {
y->flip();
p->flip();
} else {
Node *v = y->ch[d], *w = y->ch[d^1];
if ((w ? w->rp() : true) != y->rp()) {
rotate(p, d^1);
y->flip();
p->flip();
if (!p->ch[0] && !p->ch[1]) p->flip();
} else {
rotate(y, d);
rotate(p, d^1);
y->flip();
}
break;
}
if (!creates_3) break;
}
}

for (; x2; x2 = x2->parent()) x2->mconcat();
}
};

int print_tree(Node *n, int d = 0) {
if (!n) return -1;
int r = print_tree(n->ch[0], d + 1);
r += n->rp() == (r & 1) ? 2 : 1;
printf("%*s%d (%d)\n", 2*d, "", n->i, r);
print_tree(n->ch[1], d + 1);
return r;
}

int compute_rank(Node *n, bool debug = false) {
if (!n) return -1;
int lr = compute_rank(n->ch[0], debug), rr = compute_rank(n->ch[1], debug);
if (lr < -1 || rr < -1) return -2;
int rank_l = lr + (n->rp() == (lr & 1) ? 2 : 1);
int rank_r = rr + (n->rp() == (rr & 1) ? 2 : 1);
if (rank_l != rank_r) {
if (debug) printf("node %d: rank mismatch left=%d right=%d\n", n->i, rank_l, rank_r);
return -2;
}
if (!n->ch[0] && !n->ch[1] && rank_l != 0) {
if (debug) printf("node %d: leaf must be 1,1 but flags=%lu\n", n->i, n->flags());
return -2;
}
int expected_sum = n->i + (n->ch[0] ? n->ch[0]->sum : 0) + (n->ch[1] ? n->ch[1]->sum : 0);
if (n->sum != expected_sum) {
if (debug) printf("node %d: sum mismatch got=%d expected=%d\n", n->i, n->sum, expected_sum);
return -2;
}
int expected_size = 1 + (n->ch[0] ? n->ch[0]->size : 0) + (n->ch[1] ? n->ch[1]->size : 0);
if (n->size != expected_size) {
if (debug) printf("node %d: size mismatch got=%d expected=%d\n", n->i, n->size, expected_size);
return -2;
}
return rank_l;
}

bool verify_tree(const WAVL &tree, bool verbose = false) {
int rank = compute_rank(tree.root);
if (rank < -1) {
printf("INVALID TREE\n");
compute_rank(tree.root, true);
return false;
}
if (verbose) printf("Tree verified, rank = %d\n", rank);
return true;
}

int main() {
srand(42);
WAVL tree;
int i = 0;
std::vector<int> a(20);
std::iota(a.begin(), a.end(), 1);
std::shuffle(a.begin(), a.end(), std::default_random_engine(42));
for (int val : a) {
auto n = new Node;
n->i = val;
tree.insert(n);
if (i++ < 6) {
printf("-- %d After insertion of %d\n", i, val);
print_tree(tree.root);
}
}
printf("\nSum\tof values = %d\n", tree.root->sum);
verify_tree(tree, true);

for (int val : {5, 10, 15}) {
if (auto found = tree.find(val)) {
tree.remove(found);
delete found;
}
}
printf("After removing 5, 10, 15:\n");
printf("\nSum\tof values = %d\n", tree.root->sum);
verify_tree(tree, true);

std::vector<Node*> ref;
for (auto n = tree.min(); n; n = WAVL::next(n)) ref.push_back(n);

for (int i = 0; i < 100000; i++) {
if (ref.size() < 5 || (ref.size() < 1000 && rand() % 2 == 0)) {
auto n = new Node;
n->i = rand() % 100000;
tree.insert(n);
ref.push_back(n);
} else {
int idx = rand() % ref.size();
tree.remove(ref[idx]);
delete ref[idx];
ref[idx] = ref.back();
ref.pop_back();
}
if (i%100 == 0 && !verify_tree(tree)) {
printf("FAILED at iteration %d\n", i);
return 1;
}
}

while (!ref.empty()) {
tree.remove(ref.back());
delete ref.back();
ref.pop_back();
if (tree.root && !verify_tree(tree)) {
printf("FAILED during final cleanup\n");
return 1;
}
}
printf("Stress test passed\n");

// Test rank, select, prev, next
printf("\nTesting rank/select/prev/next...\n");
std::vector<int> vals = {10, 20, 30, 40, 50};
for (int v : vals) {
auto n = new Node;
n->i = v;
tree.insert(n);
}

// rank tests (number of elements < key)
assert(tree.rank(5) == 0);
assert(tree.rank(10) == 0);
assert(tree.rank(15) == 1);
assert(tree.rank(20) == 1);
assert(tree.rank(25) == 2);
assert(tree.rank(50) == 4);
assert(tree.rank(55) == 5);

// select tests (0-indexed)
assert(tree.select(0) == 10);
assert(tree.select(1) == 20);
assert(tree.select(2) == 30);
assert(tree.select(3) == 40);
assert(tree.select(4) == 50);
assert(tree.select(5) == -1);

// prev tests (largest < key)
assert(tree.prev(10) == -1);
assert(tree.prev(11) == 10);
assert(tree.prev(20) == 10);
assert(tree.prev(21) == 20);
assert(tree.prev(50) == 40);
assert(tree.prev(55) == 50);

// next tests (smallest > key)
assert(tree.next(5) == 10);
assert(tree.next(10) == 20);
assert(tree.next(15) == 20);
assert(tree.next(40) == 50);
assert(tree.next(50) == -1);
assert(tree.next(55) == -1);

printf("rank/select/prev/next tests passed\n");
}

Node structure:

  • ch[2]: left and right child pointers.
  • par_and_flg: packs the parent pointer with 2 flag bits in the low bits. Bit 0 indicates whether the left child has rank difference 2; bit 1 indicates whether the right child has rank difference 2. A cleared bit means rank difference 1.
  • i: the key value.
  • sum, size: augmented data maintained by mconcat for order statistics operations.

Helper methods:

  • rd2(d): returns true if child d has rank difference 2.
  • flip(d): toggles the rank difference of child d between 1 and 2.
  • clr_flags(): sets both children to rank difference 1 (used after rotations to reset a node to 1,1).

Invariants:

  • Leaves always have flags() == 0, meaning both null children are 1-children (null nodes have rank -1, leaf has rank 0).
  • After each insertion or deletion, mconcat is called along the path to the root to update augmented data.

Rotations:

The rotate(x, d) function rotates node x in direction d. It lifts x->ch[d] to replace x, and updates the augmented data for x. The caller is responsible for updating rank differences.

Rank encoding:

The paper suggests encoding a 1-bit rank difference in each node, but I find this approach less efficient. A null node can be either a 1-child (parent is binary) or a 2-child (parent is unary), so retrieving the child rank difference requires probing the sibling node. int rank_diff(Node *p, int d) { return p->ch[d] ? p->ch[d]->par_and_flg & 1 : p->ch[!d] ? 2 : 1; }

Misc

Visualization: https://tjkendev.github.io/bst-visualization/avl-tree/bu-weak.html

Stack walking: space and time trade-offs

On most Linux platforms (except AArch32, which uses .ARM.exidx), DWARF .eh_frame is required for C++ exception handling and stack unwinding to restore callee-saved registers. While .eh_frame can be used for call trace recording, it is often criticized for its runtime overhead. As an alternative, developers can enable frame pointers, or adopt SFrame, a newer format designed specifically for profiling. This article examines the size overhead of enabling non-DWARF stack walking mechanisms when building several LLVM executables.

Runtime performance analysis will be added in a future update.

Read More

Remarks on SFrame

SFrame is a new stack walking format for userspace profiling, inspired by Linux's in-kernel ORC unwind format. While SFrame eliminates some .eh_frame CIE/FDE overhead, it sacrifices functionality (e.g., personality, LSDA, callee-saved registers) and flexibility, and its stack offsets are less compact than .eh_frame's bytecode-style CFI instructions. In llvm-project executables I've tested on x86-64, .sframe section is 20% larger than .eh_frame. It also remains significantly larger than highly compact schemes like Windows ARM64 unwind codes.

SFrame describes three elements for each function:

  • Canonical Frame Address (CFA): The base address for stack frame calculations
  • Return address
  • Frame pointer

An .sframe section follows a straightforward layout:

  • Header: Contains metadata and offset information
  • Auxiliary header (optional): Reserved for future extensions
  • Function Descriptor Entries (FDEs): Array describing each function
  • Frame Row Entries (FREs): Arrays of unwinding information per function

Read More

Benchmarking compression programs

tl;dr https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7 is a single-file Ruby program that downloads and compiles multiple compression utilities, then benchmarks their compression and decompression performance on a specified input file, finally generates a HTML file with scatter charts. Scroll to the end to view example HTML pages.

Compression algorithms can be broadly categorized into three groups based on their typical compression ratio and decompression speed:

  • Low ratio, high speed: lz4, snappy, Oodle Selkie.
  • Medium ratio, medium speed: zlib, zstd, brotli, Oodle Kraken.
  • High ratio, low speed: LZMA, bzip2, bzip3, bsc, zpaq, kanzi, Oodle Leviathan.

Low ratio Codecs in this category prioritize speed above all else. The compression and compression speeds are comparable. They are designed to decompress so quickly that they don't introduce a noticeable delay when reading data from storage like solid-state drives. These codecs typically producing byte-aligned output and often skip the final step of entropy encoding, which, while crucial for high compression, is computationally intensive. They are excellent choices for applications where latency is critical, such as kernel features like zswap.

Medium ratio This is the sweet spot for many tasks. The codecs achieve better compression ratio by employing entropy encoding, usually Huffman coding.

zstd has emerged as a clear leader, gaining popularity and effectively supplanting older codecs like the venerable DEFLATE (zlib).

High ratio They are designed to squeeze every last bit of redundancy out of the data, often at the cost of significantly longer compression and decompression times, and large memory usage. They are perfect for archival purposes or data distribution where the files are compressed once and decompressed infrequently. Codecs typically have 3 important components:

  • Transforms: Codecs typically implement strong transforms to increase redundancy, even very specific ones like branch/call/jump filters for machine code.
  • Predication model: This model anticipates the next piece of data based on what has already been processed.
  • Entropy encoding: Traditional codecs use arithmetic encoder, which is replaced by the more efficient Range variant of Asymmetric Numeral Systems (rANS).

Some projects apply neural network models, such as Recurrent Neural Network, Long Short-Term Memory, and Transformer, to the predication model. They are usually very slow.


This categorization is loose. Many modern programs offer a wide range of compression levels that allow them to essentially span multiple categories. For example, a high-level zstd compression can achieve a ratio comparable to xz (a high-compression codec) by using more RAM and CPU. While zstd's compression speed or ratio is generally lower, its decompression speed is often much faster than that of xz.

Benchmarking

I want to benchmark the single worker performance of a few compression programs:

  • lz4: Focuses on speed over compression ratio. Memory usage is extremely low. It seems Pareto superior to Google's Snappy.
  • zstd: Gained significant traction and obsoleted many existing codecs. Its LZ77 variant uses three recent match offsets like LZX. For entropy encoding, it employs Huffman coding for literals and 2-way interleaved Finite State Entropy for Huffman weights, literal lengths, match lengths, and offset codes. The large alphabet of literals makes Huffman a good choice, as compressing them with FSE provides little gain for a speed cost. However, other symbols have a small range, making them a sweet spot for FSE. zstd works on multiple streams at the same time to utilize instruction-level parallelism. zstd is supported by the Accept-Encoding: zstd HTTP header. Decompression memory usage is very low.
  • brotli: Uses a combination of LZ77, 2nd order context model, Huffman coding, and static dictionary. The decompression speed is similar to gzip with a higher ratio. At lower levels, its performance is overshadowed by zstd. Compared with DEFLATE, it employs a larger sliding window (from 16KiB-16B to 16MiB-16B) and a smaller minimum match length (2 instead of 3). It has a predefined dictionary that works well for web content (but feels less elegant) and supports 120 transforms. brotli is supported by the Accept-Encoding: br HTTP header. Decompression memory usage is quite low.
  • bzip3: Combines BWT, RLE, and LZP and uses arithmetic encoder. Memory usage is large.
  • xz: LZMA2 with a few filters. The filters must be enabled explicitly.
  • lzham: Provides a compression ratio similar to LZMA but with faster decompression. Compression is slightly slower while memory usage is larger. The build system is not well-polished for Linux. I have forked it, fixed stdint.h build errors, and installed lzhamtest. The command line program lzhamtest should really be renamed to lzham.
  • zpaq: Functions as a command-line archiver supporting multiple files. It combines context mixing with arithmetic encoder but operates very slowly.
  • kanzi: There are a wide variety of transforms and entropy encoders, unusual for a compresion program. For the compression speed of enwik8, it's Pareto superior to xz, but decompression is slower. Levels 8 and 9 belong to the PAQ8 family and consume substantial memory.

I'd like to test lzham (not updated for a few years), but I'm having trouble getting it to compile due to a cstdio header issue.

Many modern compressors are parallel by default. I have to disable this behavior by using options like -T1. Still, zstd uses a worker thread for I/O overlap, but I don't bother with --single-thread.

To ensure fairness, each program is built with consistent compiler optimizations, such as -O3 -march=native.

Below is a Ruby program that downloads and compiles multiple compression utilities, compresses then decompress a specified input file. It collects performance metrics including execution time, memory usage, and compression ratio, and finally generates an HTML file with scatter charts visualizing the results. The program has several notable features:

  • Adding new compressors is easy: just modify COMPRESSORS.
  • Benchmark results are cached in files named cache_$basename_$digest.json, allowing reuse of previous runs for the same input file.
  • Adding a new compression level does not invalidate existing benchmark results for other levels.
  • The script generates an HTML file with interactive scatter charts. Each compressor is assigned a unique, deterministic color based on a hash of its name (using the hsl function in CSS).

The single file Ruby program is available at https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7

Limitation

A single run might not be representative.

Running the executable incurs initialization overhead, which would be amortized in a library setup. However, library setup would make updating libraries more difficult.

Demo

1
2
3
4
5
ruby bench.rb enwik8
# The first iframe below

ruby bench.rb clang
# The second iframe below

Many programs exhibit a stable decompression speed (uncompressed size / decompression time). There is typically a slightly higher decompression speed at higher compression levels. If you think of the compressed content as a form of "byte code", a more highly compressed file means there are fewer bytes for the decompression algorithm to process, resulting in faster decompression. Some programs, like zpaq and kanzi, use different algorithms that can result in significantly different decompression speeds.

xz -9 doesn't use parallelism on the two files under ~100 MiB because their uncompressed size is smaller than the default block size for level 9.

From install/include/lzma/container.h

For each thread, about 3 * block_size bytes of memory will be allocated. This may change in later liblzma versions. If so, the memory usage will probably be reduced, not increased.

Understanding alignment - from source to object file

Alignment refers to the practice of placing data or code at memory addresses that are multiples of a specific value, typically a power of 2. This is typically done to meet the requirements of the programming language, ABI, or the underlying hardware. Misaligned memory accesses might be expensive or will cause traps on certain architectures.

This blog post explores how alignment is represented and managed as C++ code is transformed through the compilation pipeline: from source code to LLVM IR, assembly, and finally the object file. We'll focus on alignment for both variables and functions.

Read More

LLVM integrated assembler: Engineering better fragments

In my previous assembler posts, I've discussed improvements on expression resolving and relocation generation. Now, let's turn our attention to recent refinements within section fragments. Understanding how an assembler utilizes these fragments is key to appreciating the improvements we've made. At a high level, the process unfolds in three main stages:

  • Parsing phase: The assembler constructs section fragments. These fragments represent sequences of regular instructions or data, span-dependent instructions, alignment directives, and other elements.
  • Section layout phase: Once fragments are built, the assembler assigns offsets to them and finalizes the span-dependent content.
  • Relocation decision phase: In the final stage, the assembler evaluates fixups and, if necessary, updates the content of the fragments.

Read More

GCC 13.3.0 miscompiles LLVM

For years, I've been involved in updating LLVM's MC layer. A recent journey led me to eliminate the FK_PCRel_ fixup kinds:

MCFixup: Remove FK_PCRel_

The generic FK_Data_ fixup kinds handle both absolute and PC-relative
fixups. ELFObjectWriter sets IsPCRel to true for `.long foo-.`, so the
backend has to handle PC-relative FK_Data_.

However, the existence of FK_PCRel_ encouraged backends to implement it
as a separate fixup type, leading to redundant and error-prone code.

Removing FK_PCRel_ simplifies the overall fixup mechanism.

Read More