This article introduces CREL (previously known as RELLEB), a new relocation format offering incredible size reduction (LLVM implementation in my fork).
ELF's design emphasizes natural size and alignment guidelines for its control structures. This principle, outlined in Proceedings of the Summer 1990 USENIX Conference, ELF: An Object File to Mitigate Mischievous Misoneism, promotes ease of random access for structures like program headers, section headers, and symbols.
All data structures that the object file format defines follow the "natural" size and alignment guidelines for the relevant class. If necessary, data structures contain explicit padding to ensure 4-byte alignment for 4-byte objects, to force structure sizes to a multiple of four, etc. Data also have suitable alignment from the beginning of the file. Thus, for example, a structure containing an Elf32_Addr member will be aligned on a 4-byte boundary within the file. Other classes would have appropriately scaled definitions. To illustrate, the 64-bit class would define Elf64 Addr as an 8-byte object, aligned on an 8-byte boundary. Following the strictest alignment for each object allows the format to work on any machine in a class. That is, all ELF structures on all 32-bit machines have congruent templates. For portability, ELF uses neither bit-fields nor floating-point values, because their representations vary, even among pro- cessors with the same byte order. Of course the programs in an ELF file may use these types, but the format itself does not.
While beneficial for many control structures, the natural size
guideline presents significant drawbacks for relocations. Since
relocations are typically processed sequentially, they don't gain the
same random-access advantages. The large 24-byte Elf64_Rela
structure highlights the drawback. For a detailed comparison of
relocation formats, see Exploring
object file formats#Relocations.
Furthermore, Elf32_Rel
and Elf32_Rela
sacrifice flexibility to maintain a smaller size, limiting relocation
types to a maximum of 255. This constraint has become noticeable for
AArch32 and RISC-V, and especially when platform-specific relocations
are needed. While the 24-bit symbol index field is less elegant, it
hasn't posed significant issues in real-world use cases.
In contrast, the WebAssembly object file format uses LEB128 encoding for relocations and other constrol structures, offering a significant size advantage over ELF.
Inspired by WebAssembly, I will start discussion with a generic compression algorithm and then propose an alternative format (CREL) that addresses ELF's limitations.
Compressed relocations
While the standard SHF_COMPRESSED
feature is commonly
used for debug sections, its application can easily extend to relocation
sections. I have developed a Clang/lld prototype that demonstrates this
by compressing SHT_RELA
sections.
The compressed SHT_RELA
section occupies
sizeof(Elf64_Chdr) + size(compressed)
bytes. The
implementation retains uncompressed content if compression would result
in a larger size.
In scenarios with numerous smaller relocation sections (such as when
using -ffunction-sections -fdata-sections
), the 24-byte
Elf64_Chdr
header can introduce significant overhead. This
observation raises the question of whether encoding
Elf64_Chdr
fields using ULEB128 could further optimize file
sizes. With larger monolithic sections (.text
,
.data
, .eh_frame
), compression ratio would be
higher as well.
1 | # configure-llvm is my wrapper of cmake that specifies some useful options. |
Relocations consume a significant portion (approximately 20.9%) of
the file size. Despite the overhead of
-ffunction-sections -fdata-sections
, the compression
technique yields a significant reduction of 14.5%!
However, dropping in-place relocation processing is a downside.
CREL relocation format
The 1990 ELF paper ELF: An Object File to Mitigate Mischievous Misoneism says "ELF allows extension and redefinition for other control structures." Let's explore CREL, a new and more compact relocation format designed to replace REL and RELA. Our emphasis is on simplicity over absolute minimal encoding. This is achieved by using a byte-oriented encoding that avoids complex compression techniques (e.g., dictionary-based compression, entropy encoder). As a byte-oriented format, CREL relocations can be further compressed by other codecs, if desired. Using CREL as relocatable files can decrease memory usage.
See the end of the article for a detailed format description.
A SHT_CREL
section (preferred name:
.crel<name>
) holds compact relocation entries that
decode to Elf32_Rela
or Elf64_Rela
depending
on the object file class (32-bit or 64-bit). Its content begins with a
ULEB128-encoded relocation count, followed by entries encoding
r_offset
, r_type
, r_symidx
, and
r_addend
. The entries use ULEB128 and SLEB128 exclusively
and there is no endianness difference.
Here are key design choices:
Relocation count (ULEB128):
This allows for efficient retrieval of the relocation count without
decoding the entire section. While a uint32_t
(like SHT_HASH
)
could be used, ULEB128 aligns with subsequent entries, removes
endianness differences, and offers a slight size advantage in most cases
when the number of symbols can be encoded in one to three bytes.
Shifted offset:
64-bit data sections frequently have absolute relocations spaced 8
bytes apart. Additionally, in RISC architectures, offsets are often
multiples of 2 or 4. A shift value of 2 allows delta offsets within the
[0, 64) range to be encoded in a single byte, often avoiding the need
for two-byte encoding. In an AArch64 -O3
build, the shifted
offset technique reduces size(.crel*)
by 12.8%.
Many C++ virtual tables have the first relocation at offset 0x10. In the absence of the shifted offset technique, the relocation at offset 0x10 cannot be encoded in one byte.
1 | Relocation section '.crel.data.rel.ro._ZTVN12_GLOBAL__N_113InlineSpillerE' at offset 0x116fe contains 5 entries: |
The shifted offset also works really well for dynamic relocations,
whose offset differences are almost always a multiple of
sizeof(Elf_Addr)
.
Addend bit:
CREL was initially designed with explicit addends in each relocation entry. However, this approach created redundancy when I extended CREL for dynamic relocations (discussed throroughly in another section). While RELR remains an option, the goal is for CREL to be a viable alternative even without RELR. In addition, when implicit addends are used, I feel sad if one bit in every relocation entry is wasted.
To address these concerns, a single bit flag has been introduced in the CREL header:
addend_bit==1
: The flags include a bit to signal the presence of the delta addend.addend_bit==0
: The flags bit is stolen to encode one more bit for the offset. The addend is implicit and not encoded within the relocation.
This bit flag effectively balances efficiency by avoiding unnecessary storage for dynamic relocations while maintaining flexibility for cases requiring explicit addend values.
Assembler produced SHT_CREL
sections are supposed to
always set the addend_bit
bit.
Delta encoding for r_offset
(ULEB128):
Section offsets can be large, and relocations are typically ordered.
Storing the difference between consecutive offsets offers compression
potential. In most cases, a single byte will suffice. While there are
exceptions (general dynamic TLS model of s390/s390x uses a local
"out-of-order" pair:
R_390_PLT32DBL(offset=o) R_390_TLS_GDCALL(offset=o-2)
), we
are optimizing for the common case.
For ELFCLASS32, r_offsets
members are calculated using
modular arithmetic modulo 4294967296.
Delta encoding for r_symidx
(SLEB128):
This is more for consistency and less for the benefit.
Absolute symbol indexes allow one-byte encoding for symbols in the range [0,128) and offer minor size advantage for static relocations when the symbol table is sorted by usage frequency. Delta encoding, on the other hand, might optimize for the scenario when the symbol table presents locality: neighbor symbols are frequently mutually called.
Delta symbol index enables one-byte encoding for GOT/PLT dynamic
relocations when .got
/.got.plt
entries are
ordered by symbol index. For example, R_*_GLOB_DAT
and
R_*_JUMP_SLOT
relocations can typically be encoded with
repeated 0x05 0x01 (when addend_bit==0 && shift==3
,
offset++, symidx++). Delta encoding has a disvantage. It can partial
claim the optimization by arranging symbols in a "cold0 hot cold1"
pattern. In addition, delta symbol index enables one-byte encoding for
GOT/PLT dynamic relocations when .got
/.got.plt
entries are ordered by symbol index.
In my experiments, absolute encoding with ULEB128 results in slightly larger .o file sizes for both x86-64 and AArch64 builds.
Delta encoding for r_type
(SLEB128):
Some psABIs utilize relocation types greater than 128. AArch64's static relocation types begin at 257 and dynamic relocation types begin at 1024, necessitating two bytes with ULEB128/SLEB128 encoding in the absence of delta encoding. Delta encoding allows all but the first relocation's type to be encoded in a single byte. An alternative design is to define a base type in the header and encode types relative to the base type, which would introduce slight complexity.
If the AArch32 psABI could be redesigned, allocating
[0,64)
for Thumb relocation types and [64,*)
for ARM relocation types would optimize delta encoding even further.
While sharing a single type code for multiple relocations would be efficient, it would require reordering relocations. This conflicts with order requirements imposed by several psABIs and could complicate linker implementations.
Delta encoding for addend (SLEB128):
Encoding the delta addend offers a slight size advantage and optimizes for cases like:
1 | .quad .data + 0x78 |
Symbol index/type/addend omission
Relocations often exhibit patterns that can be exploited for size reduction:
- Symbol index/type remain constant with varying addends, common for
STT_SECTION
symbols, e.g..rodata
,.eh_frame
,.debug_str_offsets
,.debug_names
,.debug_line
, and.debug_addr
. - Type/addend remain constant with varying symbol indexes, common for non-section symbols, e.g. function calls, C++ virtual tables, and dynamic relocations.
1 | // Type/addend do not change. |
We use the least significant bits of the offset member to signal the presence of symbol index, type, and addend information. This allows us to omit delta fields when they match the previous entry.
CREL steals 3 bits from the offset member. I have tried stealing just a bit and utilizing negative symbol index to signal type/addend omission, but offsets generally require fewer bits to encode and stealing bits from offsets is superior.
Omitting the symbol index information is especially beneficial for
reducing debug build sizes. For example, most .debug_names
relocations can be encoded using only 4 bytes (offset and delta addend),
instead of the 7 bytes required otherwise.
While RISC architectures often require multiple relocations with different types to access global data, making type omission slightly less beneficial than x86, the frequent use of call instructions offers large size reduction with type omission.
With a limited number of types and frequent zero addends (except
R_*_RELATIVE
and R_*_IRELATIVE
), dynamic
relocations also benefit from type/addend omission.
I have developed a prototype at https://github.com/MaskRay/llvm-project/tree/demo-crel.
CREL demonstrates superrior size reduction compared to the
SHF_COMPRESSED SHT_RELA
approach.
LEB128 among variable-length integer encodings
LEB128 and UTF-8 stand out as the two most commonly used byte-oriented, variable-length integer encoding schemes. Binary encodings often employ LEB128. While alternatives like PrefixVarInt (or a suffix-based variant) might excel when encoding larger integers, LEB128 offers advantages when most integers fit within one or two bytes, as it avoids the need for shift operations in the common one-byte representation.
While we could utilize zigzag encoding
(i>>31) ^ (i<<1)
to convert SLEB128-encoded
type/addend to use ULEB128 instead, the generate code is inferior to or
on par with SLEB128 for one-byte encodings on x86, AArch64, and
RISC-V.
1 | // One-byte case for SLEB128 |
While some variale-length integer schemes allocate more integers
within the one-byte bucket, I do not believe they would lead to
noticeable improvement over LEB128 and their complexity is higher than
LEB128. For example, when I assign one extra bit to offsets (by clearing
addend_bit
in the header), .crel.dyn
merely
decreases by 2.5%.
Here is an extremely simple C decoder implementation for ULEB128 and
SLEB128. The clever use of 64/128 is from Stefan O'Rear. The return type
uint64_t
can be changed to size_t
when used in
a dynamic loader. 1
2
3
4
5
6
7
8
9
10
11
12static uint64_t read_leb128(unsigned char **buf, uint64_t sleb_uleb) {
uint64_t acc = 0, shift = 0, byte;
do {
byte = *(*buf)++;
acc |= (byte - 128*(byte >= sleb_uleb)) << shift;
shift += 7;
} while (byte >= 128);
return acc;
}
uint64_t read_uleb128(unsigned char **buf) { return read_leb128(buf, 128); }
int64_t read_sleb128(unsigned char **buf) { return read_leb128(buf, 64); }
I have used a modified lld analyze LEB128 length distribution in a
x86-64 release build of lld
that enables CREL.
1 | zo[std::min(getULEB128Size(offset-old_offset), 3u) - 1]++; |
The distribution of ULEB128/SLEB128 lengths is:
1 | 1 2 3+ any |
80.5% LEB128 encodings are of 1 byte and 19.2% are of 2 bytes. 2-byte delta symidx members are quite common, but I do not plan to steal bits from other members to symidx.
VLQ (variable-length quantity) used by the MIDI file format is a big-endian variant of LEB128. While VLQ offers some advantages:
- Lexicographic ordering: This can be beneficial for database sorting but isn't a factor for object files.
- Potential for better sign extension.
git uses a bijective variant of VLQ. While bijectivity is a nice property, it is not useful.
The potential benefits are outweighed by a slightly more complex encoder and advantages of sticking with popular LEB128 in object files.
https://sqlite.org/src4/doc/trunk/www/varint.wiki describes a scheme that encodes 0~240 in one byte, but it is a bit complex. A simplified version I explored led to larger relocatable files and dynamic relocations compared to LEB128. Additionally, implementing 67-bit delta offsets and flags without 128-bit integer support would be cumbersome.
1 | uint64_t read_vu128(const uint8_t *&p) { |
Experiments
build | format | .o size |
size(.rel*) |
.o size decrease |
---|---|---|---|---|
-O3 | RELA | 136012504 | 28235448 | |
-O3 | CREL | 111583312 | 3806234 | 18.0% |
aarch64 -O3 | RELA | 124965808 | 25855800 | |
aarch64 -O3 | CREL | 102529784 | 3388307 | 18.0% |
ppc64le -O3 | RELA | 129017272 | 26589192 | |
ppc64le -O3 | CREL | 105860576 | 3432419 | 17.9% |
riscv64 -O3 | RELA | 227189744 | 91396344 | |
riscv64 -O3 | CREL | 149343352 | 13549699 | 34.3% |
-O1 -g | RELA | 1506173760 | 340965576 | |
-O1 -g | CREL | 1202445768 | 37237274 | 20.2% |
-O3 -g $SPLIT | RELA | 549003848 | 104227128 | |
-O3 -g $SPLIT | CREL | 459768736 | 14992114 | 16.3% |
SPLIT="-gpubnames -gsplit-dwarf"
Let's compare x86_64 -O3 builds of lld.
size(.crel*)/size(.rel*) = 3806234 / 28235448
, 13.5%. The
total .o file size has decreased by 18.0%. In addition, the maximum
resident set size of the linker (also lld) using mimalloc has decreased
by 4.2%.
It would be interesting to explore the potential gains of combining zstd compression with CREL.
1 | configure-llvm s2-custom3 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS='-Wa,--crel -Xclang --compress-relocations=zstd' |
(Here is a Ruby script to count relocations.
1 | agg=Hash.new 0 |
)
I debated whether to name the new section SHT_RELOC
(.reloc<name>
) or SHT_RELLEB
(.relleb<name>
). Ultimately, I chose
SHT_CREL
because its unique name minimizes potential
confusion, whereas SHT_RELOC
could be confused with
SHT_REL
and SHT_RELA
and the LEB128 part is
not that strong.
Case study
Let's explore some real-world scenarios where relocation size is critical.
Marker relocations
Marker relocations are utilized to indicate certain linker
optimization/relaxation is applicable. While many marker relocations are
used scarcely, RISC-V relocatable files are typically filled up with
R_RISCV_RELAX
relocations. Their size contribution is quite
substantial.
.llvm_addrsig
On many Linux targets, Clang emits a special section called
.llvm_addrsig
(type SHT_LLVM_ADDRSIG
, LLVM
address-significance table) by default to allow
ld.lld --icf=safe
. The .llvm_addrsig
section
stores symbol indexes in ULEB128 format, independent of relocations.
Consequently, tools like ld -r
and objcopy risk invalidate
the section due to symbol table modifications.
Ideally, using relocations would allow certain operations. However,
the size concern of REL/RELA in ELF hinders this approach. In contrast,
lld's Mach-O port chose
a relocation-based representation for
__DATA,__llvm_addrsig
.
If CREL is adopted, we can consider switching to the relocation representation.
.llvm.call-graph-profile
LLVM leverages a special section called
.llvm.call-graph-profile
(type
SHT_LLVM_CALL_GRAPH_PROFILE
) for both instrumentation- and
sample-based profile-guided optimization (PGO). lld utilizes
this information ((from_symbol, to_symbol, weight) tuples) to
optimize section ordering within an input section description, enhancing
cache utilization and minimizing TLB thrashing.
Similar to .llvm_addrsig
, the
.llvm.call-graph-profile
section initially faced the symbol
index invalidation problem, which was solved by switching to
relocations. I opted for REL over RELA to reduce code size.
DWARF sections
In a non-split-DWARF build, .rela.debug_str_offsets
and
.rela.debug_addr
consume a significant portion of the file
size.
DWARF v5 accelerated name-based access with the introduction of the
.debug_names
section. However, in a
clang -g -gsplit-dwarf -gpubnames
generated relocatable
file, the .rela.debug_names
section can consume a
significant portion (approximately 10%) of the file size.
1 | Relocation section '.crel.debug_names' at offset 0x65c0 contains 200 entries: |
This size increase has sparked discussions within the LLVM community about potentially altering the file format for linking purposes.
.debug_line
and .debug_addr
also contribute
a lot of relocations.
1 | Relocation section '.crel.debug_addr' at offset 0x64f1 contains 51 entries: |
Many adjacent relocations share the same section symbol. CREL can
compress most relocations into a few bytes depending on addend sizes
(offset+=O, addend+=A
).
By teaching the assembler to use implicit addends, we achieve even
greater size reduction by compressing most relocations into a single
byte (mostly 0x04, offset+=1<<3
). However, this might
harm compression in the presence of
--compress-debug-sections=zstd
. Personally I recommend that
we don't use implicit addends.
CREL for dynamic relocations
CREL excels with static relocations, but what about the dynamic case?
A substantial part of position-independent executables (PIEs) and dynamic shared objects (DSOs) is occupied by dynamic relocations. While RELR (a compact relative relocation format) offers size-saving benefits for relative relocations, other dynamic relocations can benefit from a compact relocation format. There are a few properties:
- There are much fewer relocation types.
- The offsets are often adjacent by
Elf_Addr
. No two dynamic relocations can share the same offset. - Each symbol is associated with very few dynamic relocations,
typically 1 or 2 (
R_*_JUMP_SLOT
andR_*_GLOB_DAT
). When a symbol is associated with more dynamic relocations, it is typically a base class function residing in multiple C++ virtual tables, e.g.__cxa_pure_virtual
.-fexperimental-relative-c++-abi-vtables
would eliminate such dynamic relocations.
Android's packed relocation format (linker implementation:
ld.lld --pack-dyn-relocs=android
) was an earlier design
that applies to all dynamic relocations at the cost of complexity. It
replaces .rel.dyn
/.rela.dyn
but does not
change the section name.
Additionally, Apple linkers and dyld use LEB128 encoding for bind opcodes.
Here is a one-liner to dump the relative relocation and non-relative
relocation sizes for each shared object in a system library directory:
1
ruby -e 'Dir.glob("/usr/lib/x86_64-linux-gnu/*.so.*").each{|f| next if File.symlink?(f) || `file #{f}`!~/shared object/; s=`readelf -Wr #{f}`.lines; nr=s.count{|x|x=~/R_/&&x !~/_RELATIVE/}*24; r=s.count{|x|x=~/_RELATIVE/}*24; s=File.size(f); puts "#{f}\t#{s}\t#{r} (#{(r*100.0/s).round(2)}%)\t#{nr} (#{(nr*100.0/s).round(2)}%)" }'
I believe CREL compresses dynamic relocations well, but it is far from an optimal dynamic relocation format. A generalized RELR format would leverage the dyamic relocation properties well. Here's a possible encoding:
1 | // R_*_RELATIVE group |
We need to enumerate all dynamic relocation types including
R_*_IRELATIVE
, R_*_TLSDESC
used by some ports.
Some R_*_TLSDESC
relocations have a symbol index of zero,
but the straightforward encoding does not utilize this property.
Traditionally, we have two dynamic relocation ranges for executables and shared objects (except static position-dependent executables):
.rela.dyn
([DT_RELA, DT_RELA + DT_RELASZ)
) or.rel.dyn
([DT_REL, DT_REL + DT_RELSZ)
).rela.plt
([DT_JMPREL, DT_JMPREL + DT_PLTRELSZ)
): Stored JUMP_SLOT relocations.DT_PLTREL
specifiesDT_REL
orDT_RELA
.
IRELATIVE relocations can be placed in either range, but preferrably
in .rel[a].dyn
.
Some GNU ld ports (e.g. SPARC) treat .rela.plt
as a
subset of .rela.dyn
, introducing complexity for dynamic
loaders.
CREL adoption considerations
- New dynamic tag (
DT_CREL
): To identify CREL relocations, separate from existingDT_REL
/DT_RELA
. - No
DT_CRELSZ
: Relocation count can be derived from the CREL header. - Output section description
.rela.dyn : { *(.rela.dyn) *(.rela.plt) }
is incompatible with CREL.
Challenges with lazy binding
glibc's lazy binding scheme relies on random
access to relocation entries within the DT_JMPREL
table. CREL's sequential nature prevents this. However, eager
binding doesn't require random access. Therefore, when
-z now
(eager binding) is enabled, we can:
- Set
DT_PLTREL
toDT_CREL
. - Replace
.rel[a].plt
with.crel.plt
.
Challenges with statically linked position-dependent executables
glibc introduces additional complexity for IRELATIVE relocations in statically linked position-dependent executables. They should only contain IRELATIVE relocations and no other dynamic relocations.
glibc's csu/libc-start.c
processes IRELATIVE relocations
in the range [__rela_iplt_start, __rela_iplt_end)
(or [__rel_iplt_start, __rel_iplt_end)
, determined at build
time through ELF_MACHINE_IREL
). While CREL relocations
cannot be decoded in the middle of the section, we can still place
IRELATIVE relocations in .crel.dyn
because there wouldn't
be any other relocation types (position-dependent executables don't have
RELATIVE relocations). When CREL is enabled, we can define
__crel_iplt_start
and __crel_iplt_end
for
statically linked position-dependent executables.
If glibc only intends to support addend_bit==0
, the code
can simply be: 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15extern const uint8_t __crel_iplt_start[] __attribute__ ((weak));
extern const uint8_t __crel_iplt_end[] __attribute__ ((weak));
if (&__crel_iplt_start != &__crel_iplt_end) {
const uint8_t *p = __crel_iplt_start;
size_t offset = 0, count = read_uleb128 (&p), shift = count & 3;
for (count >>= 3; count; count--) {
uint8_t rel_head = *p++;
offset += rel_head >> 2;
if (rel_head & 128)
offset += (read_uleb128 (&p) << 5) - 32;
if (rel_head & 2)
read_sleb128 (&p);
elf_crel_irel ((ElfW (Addr) *) (offset << shift));
}
}
Considering implicit addends for CREL
Many dynamic relocations have zero addends:
- COPY/GLOB_DAT/JUMP_SLOT relocations only use zero addends.
- Absolute relocations could use non-zero addends with
STT_SECTION
symbol, but linkers convert them to relative relocations.
Usually only RELATIVE/IRELATIVE and potentially TPREL/TPOFF might
require non-zero addends. Switching from DT_RELA
to
DT_REL
offers a minor size advantage.
I considered defining two separate dynamic tags (DT_CREL
and DT_CRELA
) to distinguish between implicit and explicit
addends. However, this would have introduced complexity:
- Should
llvm-readelf -r
dump the zero addends forDT_CRELA
? - Should dynamic loaders support both dynamic tags?
I placed the delta addend bit next to offset bits so that it can be
reused for offsets. Thanks to Stefan O'Rear's for making me believe that
my original thought of reserving a single bit flag
(addend_bit
) within the CREL header is elegant. Dynamic
loaders prioritizing simplicity can hardcode the desired
addend_bit
value.
ld.lld -z crel
defaults to implicit addends
(addend_bit==0
), but the option of using in-relocation
addends is available with -z crel -z rela
.
DT_AARCH64_AUTH_RELR vs CREL
The AArch64 PAuth ABI introduces DT_AARCH64_AUTH_RELR
as
a variant of RELR for signed relocations. However, its benefit seems
limited.
In a release build of Clang 16, using -z crel -z rela
resulted in a .crel.dyn
section size of only 1.0% of the
file size. Notably, enabling implicit addends with
-z crel -z rel
further reduced the size to just 0.3%. While
DT_AARCH64_AUTH_RELR
will achieve a noticeable smaller
relocation size if most relative relocations are encoded with it, the
advantage seems less significant considering CREL's already compact
size.
Furthermore, DT_AARCH64_AUTH_RLEL
introduces additional
complexity to the linker due to its 32-bit addend limitation: the
in-place 64 value encodes a 32-bit schema, giving just 32 bits to the
implicit addend. If the addend does not fit into 32 bits,
DT_AARCH64_AUTH_RELR
cannot be used. CREL with addends
would avoid this complexity.
I have filed Quantifying the benefits of DT_AARCH64_AUTH_RELR.
I've implemented ld.lld -z crel
to replace
.rel[a].dyn
and .rel[a].plt
with
.crel.dyn
and .crel.plt
. Dynamic relocations
are sorted by (r_type, r_offset)
to better utilize
CREL.
Let's link clang-16-release using RELA,
-z pack-relative-relocs
,
--pack-dyn-relocs=android+relr
, and
-z pack-relative-relocs -z crel
and analyze the
results.
1 | % fld.lld @response.txt -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.' |
Analysis
- Relative relocations usually outnumber non-relative relocations.
- RELR significantly optimizes relative relocations, offering the largest size reduction.
- CREL further improves the non-relative portion, compressing that
portion to 5.77%, even better than Android packed relocations (6.60%)!
Android's
r_info
sharing withRELOCATION_GROUPED_BY_INFO_FLAG
has been overshadowed by our shifted offset technique. - The non-relative relocation advantage is less pronounced since
.relr.dyn
still accounts for a significant portion of the size. .crel.dyn
usingDT_CREL
(implicit addends) without RELR is more than 3x as large as RELR.relr.dyn+.rela.dyn
.
Decoding ULEB128/SLEB128 would necessitate more work in the dynamic loader.
I have a much patch for the DT_CREL
support. In an
x86-64 -O2
build, CREL support linked with
-z pack-relative-relocs -z crel
increases the size of
libc.so
by just 200 bytes (0.0256%) compared to a non-CREL
build with -z pack-relative-relocs
.
- https://github.com/MaskRay/musl/tree/crel: musl patch
- https://github.com/MaskRay/llvm-project/tree/demo-crel:
ld.lld -z crel
However, DT_CREL currently uses the non-standard value 0x40000026
(see this comment: https://github.com/llvm/llvm-project/pull/91280#discussion_r1659384454).
I hope that we can switch to 0x26 before lld gets -z crel
support.
Linker notes
--emit-relocs
and -r
necessitate combining
relocation sections. The output size may differ from the sum of input
sections. The total relocation count must be determined, a new header
written, and section content regenerated, as symbol indexes and addends
may have changed. If the linker does not attempt to determine the offset
shift in another relocation scan, the offset shift in the header can be
set to 0. Debug sections, .eh_frame
, and
.gcc_except_table
require special handling to rewrite
relocations referencing a dead symbol to R_*_NONE
. This
also necessitates updating the relocation type.
--emit-relocs
and -r
copy CREL relocation
sections (e.g. .crel.text
) to the output. When
.rela.text
is also present, linkers are required to merge
.rela.text
into .crel.text
.
GNU ld allows certain unknown section types:
[SHT_LOUSER,SHT_HIUSER]
and non-SHF_ALLOC
[SHT_LOOS,SHT_HIOS]
and non-SHF_OS_NONCONFORMING
but reports errors and stops linking for others (unless
--no-warn-mismatch
is specified). When linking a
relocatable file using SHT_CREL
, you might encounter errors
like the following:
1 | % clang -Wa,--crel -fuse-ld=bfd a.c b.c |
Older lld and mold do not report errors. I have filed:
- https://github.com/llvm/llvm-project/issues/84812 (milestone: 19.1)
- https://github.com/rui314/mold/issues/1215 (milestone: 2.4.2)
In addition, when there is one .eh_frame
section with
CIE pieces but no relocation, _bfd_elf_parse_eh_frame
will
report an error.
mips64el
mips64el has an incorrect r_info
: a 32-bit little-endian
symbol index followed by a 32-bit big-endian type. If mips64el decides
to adopt CREL, they can utilize this opportunity to fix
r_info
.
Data compression
I analyzed relocatable files in lld 18 (x86_64, -O3
builds) and extracted RELA and CREL relocations. I investigated the
potential benefits of compressing these combined sections within the
file.
It's important to consider that data compression, while beneficial for size reduction, can introduce challenges for random access and increase memory usage especially for the linker. Therefore, it might not be a suitable solution.
1 | with open(f'{f}.bin', 'wb') as out, \ |
CREL, being a byte-oriented format, allows for further compression. It can be regarded as a very efficient filter before a Lempel-Ziv compressor.
Even more surprising, CREL outperforms RELA compressed with lz4 (level 9) and zstd (level 6).
1 | % stat -c '%s %n' *.bin* |
Dynamic relocations are also worth investigating. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18ld.lld @response.txt -z now --pack-dyn-relocs=relr -o clang.relr
ld.lld @response.txt -z now --pack-dyn-relocs=android+relr -o clang.relr+android
ld.lld @response.txt -z now --pack-dyn-relocs=relr -z crel -o clang.relr+crel
llvm-objcopy --dump-section .rela.dyn=reladyn clang.relr /dev/null
llvm-objcopy --dump-section .crel.dyn=creldyn clang.relr+crel /dev/null
llvm-objcopy --dump-section .rela.dyn=androiddyn clang.relr+android /dev/null
xz -fk reladyn androiddyn creldyn
zstd -fk reladyn androiddyn creldyn
% stat -c '%s %n' reladyn* androiddyn* creldyn*
69768 reladyn
4236 reladyn.xz
4449 reladyn.zst
4604 androiddyn
1484 androiddyn.xz
1481 androiddyn.zst
3980 creldyn
1344 creldyn.xz
1367 creldyn.zst
Interestingly, both zstd and LZMA2's default levels on RELA outperform Android's packed relocation format. Even better, CREL outperforms them both!
The results do not suggest that we add a Lempel-Ziv compressor to CREL. It would significantly increase complexity and decoder memory usage. The filesystem's transparent compression can handle this for us conveniently.
CREL proposal for the generic ABI
The latest revision has been proposed at https://groups.google.com/g/generic-abi/c/ppkaxtLb0P0. I have also created:
In https://www.sco.com/developers/gabi/latest/ch4.sheader.html, make the following changes.
In Figure 4-9: Section Types,sh_type, append a row
SHT_CREL
| 20
Add text:
SHT_CREL - The section holds compact relocation entries with explicit addends. An object file may have multiple relocation sections. See ''Relocation'' below for details.
In Figure 4-16: Special Sections, append
.crelname
| SHT_CREL
| see below
Change the text below:
.relname, .relaname, and .crelname
These sections hold relocation information, as described in ''Relocation''. If the file has a loadable segment that includes relocation, the sections' attributes will include the SHF_ALLOC bit; otherwise, that bit will be off. Conventionally, name is supplied by the section to which the relocations apply. Thus a relocation section for .text normally would have the name .rel.text, .rela.text, or .crel.text.
In Figure 4-23: Relocation Entries, add:
1 | typedef struct { |
Add text above "A relocation section references two other sections":
A SHT_CREL
section holds compact relocation entries that
decode to Elf32_Crel
or Elf64_Crel
depending
on the object file class (32-bit or 64-bit). Its content begins with a
ULEB128-encoded value count * 8 + addend_bit * 4 + shift
(35-bit or 67-bit unsigned), where:
count
: Relocation count (32-bit or 64-bit unsigned).addend_bit
: 1 indicates that relocation entries encode addends. 0 indicates implicit addends (stored in the location to be modified).shift
: The shift value (0 to 3) applies todelta_offset
in relocation entries.
Relocation entries (which encode r_offset
,
r_symidx
, r_type
, and r_addend
)
follow the header. Note: r_info
in traditional REL/RELA
formats has been split into r_symidx
and
r_type
, allowing uint32_t
relocation types for
ELFCLASS32 as well.
- Delta offset and flags (ULEB128): Holds
delta_offset * (addend_bit ? 8 : 4) + flags
(35-bit or 67-bit unsigned), where:delta_offset
: Difference inr_offset
from the previous entry (truncated toElf32_Addr
orElf64_Addr
), right shifted byshift
.flags
: 0 to 7 ifaddend_bit
is 1; otherwise 0 to 3.flags & 1
: Indicate if delta symbol index is present.flags & 2
: Indicate if delta type is present.flags & 4
: Indicate if delta addend is present.
- Delta symbol index (SLEB128, if present): The difference in symbol index from the previous entry, truncated to a 32-bit signed integer.
- Delta type (SLEB128, if present): The difference in relocation type from the previous entry, truncated to a 32-bit signed integer.
- Delta addend (SLEB128, if present): The difference in addend from the previous entry, truncated to a 32-bit or 64-bit signed integer depending on the object file class.
ULEB128 or SLEB128 encoded values use the canonical representation (i.e., the shortest byte sequence). For the first relocation entry, the previous offset, symbol index, type, and addend members are treated as zero.
Encoding/decoding delta offset and flags does not need multi-precision arithmetic. We can just unroll and special case the first iteration. The header can be encoded/decoded in a similar way. An implementation can assume that the relocation count cannot be larger than 2**61 and simplify the code.
Example C++ encoder:
1 | // encodeULEB128(uint64_t, raw_ostream &os); |
Example C++ decoder:
1 | // uint64_t decodeULEB128(uint8_t *&p); |
Both encoder and decoder can be simplified if the desired
addend_bit
is hardcoded, making flagBits
an
integer literal.
In Figure 5-10: Dynamic Array Tags, d_tag, add:
DT_CREL
| 38 | d_ptr
| optional |
optional
Add text below:
DT_CREL
- This element is similar toDT_REL
, except its table uses the CREL format. The relocation count can be inferred from the header.
Update DT_PLTREL
and
DT_PLTRELSZ
:
DT_PLTRELSZ
: This element holds the total size, in bytes, of the relocation entries associated with the procedure linkage table. If an entry of typeDT_JMPREL
is present and theDT_PLTREL
entry value isDT_REL
orDT_RELA
, aDT_PLTRELSZ
must accompany it.DT_PLTREL
: This member specifies the type of relocation entry to which the procedure linkage table refers. Thed_val
member holdsDT_REL
,DT_RELA
, orDT_CREL
, as appropriate. All relocations in a procedure linkage table must use the same relocation type.
Abandonded proposal RELLEB (last revision)
A SHT_CREL
section holds compact relocation entries that
decode to Elf32_Crel
or Elf64_Crel
depending
on the object file class (32-bit or 64-bit). Its content begins with a
ULEB128-encoded relocation count, followed by entries encoding
r_offset
, r_symidx
, r_type
, and
r_addend
. Note that the r_info
member in
traditional REL/RELA formats has been split into separate
r_symidx
and r_type
members, allowing
uint32_t
relocation types for ELFCLASS32 as well.
In the following description,
Elf_Addr
/Elf_SAddr
denote
uint32_t
/int32_t
for ELFCLASS32 or
uint64_t
/int64_t
for ELFCLASS64.
- First member (ULEB128): Holds
2 * delta_offset + eq
(33-bit or 65-bit unsigned), where:delta_offset
: Difference inr_offset
from the previous entry (Elf_Addr
).eq
: Indicates if the symbol index/type match the previous entry (1 for match, 0 otherwise).
- Second Member (SLEB128) if
eq
is 1:- Difference in
r_addend
from the previous entry (Elf_SAddr
).
- Difference in
- Second Member (SLEB128) if
eq
is 0:- If type and addend match the previous entry, the encoded value is the symbol index; type and addend are omitted.
- Otherwise, the bitwise NOT of the encoded value (33-bit signed) is
the symbol index; delta type and delta addend follow:
- Delta type (SLEB128): The difference in relocation type from the previous entry (32-bit signed).
- Delta addend (SLEB128): The difference in
r_addend
relative to the previous entry (signedElf_Addr
).
The bitwise NOT of symbol index 0xffffffff is -0x100000000 (33-bit) instead of 0 (32-bit).
Encoder in pseudo-code: 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
encodeULEB128(relocs.size());
for (const Reloc &rel : relocs) {
if (symidx == rel.r_symidx && type == rel.r_type) {
// Symbol index/type match the previous entry. Encode the addend.
encodeULEB128(2 * uint128_t(rel.r_offset - offset) + 1); // at most 65-bit
encodeSLEB128(rel.r_addend - addend);
} else {
encodeULEB128(2 * uint128_t(rel.r_offset - offset)); // at most 65-bit
if (type == rel.r_type && addend == rel.r_addend) {
// Type/addend match the previous entry. Encode the symbol index.
encodeSLEB128(rel.r_symidx);
} else {
// No optimization is applied. Encode symbol index, type, and addend.
encodeSLEB128(~static_cast<int64_t>(symidx));
encodeSLEB128(static_cast<int32_t>(rel.type - type));
type = rel.r_type;
encodeSLEB128(static_cast<Elf_SAddr>(rel.r_addend - addend));
addend = rel.r_addend;
}
symidx = rel.r_symidx;
}
}
Encoding/decoding a unsigned 65-bit does not need multi-precision arithmetic. We can just unroll and special case the first iteration. Example C++ encoder:
1 | // encodeULEB128(uint64_t, raw_ostream &os); |
Example C++ decoder:
1 | // uint64_t decodeULEB128(uint8_t *&p); |