tl;dr Clang 19 will remove the -mrelax-all default at
-O0, significantly decreasing the text section size for
x86.
Span-dependent instructions
In assembly languages, some instructions with an immediate operand
can be encoded in two (or more) forms with different sizes. On x86-64, a
direct JMP/JCC can be encoded either in 2 bytes with a 8-bit relative
offset or 6 bytes with a 32-bit relative offset. A short jump is
preferred because it takes less space. However, when the target of the
jump is too far away (out of range for a 8-bit relative offset), a near
jump must be used.
1 2 3 4
ja foo # jump short if above, 77 <rel8> ja foo # jump near if above, 0f 87 <rel32> .nops 126 foo: ret
A 1978 paper by Thomas G. Szymanski ("Assembling Code for
Machines with Span-Dependent Instructions") used the term
"span-dependent instructions" to refer to such instructions with short
and long forms. Assemblers grapple with the challenge of choosing the
optimal size for these instructions, often referred to as the "branch
displacement problem" since branches are the most common type. A good
resource for understanding Szymanski's work is Assembling
Span-Dependent Instructions.
QOI, the Quite OK Image format,
has been gaining in popularity. Chris Wellons offers a great analysis.
QOI's key advantages is its simplicity. Being a byte-oriented format
without entropy encoding, it can be further compressed with generic data
compression programs like LZ4, XZ, and zstd. PNG, on the other hand,
uses DEFLATE compression internally and is typically resistant to
further compression. By applying a stronger compression algorithm on QOI
output, you can often achieve a smaller file size compared to PNG.
XZ
Lasse Collin has shared some effective options for compressing
uncompressed BMP/TIFF files. I tested them on the QOI benchmark images.
When the color table (palette) is used, a delta filter would increase
the compressed size and should be disabled.
While DEFLATE-compressed PNG files can hardly be further compressed,
we can convert these PNG files to uncompressed ones then apply xz. The
.png.xz results below do not apply a filter, and the files
are generally larger than .qoi.xz.
directory
.png
.png.xz
.qoi.xz
.bmp.xz
.tiff.xz
.p[ap]m.xz
icon_512
11154424
7861652
7476640
8042032
8064476
8039192
icon_64
828119
750836
708480
730472
757760
735296
photo_kodak
15394305
14464504
12902852
13612440
13616140
13610844
photo_tecnick
237834256
254803292
213268188
210591724
210508596
210468412
photo_wikipedia
88339751
100449996
86679696
86380124
86274480
86241296
pngimg
229608249
134233476
193382668
186389368
186654256
186420564
screenshot_game
266238855
237976536
218915316
216626004
216847500
216765748
screenshot_web
40272678
24690360
21321460
21458496
21532360
21533432
textures_photo
37854634
36393340
28967008
30054968
30064236
30059784
textures_pk
43523493
40868036
54117600
41990596
40632916
46695172
textures_pk01
18946769
15734348
14950836
14835648
14853420
14839312
textures_pk02
102962935
86037000
82279000
79374112
79348768
79336276
textures_plants
51765329
53044044
43681548
44913260
45021996
45048652
While compressing QOI with XZ (.qoi.xz) can achieve good
results, using a delta filter directly on the uncompressed BMP format
(.bmp.xz) can sometimes lead to even smaller files. (TIFF
and PPM/PAM, when compressed, can achieve similar file sizes to
.bmp.xz.) This suggests that QOI is probably not better
than a plain delta filter.
It's important to note that uncompressed BMP/TIFF files are huge.
This can be problematic if the decompressed data can't be streamed
directly into the program's internal structures. In such cases, a large
temporary buffer would be needed, wasting memory.
Drop LZ match finders
QOI_OP_INDEX essentially does length-1 LZ77 using a
conceptual window that contains 64 unique pixels. When further
compressed, another match finder seems to help very little.
Lasse Collin mentioned that the LZ layer cannot be disabled but it
can be made really weak using
xz --lzma2=dict=4KiB,mode=fast,nice=2,mf=hc3,depth=1. Let's
try it.
This size increase is small for certain directories but quite large
for the others. For the directories with small size increases, relying
purely on delta coding and a fast entropy encoder will give a strong
competitor.
PNG
The PNG International Standard defines the compression method 0 as
DEFLATE with a sliding window of at most 32768 bytes. Technically new
compression methods can be defined, but that would break compatibility
of existing decoders and stakeholders would just resort to new image
formats. However, it would be a nice experiment to check that after the
compression part is improved, how PNG compares with newer image
formats.
ELF's design emphasizes natural
size and alignment guidelines for its control structures. While
ensured efficient processing in the old days, this can lead to larger
file sizes. I propose "Light ELF" (EV_LIGHT, version 2) – a
whimsical exploration inspired by Light Elves of Tolkien's legendarium
(who had seen the light of the Two Trees in Valinor).
In a release build of llvm-project
(-O3 -ffunction-sections -fdata-sections, the section
header tables occupy 13.4% of the .o file size.
I propose an alternative section header table format that is signaled
by e_shentsize == 0 in the ELF header.
e_shentsize == sizeof(Elf64_Shdr) (or the 32-bit
counterpart) selects the traditional section header table format.
In ISO C++ standards, [basic.start.term] specifies that:
Constructed objects ([dcl.init]) with static storage duration are
destroyed and functions registered with std::atexit are called as part
of a call to std::exit ([support.start.term]). The call to std::exit is
sequenced before the destructions and the registered functions. [Note
1: Returning from main invokes std::exit ([basic.start.main]). — end
note]
For example, consider the following code:
1
structA { ~A(); } a;
The destructor for object a will be registered for execution at
program termination.
This article introduces CREL (previously known as RELLEB), a new
relocation format offering incredible size reduction (LLVM
implementation in my fork).
ELF's design emphasizes natural size and alignment guidelines for its
control structures. This principle, outlined in Proceedings of the
Summer 1990 USENIX Conference, ELF: An Object File to Mitigate
Mischievous Misoneism, promotes ease of random access for
structures like program headers, section headers, and symbols.
All data structures that the object file format defines follow the
"natural" size and alignment guidelines for the relevant class. If
necessary, data structures contain explicit padding to ensure 4-byte
alignment for 4-byte objects, to force structure sizes to a multiple of
four, etc. Data also have suitable alignment from the beginning of the
file. Thus, for example, a structure containing an Elf32_Addr member
will be aligned on a 4-byte boundary within the file. Other classes
would have appropriately scaled definitions. To illustrate, the 64-bit
class would define Elf64 Addr as an 8-byte object, aligned on an 8-byte
boundary. Following the strictest alignment for each object allows the
format to work on any machine in a class. That is, all ELF structures on
all 32-bit machines have congruent templates. For portability, ELF uses
neither bit-fields nor floating-point values, because their
representations vary, even among pro- cessors with the same byte order.
Of course the programs in an ELF file may use these types, but the
format itself does not.
This article describes ABI and toolchain considerations about systems
without a Memory Management Unit (MMU). We will focus on FDPIC and the
in-development FDPIC ABI for RISC-V, with updates as I delve deeper into
the topic.
Embedded systems often lack MMUs, relying on real-time operating
systems (RTOS) like VxWorks or special Linux configurations
(CONFIG_MMU=n). In these systems, the offset between the
text and data segments is often not knwon at compile time. Therefore, a
dedicated register is typically set to somewhere in the data segment and
writable data is accessed relative to this register.
Why is the offset not knwon at compile time? There are primarily two
reasons.
First, eXecute in Place (XIP), where code resides in ROM while the
data segment is copied to RAM. Therefore, the offset between the text
and data segments is often not knwon at compile time.
Second, all processes share the same address space without MMU.
However, it is still desired for these processes to share text segments.
Therefore needs a mechanism for code to find its corresponding data.
This article describes some notes about z/Architecture
with a focus on the ELF ABI and ELF linkers. An lld/ELF patch
sparked my motivation to study the architecture and write this post.
z/Architecture
is a big-endian mainframe computer architecture supporting 24-bit,
31-bit, and 64-bit addressing modes. It is the latest generation in a
lineage stretching back to the 1964 with IBM System/360 (32-bit
general-purpose registers and 24-bit addressing). This lineage includes
System/370 (1970), System/370 Extended Architecture (1983), Enterprise
Systems Architecture/370 (1988), and Enterprise Systems Architecture/390
(1990). For a deeper dive into the design choices behind
z/Architecture's extension from ESA/390, you can refer to
"Development and attributes of z/Architecture."