MaskRay

Clang's -O0 output: branch displacement and size increase

2024-04-27T07:00:00.000Z

tl;dr Clang 19 will remove the -mrelax-all default at-O0, significantly decreasing the text section size forx86.

Span-dependent instructions

In assembly languages, some instructions with an immediate operandcan be encoded in two (or more) forms with different sizes. On x86-64, adirect JMP/JCC can be encoded either in 2 bytes with a 8-bit relativeoffset or 6 bytes with a 32-bit relative offset. A short jump ispreferred because it takes less space. However, when the target of thejump is too far away (out of range for a 8-bit relative offset), a nearjump must be used.

ja foo    # jump short if above, 77 
ja foo    # jump near if above, 0f 87 
.nops 126
foo: ret

A 1978 paper by Thomas G. Szymanski ("Assembling Code forMachines with Span-Dependent Instructions") used the term"span-dependent instructions" to refer to such instructions with shortand long forms. Assemblers grapple with the challenge of choosing theoptimal size for these instructions, often referred to as the "branchdisplacement problem" since branches are the most common type. A goodresource for understanding Szymanski's work is AssemblingSpan-Dependent Instructions.

Start small and grow

Popular assemblers still used today tend to favor a "start small andgrow" approach, typically requiring one more pass than Szymanski's"start big and shrink" method. This approach often results in smallercode and can handle additional complexities like alignmentdirectives.

In LLVM, the MClibrary (Machine Code) is reponsible for assembly, disassembly, andobject file formats. Within MC, "assembler relaxation" deals withspan-dependent instructions. This is distinct from linkerrelaxation.

Eli Bendersky provides a detailed explanation in a 2013blog post and highlights an interesting behavior:

For example, when compiling with -O0, the LLVM assembler simplyrelaxes all jumps it encounters on first sight. This allows it to putall instructions immediately into data fragments, which ensures there'smuch fewer fragments overall, so the assembly process is faster andconsumes less memory.

When -O0 is enabled and the integrated assembler is used(common by default), clangDriver passes the -mrelax-allflag to the LLVM MC library. This sets the MCRelaxAll flagin MCTargetOptions, instructing the assembler topotentially start with the long form (near) for JMP and JCC instructionson the X86 target only. Other instructions like ADD/SUB/CMP and non-x86architectures remain unaffected.

`-mrelax-all` tradeoff

Here is an example:

void foo(int a) {
  // -mrelax-all: near jump (6 bytes)
  // -mno-relax-all or -fno-integrated-as: short jump (2 bytes)
  if (a) bar();
}

The assembly (clang -S) looks like:

foo:                                    # @foo
# %bb.0:                                # %entry
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $16, %rsp
        movl    %edi, -4(%rbp)
        cmpl    $0, -4(%rbp)
        je      .LBB0_2
# %bb.1:                                # %if.then
        movb    $0, %al
        callq   bar@PLT
.LBB0_2:                                # %if.end
        addq    $16, %rsp
        popq    %rbp
        retq

The JE instruction assembles to either a short jump (8-bit relativeoffset) or near jump (32-bit relative offset).

# -mrelax-all
MCSection
  MCDataFragment: empty
  MCAlignFragment: alignment=4
  MCDataFragment: instructions including JE (jump near if equal, 6 bytes)

# -mno-relax-all
MCSection
  MCDataFragment: empty
  MCAlignFragment: alignment=4
  MCDataFragment: instructions before JE (push; mov; sub; mov; cmp)
  MCRelaxableFragment: JE (jump short if equal, 2 bytes). This JE could be expanded, but not in this case.
  MCDataFragment: instructions after JE (mov; call; add; pop; ret)

The impact of -mrelax-all on text section size issignificant, especially when there are many branch instructions. In anx86-64 release build of lld, -mrelax-all increased the.text section size by 7.9%. This translates to a 5.4%increase in VM size and a 4.6% increase in the overall file size.

Dean Michael Berris proposed to remove the-mrelax-all default for -O0 in 2016, butit stalled. -mrelax-all caused undesired interaction issueswith RISC-V's conditionalbranch transforms, leading Craig Topper to remove-mrelax-all at -O0 for RISC-Vrecently.

While -mrelax-all might have offered slight compile timebenefits in the past, the gains are negligible today. Benchmarking usingstage 2 builds of Clang showed no measurable difference between-mrelax-all and -mno-relax-all. Onllvm-compile-time-tracker running the llvm-test-suite/CTMark benchmark,compile time actually increasedslightly by 0.62% while the text section size decreasedby 4.44%.

A difference for assembly at different optimisation levels would bequite surprising. GCC/GNU assembler don't exhibit similar expansion ofJMP/JCC instructions even at -O0.

These arguments strengthen the case for removing-mrelax-all as the default for -O0. My patch haslanded and will be included in the next major release, LLVM 19.1.

Understanding thecompile time difference

I have studied a notorious huge file,llvm/lib/Target/X86/X86ISelLowering.cpp.

Fragment count: A significant difference exists inthe number of assembler fragments generated:

-mrelax-all: 89633
-mno-relax-all: 143852

With -mrelax-all, the number ofMCRelaxableFragments is substantially reduced (to zero whenbuilding Clang). This reduction likely contributes to the compile timedifference.

Fixed-point iteration: -mrelax-allensures the fixed-point iteration algorithm (almost always) converges ina single iteration. In contrast, with -mno-relax-all,around 6% of sections require additional iterations. However, thisdifference is likely not the primary factor affecting compile time.

// -mrelax-all
1: 13919
2: 1

// -mno-relax-all
1: 13103
2: 793
3: 23
4: 1

Whydidn't people complain about the code size increase?

Because people generally care less about -O0 codesize.

-O0 is frequently used with -g to includedebugging information. This debug information can overshadow the sizeincrease caused by -mrelax-all. (-O1 or abovesacrifices some debuggability.)

In addition, not all projects can be successfully built with-O0 optimization. This is typically due to issues like verylarge programs or mandatory inlining behavior.

For a discussion on size reduction ideas in ELF relocatable files,please check out my blog post about LightELF.

You might also be interested in my notes about GNU assembler andLLVM integrated assembler.

When QOI meets XZ

2024-04-23T07:00:00.000Z

QOI, the Quite OK Image format,has been gaining in popularity. Chris Wellons offers a great analysis.

QOI's key advantages is its simplicity. Being a byte-oriented formatwithout entropy encoding, it can be further compressed with generic datacompression programs like LZ4, XZ, and zstd. PNG, on the other hand,uses DEFLATE compression internally and is typically resistant tofurther compression. By applying a stronger compression algorithm on QOIoutput, you can often achieve a smaller file size compared to PNG.

XZ

Lasse Collin has shared some effective options for compressinguncompressed BMP/TIFF files. I tested them on the QOI benchmark images.

When the color table (palette) is used, a delta filter would increasethe compressed size and should be disabled.

% cat ~/tmp/b.sh
#!/bin/zsh -ue
f() {
  pngcrush -fix -m 1 -l 0 $1 ${1/.png/.uncompressed.png}
  [[ -f ${1/.png/.uncompressed.png} ]] || cp $1 ${1/.png/.uncompressed.png}
  /tmp/p/qoi/qoiconv $1 ${1/.png/.qoi}
  convert $1 ${1/.png/.bmp}
  convert $1 -compress none ${1/.png/.tiff}
  xz --lzma2=pb=0 -fk ${1/.png/.qoi}
  if [[ $(file $1) =~ RGBA ]]; then
    pnm=${1/.png/.pam}
    convert $1 $pnm
    xz --delta=dist=4 --lzma2=lc=4 -fk $pnm
    xz --delta=dist=4 --lzma2=lc=4 -fk ${1/.png/.bmp}
    xz --delta=dist=4 --lzma2=lc=4 -fk ${1/.png/.tiff}
  elif [[ $(file $1) =~ 'colormap' ]]; then
    pnm=${1/.png/.ppm}
    convert $1 $pnm
    xz --lzma2=pb=0 -fk $pnm
    xz --lzma2=pb=0 -fk ${1/.png/.bmp}
    xz --lzma2=pb=0 -fk ${1/.png/.tiff}
  else
    pnm=${1/.png/.ppm}
    convert $1 $pnm
    xz --delta=dist=3 --lzma2=pb=0 -fk $pnm
    xz --delta=dist=3 --lzma2=pb=0 -fk ${1/.png/.bmp}
    xz --delta=dist=3 --lzma2=pb=0 -fk ${1/.png/.tiff}
  fi
  stat -c '%n %s' $1 ${1/.png/.qoi.xz} $pnm.xz ${1/.png/.bmp.xz} ${1/.png/.tiff.xz}
}

f $1

cd /tmp/dc-img/images/
ls -1 **/*.png | rush ~/tmp/b.sh '"{}"'
ls -1 **/*.uncompressed.png | rush 'xz -fk --lzma2=pb=0 "{}"'

ruby -e 'puts "directory\t.png\t.png.xz\t.qoi.xz\t.bmp.xz\t.tiff.xz\t.p[ap]m.xz"; Dir.glob("*").each{|dir| next unless File.directory? dir;
  png=pngxz=qoi=bmp=pnm=tiff=0; Dir.glob("#{dir}/*.qoi.xz").each{|f|
  png+=File.size(f.sub(/\.qoi.xz/,".png"));
  pngxz+=File.size(f.sub(/\.qoi.xz/,".uncompressed.png.xz"));
  qoi+=File.size(f); bmp+=File.size(f.sub(/\.qoi/,".bmp")); ppm=f.sub(/\.qoi/,".ppm");  pnm+=File.exists?(ppm) ? File.size(ppm) : File.size(f.sub(/\.qoi/,".pam")); tiff+=File.size(f.sub(/\.qoi/,".tiff"));
};
  puts "#{dir}\t#{png}\t#{pngxz}\t#{qoi}\t#{bmp}\t#{tiff}\t#{pnm}"
}'

While DEFLATE-compressed PNG files can hardly be further compressed,we can convert these PNG files to uncompressed ones then apply xz. The.png.xz results below do not apply a filter, and the filesare generally larger than .qoi.xz.

directory	.png	.png.xz	.qoi.xz	.bmp.xz	.tiff.xz	.p[ap]m.xz
icon_512	11154424	7861652	7476640	8042032	8064476	8039192
icon_64	828119	750836	708480	730472	757760	735296
photo_kodak	15394305	14464504	12902852	13612440	13616140	13610844
photo_tecnick	237834256	254803292	213268188	210591724	210508596	210468412
photo_wikipedia	88339751	100449996	86679696	86380124	86274480	86241296
pngimg	229608249	134233476	193382668	186389368	186654256	186420564
screenshot_game	266238855	237976536	218915316	216626004	216847500	216765748
screenshot_web	40272678	24690360	21321460	21458496	21532360	21533432
textures_photo	37854634	36393340	28967008	30054968	30064236	30059784
textures_pk	43523493	40868036	54117600	41990596	40632916	46695172
textures_pk01	18946769	15734348	14950836	14835648	14853420	14839312
textures_pk02	102962935	86037000	82279000	79374112	79348768	79336276
textures_plants	51765329	53044044	43681548	44913260	45021996	45048652

While compressing QOI with XZ (.qoi.xz) can achieve goodresults, using a delta filter directly on the uncompressed BMP format(.bmp.xz) can sometimes lead to even smaller files. (TIFFand PPM/PAM, when compressed, can achieve similar file sizes to.bmp.xz.) This suggests that QOI is probably not betterthan a plain delta filter.

It's important to note that uncompressed BMP/TIFF files are huge.This can be problematic if the decompressed data can't be streameddirectly into the program's internal structures. In such cases, a largetemporary buffer would be needed, wasting memory.

Drop LZ match finders

QOI_OP_INDEX essentially does length-1 LZ77 using aconceptual window that contains 64 unique pixels. When furthercompressed, another match finder seems to help very little.

Lasse Collin mentioned that the LZ layer cannot be disabled but itcan be made really weak usingxz --lzma2=dict=4KiB,mode=fast,nice=2,mf=hc3,depth=1. Let'stry it.

% =time -f '%e' xz -fk Prune_video_game_screenshot_2.qoi && stat -c %s Prune_video_game_screenshot_2.qoi.xz
0.76
2462360
% =time -f '%e' xz --lzma2=dict=4KiB,mode=fast,nice=2,mf=hc3,depth=1 -fk Prune_video_game_screenshot_2.qoi && stat -c %s Prune_video_game_screenshot_2.qoi.xz
0.27
2526664

Indeed, weakening the LZ layer improves compression speedsignicantly. Now, let's test all benchmark images.

% cat ~/tmp/qoi-weak-xz.sh
#!/bin/zsh
/tmp/p/qoi/qoiconv $1 ${1/.png/.qoi}
xz --lzma2=pb=0 -fk ${1/.png/.qoi}
xz --lzma2=dict=4KiB,mode=fast,nice=2,mf=hc3,depth=1 -c ${1/.png/.qoi} > ${1/.png/.qoi.weak-lz.xz}
% cd /tmp/dc-img/images
% ls -1 **/*.png | rush ~/tmp/qoi-weak-xz.sh '"{}"'

ruby -e 'puts "directory\tstrong\tweak\tincrease"; Dir.glob("*").each{|dir| next unless File.directory? dir;
  strong=weak=0; Dir.glob("#{dir}/*.qoi.weak-lz.xz").each{|f| weak+=File.size(f); strong+=File.size(f.sub(/\.weak-lz/,""));};
  puts "#{dir}\t#{strong}\t#{weak}\t#{(100.0*weak/strong-100).round(2)}%"
}'

directory	strong	weak	increase
icon_512	7476640	8629900	15.42%
icon_64	708480	735036	3.75%
photo_kodak	12902852	13464072	4.35%
photo_tecnick	213268188	217460392	1.97%
photo_wikipedia	86679696	88609716	2.23%
pngimg	193382668	206679224	6.88%
screenshot_game	218915316	234889060	7.3%
screenshot_web	21321460	24820020	16.41%
textures_photo	28967008	31249492	7.88%
textures_pk	54117600	57956168	7.09%
textures_pk01	14950836	15749556	5.34%
textures_pk02	82279000	87747576	6.65%
textures_plants	43681548	45494084	4.15%

This size increase is small for certain directories but quite largefor the others. For the directories with small size increases, relyingpurely on delta coding and a fast entropy encoder will give a strongcompetitor.

PNG

The PNG International Standard defines the compression method 0 asDEFLATE with a sliding window of at most 32768 bytes. Technically newcompression methods can be defined, but that would break compatibilityof existing decoders and stakeholders would just resort to new imageformats. However, it would be a nice experiment to check that after thecompression part is improved, how PNG compares with newer imageformats.

Light ELF: exploring potential size reduction

2024-04-01T07:00:00.000Z

ELF's design emphasizes naturalsize and alignment guidelines for its control structures. Whileensured efficient processing in the old days, this can lead to largerfile sizes. I propose "Light ELF" (EV_LIGHT, version 2) – awhimsical exploration inspired by Light Elves of Tolkien's legendarium(who had seen the light of the Two Trees in Valinor).

In a light ELF file, the e_version member of the ELFheader is set to 2. EV_CURRENT remains 1 for backwardcompatibility.

1
2
3

#define EV_NONE 0
#define EV_CURRENT 1
#define EV_LIGHT 2

When linking a program, traditional ELF (version 1) and light ELF(version 2) files can be mixed together.

Relocations

Light ELF utilizes CREL forrelocations. RELand RELA from traditional ELF are unused.

Existing lazy binding schemes rely on random access to relocationentries within the DT_JMPREL table. Due to CREL'ssequential nature, keeping lazy binding requires a memory allocationthat holds decoded JUMP_SLOT relocations.

Section header table

Traditional ELF (version 1) sectionheader tables can be large. Light ELF addresses this through acompact section header table format signaled bye_shentsize == 0 in the ELF header.

Acompact section header table for ELF contains the detail. Itscurrent version is copied below for your convenience.

nshdr denotes the number of sections (includingSHN_UNDEF). The section header table (located ate_shoff) begins with nshdrElf_Word values. These values specify the offset of eachsection header relative to e_shoff.

Following these offsets, nshdr section headers areencoded. Each header begins with a presence byte indicatingwhich subsequent Elf_Shdr members use explicit values vs.defaults:

sh_name, ULEB128 encoded
sh_type, ULEB128 encoded (ifpresence & 1), defaults toSHT_PROGBITS
sh_flags, ULEB128 encoded (ifpresence & 2), defaults to 0
sh_addr, ULEB128 encoded (ifpresence & 4), defaults to 0
sh_offset, ULEB128 encoded
sh_size, ULEB128 encoded (ifpresence & 8), defaults to 0
sh_link, ULEB128 encoded (ifpresence & 16), defaults to 0
sh_info, ULEB128 encoded (ifpresence & 32), defaults to 0
sh_addralign, ULEB128 encoded as log2 value (ifpresence & 64), defaults to 1
sh_entsize, ULEB128 encoded (ifpresence & 128), defaults to 0

In traditional ELF, sh_addralign can be 0 or a positiveintegral power of two, where 0 and 1 mean the section has no alignmentconstraints. While the compact encoding cannot encodesh_addralign value of 0, there is no loss ofgenerality.

Example C++ code that decodes a specific section header:

// readULEB128(const uint8_t *&p);

const uint8_t *sht = base + ehdr->e_shoff;
const uint8_t *p = sht + ((Elf_Word*)sht)[i];
uint8_t presence = *p++;
Elf_Shdr shdr = {};
shdr.sh_name = readULEB128(p);
shdr.sh_type = presence & 1 ? readULEB128(p) : ELF::SHT_PROGBITS;
shdr.sh_flags = presence & 2 ? readULEB128(p) : 0;
shdr.sh_addr = presence & 4 ? readULEB128(p) : 0;
shdr.sh_offset = readULEB128(p);
shdr.sh_size = presence & 8 ? readULEB128(p) : 0;
shdr.sh_link = presence & 16 ? readULEB128(p) : 0;
shdr.sh_info = presence & 32 ? readULEB128(p) : 0;
shdr.sh_addralign = presence & 64 ? 1UL << readULEB128(p) : 1;
shdr.sh_entsize = presence & 128 ? readULEB128(p) : 0;

While the current format allows for O(1) in-place random access ofsection headers using offsets at the beginning of the table, this accesspattern seems uncommon in practice. At least, I haven't encountered (orremembered) any instances within the llvm-project codebase. Therefore,I'm considering removing this functionality.

In a release build of llvm-project(-O3 -ffunction-sections -fdata-sections -Wa,--crel, thetraditional section header tables occupy 16.4% of the .ofile size while the compact section header table drastically reduces theratio to 4.7%.

Symbol table

Like other sections, symboltable and string table sections (SHT_SYMTAB andSHT_STRTAB) can be compressed throughSHF_COMPRESSED. However, compressing the dynamic symboltable (.dynsym) and its associated string table(.dynstr) is not recommended.

Symbol table sections have a non-zero sh_entsize, whichremains unchanged after compression.

The string table, which stores symbol names (also section names inLLVM output), is typically much larger than the symbol table itself. Toreduce its size, we can utilize a text compression algorithm. Whilecompressing the string table, compressing the symbol table along with itmight make sense, but using a compact encoding for the symbol tableitself won't provide significant benefits.

Program headers

Program headers, while individually large (eachElf64_Phdr is 56 bytes) and no random access is needed,typically have a limited quantity within an executable or shared object.Consequently, their overall size contribution is relatively small. LightELF maintains the existing format.

Section compression

Compressed sections face a challenge due to header overheadespecially for ELFCLASS64.

typedef struct {
Elf32_Wordch_type;
Elf32_Wordch_size;
Elf32_Wordch_addralign;
} Elf32_Chdr;

typedef struct {
Elf64_Wordch_type;
Elf64_Wordch_reserved;
Elf64_Xwordch_size;
Elf64_Xwordch_addralign;
} Elf64_Chdr;

The overhead and alignmentpadding limit the effectiveness when used with features like-ffunction-sections and -fdata-sections thatgenerate many smaller sections. For example, I have found that the largeElf64_Chdr makes evaluating compressed .rela.*sections difficult. Light ELF addresses this challenge by introducing anheader format of smaller footprint:

ch_type, ULEB128 encoded
ch_size, ULEB128 encoded
ch_addralign, ULEB128 encoded as log2 value

This approach allows Light ELF to represent the header information injust 3 bytes for smaller sections, compared to the 24 bytes required bythe traditional format. The content is no longer guaranteed to beword-aligned, a property that most compression libraries don't requireanyway.

Furthermore, compressedsections with the SHF_ALLOC flag are allowed. Usingthem outside of relocatable files needs caution, though.

Experiments

I have developed a Clang/lld prototype that implements compactsection header table and CREL (https://github.com/MaskRay/llvm-project/tree/april-2024).

`.o size`	sht size	build
136012504	18284992	-O3
111583312	18284992	-O3 -Wa,--crel
97976973	4604341	-O3 -Wa,--crel,--cshdr
2174179112	260281280	-g
1763231672	260281280	-g -Wa,--crel
1577187551	74234983	-g -Wa,--crel,--cshdr

Light ELF: a thoughtexperiment

By now, you might have realized that post is about a joke. Whilebumping e_version and modifying Elf_Chdr mightnot be feasible, it's interesting to consider the possibilities ofcompact section headers and compressed symbol/string tables. Perhapsthis can spark some interesting discussions!

A compact section header table for ELF

2024-03-31T07:00:00.000Z

ELF's design emphasizes natural sizeand alignment guidelines for its control structures. However, thisapproach has substantial size drawbacks.

In a release build of llvm-project(-O3 -ffunction-sections -fdata-sections, the sectionheader tables occupy 13.4% of the .o file size.

I propose an alternative section header table format that is signaledby e_shentsize == 0 in the ELF header.e_shentsize == sizeof(Elf64_Shdr) (or the 32-bitcounterpart) selects the traditional section header table format.

nshdr denotes the number of sections (includingSHN_UNDEF). The compact section header table (located ate_shoff) begins with nshdrElf_Word values. These values specify the offset of eachsection header relative to e_shoff.

Following these offsets, nshdr section headers areencoded. Each header begins with a presence byte indicatingwhich subsequent Elf_Shdr members use explicit values vs.defaults:

sh_name, ULEB128 encoded
sh_type, ULEB128 encoded (ifpresence & 1), defaults toSHT_PROGBITS
sh_flags, ULEB128 encoded (ifpresence & 2), defaults to 0
sh_addr, ULEB128 encoded (ifpresence & 4), defaults to 0
sh_offset, ULEB128 encoded
sh_size, ULEB128 encoded (ifpresence & 8), defaults to 0
sh_link, ULEB128 encoded (ifpresence & 16), defaults to 0
sh_info, ULEB128 encoded (ifpresence & 32), defaults to 0
sh_addralign, ULEB128 encoded as log2 value (ifpresence & 64), defaults to 1
sh_entsize, ULEB128 encoded (ifpresence & 128), defaults to 0

Example C++ code that decodes a section header:

// readULEB128(const uint8_t *&p);

const uint8_t *sht = base + ehdr->e_shoff;
const uint8_t *p = sht + ((Elf_Word*)sht)[i];
uint8_t presence = *p++;
Elf_Shdr shdr = {};
shdr.sh_name = readULEB128(p);
shdr.sh_type = presence & 1 ? readULEB128(p) : ELF::SHT_PROGBITS;
shdr.sh_flags = presence & 2 ? readULEB128(p) : 0;
shdr.sh_addr = presence & 4 ? readULEB128(p) : 0;
shdr.sh_offset = readULEB128(p);
shdr.sh_size = presence & 8 ? readULEB128(p) : 0;
shdr.sh_link = presence & 16 ? readULEB128(p) : 0;
shdr.sh_info = presence & 32 ? readULEB128(p) : 0;
shdr.sh_addralign = presence & 64 ? 1UL << readULEB128(p) : 1;
shdr.sh_entsize = presence & 128 ? readULEB128(p) : 0;

You can still enjoy the advantage of O(1) random access of sectionheaders through the offsets at the beginning of the section headertable.

Experiments

I have developed a Clang/lld prototype that implements compactsection header table and CREL.

`.o size`	sht size	build
136012504	18284992	-O3
111583312	18284992	-O3 -Wa,--crel
97976973	4604341	-O3 -Wa,--crel,--cshdr
2174179112	260281280	-g
1763231672	260281280	-g -Wa,--crel
1577187551	74234983	-g -Wa,--crel,--cshdr

More ideas

Symbol table sections have a non-zero sh_entsize, whichremains unchanged after compression.

C++ exit-time destructors

2024-03-17T07:00:00.000Z

In ISO C++ standards, [basic.start.term] specifies that:

Constructed objects ([dcl.init]) with static storage duration aredestroyed and functions registered with std::atexit are called as partof a call to std::exit ([support.start.term]). The call to std::exit issequenced before the destructions and the registered functions. [Note1: Returning from main invokes std::exit ([basic.start.main]). — endnote]

For example, consider the following code:

1	struct A { ~A(); } a;

The destructor for object a will be registered for execution atprogram termination.

`__cxa_atexit`

The Itanium C++ ABI employs __cxa_atexit rather thanatexit for object destructor registration for two primary reasons:

Limited atexit guarantee: ISO C (up to C23) guaranteessupport for 32 registered functions, although most implementationssupport many more.
Dynamic library unloading: __cxa_atexit provides amechanism for handling destructors when dynamic libraries are unloadedvia dlclose before program termination.

Several standard libraries, including glibc, musl, and FreeBSD libc,implement atexit using __cxa_atexit.

In glibc, atexit returns__cxa_atexit ((void (*) (void *)) func, NULL, __dso_handle),where __dso_handle is part of libc itself.
musl uses 0 instead of __dso_handle.

https://itanium-cxx-abi.github.io/cxx-abi/abi.html#dso-dtor-runtime-apiprovides detailed documentation on object destruction mechanisms. Let'sillustrate this with a GCC and glibc example:

cat > a.cc <<'eof'
#include 
int main() {
  void *h = dlopen("./b.so", RTLD_NOW);
  ((void (*)())dlsym(h, "foo"))();
  dlclose(h);
}
eof
cat > b.cc <<'eof'
#include 
struct A { ~A(); } ga;
A::~A() { printf("~A %p\n", this); }
extern "C" void foo() {
  static A a;
  puts("foo");
}
eof
g++ -fpic -shared b.cc -o b.so
g++ a.cc -o a

An invocation yields:

1
2
3

foo
~A 0x7f70d66c4c79  // for the static-local variable
~A 0x7f70d66c4c78  // for the global variable

Key points:

The compiler registers destructors with __cxa_atexitusing the __dso_handle symbol as an argument.
crtbeginS.o defines the .fini_arraysection (triggering __do_global_dtors_aux) and the hiddensymbol __dso_handle.
Since 2017, lld defines__dso_handle as a hidden symbol if crtbegin doesnot.
dlclose invokes .fini_array functions.__cxa_finalize(d) iterates through the termination functionlist, calling matching destructors based on the DSO handle.
__cxa_atexit implementations typically allocate memorydynamically and may fail. The failures are simply ignored.

Note: In glibc, the DF_1_NODELETE flag marks a sharedobject as unloadable. Additionally, symbol lookups withSTB_GNU_UNIQUE automatically set this flag.

musl provides a no-opimplementation for dlclose and__cxa_finalize.

Thread storage durationvariables

Objects with thread storage duration that have non-trivialdestructors will register those destructors using__cxa_thread_atexit during construction.

When exit-timedestructors are undesired

Exit-time destructors for static and thread storage durationvariables can be undesired due to

Unnecessary overhead and complexity: This includes operating systemkernels and memory-constrained systems.
Potential race conditions: Destructors might execute during threadtermination, while other threads still attempt to access the object.Examples: webkit

Clang provides -Wexit-time-destructors (disabled bydefault) to warn about exit-time destructors.

% clang++ -c -Wexit-time-destructors g.cc
g.cc:1:20: warning: declaration requires an exit-time destructor [-Wexit-time-destructors]
    1 | struct A { ~A(); } a;
      |                    ^
1 warning generated.

Disabling exit-timedestructors

Then, I will describe some approaches to disable exit-timedestructors.

Pointer/referenceto a dynamically-allocated object

We can use a reference or pointer that refers to adynamically-allocated object.

struct A { int v; ~A(); };
A &g = *new A; // or A *const g = new A;
A &foo() {
  static A &a = *new A;
  return a; // or static A *a = new A; return *a
}

This approach prevents the destructor from running at program exit,as pointers and references have a trivial destructor. Note that thisdoes not create a memory leak, since the pointer/reference is part ofthe root set.

The primary downside is unnecessary pointer indirection whenaccessing the object. Additionally, this approach uses a mutable pointerin the data segment and requires a memory allocation.

# %bb.2:                                 // initializer
        movl    $4, %edi
        callq   _Znwm@PLT
        movq    %rax, _ZZ3foovE1a(%rip)  // store pointer of the heap-allocated object to _ZZ3foovE1a
...
        movq    _ZZ3foovE1a(%rip), %rax  // load a pointer from _ZZ3foovE1a

Class template with anempty destructor

A common approach, as outlined in P1247, is to use a class templatewith an empty destructor to prevent exit-time destruction:

template <class T> class no_destroy {
  alignas(T) std::byte data[sizeof(T)];
public:
  template <class... Ts> no_destroy(Ts&&... ts) { new (data) T(std::forward(ts)...); }
  T &get() { return *reinterpret_cast(data); }
};

no_destroy my_widget;

libstdc++ employs a variant that uses a union member.

struct A { ~A(); };

namespace {
  struct constant_init {
    union { A obj; };
    constexpr constant_init() : obj() { }
    ~constant_init() { /* do nothing, union member is not destroyed */ }
  };
  constinit constant_init global;
}

A* get() { return &global.obj; }

C++20 will support constexpr destructor:

template <class T> union no_destroy {
  template <typename... Ts>
  explicit constexpr no_destroy(Ts&&... args) : obj(std::forward(args)...) {}
  constexpr ~no_destroy() {}
  T obj;
};

Libraries like absl::NoDestructorand folly::Indestructibleoffer similar functionality. The absl version optimizes for triviallydestructible types.

Compileroptimization for no-op destructors

Ideally, compilers should optimize out exit-time destructors forempty user-provided destructors:

1 2	struct C { C(); ~C() {} }; void foo() { static C c; }

LLVM has addressed this since2011. Its GlobalOpt pass eliminates __cxa_atexit callsrelated to empty destructors, along with other global variableoptimizations.

In contrast, GCC has an open featurerequest for this optimization since 2005.

`no_destroy` attribute

Clang supports [[clang::no_destroy]] (alternative form:__attribute__((no_destroy))) to disable exit-timedestructors for variables of static or thread storage duration. Its-fno-c++-static-destructors option allows disablingexit-time destructors globally.

July 2018 discussion: https://discourse.llvm.org/t/rfc-suppress-c-static-destructor-registration/49128
Patch: https://reviews.llvm.org/D50994 with follow-up https://reviews.llvm.org/D54344
Documentation: https://clang.llvm.org/docs/AttributeReference.html#no-destroy

Standardization efforts for this attribute are underway P1247R0.

I recently encountered a scenario where the no_destroyattribute would have been beneficial. I've filed a GCC feature request(PR114357) after I learnedthat GCC doesn't have the attribute.

Case study

LLVM provides ManagedStatic to construct an objecton-demand (good for reducing startup time) and make destructionexplicitly through llvm_shutdown.ManagedStatic is intended to be used at namespace scope. Aprime example is LLVM's statistics mechanisms (-stats and-time-passes).

Programs using LLVM can strategically avoid callingllvm_shutdown for fast teardown by skipping somedestructors. The lld linker employs this approach unless theLLD_IN_TEST environment variable is set to a non-zerointeger.

DSO plugin users requiring library unloading may findManagedStatic unsuitable. This is because:

A DSO may not be able to determine if other active LLVM users existwithin the process, making it unsafe to callllvm_shutdown.
If llvm_shutdown is deferred until around program exit,executing destructors becomes unsafe once the DSO's code has beenremoved.

The mold linker improves perceived linking speed by spawning aseparate process for the linking task. This allows the parent process(the one launched from the shell or other programs) to exit early. Thisapproach eliminates overhead associated with static destructors andother operations.

A compact relocation format for ELF

2024-03-09T08:00:00.000Z

This article introduces CREL (previously known as RELLEB), a newrelocation format offering incredible size reduction (LLVMimplementation in my fork).

ELF's design emphasizes natural size and alignment guidelines for itscontrol structures. This principle, outlined in Proceedings of theSummer 1990 USENIX Conference, ELF: An Object File to MitigateMischievous Misoneism, promotes ease of random access forstructures like program headers, section headers, and symbols.

All data structures that the object file format defines follow the"natural" size and alignment guidelines for the relevant class. Ifnecessary, data structures contain explicit padding to ensure 4-bytealignment for 4-byte objects, to force structure sizes to a multiple offour, etc. Data also have suitable alignment from the beginning of thefile. Thus, for example, a structure containing an Elf32_Addr memberwill be aligned on a 4-byte boundary within the file. Other classeswould have appropriately scaled definitions. To illustrate, the 64-bitclass would define Elf64 Addr as an 8-byte object, aligned on an 8-byteboundary. Following the strictest alignment for each object allows theformat to work on any machine in a class. That is, all ELF structures onall 32-bit machines have congruent templates. For portability, ELF usesneither bit-fields nor floating-point values, because theirrepresentations vary, even among pro- cessors with the same byte order.Of course the programs in an ELF file may use these types, but theformat itself does not.

While beneficial for many control structures, the natural sizeguideline presents significant drawbacks for relocations. Sincerelocations are typically processed sequentially, they don't gain thesame random-access advantages. The large 24-byte Elf64_Rela structurehighlights the drawback. For a detailed comparison of relocationformats, see Exploringobject file formats#Relocations.

Furthermore, Elf32_Rel and Elf32_Relasacrifice flexibility to maintain a smaller size, limiting relocationtypes to a maximum of 255. This constraint has become noticeable forAArch32 and RISC-V, and especially when platform-specific relocationsare needed. While the 24-bit symbol index field is less elegant, ithasn't posed significant issues in real-world use cases.

In contrast, the WebAssemblyobject file format uses LEB128 encoding for relocations and otherconstrol structures, offering a significant size advantage over ELF.

Inspired by WebAssembly, I will start discussion with a genericcompression algorithm and then propose an alternative format (CREL) thataddresses ELF's limitations.

Compressed relocations

While the standard SHF_COMPRESSED feature is commonlyused for debug sections, its application can easily extend to relocationsections. I have developed a Clang/lld prototype that demonstrates thisby compressing SHT_RELA sections.

The compressed SHT_RELA section occupiessizeof(Elf64_Chdr) + size(compressed) bytes. Theimplementation retains uncompressed content if compression would resultin a larger size.

In scenarios with numerous smaller relocation sections (such as whenusing -ffunction-sections -fdata-sections), the 24-byteElf64_Chdr header can introduce significant overhead. Thisobservation raises the question of whether encodingElf64_Chdr fields using ULEB128 could further optimize filesizes. With larger monolithic sections (.text,.data, .eh_frame), compression ratio would behigher as well.

# configure-llvm is my wrapper of cmake that specifies some useful options.
configure-llvm s2-custom0 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld'
configure-llvm s2-custom1 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS=-Xclang=--compress-relocations=zstd
ninja -C /tmp/out/s2-custom0 lld
ninja -C /tmp/out/s2-custom1 lld

ruby -e 'p Dir.glob("/tmp/out/s2-custom0/**/*.o").sum{|f| File.size(f)}'  # 135996752
ruby -e 'p Dir.glob("/tmp/out/s2-custom1/**/*.o").sum{|f| File.size(f)}'  # 116424688

Relocations consume a significant portion (approximately 20.9%) ofthe file size. Despite the overhead of-ffunction-sections -fdata-sections, the compressiontechnique yields a significant reduction of 14.5%!

However, dropping in-place relocation processing is a downside.

CREL relocation format

The 1990 ELF paper ELF: An Object File to Mitigate MischievousMisoneism says "ELF allows extension and redefinition for othercontrol structures." Let's explore CREL, a new and more compactrelocation format designed to replace REL and RELA. Our emphasis is onsimplicity over absolute minimal encoding. This is achieved by using abyte-oriented encoding that avoids complex compression techniques (e.g.,dictionary-based compression, entropy encoder). As a byte-orientedformat, CREL relocations can be further compressed by other codecs, ifdesired. Using CREL as relocatable files can decrease memory usage.

See the end of the article for a detailed format description.

A SHT_CREL section (preferred name:.crel) holds compact relocation entries thatdecode to Elf32_Rela or Elf64_Rela dependingon the object file class (32-bit or 64-bit). Its content begins with aULEB128-encoded relocation count, followed by entries encodingr_offset, r_type, r_symidx, andr_addend. The entries use ULEB128 and SLEB128 exclusivelyand there is no endianness difference.

Here are key design choices:

Relocation count (ULEB128):

This allows for efficient retrieval of the relocation count withoutdecoding the entire section. While a uint32_t (like SHT_HASH)could be used, ULEB128 aligns with subsequent entries, removesendianness differences, and offers a slight size advantage in most caseswhen the number of symbols can be encoded in one to three bytes.

Shifted offset:

64-bit data sections frequently have absolute relocations spaced 8bytes apart. Additionally, in RISC architectures, offsets are oftenmultiples of 2 or 4. A shift value of 2 allows delta offsets within the[0, 64) range to be encoded in a single byte, often avoiding the needfor two-byte encoding. In an AArch64 -O3 build, the shiftedoffset technique reduces size(.crel*) by 12.8%.

Many C++ virtual tables have the first relocation at offset 0x10. Inthe absence of the shifted offset technique, the relocation at offset0x10 cannot be encoded in one byte.

Relocation section '.crel.data.rel.ro._ZTVN12_GLOBAL__N_113InlineSpillerE' at offset 0x116fe contains 5 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000000010  0000007e00000001 R_X86_64_64            0000000000000000 _ZN4llvm7Spiller6anchorEv + 0
0000000000000018  0000000c00000001 R_X86_64_64            0000000000000000 .text._ZN12_GLOBAL__N_113InlineSpillerD2Ev + 0
0000000000000020  0000000f00000001 R_X86_64_64            0000000000000000 .text._ZN12_GLOBAL__N_113InlineSpillerD0Ev + 0
0000000000000028  0000001100000001 R_X86_64_64            0000000000000000 .text._ZN12_GLOBAL__N_113InlineSpiller5spillERN4llvm13LiveRangeEditE + 0
0000000000000030  0000001a00000001 R_X86_64_64            0000000000000000 .text._ZN12_GLOBAL__N_113InlineSpiller16postOptimizationEv + 0

The shifted offset also works really well for dynamic relocations,whose offset differences are almost always a multiple ofsizeof(Elf_Addr).

Addend bit:

CREL was initially designed with explicit addends in each relocationentry. However, this approach created redundancy when I extended CRELfor dynamic relocations (discussed throroughly in another section).While RELR remains an option, the goal is for CREL to be a viablealternative even without RELR. In addition, when implicit addends areused, I feel sad if one bit in every relocation entry is wasted.

To address these concerns, a single bit flag has been introduced inthe CREL header:

addend_bit==1: The flags include a bit to signal thepresence of the delta addend.
addend_bit==0: The flags bit is stolen to encode onemore bit for the offset. The addend is implicit and not encoded withinthe relocation.

This bit flag effectively balances efficiency by avoiding unnecessarystorage for dynamic relocations while maintaining flexibility for casesrequiring explicit addend values.

Assembler produced SHT_CREL sections are supposed toalways set the addend_bit bit.

Delta encoding for r_offset (ULEB128):

Section offsets can be large, and relocations are typically ordered.Storing the difference between consecutive offsets offers compressionpotential. In most cases, a single byte will suffice. While there areexceptions (general dynamic TLS model of s390/s390x uses a local"out-of-order" pair:R_390_PLT32DBL(offset=o) R_390_TLS_GDCALL(offset=o-2)), weare optimizing for the common case.

For ELFCLASS32, r_offsets members are calculated usingmodular arithmetic modulo 4294967296.

Delta encoding for r_symidx (SLEB128):

This is more for consistency and less for the benefit.

Absolute symbol indexes allow one-byte encoding for symbols in therange [0,128) and offer minor size advantage for static relocations whenthe symbol table is sorted by usage frequency. Delta encoding, on theother hand, might optimize for the scenario when the symbol tablepresents locality: neighbor symbols are frequently mutually called.

Delta symbol index enables one-byte encoding for GOT/PLT dynamicrelocations when .got/.got.plt entries areordered by symbol index. For example, R_*_GLOB_DAT andR_*_JUMP_SLOT relocations can typically be encoded withrepeated 0x05 0x01 (when addend_bit==0 && shift==3,offset++, symidx++). Delta encoding has a disvantage. It can partialclaim the optimization by arranging symbols in a "cold0 hot cold1"pattern. In addition, delta symbol index enables one-byte encoding forGOT/PLT dynamic relocations when .got/.got.pltentries are ordered by symbol index.

In my experiments, absolute encoding with ULEB128 results in slightlylarger .o file sizes for both x86-64 and AArch64 builds.

Delta encoding for r_type (SLEB128):

Some psABIs utilize relocation types greater than 128. AArch64'sstatic relocation types begin at 257 and dynamic relocation types beginat 1024, necessitating two bytes with ULEB128/SLEB128 encoding in theabsence of delta encoding. Delta encoding allows all but the firstrelocation's type to be encoded in a single byte. An alternative designis to define a base type in the header and encode types relative to thebase type, which would introduce slight complexity.

If the AArch32 psABI could be redesigned, allocating[0,64) for Thumb relocation types and [64,*)for ARM relocation types would optimize delta encoding even further.

While sharing a single type code for multiple relocations would beefficient, it would require reordering relocations. This conflicts withorder requirements imposed by several psABIs and could complicate linkerimplementations.

Delta encoding for addend (SLEB128):

Encoding the delta addend offers a slight size advantage andoptimizes for cases like:

.quad .data + 0x78
.quad .data + 0x80
.quad .data + 0x88
...

Symbol index/type/addend omission

Relocations often exhibit patterns that can be exploited for sizereduction:

Symbol index/type remain constant with varying addends, common forSTT_SECTION symbols, e.g. .rodata,.eh_frame, .debug_str_offsets,.debug_names, .debug_line, and.debug_addr.
Type/addend remain constant with varying symbol indexes, common fornon-section symbols, e.g. function calls, C++ virtual tables, anddynamic relocations.

// Type/addend do not change.
// R_AARCH64_CALL(g0), ...
// R_X86_64_PLT32(g0-4), ...
void f() { g0(); g1(); g2(); }

We use the least significant bits of the offset member to signal thepresence of symbol index, type, and addend information. This allows usto omit delta fields when they match the previous entry.

CREL steals 3 bits from the offset member. I have tried stealing justa bit and utilizing negative symbol index to signal type/addendomission, but offsets generally require fewer bits to encode andstealing bits from offsets is superior.

Omitting the symbol index information is especially beneficial forreducing debug build sizes. For example, most .debug_namesrelocations can be encoded using only 4 bytes (offset and delta addend),instead of the 7 bytes required otherwise.

While RISC architectures often require multiple relocations withdifferent types to access global data, making type omission slightlyless beneficial than x86, the frequent use of call instructions offerslarge size reduction with type omission.

With a limited number of types and frequent zero addends (exceptR_*_RELATIVE and R_*_IRELATIVE), dynamicrelocations also benefit from type/addend omission.

I have developed a prototype at https://github.com/MaskRay/llvm-project/tree/demo-crel.CREL demonstrates superrior size reduction compared to theSHF_COMPRESSED SHT_RELA approach.

LEB128 amongvariable-length integer encodings

LEB128 and UTF-8 stand out as the two most commonly usedbyte-oriented, variable-length integer encoding schemes. Binaryencodings often employ LEB128. While alternatives like PrefixVarInt (ora suffix-based variant) might excel when encoding larger integers,LEB128 offers advantages when most integers fit within one or two bytes,as it avoids the need for shift operations in the common one-byterepresentation.

While we could utilize zigzag encoding(i>>31) ^ (i<<1) to convert SLEB128-encodedtype/addend to use ULEB128 instead, the generate code is inferior to oron par with SLEB128 for one-byte encodings on x86, AArch64, andRISC-V.

// One-byte case for SLEB128
int64_t from_signext(uint64_t v) {
  return v < 64 ? v - 128 : v;
}

// One-byte case for ULEB128 with zig-zag encoding
int64_t from_zigzag(uint64_t z) {
  return (z >> 1) ^ -(z & 1);
}

While some variale-length integer schemes allocate more integerswithin the one-byte bucket, I do not believe they would lead tonoticeable improvement over LEB128. For example, when I assign one extrabit to offsets (by clearing addend_bit in the header),.crel.dyn merely decreases by 2.5%.

Here is an extremely simple C decoder implementation for ULEB128 andSLEB128. The clever use of 64/128 is from Stefan O'Rear. The return typeuint64_t can be changed to size_t when used ina dynamic loader.

static uint64_t read_leb128(unsigned char **buf, uint64_t sleb_uleb) {
  uint64_t acc = 0, shift = 0, byte;
  do {
    byte = *(*buf)++;
    acc |= (byte - 128*(byte >= sleb_uleb)) << shift;
    shift += 7;
  } while (byte >= 128);
  return acc;
}

uint64_t read_uleb128(unsigned char **buf) { return read_leb128(buf, 128); }
int64_t read_sleb128(unsigned char **buf) { return read_leb128(buf, 64); }

I have used a modified lld analyze LEB128 length distribution in ax86-64 release build of lld that enables CREL.

zo[std::min(getULEB128Size(offset-old_offset), 3u) - 1]++;
if (b & 1) {
  auto x = readSLEB128(p); symidx += x; zs[std::min(getSLEB128Size(x), 3u) - 1]++;
}
if (b & 2) {
  auto x = readSLEB128(p); type += x; zt[std::min(getSLEB128Size(x), 3u) - 1]++;
}
if (b & 4) {
  auto x = readSLEB128(p); addend += x; za[std::min(getSLEB128Size(x), 3u) - 1]++;
}

The distribution of ULEB128/SLEB128 lengths is:

        1       2       3+      any
offset  633056  48846   0       681902
type    187759  0       0       187759
symidx  360230  229610  879     590719
addend  191523  52293   2899    246715

80.5% LEB128 encodings are of 1 byte and 19.2% are of 2 bytes. 2-bytedelta symidx members are quite common, but I do not plan to steal bitsfrom other members to symidx.

Experiments

build	format	`.o size`	`size(.rel*)`	.o size decrease
-O3	RELA	136012504	28235448
-O3	CREL	111583312	3806234	18.0%
aarch64 -O3	RELA	124965808	25855800
aarch64 -O3	CREL	102529784	3388307	18.0%
ppc64le -O3	RELA	129017272	26589192
ppc64le -O3	CREL	105860576	3432419	17.9%
riscv64 -O3	RELA	227189744	91396344
riscv64 -O3	CREL	149343352	13549699	34.3%
-O1 -g	RELA	1506173760	340965576
-O1 -g	CREL	1202445768	37237274	20.2%
-O3 -g $SPLIT	RELA	549003848	104227128
-O3 -g $SPLIT	CREL	459768736	14992114	16.3%

SPLIT="-gpubnames -gsplit-dwarf"

Let's compare x86_64 -O3 builds of lld.size(.crel*)/size(.rel*) = 3806234 / 28235448, 13.5%. Thetotal .o file size has decreased by 18.0%. In addition, the maximumresident set size of the linker (also lld) using mimalloc has decreasedby 4.2%.

It would be interesting to explore the potential gains of combiningzstd compression with CREL.

configure-llvm s2-custom3 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS='-Wa,--crel -Xclang --compress-relocations=zstd'
ninja -C /tmp/out/s2-custom3 lld

ruby -e 'p Dir.glob("/tmp/out/s2-custom3/**/*.o").sum{|f| File.size(f)}'  # 111383192

I debated whether to name the new section SHT_RELOC(.reloc) or SHT_RELLEB(.relleb). Ultimately, I choseSHT_CREL because its unique name minimizes potentialconfusion, whereas SHT_RELOC could be confused withSHT_REL and SHT_RELA and the LEB128 part isnot that strong.

Case study

Let's explore some real-world scenarios where relocation size iscritical.

Marker relocations

Marker relocations are utilized to indicate certain linkeroptimization/relaxation is applicable. While many marker relocations areused scarcely, RISC-V relocatable files are typically filled up withR_RISCV_RELAX relocations. Their size contribution is quitesubstantial.

`.llvm_addrsig`

On many Linux targets, Clang emits a special section called.llvm_addrsig (type SHT_LLVM_ADDRSIG, LLVMaddress-significance table) by default to allowld.lld --icf=safe. The .llvm_addrsig sectionstores symbol indexes in ULEB128 format, independent of relocations.Consequently, tools like ld -r and objcopy risk invalidatethe section due to symbol table modifications.

Ideally, using relocations would allow certain operations. However,the size concern of REL/RELA in ELF hinders this approach. In contrast,lld's Mach-O port chosea relocation-based representation for__DATA,__llvm_addrsig.

If CREL is adopted, we can consider switching to the relocationrepresentation.

.llvm.call-graph-profile

LLVM leverages a special section called.llvm.call-graph-profile (typeSHT_LLVM_CALL_GRAPH_PROFILE) for both instrumentation- andsample-based profile-guided optimization (PGO). lld utilizesthis information ((from_symbol, to_symbol, weight) tuples) tooptimize section ordering within an input section description, enhancingcache utilization and minimizing TLB thrashing.

Similar to .llvm_addrsig, the.llvm.call-graph-profile section initially faced the symbolindex invalidation problem, which was solved by switching torelocations. I opted for REL over RELA to reduce code size.

DWARF sections

In a non-split-DWARF build, .rela.debug_str_offsets and.rela.debug_addr consume a significant portion of the filesize.

DWARF v5 accelerated name-based access with the introduction of the.debug_names section. However, in aclang -g -gsplit-dwarf -gpubnames generated relocatablefile, the .rela.debug_names section can consume asignificant portion (approximately 10%) of the file size.

Relocation section '.crel.debug_names' at offset 0x65c0 contains 200 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
000000000000002c  0000002f0000000a R_X86_64_32            0000000000000000 .debug_info + 0
00000000000004d8  000000320000000a R_X86_64_32            0000000000000000 .debug_str + 1ab
00000000000004dc  000000320000000a R_X86_64_32            0000000000000000 .debug_str + f61
00000000000004e0  000000320000000a R_X86_64_32            0000000000000000 .debug_str + f7f
...

This size increase has sparked discussions within the LLVM communityabout potentially alteringthe file format for linking purposes.

.debug_line and .debug_addr also contributea lot of relocations.

Relocation section '.crel.debug_addr' at offset 0x64f1 contains 51 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000000008  0000004300000001 R_X86_64_64            0000000000000000 _ZN4llvm30VerifyDisableABIBreakingChecksE + 0
0000000000000010  0000002d00000001 R_X86_64_64            0000000000000000 .rodata.str1.1 + 0
0000000000000018  0000002d00000001 R_X86_64_64            0000000000000000 .rodata.str1.1 + b
0000000000000020  0000002d00000001 R_X86_64_64            0000000000000000 .rodata.str1.1 + 13
...

Relocation section '.crel.debug_line' at offset 0x69a5 contains 81 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000000022  000000350000000a R_X86_64_32            0000000000000000 .debug_line_str + 0
0000000000000026  000000350000000a R_X86_64_32            0000000000000000 .debug_line_str + 18
000000000000002a  000000350000000a R_X86_64_32            0000000000000000 .debug_line_str + 2c
...

Many adjacent relocations share the same section symbol. CREL cancompress most relocations into a few bytes depending on addend sizes(offset+=O, addend+=A).

By teaching the assembler to use implicit addends, we achieve evengreater size reduction by compressing most relocations into a singlebyte (mostly 0x04, offset+=1<<3). However, this mightharm compression in the presence of--compress-debug-sections=zstd. Personally I recommend thatwe don't use implicit addends.

CREL for dynamic relocations

CREL excels with static relocations, but what about the dynamiccase?

A substantial part of position-independent executables (PIEs) anddynamic shared objects (DSOs) is occupied by dynamic relocations. WhileRELR (acompact relative relocation format) offers size-saving benefits forrelative relocations, other dynamic relocations can benefit from acompact relocation format. There are a few properties:

There are much fewer relocation types.
The offsets are often adjacent by Elf_Addr. No twodynamic relocations can share the same offset.
Each symbol is associated with very few dynamic relocations,typically 1 or 2 (R_*_JUMP_SLOT andR_*_GLOB_DAT). When a symbol is associated with moredynamic relocations, it is typically a base class function residing inmultiple C++ virtual tables, e.g. __cxa_pure_virtual.-fexperimental-relative-c++-abi-vtables would eliminatesuch dynamic relocations.

Android's packed relocation format (linker implementation:ld.lld --pack-dyn-relocs=android) was an earlier designthat applies to all dynamic relocations at the cost of complexity. Itreplaces .rel.dyn/.rela.dyn but does notchange the section name.

Additionally, Apple linkers and dyld use LEB128 encoding for bindopcodes.

Here is a one-liner to dump the relative relocation and non-relativerelocation sizes for each shared object in a system library directory:

ruby -e 'Dir.glob("/usr/lib/x86_64-linux-gnu/*.so.*").each{|f| next if File.symlink?(f) || `file #{f}`!~/shared object/; s=`readelf -Wr #{f}`.lines; nr=s.count{|x|x=~/R_/&&x !~/_RELATIVE/}*24; r=s.count{|x|x=~/_RELATIVE/}*24; s=File.size(f); puts "#{f}\t#{s}\t#{r} (#{(r*100.0/s).round(2)}%)\t#{nr} (#{(nr*100.0/s).round(2)}%)" }'

I believe CREL compresses dynamic relocations well, but it is farfrom an optimal dynamic relocation format. A generalized RELR formatwould leverage the dyamic relocation properties well. Here's a possibleencoding:

// R_*_RELATIVE group
Encode(length of the group)
Encode(R_*_RELATIVE)
RELR

// R_*_GLOB_DAT/absolute address relocation group
Encode(length of the group)
Encode(R_*_GLOB_DAT)
Use RELR to encode offsets
Encode symbol indexes separately

// R_*_JUMP_SLOT group
Encode(length of the group)
Encode(R_*_JUMP_SLOT)
Use RELR to encode offsets
Encode symbol indexes separately

...

We need to enumerate all dynamic relocation types includingR_*_IRELATIVE, R_*_TLSDESC used by some ports.Some R_*_TLSDESC relocations have a symbol index of zero,but the straightforward encoding does not utilize this property.

Traditionally, we have two dynamic relocation ranges for executablesand shared objects (except static position-dependent executables):

.rela.dyn ([DT_RELA, DT_RELA + DT_RELASZ))or .rel.dyn ([DT_REL, DT_REL + DT_RELSZ))
.rela.plt([DT_JMPREL, DT_JMPREL + DT_PLTRELSZ)): Stored JUMP_SLOTrelocations. DT_PLTREL specifies DT_REL orDT_RELA.

IRELATIVE relocations can be placed in either range, but preferrablyin .rel[a].dyn.

Some GNU ld ports (e.g. SPARC) treat .rela.plt as asubset of .rela.dyn, introducing complexity for dynamicloaders.

CREL adoption considerations

New dynamic tag (DT_CREL): To identify CRELrelocations, separate from existingDT_REL/DT_RELA.
No DT_CRELSZ: Relocation count can be derived from theCREL header.
Output section description.rela.dyn : { *(.rela.dyn) *(.rela.plt) } is incompatiblewith CREL.

Challenges with lazy binding

glibc's lazy binding scheme relies on randomaccess to relocation entries within the DT_JMPRELtable. CREL's sequential nature prevents this. However, eagerbinding doesn't require random access. Therefore, when-z now (eager binding) is enabled, we can:

Set DT_PLTREL to DT_CREL.
Replace .rel[a].plt with .crel.plt.

Challenges with statically linked position-dependentexecutables

glibc introduces additional complexity for IRELATIVE relocations instatically linked position-dependent executables. They should onlycontain IRELATIVE relocations and no other dynamic relocations.

glibc's csu/libc-start.c processes IRELATIVE relocationsin the range [__rela_iplt_start, __rela_iplt_end)(or [__rel_iplt_start, __rel_iplt_end), determined at buildtime through ELF_MACHINE_IREL). While CREL relocationscannot be decoded in the middle of the section, we can still placeIRELATIVE relocations in .crel.dyn because there wouldn'tbe any other relocation types (position-dependent executables don't haveRELATIVE relocations). When CREL is enabled, we can define__crel_iplt_start and __crel_iplt_end forstatically linked position-dependent executables.

If glibc only intends to support addend_bit==0, the codecan simply be:

extern const uint8_t __crel_iplt_start[] __attribute__ ((weak));
extern const uint8_t __crel_iplt_end[] __attribute__ ((weak));
if (&__crel_iplt_start != &__crel_iplt_end) {
  const uint8_t *p = __crel_iplt_start;
  size_t offset = 0, count = read_uleb128 (&p), shift = count & 3;
  for (count >>= 3; count; count--) {
    uint8_t rel_head = *p++;
    offset += rel_head >> 2;
    if (rel_head & 128)
      offset += (read_uleb128 (&p) << 5) - 32;
    if (rel_head & 2)
      read_sleb128 (&p);
    elf_crel_irel ((ElfW (Addr) *) (offset << shift));
  }
}

Considering implicit addends for CREL

Many dynamic relocations have zero addends:

COPY/GLOB_DAT/JUMP_SLOT relocations only use zero addends.
Absolute relocations could use non-zero addends withSTT_SECTION symbol, but linkers convert them to relativerelocations.

Usually only RELATIVE/IRELATIVE and potentially TPREL/TPOFF mightrequire non-zero addends. Switching from DT_RELA toDT_REL offers a minor size advantage.

I considered defining two separate dynamic tags (DT_CRELand DT_CRELA) to distinguish between implicit and explicitaddends. However, this would have introduced complexity:

Should llvm-readelf -r dump the zero addends forDT_CRELA?
Should dynamic loaders support both dynamic tags?

I placed the delta addend bit next to offset bits so that it can bereused for offsets. Thanks to Stefan O'Rear's for making me believe thatmy original thought of reserving a single bit flag(addend_bit) within the CREL header is elegant. Dynamicloaders prioritizing simplicity can hardcode the desiredaddend_bit value.

ld.lld -z crel defaults to implicit addends(addend_bit==0), but the option of using in-relocationaddends is available with -z crel -z rela.

DT_AARCH64_AUTH_RELR vs CREL

The AArch64 PAuth ABI introduces DT_AARCH64_AUTH_RELR asa variant of RELR for signed relocations. However, its benefit seemslimited.

In a release build of Clang 16, using -z crel -z relaresulted in a .crel.dyn section size of only 1.0% of thefile size. Notably, enabling implicit addends with-z crel -z rel further reduced the size to just 0.3%. WhileDT_AARCH64_AUTH_RELR will achieve a noticeable smallerrelocation size if most relative relocations are encoded with it, theadvantage seems less significant considering CREL's already compactsize.

Furthermore, DT_AARCH64_AUTH_RLEL introduces additionalcomplexity to the linker due to its 32-bit addend limitation: thein-place 64 value encodes a 32-bit schema, giving just 32 bits to theimplicit addend. If the addend does not fit into 32 bits,DT_AARCH64_AUTH_RELR cannot be used. CREL with addendswould avoid this complexity.

I have filed Quantifying thebenefits of DT_AARCH64_AUTH_RELR.

I've implemented ld.lld -z crel to replace.rel[a].dyn and .rel[a].plt with.crel.dyn and .crel.plt. Dynamic relocationsare sorted by (r_type, r_offset) to better utilizeCREL.

Let's link clang-16-debug using RELA,-z pack-relative-relocs,--pack-dyn-relocs=android+relr, and-z pack-relative-relocs -z crel and analyze theresults.

% fld.lld @response.txt -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
  [ 8] .rela.dyn         RELA            00000000005df318 5df318 c3a980 18   A  3   0  8
  [ 9] .rela.plt         RELA            0000000001219c98 1219c98 001f38 18  AI  3  26  8
% fld.lld @response.txt -z pack-relative-relocs -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
  [ 8] .rela.dyn         RELA            00000000005df340 5df340 011088 18   A  3   0  8
  [ 9] .relr.dyn         RELR            00000000005f03c8 5f03c8 0259d0 08   A  0   0  8
  [10] .rela.plt         RELA            0000000000615d98 615d98 001f38 18  AI  3  27  8
% fld.lld @response.txt --pack-dyn-relocs=android+relr -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
  [ 8] .rela.dyn         ANDROID_RELA    00000000005df318 5df318 0011fc 01   A  3   0  8
  [ 9] .relr.dyn         RELR            00000000005e0518 5e0518 0259d0 08   A  0   0  8
  [10] .rela.plt         RELA            0000000000605ee8 605ee8 001f38 18  AI  3  27  8
% fld.lld @response.txt -z pack-relative-relocs -z crel -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
  [ 8] .crel.dyn         CREL            00000000005df340 5df340 000fbc 00   A  3   0  8
  [ 9] .relr.dyn         RELR            00000000005e0300 5e0300 0259d0 08   A  0   0  8
  [10] .rel.plt          REL             0000000000605cd0 605cd0 0014d0 10  AI  3  27  8
% fld.lld @response.txt -z crel -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
  [ 8] .crel.dyn         CREL            00000000005df318 5df318 082c29 00   A  3   0  8
  [ 9] .rel.plt          REL             0000000000661f48 661f48 0014d0 10  AI  3  26  8
% fld.lld @response.txt -z crel -z rela -o - | fllvm-readelf -S - | grep -E ' \.c?rel.?\.'
  [ 8] .crel.dyn         CREL            00000000005df318 5df318 1b8c69 00   A  3   0  8
  [ 9] .rela.plt         RELA            0000000000797f88 797f88 001f38 18  AI  3  26  8

Analysis

Relative relocations usually outnumber non-relativerelocations.
RELR significantly optimizes relative relocations, offering thelargest size reduction.
CREL further improves the non-relative portion, compressing thatportion to 5.77%, even better than Android packed relocations (6.60%)!Android's r_info sharing withRELOCATION_GROUPED_BY_INFO_FLAG has been overshadowed byour shifted offset technique.
The non-relative relocation advantage is less pronounced since.relr.dyn still accounts for a significant portion of thesize.
.crel.dyn using DT_CREL (implicit addends)without RELR is more than 3x as large as RELR.relr.dyn+.rela.dyn.

Decoding ULEB128/SLEB128 would necessitate more work in the dynamicloader.

Stefan O'Rear has a[PATCH] ldso: DT_CREL compact relocation support patch formusl. In an x86-64 -O2 build, CREL support linked with-z pack-relative-relocs -z crel increases the size oflibc.so by just 200 bytes (0.0256%) compared to a non-CRELbuild with -z pack-relative-relocs.

Linker notes

--emit-relocs and -r necessitate combiningrelocation sections. The output size may differ from the sum of inputsections. The total relocation count must be determined, a new headerwritten, and section content regenerated, as symbol indexes and addendsmay have changed. If the linker does not attempt to determine the offsetshift in another relocation scan, the offset shift in the header can beset to 0. Debug sections, .eh_frame, and.gcc_except_table require special handling to rewriterelocations referencing a dead symbol to R_*_NONE. Thisalso necessitates updating the relocation type.

--emit-relocs and -r copy CREL relocationsections (e.g. .crel.text) to the output. When.rela.text is also present, linkers are required to merge.rela.text into .crel.text.

GNU ld allows certain unknown section types:

[SHT_LOUSER,SHT_HIUSER] andnon-SHF_ALLOC
[SHT_LOOS,SHT_HIOS] andnon-SHF_OS_NONCONFORMING

but reports errors and stops linking for others (unless--no-warn-mismatch is specified). When linking arelocatable file using SHT_CREL, you might encounter errorslike the following:

% clang -Wa,--crel -fuse-ld=bfd a.c b.c
/usr/bin/ld.bfd: unknown architecture of input file `/tmp/a-1e0778.o' is incompatible with i386:x86-64 output
/usr/bin/ld.bfd: unknown architecture of input file `/tmp/b-9963f0.o' is incompatible with i386:x86-64 output
/usr/bin/ld.bfd: error in /tmp/a-1e0778.o(.eh_frame); no .eh_frame_hdr table will be created
/usr/bin/ld.bfd: error in /tmp/b-9963f0.o(.eh_frame); no .eh_frame_hdr table will be created
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Older lld and mold do not report errors. I have filed:

https://github.com/llvm/llvm-project/issues/84812(milestone: 19.1)
https://github.com/rui314/mold/issues/1215 (milestone:2.4.2)

In addition, when there is one .eh_frame section withCIE pieces but no relocation, _bfd_elf_parse_eh_frame willreport an error.

mips64el

mips64el has an incorrect r_info: a 32-bit little-endiansymbol index followed by a 32-bit big-endian type. If mips64el decidesto adopt CREL, they can utilize this opportunity to fixr_info.

Data compression

I analyzed relocatable files in lld 18 (x86_64, -O3builds) and extracted RELA and CREL relocations. I investigated thepotential benefits of compressing these combined sections within thefile.

It's important to consider that data compression, while beneficialfor size reduction, can introduce challenges for random access andincrease memory usage especially for the linker. Therefore, it might notbe a suitable solution.

with open(f'{f}.bin', 'wb') as out, \
        open(f'{f}.bin.lz4', 'wb') as out_lz4, \
        open(f'{f}.bin.zst', 'wb') as out_zstd, \
        open(f'{f}.bin.xz', 'wb') as out_xz:
    for ofile in Path(where).glob('**/*.o'):
        content = bytes()
        elf = lief.parse(str(ofile))
        for sec in elf.sections:
            if sec.name.startswith(f'.{f}'):
                out.write(sec.content)
                content += sec.content
        sub = subprocess.run(['lz4'], input=content, capture_output=True, check=True)
        out_lz4.write(sub.stdout if len(sub.stdout) < len(content) else content)
        sub = subprocess.run(['zstd'], input=content, capture_output=True, check=True)
        out_zstd.write(sub.stdout if len(sub.stdout) < len(content) else content)
        sub = subprocess.run(['xz'], input=content, capture_output=True, check=True)
        out_xz.write(sub.stdout if len(sub.stdout) < len(content) else content)

CREL, being a byte-oriented format, allows for further compression.It can be regarded as a very efficient filter before a Lempel-Zivcompressor.

Even more surprising, CREL outperforms RELA compressed with lz4(level 9) and zstd (level 6).

% stat -c '%s %n' *.bin*
3808775 crel.bin
2855535 crel.bin.lz4
2184912 crel.bin.xz
2319041 crel.bin.zst
28238016 rela.bin
9679414 rela.bin.lz4
2835180 rela.bin.xz
4281547 rela.bin.zst

Dynamic relocations are also worth investigating.

ld.lld @response.txt -z now --pack-dyn-relocs=relr -o clang.relr
ld.lld @response.txt -z now --pack-dyn-relocs=android+relr -o clang.relr+android
ld.lld @response.txt -z now --pack-dyn-relocs=relr -z crel -o clang.relr+crel
llvm-objcopy --dump-section .rela.dyn=reladyn clang.relr /dev/null
llvm-objcopy --dump-section .crel.dyn=creldyn clang.relr+crel /dev/null
llvm-objcopy --dump-section .rela.dyn=androiddyn clang.relr+android /dev/null
xz -fk reladyn androiddyn creldyn
zstd -fk reladyn androiddyn creldyn
% stat -c '%s %n' reladyn* androiddyn* creldyn*
69768 reladyn
4236 reladyn.xz
4449 reladyn.zst
4604 androiddyn
1484 androiddyn.xz
1481 androiddyn.zst
3980 creldyn
1344 creldyn.xz
1367 creldyn.zst

Interestingly, both zstd and LZMA2's default levels on RELAoutperform Android's packed relocation format. Even better, CRELoutperforms them both!

The results do not suggest that we add a Lempel-Ziv compressor toCREL. It would significantly increase complexity and decoder memoryusage. The filesystem's transparent compression can handle this for usconveniently.

CREL proposal for the genericABI

The latest revision has been proposed at https://groups.google.com/g/generic-abi/c/yb0rjw56ORw/m/qQWVkGpuAQAJ.I have also created:

In https://www.sco.com/developers/gabi/latest/ch4.sheader.html,make the following changes.

In Figure 4-9: Section Types,sh_type, append a row

SHT_CREL | 20

Add text:

SHT_CREL - The section holds compact relocation entries with explicitaddends. An object file may have multiple relocation sections. See''Relocation'' below for details.

In Figure 4-16: Special Sections, append

.crelname | SHT_CREL | see below

Change the text below:

.relname, .relaname, and .crelname

These sections hold relocation information, as described in''Relocation''. If the file has a loadable segment that includesrelocation, the sections' attributes will include the SHF_ALLOC bit;otherwise, that bit will be off. Conventionally, name is supplied by thesection to which the relocations apply. Thus a relocation section for.text normally would have the name .rel.text, .rela.text, or.crel.text.

In Figure 4-23: Relocation Entries, add:

typedef struct {
  Elf32_Addr r_offset;
  Elf32_Word r_symidx;
  Elf32_Word r_type;
  Elf32_Sxword r_addend;
} Elf32_Crel;

typedef struct {
  Elf64_Addr r_offset;
  Elf64_Word r_symidx;
  Elf64_Word r_type;
  Elf64_Sxword r_addend;
} Elf64_Crel;

Add text above "A relocation section references two othersections":

count: Relocation count (32-bit or 64-bitunsigned).
addend_bit: 1 indicates that relocation entries encodeaddends. 0 indicates implicit addends (stored in the location to bemodified).
shift: The shift value (0 to 3) applies todelta_offset in relocation entries.

Relocation entries (which encode r_offset,r_symidx, r_type, and r_addend)follow the header. Note: r_info in traditional REL/RELAformats has been split into r_symidx andr_type, allowing uint32_t relocation types forELFCLASS32 as well.

Delta offset and flags (ULEB128): Holdsdelta_offset * (addend_bit ? 8 : 4) + flags (35-bit or67-bit unsigned), where:
- delta_offset: Difference in r_offset fromthe previous entry (truncated to Elf32_Addr orElf64_Addr), right shifted by shift.
- flags: 0 to 7 if addend_bit is 1;otherwise 0 to 3.
- flags & 1: Indicate if delta symbol index ispresent.
- flags & 2: Indicate if delta type is present.
- flags & 4: Indicate if delta addend ispresent.
Delta symbol index (SLEB128, if present): The difference in symbolindex from the previous entry, truncated to a 32-bit signedinteger.
Delta type (SLEB128, if present): The difference in relocation typefrom the previous entry, truncated to a 32-bit signed integer.
Delta addend (SLEB128, if present): The difference in addend fromthe previous entry, truncated to a 32-bit or 64-bit signed integerdepending on the object file class.

ULEB128 or SLEB128 encoded values use the canonical representation(i.e., the shortest byte sequence). For the first relocation entry, theprevious offset, symbol index, type, and addend members are treated aszero.

Encoding/decoding delta offset and flags does not needmulti-precision arithmetic. We can just unroll and special case thefirst iteration. The header can be encoded/decoded in a similar way. Animplementation can assume that the relocation count cannot be largerthan 2**61 and simplify the code.

Example C++ encoder:

// encodeULEB128(uint64_t, raw_ostream &os);
// encodeSLEB128(int64_t, raw_ostream &os);

const uint8_t addendBit = config->isRela ? 4 : 0, flagBits = config->isRela ? 3 : 2;
Elf_Addr offsetMask = 8, offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
for (const Elf_Crel &rel : relocs)
  offsetMask |= rel.r_offset;
int shift = std::countr_zero(offsetMask)
encodeULEB128(relocs.size() * 8 + addendBit + shift, os);
for (const Elf_Crel &rel : relocs) {
  Elf_Addr deltaOffset = (rel.r_offset - offset) >> shift;
  uint8_t b = (deltaOffset << flagBits) + (symidx != rel.r_symidx) +
              (type != rel.r_type ? 2 : 0) + (addend != rel.r_addend ? 4 : 0);
  if (deltaOffset < (0x80 >> flagBits)) {
    os << char(b);
  } else {
    os << char(b | 0x80);
    encodeULEB128(deltaOffset >> (7 - flagBits), os);
  }
  if (b & 1) {
    encodeSLEB128(static_cast<int32_t>(rel.r_symidx - symidx), os);
    symidx = rel.r_symidx;
  }
  if (b & 2) {
    encodeSLEB128(static_cast<int32_t>(rel.r_type - type), os);
    type = rel.r_type;
  }
  if (b & 4 & addendBit) {
    encodeSLEB128(std::make_signed_t(rel.r_addend - addend), os);
    addend = rel.r_addend;
  }
}

Example C++ decoder:

// uint64_t decodeULEB128(uint8_t *&p);
// int64_t decodeSLEB128(uint8_t *&p);

const auto hdr = decodeULEB128(p);
const size_t count = hdr / 8, flagBits = hdr & 4 ? 3 : 2, shift = hdr % 4;
Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
for (size_t i = 0; i != count; ++i) {
  const uint8_t b = *p++;
  offset += b >> flagBits;
  if (b >= 0x80)
    offset += (decodeULEB128(p) << (7 - flagBits)) - (0x80 >> flagBits);
  if (b & 1)
    symidx += decodeSLEB128(p);
  if (b & 2)
    type += decodeSLEB128(p);
  if (b & 4 & hdr)
    addend += decodeSLEB128(p);
  rels[i] = {offset << shift, symidx, type, addend};
}

Both encoder and decoder can be simplified if the desiredaddend_bit is hardcoded, making flagBits aninteger literal.

In Figure 5-10: Dynamic Array Tags, d_tag, add:

DT_CREL | 38 | d_ptr | optional |optional

Add text below:

DT_CREL - This element is similar toDT_REL, except its table uses the CREL format. Therelocation count can be inferred from the header.

Update DT_PLTREL andDT_PLTRELSZ:

DT_PLTRELSZ: This element holds the total size, inbytes, of the relocation entries associated with the procedure linkagetable. If an entry of type DT_JMPREL is present and theDT_PLTREL entry value is DT_REL orDT_RELA, a DT_PLTRELSZ must accompany it.
DT_PLTREL: This member specifies the type of relocationentry to which the procedure linkage table refers. Thed_val member holds DT_REL,DT_RELA, or DT_CREL, as appropriate. Allrelocations in a procedure linkage table must use the same relocationtype.

Abandonded proposalRELLEB (last revision)

A SHT_CREL section holds compact relocation entries thatdecode to Elf32_Crel or Elf64_Crel dependingon the object file class (32-bit or 64-bit). Its content begins with aULEB128-encoded relocation count, followed by entries encodingr_offset, r_symidx, r_type, andr_addend. Note that the r_info member intraditional REL/RELA formats has been split into separater_symidx and r_type members, allowinguint32_t relocation types for ELFCLASS32 as well.

In the following description,Elf_Addr/Elf_SAddr denoteuint32_t/int32_t for ELFCLASS32 oruint64_t/int64_t for ELFCLASS64.

First member (ULEB128): Holds 2 * delta_offset + eq(33-bit or 65-bit unsigned), where:
- delta_offset: Difference in r_offset fromthe previous entry (Elf_Addr).
- eq: Indicates if the symbol index/type match theprevious entry (1 for match, 0 otherwise).
Second Member (SLEB128) if eq is 1:
- Difference in r_addend from the previous entry(Elf_SAddr).
Second Member (SLEB128) if eq is 0:
- If type and addend match the previous entry, the encoded value isthe symbol index; type and addend are omitted.
- Otherwise, the bitwise NOT of the encoded value (33-bit signed) isthe symbol index; delta type and delta addend follow:
  - Delta type (SLEB128): The difference in relocation type from theprevious entry (32-bit signed).
  - Delta addend (SLEB128): The difference in r_addendrelative to the previous entry (signed Elf_Addr).

The bitwise NOT of symbol index 0xffffffff is -0x100000000 (33-bit)instead of 0 (32-bit).

Encoder in pseudo-code:

Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
encodeULEB128(relocs.size());
for (const Reloc &rel : relocs) {
  if (symidx == rel.r_symidx && type == rel.r_type) {
    // Symbol index/type match the previous entry. Encode the addend.
    encodeULEB128(2 * uint128_t(rel.r_offset - offset) + 1); // at most 65-bit
    encodeSLEB128(rel.r_addend - addend);
  } else {
    encodeULEB128(2 * uint128_t(rel.r_offset - offset)); // at most 65-bit
    if (type == rel.r_type && addend == rel.r_addend) {
      // Type/addend match the previous entry. Encode the symbol index.
      encodeSLEB128(rel.r_symidx);
    } else {
      // No optimization is applied. Encode symbol index, type, and addend.
      encodeSLEB128(~static_cast<int64_t>(symidx));
      encodeSLEB128(static_cast<int32_t>(rel.type - type));
      type = rel.r_type;
      encodeSLEB128(static_cast(rel.r_addend - addend));
      addend = rel.r_addend;
    }
    symidx = rel.r_symidx;
  }
}

Encoding/decoding a unsigned 65-bit does not need multi-precisionarithmetic. We can just unroll and special case the first iteration.Example C++ encoder:

// encodeULEB128(uint64_t, raw_ostream &os);
// encodeSLEB128(int64_t, raw_ostream &os);

Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
encodeULEB128(relocs.size(), os);
for (const Reloc &rel : relocs) {
  auto deltaOffset = static_cast<uint64_t>(rel.r_offset - offset);
  offset = rel.r_offset;
  uint8_t odd = outSymidx == symidx && outType == type, b = deltaOffset * 2 + odd;
  if (deltaOffset < 0x40) {
    os << char(b);
  } else {
    os << char(b | 0x80);
    encodeULEB128(deltaOffset >> 6, os);
  }
  symidx = rel.symidx;
  if (!odd && type == rel.type && addend == rel.addend) {
    encodeSLEB128(symidx, os);
  } else {
    if (!odd) {
      encodeSLEB128(~static_cast<int64_t>(symidx), os);
      encodeSLEB128(static_cast<int32_t>(rel.type - type), os);
      type = rel.type;
    }
    encodeSLEB128(std::make_signed_t(rel.addend - addend), os);
    addend = rel.addend;
  }
}

Example C++ decoder:

// uint64_t decodeULEB128(uint8_t *&p);
// int64_t decodeSLEB128(uint8_t *&p);

size_t count = decodeULEB128(p);
Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
for (size_t i = 0; i != count; ++i) {
  const uint8_t b = *p++;
  offset += b >> 1;
  if (b >= 0x80)
    offset += (decodeULEB128(p) << 6) - 0x40;
  int64_t x = decodeSLEB128(p);
  if (b & 1) {
    addend += x;
  } else {
    if (x < 0) {
      x = ~x;
      type += decodeSLEB128(p);
      addend += decodeSLEB128(p);
    }
    symidx = x;
  }
  rels[i] = {offset, symidx, type, addend};
}

My involvement with LLVM 18

2024-02-25T08:00:00.000Z

LLVM 18 will soon be relased. This post provides a summary of mycontributions in this release cycle to record my learning progress.

LLVM binary utility maintenance, e.g.
- adopted llvm-readobjstyle ObjectFile specific dumpers
- [llvm-objdump][X86] Add @plt symbols for .plt.got
- [llvm-readobj] Print for relocation target with an empty name
- Support --decompress/-z (#82594)
sanitizer maintenance, e.g.
- [asan]Enable StackSafetyAnalysis by default
- asan_static x86-64: Support 64-bitASAN_SHADOW_OFFSET_CONST (#75748)
- [asan] Report executable/DSO name for report_globals=2 andodr-violation checking (#71879)
- changed lsan to work with high-entropy ASLR for x86-64 Linux
- removed crypt and crypt_rinterceptors to work with glibc
- implementedinterceptors for glibc 2.38 __isoc23_strtol and__isoc23_scanf family functions
- tsan: Respect !nosanitize metadata and remove gcov special case
- [dfsan] Wrap glibc 2.38 __isoc23_* functions (#79958)
- -fsanitize=alignment: check memcpy/memmove arguments(#67766)
gcov maintenance
- Ignore blocks from another file to fix a crash
LTO maintenance
- Improve diagnostics handling when parsing module-level inlineassembly (#75726)
MC maintenance
- [MC,AArch64] Suppresslocal symbol to STT_SECTION conversion for GOT relocations
- MC Make .pseudo_probe createdsections deterministic after D91878
- Change .reloc to register used symbols
- Change SHF_LINK_ORDER and section group parsing orderto match GNU assembler
AArch32
- [ARM,ELF] Fix access to dso_preemptable __stack_chk_guard withstatic relocation model (#70014)
AArch64
- Suppress local symbol toSTT_SECTION conversion for GOT relocations
- Restrict MOVZ/MOVK to non-PIC large code model (#70178)
- clang: Define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16 forAArch64 (#74954)
MIPS
- Use generic isBlockOnlyReachableByFallthrough (#80799)
RISC-V
- Clean up code after improved assembler support for linkerrelaxation
- SupportR_RISCV_SET_ULEB128/R_RISCV_SUB_ULEB128 for .uleb128 directives
- Parse SHF_LINK_ORDER argument before section group name (#77407)
- R_RISCV_CALL/R_RISCV_CALL_PLT assemblerand assembly parser cleanup
- Force relocations if initial MCSubtargetInfo contains FeatureRelax(#77436)
x86
- Support inline assembly constraint "Ws"
- Change displacement overflow when parsing assembly code (#75747)
- Fix MSVC-style inline assembly call fptr andjmp fptr (#73207)
- In 32-bit mode, fix FastISel -fno-pic for intrinsics toemit R_386_PC32 instead of R_386_PLT32 (#51078)
- clang: Support arch=x86-64{,-v2,-v3,-v4} fortarget_clones attribute
- clang: __builtin_cpu_supports: supportx86-64{,-v2,-v3,-v4}
libunwind
- Bump to CXX_STANDARD 17 (#75986)

lld

See lld 18 ELFchanges

MC

Removed many obsoleted workarounds from the integratedassembler
Fixed placement of function entry comments
Re-architectured a substantial part of the integrated assembler thatis used by RISC-V linker relaxation, fixing some longstanding bugs. SeeThedark side of RISC-V linker relaxation for detail.

Clang

Driver maintenance

[Driver] -###: exit withcode 1 if hasErrorOccurred
Report errors fortarget-specific options on unsupported targets
Remove RequiresPIE and msan's NeedPIE setting (#77689)
Add -fandroid-pad-segment/-fno-android-pad-segment (#77244)
Support -mtls-dialect=desc

Others:

Function multi-versioning: don't set comdat for internal linkageresolvers

Code review

Reviewed many patches, including ADT/Support, binary utilities, MC,lld (sometimes non-ELF ports even if my primary expertise is in ELF),clangDriver, LTO, sanitizers, LoongArch, RISC-V, x86-64 medium/largecode models, etc.

TODOis:pr is:closed sort:updated-desc review-requested:@melists pull requests that requested a review from me, but it's unclearhow to list pull requests that I've made a comment.

MMU-less systems and FDPIC

2024-02-20T08:00:00.000Z

This article describes ABI and toolchain considerations about systemswithout a Memory Management Unit (MMU). We will focus on FDPIC and thein-development FDPIC ABI for RISC-V, with updates as I delve deeper intothe topic.

Embedded systems often lack MMUs, relying on real-time operatingsystems (RTOS) like VxWorks or special Linux configurations(CONFIG_MMU=n). In these systems, the offset between thetext and data segments is often not knwon at compile time. Therefore, adedicated register is typically set to somewhere in the data segment andwritable data is accessed relative to this register.

Why is the offset not knwon at compile time? There are primarily tworeasons.

First, eXecute in Place (XIP), where code resides in ROM while thedata segment is copied to RAM. Therefore, the offset between the textand data segments is often not knwon at compile time.

Second, all processes share the same address space without MMU.However, it is still desired for these processes to share text segments.Therefore needs a mechanism for code to find its corresponding data.

Compilersupport for unknown text-data segment offset

`-msep-data`

GCC's m68k port added-msep-data in 2003-10.

Add -msep-data and -mid-shared-library support for uClinux. These aretwo special PIC variants that allow executing linux applications in ROMfilesystems without loading an additional copy in memory (XIP).
With -msep-data, references to global data are made through registerA5 which is loaded with a pointer to the start of the data/bss segmentallocated in RAM.
The -mid-shared-library option allows using a special shared libraryflavour that allows allocationg a distinct data/bss section for eachprocess without the need to relocate code in both library andapplication.

-msep-data is PIC only and updates -fno-picto -fPIE. In this mode, a5 is read-only and holds theaddress of _GLOBAL_OFFSET_TABLE_. When not used with-mid-shared-library, -fPIC -msep-data isunnecessary. Just stick with -fPIE -msep-data.

`-mid-shared-library`

Thisoption was added to GCC's m68k port along with-msep-data. The documentation says:

Generate code that supports shared libraries via the library IDmethod. This allows for execute-in-place and shared libraries in anenvironment without virtual memory management. This option implies-fPIC.

-mid-shared-library is PIC only and updates-fno-pic to -fPIE. You compile a source filewith -mid-shared-library -mshared-library-id=n, and thefunctions will be attached to library ID n. At function entry a5 pointsto an array that maps a library ID to the corresponding GOT baseaddress. The compiler generates move.l -(n+1)*4(%a5),%a5 toobtain the actual GOT base address. The a5 will then be used to accessthe corresponding data segment.

gcc/config/bfin added -msep-data in2006.

-mno-pic-data-is-text-relative

This ARM option is similar to -msep-data and only makessense with -fpie/-fpic. In 2013,-mno-pic-data-is-text-relative, generalized from the ARM VxWorksRTP port, was addedto assume that text and data segments don't have a fixed displacement.On non-VxWorks-RTP targets, -mno-pic-data-is-text-relativeimplies -msingle-pic-base:

Treat the register used for PIC addressing as read-only, rather thanloading it in the prologue for each function. The runtime system isresponsible for initializing this register with an appropriate valuebefore execution begins.

r9 is used as the static base (arm_pic_register) in theposition-independent data model to access the data segment. Since r9 isnot changed, dynamic linking seems unsupported as a DSO needs adifferent data segment.

GCC's s390x port added -mno-pic-data-is-text-relative in2017 forkpatch (live kernel patching).

`-fropi` and`-frwpi`

[RFC][ARM]Add support for embedded position-independent code (ROPI/RWPI)
[ARM] Add support forembedded position-independent code
[ARM] Command-line optionsfor embedded position-independent code

Clang ARM's -fropi and -frwpi are special-fno-pic variants that only intended for static linking.While regular -fno-pic assumes absolute addressing for bothcode data, -fropi and -frwpi add a twist byenforcing relative addressing based on specific assumptions aboutrelocation. Both options consider the text-data segment offset unknownat compile time.

-fropi assumes code and read-only data will berelocated at runtime, making absolute addressing unsuitable. Instead,PC-relative addressing is used. The .ARM.attributes sectioncontains Tag_ABI_PCS_RO_data: 1 like-fpic.
-frwpi assumes writable data will be relocated atruntime, making absolute addressing unsuitable. Instead, writable datais accessed relative to the static base register. The.ARM.attributes section containsTag_ABI_PCS_RW_data: 2.

You can use -fropi and -frwpi together torequire relative addressing for both code and data. Compared with-fno-pic -frwpi, -fno-pic -fropi -frwpi needsone more instruction to retrieve a function address.

In terms of semantics, I think -fno-pic -fropic -frwpicis identical to -fpie -mno-pic-data-is-text-relative withhidden visibility declarations. In practice, GCC-fpie -mno-pic-data-is-text-relative utilizes GOT-relativerelocations (R_ARM_GOT_BREL), not MOVW/MOVTinstructions.

`-mfdpic`

We will discuss this in detal later.

Compiler option summary

-msep-data and-mno-pic-data-is-text-relative are the same, relying on-fpie/-fpic semantics to enforce relative addressing forthe text segment. -fropi and -frwpi offerfiner control. You can choose to use relative addressing for textsegment only (-fropi), data segment only (using-frwpi), or both.

Neither -msep-data nor -fropi -frwpisupports shared libraries. -msep-data's variant-mid-shared-library provides a library ID based sharedlibrary, which works for some cases but is inflexible.

Now, let's review OS support. While I'm not an RTOS expert, let'sexplore Linux's executable file loaders and see how they handle MMU-lessscenarios.

Linux binfmt loaders

fs/Kconfig.binfmt defines a few loaders.

BINFMT_ELF defaults to y and depends onMMU.
BINFMT_ELF_FDPIC defaults to y whenBINFMT_ELF is not selected. A few architecture supportBINFMT_ELF_FDPIC for NOMMU. ARM supports FDPIC even with aMMU.
BINFMT_FLAT is provided for a few architectures.

Therefore, both BINFMT_ELF_FDPIC andBINFMT_FLAT can be used for MMU-less systems.BINFMT_FLAT is a very old solution that does not allowdynamic linking while BINFMT_ELF_FDPIC supports dynamiclinking.

BTW, BINFMT_AOUT, removed in 2022, had been supportedfor alpha/arm/x86-32.

Binary flat format

Linux's BINFMT_FLAT refers to an object file format usedby μClinux:Binary Flat format (BFLT). https://myembeddeddev.blogspot.com/2010/02/uclinux-flat-file-format.htmlhas an introduction. BFLT is an executable file only format, not usedfor relocatable files. An executable file is typically converted fromELF using elf2flt.ld-elf2flt is a ld wrapper that invokeself2flt when the option -elf2flt is seen.

Linux's BINFMT_FLAT supports both version 2(OLD_FLAT_VERSION) and version 4. Version 4 supportseXecute in Place (XIP), where code resides in ROM while the data segmentis copied to RAM. Therefore, the offset between the text and datasegments is often not knwon at compile time.

Greg added ID-basedshared library support to be used with -mid-shared-library in 2003,which was removedin April 2022. The code supported one executable and at most 3 sharedlibraries.

The tooling for shared library support seems to be called eXtendedFLAT (XFLAT). It is a limited shared library scheme that disallowsglobal variable sharing. Quoting XFLAT FAQ:

XFLAT provides an alternative mechanism to bind and relocatefunctions using a thunk layer that is inserted between each inter-modulefunction call. However, without a GOT it is not possible to bind andrelocate data. In short, with no GOT XFLAT cannot support sharing ofglobal variables between program and shared library modules.

FDPIC

FDPIC can be seen as an extended-mno-pic-data-is-text-relative mode that utilizes functiondescriptors to support PIC register changes for dynamic linking. A FDPICexecutable can be loaded using either the regular Linux ELF loader forMMU systems or fs/binfmt_elf_fdpic.c for MMU-less systems.fs/binfmt_elf_fdpic.c has been availablesince 2002. It supports both MMU and NOMMU configurations but does notsupport ET_EXEC executables in NOMMU mode. Eacharchitecture that supports FDPIC defines an EI_OSABI valueto be checked by the loader.

Several architectures define a FDPIC ABI.

FujitsuFR-V: TheFR-V FDPIC ABI, initial version in 2004
ADI Blackfin:Blackfin FDPIC ABI
SuperH: TheSH FDPIC ABI, initial version in 2008
ARM FDPICABI, initial version in 2013. ARM FDPIC Toolchain andABI provides a good summary.

Here is a summary.

The read-only sections, which can be shared, are commonly referred toas the "text segment", whereas the writable sections are non-shared andcommonly referred to as the "data segment". Functions and certain datasymbols (.rodata) reside in the text segment, while otherdata symbols and the GOT reside in the data segment. Special entriescalled "canonical function descriptors" also reside in the GOT.

A call-clobbered register is reserved as the FDPIC register, used toaccess the data segment. Upon entry to a function, the FDPIC registerholds the address of _GLOBAL_OFFSET_TABLE_. The textsegment can be referenced using PC-relative addressing. The data segmentincluding GOT is referenced using indirect FDPIC-register-relativeaddressing. We will see later that sometimes it's unknown whether anon-preemptible symbol resides in the text segment or the data segment,in which case GOT-indirect addressing with the FDPIC register has to beused.

A function call is called external if the destination may reside inanother module, which has a different data segment and therefore needs adifferent FDPIC register value. Therefore, an external function callneeds to update the FDPIC register as well as changing the programcounter (PC). The FDPIC register can be spilled into a stack slot or acall-saved register, if the caller needs to reference the data segmentlater. The FDPIC register is call-clobbered to allowexternal tail calls and avoid PLT saving the register.

Calling a function pointer, including calling a PLT entry, also setsboth the FDPIC register and PC. When the address of a function is taken,the address of its canonical function descriptor is obtained, not thatof the entry point. The descriptor, resides in the GOT, containspointers to both the function's entry point and its FDPIC registervalue. The two GOT entries are relocated by a dynamic relocation of typeR_*_FUNCDESC_VALUE (e.g. R_FRV_FUNCDESC_VALUE).

If the symbol is preemptible, the code sequence loads a GOT entry.When the symbol is a function, the GOT entry is relocated by a dynamicrelocation R_*_FUNCDESC and will contain the address of thefunction descriptor address.

Let's checkout examples taking addresses of functions and variables.

__attribute__((visibility("hidden"))) void hidden_fun();
void fun();
__attribute__((visibility("hidden"))) extern int hidden_var;
extern int var;
__attribute__((visibility("hidden"))) const int ro_hidden_var = 42;

void *addr_hidden_fun() { return hidden_fun; }
void *addr_fun() { return fun; }
void *addr_hidden_var() { return &hidden_var; }
void *addr_var() { return &var; }
const int *addr_ro_hidden_var() { return &ro_hidden_var; }
int read_hidden_var() { return hidden_var; }
int read_var() { return var; }

Function access in FDPIC

Canonical function descriptors are stored in the GOT, and theiraccess depends on whether the referenced function is preemptible ornot.

For non-preemptible functions: the address of the descriptor isdirectly computed by adding an offset to the FDPIC register.
For preemptible functions: a GOT entry is loaded first. This entry,relocated by a R_*_FUNCDESC dynamic relocation, holds thefinal address of the function descriptor.

// arm-linux-gnueabihf-gcc -c -fpic -mfdpic -Wa,--fdpic
addr_hidden_fun: // non-preemptible function
  ldr     r0, .L3            // r0 = &.got[n] - FDPIC
  add     r0, r0, r9         // r0 = &.got[n]; the address of the canonical function descriptor
  ...
.L3:
// Linker resolves this to &.got[n] - FDPIC. .got[n], relocated by R_ARM_FUNCDESC_VALUE, is the canonical function descriptor.
  .word   hidden_fun(GOTOFFFUNCDESC) // R_ARM_GOTOFFFUNCDESC(hidden_fun)

addr_fun: // preemptible function
  ldr     r3, .L6            // r3 = &.got[n] - FDPIC
  ldr     r0, [r9, r3]       // r0 = &.got[n]; the address of the canonical function descriptor
  ...
.L6:
// Linker resolves this to &.got[n] - FDPIC. .got[n], relocated by R_ARM_FUNCDESC, will contain the address of the canonical function descriptor.
  .word   fun(GOTFUNCDESC)   // R_ARM_GOTFUNCDESC(fun)

Unfortunately, when linking a DSO, anR_ARM_GOTOFFFUNCDESC relocation referencing a hidden symbolresults in a linker error. This error likely arises because thegenerated R_ARM_FUNCDESC_VALUE dynamic relocation requiresa dynamic symbol. While this can be implemented using anSTB_LOCAL STT_SECTION dynamic symbol, GNU ld currentlylacks support for this approach.

% arm-linux-gnueabihf-gcc -fpic -mfdpic -O2 -Wa,--fdpic q.c -shared
/tmp/ccxpnij8.o: in function `addr_hidden_fun':
q.c:(.text+0x10): dangerous relocation: no dynamic index information available
collect2: error: ld returned 1 exit status

Let's try sh4.sh4-linux-gnu-gcc -fpic -mfdpic -O2 q.c -shared -nostdliballows taking the address of a hidden function but not a protectedfunction (my pendingfix).

Then, let's see a global variable initialized by the address of afunction and a C++ virtual table.

1
2
3

struct A { virtual void foo(); };
void call(A *a) { a->foo(); }
auto *var_call = call;

// arm-linux-gnueabihf-g++ -c -fpic -mfdpic -Wa,--fdpic
  ldr     r3, [r0]           // load vtable
...
  ldr     r3, [r3]           // load vtable entry `.word _ZN1A3fooEv(FUNCDESC)`
  ldr     r9, [r3, #4]       // load FDPIC register value
  ldr     r3, [r3]           // load foo's entry point
  blx     r3

.section        .data.rel,"aw"
var_call:
// Function descriptor address, relocated by R_ARM_FUNCDESC dynamic relocation
  .word   _Z4callP1A(FUNCDESC) // R_ARM_FUNCDESC

.section        .data.rel.ro,"aw"
_ZTV1A:
  .word   0
  .word   _ZTI1A
// Function descriptor address, relocated by R_ARM_FUNCDESC dynamic relocation
  .word   _ZN1A3fooEv(FUNCDESC) // R_ARM_FUNCDESC

TODO: -fexperimental-relative-c++-abi-vtables

Data access in FDPIC

GOT-indirect addressing is required for accessing data symbols undertwo conditions:

Preemptible symbols: Traditional GOT requirement.
Non-preemptible symbols with potential data segment placement: Thisincludes
- Writable data symbols: This covers both locally declared(int var;) and externally declared(extern int var;) non-const variables.
- Potential dynamic initialization:const A a; extern const int var;
- Certain guaranteed constant initialization:extern constinit const int *const extern_const;. Constantinitialization may require a relocation, e.g.constinit const int *const extern_const = &var;

addr_hidden_var: // non-preemptible data with potential data segment placement
  ldr     r3, .L9             // r3 = &.got[n] - FDPIC
  ldr     r0, [r9, r3]
  ...
.L9:
// Linker resolves this to &.got[n] - FDPIC. .got[n], relocated by R_ARM_RELATIVE, will contain the address of hidden_var.
  .word   hidden_var(GOT)    // R_ARM_GOT_BREL

addr_var: // preemptible data
  ldr     r3, .L12
  ldr     r0, [r9, r3]
  ...
.L12:
// Linker resolves this to &.got[n] - FDPIC. .got[n], relocated by R_ARM_GLOB_DAT, will contain the address of var.
  .word   var(GOT)           // R_ARM_GOT_BREL

The dynamic relocations R_*_RELATIVE andR_*_GLOB_DAT do not use the standard+ load_base semantics. It seems that musl fdpic doesn'tsupport the special R_*_RELATIVE.

If the referenced data symbol is non-preemptible and guaranteed to bein the text segment, we can use PC-relative addressing. However, thisscenario is remarkably rare in practice. The most likely use case islike the following:

1
2
3

const int ro_array[] = {1, 2, 3, 4}; // text segment

int read_ro_array_elem(int i) { return ro_array[i]; }

GCC's arm port does not seem to utilize PC-relative addressing. Wecan try GCC's SuperH port:

// sh4-linux-gnu-gcc -S -fpic -mfdpic -O2 q.c
addr_hidden_var: // non-preemptible data
  mov.l   .L12,r0
  rts
  add     r12,r0
.L13:
  .align 2
.L12:
  .long   hidden_var@GOTOFF

addr_ro_hidden_var: // non-preemptible data
  mov.l   .L18,r0
  rts
  mov.l   @(r0,r12),r0
.L19:
  .align 2
.L18:
  .long   ro_hidden_var@GOT

It optimizes addr_hidden_var but notaddr_ro_hidden_var.

Thread-local storage inFDPIC

ARM FDPIC ABI defines static TLS relocationsR_ARM_TLS_GD32_FDPIC, R_ARM_TLS_LDM32_FDPIC, R_ARM_TLS_IE32_FDPICto be relative to GOT, as opposed to their non-FDPIC counterpartrelative to PC.

PLT in FDPIC

The PLT entry needs to update the FDPIC register as well as changingthe program counter (PC). binutils' arm port uses the following codesequence.

foo@plt:
ldr r12, .L1
add r12, r12, r9
ldr r9, [r12, #4]
ldr pc, [r12]
.L1: .word foo(GOTOFFFUNCDESC)

Lazy binding could be implemented, but it is difficult if thearchitecture does not allow atomic updates of two words. binutils' armport just disable lazy binding.

Let's inspect an example involving consecutive function calls.

void f0(void);
void f1(void);
void f2(void);
void g() { f0(); f1(); f2(); }

g:
  push    {r4, lr}
  mov     r4, r9
  bl      f(PLT)
  mov     r9, r4
  bl      f(PLT)
  mov     r9, r4
  pop     {r4, lr}
  b       f(PLT)

If GCC implements -fno-plt, it can use the followingcode sequence:

g:
  push    {r4, lr}
  mov     r4, r9
  // call f0
  ldr     r12, .L0
  add     r12, r12, r4
  ldr     r9, [r12, #4]
  ldr     pc, [r12]
  // call f1
  ldr     r12, .L1
  add     r12, r12, r4
  ldr     r9, [r12, #4]
  ldr     pc, [r12]
  // tail call f2
  ldr     r12, .L2
  add     r12, r12, r4
  ldr     r9, [r12, #4]
  pop     {r4, lr}
  ldr     pc, [r12]

.L0: .word f0(GOTOFFFUNCDESC)
.L1: .word f1(GOTOFFFUNCDESC)
.L2: .word f2(GOTOFFFUNCDESC)

Relative relocationsand `.rofixup` section

Unlike standard R_*_RELATIVE relocations that use "*loc+= load_base" semantics, the load address in FDPIC mode is dependent onthe containing segment. The following code adapted fro musl demonstratesthe behavior:

static void *laddr(const struct dso *p, size_t v) {
  size_t j=0;
  for (; v-p->loadmap->segs[j].p_vaddr >= p->loadmap->segs[j].p_memsz; j++);
  return (void *)(v - p->loadmap->segs[j].p_vaddr + p->loadmap->segs[j].addr);
}

In -pie and -shared links, a dynamicsection is present, and non-preemptible function and data pointers arerelocated by R_*_FUNCDESC_VALUE andR_*_RELATIVE dynamic relocations. For -no-pielinks, the situation varies:

Dynamic links: A dynamic section is present. We can still usedynamic relocations.
Static links: There is no dynamic section. In the non-FDPIC, thereis even no relocation (other than R_*_IRELATIVE,unsupported in musl/uclibc-ng).

FDPIC executables of type ET_EXEC present a uniquechallenge: while the text segment has a fixed address, the data segmenthas an unknown address at link time and require relocations. To addressthis, a linker-created section named .rofixup wasintroduced in the first FDPIC ABI (FR-V), and later adopted by otherFDPIC ABIs.

.rofixup holds non-preemptible function and datapointers, which have R_*_RELATIVE semantics. The last entryof .rofixup is special and holds the address of_GLOBAL_OFFSET_TABLE_. In a -pie or-shared link, .rofixup has only one entry.__ROFIXUP_LIST__ and __ROFIXUP_END__ aredefined as encapsulation symbols of .rofixup.

At run time, the loader sets the FDPIC register to the relocated_GLOBAL_OFFSET_TABLE_ value before traferring control tothe entry point of the executable.

Here is an example:

.globl fun; fun: bx lr
.section .rodata,"a"
.globl var; var: .long 0

.section .data.rel.ro,"aw"
.long fun(FUNCDESC)          // R_ARM_FUNCDESC_VALUE or two .rofixup entries
.long var                    // R_ARM_RELATIVE or one .rofixup entry

Reflections on FDPIC

FDPIC can be seen as:

An extended-msep-data/-mno-pic-data-is-text-relative modethat utilizes function descriptors to support PIC register changes fordynamic linking.
A fixed PPC64ELFv1 function descriptors ABI. However, PPC64 ELFv1's trick ofst_value referring to the function descriptor is betterthan the existing FDPIC ABIs (sh, arm).

FDPIC resembles PPC64 ELFv2 TOC where the FDPIC register is set bythe caller instead of the callee, avoiding global/local entry and tailcall complexity.

-fno-pic -mfdpic with hidden visibility declarationscan replace -fno-pic -fropi -frwpi, though clobbered r9across function calls has slight overhead.
-fPIE -mfdpic with hidden visibility declarations canreplace -fPIE -msep-data, though setting the call-clobberedFDPIC register has slight overhead.

-mfdpic often generates smaller code than-mno-fdpic on architectures where PC-relative addressing isexpensive. This includes:

sh4: Lacks PC-relative addressing entirely.
arm: Needs LDR with.word _GLOBAL_OFFSET_TABLE_-(.LPIC0+4), which isexpensive.

Since FDPIC works effectively even on systems with MMUs, it raisesthe intriguing possibility of replacing the standard calling ABIentirely.

Toolchain notes

-mfdpic enables FDPIC code generation. GCC'sh port gotFDPICsupport in 2015. -mfpic implies -fPIE, so-fno-pic -mfdpic and -fPIE -mfdpic have thesame codegen behavior. -fPIC -mfdpic may have differentgenerated code as it additionally sets flag_shlib.

The cfgexpand pass calls sh_get_fdpic_reg_initial_val toretrieve the FDPIC register value from a pseudo register, and registerthe pseudo register for the first invocation. At the start of the ira(Integrated Register Allocator) pass,allocate_initial_values initializes the pseudo register tothe hard register r12 at the function entry point. sh is the only portthat defines TARGET_ALLOCATE_INITIAL_VALUE.

In GCC's arm port, -fno-pic -mfdpic generated code doesnot work.

In addition, external function calls save and restore r9.

gas's arm port needs --fdpic to assemble FDPIC-relatedrelocation types. GCC configured with aarm*-*-uclinuxfdpiceabi target utilizesarm/uclinuxfdpiceabi.h and transforms -mfdpicto --fdpic when assembling a file. For other targets,-Wa,--fdpic is needed to assemble the output. [PATCH]arm: Support -mfdpic for more targets will make-Wa,--fdpic unneeded.

-mfdpic -mtls-dialect=gnu2 is not supported. The ARMFDPIC ABI uses ldr to load a 32-bit constant embedded inthe text segment. The offset is used to materialize the address of a GOTentry (canonical function descriptor, address of the canonical functiondescriptor, or address of data).

You can configure binutils with--target=arm-unknown-uclinuxfdpiceabi to get a BFD linkerthat supports FDPIC emulations.

% ~/Dev/binutils-gdb/out/arm-fdpic/ld/ld-new -V
GNU ld (GNU Binutils) 2.42.50.20240222
  Supported emulations:
   armelf_linux_eabi
   armelfb_linux_eabi
   armelf_linux_fdpiceabi
   armelfb_linux_fdpiceabi
% ~/Dev/binutils-gdb/out/arm-fdpic/ld/ld-new -m armelf_linux_fdpiceabi -shared a.o -o a.so

GNU ld' arm port fails on R_ARM_GOTOFFFUNCDESCreferencing a hidden function symbol (PR31408).

% cat a.c
__attribute__((visibility("hidden"))) void fun_hidden();
void *fun_hidden_addr() { return fun_hidden; }
% ./bin/ld-new -m armelf_linux_fdpiceabi a.o
[1]    3819239 segmentation fault  ./bin/ld-new a.o
% ./bin/ld-new -m armelf_linux_fdpiceabi -shared a.o
./bin/ld-new: BFD (GNU Binutils) 2.42.50.20240224 internal error, aborting at ../../../bfd/elf32-arm.c:16466 in allocate_dynrelocs_for_symbol

./bin/ld-new: Please report this bug.

In -no-pie mode, certain non-function references thatrequire a .rofixup entrie leads to a segfault (PR31407).

Global/weak non-hidden symbols referenced byR_ARM_FUNCDESC are unnecessarily exported (PR31409).

RISC-V FDPIC

Several proposals exist for defining FDPIC-like ABIs to work forMMU-less systems.

[RFC]RISC-V ELF FDPIC psABI addendum and RISC-VFDPIC/NOMMU toolchain/runtime support: This offers a starting point,but needs further discussion.
lowRISCePIC: While simple and interesting, ePIC lacks dynamic linkingsupport.
Stefan O'Rear's proposal: This proposal holds promise and deservesclose attention.

Undoubtly, GP should be used as the FDPIC register.

Loading a constant near code (like ARM) is not efficient. Instead,consider a two-instruction sequence:

Use hi20 and lo12 instructions to generate an offset relative to theGP register.
Use c.add a0, gp to compute the address of the GOTentry.

Maciej's code sequence supports both function and data access throughindirect GP-relative addressing. We can easily enhance it by addingR_RISCV_RELAX to enable linker relaxation and improveperformance. Additionally, for consistency with similar notations onx86-64 and AArch64 ("gotpcrel"), let's adopt "gotgprel" notation.

.L0:
lui rX, %gotgprel_hi20(sym)  # R_RISCV_?(sym); R_RISCV_RELAX
c.add rX, gp                 # R_RISCV_?(.L0)
ld rY, %gotgprel_lo12(sym)   # R_RISCV_?(.L0); rY = address

For data access, the code sequence is followed by instructions like:

1 2	lb a0,0(rY) sb a1,0(rY)

Function descriptors and data have different semantics, requiring tworelocation types. Stefan O'Rear proposes:

R_RISCV_FUNCDESC_GOTGPREL_HI: Find or create two GOTentries for the canonical function descriptor.
R_RISCV_GOTGPREL_HI: For or create a GOT for thesymbol, and return an offset from the FDPIC register.

Drawing inspiration from ARM FDPIC, two additional relocation typesare needed for TLS. This results in a 4-type scheme.

RISC-V FDPIC: optimization

Addressing performance concerns is crucial. Stefan suggests an"indirect-to-relative optimization and relaxation scheme":

R_RISCV_PIC_ADD: Tags c.add rX, gp toenable optimization
R_RISCV_INTERMEDIATE_LOAD: Tagsld rY, (rX) to enableoptimization

Indirect GP-relative addressing can be optimized to directGP-relative addressing under specific conditions:

Non-preemptible functions
Non-preemptible data in the data segment

# Indirect GP-relative to direct GP-relative
lui rX, 
c.add rX, gp
addi rY, rX,

GOT-indirect addressing can be optimized to PC-relative fornon-preemptible data in the text segment.

# Indirect GP-relative to PC-relative
auipc rX, %pcrel_hi20(sym)
c.nop                               # deletable
addi rY, rX, %pcrel_lo12(sym)

GOT-indirect addressing can be optimized to absolute addressing fornon-preemptible data in the text segment.

# Indirect GP-relative to absolute
lui rX, %hi20(sym)
c.nop                               # deletable
addi rY, rX, %lo12(sym)

This can be used for SHN_ABS and unresolved undefinedweak symbols. With -no-pie linking, regular symbols areelligible for this optimization as well. However, linkers may choose notto implement this since the added complexity might outweigh thebenefits.

RISC-V FDPIC: thread-localstorage

To handle TLSDESC, we introduce a new relocation type:R_RISCV_TLSDESC_GPREL_HI. This type instructs the linker tofind or create two GOT entries unless optimized to local-exec orinit-exec. The combined hi20 and lo12 offsets compute the GP-relativeoffset to the first GOT entry.

label:
lui rX, %tlsdesc_gprel_hi(sym)      # R_RISCV_TLSDESC_GPREL_HI(sym); R_RISCV_RELAX
c.add a0, gp                        # R_RISCV_PIC_ADD(label)
ld rY, rX, %tlsdesc_load_lo(label)  # R_RISCV_TLSDESC_LOAD_LO12(label)
addi a0, rX, %tlsdesc_add_lo(label) # R_RISCV_TLSDESC_ADD_LO12(label)
jalr t0, rY, %tlsdesc_call(label)   # R_RISCV_TLSDESC_CALL(label)

Existing relocation types, R_RISCV_TLSDESC_LOAD_LO12 andR_RISCV_TLSDESC_ADD_LO12, are extended to work withR_RISCV_TLSDESC_GPREL_HI.

# TLSDESC to initial-exec optimization
lui a0, 
c.add a0, gp
ld a0, (a0)

# TLSDESC to local-exec optimization
lui a0, 
addi a0, a0,

For initial-exec TLS model, we need a new pseudoinstruction, say,la.tls.ie.fd rX, sym. It expands to:

1
2
3

lui rX, 0                    # R_RISCV_TLS_GOTGPREL_HI20(sym)
c.add rX, gp                 # R_RISCV_PIC_ADD(label)
ld rX, 0(rX)                 # R_RISCV_PIC_LO12_I(label)

Stefan's scheme defines R_RISCV_PIC_LO12_I as an aliasfor R_RISCV_PCREL_LO12_I. Since the symbol is GP-relativeinstead of PC-relative, avoiding PCREL in the relocationtype name makes sense.

Stefan's 11-type scheme adds R_RISCV_PIC_ADDR_LO12_I tobe associated with ld rX, 0(rX) instead. I have not yetfigured out the reasoning.

RISC-V FDPIC: `-fno-plt`

Regular -fno-pltcode loads the .got.plt entry using PC-relative addressingand performs an indirect branch. The FDPIC -fno-plt variantneeds to load both the FDPIC register and the destination address.

lui rX, 0
c.add rX, gp
ld gp, 8(rX)
ld rX, 0(rX)
c.jr rX

libc implementationswith FDPIC support

uclibc-ng supports AArch32, Blackfin, and FR-V.
musl supports SuperH.

lld 18 ELF changes

2024-02-18T08:00:00.000Z

LLVM 18 will be released. As usual, I maintain lld/ELF and have addedsome notes to https://github.com/llvm/llvm-project/blob/release/18.x/lld/docs/ReleaseNotes.rst.I've meticulously reviewed nearly all the patches that are not authoredby me. I'll delve into some of the key changes.

--fat-lto-objects option is added to support LLVMFatLTO. Without --fat-lto-objects, LLD will link LLVMFatLTO objects using the relocatable object file.(D146778 _)
-Bsymbolic-non-weak is added to directly bind non-weakdefinitions. (D158322)
--lto-validate-all-vtables-have-type-infos, whichcomplements --lto-whole-program-visibility, is added todisable unsafe whole-program devirtualization.--lto-known-safe-vtables= can be used to markknown-safe vtable symbols. (D155659)
--save-temps --lto-emit-asm now derives ELF/asm filenames from bitcode file names.ld.lld --save-temps a.o d/b.o -o out will create ELFrelocatable files out.lto.a.o/d/out.lto.b.oinstead of out1.lto.o/out2.lto.o. (#78835)
--no-allow-shlib-undefined now reports errors for DSOreferencing non-exported definitions. (#70769)
common-page-size can now be larger than the system page-size. (#57618)
When call graph profile information is available due toinstrumentation or sample PGO, input sections are now sorted using thenew cdsort algorithm, better than the previoushfsort algorithm. (D152840)
Symbol assignments like a = DEFINED(a) ? a : 0; are nowhandled. (#65866)
OVERLAY now supports optional start address and LMA (#77272)
Relocations referencing a symbol defined in /DISCARD/section now lead to an error. (#69295)
For AArch64 MTE, global variable descriptors have been implemented.(D152921)
R_AARCH64_GOTPCREL32 is now supported. (#72584)
R_LARCH_PCREL20_S2/R_LARCH_ADD6/R_LARCH_CALL36and extreme code model relocations are now supported.
--emit-relocs is now supported for RISC-V linkerrelaxation. (D159082)
Call relaxation respects RVC when mixing +c and -c relocatablefiles. (#73977)
R_RISCV_GOT32_PCREL is now supported. (#72587)
R_RISCV_SET_ULEB128/R_RISCV_SUB_ULEB128relocations are now supported. (#72610) (#77261)
RISC-V TLSDESC is now supported. (#79239)
SystemZ (s390x) is now supported. (#75643)

Although a substantial feature, the s390x port benefited from UlrichWeigand's meticulously prepared patch, complete with comprehensive testsfrom the outset. While I typically provide extensive feedback for suchlarge additions, the patch's exceptional quality minimized the number ofcomment rounds needed in this instance. I truly appreciate Ulrich takingthe time to reply to my remarks onthe s390x ABI. His guidance was instrumental.

The RISC-V port has received a few new relocations for.uleb128 label differences and TLSDESC. I'm glad I madethese assembler and linker changes in time, allowing LoongArchdevelopers to port the code for LoongArch. LoongArch borrows manydesigns from RISC-V. If they were to implement these features first, I'dprobably have to spend more time on code reviews, and the outcome wouldbe less rewarding since I wouldn't be the original patch author.

R_AARCH64_GOTPCREL32 (G(GDAT(S))+A-P) andR_RISCV_GOT32_PCREL, similar toR_X86_64_GOTPCREL, are new ABI additions used to optimizepreemptible _ZTI* symbol references forclang -fexperimental-relative-c++-abi-vtables. Thesimplification utilizes the optimization originally done forMach-O, which applies to GOT equivalent global variables (globalunnamed, constant, discardable linkage).

        .long   0                               # 0x0
-       .long   (_ZTI1A.rtti_proxy-.L_ZTV1A.local)-8
+       .long   _ZTI1A@GOTPCREL-4
        .long   (_ZN1A3fooEv@PLT-.L_ZTV1A.local)-8

-       .hidden _ZTI1A.rtti_proxy               # @_ZTI1A.rtti_proxy
-       .type   _ZTI1A.rtti_proxy,@object
-       .section        .data.rel.ro._ZTI1A.rtti_proxy,"aGw",@progbits,_ZTI1A.rtti_proxy,comdat
-       .globl  _ZTI1A.rtti_proxy
-       .p2align        3, 0x0
-_ZTI1A.rtti_proxy:
-       .quad   _ZTI1A
-       .size   _ZTI1A.rtti_proxy, 8

To the best of my knowledge, there is no performance-specificchange.

Link: lld 17 ELFchanges

Toolchain notes on z/Architecture

2024-02-11T08:00:00.000Z

This article describes some notes about z/Architecturewith a focus on the ELF ABI and ELF linkers. An lld/ELF patchsparked my motivation to study the architecture and write this post.

z/Architectureis a big-endian mainframe computer architecture supporting 24-bit,31-bit, and 64-bit addressing modes. It is the latest generation in alineage stretching back to the 1964 with IBM System/360 (32-bitgeneral-purpose registers and 24-bit addressing). This lineage includesSystem/370 (1970), System/370 Extended Architecture (1983), EnterpriseSystems Architecture/370 (1988), and Enterprise Systems Architecture/390(1990). For a deeper dive into the design choices behindz/Architecture's extension from ESA/390, you can refer to"Development and attributes of z/Architecture."

Linux on IBMZ is a 64-bit operating system on z/Architecture, related to anolder effort porting Linux to ESA/390. As the Wikipedia pageclarifies:

Historically the Linux kernel architecture designations were "s390"and "s390x" to distinguish between the 32-bit and 64-bit Linux on IBM Zkernels respectively, but "s390" now also refers generally to the oneLinux on IBM Z kernel architecture.

Documents

z/ArchitecturePrinciples of Operation: This is the instruction set manualwith an unusual name inheirted from IBM System/360 Principles ofOperation.
Assembler Language Programming for IBM System z: This bookis more readable than Principles of Operation.
z/Architecture Reference Summary: A concise reference ofinstructions.
zSeriesELF Application Binary Interface Supplement (v1.0.2), 2002:This ABI document has been superseded by s390x-abi.
https://github.com/IBM/s390x-abi: The latest version ofthe psABI (processor supplement to the System V ABI) resides here. Whilethe absence of updates between 2002 and 2021 might seem odd, restassured the documentation is actively maintained.

Instruction notes

Each instruction has a length of two, four or six bytes (one to threehalfwords), and must be located at a halfword boundary. Six-byteinstructions have been available since S/360. The two most significantbits of the first halfword determines the length of instruction.

There are more than 1000 basic instructions.

There are 16 64-bit general-purpose registers, each treated as twoindependent 32-bit parts. Certain instructions operate on the high32-bit part. E.g. aih %r2, 1 add 1 to the high 32-bit part.I suspect that using these instructions to overcome register scarcitywouldn't be a good idea due to the overhead of registersynchronization.

PC-relative addressing is supported with thegeneral-instructions-extension facility (February 2008). For example,only one instruction is needed to load_GLOBAL_OFFSET_TABLE_ (see "Global Offset Table" below)into a register (usually r12).

1	larl %r12, _GLOBAL_OFFSET_TABLE_ # r12 = _GLOBAL_OFFSET_TABLE_

The RIL instruction format, consisting of 6 bytes, encodes a registerand a 32-bit immediate operand. This enables it to implement valuableinstructions like BRASL (Branch Relative And Save Long, like x86's CALL)and LARL (Load Address Relative Long, like x86's MOV with RIP-relativeaddressing).

1
2
3

int var;
void fun();
int foo() { fun(); return var; }

// s390x
brasl   %r14, fun@PLT
lgfrl   %r2, var

// x86-64
callq   fun@PLT
movl    var(%rip), %eax

Note: RISC-V's JALR is akin to BRASL, as you can specify whichregister to save the return address. z/Architecture's BRASL has a lengthof 6 bytes, so encoding the register, while wasting the encoding space,is more affordable.

LGF (Load, RXY-type) performs a load with a register offset and a20-bit displacement (base+offset+disp20), but does not support a scaledindex operation. Two instructions, consisting of 12 bytes, are requiredto perform a simple index operation:

1
2
3

int foo(int *a, long i) { return a[i+3]; }
// sllg    %r3,%r3,2
// lgf     %r2,12(%r3,%r2)

-march=arch9 introduces some conditional instructionslike LOCR/LOCGR (Load On Condition, RRF-c-type),if (cond) r1 = r2. In contrast, 4 bytes on otherarchitectures can generally implement a more powerful three-registerconditional move.

While I haven't had extensive time to study the instruction setarchitecture (ISA), I do see some clear limitations:

Many instructions have different behaviors in 24-bit addressing,31-bit addressing, and 64-bit addressing. This is needless complicationfor a CPU when 24-bit and 31-bit addressing become irrelevant for modernprograms.
The legacy instruction formats inherited from S/360, S/370, andS/390 restrict the design flexibility of z/Architecture.
Instructions supporting 24-bit addressing become redundant in a64-bit environment but must be retained for compatibility reasons.

This raises the question: when would IBM prefer designing acompletely new architecture and implementing a dynamic binary translatorfor existing programs?

ABI notes

r14 is used as the link register while r15 is the stack pointer. Ins390x-abi, registers r6 to r13, and r15 are designated as designated asnon-volatile (not clobbered by a function call). Registers r2 to r6 areused for integer arguments.

r6 being non-volatile for argument storage seems uncommon comparedto other architectures.
Only 4 registers are used for integer argument storage, which isinadequate. It is unclear why r1 and r7 are not used.

The stack alignment is 8 bytes. Most 64-bit architectures employ16-byte alignment.

Symbols representing a section offset must be halfword aligned.Compilers assume that an external symbol (e.g.extern char a;) to be halfword aligned. -munaligned-symbolsremoves the assumption.

Compilers

LLVM supports IBM z10 and newer models. 31-bit addressing is notsupported.

Global Offset Table

The .got section has 3 reserved entries. The linkerdefines _GLOBAL_OFFSET_TABLE_ at the start of.got. _GLOBAL_OFFSET_TABLE_[0] stores thelink-time address of _DYNAMIC, which is used by glibc._GLOBAL_OFFSET_TABLE_[1] and_GLOBAL_OFFSET_TABLE_[2] are for lazy binding PLT(_dl_runtime_resolve and link map in glibc).

The assembler modifier @GOTENT designates a 32-bitimmediate operand. The assembler modifier @GOT designatesan immediate operand of either 16-bit or 32-bit.

Compilers generate a LGRL (Load Relative Long) instruction to loadthe GOT entry of a symbol. When the symbol is non-preemptible and not anifunc,the GOTindirection can be optimized to LARL (Load Address Relative Long).This is similar to x86-64's GOTPCRELX optimization.

lgrl %r1, var@GOT            # R_390_GOTENT(var)

=>

larl %r1, var

Procedure Linkage Table

At 32 bytes per entry, PLTs are notably larger than otherarchitectures. Only the first 14 bytes (encompassing three instructions)are strictly necessary for eager binding.

larl %r1, .got.plt[n]
lg %r1, 0(%r1)
br %r1
basr %r1, %r0
lgf %r1, 12(%r1)
jg .plt[0]
.long relocation offset

For lazy PLT binding, the .got.plt entry refers tobasr %r1, %r0 (14 bytes relative to the PLT entry), whichstores the next instruction address into r1. PLT0 is calledwith r1 set to the relocation offset. PLT0 sets uparguments and calls .got[2], the PLT resolver in glibc.

mold utilizes a 16-byte PLT entry scheme that usesbasr %r0, %r1 instead of br %r1 so that PLT0can compute the relocation offset using r0.

Relocations

There are 5 absolute relocation types:R_390_{8,16,20,32,64}. They can be used as data relocations(.byte, .short, etc) as well as coderelocations.

R_390_8 is used by instruction formats with a 8-bitimmediate operand (e.g. SI).
R_390_16 is used by instruction formats with a 16-bitimmediate operand (e.g. RI).
R_390_20 is used by instruction formats with a 20-bitimmediate operand (e.g. RSY, RXY).
R_390_32 is used by instruction formats with a 32-bitimmediate operand (e.g. RIL).

The assembler modifier @PLTOFF designatesR_390_PLTOFF16, R_390_PLTOFF32, andR_390_PLTOFF64. Their computation&.plt[n] - .got + A is similar to R_X86_64_PLTOFF64used by x86-64's large code model. However, -mcmodel=largeis unsupported, so these relocations seem not useful.

Relocation types R_390_GOTPLT* (.got.plt[n]relative to .got or PC) seem unused. GCC never emits theassembler modifier @GOTPLT. I believe these relocations arenot useful in the presence of PC-relative adddressing.

Thread Local Storage

Refer to All aboutthread-local storage for TLS. s390x utilizes TLS Variant II, implementedby glibc in 2003. The 64-bit TLS ABI closely mirrors the 32-bit TLSABI, which itself is inspired by x86-32. Unlike other architectures thatrevamped their TLS ABI during the 64-bit transition, s390x's predatesmodern instructions like LGFI and LGRL, resulting in a less efficientimplementation compared to newer architectures.

First, let's look at thread pointer accessing.

s390: 32-bit thread pointer stored in 32-bit access registera0.
s390x: 64-bit thread pointer split across a0 anda1, both still 32-bit.

This necessitates three instructions (14 bytes) to retrieve the fullthread pointer, while 64-bit access registers would simplify this:

1
2
3

ear     %r0, %a0             # r0 = hi(r0) | a0
sllg    %r1, %r0, 32         # r1 = r0<<32
ear     %r1, %a1             # r1 = hi(r1) | a1 = a0<<32 | a1

Access registers holds 32-bit access-list-entry tokens (ALET), whichare not used on Linux.

General dynamic TLS model

In the general dynamic TLS model, a key difference compared to otherarchitectures is the use of __tls_get_offset instead of__tls_get_addr. The process involves several steps,illustrated by the provided assembly code:

ear     %r0, %a0
sllg    %r1, %r0, 32
ear     %r1, %a1             # r1 = TP
larl    %r12, _GLOBAL_OFFSET_TABLE_ # r12 = _GLOBAL_OFFSET_TABLE_

lgrl    %r2, .LCPI0_0        # r2 = *(.LCPI0_0) = an offset into .got
# r2 = __tls_get_offset(r2) = dtv[m]+a@DTPOFF - TP
brasl   %r14, __tls_get_offset@PLT:tls_gdcall:a # R_390_PLT32DBL(__tls_get_offset+0x2) at offset+2, R_390_TLS_GDCALL(a) at offset
lgf     %r2, 0(%r2,%r1)      # r2 = *(r2+r1) = *(dtv[m]+a@DTPOFF) = a

.section        .data.rel.ro,"aw",@progbits
.LCPI0_0:
  .quad   a@TLSGD            # R_390_TLS_GD64(a); linker resolves this to an offset into .got

Retrieving the thread pointer and_GLOBAL_OFFSET_TABLE_: Four instructions are required butcan be shared by subsequent TLS accesses. This step can bereordered.
Obtaining the GOT offset: The offset (a@TLSGD) isstored in the .data.rel.ro section. The offset refers totwo GOT entries (a tls_index structure), relocated bydynamic relocations R_390_TLS_DTPMOD andR_390_TLS_DTPOFF. The dynamic loader will set the values to(m, a@DTPOFF), the module ID and an offset of the symbolrelative to the dynamic TLS block.
Finding the offset relative to the current dynamic TLS block(DTPOFF): __tls_get_offset(r2) returnsdtv[m] + a@DTPOFF - TP. __tls_get_addr inother architectures just return dtv[m] + a@DTPOFF.
Adding the thread pointer to get the symbol address in the currentthread

In glibc, __tls_get_offset is defined as:

// unsigned long __tls_get_offset(unsigned long offset);

__tls_get_offset:
la      %r2,0(%r2,%r12)
jg      __tls_get_addr

While this approach works, it's considered the least efficientimplementation of general dynamic TLS among the architectures I haveanalyzed. Here is why:

Ineffecient tls_index argument (similar to AArch32):This requires an extra lookup in .data.rel.ro.
Unnecessary use of _GLOBAL_OFFSET_TABLE_ (similar tox86-32): Instead of loading a@TLSGD, and then adding_GLOBAL_OFFSET_TABLE_, it is easier to just load the GOTentry address using LGRL.
Redundant argument: __tls_get_offset takes the GOToffset instead of the direct GOT entry address.
Indirect return value: Instead of returning the final TLS symboladdress directly, __tls_get_offset only provides an offset,requiring an extra instruction for addition with the TP.

The 64-bit TLS ABI, modeled closely after the 32-bit ABI, wascodified before nice instructions like LGRL(general-instructions-extension facility, February 2008) were available.It clearly comes at the cost of performance.

The marker relocation R_390_TLS_GDCALL comes afterR_390_PLT32DBL, different from other architectures.

The general-dynamic code sequence can be optimized to initial-exec orlocal-exec.

// general-dynamic to initial-exec
lgrl    %r2, .LCPIC0_0       # r2 = *(.LCPI0_0) = &.got[n]-_GLOBAL_OFFSET_TABLE_
lg      %r2, 0(%r2,%r12)     # r2 = TP offset
lgf     %r2, 0(%r2,%r1)      # r2 = *(r2+r1) = TLS value in the current thread

.section .data.rel.ro,"aw",@progbits
.LCPI0_0:
  .quad &.got[n]-_GLOBAL_OFFSET_TABLE_ # .got[n], relocated by R_PPC64_TPREL64, holds the TP offset

// general-dynamic to local-exec
lgrl    %r2, .LCPIC0_0       # r2 = *(.LCPI0_0) = TP offset
brasl   0, .+0               # nop
lgf     %r2, 0(%r2,%r1)      # r2 = *(r2+r1) = TLS value in the current thread

.section .data.rel.ro,"aw",@progbits
.LCPI0_0:
  .quad a@NTPOFF

In both cases, the linker only needs to patch one instruction,instead of four for PPC64.

Local dynamic TLS model

The process involves several steps, illustrated by the providedassembly code:

lgrl    %r2,.LC0             # r2 = *(.LC0) = GOT offset of a tls_index object holding {module_ID, 0}
# r2 = __tls_get_offset(r2) = dtv[m]-TP
brasl   %r14,__tls_get_offset@PLT:tls_ldcall:a # R_390_PLT32DBL(__tls_get_offset+0x2) at offset+2, R_390_TLS_LDCALL(a) at offset

ear     %r3, %a0
sllg    %r4, %r3, 32
ear     %r4, %a1             # r4 = TP
la      %r2,0(%r2,%r4)       # r2 = r2+r4 = dtv[m]

lgrl    %r1, .LC1            # r1 = a@DTPOFF
lgf     %r1,0(%r1,%r2)       # r1 = *(a@DTPOFF + dtv[m]) = a

lgrl    %r1, .LC2            # r1 = b@DTPOFF
lgf     %r1,0(%r1,%r2)       # r1 = *(b@DTPOFF + dtv[m]) = b

.section .data.rel.ro,"aw"
.align 8
.LC0: .quad a@TLSLDM         # R_390_TLS_LDM64(a); linker resolves this to a GOT offset of tls_index{m, 0}
.LC1: .quad a@DTPOFF         # R_390_TLS_LDO64(a); linker resolves this to a's offset relative to dtv[m]
.LC2: .quad b@DTPOFF         # R_390_TLS_LDO64(b); linker resolves this to b's offset relative to dtv[m]

Retrieving the thread pointer and_GLOBAL_OFFSET_TABLE_
Obtaining the GOT offset: The offset (a@TLSLDM) isstored in the .data.rel.ro section. The offset refers totwo GOT entries (a tls_index structure): the module ID anda zero. The module ID entry is relocated by a dynamic relocationR_390_TLS_DTPMOD.
Finding the dynamic TLS block address:__tls_get_offset(r2) returns dtv[m] - TP. Itis not dtv[m] + XXX - TP because the second GOT entry iszero.
Adding DTPOFF to get the symbol address in the current thread

The first three steps can be shared among TLS symbols.

Similar to general-dynamic, the marker relocationR_390_TLS_LDCALL comes after R_390_PLT32DBL,different from other architectures. This makes lld implementationawkward.

The local-dynamic code sequence can be optimized to local-exec.

lgrl    %r2,.LC0             # r2 = 0
brcl    0, .                 # nop

ear     %r3, %a0
sllg    %r4, %r3, 32
ear     %r4, %a1             # r4 = TP
la      %r2,0(%r2,%r4)       # r2 = r2+r4 = TP

lgrl    %r1, .LC1            # r1 = a@NTPOFF
lgf     %r1,0(%r1,%r2)       # r1 = *(a@NTPOFF + TP) = a

lgrl    %r1, .LC2            # r1 = b@NTPOFF
lgf     %r1,0(%r1,%r2)       # r1 = *(b@NTPOFF + TP) = b

.section .data.rel.ro,"aw"
.align 8
.LC0: .quad 0
.LC1: .quad a@NTPOFF         # a's TP offset
.LC2: .quad b@NTPOFF         # b's TP offset

Initial Exec TLS model

1 2	lgrl %r1, a@INDNTPOFF # R_390_TLS_IEENT(a); linker resolves this to a GOT holding the TP offset lgf %r1, 0(%r1,%r7) # r1 = *(a@NTPOFF + TP) = a

Optimizing the code sequence to local-exec is straightforward:changing the first instruction to lgfi %r1, a@NTPOFF.However, LGFI (Load Immediate) is part of the extended-immediatefacility (September 2005), introduced with System z9 109, unavailablewhen the ABI was defined.

Relocation typesR_390_TLS_IE32/R_390_TLS_IE64 for theinitial-exec TLS model seem not useful.

Local Exec TLS model

The code sequence loads the TP offset indirectly in a manner similarto AArch32.

lgrl    %r1, .LC0            # r1 = a@NTPOFF
lgf     %r1, 0(%r1,%r7)      # r1 = *(a@NTPOFF + TP) = a

.section .data.rel.ro,"aw"
.LC0: .quad a@NTPOFF         # R_390_TLS_LE64(a); linker resolves this to the TP offset, a negative integer

The indirection is unfortunate. Similarly, LGFI (Load Immediate) canbe used instead.

Linux distributions

https://almalinux.org/blog/how-we-built-almalinux-86-for-s390x/
https://wiki.alpinelinux.org/wiki/S390x/Installation
https://wiki.debian.org/SupportedArchitectures
https://alt.fedoraproject.org/alt/
https://wiki.gentoo.org/wiki/Project:S390
https://en.opensuse.org/ZSystems

Raw symbol names in inline assembly

2024-01-30T08:00:00.000Z

For operands in asm statements, GCC has supported the constraints "i"and "s" for a long time (since at least 1992).

// gcc/common.md
(define_constraint "i"
  "Matches a general integer constant."
  (and (match_test "CONSTANT_P (op)")
       (match_test "!flag_pic || LEGITIMATE_PIC_OPERAND_P (op)")))

(define_constraint "s"
  "Matches a symbolic integer constant."
  (and (match_test "CONSTANT_P (op)")
       (match_test "!CONST_SCALAR_INT_P (op)")
       (match_test "!flag_pic || LEGITIMATE_PIC_OPERAND_P (op)")))

CONSTANT_P matches a class of RTL expressions called RTX_CONST_OBJ.

An RTX code that represents a constant object. HIGH is also includedin this class.

The most interesting objects in this class are constant integers,constant floating points, symbol or label references with a constantoffset. "s" is like "i", but does not match constant integers (e.g."s"(1) is an error). So "s" essentially matches a symbol orlabel reference with a constant offset.

"s" can be used to create an artificial reference for linker garbagecollection, define sections to hold symbol addresses, or even enablemore creative applications.

namespace ns { extern int a[2][2]; }
void fun();
void foo() {
label:
  asm(".pushsection .xxx,\"aw\"; .dc.a %0; .popsection" :: "s"(&ns::a[1][1]));
  asm(".reloc ., BFD_RELOC_NONE, %0" :: "s"(fun));
  asm("// %0" :: "s"(&&label));
}

C++ templates can make this easier to use.

1
2
3

template <class T, T &x>
void ref() { asm (".reloc ., BFD_RELOC_NONE, %0" :: "s"(x)); }
void use() { ref<decltype(ns::a), ns::a>(); }

1 2	// Materialize the symbol address manually asm("adrp %0, %1\nadd %0, %0, :lo12:%1" : "=r"(ret) : "S"(&var));

Using the generic r or m constraint in suchcases would instruct GCC to generate instructions to compute theaddress, which can be wasteful if the materialized address isn'tactually needed.

// aarch64
  asm("// %0" :: "r"(fun));
// adrp    x0, _GLOBAL_OFFSET_TABLE_
// ldr     x0, [x0, #:gotpage_lo15:_Z3funv]

The condition !flag_pic || LEGITIMATE_PIC_OPERAND_P (op)highlights a key distinction in GCC's handling of symbol references:

Non-PIC code (-fno-pic): The "i" and "s" constraintsare freely permitted.
PIC code (-fpie and -fpic): Thearchitecture-specific LEGITIMATE_PIC_OPERAND_P(X) macrodictates whether these constraints are allowed.

While the default implementation (gcc/defaults.h) ispermissive (used by MIPS, PowerPC, and RISC-V), many ports imposestricter restrictions, often disallowing preemptible symbols underPIC.

This differentiation probably stems from historical and architecturalconsiderations:

Non-PIC code: Absolute addresses could be directly embedded ininstructions like an immediate integer operand.
PIC code with dynamic linking: The need for GOTindirection often requires an addressing mode different fromabsolute addressing and more than one instructions.

Nevertheless, I think this symbol preemptibility limitation for "s"is unfortunate. Ideally, we could retain the current "i" for immediateinteger operand (after linking), and design "s" for a raw symbol namewith a constant offset, ignoring symbol preemptibility. Thisarchitecture-agnostic "s" would simplify metadata section utilizationand boost code portability.

Below are some architecture-specific notes.

AArch32

In gcc/config/arm,LEGITIMATE_PIC_OPERAND_P(X) has a complex definition and itseems to disallow any non-TLS symbol reference, which means that "s"cannot be used for PIC.

"US" can be used for a symbol reference without an offset (e.g.&a[0] when a is an array) in PIC code, butthere is no good way to match &a[1]. To get rid of the# prefix, use the modifier "c".

1 2	extern int a[4]; void foo() { asm("// %c0" :: "US"(&a[0])); }

AArch64

In gcc/config/aarch64,LEGITIMATE_PIC_OPERAND_P(X) disallows any symbol reference,which means that "i" and "s" cannot be used for PIC. Instead, the constraint"S" has been supported since the initial port (2012) to reference asymbol or label.

Clang 7 also implemented "S".

RISC-V

gcc/config/riscv uses the genericLEGITIMATE_PIC_OPERAND_P(X), so "s" can be used in PICmode.

The constraint "S" is supported (since the beginning of the port in2017) for a similar purpose, but requires a non-preemptible symbol.

I implemented theconstraint "S" for Clang 14 but realized that "S" is less useful in GCC,so I sent a patch to implement"s".

x86

We can use the constraint "s" (or "i") with the modifier "p" to printraw symbol name without syntax-specific prefixes, but it does not workwhen:

the symbol is preemptible (similar to RISC-V's "S")
or -mcmodel=large
or -mcmodel=medium for large data

void foo() {
label:
  asm("// %p0" :: "i"(foo));  // Does not work if foo is preemptible
  asm("// %p0" :: "i"(&&label));
}

I filed the featurerequest for a new constraint in May 2022 and eventually went aheadand implement it by myself. The patch landed today, catching up the GCC14 release. I have also implemented "Ws" for Clang 18.1.

BTW, you can also the modifier "c".

Require a constant operand and print the constant expression with nopunctuation.

I think having such a longlist of modifiers is unfortunate.

Summary

If your program wants to adopt raw symbol names in inline assembly,consider the following list for best portability and semantics:

AArch32: "US" (for symbol reference without an offset)
AArch64: "S"
x86: "Ws", GCC 14+, Clang 18+
MIPS/PowerPC/RISC-V: "s"

Applications

Linux kernel's jump labelpatching showcases the practical benefits of accessing raw symbolnames in inline assembly. Actually, this feature likely drew me downthis rabbit hole about raw symbol names a few years ago.

Many ports use the constraint "i", which is more or less a hack.

// gcc/common.md
(define_constraint "i"
  "Matches a general integer constant."
  (and (match_test "CONSTANT_P (op)")
       (match_test "!flag_pic || LEGITIMATE_PIC_OPERAND_P (op)")))

In the non-PIC mode, "i" does works with a constant, symbolreference, or label reference. However, in the PIC mode, "i" on a symbolreference is rejected by certain GCC ports (e.g. aarch64). I went aheadand sent an arm64patch:)

BTW, the jump label patching implementation prevents kernelcompilation without optimizations (along with other clever tricks).include/linux/jump_label.h offers an interesting example offunctionoverloading using __builtin_types_compatible_p and anundefined symbol.

#define static_branch_likely(x)                                                 \
({                                                                              \
        bool branch;                                                            \
        if (__builtin_types_compatible_p(typeof(*x), struct static_key_true))   \
                branch = !arch_static_branch(&(x)->key, true);                  \
        else if (__builtin_types_compatible_p(typeof(*x), struct static_key_false)) \
                branch = !arch_static_branch_jump(&(x)->key, true);             \
        else                                                                    \
                branch = ____wrong_branch_error();                              \
        likely_notrace(branch);                                                         \
})

Ideally, if the kernel switches to C++, a template would provide amore elegant and portable solution, enabling compilation withoutoptimizations.

template <class T, T &key>
bool arch_static_branch(bool branch) {
  asm_volatile_goto(... : : "Ws"(key), "i" (2 | branch) : : l_yes);
  return false;
l_yes:
  return true;
}

Modified condition/decision coverage (MC/DC) and compiler implementations

2024-01-28T08:00:00.000Z

Key metrics for code coverage include:

function coverage: determines whether each function beenexecuted.
line coverage (aka statement coverage): determines whether everyline has been executed.
branch coverage: ensures that both the true and false branches ofeach conditional statement or the condition of each loop statement beenevaluated.

Condition coverage offers a more fine-grained evaluation of branchcoverage. It requires that each individual boolean subexpression(condition) within a compound expression be evaluated to both true andfalse. For example, in the boolean expressionif (a>0 && f(b) && c==0), each ofa>0, f(b), and c==0, conditioncoverage would require tests that:

Evaluate a>0 to true and false
Evaluate f(b) to true and false
Evaluate c==0 to true and false

A condition combination refers to a specific set of boolean valuesassigned to individual conditions within a boolean expression. Multiplecondition coverage ensures that all possible condition combinations aretested. In the example above, with three conditions, there would be 2³ =8 possible condition combinations.

Modified condition/decisioncoverage

While multiple condition coverage may not be practical, Modifiedcondition/decision coverage (MC/DC) offers a more cost-effectiveapproach. Introduced in 1992 by DO-178B (latersuperseded by DO-178C), MC/DC became mandatory for Level A software inthe aviation industry. Its popularity has since extended tosafety-critical applications in automotive (ISO 26262) and otherdomains. Notably, SQLite boasts 100% MC/DC coverage link to SQLitetesting page: https://sqlite.org/testing.html#mcdc.

Consider a boolean expression like(A && B) || (C && D). This has fourconditions (A, B, C, and D), each potentially a subexpression likex>0. Tests evaluate condition combinations (ABCD) andtheir corresponding outcomes.

Multiple flavors of MC/DC exist, with Unique-Cause MC/DC representingthe strongest variant. When demonstrating the independence of A in theboolean expression (A && B) || (C && D),Unique-Cause MC/DC requires two tests with different outcomes and:

A is false in one test and true in the other.
B, C and D values remain identical.

The two tests form an independence pair for A. Acoverage set comprises tests offering such independence pairsfor each condition. However, achieving this set may be impossible in thepresence of strongly coupled conditions.

Coupling examples:

The two conditions in x==0 && x!=0 arestrongly coupled: changing one automatically changes theother.
x==0 || x==1 || x==3 exhibits weakly coupledconditions: changing x from 0 to 2 alters only the first condition,while changing it to 1 affects the first two.

Masking MC/DC

Masking involves setting one operand of a boolean operatorto a value that renders the other operand's influence on the outcomeirrelevant. Examples:

Masking the LHS of && withA && false (outcome is always false, unaffected byA).
Masking the LHS of || with A || true(outcome is always true, unaffected by A).

Due to short-circuit semantics, the RHS of && isnot evaluated when the LHS is false.

Masking MC/DC demonstrates condition independence by showingthe condition in question affects the outcome and keeping otherconditions masked. For example, to provde the independence of A in theboolean expression (A && B) || (C && D), Cand D can change values as long as C && D remainsfalse. In this way, each condition allows more independence pairs thanUnique-Cause MC/DC.

In 2001, masking MC/DC has been considered an acceptable method formeeting objective 5 of Table A-7 in DO-178B.

Unique-Cause + Masking MC/DC is weaker than Unique-Cause MC/DC butstronger than Masking MC/DC, allowing masking only for strongly coupledconditions.

Minimum coverage set size

If an expression has N unique conditions, both Unique-Cause MC/DC andUnique-Cause Masking MC/DC require a minimum of N+1 tests. It is notclear whether this is an exact bound. WhenN<=4, it is always possible to get Unique-Cause MC/DC with N+1tests. Masking MC/DC requires a minimum of ceil(2*sqrt(N)).See An Investigation of Three Forms of the Modified ConditionDecision Coverage (MCDC) Criterion for detail.

Binary decision diagram

Binary decision diagram (BDD) is a data structure that is used torepresent a boolean function. Boolean expressions with&& and || compile to reduced orderedBDDs.

There is another coverage metric called object branch coverage, whichdetermines whether each branch is taken at least once and is also nottaken at least once. Object branch coverage does not guarantee MC/DC,but does when the reduced ordered BDD is a tree.

(B && C) || A is a non-tree example thatachieving object branch coverage requires 3 tests, which areinsufficient to guarantee MC/DC. If the expression is rewritten toA || (B && C), then the reduced ordered BDD willbecome a tree, making object branch coverage guarantee MC/DC.

GCC

Since GCC 3.4, GCC has employed .gcno and.gcda files to store control-flow graph information and arcexecution counts, respectively. This format has undergone enhancementsbut remains structurally consistent. .gcno files containfunction records describeing basic blocks, arcs between them, and lineswithin each basic block. Column information is only available forfunctions. .gcda files store arc execution counts.

gcov identifies basic blocks on a particular line (usually one) andlocates successor basic blocks to infer branches. When -bis specified, gcov prints branch probabilities, though the output may beunclear since .gcno does not encode what true and falsebranches are.

cat > a.c <<e
int test(int a, int b) {
  if (a > 0 && b > 0)
    return 1;
  return 0;
}

int main() {
  test(0, 1);
  test(1, 0);
  test(1, 1);
}
e
gcc --coverage a.c -o a && ./a
gcov -b a.

The output

        -:    0:Source:a.c
        -:    0:Graph:a.gcno
        -:    0:Data:a.gcda
        -:    0:Runs:1
function test called 3 returned 100% blocks executed 100%
        3:    1:int test(int a, int b) {
        3:    2:  if (a > 0 && b > 0)
branch  0 taken 67% (fallthrough)
branch  1 taken 33%
branch  2 taken 50% (fallthrough)
branch  3 taken 50%
        1:    3:    return 1;
        2:    4:  return 0;
        -:    5:}
        -:    6:
function main called 1 returned 100% blocks executed 100%
        1:    7:int main() {
        1:    8:  test(0, 1);
call    0 returned 100%
        1:    9:  test(1, 0);
call    0 returned 100%
        1:   10:  test(1, 1);
call    0 returned 100%
        -:   11:}

However, there is no direct MC/DC support. I believe that people justapproximate MC/DC with branch coverage. For side-effect-free expressionslike (B && C) || A, there might be avenues forcompiler transformation into a tree-style BDD, such asA || (B && C). However, I don't know the presenceof such tools.

Efficient Test Coverage Measurement for MC/DC describes acode instrumentation technique to determine masking MC/DC. For a booleanexpression with N conditions, each condition is assigned 2 bits:

One bit shows that the condition independently affects the outcomewhile evaluating to false.
The other bit shows that the condition independently affects theoutcome while evaluating to false.

The instrumentation adds a few bitwise instructions that records thebranches taken in conditions and applies a filter for masking effects.When both bits assigned to a condition are 1, we have found anindependence pair for this condition.

Jørgen Kvalsvik posted the first MC/DC patch to gcov in March 2022and PATCHv9 in December 2023. With this patch, we compile source files usinggcc --coverage -fcondition-coverage and pass--conditions to gcov. The output should look like:

        3:   17:void fn (int a, int b, int c, int d) {
        3:   18:    if ((a && (b || c)) && d)
conditions covered 3/8
condition  0 not covered (true false)
condition  1 not covered (true)
condition  2 not covered (true)
condition  3 not covered (true)
        1:   19:        x = 1;
        -:   20:    else
        2:   21:        x = 2;
        3:   22:}

Clang

Clang offers a sophisticated approach to code coverage called Source-basedCode Coverage. Unlike gcov's line-oriented method, Clang utilizescoverage mapping, a format capable of encoding:

Nested control structures
Line/column information
Macro expansion tracking (ExpansionRegion)

In January 2021, the framework has been enhanced with branchcoverage. This addition:

Creates a new region for && and ||operators.
Tracks execution counts for individual conditions.

The primary data structure changes are the additions of the secondcounter (CountedRegion::FalseExecutionCount andCounterMappingRegion::FalseCount) and a newCounterMappingRegion::RegionKind namedBranchRegion.

x = x > 0 && y > 0;  // 2 extra counters

if (x > 0 && y > 0)  // 2 extra counters, while 1 suffices
  x = 1;

if (true && y > 0)   // 2 extra counters, while 1 suffices
  x = 1;

When the boolean expression is used in an if statement,the then counter can be reused by the right operand of thelogical operand, but this optimization has not been implemented(mentioned by D84467).

The presentation "Branch Coverage: Squeezing more out of LLVMSource-based Code Coverage, 2020" elaborates on the design.

clang -fprofile-instr-generate -fcoverage-mapping a.c -o a
./a
llvm-profdata merge -sparse default.profraw -o a.profdata
llvm-cov show a -instr-profile=a.profdata -show-branches=count

  1|      3|int test(int a, int b) {
  2|      3|  if (a > 0 && b > 0)
------------------
|  Branch (2:7): [True: 2, False: 1]
|  Branch (2:16): [True: 1, False: 1]
------------------
  3|      1|    return 1;
  4|      2|  return 0;
  5|      3|}
  6|       |
  7|      1|int main() {
  8|      1|  test(0, 1);
  9|      1|  test(1, 0);
 10|      1|  test(1, 1);
 11|      1|}

A Rust feature request was since then filed:https://github.com/rust-lang/rust/issues/79649

In January 2024, Clang's Source-based Code Coverage got masking MC/DCcapability.

-fcoverage-mcdc tells Clang to instrument&& and || expressions to record thecondition combinations and outcomes, and store the reduced ordered BDDsinto the coverage mapping section. The bitmap is stored in the__llvm_prf_bits section in a rawprofile.
llvm-profdata merge merges bitmaps from multiple rawprofiles and stores the merged bitmap into an indexed profile.
When passing --show-mcdc to llvm-cov show,llvm-cov reads a profdata file, retrieves the bitmap,computes independence pairs, and print the information.

Clang adopts a distinct approach to Masking MC/DC compared to thepaper "Efficient Test Coverage Measurement for MC/DC". Insteadof complex masking value computation, it uses a "boring algorithm":

Encodes boolean expressions with N conditions as integers within[0, 2**N).
When the expression result is determined, sets a bit in a bitmapindexed by the integer.
Limits condition count to 6 for space optimization.

For example,

1 2	if (a && b \|\| c) return 1;

Let's say in one execution path a=c=1 andb=0. the condition combination (0b101) leads to an index of5. The instrumentation locates the relevant word in the bitmap and setthe bit 5.

The approach is described in detail in "MC/DC: Enablingeasy-to-use safety-critical code coverage analysis with LLVM" in2022 LLVM Developers' Meeting.

Pros:

Easier to understand
Each condition instrumentation adds just one single bitwise ORinstruction, instead of possibly three (one bitwise AND plus two bitwiseOR).

Cons:

More bits to encode N conditions (2**N vs.2*N)
More metadata to encode the reduced ordered BDDs, required by thereader to compute independence pairs
Determining independent pairs involves a brute-force algorithm inllvm-cov, which has a high time complexity but probably acceptable dueto the limited condition count.

References

An Investigation of Three Forms of the Modified Condition DecisionCoverage (MCDC) Criterion, John Joseph Chilenski, 2011
Formalization and Comparison of MCDC and Object Branch CoverageCriteria, 2012
Efficient Test Coverage Measurement for MC/DC, 2013
Branch Coverage: Squeezing more out of LLVM Source-based CodeCoverage, 2020
MC/DC: Enabling easy-to-use safety-critical code coverage analysiswith LLVM, 2022

RISC-V TLSDESC works!

2024-01-23T08:00:00.000Z

Back in 2019, I studied a bit about RISC-V and filed SupportThread-Local Storage Descriptors (TLSDESC). Last year, TatsuyukiIshi addeda specification for TLSDESC.

LLVM

On the LLVM side, the RISC-V TLSDESC work has been completed.

The the most important patch is [RISCV] SupportGlobal Dynamic TLSDESC in the RISC-V backend by Paul Kirth. The linker patchby me is also significant. Furthermore, Clang requires a -mtls-dialect=patch.

These patches are expected to be included in the upcoming LLVM 18.1release. To obtain TLSDESC code sequences, compile your program withclang --target=riscv64-linux -fpic -mtls-dialect=desc.

GCC

Latest patch: https://inbox.sourceware.org/gcc-patches/20231205070152.38360-1-ishitatsuyuki@gmail.com/

binutils

Latest patch: https://inbox.sourceware.org/binutils/20231128085109.28422-1-ishitatsuyuki@gmail.com/

glibc

Latest patch: https://inbox.sourceware.org/libc-alpha/20230914084033.222120-1-ishitatsuyuki@gmail.com/

musl

musl addedsupport in February 2024.

Bionic

No patch yet.

Testing

The LLVM patches need testing. Unfortunately, I didn't have a RISC-Vimage at hand, so I used qemu-user.

Patch musl per Re: Draftriscv64 TLSDESC implementation

diff --git c/arch/riscv64/reloc.h w/arch/riscv64/reloc.h
index 1ca13811..7c7c0611 100644
--- c/arch/riscv64/reloc.h
+++ w/arch/riscv64/reloc.h
@@ -17,6 +17,7 @@
 #define REL_DTPMOD      R_RISCV_TLS_DTPMOD64
 #define REL_DTPOFF      R_RISCV_TLS_DTPREL64
 #define REL_TPOFF       R_RISCV_TLS_TPREL64
+#define REL_TLSDESC     R_RISCV_TLSDESC

 #define CRTJMP(pc,sp) __asm__ __volatile__( \
        "mv sp, %1 ; jr %0" : : "r"(pc), "r"(sp) : "memory" )
diff --git c/include/elf.h w/include/elf.h
index 72d17c3a..7f342a23 100644
--- c/include/elf.h
+++ w/include/elf.h
@@ -3254,6 +3254,7 @@ enum
 #define R_RISCV_TLS_DTPREL64    9
 #define R_RISCV_TLS_TPREL32     10
 #define R_RISCV_TLS_TPREL64     11
+#define R_RISCV_TLSDESC         12

 #define R_RISCV_BRANCH          16
 #define R_RISCV_JAL             17
diff --git c/src/ldso/riscv64/tlsdesc.s w/src/ldso/riscv64/tlsdesc.s
new file mode 100644
index 00000000..56d1ce89
--- /dev/null
+++ w/src/ldso/riscv64/tlsdesc.s
@@ -0,0 +1,33 @@
+.text
+.global __tlsdesc_static
+.hidden __tlsdesc_static
+.type __tlsdesc_static,%function
+__tlsdesc_static:
+       ld a0,8(a0)
+       jr t0
+
+.global __tlsdesc_dynamic
+.hidden __tlsdesc_dynamic
+.type __tlsdesc_dynamic,%function
+__tlsdesc_dynamic:
+       add sp,sp,-16
+       sd t1,(sp)
+       sd t2,8(sp)
+
+       ld t2,-8(tp) # t2=dtv
+
+       ld a0,8(a0)  # a0=&{modidx,off}
+       ld t1,8(a0)  # t1=off
+       ld a0,(a0)   # a0=modidx
+       sll a0,a0,3  # a0=8*modidx
+
+       add a0,a0,t2 # a0=dtv+8*modidx
+       ld a0,(a0)   # a0=dtv[modidx]
+       add a0,a0,t1 # a0=dtv[modidx]+off
+       sub a0,a0,tp # a0=dtv[modidx]+off-tp
+
+       ld t1,(sp)
+       ld t2,8(sp)
+       add sp,sp,16
+       jr t0
+

1	(mkdir -p out/rv64 && cd out/rv64 && ../../configure --target=riscv64-linux-gnu && make -j 50)

Adjust ~/musl/out/rv64/lib/musl-gcc.specs and update~/musl/out/rv64/obj/musl-gcc

cat > ~/musl/out/rv64/obj/musl-gcc <<eof
#!/bin/sh
exec "${REALGCC:-riscv64-linux-gnu-gcc}" "$@" -specs ~/musl/out/rv64/lib/musl-gcc.specs
eof

I have also modified musl-clang (clang wrapper). Adjust~/musl/out/rv64/obj/musl-clang to use--target=riscv64-linux-musl. Adjust~/musl/out/rv64/obj/ld.musl-clang to definecc="/tmp/Rel/bin/clang --target=riscv64-linux-gnu" andinvoke exec /tmp/Rel/bin/ld.lld "$@" -lc.

Prepare a runtime test mentioned at the end of https://maskray.me/blog/2021-02-14-all-about-thread-local-storage

cat > ./a.c <<eof
#include 
int foo();
int bar();
int main() {
  assert(foo() == 2);
  assert(foo() == 4);
  assert(bar() == 2);
  assert(bar() == 4);
}
eof

cat > ./b.c <<eof
#include 
__thread int tls0;
extern __thread int tls1;
int foo() { return ++tls0 + ++tls1; }
static __thread int tls2, tls3;
int bar() { return ++tls2 + ++tls3; }
eof

echo '__thread int tls1;' > ./c.c

sed 's/        /\t/' > ./Makefile <<'eof'
.MAKE.MODE = meta curDirOk=true

CC := ~/musl/out/rv64/obj/musl-clang -O1 -g -fpic -mtls-dialect=desc -w
LDFLAGS := -Wl,-rpath=.

all: a0 a1 a2

run: all
        ./a0 && ./a1 && ./a2

c.so: c.o; ${LINK.c} -shared $> -o $@
bc.so: b.o c.o; ${LINK.c} -shared $> -o $@
b.so: b.o c.so; ${LINK.c} -shared $> -o $@

a0: a.o b.o c.o; ${LINK.c} $> -o $@
a1: a.o b.so; ${LINK.c} $> -o $@
a2: a.o bc.so; ${LINK.c} $> -o $@
eof

bmake run => succeeded!

% bmake run
~/musl/out/rv64/obj/musl-clang -O1 -g -fpic -mtls-dialect=desc -w -g  -c a.c
~/musl/out/rv64/obj/musl-clang -O1 -g -fpic -mtls-dialect=desc -w -g  -c b.c
~/musl/out/rv64/obj/musl-clang -O1 -g -fpic -mtls-dialect=desc -w -g  -c c.c
~/musl/out/rv64/obj/musl-clang -O1 -g -fpic -mtls-dialect=desc -w -g  -Wl,-rpath=. a.o b.o c.o -o a0
~/musl/out/rv64/obj/musl-clang -O1 -g -fpic -mtls-dialect=desc -w -g  -Wl,-rpath=. -shared c.o -o c.so
~/musl/out/rv64/obj/musl-clang -O1 -g -fpic -mtls-dialect=desc -w -g  -Wl,-rpath=. -shared b.o c.so -o b.so
~/musl/out/rv64/obj/musl-clang -O1 -g -fpic -mtls-dialect=desc -w -g  -Wl,-rpath=. a.o b.so -o a1
~/musl/out/rv64/obj/musl-clang -O1 -g -fpic -mtls-dialect=desc -w -g  -Wl,-rpath=. -shared b.o c.o -o bc.so
~/musl/out/rv64/obj/musl-clang -O1 -g -fpic -mtls-dialect=desc -w -g  -Wl,-rpath=. a.o bc.so -o a2
./a0 && ./a1 && ./a2

Test GCC

During my development of the linker patch, the Clang Driver patch wasactually not ready yet. I used a more hacky approach by compiling usingGCC, replacing some assembly fragments with TLSDESC code sequences, andassemblying using Clang.

Compile b.c to bb.s. Replacegeneral-dynamic code sequences (e.g.la.tls.gd a0,tls0; call __tls_get_addr@plt) with TLSDESC,e.g.

.Ltlsdesc_hi0:
  auipc a0, %tlsdesc_hi(tls0)
  ld  a1, %tlsdesc_load_lo(.Ltlsdesc_hi0)(a0)
  addi  a0, a0, %tlsdesc_add_lo(.Ltlsdesc_hi0)
  jalr  t0, 0(a1), %tlsdesc_call(.Ltlsdesc_hi0)
  add   a0, a0, tp

Create an alias bin/ld.lld to be used with-Bbin -fuse-ld=lld. I made some adjustment to theMakefile so that an invocation looks like:

% bmake run
~/musl/out/rv64/obj/musl-gcc -O1 -g -fpic -Bbin -fuse-ld=lld -g  -c a.c
/tmp/Rel/bin/clang --target=riscv64-linux -c bb.s -o b.o
~/musl/out/rv64/obj/musl-gcc -O1 -g -fpic -Bbin -fuse-ld=lld -g  -c c.c
~/musl/out/rv64/obj/musl-gcc -O1 -g -fpic -Bbin -fuse-ld=lld -g  -Wl,-rpath=. a.o b.o c.o -o a0
~/musl/out/rv64/obj/musl-gcc -O1 -g -fpic -Bbin -fuse-ld=lld -g  -Wl,-rpath=. -shared c.o -o c.so
~/musl/out/rv64/obj/musl-gcc -O1 -g -fpic -Bbin -fuse-ld=lld -g  -Wl,-rpath=. -shared b.o c.so -o b.so
~/musl/out/rv64/obj/musl-gcc -O1 -g -fpic -Bbin -fuse-ld=lld -g  -Wl,-rpath=. a.o b.so -o a1
~/musl/out/rv64/obj/musl-gcc -O1 -g -fpic -Bbin -fuse-ld=lld -g  -Wl,-rpath=. -shared b.o c.o -o bc.so
~/musl/out/rv64/obj/musl-gcc -O1 -g -fpic -Bbin -fuse-ld=lld -g  -Wl,-rpath=. a.o bc.so -o a2
./a0 && ./a1 && ./a2

Exploring object file formats

2024-01-14T08:00:00.000Z

My journey with the LLVM project began with a deep dive into theworld of lld and binary utilities. Countless hours were spent unravelingthe intricacies of object file formats and shaping LLVM's relevantcomponents. Though my interests have since broadened, object fileformats remain a personal fascination, often drawing me into discussionsaround potential changes within LLVM.

This article compares several prominent object file formats, drawingupon my experience and insights.

At the heart of each format lies the representation of essentialcomponents like symbols, sections, and relocations. For each controlstructure, We'll begin with ELF, a widely used format, before venturinginto the landscapes of other notable formats.

History of object fileformats

Before delving into the technical side, I will share some notes aboutmy archaeological journey.

a.out

The a.out format was designed for PDP-11 and appeared on the firstversion of Unix. The quantities were 16-bit, but can be naturallyextended to 32-bit or 64-bit.

In Proceedings of the Summer 1990 USENIX Conference, ELF: AnObject File to Mitigate Mischievous Misoneism by James Q. Arnoldprovided some description.

For 32-bit machines, the a.out format was extended in several ways.Most obviously, 16-bit quantities were enlarged to 32-bit values. Thesymbol table changed to allow names of unlimited length. Relocationentries also changed significantly. Larger programs and differentrelocation conventions made it necessary to associate a relocation entrywith an explicit address, instead of relying on the implicitcorrespondence between program sections and relocation records.

Many Unix and Unix-like operating systems, including SunOS, HP-UX,BSD, and Linux, used a.out before switching to ELF.

The most noticeable extension is dynamic shared library support.(This feature is distinct from static shared library, where each sharedlibrary needs a fixed address in the address space.) There are twoflavors:

In 1988, SunOS 4.0 was released with an extended a.out binary formatwith dynamic shared library support.
In 1993, on NetBSD, https://github.com/NetBSD/src/commit/97ca10e37476fb84a20a8ec4b0be3188db703670(A linker supporting shared libraries.) and https://github.com/NetBSD/src/commit/3d68d0acaed0a32f929b2c174146c62940005a18(A linker supporting shared libraries (run-time part).)added shared library support similar to the SunOS scheme.

FreeBSDa.out(5) provides a nice description.

If you follow recent years' Linux kernel news, there were somediscussions when Linux eventually removeda.out support in 2022.

COFF

a.out supports three fixed loadable sections TEXT, DATA, and BSS,which is too restrictive. COFF introduces custom section support andallows up to 32767 sections. The ELF paper contains some remarks:

Common Object File Format (COFF), was designed primarily to supportelectronic switching systems (the telephone network). Its distinguishingfeatures were multiple sections (text, data, uninitialized memory,reserved memory, overlays, etc.), some support for multiple targetprocessors, defined structures for symbol tables and relocations, anddebugging information tailored for the C language.

According to scnhdr.h in System V Release 2 for NS32xxx,COFF was designed no later than 1982. Then, System V Release 3 adoptedCOFF, which motivated a lot of follow-ups.

Windows extended COFF to the PortableExecutable (PE) format. Symbian OS before 9 used PE as well.
Texas Instruments modifiedCOFF for its TI toolset and then switched to ELF.
ECOFF used by Tru64 UNIX changed symbol representation.
IBM developed XCOFF (COFF combined with the TOC module formatconcept, CSECT, etc) and used it for AIX.

Key drawbacks:

Hard-wiring debugging information tailored for the C language intothe symbol structure is complex, space-inefficient, and ugly.
The auxiliary symbol record design is inflexible andinefficient.
Not 32-bit-aligned symbol and section structures caused performanceissue to earlier systems.
No support for weaksymbols. PE implemented an inflexible weak definition mechanismcalled "weak externals".

Mach-O

Carnegie Mellon University developed the Mach kernel as a proof ofthe microkernel concept. The operating system used a format derived froma.out with minor modifications, named the Mach object file format. Theabbreviation, Mach-O, is often used instead. The NeXTSTEP operatingsystem and then Darwin adopted Mach-O.

Dynamic shared library support on Mach-O came later than other objectfile formats. In a NeXTSTEP manual released in 1995, I can findMH_FVMLIB (fixed virtual memory library, which appears tobe a static shared library scheme) but not MH_DYLIB (usedby modern macOS for .dylib files).

Key drawbacks:

No COMDATsupport.
Scarcity of relocation types.
255 section limit. .subsections_via_symbols has somedownsides (discussed later).

In my opinion, Mach-O is the most limited among Mach-O/PE/ELF.However, I want to acknowledge certain innovative features like.subsections_via_symbols andS_ATTR_LIVE_SUPPORT.

ELF

Frustrations and inherent constraints of COFF, coupled with aself-imposed byte order dilemma, AT&T introduced a groundbreakingformat: Executable and Linking Format (ELF). ELF revisited fixed contentand hard-wired concepts in previous object file formats, removedunnecessary elements, and made control structures more flexible.

This pivotal shift was embraced by System V Release 4, marking a newera in object file format design. In the 1990s, many Unix and Unix-likeoperating systems, including Solaris, IRIX, HP-UX, Linux, and FreeBSD,switched to ELF.

Symbols

The minimum of a symbol control structure needs to encode the name,section, and value. We can require that every symbol is defined inrelation to some section. We can use a section index of zero torepresent an undefined symbol.

In a minimum object file format with only few hard-coded sections(a.out), the section field can be omitted. A type field can be used todecide whether the symbol can reference a function or a data object.

// ELFCLASS32, 16 bytes
typedef struct {
  Elf32_Wordst_name;
  Elf32_Addrst_value;
  Elf32_Wordst_size;
  unsigned charst_info;
  unsigned charst_other;
  Elf32_Halfst_shndx;
} Elf32_Sym;

// ELFCLASS64, 24 bytes
typedef struct {
  Elf64_Word st_name;     // index into the string table
  unsigned char st_info;  // type and binding
  unsigned char st_other; // visibility and others
  Elf64_Half st_shndx;    // section index
  Elf64_Addr st_value;
  Elf64_Xword st_size;
} Elf64_Sym;

The symbol name is represented as a 32-bit index into the stringtable. A 32-bit integer suffices, while a 16-bit integer would be toosmall.

st_shndx uses a size-saving trick. The 16-bit memberencodes a section index. If the member is SHN_XINDEX(0xffff), then the actual value is contained in the associated sectionof type SHT_SYMTAB_SHNDX. This is a very nice trick becausethe number of sections are almost always smaller than 0xff00. Inpathologic cases, there can be more sections, where a section of typeSHT_SYMTAB_SHNDX is needed.

st_info specifies the symbol's type (4 bits) and binding(4 bits) attributes. Types are allocated very conservatively and usuallyimply different linker behaviors. The inherently different linkerbehaviors for symbol types are not that many. So 4 bits seem small, theyare sufficient in practice. As we will learn, this is significantlysmaller than COFF's type and storage class representation. A symbol'sbinding is for the local/weak/global distinction. The reserved 4 bitscan accommodate more values, but only GNU reserves one value(STB_GNU_UNIQUE) (a misfeature in my opinion).

In COFF, function symbols can use an auxiliary symbol record toencode the size of function (x_fsize;TotalSize in PE). In ELF, st_size is a fixedmember, used for copy relocations and symbolizers. If we eliminate copyrelocations and don't need the symbolization heuristics, this field willbecome garbage.

Here is a demonstration if we remove st_size.

// 16 bytes
struct Elf64_Sym_minimized {
  Elf64_Word st_name;     // index into the string table
  unsigned char st_info;  // type and binding
  unsigned char st_other; // visibility and others
  Elf64_Half st_shndx;    // section index
  Elf64_Addr st_value;
} Elf64_Sym;

Symbols (a.out)

// a.out (System V), 16 bytes
struct nlist {
  char n_name[8];
#if pdp11
  int n_type;
#else
  char n_type;
  char n_other;
  short n_desc;
#endif
  unsigned n_value;
};

a.out uses a nlist to represent a symbol table entry. Inthe original format for PDP-11, the assembler generates symbols of atmost 7 bytes. n_name[8] can hold the name with a NUL end.Unix's appreciation of shorter identifier names is related to this:)

To support longer names, extensions add a string table after thesymbol table, and allow n_name to be interpreted as anindex (n_strx) into the string table. This member thenbecomes a size-saving trick by inlining a short name (8 bytes or less)into the structure. Some variants, like binutils' 64-bit a.out format,use an index exclusively and removed n_name.

n_type, broken down into three sub-fields, describeswhether a symbol is defined or undefined, external or local, and thesymbol type. The values listed on the FreeBSD manpage are also used onPDP-11.

For a defined symbol, n_type describes whether it isrelative to TEXT, DATA, or BSS.

Symbols (COFF)

// COFF (System V Release 3), 18 bytes in the absence padding
struct syment {
  union {
    char _n_name[SYMNMLEN]; /* old COFF version */
    struct {
      long _n_zeroes; /* new == 0 */
      long _n_offset; /* offset into string table */
    } _n_n;
  } _n;
  unsigned long n_value; /* value of symbol */
  short n_scnum; /* section number */
  unsigned short n_type; /* type and derived type */
  char n_sclass; /* storage class */
  char n_numaux; /* number of aux. entries */
};

COFF adopts a.out's approach to save space in symbol names. Thislikely made sense when most symbols were shorter. However, with today'soften lengthy symbol names, this inlining technique complicates code andincreases the control structure size (from 4 to 8 bytes).

The section number is a 16-bit signed integer, supporting up to32,767 sections. Positive values indicate a section index, while specialvalues include:

N_UNDEF (0): Undefined symbol (distinct from a.out'sn_type representation).
N_ABS (-1): Symbol has an absolute value.
N_DEBUG (-2): Special debugging symbol (value ismeaningless).

COFF's n_type and n_sclass encode C' typeand storage class information. PE assigns longer names to these typesand storage classes longer names, e.g.,IMAGE_SYM_TYPE_CHAR/IMAGE_SYM_TYPE_SHORT,IMAGE_SYM_CLASS_AUTOMATIC/IMAGE_SYM_CLASS_EXTERNAL. Whilevalues are mostly consistent, minor differences exist:

PE's IMAGE_SYM_TYPE_VOID (1) is different from System VRelease 3's#define T_ARG 1 /* function argument (only used by compiler) */.
PE's IMAGE_SYM_CLASS_WEAK_EXTERNAL (105) is differentfrom System V Release 3's#define C_ALIAS 105 /* duplicate tag */.

Symbols with C_EXT(IMAGE_SYM_CLASS_EXTERNAL) are global and added to thelinker's global symbol table, akin to ELF's STB_GLOBALsymbol binding.

System V ships a symbolic debugger (sdb), which utilizesn_type and n_sclass. If we acknowledge thatthe debugging information format is outdated, n_type andn_class serve as a wasteful counterpart to to ELF'sst_info.

n_numaux relates to Auxiliary Symbol Records, allowingextra information but introducing non-uniform symbol table entries.While seemingly beneficial, their use cases are limited and could oftenbe encoded using separate sections. In PE, an auxiliary symbol recordcan represent weak definitions, but weak references are not supported.They can also provide extra information to section symbols.

ECOFF defines Local Symbol Entry (SYMR) and External Symbol Entry(EXTR).

typedef struct {
  coff_long value;
  coff_int iss;
  coff_uint st : 6;
  coff_uint sc : 5;
  coff_uint reserved : 1;
  coff_uint index : 20;
} SYMR, *pSYMR;

typedef struct {
  SYMR asym;
  coff_uint jmptbl:1;
  coff_uint cobol_main:1;
  coff_uint weakext:1;
  coff_uint reserved:29;
  coff_int ifd;
} EXTR, *pEXTR;

Symbols (Mach-O)

// Mach-O, 12 bytes
struct nlist {
  uint32_t n_strx;
  uint8_t n_type;
  uint8_t n_sect;
  int16_t n_desc;
  uint32_t n_value;
};

// Mach-O, 16 bytes
struct nlist_64 {
  uint32_t n_strx;
  uint8_t n_type;
  uint8_t n_sect;
  uint16_t n_desc;
  uint64_t n_value;
};

Mach-O's nlist and nlist_64 are not thatdifferent from a.out's, with n_other changed ton_sect to indicate the section index. The 8-bit n_sectfield restricts representable sections to 255 without out-of-band data(discussed later). If we extend n_sect to 32-bit, withalignment padding the structure size will increase to 24 bytes, the sameas Elf64_Sym.

Like a.out, the N_EXT bit of n_typeindicates an external symbol. The N_PEXT bit indicates aprivate external symbol.

Key bits in n_desc are N_WEAK_DEF,N_WEAK_REF, and N_ALT_ENTRY.

Sections

// ELF, 40 bytes
typedef struct {
Elf32_Wordsh_name;
Elf32_Wordsh_type;
Elf32_Wordsh_flags;
Elf32_Addrsh_addr;
Elf32_Offsh_offset;
Elf32_Wordsh_size;
Elf32_Wordsh_link;
Elf32_Wordsh_info;
Elf32_Wordsh_addralign;
Elf32_Wordsh_entsize;
} Elf32_Shdr;

// ELF, 64 bytes
typedef struct {
Elf64_Wordsh_name;
Elf64_Wordsh_type;
Elf64_Xwordsh_flags;
Elf64_Addrsh_addr;
Elf64_Offsh_offset;
Elf64_Xwordsh_size;
Elf64_Wordsh_link;
Elf64_Wordsh_info;
Elf64_Xwordsh_addralign;
Elf64_Xwordsh_entsize;
} Elf64_Shdr;

The section name is represented as a 32-bit index into the stringtable. If we use a 16-bit integer, a large number of section names witha symbol suffix (e.g. .text.foo .text.bar)could make the index overflow.

sh_type categorizes the section's contents andsemantics. It avoids hard-coding magic names in many scenarios.Technically a 16-bit type could work pretty well but was deemedinsufficient for flexibility.

sh_flags describe miscellaneous attributes, e.g.writable and executable permissions, and whether the section shouldappear in a loadable segment. This member is 32-bit inElf32_Shdr while 64-bit in Elf64_Shdr. Inpractice no architecture defines flags for bits 32 to 63, therefore thismember is somewhat wasteful.

Location and size. sh_offset gives the byte offset fromthe beginning of the file to the first byte in the section. To supportobject files larger than 4GiB, this member has to be 64-bit.sh_size gives the section's size in bytes. A section typeof SHT_NOBITS occupies no space in the file. To supportsections larger than 4GiB, this member has to be 64-bit.

Address and alignment. sh_addr describes the address atwhich the section's first byte should reside for an executable or sharedobject. It should be zero for relocatable files.sh_addralign holds the address alignment. In practice thismember must be a power of 2 even if the generic ABI does not require so.This member is 64-bit in ELF64, which allows an alignment up to2**63. In practice, an alignment larger than the page size(or the largest huge page size, if huge pages are enabled) does not makesense, and a maxiumm value of 2**31 is sufficient. Therefore, we coulduse a log2 value to hold the alignment.

Connection information. sh_link holds a section index.sh_info holds either a section index or a symbol index. Ifyou recall that st_shndx is 16 bits for very solid reason,you will know that the two fields are somewhat wasteful.

For a table of fixed-size entries, sh_entsize holds theentry size in bytes. In some use cases this member is not a power oftwo. In practice, one byte suffices.

While ELF's section header structure is designed for flexibility,potential optimizations could reduce its size without significant lossof functionality. By using smaller data types for sh_flags,sh_link, sh_info, and sh_entsizebased on practical needs, we could make the structure significantlysmaller.

// 32 bytes
struct Elf32_Shdr_minimized {
  Elf32_Wordsh_name;
  Elf32_Wordsh_type;    // Making this uint16_t and reordering it can decrease the size to 28 bytes
  Elf32_Wordsh_flags;
  Elf32_Addrsh_addr;
  Elf32_Offsh_offset;
  Elf32_Wordsh_size;
  uint8_tsh_addralign;
  uint8_tsh_entsize;
  Elf32_Halfsh_link;
  Elf32_Halfsh_info;
};

// 40 bytes
struct Elf64_Shdr_minimized {
  Elf64_Word sh_name;
  Elf64_Word sh_flags;
  Elf64_Addr sh_addr;
  Elf64_Off sh_offset;
  Elf64_Xword sh_size;
  Elf64_Half sh_type;
  uint8_t sh_addralign;
  uint8_t sh_entsize;
  Elf64_Half sh_link;
  Elf64_Half sh_info;
};

Reducing sh_type into 2 bytes loses flexibility a bit.If this deems insufficient, we could take 3 bits fromsh_addralign (by turning it into a bitfield) and give themto sh_type.

Sections (COFF)

// COFF (System V Release 3), 40 bytes, when sizeof(long) == 4
struct scnhdr {
  char            s_name[8];      /* section name */
  long            s_paddr;        /* physical address */
  long            s_vaddr;        /* virtual address */
  long            s_size;         /* section size */
  long            s_scnptr;       /* file ptr to raw data for section */
  long            s_relptr;       /* file ptr to relocation */
  long            s_lnnoptr;      /* file ptr to line numbers */
  unsigned short  s_nreloc;       /* number of relocation entries */
  unsigned short  s_nlnno;        /* number of line number entries */
  long            s_flags;        /* flags */
};

// PE, 40 bytes
struct section {
  char Name[8];
  uint32_t VirtualSize;
  uint32_t VirtualAddress;
  uint32_t SizeOfRawData;
  uint32_t PointerToRawData;
  uint32_t PointerToRelocations;
  uint32_t PointerToLineNumbers;
  uint16_t NumberOfRelocations;
  uint16_t NumberOfLineNumbers;
  uint32_t Characteristics;
};

PE's section control structure demonstrates a minor modificationcompared to COFF, s_paddr => VirtualSize.

The presented structure measures as 40 bytes when longis 4 bytes. If we extends_paddr, s_vaddr, s_size, s_scnptr, s_relptr, s_lnnoptr to8 bytes, the structure will be of 64 bytes.

The section name supports up to 8 bytes. A longer name would requirean extension similar to the symbol control structure.

Encoding both s_paddr and s_vaddr iswasteful. ELF encodes the physical address in the segment and thereforeremoves the member from its section structure.

COFF embeds the location and size of relocations into the sectionstructure. This is actually pretty nice. A 16-bit s_nrelocmay appear restritive but is sufficient for relocatable files. Inpractice, the number of relocations can exceed 65536 for a singlesection using relocatable linking.

s_lnnoptr and s_nlnno point to line numberentries, which relate addresses to source file line numbers. Theembedded nature is inflexible.

/*  There is one line number entry for every "breakpointable" source line in a
section. Line numbers are grouped on a per function basis; the first entry in a
function grouping will have l_lnno = 0 and in place of physical address will be
the symbol table index of the function name. */
struct lineno {
  union {
    long l_symndx;/* sym. table index of function name iff l_lnno == 0 */
    long l_paddr;   /* (physical) address of line number */
  } l_addr;
  unsigned short l_lnno ;/* line number */
};

This simple format is deprecated. In DWARF, special opcodes in linenumber information can encode the information in a more space-efficientway and present more information like the column number.

Sections (Mach-O)

// Mach-O, 80 bytes
struct section_64 {
  char sectname[16];
  char segname[16];
  uint64_t addr;
  uint64_t size;
  uint32_t offset;
  uint32_t align;
  uint32_t reloff;
  uint32_t nreloc;
  uint32_t flags;
  uint32_t reserved1; // index into the indirect symbol table specified by LC_DYSYMTAB
  uint32_t reserved2; // __TEXT,__stub: entry size; otherwise: 0
  uint32_t reserved3;
};

How does Mach-O end up with such a huge section structure? Let's findout...

A Mach-O binary is divided into segments, each housing one or moresections. The section structure encodes the section name and the segmentname, both can be up to 16 bytes. This representation allows the sectionnames to be read without a string table, but restrictive for descriptivenames. Section semantics are derived from the name (unlike ELF).

The segment name is redundantly encoded within the section structure.We could derive the segment from the section name and flags, e.g.,S_ATTR_SOME_INSTRUCTIONS => __TEXT ,S_ZEROFILL => ZeroFill __DATA .

There is a severe limitation: maximum of 255 sections due tonlist::n_sect being a uint8_t. This isapparently too restrictive. Thankfully, an innovative feature.subsections_via_symbols overcomes the limitation. Thefeature uses a monolithic section with "atoms" dividing it into pieces(subsections). This is more size-efficient than ELF's-ffunction-sections -fdata-sections -fno-unique-section-names.However, there are assembler limitations, relocation processingcomplexity, and potential loss of ability to ensure that two non-localsymbols are not reordered.

Like COFF, Mach-O embeds the location and size of relocations intothe section structure.

reserved1 and reserved2 are used similarlyto ELF's connection information.

__TEXT,__stub (like ELF's .plt),__TEXT,__got (like ELF's .got),__TEXT,__la_symbol_ptr (like ELF's .got.plt),and __DATA,__thread_ptrs set reserved1 as anindex into the indirect symbol table (the offset is specified byindirectsymoff in a LC_DYSYMTAB command).

For __TEXT,__stub, reserved2 is the size ofone entry, e.g., 6 for x86-64(jmpq *__la_symbol_ptr(%rip)). This is analogous to ELFx86-64's DT_X86_64_PLTSZ.For other sections, reserved2 is zero.

Relocations

// ELFCLASS32, 8 bytes
typedef struct {
  Elf32_Addr r_offset;
  Elf32_Word r_info;
} Elf32_Rel;
// ELFCLASS32, 12 bytes
typedef struct {
  Elf32_Addr r_offset;
  Elf32_Word r_info;
  Elf32_Sword r_addend;
} Elf32_Rela;

// ELFCLASS64, 16 bytes
typedef struct {
  Elf64_Addr r_offset;
  Elf64_Xword r_info;
} Elf64_Rel;
// ELFCLASS64, 24 bytes
typedef struct {
  Elf64_Addr r_offset;
  Elf64_Xword r_info;
  Elf64_Sxword r_addend;
} Elf64_Rela;

r_info specifies the symbol table index with respect towhich the relocation must be made, and the type of relocation toapply.

ELFCLASS32: 8-bit type, 24-bit symbol index
ELFCLASS64: 32-bit type, 32-bit symbol index

There are two variants, REL and RELA. Let's quote the genericABI:

As specified previously, only Elf32_Rela and Elf64_Rela entriescontain an explicit addend. Entries of type Elf32_Rel and Elf64_Relstore an implicit addend in the location to be modified. Depending onthe processor architecture, one form or the other might be necessary ormore convenient. Consequently, an implementation for a particularmachine may use one form exclusively or either form depending oncontext.

Relocatable files need a lot of relocatable types while executablesand shared objects need only a few. The former is often called staticrelocations while the latter is called dynamic relocations.

Of the few dynamic relocation types, most do not need the addendmember. lld provides an option -z rel to useSHT_REL/DT_REL dynamic relocations.

If we disregard the REL dynamic relocation scenario, then all modernarchitectures use RELA exclusively. Most architectures encode theimmediate with only few bits, which are inadequate for many relocatablefile uses.

ELFCLASS64, with its 64-bit members, doubles the size compared toELFCLASS32's 32-bit members. Since relocations often comprise asubstantial portion of object files, this size difference can lead touser concerns. However, in practice, a 24-bit symbol index is oftensufficient, even in 64-bit contexts. Therefore, if a 64-bitarchitecture's relocation type requirements are less than 256,ELFCLASS32 can be a viable and more size-efficient option.

In March 2024, I proposed CREL asan alternative relocation format.

Relocations (a.out)

// a.out (System V Release 2), 8 bytes
struct relocation_info {
  int   r_address;
  unsigned r_symbolnum : 24,
           r_pcrel : 1,
           r_length : 2,
           r_extern : 1,
           r_pad : 4;
};

r_symbolnum mirrors ELF's ELF32_R_SYM.

The other bitfields, resembling ELF's ELF32_R_TYPE, butsplit into distinct fields:

r_pcrel
r_length
others

Reserving dedicated semantics for individual bits can limitadaptability. COFF and ELF opted to remove bitfields in favor of a typeto provide greater flexibility.

Relocations (COFF)

// COFF, 10 bytes on disk, 12 bytes with alignment padding
struct reloc {
  long r_vaddr;
  long r_symndx;
  unsigned short r_type;
};

This format resembles ELF's Elf32_Rel.

r_vaddr gives the virtual address of the location atwhich to apply the relocation action. If we interpretr_vaddr as an offset (as PE does) and restrict section sizeto 32 bits, we could reuse this structure for 64-bit architectures.

r_symndx is a 32-bit symbol table index.

r_type is a 16-bit relocation type, limited in numbercompared to ELF.

COFF generally supports fewer relocation types than ELF. System VRelease 3 defines very few relocations for each architecture. Inbinutils, include/coff/*.h files define relocations formore architectures.

While ELF uses the REL/RELA for both relocatable files andexecutables, in PE image files, the import address table and baserelocation table (.reloc) are a completely differentdesign.

Relocations (Mach-O)

// Mach-O, 8 bytes
struct relocation_info {
   int32_tr_address;/* offset in the section to what is being
   relocated */
   uint32_t     r_symbolnum:24,/* symbol index if r_extern == 1 or section
   ordinal if r_extern == 0 */
r_pcrel:1, /* was relocated pc relative already */
r_length:2,/* 0=byte, 1=word, 2=long, 3=quad */
r_extern:1,/* does not include value of sym referenced */
r_type:4;/* if not 0, machine specific relocation type */
};

struct scattered_relocation_info {
#ifdef __BIG_ENDIAN__
  uint32_t r_scattered : 1, r_pcrel : 1, r_length : 2, r_type : 4,
      r_address : 24;
#else
  uint32_t r_address : 24, r_type : 4, r_length : 2, r_pcrel : 1,
      r_scattered : 1;
#endif
  int32_t r_value;
};

Mach-O's relocation structure closely mirrors a.out's with adaptedr_symbolnum meaning. When r_extern == 0(local), the r_symbolnum member references a section indexinstead of a symbol index. This is to support custom sections, breakingthe three-section limitation (text, data, and bss) of traditionala.out.

As aforementioned, dedicating bits to bitfields(r_pcrel, r_length, andr_scattered greatly restricted the number of relocationtypes.

Related to the relocation type limitation, a.long foo - . in a data section requires a pair ofrelocations, SUBTRACTOR and/UNSIGNED. I havesome notes on Port LLVM XRayto Apple systems.

Mach-O uses a number of sections in the __LINKEDITsegment to communicate information to dyld.

File header

File header (a.out)

Dennis MacAlistair Ritchie's A.OUT (V)manpage (1971) describes the original a.out format. The headercontains 6 words.

a "br .+14" instruction (205(8))
The size of the program text
The size of the symbol table
The size of the relocation bits area
The size of a data area
A zero word (unused at present)

The text relocations are implicit.

Later versions introduced new magic numbers, separated textrelocations and data relocations, and added an entry point(a_entry).

Size comparison

TODO

Size reduction opportunities

ELFCLASS32 structures are already compact, offering limited sizereduction potential. ELFCLASS64 structures, while flexible, can beoptimized by sacrificing some flexibility (64-bit quantities). The64-bit symbol control structure is compact, but section and relocation'sare quite wasteful if we can sacrifice some flexibility.

As the ELF paper acknowledges, "Relocatable and executable files donot necessarily have the same constraints, and we considered using twofile formats. Eventually, we decided the two activities were similarenough that a single format would suffice." There are more toolsinspecting executables than relocatable files. So, naturally, we mightwant to change just relocatable files. Can we use ELFCLASS32 relocatablefiles for 64-bit architectures?

Well, x86-64 and AArch64 make a clear distinct of ELFCLASS32 andELFCLASS64. ELFCLASS32 is for ILP32 (x32, aarch64_ilp32) whileELFCLASS64 is for LP64. However, the discontinued Itanium architecturesets a precedent that ELFCLASS32 can be used for LP64 programs. Quotingits psABI (Intel Itanium Processorspecific Application BinaryInterface (ABI)).

For Itanium architecture ILP32 relocatable (i.e. of type ET_REL)objects, the file class value in e_ident[EI_CLASS] must be ELFCLASS32.For LP64 relocatable objects, the file class value may be eitherELFCLASS32 or ELFCLASS64, and a conforming linker must be able toprocess either or both classes. ET_EXEC or ET_DYN object file types mustuse ELFCLASS32 for ILP32 and ELFCLASS64 for LP64 programs.
Addresses appearing in ELFCLASS32 relocatable objects for LP64programs are implicitly extended to 64 bits by zero-extending.
Note: Some constructs legal in LP64 programs, e.g. absolute 64-bitaddresses outside the 32-bit range, may require use of an ELFCLASS64relocatable object file.

Given the prior art, it seems promising to allow ELFCLASS32 when thecode size concerns people. Ideally there should be a marker todistinguish ILP32 and LP64-using-ELFCLASS32 object files.

The primary changes reside in the assembler and linker. It's alsoimportant to ensure that binary manipulation programs (like objcopy) anddump tools are happy with them.

Further optimization potential lies in exploring the use ofElf32_Rel instead of Elf32_Rela for evensmaller relocations.

Replacing control structures

This approach is independent of whether ELFCLASS32 is adopted and canbe applied to both ELFCLASS32 and ELFCLASS64. The ELF paper is clear,"ELF allows extension and redefinition for other control structures."However, caution is warranted due to the significant impact on theecosystem as many tools rely on the existing structures.

One promising example is Elf32_Shdr_minimized, a customstructure reduced to 32 bytes from the standardElf32_Shdr's 40 bytes. While I would be nervous, but if wereduce sh_type to a uint16_t, the structuresize can reduce to 28 bytes.

stabs

Earlier debuggers operated using a debugging information formatcalled "stabs" (short for symbol table entries; dating back to at leastUNIX/32V in 1979). Stabs is encoded using extra symbol table entries inthe a.out object file format.

1
2
3

.stabs "string",type,other,desc,value
.stabn type,other,desc,value
.stabd type,other,desc

Stabs was ported to COFF for System V Release 2, used on somemachines. System V Release 4 switched to ELF and abandoned stabs infavor of a newly developed format called DWARF. Its debugger sdb wasrewritten to support DWARF, and stabs was no longer supported. (Thefirst version of DWARF was later published by the UNIX InternationalProgramming Languages Special Interest Group (SIG) in January 1992.)

However, stabs continued to be used in other operating systems,including *BSD, AIX, and IRIX. For example, the GNU assembler addedstabs support for ELF (n_strx is 32-bit).

GCC 13 removedstabs support.

Stabs is less efficient than DWARF. When compiling a non-trivialprogram (so that the boilerplate in DWARF is less significant), you mayobserve that .stab and .stabstr consume morespace than .debug_* sections, even if DWARF is moreexpressive and contains more information.

Heterogeneity and challenge

While the diversity of operating systems and architectures posescomplexity for application developers, the object file formatheterogeneity presents a unique challenge for toolchain development,probably not very tangible by application developers and users.

Integrating features like Link Time Optimization (LTO),Profile-Guided Optimization (PGO), and sanitizers has complexity due toobject file format-specific limitations and nuances. While mostdevelopers primarily concern themselves with a specific format, theystill need to tread carefully during development to avoid disruptions toother platforms.

TODO

WebAssembly

2023年总结

2023-12-31T08:00:00.000Z

一如既往，主要在工具链领域耕耘。

llvm-project

I made 700+ commits this year. Many are clean-up commits or fixup forothers' work. I hope that I can do more useful work next year.

Enabled --features=layering_check for Bazel's llvm andclang projects
Implemented LLVM_ENABLE_REVERSE_ITERATION forStringMap
Added llvm::xxh3_64bits and adopted it inllvm/clang/lld
Made AArch64 changes to msan and dfsan
Made various improvements to the clang driver, including-### exit code, auxiliary files, errors forunsupported target-specific options, -fsanitize=kcfi,and XRay
Made various sanitizer changes
Made -fsanitize=function work for C and non-x86architectures
Fixed several major problems of assembler implementation of RISC-Vlinker relaxation
clangSema: %lb recognization for printf/scanf, checkingthe failure memory order for atomic_compare_exchange-familybuilt-in functions, -Wc++11-narrowing-const-reference
llvm-objdump: @plt symbols for x86.plt.got, mapping symbol improvements,--disassemble-symbols changes, etc
Supported R_RISCV_SET_ULEB128/R_RISCV_SUB_ULEB128 for.uleb128 directives
gcov: fix instrumentation crashes when using inline variables,#line, and #include, made llvm-cov gcov workwith a smaller stack size
LTO: fixed local ifunc for ThinLTO, reported errors when parsingmodule-level inline assembly
CodeLayout: fixed a correctness bug and improved performance to besuitable for lld/ELF --call-graph-profile-sort=hfsortdefault

Reviewed many commits. A lot of people don't add aReviewed By: tag. Anyway, counting commits with the tag cangive an underestimate.

1 2	% git shortlog -sn 2679e8bba3e166e3174971d040b9457ec7b7d768...main --grep 'Reviewed .*MaskRay' \| awk '{s+=$1}END{print s}' 395

Many GitHub pull requests are not counted.

I createda read-only archive of the LLVM phabricator instance.

lld/ELF

lld/ELF is quite stable. I have made some maintenance changes. Asusual, I wrote the ELF port's release notes for the two releases. See lld 16 ELF changes and lld 17 ELF changes fordetail.

I made someslides on how I improved the performance of lld/ELF.

psABI

I have made changes to x86-64, AArch64, and RISC-V psABIdocuments.

binutils

Reported many bugs and feature requests:

ld: Should --gc-sections respect RHS of a symbol assignment?
objcopy: add support for changing ELF symbol visibility
rtld: resolve ifunc relocations after JUMP_SLOT/GLOB_DAT/etc
ld riscv: --emit-relocs does not retain the original relocation type
gas aarch64: GOT relocations referencing a local symbol should not be changed to reference STT_SECTION
objcopy --set-section-flags: support toggling a flag
gas x86: reject {call,jmp} [offset func] in Intel syntax

My commits:

PR30592 objcopy: allow --set-section-flags to add or remove SHF_X86_64_LARGE
ld: Allow R_386_GOT32 for call *__tls_get_addr@GOT(%reg)
ld: Allow R_X86_64_GOTPCREL for call *__tls_get_addr@GOTPCREL(%rip)
RISC-V: Add --[no-]relax-gp to ld

GCC

I had one patch landed supporting-mlarge-data-threshold= for x86-64-mcmodel=medium.

Linux kernel

8 commits. Consulted on a number of toolchain questions.

Blog

Wrote 29 blog posts (including this one, mainly about toolchains) andrevised many posts initially written between 2020 and 2023.

Misc

Trips: Orlando, Philadelphia, Harrisburg, Trenton, Newark, New YorkCity, Alaska, Ontario, Quebec, Nova Scotia, Chicago, Atlanta, Miami,Jamaica, Haiti.

Mastodon: https://hachyderm.io/@meowray

reviews.llvm.org became a read-only archive

2023-12-30T08:00:00.000Z

For approximately 10 years, reviews.llvm.org functioned as the codeview site for the LLVM project, utilizing a Phabricator instance. Thiswebsite hosted numerous invaluable code review discussions. However,following LLVM's transitionto GitHub pull requests, there arises a necessity for a read-onlyarchive of the existing Phabricator instance. (https://archive.org/archives a subset of the reviews.llvm.org/Dxxxxxpages.)

The intent is to eliminate a SQL engine. Phabicator operates on a complexdatabase scheme. To minimize time investment, the most feasibleapproach seems to involve downloading the static HTML pages andemploying a lightweight scraping process.

Raphaël Gomès developed phab-archiveto serve a read-only archive for Mercurial's Phabricator instance. I have modifiedthe code to suit reviews.llvm.org.

The DNS records of reviews.llvm.org have been pointed to the archive website.

Read-only pages

The review discussions primarily happen on /Dxxx pages,which should be archived. There are much fewer discussions on/rL$svn_rev (when LLVM used svn) and/rG$git_commit pages. We skip archiving them as acompromise.

Some /Dxxx pages contain a large number of modifiedfiles (usually tests). Phabricator presents a "Load File" button. If weexpand every button, the end HTML can be very large. We need to limitthe number of buttons to click.

The file hierarchy is quite straightforward.archive/unprocessed/diffs contains raw HTML pages whiletemplates/diffs contains scraped HTML pages alongside patchfiles.

% tree archive/unprocessed/diffs | head -n 12
archive/unprocessed/diffs
├── 1
│   ├── D1-4.html
│   ├── D1-5.html
│   └── D1.html
├── 10
│   ├── D10-33.html
│   └── D10.html
├── 100
│   ├── D100000-335683.html
│   ├── D100000-335688.html
│   ├── D100000-335689.html
% tree templates/diffs/ | head -n 20
templates/diffs/
├── 1
│   ├── D1-4.diff
│   ├── D1-4.html
│   ├── D1-5.diff
│   ├── D1-5.html
│   ├── D1.diff
│   └── D1.html
├── 10
│   ├── D10-33.diff
│   ├── D10-33.html
│   ├── D10.diff
│   └── D10.html
├── 100
│   ├── D100000-335683.diff
│   ├── D100000-335683.html
│   ├── D100000-335688.diff
│   ├── D100000-335688.html
│   ├── D100000-335689.diff
│   ├── D100000-335689.html
% cat templates/diffs/1/D1-4.diff
Index: include/llvm/ADT/StringMap.h
===================================================================
--- include/llvm/ADT/StringMap.h
+++ include/llvm/ADT/StringMap.h
@@ -34,7 +34,7 @@
 public:
   template 
   static void Initialize(StringMapEntry &T, InitTy InitVal) {
-    T.second = InitVal;
+    T.test= InitVal;
   }
 };

% du -sh archive/unprocessed/
270G    archive/unprocessed/
% du -sh templates/diffs
282G    templates/diffs

At present, some https://reviews.llvm.org/Dxxxxx pagesmight be inaccessible.https://reviews.llvm.org/Dxxxxx?download=true is analternative if you just need the patch file but not discussions.

Embedded images are currently unavailable. https://reviews.llvm.org/D71786 is an example. https://reviews.llvm.org/D135657 is another example withembedded images in a comment.

1 2	% rg -l 'phabricator-remarkup-embed-image' templates/diffs/ \| wc -l 3332

Nginx

I aim to utilize Nginx solely to serve URIs.

/D2 => /diffs/2/D2.html
/D2?id=&download=true => /diffs/2/D2.diff
/D2?id=10 => /diffs/2/D2-10.html
/D2?id=10&download=true => /diffs/2/D2-10.diff

/D123?id=5 => /diffs/123/D123-5.html
/D1234?id=5 => /diffs/123/D1234-5.html

/rL$svn_rev => https://github.com/llvm/llvm-project/commit/$git_commit
/rG$git_commit => https://github.com/llvm/llvm-project/commit/$git_commit

We just need URL mapping and some Nginx locationdirectives.

map_hash_max_size 400000;
map_hash_bucket_size 128;
map $request_uri $svn_rev {
  ~^/rL([0-9]+) $1;
}
map $svn_rev $git_commit {
  include /var/www/phab-archive/svn_url_rewrite.conf;
}

server {
  listen 80 default_server;
  listen [::]:80 default_server;

  if ($git_commit) {
    return 301 https://github.com/llvm/llvm-project/commit/$git_commit;
  }

  root /var/www/phab-archive/www;
  server_name reviews.llvm.org;

  types {
    text/html html;
    text/plain diff;
  }

  location ~ "^/D(?.{1,3})$" {
    set $ext ".html";
    if ($arg_download) { set $ext ".diff"; }
    if ($arg_id ~ ^(\d+)$) { rewrite ^ /diffs/$diff/D$diff-$arg_id$ext? last; }
    try_files /diffs/$diff/D$diff$ext =404;
  }
  location ~ ^/D(?...)(?.+) {
    set $ext ".html";
    if ($arg_download) { set $ext ".diff"; }
    if ($arg_id ~ ^(\d+)$) { rewrite ^ /diffs/$dir/D$dir$tail-$arg_id$ext? last; }
    try_files /diffs/$dir/D$dir$tail$ext =404;
  }
}

The second round of crawling

Among D1 to D159553, there were 1669 pages that were not downloaded.These differentials might be deleted by the author, had a permissionerror (e.g. the author did it make it publicly readable), or the crawlerencountered an error (e.g. an emulated button click failed).

In January 2024, I got access to the machine hosting the Phabricatorinstance and crawled 759 differentials. Among them, 184 differentialshave a state other than "Closed".

Statistics

We can make a copy of process-html.py and modify it toget some statistics.

def process_html(html, diff):
    soup = BeautifulSoup(html, "html.parser")
    status = soup.select_one(".phui-tag-core").text
    title = soup.select_one(".phui-header-header").text
    author = soup.select_one(".phui-head-thing-view > strong").text
    sub = []
    for div in soup.select(".phui-handle.phui-link-person"):
        if 'commits' in div.text:
            sub.append(div.text)
    print(diff, status, title, author, ','.join(sub), sep='\t')

I have collected differentials that are not “Closed” at https://gist.githubusercontent.com/MaskRay/798de69eb9e7ec7c3e98507265dc5514/raw/.The majority of differentials are "Closed" (indicating a landed patch,unless mis-tagged), therefore not interesting. The rows containsubscribers that look like *-commits, e.g. llvm-commits(a mailing list). This should help find pending patches for subprojects,such as clang, flang, and libcxx.

Exploring the section layout in linker output

2023-12-17T08:00:00.000Z

This article describes section layout and its interaction withdynamic loaders and huge pages.

Let's begin with a Linux x86-64 example involving global variablesexhibiting various properties such as read-only versus writable,zero-initialized versus non-zero, and more.

#include 
const int ro = 1;
int w0, w1 = 1;
int *const pw0 = &w0;
int main() {
  printf("%d %d %d %p\n", ro, w0, w1, pw0);
}

% clang -c -fpie a.c
% clang -pie -fuse-ld=lld -Wl,-z,separate-loadable-segments a.o -o a
% objdump -wt a | grep -P 'main|w[01]|ro$'
00000000000010f0 g     F .text  000000000000002e              main
0000000000003044 g     O .bss   0000000000000004              w0
0000000000003010 g     O .data  0000000000000004              w1
000000000000058c g     O .rodata        0000000000000004              ro
0000000000002010 g     O .data.rel.ro   0000000000000008              pw0
% readelf -Wl a
...
Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x000268 0x000268 R   0x8
  INTERP         0x0002a8 0x00000000000002a8 0x00000000000002a8 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000628 0x000628 R   0x1000
  LOAD           0x001000 0x0000000000001000 0x0000000000001000 0x000180 0x000180 R E 0x1000
  LOAD           0x002000 0x0000000000002000 0x0000000000002000 0x0001e0 0x001000 RW  0x1000
  LOAD           0x003000 0x0000000000003000 0x0000000000003000 0x000040 0x000048 RW  0x1000
  DYNAMIC        0x002018 0x0000000000002018 0x0000000000002018 0x0001a0 0x0001a0 RW  0x8
  GNU_RELRO      0x002000 0x0000000000002000 0x0000000000002000 0x0001e0 0x001000 R   0x1
  GNU_EH_FRAME   0x0005a0 0x00000000000005a0 0x00000000000005a0 0x00001c 0x00001c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0
  NOTE           0x0002c4 0x00000000000002c4 0x00000000000002c4 0x000020 0x000020 R   0x4
...

(We will discuss -Wl,-z,separate-loadable-segments later.)

We can see that these functions and global variables are placed indifferent sections.

.rodata: read-only data without dynamic relocations,constant in the link unit
.text: functions
.data.rel.ro: read-only data associated with dynamicrelocations, constant after relocation resolving, part of thePT_GNU_RELRO segment
.data: writable data
.bss: writable data known to be zeros

Section and segment layout

TODO I may write more about how linkers layout sections and segments.

Anyhow, the linker will place .data and.bss in the same PT_LOAD program header(segment) and the rest into different PT_LOAD segments.(There are some nuances. If you use GNU ld's -z noseparate-codeor lld's --no-rosegment,.rodata and .text will be placed in the samePT_LOAD segment.)

The PT_LOAD segments have different flags(p_flags): PF_R, PF_R|PF_X,PF_R|PF_W. Subsequently, the dynamic loader, also known asthe dynamic linker, will invoke mmap to map the file intomemory. The memory areas (VMA) have different memory permissionscorresponding to segment flags.

For a PT_LOAD segment, its associated memory area startsat alignDown(p_vaddr, pagesize) and ends atalignUp(p_vaddr+p_memsz, pagesize).

    Start Addr           End Addr       Size     Offset  Perms  objfile
0x555555554000     0x555555555000     0x1000        0x0  r--p   /tmp/c/a
0x555555555000     0x555555556000     0x1000     0x1000  r-xp   /tmp/c/a
0x555555556000     0x555555557000     0x1000     0x2000  r--p   /tmp/c/a
0x555555557000     0x555555558000     0x1000     0x3000  rw-p   /tmp/c/a

Let's assume the page size is 4096 bytes. We'll calculate thealignDown(p_vaddr, pagesize) values and display themalongside the "Start Addr" values:

Start Addr       alignDown(p_vaddr, pagesize)
0x555555554000   0x0000000000000000
0x555555555000   0x0000000000001000
0x555555556000   0x0000000000002000
0x555555557000   0x0000000000003000

We observe that the start address equals the base address plusalignDown(p_vaddr, pagesize).

`--no-rosegment`

This option asks lld to combine the read-only and the RX segments.The output file will consume less address space at run-time.

    Start Addr           End Addr       Size     Offset  Perms  objfile
0x555555554000     0x555555555000     0x1000        0x0  r-xp   /tmp/c/a
0x555555555000     0x555555556000     0x1000        0x0  r--p   /tmp/c/a
0x555555556000     0x555555557000     0x1000     0x1000  rw-p   /tmp/c/a

MAXPAGESIZE

A page serves as the granularity at which memory exhibits differentpermissions, and within a page, we cannot have varying permissions.Using the previous example where p_align is 4096, if thepage size is larger, for example, 65536 bytes, the program mightcrash.

Typically, the dynamic loader allocates memory for the firstPT_LOAD segment (PF_R) at a specific addressallocated by the kernel. Subsequent PT_LOAD segments thenoverwrite the previous memory regions. Consequently, certain code pagesor significant global variables might be replaced by garbage, leading toa crash.

So, how can we create a link unit that works across different pagesizes? We simply determine the maximum page size, let's say, 2097152,and then pass -z max-page-size=2097152 to the linker. Thelinker will set p_align values of PT_LOADsegments to MAXPAGESIZE.

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x000268 0x000268 R   0x8
  INTERP         0x0002a8 0x00000000000002a8 0x00000000000002a8 0x00001c 0x00001c R   0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000640 0x000640 R   0x200000
  LOAD           0x200000 0x0000000000200000 0x0000000000200000 0x000180 0x000180 R E 0x200000
  LOAD           0x400000 0x0000000000400000 0x0000000000400000 0x0001e0 0x001000 RW  0x200000
  LOAD           0x600000 0x0000000000600000 0x0000000000600000 0x000040 0x000048 RW  0x200000
  DYNAMIC        0x400018 0x0000000000400018 0x0000000000400018 0x0001a0 0x0001a0 RW  0x8
  GNU_RELRO      0x400000 0x0000000000400000 0x0000000000400000 0x0001e0 0x001000 R   0x1
  GNU_EH_FRAME   0x0005b8 0x00000000000005b8 0x00000000000005b8 0x00001c 0x00001c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0
  NOTE           0x0002c4 0x00000000000002c4 0x00000000000002c4 0x000038 0x000038 R   0x4

In a linker script, the max-page-size can be obtainedusing CONSTANT(MAXPAGESIZE).

For completeness, if you need to run a prebuilt executable on asystem with a larger page size, you can modify the executable by mergingPT_LOAD segments and combining their permissions. It'slikely there will be a sizable RWX PT_LOAD segment,reminiscent of OMAGIC.

Over-aligned segment

It is possible to increase the p_align value of onesingle PT_LOAD segment using an alignedattribute. When this value exceeds the page size, the question arises:should the kernel loader or the dynamic loader determine a suitable baseaddress to meet this alignment requirement?

In 2020, the Linux kernel loader made the decision to alignthe base address according to the maximum p_align. Thisfacilitates transparent huge pagesfor mapped files at expense cost of reduced addressrandomization.

% cat align.c
#include 
__attribute__((aligned(A))) int aligned;
int main() { printf("%p\n", &aligned); }
% cc -DA=4096 align.c -o align && ./align
0x55e994c13000
% cc -DA=2097152 align.c -o align && ./align
0x55639a400000

Should a userspace dynamic loader do the same? If it does, a variablewith an alignment greater than the page size will indeed alignaccordingly. As of glibc 2.35, it has followed suit.

On the other hand, the traditional interpretation dictates that avariable with an alignment greater than the page size is invalid. Mostother dynamic loaders do not implement this particular logic, which hassome overhead.

-z separate-loadable-segments

In previous examples using-z separate-loadable-segments, the p_vaddrvalues of PT_LOAD segments are multiples of MAXPAGESIZE.The generic ABI says "loadable process segments must have congruentvalues for p_vaddr and p_offset, modulo the page size."

p_offset - This member gives the offset from the beginning of thefile at which the first byte of the segment resides.
p_vaddr - This member gives the virtual address at which the firstbyte of the segment resides in memory.

This alignment requirement aligns with the mmapdocumentation. For example, Linux man-pages specifies, "offset must be amultiple of the page size as returned by sysconf(_SC_PAGE_SIZE)."

The p_offset values are also multiples of MAXPAGESIZE.After layouting out a PT_LOAD segment, the linker must padthe end by inserting zeros so that the next PT_LOAD segmentstarts at a multiple of MAXPAGESIZE.

However, the alignment padding is wasteful. Fortunately, we can linka.o using different MAXPAGESIZE and different alignmentsettings:-z noseparate-code,-z separate-code,-z separate-loadable-segments.

clang -pie -fuse-ld=lld -Wl,-z,noseparate-code a.o -o a0.4096
clang -pie -fuse-ld=lld -Wl,-z,noseparate-code,-z,max-page-size=65536 a.o -o a0.65536
clang -pie -fuse-ld=lld -Wl,-z,noseparate-code,-z,max-page-size=2097152 a.o -o a0.2097152

clang -pie -fuse-ld=lld -Wl,-z,separate-code a.o -o a1.4096
clang -pie -fuse-ld=lld -Wl,-z,separate-code,-z,max-page-size=65536 a.o -o a1.65536
clang -pie -fuse-ld=lld -Wl,-z,separate-code,-z,max-page-size=2097152 a.o -o a1.2097152

clang -pie -fuse-ld=lld -Wl,-z,separate-loadable-segments a.o -o a2.4096
clang -pie -fuse-ld=lld -Wl,-z,separate-loadable-segments,-z,max-page-size=65536 a.o -o a2.65536
clang -pie -fuse-ld=lld -Wl,-z,separate-loadable-segments,-z,max-page-size=2097152 a.o -o a2.2097152

% stat -c %s a0.4096 a0.65536 a0.2097152
6168
6168
6168
% stat -c %s a1.4096 a1.65536 a1.2097152
12392
135272
4198504
% stat -c %s a2.4096 a2.65536 a2.2097152
16120
200440
6295288

We can derive two properties:

Under one MAXPAGESIZE, we havesize(noseparate-code) < size(separate-code) < size(separate-loadable-segments).
For -z noseparate-code, increasing MAXPAGESIZE does notchange the output size.

AArch64 and PowerPC64 have a default MAXPAGESIZE of 65536. Stayingwith the -z noseparate-code default ensures that they willnot experience unnecessary size increase.

`-z noseparate-code`

How does -z noseparate-code work? Let's illustrate thiswith an example.

At the end of the read-only PT_LOAD segment, the addressis 0x628. Instead of starting the next segment atalignUp(0x628, MAXPAGESIZE) = 0x1000, we start atalignUp(0x628, MAXPAGESIZE) + 0x628 % MAXPAGESIZE = 0x1628.Since the .text section has an alignment(sh_addralign) of 16, we start at 0x1630. Although theaddress is advanced beyond necessity, the file offset (congruent to theaddress, modulo MAXPAGESIZE) can be decreased to 0x630, merely 8 bytes(due to alignment padding) after the previous section's end.

Moving forward, the end of the executable PT_LOADsegment has an address of 0x17b0. Instead of starting the next segmentat alignUp(0x17b0, MAXPAGESIZE) = 0x2000, we start atalignUp(0x17b0, MAXPAGESIZE) + 0x17c0 % MAXPAGESIZE = 0x27b0.While we advance the address more than needed, the file offset can bedecreased to 0x7b0, precisely at the previous section's end.

% readelf -WSl a0.4096
...
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .interp           PROGBITS        00000000000002a8 0002a8 00001c 00   A  0   0  1
  ...
  [12] .eh_frame         PROGBITS        00000000000005c0 0005c0 000068 00   A  0   0  8
  [13] .text             PROGBITS        0000000000001630 000630 00011e 00  AX  0   0 16
  ...
  [16] .plt              PROGBITS        0000000000001780 000780 000030 00  AX  0   0 16
  [17] .fini_array       FINI_ARRAY      00000000000027b0 0007b0 000008 08  WA  0   0  8
  ...
  [20] .dynamic          DYNAMIC         00000000000027c8 0007c8 0001a0 10  WA  7   0  8
  [21] .got              PROGBITS        0000000000002968 000968 000028 00  WA  0   0  8
  [22] .relro_padding    NOBITS          0000000000002990 000990 000670 00  WA  0   0  1
  [23] .data             PROGBITS        0000000000003990 000990 000014 00  WA  0   0  8
  ...
  [26] .bss              NOBITS          00000000000039d0 0009d0 000008 00  WA  0   0  4
...
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000628 0x000628 R   0x1000
  LOAD           0x000630 0x0000000000001630 0x0000000000001630 0x000180 0x000180 R E 0x1000
  LOAD           0x0007b0 0x00000000000027b0 0x00000000000027b0 0x0001e0 0x000850 RW  0x1000
  LOAD           0x000990 0x0000000000003990 0x0000000000003990 0x000040 0x000048 RW  0x1000
  DYNAMIC        0x0007c8 0x00000000000027c8 0x00000000000027c8 0x0001a0 0x0001a0 RW  0x8
  GNU_RELRO      0x0007b0 0x00000000000027b0 0x00000000000027b0 0x0001e0 0x000850 R   0x1

-z separate-code performs the trick when transiting fromthe first RW PT_LOAD segment to the second, whereas-z separate-loadable-segments doesn't.

WhenMAXPAGESIZE is larger than the actual page size

Let's consider two adjacement PT_LOAD segments. Thememory area associated with the first segment ends atalignUp(load[i].p_vaddr+load[i].p_memsz, pagesize) whilethe memory area associated with the second one starts atalignDown(load[i+1].p_vaddr, pagesize). When the actualpage size equals MAXPAGESIZE, the two addresses are identical. However,if the actual page size is smaller, a gap emerges between theseaddresses.

A typical link unit generally presents three gaps. These gaps mighteither be unmapped or mapped. When mapped, they necessitatestruct vm_area_struct objects within the Linux kernel. Asof Linux 6.3.13, the size of struct vm_area_struct is 152bytes. For instance, 10000 mapped object files would require10000 * 3 * sizeof(struct vm_area_struct) = 4,560,000 bytes,signifying a considerable memory footprint. You can refer to Extrastruct vm_area_struct with ---p created when PAGE_SIZE .

Dynamic loaders typically invoke mmap usingPROT_READ, encompassing the whole file, followed bymultiple mmap calls using MAP_FIXED and thecorresponding flags. When dynamic loaders, like musl, don't processgaps, the gaps retain r--p permissions. However, in glibc'self/dl-map-segments.h, the has_holes codeemploys mprotect to transition permissions fromr--p to ---p.

While ---p might be perceived as a security enhancement,personally, I don't believe it significantly impacts exploitability.While there might be numerous gadgets in r-xp areas,reducing gadgets in r--p areas doesn't seem notablyimpactful. (https://isopenbsdsecu.re/mitigations/rop_removal/)

Unmap the gap

Within Linux kernel loads the executable and its interpreter (itpresent) (fs/binfmt_elf.c), the gap gets unmapped, therebyfreeing a struct vm_area_struct object. Implementing asimilar approach in dynamic loaders could yield comparable savings.

However, unmapping the gap carries the risk of an unrelated futuremmap occupying the gap:

1
2
3

564d8e90f000-564d8e910000 r--p 00000000 08:05 2519504        /sample/build/main
   ================ an unrelated mmap may be placed in the gap
564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504        /sample/build/main

It is not clear whether the potential occurrence of an unrelated mmapconsidered a regression in security. Personally, I don't think thisposes a significant issue as the program does not access the gaps. Thisproperty can be guaranteed for direct access when input relocations tothe linker use symbols with in-bounds addends (e.g. when x is definedrelative to an input section, we know R_X86_64_PC32(x) mustbe in-bounds).

However, some programs may expect contiguous maps areas of a file(such as when glibc link_map::l_contiguous is set to 1).Does this choice render the program exploitable if an attacker canensure a map within the gap instead of outside the file? It seems to methat they could achieve everything with a map outside of the file.

Having said that, the presence of an unrelated map between mapsassociated with a single file descriptor remains odd, so it's preferableto avoid it if possible.

Extend the memory areato cover the gap

This appears the best solution.

When creating a memory area, instead of setting the end toalignUp(load[i].p_vaddr+load[i].p_memsz, pagesize), we canextend the end tomin(alignDown(min(load[i+1].p_vaddr), pagesize), alignUp(file_end_addr, pagesize)).

1 2	564d8e90f000-564d8e91f000 r--p 00000000 08:05 2519504 /sample/build/main (the end is extended) 564d8e91f000-564d8e920000 r-xp 00010000 08:05 2519504 /sample/build/main

For the last PT_LOAD segment, we could also just usealignDown(min(load[i+1].p_vaddr), pagesize) and ignorealignUp(file_end_addr, pagesize)). Accessing a byte beyondthe backed file will result to a SIGBUS signal.

A new linker option?

Personally I favor the area end extending approach. I've alsopondered whether this falls under the purview of linkers. Such a changeseems intrusive and unsightly. If the linker extends the end of p_memszto cover the gap, should it also extend p_filesz?

If it doesn't, we create a PT_LOAD with p_filesz/p_memsz that is notfor BSS, which is weird.
If it does, we have an output file featuring overlapping file offsetranges, which is weird as well.

Moreover, a PT_LOAD whose end isn't backed by a section is unusual.I'm concerned that many binary manipulation tools may not handle thiscase correctly. Utilizing a linker script can intentionally creatediscontiguous address ranges. I'm concerned that the linker might notdiscern such cases with intelligent logic regardingp_filesz/p_memsz.

This feature request seems to be within the realm of loaders andspecific information, such as the page size, is only accessible toloaders. I believe loaders are better equipped to handle this task.

Transparent huge pagesfor mapped files

Some programs optimize their usage of the limited TranslationLookaside Buffer (TLB) by employing transparent huge pages. When theLinux kernel loads an executable, it takes into account thep_align field to create a memory area. Ifp_align is 4096, the memory area will commence at amultiple of 4096, but not necessarily at a multiple of a huge page.

Transparent huge pages for mapped files have several requirementsincluding:

the memory area's start address and the start file offset align witha huge page(include/linux/huge_mm.h:transhuge_vma_suitable).
CONFIG_READ_ONLY_THP_FOR_FS is enabled(scripts/config -e TRANSPARENT_HUGEPAGE -e TRANSPARENT_HUGEPAGE_MADVISE -e READ_ONLY_THP_FOR_FS)
~~the VMA has the VM_EXEC flag~~(I removedthis condition for v6.8)
the file is not opened for write

When madvise(addr, len, MADV_HUGEPAGE) is called, thekernel code path isdo_madvise -> madvise_vma_behavior -> hugepage_madvise -> khugepaged_enter_vma -> thp_vma_allowable_order+__khugepaged_enter.

To ensure that addr-fileoff is a multiple of a hugepage, we should link the executable using -z max-page-size=with the huge page size.

In kernels with the VM_EXEC requirement (before v6.8),if we want to remap the file as huge pages from the ELF header, we mustspecify --no-rosegment to ld.lld.

Build the following program withc++ -fuse-ld=lld -Wl,-z,max-page-size=2097152 and run it.We do not define COLLAPSE for now.

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

// Adapted from https://mazzo.li/posts/check-huge-page.html
// normal page, 4KiB
#define PAGE_SIZE (1 << 12)
// huge page, 2MiB
#define HPAGE_SIZE (1 << 21)

// See  for
// format which these bitmasks refer to
#define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0)
#define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1))

extern char __ehdr_start[];
__attribute__((used)) const char pad[HPAGE_SIZE] = {};

// Checks if the page pointed at by `ptr` is huge. Assumes that `ptr` has
// already been allocated.
static void check_huge_page(void *ptr) {
  if (getuid())
    return warnx("not root; skip KPF_THP check");
  int pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
  if (pagemap_fd < 0)
    errx(1, "could not open /proc/self/pagemap: %s", strerror(errno));
  int kpageflags_fd = open("/proc/kpageflags", O_RDONLY);
  if (kpageflags_fd < 0)
    errx(1, "could not open /proc/kpageflags: %s", strerror(errno));

  // each entry is 8 bytes long
  uint64_t ent;
  if (pread(pagemap_fd, &ent, sizeof(ent), ((uintptr_t)ptr) / PAGE_SIZE * 8) != sizeof(ent))
    errx(1, "could not read from pagemap\n");

  if (!PAGEMAP_PRESENT(ent))
    errx(1, "page not present in /proc/self/pagemap, did you allocate it?\n");
  if (!PAGEMAP_PFN(ent))
    errx(1, "page frame number not present, run this program as root\n");

  uint64_t flags;
  if (pread(kpageflags_fd, &flags, sizeof(flags), PAGEMAP_PFN(ent) << 3) != sizeof(flags))
    errx(1, "could not read from kpageflags\n");
  if (!(flags & (1ull << KPF_THP)))
    errx(1, "could not allocate huge page\n");
  if (close(pagemap_fd) < 0)
    errx(1, "could not close /proc/self/pagemap: %s", strerror(errno));
  if (close(kpageflags_fd) < 0)
    errx(1, "could not close /proc/kpageflags: %s", strerror(errno));
}

int main() {
  printf("__ehdr_start: %p\n", __ehdr_start);
  int ret, tries = 2;
#ifdef COLLAPSE // use Linux 6.1 MADV_COLLAPSE
  do {
    ret = madvise(__ehdr_start, HPAGE_SIZE, MADV_COLLAPSE);
  } while (ret && errno == EAGAIN && --tries);
  printf("madvise(MADV_COLLAPSE): %d\n", ret);
  if (ret) {
    ret = madvise(__ehdr_start, HPAGE_SIZE, MADV_HUGEPAGE);
    if (ret)
      err(1, "madvise");
  }
#else
  ret = madvise(__ehdr_start, HPAGE_SIZE, MADV_HUGEPAGE);
  if (ret)
    err(1, "madvise");
#endif

  size_t size = HPAGE_SIZE;
  char *buf = (char *)aligned_alloc(HPAGE_SIZE, size);
  madvise(buf, 2 << 20, MADV_HUGEPAGE);
  *((volatile char *)buf);
  check_huge_page(buf);

  int fd = open("/proc/self/maps", O_RDONLY);
  read(fd, buf, HPAGE_SIZE);
  write(STDOUT_FILENO, buf, strstr(buf, "[stack]\n") - buf + 8);
  close(fd);

#ifndef COLLAPSE
  fd = open("/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs", O_RDONLY);
  read(fd, buf, 32);
  close(fd);
  usleep(atoi(buf) * 1000);
#endif

  memcpy(buf, __ehdr_start, HPAGE_SIZE);
  check_huge_page(__ehdr_start);
}

The output looks like:

% g++ test.cc -o ~/tmp/test -O2 -fuse-ld=lld -Wl,-z,max-page-size=2097152 && sudo ~/tmp/test
__ehdr_start: 0x55f3b1c00000
55f3b1c00000-55f3b1e00000 r--p 00000000 103:03 555277119                 /home/ray/tmp/test
55f3b1e00000-55f3b1e01000 r--p 00200000 103:03 555277119                 /home/ray/tmp/test
55f3b2000000-55f3b2002000 r-xp 00200000 103:03 555277119                 /home/ray/tmp/test
55f3b2201000-55f3b2202000 r--p 00201000 103:03 555277119                 /home/ray/tmp/test
55f3b2401000-55f3b2402000 rw-p 00201000 103:03 555277119                 /home/ray/tmp/test
55f3b3a9a000-55f3b3abb000 rw-p 00000000 00:00 0                          [heap]

Thanks to 周洲仪 for helping me figure out the khugepagedbehavior.

usleep gives khugepaged an opportunity to collapse pages(hpage_collapse_scan_file => collapse_file => retract_page_tables => pmdp_collapse_flush).In the fortunate scenario when this collapse occurs, and the next pagefault is triggered (memcpy(buf, __ehdr_start, HPAGE_SIZE)),the kernel will populate the pmd with a huge page(handle_page_fault ...=> handle_pte_fault ...=> do_fault_around => filemap_map_pages ...=> do_set_pmd => set_pmd_at).

However, in an unfortunate case,check_huge_page(__ehdr_start) will fail withcould not allocate huge page.scan_sleep_millisecs defaults to 10000 (10 seconds).Reducing the value increases the likelihood of the fortunate case.

Linux 6.1 introduces MADV_COLLAPSE to attempt asynchronous collapse of the native pages mapped by the memory range intoTransparent Huge Pages (THPs). While success is not guaranteed, asuccessful collapse eliminates the need to wait for the khugepageddaemon(madvise_collapse => hpage_collapse_scan_file => collapse_file => retract_page_tables => pmdp_collapse_flush).In the event of repeated MADV_COLLAPSE failures, a fallbackmechanism using MADV_HUGEPAGE can be employed.

% g++ -static -DCOLLAPSE test.cc -o test -O2 -fuse-ld=lld -Wl,-z,max-page-size=2097152
% sudo ./test
__ehdr_start: 0x200000
madvise(MADV_COLLAPSE): -1
...
test: could not allocate huge page
% sudo ./test
__ehdr_start: 0x55f3b1c00000
madvise(MADV_COLLAPSE): 0
00200000-00429000 r--p 00000000 fd:03 260                                /root/test
00628000-0069f000 r-xp 00228000 fd:03 260                                /root/test
0089e000-008a3000 r--p 0029e000 fd:03 260                                /root/test
00aa2000-00aa5000 rw-p 002a2000 fd:03 260                                /root/test
00aa5000-00aab000 rw-p 00000000 00:00 0
01800000-01822000 rw-p 00000000 00:00 0                                  [heap]
7fd141600000-7fd141800000 rw-p 00000000 00:00 0
7fd141800000-7fd141a00000 rw-p 00000000 00:00 0
7fd141a00000-7fd141a01000 rw-p 00000000 00:00 0
7ffe69edf000-7ffe69f00000 rw-p 00000000 00:00 0                          [stack]

In -z noseparate-code layouts, the file content startssomewhere at the first page, potentially wasting half a huge page onunrelated content. Switching to -z separate-code allowsreclaiming the benefits of the half huge page but increases the filesize. Balancing these aspects poses a challenge. One potential solutionis using fallocate(FALLOC_FL_PUNCH_HOLE), which introducescomplexity into the linker. However, this approach feels like aworkaround to address a kernel limitation. It would be preferable if afile-backed huge page didn't necessitate a file offset aligned to a hugepage boundary.

Cost of RELRO

To accommodate PT_GNU_RELRO, the RW regionwill possess two permissions after the runtime linker maps the program.While GNU ld provides one RW segment split by the dynamic loader, lldemploys two explicit RW PT_LOAD segments. After relocationresolving, the effects of lld and GNU ld are similar.

For those curious, explore my notes on GNU ld's file sizeincrease due to RELRO.

Due to RELRO, covering the two RW PT_LOAD segmentsnecessitates a minimum of 2 (huge) pages. In contrast, without RELRO,only one (huge) page is required at minimum. This means potentiallywasting up to MAXPAGESIZE-1 bytes, which could otherwise be utilized tocover more data.

Nowadays, RELRO is considered a security baseline and removing itmight unsettle security-minded individuals.

Linker notes on PE/COFF

2023-12-03T08:00:00.000Z

This article describes linker notes about Portable Executable (PE)and Common Object File Format (COFF) used on Windows and UEFIenvironments.

In ELF, an object file can be a relocatable file, an executable file,or a shared object file. On Windows, the term "object file" usuallyrefers to relocatable files like ELF. Such files use the Common ObjectFile Format (COFF) while image files (e.g. executables and DLLs) use thePortable Executable (PE) format.

Input files

The input files to the linker can be object files, archive files, andimport libraries. GNU ld and lld-link allow linking against DLL fileswithout an import library.

Object files

Import files

An import file (.lib) is a special archive file. Eachmember represents a symbol to be imported. The symbol__imp_$sym is inserted to the global symbol table.

The import header has a Type field indicatingIMPORT_OBJECT_CODE/IMPORT_OBJECT_DATA/IMPORT_OBJECT_CONST.

For an import type of IMPORT_OBJECT_DATA, the symbol$sym is defined as an alias for__imp_$sym.

For an import type of IMPORT_OBJECT_CODE, the symbol$sym is defined as an import thunk, which is like a PLTentry in ELF.

GNU ld and lld-link allow linking against DLL files without an importlibrary. The behavior is as if the linker synthesizes an import libraryfrom a DLL file.

Symbols

An object file contributes defined and undefined symbols. An importfile contributes defined symbols in a DLL that can be referenced by__imp_$sym.

A defined symbol can be any of the following kinds:

special (ignored in the global symbol table)
common (section number is IMAGE_SYM_UNDEFINED and valueis not 0)
absolute (section number is -1)
regular (section number is positive)

An undefined symbol has a storage class ofIMAGE_SYM_CLASS_EXTERNAL, a section number ofIMAGE_SYM_UNDEFINED (zero), and a value of zero.

An undefined symbol with a storage class ofIMAGE_SYM_CLASS_WEAK_EXTERNAL is a weak external, which isactually like a weak definition in ELF.

PE requires explicit annotations for exported symbols and importedsymbols in DLL files. There are differences between code symbols andfunction symbols.

COMDAT

Refer to COMDATand section group.

Imported code symbols

// b.dll
__declspec(dllexport) void f() {}

// a.exe
void local(void) {}
void __declspec(dllimport) f(void);
int main(void) {
  local();
  f();
}

Linking b.dll gives us b.lib (see "Importfiles" above).

# b.dll
.globl f
f:

.section        .drectve,"yni"
.ascii  " -export:f"

a.obj has two function calls. The call to freferences the prefixed symbol __imp_f.

1
2
3

# a.obj
  callq   local
  callq   *__imp_f(%rip)

call *__imp_f(%rip) is like -fno-pltcodegen for ELF. In this case when we know that f isdefined elsewhere, the generated code is more efficient.

When linking a.exe, we need to make the import fileb.lib as an input file. The linker parses the import fileand creates a definition for __imp_f pointing to the importaddress table entry.

TODO import table

Actually, when __imp_f is defined, the unprefixed symbolf is also defined. Normally, the unprefixed fis unused and will be discarded. However, if the user code calls theunprefixed symbol (e.g. call f; like ELF-fplt), the f definition will be retained inthe linker output and point to a thunk:

  call f  # generated code without using dllimport

f:  # x86-64 thunk
  jmpq *__imp_f(%rip)

Different architectures have different thunk implementations.

// x86-32 and x86-64
jmp *0x0   // references an entry in the import address table

// AArch32
mov.w ip, #0
mov.t ip, #0
ldr.w pc, [ip]

// AArch64
adrp x16, #0
ldr x16, [x16]
br x16

TODO link.exe will issue a warning.

Imported data symbols

// b.dll
__declspec(dllexport) int var;

// a.exe
int local_var;
__declspec(dllimport) extern int var;
int main() { return local_var + var;  }

# b.dll
.bss
.globl var
var:

.section        .drectve,"yni"
.ascii  " -export:var,data"

The linker parses the import file and creates a definition for__imp_var pointing to the import address table entry.Unlike a code symbol, the linker does not create a definition forvar (without the __imp_ prefix).

With a dllimport:

1 2	movq __imp_var(%rip), %rax movl (%rax), %eax

If dllimport is not specified, we get a referenced tothe unprefixed symbol:

1	movq var(%rip), %rax

link.exe will report an error.

MinGW implements runtime pseudo relocations to patch the text sectionso that absolute pointers and relative offsets to the symbol will berewritten to bind to the actual definition.

1	movq var(%rip), %rax # the runtime will rewrite this to point to the definition in b.dll

If the variable is defined out of the +-2GiB range from the currentlocation, the runtime pseudo relocation can't fix the issue. See crt:Check pseudo relocations for overflows and error out clearly.

For a non-definition declaration, GCC conservatively thinks thevariable may be defined in a DLL and generate indirection. This issimilar to a GOT code sequence in ELF.

1 2	extern int extern_var; int main() { return extern_var; }

// MSVC
  movl    extern_var(%rip), %eax

// GCC
  movq    .refptr.extern_var(%rip), %rax
  movl    (%rax), %eax

  .section        .rdata$.refptr.extern_var,"dr",discard,.refptr.extern_var
  .p2align        3, 0x0
  .globl  .refptr.extern_var
.refptr.extern_var:
  .quad   extern_var

Non-dllexport definitionand dllimport

A dllimport symbol referenced by an object file isnormally satisfied by an import file. link.exe allows another objectfile to provide the definition. In such a case, link.exe will issue awarning (LinkerTools Warning LNK4217). lld-link has implemented this feature forcompatibility.

echo '__declspec(dllimport) int foo(); int main() { return foo(); }' > a.cc
echo 'int foo() { return 42; }' > b.cc
clang-cl -c a.cc b.cc
lld-link -nodefaultlib -entry:main a.obj b.obj

1	lld-link: warning: a.obj: locally defined symbol imported: int __cdecl foo(void) (defined in b.obj) [LNK4217]

MinGW

MinGW provides auto exporting and auto importing features to make PEDLL files work like ELF shared objects. When producing a DLL file, if nosymbol is chosen to be exported, almost all symbols are exported bydefault (--export-all-symbols).

If an undefined symbol $sym is unresolved and__imp_$sym is defined, $sym will be aliased to__imp_$sym. TODO: example

If the symbol .refptr.$sym is present, it will bealiased to __imp_$sym as well. mingw-w64 defaults to-mcmodel=medium and uses .refptr.$sym. TODO:example

https://github.com/ziglang/zig/issues/9845

Manual `__imp_`definition

The user can define __imp_ instead of letting the linkerdoes.

https://github.com/llvm/llvm-project/issues/57982

$ cat lto-dllimp1.c
void __declspec(dllimport) importedFunc(void);
void other(void);

void entry(void) {
    importedFunc();
    other();
}
$ cat lto-dllimp2.c
static void importedFuncReplacement(void) {
}
void (*__imp_importedFunc)(void) = importedFuncReplacement;

void other(void) {
}

Shared library comparisonwith ELF

The design of share libraries has major advancements around 1988.Before 1988, there were shared libraries implementations in a.out andCOFF objec file formats, but they had severe limitations, such as fixedaddresses and the requirement of extra files like import files.

Such limitations are evidenced in 1986 Summer USENIX TechnicalConference & Exhibition Proceedings, Shared Libraries onUNIX System V from AT&T. Its shared library (presumably usingthe COFF object file format) must have a fixed virtual address, which iscalled "static shared library" in Linkers and Loaders'sterm.

In 1988, SunOS 4.0 was released with an extended a.out binary formatwith dynamic shared library support. Unlike previous static sharedlibrary schemes, the a.out shared libraries are position independent andcan be loaded at different addresses. The dynamic linker source code isavailable somewhere and I find that its GOT and PLT schemes are exaclylike what we have for ELF today.

AT&T and Sun collaborated to create the first System V release 4ABI (using ELF). AT&T contributed the ELF object format. Suncontributed all of the dynamic linking implementation from SunOS 4.x. In1992, SunOS 5.0 (Solaris 2.0) switched to ELF.

For ELF, the designers tried to make shared libraries similar tostatic libraries. There is no need to annotate export and import symbolsto work with shared libraries.

I cannot find more information about System V release 3's sharedlibrary support, but the Windows DLL is assuredly inspired by it, giventhat the PE object file format is based on COFF and the PE specificationrefers to COFF in numerous places.

So, is the shared library design in ELF more advanced? It is.However, two aspects are worth deep thoughts.

The manual export and import annotations have its stregth.
Choices made to make ELF shared libraries flexible had majordownsides.
- Performance downside due to symbol interposition on the compilerside. See -fno-semantic-interposition
- Performance downside due to symbol interposition on the linker andloader side. See ELFinterposition and -Bsymbolic
- Underlinking problems exacebated by the -z undefsdefault in linkers. See Dependencyrelated linker options.

Limitations

The number of symbols cannot exceed 65535. Several open-sourceprojects have faced problems that a DLL file cannot export more than65535 symbols. (GNU ld has a diagnosticerror: export ordinal too large:).

A section header has only 8 bytes for the name field. link.exetruncates long section names to 8 bytes. For a section with a long nameand the IMAGE_SCN_MEM_DISCARDABLE flag, lld uses anon-standard string table and issues a warning.

COMDAT limitation: MSVC link.exe will report a duplicatesymbol error (error LNK2005) for an external symbol defined in anIMAGE_COMDAT_SELECT_ASSOCIATIVE section, even if it wouldbe discarded after handling the leader symbol.

DSO undef and non-exported definition

2023-10-31T07:00:00.000Z

DSO undef and non-exporteddef

If a DSO has an undefined STB_GLOBAL symbol that isdefined in a relocatable object file but not exported, should the--no-allow-shlib-undefined feature report an error? You maywant to check out Dependencyrelated linker options for a discussion of this option and the symbolexporting rule.

For quite some time, the --no-allow-shlib-undefinedfeature has been implemented in lld/ELF as follows:

for (SharedFile *file : ctx.sharedFiles) {
  bool allNeededIsKnown =
      llvm::all_of(file->dtNeeded, [&](StringRef needed) {
        return symtab.soNames.count(CachedHashStringRef(needed));
      });
  if (!allNeededIsKnown)
    continue;
  for (Symbol *sym : file->requiredSymbols)
    if (sym->isUndefined() && !sym->isWeak())
      diagnose("undefined reference due to --no-allow-shlib-undefined: " +
               toString(*sym) + "\n>>> referenced by " + toString(file));
}

Recently I noticed that GNU ld implemented a related error in April2003 (discussion).

echo '.globl _start; _start: call shared' > main.s && clang -c main.s
echo '.globl shared; shared: call foo' > a.s && clang -shared -fpic a.s -o a.so
echo '.globl foo; foo:' > def.s && clang -c def.s && clang -shared def.o -o def.so
echo '.globl foo; .hidden foo; foo:' > def-hidden.s && clang -c def-hidden.s

% ld.bfd main.o a.so def.o
% ld.bfd main.o a.so def-hidden.o
ld.bfd: a.out: hidden symbol `foo' in def-hidden.o is referenced by DSO
ld.bfd: final link failed: bad value

A non-local default or protected visibility symbol can satisfy a DSOreference. The linker will export the symbol to the dynamic symboltable. Therefore, ld.bfd main.o a.so def.o succeeds asintended.

We encounter an error forld.bfd main.o a.so def-hidden.o because a symbol withhidden visibility cannot be exported, and it's unable to satisfy thereference in a.so at run-time.

Here is another interesting case: we use a version script to changethe binding of a defined symbol to STB_LOCAL, causing it tobe unable to satisfy the reference in a.so at run-time. GNUld also reports an error in this case.

1
2
3

% ld.bfd --version-script=local.ver main.o a.so def.o
ld.bfd: a.out: local symbol `foo' in def.o is referenced by DSO
ld.bfd: final link failed: bad value

My recent commit https://github.com/llvm/llvm-project/commit/1981b1b6b92f7579a30c9ed32dbdf3bc749c1b40strengthened LLD's --no-allow-shlib-undefined to detectcases in which the non-exported definitions are garbage-collected. Ihave landed https://github.com/llvm/llvm-project/pull/70769 to covernon-garbage-collected cases for LLD 18.

DSO undef, non-exporteddef, and DSO def

A variation of the scenario mentioned above occurs when a DSOdefinition is also present. Even if the executable does not exportfoo, another DSO (def.so) may provide it. GNUld's check allows for this case.

1 2	ld.bfd main.o a.so def-hidden.o def.so # succeeded ld.lld main.o a.so def-hidden.o def.so # failed after commit 1981b1b6b92f7579a30c9ed32dbdf3bc749c1b40

It turns out that https://github.com/llvm/llvm-project/commit/1981b1b6b92f7579a30c9ed32dbdf3bc749c1b40unexpectedly strengthened --no-allow-shlib-undefined toalso catch this ODR violation. More precisely, when all three conditionsare met, the new --no-allow-shlib-undefined code reports anerror.

There is a DSO undef that can be satisfied by a definition fromanother DSO (referred to as SharedSymbol in lld/ELF).
The SharedSymbol is overridden by a non-exported(usually of hidden visibility) definition in a relocatable object file(Defined).
The section containing the Defined is garbage-collected(it is not part of .dynsym and is not marked as live).

An exported symbol is a GC root, making its section live. Anon-exported symbol, however, can be discarded when its section isdiscarded.

So, is this error legitimate? At run-time, the undefined symbolfoo in a.so will be bound todef.so, even if the executable does not exportfoo, so we are fine. This suggests that the--no-allow-shlib-undefined code probably should not reportan error.

However, both def-hidden.o and def.sodefine foo, and we know the definitions are different andless likely benign. At the very least, they are not exactly the same dueto different visibilities or one being localized by a versionscript.

A real-world report boils down to

% ld.lld @response.txt -y _Znam
...
libfdio.so: reference to _Znam
libclang_rt.asan.so: shared definition of _Znam
libc++.a(stdlib_new_delete.cpp.obj): definition of _Znam
ld.lld: error: undefined reference due to --no-allow-shlib-undefined: _Znam
>>> referenced by libfdio.so

How does libfdio.so obtain a reference to_Znam? Well, libfdio.so is linked against bothlibclang_rt.asan.so and libc++.a. Due tosymbol processing rules, the definition fromlibclang_rt.asan.so takes precedence. (See Symbol processing#Sharedobject overriding archive.)

An appropriate solution is to replace libc++a with anAddressSanitizer-instrumented version that does not define_Znam.

I have also encountered issues stemming from the combination ofmultiple definitions from libgcc.a (with hidden visibility)and libclang_rt.builtins.a (with default visibility),relying on archive member extraction rules.

% ld.lld @response.txt -y __divti3
...
a.so: reference to __divti3
libgcc.a(_divdi3.o): definition of __divti3
libc++.so: shared definition of __divti3
# A lazy symbol in libclang_rt.builtins.a is not reported by -y
ld.lld: error: undefined reference due to --no-allow-shlib-undefined: __divti3
>>> referenced by a.so

a.so is linked against libc++.so andlibclang_rt.builtins.a and obtains a reference to__divti3 due to libc++.so. For the executablelink, the undesired situation arises as the definition inlibgcc.a takes precedence. What we actually want is forlibgcc.a to provide the missing components fromlibclang_rt.builtins.a.

Some users compile relocatable object files with-fvisibility=hidden to disallow dynamic linking. However,when their system includes specific shared objects, it increases therisk of conflicting multiple definition symbols.

While this additional check introduced in https://github.com/llvm/llvm-project/commit/1981b1b6b92f7579a30c9ed32dbdf3bc749c1b40may not perfectly fit into --no-allow-shlib-undefined, Ibelieve it has value. As a result, I have proposed --[no-]allow-non-exported-symbols-shared-with-dso.However, I am also on the fence that we introduce a new option, as itmay not get used.

Technically, the check can be extended to default visibility to catchall link-time symbol interposition. However, I suspect that there are alot of benign violations and in the absence of an ignore list mechanism,this extension will not be useful.

AddressSanitizer: global variable instrumentation

2023-10-15T07:00:00.000Z

AddressSanitizer (ASan) is a compiler technology that detectsaddressability-related memory errors with some additional checks. Itconsists of two components: compiler instrumentation and a runtimelibrary. To put it simply,

The compiler instruments global variables, stack frames, and heapallocations to monitor shadow memory.
The compiler also instruments memory access instructions to verifyshadow memory.
In case of an error, the inserted code invokes a callback(implemented in the runtime library) to report the error along with astack trace. Typically, the program will terminate after displaying theerror message.

This article describes global variable instrumentation.

Global variableinstrumentation

AddressSanitizer instruments certain defined global variables of LLVMexternal or internal linkage. To be instrumented, the variable mustsatisfy a bunch of conditions.

It is not thread-local.
It has a smaller alignment.
It is not synthesized by LLVM.
It does not have the no_sanitize_address attribute inLLVM IR. Variables receive this attribute when annotated as__attribute__((no_sanitize("address"))) or__attribute__((disable_sanitizer_instrumentation)) inC/C++.

1 2	int g0; const long g1 = 42;

Each instrumented global variable is padded with a right redzone todetect out-of-bounds accesses.

1 2	@g0 = dso_local global { i32, [28 x i8] } zeroinitializer, comdat, align 32 @g1 = dso_local constant { i64, [24 x i8] } zeroinitializer, comdat, align 32

On ELF platforms, by default (since Clang 17.0) each instrumentedglobal variable receives an associated __asan_global_$namevariable, which is located within the asan_globals section.Additionally, there are several related variables, including someunnamed ones (@0 and @1), as well as__odr_asan_gen_g0 and __odr_asan_gen_g1, alongwith metadata nodes (!0 and !1), which we willdiscuss in more detail later."

@___asan_gen_.1 = private unnamed_addr constant [3 x i8] c"g0\00", align 1
@___asan_gen_.2 = private unnamed_addr constant [3 x i8] c"g1\00", align 1
@__asan_global_g0 = private global { i64, i64, i64, i64, i64, i64, i64, i64 } { i64 ptrtoint (ptr @0 to i64), i64 4, i64 32, i64 ptrtoint (ptr @___asan_gen_.1 to i64), i64 ptrtoint (ptr @___asan_gen_ to i64), i64 0, i64 0, i64 ptrtoint (ptr @__odr_asan_gen_g0 to i64) }, section "asan_globals", comdat($g0), !associated !0
@__asan_global_g1 = private global { i64, i64, i64, i64, i64, i64, i64, i64 } { i64 ptrtoint (ptr @1 to i64), i64 4, i64 32, i64 ptrtoint (ptr @___asan_gen_.2 to i64), i64 ptrtoint (ptr @___asan_gen_ to i64), i64 0, i64 0, i64 ptrtoint (ptr @__odr_asan_gen_g1 to i64) }, section "asan_globals", comdat($g1), !associated !1
@llvm.compiler.used = appending global [4 x ptr] [ptr @g0, ptr @g1, ptr @__asan_global_g0, ptr @__asan_global_g1], section "llvm.metadata"

!0 = !{ptr @g0}
!1 = !{ptr @g1}

The module constructor asan.module_ctor processesgarbage-collectable asan_globals input sections. Thisconstructor invokes a runtime callback to register the instrumentedglobal variables, which involves poisoning the redzone and conductingODR violation checks. I will discuss ODR violation checking later.

define internal void @asan.module_ctor() #0 comdat {
  call void @__asan_init()
  call void @__asan_version_mismatch_check_v8()
  call void @__asan_register_elf_globals(i64 ptrtoint (ptr @___asan_globals_registered to i64), i64 ptrtoint (ptr @__start_asan_globals to i64), i64 ptrtoint (ptr @__stop_asan_globals to i64))
  ret void
}

The runtime poisons the redzone of each instrumented global variable.

void __asan_register_elf_globals(uptr *flag, void *start, void *stop) {
  if (*flag) return;
  if (!start) return;
  CHECK_EQ(0, ((uptr)stop - (uptr)start) % sizeof(__asan_global));
  __asan_global *globals_start = (__asan_global*)start;
  __asan_global *globals_stop = (__asan_global*)stop;
  __asan_register_globals(globals_start, globals_stop - globals_start);
  *flag = 1;
}

void __asan_register_globals(__asan_global *globals, uptr n) {
  if (!flags()->report_globals) return;
  ...
  for (uptr i = 0; i < n; i++)
    RegisterGlobal(&globals[i]);

  // Poison the metadata. It should not be accessible to user code.
  PoisonShadow(reinterpret_cast(globals), n * sizeof(__asan_global),
               kAsanGlobalRedzoneMagic);
}

static void RegisterGlobal(const Global *g) {
  ...
  if (CanPoisonMemory())
    PoisonRedZones(*g);
}

Every full granule in the shadow of the redzone is filled with 0xf9(kAsanGlobalRedzoneMagic) while a partial granule is filledin a manner similar to partially-addressable stack memory.

ALWAYS_INLINE void PoisonRedZones(const Global &g) {
  uptr aligned_size = RoundUpTo(g.size, ASAN_SHADOW_GRANULARITY);
  FastPoisonShadow(g.beg + aligned_size, g.size_with_redzone - aligned_size,
                   kAsanGlobalRedzoneMagic);
  if (g.size != aligned_size) {
    FastPoisonShadowPartialRightRedzone(
        g.beg + RoundDownTo(g.size, ASAN_SHADOW_GRANULARITY),
        g.size % ASAN_SHADOW_GRANULARITY, ASAN_SHADOW_GRANULARITY,
        kAsanGlobalRedzoneMagic);
  }
}

global-buffer-overflowexample

If an access occurs within a redzone byte poisoned by 0xf9 or withina partial redzone preceding 0xf9, the runtime will report aglobal-buffer-overflow error. Here is an example:

cat > a.c <
#include 
int main(int argc, char **argv) {
  static char a[10];
  memset(a, 0, 10);
  return a[argc * 5];
}
e
clang -fsanitize=address a.c -o a

% ./a 1  # a[argc * 5] == a[10] is out-of-bounds
=================================================================
==240472==ERROR: AddressSanitizer: global-buffer-overflow on address 0x5592092356aa at pc 0x5592088dc38f bp 0x7ffd457ab520 sp 0x7ffd457ab518
READ of size 1 at 0x5592092356aa thread T0
    #0 0x5592088dc38e  (/tmp/c/a+0x14238e)
    #1 0x7fd59d38f6c9  (/lib/x86_64-linux-gnu/libc.so.6+0x276c9) (BuildId: 2ac5fa07c22f99cfd5dc47c70cd5f0e78b974269)
    #2 0x7fd59d38f784  (/lib/x86_64-linux-gnu/libc.so.6+0x27784) (BuildId: 2ac5fa07c22f99cfd5dc47c70cd5f0e78b974269)
    #3 0x559208800f80  (/tmp/c/a+0x66f80)

0x5592092356aa is located 0 bytes after global variable 'main.a' defined in 'a.c' (0x5592092356a0) of size 10
SUMMARY: AddressSanitizer: global-buffer-overflow (/tmp/c/a+0x14238e)
Shadow bytes around the buggy address:
  0x559209235400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x559209235480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x559209235500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x559209235580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x559209235600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x559209235680: 00 00 00 00 00[02]f9 f9 00 00 00 00 00 00 00 00
  0x559209235700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x559209235780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x559209235800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x559209235880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x559209235900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
...

ODR violation checker

The global variable poisoning mechanism offers a straightforwardmeans to detect differences in variable definitions between twocomponents, such as between the main executable and a shared object, orbetween two shared objects. This can be considered a category of ODRviolations.

echo 'int var; int main() { return var; }' > a.cc
echo 'long var;' > b.cc
clang++ -fpic -fsanitize=address -shared b.cc -o b.so
clang++ -fsanitize=address a.cc ./b.so -o a

% ./a
=================================================================
==1299789==ERROR: AddressSanitizer: odr-violation (0x56107ea3f500):
  [1] size=4 'var' a.cc in /tmp/c/a
  [2] size=8 'var' b.cc in ./b.so
These globals were registered at these points:
  [1]:
    #0 0x56107df99996  (/tmp/c/a+0x7b996)
    #1 0x56107df9aab9  (/tmp/c/a+0x7cab9)
    #2 0x7f72e5a457f5  (/lib/x86_64-linux-gnu/libc.so.6+0x277f5) (BuildId: 2ac5fa07c22f99cfd5dc47c70cd5f0e78b974269)

  [2]:
    #0 0x56107df99996  (/tmp/c/a+0x7b996)
    #1 0x56107df9aab9  (/tmp/c/a+0x7cab9)
    #2 0x7f72e604dd2d  (/lib64/ld-linux-x86-64.so.2+0x4d2d) (BuildId: accffc5784c4a469d09348e3f7ec53a74096fbd3)

==1299789==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0
SUMMARY: AddressSanitizer: odr-violation: global 'var' at a.cc in /tmp/c/a
==1299789==ABORTING

The default mode, detect_odr_violation=2, also prohibitssymbol interposition on variables. If you change long toint in b.cc, you will still encounter anodr-violation error. In contrast, withdetect_odr_violation=1, errors are suppressed if theregistered variables are of the same size.

% ASAN_OPTIONS=detect_odr_violation=1 ./a
% ASAN_OPTIONS=detect_odr_violation=2 ./a
=================================================================
==2574052==ERROR: AddressSanitizer: odr-violation (0x562d39db1200):
...

For a variable named $var, a one-byte variable,__odr_asan_gen_$var, is created with the original linkage(essentially must be external) and visibility.

If $var is defined in two instrumented modules, their__odr_asan_gen_$var symbols reference to the same copy dueto symbol interposition. When registering $var, the runtimechecks whether __odr_asan_gen_$var is already 1, and ifyes, the program has an ODR violation; otherwise__odr_asan_gen_$var is set to 1.

@__odr_asan_gen_g0 = global i8 0, align 1
@__odr_asan_gen_g1 = global i8 0, align 1

@0 = private alias { i32, [28 x i8] }, ptr @g0
@1 = private alias { i32, [28 x i8] }, ptr @g1

The private aliases @0and @1 were due to http://reviews.llvm.org/D15642.

If a.supp contains the following text, running theprogram with the environment variableASAN_OPTIONS=suppressions=a.supp suppresses errors due tothe variable name var.

1	odr_violation:^var$

An ODR violation is reported for two different linked units, say,exe and b.so. With static linking, the issuecan be suppressed due to archive member extraction semantics if theb.a member is not extracted.

ODR indicator

The previous example uses-fsanitize-address-use-odr-indicator.

Prior to Clang 16,-fno-sanitize-address-use-odr-indicator was the default fornon-Windows platforms. The runtime checks checks whether a variable hasbeen registered by verifying whether its redzone has been poisoned, andreports an ODR violation when the redzone has been poisoned.

@___asan_gen_.1 = private unnamed_addr constant [3 x i8] c"g0\00", align 1
@___asan_gen_.2 = private unnamed_addr constant [3 x i8] c"g1\00", align 1
@__asan_global_g0 = private global { i64, i64, i64, i64, i64, i64, i64, i64 } { i64 ptrtoint (ptr @g0 to i64), i64 4, i64 32, i64 ptrtoint (ptr @___asan_gen_.1 to i64), i64 ptrtoint (ptr @___asan_gen_ to i64), i64 0, i64 0, i64 0 }, section "asan_globals", !associated !0
@__asan_global_g1 = private global { i64, i64, i64, i64, i64, i64, i64, i64 } { i64 ptrtoint (ptr @g1 to i64), i64 8, i64 32, i64 ptrtoint (ptr @___asan_gen_.2 to i64), i64 ptrtoint (ptr @___asan_gen_ to i64), i64 0, i64 0, i64 0 }, section "asan_globals", !associated !1
@llvm.compiler.used = appending global [4 x ptr] [ptr @g0, ptr @g1, ptr @__asan_global_g0, ptr @__asan_global_g1], section "llvm.metadata"

This mode eliminates the need for an additional variable like__odr_asan_gen_$var, but it can lead to interaction issueswhen mixing instrumented and uninstrumented components. In the case of ashared object, if the reference to $var in__asan_global_$var is interposed with an uninstrumentedvariable due to symbol interposition, it may result in a spurious errorstating, "The following global variable is not properly aligned."

For Clang 16, I introduced the use of-fsanitize-address-use-odr-indicator by default fornon-Windows targets (see https://reviews.llvm.org/D137227).

(Additionally, https://reviews.llvm.org/D127911 changed the ODRindicator symbol name to __odr_asan_gen_$demangled.)

Copy relocations

Private aliases have an interest interaction with copy relocations.This issue is reported at https://gcc.gnu.org/PR68016.

The default -fsanitize-address-use-odr-indicator inClang 16 and later cannot detect the global-buffer-overflowerror below:

echo 'int f[5] = {1};' > foo.cc
echo 'extern int f[5]; int main() { return f[5]; }' > a.cc
clang++ -fpic -fsanitize=address -mllvm -asan-use-private-alias=1 -shared foo.cc -o foo1.so
clang++ -fno-pic -fsanitize=address -mllvm -asan-use-private-alias=1 -no-pie a.cc ./foo1.so -o a1
./a1 # no error

clang++ -fpic -fsanitize=address -mllvm -asan-use-private-alias=0 -shared foo.cc -o foo0.so
clang++ -fno-pic -fsanitize=address -mllvm -asan-use-private-alias=0 -no-pie a.cc ./foo0.so -o a0
./a0 # error

The definition of f in foo.cc isinstrumented, resulting in the creation of __asan_global_f.However, the executable actually accesses the copy created by the linkerdue to copy relocation.

When -asan-use-private-alias=1 is in effect (the defaultsince Clang 16), the __asan_global_f variable referencesthe unused copy inside the shared object. The executable accesses thecopy-relocated variable, whose redzone is not poisoned, resulting in noerror.

Conversely, when -asan-use-private-alias=0 is in effect,the __asan_global_f variable references the copy-relocatedvariable and poisons the redzone within the executable. Consequently,accessing f[5] leads to the expected error.

Garbage collection

Since Clang 17, asan.module_ctor is, by default, placedin a COMDAT group. When multiple instrumented relocatable object filesare linked together, only one asan.module_ctor isretained.

__asan_global_g0 is positioned in a section that linksto the section defining g0 using theSHF_LINK_ORDER flag. During linking, if the linker discardsthe section defining g0, the asan_globalssection containing __asan_global_g0 will also be discarded.For more detail on SHF_LINK_ORDER, you can refer to Metadatasections, COMDAT and SHF_LINK_ORDER.

Before Clang 17, the default behavior was to use-fno-sanitize-address-globals-dead-stripping. In this mode,the instrumentation places pointers to instrumented global variables ina metadata array and calls __asan_register_globals.__asan_register_globals then iterates over the array andregisters each global variable.

@g0 = dso_local global { i32, [28 x i8] } zeroinitializer, align 32
@g1 = dso_local global { i64, [24 x i8] } zeroinitializer, align 32

@___asan_gen_.1 = private unnamed_addr constant [3 x i8] c"g0\00", align 1
@___asan_gen_.2 = private unnamed_addr constant [3 x i8] c"g1\00", align 1

@llvm.compiler.used = appending global [2 x ptr] [ptr @g0, ptr @g1], section "llvm.metadata"
@0 = internal global [2 x { i64, i64, i64, i64, i64, i64, i64, i64 }] [{ i64, i64, i64, i64, i64, i64, i64, i64 } { i64 ptrtoint (ptr @1 to i64), i64 4, i64 32, i64 ptrtoint (ptr @___asan_gen_.1 to i64), i64 ptrtoint (ptr @___asan_gen_ to i64), i64 0, i64 0, i64 ptrtoint (ptr @__odr_asan_gen_g0 to i64) }, { i64, i64, i64, i64, i64, i64, i64, i64 } { i64 ptrtoint (ptr @2 to i64), i64 4, i64 32, i64 ptrtoint (ptr @___asan_gen_.2 to i64), i64 ptrtoint (ptr @___asan_gen_ to i64), i64 0, i64 0, i64 ptrtoint (ptr @__odr_asan_gen_g1 to i64) }]

@1 = private alias { i32, [28 x i8] }, ptr @g0
@2 = private alias { i32, [28 x i8] }, ptr @g1

define internal void @asan.module_ctor() #0 {
  call void @__asan_init()
  call void @__asan_version_mismatch_check_v8()
  call void @__asan_register_globals(i64 ptrtoint (ptr @0 to i64), i64 2)
  ret void
}

asan.module_ctor references the metadata array@0, which, in turn, references @1 and@2. @1 and @2 reference theglobal variables g0 and g1, respectively. Thisunfortunately indicates that g0 and g1 cannotbe discarded by section-based garbage collection.

It's important to note that this version ofasan.module_ctor is not placed within a COMDAT group. Inanother compile unit, a separate asan.module_ctorreferences a different metadata array. As a result, theseasan.module_ctor functions cannot share the sameimplementation.

In a linked component, both __asan_init and__asan_version_mismatch_check_v8 will be called multipletimes, incurring a small overhead.

Regrettably, the default setting of-fsanitize-address-globals-dead-stripping in Clang 17 had abug. Specifically, when there are no global variables, and the uniquemodule ID is non-empty, a COMDAT asan.module_ctor iscreated without any __asan_register_elf_globals calls. Ifthis COMDAT is selected as the prevailing copy by the linker, thelinkage unit will lack a __asan_register_elf_globals call,resulting in an unpoisoned redzone and a non-functional ODR violationchecker.

I have fixed this in the main branch (#67745) butLLVM 17.0.2 does not contain the fix.

Global variable metadata

Before Clang 15, Clang's instrumentation includedllvm.asan.globals, and the AddressSanitizer runtimerequired its object file feature for symbolization.

https://reviews.llvm.org/D127552 enabled debuginformation for symbolization and https://reviews.llvm.org/D127911 deleted the metadatanode llvm.asan.globals.

initialization-order-fiasco

AddressSanitizer provides a check to detect whether a dynamicinitializer for one global variable accesses dynamically initializedglobal variables defined in another compile unit, which helps identifycertain initialization order issues. This catches certain initializationorder fiasco issues.

Here is an example:

cat > a0.cc <<'eof'
#include 
extern int a1;
static int fa0() { return 1; }
int a0 = fa0();
int main() { printf("%d %d\n", a0, a1); }
eof
cat > a1.cc <<'eof'
extern int a0;
static int fa1() { return a0+1; }
int a1 = fa1();
eof
clang++ -fsanitize=address a0.cc a1.cc -o a

% ASAN_OPTIONS=strict_init_order=1 ./a
=================================================================
==124921==ERROR: AddressSanitizer: initialization-order-fiasco on address 0x5577b1cd6b00 at pc 0x5577b12fbbca bp 0x7ffe75a0a280 sp 0x7ffe75a0a260
READ of size 4 at 0x5577b1cd6b00 thread T0
    #0 0x5577b12fbbc9 in fa1() /tmp/t/d/a1.cc:2:27
    #1 0x5577b12fbbec in __cxx_global_var_init /tmp/t/d/a1.cc:3:10
    #2 0x5577b12fbc64 in _GLOBAL__sub_I_a1.cc /tmp/t/d/a1.cc
    #3 0x7ff44e0107f5 in call_init csu/../csu/libc-start.c:145:3
    #4 0x7ff44e0107f5 in __libc_start_main csu/../csu/libc-start.c:347:5
    #5 0x5577b11b46d0 in _start (/tmp/t/d/a+0x766d0)

0x5577b1cd6b00 is located 0 bytes inside of global variable 'a0' defined in '/tmp/t/d/a0.cc:4' (0x5577b1cd6b00) of size 4
  registered at:
    #0 0x5577b11d1da4 in __asan_register_globals /usr/local/google/home/maskray/llvm/compiler-rt/lib/asan/asan_globals.cpp:363:3
    #1 0x5577b11d2181 in __asan_register_elf_globals /usr/local/google/home/maskray/llvm/compiler-rt/lib/asan/asan_globals.cpp:346:3
    #2 0x5577b12fbb57 in asan.module_ctor a0.cc
    #3 0x7ff44e0107f5 in call_init csu/../csu/libc-start.c:145:3
    #4 0x7ff44e0107f5 in __libc_start_main csu/../csu/libc-start.c:347:5

SUMMARY: AddressSanitizer: initialization-order-fiasco /tmp/t/d/a1.cc:2:27 in fa1()
...

When check_initialization_order is enabled, whilestrict_init_order is disabled, AddressSanitizer performs aweak check allowing a compile unit that is about to be initialized toaccess global variables in an already initialized compile unit. In thisscenario, the previous example does not result in an error:

1 2	% ASAN_OPTIONS=check_initialization_order=1:strict_init_order=0 ./a 1 2

For the following case, the weak check can still catch theinitialization order fiasco:

cat > a0.cc <<'eof'
#include 
extern int a1;
int a0 = []() { return a1-1; }();
int main() { printf("%d %d\n", a0, a1); }
eof
cat > a1.cc <<'eof'
extern int a0;
static int fa1() { return 2; }
int a1 = fa1();
eof
clang++ -g -fsanitize=address a0.cc a1.cc -o a
ASAN_OPTIONS=check_initialization_order=1:strict_init_order=0 ./a

Clang translates C++ dynamic initialization into a globalinitialization function within the llvm.global_ctors list.AddressSanitizer augments this global initialization function with__asan_before_dynamic_init and__asan_after_dynamic_init. These two functions worktogether to check for initialization order issues whencheck_initialization_order is enabled.

For instrumented global variables with initializers, thehas_dynamic_init variable in the __asan_globalmetadata is set to true. These variables are collected into thedynamic_init_globals array.

__asan_before_dynamic_init is called for each compileunit. This function iterates over dynamic_init_globals andpoisons those whose DynInitGlobal::initialized value isfalse. Subsequently, the global initialization function is executed. Ifit accesses the poisoned memory, it triggers a report for aninitialization order issue. Following this,__asan_after_dynamic_init processes these global variables,unpoisoning them.

void __asan_before_dynamic_init(const char *module_name) {
  ...
  for (uptr i = 0, n = dynamic_init_globals->size(); i < n; ++i) {
    DynInitGlobal &dyn_g = (*dynamic_init_globals)[i];
    const Global *g = &dyn_g.g;
    if (dyn_g.initialized)
      continue;
    if (g->module_name != module_name)
      PoisonShadowForGlobal(g, kAsanInitializationOrderMagic);
    else if (!strict_init_order)
      dyn_g.initialized = true;
  }
}

void __asan_after_dynamic_init() {
  ...
  for (uptr i = 0, n = dynamic_init_globals->size(); i < n; ++i) {
    DynInitGlobal &dyn_g = (*dynamic_init_globals)[i];
    const Global *g = &dyn_g.g;
    if (!dyn_g.initialized) {
      // Unpoison the whole global.
      PoisonShadowForGlobal(g, 0);
      // Poison redzones back.
      PoisonRedZones(*g);
    }
  }
}

The check is applicable when the accessed variable resides in anotherlinked unit.

For example, consider that b.so consists ofb0.cc and b1.cc, while the main executablea contains a0.cc and a1.cc.

cat > a0.cc <<'eof'
#include 
extern int a1, b0, b1;
static int fa0() { return 1; }
int a0 = fa0();
int main() { printf("%d %d %d %d\n", a0, a1, b0, b1); }
eof
echo 'static int fa1() { return 2; } int a1 = fa1();' > a1.cc
echo 'static int fb0() { return 3; } int b0 = fb0();' > b0.cc
echo 'static int fb1() { return 4; } int b1 = fb1();' > b1.cc
sed 's/^        /\t/' > Makefile <<'eof'
.MAKE.MODE := meta curDirOk=true
CXX := clang++
CXXFLAGS := -g -fsanitize=address
a: a0.cc a1.cc b.so
        ${LINK.cc} -Wl,-rpath=. $> -o $@
b.so: b0.cc b1.cc
        ${LINK.cc} -fpic -shared $> -o $@
clean:
        rm -f *.meta a b.so
eof
bmake

In check_initialization_order=1,strict_init_order=0mode,

globals in b0.cc and b1.cc areregistered
b0.cc: __asan_before_dynamic_init marks b0as initialized and poisons b1. Global initialization isrun. __asan_register_globals unpoisons b1
b1.cc: __asan_before_dynamic_init marks b1as initialized and poisons b0. Global initialization isrun. __asan_register_globals unpoisons b0
globals in a0.cc and a1.cc areregistered
a0.cc: __asan_before_dynamic_init marks a0as initialized and poisons a1. Global initialization isrun. __asan_register_globals unpoisons a1
a1.cc: __asan_before_dynamic_init marks a1as initialized and poisons a0. Global initialization isrun. __asan_register_globals unpoisons a0

In check_initialization_order=1,strict_init_order=1mode,

globals in b0.cc and b1.cc areregistered
b0.cc: __asan_before_dynamic_init poisonsb1. Global initialization is run
b1.cc: __asan_before_dynamic_init poisonsb0. Global initialization is run
globals in a0.cc and a1.cc areregistered
a0.cc: __asan_before_dynamic_init poisonsb0,b1,a1. Global initialization is run.__asan_register_globals unpoisonsb0,b1,a1
a1.cc: __asan_before_dynamic_init poisonsb0,b1,a0. Global initialization is run.__asan_register_globals unpoisonsb0,b1,a0

The instrumentation can be disabled with an entry inasan_ignorelist.txt:

1	global:var=init

An initialization-order-fiasco error cannot be suppressed usingASAN_OPTIONS=suppressions=a.supp.

MaskRay

Clang's -O0 output: branch displacement and size increase

Span-dependent instructions

Start small and grow

-mrelax-all tradeoff

Understanding thecompile time difference

Whydidn't people complain about the code size increase?

When QOI meets XZ

XZ

Drop LZ match finders

PNG

Light ELF: exploring potential size reduction

Relocations

Section header table

Symbol table

Program headers

Section compression

Experiments

Light ELF: a thoughtexperiment

A compact section header table for ELF

Experiments

More ideas

C++ exit-time destructors

__cxa_atexit

Thread storage durationvariables

When exit-timedestructors are undesired

Disabling exit-timedestructors

Class template with anempty destructor

Compileroptimization for no-op destructors

no_destroy attribute

Case study

A compact relocation format for ELF

Compressed relocations

CREL relocation format

LEB128 amongvariable-length integer encodings

Experiments

Case study

Marker relocations

.llvm_addrsig

DWARF sections

CREL for dynamic relocations

Linker notes

mips64el

Data compression

CREL proposal for the genericABI

Abandonded proposalRELLEB (last revision)

My involvement with LLVM 18

lld

MC

Clang

Code review

MMU-less systems and FDPIC

Compilersupport for unknown text-data segment offset

-msep-data

-mid-shared-library

-fropi and-frwpi

-mfdpic

Compiler option summary

Linux binfmt loaders

Binary flat format

FDPIC

Function access in FDPIC

Data access in FDPIC

Thread-local storage inFDPIC

PLT in FDPIC

Relative relocationsand .rofixup section

Reflections on FDPIC

Toolchain notes

RISC-V FDPIC

RISC-V FDPIC: optimization

RISC-V FDPIC: thread-localstorage

RISC-V FDPIC: -fno-plt

libc implementationswith FDPIC support

lld 18 ELF changes

Toolchain notes on z/Architecture

Documents

Instruction notes

ABI notes

Compilers

Global Offset Table

`-mrelax-all` tradeoff

`__cxa_atexit`

`no_destroy` attribute

`.llvm_addrsig`

`-msep-data`

`-mid-shared-library`

`-fropi` and`-frwpi`

`-mfdpic`

Relative relocationsand `.rofixup` section

RISC-V FDPIC: `-fno-plt`