Constructed objects ([dcl.init]) with static storage duration aredestroyed and functions registered with std::atexit are called as partof a call to std::exit ([support.start.term]). The call to std::exit issequenced before the destructions and the registered functions. [Note1: Returning from main invokes std::exit ([basic.start.main]). — endnote]
For example, consider the following code:
1 | struct A { ~A(); } a; |
The destructor for object a will be registered for execution atprogram termination.
__cxa_atexit
The Itanium C++ ABI employs __cxa_atexit
rather thanatexit for object destructor registration for two primary reasons:
atexit
guarantee: ISO C (up to C23) guaranteessupport for 32 registered functions, although most implementationssupport many more.__cxa_atexit
provides amechanism for handling destructors when dynamic libraries are unloadedvia dlclose
before program termination.Several standard libraries, including glibc, musl, and FreeBSD libc,implement atexit
using __cxa_atexit
.
atexit
returns__cxa_atexit ((void (*) (void *)) func, NULL, __dso_handle)
,where __dso_handle
is part of libc itself.__dso_handle
.1 | cat > a.cc <<'eof' |
An invocation yields:
1 | foo |
Key points:
__cxa_atexit
using the __dso_handle
symbol as an argument.crtbeginS.o
defines the .fini_array
section (triggering __do_global_dtors_aux
) and the hiddensymbol __dso_handle
.__dso_handle
as a hidden symbol if crtbegin doesnot.dlclose
invokes .fini_array
functions.__cxa_finalize(d)
iterates through the termination functionlist, calling matching destructors based on the DSO handle.Note: In glibc, the DF_1_NODELETE
flag marks a sharedobject as unloadable. Additionally, symbol lookups withSTB_GNU_UNIQUE
automatically set this flag.
musl provides a dlclose
and__cxa_finalize
.
Objects with thread storage duration that have non-trivialdestructors will register those destructors using__cxa_thread_atexit
during construction.
Exit-time destructors for static and thread storage durationvariables can be undesired due to
Clang provides -Wexit-time-destructors
(disabled bydefault) to warn about exit-time destructors.
1 | % clang++ -c -Wexit-time-destructors g.cc |
Then, I will describe some approaches to disable exit-timedestructors.
We can use a reference or pointer that refers to adynamically-allocated object.
1 | struct A { int v; ~A(); }; |
This approach prevents the destructor from running at program exit,as pointers and references have a trivial destructor. Note that thisdoes not create a memory leak, since the pointer/reference is part ofthe root set.
The primary downside is unnecessary pointer indirection whenaccessing the object. Additionally, this approach uses a mutable pointerin the data segment and requires a memory allocation.
1 | # %bb.2: // initializer |
A common approach, as outlined in
1 | template <class T> class no_destroy { |
libstdc++ employs a variant that uses a union member.
1 | struct A { ~A(); }; |
C++20 will support constexpr destructor:
1 | template <class T> union no_destroy { |
Libraries like absl::NoDestructor
and folly::Indestructible
offer similar functionality. The absl version optimizes for triviallydestructible types.
Ideally, compilers should optimize out exit-time destructors forempty user-provided destructors:
1 | struct C { C(); ~C() {} }; |
LLVM has addressed this __cxa_atexit
callsrelated to empty destructors, along with other global variableoptimizations.
In contrast, GCC has an
no_destroy
attributeClang supports [[clang::no_destroy]]
(alternative form:__attribute__((no_destroy))
) to disable exit-timedestructors for variables of static or thread storage duration. Its-fno-c++-static-destructors
option allows disablingexit-time destructors globally.
Standardization efforts for this attribute are underway
I recently encountered a scenario where the no_destroy
attribute would have been beneficial. I've filed a GCC feature request(PR114357) after I learnedthat GCC doesn't have the attribute.
LLVM provides ManagedStatic
to construct an objecton-demand (good for reducing startup time) and make destructionexplicitly through llvm_shutdown
.ManagedStatic
is intended to be used at namespace scope. Aprime example is LLVM's statistics mechanisms (-stats
and-time-passes
).
Programs using LLVM can strategically avoid callingllvm_shutdown
for fast teardown by skipping somedestructors. The lld linker employs this approach unless theLLD_IN_TEST
environment variable is set to a non-zerointeger.
DSO plugin users requiring library unloading may ManagedStatic
unsuitable. This is because:
llvm_shutdown
.llvm_shutdown
is deferred until around program exit,executing destructors becomes unsafe once the DSO's code has beenremoved.The mold linker improves perceived linking speed by spawning aseparate process for the linking task. This allows the parent process(the one launched from the shell or other programs) to exit early. Thisapproach eliminates overhead associated with static destructors andother operations.
]]>All data structures that the object file format defines follow the"natural" size and alignment guidelines for the relevant class. Ifnecessary, data structures contain explicit padding to ensure 4-bytealignment for 4-byte objects, to force structure sizes to a multiple offour, etc. Data also have suitable alignment from the beginning of thefile. Thus, for example, a structure containing an Elf32_Addr memberwill be aligned on a 4-byte boundary within the file. Other classeswould have appropriately scaled definitions. To illustrate, the 64-bitclass would define Elf64 Addr as an 8-byte object, aligned on an 8-byteboundary. Following the strictest alignment for each object allows theformat to work on any machine in a class. That is, all ELF structures onall 32-bit machines have congruent templates. For portability, ELF usesneither bit-fields nor floating-point values, because theirrepresentations vary, even among pro- cessors with the same byte order.Of course the programs in an ELF file may use these types, but theformat itself does not.
While beneficial for many control structures, the natural sizeguideline presents significant drawbacks for relocations. Sincerelocations are typically processed sequentially, they don't gain thesame random-access advantages. The large 24-byte Elf64_Rela structurehighlights the drawback. For a detailed comparison of relocationformats, see
Furthermore, Elf32_Rel
and Elf32_Rela
sacrifice flexibility to maintain a smaller size, limiting relocationtypes to a maximum of 255. This constraint has become noticeable forAArch32 and RISC-V, and especially when platform-specific relocationsare needed. While the 24-bit symbol index field is less elegant, ithasn't posed significant issues in real-world use cases.
In contrast, the
Inspired by WebAssembly, I will explore real-world scenarios whererelocation size is critical and propose an alternative format (RELLEB)that addresses ELF's limitations.
A substantial part of position-independent executables (PIEs) anddynamic shared objects (DSOs) is occupied by dynamic relocations. WhileRELR (acompact relative relocation format) offers size-saving benefits forrelative relocations, other dynamic relocations can benefit from acompact relocation format. There are a few properties:
Elf_Addr
. No twodynamic relocations can share the same offset.R_*_JUMP_SLOT
andR_*_GLOB_DAT
). When a symbol is associated with moredynamic relocations, it is typically a base class function residing inmultiple C++ virtual tables.-fexperimental-relative-c++-abi-vtables
would eliminatesuch dynamic relocations.Android's packed relocation format (linker implementation:ld.lld --pack-dyn-relocs=android
) was an earlier designthat applies to all dynamic relocations at the cost of complexity.
Additionally, Apple linkers and dyld use LEB128 encoding for bindopcodes.
Marker relocations are utilized to indicate certain linkeroptimization/relaxation is applicable. While many marker relocations areused scarcely, RISC-V relocatable files are typically filled up withR_RISCV_RELAX
relocations. Their size contribution is quitesubstantial.
.llvm_addrsig
On many Linux targets, Clang emits a special section called.llvm_addrsig
(type SHT_LLVM_ADDRSIG
, LLVMaddress-significance table) by default to allowld.lld --icf=safe
. The .llvm_addrsig
sectionstores symbol indexes in ULEB128 format, independent of relocations.Consequently, tools like ld -r
and objcopy risk invalidatethe section due to symbol table modifications.
Ideally, using relocations would allow certain operations. However,the size concern of REL/RELA in ELF hinders this approach. In contrast,lld's Mach-O port __DATA,__llvm_addrsig
.
.llvm.call-graph-profile
LLVM leverages a special section called.llvm.call-graph-profile
(typeSHT_LLVM_CALL_GRAPH_PROFILE
) for both instrumentation- andsample-based profile-guided optimization (PGO). lld
Similar to .llvm_addrsig
, the.llvm.call-graph-profile
section initially faced the symbolindex invalidation problem, which was solved by switching torelocations. I opted for REL over RELA to reduce code size.
DWARF v5 accelerated name-based access with the introduction of the.debug_names
section. However, in aclang -g -gsplit-dwarf -gpubnames
generated relocatablefile, the .rela.debug_names
section can consume asignificant portion (approximately 10%) of the file size.
1 | Relocation section '.relleb.debug_names' at offset 0x65c0 contains 200 entries: |
This size increase has sparked discussions within the LLVM communityabout potentially
The availability of a more compact relocation format would likelyalleviate the need for such format changes.
.debug_line
and .debug_addr
also contributea lot of relocations.
1 | Relocation section '.relleb.debug_addr' at offset 0x64f1 contains 51 entries: |
Many adjacent relocations share the same section symbol. We will seelater that the proposed RELLEB does not utilize much about thisproperty.
While the standard SHF_COMPRESSED
feature is commonlyused for debug sections, its application can easily extend to relocationsections. I have developed a Clang/lld prototype that demonstrates thisby compressing SHT_RELA
sections.
The compressed SHT_RELA
section occupiessizeof(Elf64_Chdr) + size(compressed)
bytes. Theimplementation retains uncompressed content if compression would resultin a larger size.
In scenarios with numerous smaller relocation sections (such as whenusing -ffunction-sections -fdata-sections
), the 24-byteElf64_Chdr
header can introduce significant overhead. Thisobservation raises the question of whether encodingElf64_Chdr
fields using ULEB128 could further optimize filesizes. With larger monolithic sections (.text
,.data
, .eh_frame
), compression ratio would behigher as well.
1 | # configure-llvm is my wrapper of cmake that specifies some useful options. |
Relocations consume a significant portion (approximately 20.9%) ofthe file size. Despite the overhead of-ffunction-sections -fdata-sections
, the compressiontechnique yields a significant reduction of 14.5%!
However, dropping in-place relocation processing is a downside.
The 1990 ELF paper ELF: An Object File to Mitigate MischievousMisoneism says "ELF allows extension and redefinition for othercontrol structures." Inspired by WebAssembly, let's explore RELLEB, anew and more compact relocation format designed to replace RELA. Ouremphasis is on simplicity over absolute minimal encoding. See the end ofthe article for a detailed format description.
A SHT_RELLEB
section (preferred name:.relleb<name>
) holds compact relocation entries thatdecode to Elf32_Rela
or Elf64_Rela
dependingon the object file class (32-bit or 64-bit). Its content begins with aULEB128-encoded relocation count, followed by entries encodingr_offset
, r_type
, r_symidx
, andr_addend
. The entries use ULEB128 and SLEB128 exclusivelyand there is no endianness difference.
Here are key design choices:
Relocation count (ULEB128):
This allows for efficient retrieval of the relocation count withoutdecoding the entire section. While a uint32_t
(like SHT_HASH
)could be used, ULEB128 aligns with subsequent entries, removesendianness differences, and offers a slight size advantage in most caseswhen the number of symbols can be encoded in one to three bytes.
Delta encoding for r_offset
(ULEB128):
Section offsets can be large, and relocations are typically ordered.Storing the difference between consecutive offsets offers compressionpotential. In most cases, a single byte will suffice. While there areexceptions (general dynamic TLS model of s390/s390x uses a local"out-of-order" pair:R_390_PLT32DBL(offset=o) R_390_TLS_GDCALL(offset=o-2)
), weare optimizing for the common case. Switching to SLEB128 would increasethe total .o
size by 0.1%.
For ELFCLASS32, r_offsets
members are calculated usingmodular arithmetic modulo 4294967296.
Delta encoding for r_type
(SLEB128):
Some psABIs utilize relocation types greater than 128. AArch64'sstatic relocation types begin at 257 and dynamic relocation types beginat 1024, necessitating two bytes with ULEB128/SLEB128 encoding in theabsence of delta encoding. Delta encoding allows all but the firstrelocation's type to be encoded in a single byte. An alternative designis to define a base type in the header and encode types relative to thebase type, which would introduce slight complexity.
If the AArch32 psABI could be redesigned, allocating[0,64)
for Thumb relocation types and [64,*)
for ARM relocation types would optimize delta encoding even further.
While sharing a single type code for multiple relocations would beefficient, it would require reordering relocations. This conflicts withorder requirements imposed by several psABIs and could complicate linkerimplementations.
Symbol table index and type/addend presence (SLEB128):
ULEB128 optimizes for the common case when the symbol index isencodable in one or two bytes. Using SLEB128 and delta encoding insteadof ULEB128 for the symbol index field would increase the total size by0.4%.
The sign bit determines type/addend presence:
This scheme optimizes for consecutive static relocations withidentical types and addends, a common pattern in many architectures.Example:
1 | // R_AARCH64_CALL(g0), ... |
While RISC architectures often require multiple relocations withdifferent types to access global data, the frequent use of callinstructions outweighs this in most cases. Overall, type/addend omissionis generally advantageous than just addend omission (tested using aaarch64-linux-gnu build).
On RISC-V, due to frequent relocation type changes andR_RISCV_RELAX
, addend omission has slightly smaller.relleb*
than type/addend omission (by 1.9%).
With a limited number of types and frequent zero addends (exceptR_*_RELATIVE
and R_*_IRELATIVE
),dynamicrelocations also benefit from type/addend omission.
Delta encoding for addend (SLEB128):
When the addend changes, we use an addend delta. This offers a slightsize advantage (about 0.20%) and optimizes for cases like:
1 | .quad .data + 0x78 |
The .debug_line_str
references in a.debug_line
section follow this pattern.
Note: when the bitwise NOT code path is taken, the zero delta typeand addend is not utilized.
I have developed a prototype at SHF_COMPRESSED SHT_RELA
approach.
1 | configure-llvm s2-custom2 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS=-mrelleb |
The total relocation section size has decreased from 28767768 to4872672, 16.9% of the original size. RELLEB yields a significant filesize reduction of 17.2%!
In aarch64-linux-gnu
builds, the total relocationsection size has decreased from 25762752 to 4698182, 18.2% of theoriginal size. RELLEB yields a file size reduction of 16.5%.
In riscv64-linux-gnu
builds, the total relocationsection size has decreased from 91054800 to 17522812, 19.2% of theoriginal size. RELLEB yields a file size reduction of 32.4%.
In an x86-64 clang -g -gsplit-dwarf -gpubnames
build,.rela*
sections consume 19.1% of the file size. The totalrelocation section size has decreased from 105622824 to 24559569, 23.3%of the original size. RELLEB yields a file size reduction of 14.6%.
It would be interesting to explore the potential gains of combiningzstd compression with RELLEB.
1 | configure-llvm s2-custom3 -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_{C,CXX}_FLAGS='-mrelleb -Xclang --compress-relocations=zstd' |
While the 25.8% reduction in RELLEB section size suggests room forfurther optimization, the overall decrease of only 1.10% in.o
file sizes indicates that the current compact relocationformat offers a reasonable compromise. (In the absence of the addendpresence and delta addend technique, the overall decrease is about1.5%.)
I debated whether to name the new section SHT_RELOC
(.reloc<name>
) or SHT_RELLEB
(.relleb<name>
). Ultimately, I choseSHT_RELLEB
because its unique name minimizes potentialconfusion, whereas SHT_RELOC
could be confused withSHT_REL
and SHT_RELA
.
RELLEB is not the most optimal format for sections like.rodata
, .debug_names
,.debug_line
, and .debug_addr
. These sectionsoften have many relocations with the same type and symbol, a patternthat the generic RELR format (discussed below for dynamic relocations)could exploit more effectively.
Specifically for .debug_names
, the RELLEB format resultsin a size(.relleb.debug_names) / size(.rela.debug_names)
ratio of 27.7%. Modifying RELLEB to use the sign bit of the symbol indexfor omitting the relocation type (instead of the addend) could improvethis ratio, but at the cost of larger .relleb.text
sections.
RELLEB excels with static relocations, but what about the dynamiccase? I believe its benefits for dynamic relocations are lesspronounced. An optimal dynamic relocation format would differsubstantially. A generalized RELR format would leverage the dyamicrelocation properties well. Here's apossible encoding:
1 | // R_*_RELATIVE group |
We need to enumerate all dynamic relocation types includingR_*_IRELATIVE
, R_*_TLSDESC
used by some ports.Some R_*_TLSDESC
relocations have a symbol index of zero,but the straightforward encoding does not utilize this property.
Traditionally, we have two dynamic relocation ranges.
.rela.dyn
: [DT_RELA, DT_RELA + DT_RELASZ)
(or .rel.dyn
:[DT_REL, DT_REL + DT_RELSZ)
).rela.plt
:[DT_JMPREL, DT_JMPREL + DT_PLTRELSZ)
.DT_PLTREL
specifies DT_REL
orDT_RELA
.Some GNU ld ports treat .rela.plt
as a subset of.rela.dyn
, introducing complexity for dynamic loaders.
Android's packed relocation format replaces.rel.dyn
/.rela.dyn
but does not change thesection name.
If we aim RELLEB for replacement, we'd need a new dynamic tag(DT_RELLEB
) and ensure no overlap withDT_JMPREL
. DT_RELLEBSZ
is not needed, becausethe relocation count can be inferred from the header. We would need todisallow output section descriptions like.rela.dyn : { *(.rela.dyn) *(.rela.plt) }
.
In glibc, there is additional complexity for ET_EXEC
executables due to __rela_iplt_start
.
I've implemented -z relleb
to replace.rel.dyn
/.rela.dyn
but not yet.rel.plt
/.rela.plt
. Dynamic relocations aresorted by (r_type, r_offset)
to better utilize RELLEB.
Let's link clang-16-debug using RELA,--pack-dyn-relocs=relr
,--pack-dyn-relocs=android+relr
, and--pack-dyn-relocs=relr -z relleb
and analyze theresults.
1 | % llvm-readelf -S clang | grep ' \.rel.*\.' |
Analysis
RELOCATION_GROUPED_BY_INFO_FLAG
sharingr_info
..relr.dyn
still accountsfor a significant portion of the size.Decoding ULEB128/SLEB128 would necessitate more work in the dynamicloader. However, since there is no implementation yet, we don't know theperformance.
--emit-relocs
and -r
necessitate combiningrelocation sections. The output size may differ from the sum of inputsections. The total relocation count must be determined, a new headerwritten, and section content regenerated, as symbol indexes and addendsmay have changed. Debug sections, .eh_frame
, and.gcc_except_table
require special handling to rewriterelocations referencing a dead symbol to R_*_NONE
. Thisalso necessitates updating the relocation type.
--emit-relocs
and -r
copy RELLEB relocationsections (e.g. .relleb.text
) to the output. When.rela.text
is also present, linkers are required to merge.rela.text
into .relleb.text
.
GNU ld allows certain unknown section types:
[SHT_LOUSER,SHT_HIUSER]
andnon-SHF_ALLOC
[SHT_LOOS,SHT_HIOS]
andnon-SHF_OS_NONCONFORMING
but reports errors and stops linking for others (unless--no-warn-mismatch
is specified). When linking arelocatable file using SHT_RELLEB
, you might encountererrors like the following:
1 | % clang -mrelleb -fuse-ld=bfd a.c b.c |
Older lld and mold do not report errors. I have filed:
In addition, when there is one .eh_frame
section withCIE pieces but no relocation, _bfd_elf_parse_eh_frame
willreport an error.
mips64el has an incorrect r_info
: a 32-bit little-endiansymbol index followed by a 32-bit big-endian type. If mips64el decidesto adopt RELLEB, they can utilize this opportunity to fixr_info
.
The initial revision has been proposed at
In
In Figure 4-9: Section Types,sh_type, append a row
SHT_RELLEB
| 20
Add text:
SHT_RELLEB - The section holds compact relocation entries withexplicit addends. An object file may have multiple relocation sections.See ''Relocation'' below for details.
In Figure 4-16: Special Sections, append
.rellebname
| SHT_RELLEB
| see below
Change the text below:
.relname, .relaname, and .rellebname
These sections hold relocation information, as described in''Relocation''. If the file has a loadable segment that includesrelocation, the sections' attributes will include the SHF_ALLOC bit;otherwise, that bit will be off. Conventionally, name is supplied by thesection to which the relocations apply. Thus a relocation section for.text normally would have the name .rel.text, .rela.text, or.relleb.text.
In Figure 4-23: Relocation Entries, add:
1 | typedef struct { |
Add text above "A relocation section references two othersections":
A SHT_RELLEB
section holds compact relocation entriesthat decode to Elf32_Relr
or Elf64_Relr
depending on the object file class (32-bit or 64-bit). Its contentbegins with a ULEB128-encoded relocation count, followed by entriesencoding r_offset
, r_type
,r_symidx
, and r_addend
. Note that ther_info
member in traditional REL/RELA formats has beensplit into separate r_type
and r_symidx
members, allowing uint32_t
relocation types for ELFCLASS32as well.
r_offset
relative to the previous entry, represented as a32-bit or 64-bit unsigned integer for ELFCLASS32/ELFCLASS64,respectively.r_addend
relative to the previous entry, represented as a32-bit or 64-bit signed integer for ELFCLASS32/ELFCLASS64,respectively.The bitwise NOT of symbol index 0xffffffff is -0x100000000 (64-bit)instead of 0 (32-bit).
Example C++ encoder:
1 | // encodeULEB128(uint64_t, raw_ostream &os); |
For the first relocation entry, the previous offset, type, and addendmembers are treated as zero.
In Figure 5-10: Dynamic Array Tags, d_tag, add:
DT_RELLEB
| 38 | d_ptr
| optional |optional
Add text below:
DT_RELLEB
- This element is similar toDT_RELA
, except its table uses the RELLEB format. Therelocation count can be inferred from the header.[llvm-objdump][X86] Add @plt symbols for .plt.got
[llvm-readobj] Print <null> for relocation target with an empty name
--decompress
/-z
(ASAN_SHADOW_OFFSET_CONST
(__isoc23_strtol
and__isoc23_scanf
family functions__isoc23_*
functions (-fsanitize=alignment
: check memcpy/memmove arguments(#67766).pseudo_probe
createdsections deterministic after D91878.reloc
to register used symbolsSHF_LINK_ORDER
and section group parsing orderto match GNU assembler__GCC_HAVE_SYNC_COMPARE_AND_SWAP_16
forAArch64 (R_RISCV_CALL
/R_RISCV_CALL_PLT
assemblerand assembly parser cleanupcall fptr
andjmp fptr
(-fno-pic
for intrinsics toemit R_386_PC32
instead of R_386_PLT32
(arch=x86-64{,-v2,-v3,-v4}
fortarget_clones
attribute__builtin_cpu_supports
: supportx86-64{,-v2,-v3,-v4}
CXX_STANDARD 17
(Driver maintenance
-mtls-dialect=desc
Others:
Reviewed many patches, including ADT/Support, binary utilities, MC,lld (sometimes non-ELF ports even if my primary expertise is in ELF),clangDriver, LTO, sanitizers, LoongArch, RISC-V, x86-64 medium/largecode models, etc.
TODOis:pr is:closed sort:updated-desc review-requested:@me
lists pull requests that requested a review from me, but it's unclearhow to list pull requests that I've made a comment.
Embedded systems often lack MMUs, relying on real-time operatingsystems (RTOS) like VxWorks or special Linux configurations(CONFIG_MMU=n
). In these systems, the offset between thetext and data segments is often not knwon at compile time. Therefore, adedicated register is typically set to somewhere in the data segment andwritable data is accessed relative to this register.
Why is the offset not knwon at compile time? There are primarily tworeasons.
First, eXecute in Place (XIP), where code resides in ROM while thedata segment is copied to RAM. Therefore, the offset between the textand data segments is often not knwon at compile time.
Second, all processes share the same address space without MMU.However, it is still desired for these processes to share text segments.Therefore needs a mechanism for code to find its corresponding data.
-msep-data
GCC's m68k port -msep-data
in 2003-10.
Add -msep-data and -mid-shared-library support for uClinux. These aretwo special PIC variants that allow executing linux applications in ROMfilesystems without loading an additional copy in memory (XIP).
With -msep-data, references to global data are made through registerA5 which is loaded with a pointer to the start of the data/bss segmentallocated in RAM.
The -mid-shared-library option allows using a special shared libraryflavour that allows allocationg a distinct data/bss section for eachprocess without the need to relocate code in both library andapplication.
-msep-data
is PIC only and updates -fno-pic
to -fPIE
. In this mode, a5 is read-only and holds theaddress of _GLOBAL_OFFSET_TABLE_
. When not used with-mid-shared-library
, -fPIC -msep-data
isunnecessary. Just stick with -fPIE -msep-data
.
-mid-shared-library
-msep-data
. The documentation says:
Generate code that supports shared libraries via the library IDmethod. This allows for execute-in-place and shared libraries in anenvironment without virtual memory management. This option implies-fPIC.
-mid-shared-library
is PIC only and updates-fno-pic
to -fPIE
. You compile a source filewith -mid-shared-library -mshared-library-id=n
, and thefunctions will be attached to library ID n. At function entry a5 pointsto an array that maps a library ID to the corresponding GOT baseaddress. The compiler generates move.l -(n+1)*4(%a5),%a5
toobtain the actual GOT base address. The a5 will then be used to accessthe corresponding data segment.
gcc/config/bfin
added -msep-data
in2006.
-mno-pic-data-is-text-relative
This ARM option is similar to -msep-data
and only makessense with -fpie
/-fpic
. In 2013,-mno-pic-data-is-text-relative
, generalized from the ARM -mno-pic-data-is-text-relative
implies -msingle-pic-base
:
Treat the register used for PIC addressing as read-only, rather thanloading it in the prologue for each function. The runtime system isresponsible for initializing this register with an appropriate valuebefore execution begins.
r9 is used as the static base (arm_pic_register
) in theposition-independent data model to access the data segment. Since r9 isnot changed, dynamic linking seems unsupported as a DSO needs adifferent data segment.
GCC's s390x port added -mno-pic-data-is-text-relative
in2017
-fropi
and-frwpi
Clang ARM's -fropi
and -frwpi
are special-fno-pic
variants that only intended for static linking.While regular -fno-pic
assumes absolute addressing for bothcode data, -fropi
and -frwpi
add a twist byenforcing relative addressing based on specific assumptions aboutrelocation. Both options consider the text-data segment offset unknownat compile time.
-fropi
assumes code and read-only data will berelocated at runtime, making absolute addressing unsuitable. Instead,PC-relative addressing is used. The .ARM.attributes
sectioncontains Tag_ABI_PCS_RO_data: 1
like-fpic
.-frwpi
assumes writable data will be relocated atruntime, making absolute addressing unsuitable. Instead, writable datais accessed relative to the static base register. The.ARM.attributes
section containsTag_ABI_PCS_RW_data: 2
.You can use -fropi
and -frwpi
together torequire relative addressing for both code and data. Compared with-fno-pic -frwpi
, -fno-pic -fropi -frwpi
needsone more instruction to retrieve a function address.
In terms of semantics, I think -fno-pic -fropic -frwpic
is identical to -fpie -mno-pic-data-is-text-relative
withhidden visibility declarations. In practice, GCC-fpie -mno-pic-data-is-text-relative
utilizes GOT-relativerelocations (R_ARM_GOT_BREL
), not MOVW/MOVTinstructions.
-mfdpic
We will discuss this in detal later.
-msep-data
and-mno-pic-data-is-text-relative
are the same, relying on-fpie/-fpic
semantics to enforce relative addressing forthe text segment. -fropi
and -frwpi
offerfiner control. You can choose to use relative addressing for textsegment only (-fropi
), data segment only (using-frwpi
), or both.
Neither -msep-data
nor -fropi -frwpi
supports shared libraries. -msep-data
's variant-mid-shared-library
provides a library ID based sharedlibrary, which works for some cases but is inflexible.
Now, let's review OS support. While I'm not an RTOS expert, let'sexplore Linux's executable file loaders and see how they handle MMU-lessscenarios.
fs/Kconfig.binfmt
defines a few loaders.
BINFMT_ELF
defaults to y and depends onMMU
.BINFMT_ELF_FDPIC
defaults to y whenBINFMT_ELF
is not selected. A few architecture supportBINFMT_ELF_FDPIC
for NOMMU. ARM supports FDPIC even with aMMU.BINFMT_FLAT
is provided for a few architectures.Therefore, both BINFMT_ELF_FDPIC
andBINFMT_FLAT
can be used for MMU-less systems.BINFMT_FLAT
is a very old solution that does not allowdynamic linking while BINFMT_ELF_FDPIC
supports dynamiclinking.
BTW, BINFMT_AOUT
, removed in 2022, had been supportedfor alpha/arm/x86-32.
Linux's BINFMT_FLAT
refers to an object file format usedby μClinux:Binary Flat format (BFLT). ld-elf2flt
is a ld wrapper that invokeself2flt
when the option -elf2flt
is seen.
Linux's BINFMT_FLAT
supports both version 2(OLD_FLAT_VERSION
) and version 4. Version 4 supportseXecute in Place (XIP), where code resides in ROM while the data segmentis copied to RAM. Therefore, the offset between the text and datasegments is often not knwon at compile time.
Greg added -mid-shared-library
in 2003,which was
The tooling for shared library support seems to be called eXtendedFLAT (XFLAT). It is a limited shared library scheme that disallowsglobal variable sharing. Quoting
XFLAT provides an alternative mechanism to bind and relocatefunctions using a thunk layer that is inserted between each inter-modulefunction call. However, without a GOT it is not possible to bind andrelocate data. In short, with no GOT XFLAT cannot support sharing ofglobal variables between program and shared library modules.
FDPIC can be seen as an extended-mno-pic-data-is-text-relative
mode that utilizes functiondescriptors to support PIC register changes for dynamic linking. A FDPICexecutable can be loaded using either the regular Linux ELF loader forMMU systems or fs/binfmt_elf_fdpic.c
for MMU-less systems.fs/binfmt_elf_fdpic.c
has been ET_EXEC
executables in NOMMU mode. Eacharchitecture that supports FDPIC defines an EI_OSABI
valueto be checked by the loader.
Several architectures define a FDPIC ABI.
Here is a summary.
The read-only sections, which can be shared, are commonly referred toas the "text segment", whereas the writable sections are non-shared andcommonly referred to as the "data segment". Functions and certain datasymbols (.rodata
) reside in the text segment, while otherdata symbols and the GOT reside in the data segment. Special entriescalled "canonical function descriptors" also reside in the GOT.
A call-clobbered register is reserved as the FDPIC register, used toaccess the data segment. Upon entry to a function, the FDPIC registerholds the address of _GLOBAL_OFFSET_TABLE_
. The textsegment can be referenced using PC-relative addressing. The data segmentincluding GOT is referenced using indirect FDPIC-register-relativeaddressing. We will see later that sometimes it's unknown whether anon-preemptible symbol resides in the text segment or the data segment,in which case GOT-indirect addressing with the FDPIC register has to beused.
A function call is called external if the destination may reside inanother module, which has a different data segment and therefore needs adifferent FDPIC register value. Therefore, an external function callneeds to update the FDPIC register as well as changing the programcounter (PC). The FDPIC register can be spilled into a stack slot or acall-saved register, if the caller needs to reference the data segmentlater. The FDPIC register is call-clobbered to
Calling a function pointer, including calling a PLT entry, also setsboth the FDPIC register and PC. When the address of a function is taken,the address of its canonical function descriptor is obtained, not thatof the entry point. The descriptor, resides in the GOT, containspointers to both the function's entry point and its FDPIC registervalue. The two GOT entries are relocated by a dynamic relocation of typeR_*_FUNCDESC_VALUE
(e.g. R_FRV_FUNCDESC_VALUE
).
If the symbol is preemptible, the code sequence loads a GOT entry.When the symbol is a function, the GOT entry is relocated by a dynamicrelocation R_*_FUNCDESC
and will contain the address of thefunction descriptor address.
Let's checkout examples taking addresses of functions and variables.1
2
3
4
5
6
7
8
9
10
11
12
13__attribute__((visibility("hidden"))) void hidden_fun();
void fun();
__attribute__((visibility("hidden"))) extern int hidden_var;
extern int var;
__attribute__((visibility("hidden"))) const int ro_hidden_var = 42;
void *addr_hidden_fun() { return hidden_fun; }
void *addr_fun() { return fun; }
void *addr_hidden_var() { return &hidden_var; }
void *addr_var() { return &var; }
const int *addr_ro_hidden_var() { return &ro_hidden_var; }
int read_hidden_var() { return hidden_var; }
int read_var() { return var; }
Canonical function descriptors are stored in the GOT, and theiraccess depends on whether the referenced function is preemptible ornot.
R_*_FUNCDESC
dynamic relocation, holds thefinal address of the function descriptor.1 | // arm-linux-gnueabihf-gcc -c -fpic -mfdpic -Wa,--fdpic |
Unfortunately, when linking a DSO, anR_ARM_GOTOFFFUNCDESC
relocation referencing a hidden symbolresults in a linker error. This error likely arises because thegenerated R_ARM_FUNCDESC_VALUE
dynamic relocation requiresa dynamic symbol. While this can be implemented using anSTB_LOCAL STT_SECTION
dynamic symbol, GNU ld currentlylacks support for this approach.
1 | % arm-linux-gnueabihf-gcc -fpic -mfdpic -O2 -Wa,--fdpic q.c -shared |
Let's try sh4.sh4-linux-gnu-gcc -fpic -mfdpic -O2 q.c -shared -nostdlib
allows taking the address of a hidden function but not a protectedfunction (my
Then, let's see a global variable initialized by the address of afunction and a C++ virtual table. 1
2
3struct A { virtual void foo(); };
void call(A *a) { a->foo(); }
auto *var_call = call;
1 | // arm-linux-gnueabihf-g++ -c -fpic -mfdpic -Wa,--fdpic |
TODO: -fexperimental-relative-c++-abi-vtables
GOT-indirect addressing is required for accessing data symbols undertwo conditions:
int var;
) and externally declared(extern int var;
) non-const variables.const A a; extern const int var;
extern constinit const int *const extern_const;
. Constantinitialization may require a relocation, e.g.constinit const int *const extern_const = &var;
1 | addr_hidden_var: // non-preemptible data with potential data segment placement |
The dynamic relocations R_*_RELATIVE
andR_*_GLOB_DAT
do not use the standard+ load_base
semantics. It seems that musl fdpic doesn'tsupport the special R_*_RELATIVE
.
If the referenced data symbol is non-preemptible and guaranteed to bein the text segment, we can use PC-relative addressing. However, thisscenario is remarkably rare in practice. The most likely use case islike the following:
1 | const int ro_array[] = {1, 2, 3, 4}; // text segment |
GCC's arm port does not seem to utilize PC-relative addressing. Wecan try GCC's SuperH port:
1 | // sh4-linux-gnu-gcc -S -fpic -mfdpic -O2 q.c |
It optimizes addr_hidden_var
but notaddr_ro_hidden_var
.
ARM FDPIC ABI defines static TLS relocationsR_ARM_TLS_GD32_FDPIC, R_ARM_TLS_LDM32_FDPIC, R_ARM_TLS_IE32_FDPIC
to be relative to GOT, as opposed to their non-FDPIC counterpartrelative to PC.
The PLT entry needs to update the FDPIC register as well as changingthe program counter (PC). binutils' arm port uses the following codesequence.
1 | foo@plt: |
Lazy binding could be implemented, but it is difficult if thearchitecture does not allow atomic updates of two words. binutils' armport just disable lazy binding.
Let's inspect an example involving consecutive function calls.1
2
3
4void f0(void);
void f1(void);
void f2(void);
void g() { f0(); f1(); f2(); }
1 | g: |
If GCC implements -fno-plt
, it can use the followingcode sequence: 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23g:
push {r4, lr}
mov r4, r9
// call f0
ldr r12, .L0
add r12, r12, r4
ldr r9, [r12, #4]
ldr pc, [r12]
// call f1
ldr r12, .L1
add r12, r12, r4
ldr r9, [r12, #4]
ldr pc, [r12]
// tail call f2
ldr r12, .L2
add r12, r12, r4
ldr r9, [r12, #4]
pop {r4, lr}
ldr pc, [r12]
.L0: .word f0(GOTOFFFUNCDESC)
.L1: .word f1(GOTOFFFUNCDESC)
.L2: .word f2(GOTOFFFUNCDESC)
.rofixup
sectionUnlike standard R_*_RELATIVE
relocations that use "*loc+= load_base" semantics, the load address in FDPIC mode is dependent onthe containing segment. The following code adapted fro musl demonstratesthe behavior: 1
2
3
4
5static void *laddr(const struct dso *p, size_t v) {
size_t j=0;
for (; v-p->loadmap->segs[j].p_vaddr >= p->loadmap->segs[j].p_memsz; j++);
return (void *)(v - p->loadmap->segs[j].p_vaddr + p->loadmap->segs[j].addr);
}
In -pie
and -shared
links, a dynamicsection is present, and non-preemptible function and data pointers arerelocated by R_*_FUNCDESC_VALUE
andR_*_RELATIVE
dynamic relocations. For -no-pie
links, the situation varies:
R_*_IRELATIVE
,unsupported in musl/uclibc-ng).FDPIC executables of type ET_EXEC
present a uniquechallenge: while the text segment has a fixed address, the data segmenthas an unknown address at link time and require relocations. To addressthis, a linker-created section named .rofixup
wasintroduced in the first FDPIC ABI (FR-V), and later adopted by otherFDPIC ABIs.
.rofixup
holds non-preemptible function and datapointers, which have R_*_RELATIVE
semantics. The last entryof .rofixup
is special and holds the address of_GLOBAL_OFFSET_TABLE_
. In a -pie
or-shared
link, .rofixup
has only one entry.__ROFIXUP_LIST__
and __ROFIXUP_END__
aredefined as encapsulation symbols of .rofixup
.
At run time, the loader sets the FDPIC register to the relocated_GLOBAL_OFFSET_TABLE_
value before traferring control tothe entry point of the executable.
Here is an example: 1
2
3
4
5
6
7.globl fun; fun: bx lr
.section .rodata,"a"
.globl var; var: .long 0
.section .data.rel.ro,"aw"
.long fun(FUNCDESC) // R_ARM_FUNCDESC_VALUE or two .rofixup entries
.long var // R_ARM_RELATIVE or one .rofixup entry
FDPIC can be seen as:
-msep-data
/-mno-pic-data-is-text-relative
modethat utilizes function descriptors to support PIC register changes fordynamic linking.st_value
referring to the function descriptor is betterthan the existing FDPIC ABIs (sh, arm).FDPIC resembles PPC64 ELFv2 TOC where the FDPIC register is set bythe caller instead of the callee, avoiding global/local entry and tailcall complexity.
-fno-pic -mfdpic
with hidden visibility declarationscan replace -fno-pic -fropi -frwpi
, though clobbered r9across function calls has slight overhead.-fPIE -mfdpic
with hidden visibility declarations canreplace -fPIE -msep-data
, though setting the call-clobberedFDPIC register has slight overhead.-mfdpic
often generates smaller code than-mno-fdpic
on architectures where PC-relative addressing isexpensive. This includes:
LDR
with.word _GLOBAL_OFFSET_TABLE_-(.LPIC0+4)
, which isexpensive.Since FDPIC works effectively even on systems with MMUs, it raisesthe intriguing possibility of replacing the standard calling ABIentirely.
-mfdpic
enables FDPIC code generation. GCC'sh port got-mfpic
implies -fPIE
, so-fno-pic -mfdpic
and -fPIE -mfdpic
have thesame codegen behavior. -fPIC -mfdpic
may have differentgenerated code as it additionally sets flag_shlib
.
The cfgexpand pass calls sh_get_fdpic_reg_initial_val
toretrieve the FDPIC register value from a pseudo register, and registerthe pseudo register for the first invocation. At the start of the ira(Integrated Register Allocator) pass,allocate_initial_values
initializes the pseudo register tothe hard register r12 at the function entry point. sh is the only portthat defines TARGET_ALLOCATE_INITIAL_VALUE
.
In GCC's arm port, -fno-pic -mfdpic
generated code doesnot work.
In addition, external function calls save and restore r9.
gas's arm port needs --fdpic
to assemble FDPIC-relatedrelocation types. GCC configured with aarm*-*-uclinuxfdpiceabi
target utilizesarm/uclinuxfdpiceabi.h
and transforms -mfdpic
to --fdpic
when assembling a file. For other targets,-Wa,--fdpic
is needed to assemble the output. -Wa,--fdpic
unneeded.
-mfdpic -mtls-dialect=gnu2
is not supported. The ARMFDPIC ABI uses ldr
to load a 32-bit constant embedded inthe text segment. The offset is used to materialize the address of a GOTentry (canonical function descriptor, address of the canonical functiondescriptor, or address of data).
You can configure binutils with--target=arm-unknown-uclinuxfdpiceabi
to get a BFD linkerthat supports FDPIC emulations. 1
2
3
4
5
6
7
8% ~/Dev/binutils-gdb/out/arm-fdpic/ld/ld-new -V
GNU ld (GNU Binutils) 2.42.50.20240222
Supported emulations:
armelf_linux_eabi
armelfb_linux_eabi
armelf_linux_fdpiceabi
armelfb_linux_fdpiceabi
% ~/Dev/binutils-gdb/out/arm-fdpic/ld/ld-new -m armelf_linux_fdpiceabi -shared a.o -o a.so
GNU ld' arm port fails on R_ARM_GOTOFFFUNCDESC
referencing a hidden function symbol (1
2
3
4
5
6
7
8
9% cat a.c
__attribute__((visibility("hidden"))) void fun_hidden();
void *fun_hidden_addr() { return fun_hidden; }
% ./bin/ld-new -m armelf_linux_fdpiceabi a.o
[1] 3819239 segmentation fault ./bin/ld-new a.o
% ./bin/ld-new -m armelf_linux_fdpiceabi -shared a.o
./bin/ld-new: BFD (GNU Binutils) 2.42.50.20240224 internal error, aborting at ../../../bfd/elf32-arm.c:16466 in allocate_dynrelocs_for_symbol
./bin/ld-new: Please report this bug.
In -no-pie
mode, certain non-function references thatrequire a .rofixup
entrie leads to a segfault (
Global/weak non-hidden symbols referenced byR_ARM_FUNCDESC
are unnecessarily exported (
Several proposals exist for defining FDPIC-like ABIs to work forMMU-less systems.
Undoubtly, GP should be used as the FDPIC register.
Loading a constant near code (like ARM) is not efficient. Instead,consider a two-instruction sequence:
c.add a0, gp
to compute the address of the GOTentry.Maciej's code sequence supports both function and data access throughindirect GP-relative addressing. We can easily enhance it by addingR_RISCV_RELAX
to enable linker relaxation and improveperformance. Additionally, for consistency with similar notations onx86-64 and AArch64 ("gotpcrel"), let's adopt "gotgprel" notation.
1 | .L0: |
For data access, the code sequence is followed by instructions like:1
2lb a0,0(rY)
sb a1,0(rY)
Function descriptors and data have different semantics, requiring tworelocation types. Stefan O'Rear proposes:
R_RISCV_FUNCDESC_GOTGPREL_HI
: Find or create two GOTentries for the canonical function descriptor.R_RISCV_GOTGPREL_HI
: For or create a GOT for thesymbol, and return an offset from the FDPIC register.Drawing inspiration from ARM FDPIC, two additional relocation typesare needed for TLS. This results in a 4-type scheme.
Addressing performance concerns is crucial. Stefan suggests an"indirect-to-relative optimization and relaxation scheme":
R_RISCV_PIC_ADD
: Tags c.add rX, gp
toenable optimizationR_RISCV_INTERMEDIATE_LOAD
: Tagsld rY, <gotgprel_lo12>(rX)
to enableoptimizationIndirect GP-relative addressing can be optimized to directGP-relative addressing under specific conditions:
1 | # Indirect GP-relative to direct GP-relative |
GOT-indirect addressing can be optimized to PC-relative fornon-preemptible data in the text segment.
1 | # Indirect GP-relative to PC-relative |
GOT-indirect addressing can be optimized to absolute addressing fornon-preemptible data in the text segment.
1 | # Indirect GP-relative to absolute |
This can be used for SHN_ABS
and unresolved undefinedweak symbols. With -no-pie
linking, regular symbols areelligible for this optimization as well. However, linkers may choose notto implement this since the added complexity might outweigh thebenefits.
To handle TLSDESC, we introduce a new relocation type:R_RISCV_TLSDESC_GPREL_HI
. This type instructs the linker tofind or create two GOT entries unless optimized to local-exec orinit-exec. The combined hi20 and lo12 offsets compute the GP-relativeoffset to the first GOT entry.
1 | label: |
Existing relocation types, R_RISCV_TLSDESC_LOAD_LO12
andR_RISCV_TLSDESC_ADD_LO12
, are extended to work withR_RISCV_TLSDESC_GPREL_HI
.
1 | # TLSDESC to initial-exec optimization |
For initial-exec TLS model, we need a new pseudoinstruction, say,la.tls.ie.fd rX, sym
. It expands to:
1 | lui rX, 0 # R_RISCV_TLS_GOTGPREL_HI20(sym) |
Stefan's scheme defines R_RISCV_PIC_LO12_I
as an aliasfor R_RISCV_PCREL_LO12_I
. Since the symbol is GP-relativeinstead of PC-relative, avoiding PCREL
in the relocationtype name makes sense.
Stefan's 11-type scheme adds R_RISCV_PIC_ADDR_LO12_I
tobe associated with ld rX, 0(rX)
instead. I have not yetfigured out the reasoning.
-fno-plt
Regular -fno-plt
code loads the .got.plt
entry using PC-relative addressingand performs an indirect branch. The FDPIC -fno-plt
variantneeds to load both the FDPIC register and the destination address.
1 | lui rX, 0 |
--fat-lto-objects
option is added to support LLVMFatLTO. Without --fat-lto-objects
, LLD will link LLVMFatLTO objects using the relocatable object file.(D146778 <https://reviews.llvm.org/D146778>
_)-Bsymbolic-non-weak
is added to directly bind non-weakdefinitions. (--lto-validate-all-vtables-have-type-infos
, whichcomplements --lto-whole-program-visibility
, is added todisable unsafe whole-program devirtualization.--lto-known-safe-vtables=<glob>
can be used to markknown-safe vtable symbols. (--save-temps --lto-emit-asm
now derives ELF/asm filenames from bitcode file names.ld.lld --save-temps a.o d/b.o -o out
will create ELFrelocatable files out.lto.a.o
/d/out.lto.b.o
instead of out1.lto.o
/out2.lto.o
. (--no-allow-shlib-undefined
now reports errors for DSOreferencing non-exported definitions. (cdsort
algorithm, better than the previoushfsort
algorithm. (a = DEFINED(a) ? a : 0;
are nowhandled. (OVERLAY
now supports optional start address and LMA (/DISCARD/
section now lead to an error. (R_AARCH64_GOTPCREL32
is now supported. (R_LARCH_PCREL20_S2
/R_LARCH_ADD6
/R_LARCH_CALL36
and extreme code model relocations are now supported.--emit-relocs
is now supported for RISC-V linkerrelaxation. (R_RISCV_GOT32_PCREL
is now supported. (R_RISCV_SET_ULEB128
/R_RISCV_SUB_ULEB128
relocations are now supported. (Although a substantial feature, the s390x port benefited from UlrichWeigand's meticulously prepared patch, complete with comprehensive testsfrom the outset. While I typically provide extensive feedback for suchlarge additions, the patch's exceptional quality minimized the number ofcomment rounds needed in this instance. I truly appreciate Ulrich takingthe time to reply to
The RISC-V port has received a few new relocations for.uleb128
label differences and TLSDESC. I'm glad I madethese assembler and linker changes in time, allowing LoongArchdevelopers to port the code for LoongArch. LoongArch borrows manydesigns from RISC-V. If they were to implement these features first, I'dprobably have to spend more time on code reviews, and the outcome wouldbe less rewarding since I wouldn't be the original patch author.
R_AARCH64_GOTPCREL32
(G(GDAT(S))+A-P
) andR_RISCV_GOT32_PCREL
, similar toR_X86_64_GOTPCREL
, are new ABI additions used to optimizepreemptible _ZTI*
symbol references forclang -fexperimental-relative-c++-abi-vtables
. Thesimplification utilizes the 1
2
3
4
5
6
7
8
9
10
11
12
13 .long 0 # 0x0
- .long (_ZTI1A.rtti_proxy-.L_ZTV1A.local)-8
+ .long _ZTI1A@GOTPCREL-4
.long (_ZN1A3fooEv@PLT-.L_ZTV1A.local)-8
- .hidden _ZTI1A.rtti_proxy # @_ZTI1A.rtti_proxy
- .type _ZTI1A.rtti_proxy,@object
- .section .data.rel.ro._ZTI1A.rtti_proxy,"aGw",@progbits,_ZTI1A.rtti_proxy,comdat
- .globl _ZTI1A.rtti_proxy
- .p2align 3, 0x0
-_ZTI1A.rtti_proxy:
- .quad _ZTI1A
- .size _ZTI1A.rtti_proxy, 8
To the best of my knowledge, there is no performance-specificchange.
Link: lld 17 ELFchanges
]]>Linux on IBMZ is a 64-bit operating system on z/Architecture, related to anolder effort porting Linux to ESA/390. As the Wikipedia pageclarifies:
Historically the Linux kernel architecture designations were "s390"and "s390x" to distinguish between the 32-bit and 64-bit Linux on IBM Zkernels respectively, but "s390" now also refers generally to the oneLinux on IBM Z kernel architecture.
Each instruction has a length of two, four or six bytes (one to threehalfwords), and must be located at a halfword boundary. Six-byteinstructions have been available since S/360. The two most significantbits of the first halfword determines the length of instruction.
There are more than 1000 basic instructions.
There are 16 64-bit general-purpose registers, each treated as twoindependent 32-bit parts. Certain instructions operate on the high32-bit part. E.g. aih %r2, 1
add 1 to the high 32-bit part.I suspect that using these instructions to overcome register scarcitywouldn't be a good idea due to the overhead of registersynchronization.
PC-relative addressing is supported with thegeneral-instructions-extension facility (February 2008). For example,only one instruction is needed to load_GLOBAL_OFFSET_TABLE_
(see "Global Offset Table" below)into a register (usually r12). 1
larl %r12, _GLOBAL_OFFSET_TABLE_ # r12 = _GLOBAL_OFFSET_TABLE_
The RIL instruction format, consisting of 6 bytes, encodes a registerand a 32-bit immediate operand. This enables it to implement valuableinstructions like BRASL (Branch Relative And Save Long, like x86's CALL)and LARL (Load Address Relative Long, like x86's MOV with RIP-relativeaddressing).
1 | int var; |
1 | // s390x |
Note: RISC-V's JALR is akin to BRASL, as you can specify whichregister to save the return address. z/Architecture's BRASL has a lengthof 6 bytes, so encoding the register, while wasting the encoding space,is more affordable.
LGF (Load, RXY-type) performs a load with a register offset and a20-bit displacement (base+offset+disp20), but does not support a scaledindex operation. Two instructions, consisting of 12 bytes, are requiredto perform a simple index operation: 1
2
3int foo(int *a, long i) { return a[i+3]; }
// sllg %r3,%r3,2
// lgf %r2,12(%r3,%r2)
-march=arch9
introduces some conditional instructionslike LOCR/LOCGR (Load On Condition, RRF-c-type),if (cond) r1 = r2
. In contrast, 4 bytes on otherarchitectures can generally implement a more powerful three-registerconditional move.
While I haven't had extensive time to study the instruction setarchitecture (ISA), I do see some clear limitations:
This raises the question: when would IBM prefer designing acompletely new architecture and implementing a dynamic binary translatorfor existing programs?
r14 is used as the link register while r15 is the stack pointer. Ins390x-abi, registers r6 to r13, and r15 are designated as designated asnon-volatile (not clobbered by a function call). Registers r2 to r6 areused for integer arguments.
The stack alignment is 8 bytes. Most 64-bit architectures employ16-byte alignment.
Symbols representing a section offset must be halfword aligned.Compilers assume that an external symbol (e.g.extern char a;
) to be halfword aligned. -munaligned-symbols
removes the assumption.
LLVM supports IBM z10 and newer models. 31-bit addressing is notsupported.
The .got
section has 3 reserved entries. The linkerdefines _GLOBAL_OFFSET_TABLE_
at the start of.got
. _GLOBAL_OFFSET_TABLE_[0]
stores thelink-time address of _DYNAMIC
, which is used by glibc._GLOBAL_OFFSET_TABLE_[1]
and_GLOBAL_OFFSET_TABLE_[2]
are for lazy binding PLT(_dl_runtime_resolve
and link map in glibc).
The assembler modifier @GOTENT
designates a 32-bitimmediate operand. The assembler modifier @GOT
designatesan immediate operand of either 16-bit or 32-bit.
Compilers generate a LGRL (Load Relative Long) instruction to loadthe GOT entry of a symbol. When the symbol is non-preemptible and not an
1 | lgrl %r1, var@GOT # R_390_GOTENT(var) |
At 32 bytes per entry, PLTs are notably larger than otherarchitectures. Only the first 14 bytes (encompassing three instructions)are strictly necessary for eager binding.
1 | larl %r1, .got.plt[n] |
For lazy PLT binding, the .got.plt
entry refers tobasr %r1, %r0
(14 bytes relative to the PLT entry), whichstores the next instruction address into r1
. PLT0 is calledwith r1
set to the relocation offset. PLT0 sets uparguments and calls .got[2]
, the PLT resolver in glibc.
mold utilizes a 16-byte PLT entry scheme that usesbasr %r0, %r1
instead of br %r1
so that PLT0can compute the relocation offset using r0
.
There are 5 absolute relocation types:R_390_{8,16,20,32,64}
. They can be used as data relocations(.byte
, .short
, etc) as well as coderelocations.
R_390_8
is used by instruction formats with a 8-bitimmediate operand (e.g. SI).R_390_16
is used by instruction formats with a 16-bitimmediate operand (e.g. RI).R_390_20
is used by instruction formats with a 20-bitimmediate operand (e.g. RSY, RXY).R_390_32
is used by instruction formats with a 32-bitimmediate operand (e.g. RIL).The assembler modifier @PLTOFF
designatesR_390_PLTOFF16
, R_390_PLTOFF32
, andR_390_PLTOFF64
. Their computation&.plt[n] - .got + A
is similar to R_X86_64_PLTOFF64
used by x86-64's large code model. However, -mcmodel=large
is unsupported, so these relocations seem not useful.
Relocation types R_390_GOTPLT*
(.got.plt[n]
relative to .got
or PC) seem unused. GCC never emits theassembler modifier @GOTPLT
. I believe these relocations arenot useful in the presence of PC-relative adddressing.
Refer to
First, let's look at thread pointer accessing.
a0
.a0
anda1
, both still 32-bit.This necessitates three instructions (14 bytes) to retrieve the fullthread pointer, while 64-bit access registers would simplify this:1
2
3ear %r0, %a0 # r0 = hi(r0) | a0
sllg %r1, %r0, 32 # r1 = r0<<32
ear %r1, %a1 # r1 = hi(r1) | a1 = a0<<32 | a1
Access registers holds 32-bit access-list-entry tokens (ALET), whichare not used on Linux.
In the general dynamic TLS model, a key difference compared to otherarchitectures is the use of __tls_get_offset
instead of__tls_get_addr
. The process involves several steps,illustrated by the provided assembly code:
1 | ear %r0, %a0 |
_GLOBAL_OFFSET_TABLE_
: Four instructions are required butcan be shared by subsequent TLS accesses. This step can bereordered.a@TLSGD
) isstored in the .data.rel.ro
section. The offset refers totwo GOT entries (a tls_index
structure), relocated bydynamic relocations R_390_TLS_DTPMOD
andR_390_TLS_DTPOFF
. The dynamic loader will set the values to(m, a@DTPOFF)
, the module ID and an offset of the symbolrelative to the dynamic TLS block.DTPOFF
): __tls_get_offset(r2)
returnsdtv[m] + a@DTPOFF - TP
. __tls_get_addr
inother architectures just return dtv[m] + a@DTPOFF
.In glibc, __tls_get_offset
is defined as:1
2
3
4
5// unsigned long __tls_get_offset(unsigned long offset);
__tls_get_offset:
la %r2,0(%r2,%r12)
jg __tls_get_addr
While this approach works, it's considered the least efficientimplementation of general dynamic TLS among the architectures I haveanalyzed. Here is why:
tls_index
argument (similar to AArch32):This requires an extra lookup in .data.rel.ro
._GLOBAL_OFFSET_TABLE_
(similar tox86-32): Instead of loading a@TLSGD
, and then adding_GLOBAL_OFFSET_TABLE_
, it is easier to just load the GOTentry address using LGRL.__tls_get_offset
takes the GOToffset instead of the direct GOT entry address.__tls_get_offset
only provides an offset,requiring an extra instruction for addition with the TP.The 64-bit TLS ABI, modeled closely after the 32-bit ABI, wascodified before nice instructions like LGRL(general-instructions-extension facility, February 2008) were available.It clearly comes at the cost of performance.
The marker relocation R_390_TLS_GDCALL
comes afterR_390_PLT32DBL
, different from other architectures.
The general-dynamic code sequence can be optimized to initial-exec orlocal-exec.
1 | // general-dynamic to initial-exec |
In both cases, the linker only needs to patch one instruction,instead of four for PPC64.
The process involves several steps, illustrated by the providedassembly code:
1 | lgrl %r2,.LC0 # r2 = *(.LC0) = GOT offset of a tls_index object holding {module_ID, 0} |
_GLOBAL_OFFSET_TABLE_
a@TLSLDM
) isstored in the .data.rel.ro
section. The offset refers totwo GOT entries (a tls_index
structure): the module ID anda zero. The module ID entry is relocated by a dynamic relocationR_390_TLS_DTPMOD
.__tls_get_offset(r2)
returns dtv[m] - TP
. Itis not dtv[m] + XXX - TP
because the second GOT entry iszero.The first three steps can be shared among TLS symbols.
Similar to general-dynamic, the marker relocationR_390_TLS_LDCALL
comes after R_390_PLT32DBL
,different from other architectures. This makes lld implementationawkward.
The local-dynamic code sequence can be optimized to local-exec.
1 | lgrl %r2,.LC0 # r2 = 0 |
1 | lgrl %r1, a@INDNTPOFF # R_390_TLS_IEENT(a); linker resolves this to a GOT holding the TP offset |
Optimizing the code sequence to local-exec is straightforward:changing the first instruction to lgfi %r1, a@NTPOFF
.However, LGFI (Load Immediate) is part of the extended-immediatefacility (September 2005), introduced with System z9 109, unavailablewhen the ABI was defined.
Relocation typesR_390_TLS_IE32
/R_390_TLS_IE64
for theinitial-exec TLS model seem not useful.
The code sequence loads the TP offset indirectly in a manner similarto AArch32. 1
2
3
4
5lgrl %r1, .LC0 # r1 = a@NTPOFF
lgf %r1, 0(%r1,%r7) # r1 = *(a@NTPOFF + TP) = a
.section .data.rel.ro,"aw"
.LC0: .quad a@NTPOFF # R_390_TLS_LE64(a); linker resolves this to the TP offset, a negative integer
The indirection is unfortunate. Similarly, LGFI (Load Immediate) canbe used instead.
1 | // gcc/common.md |
CONSTANT_P
matches a class of RTL expressions called RTX_CONST_OBJ
.
An RTX code that represents a constant object. HIGH is also includedin this class.
The most interesting objects in this class are constant integers,constant floating points, symbol or label references with a constantoffset. "s" is like "i", but does not match constant integers (e.g."s"(1)
is an error). So "s" essentially matches a symbol orlabel reference with a constant offset.
"s" can be used to create an artificial reference for
1 | namespace ns { extern int a[2][2]; } |
C++ templates can make this easier to use. 1
2
3template <class T, T &x>
void ref() { asm (".reloc ., BFD_RELOC_NONE, %0" :: "s"(x)); }
void use() { ref<decltype(ns::a), ns::a>(); }
1 | // Materialize the symbol address manually |
Using the generic r
or m
constraint in suchcases would instruct GCC to generate instructions to compute theaddress, which can be wasteful if the materialized address isn'tactually needed.
1 | // aarch64 |
The condition !flag_pic || LEGITIMATE_PIC_OPERAND_P (op)
highlights a key distinction in GCC's handling of symbol references:
-fno-pic
): The "i" and "s" constraintsare freely permitted.-fpie
and -fpic
): Thearchitecture-specific LEGITIMATE_PIC_OPERAND_P(X)
macrodictates whether these constraints are allowed.While the default implementation (gcc/defaults.h
) ispermissive (used by MIPS, PowerPC, and RISC-V), many ports imposestricter restrictions, often disallowing preemptible symbols underPIC.
This differentiation probably stems from historical and architecturalconsiderations:
Nevertheless, I think this symbol preemptibility limitation for "s"is unfortunate. Ideally, we could retain the current "i" for immediateinteger operand (after linking), and design "s" for a raw symbol namewith a constant offset, ignoring symbol preemptibility. Thisarchitecture-agnostic "s" would simplify metadata section utilizationand boost code portability.
Below are some architecture-specific notes.
In gcc/config/arm
,LEGITIMATE_PIC_OPERAND_P(X)
has a complex definition and itseems to disallow any non-TLS symbol reference, which means that "s"cannot be used for PIC.
"US" can be used for a symbol reference without an offset (e.g.&a[0]
when a
is an array) in PIC code, butthere is no good way to match &a[1]
. To get rid of the#
prefix, use the modifier "c".
1 | extern int a[4]; |
In gcc/config/aarch64
,LEGITIMATE_PIC_OPERAND_P(X)
disallows any symbol reference,which means that "i" and "s" cannot be used for PIC. Instead, the
Clang 7 also implemented "S".
gcc/config/riscv
uses the genericLEGITIMATE_PIC_OPERAND_P(X)
, so "s" can be used in PICmode.
The constraint "S" is supported (since the beginning of the port in2017) for a similar purpose, but requires a non-preemptible symbol.
I implemented theconstraint "S" for Clang 14 but realized that "S" is less useful in GCC,so I sent a patch to
We can use the constraint "s" (or "i") with the modifier "p" to printraw symbol name without syntax-specific prefixes, but it does not workwhen:
-mcmodel=large
-mcmodel=medium
for large data1 | void foo() { |
I filed the
BTW, you can also the modifier "c".
Require a constant operand and print the constant expression with nopunctuation.
I think having such a
If your program wants to adopt raw symbol names in inline assembly,consider the following list for best portability and semantics:
Linux kernel's
Many ports use the constraint "i", which is more or less a hack.
1 | // gcc/common.md |
In the non-PIC mode, "i" does works with a constant, symbolreference, or label reference. However, in the PIC mode, "i" on a symbolreference is rejected by certain GCC ports (e.g. aarch64). I went aheadand sent an
BTW, the jump label patching implementation prevents kernelcompilation without optimizations (along with other clever tricks).include/linux/jump_label.h
offers an interesting example of__builtin_types_compatible_p
and anundefined symbol. 1
2
3
4
5
6
7
8
9
10
11
Ideally, if the kernel switches to C++, a template would provide amore elegant and portable solution, enabling compilation withoutoptimizations. 1
2
3
4
5
6
7template <class T, T &key>
bool arch_static_branch(bool branch) {
asm_volatile_goto(... : : "Ws"(key), "i" (2 | branch) : : l_yes);
return false;
l_yes:
return true;
}
Condition coverage offers a more fine-grained evaluation of branchcoverage. It requires that each individual boolean subexpression(condition) within a compound expression be evaluated to both true andfalse. For example, in the boolean expressionif (a>0 && f(b) && c==0)
, each ofa>0
, f(b)
, and c==0
, conditioncoverage would require tests that:
a>0
to true and falsef(b)
to true and falsec==0
to true and falseA condition combination refers to a specific set of boolean valuesassigned to individual conditions within a boolean expression. Multiplecondition coverage ensures that all possible condition combinations aretested. In the example above, with three conditions, there would be 2³ =8 possible condition combinations.
While multiple condition coverage may not be practical,
Consider a boolean expression like(A && B) || (C && D)
. This has fourconditions (A, B, C, and D), each potentially a subexpression likex>0
. Tests evaluate condition combinations (ABCD) andtheir corresponding outcomes.
Multiple flavors of MC/DC exist, with Unique-Cause MC/DC representingthe strongest variant. When demonstrating the independence of A in theboolean expression (A && B) || (C && D)
,Unique-Cause MC/DC requires two tests with different outcomes and:
The two tests form an independence pair for A. Acoverage set comprises tests offering such independence pairsfor each condition. However, achieving this set may be impossible in thepresence of strongly coupled conditions.
Coupling examples:
x==0 && x!=0
arestrongly coupled: changing one automatically changes theother.x==0 || x==1 || x==3
exhibits weakly coupledconditions: changing x from 0 to 2 alters only the first condition,while changing it to 1 affects the first two.Masking involves setting one operand of a boolean operatorto a value that renders the other operand's influence on the outcomeirrelevant. Examples:
&&
withA && false
(outcome is always false, unaffected byA).||
with A || true
(outcome is always true, unaffected by A).Due to short-circuit semantics, the RHS of &&
isnot evaluated when the LHS is false.
Masking MC/DC demonstrates condition independence by showingthe condition in question affects the outcome and keeping otherconditions masked. For example, to provde the independence of A in theboolean expression (A && B) || (C && D)
, Cand D can change values as long as C && D
remainsfalse. In this way, each condition allows more independence pairs thanUnique-Cause MC/DC.
In 2001, masking MC/DC has been considered an acceptable method formeeting objective 5 of Table A-7 in DO-178B.
Unique-Cause + Masking MC/DC is weaker than Unique-Cause MC/DC butstronger than Masking MC/DC, allowing masking only for strongly coupledconditions.
If an expression has N unique conditions, both Unique-Cause MC/DC andUnique-Cause Masking MC/DC require a minimum of N+1 tests. It is notclear whether this is an exact bound. ceil(2*sqrt(N))
.See An Investigation of Three Forms of the Modified ConditionDecision Coverage (MCDC) Criterion for detail.
Binary decision diagram (BDD) is a data structure that is used torepresent a boolean function. Boolean expressions with&&
and ||
compile to reduced orderedBDDs.
There is another coverage metric called object branch coverage, whichdetermines whether each branch is taken at least once and is also nottaken at least once. Object branch coverage does not guarantee MC/DC,but does when the reduced ordered BDD is a tree.
(B && C) || A
is a non-tree example thatachieving object branch coverage requires 3 tests, which areinsufficient to guarantee MC/DC. If the expression is rewritten toA || (B && C)
, then the reduced ordered BDD willbecome a tree, making object branch coverage guarantee MC/DC.
Since GCC 3.4, GCC has employed .gcno
and.gcda
files to store control-flow graph information and arcexecution counts, respectively. This format has undergone enhancementsbut remains structurally consistent. .gcno
files containfunction records describeing basic blocks, arcs between them, and lineswithin each basic block. Column information is only available forfunctions. .gcda
files store arc execution counts.
gcov identifies basic blocks on a particular line (usually one) andlocates successor basic blocks to infer branches. When -b
is specified, gcov prints branch probabilities, though the output may beunclear since .gcno
does not encode what true and falsebranches are.
1 | cat > a.c <<e |
The output 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24 -: 0:Source:a.c
-: 0:Graph:a.gcno
-: 0:Data:a.gcda
-: 0:Runs:1
function test called 3 returned 100% blocks executed 100%
3: 1:int test(int a, int b) {
3: 2: if (a > 0 && b > 0)
branch 0 taken 67% (fallthrough)
branch 1 taken 33%
branch 2 taken 50% (fallthrough)
branch 3 taken 50%
1: 3: return 1;
2: 4: return 0;
-: 5:}
-: 6:
function main called 1 returned 100% blocks executed 100%
1: 7:int main() {
1: 8: test(0, 1);
call 0 returned 100%
1: 9: test(1, 0);
call 0 returned 100%
1: 10: test(1, 1);
call 0 returned 100%
-: 11:}
However, there is no direct MC/DC support. I believe that people justapproximate MC/DC with branch coverage. For side-effect-free expressionslike (B && C) || A
, there might be avenues forcompiler transformation into a tree-style BDD, such asA || (B && C)
. However, I don't know the presenceof such tools.
Efficient Test Coverage Measurement for MC/DC describes acode instrumentation technique to determine masking MC/DC. For a booleanexpression with N conditions, each condition is assigned 2 bits:
The instrumentation adds a few bitwise instructions that records thebranches taken in conditions and applies a filter for masking effects.When both bits assigned to a condition are 1, we have found anindependence pair for this condition.
Jørgen Kvalsvik posted the first MC/DC patch to gcov in March 2022and gcc --coverage -fcondition-coverage
and pass--conditions
to gcov. The output should look like:1
2
3
4
5
6
7
8
9
10
11 3: 17:void fn (int a, int b, int c, int d) {
3: 18: if ((a && (b || c)) && d)
conditions covered 3/8
condition 0 not covered (true false)
condition 1 not covered (true)
condition 2 not covered (true)
condition 3 not covered (true)
1: 19: x = 1;
-: 20: else
2: 21: x = 2;
3: 22:}
Clang offers a sophisticated approach to code coverage called
ExpansionRegion
)In January 2021, the framework has been
&&
and ||
operators.The primary data structure changes are the additions of the secondcounter (CountedRegion::FalseExecutionCount
andCounterMappingRegion::FalseCount
) and a newCounterMappingRegion::RegionKind
namedBranchRegion
.
1 | x = x > 0 && y > 0; // 2 extra counters |
When the boolean expression is used in an if
statement,the then
counter can be reused by the right operand of thelogical operand, but this optimization has not been implemented(mentioned by D84467).
The presentation "Branch Coverage: Squeezing more out of LLVMSource-based Code Coverage, 2020" elaborates on the design.
1 | clang -fprofile-instr-generate -fcoverage-mapping a.c -o a |
1 | 1| 3|int test(int a, int b) { |
A Rust feature request was since then filed:https://github.com/rust-lang/rust/issues/79649
In January 2024, Clang's Source-based Code Coverage
-fcoverage-mcdc
tells Clang to instrument&&
and ||
expressions to record thecondition combinations and outcomes, and store the reduced ordered BDDsinto the coverage mapping section. The bitmap is stored in the__llvm_prf_bits
section in a llvm-profdata merge
merges bitmaps from multiple rawprofiles and stores the merged bitmap into an indexed profile.--show-mcdc
to llvm-cov show
,llvm-cov
reads a profdata file, retrieves the bitmap,computes independence pairs, and print the information.Clang adopts a distinct approach to Masking MC/DC compared to thepaper "Efficient Test Coverage Measurement for MC/DC". Insteadof complex masking value computation, it uses a "boring algorithm":
[0, 2**N)
.For example, 1
2if (a && b || c)
return 1;
Let's say in one execution path a=c=1
andb=0
. the condition combination (0b101) leads to an index of5. The instrumentation locates the relevant word in the bitmap and setthe bit 5.
The approach is described in detail in "MC/DC: Enablingeasy-to-use safety-critical code coverage analysis with LLVM" in2022 LLVM Developers' Meeting.
Pros:
Cons:
2**N
vs.2*N
)On the LLVM side, the RISC-V TLSDESC work has been completed.
The the most important patch is -mtls-dialect=
patch.
These patches are expected to be included in the upcoming LLVM 18.1release. To obtain TLSDESC code sequences, compile your program withclang --target=riscv64-linux -fpic -mtls-dialect=desc
.
Latest patch:
Latest patch:
Latest patch:
musl
No patch yet.
The LLVM patches need testing. Unfortunately, I didn't have a RISC-Vimage at hand, so I used qemu-user.
Patch musl per 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63diff --git c/arch/riscv64/reloc.h w/arch/riscv64/reloc.h
index 1ca13811..7c7c0611 100644
--- c/arch/riscv64/reloc.h
+++ w/arch/riscv64/reloc.h
#define REL_DTPMOD R_RISCV_TLS_DTPMOD64
#define REL_DTPOFF R_RISCV_TLS_DTPREL64
#define REL_TPOFF R_RISCV_TLS_TPREL64
+#define REL_TLSDESC R_RISCV_TLSDESC
#define CRTJMP(pc,sp) __asm__ __volatile__( \
"mv sp, %1 ; jr %0" : : "r"(pc), "r"(sp) : "memory" )
diff --git c/include/elf.h w/include/elf.h
index 72d17c3a..7f342a23 100644
--- c/include/elf.h
+++ w/include/elf.h
enum
#define R_RISCV_TLS_DTPREL64 9
#define R_RISCV_TLS_TPREL32 10
#define R_RISCV_TLS_TPREL64 11
+#define R_RISCV_TLSDESC 12
#define R_RISCV_BRANCH 16
#define R_RISCV_JAL 17
diff --git c/src/ldso/riscv64/tlsdesc.s w/src/ldso/riscv64/tlsdesc.s
new file mode 100644
index 00000000..56d1ce89
--- /dev/null
+++ w/src/ldso/riscv64/tlsdesc.s
+.text
+.global __tlsdesc_static
+.hidden __tlsdesc_static
+.type __tlsdesc_static,%function
+__tlsdesc_static:
+ ld a0,8(a0)
+ jr t0
+
+.global __tlsdesc_dynamic
+.hidden __tlsdesc_dynamic
+.type __tlsdesc_dynamic,%function
+__tlsdesc_dynamic:
+ add sp,sp,-16
+ sd t1,(sp)
+ sd t2,8(sp)
+
+ ld t2,-8(tp) # t2=dtv
+
+ ld a0,8(a0) # a0=&{modidx,off}
+ ld t1,8(a0) # t1=off
+ ld a0,(a0) # a0=modidx
+ sll a0,a0,3 # a0=8*modidx
+
+ add a0,a0,t2 # a0=dtv+8*modidx
+ ld a0,(a0) # a0=dtv[modidx]
+ add a0,a0,t1 # a0=dtv[modidx]+off
+ sub a0,a0,tp # a0=dtv[modidx]+off-tp
+
+ ld t1,(sp)
+ ld t2,8(sp)
+ add sp,sp,16
+ jr t0
+
1 | (mkdir -p out/rv64 && cd out/rv64 && ../../configure --target=riscv64-linux-gnu && make -j 50) |
Adjust ~/musl/out/rv64/lib/musl-gcc.specs
and update~/musl/out/rv64/obj/musl-gcc
1
2
3
4cat > ~/musl/out/rv64/obj/musl-gcc <<eof
#!/bin/sh
exec "${REALGCC:-riscv64-linux-gnu-gcc}" "$@" -specs ~/musl/out/rv64/lib/musl-gcc.specs
eof
I have also modified musl-clang (clang wrapper). Adjust~/musl/out/rv64/obj/musl-clang
to use--target=riscv64-linux-musl
. Adjust~/musl/out/rv64/obj/ld.musl-clang
to definecc="/tmp/Rel/bin/clang --target=riscv64-linux-gnu"
andinvoke exec /tmp/Rel/bin/ld.lld "$@" -lc
.
Prepare a runtime test mentioned at the end of 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42cat > ./a.c <<eof
#include <assert.h>
int foo();
int bar();
int main() {
assert(foo() == 2);
assert(foo() == 4);
assert(bar() == 2);
assert(bar() == 4);
}
eof
cat > ./b.c <<eof
#include <stdio.h>
__thread int tls0;
extern __thread int tls1;
int foo() { return ++tls0 + ++tls1; }
static __thread int tls2, tls3;
int bar() { return ++tls2 + ++tls3; }
eof
echo '__thread int tls1;' > ./c.c
sed 's/ /\t/' > ./Makefile <<'eof'
.MAKE.MODE = meta curDirOk=true
CC := ~/musl/out/rv64/obj/musl-clang -O1 -g -fpic -mtls-dialect=desc -w
LDFLAGS := -Wl,-rpath=.
all: a0 a1 a2
run: all
./a0 && ./a1 && ./a2
c.so: c.o; ${LINK.c} -shared $> -o $@
bc.so: b.o c.o; ${LINK.c} -shared $> -o $@
b.so: b.o c.so; ${LINK.c} -shared $> -o $@
a0: a.o b.o c.o; ${LINK.c} $> -o $@
a1: a.o b.so; ${LINK.c} $> -o $@
a2: a.o bc.so; ${LINK.c} $> -o $@
eof
bmake run
=> succeeded!
1 | % bmake run |
During my development of the linker patch, the Clang Driver patch wasactually not ready yet. I used a more hacky approach by compiling usingGCC, replacing some assembly fragments with TLSDESC code sequences, andassemblying using Clang.
Compile b.c
to bb.s
. Replacegeneral-dynamic code sequences (e.g.la.tls.gd a0,tls0; call __tls_get_addr@plt
) with TLSDESC,e.g. 1
2
3
4
5
6.Ltlsdesc_hi0:
auipc a0, %tlsdesc_hi(tls0)
ld a1, %tlsdesc_load_lo(.Ltlsdesc_hi0)(a0)
addi a0, a0, %tlsdesc_add_lo(.Ltlsdesc_hi0)
jalr t0, 0(a1), %tlsdesc_call(.Ltlsdesc_hi0)
add a0, a0, tp
Create an alias bin/ld.lld
to be used with-Bbin -fuse-ld=lld
. I made some adjustment to theMakefile
so that an invocation looks like:
1 | % bmake run |
This article compares several prominent object file formats, drawingupon my experience and insights.
At the heart of each format lies the representation of essentialcomponents like symbols, sections, and relocations. For each controlstructure, We'll begin with ELF, a widely used format, before venturinginto the landscapes of other notable formats.
Before delving into the technical side, I will share some notes aboutmy archaeological journey.
The a.out format was designed for PDP-11 and appeared on the firstversion of Unix. The quantities were 16-bit, but can be naturallyextended to 32-bit or 64-bit.
In Proceedings of the Summer 1990 USENIX Conference, ELF: AnObject File to Mitigate Mischievous Misoneism by James Q. Arnoldprovided some description.
For 32-bit machines, the a.out format was extended in several ways.Most obviously, 16-bit quantities were enlarged to 32-bit values. Thesymbol table changed to allow names of unlimited length. Relocationentries also changed significantly. Larger programs and differentrelocation conventions made it necessary to associate a relocation entrywith an explicit address, instead of relying on the implicitcorrespondence between program sections and relocation records.
Many Unix and Unix-like operating systems, including SunOS, HP-UX,BSD, and Linux, used a.out before switching to ELF.
The most noticeable extension is dynamic shared library support.(This feature is distinct from static shared library, where each sharedlibrary needs a fixed address in the address space.) There are twoflavors:
A linker supporting shared libraries.
) and A linker supporting shared libraries (run-time part).
)added shared library support similar to the SunOS scheme.FreeBSDa.out(5) provides a nice description.
If you follow recent years' Linux kernel news, there were somediscussions when Linux eventually
a.out supports three fixed loadable sections TEXT, DATA, and BSS,which is too restrictive. COFF introduces custom section support andallows up to 32767 sections. The ELF paper contains some remarks:
Common Object File Format (COFF), was designed primarily to supportelectronic switching systems (the telephone network). Its distinguishingfeatures were multiple sections (text, data, uninitialized memory,reserved memory, overlays, etc.), some support for multiple targetprocessors, defined structures for symbol tables and relocations, anddebugging information tailored for the C language.
According to scnhdr.h
in System V Release 2 for NS32xxx,COFF was designed no later than 1982. Then, System V Release 3 adoptedCOFF, which motivated a lot of follow-ups.
Key drawbacks:
Carnegie Mellon University developed the Mach kernel as a proof ofthe microkernel concept. The operating system used a format derived froma.out with minor modifications, named the Mach object file format. Theabbreviation, Mach-O, is often used instead. The NeXTSTEP operatingsystem and then Darwin adopted Mach-O.
Dynamic shared library support on Mach-O came later than other objectfile formats. In a NeXTSTEP manual released in 1995, I can findMH_FVMLIB
(fixed virtual memory library, which appears tobe a static shared library scheme) but not MH_DYLIB
(usedby modern macOS for .dylib files).
Key drawbacks:
.subsections_via_symbols
has somedownsides (discussed later).In my opinion, Mach-O is the most limited among Mach-O/PE/ELF.However, I want to acknowledge certain innovative features like.subsections_via_symbols
andS_ATTR_LIVE_SUPPORT
.
Frustrations and inherent constraints of COFF, coupled with aself-imposed byte order dilemma, AT&T introduced a groundbreakingformat: Executable and Linking Format (ELF). ELF revisited fixed contentand hard-wired concepts in previous object file formats, removedunnecessary elements, and made control structures more flexible.
This pivotal shift was embraced by System V Release 4, marking a newera in object file format design. In the 1990s, many Unix and Unix-likeoperating systems, including Solaris, IRIX, HP-UX, Linux, and FreeBSD,switched to ELF.
The minimum of a symbol control structure needs to encode the name,section, and value. We can require that every symbol is defined inrelation to some section. We can use a section index of zero torepresent an undefined symbol.
In a minimum object file format with only few hard-coded sections(a.out), the section field can be omitted. A type field can be used todecide whether the symbol can reference a function or a data object.
1 | // ELFCLASS32, 16 bytes |
The symbol name is represented as a 32-bit index into the stringtable. A 32-bit integer suffices, while a 16-bit integer would be toosmall.
st_shndx
uses a size-saving trick. The 16-bit memberencodes a section index. If the member is SHN_XINDEX
(0xffff), then the actual value is contained in the associated sectionof type SHT_SYMTAB_SHNDX
. This is a very nice trick becausethe number of sections are almost always smaller than 0xff00. Inpathologic cases, there can be more sections, where a section of typeSHT_SYMTAB_SHNDX
is needed.
st_info
specifies the symbol's type (4 bits) and binding(4 bits) attributes. Types are allocated very conservatively and usuallyimply different linker behaviors. The inherently different linkerbehaviors for symbol types are not that many. So 4 bits seem small, theyare sufficient in practice. As we will learn, this is significantlysmaller than COFF's type and storage class representation. A symbol'sbinding is for the local/weak/global distinction. The reserved 4 bitscan accommodate more values, but only GNU reserves one value(STB_GNU_UNIQUE
) (a misfeature in my opinion).
In COFF, function symbols can use an auxiliary symbol record toencode the size of function (x_fsize
;TotalSize
in PE). In ELF, st_size
is a fixedmember, used for copy relocations and symbolizers. If we eliminate copyrelocations and don't need the symbolization heuristics, this field willbecome garbage.
Here is a demonstration if we remove st_size
.1
2
3
4
5
6
7
8// 16 bytes
struct Elf64_Sym_minimized {
Elf64_Word st_name; // index into the string table
unsigned char st_info; // type and binding
unsigned char st_other; // visibility and others
Elf64_Half st_shndx; // section index
Elf64_Addr st_value;
} Elf64_Sym;
1 | // a.out (System V), 16 bytes |
a.out uses a nlist
to represent a symbol table entry. Inthe original format for PDP-11, the assembler generates symbols of atmost 7 bytes. n_name[8]
can hold the name with a NUL end.Unix's appreciation of shorter identifier names is related to this:)
To support longer names, extensions add a string table after thesymbol table, and allow n_name
to be interpreted as anindex (n_strx
) into the string table. This member thenbecomes a size-saving trick by inlining a short name (8 bytes or less)into the structure. Some variants, like binutils' 64-bit a.out format,use an index exclusively and removed n_name
.
n_type
, broken down into three sub-fields, describeswhether a symbol is defined or undefined, external or local, and thesymbol type. The values listed on the FreeBSD manpage are also used onPDP-11.
For a defined symbol, n_type
describes whether it isrelative to TEXT, DATA, or BSS.
1 | // COFF (System V Release 3), 18 bytes in the absence padding |
COFF adopts a.out's approach to save space in symbol names. Thislikely made sense when most symbols were shorter. However, with today'soften lengthy symbol names, this inlining technique complicates code andincreases the control structure size (from 4 to 8 bytes).
The section number is a 16-bit signed integer, supporting up to32,767 sections. Positive values indicate a section index, while specialvalues include:
N_UNDEF
(0): Undefined symbol (distinct from a.out'sn_type
representation).N_ABS
(-1): Symbol has an absolute value.N_DEBUG
(-2): Special debugging symbol (value ismeaningless).COFF's n_type
and n_sclass
encode C' typeand storage class information. PE assigns longer names to these typesand storage classes longer names, e.g.,IMAGE_SYM_TYPE_CHAR/IMAGE_SYM_TYPE_SHORT
,IMAGE_SYM_CLASS_AUTOMATIC/IMAGE_SYM_CLASS_EXTERNAL
. Whilevalues are mostly consistent, minor differences exist:
IMAGE_SYM_TYPE_VOID
(1) is different from System VRelease 3's#define T_ARG 1 /* function argument (only used by compiler) */
.IMAGE_SYM_CLASS_WEAK_EXTERNAL
(105) is differentfrom System V Release 3's#define C_ALIAS 105 /* duplicate tag */
.Symbols with C_EXT
(IMAGE_SYM_CLASS_EXTERNAL
) are global and added to thelinker's global symbol table, akin to ELF's STB_GLOBAL
symbol binding.
System V ships a symbolic debugger (sdb), which utilizesn_type
and n_sclass
. If we acknowledge thatthe debugging information format is outdated, n_type
andn_class
serve as a wasteful counterpart to to ELF'sst_info
.
n_numaux
relates to Auxiliary Symbol Records, allowingextra information but introducing non-uniform symbol table entries.While seemingly beneficial, their use cases are limited and could oftenbe encoded using separate sections. In PE, an auxiliary symbol recordcan represent weak definitions, but weak references are not supported.They can also provide extra information to section symbols.
ECOFF defines Local Symbol Entry (SYMR) and External Symbol Entry(EXTR).
1 | typedef struct { |
1 | // Mach-O, 12 bytes |
Mach-O's nlist
and nlist_64
are not thatdifferent from a.out's, with n_other
changed ton_sect
to indicate the section index. The 8-bit n_sectfield restricts representable sections to 255 without out-of-band data(discussed later). If we extend n_sect
to 32-bit, withalignment padding the structure size will increase to 24 bytes, the sameas Elf64_Sym
.
Like a.out, the N_EXT
bit of n_type
indicates an external symbol. The N_PEXT
bit indicates aprivate external symbol.
Key bits in n_desc
are N_WEAK_DEF
,N_WEAK_REF
, and N_ALT_ENTRY
.
1 | // ELF, 40 bytes |
The section name is represented as a 32-bit index into the stringtable. If we use a 16-bit integer, a large number of section names witha symbol suffix (e.g. .text.foo
.text.bar
)could make the index overflow.
sh_type
categorizes the section's contents andsemantics. It avoids hard-coding magic names in many scenarios.Technically a 16-bit type could work pretty well but was deemedinsufficient for flexibility.
sh_flags
describe miscellaneous attributes, e.g.writable and executable permissions, and whether the section shouldappear in a loadable segment. This member is 32-bit inElf32_Shdr
while 64-bit in Elf64_Shdr
. Inpractice no architecture defines flags for bits 32 to 63, therefore thismember is somewhat wasteful.
Location and size. sh_offset
gives the byte offset fromthe beginning of the file to the first byte in the section. To supportobject files larger than 4GiB, this member has to be 64-bit.sh_size
gives the section's size in bytes. A section typeof SHT_NOBITS
occupies no space in the file. To supportsections larger than 4GiB, this member has to be 64-bit.
Address and alignment. sh_addr
describes the address atwhich the section's first byte should reside for an executable or sharedobject. It should be zero for relocatable files.sh_addralign
holds the address alignment. In practice thismember must be a power of 2 even if the generic ABI does not require so.This member is 64-bit in ELF64, which allows an alignment up to2**63
. In practice, an alignment larger than the page size(or the largest huge page size, if huge pages are enabled) does not makesense, and a maxiumm value of 2**31 is sufficient. Therefore, we coulduse a log2 value to hold the alignment.
Connection information. sh_link
holds a section index.sh_info
holds either a section index or a symbol index. Ifyou recall that st_shndx
is 16 bits for very solid reason,you will know that the two fields are somewhat wasteful.
For a table of fixed-size entries, sh_entsize
holds theentry size in bytes. In some use cases this member is not a power oftwo. In practice, one byte suffices.
While ELF's section header structure is designed for flexibility,potential optimizations could reduce its size without significant lossof functionality. By using smaller data types for sh_flags
,sh_link
, sh_info
, and sh_entsize
based on practical needs, we could make the structure significantlysmaller. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27// 32 bytes
struct Elf32_Shdr_minimized {
Elf32_Wordsh_name;
Elf32_Wordsh_type; // Making this uint16_t and reordering it can decrease the size to 28 bytes
Elf32_Wordsh_flags;
Elf32_Addrsh_addr;
Elf32_Offsh_offset;
Elf32_Wordsh_size;
uint8_tsh_addralign;
uint8_tsh_entsize;
Elf32_Halfsh_link;
Elf32_Halfsh_info;
};
// 40 bytes
struct Elf64_Shdr_minimized {
Elf64_Word sh_name;
Elf64_Word sh_flags;
Elf64_Addr sh_addr;
Elf64_Off sh_offset;
Elf64_Xword sh_size;
Elf64_Half sh_type;
uint8_t sh_addralign;
uint8_t sh_entsize;
Elf64_Half sh_link;
Elf64_Half sh_info;
};
Reducing sh_type
into 2 bytes loses flexibility a bit.If this deems insufficient, we could take 3 bits fromsh_addralign
(by turning it into a bitfield) and give themto sh_type
.
1 | // COFF (System V Release 3), 40 bytes, when sizeof(long) == 4 |
PE's section control structure demonstrates a minor modificationcompared to COFF, s_paddr => VirtualSize
.
The presented structure measures as 40 bytes when long
is 4 bytes. If we extends_paddr, s_vaddr, s_size, s_scnptr, s_relptr, s_lnnoptr
to8 bytes, the structure will be of 64 bytes.
The section name supports up to 8 bytes. A longer name would requirean extension similar to the symbol control structure.
Encoding both s_paddr
and s_vaddr
iswasteful. ELF encodes the physical address in the segment and thereforeremoves the member from its section structure.
COFF embeds the location and size of relocations into the sectionstructure. This is actually pretty nice. A 16-bit s_nreloc
may appear restritive but is sufficient for relocatable files. Inpractice, the number of relocations can exceed 65536 for a singlesection using relocatable linking.
s_lnnoptr
and s_nlnno
point to line numberentries, which relate addresses to source file line numbers. Theembedded nature is inflexible. 1
2
3
4
5
6
7
8
9
10
11/* There is one line number entry for every "breakpointable" source line in a
section. Line numbers are grouped on a per function basis; the first entry in a
function grouping will have l_lnno = 0 and in place of physical address will be
the symbol table index of the function name. */
struct lineno {
union {
long l_symndx;/* sym. table index of function name iff l_lnno == 0 */
long l_paddr; /* (physical) address of line number */
} l_addr;
unsigned short l_lnno ;/* line number */
};
This simple format is deprecated. In DWARF, special opcodes in linenumber information can encode the information in a more space-efficientway and present more information like the column number.
1 | // Mach-O, 80 bytes |
How does Mach-O end up with such a huge section structure? Let's findout...
A Mach-O binary is divided into segments, each housing one or moresections. The section structure encodes the section name and the segmentname, both can be up to 16 bytes. This representation allows the sectionnames to be read without a string table, but restrictive for descriptivenames. Section semantics are derived from the name (unlike ELF).
The segment name is redundantly encoded within the section structure.We could derive the segment from the section name and flags, e.g.,S_ATTR_SOME_INSTRUCTIONS => __TEXT
,S_ZEROFILL => ZeroFill __DATA
.
There is a severe limitation: maximum of 255 sections due tonlist::n_sect
being a uint8_t
. This isapparently too restrictive. Thankfully, an innovative feature.subsections_via_symbols
overcomes the limitation. Thefeature uses a monolithic section with "atoms" dividing it into pieces(subsections). This is more size-efficient than ELF's-ffunction-sections -fdata-sections -fno-unique-section-names
.However, there are assembler limitations, relocation processingcomplexity, and potential loss of ability to ensure that two non-localsymbols are not reordered.
Like COFF, Mach-O embeds the location and size of relocations intothe section structure.
reserved1
and reserved2
are used similarlyto ELF's connection information.
__TEXT,__stub
(like ELF's .plt
),__TEXT,__got
(like ELF's .got
),__TEXT,__la_symbol_ptr
(like ELF's .got.plt
),and __DATA,__thread_ptrs
set reserved1
as anindex into the indirect symbol table (the offset is specified byindirectsymoff
in a LC_DYSYMTAB
command).
For __TEXT,__stub
, reserved2
is the size ofone entry, e.g., 6 for x86-64(jmpq *__la_symbol_ptr(%rip)
). This is analogous to ELFx86-64's DT_X86_64_PLTSZ
.For other sections, reserved2
is zero.
1 | // ELFCLASS32, 8 bytes |
r_info
specifies the symbol table index with respect towhich the relocation must be made, and the type of relocation toapply.
There are two variants, REL and RELA. Let's quote the genericABI:
As specified previously, only Elf32_Rela and Elf64_Rela entriescontain an explicit addend. Entries of type Elf32_Rel and Elf64_Relstore an implicit addend in the location to be modified. Depending onthe processor architecture, one form or the other might be necessary ormore convenient. Consequently, an implementation for a particularmachine may use one form exclusively or either form depending oncontext.
Relocatable files need a lot of relocatable types while executablesand shared objects need only a few. The former is often called staticrelocations while the latter is called dynamic relocations.
Of the few dynamic relocation types, most do not need the addendmember. lld provides an option -z rel
to useSHT_REL/DT_REL
dynamic relocations.
If we disregard the REL dynamic relocation scenario, then all modernarchitectures use RELA exclusively. Most architectures encode theimmediate with only few bits, which are inadequate for many relocatablefile uses.
ELFCLASS64, with its 64-bit members, doubles the size compared toELFCLASS32's 32-bit members. Since relocations often comprise asubstantial portion of object files, this size difference can lead touser concerns. However, in practice, a 24-bit symbol index is oftensufficient, even in 64-bit contexts. Therefore, if a 64-bitarchitecture's relocation type requirements are less than 256,ELFCLASS32 can be a viable and more size-efficient option.
In March 2024, I proposed
1 | // a.out (System V Release 2), 8 bytes |
r_symbolnum
mirrors ELF's ELF32_R_SYM
.
The other bitfields, resembling ELF's ELF32_R_TYPE
, butsplit into distinct fields:
r_pcrel
r_length
Reserving dedicated semantics for individual bits can limitadaptability. COFF and ELF opted to remove bitfields in favor of a typeto provide greater flexibility.
1 | // COFF, 10 bytes on disk, 12 bytes with alignment padding |
This format resembles ELF's Elf32_Rel
.
r_vaddr
gives the virtual address of the location atwhich to apply the relocation action. If we interpretr_vaddr
as an offset (as PE does) and restrict section sizeto 32 bits, we could reuse this structure for 64-bit architectures.
r_symndx
is a 32-bit symbol table index.
r_type
is a 16-bit relocation type, limited in numbercompared to ELF.
COFF generally supports fewer relocation types than ELF. System VRelease 3 defines very few relocations for each architecture. Inbinutils, include/coff/*.h
files define relocations formore architectures.
While ELF uses the REL/RELA for both relocatable files andexecutables, in PE image files, the import address table and baserelocation table (.reloc
) are a completely differentdesign.
1 | // Mach-O, 16 bytes |
Mach-O's relocation structure closely mirrors a.out's with adaptedr_symbolnum
meaning. When r_extern == 0
(local), the r_symbolnum
member references a section indexinstead of a symbol index. This is to support custom sections, breakingthe three-section limitation (text, data, and bss) of traditionala.out.
As aforementioned, dedicating bits to bitfields(r_pcrel
, r_length
, andr_scattered
greatly restricted the number of relocationtypes.
Related to the relocation type limitation, a.long foo - .
in a data section requires a pair ofrelocations, SUBTRACTOR
and/UNSIGNED
. I havesome notes on
Mach-O uses a number of sections in the __LINKEDIT
segment to communicate information to dyld.
Dennis MacAlistair Ritchie's A.OUT (V)
manpage (1971) describes the original a.out format. The headercontains 6 words.
The text relocations are implicit.
Later versions introduced new magic numbers, separated textrelocations and data relocations, and added an entry point(a_entry
).
TODO
ELFCLASS32 structures are already compact, offering limited sizereduction potential. ELFCLASS64 structures, while flexible, can beoptimized by sacrificing some flexibility (64-bit quantities). The64-bit symbol control structure is compact, but section and relocation'sare quite wasteful if we can sacrifice some flexibility.
As the ELF paper acknowledges, "Relocatable and executable files donot necessarily have the same constraints, and we considered using twofile formats. Eventually, we decided the two activities were similarenough that a single format would suffice." There are more toolsinspecting executables than relocatable files. So, naturally, we mightwant to change just relocatable files. Can we use ELFCLASS32 relocatablefiles for 64-bit architectures?
Well, x86-64 and AArch64 make a clear distinct of ELFCLASS32 andELFCLASS64. ELFCLASS32 is for ILP32 (x32, aarch64_ilp32) whileELFCLASS64 is for LP64. However, the discontinued Itanium architecturesets a precedent that ELFCLASS32 can be used for LP64 programs. Quotingits psABI (Intel Itanium Processorspecific Application BinaryInterface (ABI)).
For Itanium architecture ILP32 relocatable (i.e. of type ET_REL)objects, the file class value in e_ident[EI_CLASS] must be ELFCLASS32.For LP64 relocatable objects, the file class value may be eitherELFCLASS32 or ELFCLASS64, and a conforming linker must be able toprocess either or both classes. ET_EXEC or ET_DYN object file types mustuse ELFCLASS32 for ILP32 and ELFCLASS64 for LP64 programs.
Addresses appearing in ELFCLASS32 relocatable objects for LP64programs are implicitly extended to 64 bits by zero-extending.
Note: Some constructs legal in LP64 programs, e.g. absolute 64-bitaddresses outside the 32-bit range, may require use of an ELFCLASS64relocatable object file.
Given the prior art, it seems promising to allow ELFCLASS32 when thecode size concerns people. Ideally there should be a marker todistinguish ILP32 and LP64-using-ELFCLASS32 object files.
The primary changes reside in the assembler and linker. It's alsoimportant to ensure that binary manipulation programs (like objcopy) anddump tools are happy with them.
Further optimization potential lies in exploring the use ofElf32_Rel
instead of Elf32_Rela
for evensmaller relocations.
This approach is independent of whether ELFCLASS32 is adopted and canbe applied to both ELFCLASS32 and ELFCLASS64. The ELF paper is clear,"ELF allows extension and redefinition for other control structures."However, caution is warranted due to the significant impact on theecosystem as many tools rely on the existing structures.
One promising example is Elf32_Shdr_minimized
, a customstructure reduced to 32 bytes from the standardElf32_Shdr
's 40 bytes. While I would be nervous, but if wereduce sh_type
to a uint16_t
, the structuresize can reduce to 28 bytes.
Earlier debuggers operated using a debugging information formatcalled "stabs" (short for symbol table entries; dating back to at leastUNIX/32V in 1979). Stabs is encoded using extra symbol table entries inthe a.out object file format.
1 | .stabs "string",type,other,desc,value |
Stabs was ported to COFF for System V Release 2, used on somemachines. System V Release 4 switched to ELF and abandoned stabs infavor of a newly developed format called DWARF. Its debugger sdb wasrewritten to support DWARF, and stabs was no longer supported. (Thefirst version of DWARF was later published by the UNIX InternationalProgramming Languages Special Interest Group (SIG) in January 1992.)
However, stabs continued to be used in other operating systems,including *BSD, AIX, and IRIX. For example, the GNU assembler addedstabs support for ELF (n_strx
is 32-bit).
GCC 13
Stabs is less efficient than DWARF. When compiling a non-trivialprogram (so that the boilerplate in DWARF is less significant), you mayobserve that .stab
and .stabstr
consume morespace than .debug_*
sections, even if DWARF is moreexpressive and contains more information.
While the diversity of operating systems and architectures posescomplexity for application developers, the object file formatheterogeneity presents a unique challenge for toolchain development,probably not very tangible by application developers and users.
Integrating features like Link Time Optimization (LTO),Profile-Guided Optimization (PGO), and sanitizers has complexity due toobject file format-specific limitations and nuances. While mostdevelopers primarily concern themselves with a specific format, theystill need to tread carefully during development to avoid disruptions toother platforms.
WebAssembly
]]>I made 700+ commits this year. Many are clean-up commits or fixup forothers' work. I hope that I can do more useful work next year.
--features=layering_check
for Bazel's llvm andclang projectsLLVM_ENABLE_REVERSE_ITERATION
forStringMap
llvm::xxh3_64bits
and adopted it inllvm/clang/lld-###
exit code, -fsanitize=kcfi
,and XRay-fsanitize=function
work for C and non-x86architectures%lb
recognization for printf/scanf, checkingthe failure memory order for atomic_compare_exchange
-familybuilt-in functions, -Wc++11-narrowing-const-reference
@plt
symbols for x86.plt.got
, mapping symbol improvements,--disassemble-symbols
changes, etcR_RISCV_SET_ULEB128/R_RISCV_SUB_ULEB128
for.uleb128
directives#line
, and #include
, made llvm-cov gcov workwith a smaller stack size--call-graph-profile-sort=hfsort
defaultReviewed many commits. A lot of people don't add aReviewed By:
tag. Anyway, counting commits with the tag cangive an underestimate. 1
2% git shortlog -sn 2679e8bba3e166e3174971d040b9457ec7b7d768...main --grep 'Reviewed .*MaskRay' | awk '{s+=$1}END{print s}'
395
Many GitHub pull requests are not counted.
I
lld/ELF is quite stable. I have made some maintenance changes. Asusual, I wrote the ELF port's release notes for the two releases. See
I made
I have made changes to x86-64, AArch64, and RISC-V psABIdocuments.
Reported many bugs and feature requests:
ld: Should --gc-sections respect RHS of a symbol assignment?
objcopy: add support for changing ELF symbol visibility
rtld: resolve ifunc relocations after JUMP_SLOT/GLOB_DAT/etc
ld riscv: --emit-relocs does not retain the original relocation type
gas aarch64: GOT relocations referencing a local symbol should not be changed to reference STT_SECTION
objcopy --set-section-flags: support toggling a flag
gas x86: reject {call,jmp} [offset func] in Intel syntax
My commits:
PR30592 objcopy: allow --set-section-flags to add or remove SHF_X86_64_LARGE
ld: Allow R_386_GOT32 for call *__tls_get_addr@GOT(%reg)
ld: Allow R_X86_64_GOTPCREL for call *__tls_get_addr@GOTPCREL(%rip)
RISC-V: Add --[no-]relax-gp to ld
I had one patch landed supporting-mlarge-data-threshold=
for x86-64-mcmodel=medium
.
8 commits. Consulted on a number of toolchain questions.
Wrote 29 blog posts (including this one, mainly about toolchains) andrevised many posts initially written between 2020 and 2023.
Trips: Orlando, Philadelphia, Harrisburg, Trenton, Newark, New YorkCity, Alaska, Ontario, Quebec, Nova Scotia, Chicago, Atlanta, Miami,Jamaica, Haiti.
Mastodon: https://hachyderm.io/@meowray
]]>reviews.llvm.org/Dxxxxx
pages.)The intent is to eliminate a SQL engine. Phabicator operates on a
Raphaël Gomès developed
The DNS records of reviews.llvm.org have been pointed to the
The review discussions primarily happen on /Dxxx
pages,which should be archived. There are much fewer discussions on/rL$svn_rev
(when LLVM used svn) and/rG$git_commit
pages. We skip archiving them as acompromise.
Some /Dxxx
pages contain a large number of modifiedfiles (usually tests). Phabricator presents a "Load File" button. If weexpand every button, the end HTML can be very large. We need to limitthe number of buttons to click.
The file hierarchy is quite straightforward.archive/unprocessed/diffs
contains raw HTML pages whiletemplates/diffs
contains scraped HTML pages alongside patchfiles.
1 | % tree archive/unprocessed/diffs | head -n 12 |
1 | % du -sh archive/unprocessed/ |
At present, some https://reviews.llvm.org/Dxxxxx
pagesmight be inaccessible.https://reviews.llvm.org/Dxxxxx?download=true
is analternative if you just need the patch file but not discussions.
Embedded images are currently unavailable. 1
2% rg -l 'phabricator-remarkup-embed-image' templates/diffs/ | wc -l
3332
I aim to utilize Nginx solely to serve URIs.
1 | /D2 => /diffs/2/D2.html |
We just need URL mapping and some Nginx location
directives.
1 | map_hash_max_size 400000; |
Among D1 to D159553, there were 1669 pages that were not downloaded.These differentials might be deleted by the author, had a permissionerror (e.g. the author did it make it publicly readable), or the crawlerencountered an error (e.g. an emulated button click failed).
In January 2024, I got access to the machine hosting the Phabricatorinstance and crawled 759 differentials. Among them, 184 differentialshave a state other than "Closed".
We can make a copy of process-html.py
and modify it toget some statistics. 1
2
3
4
5
6
7
8
9
10def process_html(html, diff):
soup = BeautifulSoup(html, "html.parser")
status = soup.select_one(".phui-tag-core").text
title = soup.select_one(".phui-header-header").text
author = soup.select_one(".phui-head-thing-view > strong").text
sub = []
for div in soup.select(".phui-handle.phui-link-person"):
if 'commits' in div.text:
sub.append(div.text)
print(diff, status, title, author, ','.join(sub), sep='\t')
I have collected differentials that are not “Closed” at *-commits
, e.g.
Let's begin with a Linux x86-64 example involving global variablesexhibiting various properties such as read-only versus writable,zero-initialized versus non-zero, and more.
1 |
|
1 | % clang -c -fpie a.c |
(We will discuss -Wl,-z,separate-loadable-segments
We can see that these functions and global variables are placed indifferent sections.
.rodata
: read-only data without dynamic relocations,constant in the link unit.text
: functions.data.rel.ro
: read-only data associated with dynamicrelocations, constant after relocation resolving, part of thePT_GNU_RELRO
segment.data
: writable data.bss
: writable data known to be zerosTODO I may write more about how linkers layout sections and segments.
Anyhow, the linker will place .data
and.bss
in the same PT_LOAD
program header(segment) and the rest into different PT_LOAD
segments.(There are some nuances. If you use GNU ld's -z noseparate-code
or lld's --no-rosegment
,.rodata
and .text
will be placed in the samePT_LOAD
segment.)
The PT_LOAD
segments have different flags(p_flags
): PF_R
, PF_R|PF_X
,PF_R|PF_W
. Subsequently, the dynamic loader, also known asthe dynamic linker, will invoke mmap
to map the file intomemory. The memory areas (VMA) have different memory permissionscorresponding to segment flags.
For a PT_LOAD
segment, its associated memory area startsat alignDown(p_vaddr, pagesize)
and ends atalignUp(p_vaddr+p_memsz, pagesize)
.
1 | Start Addr End Addr Size Offset Perms objfile |
Let's assume the page size is 4096 bytes. We'll calculate thealignDown(p_vaddr, pagesize)
values and display themalongside the "Start Addr" values: 1
2
3
4
5Start Addr alignDown(p_vaddr, pagesize)
0x555555554000 0x0000000000000000
0x555555555000 0x0000000000001000
0x555555556000 0x0000000000002000
0x555555557000 0x0000000000003000
We observe that the start address equals the base address plusalignDown(p_vaddr, pagesize)
.
--no-rosegment
This option asks lld to combine the read-only and the RX segments.The output file will consume less address space at run-time.
1 | Start Addr End Addr Size Offset Perms objfile |
A page serves as the granularity at which memory exhibits differentpermissions, and within a page, we cannot have varying permissions.Using the previous example where p_align
is 4096, if thepage size is larger, for example, 65536 bytes, the program mightcrash.
Typically, the dynamic loader allocates memory for the firstPT_LOAD
segment (PF_R
) at a specific addressallocated by the kernel. Subsequent PT_LOAD
segments thenoverwrite the previous memory regions. Consequently, certain code pagesor significant global variables might be replaced by garbage, leading toa crash.
So, how can we create a link unit that works across different pagesizes? We simply determine the maximum page size, let's say, 2097152,and then pass -z max-page-size=2097152
to the linker. Thelinker will set p_align
values of PT_LOAD
segments to MAXPAGESIZE.
1 | Program Headers: |
In a linker script, the max-page-size
can be obtainedusing CONSTANT(MAXPAGESIZE)
.
For completeness, if you need to run a prebuilt executable on asystem with a larger page size, you can modify the executable by mergingPT_LOAD
segments and combining their permissions. It'slikely there will be a sizable RWX PT_LOAD
segment,reminiscent of OMAGIC.
It is possible to increase the p_align
value of onesingle PT_LOAD
segment using an aligned
attribute. When this value exceeds the page size, the question arises:should the kernel loader or the dynamic loader determine a suitable baseaddress to meet this alignment requirement?
In 2020, the Linux kernel loader made the decision to p_align
. Thisfacilitates
1 | % cat align.c |
Should a userspace dynamic loader do the same? If it does, a variablewith an alignment greater than the page size will indeed alignaccordingly. As of glibc 2.35, it has
On the other hand, the traditional interpretation dictates that avariable with an alignment greater than the page size is invalid. Mostother dynamic loaders do not implement this particular logic, which hassome overhead.
-z separate-loadable-segments
In previous examples using-z separate-loadable-segments
, the p_vaddr
values of PT_LOAD
segments are multiples of MAXPAGESIZE.The generic ABI says "loadable process segments must have congruentvalues for p_vaddr and p_offset, modulo the page size."
p_offset - This member gives the offset from the beginning of thefile at which the first byte of the segment resides.
p_vaddr - This member gives the virtual address at which the firstbyte of the segment resides in memory.
This alignment requirement aligns with the mmap
documentation. For example, Linux man-pages specifies, "offset must be amultiple of the page size as returned by sysconf(_SC_PAGE_SIZE)."
The p_offset
values are also multiples of MAXPAGESIZE.After layouting out a PT_LOAD
segment, the linker must padthe end by inserting zeros so that the next PT_LOAD
segmentstarts at a multiple of MAXPAGESIZE.
However, the alignment padding is wasteful. Fortunately, we can linka.o
using different MAXPAGESIZE and different alignmentsettings:-z noseparate-code
,-z separate-code
,-z separate-loadable-segments
.
1 | clang -pie -fuse-ld=lld -Wl,-z,noseparate-code a.o -o a0.4096 |
1 | % stat -c %s a0.4096 a0.65536 a0.2097152 |
We can derive two properties:
size(noseparate-code) < size(separate-code) < size(separate-loadable-segments)
.-z noseparate-code
, increasing MAXPAGESIZE does notchange the output size.AArch64 and PowerPC64 have a default MAXPAGESIZE of 65536. Stayingwith the -z noseparate-code
default ensures that they willnot experience unnecessary size increase.
-z noseparate-code
How does -z noseparate-code
work? Let's illustrate thiswith an example.
At the end of the read-only PT_LOAD
segment, the addressis 0x628. Instead of starting the next segment atalignUp(0x628, MAXPAGESIZE) = 0x1000
, we start atalignUp(0x628, MAXPAGESIZE) + 0x628 % MAXPAGESIZE = 0x1628
.Since the .text
section has an alignment(sh_addralign
) of 16, we start at 0x1630. Although theaddress is advanced beyond necessity, the file offset (congruent to theaddress, modulo MAXPAGESIZE) can be decreased to 0x630, merely 8 bytes(due to alignment padding) after the previous section's end.
Moving forward, the end of the executable PT_LOAD
segment has an address of 0x17b0. Instead of starting the next segmentat alignUp(0x17b0, MAXPAGESIZE) = 0x2000
, we start atalignUp(0x17b0, MAXPAGESIZE) + 0x17c0 % MAXPAGESIZE = 0x27b0
.While we advance the address more than needed, the file offset can bedecreased to 0x7b0, precisely at the previous section's end.
1 | % readelf -WSl a0.4096 |
-z separate-code
performs the trick when transiting fromthe first RW PT_LOAD
segment to the second, whereas-z separate-loadable-segments
doesn't.
Let's consider two adjacement PT_LOAD
segments. Thememory area associated with the first segment ends atalignUp(load[i].p_vaddr+load[i].p_memsz, pagesize)
whilethe memory area associated with the second one starts atalignDown(load[i+1].p_vaddr, pagesize)
. When the actualpage size equals MAXPAGESIZE, the two addresses are identical. However,if the actual page size is smaller, a gap emerges between theseaddresses.
A typical link unit generally presents three gaps. These gaps mighteither be unmapped or mapped. When mapped, they necessitatestruct vm_area_struct
objects within the Linux kernel. Asof Linux 6.3.13, the size of struct vm_area_struct
is 152bytes. For instance, 10000 mapped object files would require10000 * 3 * sizeof(struct vm_area_struct) = 4,560,000 bytes
,signifying a considerable memory footprint. You can refer to
Dynamic loaders typically invoke mmap
usingPROT_READ
, encompassing the whole file, followed bymultiple mmap
calls using MAP_FIXED
and thecorresponding flags. When dynamic loaders, like musl, don't processgaps, the gaps retain r--p
permissions. However, in glibc'self/dl-map-segments.h
, the has_holes
codeemploys mprotect
to transition permissions fromr--p
to ---p
.
While ---p
might be perceived as a security enhancement,personally, I don't believe it significantly impacts exploitability.While there might be numerous gadgets in r-xp
areas,reducing gadgets in r--p
areas doesn't seem notablyimpactful. (https://isopenbsdsecu.re/mitigations/rop_removal/)
Within Linux kernel loads the executable and its interpreter (itpresent) (fs/binfmt_elf.c
), the gap gets unmapped, therebyfreeing a struct vm_area_struct
object. Implementing asimilar approach in dynamic loaders could yield comparable savings.
However, unmapping the gap carries the risk of an unrelated futuremmap
occupying the gap:
1 | 564d8e90f000-564d8e910000 r--p 00000000 08:05 2519504 /sample/build/main |
It is not clear whether the potential occurrence of an unrelated mmapconsidered a regression in security. Personally, I don't think thisposes a significant issue as the program does not access the gaps. Thisproperty can be guaranteed for direct access when input relocations tothe linker use symbols with in-bounds addends (e.g. when x is definedrelative to an input section, we know R_X86_64_PC32(x)
mustbe in-bounds).
However, some programs may expect contiguous maps areas of a file(such as when glibc link_map::l_contiguous
is set to 1).Does this choice render the program exploitable if an attacker canensure a map within the gap instead of outside the file? It seems to methat they could achieve everything with a map outside of the file.
Having said that, the presence of an unrelated map between mapsassociated with a single file descriptor remains odd, so it's preferableto avoid it if possible.
This appears the best solution.
When creating a memory area, instead of setting the end toalignUp(load[i].p_vaddr+load[i].p_memsz, pagesize)
, we canextend the end tomin(alignDown(min(load[i+1].p_vaddr), pagesize), alignUp(file_end_addr, pagesize))
.
1 | 564d8e90f000-**564d8e91f000** r--p 00000000 08:05 2519504 /sample/build/main (the end is extended) |
For the last PT_LOAD
segment, we could also just usealignDown(min(load[i+1].p_vaddr), pagesize)
and ignorealignUp(file_end_addr, pagesize))
. Accessing a byte beyondthe backed file will result to a SIGBUS
signal.
Personally I favor the area end extending approach. I've alsopondered whether this falls under the purview of linkers. Such a changeseems intrusive and unsightly. If the linker extends the end of p_memszto cover the gap, should it also extend p_filesz?
Moreover, a PT_LOAD whose end isn't backed by a section is unusual.I'm concerned that many binary manipulation tools may not handle thiscase correctly. Utilizing a linker script can intentionally creatediscontiguous address ranges. I'm concerned that the linker might notdiscern such cases with intelligent logic regardingp_filesz/p_memsz.
This feature request seems to be within the realm of loaders andspecific information, such as the page size, is only accessible toloaders. I believe loaders are better equipped to handle this task.
Some programs optimize their usage of the limited TranslationLookaside Buffer (TLB) by employing transparent huge pages. When theLinux kernel loads an executable, it takes into account thep_align
field to create a memory area. Ifp_align
is 4096, the memory area will commence at amultiple of 4096, but not necessarily at a multiple of a huge page.
Transparent huge pages for mapped files have several requirementsincluding:
include/linux/huge_mm.h:transhuge_vma_suitable
).CONFIG_READ_ONLY_THP_FOR_FS
is enabled(scripts/config -e TRANSPARENT_HUGEPAGE -e TRANSPARENT_HUGEPAGE_MADVISE -e READ_ONLY_THP_FOR_FS
)VM_EXEC
flagWhen madvise(addr, len, MADV_HUGEPAGE)
is called, thekernel code path isdo_madvise -> madvise_vma_behavior -> hugepage_madvise -> khugepaged_enter_vma -> thp_vma_allowable_order+__khugepaged_enter
.
To ensure that addr-fileoff
is a multiple of a hugepage, we should link the executable using -z max-page-size=
with the huge page size.
In kernels with the VM_EXEC
requirement (before v6.8),if we want to remap the file as huge pages from the ELF header, we mustspecify --no-rosegment
to ld.lld.
Build the following program withc++ -fuse-ld=lld -Wl,-z,max-page-size=2097152
and run it.We do not define COLLAPSE
for now. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// Adapted from https://mazzo.li/posts/check-huge-page.html
// normal page, 4KiB
// huge page, 2MiB
// See <https://www.kernel.org/doc/Documentation/vm/pagemap.txt> for
// format which these bitmasks refer to
extern char __ehdr_start[];
__attribute__((used)) const char pad[HPAGE_SIZE] = {};
// Checks if the page pointed at by `ptr` is huge. Assumes that `ptr` has
// already been allocated.
static void check_huge_page(void *ptr) {
if (getuid())
return warnx("not root; skip KPF_THP check");
int pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
if (pagemap_fd < 0)
errx(1, "could not open /proc/self/pagemap: %s", strerror(errno));
int kpageflags_fd = open("/proc/kpageflags", O_RDONLY);
if (kpageflags_fd < 0)
errx(1, "could not open /proc/kpageflags: %s", strerror(errno));
// each entry is 8 bytes long
uint64_t ent;
if (pread(pagemap_fd, &ent, sizeof(ent), ((uintptr_t)ptr) / PAGE_SIZE * 8) != sizeof(ent))
errx(1, "could not read from pagemap\n");
if (!PAGEMAP_PRESENT(ent))
errx(1, "page not present in /proc/self/pagemap, did you allocate it?\n");
if (!PAGEMAP_PFN(ent))
errx(1, "page frame number not present, run this program as root\n");
uint64_t flags;
if (pread(kpageflags_fd, &flags, sizeof(flags), PAGEMAP_PFN(ent) << 3) != sizeof(flags))
errx(1, "could not read from kpageflags\n");
if (!(flags & (1ull << KPF_THP)))
errx(1, "could not allocate huge page\n");
if (close(pagemap_fd) < 0)
errx(1, "could not close /proc/self/pagemap: %s", strerror(errno));
if (close(kpageflags_fd) < 0)
errx(1, "could not close /proc/kpageflags: %s", strerror(errno));
}
int main() {
printf("__ehdr_start: %p\n", __ehdr_start);
int ret, tries = 2;
do {
ret = madvise(__ehdr_start, HPAGE_SIZE, MADV_COLLAPSE);
} while (ret && errno == EAGAIN && --tries);
printf("madvise(MADV_COLLAPSE): %d\n", ret);
if (ret) {
ret = madvise(__ehdr_start, HPAGE_SIZE, MADV_HUGEPAGE);
if (ret)
err(1, "madvise");
}
ret = madvise(__ehdr_start, HPAGE_SIZE, MADV_HUGEPAGE);
if (ret)
err(1, "madvise");
size_t size = HPAGE_SIZE;
char *buf = (char *)aligned_alloc(HPAGE_SIZE, size);
madvise(buf, 2 << 20, MADV_HUGEPAGE);
*((volatile char *)buf);
check_huge_page(buf);
int fd = open("/proc/self/maps", O_RDONLY);
read(fd, buf, HPAGE_SIZE);
write(STDOUT_FILENO, buf, strstr(buf, "[stack]\n") - buf + 8);
close(fd);
fd = open("/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs", O_RDONLY);
read(fd, buf, 32);
close(fd);
usleep(atoi(buf) * 1000);
memcpy(buf, __ehdr_start, HPAGE_SIZE);
check_huge_page(__ehdr_start);
}
The output looks like: 1
2
3
4
5
6
7
8% g++ test.cc -o ~/tmp/test -O2 -fuse-ld=lld -Wl,-z,max-page-size=2097152 && sudo ~/tmp/test
__ehdr_start: 0x55f3b1c00000
55f3b1c00000-55f3b1e00000 r--p 00000000 103:03 555277119 /home/ray/tmp/test
55f3b1e00000-55f3b1e01000 r--p 00200000 103:03 555277119 /home/ray/tmp/test
55f3b2000000-55f3b2002000 r-xp 00200000 103:03 555277119 /home/ray/tmp/test
55f3b2201000-55f3b2202000 r--p 00201000 103:03 555277119 /home/ray/tmp/test
55f3b2401000-55f3b2402000 rw-p 00201000 103:03 555277119 /home/ray/tmp/test
55f3b3a9a000-55f3b3abb000 rw-p 00000000 00:00 0 [heap]
Thanks to 周洲仪 for helping me figure out the khugepagedbehavior.
usleep
gives khugepaged an opportunity to collapse pages(hpage_collapse_scan_file => collapse_file => retract_page_tables => pmdp_collapse_flush
).In the fortunate scenario when this collapse occurs, and the next pagefault is triggered (memcpy(buf, __ehdr_start, HPAGE_SIZE)
),the kernel will populate the pmd
with a huge page(handle_page_fault ...=> handle_pte_fault ...=> do_fault_around => filemap_map_pages ...=> do_set_pmd => set_pmd_at
).
However, in an unfortunate case,check_huge_page(__ehdr_start)
will fail withcould not allocate huge page
.scan_sleep_millisecs
defaults to 10000 (10 seconds).Reducing the value increases the likelihood of the fortunate case.
Linux 6.1 introduces MADV_COLLAPSE
to attempt asynchronous collapse of the native pages mapped by the memory range intoTransparent Huge Pages (THPs). While success is not guaranteed, asuccessful collapse eliminates the need to wait for the khugepageddaemon(madvise_collapse => hpage_collapse_scan_file => collapse_file => retract_page_tables => pmdp_collapse_flush
).In the event of repeated MADV_COLLAPSE
failures, a fallbackmechanism using MADV_HUGEPAGE
can be employed.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19% g++ -static -DCOLLAPSE test.cc -o test -O2 -fuse-ld=lld -Wl,-z,max-page-size=2097152
% sudo ./test
__ehdr_start: 0x200000
madvise(MADV_COLLAPSE): -1
...
test: could not allocate huge page
% sudo ./test
__ehdr_start: 0x55f3b1c00000
madvise(MADV_COLLAPSE): 0
00200000-00429000 r--p 00000000 fd:03 260 /root/test
00628000-0069f000 r-xp 00228000 fd:03 260 /root/test
0089e000-008a3000 r--p 0029e000 fd:03 260 /root/test
00aa2000-00aa5000 rw-p 002a2000 fd:03 260 /root/test
00aa5000-00aab000 rw-p 00000000 00:00 0
01800000-01822000 rw-p 00000000 00:00 0 [heap]
7fd141600000-7fd141800000 rw-p 00000000 00:00 0
7fd141800000-7fd141a00000 rw-p 00000000 00:00 0
7fd141a00000-7fd141a01000 rw-p 00000000 00:00 0
7ffe69edf000-7ffe69f00000 rw-p 00000000 00:00 0 [stack]
In -z noseparate-code
layouts, the file content startssomewhere at the first page, potentially wasting half a huge page onunrelated content. Switching to -z separate-code
allowsreclaiming the benefits of the half huge page but increases the filesize. Balancing these aspects poses a challenge. One potential solutionis using fallocate(FALLOC_FL_PUNCH_HOLE)
, which introducescomplexity into the linker. However, this approach feels like aworkaround to address a kernel limitation. It would be preferable if afile-backed huge page didn't necessitate a file offset aligned to a hugepage boundary.
To accommodate PT_GNU_RELRO
, the RW
regionwill possess two permissions after the runtime linker maps the program.While GNU ld provides one RW segment split by the dynamic loader, lldemploys two explicit RW PT_LOAD
segments. After relocationresolving, the effects of lld and GNU ld are similar.
For those curious, explore my notes on GNU ld's
Due to RELRO, covering the two RW PT_LOAD
segmentsnecessitates a minimum of 2 (huge) pages. In contrast, without RELRO,only one (huge) page is required at minimum. This means potentiallywasting up to MAXPAGESIZE-1 bytes, which could otherwise be utilized tocover more data.
Nowadays, RELRO is considered a security baseline and removing itmight unsettle security-minded individuals.
]]>In ELF, an object file can be a relocatable file, an executable file,or a shared object file. On Windows, the term "object file" usuallyrefers to relocatable files like ELF. Such files use the Common ObjectFile Format (COFF) while image files (e.g. executables and DLLs) use thePortable Executable (PE) format.
The input files to the linker can be object files, archive files, andimport libraries. GNU ld and lld-link allow linking against DLL fileswithout an import library.
An import file (.lib
) is a special archive file. Eachmember represents a symbol to be imported. The symbol__imp_$sym
is inserted to the global symbol table.
The import header has a Type
field indicatingIMPORT_OBJECT_CODE/IMPORT_OBJECT_DATA/IMPORT_OBJECT_CONST
.
For an import type of IMPORT_OBJECT_DATA
, the symbol$sym
is defined as an alias for__imp_$sym
.
For an import type of IMPORT_OBJECT_CODE
, the symbol$sym
is defined as an import thunk, which is like a PLTentry in ELF.
GNU ld and lld-link allow linking against DLL files without an importlibrary. The behavior is as if the linker synthesizes an import libraryfrom a DLL file.
An object file contributes defined and undefined symbols. An importfile contributes defined symbols in a DLL that can be referenced by__imp_$sym
.
A defined symbol can be any of the following kinds:
IMAGE_SYM_UNDEFINED
and valueis not 0)An undefined symbol has a storage class ofIMAGE_SYM_CLASS_EXTERNAL
, a section number ofIMAGE_SYM_UNDEFINED
(zero), and a value of zero.
An undefined symbol with a storage class ofIMAGE_SYM_CLASS_WEAK_EXTERNAL
is a weak external, which isactually like a weak definition in ELF.
PE requires explicit annotations for exported symbols and importedsymbols in DLL files. There are differences between code symbols andfunction symbols.
Refer to
1 | // b.dll |
Linking b.dll
gives us b.lib
(see "Importfiles" above). 1
2
3
4
5
6# b.dll
.globl f
f:
.section .drectve,"yni"
.ascii " -export:f"
a.obj
has two function calls. The call to f
references the prefixed symbol __imp_f
. 1
2
3# a.obj
callq local
callq *__imp_f(%rip)
call *__imp_f(%rip)
is like -fno-plt
codegen for ELF. In this case when we know that f
isdefined elsewhere, the generated code is more efficient.
When linking a.exe
, we need to make the import fileb.lib
as an input file. The linker parses the import fileand creates a definition for __imp_f
pointing to the importaddress table entry.
TODO import table
Actually, when __imp_f
is defined, the unprefixed symbolf
is also defined. Normally, the unprefixed f
is unused and will be discarded. However, if the user code calls theunprefixed symbol (e.g. call f
; like ELF-fplt
), the f
definition will be retained inthe linker output and point to a thunk: 1
2
3
4 call f # generated code without using dllimport
f: # x86-64 thunk
jmpq *__imp_f(%rip)
Different architectures have different thunk implementations.1
2
3
4
5
6
7
8
9
10
11
12// x86-32 and x86-64
jmp *0x0 // references an entry in the import address table
// AArch32
mov.w ip, #0
mov.t ip, #0
ldr.w pc, [ip]
// AArch64
adrp x16, #0
ldr x16, [x16]
br x16
TODO link.exe will issue a warning.
1 | // b.dll |
1 | # b.dll |
The linker parses the import file and creates a definition for__imp_var
pointing to the import address table entry.Unlike a code symbol, the linker does not create a definition forvar
(without the __imp_
prefix).
With a dllimport
: 1
2movq __imp_var(%rip), %rax
movl (%rax), %eax
If dllimport
is not specified, we get a referenced tothe unprefixed symbol: 1
movq var(%rip), %rax
link.exe will report an error.
MinGW implements runtime pseudo relocations to patch the text sectionso that absolute pointers and relative offsets to the symbol will berewritten to bind to the actual definition. 1
movq var(%rip), %rax # the runtime will rewrite this to point to the definition in b.dll
If the variable is defined out of the +-2GiB range from the currentlocation, the runtime pseudo relocation can't fix the issue. See
For a non-definition declaration, GCC conservatively thinks thevariable may be defined in a DLL and generate indirection. This issimilar to a GOT code sequence in ELF. 1
2extern int extern_var;
int main() { return extern_var; }
1 | // MSVC |
A dllimport
symbol referenced by an object file isnormally satisfied by an import file. link.exe allows another objectfile to provide the definition. In such a case, link.exe will issue awarning (
1 | echo '__declspec(dllimport) int foo(); int main() { return foo(); }' > a.cc |
1 | lld-link: warning: a.obj: locally defined symbol imported: int __cdecl foo(void) (defined in b.obj) [LNK4217] |
MinGW provides auto exporting and auto importing features to make PEDLL files work like ELF shared objects. When producing a DLL file, if nosymbol is chosen to be exported, almost all symbols are exported bydefault (--export-all-symbols
).
If an undefined symbol $sym
is unresolved and__imp_$sym
is defined, $sym
will be aliased to__imp_$sym
. TODO: example
If the symbol .refptr.$sym
is present, it will bealiased to __imp_$sym
as well. mingw-w64 defaults to-mcmodel=medium
and uses .refptr.$sym
. TODO:example
https://github.com/ziglang/zig/issues/9845
__imp_
definitionThe user can define __imp_
instead of letting the linkerdoes.
https://github.com/llvm/llvm-project/issues/57982 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15$ cat lto-dllimp1.c
void __declspec(dllimport) importedFunc(void);
void other(void);
void entry(void) {
importedFunc();
other();
}
$ cat lto-dllimp2.c
static void importedFuncReplacement(void) {
}
void (*__imp_importedFunc)(void) = importedFuncReplacement;
void other(void) {
}
The design of share libraries has major advancements around 1988.Before 1988, there were shared libraries implementations in a.out andCOFF objec file formats, but they had severe limitations, such as fixedaddresses and the requirement of extra files like import files.
Such limitations are evidenced in 1986 Summer USENIX TechnicalConference & Exhibition Proceedings, Shared Libraries onUNIX System V from AT&T. Its shared library (presumably usingthe COFF object file format) must have a fixed virtual address, which iscalled "static shared library" in Linkers and Loaders'sterm.
In 1988, SunOS 4.0 was released with an extended a.out binary formatwith dynamic shared library support. Unlike previous static sharedlibrary schemes, the a.out shared libraries are position independent andcan be loaded at different addresses. The dynamic linker source code isavailable somewhere and I find that its GOT and PLT schemes are exaclylike what we have for ELF today.
AT&T and Sun collaborated to create the first System V release 4ABI (using ELF). AT&T contributed the ELF object format. Suncontributed all of the dynamic linking implementation from SunOS 4.x. In1992, SunOS 5.0 (Solaris 2.0) switched to ELF.
For ELF, the designers tried to make shared libraries similar tostatic libraries. There is no need to annotate export and import symbolsto work with shared libraries.
I cannot find more information about System V release 3's sharedlibrary support, but the Windows DLL is assuredly inspired by it, giventhat the PE object file format is based on COFF and the PE specificationrefers to COFF in numerous places.
So, is the shared library design in ELF more advanced? It is.However, two aspects are worth deep thoughts.
-z undefs
default in linkers. See The number of symbols cannot exceed 65535. Several open-sourceprojects have faced problems that a DLL file cannot export more than65535 symbols. (GNU ld has a diagnosticerror: export ordinal too large:
).
A section header has only 8 bytes for the name field. link.exetruncates long section names to 8 bytes. For a section with a long nameand the IMAGE_SCN_MEM_DISCARDABLE
flag, lld uses anon-standard string table and issues a warning.
COMDAT limitation: MSVC link.exe will report a IMAGE_COMDAT_SELECT_ASSOCIATIVE
section, even if it wouldbe discarded after handling the leader symbol.
If a DSO has an undefined STB_GLOBAL
symbol that isdefined in a relocatable object file but not exported, should the--no-allow-shlib-undefined
feature report an error? You maywant to check out
For quite some time, the --no-allow-shlib-undefined
feature has been implemented in lld/ELF as follows:
1 | for (SharedFile *file : ctx.sharedFiles) { |
Recently I noticed that GNU ld implemented a related error in
1 | echo '.globl _start; _start: call shared' > main.s && clang -c main.s |
1 | % ld.bfd main.o a.so def.o |
A non-local default or protected visibility symbol can satisfy a DSOreference. The linker will export the symbol to the dynamic symboltable. Therefore, ld.bfd main.o a.so def.o
succeeds asintended.
We encounter an error forld.bfd main.o a.so def-hidden.o
because a symbol withhidden visibility cannot be exported, and it's unable to satisfy thereference in a.so
at run-time.
Here is another interesting case: we use a version script to changethe binding of a defined symbol to STB_LOCAL
, causing it tobe unable to satisfy the reference in a.so
at run-time. GNUld also reports an error in this case. 1
2
3% ld.bfd --version-script=local.ver main.o a.so def.o
ld.bfd: a.out: local symbol `foo' in def.o is referenced by DSO
ld.bfd: final link failed: bad value
My recent commit --no-allow-shlib-undefined
to detectcases in which the non-exported definitions are garbage-collected. Ihave landed https://github.com/llvm/llvm-project/pull/70769 to covernon-garbage-collected cases for LLD 18.
A variation of the scenario mentioned above occurs when a DSOdefinition is also present. Even if the executable does not exportfoo
, another DSO (def.so
) may provide it. GNUld's check allows for this case.
1 | ld.bfd main.o a.so def-hidden.o def.so # succeeded |
It turns out that --no-allow-shlib-undefined
toalso catch this ODR violation. More precisely, when all three conditionsare met, the new --no-allow-shlib-undefined
code reports anerror.
SharedSymbol
in lld/ELF).SharedSymbol
is overridden by a non-exported(usually of hidden visibility) definition in a relocatable object file(Defined
).Defined
is garbage-collected(it is not part of .dynsym
and is not marked as live).An exported symbol is a GC root, making its section live. Anon-exported symbol, however, can be discarded when its section isdiscarded.
So, is this error legitimate? At run-time, the undefined symbolfoo
in a.so
will be bound todef.so
, even if the executable does not exportfoo
, so we are fine. This suggests that the--no-allow-shlib-undefined
code probably should not reportan error.
However, both def-hidden.o
and def.so
define foo
, and we know the definitions are different andless likely benign. At the very least, they are not exactly the same dueto different visibilities or one being localized by a versionscript.
A real-world report boils down to 1
2
3
4
5
6
7% ld.lld @response.txt -y _Znam
...
libfdio.so: reference to _Znam
libclang_rt.asan.so: shared definition of _Znam
libc++.a(stdlib_new_delete.cpp.obj): definition of _Znam
ld.lld: error: undefined reference due to --no-allow-shlib-undefined: _Znam
>>> referenced by libfdio.so
How does libfdio.so
obtain a reference to_Znam
? Well, libfdio.so
is linked against bothlibclang_rt.asan.so
and libc++.a
. Due tosymbol processing rules, the definition fromlibclang_rt.asan.so
takes precedence. (See
An appropriate solution is to replace libc++a
with anAddressSanitizer-instrumented version that does not define_Znam
.
I have also encountered issues stemming from the combination ofmultiple definitions from libgcc.a
(with hidden visibility)and libclang_rt.builtins.a
(with default visibility),relying on archive member extraction rules. 1
2
3
4
5
6
7
8% ld.lld @response.txt -y __divti3
...
a.so: reference to __divti3
libgcc.a(_divdi3.o): definition of __divti3
libc++.so: shared definition of __divti3
# A lazy symbol in libclang_rt.builtins.a is not reported by -y
ld.lld: error: undefined reference due to --no-allow-shlib-undefined: __divti3
>>> referenced by a.so
a.so
is linked against libc++.so
andlibclang_rt.builtins.a
and obtains a reference to__divti3
due to libc++.so
. For the executablelink, the undesired situation arises as the definition inlibgcc.a
takes precedence. What we actually want is forlibgcc.a
to provide the missing components fromlibclang_rt.builtins.a
.
Some users compile relocatable object files with-fvisibility=hidden
to disallow dynamic linking. However,when their system includes specific shared objects, it increases therisk of conflicting multiple definition symbols.
While this additional check introduced in --no-allow-shlib-undefined
, Ibelieve it has value. As a result, I have proposed --[no-]allow-non-exported-symbols-shared-with-dso
.However, I am also on the fence that we introduce a new option, as itmay not get used.
Technically, the check can be extended to default visibility to catchall link-time symbol interposition. However, I suspect that there are alot of benign violations and in the absence of an ignore list mechanism,this extension will not be useful.
]]>This article describes global variable instrumentation.
AddressSanitizer instruments certain defined global variables of LLVMexternal or internal linkage. To be instrumented, the variable mustsatisfy a bunch of conditions.
no_sanitize_address
attribute inLLVM IR. Variables receive this attribute when annotated as__attribute__((no_sanitize("address")))
or__attribute__((disable_sanitizer_instrumentation))
inC/C++.1 | int g0; |
Each instrumented global variable is padded with a right redzone todetect out-of-bounds accesses. 1
2@g0 = dso_local global { i32, [28 x i8] } zeroinitializer, comdat, align 32
@g1 = dso_local constant { i64, [24 x i8] } zeroinitializer, comdat, align 32
On ELF platforms, by default (since Clang 17.0) each instrumentedglobal variable receives an associated __asan_global_$name
variable, which is located within the asan_globals
section.Additionally, there are several related variables, including someunnamed ones (@0
and @1
), as well as__odr_asan_gen_g0
and __odr_asan_gen_g1
, alongwith metadata nodes (!0
and !1
), which we willdiscuss in more detail later."
1 | @___asan_gen_.1 = private unnamed_addr constant [3 x i8] c"g0\00", align 1 |
The module constructor asan.module_ctor
processesgarbage-collectable asan_globals
input sections. Thisconstructor invokes a runtime callback to register the instrumentedglobal variables, which involves poisoning the redzone and conductingODR violation checks. I will discuss ODR violation checking later.1
2
3
4
5
6define internal void @asan.module_ctor() #0 comdat {
call void @__asan_init()
call void @__asan_version_mismatch_check_v8()
call void @__asan_register_elf_globals(i64 ptrtoint (ptr @___asan_globals_registered to i64), i64 ptrtoint (ptr @__start_asan_globals to i64), i64 ptrtoint (ptr @__stop_asan_globals to i64))
ret void
}
The runtime poisons the redzone of each instrumented global variable.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26void __asan_register_elf_globals(uptr *flag, void *start, void *stop) {
if (*flag) return;
if (!start) return;
CHECK_EQ(0, ((uptr)stop - (uptr)start) % sizeof(__asan_global));
__asan_global *globals_start = (__asan_global*)start;
__asan_global *globals_stop = (__asan_global*)stop;
__asan_register_globals(globals_start, globals_stop - globals_start);
*flag = 1;
}
void __asan_register_globals(__asan_global *globals, uptr n) {
if (!flags()->report_globals) return;
...
for (uptr i = 0; i < n; i++)
RegisterGlobal(&globals[i]);
// Poison the metadata. It should not be accessible to user code.
PoisonShadow(reinterpret_cast<uptr>(globals), n * sizeof(__asan_global),
kAsanGlobalRedzoneMagic);
}
static void RegisterGlobal(const Global *g) {
...
if (CanPoisonMemory())
PoisonRedZones(*g);
}
Every full granule in the shadow of the redzone is filled with 0xf9(kAsanGlobalRedzoneMagic
) while a partial granule is filledin a manner similar to partially-addressable stack memory.1
2
3
4
5
6
7
8
9
10
11ALWAYS_INLINE void PoisonRedZones(const Global &g) {
uptr aligned_size = RoundUpTo(g.size, ASAN_SHADOW_GRANULARITY);
FastPoisonShadow(g.beg + aligned_size, g.size_with_redzone - aligned_size,
kAsanGlobalRedzoneMagic);
if (g.size != aligned_size) {
FastPoisonShadowPartialRightRedzone(
g.beg + RoundDownTo(g.size, ASAN_SHADOW_GRANULARITY),
g.size % ASAN_SHADOW_GRANULARITY, ASAN_SHADOW_GRANULARITY,
kAsanGlobalRedzoneMagic);
}
}
If an access occurs within a redzone byte poisoned by 0xf9 or withina partial redzone preceding 0xf9, the runtime will report aglobal-buffer-overflow
error. Here is an example:
1 | cat > a.c <<e |
1 | % ./a 1 # a[argc * 5] == a[10] is out-of-bounds |
The global variable poisoning mechanism offers a straightforwardmeans to detect differences in variable definitions between twocomponents, such as between the main executable and a shared object, orbetween two shared objects. This can be considered a category of ODRviolations.
1 | echo 'int var; int main() { return var; }' > a.cc |
1 | % ./a |
The default mode, detect_odr_violation=2
, also prohibitssymbol interposition on variables. If you change long
toint
in b.cc
, you will still encounter anodr-violation
error. In contrast, withdetect_odr_violation=1
, errors are suppressed if theregistered variables are of the same size. 1
2
3
4
5% ASAN_OPTIONS=detect_odr_violation=1 ./a
% ASAN_OPTIONS=detect_odr_violation=2 ./a
=================================================================
==2574052==ERROR: AddressSanitizer: odr-violation (0x562d39db1200):
...
For a variable named $var
, a one-byte variable,__odr_asan_gen_$var
, is created with the original linkage(essentially must be external
) and visibility.
If $var
is defined in two instrumented modules, their__odr_asan_gen_$var
symbols reference to the same copy dueto symbol interposition. When registering $var
, the runtimechecks whether __odr_asan_gen_$var
is already 1, and ifyes, the program has an ODR violation; otherwise__odr_asan_gen_$var
is set to 1.
1 | @__odr_asan_gen_g0 = global i8 0, align 1 |
The private aliases @0and @1 were due to
If a.supp
contains the following text, running theprogram with the environment variableASAN_OPTIONS=suppressions=a.supp
suppresses errors due tothe variable name var. 1
odr_violation:^var$
An ODR violation is reported for two different linked units, say,exe
and b.so
. With static linking, the issuecan be suppressed due to archive member extraction semantics if theb.a
member is not extracted.
The previous example uses-fsanitize-address-use-odr-indicator
.
Prior to Clang 16,-fno-sanitize-address-use-odr-indicator
was the default fornon-Windows platforms. The runtime checks checks whether a variable hasbeen registered by verifying whether its redzone has been poisoned, andreports an ODR violation when the redzone has been poisoned.1
2
3
4
5@___asan_gen_.1 = private unnamed_addr constant [3 x i8] c"g0\00", align 1
@___asan_gen_.2 = private unnamed_addr constant [3 x i8] c"g1\00", align 1
@__asan_global_g0 = private global { i64, i64, i64, i64, i64, i64, i64, i64 } { i64 ptrtoint (ptr @g0 to i64), i64 4, i64 32, i64 ptrtoint (ptr @___asan_gen_.1 to i64), i64 ptrtoint (ptr @___asan_gen_ to i64), i64 0, i64 0, i64 0 }, section "asan_globals", !associated !0
@__asan_global_g1 = private global { i64, i64, i64, i64, i64, i64, i64, i64 } { i64 ptrtoint (ptr @g1 to i64), i64 8, i64 32, i64 ptrtoint (ptr @___asan_gen_.2 to i64), i64 ptrtoint (ptr @___asan_gen_ to i64), i64 0, i64 0, i64 0 }, section "asan_globals", !associated !1
@llvm.compiler.used = appending global [4 x ptr] [ptr @g0, ptr @g1, ptr @__asan_global_g0, ptr @__asan_global_g1], section "llvm.metadata"
This mode eliminates the need for an additional variable like__odr_asan_gen_$var
, but it can lead to interaction issueswhen mixing instrumented and uninstrumented components. In the case of ashared object, if the reference to $var
in__asan_global_$var
is interposed with an uninstrumentedvariable due to symbol interposition, it may result in a spurious errorstating, "The following global variable is not properly aligned."
For Clang 16, I introduced the use of-fsanitize-address-use-odr-indicator
by default fornon-Windows targets (see https://reviews.llvm.org/D137227).
(Additionally, https://reviews.llvm.org/D127911 changed the ODRindicator symbol name to __odr_asan_gen_$demangled
.)
Private aliases have an interest interaction with copy relocations.This issue is reported at https://gcc.gnu.org/PR68016.
The default -fsanitize-address-use-odr-indicator
inClang 16 and later cannot detect the global-buffer-overflow
error below:
1 | echo 'int f[5] = {1};' > foo.cc |
The definition of f
in foo.cc
isinstrumented, resulting in the creation of __asan_global_f
.However, the executable actually accesses the copy created by the linkerdue to copy relocation.
When -asan-use-private-alias=1
is in effect (the defaultsince Clang 16), the __asan_global_f
variable referencesthe unused copy inside the shared object. The executable accesses thecopy-relocated variable, whose redzone is not poisoned, resulting in noerror.
Conversely, when -asan-use-private-alias=0
is in effect,the __asan_global_f
variable references the copy-relocatedvariable and poisons the redzone within the executable. Consequently,accessing f[5]
leads to the expected error.
Since Clang 17, asan.module_ctor
is, by default, placedin a COMDAT group. When multiple instrumented relocatable object filesare linked together, only one asan.module_ctor
isretained.
__asan_global_g0
is positioned in a section that linksto the section defining g0
using theSHF_LINK_ORDER
flag. During linking, if the linker discardsthe section defining g0
, the asan_globals
section containing __asan_global_g0
will also be discarded.For more detail on SHF_LINK_ORDER
, you can refer to
Before Clang 17, the default behavior was to use-fno-sanitize-address-globals-dead-stripping
. In this mode,the instrumentation places pointers to instrumented global variables ina metadata array and calls __asan_register_globals
.__asan_register_globals
then iterates over the array andregisters each global variable.
1 | @g0 = dso_local global { i32, [28 x i8] } zeroinitializer, align 32 |
asan.module_ctor
references the metadata array@0
, which, in turn, references @1
and@2
. @1
and @2
reference theglobal variables g0
and g1
, respectively. Thisunfortunately indicates that g0
and g1
cannotbe discarded by section-based garbage collection.
It's important to note that this version ofasan.module_ctor
is not placed within a COMDAT group. Inanother compile unit, a separate asan.module_ctor
references a different metadata array. As a result, theseasan.module_ctor
functions cannot share the sameimplementation.
In a linked component, both __asan_init
and__asan_version_mismatch_check_v8
will be called multipletimes, incurring a small overhead.
Regrettably, the default setting of-fsanitize-address-globals-dead-stripping
in Clang 17 had abug. Specifically, when there are no global variables, and the uniquemodule ID is non-empty, a COMDAT asan.module_ctor
iscreated without any __asan_register_elf_globals
calls. Ifthis COMDAT is selected as the prevailing copy by the linker, thelinkage unit will lack a __asan_register_elf_globals
call,resulting in an unpoisoned redzone and a non-functional ODR violationchecker.
I have fixed this in the main branch (
Before Clang 15, Clang's instrumentation includedllvm.asan.globals
, and the AddressSanitizer runtimerequired its object file feature for symbolization.
https://reviews.llvm.org/D127552 enabled debuginformation for symbolization and llvm.asan.globals
.
AddressSanitizer provides a check to detect whether a dynamicinitializer for one global variable accesses dynamically initializedglobal variables defined in another compile unit, which helps identifycertain initialization order issues. This catches certain initializationorder fiasco issues.
Here is an example: 1
2
3
4
5
6
7
8
9
10
11
12
13cat > a0.cc <<'eof'
#include <stdio.h>
extern int a1;
static int fa0() { return 1; }
int a0 = fa0();
int main() { printf("%d %d\n", a0, a1); }
eof
cat > a1.cc <<'eof'
extern int a0;
static int fa1() { return a0+1; }
int a1 = fa1();
eof
clang++ -fsanitize=address a0.cc a1.cc -o a1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21% ASAN_OPTIONS=strict_init_order=1 ./a
=================================================================
==124921==ERROR: AddressSanitizer: initialization-order-fiasco on address 0x5577b1cd6b00 at pc 0x5577b12fbbca bp 0x7ffe75a0a280 sp 0x7ffe75a0a260
READ of size 4 at 0x5577b1cd6b00 thread T0
#0 0x5577b12fbbc9 in fa1() /tmp/t/d/a1.cc:2:27
#1 0x5577b12fbbec in __cxx_global_var_init /tmp/t/d/a1.cc:3:10
#2 0x5577b12fbc64 in _GLOBAL__sub_I_a1.cc /tmp/t/d/a1.cc
#3 0x7ff44e0107f5 in call_init csu/../csu/libc-start.c:145:3
#4 0x7ff44e0107f5 in __libc_start_main csu/../csu/libc-start.c:347:5
#5 0x5577b11b46d0 in _start (/tmp/t/d/a+0x766d0)
0x5577b1cd6b00 is located 0 bytes inside of global variable 'a0' defined in '/tmp/t/d/a0.cc:4' (0x5577b1cd6b00) of size 4
registered at:
#0 0x5577b11d1da4 in __asan_register_globals /usr/local/google/home/maskray/llvm/compiler-rt/lib/asan/asan_globals.cpp:363:3
#1 0x5577b11d2181 in __asan_register_elf_globals /usr/local/google/home/maskray/llvm/compiler-rt/lib/asan/asan_globals.cpp:346:3
#2 0x5577b12fbb57 in asan.module_ctor a0.cc
#3 0x7ff44e0107f5 in call_init csu/../csu/libc-start.c:145:3
#4 0x7ff44e0107f5 in __libc_start_main csu/../csu/libc-start.c:347:5
SUMMARY: AddressSanitizer: initialization-order-fiasco /tmp/t/d/a1.cc:2:27 in fa1()
...
When check_initialization_order
is enabled, whilestrict_init_order
is disabled, AddressSanitizer performs aweak check allowing a compile unit that is about to be initialized toaccess global variables in an already initialized compile unit. In thisscenario, the previous example does not result in an error:1
2% ASAN_OPTIONS=check_initialization_order=1:strict_init_order=0 ./a
1 2
For the following case, the weak check can still catch theinitialization order fiasco: 1
2
3
4
5
6
7
8
9
10
11
12
13cat > a0.cc <<'eof'
#include <stdio.h>
extern int a1;
int a0 = []() { return a1-1; }();
int main() { printf("%d %d\n", a0, a1); }
eof
cat > a1.cc <<'eof'
extern int a0;
static int fa1() { return 2; }
int a1 = fa1();
eof
clang++ -g -fsanitize=address a0.cc a1.cc -o a
ASAN_OPTIONS=check_initialization_order=1:strict_init_order=0 ./a
Clang translates C++ dynamic initialization into a globalinitialization function within the llvm.global_ctors
list.AddressSanitizer augments this global initialization function with__asan_before_dynamic_init
and__asan_after_dynamic_init
. These two functions worktogether to check for initialization order issues whencheck_initialization_order
is enabled.
For instrumented global variables with initializers, thehas_dynamic_init
variable in the __asan_global
metadata is set to true. These variables are collected into thedynamic_init_globals
array.
__asan_before_dynamic_init
is called for each compileunit. This function iterates over dynamic_init_globals
andpoisons those whose DynInitGlobal::initialized
value isfalse. Subsequently, the global initialization function is executed. Ifit accesses the poisoned memory, it triggers a report for aninitialization order issue. Following this,__asan_after_dynamic_init
processes these global variables,unpoisoning them.
1 | void __asan_before_dynamic_init(const char *module_name) { |
The check is applicable when the accessed variable resides in anotherlinked unit.
For example, consider that b.so
consists ofb0.cc
and b1.cc
, while the main executablea
contains a0.cc
and a1.cc
.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22cat > a0.cc <<'eof'
#include <stdio.h>
extern int a1, b0, b1;
static int fa0() { return 1; }
int a0 = fa0();
int main() { printf("%d %d %d %d\n", a0, a1, b0, b1); }
eof
echo 'static int fa1() { return 2; } int a1 = fa1();' > a1.cc
echo 'static int fb0() { return 3; } int b0 = fb0();' > b0.cc
echo 'static int fb1() { return 4; } int b1 = fb1();' > b1.cc
sed 's/^ /\t/' > Makefile <<'eof'
.MAKE.MODE := meta curDirOk=true
CXX := clang++
CXXFLAGS := -g -fsanitize=address
a: a0.cc a1.cc b.so
${LINK.cc} -Wl,-rpath=. $> -o $@
b.so: b0.cc b1.cc
${LINK.cc} -fpic -shared $> -o $@
clean:
rm -f *.meta a b.so
eof
bmake
In check_initialization_order=1,strict_init_order=0
mode,
b0.cc
and b1.cc
areregistered__asan_before_dynamic_init
marks b0
as initialized and poisons b1
. Global initialization isrun. __asan_register_globals
unpoisons b1
__asan_before_dynamic_init
marks b1
as initialized and poisons b0
. Global initialization isrun. __asan_register_globals
unpoisons b0
a0.cc
and a1.cc
areregistered__asan_before_dynamic_init
marks a0
as initialized and poisons a1
. Global initialization isrun. __asan_register_globals
unpoisons a1
__asan_before_dynamic_init
marks a1
as initialized and poisons a0
. Global initialization isrun. __asan_register_globals
unpoisons a0
In check_initialization_order=1,strict_init_order=1
mode,
b0.cc
and b1.cc
areregistered__asan_before_dynamic_init
poisonsb1
. Global initialization is run__asan_before_dynamic_init
poisonsb0
. Global initialization is runa0.cc
and a1.cc
areregistered__asan_before_dynamic_init
poisonsb0,b1,a1
. Global initialization is run.__asan_register_globals
unpoisonsb0,b1,a1
__asan_before_dynamic_init
poisonsb0,b1,a0
. Global initialization is run.__asan_register_globals
unpoisonsb0,b1,a0
The instrumentation can be disabled with an entry inasan_ignorelist.txt
: 1
global:var=init
An initialization-order-fiasco error cannot be suppressed usingASAN_OPTIONS=suppressions=a.supp
.
1 | C/C++ =(front end)=> LLVM IR =(middle end)=> LLVM IR (optimized) =(back end)=> relocatable object file |
If we follow the internal representations of instructions, a moredetailed diagram looks like this: 1
C/C++ =(front end)=> LLVM IR =(middle end)=> LLVM IR (optimized) =(instruction selector)=> MachineInstr =(AsmPrinter)=> MCInst =(assembler)=> relocatable object file
LLVM and Clang are designed as a collection of libraries. This postdescribes how different libraries work together to create the finalrelocatable object file. I will focus on how a function goes through themultiple compilation stages.
The compiler frontend primarily comprises the followinglibraries:
The clangDriver library is located in clang/lib/Driver/
and clang/include/Driver/
, while other libraries havesimilar structures. In general, when a header file in one library (let'scall it library A) is needed by another library, it is exposed toclang/include/$A/
. Downstream projects can also include theheader file from clang/include/$A/
.
Let's use a C++ source file as an example.
1 | % cat a.cc |
The entry point of the Clang executable is implemented inclang/tools/driver/
. clang_main
creates aclang::driver::Driver
instance, callsBuildCompilation
to construct aclang::driver::Compilation
instance, and then callsExecuteCompilation
.
You may read
1 | BuildCompilation |
For clang++ -g a.cc
, clangDriver identifies thefollowing phases: preprocessor, compiler (C++ to LLVM IR), backend,assembler, and linker. The first several phases can be performed by onesingle clang::driver::tools::Clang
object (also known asClang cc1), while the final phase requires an external program (thelinker).
1 | % clang++ -g a.cc '-###' |
cc1_main
in clangDriver callsExecuteCompilerInvocation
defined in clangFrontend.
clangFrontend
defines CompilerInstance
,which manages various classes, includingCompilerInvocation
, DiagnosticsEngine
,TargetInfo
, FileManager
,SourceManager
, Preprocessor
,ASTContext
, ASTConsumer
, andSema
.
1 | ExecuteCompilerInvocation |
In ExecuteCompilerInvocation
, a FrontAction
is created based on the CompilerInstance
argument and thenexecuted. When using the -emit-obj
option, the selectedFrontAction
is an EmitObjAction
, which is aderivative of CodeGenAction
.
During FrontendAction::BeginSourceFile
, several classesmentioned earlier are created, and a BackendConsumer
isalso established. The BackendConsumer
serves as a wrapperaround CodeGenerator
, which is another derivative ofASTConsumer
. Finally, inFrontendAction::BeginSourceFile
,CompilerInstance::setASTConsumer
is called to create aCodeGenModule
object, responsible for managing an LLVM IRmodule.
In FrontendAction::Execute
,CodeGenAction::ExecuteAction
is invoked, primarily handlingthe compilation of LLVM IR files. This function, in turn, calls the basefunction ASTFrontendAction::ExecuteAction
, which, inessence, triggers the entry point of clangParse
:ParseAST
.
clangParse
consumes tokens from clangLex
and invokes parser actions, many of which are named Act*
,defined in clangSema
. clangSema
performssemantic analysis and generates AST nodes.
1 | ParseAST |
When ParseTopLevelDecl
consumes a tok::eof
token, implicit instantiations are performed.
In the end, we get a full AST (actually a misnomer as therepresentation is not abstract, not only about syntax, and is not atree). ParseAST
calls virtual functionsHandleTopLevelDecl
andHandleTranslationUnit
.
BackendConsumer
defined in clangCodeGen overridesHandleTopLevelDecl
and HandleTranslationUnit
to generate unoptimized LLVM IR and hand over the IR to LLVM for machinecode generation.
1 | BackendConsumer::HandleTopLevelDecl |
BackendConsumer::HandleTopLevelDecl
generates LLVM IRfor each top-level declaration. This means that Clang generates afunction at a time.
BackendConsumer::HandleTranslationUnit
invokesEmitBackendOutput
to create an LLVM IR file, an assemblyfile, or a relocatable object file. EmitBackendOutput
establishes an optimization pipeline and a machine code generationpipeline.
Now let's explore CodeGenFunction::EmitFunctionBody
.Generating IR for a variable declaration and a return statement involvethe following functions, among others: 1
2
3
4
5
6
7
8
9
10
11EmitFunctionBody
EmitCompoundStmtWithoutScope
EmitStmt
EmitSimpleStmt
EmitDeclStmt
EmitDecl
EmitVarDecl
EmitStopPoint
EmitReturnStmt
EmitScalarExpr
ScalarExprEmitter::EmitBinOps
After generating the LLVM IR, clangCodeGen proceeds to executeEmitAssemblyHelper::RunOptimizationPipeline
to performmiddle-end optimizations and subsequentlyEmitAssemblyHelper::RunCodegenPipeline
to generate machinecode.
For our integer division example, the function foo
inthe unoptimized LLVM IR looks like this (attributes are omitted):1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24; Function Attrs: mustprogress noinline uwtable
define dso_local noundef i32 @_Z3fooifi(i32 noundef %a, float noundef %b, i32 noundef %c) #0 {
entry:
%a.addr = alloca i32, align 4
%b.addr = alloca float, align 4
%c.addr = alloca i32, align 4
%s = alloca i32, align 4
store i32 %a, ptr %a.addr, align 4, !tbaa !5
store float %b, ptr %b.addr, align 4, !tbaa !9
store i32 %c, ptr %c.addr, align 4, !tbaa !5
call void @llvm.lifetime.start.p0(i64 4, ptr %s) #4
%0 = load i32, ptr %a.addr, align 4, !tbaa !5
%1 = load float, ptr %b.addr, align 4, !tbaa !9
%2 = load float, ptr %b.addr, align 4, !tbaa !9
%cmp = fcmp oeq float %1, %2
%conv = zext i1 %cmp to i32
%add = add nsw i32 %0, %conv
store i32 %add, ptr %s, align 4, !tbaa !5
%3 = load i32, ptr %s, align 4, !tbaa !5
%4 = load i32, ptr %c.addr, align 4, !tbaa !5
%call = call noundef i32 @_Z3divIiET_S0_S0_(i32 noundef %3, i32 noundef %4)
call void @llvm.lifetime.end.p0(i64 4, ptr %s) #4
ret i32 %call
}
EmitAssemblyHelper::RunOptimizationPipeline
creates apass manager to schedule the middle-end optimization pipeline. This passmanager executes numerous optimization passes and analyses.
The option -mllvm -print-pipeline-passes
providesinsight into these passes: 1
2% clang -c -O1 -mllvm -print-pipeline-passes a.c
annotation2metadata,forceattrs,declare-to-assign,inferattrs,coro-early,...
For our integer division example, the optimized LLVM IR looks likethis: 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16; Function Attrs: mustprogress nofree noinline norecurse nosync nounwind willreturn memory(none) uwtable
define dso_local noundef i32 @_Z3fooifi(i32 noundef %a, float noundef %b, i32 noundef %c) local_unnamed_addr #0 {
entry:
%cmp = fcmp ord float %b, 0.000000e+00
%conv = zext i1 %cmp to i32
%add = add nsw i32 %conv, %a
%div.i = sdiv i32 %add, %c
ret i32 %div.i
}
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable
define dso_local noundef i32 @main() local_unnamed_addr #1 {
entry:
%call = tail call noundef i32 @_Z3fooifi(i32 noundef 3, float noundef 2.000000e+00, i32 noundef 1)
ret i32 %call
}
The most notaceable differences are the following
SROAPass
runs mem2reg and optimizes out manyAllocaInst
s.InstCombinePass
(InstCombinerImpl::visitFCmpInst
) replacesfcmp oeq float %1, %1
withfcmp ord float %1, 0.000000e+00
, canonicalize NaN testingto FCmpInst::FCMP_ORD
.InlinerPass
inlines the instantiated div
function into its caller foo
The demarcation between the middle end and the back end may not beentirely distinct. WithinLLVMTargetMachine::addPassesToEmitFile
, several IR passesare scheduled. It's reasonable to consider these IR passes (everythingbefore addCoreISelPasses
) as part of the middle end, whilethe phase beginning with instruction selection can be regarded as theactual back end.
Here is an overview ofLLVMTargetMachine::addPassesToEmitFile
:
1 | LLVMTargetMachine::addPassesToEmitFile |
These IR and machine passes are scheduled by the legacy pass manager.The option -mllvm -debug-pass=Structure
provides insightinto these passes: 1
clang -c -O1 a.c -mllvm -debug-pass=Structure
There are three instruction selectors: SelectionDAG, FastISel, andGlobalISel. FastISel is integrated within the SelectionDAGframework.
For most targets, FastISel is the default for clang -O0
while SelectionDAG is the default for optimized builds. However, formost AArch64 -O0
configurations, GlobalISel is thedefault.
To force using GlobalISel, we can specify-mllvm -global-isel
.
See 1
2
3
4
5SectionDAG: normal code path
LLVM IR =(visit)=> SDNode =(DAGCombiner,LegalizeTypes,DAGCombiner,Legalize,DAGCombiner,Select,Schedule)=> MachineInstr
SectionDAG: FastISel (fast but not optimal)
LLVM IR =(FastISel)=> MachineInstr
1 | TargetPassConfig::addCoreISelPasses |
Each backend implements a derived class of SelectionDAGISel. Forexample, the X86 backend implements X86DAGToDAGISel
andoverrides runOnMachineFunction
to set up variables likeX86Subtarget
and then invokes the base functionSelectionDAGISel::runOnMachineFunction
.
SelectionDAGISel
creates aSelectionDAGBuilder
. For each basic block,SelectionDAGISel::SelectBasicBlock
iterates over all IRinstructions and calls SelectionDAGBuilder::visit
on them,creating a new SDNode
for each Value
thatbecomes part of the DAG.
The initial DAG may contain types and operations that are notnatively supported by the target.SelectionDAGISel::CodeGenAndEmitDAG
invokesLegalizeTypes
and Legalize
to convertunsupported types and operations to supported ones.
ScheduleDAGSDNodes::EmitSchedule
emits the machine code(MachineInstr
s) in the scheduled order.
Let's take a closer look at our foo
function.
For the IR instruction%cmp = fcmp ord float %b, 0.000000e+00
,SelectionDAGBuilder::visit
creates a newSDNode
with the opcode ISD::SETCC
(t9: i1 = setcc t4, ConstantFP:f32<0.000000e+00>, seto:ch
).1
2SelectionDAGBuilder::visit
SelectionDAGBuilder::visitFcmp
A new SDNode
with the opcodeISD::ZERO_EXTEND
is created for%conv = zext i1 %cmp to i32
.
For the IR instruction %add = add nsw i32 %conv, %a
,SelectionDAGBuilder::visit
creates a newSDNode
with the opcode ISD::ADD
.1
2
3SelectionDAGBuilder::visit
SelectionDAGBuilder::visitAdd
SelectionDAGBuilder::visitBinary # binary operators are handled similarly
Similarly, SelectionDAGBuilder::visit
creates a newSDNode
with the opcode ISD::SDIV
for%div.i = sdiv i32 %add, %c
,SelectionDAGBuilder::visit
. For theret i32 %div.i
instruction, the created SDNode
has a target-specific opcode X86ISD::RET_GLUE
(target-specific opcodes are legal for almost all targets).
After all instructions are visited, we get an initial DAG that lookslike: 1
2
3
4
5
6
7
8
9
10
11
12Initial selection DAG: %bb.0 '_Z3fooifi:entry'
SelectionDAG has 17 nodes:
t0: ch,glue = EntryToken
t4: f32,ch = CopyFromReg t0, Register:f32 %1
t9: i1 = setcc t4, ConstantFP:f32<0.000000e+00>, seto:ch
t10: i32 = zero_extend t9
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t11: i32 = add nsw t10, t2
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t12: i32 = sdiv t11, t6
t15: ch,glue = CopyToReg t0, Register:i32 $eax, t12
t16: ch = X86ISD::RET_GLUE t15, TargetConstant:i32<0>, Register:i32 $eax, t15:1
The DAGCombiner
process changest9: i1 = setcc t4, ConstantFP:f32<0.000000e+00>, seto:ch
to t19: i1 = setcc t4, t4, seto:ch
. After the initialcombining, the output looks like: 1
2
3
4
5
6
7
8
9
10
11
12Optimized lowered selection DAG: %bb.0 '_Z3fooifi:entry'
SelectionDAG has 16 nodes:
t0: ch,glue = EntryToken
t4: f32,ch = CopyFromReg t0, Register:f32 %1
t19: i1 = setcc t4, t4, seto:ch
t10: i32 = zero_extend t19
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t11: i32 = add nsw t10, t2
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t12: i32 = sdiv t11, t6
t15: ch,glue = CopyToReg t0, Register:i32 $eax, t12
t16: ch = X86ISD::RET_GLUE t15, TargetConstant:i32<0>, Register:i32 $eax, t15:1
The LegalizeTypes
process changest10: i32 = zero_extend t19
tot23: i32 = any_extend t22; t25: i32 = and t23, Constant:i32<1>
.The result of LegalizeTypes
looks like the following:1
2
3
4
5
6
7
8
9
10
11
12
13Optimized legalized selection DAG: %bb.0 '_Z3fooifi:entry'
SelectionDAG has 17 nodes:
t0: ch,glue = EntryToken
t4: f32,ch = CopyFromReg t0, Register:f32 %1
t30: i32 = X86ISD::FCMP t4, t4
t32: i8 = X86ISD::SETCC TargetConstant:i8<11>, t30
t26: i32 = zero_extend t32
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t11: i32 = add nsw t26, t2
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t29: i32,i32 = sdivrem t11, t6
t15: ch,glue = CopyToReg t0, Register:i32 $eax, t29
t16: ch = X86ISD::RET_GLUE t15, TargetConstant:i32<0>, Register:i32 $eax, t15:1
x86 division instructions computes both the quotient and thereminder. To leverage this property, X86ISelLowering.cpp
sets ISD::SDIV
to Expand
. TheLegalize
process will expand the ISD::SDIV
node. In SelectionDAGLegalize::ExpandNode
, the node isreplaced with a new node with the opcode ISD::SDIVREM
.X86ISelLowering.cpp sets ISD::SETCC
for the type toCustom
. X86TargetLowering::LowerOperation
provides custom lowering hooks and replaces the ISD::SETCC
node witht32: i8 = X86ISD::SETCC TargetConstant:i8<11>, t30
that uses another node t30: i32 = X86ISD::FCMP t4, t4
.
The SelectionDAGISel::Select
process createsMachineSDNode
(derivied class of SDNode
withnegative NodeType
) objects, which will be converted toMachineInstr
. SelectionDAGISel::Select
isderived by targets to perform custom handling for certain instructionsand handle over the rest to SelectCode
.SelectCode
is the entry point of TableGen-generated patternmatching code for instruction selection. While some SDNode
sbecome MachineSDNode
s, some SDNode
s (e.g.CopyFromReg
) remain SDNode
.
In our example,
X86::RET
is selected for theX86ISD::RET_GLUE
node.ISD::Register
andISD::CopyFromReg
, remain the same.X86::IDIV32r
is selected for theISD::SDIVREM
node.X86::ADD32rr
is selected for the ISD::ADD
node.X86::MOVZX32rr8
is selected for theISD::ZERO_EXTEND
node.In the EmitSchedule
process, MachineInstr
objects are created from these MachineSDNode
and regularSDNode
objects.
FastISel, typically used for clang -O0
, represents afast path of SelectionDAG that generates less optimized machinecode.
When FastISel is enabled, SelectAllBasicBlocks
tries toskip SelectBasicBlock
and select instructions withFastISel. However, FastISel only handles a subset of IR instructions.For unhandled instructions, SelectAllBasicBlocks
falls backto SelectBasicBlock
to handle the remaining instructions inthe basic block.
GlobalISelis a new instruction selection framework that operates on the entirefunction, in contrast to the basic block view of SelectionDAG.GlobalISel offers improved performance and modularity (multiple passesinstead of one monolithic pass).
The design of the MachineInstr
replaces an intermediate representation,SDNode
, which was used in the SelectionDAG framework.
1 | LLVM IR =(IRTranslator)=> generic MachineInstr =(Legalizer)=> regular and generic MachineInstr =(RegBankSelect,GlobalInstructionSelect)=> regular MachineInstr |
1 | TargetPassConfig::addCoreISelPasses |
Similar to FastISel, GlobalISel does not handle all instructions. IfGlobalISel fails to handle a function, SelectionDAG will be used as thefallback.
After instruction selector, there are machine SSA optimizations,register allocation, machine late optimizations, and pre-emitpasses.
1 | TargetPassConfig::addMachinePasses |
After register allocation, the machine function is no longer in a SSAform. Then, the VirtRegRewriter
pass replaces all virtualregister references to physical register references.
X86::RET
describes a pseudo instruction.X86ExpandPseudo::ExpandMI
expands the X86::RET
MachineInstr
to an X86::RET64
.
LLVMTargetMachine::addAsmPrinter
incorporates atarget-specific AsmPrinter
(derived fromAsmPrinter
) pass into the machine code generation pipeline.These target-specific AsmPrinter
passes are responsible forconverting MachineInstr
s to MCInst
s andemitting them to an MCStreamer
.
In our x86 case, the target-specific class is namedX86AsmPrinter
.X86AsmPrinter::runOnMachineFunction
invokesAsmPrinter::emitFunctionBody
to emit the function body. Thebase member function handles function header/footer, comments, andinstructions. Target-specific classes overrideemitInstruction
to lower MachineInstr
s withtarget-specific opcodes to MCInst
s.
An MCInst
object can also be created by the LLVMintegrated assembler. MCInst
is like a simplified versionof MachineInstr
with less information. WhenMCInst
is emitted to an MCStreamer
, it resultsin either assembly code or bytes for a relocatable object file.
Clang has the capability to output either assembly code or an objectfile. Generating an object file directly without involving an assembleris referred to as "direct object emission".
To provide a unified interface, MCStreamer
is created tohandle the emission of both assembly code and object files. The twoprimary subclasses of MCStreamer
areMCAsmStreamer
and MCObjectStreamer
,responsible for emitting assembly code and machine coderespectively.
In the case of an assembly input file, LLVM creates anMCAsmParser
object (LLVMMCParser) and a target-specificMCTargetAsmParser
object. The MCAsmParser
isresponsible for tokenizing the input, parsing assembler directives, andinvoking the MCTargetAsmParser
to parse an instruction.Both the MCAsmParser
and MCTargetAsmParser
objects can call the MCStreamer
API to emit assembly codeor machine code.
In our case, LLVMAsmPrinter calls the MCStreamer
API toemit assembly code or machine code.
If the streamer is an MCAsmStreamer
, theMCInst
will be pretty-printed. If the streamer is anMCELFStreamer
(other object file formats are similar),MCELFStreamer::emitInstToData
will use${Target}MCCodeEmitter
from LLVM${Target}Desc to encode theMCInst
, emit its byte sequence, and records neededrelocations. An ELFObjectWriter
object is used to write therelocatable object file.
In our example, we get a relocatable object file. If we invokeclang -S a.cc
to get the assembly, it will look like:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32...
.globl _Z3fooifi # -- Begin function _Z3fooifi
.p2align 4, 0x90
.type _Z3fooifi,@function
_Z3fooifi: # @_Z3fooifi
.cfi_startproc
# %bb.0:
xorl %eax, %eax
ucomiss %xmm0, %xmm0
setnp %al
addl %edi, %eax
cltd
idivl %esi
retq
.Lfunc_end0:
.size _Z3fooifi, .Lfunc_end0-_Z3fooifi
.cfi_endproc
...
.globl main
.p2align 4, 0x90
.type main,@function
main: # @main
.cfi_startproc
# %bb.0:
movss .LCPI1_0(%rip), %xmm0 # xmm0 = mem[0],zero,zero,zero
movl $3, %edi
movl $1, %esi
jmp _Z3fooifi # TAILCALL
.Lfunc_end1:
.size main, .Lfunc_end1-main
.cfi_endproc
You may read my post
I may update this article as the process stabilizes further.
The move to GitHub pull requests has been a topic of discussion overthe past few years. Several lengthy threads on the subject haveemerged:
This transition could very well be the most contentiousinfrastructure change in LLVM's history. If you have the patience todelve into the discussions within the mentioned threads, you'll comeacross
Nevertheless, a decision has been made, and the targeted transitiondate was set for September 1, 2023. The negotiation during thedecision-making process could have been handled more strategically to
On September 1, 2023 (or 2nd on some parts of the world),
In general, I believe that the majority of contributors considerGitHub to offer better accessibility. However, I have heard quite a fewopinions suggesting that GitHub's code review capability issignificantly worse than that of Phabricator. I will delve into thistopic further later in this post.
Having contributed patches to more than 200 projects, many of whichare one-off and even trivial, I genuinely appreciate it when a projectuses GitHub pull requests. This is because I am already familiar withthe system. On the other hand, if a project relies on a self-hosted codereview website, I might find it less convenient as I'm not particularlykeen on registering a username on a website I may never visit again.Even worse, I might need to invest time in getting acquainted with thesystem if I haven't encountered similar instances before.
The same argument applies to LLVM's self-hosted Phabricator instance.Many contributors have not used Phabricator before and would considerboth the website and the command-line tool (
GitHub provides GitHub Apps and GitHub Actions to extend itsfunctionality. With these, we can automate pull request labelling,testing, code analysis, code coverage, and potentially even a
Phabricator can also handle automation, but there are far fewerresources available for it. LLVM's self-hosted Phabricator instance, forinstance, relies on
The llvm-project repository is vast. With a code frequency of 100+commits every day, it's practically impossible for anyone to monitorevery new commit. Nonetheless, many people wish to stay informed aboutchanges to specific components, making patch subscription essential.
One way to achieve this is through mailing lists, such as
The other method is to utilize the code review tool, formerlyPhabricator and now GitHub. With Phabricator, users can set up fairlycomplex subscription rules known as Herald. When a patch title,description, affected files, or the acting user matches certaincriteria, you can take actions like adding yourself as areviewer/subscriber or sending a one-off email.
GitHub, however, is less flexible in this regard. Individual userscan choose to
To enable component-based subscription, the llvm organization onGitHub has created multiple pr-subscribers-*
teams, whichusers can freely join them. ( https://github.com/orgs/llvm/teams/pr-subscribers-*
pageand not on any other page. It's not reasonable to expect a maintainer toroutinely check thehttps://github.com/orgs/llvm/teams/pr-subscribers-*
pagesfor pending join requests. So if they miss the email notification thatsays would like to join "LLVM"
, the request may remainpending indefinitely. )
Then we use pr-subscribers-*
teams. For example, a pull requestaffecting clang/lib/Driver/XX
will receive the labelsclang
and clang:driver
, and thepr-subscribers-clang
andpr-subscribers-clang:driver
teams will be notified.
On https://github.com/notifications, the "Team mentioned"tab lists these patches.
.github/CODEOWNERS
Previously, pr-subscribers-*
teams to.github/CODEOWNERS
. Due to GitHub's CODEOWNERSmechanism, the pr-subscribers-clang
team would be added asa reviewer when a pull request affecting clang/xx
wascreated. pr-subscribers-clang
members would receive anemail notification about the pull request with a full diff.
However, a complication arose when a member of thepr-subscribers-*
team approved a change. It resulted in amessage saying$user approved these changes on behalf of llvm/pr-subscribers-xx
,which could be misleading if the user did not wish to assume suchauthority. In addition, the team was automatically removoed as a teamreviewer, adding to the confusion. This use case wasn't in line withGitHub's intended functionality, and there was a risk that GitHub'sfuture changes might disrupt our workflow.
Filtering is another crucial aspect of managing notifications. GitHubsupports
I have joined several pr-subscribers-*
teams that I aminterested in as replacements for my Herald rules on Phabricator. I amstill in the process to make a curated list of Gmail filters. Here aresome filters I currently use:
from:(notifications@github.com) to:(mention@noreply.github.com)
from:(notifications@github.com) to:(team_mention@noreply.github.com)
to:(review_requested@noreply.github.com)
to:(push@noreply.github.com)
from:(notifications@github.com) subject:(llvm-project "Issue #")
from:(notifications@github.com) (-cc:(review_requested@noreply.github.com OR mention@noreply.github.com OR team_mention@noreply.github.com) "this pull request" KEYWORDS_I_WANT_TO_IGNORE)
However, it's worth noting that these notifications also appear on
Another option is to switch to a pull-based workflow if LLVM providesa public-inboxinstance. https://github.com/pulls provides a dashboard where onecan list review requests.
Code review is the top reason we pick a code review tool. Let'sassess how GitHub pull requests fare in addressing this challenge.
In Phabricator, the body a differential (Phabricator's term for apatch) contains is a patch file. The patch file is based on a specificcommit, but Phabricator is not required to know the base commit. Astable identifier,Differential Revision: https://reviews.llvm.org/Dxxxxx
, inthe commit message connects a patch file to a differential. When youamend a patch, Phabricator recognizes that the differential has evolvedfrom patch X to patch Y. The user interface allows for comparisonsbetween any two revisions associated with a differential. Additionally,review comments are confidently associated with the source line.
On the other hand, GitHub structures the concept of pull requestsaround branches and enforces a branch-centric workflow. A pull requestcenters on the difference (commits) between the base branch and thefeature branch. GitHub does not employ a stable identifier for committracking. If commits are rebased, reordered, or combined, GitHub caneasily become confused.
When you force-push a branch after a rebase, the user interface git diff X..Y
, which includes unrelated commits. Ideally,GitHub would show the difference between the two patch files, asPhabricator does, but it only displays the difference between the twohead commits. These unrelated in-between commits might be acceptable forprojects with lower commit frequency but can be challenging for aproject with a code frequency of 100+ commits every day.
The fidelity of preserving inline comments after a force push hasalways been a weakness. The comments may be presented as "outdated". Inthe past, there was a notorious "lost inline comment" problem. Nowadays,the situation has improved, but some users still report that inlinecomments may occasionally become misplaced.
Due to the difficulties in comparing revisions and the lack ofconfidence in preserving inline comments, some recommendations suggestadopting
In a large repository, avoiding pulling upstream commits may not berealistic due to other commits frequently modifying nearby lines. Somepeople use a remote branch to save their work. Having to worry aboutwhether a rebase could cause spam makes the branch more difficult touse. When working with both the latest main branch and the pull requestbranch, switching between branches results in numerous rebuilds.Rebasing follow-up commits could lead to merge conflicts and more pain.In addition, a popular convention among many LLVM contributors is tocommit tests before landing the functional change, which also mandatesforce-push. (The GitHub UI will show a merge conflict.) Sometimes, onlythrough rebasing can one notice that the patch needs adjustments to morecode or tests due to its interaction with another landed patch:
There are two workflows you can do. One is the merge-based workflowrecommended by
1 | git switch pr1 |
You can invoke git log --first-parent
to list the fixupcommits.
Personally I prefer the rebase-based alternative: perform a rebaseand force push, then make a functional change and either append a newcommit or amend the old commit.
When the functional change is made, leave a comment so that reviewerscan locate the useful "compare" button and disregard the "compare"button associated with the rebase push.
GitHub's repository setting allows three options for pull requests:"Allow merge commits", "Allow squash merging", and "Allow rebasemerging".
The default effect is quite impactful.
GitHub <noreply@github.com>
.In 2022, GitHub finally
If you land a patch in a manual way, it is easy for the pull requestto end up the "Closed" status (red), even if the commit has the#xxxxx
identifier. I am not sure whether this status mightdiscourage people from manually landing patches, which I believe shouldbe fully supported and not discouraged.
I am not well-versed in reviewing patch series on GitHub, but this isa widely acknowledged pain point. Numerous projects are exploring waysto alleviate this issue.
In Phabricator, since a differential consists of a patch file anddoesn't have a fixed base branch, we can freely reorder twodifferentials.
The issue
User branches to the llvm-project
repository are
Pros
git fetch origin pull/xx/head:local_branch
. Phabricatorallows arc patch Dxxxxx
and the less-known method:curl -L 'https://reviews.llvm.org/Dxxxxx?download=1' | patch -p1
@mention
applies to every user on GitHub, which isconvenient when you need to seek for a user's opinions. You cannotexpect every user to have an account on a self-hosted instance.gh
is more powerful than arcanist
.Cons
Note: the button below the "Files changed" tab allows switchingbetween unified and split diff views. Users who are used toPhabricator's side-by-side diff view may want to adjust thissetting.
I've noticed that my review productivity has decreased, a sentimentshared by many others. It's disheartening that
I hope that future iterations of GitHub will incorporate some ideasfrom PullRequest feature requests for GitHub #56635.
I've voiced numerous concerns regarding GitHub pull requests, and forthat, I apologize. It's essential to acknowledge that GitHub contributessignificantly to open source projects in many positive ways. Myintention in sharing these concerns is to express a genuine hope thatGitHub pull requests can be enhanced to better support largeprojects.
I would also like to express my gratitude to
The thread "How’s it going with pull requests?" has some niceanalysis.
I use https://getcord.github.io/spr/ to 1
2
3
4
5
6
7
8% cargo install spr
% spr --version
spr 1.3.4
% git remote -v
maskray git@github.com:MaskRay/llvm-project.git (fetch)
maskray git@github.com:MaskRay/llvm-project.git (push)
origin git@github.com:llvm/llvm-project.git (fetch)
origin git@github.com:llvm/llvm-project.git (push)
My ~/.gitconfig
: 1
2
3
4
5
6
7[spr]
githubAuthToken = ghp_xxxxxxxxxxx
githubRemoteName = origin
githubRepository = llvm/llvm-project
githubMasterBranch = main
branchPrefix = users/MaskRay/spr/
requireTestPlan = false
After a rebase and an amend, spr diff
will create amerge commit. GitHub's UI gives aYou are viewing a condensed version of this merge commit
message.
To review an incoming pull request, I rungh pr checkout -b pr$id $id
.
I have enabled a browser extension
Alt + Click
on a resolved inline comment expands allresolved inline comments. Another Alt + Click
will hide allinline comments. Note that some inline comments may become out-of-line,possibly due to a large local change or a force push.
I primarily look at the "Files changed" tab. I use "Conversation" forreading standalone comments, but not inline comments. I want allresolvable comments to be expanded by default, so I have installed thisuserscript: 1
2
3
4
5
6
7
8
9
10// ==UserScript==
// @name GitHub pull: expand all resolvable comments by default
// @version 2024-01-21
// @author You
// @match https://github.com/*/files
// @icon https://www.google.com/s2/favicons?sz=64&domain=github.com
// @grant none
// ==/UserScript==
document.querySelectorAll('.js-resolvable-timeline-thread-container').forEach(e => { e.setAttribute('open', '') })
On September 5, 2023, I added a red banner to reviews.llvm.org todiscourage new patches.
Transitioning existing differentials to GitHub pull requests couldpotentially cause disruption.
It is anticipated that at some point next year /Dxxxxx
pages (referenced by many commits), we can utilize
As activities on Phabricator wind down, maintenance should becomemore lightweight.
In the past two weeks, there have been different IP addressescrawling /source/llvm-github/
pages. It looks very like abotnet as adjacent files are visited by IP addresses from very differentautonomous systems. Such a visit will cause Phabricator to spawn aprocess likegit log --skip=0 -n 30 --pretty=format:%H:%P 988a16af929ece9453622ea256911cdfdf079d47 -- llvm/lib/Demangle/ItaniumDemangle.cpp
that takes a few seconds (llvm-project is huge). I have redirected somepages to https://github.com/llvm/llvm-project/.
On September 8, 2023, I resized the database disk to 850GB to
On September 18, 2023, I noticed an aggressive spider that made thewebsite slow. I blocked the user agents for some aggresstivespiders.
On October 19, 2023, we received an attack with malicious using fakebut strange user agents. I blocked two related IP ranges and updated/etc/php/fpm/pool.d/www.conf
(seemed un-configured before)to reserve more processes.
On November 23, 2023, SendGrid used by the website
I am adapting Mercurial's phab-archive project to mirrorreviews.llvm.org/Dxxxx
pages. I just need to make a fewadjustments (python,js,css,html).
The "History" button functions properly, although there isn't a"Compare" button.
]]>In the llvm-project project, I sometimes find myself assigned as areviewer for MIPS patches. I want to be transparent that I have nointerest in MIPS, but my concern lies with the specific components thatare impacted (Clang driver, ld.lld, MC, compiler-rt, etc.). Therefore,regrettably, I have to spend some time studying MIPS.
Using copper as a mirror, one can straighten their attire; using thepast as a mirror, one can understand rise and fall; using people as amirror, one can discern gains and losses. -- 贞观政要
Earlier ISAs, such as MIPS I, MIPS II, MIPS III, and MIPS IV, can beselected by gcc -march=mips[1-4]
. MIPS V has noimplementation and GCC doesn't support -march=mips5
.
Successive versions are split into MIPS32 and MIPS64, named Release1, Release 2, Release 3, and Release 5, can be selected by-march=mips{32,64}
and-march=mips{32,64}r{2,3,5}
. Release 4 is skipped and-march=mips{32,64}r4
is not supported.
Release 6 is very different from prior releases, with removal andreorganization of some instructions.
-march=mips{32,64}
define __mips_isa_rev
to 1.-march=mips{32,64}r2
define __mips_isa_rev
to 2.-march=mips{32,64}r3
define __mips_isa_rev
to 3.-march=mips{32,64}r5
define __mips_isa_rev
to 5.-march=mips{32,64}r6
define __mips_isa_rev
to 6.__mips_isa_rev
.The ISA revisions of GCC supported -march=
values can befound at gcc/config/mips/mips-cpus.def
.
The EF_MIPS_ARCH
(0xf0000000) part ofe_flags
indicates the ISA. This is not really a good designbecause ISA levels may not be linear and the small number of bits easilyrun out. e_flags
has only 32 bits in both ELFCLASS32 andELFCLASS64 objects. Allocating every spare bit should be done extremelycarefully.
https://refspecs.linuxfoundation.org/elf/mipsabi.pdf
This is an ILP32 ABI designed for the 32-bit CPU MIPS R3000 thatimplements the MIPS I ISA. It is the original System V ProcessorSupplement ABI for MIPS. Common target triples:mips[el]-unknown-linux-gnu
.
In GCC, -mabi=32
selects this ABI, which is identifiedas ABI_32
. The macro _ABIO32=1
is defined.
Assemblers set the EF_MIPS_ABI_O32
flag ine_flags
.
There are some major flaws:
MIPS I has 32 floating-point registers. Two registers are paired forholding a double precision number.
For o32, -mfp64
selects a variant that enables 64-bitfloating-point registers. This requires at least-march=mips32r2
and GCC will emit.module fp=64; .module oddspreg
.
When a 64-bit capable CPU is used with o32, assemblers set theEF_MIPS_32BITMODE
flag in e_flags
.
This is an LP64 ABI designed for 64-bit CPUs (MIPS III and newer).Common target triples:mips64[el]-unknown-linux-gnuabi64
.
Does anyone know where I can get a copy of the ABI document?
In GCC, -mabi=64
selects this ABI, which is identifiedas ABI_64
. The macro _ABI64=3
is defined.
Assemblers do not set a particular bit in e_flags
.
This ABI has fixed some flaws of o32 but introduces an issue relatedto compiler optimization: $gp is now callee-saved. We will discuss thislater.
This is an ILP32 ABI for 64-bit CPUs (MIPS III and newer), oftennamed n32. Common target triples:mips64[el]-unknown-linux-gnuabin32
In GCC, -mabi=n32
selects this ABI, which is identifiedas ABI_N32
. The macro _ABIN32=2
is defined.-march=
values such as mips3
andmips64
are compatible.
Assemblers set the EF_MIPS_ABI2
bit ine_flags
.
This is o32 extended for 64-bit CPUs.
In GCC, -mabi=o64
selects this ABI, which is identifiedas ABI_O64
.The macro _ABIO64=4
is defined.
Assemblers set the EF_MIPS_ABI_O64
flag ine_flags
.
Assemblers set the EF_MIPS_ABI_EABI32
flag ine_flags
.
o32 and o64 are called TARGET_OLDABI
in GCC. n32 and n64are called TARGET_NEWABI
in GCC.
gcc/config.gcc
defines the default ABI (macroMIPS_ABI_DEFAULT
) for different target triples.
In the old ABIs, the local label prefix for assembly is$
, which is strange. This is confusing as register namesare also prefixed with $
.
GCC emits a special section to indicate the ABI for GDB.
.mdebug.abi32
.mdebug.abiN32
.mdebug.abi64
mdebug.abiO64
.mdebug.eabi32
.mdebug.eabi64
In n32/n64 ABIs, $gp is callee-saved. This is unfortunate and oftenleads to more instructions compared with o32. Prologue/epilogue codeneeds to save and restore $gp, so tail calls are often inhibited.Technically, the GOT address is not necessarily loaded in $gp and GCCcan pick a volatile register if profitable, but GCC doesn't have theoptimizion.
.cpsetup $reg1, offset|$reg2, label
pseudo-op wasintroduced (
See
IEEE 754-2008 says
A quiet NaN bit string should be encoded with the first bit (d1) ofthe trailing significand field T being 1.
-mnan=2008
instructs GCC to emit a.nan 2008
directive, which causes GNU assembler to set theEF_MIPS_NAN2008
bit in e_flags
.
MIPS32/MIPS64 Release 6 defaults to -mnan=2008
whileprior ISAs default to -mnan=legacy
.
-modd-spreg
for o32Enable the use of odd-numbered single-precision floating-pointregisters for the o32 ABI. This option requires-march=mips32
or above.
I think the option should not be used with n32 or n64.
1 | % grep DT_MIPS_ include/elf/mips.h |
Among these dynamic tags, only DT_MIPS_LOCAL_GOTNO
,DT_MIPS_GOTSYM
, DT_MIPS_SYMTABNO
, andDT_MIPS_PLTGOT
are really needed in rtld. DT_MIPS_LOCAL_GOTNO
andDT_MIPS_SYMTABNO-DT_MIPS_GOTSYM
.
For executable output, DT_MIPS_RLD_MAP_REL
holds theoffset to the linker synthesized .rld_map
. IfDT_MIPS_RLD_MAP_REL
is unavailable, glibc rtld looks forDT_MIPS_RLD_MAP
, which is emitted for ET_EXEC
executables.
In all the ABIs, "when calling position independent functions $25must contain the address of the called function."
We will see below that non-PIC code may call PIC functions (inanother translation unit) with j
or jal
instructions. When the callee is defined in the executable, to ensurethat register $25 is correct at the callee entry, the linker must inserta trampoline. 1
2
3
4
5
6
7
8
9
10
11
12// Prior to Release 6
jal func
=>
jal __LA25Thunk_func
__LA25Thunk_foo:
lui $25, %hi(func)
j func
addiu $25, $25, %lo(func)
nop
When the callee is defined in a shared object, the linker will createa PLT entry to set up register $25.
See "Linking PIC and non-PIC in the same object" on STO_MIPS_PLT
to binutils.
-mno-shared
foro32/n32 non-PICFor the o32 ABI, GCC generated assembly normally uses.cpload $25
pseudo-op at function entry to set up $gp.1
2
3lui$gp,%hi(_gp_disp)
addiu$gp,$gp,%lo(_gp_disp)
addu$gp,$gp,.cpload argument
For -fno-pic
code, we can replace the three instructionswith two using -mno-shared
: 1
2lui $28,%hi(__gnu_local_gp)
addiu $28,$28,%lo(__gnu_local_gp)
__gnu_local_gp
is defined by the linker.
In addition, for a function call that is known to be defined in theexecutable, GCC generates j
and jal
instructions (prior to Release 6). A R_MIPS_26
relocationis required.
1 | void ext(); |
GCC 4.3 has enabled -mno-shared
by default fornon-PIC.
The n64 ABI does not have the optimization. Suppressing the j/jaloptimization can prevent R_MIPS_26
overflows.
-mno-abicalls
TODO
-mplt
for o32/n32non-PICMIPS did not have PLT at all until 2008. For a default visibilityexternal linkage function, a direct call would require PLT, so GCC MIPSgenerates a GOT code sequence instead (like x86-fno-plt
).
However, the GOT code sequence is long and requires indirection via$25. For -fno-pic
code where the callee is defined in theexecutable itself, this code sequence is rather inefficient.
Since 2008, -fno-pic -mplt
instructs GCC to 1
2
3// For the code above,
// jal ext instead of
// lw $25,%call16(ext)($28); .reloc .,R_MIPS_JALR,ext; jalr $25
-mplt
also works with n64 -msym32
.
Technically the optimization applies to -fpie
as well,but GCC does not implement it because there is PIC PLT entry. Theexisting PLT entry utilizes absolute relocation types:lui $15, %hi(.got.plt entry); l[wd] $25, %lo(.got.plt entry)($15)
Unfortunately, -mplt
conflates function calls andvariable accesses. When -mplt
is in effect, GCC generatesan absolute relocation accessing external data, which may lead to copyrelocations if the data ends up defined by a shared object.
-mno-explicit-relocs
Don't use assembler relocation operators such as %hi
,%lo
, %call16
, %got_disp
. Instead,use assembler macros.
-mexplicit-relocs
(default) facilitates instructionreordering.
-msym32
for n64Assume that all symbols have 32-bit values, regardless of theselected ABI. This is similar to -mcmodel=small
for otherarchitectures.
When -msym32
is used with the n64 ABI, the-mplt
optimization will apply.
There are many special sections. Let me just quote a comment frombinutils/readelf.c:process_mips_specific
: 1
/* We have a lot of special sections. Thanks SGI! */
There are many sections from other vendors. Anyway, this remark setsup the mood.
.reginfo
sectionGNU assembler creates this section to hold Elf32_RegInfo
object for o32 and n32 ABIs.
1 | /* A section of type SHT_MIPS_REGINFO contains the following |
.MIPS.options
sectionThe n64 ABI uses .MIPS.options
instead of.reginfo
.
.MIPS.abiflags
sectionIn 2014, .MIPS.abiflags
(type SHT_MIPS_ABIFLAGS
) andPT_MIPS_ABIFLAGS
.
1 | A new implicitly generated section will be present on all new modules. The section contains a versioned data structure which represents essential information to allow a program loader to determine the requirements of the application. ELF e_flags currently contain some of this information but space is limited and e_flags are not available after loading an application. |
GNU ld has quite involved merging strategy for this section.
.option pic0
and .option pic2
directivesTODO
While ABis use REL for dynamic relocations, the o32 ABI also uses RELfor relocatable files while n64 uses RELA. The little-endian n64 ABImessed up the r_info
field in relocations.
The n64 ABI can pack up to 3 relocations with the same offset intoone record (r_type, r_type2, r_type3
). This is primarilyfor $gp setupR_MIPS_GPREL16 + R_MIPS_SUB + R_MIPS_{HI,LO}16
.
Certain relocation types, such asR_MIPS_GOT16
/R_MIPS_GPREL16
, have differentcalculation when the referenced symbol is local or external. This is abad design.
When referencing a local symbol, the n32/n64 ABIs replace aR_MIPS_GOT16
/R_MIPS_LO16
pair with aR_MIPS_GOT_PAGE
/R_MIPS_GOT_OFST
pair.
R_MIPS_PC64
This relocation type does not exist, but we can emulate a 64-bitPC-relative relocation withR_MIPS_PC32 + R_MIPS_64 + R_MIPS_NONE
. LLVM implementedthis relocation in https://reviews.llvm.org/D80390 while binutils
1 | .globl foo |
TODO
Here is an incomplete list.
FreeBSD 14 discontinued MIPS. MIPS code was removed
The entry point for an executable or shared object is__start
instead of _start
.
E_MIPS_*
macros are
The following script demonstrate that we can use a GCC cross compilerconfigured with 32-bit MIPS to build 64-bit object files.
1 |
|
When an option is checked with APIs such as hasArg
andgetLastArg
, its "claimed" bit is set. 1
2
3
4
5
6
7
8
9template<typename ...OptSpecifiers>
Arg *getLastArg(OptSpecifiers ...Ids) const {
Arg *Res = nullptr;
for (Arg *A : filtered(Ids...)) {
Res = A;
Res->claim();
}
return Res;
}
After all options are processed, Clang reports a-Wunused-command-line-argument
diagnostic for eachunclaimed option. There are multiple possible messages, butargument unused during compilation
is the most common one.1
2
3
4
5
6% clang -c a.c -la
clang: warning: -la: 'linker' input unused [-Wunused-command-line-argument]
% clang -c -Werror a.c -la
clang: error: -la: 'linker' input unused [-Werror,-Wunused-command-line-argument]
% clang -c -Werror=unused-command-line-argument a.c -la
clang: error: -la: 'linker' input unused [-Werror,-Wunused-command-line-argument]
There are many heuristics to enhance the desirability of-Wunused-command-line-argument
, which can be rathersubjective. For instance, options that are relevant only duringcompilation do not result in -Wunused-command-line-argument
diagnostics when linking is performed. This is necessary to supportlinking actions that utilitize CFLAGS
orCXXFLAGS
. 1
% clang -faddrsig -fpic -march=generic a.o
In clangDriver, there are many job actions, e.g. preprocess,precompile, compile, backend, assemble, link. For most actions(preprocess/precompile/compile/backend/etc),clang::driver::tools::Clang
is selected.Clang::ConstructJob
is where most compilation only optionsare handled. Target-specific options are handled by Clang
'smember function RenderTargetOptions
.
For assembly files that do not need preprocessing (e.g..s
(clang::driver::types::TY_PP_Asm
)), thedriver selects clang::driver::tools::ClangAs
to use theintegrated assembler. ClangAs::ConstructJob
claims very fewoptions.
For a link job, oftentimesclang::driver::tools::gcc::Linker
is selected. ItsConstructJob
forwards Link_Group
andLinkOption
options to the linker, and processes-Wa,
options.
Assembling a .s
file uses ClangAs
. Asmentions, it claims very few options and we can get many-Wunused-command-line-argument
diagnostics.1
2
3
4% gcc -S -fpic -mfpmath=sse a.c
% gcc -c -fpic -mfpmath=sse a.s
% clang -c -fpic -mfpmath=sse a.s
clang: warning: argument unused during compilation: '-mfpmath=sse' [-Wunused-command-line-argument]
Preprocessing and assembling a .S
file selectsClang
. We will get behaviors similar to compiling a.c
file: compilation only options do not lead todiagnostics. 1
% clang -c -fpic -mfpmath=sse a.S
There is a tension between-Wunused-command-line-argument
and default options. Let'sconsider a scenario where we specify--rtlib=compiler-rt --unwindlib=libunwind
inCFLAGS
and CXXFLAGS
to utilize compiler-rt andLLVM libunwind. ClangDriver claims --rtlib
and--unwindlib
in the following code snippet:1
2
3
4
5
6
7
8if (!Args.hasArg(options::OPT_nostdlib, options::OPT_r)) {
if (!Args.hasArg(options::OPT_nodefaultlibs)) {
// Handle --rtlib and --unwindlib.
}
if (!Args.hasArg(options::OPT_nostartfiles)) {
// Handle --rtlib. Append clang_rt.crtend.o or GCC style crtend{,S,T}.o
}
}
However, if a build target employs -nostdlib
or-nodefaultlibs
, options such as --rtlib
,--unwindlib
, and many other linker options (e.g.-static-libstdc++
and -pthread
) will not beclaimed, resulting in unused argument diagnostics: 1
2
3
4
5% clang --rtlib=compiler-rt --unwindlib=libunwind -stdlib=libstdc++ -static-libstdc++ -pthread -nostdlib a.o
clang: warning: argument unused during compilation: '--rtlib=compiler-rt' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '--unwindlib=libunwind' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-static-libstdc++' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]
While some options like -stdlib=
do not trigger adiagnostic, this seems more like a happenstance rather than a deliberatedesign choice.
To suppress the diagnostics, we can utilitize --start-no-unused-arguments
and --end-no-unused-arguments
(Clang 14) like thefollowing: 1
% clang --start-no-unused-arguments --rtlib=compiler-rt --unwindlib=libunwind -stdlib=libstdc++ -static-libstdc++ -pthread --end-no-unused-arguments -nostdlib a.o
There is also a heavy hammer -Qunused-arguments
tosuppress -Wunused-command-line-argument
diagnosticsregardless of options' positions: 1
% clang -Qunused-arguments --rtlib=compiler-rt --unwindlib=libunwind -stdlib=libstdc++ -static-libstdc++ -pthread -nostdlib a.o
-Qunused-arguments
is similar to-Wno-unused-command-line-argument
.
Conveniently, options specified in a Clang configuration file areautomatically claimed.
1 | cat >a.cfg <<e |
In the last command, I specify -no-canonical-prefixes
sothat clang will find dirname(clang)/clang.cfg
, otherwiseClang would try to finddirname(realname(clang))/clang.cfg
.
GCC has -m
. These options areoften referred to as "machine-specific" or sometimes as"target-specific". I am not certain whether there are any distinctionsin these terms within the context of GCC. In Clang, we prefer the term"target-specific".
For instance, certain targets implement -mtls-dialect=
.This is achieved through files likegcc/config/aarch64/aarch64.opt
. If we use such options onunsupported targets, we will encounter an error:
1 | % aarch64-linux-gnu-gcc -c -mtls-dialect=desc a.c |
However, Clang employs a unified table namedclang/include/clang/Driver/Options.td
for all options,thereby eliminating the need for maintaining target-specific lists.Historically, an unsupported option -Werror
is used): 1
2% clang-16 -c --target=aarch64 -mavx a.c
clang-16: warning: argument unused during compilation: '-mavx' [-Wunused-command-line-argument]
For Clang 17, I have TargetSpecific
flag inclang/include/clang/Driver/Options.td
and annotatednumerous options accordingly. If an option is annotated asTargetSpecific
, the-Wunused-command-line-argument
diagnostic will be elevatedto an error 1
2% clang -c --target=aarch64 -mavx a.c
clang: error: unsupported option '-mavx' for target 'aarch64'
For -x assembler
input, clangDriver 1
2% clang -c -mfpmath=sse a.s
clang: warning: argument unused during compilation: '-mfpmath=sse' [-Wunused-command-line-argument]
I have mentioned that
Options that are relevant only during compilation do not result in
-Wunused-command-line-argument
diagnostics when linking isperformed.
They also don't lead to "unsupported option" errors when linking isperformed. Certain driver options only affect linking are marked asLinkerInput
in Clang.
1 | % clang --target=aarch64 -mavx a.c -fdriver-only # should report an error but not |