This article describes target-specific details about AArch64 in ELF linkers. AArch64 is the 64-bit execution state for the Arm architecture. The AArch64 execution state runs the A64 instruction set. The AArch32 and AArch64 execution states use very different instruction sets, so many pieces of software use two ports for the two execution states of the Arm architecture.
There were the "ARM architecture" and the "ARM instruction set", leading to many software projects using "ARM" or "arm" as their port names. In 2011, ARMv8 introduced two execution states, AArch32 and AArch64. The previous instruction sets "ARM" and "Thumb" were renamed to "A32" and "T32", respectively. In 2017, the architecture was renamed to the "Arm architecture" to reflect the rebranding of the company name. So, the "ARMv8-A" architecture profile is now named "Armv8-A".
For the AArch64 execution state, while many projects use "AArch64" as their port name, for legacy reasons, macOS, Windows, the Linux kernel, and some BSD operating systems unfortunately use "arm64". (Support for AArch64 was added to the Linux kernel in version 3.7. Initially, the patch set was named "aarch64", but it was later changed at the request of kernel developers.)
ABI documents
- ELF for the Arm® 64-bit Architecture (AArch64)
- System V ABI for the Arm® 64-bit Architecture (AArch64)
Global Offset Table
The Global Offset Table consists of two sections:
.got.plt
holds code addresses for PLT..got
holds other addresses and offsets.
The symbol _GLOBAL_OFFSET_TABLE_
is defined at the
beginning of the .got
section. GNU ld reserves a single
entry for .got
and .got[0]
holds the link-time
address of _DYNAMIC
for a legacy reason Versions of glibc
prior to 2.35 have the _DYNAMIC
requirement. See All
about Global Offset Table.
.got.plt[1]
and .got.plt[2]
are for lazy
binding PLT. Linkers communicate the address of .got.plt
to
rtld with the dynamic tag DT_PLTGOT
.
Procedure Linkage Table
The registers x16
(IP0) and x17
(IP1) are
the first and second intra-procedure-call temporary registers. They may
be used by PLT entries and veneers.
The PLT header looks like: 1
2
3
4
5
6bti c // If BTI
stp x16, x30, [sp,#-16]!
adrp x16, &.got.plt[2]
ldr x17, [x16, :lo12: &.got.plt[2]]
add x16, x16, :lo12: &.got.plt[2]
br x17
The Nth PLT entry looks like: 1
2
3
4
5
6bti c // If BTI
adrp x16, &.got.plt[N + 3]
ldr x17, [x16, :lo12: &.got.plt[N + 3]]
add x16, x16, :lo12: &.got.plt[N + 3]
autia1716 // If PAC-PLT
br x17
When BTI is enabled for the output file, the code sequence starts
with bti c
. When PAC-PLT is enabled, the code sequence
includes autia1716
before br x17
.
Relocation optimization
See All about Global Offset Table#GOT optimization for GOT optimization.
There are a few optimization schemes beside GOT optimization, e.g.
1 | add x2, x2, 0 // R_<CLS>_ADD_ABS_LO12_NC |
1 | adrp x0, symbol |
--no-relax
disables the optimization.
See ELF for the Arm® 64-bit Architecture (AArch64)#Relocation optimization.
Thread Local Storage
AArch64 uses a variant of TLS Variant I: the static TLS blocks are placed above the thread pointer. The thread pointer points to the end of the thread control block.
The linker performs TLS optimization.
The traditional general dynamic and local dynamic TLS models are obsoleted and not supported by ld.lld
See All about thread-local storage.
Program Property
A .note.gnu.property
section contains program property
notes that describe special handling requirements for the linker and the
dynamic loader.
The linker parses input .note.gnu.property
sections and
recognizes command line options -z force-bti
and
-z pac-plt
to compute the output
.note.gnu.property
(type is SHT_NOTE
) section.
Without these options, linkers only set the feature bit in the output
file if all the input relocatable object files have the corresponding
feature set.
1 | for (ELFFileBase *f : ctx.objectFiles) { |
Range extension thunks
Function calls typically use B
and BL
instructions. The two instructions have a range of +/-128MiB and may use
2 relocation types: R_AARCH64_CALL26
and
R_AARCH64_JUMP26
. The range is larger than the branch range
for many other instruction sets. If the destination is not reachable by
a single B
/BL
, linkers may insert a veneer
(range extension thunk).
-no-pie
links may use a thunk with absolute addressing
targeting any location in the 64-bit address space. 1
2
3
4
5
6
7
8
9
10
11
12
13<caller>:
bl __AArch64AbsLongThunk_nonpreemptible
b __AArch64AbsLongThunk_nonpreemptible
<__AArch64AbsLongThunk_nonpreemptible>:
ldr x16, .+8
br x16
<$d>:
.word 0x00000000
.word 0x00000010
<.plt>:
-pie
and -shared
links need to use a thunk
with PC-relative addressing targeting a range of +/-4GiB.
1
2
3
4
5
6
7
8<caller>:
bl __AArch64ADRPThunk_nonpreemptible
b __AArch64ADRPThunk_nonpreemptible
<__AArch64ADRPThunk_nonpreemptible>:
adrp x16, nonpreemptible
add x16, x16, :lo12: nonpreemptible
br x16
The branch target of a thunk may be a PLT entry: 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15<caller>:
bl __AArch64ADRPThunk_preemptible
<__AArch64ADRPThunk_preemptible>:
adrp x16, preemptible@plt
add x16, x16, :lo12: preemptible@plt
br x16
...
<preemptible@plt>:
adrp x16, &.got.plt[N + 3]
ldr x17, [x16, :lo12: &.got.plt[N + 3]]
add x16, x16, :lo12: &.got.plt[N + 3]
br x17
--fix-cortex-a53-843419
This option enables a linker workaround for Arm Cortex-A53 Errata
843419. Full details are available in the ARM-EPM-048406 document.
Linkers scan adrp
in the last two instructions of a 4KiB
page, followed by a load or store instruction and two other
instructions. Oncea erratum condition is detected, linkers try to
rewrite it into an alternative code sequence. See the comments in the
implementations for detail.
In ld.lld this is implemented as a thunk, similar to a range
extension thunk. ld.lld additionally sets a workaround
when relocating R_AARCH64_JUMP26
.
Small code model
On x86-64, symbols in the small code model are required to be located
in the range [0, 2**31 − 2**24)
.
The AArch64 small code model allows for a maximum text segment size of 2GiB and a maximum combined span of text and data segments of 4GiB. For small position-independent code (pic), there is an additional restriction on the size of the Global Offset Table (GOT), which must be smaller than 32KiB. The maximum combined span of text and data segments is larger than that of x86-64.
Linked image sizes for AArch64 and x86-64 are comparable, but AArch64 linked images are more resistant to relocation overflows.
There are several types of relocation overflows that we need to pay attention to:
.text <-> .rodata
.text <-> .eh_frame
(.eh_frame
has 32-bit offsets).text <-> .bss
.rodata <-> .bss
In many programs, .text <-> .data/.bss
relocations
have the tightest constraints. Overflows due to
.text <-> .rodata
relocations are possible but rare
(I have seen such issues in the past). .rodata
is usually
larger than .data+.bss
.
.rodata <-> .bss
overflows usually do not occur,
but metadata needs to be careful using .quad label-.
instead of .long label-.
. Such issues can be trivially
fixed on the compiler side.)
For .text <-> .rodata
and
.text <-> .bss
references, x86-64 uses
R_X86_64_REX_GOTPCRELX
/R_X86_64_PC32
relocations, which have a small range [-2**31,2**31)
. In
contrast, R_AARCH64_ADR_PREL_PG_HI21
on AArch64 has a
doubled range [-2**32,2**32)
, making it unlikely that
AArch64 will hit an issue before the binary becomes excessively
oversized for x86-64.
--android-memtag-{mode,stack,heap}
The options instruct ld.lld to create
DT_AARCH64_MEMTAG_*
dynamic tags. See Memtag
ABI Extension to ELF for the Arm® 64-bit Architecture (AArch64).