This article describes section layout and its interaction with dynamic loaders and huge pages.
Let's begin with a Linux x86-64 example involving global variables exhibiting various properties such as read-only versus writable, zero-initialized versus non-zero, and more.
1 |
|
1 | % clang -c -fpie a.c |
(We will discuss -Wl,-z,separate-loadable-segments
later.)
We can see that these functions and global variables are placed in different sections.
.rodata
: read-only data without dynamic relocations, constant in the link unit.text
: functions.data.rel.ro
: read-only data associated with dynamic relocations, constant after relocation resolving, part of thePT_GNU_RELRO
segment.data
: writable data.bss
: writable data known to be zeros
Section and segment layout
TODO I may write more about how linkers layout sections and segments.
Anyhow, the linker will place .data
and
.bss
in the same PT_LOAD
program header
(segment) and the rest into different PT_LOAD
segments.
(There are some nuances. If you use GNU ld's -z noseparate-code
or lld's --no-rosegment
,
.rodata
and .text
will be placed in the same
PT_LOAD
segment.)
The PT_LOAD
segments have different flags
(p_flags
): PF_R
, PF_R|PF_X
,
PF_R|PF_W
. Subsequently, the dynamic loader, also known as
the dynamic linker, will invoke mmap
to map the file into
memory. The memory areas (VMA) have different memory permissions
corresponding to segment flags.
For a PT_LOAD
segment, its associated memory area starts
at alignDown(p_vaddr, pagesize)
and ends at
alignUp(p_vaddr+p_memsz, pagesize)
.
1 | Start Addr End Addr Size Offset Perms objfile |
Let's assume the page size is 4096 bytes. We'll calculate the
alignDown(p_vaddr, pagesize)
values and display them
alongside the "Start Addr" values: 1
2
3
4
5Start Addr alignDown(p_vaddr, pagesize)
0x555555554000 0x0000000000000000
0x555555555000 0x0000000000001000
0x555555556000 0x0000000000002000
0x555555557000 0x0000000000003000
We observe that the start address equals the base address plus
alignDown(p_vaddr, pagesize)
.
--no-rosegment
This option asks lld to combine the read-only and the RX segments. The output file will consume less address space at run-time.
1 | Start Addr End Addr Size Offset Perms objfile |
MAXPAGESIZE
A page serves as the granularity at which memory exhibits different
permissions, and within a page, we cannot have varying permissions.
Using the previous example where p_align
is 4096, if the
page size is larger, for example, 65536 bytes, the program might
crash.
Typically, the dynamic loader allocates memory for the first
PT_LOAD
segment (PF_R
) at a specific address
allocated by the kernel. Subsequent PT_LOAD
segments then
overwrite the previous memory regions. Consequently, certain code pages
or significant global variables might be replaced by garbage, leading to
a crash.
So, how can we create a link unit that works across different page
sizes? We simply determine the maximum page size, let's say, 2097152,
and then pass -z max-page-size=2097152
to the linker. The
linker will set p_align
values of PT_LOAD
segments to MAXPAGESIZE.
1 | Program Headers: |
In a linker script, the max-page-size
can be obtained
using CONSTANT(MAXPAGESIZE)
.
For completeness, if you need to run a prebuilt executable on a
system with a larger page size, you can modify the executable by merging
PT_LOAD
segments and combining their permissions. It's
likely there will be a sizable RWX PT_LOAD
segment,
reminiscent of OMAGIC.
Over-aligned segment
It is possible to increase the p_align
value of one
single PT_LOAD
segment using an aligned
attribute. When this value exceeds the page size, the question arises:
should the kernel loader or the dynamic loader determine a suitable base
address to meet this alignment requirement?
In 2020, the Linux kernel loader made the decision to align
the base address according to the maximum p_align
. This
facilitates transparent huge pages
for mapped files at expense cost of reduced address
randomization.
1 | % cat align.c |
Should a userspace dynamic loader do the same? If it does, a variable with an alignment greater than the page size will indeed align accordingly. As of glibc 2.35, it has followed suit.
On the other hand, the traditional interpretation dictates that a variable with an alignment greater than the page size is invalid. Most other dynamic loaders do not implement this particular logic, which has some overhead.
-z separate-loadable-segments
In previous examples using
-z separate-loadable-segments
, the p_vaddr
values of PT_LOAD
segments are multiples of MAXPAGESIZE.
The generic ABI says "loadable process segments must have congruent
values for p_vaddr and p_offset, modulo the page size."
p_offset - This member gives the offset from the beginning of the file at which the first byte of the segment resides.
p_vaddr - This member gives the virtual address at which the first byte of the segment resides in memory.
This alignment requirement aligns with the mmap
documentation. For example, Linux man-pages specifies, "offset must be a
multiple of the page size as returned by sysconf(_SC_PAGE_SIZE)."
The p_offset
values are also multiples of MAXPAGESIZE.
After layouting out a PT_LOAD
segment, the linker must pad
the end by inserting zeros so that the next PT_LOAD
segment
starts at a multiple of MAXPAGESIZE.
However, the alignment padding is wasteful. Fortunately, we can link
a.o
using different MAXPAGESIZE and different alignment
settings:
-z noseparate-code
,-z separate-code
,-z separate-loadable-segments
.
1 | clang -pie -fuse-ld=lld -Wl,-z,noseparate-code a.o -o a0.4096 |
1 | % stat -c %s a0.4096 a0.65536 a0.2097152 |
We can derive two properties:
- Under one MAXPAGESIZE, we have
size(noseparate-code) < size(separate-code) < size(separate-loadable-segments)
. - For
-z noseparate-code
, increasing MAXPAGESIZE does not change the output size.
AArch64 and PowerPC64 have a default MAXPAGESIZE of 65536. Staying
with the -z noseparate-code
default ensures that they will
not experience unnecessary size increase.
-z noseparate-code
How does -z noseparate-code
work? Let's illustrate this
with an example.
At the end of the read-only PT_LOAD
segment, the address
is 0x628. Instead of starting the next segment at
alignUp(0x628, MAXPAGESIZE) = 0x1000
, we start at
alignUp(0x628, MAXPAGESIZE) + 0x628 % MAXPAGESIZE = 0x1628
.
Since the .text
section has an alignment
(sh_addralign
) of 16, we start at 0x1630. Although the
address is advanced beyond necessity, the file offset (congruent to the
address, modulo MAXPAGESIZE) can be decreased to 0x630, merely 8 bytes
(due to alignment padding) after the previous section's end.
Moving forward, the end of the executable PT_LOAD
segment has an address of 0x17b0. Instead of starting the next segment
at alignUp(0x17b0, MAXPAGESIZE) = 0x2000
, we start at
alignUp(0x17b0, MAXPAGESIZE) + 0x17c0 % MAXPAGESIZE = 0x27b0
.
While we advance the address more than needed, the file offset can be
decreased to 0x7b0, precisely at the previous section's end.
1 | % readelf -WSl a0.4096 |
-z separate-code
performs the trick when transiting from
the first RW PT_LOAD
segment to the second, whereas
-z separate-loadable-segments
doesn't.
When MAXPAGESIZE is larger than the actual page size
Let's consider two adjacement PT_LOAD
segments. The
memory area associated with the first segment ends at
alignUp(load[i].p_vaddr+load[i].p_memsz, pagesize)
while
the memory area associated with the second one starts at
alignDown(load[i+1].p_vaddr, pagesize)
. When the actual
page size equals MAXPAGESIZE, the two addresses are identical. However,
if the actual page size is smaller, a gap emerges between these
addresses.
A typical link unit generally presents three gaps. These gaps might
either be unmapped or mapped. When mapped, they necessitate
struct vm_area_struct
objects within the Linux kernel. As
of Linux 6.3.13, the size of struct vm_area_struct
is 152
bytes. For instance, 10000 mapped object files would require
10000 * 3 * sizeof(struct vm_area_struct) = 4,560,000 bytes
,
signifying a considerable memory footprint. You can refer to Extra
struct vm_area_struct with ---p created when PAGE_SIZE <
max-page-size.
Dynamic loaders typically invoke mmap
using
PROT_READ
, encompassing the whole file, followed by
multiple mmap
calls using MAP_FIXED
and the
corresponding flags. When dynamic loaders, like musl, don't process
gaps, the gaps retain r--p
permissions. However, in glibc's
elf/dl-map-segments.h
, the has_holes
code
employs mprotect
to transition permissions from
r--p
to ---p
.
While ---p
might be perceived as a security enhancement,
personally, I don't believe it significantly impacts exploitability.
While there might be numerous gadgets in r-xp
areas,
reducing gadgets in r--p
areas doesn't seem notably
impactful. (https://isopenbsdsecu.re/mitigations/rop_removal/)
Unmap the gap
Within Linux kernel loads the executable and its interpreter (it
present) (fs/binfmt_elf.c
), the gap gets unmapped, thereby
freeing a struct vm_area_struct
object. Implementing a
similar approach in dynamic loaders could yield comparable savings.
However, unmapping the gap carries the risk of an unrelated future
mmap
occupying the gap:
1 | 564d8e90f000-564d8e910000 r--p 00000000 08:05 2519504 /sample/build/main |
It is not clear whether the potential occurrence of an unrelated mmap
considered a regression in security. Personally, I don't think this
poses a significant issue as the program does not access the gaps. This
property can be guaranteed for direct access when input relocations to
the linker use symbols with in-bounds addends (e.g. when x is defined
relative to an input section, we know R_X86_64_PC32(x)
must
be in-bounds).
However, some programs may expect contiguous maps areas of a file
(such as when glibc link_map::l_contiguous
is set to 1).
Does this choice render the program exploitable if an attacker can
ensure a map within the gap instead of outside the file? It seems to me
that they could achieve everything with a map outside of the file.
Having said that, the presence of an unrelated map between maps associated with a single file descriptor remains odd, so it's preferable to avoid it if possible.
Extend the memory area to cover the gap
This appears the best solution.
When creating a memory area, instead of setting the end to
alignUp(load[i].p_vaddr+load[i].p_memsz, pagesize)
, we can
extend the end to
min(alignDown(min(load[i+1].p_vaddr), pagesize), alignUp(file_end_addr, pagesize))
.
1 | 564d8e90f000-**564d8e91f000** r--p 00000000 08:05 2519504 /sample/build/main (the end is extended) |
For the last PT_LOAD
segment, we could also just use
alignDown(min(load[i+1].p_vaddr), pagesize)
and ignore
alignUp(file_end_addr, pagesize))
. Accessing a byte beyond
the backed file will result to a SIGBUS
signal.
A new linker option?
Personally I favor the area end extending approach. I've also pondered whether this falls under the purview of linkers. Such a change seems intrusive and unsightly. If the linker extends the end of p_memsz to cover the gap, should it also extend p_filesz?
- If it doesn't, we create a PT_LOAD with p_filesz/p_memsz that is not for BSS, which is weird.
- If it does, we have an output file featuring overlapping file offset ranges, which is weird as well.
Moreover, a PT_LOAD whose end isn't backed by a section is unusual. I'm concerned that many binary manipulation tools may not handle this case correctly. Utilizing a linker script can intentionally create discontiguous address ranges. I'm concerned that the linker might not discern such cases with intelligent logic regarding p_filesz/p_memsz.
This feature request seems to be within the realm of loaders and specific information, such as the page size, is only accessible to loaders. I believe loaders are better equipped to handle this task.
Transparent huge pages for mapped files
Some programs optimize their usage of the limited Translation
Lookaside Buffer (TLB) by employing transparent huge pages. When the
Linux kernel loads an executable, it takes into account the
p_align
field to create a memory area. If
p_align
is 4096, the memory area will commence at a
multiple of 4096, but not necessarily at a multiple of a huge page.
Transparent huge pages for mapped files have several requirements including:
- the memory area's start address and the start file offset align with
a huge page
(
include/linux/huge_mm.h:transhuge_vma_suitable
). CONFIG_READ_ONLY_THP_FOR_FS
is enabled (scripts/config -e TRANSPARENT_HUGEPAGE -e TRANSPARENT_HUGEPAGE_MADVISE -e READ_ONLY_THP_FOR_FS
)the VMA has the(I removed this condition for v6.8)VM_EXEC
flag- the file is not opened for write
When madvise(addr, len, MADV_HUGEPAGE)
is called, the
kernel code path is
do_madvise -> madvise_vma_behavior -> hugepage_madvise -> khugepaged_enter_vma -> thp_vma_allowable_order+__khugepaged_enter
.
To ensure that addr-fileoff
is a multiple of a huge
page, we should link the executable using -z max-page-size=
with the huge page size.
In kernels with the VM_EXEC
requirement (before v6.8),
if we want to remap the file as huge pages from the ELF header, we must
specify --no-rosegment
to ld.lld.
Build the following program with
c++ -fuse-ld=lld -Wl,-z,max-page-size=2097152
and run it.
We do not define COLLAPSE
for now. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// Adapted from https://mazzo.li/posts/check-huge-page.html
// normal page, 4KiB
// huge page, 2MiB
// See <https://www.kernel.org/doc/Documentation/vm/pagemap.txt> for
// format which these bitmasks refer to
extern char __ehdr_start[];
__attribute__((used)) const char pad[HPAGE_SIZE] = {};
// Checks if the page pointed at by `ptr` is huge. Assumes that `ptr` has
// already been allocated.
static void check_huge_page(void *ptr) {
if (getuid())
return warnx("not root; skip KPF_THP check");
int pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
if (pagemap_fd < 0)
errx(1, "could not open /proc/self/pagemap: %s", strerror(errno));
int kpageflags_fd = open("/proc/kpageflags", O_RDONLY);
if (kpageflags_fd < 0)
errx(1, "could not open /proc/kpageflags: %s", strerror(errno));
// each entry is 8 bytes long
uint64_t ent;
if (pread(pagemap_fd, &ent, sizeof(ent), ((uintptr_t)ptr) / PAGE_SIZE * 8) != sizeof(ent))
errx(1, "could not read from pagemap\n");
if (!PAGEMAP_PRESENT(ent))
errx(1, "page not present in /proc/self/pagemap, did you allocate it?\n");
if (!PAGEMAP_PFN(ent))
errx(1, "page frame number not present, run this program as root\n");
uint64_t flags;
if (pread(kpageflags_fd, &flags, sizeof(flags), PAGEMAP_PFN(ent) << 3) != sizeof(flags))
errx(1, "could not read from kpageflags\n");
if (!(flags & (1ull << KPF_THP)))
errx(1, "could not allocate huge page\n");
if (close(pagemap_fd) < 0)
errx(1, "could not close /proc/self/pagemap: %s", strerror(errno));
if (close(kpageflags_fd) < 0)
errx(1, "could not close /proc/kpageflags: %s", strerror(errno));
}
int main() {
printf("__ehdr_start: %p\n", __ehdr_start);
int ret, tries = 2;
do {
ret = madvise(__ehdr_start, HPAGE_SIZE, MADV_COLLAPSE);
} while (ret && errno == EAGAIN && --tries);
printf("madvise(MADV_COLLAPSE): %d\n", ret);
if (ret) {
ret = madvise(__ehdr_start, HPAGE_SIZE, MADV_HUGEPAGE);
if (ret)
err(1, "madvise");
}
ret = madvise(__ehdr_start, HPAGE_SIZE, MADV_HUGEPAGE);
if (ret)
err(1, "madvise");
size_t size = HPAGE_SIZE;
char *buf = (char *)aligned_alloc(HPAGE_SIZE, size);
madvise(buf, 2 << 20, MADV_HUGEPAGE);
*((volatile char *)buf);
check_huge_page(buf);
int fd = open("/proc/self/maps", O_RDONLY);
read(fd, buf, HPAGE_SIZE);
write(STDOUT_FILENO, buf, strstr(buf, "[stack]\n") - buf + 8);
close(fd);
fd = open("/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs", O_RDONLY);
read(fd, buf, 32);
close(fd);
usleep(atoi(buf) * 1000);
memcpy(buf, __ehdr_start, HPAGE_SIZE);
check_huge_page(__ehdr_start);
}
The output looks like: 1
2
3
4
5
6
7
8% g++ test.cc -o ~/tmp/test -O2 -fuse-ld=lld -Wl,-z,max-page-size=2097152 && sudo ~/tmp/test
__ehdr_start: 0x55f3b1c00000
55f3b1c00000-55f3b1e00000 r--p 00000000 103:03 555277119 /home/ray/tmp/test
55f3b1e00000-55f3b1e01000 r--p 00200000 103:03 555277119 /home/ray/tmp/test
55f3b2000000-55f3b2002000 r-xp 00200000 103:03 555277119 /home/ray/tmp/test
55f3b2201000-55f3b2202000 r--p 00201000 103:03 555277119 /home/ray/tmp/test
55f3b2401000-55f3b2402000 rw-p 00201000 103:03 555277119 /home/ray/tmp/test
55f3b3a9a000-55f3b3abb000 rw-p 00000000 00:00 0 [heap]
Thanks to 周洲仪 for helping me figure out the khugepaged behavior.
usleep
gives khugepaged an opportunity to collapse pages
(hpage_collapse_scan_file => collapse_file => retract_page_tables => pmdp_collapse_flush
).
In the fortunate scenario when this collapse occurs, and the next page
fault is triggered (memcpy(buf, __ehdr_start, HPAGE_SIZE)
),
the kernel will populate the pmd
with a huge page
(handle_page_fault ...=> handle_pte_fault ...=> do_fault_around => filemap_map_pages ...=> do_set_pmd => set_pmd_at
).
However, in an unfortunate case,
check_huge_page(__ehdr_start)
will fail with
could not allocate huge page
.
scan_sleep_millisecs
defaults to 10000 (10 seconds).
Reducing the value increases the likelihood of the fortunate case.
Linux 6.1 introduces MADV_COLLAPSE
to attempt a
synchronous collapse of the native pages mapped by the memory range into
Transparent Huge Pages (THPs). While success is not guaranteed, a
successful collapse eliminates the need to wait for the khugepaged
daemon
(madvise_collapse => hpage_collapse_scan_file => collapse_file => retract_page_tables => pmdp_collapse_flush
).
In the event of repeated MADV_COLLAPSE
failures, a fallback
mechanism using MADV_HUGEPAGE
can be employed.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19% g++ -static -DCOLLAPSE test.cc -o test -O2 -fuse-ld=lld -Wl,-z,max-page-size=2097152
% sudo ./test
__ehdr_start: 0x200000
madvise(MADV_COLLAPSE): -1
...
test: could not allocate huge page
% sudo ./test
__ehdr_start: 0x55f3b1c00000
madvise(MADV_COLLAPSE): 0
00200000-00429000 r--p 00000000 fd:03 260 /root/test
00628000-0069f000 r-xp 00228000 fd:03 260 /root/test
0089e000-008a3000 r--p 0029e000 fd:03 260 /root/test
00aa2000-00aa5000 rw-p 002a2000 fd:03 260 /root/test
00aa5000-00aab000 rw-p 00000000 00:00 0
01800000-01822000 rw-p 00000000 00:00 0 [heap]
7fd141600000-7fd141800000 rw-p 00000000 00:00 0
7fd141800000-7fd141a00000 rw-p 00000000 00:00 0
7fd141a00000-7fd141a01000 rw-p 00000000 00:00 0
7ffe69edf000-7ffe69f00000 rw-p 00000000 00:00 0 [stack]
In -z noseparate-code
layouts, the file content starts
somewhere at the first page, potentially wasting half a huge page on
unrelated content. Switching to -z separate-code
allows
reclaiming the benefits of the half huge page but increases the file
size. Balancing these aspects poses a challenge. One potential solution
is using fallocate(FALLOC_FL_PUNCH_HOLE)
, which introduces
complexity into the linker. However, this approach feels like a
workaround to address a kernel limitation. It would be preferable if a
file-backed huge page didn't necessitate a file offset aligned to a huge
page boundary.
Cost of RELRO
To accommodate PT_GNU_RELRO
, the RW
region
will possess two permissions after the runtime linker maps the program.
While GNU ld provides one RW segment split by the dynamic loader, lld
employs two explicit RW PT_LOAD
segments. After relocation
resolving, the effects of lld and GNU ld are similar.
For those curious, explore my notes on GNU ld's file size increase due to RELRO.
Due to RELRO, covering the two RW PT_LOAD
segments
necessitates a minimum of 2 (huge) pages. In contrast, without RELRO,
only one (huge) page is required at minimum. This means potentially
wasting up to MAXPAGESIZE-1 bytes, which could otherwise be utilized to
cover more data.
Nowadays, RELRO is considered a security baseline and removing it might unsettle security-minded individuals.