Control-flow integrity

Control-flow integrity (CFI) refers to techniques which prevent control-flow hijacking attacks. This article describes some compiler/hardware features with a focus on llvm-project implementations.

CFI schemes are commonly divided into forward-edge (e.g. indirect calls) and backward-edge (mainly function returns). AIUI exception handling and symbol interposition are not categorized.

Let's start with backward-edge CFI. Fine-grained schemes check that a return address refers to a possible caller in the control-flow graph. This is a very difficult problem and the additional guarantee is possibly not that useful.

Coarse-grained schemes just check that return addresses are not tempered. Return addresses are typically stored on a memory region called "stack" with function arguments, local variables, and register save areas. Stack smashing is an attack that overwrites the return address to hijack the control flow. The name is made popular by Aleph One (Elias Levy)'s paper Smashing The Stack For Fun And Profit.

StackGuard/Stack Smashing Protector

StackGuard: Automatic Adaptive Detection and Prevention of Buffer-Overflow Attacks (1998) introduced an approach to detect tampered return addresses on the stack.

GCC 4.1 implemented -fstack-protector and -fstack-protector-all. More variants are added over the years, e.g. -fstack-protector-strong for GCC 4.9 and -fstack-protector-explicit in 2015-01.

The feature has two related function attributes: stack_protect and no_stack_protector. Strangely stack_protect is not named stack_protector.

  • In the prologue, a canary is loaded from a secret location and placed before the return address on the stack.
  • In the epilogue, the canary is loaded again and compared with the entry before the return address.

The idea is that an attack overwriting the return address will likely overwrite the canary value as well. If the attacker doesn't know the canary value, returning from the function will crash.

The canary is stored either in a libc global variable (__stack_chk_guard) or a field in the thread control block (e.g. %fs:40 on x86-64). Every architecture may have a preference on the location. In a glibc port, only one of -mstack-protector-guard=global and -mstack-protector-guard=tls is supported (BZ #26817). (Changing this requires symbol versioning wrestling and may not be necessary as we can move to a more fine-grained scheme.) With the current thread stack allocation scheme, the thread control block is allocated beside the thread stack. A large out-of-bounds stack write can potentially overwrite the canary.

GCC source code lets either libssp or libc provide __stack_chk_guard. It seems that libssp is obsoleted for modern systems.

musl sets ((char *)&__stack_chk_guard)[1] = 0;:

  • The NUL byte serves as a terminator to sabotage string reads as information leak.
  • A string overflow which attempts to preserve the canary while overwriting bytes after the canary (implementation detail of struct pthread) will fail.
  • One byte overwrite is still detectable.
  • The set of possible values is decreased, but this seems acceptable for 64-bit systems.

-mstack-protector-guard-symbol= can change the global variable name from __stack_chk_guard to something else. It was introduced for the Linux kernel. See arch/*/include/asm/stackprotector.h for how the canary is initialized in the Linux kernel. It is at least per-cpu. Some ports support per-task stack canary (search config STACKPROTECTOR_PER_TASK).

Retguard

OpenBSD introduced Retguard in 2017. See RETGUARD and stack canaries. It is more fine-grained than StackGuard with a more expensive function prologues/epilogues.

For each instrumented function, a cookie is allocated from a pool of 4000 entries. In the prologue, return_address ^ cookie is pushed next to the return address (similar to XOR Random Canary). The epilogue pops the XOR value and the return address and verifies that they match ((value ^ cookie) == return_address).

Not encrypting the return address directly is important to preserve return address prediction for the CPU. The two int3 instructions are to disrupt ROP gadgets which may form from je ...; retq (02 cc cc c3 is addb %ah, %cl; int3; retq). (ROP gadgets removal says that the gadget removal may not be useful.) If a static branch predictor exists (likely not-taken for a forward branch) the initial prediction is likely wrong.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// From https://www.openbsd.org/papers/eurobsdcon2018-rop.pdf
// prologue
ffffffff819ff700: 4c 8b 1d 61 21 24 00 mov 2367841(%rip),%r11 # <__retguard_2759>
ffffffff819ff707: 4c 33 1c 24 xor (%rsp),%r11
ffffffff819ff70b: 55 push %rbp
ffffffff819ff70c: 48 89 e5 mov %rsp,%rbp
ffffffff819ff70f: 41 53 push %r11

// epilogue
ffffffff8115a457: 41 5b pop %r11
ffffffff8115a459: 5d pop %rbp
ffffffff8115a45a: 4c 33 1c 24 xor (%rsp),%r11
ffffffff8115a45e: 4c 3b 1d 03 74 ae 00 cmp 11432963(%rip),%r11 # <__retguard_2759>
ffffffff8115a465: 74 02 je ffffffff8115a469
ffffffff8115a467: cc int3
ffffffff8115a468: cc int3
ffffffff8115a469: c3 retq

SafeStack

Code-Pointer Integrity (2014) proposed a stack object instrumentation which was merged into LLVM in 2015. Use clang -fsanitize=safe-stack. The driver options uses as a link action links in the runtime (compiler-rt/lib/safestack).

The pass moves some stack objects into a separate stack (normally referenced by a thread-local variable __safestack_unsafe_stack_ptr or via a function call __safestack_pointer_address). These objects include those which are not guaranteed to be free of stack smashing (mainly via ScalarEvolution).

As an example, the local variable a below is moved to the unsafe stack as there is a risk that bar may have out-of-bounds accesses.

1
2
3
4
5
void bar(int *);
void foo() {
int a;
bar(&a);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
foo:                                    # @foo
# %bb.0: # %entry
pushq %r14
pushq %rbx
pushq %rax
movq __safestack_unsafe_stack_ptr@GOTTPOFF(%rip), %rbx
movq %fs:(%rbx), %r14
leaq -16(%r14), %rax
movq %rax, %fs:(%rbx)
leaq -4(%r14), %rdi
callq bar@PLT
movq %r14, %fs:(%rbx)
addq $8, %rsp
popq %rbx
popq %r14
retq

Shadow stack

This technique is very old. Stach Shield (1999) is an early scheme based on assembly instrumentation.

During a function call, the return address stored in a shadow stack. The normal stack may contain a copy, or (as a variant) not at all. Upon return, an entry is popped from the shadow stack. It is either used as the return address, or (as a variant) compared with the normal return address.

Userspace

-fsanitize=shadow-call-stack

See https://clang.llvm.org/docs/ShadowCallStack.html. The instrumentation stores the return address in a shadow stack during a function call. Upon return, the return address is popped from the shadow stack. The return address is also stored on the regular stack for return address prediction and compatibility with unwinders, but is otherwise unused.

To use this for AArch64, run clang --target=aarch64-unknown-linux-gnu -fsanitize=shadow-call-stack -ffixed-x18.

1
2
3
4
5
6
7
8
str	x30, [x18], #8      // instrumented
sub sp, sp, #32
stp x29, x30, [sp, #16]
...
ldp x29, x30, [sp, #16]
add sp, sp, #32
ldr x30, [x18, #-8]! // instrumented
ret

This is implemented for RISC-V as well. Interestingly, you can still use -ffixed-x18, as x18 (aka s2) is a callee-saved register.

GCC ported the feature in 2022-02 (milestone: 12.0).

In the Linux kernel, select CONFIG_SHADOW_CALL_STACK to use this scheme.

Hardware-assisted

Intel Control-flow Enforcement Technology and AMD Shadow Stack

Supported by Intel's 11th Gen and AMD Zen 3.

A RET instruction pops the return address from both the regular stack and the shadow stack, and compares them. A control protection exception (#CP) is raised in case of a mismatch.

If all relocatable files with .note.gnu.property have set the GNU_PROPERTY_X86_FEATURE_1_SHSTK bit, or -z force-ibt is specified, the output will have the bit.

On Windows the scheme is branded as Hardware-enforced Stack Protection. In the MSVC linker, /cetcompat marks an executable image as compatible with Control-flow Enforcement Technology (CET) Shadow Stack.

setjmp/longjmp need to save and restore the shadow stack pointer.

Stack unwinding

If the shadow stack contains just return addresses, we will get a side benefit of an efficient stack unwinding scheme. The frame pointer can be omitted now that we no longer need a frame chain to trace through return addresses.

In many implementations, a shadow stack is difficult to access for a different process. This makes out-of-process unwinding difficult.

ARMv8.3 Pointer Authentication

Also available as Armv8.1-M Pointer Authentication.

Instructions are provided to sign a pointer with a 64-bit user-chosen context value (usually zero, X16, or SP) and a 128-bit secret key. The computed Pointer Authentication Code is stored in the unused high bits of the pointer. The instructions are allocated from the HINT space for compatibility with older CPUs.

A major use case is to sign/authenticate the return address. paciasp is inserted at the start of the function prologue to sign the to-be-saved LR (X30). (It serves as an implicit bti c as well.) autiasp is inserted at the end of the function prologue to authenticate the loaded LR.

1
2
3
4
5
6
7
8
paciasp                        # instrumented
sub sp, sp, #0x20
stp x29, x30, [sp, #0x10]
...
ldp x29, x30, [sp, #0x10]
add sp, sp, #0x20
autiasp # instrumented
ret

ld.lld added support in https://reviews.llvm.org/D62609 (with substantial changes afterwards). If -z pac-plt is specified, autia1716 is used for a PLT entry; a relocatable file with .note.gnu.property and the GNU_PROPERTY_AARCH64_FEATURE_1_PAC bit cleared gets a warning.


Now let's discuss forward-edge CFI schemes.

For an indirect call, ideally we can enforce that the target is among the targets in the control-flow graph. This is very difficult to implement efficiently, so many schemes just ensure that the function signature matches. E.g. in void f(void (*indirect)(void)) { indirect(); }, if indirect actually refers to a function with arguments, the CFI scheme will flag this indirect call.

In C, checking that the argument and return types match suffices. A lot of schemes use type hashes. In C++, language constructs such as virtual functions and pointers to member functions add more restriction to what can be indirectly called. -fsanitize=cfi has implemented many C++ specific checks, which are missing from many other CFI schemes.

pax-future.txt

https://pax.grsecurity.net/docs/pax-future.txt (c.2) describes a scheme where

  • a call site provides a hash indicating the intended function prototype
  • the callee's epilogue checks whether the hash matches its own prototype

In case of a mismatch, it indicates that the callee does not have the intended prototype. The program should terminate.

1
2
3
4
5
6
7
8
9
10
callee
epilogue: mov register,[esp]
cmp [register+1],MAGIC
jnz .1
retn
.1: jmp esp

caller:
call callee
test eax,MAGIC

The epilogue assumes that a call site hash is present, so an instrumented function cannot be called by a non-instrumented call site.

-fsanitize=cfi

See https://clang.llvm.org/docs/ControlFlowIntegrity.html. Clang has implemented the traditional indirect function call check and many C++ specific checks under this umbrella option. They all rely on Type Metadata and link-time optimizations: LTO collects functions with the same signature so that a type check can be performed efficiently.

The underlying LLVM IR technique is shared with virtual function call optimization (devirtualization).

1
2
3
4
5
6
7
8
9
10
11
12
13
struct A { void f(); };
struct B { void f(); };
void A::f() {}
void B::f() {}
static void (A::*fptr)();

int main(int argc, char **argv) {
A *a = new A;
if (argv[1]) fptr = (void (A::*)())&B::f; // caught by -fsanitize=cfi-mfcall
else fptr = &A::f; // good
(a->*fptr)();
delete a;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct A { virtual void f() {} };
struct B : A { void g() {} virtual ~B() {} };
struct C : A { void g() {} };
struct D { virtual void f() {} };
void af(A *a) { a->f(); }
void bg(B *b) { b->g(); }
int main(int argc, char **argv) {
B b;
C c;
D d;
af(reinterpret_cast<A *>(&d)); // caught by -fsanitize=cfi-vcall
bg(&b); // good
bg(reinterpret_cast<B *>(&c)); // caught by -fsanitize=cfi-nvcall
}

Cross-DSO CFI requires a runtime (see compiler-rt/lib/cfi).

Control Flow Guard

Windows 8.1 Preview introduced Control Flow Guard. It was implemented in llvm-project in 2019. Use clang-cl /guard:cf or clang --target=x86_64-pc-windows-gnu -mguard=cf.

The compiler instruments indirect calls to call a global function pointer (___guard_check_icall_fptr or __guard_dispatch_icall_fptr) and records valid indirect call targets in special sections: .gfids$y (address-taken functions), .giats$y (address-taken IAT entries), .gljmp$y (longjmp targets), and .gehcont$y (ehcont targets). An instrumented file with applicable functions defines the @feat.00 symbol with at least one bit of 0x4800.

The linker combines the sections, marks additional symbols (e.g. /entry), creates address tables (__guard_fids_table, __guard_iat_table, __guard_longjmp_table (unless /guard:nolongjmp), __guard_eh_cont_table).

At run-time, the global function pointer refers to a function which verifies that an indirect call target is valid.

1
2
3
4
5
6
7
leaq   target(%rip), %rax
callq *%rax

=>

leaq target(%rip), %rax
callq *__guard_dispatch_icall_fptr(%rip)

eXtended Flow Guard

This is an improved Control Flow Guard and is similar to pax-future.txt (c.2). Use cl.exe /guard:xfg. On x86-64, the instrumentation calls the function pointer __guard_xfg_dispatch_icall_fptr (instead of __guard_dispatch_icall_fptr) and provides the prototype hash as an argument. The runtime checks whether the prototype hash matches the callee.

1
2
void bar(int a) {}
void foo(void (*f)(int)) { f(42); }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
gxfg$y  SEGMENT
__guard_xfg?foo@@YAXP6AXH@Z@Z DDSymXIndex: FLAT:?foo@@YAXP6AXH@Z@Z
DD 00H
DQ ba49be2b36da9170H
gxfg$y ENDS
; COMDAT gxfg$y
gxfg$y SEGMENT
__guard_xfg?bar@@YAXH@Z DDSymXIndex: FLAT:?bar@@YAXH@Z
DD 00H
DQ c6c1864950d77370H
gxfg$y ENDS

?bar@@YAXH@Z PROC ; bar, COMDAT
ret 0
?bar@@YAXH@Z ENDP

?foo@@YAXP6AXH@Z@Z PROC ; foo, COMDAT
mov rax, rcx
mov r10, -4124868134247632016 ; c6c1864950d77370H
mov ecx, 42 ; 0000002aH
rex_jmp QWORD PTR __guard_xfg_dispatch_icall_fptr
?foo@@YAXP6AXH@Z@Z ENDP ; foo

-fsanitize=kcfi

Introduced to llvm-project in 2022-11 (milestone: 16.0.0. Glad as a reviewer). GCC feature request.

The instrumentation does:

  • store a hash of the function prototype before the function entry.
  • in an indirect call site, load the hash, trap if it does not match the expected prototype.
  • record the trap location in a section named .kcfi_traps. The Linux kernel uses the section to check whether a trap is caused by KCFI.
  • define a weak absolute symbol __kcfi_typeid_<func> if the function has a C identifier name and is address-taken.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
__cfi_bar:
.rept 11; nop; .endr
movl $27004076, %eax # imm = 0x19C0CAC
bar:
retq

__cfi_foo:
.rept 11; nop; .endr
movl $2992198919, %eax # imm = 0xB2595507
foo:
movq %rdi, %rax
movl $42, %edi
movl $4267963220, %r10d # imm = 0xFE63F354
addl -4(%rax), %r10d
je .Ltmp0
.Ltmp1:
ud2
.section .kcfi_traps,"ao",@progbits,.text
.Ltmp2:
.long .Ltmp1-.Ltmp2
.text
.Ltmp0:
jmpq *%rax # TAILCALL

The large number of NOPs is to leave room for FineIBT patching at run-time.

An instrumented indirect call site cannot call an uninstrumented target. This property appears to be satisfied in the Linux kernel.

See cfi: Switch to -fsanitize=kcfi for the Linux kernel implementation.

FineIBT

This is a instrumentation and hardware hybrid scheme. It uses both an indirect branch target indicator (ENDBR) and a prototype hash check. See https://dustri.org/b/paper-notes-fineibt.html for an analysis.

testb 0x11, fs:0x48 is a compromise that it can be disabled at run-time. fs:0x48 will take an unused field tcbhead_t:unused_vgetcpu_cache in sysdeps/x86_64/nptl/tls.h. This allows an attacker to disable FineIBT by zeroing this byte.

x86/ibt: Implement FineIBT is the Linux kernel implementation which patches -fsanitize=kcfi output into FineIBT.

Hardware-assisted

Intel Indirect Branch Tracking

This is part of Intel Control-flow Enforcement Technology. When enabled, the CPU ensures that every indirect branch lands on a special instruction (ENDBR, endbr32 for x86-32 and endbr64 for x86-64), otherwise a control-protection (#CP) exception is raised. A jump/call instruction with the notrack prefix skips the check.

With -fcf-protection={branch,full}, the compiler inserts ENDBR at the start of a basic block which may be reached indirectly. This is conservative:

  • every address-taken basic block needs ENDBR
  • every non-internal-linkage function needs endbr as it may be reached via PLT
  • every function compiled for the large code model needs ENDBR as it may be reached via a large code model branch
  • landing pads for exception handling need ENDBR

__attribute__((nocf_check)) can disable ENDBR insertion for a function.

ld.lld added support in 2020-01:

  • PLT entries need endbr
  • If all relocatable files with .note.gnu.property (which contains NT_GNU_PROPERTY_TYPE_0 notes) have set the GNU_PROPERTY_X86_FEATURE_1_IBT bit, or -z ibt is specified, the output will have the bit and synthesize ENDBR compatible PLT entries.

The GNU ld implementation uses a second PLT scheme (.plt.sec). I was sad about it, but in the end I decided to follow suit for ld.lld. Some tools (e.g. objdump) use heuristics to detect foo@plt and using an alternative form will require them to adapt. mold nevertheless settled on its own scheme.

swapcontext may return with an indirect branch. This function can be annotated with __attribute__((indirect_branch)) so that the call site will be followed by an endbr.

glibc can be built with --enable-cet. By default it uses the NT_GNU_PROPERTY_TYPE_0 notes to decide whether to enable CET. If any of the executable and all shared objects doesn't have the GNU_PROPERTY_X86_FEATURE_1_IBT bit, rtld will call arch_prctl with ARCH_CET_DISABLE to disable IBT. SHSTK is similar. A dlopen'ed shared object triggers the check as well.

As of 2023-01, the Linux kernel support is work in progress. It does not support ARCH_CET_STATUS/ARCH_CET_DISABLE yet.

qemu does not support Indirect Branch Tracking yet.

Don't conservatively mark non-local-linkage functions

The current scheme as well as ARM BTI conservatively adds an indirect branch indicator to every non-internal-linkage function, which increases the attack surface. Alternative CET ABI mentioned that we can use NOTRACK in PLT entries, and use new relocation types to indicate address-significant functions. Supporting dlsym/dlclose needs special tweaking.

Technically the address significance part can be achieved with the existing SHT_LLVM_ADDRSIG, then we can avoid introducing a new relocation type for each architecture. I am somewhat unhappy that SHT_LLVM_ADDRSIG optimizes for size and does not make binary manipulation convenient (objcopy and ld -r sets sh_link=0 of SHT_LLVM_ADDRSIG and invalidates the section). See Explain GNU style linker options. Since we just deal with x86, introducing new x86-32/x86-64 relocation types is not bad.

When LTO is concerned, this imposes more difficulties as LLVMCodeGen does not know (https://reviews.llvm.org/D140363):

  • whether a function is visibile to a native relocatable file (VisibleToRegularObj)
  • or whether the address of a function is taken in at least one IR module.

The two properties address the two problems:

  • If the link mixes bitcode files and ELF relocatable files, for a function in a bitcode file, F.hasAddressTaken() doesn't indicate that its address is not taken by an ELF relocatable file.
  • For ThinLTO, a function may have false F.hasAddressTaken() for the definition in one module and true F.hasAddressTaken() for a reference in another module.

Both properties need additional bookkeeping between LLVMLTO and LLVMCodeGen. The first property additionally needs the linker to communicate information to LLVMLTO.

If we make ThinLTO properly combine the address-taken property (close to !GV.use_empty() && !GV.hasAtLeastLocalUnnamedAddr()), and provide VisibleToRegularObj to LLVMCodeGen, we can use the following condition to decide whether ENDBR is needed with an appropriate code model:

AddressTaken || (!F.hasLocalLinkage() && (VisibleToRegularObj || !F.hasHiddenVisibility()))

(Note: some large x86-64 executables are facing relocation overflow pressure. Range extension thunks may be a future direction. If NOTRACK is not used, we will need to conservatively mark local linkage functions.)

ARMv8.5 Branch Target Identification

This is similar to Intel Indirect Branch Tracking. When enabled, the CPU ensures that every indirect branch lands on a special instruction, otherwise a Branch Target exception is raised. The most common landing instructions are bti {c,j,jc} with different branch instruction compatibility. paciasp and pacibsp are implicitly bti c.

This feature is more fine-grained: every memory page may set a bit VM_ARM64_BTI (set with the mmap flag PROT_BTI) to indicate that indirect calls from it do not need protection. This allows uninstrumented code to mapped simply by skipping PROT_BTI.

The value of -mbranch-protection= can be none (no hardening), standard (bti with a non-leaf protection scope), or + separated bti and pac-ret[+b-key,+leaf]. If bti is enabled, the compiler inserts bti {c,j,jc} at the start of a basic block which may be reached indirectly. For a non-internal-linkage function, its entry may be reached by a PLT or range extension thunk, so it is conservatively marked as needing bti c.

ld.lld added support in https://reviews.llvm.org/D62609 (with substantial changes afterwards). If all relocatable files with .note.gnu.property have set the GNU_PROPERTY_AARCH64_FEATURE_1_BTI bit, or -z force-bti is specified, the output will have the bit and synthesize BTI compatible PLT entries.

qemu has supported Branch Target Identification since 2019.

Compiler warnings

clang -Wcast-function-type warns when a function pointer is cast to an incompatible function pointer. Calling such a cast function pointer likely leads to -fsanitize=cfi/-fsanitize=kcfi runtime errors.

Clang 16.0.0 makes -Wcast-function-type stricter and warns more cases. -Wno-cast-function-type-strict restores the previous state that ignores many cases including some ABI equivalent cases.

Summary

Forward-edge CFI

  • check that the target may be landed indirectly but does not check the function signature: Control Flow Guard, Intel Indirect Branch Tracking, ARMv8.5 Branch Target Identification
  • check that the target may be landed indirectly and check the function signature: eXtended Flow Guard, -fsanitize=cfi, -fsanitize=kcfi, FineIBT