2020-02-01

从-fpatchable-function-entry=N[,M]说起

Linux kernel用了很多GCC选项支持ftrace。

-pg
-mfentry
-mnop-mcount
-mprofile-kernel powerpc
-mrecord-mcount
-mhotpatch=pre-halfwords,post-halfwords
-fpatchable-function-entry=N[,M]

在当前GCC git repo的“史前”时期(Initial revision)就能看到-pg支持了。-pg在函数prologue后插入mcount()(Linux x86)，在其他OS或arch上可能叫不同名字，如_mcount、__mcount、.mcount。 trace信息可用于gprof和gcov。

# gcc -S -pg -O3 -fno-asynchronous-unwind-tables
foo:
	pushq	%rbp
	movq	%rsp, %rbp
1:	call	*mcount@GOTPCREL(%rip)

-pg作用在inlining后。

链接时GCC会选择一个不同的crt1文件gcrt1.o
libc实现gcrt1.o (glibc sysdeps/x86_64/_mcount.S gmon/mcount.c, FreeBSD sys/amd64/amd64/prof_machdep.c)。
musl不提供gcrt1.o https://www.openwall.com/lists/musl/2014/11/05/2

glibc的用法：

gcrt1.o定义__gmon_start__。其他crt1.o没有定义
crti.o用undefined weak __gmon_start__检测gcrt1.o，是则调用
gcrt1.o的__gmon_start__调用__monstartup初始化。在程序运行前初始化完可以避免call-once的同步。

GCC r21495 (1998)引入-finstrument-functions，在函数prologue后插入__cyg_profile_func_enter(this_fn, call_site)、epilogue前插入__cyg_profile_func_enter(callee, call_site)。程序实现这两个函数后可以记录函数调用。这两个函数分别有两个参数，对code size有较大影响。另外，很多应用其实不需要call_site这个参数。

1 2	void __cyg_profile_func_enter(void this_fn, void call_site); void __cyg_profile_func_exit(void this_fn, void call_site);

# gcc -S -O3 -finstrument-functions -fno-asynchronous-unwind-tables
foo:
  subq    $8, %rsp
  leaq    foo(%rip), %rdi
  movq    8(%rsp), %rsi
  call    __cyg_profile_func_enter@PLT
  movq    8(%rsp), %rsi
  leaq    foo(%rip), %rdi
  call    __cyg_profile_func_exit@PLT
  xorl    %eax, %eax
  addq    $8, %rsp
  ret

-finstrument-functions默认作用在inlining前，能较好地体现控制流，但有很多的开销。原因是inlining后，一个函数里可能有多个__cyg_profile_func_enter()。 clang提供了-finstrument-functions-after-inlining在inlining后再trace。 GCC x86的-pg -mfentry -minstrument-return=call可以在函数返回时插入call __return__，可以作为-finstrument-functions -finstrument-functions-after-inlining的替代品。

Linux kernel 2008年最早的ftrace实现16444a8a40d使用-pg和mcount。 Linux定义了mcount，比较一个函数指针来检查ftrace是否开启，倘若没有开启，mcount则相当于一个空函数。

#ifdef CONFIG_FTRACE
ENTRY(mcount)
	cmpq $ftrace_stub, ftrace_trace_function
	jnz trace
.globl ftrace_stub
ftrace_stub:
	...
#endif

所有函数的prologue后都执行call mcount，会产生很大的开销。因此，后来Linux kernel在一个hash table里记录mcount的caller的PC，用一个一秒运行一次的daemon检查hash table，把不需要trace的函数的call mcount修改成NOP。

之后，8da3821ba56把"JIT"改成了"AOT"。构建时，一个Perl script scripts/recordmcount.pl调用objdump记录所有call mcount的地址，存储在__mcount_loc section里。Kernel启动时预先把所有call mcount修改成NOP，免去了daemon。由于Perl+objdump太慢，2010年，16444a8a40d添加了一个C实现scripts/recordmcount.c。

mcount有一个弊端是stack frame size难以确定，ftrace不能访问tracee的参数。 GCC r162651 (2010) (GCC 4.6)引入-mfentry，把prologue后的call mcount改成prologue前的call __fentry__。 2011年，d57c5d51a30添加了x86-64的-mfentry支持。

GCC r206111 (2013)引入了SystemZ特有的-mhotpatch。注意描述，function entry后仅有一个NOP，对entry前的NOP类型进行了限定。这样缺乏通用性，其他arch用不上。后来一般化为-mhotpatch=pre-halfwords,post-halfwords。

GCC b54214fe22107618e7dd7c6abd3bff9526fcb3e5 (2013-03)移植-mprofile-kernel到PowerPC64 ELFv2。 2016年powerpc/ftrace: Add Kconfig & Make glue for mprofile-kernel和之前的几个commits用上了这个选项。

GCC r215629 (2014)引入-mrecord-mcount、-mnop-mcount。 -mrecord-mcount用于代替linux/scripts/record_mcount.{pl,c}。-mnop-mcount不可用于PIC，把__fentry__替换成NOP。设计时没有考虑通用性，大多数RISC都用不上不带参数的-mnop-mcount。截至今天，-mnop-mcount只有x86和SystemZ支持。

(2019年，Linux x86移除了mcount支持562e14f7229。)

GCC r250521 (2017)引入-fpatchable-function-entry=N[,M]。和SystemZ特有选项-mhotpatch=类似，在function entry前插入M个NOP，在entry后插入N-M个NOP。现在被Linux arm64和parisc采用。这个功能设计理念挺好的，可惜实现有诸多问题，仅能用于Linux kernel。

2018年GCC x86引入了-minstrument-return=call用于配合-pg -mfentry在函数返回时插入call __return__。 -minstrument-return=nop5则是插入一个5-byte nop。

# gcc -fpatchable-function-entry=3,1 -S -O3 a.c -fno-asynchronous-unwind-tables
	.section	__patchable_function_entries,"aw",@progbits
	.quad	.LPFE1
	.text
.LPFE1:
	nop
	.type	foo, @function
foo:
	nop
	nop
	xorl	%eax, %eax
	ret

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93197 __patchable_function_entries会被ld --gc-sections(linker section garbage collection)收集。导致GCC的实现无法用于大部分程序。这个问题最终在添加PowerPC ELFv2支持时被完全修复https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99899 (milestone: 13.0)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93195 __patchable_function_entries entry所属的COMDAT section group被收集会产生链接错误。导致很多使用inline的C++程序无法使用。
错误信息写错选项名：gcc -fpatchable-function-entry=a -c a.c => cc1: error: invalid arguments for ‘-fpatchable_function_entry’
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93194 __patchable_function_entries没有指定section alignment。我的第二个GCC patch～
__patchable_function_entries的entries应用PC-relative relocations，而非absolute relocations，避免链接后生成R_*_RELATIVE dynamic relocations。这一点我一开始不能接受，因为其他缺陷clang这边修复后也能保持backward compatible，但relocation type是没法改的。后来我认识到MIPS没有提供R_MIPS_PC64……那么选择原谅GCC了。MIPS就是这样，ISA缺陷->psABI“发明”聪明的ELF技巧绕过+引入新的问题。"mips is really the worst abi i've ever seen." "you mean worst dozen abis ;"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92424 AArch64 Branch Target Identification开启时，NOP sled应在BTI后
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93492 x86 Indirect Branch Tracking开启时，NOP sled应在ENDBR32/ENDBR64后。在开始实现-fpatchable-function-entry=前，正巧给lld加-z force-ibt。因此在看到AArch64问题很自然地想到了x86也有类似问题。
没有考虑和-fasynchronous-unwind-tables的协作。再一次，Linux kernel使用-fno-asynchronous-unwind-tables。所以GCC实现时很自然地没有思考这个问题
Initial .loc directive应在NOP sled前。会导致symbolize function address得不到文件名/行号信息

修复--gc-sections和COMDAT比较棘手，还需要binutils这边的GNU as和GNU ld的功能：

https://sourceware.org/bugzilla/show_bug.cgi?id=25380支持unique section ID。2月2日GNU as添加了支持https://sourceware.org/ml/binutils/2020-02/msg00020.html
https://sourceware.org/bugzilla/show_bug.cgi?id=25381支持SHF_LINK_ORDER。HJ Lu发了patch：https://sourceware.org/ml/binutils/2020-02/msg00028.html
GNU ld --gc-sections semantics https://sourceware.org/ml/binutils/2019-11/msg00266.html

除AArch64 BTI外，其余问题都是我报告的～

给clang添加-fpatchable-function-entry=的步骤如下：

D72215 引入LLVM function attribute "patchable-function-entry"，AArch64 AsmPrinter支持
D72220 x86 AsmPrinter支持
D72221 在clang里实现function attribute __attribute__((patchable_function_entry(0,0)))
D72222 给clang添加driver option -fpatchable-function-entry=N[,0]
D73070 引入LLVM function attribute "patchable-function-prefix"
移动codegen passes，改变NOP sled与BTI/ENDBR的顺序，顺便修好了XRay、-mfentry与-fcf-protection=branch的协作。
D73680 AArch64 BTI，处理M=0时，patch label的位置：bti c; .Lpatch0: nop而不是.Lpatch0: bti c; nop
x86 ENDBR32/ENDBR64，处理M=0时，patch label的位置：endbr64; .Lpatch0: nop而不是.Lpatch0: endbr64; nop

上述patches，除了x86 ENDBR的patch label位置调整，都会包含在clang 10.0.0里。

在-fpatchable-function-entry=之前，clang已经有多种在function entry插入代码的方法了：

-fxray-instrument。XRay使用类似-finstrument-functions的方法trace，和Linux kernel类似，运行时修改代码
Azul Systems引入了PatchableFunction用于JIT。我引入"patchable-function-entry"时就复用了这个pass
IR feature: prologue data，在function entry后添加任意字节。用于function sanitizer
IR feature: prefix data，在function entry前添加任意字节。用于GHC TABLES_NEXT_TO_CODE。Info table放在entry code前。GHC的LLVM后端目前仍是年久失修状态

PowerPC64 ELFv2

PowerPC ELFv2的实现见https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99888

-fpatchable-function-entry=5,2输出如下：

        .globl foo
        .type   foo, @function
foo:
.LFB0:
        .cfi_startproc
.LCF0:
0:      addis 2,12,.TOC.-.LCF0@ha
        addi 2,2,.TOC.-.LCF0@l
        .section        __patchable_function_entries,"awo",@progbits,foo
        .align 3
        .8byte  .LPFE1
        .section        ".text"
.LPFE1:
        nop
        nop
        .localentry     foo,.-foo
        nop
        nop
        nop
        mflr 0
        std 0,16(1)
        stdu 1,-32(1)

NOPs在global entry后。Local entry前后分别有M、N-M个NOPs。因为global entry和local entry间距有限制，M只能取0 (2-2)、2 (4-2)、6 (8-2)、14 (16-2)等少数值。

在PR99888中，我在2022年就提出没有必要让NOP连续。2023年末的讨论也说明了当前连续NOP不方便kernel和userspace live patching。 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112980 打算修改实现。