2020-10-15

模块内函数调用和libc符号重命名

Translation unit外的函数调用

在一个libc实现中，有时会调用自身其他translation unit定义的函数，如：

#include <string.h>

void foo(void *src, const void *dst) {
  memcpy(dst, src, 90);
}

这个translation unit用-fPIC编译成.o时会生成一个memcpy函数调用。在部分架构上有少量PIC设置的开销(见下文)。

在一个ELF文件格式的shared object中，一个定义的non-local STB_DEFAULT符号默认为preemptible(interposable)，即运行时可被替换。对于一个preemptible符号，对其的引用不可绑定到模块内部的定义。

这个.o用-shared方式链接时，即使memcpy定义在该shared object的另一个translation unit中，链接器也会假设memcpy在运行时可能被可执行文件或其他shared object在替换，看到表示函数调用的relocation types时得生成PLT。

# Although memcpy is within the module, the linker cannot resolve the relocation to the address of memcpy (direct call).
# It has to resolve the relocation to the PLT entry (symbolized as memcpy@plt in objdump).
f: call memcpy@PLT  # R_X86_64_PLT32

memcpy: ...

运行时替换提供了灵活性，但也牺牲了性能。一个函数调用需要执行PLT entry中几条指令，从.got.plt(多数架构)读取一个函数指针执行间接跳转。还要算上ld.so解析这个符号的一次性开销。

在99.9%的情况下Procedure Linkage Table (PLT)的作用是调用一个定义在其他shared object或可执行文件中的函数。 (什么？你要问我剩下的0.1%是什么？是GNU indirect function中一种接近失传、晦涩的技巧。今天binutils添加了RISC-V的ifunc支持，我就在之前的邮件中提到了LLD使用的方法)

Translation unit内的函数调用和`-fno-semantic-interposition`

GCC 5引入了-fno-semantic-interposition。如果调用的函数定义在同一个translation unit，

1
2
3

// f is global STV_DEFAULT.
void f() {}
void g() { f(); }

编译器有两种选择：

-fsemantic-interposition (GCC默认行为): 保守地假设可能发生运行时preemption，阻止一切相关的inter-procedural optimizations(如inlining)。
1
2
3
4
5
f:
...
g:
# Inlining f into g may change semantics if f is preempted at runtime.
call f@PLT
-fno-semantic-interposition: 乐观地假设不发生preemption，允许inter-procedural optimizations。这其实是Clang长期以来的行为。(我的贡献)Clang 11若指定该选项，会在LLVM IR层面设置dso_local。在x86上，如果函数最终没有被inline，编译器会生成对local alias .Lfoo$local的调用，不用GOT-generating relocations。Local symbols是non-preeemptible的，可以阻止链接器生成GOT和PLT。
1
2
3
4
5
f: # STB_GLOBAL
.Lf$local: # STB_LOCAL
...
g:
call .Lf$local
如果f定义在另一个translation unit，这个选项仍然不能阻止PLT。

在Clang 11，Serge Guelton添加了-fsemantic-interposition。

Link-time solution: `--dynamic-list`

libc中有大量库函数会被其他库函数调用。放任它们产生PLT会有可观的性能损失。链接器提供了-Bsymbolic, -Bsymbolic-functions, version script和--dynamic-list等几种机制使部分符号non-preemptible。

musl采用的方法是用--dynamic-list精细指定libc.so中preemptible的符号列表：

{
environ; __environ; stdin; stdout; stderr; timezone; daylight; tzname; ...
malloc; calloc; realloc; free; ...
}

在可执行文件和shared object中，--dynamic-list的语义不同。这里取shared object语义：

executable: Put matched non-local defined symbols to the dynamic symbol table
shared object: Matched defined non-local STV_DEFAULT symbols are non-preemptible, and others are preemptible. (Implies -Bsymbolic but does not set DF_SYMBOLIC.) (References to preemtible symbols cannot be bound to the definitions within the shared object.)

大多数函数都不在这个列表中，因为大多数程序都不能被用户程序重定义。C标准规定在程序中定义任何标准库函数都是undefined behavior，但在实际中很多实现会放宽要求以允许可替代的malloc实现(最著名的是通过LD_PRELOAD使用的jemalloc和tcmalloc)。另外sanitizers也会preempt大量库函数。

musl 1.1.20起支持用户程序替换少量malloc相关函数，因此这些这些函数在dynamic list中。

在1.1.20之前，musl的malloc实现不可被替换。musl使用功能比--dynamic-list弱的-Bsymbolic-functions：所有的STT_FUNC符号non-preemptible，而所有STT_OBJECT符号仍是preemptible的。

-fno-PIC方式编译的translation unit只能用于可执行文件。传统上，很多架构的-fno-PIC会用absolute relocation或PC-relative relocation访问外部STT_OBJECT符号，不用Global Offset Table (GOT)。当访问的符号定义在一个shared object中时，就会在链接时产生copy relocation (an ugly hack)：可执行文件把shared object的若干字节复制过来，让shared object对该符号的解析重定向到可执行文件。

对于copy relocation，只有使符号preemptible才能维持程序的一致性。倘若non-preemptible，就会产生可执行文件和shared object操作不同拷贝的情形。

// On x86, direct access is generated in both -fno-PIC and -fPIE.
// stdout requires a copy relocation.
// If libc.so is linked with -Bsymbolic, modifying stdout in the executable will not be observed by libc.so, vice versa.
#include <stdio.h>
int main() { fprintf(stdout, "%d", 42); }

PLT设置的开销

使用链接选项的方案很优雅。对于具有PC relative访问数据的指令的架构，这种方案是完美的。在其他架构上，一个需要PLT的外部函数调用会有额外开销(链接选项不影响编译期)。下面展示编译这段C程序得到的汇编指令：

#ifdef HIDDEN
void ext() __attribute__((visibility("hidden")));
#else
void ext();
#endif
void f() { ext(); ext(); }

在具有PC relative访问数据的指令的架构上，ext是否hidden，生成的指令序列没有变化。

i386

在i386上，ABI要求访问PLT时ebx指向GOT base，因为PLT entry会用基于ebx的寻址加载.got.plt(或罕见的.plt.got)的函数指针。外部函数前需要设置ebx。因为ebx是call-saved的，如果一个函数修改了ebx，需要保证返回时ebx被还原，因此还有额外的save/restore。

# ext is STV_DEFAULT
f:
	pushl	%ebx
	call	__x86.get_pc_thunk.bx
	addl	$_GLOBAL_OFFSET_TABLE_, %ebx
	subl	$8, %esp
	call	ext@PLT
	call	ext@PLT
	addl	$8, %esp
	popl	%ebx
	ret

# ext is STV_HIDDEN
f:
	subl	$12, %esp
	call	ext
	addl	$12, %esp
	jmp	ext

PowerPC64

POWER10有PC-relative访问数据的指令。之前，ELFv2用TOC (Table Of Contents)降低缺乏PC-relative指令带来的性能开销。 ABI要求r2指向当前module (可执行文件或shared object)的TOC base。一个外部函数调用会修改r2，因此一个bl指令后需要恢复r2。编译器会在每一条外部bl指令后放置一个nop(他们一定是受到了Mips delay slot的启发)，链接时按需patch成ld指令。除了多余的nop开销外，还导致了tail call失效。

# ext is STV_DEFAULT
f:
.Lfunc_begin0:
.Lfunc_gep0:
	addis 2, 12, .TOC.-.Lfunc_gep0@ha
	addi 2, 2, .TOC.-.Lfunc_gep0@l
.Lfunc_lep0:
	.localentry	f, .Lfunc_lep0-.Lfunc_gep0
	mflr 0
	std 0, 16(1)
	stdu 1, -32(1)
	bl ext
	nop
	bl ext
	nop
	addi 1, 1, 32
	ld 0, 16(1)
	mtlr 0
	blr

# ext is STV_HIDDEN
f:
.Lfunc_begin0:
.Lfunc_gep0:
	addis 2, 12, .TOC.-.Lfunc_gep0@ha
	addi 2, 2, .TOC.-.Lfunc_gep0@l
.Lfunc_lep0:
	.localentry	f, .Lfunc_lep0-.Lfunc_gep0
	mflr 0
	std 0, 16(1)
	stdu 1, -32(1)
	bl ext
	addi 1, 1, 32
	ld 0, 16(1)
	mtlr 0
	b ext

意外地，两种函数调用方式在Mips上没有差别，可能是因为它们的指令序列已经很长了吧……

Compile-time solution: hidden aliases

glibc采取的方式是

定义STV_DEFAULT的memcpy和一个hidden alias __GI_memcpy
内部header声明memcpy时asm label指向hidden的__GI_memcpy
不使用-fno-builtin-memcpy，允许memcpy函数调用被内联
引用memcpy处include内部header。memcpy或者被内联，或者生成对__GI_memcpy的调用

因为__GI_memcpy是hidden的，编译器/链接器知道它的定义在模块内部，且不能由其他模块提供，因此能避免PLT设置开销。

extern void *memcpy(void *__restrict, const void *__restrict, unsigned long);
extern __typeof(memcpy) memcpy __asm__("__GI_memcpy") __attribute__((visibility("hidden")));

void f() {
  memcpy(...);
}

所以，为什么不直接调用__GI_memcpy呢？因为这样mangle函数名用户体验不好……

其实musl在很多地方也用了__开头的hidden aliases，需要避免PLT设置开销时就会调用这些函数，如：

// src/include/sys/mman.h
__attribute__((__visibility__("hidden"))) void *__mmap(void *, size_t, int, int, int, off_t);

// src/mman/mmap.c
void *__mmap(void *start, size_t len, int prot, int flags, int fd, off_t off)
{
  ...
}

extern __typeof(__mmap) mmap __attribute__((__weak__, __alias__("__mmap")));

`STV_PROTECTED`

其实，除了STV_DEFAULT和STV_HIDDEN外，还有另一种在这种场景下更适合的visibility：STV_PROTECTED。有一个缺点是历史上STV_PROTECTED使用非常少，缺乏测试。但缺乏测试在最近若干年应该不是问题了。

STV_PROTECTED在ELF specification中的定义如下，暴露给外界，且non-preemptible，看上去是完美解决方案。

A symbol defined in the current component is protected if it is visible in other components but not preemptable, meaning that any reference to such a symbol from within the defining component must be resolved to the definition in that component, even if there is a definition in another component that would preempt by the default rules.

那么为什么libc不用STV_PROTECTED呢？

传统上，很多结构上-fno-PIC取外部函数地址就像访问外部STT_OBJECT符号那样，会用absolute relocation或PC-relative relocation，不用Global Offset Table (GOT)。链接器会在可执行文件里创建一个st_value!=0的PLT entry(相当于STT_FUNC的copy relocation)。可执行文件中该函数的地址就是这个PLT entry的运行时地址。为了pointer equality，链接器会试图让该PLT entry preempt shared object中的定义。然而，STV_PROTECTED是不允许preemption的，冲突导致报错。

// b.c - b.so
__attribute__((visibility("protected"))) void foo() {}
void *addr_foo() { return (void *)foo; }

// a.c - exe
#include <stdio.h>
void foo();
int main() { printf("%p\n", foo); }

链接时会报错：

# ld.lld
error: cannot preempt symbol: foo

# ld.bfd
relocation R_X86_64_32 against protected symbol `foo' can not be used when making a PIE object; recompile with -fPIE

假如链接器允许可执行文件中的absolute relocation或PC-relative relocation，运行时可执行文件中foo的地址会和shared object中foo的地址不一致。

asm label in Clang

对于大多数编译器不认识的函数(没有内建知识，不能内联(expand memcpy)或替换成其他实现(printf->puts)，不能生成lower成该函数的intrinsics)，asm label的实现方式都是挺直接的。然而，在今天之前的Clang里，有不少库函数(包括最重要的memset/memcpy)的asm label没有效果。我今天修复了这个问题D88712。

这里的主要难点是如果C函数foo含有内建语义X，且符号foo含有内建语义X，那么拒绝编译C函数foo为符号foo是不合逻辑的。换言之，下述三条不可同时成立。

如果frontend函数foo含有内建语义X
符号foo含有内建语义X
C函数foo不能编译为符号foo

在glibc的场合下，第一条是需要的。如果编译器假装不认识memcpy，那么就无法展开n为常数的memcpy，可能会影响性能。这也表明-fno-builtin-memcpy(或更强的-fno-builtin和-ffreestanding)不可接受。第三条也是需要的，因为使用asm label的目的就是重命名啊……

这样我们就得驳斥第二条。换言之，Clang生成LLVM IR后，IR优化不可假设符号foo具有内建语义X。然而这在GCC和Clang中都无法做到。 Clang若想实现，得引入LLVM IR特性支持重命名。倘若不支持重命名，得知道会lower成符号foo的intrinsics不可生成。这个功能目前是缺失的。

不能驳斥第二条给整个系统带来了一点不一致性。glibc的处理方式是加第二层重命名，给每个translation unit加一条asm("memcpy = __GI_memcpy;")

memcpy = __GI_memcpy;

# If __GI_memcpy is undefined, this produces a relocation referencing __GI_memcpy.
call memcpy@PLT

对于不#include <string.h>的translation unit，asm label不可替代这个重命名。
对于GCC优化过程中合成的memcpy，这行asm保证了GNU as会实施重命名。

我有另一个patch实现GNU as的这个逻辑。

另外，Clang支持继承自Sun Studio的另一种重命名语法：#pragma redefine_extname oldname newname。内部这一功能是用asm label实现的。 GCC文档中提到了这个功能https://gcc.gnu.org/onlinedocs/gcc/Symbol-Renaming-Pragmas.html，但我测试不可用……

Redeclaration

A function can be redeclared multiple times. The requirement is that an asm label is added before the first use.

typedef unsigned long size_t;

// #pragma redefine_extname memcpy __GI_memcpy // before the first use, works
extern void *memcpy(void *, const void *, size_t);

#pragma redefine_extname memcpy __GI_memcpy // before the first use, works

void *test_memcpy(void *dst, const void *src, size_t n) { return memcpy(dst, src, n); }

// #pragma redefine_extname memcpy __GI_memcpy // after the first use, does not work

As https://reviews.llvm.org/D88712 mentions, the asm label does not apply to Clang generated memcpy calls. Certain optimization passes can synthesize built-in function calls. It's really difficult to fix, also related to the function-at-a-time mode.