2020-09-27

gcov与LLVM中的实现

打算以后不定期写一点LLVM的学习(开发)笔记。写作上不想过多花时间(加语文水平所限...)，所以字句不作过多斟酌。

gcov

https://gcc.gnu.org/onlinedocs/gcc/Gcov.html

_Optimally Profiling and Tracing Programs_描述了一个edge-frequency/edge-placement problem。选择control-flow graph的一些边，加上监控代码，推导所有边的执行次数。 gcov是一种实现。

在gcov的模型中，一个源文件包含若干函数，一个函数包含若干基本块，一个基本块占据若干行，这些信息保存在.gcno文件中。 Instrument程序，在基本块间转移时记录边的执行次数，程序退出时为每个translation unit输出一个.gcda文件。 .gcda文件可以累计多次程序执行的计数。

根据.gcno描述的图信息和.gcda记录的边的执行次数，可以计算基本块的执行次数，进而计算行的执行次数。

使用流程：

gcc --coverage a.c b.c 生成a.gcno和b.gcno，instrument(记录边执行次数)并链接libgcov.a
./a.out 程序退出时输出a.gcda和b.gcda
gcov -t a. b. 生成coverage报告

有两个主要GCC选项：

-ftest-profile: 输出notes file (.gcno)，描述GCC版本、函数基本信息(名称、起始行列号、结束行列号(GCC 8添加)、校验码(源文件更新后可以判断.gcno/.gcda是否过期))、基本块 (flags (是否树边、是否异常控制流添加的)、包含的行)
-fprofile-arcs: instrument程序，程序执行后更新data files (.gcda)，描述程序执行次数和边执行次数。可以设置环境变量GCOV_PREFIX控制.gcda文件的创建路径

.gcno/.gcda均由一串uint32组成，文件格式参见gcc/gcov-io.h。编码简单，但是空间使用有点浪费。简单说来，-ftest-profile描述控制流图而-fprofile-arcsinstrument程序。这两个选项在GCC 3.4定形，之前版本用.bb{,g}和.da。

-fprofile-arcs亦可用作链接选项，链接libgcov.{a,so}。通常-ftest-profile -fprofile-arcs同时指定，可以简写为--coverage。(目前为止我唯一的CPython commit就是把-lgcov换成了--coverage)

gcov有个经典前端lcov，原先为Linux kernel设计。lcov设计的一套中间格式应用非常广泛，比如Bazel的coverage功能就用lcov作为所有程序设计语言coverage的中间格式。 http://ltp.sourceforge.net/coverage/lcov/geninfo.1.php

gcov常用的命令行选项：

-i: GCC 4.9~8生成一个近似lcov的intermediate format。GCC 9之后生成.json.gz(有点讨厌的上游改动～)。如果只是想单纯看行执行次数，从JSON提取信息比提取intermediate format麻烦多了
-r: 只输出编译时(源文件使用相对路径名)的.gcno/.gcda的信息。可跳过系统头文件
-s prefix: 若源文件路径名前缀为prefix，去除后变为相对路径，可使-r生效
-t: 输出到stdout，而非.gcov文件

GCC 4.7、4.8、8、9有打破backward compatibility的改动。如果编译程序用的GCC版本和gcov版本不一致会得到warning，如果没有breaking changes，其实是良性的：

# gcov removes the extension name, so you can use any of a. a.c a.gcno a.gcda
% gcc-9 --coverage a.c; gcov-10 -t a.
a.gcno:version 'A93*', prefer 'B01*'
a.gcda:version 'A93*', prefer version 'B01*'
...

行执行次数

源文件和函数的summary info都会输出Lines executed。对于某个基本块占据的一行，标记该基本块所在的函数包含该行。若该基本块执行次数大于0，则给所在的函数的lines_executed计数器+1。

GCC 10的gcov statistics实现的基本流程：

add_line_counts(coverage_info *coverage, function_info *fn)
  // Accmulate info for one function
  for each block
    for every location (a pair of (source file index, lines))
      for every line
        if fn is grouped && line belongs to the range of fn
          set line to the line_info of the function
          line->exists = 1
          increment line->count
        else
          set line to the line_info of the file
          line->exists = 1
          increment function's lines & lines_executed if necessary
          increment line->count

accumulate_line_info(line_info *line, source_info *src, bool add_coverage)
  // Accumulate info for one line
  for each outgoing arc
    arc->cs_count = arc->count
  compute cycle counts
  line count = incoming counts + cycle counts
  if line->exists && add_coverage
    src->coverage.lines++
    if line->count
      src->coverage.lines_executed++

accumulate_line_counts(source_info *src)
  // Accumulate info for one file
  for each grouped function
    for each line
      call accumulate_line_info with add_coverage=false
  for each line
    call accumulate_line_info with add_coverage=true
  for each grouped function
    for each fn_line where fn_line->exists
      increment src's lines and lines_executed if necessary

generate_results(const char *filename)
  for each function
    call add_line_counts
  for each source file
    call accumulate_line_counts
    call file_summary

得到lines和lines_executed后，Lines executed的百分比就是简单的lines和lines_executed的商。

is_group概念是用于处理在同一位置的多个函数的(如template; https://gcc.gnu.org/PR48463)。gcov -f会显示多个函数：

------------------
| _ZN3FooIfEC2Ev:
|    6|      2|  {
|    7|      2|    b = 123;
|    8|      2|  }
------------------
| _ZN3FooIiEC2Ev:
|    6|      1|  {
|    7|      1|    b = 123;
|    8|      1|  }
------------------

循环执行次数

行执行次数(上面伪代码的compute cycle counts)使用了一个启发式规则：incoming counts+cycle counts。 Incoming counts：不在同一行的前驱进入该基本块的执行次数和。 Cycle counts：在同一行的基本块之间的循环执行次数。可以这样理解：如果有一个循环写在同一行上(macro里的循环也属于这一类)，那么循环执行次数也应该计入该行的执行次数。

对于一个for，它的执行次数部分来自上一行(entering block(s))，部分来自循环内的latch(有back edge回到header)。

对于下面的简单程序，相关的基本块如下：

bb0: entry block
bb1: exit block (not considered here)
bb2: entering block (also preheader)
bb5: header
bb3: loop body
bb4: latch

它们之间的执行次数：(bb0,bb2,1), (bb2,bb5,1), (bb5,bb3,11), (bb3,bb4,11), (bb4,bb5,11) 根据编译器版本，bb5可能会占据for所在行，或不占据，有下面两种方案。使用incoming counts+cycle counts的策略可以计算出相同的行执行次数结果(for循环计算为执行了12次)。

int main() {                    // bb2
  for (int i = 0; i < 11; i++)  // bb2, bb4, bb5
    i = i;                      // bb3
}

int main() {                    // bb2
  for (int i = 0; i < 11; i++)  // bb2, bb4
    i = i;                      // bb3
}

循环执行次数是network flow problems中circulation problem的应用：maximum circulation就是循环流量。

gcov原先使用 J. C. Tiernan, An Efficient Search Algorithm to Find the Elementary Circuits of a Graph, Comm ACM 1970 描述的算法(the worst-case time bound is exponential in the number of elementary circuits)，枚举圈(cycle, aka simple circuit, aka elementary circuit)然后cycle cancelling。 2016年GCC PR67992用了改进的Donald B. Johnson算法，时间复杂度：，其中c是圈数。最坏情况下也是exponential in the size of the graph的。 (Boost把该算法归功于K. A. Hawick and H. A. James。gcov继承了这一名字。但实际上这篇论文并没有对Johnson的算法进行改进)

事实上每次消圈时至少把一条边的流量降低为0，因此最多有个圈。PR90380 (target: GCC 8.3)跳过非正数边，把时间复杂度降低到(理论可以做到，但实现中有一个线性的比较)。

llvm-cov gcov也有GCC PR90380修复的指数复杂度问题，该问题被https://reviews.llvm.org/D93036修复。实际上，找个圈使用cycle enumeration的Donald B. Johnson算法意义不大，可以换成普通的cycle detection。 https://reviews.llvm.org/D93073实现了简化，将复杂度降低到。

进一步思考：我们处理的是reducible flow graph(对于irreducible flow graph没有直观合理的cycle counts)，每一个natural loop都有一个back edge标识。因此构建dominator tree，找back edges，找出natural loops并把arcs计数器清零(清零的原因是我们还要计算从其他行进入的执行次数，要避免重复计数)，复杂度可以优化到。实现semi-NCA算法(复杂度是)并不复杂，但找出natural loops会比较麻烦。因此找natural loops的方案没有实用价值。

construct dominator tree
find back edges (the arc tail dominates the arc header)
for each back edge
  find the associated natural loop
  increase cycle counts
  decrease arc->count for each in-loop arc
compute incoming counts
line count = incoming counts + cycle counts

gcov in LLVM

最早由Nick Lewycky在2011年贡献。

使用流程示意：

clang --coverage -c a.c b.c 生成a.gcno和b.gcno，instrument(记录边执行次数)
clang --coverage a.o b.o 链接libclang_rt.profile-$arch.a (实现在compiler-rt)
./a.out 程序退出时输出a.gcda和b.gcda
llvm-cov gcov -t a. b. 生成coverage报告

llvm-cov gcov接口模拟gcov。在Clang 11之前(我施工之前)，Clang吐出的.gcno/.gcda是一种伪GCC 4.2格式，实际上没有办法用任何gcov版本阅读:(

Instrumentation

Instrument一条边的指令非常简单，给一个计数器加一。多线程程序同时操作一个计数器就会有data race。 GCC追求精确，如果编译时指定-pthread，计数器会用atomic operation，多线程竞争带来的性能损失可能很大(https://gcc.gnu.org/PR97065)。 Clang -fprofile-generate和-fprofile-arcs没有用atomic operation。多线程并发执行可能会有部分计数丢失。

// g++
addq    $1, __gcov0._Z3fooRSt6vectorIiSaIiEE(%rip)
// g++ -pthread
lock addq       $1, __gcov0._Z3fooRSt6vectorIiSaIiEE(%rip)

Clang的PGO会检测循环，把计数器+1转换为+N，在性能提升的同时可以大幅改善多线程冲突导致计数丢失的问题(https://reviews.llvm.org/D34085)。

替代atomic operation的一种思路是在function prologue处获取CPU id(可惜没有移植性好的函数)，获得一个per-cpu counter array。函数内的所有计数器都相对于该array，假设不发生process migration。 Linux上若有rseq倒是可以访问__rseq_abi.cpu_id，但使用seq需要runtime在每个新建thread上执行rseq系统调用…… (假如要保护process migration，每个计数器增加都用rseq会消耗过多__rseq_cs) 这种方法还有一个缺点是一个寄存器被固定用作计数器用途。

PGO/coverage选项可以参阅https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html。 GCC的profile guided optimization -fprofile-generate隐含-fprofile-arcs，亦即PGO和coverage复用同一套instrument模块和runtime。这么做也有缺点，因为PGO和coverage取舍上有所不同。比如coverage用户可能会更在意行执行次数的准确性，假如三个计时器有A=B+C的关系而实际上不成立，会给用户造成困惑。而对于PGO来说，损失10%的边执行次数未尝不可接受。

Clang其实还有另一套coverage机制，在2014年开发：-fprofile-instr-generate -fcoverage-mapping。 clangCodeGen实现了-fprofile-instr-generate，既用于Clang的PGO instrumentation，也用于coverage。假如你用llvm-cov导出coverage信息，可以发现它有非常精确的列号和覆盖区间信息，因为它能直接操作AST，知道每一个语句的起始和结束位置。 --coverage则用了调试信息，大部分时候只有点信息，而没有区间信息，甚至有些时候没有列号信息。

-fprofile-instr-generate instrument一个函数时记录所有边的执行次数。假如你学过电路，可能知道Kirchhoff's circuit law。若一个顶点有E条边，那么记录E-1条边的执行次数，即可计算剩下一条边的执行次数。对于一个V个基本块、E条边的图，只需要测量E-V+1条边即可计算所有边的执行次数。在控制流图中添加一个单个后继的basic block，不需要instrument新边。在控制流图中添加一个两个后继的basic block，只需要instrument一条新边。这一优化可以少instrument超过一半的边。我最近给Clang 12添加了这一优化，可能有10%~30%的性能提升，性能和关闭value profiling的-fprofile-generate类似，比-fprofile-instr-generate快。

当然这么做并非没有缺陷：对于异常控制流，如fork/execve，少instrument一些边可以导致计数不准确。

2015年Clang添加了-fprofile-generate(选项名模拟GCC，实际上并不兼容)，IR-level PGO。作为一个IR pass，它在优化管线内的位置比较灵活，而且可以方便地用于其他语言前端(ldc、Rust)。比如，在instrument前添加一个inliner可以大幅减少记录的边数。GCC也有类似优化。对于下面的程序，bar很有可能没有行执行次数：

int bar() {
  return 1;
}

__attribute__((noinline))
int foo() {
  return bar() + 2;
}

clang --coverage目前没有在instrument前添加inliner，因此bar的执行次数被准确地保留下来。你可以把bar替换成一个/usr/include/c++/里的某个C++ STL，先执行inling的好处就更加明显了：很多时候你不想了解C++ STL里复杂实现的各个部分的行执行次数，然而他们却占据了好多计数器。

还有一个有意思的地方，如果用了--coverage -fsanitize=thread，thread sanitizer会添加在优化管线非常靠后的地方。计数器应该用atomic operation，否则会触发data race报告。我在Clang 12中修复了这个问题。

C++的函数默认都会抛异常😹。为了计数器精确，GCC会拆分基本块，添加fake arcs，使得一个外部函数调用和下一条语句的计数可能不同。对于下面的程序，bb 0和bb 1分别表示entry block/exit block，bb 2~5之间有边。

extern void f();
void g() {   // bb 2
  puts("0"); // bb 2
  f();       // bb 3
  puts("1"); // bb 4
}            // bb 5

如果没有异常、dxit、execve、fork等干涉控制流的函数，函数g只需要一个计数器，让所有语句共享。而GCC从bb 2~4各引一条边到bb 1，导致了三个新的计数器。Clang没有实现fake arcs。

我的gcov in LLVM贡献

大部分修改在LLVM 11.0.0中。

几乎重写了llvm-cov gcov的所有统计代码，支持读取GCC 3.4~10 .gcno/.gcda
给llvm-cov gcov实现了-i -r -s -t
让Clang和libclang_rt.profile-支持GCC 4.7~9所有instrument改动。可以用-Xclang -coverage-version='409*'生成GCC 4.9兼容的.gcno/.gcda；A93*则是GCC 9.3，B01*则是GCC 10.1
添加了big-endian实现

lcov

Create llvm-gcov:

1 2	#!/bin/sh exec llvm-cov gcov "$@"

clang --coverage a.c -o a  # generate a.gcno
./a  # generate a.gcda
lcov -c --gcov-tool $PWD/llvm-gcov -d . -o output.lcov
genhtml output.lcov -o out

Specify --base-directory to specify the source directory.

gcov in Linux kernel

有意思的是libgcov在Linux kernel中也有实现：https://github.com/torvalds/linux/blob/master/kernel/gcov/gcc_4_7.c “关机输出.gcda文件”显然不是好的用户体验～.gcno/.gcda实际上是用debugfs提供的。 kernel/gcov/clang.c是Android提供的用于Clang的runtime实现。

gcov used J. C. Tiernan, An Efficient Search Algorithm to Find the Elementary Circuits of a Graph, Comm ACM 1970. The worst-case time bound is exponential in the number of elementary circuits. It enumerated cycles (aka simple circuit, aka elementary circuit) and performed cycle cancelling. In 2016, the resolution to GCC PR67992 switched to Donald B. Johnson's algorithm to improve performance. The theoretical time complexity is where is the number of cycles, which is exponential in the size of the graph. (Boost attributed the algorithm to K. A. Hawick and H. A. James, and gcov inherited this name. However, that paper did not improve Johnson's algorithm.)

Actually every step of cycle cancelling decreases the count of at lease one arc to 0, so there are at most cycles. The resolution to PR90380 (target: GCC 8.3) skipped non-positive arcs and decreased the time complexity to (in theory it could be but the implementation has a linear scan).

Thinking more, we are processing a reducible flow graph (there is no intuitive cycle count for an irreducible flow graph). Every natural loop is identified by a back edge. By constructing a dominator tree, finding back edges, identifying natural loops and clearing the arc counters (we will compute incoming counts so we clear counters to prevent duplicates), the time complexity can be decreased to . In practice, the semi-NCA algorithm (time complexity: , but considered faster than the almost linear Lengauer-Tarjan's algorithm) is not difficult to implement, but identifying natural loops is troublesome. So the method is not useful.