Clang is a C/C++ compiler that generates LLVM IR and utilitizes LLVM
to generate relocatable object files. Using the classic three-stage
compiler structure, the stages can be described as follows:
1
C/C++ =(front end)=> LLVM IR =(middle end)=> LLVM IR (optimized) =(back end)=> relocatable object file
If we follow the internal representations of instructions, a more
detailed diagram looks like this: 1
C/C++ =(front end)=> LLVM IR =(middle end)=> LLVM IR (optimized) =(instruction selector)=> MachineInstr =(AsmPrinter)=> MCInst =(assembler)=> relocatable object file
LLVM and Clang are designed as a collection of libraries. This post describes how different libraries work together to create the final relocatable object file. I will focus on how a function goes through the multiple compilation stages.
Compiler frontend
The compiler frontend primarily comprises the following libraries:
- clangDriver
- clangFrontend
- clangParse and clangSema
- clangCodeGen
The clangDriver library is located in clang/lib/Driver/
and clang/include/Driver/
, while other libraries have
similar structures. In general, when a header file in one library (let's
call it library A) is needed by another library, it is exposed to
clang/include/$A/
. Downstream projects can also include the
header file from clang/include/$A/
.
Let's use a C++ source file as an example.
1 | % cat a.cc |
The entry point of the Clang executable is implemented in
clang/tools/driver/
. clang_main
creates a
clang::driver::Driver
instance, calls
BuildCompilation
to construct a
clang::driver::Compilation
instance, and then calls
ExecuteCompilation
.
clangDriver
clangDriver parses the command line arguments, constructs compilation actions, assigns actions to tools, generates commands for these tools, and executes the commands.
You may read Compiler driver and cross compilation for additional information.
1 | BuildCompilation |
For clang++ -g a.cc
, clangDriver identifies the
following phases: preprocessor, compiler (C++ to LLVM IR), backend,
assembler, and linker. The first several phases can be performed by one
single clang::driver::tools::Clang
object (also known as
Clang cc1), while the final phase requires an external program (the
linker).
1 | % clang++ -g a.cc '-###' |
cc1_main
in clangDriver calls
ExecuteCompilerInvocation
defined in clangFrontend.
clangFrontend
clangFrontend
defines CompilerInstance
,
which manages various classes, including
CompilerInvocation
, DiagnosticsEngine
,
TargetInfo
, FileManager
,
SourceManager
, Preprocessor
,
ASTContext
, ASTConsumer
, and
Sema
.
1 | ExecuteCompilerInvocation |
In ExecuteCompilerInvocation
, a FrontAction
is created based on the CompilerInstance
argument and then
executed. When using the -emit-obj
option, the selected
FrontAction
is an EmitObjAction
, which is a
derivative of CodeGenAction
.
During FrontendAction::BeginSourceFile
, several classes
mentioned earlier are created, and a BackendConsumer
is
also established. The BackendConsumer
serves as a wrapper
around CodeGenerator
, which is another derivative of
ASTConsumer
. Finally, in
FrontendAction::BeginSourceFile
,
CompilerInstance::setASTConsumer
is called to create a
CodeGenModule
object, responsible for managing an LLVM IR
module.
In FrontendAction::Execute
,
CodeGenAction::ExecuteAction
is invoked, primarily handling
the compilation of LLVM IR files. This function, in turn, calls the base
function ASTFrontendAction::ExecuteAction
, which, in
essence, triggers the entry point of clangParse
:
ParseAST
.
clangParse and clangSema
clangParse
consumes tokens from clangLex
and invokes parser actions, many of which are named Act*
,
defined in clangSema
. clangSema
performs
semantic analysis and generates AST nodes.
1 | ParseAST |
When ParseTopLevelDecl
consumes a tok::eof
token, implicit instantiations are performed.
In the end, we get a full AST (actually a misnomer as the
representation is not abstract, not only about syntax, and is not a
tree). ParseAST
calls virtual functions
HandleTopLevelDecl
and
HandleTranslationUnit
.
clangCodeGen
BackendConsumer
defined in clangCodeGen overrides
HandleTopLevelDecl
and HandleTranslationUnit
to generate unoptimized LLVM IR and hand over the IR to LLVM for machine
code generation.
1 | BackendConsumer::HandleTopLevelDecl |
BackendConsumer::HandleTopLevelDecl
generates LLVM IR
for each top-level declaration. This means that Clang generates a
function at a time.
BackendConsumer::HandleTranslationUnit
invokes
EmitBackendOutput
to create an LLVM IR file, an assembly
file, or a relocatable object file. EmitBackendOutput
establishes an optimization pipeline and a machine code generation
pipeline.
Now let's explore CodeGenFunction::EmitFunctionBody
.
Generating IR for a variable declaration and a return statement involve
the following functions, among others: 1
2
3
4
5
6
7
8
9
10
11EmitFunctionBody
EmitCompoundStmtWithoutScope
EmitStmt
EmitSimpleStmt
EmitDeclStmt
EmitDecl
EmitVarDecl
EmitStopPoint
EmitReturnStmt
EmitScalarExpr
ScalarExprEmitter::EmitBinOps
After generating the LLVM IR, clangCodeGen proceeds to execute
EmitAssemblyHelper::RunOptimizationPipeline
to perform
middle-end optimizations and subsequently
EmitAssemblyHelper::RunCodegenPipeline
to generate machine
code.
For our integer division example, the function foo
in
the unoptimized LLVM IR looks like this (attributes are omitted):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24; Function Attrs: mustprogress noinline uwtable
define dso_local noundef i32 @_Z3fooifi(i32 noundef %a, float noundef %b, i32 noundef %c) #0 {
entry:
%a.addr = alloca i32, align 4
%b.addr = alloca float, align 4
%c.addr = alloca i32, align 4
%s = alloca i32, align 4
store i32 %a, ptr %a.addr, align 4, !tbaa !5
store float %b, ptr %b.addr, align 4, !tbaa !9
store i32 %c, ptr %c.addr, align 4, !tbaa !5
call void @llvm.lifetime.start.p0(i64 4, ptr %s) #4
%0 = load i32, ptr %a.addr, align 4, !tbaa !5
%1 = load float, ptr %b.addr, align 4, !tbaa !9
%2 = load float, ptr %b.addr, align 4, !tbaa !9
%cmp = fcmp oeq float %1, %2
%conv = zext i1 %cmp to i32
%add = add nsw i32 %0, %conv
store i32 %add, ptr %s, align 4, !tbaa !5
%3 = load i32, ptr %s, align 4, !tbaa !5
%4 = load i32, ptr %c.addr, align 4, !tbaa !5
%call = call noundef i32 @_Z3divIiET_S0_S0_(i32 noundef %3, i32 noundef %4)
call void @llvm.lifetime.end.p0(i64 4, ptr %s) #4
ret i32 %call
}
Compiler middle end
EmitAssemblyHelper::RunOptimizationPipeline
creates a
pass manager to schedule the middle-end optimization pipeline. This pass
manager executes numerous optimization passes and analyses.
The option -mllvm -print-pipeline-passes
provides
insight into these passes: 1
2% clang -c -O1 -mllvm -print-pipeline-passes a.c
annotation2metadata,forceattrs,declare-to-assign,inferattrs,coro-early,...
For our integer division example, the optimized LLVM IR looks like
this: 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16; Function Attrs: mustprogress nofree noinline norecurse nosync nounwind willreturn memory(none) uwtable
define dso_local noundef i32 @_Z3fooifi(i32 noundef %a, float noundef %b, i32 noundef %c) local_unnamed_addr #0 {
entry:
%cmp = fcmp ord float %b, 0.000000e+00
%conv = zext i1 %cmp to i32
%add = add nsw i32 %conv, %a
%div.i = sdiv i32 %add, %c
ret i32 %div.i
}
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable
define dso_local noundef i32 @main() local_unnamed_addr #1 {
entry:
%call = tail call noundef i32 @_Z3fooifi(i32 noundef 3, float noundef 2.000000e+00, i32 noundef 1)
ret i32 %call
}
The most notaceable differences are the following
SROAPass
runs mem2reg and optimizes out manyAllocaInst
s.InstCombinePass
(InstCombinerImpl::visitFCmpInst
) replacesfcmp oeq float %1, %1
withfcmp ord float %1, 0.000000e+00
, canonicalize NaN testing toFCmpInst::FCMP_ORD
.InlinerPass
inlines the instantiateddiv
function into its callerfoo
Compiler back end
The demarcation between the middle end and the back end may not be
entirely distinct. Within
LLVMTargetMachine::addPassesToEmitFile
, several IR passes
are scheduled. It's reasonable to consider these IR passes (everything
before addCoreISelPasses
) as part of the middle end, while
the phase beginning with instruction selection can be regarded as the
actual back end.
Here is an overview of
LLVMTargetMachine::addPassesToEmitFile
:
1 | LLVMTargetMachine::addPassesToEmitFile |
These IR and machine passes are scheduled by the legacy pass manager.
The option -mllvm -debug-pass=Structure
provides insight
into these passes: 1
clang -c -O1 a.c -mllvm -debug-pass=Structure
Instruction selector
There are three instruction selectors: SelectionDAG, FastISel, and GlobalISel. FastISel is integrated within the SelectionDAG framework.
For most targets, FastISel is the default for clang -O0
while SelectionDAG is the default for optimized builds. However, for
most AArch64 -O0
configurations, GlobalISel is the
default.
To force using GlobalISel, we can specify
-mllvm -global-isel
.
SelectionDAG
See https://llvm.org/docs/WritingAnLLVMBackend.html#instruction-selector.
1
2
3
4
5SectionDAG: normal code path
LLVM IR =(visit)=> SDNode =(DAGCombiner,LegalizeTypes,DAGCombiner,Legalize,DAGCombiner,Select,Schedule)=> MachineInstr
SectionDAG: FastISel (fast but not optimal)
LLVM IR =(FastISel)=> MachineInstr
1 | TargetPassConfig::addCoreISelPasses |
Each backend implements a derived class of SelectionDAGISel. For
example, the X86 backend implements X86DAGToDAGISel
and
overrides runOnMachineFunction
to set up variables like
X86Subtarget
and then invokes the base function
SelectionDAGISel::runOnMachineFunction
.
SelectionDAGISel
creates a
SelectionDAGBuilder
. For each basic block,
SelectionDAGISel::SelectBasicBlock
iterates over all IR
instructions and calls SelectionDAGBuilder::visit
on them,
creating a new SDNode
for each Value
that
becomes part of the DAG.
The initial DAG may contain types and operations that are not
natively supported by the target.
SelectionDAGISel::CodeGenAndEmitDAG
invokes
LegalizeTypes
and Legalize
to convert
unsupported types and operations to supported ones.
ScheduleDAGSDNodes::EmitSchedule
emits the machine code
(MachineInstr
s) in the scheduled order.
FinalizeISel
expands instructions marked with
MCID::UsesCustomInserter
.
Let's take a closer look at our foo
function.
For the IR instruction
%cmp = fcmp ord float %b, 0.000000e+00
,
SelectionDAGBuilder::visit
creates a new
SDNode
with the opcode ISD::SETCC
(t9: i1 = setcc t4, ConstantFP:f32<0.000000e+00>, seto:ch
).
1
2SelectionDAGBuilder::visit
SelectionDAGBuilder::visitFcmp
A new SDNode
with the opcode
ISD::ZERO_EXTEND
is created for
%conv = zext i1 %cmp to i32
.
For the IR instruction %add = add nsw i32 %conv, %a
,
SelectionDAGBuilder::visit
creates a new
SDNode
with the opcode ISD::ADD
.
1
2
3SelectionDAGBuilder::visit
SelectionDAGBuilder::visitAdd
SelectionDAGBuilder::visitBinary # binary operators are handled similarly
Similarly, SelectionDAGBuilder::visit
creates a new
SDNode
with the opcode ISD::SDIV
for
%div.i = sdiv i32 %add, %c
,
SelectionDAGBuilder::visit
. For the
ret i32 %div.i
instruction, the created SDNode
has a target-specific opcode X86ISD::RET_GLUE
(target-specific opcodes are legal for almost all targets).
After all instructions are visited, we get an initial DAG that looks
like: 1
2
3
4
5
6
7
8
9
10
11
12Initial selection DAG: %bb.0 '_Z3fooifi:entry'
SelectionDAG has 17 nodes:
t0: ch,glue = EntryToken
t4: f32,ch = CopyFromReg t0, Register:f32 %1
t9: i1 = setcc t4, ConstantFP:f32<0.000000e+00>, seto:ch
t10: i32 = zero_extend t9
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t11: i32 = add nsw t10, t2
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t12: i32 = sdiv t11, t6
t15: ch,glue = CopyToReg t0, Register:i32 $eax, t12
t16: ch = X86ISD::RET_GLUE t15, TargetConstant:i32<0>, Register:i32 $eax, t15:1
The DAGCombiner
process changes
t9: i1 = setcc t4, ConstantFP:f32<0.000000e+00>, seto:ch
to t19: i1 = setcc t4, t4, seto:ch
. After the initial
combining, the output looks like: 1
2
3
4
5
6
7
8
9
10
11
12Optimized lowered selection DAG: %bb.0 '_Z3fooifi:entry'
SelectionDAG has 16 nodes:
t0: ch,glue = EntryToken
t4: f32,ch = CopyFromReg t0, Register:f32 %1
t19: i1 = setcc t4, t4, seto:ch
t10: i32 = zero_extend t19
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t11: i32 = add nsw t10, t2
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t12: i32 = sdiv t11, t6
t15: ch,glue = CopyToReg t0, Register:i32 $eax, t12
t16: ch = X86ISD::RET_GLUE t15, TargetConstant:i32<0>, Register:i32 $eax, t15:1
The LegalizeTypes
process changes
t10: i32 = zero_extend t19
to
t23: i32 = any_extend t22; t25: i32 = and t23, Constant:i32<1>
.
The result of LegalizeTypes
looks like the following:
1
2
3
4
5
6
7
8
9
10
11
12
13Optimized legalized selection DAG: %bb.0 '_Z3fooifi:entry'
SelectionDAG has 17 nodes:
t0: ch,glue = EntryToken
t4: f32,ch = CopyFromReg t0, Register:f32 %1
t30: i32 = X86ISD::FCMP t4, t4
t32: i8 = X86ISD::SETCC TargetConstant:i8<11>, t30
t26: i32 = zero_extend t32
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t11: i32 = add nsw t26, t2
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t29: i32,i32 = sdivrem t11, t6
t15: ch,glue = CopyToReg t0, Register:i32 $eax, t29
t16: ch = X86ISD::RET_GLUE t15, TargetConstant:i32<0>, Register:i32 $eax, t15:1
x86 division instructions computes both the quotient and the
reminder. To leverage this property, X86ISelLowering.cpp
sets ISD::SDIV
to Expand
. The
Legalize
process will expand the ISD::SDIV
node. In SelectionDAGLegalize::ExpandNode
, the node is
replaced with a new node with the opcode ISD::SDIVREM
.
X86ISelLowering.cpp sets ISD::SETCC
for the type to
Custom
. X86TargetLowering::LowerOperation
provides custom lowering hooks and replaces the ISD::SETCC
node with
t32: i8 = X86ISD::SETCC TargetConstant:i8<11>, t30
that uses another node t30: i32 = X86ISD::FCMP t4, t4
.
The SelectionDAGISel::Select
process creates
MachineSDNode
(derivied class of SDNode
with
negative NodeType
) objects, which will be converted to
MachineInstr
. SelectionDAGISel::Select
is
derived by targets to perform custom handling for certain instructions
and handle over the rest to SelectCode
.
SelectCode
is the entry point of TableGen-generated pattern
matching code for instruction selection. While some SDNode
s
become MachineSDNode
s, some SDNode
s (e.g.
CopyFromReg
) remain SDNode
.
In our example,
X86::RET
is selected for theX86ISD::RET_GLUE
node.- Some nodes, such as
ISD::Register
andISD::CopyFromReg
, remain the same. X86::IDIV32r
is selected for theISD::SDIVREM
node.X86::ADD32rr
is selected for theISD::ADD
node.X86::MOVZX32rr8
is selected for theISD::ZERO_EXTEND
node.
In the EmitSchedule
process, MachineInstr
objects are created from these MachineSDNode
and regular
SDNode
objects.
FastISel, typically used for clang -O0
, represents a
fast path of SelectionDAG that generates less optimized machine
code.
When FastISel is enabled, SelectAllBasicBlocks
tries to
skip SelectBasicBlock
and select instructions with
FastISel. However, FastISel only handles a subset of IR instructions.
For unhandled instructions, SelectAllBasicBlocks
falls back
to SelectBasicBlock
to handle the remaining instructions in
the basic block.
GlobalISel
GlobalISel is a new instruction selection framework that operates on the entire function, in contrast to the basic block view of SelectionDAG. GlobalISel offers improved performance and modularity (multiple passes instead of one monolithic pass).
The design of the generic
MachineInstr
replaces an intermediate representation,
SDNode
, which was used in the SelectionDAG framework.
1 | LLVM IR =(IRTranslator)=> generic MachineInstr =(Legalizer)=> regular and generic MachineInstr =(RegBankSelect,GlobalInstructionSelect)=> regular MachineInstr |
1 | TargetPassConfig::addCoreISelPasses |
Similar to FastISel, GlobalISel does not handle all instructions. If GlobalISel fails to handle a function, SelectionDAG will be used as the fallback.
Machine passes
After instruction selector, there are machine SSA optimizations, register allocation, machine late optimizations, and pre-emit passes.
1 | TargetPassConfig::addMachinePasses |
PHIElimination
and
TwoAddressInstructionPass
precede the register allocation
pass. After PHIElimination
, the machine function is no
longer in a SSA form.
The optimization level affects the default register allocation strategy:
-O0
: A simpler allocator calledRegAllocFast
is used.-O1
and above: A sophisticated allocator calledRegAllocGreedy
is used. It runs additional passes and utilizesVirtRegRewriter
to finalize the allocation by replacing virtual register references with actual physical registers.
X86::RET
describes a pseudo instruction.
X86ExpandPseudo::ExpandMI
expands the X86::RET
MachineInstr
to an X86::RET64
.
AsmPrinter
LLVMTargetMachine::addAsmPrinter
incorporates a
target-specific AsmPrinter
(derived from
AsmPrinter
) pass into the machine code generation pipeline.
These target-specific AsmPrinter
passes are responsible for
converting MachineInstr
s to MCInst
s and
emitting them to an MCStreamer
.
In our x86 case, the target-specific class is named
X86AsmPrinter
.
X86AsmPrinter::runOnMachineFunction
invokes
AsmPrinter::emitFunctionBody
to emit the function body. The
base member function handles function header/footer, comments, and
instructions. Target-specific classes override
emitInstruction
to lower MachineInstr
s with
target-specific opcodes to MCInst
s.
An MCInst
object can also be created by the LLVM
integrated assembler. MCInst
is like a simplified version
of MachineInstr
with less information. When
MCInst
is emitted to an MCStreamer
, it results
in either assembly code or bytes for a relocatable object file.
MC
Clang has the capability to output either assembly code or an object file. Generating an object file directly without involving an assembler is referred to as "direct object emission".
To provide a unified interface, MCStreamer
is created to
handle the emission of both assembly code and object files. The two
primary subclasses of MCStreamer
are
MCAsmStreamer
and MCObjectStreamer
,
responsible for emitting assembly code and machine code
respectively.
In the case of an assembly input file, LLVM creates an
MCAsmParser
object (LLVMMCParser) and a target-specific
MCTargetAsmParser
object. The MCAsmParser
is
responsible for tokenizing the input, parsing assembler directives, and
invoking the MCTargetAsmParser
to parse an instruction.
Both the MCAsmParser
and MCTargetAsmParser
objects can call the MCStreamer
API to emit assembly code
or machine code.
In our case, LLVMAsmPrinter calls the MCStreamer
API to
emit assembly code or machine code.
If the streamer is an MCAsmStreamer
, the
MCInst
will be pretty-printed. If the streamer is an
MCELFStreamer
(other object file formats are similar),
MCELFStreamer::emitInstToData
will use
${Target}MCCodeEmitter
from LLVM${Target}Desc to encode the
MCInst
, emit its byte sequence, and records needed
relocations. An ELFObjectWriter
object is used to write the
relocatable object file.
In our example, we get a relocatable object file. If we invoke
clang -S a.cc
to get the assembly, it will look like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32...
.globl _Z3fooifi # -- Begin function _Z3fooifi
.p2align 4, 0x90
.type _Z3fooifi,@function
_Z3fooifi: # @_Z3fooifi
.cfi_startproc
# %bb.0:
xorl %eax, %eax
ucomiss %xmm0, %xmm0
setnp %al
addl %edi, %eax
cltd
idivl %esi
retq
.Lfunc_end0:
.size _Z3fooifi, .Lfunc_end0-_Z3fooifi
.cfi_endproc
...
.globl main
.p2align 4, 0x90
.type main,@function
main: # @main
.cfi_startproc
# %bb.0:
movss .LCPI1_0(%rip), %xmm0 # xmm0 = mem[0],zero,zero,zero
movl $3, %edi
movl $1, %esi
jmp _Z3fooifi # TAILCALL
.Lfunc_end1:
.size main, .Lfunc_end1-main
.cfi_endproc
You may read my post Assemblers for more information about the LLVM integrated assembler.