Clang is a C/C++ compiler that generates LLVM IR and utilizes LLVM to
generate relocatable object files. Using the classic three-stage
compiler structure, the stages can be described as follows:
1
C/C++ =(front end)=> LLVM IR =(middle end)=> LLVM IR (optimized) =(back end)=> relocatable object file
If we follow the internal representations of instructions, a more
detailed diagram looks like this: 1
C/C++ =(front end)=> LLVM IR =(middle end)=> LLVM IR (optimized) =(instruction selector)=> MachineInstr =(AsmPrinter)=> MCInst =(assembler)=> relocatable object file
LLVM and Clang are designed as a collection of libraries. This post describes how different libraries work together to create the final relocatable object file. I will focus on how a function goes through the multiple compilation stages.
Compiler frontend
The compiler frontend primarily comprises the following libraries:
- clangDriver
- clangFrontend
- clangParse and clangSema
- clangCodeGen
The clangDriver library is located in clang/lib/Driver/
and clang/include/clang/Driver/, while other libraries have
similar structures. In general, when a header file in one library (let's
call it library A) is needed by another library, it is exposed to
clang/include/clang/$A/. Downstream projects can also
include the header file using
#include <clang/$A/$file.h>.
Let's use a C++ source file as an example.
1 | % cat a.cc |
The entry point of the Clang executable is implemented in
clang/tools/driver/. clang_main creates a
clang::driver::Driver instance, calls
BuildCompilation to construct a
clang::driver::Compilation instance, and then calls
ExecuteCompilation.
clangDriver
clangDriver parses the command line arguments, constructs compilation actions, assigns actions to tools, generates commands for these tools, and executes the commands.
You may read Compiler driver and cross compilation for additional information.
1 | BuildCompilation |
For clang++ -g a.cc, clangDriver identifies the
following phases: preprocessor, compiler (C++ to LLVM IR), backend,
assembler, and linker. The first several phases can be performed by one
single clang::driver::tools::Clang object (also known as
Clang cc1), while the final phase requires an external program (the
linker).
1 | % clang++ -g a.cc '-###' |
cc1_main in clangDriver calls
ExecuteCompilerInvocation defined in clangFrontend.
clangFrontend
clangFrontend defines CompilerInstance,
which manages various classes, including
CompilerInvocation, DiagnosticsEngine,
TargetInfo, FileManager,
SourceManager, Preprocessor,
ASTContext, ASTConsumer, and
Sema.
1 | ExecuteCompilerInvocation |
In ExecuteCompilerInvocation, a FrontAction
is created based on the CompilerInstance argument and then
executed. When using the -emit-obj option, the selected
FrontAction is an EmitObjAction, which is a
derivative of CodeGenAction.
During FrontendAction::BeginSourceFile, several classes
mentioned earlier are created, and a BackendConsumer is
also established. The BackendConsumer serves as a wrapper
around CodeGenerator, which is another derivative of
ASTConsumer. Finally, in
FrontendAction::BeginSourceFile,
CompilerInstance::setASTConsumer is called to create a
CodeGenModule object, responsible for managing an LLVM IR
module.
In FrontendAction::Execute,
CodeGenAction::ExecuteAction is invoked, primarily handling
the compilation of LLVM IR files. This function, in turn, calls the base
function ASTFrontendAction::ExecuteAction, which, in
essence, triggers the entry point of clangParse:
ParseAST.
clangParse and clangSema
clangParse consumes tokens from clangLex
and invokes parser actions, many of which are named Act*,
defined in clangSema. clangSema performs
semantic analysis and generates AST nodes.
1 | ParseAST |
When ParseTopLevelDecl consumes a tok::eof
token, implicit instantiations are performed.
In the end, we get a full AST (actually a misnomer as the
representation is not abstract, not only about syntax, and is not a
tree). ParseAST calls virtual functions
HandleTopLevelDecl and
HandleTranslationUnit.
clangCodeGen
BackendConsumer defined in clangCodeGen overrides
HandleTopLevelDecl and HandleTranslationUnit
to generate unoptimized LLVM IR and hand over the IR to LLVM for machine
code generation.
1 | BackendConsumer::HandleTopLevelDecl |
BackendConsumer::HandleTopLevelDecl generates LLVM IR
for each top-level declaration. This means that Clang generates a
function at a time.
BackendConsumer::HandleTranslationUnit invokes
EmitBackendOutput to create an LLVM IR file, an assembly
file, or a relocatable object file. EmitBackendOutput
establishes an optimization pipeline and a machine code generation
pipeline.
Now let's explore CodeGenFunction::EmitFunctionBody.
Generating IR for a variable declaration and a return statement involve
the following functions, among others: 1
2
3
4
5
6
7
8
9
10
11EmitFunctionBody
EmitCompoundStmtWithoutScope
EmitStmt
EmitSimpleStmt
EmitDeclStmt
EmitDecl
EmitVarDecl
EmitStopPoint
EmitReturnStmt
EmitScalarExpr
ScalarExprEmitter::EmitBinOps
After generating the LLVM IR, clangCodeGen proceeds to execute
EmitAssemblyHelper::RunOptimizationPipeline to perform
middle-end optimizations and subsequently
EmitAssemblyHelper::RunCodegenPipeline to generate machine
code.
For our integer division example, the function foo in
the unoptimized LLVM IR looks like this (attributes are omitted):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24; Function Attrs: mustprogress noinline uwtable
define dso_local noundef i32 @_Z3fooifi(i32 noundef %a, float noundef %b, i32 noundef %c) #0 {
entry:
%a.addr = alloca i32, align 4
%b.addr = alloca float, align 4
%c.addr = alloca i32, align 4
%s = alloca i32, align 4
store i32 %a, ptr %a.addr, align 4, !tbaa !5
store float %b, ptr %b.addr, align 4, !tbaa !9
store i32 %c, ptr %c.addr, align 4, !tbaa !5
call void @llvm.lifetime.start.p0(i64 4, ptr %s) #4
%0 = load i32, ptr %a.addr, align 4, !tbaa !5
%1 = load float, ptr %b.addr, align 4, !tbaa !9
%2 = load float, ptr %b.addr, align 4, !tbaa !9
%cmp = fcmp oeq float %1, %2
%conv = zext i1 %cmp to i32
%add = add nsw i32 %0, %conv
store i32 %add, ptr %s, align 4, !tbaa !5
%3 = load i32, ptr %s, align 4, !tbaa !5
%4 = load i32, ptr %c.addr, align 4, !tbaa !5
%call = call noundef i32 @_Z3divIiET_S0_S0_(i32 noundef %3, i32 noundef %4)
call void @llvm.lifetime.end.p0(i64 4, ptr %s) #4
ret i32 %call
}
Compiler middle end
EmitAssemblyHelper::RunOptimizationPipeline creates an
LLVM pass manager to schedule the middle-end optimization pipeline. The
new pass manager uses a nested structure: ModulePassManager
→ CGSCCPassManager (call graph SCCs) →
FunctionPassManager → LoopPassManager. Each
level provides the appropriate IR unit and analysis results to its
passes.
1 | EmitAssemblyHelper::RunOptimizationPipeline |
To get a sense of which passes matter most, we can compile sqlite3
(1051 functions) with opt -O2 --print-changed and count how
many invocations actually changed the IR.
At the module level, IPSCCPPass (interprocedural sparse
conditional constant propagation), GlobalOptPass, and
DeadArgumentEliminationPass each fire once and change the
IR.
Among function passes (totals across all pipeline instances), the
workhorse passes are SimplifyCFGPass (2287),
InstCombinePass (2136), SROAPass (1048), and
EarlyCSEPass (952). These appear multiple times in the
pipeline and run after every major transformation. Other notable passes
include JumpThreadingPass (444),
CorrelatedValuePropagationPass (345), GVNPass
(307), and SCCPPass (103).
For loop passes, LoopRotatePass (588) and
LICMPass (501) are the most active, followed by
IndVarSimplifyPass (332).
In buildModuleOptimizationPipeline,
LoopUnrollPass (316), LoopVectorizePass (14),
and SLPVectorizerPass (10) are infrequent on sqlite3 but
can be critical for numerical code.
The inliner pipeline is a CGSCC pass that, for each SCC in bottom-up
order, runs the inliner followed by function simplification passes
(SROA, InstCombine, GVN, etc.). This means div<int>
is inlined into foo during CGSCC traversal, and the
post-inline cleanup immediately simplifies the result.
The option -mllvm -print-pipeline-passes prints the
pipeline structure, and -Xclang -fdebug-pass-manager shows
passes as they execute: 1
2% clang -c -O1 -mllvm -print-pipeline-passes a.c
annotation2metadata,forceattrs,declare-to-assign,inferattrs,coro-early,...
For our integer division example, the optimized LLVM IR looks like
this: 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16; Function Attrs: mustprogress nofree noinline norecurse nosync nounwind willreturn memory(none) uwtable
define dso_local noundef i32 @_Z3fooifi(i32 noundef %a, float noundef %b, i32 noundef %c) local_unnamed_addr #0 {
entry:
%cmp = fcmp ord float %b, 0.000000e+00
%conv = zext i1 %cmp to i32
%add = add nsw i32 %conv, %a
%div.i = sdiv i32 %add, %c
ret i32 %div.i
}
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none) uwtable
define dso_local noundef i32 @main() local_unnamed_addr #1 {
entry:
%call = tail call noundef i32 @_Z3fooifi(i32 noundef 3, float noundef 2.000000e+00, i32 noundef 1)
ret i32 %call
}
The most noticeable differences are the following
SROAPassruns mem2reg and optimizes out manyAllocaInsts.InstCombinePass(InstCombinerImpl::visitFCmpInst) replacesfcmp oeq float %1, %1withfcmp ord float %1, 0.000000e+00, canonicalize NaN testing toFCmpInst::FCMP_ORD.InlinerPassinlines the instantiateddivfunction into its callerfoo
Compiler back end
The demarcation between the middle end and the back end may not be
entirely distinct. Within
LLVMTargetMachine::addPassesToEmitFile, several IR passes
are scheduled. It's reasonable to consider these IR passes (everything
before addCoreISelPasses) as part of the middle end, while
the phase beginning with instruction selection can be regarded as the
actual back end.
Here is an overview of
LLVMTargetMachine::addPassesToEmitFile:
1 | LLVMTargetMachine::addPassesToEmitFile |
These IR and machine passes are scheduled by the legacy pass manager.
The option -mllvm -debug-pass=Structure provides insight
into these passes: 1
clang -c -O1 a.c -mllvm -debug-pass=Structure
Instruction selector
There are three instruction selectors: SelectionDAG, FastISel, and GlobalISel. FastISel is integrated within the SelectionDAG framework.
For most targets, FastISel is the default for clang -O0
while SelectionDAG is the default for optimized builds. However, for
most AArch64 -O0 configurations, GlobalISel is the
default.
To force using GlobalISel, we can specify
-mllvm -global-isel.
SelectionDAG
The official documentation https://llvm.org/docs/WritingAnLLVMBackend.html#instruction-selector is dry and incomplete. 2024 LLVM Dev Mtg - A Beginners’ Guide to SelectionDAG has a great introduction.
1 | SelectionDAG: normal code path |
1 | TargetPassConfig::addCoreISelPasses |
Each backend implements a derived class of
SelectionDAGISel. For example, the X86 backend implements
X86DAGToDAGISel and overrides
runOnMachineFunction to set up variables like
X86Subtarget and then invokes the base function
SelectionDAGISel::runOnMachineFunction.
SelectionDAGISel creates a
SelectionDAGBuilder.
A function is split into basic blocks and each SelectionDAG
represents a single basic block.
SelectionDAGISel::SelectBasicBlock iterates over all IR
instructions and calls SelectionDAGBuilder::visit on them,
creating a new SDNode for each Value that
becomes part of the initial DAG.
The initial DAG may contain types and operations that are not
natively supported by the target.
SelectionDAGISel::CodeGenAndEmitDAG invokes
LegalizeTypes and Legalize to convert
unsupported types and operations to supported ones.
ScheduleDAGSDNodes::EmitSchedule emits the machine code
(MachineInstrs) in the scheduled order.
FinalizeISel expands instructions marked with
MCID::UsesCustomInserter.
-mllvm -debug-only=isel-dump provides insight into the
instruction selection phases.
SelectionDAG concept
A SelectionDAG is a directed acyclic graph where each node
(SDNode) represents an operation (opcode + flags), takes 0
or more operands (SDUse), and produces 1 or more results
(SDValue). Each result has a machine value type
(MVT), e.g. i32, f64,
v4f32. Operands reference results of other nodes, forming
def-use edges.
An EntryToken node represents the start-of-block
dependency. Each DAG has a root node, usually the terminator instruction
(e.g. ISD::RET).
Beyond data values, two special MVT types encode
ordering constraints:
- Chain (
ch) values enforce ordering between side-effecting nodes (loads, stores, calls) while allowing the scheduler to interleave unrelated nodes in between. - Glue (
MVT::Glue) forces two nodes to be scheduled immediately adjacent — no instruction can be inserted between them. A node produces glue as its last result and the consumer takes it as its last operand, forming a linear chain (at most one glue in, one glue out per node). Common uses:CopyToRegnodes glued toCALL/RET(preventing argument register clobbering), and flag-setting instructions glued to flag consumers (e.g.CMP→CMOVon x86, preventing EFLAGS clobbering).
In -debug-only=isel output, leaf nodes with zero
operands (constants, registers) are printed inline with their users in
debug dumps. Other nodes are printed one per line; single-use nodes are
indented under their sole user, while multi-use nodes appear at the top
level.
Now, let's take a closer look at our foo function.
SelectionDAG construction
For the IR instruction
%cmp = fcmp ord float %b, 0.000000e+00,
SelectionDAGBuilder::visit creates a new
SDNode with the opcode ISD::SETCC
(t9: i1 = setcc t4, ConstantFP:f32<0.000000e+00>, seto:ch).
1
2SelectionDAGBuilder::visit
SelectionDAGBuilder::visitFcmp
A new SDNode with the opcode
ISD::ZERO_EXTEND is created for
%conv = zext i1 %cmp to i32.
For the IR instruction %add = add nsw i32 %conv, %a,
SelectionDAGBuilder::visit creates a new
SDNode with the opcode ISD::ADD.
1
2
3SelectionDAGBuilder::visit
SelectionDAGBuilder::visitAdd
SelectionDAGBuilder::visitBinary # binary operators are handled similarly
Similarly, SelectionDAGBuilder::visit creates a new
SDNode with the opcode ISD::SDIV for
%div.i = sdiv i32 %add, %c,
SelectionDAGBuilder::visit. For the
ret i32 %div.i instruction, the created SDNode
has a target-specific opcode X86ISD::RET_GLUE
(target-specific opcodes are legal for almost all targets).
After all instructions are visited, we get an initial DAG that looks
like: 1
2
3
4
5
6
7
8
9
10
11
12Initial selection DAG: %bb.0 '_Z3fooifi:entry'
SelectionDAG has 17 nodes:
t0: ch,glue = EntryToken
t4: f32,ch = CopyFromReg t0, Register:f32 %1
t9: i1 = setcc t4, ConstantFP:f32<0.000000e+00>, seto:ch
t10: i32 = zero_extend t9
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t11: i32 = add nsw t10, t2
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t12: i32 = sdiv t11, t6
t15: ch,glue = CopyToReg t0, Register:i32 $eax, t12
t16: ch = X86ISD::RET_GLUE t15, TargetConstant:i32<0>, Register:i32 $eax, t15:1
The DAGCombiner process changes
t9: i1 = setcc t4, ConstantFP:f32<0.000000e+00>, seto:ch
to t19: i1 = setcc t4, t4, seto:ch. After the initial DAG
combining, the output looks like: 1
2
3
4
5
6
7
8
9
10
11
12Optimized lowered selection DAG: %bb.0 '_Z3fooifi:entry'
SelectionDAG has 16 nodes:
t0: ch,glue = EntryToken
t4: f32,ch = CopyFromReg t0, Register:f32 %1
t19: i1 = setcc t4, t4, seto:ch
t10: i32 = zero_extend t19
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t11: i32 = add nsw t10, t2
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t12: i32 = sdiv t11, t6
t15: ch,glue = CopyToReg t0, Register:i32 $eax, t12
t16: ch = X86ISD::RET_GLUE t15, TargetConstant:i32<0>, Register:i32 $eax, t15:1
Type legalization
The type legalizer (entry: SelectionDAG::LegalizeTypes)
traverses nodes in a topological order. The targets call
addRegisterClass and
TargetLoweringBase::computeRegisterProperties to set up
legal types. Nodes with illegal value types are transformed according to
a few predefined actions:
TypePromoteIntegerTypeExpandIntegerTypeSoftenFloatTypePromoteFloatTypeSoftPromoteFloatTypeScalarizeFloat- ...
1 | EVT ResultVT = N->getValueType(i); |
Many actions actually call a hook CustomLowerNode before
generic code. If the target specifies a Custom action handler attached
on the operation with the illegal type, the custom handler will run
instead. This allows the target-specific code to legalize an operation
with its original value type.
In our example, i1 is an illegal type and the
transformed-to type is i8. The type legalizer will change
t10: i32 = zero_extend t19 to
t23: i32 = any_extend t22; t25: i32 = and t23, Constant:i32<1>.
Operation legalization
Targets support a subset of (Opcode, MVT) combinations (legal). The
rest are illegal and need to be transformed to legal operations for the
target. Targets call
setOperationAction(Opcode, MVT, LegalizeAction) to inform
the legalizer where a (Opcode, MVT) combination is legal or requires a
legalize action.
- Legal
- Promote
- Expand
- Custom
The nodes are visited in a topological order.
x86 division instructions computes both the quotient and the
reminder. X86ISelLowering.cpp sets the legalization
operation of ISD::SDIV with type i32 to
Expand.
1 | setOperationAction(ISD::SDIV, VT, Expand); |
SelectionDAGLegalize::LegalizeOp notices that the
operation is Expand and calls
SelectionDAGLegalize::ExpandNode. The node is replaced with
a new node with the opcode ISD::SDIVREM.
The legalization operation of ISD::SETCC for i8 is
Custom.
1 | setOperationAction(ISD::SETCC, VT, Custom); |
X86TargetLowering::LowerOperation provides custom
lowering hooks and replaces the ISD::SETCC node with
t32: i8 = X86ISD::SETCC TargetConstant:i8<11>, t30
that uses another node t30: i32 = X86ISD::FCMP t4, t4.
Legalization might create new nodes. These new nodes need
legalization as well. SelectionDAG::Legalize uses an
iterative algorithm.
The result of Legalize looks like the following:
1
2
3
4
5
6
7
8
9
10
11
12
13Optimized legalized selection DAG: %bb.0 '_Z3fooifi:entry'
SelectionDAG has 17 nodes:
t0: ch,glue = EntryToken
t4: f32,ch = CopyFromReg t0, Register:f32 %1
t30: i32 = X86ISD::FCMP t4, t4
t32: i8 = X86ISD::SETCC TargetConstant:i8<11>, t30
t26: i32 = zero_extend t32
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t11: i32 = add nsw t26, t2
t6: i32,ch = CopyFromReg t0, Register:i32 %2
t29: i32,i32 = sdivrem t11, t6
t15: ch,glue = CopyToReg t0, Register:i32 $eax, t29
t16: ch = X86ISD::RET_GLUE t15, TargetConstant:i32<0>, Register:i32 $eax, t15:1
Instruction selection
DoInstructionSelection visits DAG nodes in a topological
order (root node first; operator selected before operands) and calls
SelectionDAGISel::Select. Select replaces most
generic SDNodes with MachineSDNodes (derived
class of SDNode with negative NodeType), which
will be converted to MachineInstr. Some
SDNodes (e.g. CopyFromReg) remain
SDNode.
Select is derived by targets to perform custom logic and
handle over the rest to SelectCode. SelectCode
is the entry point of TableGen-generated pattern matching code for
instruction selection. TableGen generates code that matches DAG patterns
and emits machine nodes.
You can pass -mllvm -debug-only=isel to Clang to observe
the process. In our example,
X86::RETis selected for theX86ISD::RET_GLUEnode.- Some nodes, such as
ISD::RegisterandISD::CopyFromReg, remain the same. X86::IDIV32ris selected for theISD::SDIVREMnode.X86::ADD32rris selected for theISD::ADDnode.X86::MOVZX32rr8is selected for theISD::ZERO_EXTENDnode.
SelectionDAG instruction scheduling
After instruction selection, the DAG nodes must be linearized into a
sequence of MachineInstrs.
SelectionDAGISel::CodeGenAndEmitDAG calls
CreateScheduler to create a ScheduleDAGSDNodes
subclass. createDefaultScheduler picks a scheduler based on
the target's scheduling preference
(TargetLowering::getSchedulingPreference):
- Most targets enable the machine scheduler (X86, RISC-V, AArch64,
etc.) and
createDefaultSchedulerreturnssource, a trivial scheduler that preserves source order, deferring real scheduling to the laterMachineSchedulerpass. -O0also usessource.- For targets without the machine scheduler, the scheduling preference
(
TargetLowering::getSchedulingPreference) selects aScheduleDAGRRListvariant:list-burr(register pressure),list-hybrid(latency vs register pressure), orlist-ilp(instruction-level parallelism).
While a simpler linearize scheduler exists that just
does a topological sort, it cannot handle physical register
dependencies. The source scheduler uses the full
ScheduleDAGRRList machinery (physical register liveness
tracking, backtracking) to ensure correctness when instructions have
physical register constraints (e.g. $eax for x86
idiv), while preferring source order when there is
scheduling freedom. GlobalISel does not need this step since it operates
on MachineInstrs directly without an intermediate DAG.
These can be overridden with
-mllvm -pre-RA-sched=<name>.
The ScheduleDAGRRList variants perform bottom-up list
scheduling: glued nodes are grouped into a single scheduling unit
(SUnit) and a data dependency graph is built.
All glued nodes in a chain are merged into one SUnit —
the scheduler simply cannot separate them.
The algorithm starts from the DAG root (e.g. RET) and
works backward: each iteration pops the highest-priority node from a
priority queue and appends it to the schedule, then releases
predecessors whose successors have all been scheduled. The final
sequence is reversed to get the execution order. Bottom-up naturally
models register pressure — at each step, the live values are exactly
those whose uses have been scheduled but whose definitions have not,
allowing the priority queue (using Sethi-Ullman numbers) to minimize
simultaneously live values.
The scheduler tracks physical register liveness: when a node is
scheduled, its predecessors' physical register defs become live; when
the defining node is later scheduled, those registers are released. If
scheduling a candidate would clobber a live physical register (checked
against all aliases, implicit defs, register masks, and inline asm
clobbers), the candidate is deferred to an interference list. When all
candidates are blocked, PickNodeToScheduleBottomUp resolves
the conflict with escalating strategies: first, backtrack by
unscheduling the node that keeps the register alive and add an
artificial ordering edge to prevent recurrence; if backtracking would
create a cycle, duplicate the defining node or insert cross-class
register copies as a last resort.
ScheduleDAGSDNodes::EmitSchedule converts the scheduled
MachineSDNode and SDNode objects into
MachineInstrs.
Instruction scheduling
Instruction scheduling and register allocation are conflicting goals. Scheduling wants to increase instruction-level parallelism by separating dependent instructions, which increases the number of simultaneously live values (register pressure). Register allocation wants to minimize register pressure to avoid spills.
To balance these concerns, LLVM schedules instructions multiple
times: The pre-RA machine scheduler
(MachineScheduler::runOnMachineFunction) runs at -O1 and
above, immediately before register allocation. Some targets also enable
a post-RA scheduler (PostRASchedulerID or
PostMachineSchedulerID) at -O2 and above, since register
allocation introduces spill/reload code and removes false dependencies
(anti and output dependencies on virtual registers), making rescheduling
profitable.
MachineScheduler
MachineSchedulerBase::scheduleRegions iterates over
basic blocks and splits each into scheduling regions — contiguous runs
of instructions separated by scheduling boundaries (calls, inline
assembly, and other instructions identified by
TargetInstrInfo::isSchedulingBoundary). Each region is
scheduled independently.
The pre-RA scheduler uses ScheduleDAGMILive, which
builds a dependency DAG over MachineInstrs and tracks
register pressure using RegisterPressureTracker. Its
schedule() method builds the DAG with register pressure
information (buildDAGWithRegPressure), finds top and bottom
roots, then calls SchedImpl->pickNode(IsTopNode) in a
loop until the top and bottom frontiers meet.
The default strategy is GenericScheduler, which performs
bidirectional list scheduling. It maintains two
SchedBoundary zones (top and bottom), each with its own
ready queue. pickNodeBidirectional picks the best candidate
from either zone using a hierarchical set of heuristics in
tryCandidate: bias physical register defs toward their
uses, avoid exceeding register pressure limits, avoid stalls on
unbuffered resources, keep clustered nodes together, balance critical
resource consumption, avoid serializing long latency chains, and fall
back to original instruction order.
Scheduling models
Both the machine scheduler and the post-RA scheduler rely on
scheduling models defined in target .td
files. A SchedMachineModel describes the processor's
pipeline properties, while ProcResource and
WriteRes map instruction operands to pipeline resources and
latencies. For example, X86SchedHaswell.td models the
Haswell microarchitecture: 1
2
3
4
5
6
7
8
9
10
11
12
13def HaswellModel : SchedMachineModel {
let IssueWidth = 4;
let MicroOpBufferSize = 192; // Based on the reorder buffer.
let LoadLatency = 5;
let MispredictPenalty = 16;
}
def HWPort0 : ProcResource<1>;
def HWPort1 : ProcResource<1>;
...
def HWPort0156: ProcResGroup<[HWPort0, HWPort1, HWPort5, HWPort6]>;
defm : HWWriteResPair<WriteALU, [HWPort0156], 1>; // ALU ops: port 0/1/5/6, latency 1
defm : HWWriteResPair<WriteIMul64, [HWPort1, HWPort6], 4, [1,1], 2>; // 64-bit mul: latency 4, 2 uops
The scheduler uses these to model resource contention and instruction
latencies. -mllvm -debug-only=machine-scheduler shows the
scheduling decisions, and -mllvm -debug-only=pre-RA-sched
shows the SelectionDAG scheduler.
-mllvm -misched-dump-schedule-trace prints a cycle-by-cycle
resource booking table.
FastISel
FastISel, typically used for clang -O0, represents a
fast path of SelectionDAG that generates less optimized machine
code.
When FastISel is enabled, SelectAllBasicBlocks tries to
skip SelectBasicBlock and select instructions with
FastISel. However, FastISel only handles a subset of IR instructions.
For unhandled instructions, SelectAllBasicBlocks falls back
to SelectBasicBlock to handle the remaining instructions in
the basic block.
GlobalISel
GlobalISel is a new instruction selection framework that operates on the entire function, in contrast to the basic block view of SelectionDAG. GlobalISel offers improved performance and modularity (multiple passes instead of one monolithic pass).
The design of the generic
MachineInstr replaces an intermediate representation,
SDNode, which was used in the SelectionDAG framework.
1 | LLVM IR =(IRTranslator)=> generic MachineInstr =(Legalizer)=> regular and generic MachineInstr =(RegBankSelect,GlobalInstructionSelect)=> regular MachineInstr |
1 | TargetPassConfig::addCoreISelPasses |
Similar to FastISel, GlobalISel does not handle all instructions. If GlobalISel fails to handle a function, SelectionDAG will be used as the fallback.
Machine passes
After instruction selector, there are machine SSA optimizations, register allocation, machine late optimizations, and pre-emit passes.
1 | TargetPassConfig::addMachinePasses |
PHIElimination and
TwoAddressInstructionPass precede the register allocation
pass. After PHIElimination, the machine function is no
longer in a SSA form.
The optimization level affects the default register allocation strategy:
-O0: A simpler allocator calledRegAllocFastis used.-O1and above: A sophisticated allocator calledRegAllocGreedyis used. It runs additional passes and utilizesVirtRegRewriterto finalize the allocation by replacing virtual register references with actual physical registers.
X86::RET describes a pseudo instruction.
X86ExpandPseudo::ExpandMI expands the X86::RET
MachineInstr to an X86::RET64.
AsmPrinter
LLVMTargetMachine::addAsmPrinter incorporates a
target-specific AsmPrinter (derived from
AsmPrinter) pass into the machine code generation pipeline.
These target-specific AsmPrinter passes are responsible for
converting MachineInstrs to MCInsts and
emitting them to an MCStreamer.
In our x86 case, the target-specific class is named
X86AsmPrinter.
X86AsmPrinter::runOnMachineFunction invokes
AsmPrinter::emitFunctionBody to emit the function body. The
base member function handles function header/footer, comments, and
instructions. Target-specific classes override
emitInstruction to lower MachineInstrs with
target-specific opcodes to MCInsts. Pseudo instructions
that expand to multiple instructions are expanded here.
An MCInst object can also be created by the LLVM
integrated assembler. MCInst is like a simplified version
of MachineInstr with less information. When
MCInst is emitted to an MCStreamer, it results
in either assembly code or bytes for a relocatable object file.
MC
Clang has the capability to output either assembly code or an object file. Generating an object file directly without involving an assembler is referred to as "direct object emission".
To provide a unified interface, MCStreamer is created to
handle the emission of both assembly code and object files. The two
primary subclasses of MCStreamer are
MCAsmStreamer and MCObjectStreamer,
responsible for emitting assembly code and machine code
respectively.
In the case of an assembly input file, LLVM creates an
MCAsmParser object (LLVMMCParser) and a target-specific
MCTargetAsmParser object. The MCAsmParser is
responsible for tokenizing the input, parsing assembler directives, and
invoking the MCTargetAsmParser to parse an instruction.
Both the MCAsmParser and MCTargetAsmParser
objects can call the MCStreamer API to emit assembly code
or machine code.
In our case, LLVMAsmPrinter calls the MCStreamer API to
emit assembly code or machine code.
If the streamer is an MCAsmStreamer, the
MCInst will be pretty-printed. If the streamer is an
MCELFStreamer (other object file formats are similar),
MCELFStreamer::emitInstToData will use
${Target}MCCodeEmitter from LLVM${Target}Desc to encode the
MCInst, emit its byte sequence, and records needed
relocations. An ELFObjectWriter object is used to write the
relocatable object file.
In our example, we get a relocatable object file. If we invoke
clang -S a.cc to get the assembly, it will look like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32...
.globl _Z3fooifi # -- Begin function _Z3fooifi
.p2align 4, 0x90
.type _Z3fooifi,@function
_Z3fooifi: # @_Z3fooifi
.cfi_startproc
# %bb.0:
xorl %eax, %eax
ucomiss %xmm0, %xmm0
setnp %al
addl %edi, %eax
cltd
idivl %esi
retq
.Lfunc_end0:
.size _Z3fooifi, .Lfunc_end0-_Z3fooifi
.cfi_endproc
...
.globl main
.p2align 4, 0x90
.type main,@function
main: # @main
.cfi_startproc
# %bb.0:
movss .LCPI1_0(%rip), %xmm0 # xmm0 = mem[0],zero,zero,zero
movl $3, %edi
movl $1, %esi
jmp _Z3fooifi # TAILCALL
.Lfunc_end1:
.size main, .Lfunc_end1-main
.cfi_endproc
You may read my post Assemblers for more information about the LLVM integrated assembler.