Within the LLVM project, MC is a library responsible for handling assembly, disassembly, and object file formats. Intro to the LLVM MC Project, which was written back in 2010, remains a good source to understand the high-level structures.
In the latest release cycle, substantial effort has been dedicated to refining MC's internal representation for improved performance and readability. These changes have decreased compile time significantly. This blog post will delve into the details, providing insights into the specific changes.
Merged
MCAsmLayout
into MCAssembler
MCAssembler
manages assembler states (including
sections, symbols) and implements post-parsing passes (computing a
layout and writing an object file). MCAsmLayout
, tightly
coupled with MCAssembler
, was in charge of symbol and
fragment offsets during MCAssembler::Finish
.
MCAsmLayout
was a wrapper of MCAssembler
and a
section order vector (actually Mach-O specific). Many
MCAssembler
and MCExpr
member functions have a
const MCAsmLayout &
parameter, contributing to slight
overhead. Here are some functions that are called frequently:
MCAssembler::computeFragmentSize
is called a lot in the layout process.MCAsmBackend::handleFixup
andMCAsmBackend::applyFixup
evaluate each fixup and produce relocations.MCAssembler::fixupNeedsRelaxation
determines whether aMCRelaxableFragment
needs relaxation due to aMCFixup
.MCAssembler::relaxFragment
andMCAssembler::relaxInstruction
relax a fragment.
I started
to merge MCAsmLayout
into MCAssembler
and
simplify MC code, and eventually removed llvm/include/llvm/MC/MCAsmLayout.h
.
Fragments
Fragments, representing sequences of non-relaxable instructions,
relaxable instruction, alignment directives, and other elements.
MCDataFragment
and MCRelaxableFragment
, whose
sizes are crucial for memory consumption, have undergone several
optimizations:
- MCInst: decrease inline element count to 6
- [MC] Reduce size of MCDataFragment by 8 bytes by @aengelke
- [MC] Move MCFragment::Atom to MCSectionMachO::Atoms
The fragment management system has also been streamlined by
transitioning from a doubly-linked list (llvm::iplist
) to a
singly-linked
list, eliminating unnecessary overhead. A few prerequisite commits
removed backward iterator requirements.
Furthermore, I introduced
the "current fragment" concept (MCSteamer::CurFrag
)
allowing for faster appending of new fragments.
I have also simplified and optimized fragment offset computation:
Previously, calculating fragment offsets happened lazily in the
getFragmentOffset
function. All sections were iteratively
relaxed until they all converged. This process was inefficient as the
slowest section determined the number of iterations for all others,
resulting in extra calculations.
Previously, fragment offset computation was lazily performed by
getFragmentOffset
. The section that converged the slowest
determined other sections' iteration steps, leading to some unneeded
computation.
The new layout algorithm assigns fragment offsets and iteratively refines them for each section until it's optimized. Then, it moves on to the next section. If relaxation doesn't change anything, fragment offset assignment will be skipped. This way, sections that converge quickly don't have to wait for the slowest ones, resulting in a significant decrease in compile time for full LTO.
1 | bool MCAssembler::relaxOnce() { |
Symbols
@aengelke made two noticeable performance improvements:
In MCObjectStreamer
, newly defined labels were put into
a "pending label" list and initially assigned to a
MCDummyFragment
associated with the current section. The
symbols will be reassigned to a new fragment when the next instruction
or directive is parsed. This pending label system, while necessary for
aligned bundling, introduced complexity and potential for subtle
bugs.
To streamline this, I revamped the implementation by directly adjusting offsets of existing fragments, eliminating over 100 lines of code and reducing the potential for errors.
Details: In 2014, [MC]
Attach labels to existing fragments instead of using a separate
fragment introduced flushPendingLabels
aligned bundling
assembler extension for Native Client. [MC] Match labels to existing
fragments even when switching sections., built on top of
flushPendingLabels
, added further complication.
In MCObjectStreamer
, a newly defined label was
temporarily assigned to a MCDummyFragment
. The symbol would
be reassigned to a new fragment when the next instruction or directive
was parsed. The MCDummyFragment
was not in the section's
fragment list. However, during expression evaluation, it should be
considered as the temporary end of the section.
For the following code, aligned bundling requires that
.Ltmp
is defined at addl
. 1
2
3
4
5
6
7
8
9$ clang var.c -S -o - -fPIC -m32
...
.bundle_lock align_to_end
calll .L0$pb
.bundle_unlock
.L0$pb:
popl %eax
.Ltmp0:
addl $_GLOBAL_OFFSET_TABLE_+(.Ltmp0-.L0$pb), %eax
Worse, a lot of directive handling code had to add
flushPendingLabels
and a missing
flushPendingLabels
could lead to subtle bugs related to
incorrect symbol values.
( MCAsmStreamer
doesn't call
flushPendingLabels
in its handlers. This is the reason that
we cannot change MCAsmStreamer::getAssemblerPtr
to use a
MCAssembler
and change
AsmParser::parseExpression
. )
Sections
Section handling was also refined. MCStreamer maintains a a section
stack for features like
.push_section
/.pop_section
/.previous
directives. Many functions relied on the section stack for loading the
current section, which introduced overhead due to the additional
indirection and nullable return values.
By leveraging the "current fragment" concept, the need for the section stack was eliminated in most cases, simplifying the codebase and improving efficiency.
I have eliminated nullable getCurrentSectionOnly
uses
and changed
getCurrentSectionOnly
to leverage the "current fragment"
concept. This change also revealed
an interesting quirk in NVPTX assembly related to DWARF
sections.
Section symbols
Many section creation functions (MCContext::get*Section
)
had a const char *BeginSymName
parameter to support the section symbol concept. This led to issues
when we want to treat the section name as a symbol. In 2017, the
parameter was removed
for ELF, streamlining section symbol handling.
I changed the way MC handles section symbols for COFF and removed the unused parameters for WebAssembly. The work planned for XCOFF is outlined in https://github.com/llvm/llvm-project/issues/96810.
Expression evaluation
Expression evaluation in MCAssembler::layout
previously
employed a complex lazy evaluation algorithm, which aimed to minize the
number of fragment relaxation. It proved difficult to understand and
resulted in complex recursion
detection.
To address this, I removed lazy evaluation in favor of eager
fragment relaxation. This simplification improved the reliability of
the layout process, eliminating the need for intricate workarounds like
the MCFragment::IsBeingLaidOut
flag introduced earlier.
Note: the benefit of lazy evaluation largely diminished when https://reviews.llvm.org/D76114 invalidated all sections to fix the correctness issue for the following assembly:
1 | .section .text1,"ax" |
In addition, I removed an overload of isSymbolRefDifferenceFullyResolvedImpl, enabling constant folding for variable differences in Mach-O.
Target-specific features misplaced in the generic implementation
I have made efforts to relocate target-specific functionalities to their respective target implementations:
- [MC,X86] emitInstruction: remove virtual function calls due to Intel JCC Erratum
- [MC,X86] De-virtualize emitPrefix
- [MC] Move Mach-O specific getAtom and isSectionAtomizableBySymbols to Mach-O files
- [MC] Move ELFWriter::createMemtagRelocs to AArch64TargetELFStreamer::finish
- [MC] Move MCAsmLayout::SectionOrder to MachObjectWriter::SectionOrder
- Move MCSection::LayoutOrder to MCSectionMachO
- [MC] Move isPrivateExtern to MCSymbolMachO
- [MC] Export llvm::WinCOFFObjectWriter and access it from MCWinCOFFStreamer
- [MC] Move VersionInfo to MachObjectWriter
- [MC] Export llvm::ELFObjectWriter
- MCObjectWriter: Remove XCOFF specific virtual functions
The class hierarchy has been cleaned up by making more
MC*ObjectWriter
public and accessing them from
MC*Streamer
.
CREL
The integrated assembler now supports CREL (compact relocation) for ELF.
Once the Clang and lld patches are merged, enabling compact relocations is as simple as this:
clang -c -Wa,--crel,--allow-experimental-crel a.c && clang -fuse-ld=lld a.o
.
Note: This feature is unstable. While relocatable files created with Clang version A will work with lld version A, they might not be compatible with newer versions of lld (where A is older than B).
As the future of the generic ABI remains uncertain, CREL might not get "standardized". In that case, I will just get the section code agreed with the GNU community to ensure wider compatibility.
Assembly parser
\+
, the per-macro invocation count, is now available for.irp/.irpc/.rept
.- [MCParser] .altmacro: Support argument expansion not preceded by \
- [MC] Support .cfi_label
Summary
I've been contributing to MC for several years. Back then, while many contributed, most focused on adding specific features. Rafael Ávila de Espíndola was the last to systematically review and improve the MC layer. Unfortunately, simplification efforts stalled after Rafael's departure in 2018.
Picking up where Rafael left off, I'm diving into the MC layer to streamline its design. A big thanks to @aengelke for his invaluable performance centric contributions in this release cycle. LLVM 19 introduces significant enhancements to the integrated assembler, resulting in notable performance gains, reduced memory usage, and a more streamlined codebase. These optimizations pave the way for future improvements.
I compiled the preprocessed SQLite Amalgamation (from llvm-test-suite) using a Release build of clang:
build | 2024-05-14 | 2024-07-02 |
---|---|---|
-O0 | 0.5304 | 0.4930 |
-O0 -g | 0.8818 | 0.7967 |
-O2 | 6.249 | 6.098 |
-O2 -g | 7.931 | 7.682 |
clang -c -w sqlite3.i
The AsmPrinter pass, which is coupled with the assembler, consumes a
significant portion of the -O0
compile time. I have
modified the -ftime-report
mechanism to decrease the
per-instruction overhead. The decrease in compile time matches the
decrease in the spent in AsmPrinter. Coupled with a recent observation
that BOLT, which heavily utilizes MC, is ~8% faster, it's clear that MC
modifications have yielded substantial improvements.
Noticeable optimizations in previous releases
[MC] Always encode
instruction into SmallVector optimized
MCCodeEmitter::encodeInstruction
for x86 by avoiding
raw_ostream::write
overhead. I have migrated other targets
and removed
the extra overload. 1
raw_ostream::write =(inlinable)=> flush_tied_then_write (unneeded TiedStream check) =(virtual function call)=> raw_svector_ostream::write_impl ==> SmallVector append(ItTy in_start, ItTy in_end) (range; less efficient then push_back).
[MC]
Optimize relaxInstruction: remove SmallVector copy. NFC removes code
and fixup copy for relaxInstruction
.
Roadmap
Symbol redefinition
llvm-mc: Diagnose misuse (mix) of defined symbols and labels. added redefinition error. This was refined many times. I hope to fix this in the future.
Addressing Mach-O weakness
The Mach-O assembler lacks the robustness of its ELF counterpart.
Notably, certain aspects of the Mach-O implementation, such as the
conditions for constant folding in
MachObjectWriter::isSymbolRefDifferenceFullyResolvedImpl
(different for x86-64 and AArch64), warrant revisiting.
Additionally, the Mach-O has a hack to maintain compatibility with Apple cctools assembler, when the relocation addend is non-zero.
1 | .data |
This leads to another workaround in
MCFragment.cpp:getSymbolOffsetImpl
([MC] Recursively calculate
symbol offset), which is to support the following assembly:
1 | l_a: |
Misc
emitLabel
atswitchSection
was for DWARF sections, which might be no longer useful