2024-06-30

Integrated assembler improvements in LLVM 19

Within the LLVM project, MC is a library responsible for handling assembly, disassembly, and object file formats. Intro to the LLVM MC Project, which was written back in 2010, remains a good source to understand the high-level structures.

In the latest release cycle, substantial effort has been dedicated to refining MC's internal representation for improved performance and readability. These changes have decreased compile time significantly. This blog post will delve into the details, providing insights into the specific changes.

Merged `MCAsmLayout` into `MCAssembler`

MCAssembler manages assembler states (including sections, symbols) and implements post-parsing passes (computing a layout and writing an object file). MCAsmLayout, tightly coupled with MCAssembler, was in charge of symbol and fragment offsets during MCAssembler::Finish. MCAsmLayout was a wrapper of MCAssembler and a section order vector (actually Mach-O specific). Many MCAssembler and MCExpr member functions have a const MCAsmLayout & parameter, contributing to slight overhead. Here are some functions that are called frequently:

MCAssembler::computeFragmentSize is called a lot in the layout process.
MCAsmBackend::handleFixup and MCAsmBackend::applyFixup evaluate each fixup and produce relocations.
MCAssembler::fixupNeedsRelaxation determines whether a MCRelaxableFragment needs relaxation due to a MCFixup.
MCAssembler::relaxFragment and MCAssembler::relaxInstruction relax a fragment.

I started to merge MCAsmLayout into MCAssembler and simplify MC code, and eventually removed llvm/include/llvm/MC/MCAsmLayout.h.

Fragments

Fragments, representing sequences of non-relaxable instructions, relaxable instruction, alignment directives, and other elements. MCDataFragment and MCRelaxableFragment, whose sizes are crucial for memory consumption, have undergone several optimizations:

The fragment management system has also been streamlined by transitioning from a doubly-linked list (llvm::iplist) to a singly-linked list, eliminating unnecessary overhead. A few prerequisite commits removed backward iterator requirements.

Furthermore, I introduced the "current fragment" concept (MCSteamer::CurFrag) allowing for faster appending of new fragments.

I have also simplified and optimized fragment offset computation:

[MC] Relax fragments eagerly

Previously, calculating fragment offsets happened lazily in the getFragmentOffset function. All sections were iteratively relaxed until they all converged. This process was inefficient as the slowest section determined the number of iterations for all others, resulting in extra calculations.

Previously, fragment offset computation was lazily performed by getFragmentOffset. The section that converged the slowest determined other sections' iteration steps, leading to some unneeded computation.

The new layout algorithm assigns fragment offsets and iteratively refines them for each section until it's optimized. Then, it moves on to the next section. If relaxation doesn't change anything, fragment offset assignment will be skipped. This way, sections that converge quickly don't have to wait for the slowest ones, resulting in a significant decrease in compile time for full LTO.

bool MCAssembler::relaxOnce() {
  bool ChangedAny = false;
  for (MCSection &Sec : *this) {
    auto MaxIter = NumFrags + 1;
    uint64_t OldSize = getSectionAddressSize(Sec);
    do {
      uint64_t Offset = 0;
      Changed = false;
      for (MCFragment &F : Sec) {
        if (F.Offset != Offset) {
          F.Offset = Offset;
          Changed = true;
        }
        relaxFragment(F);
        Offset += computeFragmentSize(F);
      }

      Changed |= OldSize != Offset;
      ChangedAny |= Changed;
      OldSize = Offset;
    } while (Changed && --MaxIter);
    if (MaxIter == 0)
      return false;
  }
  return ChangedAny;
}

Symbols

@aengelke made two noticeable performance improvements:

In MCObjectStreamer, newly defined labels were put into a "pending label" list and initially assigned to a MCDummyFragment associated with the current section. The symbols will be reassigned to a new fragment when the next instruction or directive is parsed. This pending label system, while necessary for aligned bundling, introduced complexity and potential for subtle bugs.

To streamline this, I revamped the implementation by directly adjusting offsets of existing fragments, eliminating over 100 lines of code and reducing the potential for errors.

Details: In 2014, [MC] Attach labels to existing fragments instead of using a separate fragment introduced flushPendingLabels aligned bundling assembler extension for Native Client. [MC] Match labels to existing fragments even when switching sections., built on top of flushPendingLabels, added further complication.

In MCObjectStreamer, a newly defined label was temporarily assigned to a MCDummyFragment. The symbol would be reassigned to a new fragment when the next instruction or directive was parsed. The MCDummyFragment was not in the section's fragment list. However, during expression evaluation, it should be considered as the temporary end of the section.

For the following code, aligned bundling requires that .Ltmp is defined at addl.

$ clang var.c -S -o - -fPIC -m32
...
.bundle_lock align_to_end
  calll   .L0$pb
.bundle_unlock
.L0$pb:
  popl    %eax
.Ltmp0:
  addl    $_GLOBAL_OFFSET_TABLE_+(.Ltmp0-.L0$pb), %eax

Worse, a lot of directive handling code had to add flushPendingLabels and a missing flushPendingLabels could lead to subtle bugs related to incorrect symbol values.

( MCAsmStreamer doesn't call flushPendingLabels in its handlers. This is the reason that we cannot change MCAsmStreamer::getAssemblerPtr to use a MCAssembler and change AsmParser::parseExpression. )

Sections

Section handling was also refined. MCStreamer maintains a a section stack for features like .push_section/.pop_section/.previous directives. Many functions relied on the section stack for loading the current section, which introduced overhead due to the additional indirection and nullable return values.

By leveraging the "current fragment" concept, the need for the section stack was eliminated in most cases, simplifying the codebase and improving efficiency.

I have eliminated nullable getCurrentSectionOnly uses and changed getCurrentSectionOnly to leverage the "current fragment" concept. This change also revealed an interesting quirk in NVPTX assembly related to DWARF sections.

Section symbols

Many section creation functions (MCContext::get*Section) had a const char *BeginSymName parameter to support the section symbol concept. This led to issues when we want to treat the section name as a symbol. In 2017, the parameter was removed for ELF, streamlining section symbol handling.

I changed the way MC handles section symbols for COFF and removed the unused parameters for WebAssembly. The work planned for XCOFF is outlined in https://github.com/llvm/llvm-project/issues/96810.

Expression evaluation

Expression evaluation in MCAssembler::layout previously employed a complex lazy evaluation algorithm, which aimed to minize the number of fragment relaxation. It proved difficult to understand and resulted in complex recursion detection.

To address this, I removed lazy evaluation in favor of eager fragment relaxation. This simplification improved the reliability of the layout process, eliminating the need for intricate workarounds like the MCFragment::IsBeingLaidOut flag introduced earlier.

Note: the benefit of lazy evaluation largely diminished when https://reviews.llvm.org/D76114 invalidated all sections to fix the correctness issue for the following assembly:

.section .text1,"ax"
 .skip after-before,0x0
.L0:

  .section .text2
before:
  jmp .L0
after:

In addition, I removed an overload of isSymbolRefDifferenceFullyResolvedImpl, enabling constant folding for variable differences in Mach-O.

Target-specific features misplaced in the generic implementation

I have made efforts to relocate target-specific functionalities to their respective target implementations:

The class hierarchy has been cleaned up by making more MC*ObjectWriter public and accessing them from MC*Streamer.

CREL

The integrated assembler now supports CREL (compact relocation) for ELF.

Once the Clang and lld patches are merged, enabling compact relocations is as simple as this:

clang -c -Wa,--crel,--allow-experimental-crel a.c && clang -fuse-ld=lld a.o.

Note: This feature is unstable. While relocatable files created with Clang version A will work with lld version A, they might not be compatible with newer versions of lld (where A is older than B).

As the future of the generic ABI remains uncertain, CREL might not get "standardized". In that case, I will just get the section code agreed with the GNU community to ensure wider compatibility.

Assembly parser

\+, the per-macro invocation count, is now available for .irp/.irpc/.rept.
[MCParser] .altmacro: Support argument expansion not preceded by \
[MC] Support .cfi_label

Summary

I've been contributing to MC for several years. Back then, while many contributed, most focused on adding specific features. Rafael Ávila de Espíndola was the last to systematically review and improve the MC layer. Unfortunately, simplification efforts stalled after Rafael's departure in 2018.

Picking up where Rafael left off, I'm diving into the MC layer to streamline its design. A big thanks to @aengelke for his invaluable performance centric contributions in this release cycle. LLVM 19 introduces significant enhancements to the integrated assembler, resulting in notable performance gains, reduced memory usage, and a more streamlined codebase. These optimizations pave the way for future improvements.

I compiled the preprocessed SQLite Amalgamation (from llvm-test-suite) using a Release build of clang:

build	2024-05-14	2024-07-02
-O0	0.5304	0.4930
-O0 -g	0.8818	0.7967
-O2	6.249	6.098
-O2 -g	7.931	7.682

clang -c -w sqlite3.i

The AsmPrinter pass, which is coupled with the assembler, consumes a significant portion of the -O0 compile time. I have modified the -ftime-report mechanism to decrease the per-instruction overhead. The decrease in compile time matches the decrease in the spent in AsmPrinter. Coupled with a recent observation that BOLT, which heavily utilizes MC, is ~8% faster, it's clear that MC modifications have yielded substantial improvements.

Noticeable optimizations in previous releases

[MC] Always encode instruction into SmallVector optimized MCCodeEmitter::encodeInstruction for x86 by avoiding raw_ostream::write overhead. I have migrated other targets and removed the extra overload.

raw_ostream::write =(inlinable)=> flush_tied_then_write (unneeded TiedStream check) =(virtual function call)=> raw_svector_ostream::write_impl ==> SmallVector append(ItTy in_start, ItTy in_end) (range; less efficient then push_back).

[MC] Optimize relaxInstruction: remove SmallVector copy. NFC removes code and fixup copy for relaxInstruction.

Roadmap

Symbol redefinition

llvm-mc: Diagnose misuse (mix) of defined symbols and labels. added redefinition error. This was refined many times. I hope to fix this in the future.

Addressing Mach-O weakness

The Mach-O assembler lacks the robustness of its ELF counterpart. Notably, certain aspects of the Mach-O implementation, such as the conditions for constant folding in MachObjectWriter::isSymbolRefDifferenceFullyResolvedImpl (different for x86-64 and AArch64), warrant revisiting.

Additionally, the Mach-O has a hack to maintain compatibility with Apple cctools assembler, when the relocation addend is non-zero.

.data
a = b + 4
.long a   # ARM64_RELOC_UNSIGNED(a) instead of b; This might work around the linker bug(?) when the referenced symbol is b and the addend is 4.

c = d
.long c   # ARM64_RELOC_UNSIGNED(d)

y:
x = y + 4
.long x   # ARM64_RELOC_UNSIGNED(x) instead of y

This leads to another workaround in MCFragment.cpp:getSymbolOffsetImpl ([MC] Recursively calculate symbol offset), which is to support the following assembly:

l_a:
l_b = l_a + 1
l_c = l_b
.long l_c

Misc

emitLabel at switchSection was for DWARF sections, which might be no longer useful

Merged MCAsmLayout into MCAssembler