In my previous post, Relocation Generation in Assemblers, I explored some key concepts behind LLVM’s integrated assemblers. This post dives into recent improvements I’ve made to refine that system.
The LLVM integrated assembler handles fixups and relocatable
expressions as distinct entities. Relocatable expressions, in
particular, are encoded using the MCValue class, which
originally looked like this:
1 | class MCValue { |
In this structure:
RefKindacts as an optional relocation specifier, though only a handful of targets actually use it.SymArepresents an optional symbol reference (the addend).SymBrepresents another optional symbol reference (the subtrahend).Cstholds a constant value.
While functional, this design had its flaws. For one, the way relocation specifiers were encoded varied across architectures:
- Targets like COFF, Mach-O, and ELF's PowerPC, SystemZ, and X86 embed
the relocation specifier within
MCSymbolRefExpr *SymAas part ofSubclassData. - Conversely, ELF targets such as AArch64, MIPS, and RISC-V store it
as a target-specific subclass of
MCTargetExpr, and convert it toMCValue::RefKindduringMCValue::evaluateAsRelocatable.
Another issue was with SymB. Despite being typed as
const MCSymbolRefExpr *, its
MCSymbolRefExpr::VariantKind field went unused. This is
because expressions like add - sub@got are not
relocatable.
Over the weekend, I tackled these inconsistencies and reworked the representation into something cleaner:
1 | class MCValue { |
This updated design not only aligns more closely with the concept of
relocatable expressions but also shaves off some compiler time in LLVM.
The ambiguous RefKind has been renamed to
Specifier for clarity. Additionally, targets that
previously encoded the relocation specifier within
MCSymbolRefExpr (rather than using
MCTargetExpr) can now access it directly via
MCValue::Specifier.
To support this change, I made a few adjustments:
- Introduced
getAddSymandgetSubSymmethods, returningconst MCSymbol *, as replacements forgetSymAandgetSymB. - Eliminated dependencies on the old accessors,
MCValue::getSymAandMCValue::getSymB. - Reworked the expression folding code that handles + and -
- Stored
the
const MCSymbolRefExpr *SymAspecifier atMCValue::Specifier - Some targets relied on PC-relative fixups with explicit specifiers
forcing relocations. I have defined
MCAsmBackend::shouldForceRelocationfor SystemZ and cleaned up ARM and PowerPC - Changed
the type of
SymAandSymBtoconst MCSymbol * - Replaced
the temporary
getSymSpecifierwithgetSpecifier - Replaced
the legacy
getAccessVariantwithgetSpecifier
Streamlining Mach-O support
Mach-O assembler support in LLVM has accumulated significant
technical debt, impacting both target-specific and generic code. One
particularly nagging issue was the
const SectionAddrMap *Addrs parameter in
MCExpr::evaluateAs* functions. This parameter existed to
handle cross-section label differences, primarily for generating
(compact) unwind information in Mach-O. A typical example of this can be
seen in assembly like:
1 | .section __TEXT,__text,regular,pure_instructions |
The SectionAddrMap *Addrs parameter always felt like a
clunky workaround to me. It wasn’t until I dug into the Mach-O
AArch64 object writer that I realized this hack wasn't necessary for
that writer. This discovery prompted a cleanup effort to remove the
dependency on SectionAddrMap for ARM and X86 and eliminate
the parameter:
- [MC,MachO] Replace SectionAddrMap workaround with cleaner variable handling
- MCExpr: Remove unused SectionAddrMap workaround
While I was at it, I also tidied up MCSymbolRefExpr by
removing
the clunky HasSubsectionsViaSymbolsBit, further
simplifying the codebase.
Stremlining InstPrinter
The MCExpr code also determines how expression operands in assembly instructions are printed. I have made improvements in this area as well: