In my previous post, Relocation Generation in Assemblers, I explored some key concepts behind LLVM’s integrated assemblers. This post dives into recent improvements I’ve made to refine that system.
The LLVM integrated assembler handles fixups and relocatable
expressions as distinct entities. Relocatable expressions, in
particular, are encoded using the MCValue
class, which
originally looked like this:
1 | class MCValue { |
In this structure:
RefKind
acts as an optional relocation specifier, though only a handful of targets actually use it.SymA
represents an optional symbol reference (the addend).SymB
represents another optional symbol reference (the subtrahend).Cst
holds a constant value.
While functional, this design had its flaws. For one, the way relocation specifiers were encoded varied across architectures:
- Targets like COFF, Mach-O, and ELF's PowerPC, SystemZ, and X86 embed
the relocation specifier within
MCSymbolRefExpr *SymA
as part ofSubclassData
. - Conversely, ELF targets such as AArch64, MIPS, and RISC-V store it
as a target-specific subclass of
MCTargetExpr
, and convert it toMCValue::RefKind
duringMCValue::evaluateAsRelocatable
.
Another issue was with SymB
. Despite being typed as
const MCSymbolRefExpr *
, its
MCSymbolRefExpr::VariantKind
field went unused. This is
because expressions like add - sub@got
are not
relocatable.
Over the weekend, I tackled these inconsistencies and reworked the representation into something cleaner:
1 | class MCValue { |
This updated design not only aligns more closely with the concept of
relocatable expressions but also shaves off some compiler time in LLVM.
The ambiguous RefKind
has been renamed to
Specifier
for clarity. Additionally, targets that
previously encoded the relocation specifier within
MCSymbolRefExpr
(rather than using
MCTargetExpr
) can now access it directly via
MCValue::Specifier
.
To support this change, I made a few adjustments:
- Introduced
getAddSym
andgetSubSym
methods, returningconst MCSymbol *
, as replacements forgetSymA
andgetSymB
. - Eliminated dependencies on the old accessors,
MCValue::getSymA
andMCValue::getSymB
. - Reworked the expression folding code that handles + and -
- Stored
the
const MCSymbolRefExpr *SymA
specifier atMCValue::Specifier
- Some targets relied on PC-relative fixups with explicit specifiers
forcing relocations. I have defined
MCAsmBackend::shouldForceRelocation
for SystemZ and cleaned up ARM and PowerPC - Changed
the type of
SymA
andSymB
toconst MCSymbol *
- Replaced
the temporary
getSymSpecifier
withgetSpecifier
- Replaced
the legacy
getAccessVariant
withgetSpecifier
Streamlining Mach-O support
Mach-O assembler support in LLVM has accumulated significant
technical debt, impacting both target-specific and generic code. One
particularly nagging issue was the
const SectionAddrMap *Addrs
parameter in
MCExpr::evaluateAs*
functions. This parameter existed to
handle cross-section label differences, primarily for generating
(compact) unwind information in Mach-O. A typical example of this can be
seen in assembly like:
1 | .section __TEXT,__text,regular,pure_instructions |
The SectionAddrMap *Addrs
parameter always felt like a
clunky workaround to me. It wasn’t until I dug into the Mach-O
AArch64 object writer that I realized this hack wasn't necessary for
that writer. This discovery prompted a cleanup effort to remove the
dependency on SectionAddrMap
for ARM and X86 and eliminate
the parameter:
- [MC,MachO] Replace SectionAddrMap workaround with cleaner variable handling
- MCExpr: Remove unused SectionAddrMap workaround
While I was at it, I also tidied up MCSymbolRefExpr
by
removing
the clunky HasSubsectionsViaSymbolsBit
, further
simplifying the codebase.