2024-10-20

ccls and LSP Semantic Tokens

I've spent countless hours writing and reading C++ code. For many years, Emacs has been my primary editor, and I leverage ccls' (my C++ language server) rainbow semantic highlighting feature.

The feature relies on two custom notification messages $ccls/publishSemanticHighlight and $ccls/publishSkippedRanges. $ccls/publishSemanticHighlight provides a list of symbols, each with kind information (function, type, or variable) of itself and its semantic parent (e.g. a member function's parent is a class), storage duration, and a list of ranges.

struct CclsSemanticHighlightSymbol {
  int id = 0;
  SymbolKind parentKind;
  SymbolKind kind;
  uint8_t storage;
  std::vector<std::pair<int, int>> ranges;

  std::vector<lsRange> lsRanges; // Only used by vscode-ccls
};

struct CclsSemanticHighlight {
  DocumentUri uri;
  std::vector<CclsSemanticHighlightSymbol> symbols;
};

An editor can use consistent colors to highlight different occurrences of a symbol. Different colors can be assigned to different symbols.

Tobias Pisani created emacs-cquery (the predecessor to emacs-ccls) in Nov 2017. Despite not being a fan of Emacs Lisp, I added the rainbow semantic highlighting feature for my own use in early 2018. My setup also relied heavily on these two settings:

Bolding and underlining variables of static duration storage
Italicizing member functions and variables

1 2	(setq ccls-sem-highlight-method 'font-lock) (ccls-use-default-rainbow-sem-highlight)

Key symbol properties (member, static) were visually prominent in my Emacs environment.

My Emacs hacking days are a distant memory – beyond basic configuration tweaks, I haven't touched elisp code since 2018. As my Elisp skills faded, I increasingly turned to Neovim for various editing tasks. Naturally, I wanted to migrate my C++ development workflow to Neovim as well. However, a major hurdle emerged: Neovim lacked the beloved rainbow highlighting I enjoyed in Emacs.

Thankfully, Neovim supports "semantic tokens" from LSP 3.16, a standardized approach adopted by many editors.

I've made changes to ccls (available on a branch; PR) to support semantic tokens. This involves adapting the $ccls/publishSemanticHighlight code to additionally support textDocument/semanticTokens/full and textDocument/semanticTokens/range.

I utilize a few token modifiers (static, classScope, functionScope, namespaceScope) for highlighting:

vim.cmd([[
hi @lsp.mod.classScope.cpp gui=italic
hi @lsp.mod.static.cpp gui=bold
hi @lsp.typemod.variable.namespaceScope.cpp gui=bold,underline
]])

While this approach is a significant improvement over relying solely on nvim-treesitter, I'm still eager to implement rainbow semantic tokens. Although LSP semantic tokens don't directly distinguish symbols, we can create custom modifiers to achieve similar results.

tokenModifiers: {
  "declaration", "definition", "static", ...

  "id0", "id1", ... "id9",
}

In the user-provided initialization options, I set highlight.rainbow to 10.

ccls assigns the same modifier ID to tokens belonging to the same symbol, aiming for unique IDs for different symbols. While we only have a few predefined IDs (each linked to a specific color), there's a slight possibility of collisions. However, this is uncommon and generally acceptable.

For a token with type variable, Neovim's built-in LSP plugin assigns a highlight group @lsp.typemod.variable.id$i.cpp where $i is an integer between 0 and 9. This allows us to customize a unique foreground color for each modifier ID.

local func_colors = {
  '#e5b124', '#927754', '#eb992c', '#e2bf8f', '#d67c17',
  '#88651e', '#e4b953', '#a36526', '#b28927', '#d69855',
}
local type_colors = {
  '#e1afc3', '#d533bb', '#9b677f', '#e350b6', '#a04360',
  '#dd82bc', '#de3864', '#ad3f87', '#dd7a90', '#e0438a',
}
local param_colors = {
  '#e5b124', '#927754', '#eb992c', '#e2bf8f', '#d67c17',
  '#88651e', '#e4b953', '#a36526', '#b28927', '#d69855',
}
local var_colors = {
  '#429921', '#58c1a4', '#5ec648', '#36815b', '#83c65d',
  '#419b2f', '#43cc71', '#7eb769', '#58bf89', '#3e9f4a',
}
local all_colors = {
  class = type_colors,
  constructor = func_colors,
  enum = type_colors,
  enumMember = var_colors,
  field = var_colors,
  ['function'] = func_colors,
  method = func_colors,
  parameter = param_colors,
  struct = type_colors,
  typeAlias = type_colors,
  typeParameter = type_colors,
  variable = var_colors
}
for type, colors in pairs(all_colors) do
  for i = 1,#colors do
    vim.api.nvim_set_hl(0, string.format('@lsp.typemod.%s.id%s.cpp', type, i-1), {fg=colors[i]})
  end
end

vim.cmd([[
hi @lsp.mod.classScope.cpp gui=italic
hi @lsp.mod.static.cpp gui=bold
hi @lsp.typemod.variable.namespaceScope.cpp gui=bold,underline
]])

Now, let's analyze the C++ code above using this configuration.

While the results are visually pleasing, I need help implementing code lens functionality.

Inactive code highlighting

Inactive code regions (skipped ranges in Clang) are typically displayed in grey. While this can be helpful for identifying unused code, it can sometimes hinder understanding the details. I simply disabled the inactive code feature.

#ifdef X
... // colorful
#else
... // normal instead of grey
#endif

Refresh

When opening a large project, the initial indexing or cache loading process can be time-consuming, often leading to empty lists of semantic tokens for the initially opened files. While ccls prioritizes indexing these files, it's unclear how to notify the client to refresh the files. The existing workspace/semanticTokens/refresh request, unfortunately, doesn't accept text document parameters.

In contrast, with $ccls/publishSemanticHighlight, ccls proactively sends the notification after an index update (see main_OnIndexed).

void main_OnIndexed(DB *db, WorkingFiles *wfiles, IndexUpdate *update) {
  ...

  db->applyIndexUpdate(update);

  // Update indexed content, skipped ranges, and semantic highlighting.
  if (update->files_def_update) {
    auto &def_u = *update->files_def_update;
    if (WorkingFile *wfile = wfiles->getFile(def_u.first.path)) {
      wfile->setIndexContent(g_config->index.onChange ? wfile->buffer_content
                                                      : def_u.second);
      QueryFile &file = db->files[update->file_id];
      // Publish notifications to the file.
      emitSkippedRanges(wfile, file);
      emitSemanticHighlight(db, wfile, file);
      // But how do we send a workspace/semanticTokens/refresh request?????
    }
  }
}

While the semantic token request supports partial results in the specification, Neovim lacks this implementation. Even if it were, I believe a notification message with a text document parameter would be a more efficient and direct approach.

export interface SemanticTokensParams extends WorkDoneProgressParams,
	PartialResultParams {
	/**
	 * The text document.
	 */
	textDocument: TextDocumentIdentifier;
}

Other clients

emacs-ccls

Once this feature branch is merged, Emacs users can simply remove the following lines:

1 2	(setq ccls-sem-highlight-method 'font-lock) (ccls-use-default-rainbow-sem-highlight)

How to change lsp-semantic-token-modifier-faces to support rainbow semantic tokens in lsp-mode and emacs-ccls?

The general approach is similar to the following, but we need a feature from lsp-mode (https://github.com/emacs-lsp/lsp-mode/issues/4590).

(setq lsp-semantic-tokens-enable t)
(defface lsp-face-semhl-namespace-scope
         '((t :weight bold)) "highlight for namespace scope symbols" :group 'lsp-semantic-tokens)
(cl-loop for color in '("#429921" "#58c1a4" "#5ec648" "#36815b" "#83c65d"
                        "#417b2f" "#43cc71" "#7eb769" "#58bf89" "#3e9f4a")
       for i = 0 then (1+ i)
       do (custom-declare-face (intern (format "lsp-face-semhl-id%d" i))
                               `((t :foreground ,color))
                               "" :group 'lsp-semantic-tokens))
(setq lsp-semantic-token-modifier-faces
      `(("declaration" . lsp-face-semhl-interface)
        ("definition" . lsp-face-semhl-definition)
        ("implementation" . lsp-face-semhl-implementation)
        ("readonly" . lsp-face-semhl-constant)
        ("static" . lsp-face-semhl-static)
        ("deprecated" . lsp-face-semhl-deprecated)
        ("abstract" . lsp-face-semhl-keyword)
        ("async" . lsp-face-semhl-macro)
        ("modification" . lsp-face-semhl-operator)
        ("documentation" . lsp-face-semhl-comment)
        ("defaultLibrary" . lsp-face-semhl-default-library)
        ("classScope" . lsp-face-semhl-member)
        ("namespaceScope" . lsp-face-semhl-namespace-scope)
        ,@(cl-loop for i from 0 to 10
                   collect (cons (format "id%d" i)
                                 (intern (format "lsp-face-semhl-id%d" i))))
        ))

vscode-ccls

We require assistance to eliminate the $ccls/publishSemanticHighlight feature and adopt built-in semantic tokens support. Due to the lack of active maintenance for vscode-ccls, I'm unable to maintain this plugin for an editor I don't frequently use.

Misc

I use a trick to switch ccls builds without changing editor configurations.

#!/bin/zsh
#export CCLS_TRACEME=s
export LD_PRELOAD=/usr/lib/libmimalloc.so

type=
[[ -f /tmp/ccls-build ]] && type=$(</tmp/ccls-build)

case $type in
  strace)
    exec strace -s999 -e read,write -o /tmp/strace.log -f ~/ccls/out/debug/ccls --log-file=/tmp/cc.log -v=1 "$@";;
  debug)
    exec ~/ccls/out/debug/ccls --log-file=/tmp/cc.log -v=2 "$@";;
  release)
    exec ~/ccls/out/release/ccls --log-file=/tmp/cc.log -v=1 "$@";;
  *)
    exec /usr/bin/ccls --log-file=/tmp/cc.log -v=1 "$@";;
esac

Usage:

1 2	echo debug > /tmp/ccls-build nvim # out/debug/ccls is now used

2024-08-18

My involvement with LLVM 19

LLVM 19.1 will soon be released. This post provides a summary of my contributions in this release cycle to record my learning progress.

2024-08-04

lld 19 ELF changes

LLVM 19 will be released. As usual, I maintain lld/ELF and have added some notes to https://github.com/llvm/llvm-project/blob/release/19.x/lld/docs/ReleaseNotes.rst. I've meticulously reviewed nearly all the patches that are not authored by me. I'll delve into some of the key changes.

2024-07-21

Mapping symbols: rethinking for efficiency

In object files, certain code patterns embed data within instructions or transitions occur between instruction sets. This can create hurdles for disassemblers, which might misinterpret data as code, resulting in inaccurate output. Furthermore, code written for one instruction set could be incorrectly disassembled as another. To address these issues, some architectures (Arm, C-SKY, NDS32, RISC-V, etc) define mapping symbols to explicitly denote state transition. Let's explore this concept using an AArch32 code example:

2024-07-07

Linker compatibility and the "User-Agent" problem

The output of ld.lld -v includes a message "compatible with GNU linkers" to address detection mechanism used by GNU Libtool. This problem is described by Software compatibility and our own "User-Agent" problem.

The latest m4/libtool.m4 continues to rely on a GNU check.

2024-06-30

Integrated assembler improvements in LLVM 19

Within the LLVM project, MC is a library responsible for handling assembly, disassembly, and object file formats. Intro to the LLVM MC Project, which was written back in 2010, remains a good source to understand the high-level structures.

In the latest release cycle, substantial effort has been dedicated to refining MC's internal representation for improved performance and readability. These changes have decreased compile time significantly. This blog post will delve into the details, providing insights into the specific changes.

2024-06-02

Understanding orphan sections

GNU ld's output section layout is determined by a linker script, which can be either internal (default) or external (specified with -T or -dT). Within the linker script, SECTIONS commands define how input sections are mapped into output sections.

Input sections not explicitly placed by SECTIONS commands are termed "orphan sections".

Orphan sections are sections present in the input files which are not explicitly placed into the output file by the linker script. The linker will still copy these sections into the output file by either finding, or creating a suitable output section in which to place the orphaned input section.

GNU ld's default behavior is to create output sections to hold these orphan sections and insert these output sections into appropriate places.

Orphan section placement is crucial because GNU ld's built-in linker scripts, while understanding common sections like .text/.rodata/.data, are unaware of custom sections. These custom sections should still be included in the final output file.

Grouping: Orphan input sections are grouped into orphan output sections that share the same name.
Placement: These grouped orphan output sections are then inserted into the output sections defined in the linker script. They are placed near similar sections to minimize the number of PT_LOAD segments needed.

2024-05-26

Evolution of the ELF object file format

The ELF object file format is adopted by many UNIX-like operating systems. While I've previously delved into the control structures of ELF and its predecessors, tracing the historical evolution of ELF and its relationship with the System V ABI can be interesting in itself.

The format consists of the generic specification, processor-specific specifications, and OS-specific specifications. Three key documents often surface when searching for the generic specification:

Tool Interface Standard (TIS) Portable Formats Specification, version 1.2 on https://refspecs.linuxfoundation.org/
System V Application Binary Interface - DRAFT - 10 June 2013 on www.sco.com
Oracle Solaris Linkers and Libraries Guide

The TIS specification breaks ELF into the generic specification, a processor-specific specification (x86), and an OS-specific specification (System V Release 4). However, it has not been updated since 1995. The Solaris guide, though well-written, includes Solaris-specific extensions not applicable to Linux and *BSD. This leaves us primarily with the System V ABI hosted on www.sco.com, which dedicates Chapters 4 and 5 to the ELF format.

Let's trace the ELF history to understand its relationship with the System V ABI.

2024-05-12

Exploring GNU extensions in the Linux kernel

The Linux kernel is written in C, but it also leverages extensions provided by GCC. In 2022, it moved from GCC/Clang -std=gnu89 to -std=gnu11. This article explores my notes on how these GNU extensions are utilized within the kernel.

2024-04-27

Clang's -O0 output: branch displacement and size increase

tl;dr Clang 19 will remove the -mrelax-all default at -O0, significantly decreasing the text section size for x86.

Span-dependent instructions

In assembly languages, some instructions with an immediate operand can be encoded in two (or more) forms with different sizes. On x86-64, a direct JMP/JCC can be encoded either in 2 bytes with a 8-bit relative offset or 6 bytes with a 32-bit relative offset. A short jump is preferred because it takes less space. However, when the target of the jump is too far away (out of range for a 8-bit relative offset), a near jump must be used.

ja foo    # jump short if above, 77 <rel8>
ja foo    # jump near if above, 0f 87 <rel32>
.nops 126
foo: ret

A 1978 paper by Thomas G. Szymanski ("Assembling Code for Machines with Span-Dependent Instructions") used the term "span-dependent instructions" to refer to such instructions with short and long forms. Assemblers grapple with the challenge of choosing the optimal size for these instructions, often referred to as the "branch displacement problem" since branches are the most common type. A good resource for understanding Szymanski's work is Assembling Span-Dependent Instructions.