clang-format and single-line statements

The Google C++ Style is widely adopted by projects. It contains a brace omission guideline in Looping and branching statements:

For historical reasons, we allow one exception to the above rules: the curly braces for the controlled statement or the line breaks inside the curly braces may be omitted if as a result the entire statement appears on either a single line (in which case there is a space between the closing parenthesis and the controlled statement) or on two lines (in which case there is a line break after the closing parenthesis and there are no braces).

1
2
3
4
5
6
7
8
9
// OK - fits on one line.
if (x == kFoo) { return new Foo(); }

// OK - braces are optional in this case.
if (x == kFoo) return new Foo();

// OK - condition fits on one line, body fits on another.
if (x == kBar)
Bar(arg1, arg2, arg3);

In clang-format's predefined Google style for C++, there are two related style options:

1
2
3
% clang-format --dump-config --style=Google | grep -E 'AllowShort(If|Loop)'
AllowShortIfStatementsOnASingleLine: WithoutElse
AllowShortLoopsOnASingleLine: true

The two options cause clang-format to aggressively join lines for the following code:

1
2
3
4
5
6
7
8
for (int x : a)
foo(x);

while (cond())
foo(x);

if (x)
foo(x);

As a heavy debugger user, I find this behavior cumbersome.

1
2
3
4
5
6
7
// clang-format --style=Google
#include <vector>
void foo(int v) {}
int main() {
std::vector<int> a{1, 2, 3};
for (int x : a) foo(x); // breakpoint
}

When GDB stops at the for loop, how can I step into the loop body? Unfortunately, it's not simple.

If I run step, GDB will dive into the implementation detail of the range-based for loop. It will stop at the std::vector::begin function. Stepping out and executing step again will stop at the std::vector::end function. Stepping out and executing step another time will stop at the operator!= function of the iterator type. Here is an interaction example with GDB:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
(gdb) n
5 for (int x : a) foo(v);
(gdb) s
std::vector<int, std::allocator<int> >::begin (this=0x7fffffffdcc0) at /usr/include/c++/14.2.1/bits/stl_vector.h:873
873 begin() _GLIBCXX_NOEXCEPT
(gdb) fin
Run till exit from #0 std::vector<int, std::allocator<int> >::begin (this=0x7fffffffdcc0) at /usr/include/c++/14.2.1/bits/stl_vector.h:873
0x00005555555561d5 in main () at a.cc:5
5 for (int x : a) foo(v);
Value returned is $1 = 1
(gdb) s
std::vector<int, std::allocator<int> >::end (this=0x7fffffffdcc0) at /usr/include/c++/14.2.1/bits/stl_vector.h:893
893 end() _GLIBCXX_NOEXCEPT
(gdb) fin
Run till exit from #0 std::vector<int, std::allocator<int> >::end (this=0x7fffffffdcc0) at /usr/include/c++/14.2.1/bits/stl_vector.h:893
0x00005555555561e5 in main () at a.cc:5
5 for (int x : a) foo(v);
Value returned is $2 = 0
(gdb) s
__gnu_cxx::operator!=<int*, std::vector<int, std::allocator<int> > > (__lhs=1, __rhs=0) at /usr/include/c++/14.2.1/bits/stl_iterator.h:1235
1235 { return __lhs.base() != __rhs.base(); }
(gdb) fin
Run till exit from #0 __gnu_cxx::operator!=<int*, std::vector<int, std::allocator<int> > > (__lhs=1, __rhs=0) at /usr/include/c++/14.2.1/bits/stl_iterator.h:1235
0x0000555555556225 in main () at a.cc:5
5 for (int x : a) foo(v);
Value returned is $3 = true
(gdb) s
__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::operator* (this=0x7fffffffdca0) at /usr/include/c++/14.2.1/bits/stl_iterator.h:1091
1091 { return *_M_current; }
(gdb) fin
Run till exit from #0 __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >::operator* (this=0x7fffffffdca0) at /usr/include/c++/14.2.1/bits/stl_iterator.h:1091
0x00005555555561f7 in main () at a.cc:5
5 for (int x : a) foo(v);
Value returned is $4 = (int &) @0x55555556b2b0: 1

You can see that this can significantly hinder the debugging process, as it forces the user to delve into uninteresting function calls of the range-based for loop.

In contrast, when the loop body is on the next line, we can just run next to skip the three uninteresting function calls:

1
2
for (int x : a) // next
foo(x); // step

The AllowShortIfStatementsOnASingleLine style option is similar. While convenient for simple scenarios, it can sometimes hinder debuggability.

For the following code, it's not easy to skip the c() and d() function calls if you just want to step into foo(v).

1
if (c() && d()) foo(v);

Many developers, mindful of potential goto fail-like issues, often opt to include braces in their code. clang-format's default style can further reinforce this practice.

1
2
3
4
5
6
7
// clang-format does not join lines.
if (v) {
foo(v);
}
for (int x : a) {
foo(x);
}

Other predefined styles

clang-format's Chromium style is a variant of the Google style and does not have the aforementioned problem. The LLVM style, and many styles derived from it, do not have the problem either.

1
2
3
4
5
6
% clang-format --dump-config --style=Chromium | grep -E 'AllowShort(If|Loop)'
AllowShortIfStatementsOnASingleLine: Never
AllowShortLoopsOnASingleLine: false
% clang-format --dump-config --style=LLVM | grep -E 'AllowShort(If|Loop)'
AllowShortIfStatementsOnASingleLine: Never
AllowShortLoopsOnASingleLine: false

A comparative look at other Languages

Go, Odin, and Rust require {} for if statements but omit (), striking a balance between clarity and conciseness. C/C++'s required ()` makes opt-in braces feel a bit verbose.

C3 and Jai, similar to C++, make {} optional.

Removing global state from LLD

LLD, the LLVM linker, is a mature and fast linker supporting multiple binary formats (ELF, Mach-O, PE/COFF, WebAssembly). Designed as a standalone program, the code base relies heavily on global state, making it less than ideal for library integration. As outlined in RFC: Revisiting LLD-as-a-library design, two main hurdles exist:

  • Fatal errors: they exit the process without returning control to the caller. This was actually addressed for most scenarios in 2020 by utilizing llvm::sys::Process::Exit(val, /*NoCleanup=*/true) and CrashRecoveryContext (longjmp under the hood).
  • Global variable conflicts: shared global variables do not allow two concurrent invocation.

I understand that calling a linker API could be convenient, especially when you want to avoid shipping another executable (which can be large when you link against LLVM statically). However, I believe that invoking LLD as a separate process remains the recommended approach. There are several advantages:

  • Build system control: Build systems gain greater control over scheduling and resource allocation for LLD. In an edit-compile-link cycle, the link could need more resources and threading is more useful.
  • Better parallelism management
  • Global state isolation: LLVM's global state (primarily cl::opt and ManagedStatic) is isolated.

Read More

Keeping pace with LLVM: compatibility strategies

LLVM's C++ API doesn't offer a stability guarantee. This means function signatures can change or be removed between versions, forcing projects to adapt.

On the other hand, LLVM has an extensive API surface. When a library like llvm/lib/Y relies functionality from another library, the API is often exported in header files under llvm/include/llvm/X/, even if it is not intended to be user-facing.

To be compatible with multiple LLVM versions, many projects rely on #if directives based on the LLVM_VERSION_MAJOR macro. This post explores the specific techniques used by ccls to ensure compatibility with LLVM versions 7 to 19. For the latest release (ccls 0.20241108), support for LLVM versions 7 to 9 has been discontinued.

Given the tight coupling between LLVM and Clang, the LLVM_VERSION_MAJOR macro can be used for both version detection. There's no need to check CLANG_VERSION_MAJOR.

Read More

Tinkering with Neovim

After migrating from Vim to Emacs as my primary C++ editor in 2015, I switched from Vim to Neovim for miscellaneous non-C++ tasks as it is more convenient in a terminal. Customizing the editor with a language you are comfortable with is important. I found myself increasingly drawn to Neovim's terminal-based simplicity for various tasks. Recently, I've refined my Neovim setup to the point where I can confidently migrate my entire C++ workflow away from Emacs.

This post explores the key improvements I've made to achieve this transition. My focus is on code navigation.

Read More

ccls and LSP Semantic Tokens

I've spent countless hours writing and reading C++ code. For many years, Emacs has been my primary editor, and I leverage ccls' (my C++ language server) rainbow semantic highlighting feature.

The feature relies on two custom notification messages $ccls/publishSemanticHighlight and $ccls/publishSkippedRanges. $ccls/publishSemanticHighlight provides a list of symbols, each with kind information (function, type, or variable) of itself and its semantic parent (e.g. a member function's parent is a class), storage duration, and a list of ranges.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct CclsSemanticHighlightSymbol {
int id = 0;
SymbolKind parentKind;
SymbolKind kind;
uint8_t storage;
std::vector<std::pair<int, int>> ranges;

std::vector<lsRange> lsRanges; // Only used by vscode-ccls
};

struct CclsSemanticHighlight {
DocumentUri uri;
std::vector<CclsSemanticHighlightSymbol> symbols;
};

An editor can use consistent colors to highlight different occurrences of a symbol. Different colors can be assigned to different symbols.

Tobias Pisani created emacs-cquery (the predecessor to emacs-ccls) in Nov 2017. Despite not being a fan of Emacs Lisp, I added the rainbow semantic highlighting feature for my own use in early 2018. My setup also relied heavily on these two settings:

  • Bolding and underlining variables of static duration storage
  • Italicizing member functions and variables
1
2
(setq ccls-sem-highlight-method 'font-lock)
(ccls-use-default-rainbow-sem-highlight)

Key symbol properties (member, static) were visually prominent in my Emacs environment.

My Emacs hacking days are a distant memory – beyond basic configuration tweaks, I haven't touched elisp code since 2018. As my Elisp skills faded, I increasingly turned to Neovim for various editing tasks. Naturally, I wanted to migrate my C++ development workflow to Neovim as well. However, a major hurdle emerged: Neovim lacked the beloved rainbow highlighting I enjoyed in Emacs.

Thankfully, Neovim supports "semantic tokens" from LSP 3.16, a standardized approach adopted by many editors.

I've made changes to ccls (available on a branch; PR) to support semantic tokens. This involves adapting the $ccls/publishSemanticHighlight code to additionally support textDocument/semanticTokens/full and textDocument/semanticTokens/range.

I utilize a few token modifiers (static, classScope, functionScope, namespaceScope) for highlighting:

1
2
3
4
5
vim.cmd([[
hi @lsp.mod.classScope.cpp gui=italic
hi @lsp.mod.static.cpp gui=bold
hi @lsp.typemod.variable.namespaceScope.cpp gui=bold,underline
]])
treesitter, tokyonight-moon

While this approach is a significant improvement over relying solely on nvim-treesitter, I'm still eager to implement rainbow semantic tokens. Although LSP semantic tokens don't directly distinguish symbols, we can create custom modifiers to achieve similar results.

1
2
3
4
5
tokenModifiers: {
"declaration", "definition", "static", ...

"id0", "id1", ... "id9",
}

In the user-provided initialization options, I set highlight.rainbow to 10.

ccls assigns the same modifier ID to tokens belonging to the same symbol, aiming for unique IDs for different symbols. While we only have a few predefined IDs (each linked to a specific color), there's a slight possibility of collisions. However, this is uncommon and generally acceptable.

For a token with type variable, Neovim's built-in LSP plugin assigns a highlight group @lsp.typemod.variable.id$i.cpp where $i is an integer between 0 and 9. This allows us to customize a unique foreground color for each modifier ID.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
local func_colors = {
'#e5b124', '#927754', '#eb992c', '#e2bf8f', '#d67c17',
'#88651e', '#e4b953', '#a36526', '#b28927', '#d69855',
}
local type_colors = {
'#e1afc3', '#d533bb', '#9b677f', '#e350b6', '#a04360',
'#dd82bc', '#de3864', '#ad3f87', '#dd7a90', '#e0438a',
}
local param_colors = {
'#e5b124', '#927754', '#eb992c', '#e2bf8f', '#d67c17',
'#88651e', '#e4b953', '#a36526', '#b28927', '#d69855',
}
local var_colors = {
'#429921', '#58c1a4', '#5ec648', '#36815b', '#83c65d',
'#419b2f', '#43cc71', '#7eb769', '#58bf89', '#3e9f4a',
}
local all_colors = {
class = type_colors,
constructor = func_colors,
enum = type_colors,
enumMember = var_colors,
field = var_colors,
['function'] = func_colors,
method = func_colors,
parameter = param_colors,
struct = type_colors,
typeAlias = type_colors,
typeParameter = type_colors,
variable = var_colors
}
for type, colors in pairs(all_colors) do
for i = 1,#colors do
for _, lang in pairs({'c', 'cpp'}) do
vim.api.nvim_set_hl(0, string.format('@lsp.typemod.%s.id%s.%s', type, i-1, lang), {fg=colors[i]})
end
end
end

vim.cmd([[
hi @lsp.mod.classScope.cpp gui=italic
hi @lsp.mod.static.cpp gui=bold
hi @lsp.typemod.variable.namespaceScope.cpp gui=bold,underline
]])

Now, let's analyze the C++ code above using this configuration.

tokyonight-moon

While the results are visually pleasing, I need help implementing code lens functionality.

Inactive code highlighting

Inactive code regions (skipped ranges in Clang) are typically displayed in grey. While this can be helpful for identifying unused code, it can sometimes hinder understanding the details. I simply disabled the inactive code feature.

1
2
3
4
5
#ifdef X
... // colorful
#else
... // normal instead of grey
#endif

Refresh

When opening a large project, the initial indexing or cache loading process can be time-consuming, often leading to empty lists of semantic tokens for the initially opened files. While ccls prioritizes indexing these files, it's unclear how to notify the client to refresh the files. The existing workspace/semanticTokens/refresh request, unfortunately, doesn't accept text document parameters.

In contrast, with $ccls/publishSemanticHighlight, ccls proactively sends the notification after an index update (see main_OnIndexed).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void main_OnIndexed(DB *db, WorkingFiles *wfiles, IndexUpdate *update) {
...

db->applyIndexUpdate(update);

// Update indexed content, skipped ranges, and semantic highlighting.
if (update->files_def_update) {
auto &def_u = *update->files_def_update;
if (WorkingFile *wfile = wfiles->getFile(def_u.first.path)) {
wfile->setIndexContent(g_config->index.onChange ? wfile->buffer_content
: def_u.second);
QueryFile &file = db->files[update->file_id];
// Publish notifications to the file.
emitSkippedRanges(wfile, file);
emitSemanticHighlight(db, wfile, file);
// But how do we send a workspace/semanticTokens/refresh request?????
}
}
}

While the semantic token request supports partial results in the specification, Neovim lacks this implementation. Even if it were, I believe a notification message with a text document parameter would be a more efficient and direct approach.

1
2
3
4
5
6
7
export interface SemanticTokensParams extends WorkDoneProgressParams,
PartialResultParams {
/**
* The text document.
*/
textDocument: TextDocumentIdentifier;
}

Other clients

emacs-ccls

Once this feature branch is merged, Emacs users can simply remove the following lines:

1
2
(setq ccls-sem-highlight-method 'font-lock)
(ccls-use-default-rainbow-sem-highlight)

How to change lsp-semantic-token-modifier-faces to support rainbow semantic tokens in lsp-mode and emacs-ccls?

The general approach is similar to the following, but we need a feature from lsp-mode (https://github.com/emacs-lsp/lsp-mode/issues/4590).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
(setq lsp-semantic-tokens-enable t)
(defface lsp-face-semhl-namespace-scope
'((t :weight bold)) "highlight for namespace scope symbols" :group 'lsp-semantic-tokens)
(cl-loop for color in '("#429921" "#58c1a4" "#5ec648" "#36815b" "#83c65d"
"#417b2f" "#43cc71" "#7eb769" "#58bf89" "#3e9f4a")
for i = 0 then (1+ i)
do (custom-declare-face (intern (format "lsp-face-semhl-id%d" i))
`((t :foreground ,color))
"" :group 'lsp-semantic-tokens))
(setq lsp-semantic-token-modifier-faces
`(("declaration" . lsp-face-semhl-interface)
("definition" . lsp-face-semhl-definition)
("implementation" . lsp-face-semhl-implementation)
("readonly" . lsp-face-semhl-constant)
("static" . lsp-face-semhl-static)
("deprecated" . lsp-face-semhl-deprecated)
("abstract" . lsp-face-semhl-keyword)
("async" . lsp-face-semhl-macro)
("modification" . lsp-face-semhl-operator)
("documentation" . lsp-face-semhl-comment)
("defaultLibrary" . lsp-face-semhl-default-library)
("classScope" . lsp-face-semhl-member)
("namespaceScope" . lsp-face-semhl-namespace-scope)
,@(cl-loop for i from 0 to 10
collect (cons (format "id%d" i)
(intern (format "lsp-face-semhl-id%d" i))))
))

vscode-ccls

We require assistance to eliminate the $ccls/publishSemanticHighlight feature and adopt built-in semantic tokens support. Due to the lack of active maintenance for vscode-ccls, I'm unable to maintain this plugin for an editor I don't frequently use.

Misc

I use a trick to switch ccls builds without changing editor configurations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/zsh
#export CCLS_TRACEME=s
export LD_PRELOAD=/usr/lib/libmimalloc.so

type=
[[ -f /tmp/ccls-build ]] && type=$(</tmp/ccls-build)

case $type in
strace)
exec strace -s999 -e read,write -o /tmp/strace.log -f ~/ccls/out/debug/ccls --log-file=/tmp/cc.log -v=1 "$@";;
debug)
exec ~/ccls/out/debug/ccls --log-file=/tmp/cc.log -v=2 "$@";;
release)
exec ~/ccls/out/release/ccls --log-file=/tmp/cc.log -v=1 "$@";;
*)
exec /usr/bin/ccls --log-file=/tmp/cc.log -v=1 "$@";;
esac

Usage:

1
2
echo debug > /tmp/ccls-build
nvim # out/debug/ccls is now used

Mapping symbols: rethinking for efficiency

In object files, certain code patterns embed data within instructions or transitions occur between instruction sets. This can create hurdles for disassemblers, which might misinterpret data as code, resulting in inaccurate output. Furthermore, code written for one instruction set could be incorrectly disassembled as another. To address these issues, some architectures (Arm, C-SKY, NDS32, RISC-V, etc) define mapping symbols to explicitly denote state transition. Let's explore this concept using an AArch32 code example:

Read More

Integrated assembler improvements in LLVM 19

Within the LLVM project, MC is a library responsible for handling assembly, disassembly, and object file formats. Intro to the LLVM MC Project, which was written back in 2010, remains a good source to understand the high-level structures.

In the latest release cycle, substantial effort has been dedicated to refining MC's internal representation for improved performance and readability. These changes have decreased compile time significantly. This blog post will delve into the details, providing insights into the specific changes.

Read More