Programming language behavior

FORTRAN 77 COMMON blocks compiled to COMMON symbols. You could declare a COMMON block in more than one file, with each specifying the number, type, and size of the variable. The linker allocated enough space to satisfy the largest size.

This feature was somehow ported to C. Unix C compilers traditionally permitted a variable using tentative definition in different compilation units and the linker would allocate enough space without reporting an error.

This behavior is constrast to both C and C++ standards, but GCC and Clang traditionally defaulted to -fcommon for C. GCC since 10 and Clang since 11 default to -fno-common.

Assembler behavior

The directive .comm identifier, size[, alignment] instructs the assembler to define a COMMON symbol with the specified size and the optional alignment.

In the ELF object file format, the symbol is represented as a STT_OBJECT STB_GLOBAL symbol whose st_shndx field holds SHN_COMMON. In readelf, the SHN_COMMON value is shown as COM.

The st_value field holds the alignment. This is an interesting abuse. Regular definitions are relative to a section (st_value is a section offset) and the section alignment (sh_addralign) is sufficient to encode the symbol alignment information. For COMMON symbols, the section information is unavailable but fortunately st_value is vacant.

The binding STB_WEAK is not allowed. Other types are not allowed:

The generic ABI supports STT_COMMON as another way to label a COMMON symbol. It says:

Symbols with type STT_COMMON label uninitialized common blocks. In relocatable objects, these symbols are not allocated and must have the special section index SHN_COMMON (see below). In shared objects and executables these symbols must be allocated to some section in the defining object.

In relocatable objects, symbols with type STT_COMMON are treated just as other symbols with index SHN_COMMON. If the link-editor allocates space for the SHN_COMMON symbol in an output section of the object it is producing, it must preserve the type of the output symbol as STT_COMMON.

When the dynamic linker encounters a reference to a symbol that resolves to a definition of type STT_COMMON, it may (but is not required to) change its symbol resolution rules as follows: instead of binding the reference to the first symbol found with the given name, the dynamic linker searches for the first symbol with that name with type other than STT_COMMON. If no such symbol is found, it looks for the STT_COMMON definition of that name that has the largest size.

--elf-stt-common=yes causes GNU assembler to use STT_COMMON. It is super rare in the wild, though.

Symbol resolution

The key is: a COMMON symbol does not lead to a duplicate definition error with any kind of definitions.

However, the size and alignment fields may be updated when two COMMON symbols are merged. The quoted generic ABI text describes the behavior when a COMMON symbol has different sizes in relocatable objects. The output symbol gets the largest size.

Platforms differ in how the alignment is selected. GNU ld and ld.lld pick the largest alignment.

Mach-O ld64 lets the copy with the largest size decide the alignment.

When a common symbol is merged with a shared symbol, GNU ld and ld.lld (see D71161) increase st_size if the shared symbol has a larger st_size.

IN ELF, the precedence is STB_GLOBAL > COMMON > STB_WEAK.

When the link editor combines several relocatable object files, it does not allow multiple definitions of STB_GLOBAL symbols with the same name. On the other hand, if a defined global symbol exists, the appearance of a weak symbol with the same name will not cause an error. The link editor honors the global definition and ignores the weak ones. Similarly, if a common symbol exists (that is, a symbol whose st_shndx field holds SHN_COMMON), the appearance of a weak symbol with the same name will not cause an error. The link editor honors the common definition and ignores the weak ones.

GNU ld ported a strange rule from SUN's linker in 1999-12: GNU-ld behaviour does not match native linker behaviour.

Here is a table showing when an element is pulled in from an archive with the Solaris 2.6 linker and ar program:

When a symbol is COMMON and ld sees an archive, ld checks whether the archive index provides a STB_GLOBAL definition of the symbol. If yes, ld extracts the archive as well. This is in contrary to the usual rule that only an undefined symbol leads to archive member extraction.

ld.lld since 12.0.0 has this behavior (D86142) with the enabled-by-default --fortran-common option.

Say b0.a and b1.a are mostly identical archives, but b0.a objects are compiled with -fcommon while b1.a objects are compiled with -fno-common . If a.o references b0.a, this archive lookup behavior may cause a duplicate definition error for ld a.o b0.a b1.a while b1.a can be shadowed by b0.a without the rule.

What I am most concerned with is how to parallelize symbol resolution in the presence of this archive lookup rule.

GNU ld and ld.lld treat COMMON symbols as though they are in an input section named COMMON. *(COMMON) in a linker script can match these symbols.

Error-prone COMMON symbols

With -fcommon, due to the linker symbol resolution rule, a tentative definition int x; may be overridden by a STB_GLOBAL definition in another compilation unit. This is error-prone since the user may assume an initial value of zero if unware of int x = 1;.

GNU ld and ld.lld support --warn-common which detects the error-prone overridding.

Some legacy code may inadvertently rely on COMMON symbols by having something like int x; in a header file. Such code may not compile with -fno-common.

.bss allocation

When producing an executable or shared object, the linker allocates space in .bss to hold COMMON symbols. In GNU ld, COMMON symbols are placed after .bss and .bss.* input sections.

In a relocatable link, COMMON symbols remain COMMON.

Run-time behavior

When a.c and b.c are in the same component (main executable or shared object), with -fcommon, it's clear that the two x resolves to the same copy and the output is 1.

If b.c is compiled and linked into a different component, this works with the help of ELF symbol interposition. When linking the shared object, x is preemptible (default visibility non-local binding) and its access requires GOT indirection. When linking the executable, the linker exports x to the dynamic symbol table because it is used by an input shared object.

If you make x non-preemptible (e.g. vi -Bsymbolic) in b.so, b.so will get its own copy.

--no-define-common

In 2001-09, optionally postpone assignment of Common added this option to be used with -shared.

Here is my understanding: glibc around 2.1.3 used to have a ld.so bug that the ELF interposition might not work. Using --no-define-common with shared objects can make COMMON symbols undefined and circumvent the bug.

gold confuses --define-common with -d/FORCE_COMMON_ALLOCATION and implements --define-common with -d semantics. Its --no-define-common is incompatible with GNU ld.

-d, -dc, -dp

In a relocatable link, a COMMON symbol remains COMMON in the output. If -dc is specified, the linker will allocate space to COMMON symbols. -d and -dp are aliases for -dc.

The output has a regular STB_GLOBAL definition. Linking the relocatable output with another which defines x will lead to a duplicate definition error.

The options are obscure and might be used to work around some legacy programs. If the relocatable output is fed into the linker again, ignoring -dc should usually work as well. Only when the program inspects relocatable output by itself and does not recognize COMMON symbols, there may be a problem. This implies that the program cannot process a relocatable object with COMMON symbols produced by the assembler.

For ld.lld, I removed -dp and ignored -d/-dc for 15.0.0: https://github.com/llvm/llvm-project/issues/53660.

--sort-common

By sorting COMMON symbols by decreasing alignment, some padding can be saved. However, I think this hardly ever has any size benefit. For example, musl specifies --sort-common by default. With -fcommon, I see a 24 byte decrease of .bss. The total size of .bss is 11344 bytes.

Actually, this can degrade performance if COMMON symbols in an object file have locality and --sort-common breaks the locality.

edata, end, and etext

For legacy reasons GNU ld's internal linker script has PROVIDE(edata = .); and similar symbol assignments for the other two symbols. In GNU ld, the definition precedence is: regular symbol assignment > relocatable object definition > PROVIDE symbol assignment.

If a relocatable object file defines end, it will take precedence over the internal linker script PROVIDE(end = .);. This makes sense because the global variable int end; is valid in C and C++.

Before ld.lld 15, int end; compiled with -fcommon is overridden by the linker definition. This will be fixed by D120389.

LLVM IR

In LLVM IR, a COMMON symbol has the "common" linkage. It is an interposable linkage and some optimizations are suppressed. For example:

• InstCombine assumes that the addresses of a common global i8 and an external global i32 may be the same.
• llvm.objectsize intrinsic does not know the size. This may lead to conservative assumptions for some _chk functions.