# Compressed debug sections

In a GCC/Clang -g1 or -g2 build, the debug information is often much larger than text sections. Some assemblers and linkers offer an optional feature which compresses debug sections.

## History

In 2007-11, Craig Silverstein added --compress-debug-sections=zlib to gold. When the option was specified, gold compressed the content of a .debug* section with zlib and changed the section name to .debug*.zlib.\$uncompressed_size.

In 2008-04, Craig Silverstein changed the format and contributed Patch to handle compressed sections to gdb. The compressed section was renamed to .zdebug*.

In 2010-06, Cary Coutant added --compress-debug-sections to gas and added reading support to objdump and readelf.

ELF Section Compression has a nice summary of the .zdebug format. The article lists some problems with the format which led to a new format standardized by the generic ELF ABI in 2012. I recommend that folks interested in the ELF format read this article. My thinking of implementing ELF features has been influenced by profound discussions like this article and other discussions on the generic ABI mailing list.

In Solaris 11.2, its linker introduced -z compress-sections to compress candidate sections.

The generic ABI format led to modification to the existing assembler and linker options in binutils. In 2015-04, H.J. Lu added --compress-debug-sections=[none|zlib|zlib-gnu|zlib-gabi] to gas and added --compress-debug-sections=[none|zlib|zlib-gnu|zlib-gabi] to GNU ld. In 2015-07, H.J. Lu added --compress-debug-sections=[none|zlib|zlib-gnu|zlib-gabi] to gold. zlib and zlib-gnu indicated the .zdebug format while zlib-gabi indicated the generic ABI format.

In 2015-07, [PATCH] Make default compression gABI compliant (milestone: binutils 2.26) changed zlib to indicate the generic ABI format.

The assembler and linker --compress-debug-sections= options are long and difficult to use. In 2014-06, Rainer Orth added -gz and -gz=[none|zlib|zlib-gnu] to GCC.

## .zdebug remnant

https://github.com/golang/go still uses .zdebug for ELF. I wish they can migrate to the generic ABI format, so I have filed https://github.com/golang/go/issues/50796.

## Usage

The compiler driver option -gz (or a variant like -gz=zlib) combines two tasks.

For object generation, it acts as -Wa,--compress-debug-sections=zlib and asks the assembler to compress debug sections in the output .o (and .dwo with -gsplit-dwarf). A compressed section has the SHF_COMPRESSED flag and its content begins with a compression header structure that identifies the compression algorithm.

Currently only ELFCOMPRESS_ZLIB is defined:

The section data is compressed with the ZLIB algorithm. The compressed ZLIB data bytes begin with the byte immediately following the compression header, and extend to the end of the section. Additional documentation for ZLIB may be found at http://zlib.net.

For linking, -gz acts as -Wl,--compress-debug-sections=zlib and asks the linker to compress debug sections in the linked image. If you need uncompression for linker input but don't need compression for the linker output, don't bother with -gz. The linker recognizes compressed input and uncompressed it automatically.

You may not want to use -gz if you combine assembly and linking in one step (gcc -g -gz a.c without -S or -c). The intermediate .o file will be discarded. The assembler compressed debug sections will immediately be uncompressed by the linker, causing wasted efforts.

## Evaluation

I tested two -DCMAKE_BUILD_TYPE=Debug builds of clang. clang.dw4 is the uncompressed DWARF v4 build while clang.dw5 is the uncompressed DWARF v5 build. objcopy --compress-debug-sections compresses .debug* sections. The effect is as if the linker was invoked with --compress-debug-sections=zlib.

There are some interesting points about the compression ratios.

• .debug_str is a string table. It compresses really well.
• pre-DWARF-v5 .debug_loc and .debug_ranges compress well because their encodings are relatively inefficient.
• Compressed .debug_rnglists is larger than compressed .debug_ranges. A manually tuned byte-oriented encoding may not beat LZ77:)
• .debug_info compresses not well, which may be a good thing: the encoding is pretty compact.

-gsplit-dwarf{,=split} moves a large portion of .debug_{abbrev,info,loclists,rnglists,str,str_offsets} (and pre-DWARF-v5 .debug_loc) into .dwo files. .debug_line is left and -gz may still be useful.

## Compressing more sections

In 2021-02, I filed a feature request ld: Support compressing arbitrary sections (generalized --compress-debug-sections=). This has not seen immediate use cases.

## Pros and cons

The pros are obvious: compression decreases sizes for either (or both) relocatable object files and linked images.

Compression imposes memory usage costs at many stages of development. The assembler, the linker, and the debugger need to allocate additional memory to hold the compressed or uncompressed data.

For object generation, the performance depends on the environment. On one hand, compression performs more work and creates uncompression work for the linker. On the other hand, compressed data is smaller, so I/O is faster. I tested some C++ source files. Other phases of the compiler dominate and -gz seems to barely affect the compile time.

For linking, compression may greatly increase link time. In a project, a .o file content may go into multiple linked images. The same data may be compressed multime times but it is nearly impossible to reuse a compression result because every linked image has a different post-relocation content.

In addition, compressed debug sections may make the debugger significantly slower. Neither gdb nor lldb uncompresses debug sections parallelly.

If you care about the file size and debugging is rare, compressed debug sections in linked images may be good to have.

## Linkers

zlib uses the DEFLATE compressed data format, which was not designed for parallel uncompression. Fortunately the linker does not need to deal with this task, because uncompressing input sections deals with many zlib streams which are embarrassingly parallel.

However, compressing input sections is a major bottleneck. I tested a -DCMAKE_BUILD_TYPE=Debug build of clang with 265MiB SHF_ALLOC sections and 920MiB uncompressed debug sections. If I specify --compress-debug-sections=zlib in a --threads=1 link, "Compress debug sections" takes 2/3 time. In a --threads=8 link, "Compress debug sections" takes nearly 70% time.

We basically have four choices to improve the situation.

• tune zlib parameters
• alternative compression format
• more optimized library
• parallel divide-and-conquer

The first has been done (D70658). ld.lld switched to compression level 1 (Z_BEST_SPEED) for -O1 (default). The previous default (6) was known to not decrease much size while taking too much time.

The second choice is not really feasible. zstd is better than zlib in all metrics: compression speed, decompression speed, and compression ratio. However, it is not standardized (the only flag the generic ABI specifies is ELFCOMPRESS_ZLIB). It does not have debug producer/consumer support. Using a better format is an ecosystem issue that requires significant undertaking and stakeholder buy-in.

The third choice is difficult for a linker like lld. Importing a library to llvm-project has a significantly high barrier. A new CMake configuration has bad discoverability and benefits very few groups. libdeflate is efficient and seems to do a good job, but I do now know how efficient can make it justify an import.

The fourth choice is feasible. Rui Ueyama told me that mold optimizes --compress-debug-sections=zlib with sharding. I researched a bit. pigz has a great comment about how it leverages multi-threading: https://github.com/madler/pigz/blob/master/pigz.c. It has some sophisticated features that a linker does (may) not need:

• unless -i is specified, the last 32KiB (window size) from the previous chunk is used as a dictionary (deflateSetDictionary) to improve compression ratio.
• sync marker even in the absence of --rsyncable

I submitted [ELF] Parallelize --compress-debug-sections=zlib (milestone: LLD 14.0.0) to improve the ld.lld algorithm. By divide-and-conquer, we need to stitch compressed streams together and add the zlib header and the trailer (checksum). The zlib stream has the following structure per (RFC1950). We don't use a preset dictionary, so a field after the 2-byte header is omitted.

The compressed data is encoded in the DEFLATE compressed data format (RFC1951). It consists of several shards.

DEFLATE blocks are a bit sequence. We need to ensure every shard starts at a byte boundary for concatenation. We use Z_SYNC_FLUSH for all shards but the last to flush the output to a byte boundary. (Z_FULL_FLUSH can be used as well, but Z_FULL_FLUSH clears the hash table which just wastes time.)

The last block requires the BFINAL flag. We call deflate with Z_FINISH to set the flag as well as flush the output to a byte boundary. Under the hood, all of Z_SYNC_FLUSH, Z_FULL_FLUSH, and Z_FINISH emit a non-compressed block (called stored block in zlib). RFC1951 says "Any bits of input up to the next byte boundary are ignored."

After compressed data, the last 4 bytes of the zlib stream are an Adler-32 checksum. adler32_combine can combine two Adler-32 checksums, i.e. adler32_combine(adler32(A), adler32(B)) = adler32(cat(A, B)).

The final step is to write DEFLATE blocks in parallel.

Can we do better? At one time, the compressed data is stored in two places. One in the allocated memory holding the compressed shard, the other in the memory mapped output file. It will be nice if we can avoid memory allocation. Unfortunately we need to compute the section size, otherwise we do not know the offsets of following sections and the section header table (which is usually placed after the last section). There is no good way estimating the compressed section size without doing the compression. Technically if the section header table along with .symtab/.shstrtab/.strtab is moved before debug sections, we can compress the debug compression and append them to the output file. The output file will unfortunately be unconventional and this will not work when a linker script specifies exact orders of sections.

It is just too hacky to do so much to just save a little memory. In addition, .symtab/.shstrtab/.strtab/section header table at the end of the file has the nice property that an optimized strip program can alter just these bytes if it perform symbol-only operations, though I don't know any strip leverages this property.

I notified binutils about using a parallel implementation. dwp packages multiple .dwo files and performs a linker-like task. dwp does not support --compress-debug-sections yet but I do now know whether anyone has such a need.

dwz is a DWARF optimization and duplicate removal tool. There is a feature request for adding --compress-debug-sections.

### More ideas

Q: For compressed input and uncompressed output, can we avoid temporary buffer holding the uncompressed output?

It's possible but there will be difficulties in resolving relocations. My estimation is that avoiding the temporary buffer may not provide significant speed-up/memory savings.

Q: For compressed input and compressed output, can we not re-compress its data?

The answer is yes in the absence of relocations, but the changes are involved and it may not be a worthwhile trade-off. zlib provides examples/gzjoin.c which concatenates multiple gzip strems (gzip uses DEFLATE as well) without re-compression. It needs to uncompress input to locate the last block in each input stream. If a linker leverage this optimization:

• it needs to port the involved code
• it needs to retain compressed data. This is difficult in ld.lld because features like --gdb-index will update the InputSection member (originally pointing to the compressed data) to point to the uncompressed data. Retaining both will need an extra member in the section representation, increasing memory usage.

Q: Is this still useful with filesystem compression?

Such a comparison will be useful. I have not heard that anyone has a quantitative comparison.

The storage can do compression, but applications can usually make better decisions.

• A build artifact may be transferred over the network several times. A storage-level solution require compression or uncompression multiple times in the whole system.
• Debug information pages and other pages (e.g. SHF_ALLOC) have very different use rates. A storage-level solution cannot easily leverage this property.