Skip to content

feat: Support -z pack-relative-relocs without actual packing#1701

Merged
mati865 merged 83 commits intowild-linker:mainfrom
mati865:push-ruqzpzktuvvm
Apr 4, 2026
Merged

feat: Support -z pack-relative-relocs without actual packing#1701
mati865 merged 83 commits intowild-linker:mainfrom
mati865:push-ruqzpzktuvvm

Conversation

@mati865
Copy link
Copy Markdown
Member

@mati865 mati865 commented Mar 17, 2026

Works now but doesn't pack RELR entries via bitmaps, so the size reduction is not that big.

Doesn't work yet:

./a.out: error while loading shared libraries: ./a.out: DT_RELR without GLIBC_ABI_DT_RELR dependency

I haven't yet figured out how to cleanly synthesise GLIBC_ABI_DT_RELR version for __libc_start_main symbol. I'd prefer to avoid matching that symbol by the name, but there might be no other choice.

@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Mar 17, 2026

Ah, I have misunderstood that part. That version has to be just declared, not assigned to any symbol.

@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Mar 17, 2026

It works now but doesn't support the major selling point of RELR which is compacting of the addresses via bitmaps. I'm not sure how to approach that as we need sorted entries before they are written for it work (use a temp buffer?). Currently, I'm working around the problem by sorting the entries after they are already written.

Clang release (without debuginfo) builds without and with -z pack-relative-relocs:

❯ ls bin*
.rwxr-xr-x 248M mateusz 17 mar 17:37  bin.default-ld
.rwxr-xr-x 248M mateusz 17 mar 17:37  bin.default-wild
.rwxr-xr-x 242M mateusz 17 mar 17:38  bin.pack-ld
.rwxr-xr-x 244M mateusz 17 mar 17:45  bin.pack-wild

Even without the compaction this is a small win to the size.

This is how compacted vs non-compacted entries look like:

❯ readelf -Wr bin.pack-ld | rg relr -A 5
Relocation section '.relr.dyn' at offset 0x71f100 contains 7264 entries which relocate 273137 locations:
Index: Entry            Address           Symbolic Address
0000:  000000000c8a8940 000000000c8a8940  __frame_dummy_init_array_entry
0001:  ffffffffffffffff 000000000c8a8948  __frame_dummy_init_array_entry + 0x8
                        000000000c8a8950  __frame_dummy_init_array_entry + 0x10
                        000000000c8a8958  __frame_dummy_init_array_entry + 0x18

❯ readelf -Wr bin.pack-wild | rg relr -A 5
Relocation section '.relr.dyn' at offset 0x70f868 contains 269508 entries which relocate 269508 locations:
Index: Entry            Address           Symbolic Address
0000:  000000000ca36288 000000000ca36288  __frame_dummy_init_array_entry
0001:  000000000ca36290 000000000ca36290  __frame_dummy_init_array_entry + 0x8
0002:  000000000ca36298 000000000ca36298  __frame_dummy_init_array_entry + 0x10
0003:  000000000ca362a0 000000000ca362a0  __frame_dummy_init_array_entry + 0x18
Performance impact is not bad considering the sort workaround
❯ OUT=/tmp/bin powerprofilesctl launch -p performance hyperfine -w 5 './run-with ~/Projects/wild/target/release/wild' './run-with ~/Projects/wild/target/release/wild -z pack-relative-relocs' './run-with ld.bfd' './run-with ld.bfd -z pack-relative-relocs' './run-with ~/Projects/wild/target/debug/wild' './run-with ~/Projects/wild/target/debug/wild -z pack-relative-relocs'
Benchmark 1: ./run-with ~/Projects/wild/target/release/wild
  Time (mean ± σ):      55.3 ms ±   1.8 ms    [User: 1.0 ms, System: 1.3 ms]
  Range (min … max):    52.4 ms …  59.9 ms    51 runs

Benchmark 2: ./run-with ~/Projects/wild/target/release/wild -z pack-relative-relocs
  Time (mean ± σ):      56.3 ms ±   1.1 ms    [User: 1.3 ms, System: 1.0 ms]
  Range (min … max):    54.4 ms …  60.7 ms    52 runs

Benchmark 3: ./run-with ld.bfd
  Time (mean ± σ):      1.853 s ±  0.052 s    [User: 1.391 s, System: 0.454 s]
  Range (min … max):    1.730 s …  1.925 s    10 runs

Benchmark 4: ./run-with ld.bfd -z pack-relative-relocs
  Time (mean ± σ):      1.987 s ±  0.046 s    [User: 1.541 s, System: 0.440 s]
  Range (min … max):    1.882 s …  2.062 s    10 runs

Benchmark 5: ./run-with ~/Projects/wild/target/debug/wild
  Time (mean ± σ):     320.7 ms ±   3.2 ms    [User: 1.7 ms, System: 1.1 ms]
  Range (min … max):   314.1 ms … 326.7 ms    10 runs

Benchmark 6: ./run-with ~/Projects/wild/target/debug/wild -z pack-relative-relocs
  Time (mean ± σ):     424.5 ms ±   3.7 ms    [User: 1.3 ms, System: 1.4 ms]
  Range (min … max):   418.2 ms … 430.8 ms    10 runs

Summary
  ./run-with ~/Projects/wild/target/release/wild ran
    1.02 ± 0.04 times faster than ./run-with ~/Projects/wild/target/release/wild -z pack-relative-relocs
    5.80 ± 0.20 times faster than ./run-with ~/Projects/wild/target/debug/wild
    7.67 ± 0.26 times faster than ./run-with ~/Projects/wild/target/debug/wild -z pack-relative-relocs
   33.48 ± 1.44 times faster than ./run-with ld.bfd
   35.90 ± 1.43 times faster than ./run-with ld.bfd -z pack-relative-relocs

@mati865 mati865 force-pushed the push-ruqzpzktuvvm branch 6 times, most recently from 5085561 to 5e53268 Compare March 19, 2026 22:41
Copy link
Copy Markdown
Member

@davidlattimore davidlattimore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is marked as a draft, I assume there's still stuff that you want to address, so I haven't reviewed thoroughly yet. I did skim over it though.

let mut verneed_info = state.verneed_info;

if let Some(v) = state.verneed_info.as_ref()
if let Some(v) = &mut verneed_info
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this mut perhaps isn't needed.

Copy link
Copy Markdown
Member Author

@mati865 mati865 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this is a leftover from one of the previous attempts at making glibc special symbol logic not feel like a plaster on a broken limb. This version I'm fairly satisfied with, but as you noticed it needs a bit of clean-up still.

Also, too bad none of the linter caught it.

@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Mar 20, 2026

Thanks, the status from #1701 (comment) is still up-to-date. Not much progress, so far. At least I no longer hate the way this PR implements glibc special symbol version.

@mati865 mati865 force-pushed the push-ruqzpzktuvvm branch from 5e53268 to 00d2009 Compare March 20, 2026 18:49
@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Mar 21, 2026

Looking at readelf --got-contents and relr addresses without sorting enabled from linked Clang binary got me thinking. It looks like this:

Relocation section '.relr.dyn' at offset 0x70f868 contains 269508 entries which relocate 269508 locations:
Index: Entry            Address           Symbolic Address
0000:  000000000cdbd4f0 000000000cdbd4f0  __dso_handle
0001:  000000000ca37900 000000000ca37900  __do_global_dtors_aux_fini_array_entry
0002:  000000000ca36288 000000000ca36288  __frame_dummy_init_array_entry
0003:  000000000ca36290 000000000ca36290  __frame_dummy_init_arra[...] + 0x8
0004:  000000000cc15a08 000000000cc15a08  _ZTVSt23_Sp_counted_ptr[...] + 0x10
0005:  000000000cc15a10 000000000cc15a10  _ZTVSt23_Sp_counted_ptr[...] + 0x18
0006:  000000000cc15a18 000000000cc15a18  _ZTVSt23_Sp_counted_ptr[...] + 0x20
0007:  000000000cc15a20 000000000cc15a20  _ZTVSt23_Sp_counted_ptr[...] + 0x28
0008:  000000000cc15a28 000000000cc15a28  _ZTVSt23_Sp_counted_ptr[...] + 0x30
0009:  000000000cc15a40 000000000cc15a40  _ZTVSt23_Sp_counted_ptr[...] + 0x10
0010:  000000000cc15a48 000000000cc15a48  _ZTVSt23_Sp_counted_ptr[...] + 0x18
0011:  000000000cc15a50 000000000cc15a50  _ZTVSt23_Sp_counted_ptr[...] + 0x20
0012:  000000000cc15a58 000000000cc15a58  _ZTVSt23_Sp_counted_ptr[...] + 0x28
0013:  000000000cc15a60 000000000cc15a60  _ZTVSt23_Sp_counted_ptr[...] + 0x30
0014:  000000000cda60b0 000000000cda60b0  _GLOBAL_OFFSET_TABLE_
0015:  000000000cc15a78 000000000cc15a78  _ZTVN12_GLOBAL__N_128AA[...] + 0x10
0016:  000000000cc15a80 000000000cc15a80  _ZTVN12_GLOBAL__N_128AA[...] + 0x18
0017:  000000000cc15a88 000000000cc15a88  _ZTVN12_GLOBAL__N_128AA[...] + 0x20
...

But if I map addresses to the sections we get:

0000:  000000000cdbd4f0 000000000cdbd4f0  .data
0001:  000000000ca37900 000000000ca37900  .fini_array
0002:  000000000ca36288 000000000ca36288  .init_array
0003:  000000000ca36290 000000000ca36290  .init_array
0004:  000000000cc15a08 000000000cc15a08  .data.rel.ro
0005:  000000000cc15a10 000000000cc15a10  .data.rel.ro
0006:  000000000cc15a18 000000000cc15a18  .data.rel.ro
0007:  000000000cc15a20 000000000cc15a20  .data.rel.ro
0008:  000000000cc15a28 000000000cc15a28  .data.rel.ro
0009:  000000000cc15a40 000000000cc15a40  .data.rel.ro
0010:  000000000cc15a48 000000000cc15a48  .data.rel.ro
0011:  000000000cc15a50 000000000cc15a50  .data.rel.ro
0012:  000000000cc15a58 000000000cc15a58  .data.rel.ro
0013:  000000000cc15a60 000000000cc15a60  .data.rel.ro
0014:  000000000cda60b0 000000000cda60b0  .got
0015:  000000000cc15a78 000000000cc15a78  .data.rel.ro
0016:  000000000cc15a80 000000000cc15a80  .data.rel.ro
0017:  000000000cc15a88 000000000cc15a88  .data.rel.ro

The relocations are not as unordered as I previously thought. Currently, sections are written at the first available slot, but relocations within each section are already ordered (not verified but seems plausible).
So, if I could apply layout to .relr.dyn section by offsetting the relocations, rather than writing them at the first available slot, everything would naturally land in the perfect order.

GOTs won't be a problem as well (at least in Clang's case). Even though there are 2002 entries with relative reloc:

❯ readelf -W --got-contents bin | rg RELATIVE | wc -l
2002

They are sequential:

❯ readelf -W --got-contents bin | rg -ow 'R_.*?\s' | uniq
R_X86_64_RELATIVE
R_X86_64_GLOB_DAT
R_X86_64_TPOFF64
R_X86_64_GLOB_DAT
R_X86_64_TPOFF64
R_X86_64_GLOB_DAT

Other relocations should be irrelevant here.


With some creativity this solution could be extended to avoid overallcoation as well. If we store first and last addresses of each written relocations chunk for each section somewhere, we can increment the sizes by the chunk size, instead of relocations count.

Maybe it will be clearer that way:

Id   Reloc address
00: 0000

# 16 bytes apart, cannot be packed with 00
01: 0010 
02: 0018
03: 0020 
# 3 subsequent consecutive addresses, emit 01 as the real reloc and move 02 and 03 into bitmap

04: 00d0
```
In that example we would end up with 3 relocation entries (for 00, 01, and 04) rather than 5. Sounds almost too good to be true.

@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Mar 24, 2026

There is something I didn't account for: buckets and groups.
I still cannot figure out how the buffer splitting works in regard to threads and groups (input files), This is the best I could come up with so far: eac18b6

It works only for simple cases: using single thread (--threads=1) and low amount of inputs. For example, running:

❯ wild/tests/build/pack-relative-relocs.c/default-host/pack-relative-relocs.c.save/run-with cargo r --bin wild -q -- --threads=1

wild on  HEAD (eac18b6) is 📦 v0.8.0 via 🦀 v1.94.0
❯ readelf -Wr wild/tests/build/pack-relative-relocs.c/default-host/pack-relative-relocs.c.save/bin | rg -A99 relr
Relocation section '.relr.dyn' at offset 0xd60 contains 10 entries which relocate 10 locations:
Index: Entry            Address           Symbolic Address
0000:  0000000000003120 0000000000003120  __frame_dummy_init_array_entry
0001:  0000000000003128 0000000000003128  __frame_dummy_init_array_entry + 0x8
0002:  0000000000003130 0000000000003130  __frame_dummy_init_array_entry + 0x10
0003:  0000000000003138 0000000000003138  __do_global_dtors_aux_fini_array_entry
0004:  0000000000003140 0000000000003140  __do_global_dtors_aux_fini_array_entry + 0x8
0005:  0000000000003148 0000000000003148  __do_global_dtors_aux_fini_array_entry + 0x10
0006:  0000000000003150 0000000000003150  ptrs_b
0007:  0000000000003158 0000000000003158  ptrs_b + 0x8
0008:  0000000000003160 0000000000003160  ptrs_b + 0x10
0009:  0000000000004418 0000000000004418  __dso_handle

Shows the addresses are in order, but adding threads or inputs makes the offsets fall apart.

EDIT: Regarding the fields in common structs, some of those were ELF specific when I started and I didn't bother cleaning up the code that doesn't even work when rebasing. I'll move them if this ever starts working correctly.

@davidlattimore
Copy link
Copy Markdown
Member

The closest equivalent I can think of in the linker is how dynamic symbol definitions are handled. During the GC phase, objects collect up the dynamic symbols that they need to define, storing them in CommonGroupState::dynamic_symbol_definitions. Afterwards, merge_dynamic_symbol_definitions combines them together and they're supplied to the epilogue, which is responsible for sorting them and then writing them. Perhaps you can do something similar here?

@mati865 mati865 force-pushed the push-ruqzpzktuvvm branch from d8707e5 to e98b0a1 Compare March 26, 2026 22:12
@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Mar 26, 2026

Thanks for the pointer. I had looked there but I don't see how that would work without huge changes to how relocations are handled. Although that might be unavoidable to enable packing via bitmaps.

Switching to OutputSectionPartMap helped make sense out regarding the buffers splitting per group, but I couldn't come up with anything sensible, so in the end one hack if replaced with another...
Also, I'm afraid this didn't get me any closer to bitmap packing.Even the last of 200k+ relocations in Clang binary show the problem. This is the info that I gather during layout phase:

group_id: 123, part_id: 161, end_offset: 2155960, relocations within chunk: 1
group_id: 142, part_id: 161, end_offset: 2155968, relocations within chunk: 1
group_id: 159, part_id: 161, end_offset: 2156000, relocations within chunk: 4
group_id: 164, part_id: 161, end_offset: 2156040, relocations within chunk: 5
group_id: 169, part_id: 161, end_offset: 2156048, relocations within chunk: 1
group_id: 170, part_id: 161, end_offset: 2156056, relocations within chunk: 1
group_id: 171, part_id: 161, end_offset: 2156064, relocations within chunk: 1

Compared with the output:

❯ readelf -r /tmp/bin | tail -14
269494:  000000000cdbd800 000000000cdbd800  _ZZN5clang6markup15EmitPlistHeade[...]
269495:  000000000cdbd808 000000000cdbd808  _ZN4llvm6detail18UniqueFunctionBa[...]
269496:  000000000cdbd810 000000000cdbd810  _ZN5clang6format20DefaultFallback[...]
269497:  000000000cdbd818 000000000cdbd818  _ZN5clang6format18DefaultFormatStyleE
269498:  000000000cdbd820 000000000cdbd820  _ZN5clang6format26StyleOptionHelp[...]
269499:  000000000cdbd828 000000000cdbd828  _ZZN5clang6format16getParseCatego[...]
269500:  000000000cdbd830 000000000cdbd830  _ZZN4llvm18instrprof_categoryEvE1[...]
269501:  000000000cdbd838 000000000cdbd838  _ZN4llvm19InstrProfCorrelator24Nu[...]
269502:  000000000cdbd840 000000000cdbd840  _ZN4llvm19InstrProfCorrelator20CF[...]
269503:  000000000cdbd848 000000000cdbd848  _ZN4llvm19InstrProfCorrelator25Fu[...]
269504:  000000000cdbd850 000000000cdbd850  _ZZN4llvm19sampleprof_categoryEvE[...]
269505:  000000000cdbd860 000000000cdbd860  _ZZN4llvm6object15object_category[...]
269506:  000000000cdbd868 000000000cdbd868  _ZZN4llvm20BitcodeErrorCategoryEv[...]
269507:  000000000cdbd870 000000000cdbd870  _ZZN4llvm8codeview15CVErrorCatego[...]

There is a single relocation missing between 000000000cdbd850 and 000000000cdbd860. There is no way to figure it out before symbols get offsets assigned and probably even until we start doing relocations which conflicts with output preallocation.
The symbol that created the offset hole in relocations is no different from any other symbols:

❯ readelf -Ws /tmp/bin | rg '000000000cdbd860|000000000cdbd858|000000000cdbd850'
116965: 000000000cdbd850     8 OBJECT  LOCAL  DEFAULT   27 _ZZN4llvm19sampleprof_categoryEvE13ErrorCategory
117966: 000000000cdbd858     8 OBJECT  LOCAL  DEFAULT   27 _ZZL17computeMemberDataRN4llvm11raw_ostreamES1_NS_6object7Archive4KindEbbNS_17SymtabWritingModeEP6SymMapRNS_11LLVMContextENS_8ArrayRefINS_16NewArchiveMemberEEESt8optionalIbENS_12function_refIFvNS_5ErrorEEEEE11PaddingData
118280: 000000000cdbd860     8 OBJECT  LOCAL  DEFAULT   27 _ZZN4llvm6object15object_categoryEvE14error_category

All three are Local Data: NON_INTERPOSABLE | DIRECT according to --sym-info.

I don't know if it's worth investing any more time into that approach as there is very little hope to implement fully functional packed relocations this way.

At least the performance of current hack is within noise margins on my machine

Details

wild-main - the commit this branch is based on.
wild-pack-sort-hack - writing relocations unsorted and performing that later.
wild-pack-split-hack - splitting buffers in a way that makes the symbols land in the right spots.
I confirmed offsets proper ordering using readelf -r /tmp/bin.packed.pack-split-hack | awk '/^Relocation section.*\.relr\.dyn/,/^$/ {if ($1 ~ /^[0-9a-f]+:/) print $2}' | sort -c.

❯ powerprofilesctl launch -p performance hyperfine -w 20 -L suffix main,pack-sort-hack,pack-split-hack 'OUT=/tmp/bin.packed.{suffix} ./run-with ~/Projects/wild/target/release/wild-{suffix} -z pack-relative-relocs' 'OUT=/tmp/bin.{suffix} ./run-with ~/Projects/wild/target/release/wild-{suffix}'
Benchmark 1: OUT=/tmp/bin.packed.main ./run-with ~/Projects/wild/target/release/wild-main -z pack-relative-relocs
  Time (mean ± σ):      54.5 ms ±   1.2 ms    [User: 1.2 ms, System: 1.1 ms]
  Range (min … max):    52.1 ms …  56.8 ms    53 runs

Benchmark 2: OUT=/tmp/bin.main ./run-with ~/Projects/wild/target/release/wild-main
  Time (mean ± σ):      54.4 ms ±   1.1 ms    [User: 1.3 ms, System: 1.0 ms]
  Range (min … max):    52.2 ms …  56.4 ms    56 runs

Benchmark 3: OUT=/tmp/bin.packed.pack-sort-hack ./run-with ~/Projects/wild/target/release/wild-pack-sort-hack -z pack-relative-relocs
  Time (mean ± σ):      56.1 ms ±   0.9 ms    [User: 1.1 ms, System: 1.1 ms]
  Range (min … max):    54.0 ms …  58.0 ms    52 runs

Benchmark 4: OUT=/tmp/bin.pack-sort-hack ./run-with ~/Projects/wild/target/release/wild-pack-sort-hack
  Time (mean ± σ):      54.7 ms ±   0.9 ms    [User: 1.1 ms, System: 1.2 ms]
  Range (min … max):    52.7 ms …  56.9 ms    54 runs

Benchmark 5: OUT=/tmp/bin.packed.pack-split-hack ./run-with ~/Projects/wild/target/release/wild-pack-split-hack -z pack-relative-relocs
  Time (mean ± σ):      54.6 ms ±   1.1 ms    [User: 1.1 ms, System: 1.2 ms]
  Range (min … max):    52.4 ms …  57.5 ms    54 runs

Benchmark 6: OUT=/tmp/bin.pack-split-hack ./run-with ~/Projects/wild/target/release/wild-pack-split-hack
  Time (mean ± σ):      54.9 ms ±   1.0 ms    [User: 1.1 ms, System: 1.2 ms]
  Range (min … max):    52.9 ms …  57.5 ms    55 runs

Summary
  OUT=/tmp/bin.main ./run-with ~/Projects/wild/target/release/wild-main ran
    1.00 ± 0.03 times faster than OUT=/tmp/bin.packed.main ./run-with ~/Projects/wild/target/release/wild-main -z pack-relative-relocs
    1.00 ± 0.03 times faster than OUT=/tmp/bin.packed.pack-split-hack ./run-with ~/Projects/wild/target/release/wild-pack-split-hack -z pack-relative-relocs
    1.01 ± 0.03 times faster than OUT=/tmp/bin.pack-sort-hack ./run-with ~/Projects/wild/target/release/wild-pack-sort-hack
    1.01 ± 0.03 times faster than OUT=/tmp/bin.pack-split-hack ./run-with ~/Projects/wild/target/release/wild-pack-split-hack
    1.03 ± 0.03 times faster than OUT=/tmp/bin.packed.pack-sort-hack ./run-with ~/Projects/wild/target/release/wild-pack-sort-hack -z pack-relative-relocs

@mati865 mati865 force-pushed the push-ruqzpzktuvvm branch from e98b0a1 to fe42bb7 Compare March 26, 2026 22:14
@davidlattimore
Copy link
Copy Markdown
Member

I haven't looked (recently) at your code changes, but I gave some thought to how to simplify packed relocations... would it help if we reduced scope and only packed relative relocations within each input section? i.e when we scan the relocations for an input section (during the layout phase), we allocate an entry in RELR when we first encounter a relative relocation, then if we encounter any more within the subsequent 63 addresses, allocate another entry (bitmask entry). That'd make sizing much more localised. Obviously it wouldn't give as much space saving as doing the operation globally, but I bet it would still give some savings. Especially when you have a section containing a vtable.

@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Mar 27, 2026

Hmm, how can we tell in layout if symbols are adjacent? If I add elf_symbol.size() near my allocate_relr call it only shows zeroes. Very likely I'm trying to access wrong thing - this part is copied from elf_writer.rs.

I've implemented (or at least I hope so) counting of possible savings if we consider symbols adjacent only if they are in the same section, the same file, and they are at most 4 bytes away from the previous symbol. That 4 bytes apart is bad but I haven't yet figured out how to solve that. I'll take another look into that over Saturday or Sunday.

I did it to see what kind of savings would that approach would give, and to see if your suggestion would work.

@davidlattimore
Copy link
Copy Markdown
Member

Hmm, how can we tell in layout if symbols are adjacent?

During the size calculating (GC) phase, you can't, but do you need to? I'm probably missing something, but I thought what mattered was (a) whether a particular relocation was relative, i.e. not a dynamic symbol or some other exotic type and (b) the offset of the relocation itself. I didn't think the addresses to which the relocations referred to mattered, since I assumed those were addends that were stored at the location where the relocation is to be applied and only needed to be computed in the write stage.

Perhaps I should describe my understanding of RELR on RELR packing to make sure we're both on the same page...

In RELA relocations, we store offset, info and addend. REL then optimises by storing the addend inline at the target location, so it only needs to store offset and info. RELR avoids storing info by specialising to work only with relative relocations. So it's half the size of REL and 1/3rd the size of RELA, but REL or RELA is still needed for relocation types other than relative ones.

REL packing then goes further and optionally allows a bitmask after an offset. The bitmask indicates whether each of the next 63 addresses should also have a relative relocation applied. One bit is used to indicate whether an entry is a relocation offset or a bitmask.

So even without packing, i.e. just writing a list of relative relocation offsets, RELR provides a 2/3rds reduction in space usage compared with RELA.

mem_sizes: &mut crate::output_section_part_map::OutputSectionPartMap<u64>,
output_kind: crate::output_kind::OutputKind,
pack_relative_relocs: bool,
relr_part_sizes: &mut crate::output_section_part_map::OutputSectionPartMap<(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just do RELR packing only locally within each input section, then I'd expect that we could just allocate RELR entries just like we currently allocate RELA entries.

}
}

fn compute_relr_offsets_by_group<P: Platform>(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only sort of read this method, so I don't fully understand what it's doing. I suspect if we do RELR allocation entirely within an input section, it shouldn't be needed. One other thing I was thinking was about sorting. The input relocations for a section are generally sorted already, so I don't think we need to do any sorting. If some compiler for some reason produces relocations that aren't sorted, all that happens is that we get less packing, which seems OK.

@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Mar 28, 2026

During the size calculating (GC) phase, you can't, but do you need to? I'm probably missing something, but I thought what mattered was (a) whether a particular relocation was relative, i.e. not a dynamic symbol or some other exotic type and (b) the offset of the relocation itself. I didn't think the addresses to which the relocations referred to mattered, since I assumed those were addends that were stored at the location where the relocation is to be applied and only needed to be computed in the write stage.

No, I was missing something. I thought you can only pack adjacent addresses, so we'd need base address+bitmap entries after every gap.

So, is it completely valid to have as many spaces as you want, so long they fit in that 500 bytes space of bitmap? That changes everything.

So even without packing, i.e. just writing a list of relative relocation offsets, RELR provides a 2/3rds reduction in space usage compared with RELA.

Yeah, I tested that in #1701 (comment)
This time I was wondering how much my dumb solution without gaps in bitmaps would leave on the table.

@davidlattimore
Copy link
Copy Markdown
Member

So, is it completely valid to have as many spaces as you want, so long they fit in that 500 bytes space of bitmap?

By "spaces" do you mean gaps between relocations? Yes, I'd assume that's fine, those offsets would just get a zero bit in the bitmask.

The way I imagine it working is that the first relative relocation you encounter costs 1 entry (8 bytes). If you then encounter another relative relocation within the next 63 machine words (504 bytes), then an additional entry (bitmap) is allocated. Subsequent relocations within those 63 machine words are then "free" in that they just set the corresponding bit, but don't occupy any additional space. After those 63 words are up, if we encounter another relocation within the subsequent 63 machine words, we allocate another bitmap entry, and so on until either we reach the end of the section or we get to the end of the 63 words without encountering an eligible relocation.

Another small complication that I thought of is that in executable code, some of the relocations won't be at machine-word boundaries. i.e. their offsets in the section won't be an exact multiple of 8 bytes. In those cases, the relocations that are an exact 8 byte multiple would be fine to put into RELR, but the unaligned relocations, even if relative, would still need to be put into the regular relocations table.

A further, related complication is what to do if the section we're processing has alignment smaller than 8. I suspect in that case, we need to just not use RELR for that section, since we don't know at that point which relocations will end up aligned to 8 byte boundaries. But that's probably OK. On x86_64, functions tend to have 16 byte alignment. On RISC architectures, the alignment is generally smaller - 4 byte for aarch64 and 2 bytes for riscv - but I don't think those architectures could really use RELR on code anyway, since they're not using full 8 byte relocations in the code. Those architectures could still benefit by getting smaller relocations for vtables, which is where the real benefits probably come anyway.

@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Mar 28, 2026

By "spaces" do you mean gaps between relocations? Yes, I'd assume that's fine, those offsets would just get a zero bit in the bitmask.

Yes, 2 AM isn't the best time for productivity.

The way I imagine it working is that the first relative relocation you encounter costs 1 entry (8 bytes). If you then encounter another relative relocation within the next 63 machine words (504 bytes), then an additional entry (bitmap) is allocated. Subsequent relocations within those 63 machine words are then "free" in that they just set the corresponding bit, but don't occupy any additional space. After those 63 words are up, if we encounter another relocation within the subsequent 63 machine words, we allocate another bitmap entry, and so on until either we reach the end of the section or we get to the end of the 63 words without encountering an eligible relocation.

You are right. I worked on too small examples, so other linkers produced:

0: base address
1: bitmap
                     [multiple adjacent entries]
2: base address (this one is over 600 bytes away from the last bitmap)
3: bitmap

Based on that I made the wrong assumption that the entries must be adjacent, otherwise new base entry shall be introduced. I found similar patterns in Clang binary (it wasn't hard because there are over 7k RERL entries after packing...).

Now, realizing that the third RELR entry (the one with index 2) was created because that >504 bytes gap in the addresses the gaps indeed make sense. And sure enough I found RERL bitmaps with gaps in Clang binary.

Another small complication that I thought of is that in executable code, some of the relocations won't be at machine-word boundaries. i.e. their offsets in the section won't be an exact multiple of 8 bytes. In those cases, the relocations that are an exact 8 byte multiple would be fine to put into RELR, but the unaligned relocations, even if relative, would still need to be put into the regular relocations table.

I'm not sure if I follow. I think this would result in less saving than just yoloing all relative relocations without a specific order and be much more complex code wise. At least for x86_64.

With my prints from latest commits on a small example I get:

❯ wild/tests/build/pack-relative-relocs.c/default-host/pack-relative-relocs.c.save/run-with cargo r --bin wild -q -- --threads=32
key (file-258, 1), offsets {144, 224}
key (file-258, 6), offsets {0}
key (file-512, 1), offsets {0, 17}
key (file-513, 1), offsets {0, 17}
key (file-513, 3), offsets {0, 4, 8}

Those are addends grouped by file and section index obtained at elf.rs:4839 (symbol values are zero). Unless I made a mistake at some point we would move offsets/addends 17 and 4 into .rela.dyn which would be a waste.
Right now they nicely end up as 8 bytes apart RELR relocs without packing:

Index: Entry            Address           Symbolic Address
0000:  0000000000003120 0000000000003120  __frame_dummy_init_array_entry
0001:  0000000000003128 0000000000003128  __frame_dummy_init_arra[...] + 0x8
0002:  0000000000003130 0000000000003130  __frame_dummy_init_arra[...] + 0x10
0003:  0000000000003138 0000000000003138  __do_global_dtors_aux_fini_array_entry
0004:  0000000000003140 0000000000003140  __do_global_dtors_aux_f[...] + 0x8
0005:  0000000000003148 0000000000003148  __do_global_dtors_aux_f[...] + 0x10
0006:  0000000000003150 0000000000003150  ptrs_b
0007:  0000000000003158 0000000000003158  ptrs_b + 0x8
0008:  0000000000003160 0000000000003160  ptrs_b + 0x10
0009:  0000000000004418 0000000000004418  __dso_handle

When building Clang without -z pack-relative-relocs both GNU ld and Wild create ~248 MiB files. Enabling -z pack-relative-relocs for GNU ld reduces this down to ~242 MiB, giving ~6 MiB saving.
Turning all relative relocs into RELR (without packing) in Wild reduces the size down to ~244 MiB, giving us ~4 MiB saving and leaving ~2 MiB on the table.
Even with the efficient packing, which I doubt could be achieved without properly handling RELRs I doubt this would outweigh .rela.dyn size overhead.

A further, related complication is what to do if the section we're processing has alignment smaller than 8. I suspect in that case, we need to just not use RELR for that section, since we don't know at that point which relocations will end up aligned to 8 byte boundaries. But that's probably OK. On x86_64, functions tend to have 16 byte alignment. On RISC architectures, the alignment is generally smaller - 4 byte for aarch64 and 2 bytes for riscv - but I don't think those architectures could really use RELR on code anyway, since they're not using full 8 byte relocations in the code. Those architectures could still benefit by getting smaller relocations for vtables, which is where the real benefits probably come anyway.

That's valid point. For my small exapmles GNU ld uses .rela.dyn exclusively when targeting riscv, but for aarch64 it only created packed RELR. In this PR Wild uses RELR for all architectures.

@davidlattimore
Copy link
Copy Markdown
Member

I'm not sure if I follow. I think this would result in less saving than just yoloing all relative relocations without a specific order and be much more complex code wise. At least for x86_64.

I had a response written describing how I thought it would work and wouldn't be complex. Then I realised that all of the relocations in the functions are likely being resolved at link time anyway, so there shouldn't be any dynamic relocations for the text sections. So feel free to disregard my comments about relocations in functions / executable code, I just wasn't thinking it through properly.

Even with the efficient packing, which I doubt could be achieved without properly handling RELRs I doubt this would outweigh .rela.dyn size overhead.

I'm not sure I understand what you're saying here.

One other thing that occurred to me - possibly you're already handling it - is relocations that the linker itself is generating. In particular in the GOT. If we have symbols that are non-interposable, but for which we still end up having a GOT entry, then we emit a relative relocation for that GOT entry.

@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Mar 29, 2026

I'm not sure I understand what you're saying here.

If we kept some relative relocations in .rela.dyn rather than moving everything to .relr.dyn. The size reduction from packing probably wouldn't offset the bigger size of RELA entires.

In other words making all relative relocation as RELR without packing would yield smaller binaries than mixed RELA and RELR with packing. At least for x86_64.

This was referring to:

In those cases, the relocations that are an exact 8 byte multiple would be fine to put into RELR, but the unaligned relocations, even if relative, would still need to be put into the regular relocations table.


One other thing that occurred to me - possibly you're already handling it - is relocations that the linker itself is generating. In particular in the GOT. If we have symbols that are non-interposable, but for which we still end up having a GOT entry, then we emit a relative relocation for that GOT entry.

Yeah, this already works for x86_64. The changes were pretty minimal.

I'll do AArch64 Clang build in the morning to see if it also works.

@davidlattimore
Copy link
Copy Markdown
Member

I'm not sure I understand what you're saying here.

If we kept some relative relocations in .rela.dyn rather than moving everything to .relr.dyn. The size reduction from packing probably wouldn't offset the bigger size of RELA entires.

In other words making all relative relocation as RELR without packing would yield smaller binaries than mixed RELA and RELR with packing. At least for x86_64.

This was referring to:

In those cases, the relocations that are an exact 8 byte multiple would be fine to put into RELR, but the unaligned relocations, even if relative, would still need to be put into the regular relocations table.

But I thought that if a relocation is for an unaligned offset then it just couldn't use RELR? So we wouldn't have a choice. That said, I'd expect that most of the time all the relevant relocations would have at least 8 byte alignment as would the sections they occur in.

@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Mar 29, 2026

But I thought that if a relocation is for an unaligned offset then it just couldn't use RELR? So we wouldn't have a choice.

Oh, sorry. I thought if we encounter unaligned relocation we would recreate it as aligned because I messed up something with the code and it prints wrong addends.

Yeah, we cannot create RELRs entries for odd addresses because they are interpreted as a mask. However, we can create RELRs for unaligned even addresses, we just cannot pack them. Added a test that captures it well:

readelf -r output
❯ readelf -Wr wild/tests/build/elf/x86_64/pack-relative-relocs/default/pack-relative-relocs.c.wild

Relocation section '.rela.dyn' at offset 0x8a8 contains 2 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000002c40  0000000100000006 R_X86_64_GLOB_DAT      0000000000000000 exit_syscall + 0
0000000000002c48  0000000200000006 R_X86_64_GLOB_DAT      0000000000000000 runtime_init + 0

Relocation section '.relr.dyn' at offset 0x8d8 contains 3 entries which relocate 9 locations:
Index: Entry            Address           Symbolic Address
0000:  0000000000003c50 0000000000003c50  aligned_ptr
0001:  0000000000003c59 0000000000003c68  unaligned_ptr_even
                        0000000000003c70  unaligned_ptr_even + 0x8
                        0000000000003c80  _edata + 0x4
                        0000000000003ca0  _edata + 0x24
                        0000000000003ca8  _edata + 0x2c
                        0000000000003cb0  _edata + 0x34
                        0000000000003cb8  _edata + 0x3c
0002:  0000000000003c6a 0000000000003c6a  unaligned_ptr_even + 0x2

❯ readelf -Wr wild/tests/build/elf/x86_64/pack-relative-relocs/default/pack-relative-relocs.c.ld

Relocation section '.rela.dyn' at offset 0x3c8 contains 1 entry:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000004011  0000000000000008 R_X86_64_RELATIVE                         4000

Relocation section '.rela.plt' at offset 0x3e0 contains 2 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000003ff0  0000000100000007 R_X86_64_JUMP_SLOT     0000000000000000 runtime_init + 0
0000000000003ff8  0000000200000007 R_X86_64_JUMP_SLOT     0000000000000000 exit_syscall + 0

Relocation section '.relr.dyn' at offset 0x410 contains 2 entries which relocate 2 locations:
Index: Entry            Address           Symbolic Address
0000:  0000000000004008 0000000000004008  aligned_ptr
0001:  0000000000004022 0000000000004022  unaligned_ptr_even + 0x2

This PR turned unaligned_ptr_even into a RELR, so it was intepreted as a bitmap, which segfaulted AArch64 qemu (not just the binary itself) - I'm awarding myself a virtual medal for that.
unaligned_ptr_even + 0x2 while looking out of place, turns out to work fine with GNU ld.

That said, I'd expect that most of the time all the relevant relocations would have at least 8 byte alignment as would the sections they occur in.

Yeah, I didn't really encounter them in a few real world binaries I tried. It was a bug in my code that led me to that thought.


I'll have to reconsider a few things regarding this PR over the next few days.

@mati865 mati865 force-pushed the push-ruqzpzktuvvm branch 2 times, most recently from 321f4c9 to ee67e2c Compare March 30, 2026 21:23
@mati865
Copy link
Copy Markdown
Member Author

mati865 commented Apr 1, 2026

Thanks for the review. I believe I addressed almost everything apart from #1701 (comment)

I might have a go at doing the packing after you submit if you're not going to.

Please do if you want. I'm not planning another attempt at it any time soon.

Added one change that was not discussed: db54cf4 IIRC with RELA it doesn't matter what you write into relocated address and writing zero might be preferred.

Unfortunately #[must_use] on write_address_relocation won't extend to the value inside Result. So if we want the warning about unused result we would need an unwieldy wrapper type.

@davidlattimore
Copy link
Copy Markdown
Member

Unfortunately #[must_use] on write_address_relocation won't extend to the value inside Result. So if we want the warning about unused result we would need an unwieldy wrapper type.

I think it's fine. There's only a couple of callers of the function and if someone forgets to use the returned value, then hopefully tests will pick it up - especially the tests that verify that we write to every byte of the file.

/// Symbol will point to the start of the first loadable segment.
LoadBaseAddress,

// TODO: Versions are ELF specific
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to have platform-specific things in common code, so long as they don't interfere with other platforms

// We'll emit a warning when writing the file if it's an executable.
return;
};
for (index, def_info) in self.internal_symbols.symbol_definitions.iter().enumerate() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an internal sym def actually needed or could we just check args.pack_relative_relocs here and then request the appropriate symbol?

Copy link
Copy Markdown
Member Author

@mati865 mati865 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had to unexpectantly go out and hastily committed and pushed the changes.
So, I was testing two approaches: importing just the version and the whole dynamic symbol. Importing just the version in a hacky way worked suspiciously fine on my machine (even linked a working Clang binary) but wreak havoc on the CI as can be seen in this failure:

version.__cxa_finalize
  wild GLIBC_2.2.5
  ld local or global

I'm guessing the versions were offset by a single place.

Pushed a refined version, but I still need to make CI happy.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to make CI happy.

To my surprise, PIE isn't the default everywhere yet.

Copy link
Copy Markdown
Member

@davidlattimore davidlattimore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this! Looks good. Just two small comments. Don't feel you need to wait for me once addressed.

/// The number of verdef records provided in version script.
pub(crate) verdef_count: u16,
/// LLD creates GLIBC_ABI_DT_RELR as the last version across all inputs, we mimic that.
pub(crate) final_version_index: u16,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thought I already removed it.

},
)
.sub_option("pack-relative-relocs", "", |args, _| {
if args.arch != Architecture::RISCV64 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the other linkers just ignore the flag when linking for riscv64? Could perhaps do with a comment here saying whether we're matching the behaviour of other linkers, or whether this is something that we just don't support. I'm assuming we shouldn't error when this flag is requested for riscv64?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I had tested riscv64 only via the integration test which hides the warning from GNU ld because it linked successfully, and failed on the assertions.
If I link manually, on both Arch and Ubuntu 25.10 I get: /usr/lib/gcc/riscv64-linux-gnu/15.1.0/../../../../riscv64-linux-gnu/bin/ld: warning: -z pack-relative-relocs ignored.

LLD on the other hand enables it, guess we want to match LLD in that case. I'll update it.

@mati865 mati865 changed the title feat: Support -z pack-relative-relocs feat: Support -z pack-relative-relocs without actual packing Apr 3, 2026
@mati865 mati865 merged commit 14fdad2 into wild-linker:main Apr 4, 2026
24 checks passed
@mati865 mati865 deleted the push-ruqzpzktuvvm branch April 4, 2026 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wild: error: unrecognized option(s): --use-android-relr-tags, --pack-dyn-relocs=relr

2 participants