feat: Support -z pack-relative-relocs without actual packing#1701
feat: Support -z pack-relative-relocs without actual packing#1701mati865 merged 83 commits intowild-linker:mainfrom
-z pack-relative-relocs without actual packing#1701Conversation
|
Ah, I have misunderstood that part. That version has to be just declared, not assigned to any symbol. |
|
It works now but doesn't support the major selling point of RELR which is compacting of the addresses via bitmaps. I'm not sure how to approach that as we need sorted entries before they are written for it work (use a temp buffer?). Currently, I'm working around the problem by sorting the entries after they are already written. Clang release (without debuginfo) builds without and with Even without the compaction this is a small win to the size. This is how compacted vs non-compacted entries look like: Performance impact is not bad considering the sort workaround |
5085561 to
5e53268
Compare
davidlattimore
left a comment
There was a problem hiding this comment.
Given this is marked as a draft, I assume there's still stuff that you want to address, so I haven't reviewed thoroughly yet. I did skim over it though.
libwild/src/elf.rs
Outdated
| let mut verneed_info = state.verneed_info; | ||
|
|
||
| if let Some(v) = state.verneed_info.as_ref() | ||
| if let Some(v) = &mut verneed_info |
There was a problem hiding this comment.
It looks like this mut perhaps isn't needed.
There was a problem hiding this comment.
Good catch, this is a leftover from one of the previous attempts at making glibc special symbol logic not feel like a plaster on a broken limb. This version I'm fairly satisfied with, but as you noticed it needs a bit of clean-up still.
Also, too bad none of the linter caught it.
|
Thanks, the status from #1701 (comment) is still up-to-date. Not much progress, so far. At least I no longer hate the way this PR implements glibc special symbol version. |
5e53268 to
00d2009
Compare
|
Looking at But if I map addresses to the sections we get: The relocations are not as unordered as I previously thought. Currently, sections are written at the first available slot, but relocations within each section are already ordered (not verified but seems plausible). GOTs won't be a problem as well (at least in Clang's case). Even though there are 2002 entries with relative reloc: They are sequential: Other relocations should be irrelevant here. With some creativity this solution could be extended to avoid overallcoation as well. If we store first and last addresses of each written relocations chunk for each section somewhere, we can increment the sizes by the chunk size, instead of relocations count. Maybe it will be clearer that way: |
|
There is something I didn't account for: buckets and groups. It works only for simple cases: using single thread ( Shows the addresses are in order, but adding threads or inputs makes the offsets fall apart. EDIT: Regarding the fields in common structs, some of those were ELF specific when I started and I didn't bother cleaning up the code that doesn't even work when rebasing. I'll move them if this ever starts working correctly. |
|
The closest equivalent I can think of in the linker is how dynamic symbol definitions are handled. During the GC phase, objects collect up the dynamic symbols that they need to define, storing them in |
d8707e5 to
e98b0a1
Compare
|
Thanks for the pointer. I had looked there but I don't see how that would work without huge changes to how relocations are handled. Although that might be unavoidable to enable packing via bitmaps. Switching to Compared with the output: There is a single relocation missing between 000000000cdbd850 and 000000000cdbd860. There is no way to figure it out before symbols get offsets assigned and probably even until we start doing relocations which conflicts with output preallocation. All three are I don't know if it's worth investing any more time into that approach as there is very little hope to implement fully functional packed relocations this way. At least the performance of current hack is within noise margins on my machine Details
|
e98b0a1 to
fe42bb7
Compare
|
I haven't looked (recently) at your code changes, but I gave some thought to how to simplify packed relocations... would it help if we reduced scope and only packed relative relocations within each input section? i.e when we scan the relocations for an input section (during the layout phase), we allocate an entry in RELR when we first encounter a relative relocation, then if we encounter any more within the subsequent 63 addresses, allocate another entry (bitmask entry). That'd make sizing much more localised. Obviously it wouldn't give as much space saving as doing the operation globally, but I bet it would still give some savings. Especially when you have a section containing a vtable. |
|
Hmm, how can we tell in layout if symbols are adjacent? If I add I've implemented (or at least I hope so) counting of possible savings if we consider symbols adjacent only if they are in the same section, the same file, and they are at most 4 bytes away from the previous symbol. That 4 bytes apart is bad but I haven't yet figured out how to solve that. I'll take another look into that over Saturday or Sunday. I did it to see what kind of savings would that approach would give, and to see if your suggestion would work. |
During the size calculating (GC) phase, you can't, but do you need to? I'm probably missing something, but I thought what mattered was (a) whether a particular relocation was relative, i.e. not a dynamic symbol or some other exotic type and (b) the offset of the relocation itself. I didn't think the addresses to which the relocations referred to mattered, since I assumed those were addends that were stored at the location where the relocation is to be applied and only needed to be computed in the write stage. Perhaps I should describe my understanding of RELR on RELR packing to make sure we're both on the same page... In RELA relocations, we store offset, info and addend. REL then optimises by storing the addend inline at the target location, so it only needs to store offset and info. RELR avoids storing info by specialising to work only with relative relocations. So it's half the size of REL and 1/3rd the size of RELA, but REL or RELA is still needed for relocation types other than relative ones. REL packing then goes further and optionally allows a bitmask after an offset. The bitmask indicates whether each of the next 63 addresses should also have a relative relocation applied. One bit is used to indicate whether an entry is a relocation offset or a bitmask. So even without packing, i.e. just writing a list of relative relocation offsets, RELR provides a 2/3rds reduction in space usage compared with RELA. |
libwild/src/macho.rs
Outdated
| mem_sizes: &mut crate::output_section_part_map::OutputSectionPartMap<u64>, | ||
| output_kind: crate::output_kind::OutputKind, | ||
| pack_relative_relocs: bool, | ||
| relr_part_sizes: &mut crate::output_section_part_map::OutputSectionPartMap<( |
There was a problem hiding this comment.
If we just do RELR packing only locally within each input section, then I'd expect that we could just allocate RELR entries just like we currently allocate RELA entries.
libwild/src/layout.rs
Outdated
| } | ||
| } | ||
|
|
||
| fn compute_relr_offsets_by_group<P: Platform>( |
There was a problem hiding this comment.
I've only sort of read this method, so I don't fully understand what it's doing. I suspect if we do RELR allocation entirely within an input section, it shouldn't be needed. One other thing I was thinking was about sorting. The input relocations for a section are generally sorted already, so I don't think we need to do any sorting. If some compiler for some reason produces relocations that aren't sorted, all that happens is that we get less packing, which seems OK.
No, I was missing something. I thought you can only pack adjacent addresses, so we'd need base address+bitmap entries after every gap. So, is it completely valid to have as many spaces as you want, so long they fit in that 500 bytes space of bitmap? That changes everything.
Yeah, I tested that in #1701 (comment) |
By "spaces" do you mean gaps between relocations? Yes, I'd assume that's fine, those offsets would just get a zero bit in the bitmask. The way I imagine it working is that the first relative relocation you encounter costs 1 entry (8 bytes). If you then encounter another relative relocation within the next 63 machine words (504 bytes), then an additional entry (bitmap) is allocated. Subsequent relocations within those 63 machine words are then "free" in that they just set the corresponding bit, but don't occupy any additional space. After those 63 words are up, if we encounter another relocation within the subsequent 63 machine words, we allocate another bitmap entry, and so on until either we reach the end of the section or we get to the end of the 63 words without encountering an eligible relocation. Another small complication that I thought of is that in executable code, some of the relocations won't be at machine-word boundaries. i.e. their offsets in the section won't be an exact multiple of 8 bytes. In those cases, the relocations that are an exact 8 byte multiple would be fine to put into RELR, but the unaligned relocations, even if relative, would still need to be put into the regular relocations table. A further, related complication is what to do if the section we're processing has alignment smaller than 8. I suspect in that case, we need to just not use RELR for that section, since we don't know at that point which relocations will end up aligned to 8 byte boundaries. But that's probably OK. On x86_64, functions tend to have 16 byte alignment. On RISC architectures, the alignment is generally smaller - 4 byte for aarch64 and 2 bytes for riscv - but I don't think those architectures could really use RELR on code anyway, since they're not using full 8 byte relocations in the code. Those architectures could still benefit by getting smaller relocations for vtables, which is where the real benefits probably come anyway. |
Yes, 2 AM isn't the best time for productivity.
You are right. I worked on too small examples, so other linkers produced: Based on that I made the wrong assumption that the entries must be adjacent, otherwise new base entry shall be introduced. I found similar patterns in Clang binary (it wasn't hard because there are over 7k RERL entries after packing...). Now, realizing that the third RELR entry (the one with index 2) was created because that >504 bytes gap in the addresses the gaps indeed make sense. And sure enough I found RERL bitmaps with gaps in Clang binary.
I'm not sure if I follow. I think this would result in less saving than just yoloing all relative relocations without a specific order and be much more complex code wise. At least for x86_64. With my prints from latest commits on a small example I get: Those are addends grouped by file and section index obtained at When building Clang without
That's valid point. For my small exapmles GNU ld uses |
I had a response written describing how I thought it would work and wouldn't be complex. Then I realised that all of the relocations in the functions are likely being resolved at link time anyway, so there shouldn't be any dynamic relocations for the text sections. So feel free to disregard my comments about relocations in functions / executable code, I just wasn't thinking it through properly.
I'm not sure I understand what you're saying here. One other thing that occurred to me - possibly you're already handling it - is relocations that the linker itself is generating. In particular in the GOT. If we have symbols that are non-interposable, but for which we still end up having a GOT entry, then we emit a relative relocation for that GOT entry. |
If we kept some relative relocations in .rela.dyn rather than moving everything to .relr.dyn. The size reduction from packing probably wouldn't offset the bigger size of RELA entires. In other words making all relative relocation as RELR without packing would yield smaller binaries than mixed RELA and RELR with packing. At least for x86_64. This was referring to:
Yeah, this already works for x86_64. The changes were pretty minimal. I'll do AArch64 Clang build in the morning to see if it also works. |
But I thought that if a relocation is for an unaligned offset then it just couldn't use RELR? So we wouldn't have a choice. That said, I'd expect that most of the time all the relevant relocations would have at least 8 byte alignment as would the sections they occur in. |
Oh, sorry. I thought if we encounter unaligned relocation we would recreate it as aligned because I messed up something with the code and it prints wrong addends. Yeah, we cannot create RELRs entries for odd addresses because they are interpreted as a mask. However, we can create RELRs for unaligned even addresses, we just cannot pack them. Added a test that captures it well: readelf -r outputThis PR turned
Yeah, I didn't really encounter them in a few real world binaries I tried. It was a bug in my code that led me to that thought. I'll have to reconsider a few things regarding this PR over the next few days. |
321f4c9 to
ee67e2c
Compare
This reverts commit 537591d.
|
Thanks for the review. I believe I addressed almost everything apart from #1701 (comment)
Please do if you want. I'm not planning another attempt at it any time soon. Added one change that was not discussed: db54cf4 IIRC with RELA it doesn't matter what you write into relocated address and writing zero might be preferred. Unfortunately |
I think it's fine. There's only a couple of callers of the function and if someone forgets to use the returned value, then hopefully tests will pick it up - especially the tests that verify that we write to every byte of the file. |
libwild/src/parsing.rs
Outdated
| /// Symbol will point to the start of the first loadable segment. | ||
| LoadBaseAddress, | ||
|
|
||
| // TODO: Versions are ELF specific |
There was a problem hiding this comment.
It's fine to have platform-specific things in common code, so long as they don't interfere with other platforms
libwild/src/layout.rs
Outdated
| // We'll emit a warning when writing the file if it's an executable. | ||
| return; | ||
| }; | ||
| for (index, def_info) in self.internal_symbols.symbol_definitions.iter().enumerate() { |
There was a problem hiding this comment.
Is an internal sym def actually needed or could we just check args.pack_relative_relocs here and then request the appropriate symbol?
There was a problem hiding this comment.
Sorry, I had to unexpectantly go out and hastily committed and pushed the changes.
So, I was testing two approaches: importing just the version and the whole dynamic symbol. Importing just the version in a hacky way worked suspiciously fine on my machine (even linked a working Clang binary) but wreak havoc on the CI as can be seen in this failure:
version.__cxa_finalize
wild GLIBC_2.2.5
ld local or global
I'm guessing the versions were offset by a single place.
Pushed a refined version, but I still need to make CI happy.
There was a problem hiding this comment.
I still need to make CI happy.
To my surprise, PIE isn't the default everywhere yet.
davidlattimore
left a comment
There was a problem hiding this comment.
Thanks for your work on this! Looks good. Just two small comments. Don't feel you need to wait for me once addressed.
libwild/src/elf.rs
Outdated
| /// The number of verdef records provided in version script. | ||
| pub(crate) verdef_count: u16, | ||
| /// LLD creates GLIBC_ABI_DT_RELR as the last version across all inputs, we mimic that. | ||
| pub(crate) final_version_index: u16, |
There was a problem hiding this comment.
Oops, thought I already removed it.
libwild/src/args/elf.rs
Outdated
| }, | ||
| ) | ||
| .sub_option("pack-relative-relocs", "", |args, _| { | ||
| if args.arch != Architecture::RISCV64 { |
There was a problem hiding this comment.
Do the other linkers just ignore the flag when linking for riscv64? Could perhaps do with a comment here saying whether we're matching the behaviour of other linkers, or whether this is something that we just don't support. I'm assuming we shouldn't error when this flag is requested for riscv64?
There was a problem hiding this comment.
Good point. I had tested riscv64 only via the integration test which hides the warning from GNU ld because it linked successfully, and failed on the assertions.
If I link manually, on both Arch and Ubuntu 25.10 I get: /usr/lib/gcc/riscv64-linux-gnu/15.1.0/../../../../riscv64-linux-gnu/bin/ld: warning: -z pack-relative-relocs ignored.
LLD on the other hand enables it, guess we want to match LLD in that case. I'll update it.
-z pack-relative-relocs-z pack-relative-relocs without actual packing
Works now but doesn't pack RELR entries via bitmaps, so the size reduction is not that big.
Doesn't work yet:I haven't yet figured out how to cleanly synthesiseGLIBC_ABI_DT_RELRversion for__libc_start_mainsymbol. I'd prefer to avoid matching that symbol by the name, but there might be no other choice.