Skip to content

storage: add splitfdstream for efficient layer transfer with reflink support#651

Open
giuseppe wants to merge 12 commits intocontainers:mainfrom
giuseppe:splitfdstream
Open

storage: add splitfdstream for efficient layer transfer with reflink support#651
giuseppe wants to merge 12 commits intocontainers:mainfrom
giuseppe:splitfdstream

Conversation

@giuseppe
Copy link
Copy Markdown
Member

Add a new splitfdstream package that enables efficient container layer transfer between storage instances by separating tar metadata from file content. File content is passed as file descriptors.

Copy link
Copy Markdown
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 6% of the code base for an extremely niche use case :/

I’m afraid I can’t spend time on this now; and anyway some other work on pkg/archive needs to happen first, and expect the rebase to be somewhat non-trivial.

@giuseppe
Copy link
Copy Markdown
Member Author

  • 6% of the code base for an extremely niche use case :/

I’m afraid I can’t spend time on this now; and anyway some other work on pkg/archive needs to happen first, and expect the rebase to be somewhat non-trivial.

what other work would block this feature?

I don't love either the amount of extra code, but it is almost all in new APIs/packages that are experimental and not leaking too much to the rest of the library.

We went around this problem for a long time, and in the end this seems like an ~ok solution to not expose too much of the containers storage internals augmenting the tar format that is already plugged everywhere.

@giuseppe
Copy link
Copy Markdown
Member Author

to add more context: the goal of this feature is to be able to copy a layer, similarly to what we would do with Diff, but having the possibility to deduplicate files using reflinks. We need two separate copies because OS files can't be kept on the containers storage for different reasons, a podman system reset -f should not break the system, and OS files need to be relabeled.

There are other interesting use cases enabled with a RPC like facilitate usage from different languages (Rust in our case), inject into a build container to grab R/O access to layers/images, but this is out of scope now.

I realize complexity is not trivial just to get access to the raw files, but what alternatives do we have?

From the store API PoV we could just extend it to return a map $LAYER_FILE -> $PATH_TO_THE_FILE, but how would we use it from Rust?

Any suggestions?

@mtrmac
Copy link
Copy Markdown
Contributor

mtrmac commented Feb 10, 2026

(Just to avoid a misunderstanding, I didn’t really read the PR. I have, at this point, ~no opinion on RPC vs. Go API.)

return &tarReaderIterator{
tr: tr,
trBuf: trBuf,
buffer: make([]byte, 1<<20),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like 1<<20 should be in a const somewhere with a rationale?


// splitFDStreamSocketDescriptor is the fd for the Unix socket used to
// receive file descriptors via SCM_RIGHTS in the re-exec child.
const splitFDStreamSocketDescriptor = 5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding this seems a bit fragile, can't we pass it as an env var or a CLI arg?

Comment on lines +34 to +35
// FDs are streamed to the child process one-at-a-time over a Unix socket
// using SCM_RIGHTS, avoiding EMFILE from inheriting too many FDs at exec.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be truly one at a time, see also https://github.com/cgwalters/jsonrpc-fdpass?tab=readme-ov-file#41-fd-batching

In my initial PoC work here I handled this by just sending the splitfdstream inline data itself as a fd (could be O_TMPFILE or memfd or pipe), and then sending all the associated fds over jsonrpc-fdpass which automatically handles buffering.

Ensure we processing just 1 recvmsg() at a time (only ~200 fds on Linux) seems really reasonable and unlikely to get anywhere close to modern fd limits.


So if we do choose jsonrpc-fdpass, I think we also need to standardize this "splitfdstream serialization".

return err
}

// Try copy_file_range - kernel-level copy, more efficient than userspace
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprising we weren't doing this before.


// ReadLine reads bytes until a newline is encountered, using recvmsg.
// Any FDs received are closed since line-reading doesn't expect them.
func (p *FDPasser) ReadLine() ([]byte, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would one want this?


// JSON-RPC 2.0 protocol constants
const (
JSONRPCVersion = "2.0"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe giuseppe force-pushed the splitfdstream branch 2 times, most recently from 23ed544 to 8d74f84 Compare February 11, 2026 16:07
@giuseppe
Copy link
Copy Markdown
Member Author

still PoC, so you can avoid a review as it is not yet ready but we can use to analyze the complexity.

Moved to use https://github.com/cgwalters/jsonrpc-fdpass-go

Now the code size is almost 50% than the last revision

@packit-as-a-service
Copy link
Copy Markdown

Packit jobs failed. @containers/packit-build please check.

@giuseppe giuseppe force-pushed the splitfdstream branch 2 times, most recently from 1b087bb to bc6c5ff Compare February 12, 2026 09:15
@giuseppe giuseppe force-pushed the splitfdstream branch 7 times, most recently from a0876b6 to d77fbc4 Compare February 17, 2026 12:44
@giuseppe giuseppe changed the title [WIP] storage: add splitfdstream for efficient layer transfer with reflink support storage: add splitfdstream for efficient layer transfer with reflink support Feb 19, 2026
@giuseppe giuseppe marked this pull request as ready for review February 19, 2026 14:58
@TomSweeneyRedHat
Copy link
Copy Markdown
Member

Per @mtrmac, this should merge before #677 does

@mtrmac
Copy link
Copy Markdown
Contributor

mtrmac commented Mar 10, 2026

No, #677 would be infrastructure for this PR. I really mean that I have little opinion on merging this PR, and that without that, #677 is not clearly an improvement.

@giuseppe
Copy link
Copy Markdown
Member Author

this needs to be rebased on top of #677 anyway, I'll need to rework it on top of that

@TomSweeneyRedHat
Copy link
Copy Markdown
Member

@giuseppe looks like a rebase may be needed here.

@github-actions github-actions bot added common Related to "common" package image Related to "image" package labels Mar 20, 2026
@giuseppe giuseppe force-pushed the splitfdstream branch 2 times, most recently from 88bf912 to a7806a8 Compare March 23, 2026 07:39
@giuseppe
Copy link
Copy Markdown
Member Author

@kolyshkin are these changes to the tar package a problem for the work you are doing there?

@giuseppe
Copy link
Copy Markdown
Member Author

@kolyshkin can you PTAL?

}
defer tarStream.Close()

// Write splitfdstream data directly to a memfd to avoid buffering in memory.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of comment I find a bit distracting because it's almost circular - a memfd is memory too...how about just:

A memfd is a convenient way to buffer the metadata

Or maybe just no comment at all. Dunno.

Here and in other places like

	// Set up ID mappings
	idMappings := options.IDMappings

The comment basically adds nothing.

return streamFile, fds, nil
}

// convertTarToSplitFDStream converts a tar stream to a splitfdstream by parsing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This obviously works but it seems pretty inefficient to do - i.e. to read the entire contents of large file into memory and just discard it.

I guess we're doing this because the current API is "give a tarball" and this just builds on that without changing anything more fundamental?

if buf == nil {
buf = make([]byte, archive.CopyBufferSize)
}
// File not found in diff directory (e.g., naiveDiff was used),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would the file not be found?

// Write a single prefix for the total size, then stream
// data in chunks. The reader expects exactly one prefix
// per file entry.
if err := writer.WriteInlinePrefix(header.Size); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would perhaps be more elegant if writer.WriteInline returned a sub-io.Write impl then we could just do io.Copy to that?

rb, err = h.GetLayerInfoPiped(ctx, req.Args)
case "FinishPipe":
rb, err = h.FinishPipe(ctx, req.Args)
case "GetSplitFDStreamSocket":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this may grow into a general replacement for the bespoke proxy API right? How about calling it

OpenJsonRPCFdPass ?

"runtime"
"sync"

fdpass "github.com/cgwalters/jsonrpc-fdpass-go"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +271 to +272
// The library limits to MaxFDsPerMessage per sendmsg, so remaining
// FDs are sent as follow-up "fds" notifications.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the jsonrpc-fdpass protocol specifies to send more fds as whitespace but it's possible the Go implementation isn't doing this right

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// findManifest finds the image manifest from BigData keys.
// It looks for manifest-* keys containing an image manifest (has "config" field),
// filtering out manifest lists/indexes.
func findManifest(store Store, imageID string) ([]byte, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely there's a common API for this?

giuseppe added 2 commits April 7, 2026 08:53
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
giuseppe added 9 commits April 7, 2026 14:19
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Extend the store with splitfdstream capabilities exposed via a
UNIX socket for JSON-RPC communication.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Implement the SplitFDStreamDriver interface for the overlay driver,
enabling efficient layer operations with reflink support.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the splitfdstream branch 2 times, most recently from 4834b91 to 7af6099 Compare April 7, 2026 13:25
Add a new json-proxy method that returns a Unix socket file descriptor
to the client.  The client then speaks the jsonrpc-fdpass protocol
directly over that socket for splitfdstream operations, bypassing the
json-proxy for bulk data transfer.

Bump protocolVersion to "0.2.9".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@cgwalters
Copy link
Copy Markdown
Contributor

I think this is overall OK as is, however - just to cross-reference from composefs/composefs-rs#228 (comment) - I think we should get a bit more ambitious and try adding a full new proxy API based on jsonrpc-fdpass instead. There's a lot of warts and bugs we could fix at the same time (like the bad idea of auto-converting v2s2 to OCI and other things from the historical PRs there).

That would then make the architecture much clearer: There's a jsonrpc-fdpass API for containers-storage:. We rework https://github.com/bootc-dev/containers-image-proxy-rs/ to support this (it'd be a relatively thin wrapper on top of jsonrpc-fdpass, just with stronger Rust types basically) and then it becomes very straightforward to plug in all of the existing places.

Just to also xref again I think this API has the potential to solve problems like containers/skopeo#658 (comment) too. But of course, that's where the choice of IPC becomes a big deal.

@giuseppe
Copy link
Copy Markdown
Member Author

giuseppe commented Apr 8, 2026

I think this is overall OK as is, however - just to cross-reference from composefs/composefs-rs#228 (comment) - I think we should get a bit more ambitious and try adding a full new proxy API based on jsonrpc-fdpass instead. There's a lot of warts and bugs we could fix at the same time (like the bad idea of auto-converting v2s2 to OCI and other things from the historical PRs there).

isn't that something we can do incrementally? I've tried to keep this as small as possible (e.g. dropped write support) to facilitate merging it, but I think we should get that back and let the copy code handle this as a shortcut for more efficient copies to different stores

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to "common" package image Related to "image" package storage Related to "storage" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants