overlay: don't fail Put() when merged/ rename fails by trilamsr · Pull Request #738 · containers/container-libs

trilamsr · 2026-04-03T04:45:45Z

The rename in Put() is a non-critical optimization — it atomically replaces merged/ with a fresh directory. Get() handles an existing merged/ via MkdirAllAndChownNew regardless. Put()'s actual job is to unmount the overlay, which MNT_DETACH already accomplished by that point.

Previously, if the rename failed (e.g. EBUSY from a stale DCACHE_MOUNTED flag on the dentry), Put() returned an error. In CRI-O this manifests as a failed KillPodSandbox that retries indefinitely — pods stuck Terminating for weeks.

Don't fail Put() when the rename fails. Clean up the temp dir and return success.

Fixes #737

packit-as-a-service · 2026-04-03T05:04:04Z

Packit jobs failed. @containers/packit-build please check.

trilamsr · 2026-04-03T05:11:22Z

Note: this is hard to unit test because reproducing the EBUSY requires a separate mount namespace holding a shared mount reference (Bidirectional propagation). A simple bind mount in a single namespace gets detached by MNT_DETACH and doesn't trigger the race. Happy to write an integration test using unshare(CLONE_NEWNS) if needed.

giuseppe · 2026-04-03T19:34:28Z

After MNT_DETACH, the kernel defers unmount until all references are released. With shared mounts (e.g. mountPropagation: Bidirectional from a CSI driver), another namespace holds a reference, so rename(2) on the mountpoint fails with EBUSY:

that looks like an error in the way this mount is managed. You are likely using skip_mount_home but in this case you shouldn't be leaking private mounts around and make them unmountable.

I don't understand how this patch could address that though. You are just removing the tmpMountpoint and ignoring the error.

trilamsr · 2026-04-06T19:15:08Z

Hey Giuseppe, thanks for the pointer — I dug into this on our nodes.

skip_mount_home isn't set. MakePrivate(home) runs and works — current overlay mounts are private. The 18 stale shared mounts are from before the current CRI-O instance, and we traced them through the kernel to understand the EBUSY.

The shared overlay mounts were detached by a previous MNT_DETACH — confirmed via stat -f on merged/ returning ext4 (not overlay), and strace showing umount2(path, MNT_DETACH) = -1 EINVAL. The EINVAL comes from can_umount() → check_mnt() because umount_tree() already set mnt_ns = NULL on the previous detach.

But DCACHE_MOUNTED is still set on the merged/ dentry — put_mountpoint() only clears it when m_count reaches 0 (line 788 of fs/namespace.c), and the shared peer group keeps m_count > 0. So rename() checks d_mountpoint(), sees the flag, returns EBUSY.

Verified on the live node:

umount2 with any flags → EINVAL (mount already detached)
rmdir merged/ → EBUSY (d_mountpoint)
mount --make-private merged/ → EINVAL (kernel rejects on overlay)
crictl stopp / crictl rmp -f → both fail with same EBUSY
Only a node reboot clears these

The current code in Put() hits this EBUSY on the rename and returns an error. CRI-O treats that as a failed KillPodSandbox and retries forever — pods stuck Terminating indefinitely (42 days, 500k+ events in our case). The patch returns success instead, which is correct because the mount is already detached and the container is dead. The rename is just the merged/ directory replacement optimization — DeferredRemove handles the actual layer cleanup later.

Same pattern reported in moby/moby#34948 and containers/podman#18831.

giuseppe · 2026-04-07T15:26:23Z

containers/podman#18831 is a years old issue that was actually fixed.

If skip_mount_home is not used, how did a different namespace manage to grab a reference to this mount point?

Do you have a mountpoint on /var/lib/containers/storage/overlay?

mtrmac

I really know nothing about this space, this is not a review in any practical sense.

To save others the time, the code references to m_count/ put_mountpoint from earlier no longer exist in current kernel sources, after torvalds/linux@d72c773 .

And here I show my ignorance: torvalds/linux@8ed936b suggests that a Rename replacing a mountpoint should automatically unmount the mounts; and at least the immediately visible EBUSY error path is gated on is_local_mountpoint. So… isn’t this something that the kernel is expected to handle?

mtrmac · 2026-04-07T16:27:42Z

storage/drivers/overlay/overlay.go

+			// Clean up the temp dir and return success since MNT_DETACH already detached
+			// the mount from the calling namespace.


If this handling were sufficient and correct to do, we could just as well skip the Rename entirely, so why is it left around? (I’m not saying that this is sufficient and correct, just that it doesn’t look like a complete/coherent theory of the problem.)

Good question. The rename isn't just cleanup — it atomically replaces merged/ with a fresh directory that has correct ownership via MkdirAndChown. Skipping it always would mean the next Get() mounts on the old merged/ which may have stale permissions.

On EBUSY the directory is empty (overlay is already detached) so skipping is fine as a fallback. The only alternative is returning an error, which makes CRI-O retry KillPodSandbox forever — pods stuck Terminating for 42 days in our case. Not great.

Skipping it always would mean the next Get() mounts on the old merged/ which may have stale permissions.

[Skipping a side discussion about permissions]. So is that safe and correct to do? If so, why not do that always? If not, why do that ever?

The CRI-O retry is not an argument in any direction. For the little I know, maybe the only alternative really is “do not do that it doesn’t work and we can’t [currently] make it reliably work”.

You're right, my framing was off. Let me restate.

The rename is a non-critical optimization — it atomically refreshes merged/ with clean ownership. But Get() doesn't depend on it: MkdirAllAndChownNew handles an existing merged/ directory fine. Put()'s actual job is to unmount the overlay, which MNT_DETACH already accomplished by that point.

The real issue is that Put() returns a hard error when this optional cleanup step fails. That error propagates to CRI-O as a failed KillPodSandbox, which retries forever. The fix is: don't fail Put() on a non-critical rename failure. The rename stays because it's useful when it works (zero cost), but its failure shouldn't fail the function.

Happy to update the commit message and comments to reflect this framing if it makes more sense.

trilamsr · 2026-04-07T16:55:04Z

Good questions. Yes, there's a private bind on /var/lib/containers/storage/overlay from MakePrivate:

14282 30 8:1 ... /var/lib/containers/storage/overlay ... - ext4 /dev/sda1 rw

Private, no shared tag. New overlay mounts correctly get parent 14282.

We have two affected nodes with different stories:

Node A (58 shared overlays): These are under /mnt/nvme7/containers/storage/overlay — a separate NVMe storage path that MakePrivate doesn't cover. The NVMe device is mounted shared by our infra DaemonSet. This is our bug and we're fixing it downstream.

Node B (18 shared overlays): These are under /var/lib/containers/storage/overlay with parent 30 (/) instead of 14282. We can't fully explain this one — MakePrivate ran, the bind exists, and current test mounts correctly get parent 14282. Layer directories predate the current CRI-O process so something went wrong during the transition, but logs are rotated and we can't reproduce it.

Re #18831 — understood it's fixed. The NVMe case is a different path and we're handling it on our side. The 18 unexplained mounts on the other node are what motivated the defensive EBUSY handling here — whatever caused them, the pods shouldn't be stuck forever.

The rename in Put() is a non-critical optimization — it atomically replaces merged/ with a fresh directory. Get() handles an existing merged/ directory via MkdirAllAndChownNew regardless. Put()'s actual job is to unmount the overlay, which MNT_DETACH already accomplished by this point. Previously, if the rename failed (e.g. EBUSY from a stale DCACHE_MOUNTED flag on the dentry), Put() returned an error. In CRI-O this manifests as a failed KillPodSandbox that retries indefinitely — pods stuck Terminating for weeks. Don't fail Put() when the rename fails. Clean up the temp dir and return success. Fixes: containers#737 Signed-off-by: Tri Lam <trilamsr@gmail.com>

giuseppe · 2026-04-07T21:04:07Z

storage/drivers/overlay/overlay.go

+			// succeeded and that's what matters.
+			if errors.Is(err, unix.EBUSY) {
+				logrus.Debugf("Failed to replace mountpoint %s overlay: %s - %v (mount still held, skipping)", id, mountpoint, err)
+				os.Remove(tmpMountpoint)


the os.Remove would reintroduce this issue: containers/storage#1827

That is why we used a rename. Could we use other variants of rename, like RENAME_EXCHANGE and then rmdir the directory once it is not at tmpMountpoint?

Tested RENAME_EXCHANGE on the actual zombie mount — it also returns EBUSY. The kernel checks d_mountpoint() for both regular rename and RENAME_EXCHANGE:

regular rename: ret=-1 errno=16 (Device or resource busy) RENAME_EXCHANGE: ret=-1 errno=16 (Device or resource busy)

The DCACHE_MOUNTED flag blocks all rename variants on the target dentry. Only rmdir and rename check this flag — but since we can't do either, the only option is to leave merged/ as-is and not fail Put().

what kernel version are you using?

5.15.0-1074-oracle (OKE / Oracle Linux)

github-actions bot added the storage Related to "storage" package label Apr 3, 2026

trilamsr force-pushed the fix-overlay-put-ebusy-bidirectional branch 2 times, most recently from 0e3e133 to 7279008 Compare April 3, 2026 05:00

trilamsr changed the title ~~overlay: skip rename in Put() when mount is still held by another namespace~~ overlay: handle EBUSY from rename in Put() when mount is shared Apr 3, 2026

trilamsr mentioned this pull request Apr 3, 2026

overlay: Put() returns EBUSY when CSI drivers use Bidirectional mount propagation #737

Open

trilamsr force-pushed the fix-overlay-put-ebusy-bidirectional branch from 7279008 to 180c142 Compare April 3, 2026 05:03

trilamsr mentioned this pull request Apr 3, 2026

removing mount point \"/var/lib/containers/storage/overlay/374e64c77093aadbffc91a1fe8cc53fe5c722e898670793bec73bc55e5291961/merged\": directory not empty" #100

Open

mtrmac reviewed Apr 7, 2026

View reviewed changes

trilamsr force-pushed the fix-overlay-put-ebusy-bidirectional branch from 180c142 to c6128a9 Compare April 7, 2026 20:39

trilamsr changed the title ~~overlay: handle EBUSY from rename in Put() when mount is shared~~ overlay: don't fail Put() when merged/ rename fails Apr 7, 2026

giuseppe reviewed Apr 7, 2026

View reviewed changes

		// Clean up the temp dir and return success since MNT_DETACH already detached
		// the mount from the calling namespace.

Conversation

trilamsr commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

packit-as-a-service bot commented Apr 3, 2026

Uh oh!

trilamsr commented Apr 3, 2026

Uh oh!

giuseppe commented Apr 3, 2026

Uh oh!

trilamsr commented Apr 6, 2026

Uh oh!

giuseppe commented Apr 7, 2026

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trilamsr commented Apr 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trilamsr commented Apr 3, 2026 •

edited

Loading