overlay: don't fail Put() when merged/ rename fails#738
overlay: don't fail Put() when merged/ rename fails#738trilamsr wants to merge 1 commit intocontainers:mainfrom
Conversation
0e3e133 to
7279008
Compare
7279008 to
180c142
Compare
|
Packit jobs failed. @containers/packit-build please check. |
|
Note: this is hard to unit test because reproducing the EBUSY requires a separate mount namespace holding a shared mount reference (Bidirectional propagation). A simple bind mount in a single namespace gets detached by |
that looks like an error in the way this mount is managed. You are likely using I don't understand how this patch could address that though. You are just removing the tmpMountpoint and ignoring the error. |
|
Hey Giuseppe, thanks for the pointer — I dug into this on our nodes.
The shared overlay mounts were detached by a previous But Verified on the live node:
The current code in Same pattern reported in moby/moby#34948 and containers/podman#18831. |
|
containers/podman#18831 is a years old issue that was actually fixed. If Do you have a mountpoint on |
mtrmac
left a comment
There was a problem hiding this comment.
I really know nothing about this space, this is not a review in any practical sense.
To save others the time, the code references to m_count/ put_mountpoint from earlier no longer exist in current kernel sources, after torvalds/linux@d72c773 .
And here I show my ignorance: torvalds/linux@8ed936b suggests that a Rename replacing a mountpoint should automatically unmount the mounts; and at least the immediately visible EBUSY error path is gated on is_local_mountpoint. So… isn’t this something that the kernel is expected to handle?
storage/drivers/overlay/overlay.go
Outdated
| // Clean up the temp dir and return success since MNT_DETACH already detached | ||
| // the mount from the calling namespace. |
There was a problem hiding this comment.
If this handling were sufficient and correct to do, we could just as well skip the Rename entirely, so why is it left around? (I’m not saying that this is sufficient and correct, just that it doesn’t look like a complete/coherent theory of the problem.)
There was a problem hiding this comment.
Good question. The rename isn't just cleanup — it atomically replaces merged/ with a fresh directory that has correct ownership via MkdirAndChown. Skipping it always would mean the next Get() mounts on the old merged/ which may have stale permissions.
On EBUSY the directory is empty (overlay is already detached) so skipping is fine as a fallback. The only alternative is returning an error, which makes CRI-O retry KillPodSandbox forever — pods stuck Terminating for 42 days in our case. Not great.
There was a problem hiding this comment.
Skipping it always would mean the next Get() mounts on the old merged/ which may have stale permissions.
[Skipping a side discussion about permissions]. So is that safe and correct to do? If so, why not do that always? If not, why do that ever?
The CRI-O retry is not an argument in any direction. For the little I know, maybe the only alternative really is “do not do that it doesn’t work and we can’t [currently] make it reliably work”.
There was a problem hiding this comment.
You're right, my framing was off. Let me restate.
The rename is a non-critical optimization — it atomically refreshes merged/ with clean ownership. But Get() doesn't depend on it: MkdirAllAndChownNew handles an existing merged/ directory fine. Put()'s actual job is to unmount the overlay, which MNT_DETACH already accomplished by that point.
The real issue is that Put() returns a hard error when this optional cleanup step fails. That error propagates to CRI-O as a failed KillPodSandbox, which retries forever. The fix is: don't fail Put() on a non-critical rename failure. The rename stays because it's useful when it works (zero cost), but its failure shouldn't fail the function.
Happy to update the commit message and comments to reflect this framing if it makes more sense.
|
Good questions. Yes, there's a private bind on Private, no shared tag. New overlay mounts correctly get parent 14282. We have two affected nodes with different stories: Node A (58 shared overlays): These are under Node B (18 shared overlays): These are under Re #18831 — understood it's fixed. The NVMe case is a different path and we're handling it on our side. The 18 unexplained mounts on the other node are what motivated the defensive EBUSY handling here — whatever caused them, the pods shouldn't be stuck forever. |
The rename in Put() is a non-critical optimization — it atomically replaces merged/ with a fresh directory. Get() handles an existing merged/ directory via MkdirAllAndChownNew regardless. Put()'s actual job is to unmount the overlay, which MNT_DETACH already accomplished by this point. Previously, if the rename failed (e.g. EBUSY from a stale DCACHE_MOUNTED flag on the dentry), Put() returned an error. In CRI-O this manifests as a failed KillPodSandbox that retries indefinitely — pods stuck Terminating for weeks. Don't fail Put() when the rename fails. Clean up the temp dir and return success. Fixes: containers#737 Signed-off-by: Tri Lam <trilamsr@gmail.com>
180c142 to
c6128a9
Compare
| // succeeded and that's what matters. | ||
| if errors.Is(err, unix.EBUSY) { | ||
| logrus.Debugf("Failed to replace mountpoint %s overlay: %s - %v (mount still held, skipping)", id, mountpoint, err) | ||
| os.Remove(tmpMountpoint) |
There was a problem hiding this comment.
the os.Remove would reintroduce this issue: containers/storage#1827
That is why we used a rename. Could we use other variants of rename, like RENAME_EXCHANGE and then rmdir the directory once it is not at tmpMountpoint?
There was a problem hiding this comment.
Tested RENAME_EXCHANGE on the actual zombie mount — it also returns EBUSY. The kernel checks d_mountpoint() for both regular rename and RENAME_EXCHANGE:
regular rename: ret=-1 errno=16 (Device or resource busy)
RENAME_EXCHANGE: ret=-1 errno=16 (Device or resource busy)
The DCACHE_MOUNTED flag blocks all rename variants on the target dentry. Only rmdir and rename check this flag — but since we can't do either, the only option is to leave merged/ as-is and not fail Put().
There was a problem hiding this comment.
what kernel version are you using?
There was a problem hiding this comment.
5.15.0-1074-oracle (OKE / Oracle Linux)
The rename in
Put()is a non-critical optimization — it atomically replacesmerged/with a fresh directory.Get()handles an existingmerged/viaMkdirAllAndChownNewregardless.Put()'s actual job is to unmount the overlay, whichMNT_DETACHalready accomplished by that point.Previously, if the rename failed (e.g. EBUSY from a stale
DCACHE_MOUNTEDflag on the dentry),Put()returned an error. In CRI-O this manifests as a failedKillPodSandboxthat retries indefinitely — pods stuck Terminating for weeks.Don't fail
Put()when the rename fails. Clean up the temp dir and return success.Fixes #737