Cleanup NIXL file mounts by podkidyshev · Pull Request #840 · NVIDIA/cloudai

podkidyshev · 2026-03-17T19:24:16Z

Summary

Cleanup mounted files to nixl after execution to free space

Test Plan

Manual runs on EOS
Automated CI

Additional Notes

Additional functionality for CLOUDAI-11

coderabbitai · 2026-03-17T19:24:23Z

📝 Walkthrough

Walkthrough

Adds a no-op cleanup hook to CommandGenStrategy, implements NIXL-specific artifact removal, invokes cleanup from Slurm runners on job completion, and adds tests verifying invocation and filesystem cleanup.

Changes

Cohort / File(s)	Summary
Core base `src/cloudai/_core/command_gen_strategy.py`	Added public method `cleanup_job_artifacts(self) -> None` (no-op hook) and updated copyright year to 2026.
Slurm runners `src/cloudai/systems/slurm/slurm_runner.py`, `src/cloudai/systems/slurm/single_sbatch_runner.py`	Added `completed_test_runs(self, job: BaseJob) -> list[TestRun]`; `SlurmRunner.on_job_completion()` now iterates completed test runs and calls `cleanup_job_artifacts()` per test run inside a broad `try/except` with warning logs.
NIXL workloads `src/cloudai/workloads/common/nixl.py`	Added `_cleanup_targets()` and `cleanup_job_artifacts()` in `NIXLCmdGenBase` to compute cleanup directories (filepath/device_list mounts) and remove them using `shutil.rmtree`, logging removals.
Tests — runners & NIXL `tests/test_get_job_id.py`, `tests/test_single_sbatch_runner.py`, `tests/workloads/nixl_bench/test_command_gen_strategy_slurm.py`, `tests/workloads/nixl_kvbench/test_command_gen_slurm.py`	Added tests asserting Slurm runner calls cleanup; added tests verifying NIXL strategies remove expected mount directories; updated `kvbench_tr` fixture to include `output_path`; added mocks and assertions for cleanup calls.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through mounts at close of day,
Nibbled leftover files and tucked them away,
Paths cleared neatly, logs in a line,
Tests gave a nod — tidy and fine,
Hooray for cleanups, one hop at a time! 🥕

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Cleanup NIXL file mounts' directly and clearly describes the main purpose of the changeset, which adds cleanup functionality for mounted files in NIXL workloads.
Description check	✅ Passed	The description is directly related to the changeset, explaining that it adds cleanup of mounted files to NIXL after execution to free space, which aligns with the implemented changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ipod/cloudai-11/cleanup-nixl-mounts

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-03-20T18:02:43Z

Greptile Summary

This PR adds post-job cleanup of NIXL container-mount directories (filepath_mount and device_list_mounts) to free disk space after a job completes. It introduces a cleanup_job_artifacts() hook on the CommandGenStrategy base class (no-op by default), dispatches it from SlurmRunner.on_job_completion, and implements the actual cleanup in NIXLCmdGenBase. A new completed_test_runs(job) override in SingleSbatchRunner ensures cleanup runs for all test runs within a single-sbatch job (not just the one nominally associated with the Slurm job object). Exception handling prevents cleanup failures from disrupting job completion flow, and all new logic is covered by targeted unit tests.

Key changes:

CommandGenStrategy.cleanup_job_artifacts() — extensible no-op hook for post-job cleanup
SlurmRunner.on_job_completion() — iterates completed_test_runs and calls cleanup per TestRun, with try/except guarding
SingleSbatchRunner.completed_test_runs() — overrides base to return all scenario test runs, correctly covering the single-job-for-all-runs model
NIXLCmdGenBase._cleanup_targets() / cleanup_job_artifacts() — removes filepath_mount and device_list_mounts directories based on cmd_args_dict inspection
Tests added for SlurmRunner, SingleSbatchRunner, nixl-bench, and nixl-kvbench cleanup paths

Confidence Score: 4/5

Safe to merge — well-structured and well-tested; one minor style issue with exists() vs is_dir() before shutil.rmtree.
The logic is sound, exception handling prevents cleanup failures from affecting job completion, and all new paths have test coverage. The only deduction is for using cleanup_target.exists() instead of cleanup_target.is_dir() before calling shutil.rmtree, which could silently fire a NotADirectoryError exception rather than cleanly skipping non-directory paths.
src/cloudai/workloads/common/nixl.py — minor is_dir() vs exists() style issue in cleanup_job_artifacts

Important Files Changed

Filename	Overview
src/cloudai/_core/command_gen_strategy.py	Adds a no-op `cleanup_job_artifacts()` hook to the base strategy class, providing a safe default for workloads that don't need cleanup.
src/cloudai/systems/slurm/slurm_runner.py	Adds `completed_test_runs(job)` returning the single job's `TestRun`, and extends `on_job_completion` to call `cleanup_job_artifacts()` per test run with proper exception handling.
src/cloudai/systems/slurm/single_sbatch_runner.py	Overrides `completed_test_runs(job)` to return all test runs in the scenario (correct for single-sbatch mode where one job covers all runs); `job` parameter is intentionally unused.
src/cloudai/workloads/common/nixl.py	Implements `cleanup_job_artifacts()` and `_cleanup_targets()` to remove `filepath_mount` and `device_list_mounts` directories post-job; minor style issue: `exists()` used instead of `is_dir()` before `shutil.rmtree`.

Sequence Diagram

sequenceDiagram
    participant Runner as SlurmRunner / SingleSbatchRunner
    participant System as SlurmSystem
    participant CmdGen as CommandGenStrategy (NIXLCmdGenBase)
    participant FS as Filesystem

    Runner->>System: complete_job(job)
    Runner->>Runner: store_job_metadata(job)
    Runner->>Runner: completed_test_runs(job)
    note over Runner: SlurmRunner → [job.test_run]<br/>SingleSbatchRunner → list(all_trs)
    loop for each TestRun
        Runner->>CmdGen: get_cmd_gen_strategy(system, tr)
        CmdGen->>CmdGen: _cleanup_targets()
        note over CmdGen: checks filepath & device_list<br/>from cmd_args_dict
        loop for each cleanup_target
            CmdGen->>FS: cleanup_target.is_dir()?
            alt directory exists
                CmdGen->>FS: shutil.rmtree(cleanup_target)
                note over FS: removes filepath_mount/<br/>and device_list_mounts/
            end
        end
    end
    note over Runner: Exception caught → logging.warning

_{Last reviewed commit: "Update src/cloudai/w..."}

src/cloudai/workloads/common/nixl.py

src/cloudai/systems/slurm/single_sbatch_runner.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/systems/slurm/single_sbatch_runner.py`:
- Around line 224-225: The method completed_test_runs in single_sbatch_runner.py
currently accepts a parameter named job that is unused (triggering ARG002);
rename the parameter from job to _job in the completed_test_runs signature to
mark it intentionally unused and update any references inside the method (there
are none) so lint stops complaining—keep the body returning list(self.all_trs)
unchanged.

In `@src/cloudai/workloads/common/nixl.py`:
- Around line 235-251: The cleanup currently calls resolve() in
_cleanup_targets() which lets rmtree follow symlinks outside the run dir;
instead, return non-resolved targets (keep (self.test_run.output_path /
"filepath_mount") and "device_list_mounts") from _cleanup_targets(), and in
cleanup_job_artifacts() for each target first check existence, then compute
resolved_root = self.test_run.output_path.resolve() and target_resolved =
target.resolve(); only allow deletion if
target_resolved.is_relative_to(resolved_root); handle symlinks safely by
unlinking when target.is_symlink() (or target.is_file()) and using shutil.rmtree
for real directories, otherwise skip and log/refuse to delete to avoid escaping
the run directory (referencing cleanup_job_artifacts, _cleanup_targets,
test_run.output_path, filepath_mount, device_list_mounts).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6e86174c-28f1-4102-bfb6-d14268f624d8

📥 Commits

Reviewing files that changed from the base of the PR and between 9916d02 and 6f2338a.

📒 Files selected for processing (8)

src/cloudai/_core/command_gen_strategy.py
src/cloudai/systems/slurm/single_sbatch_runner.py
src/cloudai/systems/slurm/slurm_runner.py
src/cloudai/workloads/common/nixl.py
tests/test_get_job_id.py
tests/test_single_sbatch_runner.py
tests/workloads/nixl_bench/test_command_gen_strategy_slurm.py
tests/workloads/nixl_kvbench/test_command_gen_slurm.py

src/cloudai/systems/slurm/single_sbatch_runner.py

src/cloudai/workloads/common/nixl.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

coderabbitai

♻️ Duplicate comments (1)

src/cloudai/workloads/common/nixl.py (1)

235-250: ⚠️ Potential issue | 🔴 Critical

Prevent cleanup from deleting paths outside the run directory.

At Line 246 and Line 250, calling .resolve() before deletion lets symlinks escape self.test_run.output_path; then Line 238 can recursively delete arbitrary directories. Keep unresolved targets and enforce an output-root boundary before deletion.

🔒 Proposed safe cleanup patch

     def cleanup_job_artifacts(self) -> None:
+        output_root = self.test_run.output_path.resolve()
         for cleanup_target in self._cleanup_targets():
-            if cleanup_target.exists():
-                shutil.rmtree(cleanup_target)
-                logging.debug(f"Cleaned up job artifact: {cleanup_target}")
+            if not cleanup_target.exists() and not cleanup_target.is_symlink():
+                continue
+
+            if cleanup_target.is_symlink() or cleanup_target.is_file():
+                logging.warning("Skipping non-directory cleanup target: %s", cleanup_target)
+                cleanup_target.unlink(missing_ok=True)
+                continue
+
+            resolved_target = cleanup_target.resolve()
+            if resolved_target != output_root and output_root not in resolved_target.parents:
+                logging.warning("Skipping cleanup target outside output path: %s", cleanup_target)
+                continue
+
+            shutil.rmtree(cleanup_target)
+            logging.debug("Cleaned up job artifact: %s", cleanup_target)

     def _cleanup_targets(self) -> list[Path]:
         cleanup_targets: list[Path] = []

         filepath_raw: str | None = cast(str | None, self.test_run.test.cmd_args_dict.get("filepath"))
         if filepath_raw:
-            cleanup_targets.append((self.test_run.output_path / "filepath_mount").resolve())
+            cleanup_targets.append(self.test_run.output_path / "filepath_mount")

         device_list_raw: str | None = cast(str | None, self.test_run.test.cmd_args_dict.get("device_list"))
         if device_list_raw and get_files_from_device_list(device_list_raw):
-            cleanup_targets.append((self.test_run.output_path / "device_list_mounts").resolve())
+            cleanup_targets.append(self.test_run.output_path / "device_list_mounts")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/workloads/common/nixl.py` around lines 235 - 250, The cleanup
currently calls .resolve() in _cleanup_targets which can let symlinks escape the
intended run directory; update _cleanup_targets (used by cleanup_job_artifacts)
to stop resolving targets (return Path objects rooted under
self.test_run.output_path without calling .resolve()), and then in
cleanup_job_artifacts before calling shutil.rmtree verify each candidate is
contained inside the run output root (compare a resolved output root like
self.test_run.output_path.resolve() with the candidate's absolute path via a
safe common-path/is_relative_to check using resolve(strict=False) or
os.path.commonpath) and skip/delete only if the candidate is inside that
boundary; this prevents deleting paths outside the run directory while keeping
targets unresolved until validated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/cloudai/workloads/common/nixl.py`:
- Around line 235-250: The cleanup currently calls .resolve() in
_cleanup_targets which can let symlinks escape the intended run directory;
update _cleanup_targets (used by cleanup_job_artifacts) to stop resolving
targets (return Path objects rooted under self.test_run.output_path without
calling .resolve()), and then in cleanup_job_artifacts before calling
shutil.rmtree verify each candidate is contained inside the run output root
(compare a resolved output root like self.test_run.output_path.resolve() with
the candidate's absolute path via a safe common-path/is_relative_to check using
resolve(strict=False) or os.path.commonpath) and skip/delete only if the
candidate is inside that boundary; this prevents deleting paths outside the run
directory while keeping targets unresolved until validated.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 47c06ae3-4e1b-4a53-b5da-c8efe8993cc9

📥 Commits

Reviewing files that changed from the base of the PR and between 6f2338a and 1a88b1c.

📒 Files selected for processing (1)

src/cloudai/workloads/common/nixl.py

src/cloudai/workloads/common/nixl.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/common/nixl.py`:
- Around line 235-239: Static analysis flags the f-string in
cleanup_job_artifacts; inside the loop that iterates over
self._cleanup_targets() and calls shutil.rmtree on cleanup_target (after
cleanup_target.is_dir()), replace the f-string logging.debug(f"Cleaned up job
artifact: {cleanup_target}") with lazy formatting logging.debug("Cleaned up job
artifact: %s", cleanup_target) to avoid eager string interpolation and keep
consistent logging style.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 496ba64f-0519-417c-bb1b-b2f34fa7cc4d

📥 Commits

Reviewing files that changed from the base of the PR and between 1a88b1c and a2d6070.

📒 Files selected for processing (1)

src/cloudai/workloads/common/nixl.py

src/cloudai/workloads/common/nixl.py

cleanup nixl mounts

c9f8142

podkidyshev changed the title ~~cleanup nixl mounts~~ Cleanup NIXL file mounts Mar 17, 2026

podkidyshev added 7 commits March 19, 2026 17:20

using host os filepaths for cleanup

2a501dc

change commands order

271cc88

using python for clenaup

386edb2

Merge branch 'main' into ipod/cloudai-11/cleanup-nixl-mounts

5c1d99a

remove redundant tests

4098546

revert redundant line change

87514ce

update copyright

6f2338a

podkidyshev marked this pull request as ready for review March 20, 2026 17:58

podkidyshev requested review from amaslenn, jeffnvidia and srivatsankrishnan as code owners March 20, 2026 17:58

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

src/cloudai/workloads/common/nixl.py Show resolved Hide resolved

src/cloudai/systems/slurm/single_sbatch_runner.py Show resolved Hide resolved

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

src/cloudai/systems/slurm/single_sbatch_runner.py Show resolved Hide resolved

src/cloudai/workloads/common/nixl.py Show resolved Hide resolved

Update src/cloudai/workloads/common/nixl.py

1a88b1c

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

src/cloudai/workloads/common/nixl.py Outdated Show resolved Hide resolved

safer nixl artifacts deletion

a2d6070

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

src/cloudai/workloads/common/nixl.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup NIXL file mounts#840

Cleanup NIXL file mounts#840
podkidyshev wants to merge 10 commits intomainfrom
ipod/cloudai-11/cleanup-nixl-mounts

podkidyshev commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Mar 20, 2026 •

edited

Loading

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

podkidyshev commented Mar 17, 2026

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

greptile-apps bot commented Mar 20, 2026 •

edited

Loading