Skip to content

Cleanup NIXL file mounts#840

Open
podkidyshev wants to merge 10 commits intomainfrom
ipod/cloudai-11/cleanup-nixl-mounts
Open

Cleanup NIXL file mounts#840
podkidyshev wants to merge 10 commits intomainfrom
ipod/cloudai-11/cleanup-nixl-mounts

Conversation

@podkidyshev
Copy link
Contributor

Summary

Cleanup mounted files to nixl after execution to free space

Test Plan

  • Manual runs on EOS
  • Automated CI

Additional Notes

Additional functionality for CLOUDAI-11

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

Adds a no-op cleanup hook to CommandGenStrategy, implements NIXL-specific artifact removal, invokes cleanup from Slurm runners on job completion, and adds tests verifying invocation and filesystem cleanup.

Changes

Cohort / File(s) Summary
Core base
src/cloudai/_core/command_gen_strategy.py
Added public method cleanup_job_artifacts(self) -> None (no-op hook) and updated copyright year to 2026.
Slurm runners
src/cloudai/systems/slurm/slurm_runner.py, src/cloudai/systems/slurm/single_sbatch_runner.py
Added completed_test_runs(self, job: BaseJob) -> list[TestRun]; SlurmRunner.on_job_completion() now iterates completed test runs and calls cleanup_job_artifacts() per test run inside a broad try/except with warning logs.
NIXL workloads
src/cloudai/workloads/common/nixl.py
Added _cleanup_targets() and cleanup_job_artifacts() in NIXLCmdGenBase to compute cleanup directories (filepath/device_list mounts) and remove them using shutil.rmtree, logging removals.
Tests — runners & NIXL
tests/test_get_job_id.py, tests/test_single_sbatch_runner.py, tests/workloads/nixl_bench/test_command_gen_strategy_slurm.py, tests/workloads/nixl_kvbench/test_command_gen_slurm.py
Added tests asserting Slurm runner calls cleanup; added tests verifying NIXL strategies remove expected mount directories; updated kvbench_tr fixture to include output_path; added mocks and assertions for cleanup calls.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I hopped through mounts at close of day,
Nibbled leftover files and tucked them away,
Paths cleared neatly, logs in a line,
Tests gave a nod — tidy and fine,
Hooray for cleanups, one hop at a time! 🥕

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Cleanup NIXL file mounts' directly and clearly describes the main purpose of the changeset, which adds cleanup functionality for mounted files in NIXL workloads.
Description check ✅ Passed The description is directly related to the changeset, explaining that it adds cleanup of mounted files to NIXL after execution to free space, which aligns with the implemented changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/cloudai-11/cleanup-nixl-mounts

Comment @coderabbitai help to get the list of available commands and usage tips.

@podkidyshev podkidyshev changed the title cleanup nixl mounts Cleanup NIXL file mounts Mar 17, 2026
@podkidyshev podkidyshev marked this pull request as ready for review March 20, 2026 17:58
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile Summary

This PR adds post-job cleanup of NIXL container-mount directories (filepath_mount and device_list_mounts) to free disk space after a job completes. It introduces a cleanup_job_artifacts() hook on the CommandGenStrategy base class (no-op by default), dispatches it from SlurmRunner.on_job_completion, and implements the actual cleanup in NIXLCmdGenBase. A new completed_test_runs(job) override in SingleSbatchRunner ensures cleanup runs for all test runs within a single-sbatch job (not just the one nominally associated with the Slurm job object). Exception handling prevents cleanup failures from disrupting job completion flow, and all new logic is covered by targeted unit tests.

Key changes:

  • CommandGenStrategy.cleanup_job_artifacts() — extensible no-op hook for post-job cleanup
  • SlurmRunner.on_job_completion() — iterates completed_test_runs and calls cleanup per TestRun, with try/except guarding
  • SingleSbatchRunner.completed_test_runs() — overrides base to return all scenario test runs, correctly covering the single-job-for-all-runs model
  • NIXLCmdGenBase._cleanup_targets() / cleanup_job_artifacts() — removes filepath_mount and device_list_mounts directories based on cmd_args_dict inspection
  • Tests added for SlurmRunner, SingleSbatchRunner, nixl-bench, and nixl-kvbench cleanup paths

Confidence Score: 4/5

  • Safe to merge — well-structured and well-tested; one minor style issue with exists() vs is_dir() before shutil.rmtree.
  • The logic is sound, exception handling prevents cleanup failures from affecting job completion, and all new paths have test coverage. The only deduction is for using cleanup_target.exists() instead of cleanup_target.is_dir() before calling shutil.rmtree, which could silently fire a NotADirectoryError exception rather than cleanly skipping non-directory paths.
  • src/cloudai/workloads/common/nixl.py — minor is_dir() vs exists() style issue in cleanup_job_artifacts

Important Files Changed

Filename Overview
src/cloudai/_core/command_gen_strategy.py Adds a no-op cleanup_job_artifacts() hook to the base strategy class, providing a safe default for workloads that don't need cleanup.
src/cloudai/systems/slurm/slurm_runner.py Adds completed_test_runs(job) returning the single job's TestRun, and extends on_job_completion to call cleanup_job_artifacts() per test run with proper exception handling.
src/cloudai/systems/slurm/single_sbatch_runner.py Overrides completed_test_runs(job) to return all test runs in the scenario (correct for single-sbatch mode where one job covers all runs); job parameter is intentionally unused.
src/cloudai/workloads/common/nixl.py Implements cleanup_job_artifacts() and _cleanup_targets() to remove filepath_mount and device_list_mounts directories post-job; minor style issue: exists() used instead of is_dir() before shutil.rmtree.

Sequence Diagram

sequenceDiagram
    participant Runner as SlurmRunner / SingleSbatchRunner
    participant System as SlurmSystem
    participant CmdGen as CommandGenStrategy (NIXLCmdGenBase)
    participant FS as Filesystem

    Runner->>System: complete_job(job)
    Runner->>Runner: store_job_metadata(job)
    Runner->>Runner: completed_test_runs(job)
    note over Runner: SlurmRunner → [job.test_run]<br/>SingleSbatchRunner → list(all_trs)
    loop for each TestRun
        Runner->>CmdGen: get_cmd_gen_strategy(system, tr)
        CmdGen->>CmdGen: _cleanup_targets()
        note over CmdGen: checks filepath & device_list<br/>from cmd_args_dict
        loop for each cleanup_target
            CmdGen->>FS: cleanup_target.is_dir()?
            alt directory exists
                CmdGen->>FS: shutil.rmtree(cleanup_target)
                note over FS: removes filepath_mount/<br/>and device_list_mounts/
            end
        end
    end
    note over Runner: Exception caught → logging.warning
Loading

Last reviewed commit: "Update src/cloudai/w..."

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/systems/slurm/single_sbatch_runner.py`:
- Around line 224-225: The method completed_test_runs in single_sbatch_runner.py
currently accepts a parameter named job that is unused (triggering ARG002);
rename the parameter from job to _job in the completed_test_runs signature to
mark it intentionally unused and update any references inside the method (there
are none) so lint stops complaining—keep the body returning list(self.all_trs)
unchanged.

In `@src/cloudai/workloads/common/nixl.py`:
- Around line 235-251: The cleanup currently calls resolve() in
_cleanup_targets() which lets rmtree follow symlinks outside the run dir;
instead, return non-resolved targets (keep (self.test_run.output_path /
"filepath_mount") and "device_list_mounts") from _cleanup_targets(), and in
cleanup_job_artifacts() for each target first check existence, then compute
resolved_root = self.test_run.output_path.resolve() and target_resolved =
target.resolve(); only allow deletion if
target_resolved.is_relative_to(resolved_root); handle symlinks safely by
unlinking when target.is_symlink() (or target.is_file()) and using shutil.rmtree
for real directories, otherwise skip and log/refuse to delete to avoid escaping
the run directory (referencing cleanup_job_artifacts, _cleanup_targets,
test_run.output_path, filepath_mount, device_list_mounts).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6e86174c-28f1-4102-bfb6-d14268f624d8

📥 Commits

Reviewing files that changed from the base of the PR and between 9916d02 and 6f2338a.

📒 Files selected for processing (8)
  • src/cloudai/_core/command_gen_strategy.py
  • src/cloudai/systems/slurm/single_sbatch_runner.py
  • src/cloudai/systems/slurm/slurm_runner.py
  • src/cloudai/workloads/common/nixl.py
  • tests/test_get_job_id.py
  • tests/test_single_sbatch_runner.py
  • tests/workloads/nixl_bench/test_command_gen_strategy_slurm.py
  • tests/workloads/nixl_kvbench/test_command_gen_slurm.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/cloudai/workloads/common/nixl.py (1)

235-250: ⚠️ Potential issue | 🔴 Critical

Prevent cleanup from deleting paths outside the run directory.

At Line 246 and Line 250, calling .resolve() before deletion lets symlinks escape self.test_run.output_path; then Line 238 can recursively delete arbitrary directories. Keep unresolved targets and enforce an output-root boundary before deletion.

🔒 Proposed safe cleanup patch
     def cleanup_job_artifacts(self) -> None:
+        output_root = self.test_run.output_path.resolve()
         for cleanup_target in self._cleanup_targets():
-            if cleanup_target.exists():
-                shutil.rmtree(cleanup_target)
-                logging.debug(f"Cleaned up job artifact: {cleanup_target}")
+            if not cleanup_target.exists() and not cleanup_target.is_symlink():
+                continue
+
+            if cleanup_target.is_symlink() or cleanup_target.is_file():
+                logging.warning("Skipping non-directory cleanup target: %s", cleanup_target)
+                cleanup_target.unlink(missing_ok=True)
+                continue
+
+            resolved_target = cleanup_target.resolve()
+            if resolved_target != output_root and output_root not in resolved_target.parents:
+                logging.warning("Skipping cleanup target outside output path: %s", cleanup_target)
+                continue
+
+            shutil.rmtree(cleanup_target)
+            logging.debug("Cleaned up job artifact: %s", cleanup_target)

     def _cleanup_targets(self) -> list[Path]:
         cleanup_targets: list[Path] = []

         filepath_raw: str | None = cast(str | None, self.test_run.test.cmd_args_dict.get("filepath"))
         if filepath_raw:
-            cleanup_targets.append((self.test_run.output_path / "filepath_mount").resolve())
+            cleanup_targets.append(self.test_run.output_path / "filepath_mount")

         device_list_raw: str | None = cast(str | None, self.test_run.test.cmd_args_dict.get("device_list"))
         if device_list_raw and get_files_from_device_list(device_list_raw):
-            cleanup_targets.append((self.test_run.output_path / "device_list_mounts").resolve())
+            cleanup_targets.append(self.test_run.output_path / "device_list_mounts")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/workloads/common/nixl.py` around lines 235 - 250, The cleanup
currently calls .resolve() in _cleanup_targets which can let symlinks escape the
intended run directory; update _cleanup_targets (used by cleanup_job_artifacts)
to stop resolving targets (return Path objects rooted under
self.test_run.output_path without calling .resolve()), and then in
cleanup_job_artifacts before calling shutil.rmtree verify each candidate is
contained inside the run output root (compare a resolved output root like
self.test_run.output_path.resolve() with the candidate's absolute path via a
safe common-path/is_relative_to check using resolve(strict=False) or
os.path.commonpath) and skip/delete only if the candidate is inside that
boundary; this prevents deleting paths outside the run directory while keeping
targets unresolved until validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/cloudai/workloads/common/nixl.py`:
- Around line 235-250: The cleanup currently calls .resolve() in
_cleanup_targets which can let symlinks escape the intended run directory;
update _cleanup_targets (used by cleanup_job_artifacts) to stop resolving
targets (return Path objects rooted under self.test_run.output_path without
calling .resolve()), and then in cleanup_job_artifacts before calling
shutil.rmtree verify each candidate is contained inside the run output root
(compare a resolved output root like self.test_run.output_path.resolve() with
the candidate's absolute path via a safe common-path/is_relative_to check using
resolve(strict=False) or os.path.commonpath) and skip/delete only if the
candidate is inside that boundary; this prevents deleting paths outside the run
directory while keeping targets unresolved until validated.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 47c06ae3-4e1b-4a53-b5da-c8efe8993cc9

📥 Commits

Reviewing files that changed from the base of the PR and between 6f2338a and 1a88b1c.

📒 Files selected for processing (1)
  • src/cloudai/workloads/common/nixl.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/common/nixl.py`:
- Around line 235-239: Static analysis flags the f-string in
cleanup_job_artifacts; inside the loop that iterates over
self._cleanup_targets() and calls shutil.rmtree on cleanup_target (after
cleanup_target.is_dir()), replace the f-string logging.debug(f"Cleaned up job
artifact: {cleanup_target}") with lazy formatting logging.debug("Cleaned up job
artifact: %s", cleanup_target) to avoid eager string interpolation and keep
consistent logging style.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 496ba64f-0519-417c-bb1b-b2f34fa7cc4d

📥 Commits

Reviewing files that changed from the base of the PR and between 1a88b1c and a2d6070.

📒 Files selected for processing (1)
  • src/cloudai/workloads/common/nixl.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant