-
Notifications
You must be signed in to change notification settings - Fork 43
Description
Summary
The _verify_commit(self, ref: str, path: Path) -> InstallStatusResult method is currently duplicated in both src/cloudai/systems/kubernetes/kubernetes_installer.py and src/cloudai/systems/slurm/slurm_installer.py.
Proposed Action
Extract the shared implementation into a common location — for example, a protected method on BaseInstaller or a dedicated InstallerVerifyMixin — and update both KubernetesInstaller and SlurmInstaller to rely on the shared implementation.
Additional Scope: Workload-level installation hooks
Some workloads require custom installation steps beyond cloning a git repository. For example, Megatron-Bridge currently installs numpy and wandb via pip install at submit/runtime inside the launcher wrapper script, rather than at install time. This is fragile: parallel submissions race in site-packages and transient network failures break otherwise-valid launches.
The installable object model should be extended to support custom installation hooks, allowing workloads to declare additional steps (e.g. pip install numpy wandb) that are executed once at install time, within the managed installation phase, rather than on every job submission. Megatron-Bridge is the concrete motivating example.
References
- PR: Fix commit verification: commit/branch/tag support #824
- Review comment: Fix commit verification: commit/branch/tag support #824 (comment)
- PR: Megatron-Bridge updates #821
- Review comment: Megatron-Bridge updates #821 (comment)
Requested by @podkidyshev.