Skip to content

Deduplicate git verification logic across installers and support workload-level installation hooks #825

@coderabbitai

Description

@coderabbitai

Summary

The _verify_commit(self, ref: str, path: Path) -> InstallStatusResult method is currently duplicated in both src/cloudai/systems/kubernetes/kubernetes_installer.py and src/cloudai/systems/slurm/slurm_installer.py.

Proposed Action

Extract the shared implementation into a common location — for example, a protected method on BaseInstaller or a dedicated InstallerVerifyMixin — and update both KubernetesInstaller and SlurmInstaller to rely on the shared implementation.

Additional Scope: Workload-level installation hooks

Some workloads require custom installation steps beyond cloning a git repository. For example, Megatron-Bridge currently installs numpy and wandb via pip install at submit/runtime inside the launcher wrapper script, rather than at install time. This is fragile: parallel submissions race in site-packages and transient network failures break otherwise-valid launches.

The installable object model should be extended to support custom installation hooks, allowing workloads to declare additional steps (e.g. pip install numpy wandb) that are executed once at install time, within the managed installation phase, rather than on every job submission. Megatron-Bridge is the concrete motivating example.

References

Requested by @podkidyshev.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions