-
Notifications
You must be signed in to change notification settings - Fork 43
Open
Description
Background
PR #830 introduced the exclude_nodes parameter to SlurmSystem.get_nodes_by_spec() and wired it through the Megatron-Bridge command generation strategy. However, several other workloads and infrastructure components call get_nodes_by_spec() directly without forwarding test_run.exclude_nodes, so the exclusion has no effect for those paths.
Affected call sites
The following production code locations call get_nodes_by_spec() without passing exclude_nodes:
| File | Line | Caller context |
|---|---|---|
src/cloudai/workloads/nemo_run/slurm_command_gen_strategy.py |
126 | NeMo-Run command gen |
src/cloudai/workloads/nemo_launcher/slurm_command_gen_strategy.py |
40 | NeMo-Launcher command gen |
src/cloudai/workloads/triton_inference/slurm_command_gen_strategy.py |
86 | Triton Inference (_get_server_client_split) |
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py |
90 | DeepEP command gen |
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py |
206 | AI Dynamo command gen |
src/cloudai/systems/slurm/slurm_command_gen_strategy.py |
325 | Base Slurm strategy (_enable_vboost_cmd) |
src/cloudai/systems/slurm/single_sbatch_runner.py |
95, 118 | Single-sbatch runner (node list aggregation and per-test-run node resolution) |
Required changes
For each call site above:
- Parse
test_run.exclude_nodes(a comma-separated string) into aset[str]usingparse_node_list(already available inslurm_system.py), consistent with howSlurmCommandGenStrategydoes it in this PR. - Pass the resulting set as
exclude_nodes=...toget_nodes_by_spec(). - For
single_sbatch_runner.py, collect and union allexclude_nodesvalues across the test runs being batched.
Priority
Low — this is a follow-up to PR #830; existing behaviour (no exclusion) is preserved until these sites are updated.
References
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels