Skip to content

Propagate exclude_nodes to all get_nodes_by_spec callers (nemo-run, nemo-launcher, single-sbatch-runner, and others) #832

@coderabbitai

Description

@coderabbitai

Background

PR #830 introduced the exclude_nodes parameter to SlurmSystem.get_nodes_by_spec() and wired it through the Megatron-Bridge command generation strategy. However, several other workloads and infrastructure components call get_nodes_by_spec() directly without forwarding test_run.exclude_nodes, so the exclusion has no effect for those paths.

Affected call sites

The following production code locations call get_nodes_by_spec() without passing exclude_nodes:

File Line Caller context
src/cloudai/workloads/nemo_run/slurm_command_gen_strategy.py 126 NeMo-Run command gen
src/cloudai/workloads/nemo_launcher/slurm_command_gen_strategy.py 40 NeMo-Launcher command gen
src/cloudai/workloads/triton_inference/slurm_command_gen_strategy.py 86 Triton Inference (_get_server_client_split)
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py 90 DeepEP command gen
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py 206 AI Dynamo command gen
src/cloudai/systems/slurm/slurm_command_gen_strategy.py 325 Base Slurm strategy (_enable_vboost_cmd)
src/cloudai/systems/slurm/single_sbatch_runner.py 95, 118 Single-sbatch runner (node list aggregation and per-test-run node resolution)

Required changes

For each call site above:

  1. Parse test_run.exclude_nodes (a comma-separated string) into a set[str] using parse_node_list (already available in slurm_system.py), consistent with how SlurmCommandGenStrategy does it in this PR.
  2. Pass the resulting set as exclude_nodes=... to get_nodes_by_spec().
  3. For single_sbatch_runner.py, collect and union all exclude_nodes values across the test runs being batched.

Priority

Low — this is a follow-up to PR #830; existing behaviour (no exclusion) is preserved until these sites are updated.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions