Skip to content

Clarify hosted RL cluster avoidance or remediation proof for repeated step-3 empty-batch failures #416

@Kbediako

Description

@Kbediako

Hosted RL behavior on two clusters

  • Runs on q8ddnllz5o5dq31ii63z25o9 repeatedly fail at step 3 with:
    • Step 3 produced 0 training samples
    • RuntimeError: Step 3 failed after 5 consecutive empty batches
  • Verified examples on q8ddnllz5o5dq31ii63z25o9:
    • gqqlebczerz80saax5o950wg
    • zvjqfiiug41ahk4m4tvw3013
    • mxjbesdb0hekj4c2sqo8l4t4
    • srykllg19vpw22rea0bvbt1b
  • Comparison runs on nkcrlx7jgflan7pazeluxn67 complete successfully:
    • zcwzgyxqoclerrzn1326070e
    • uldp1uwkodle6wlejbk2j63p
    • f3zumfwflmlvpje6w8kllto7
    • ny8ercq2q2s0cy1onu65tzr2

Placement behavior observed

  • Repeated fresh launches of the same config family were heavily skewed to q8ddnllz5o5dq31ii63z25o9.
  • Six consecutive fresh launches all landed on q8ddnllz5o5dq31ii63z25o9 and were stopped immediately:
    • chqm1zsgr4c0t6bkh17lz7kw
    • wnf7ei2w7zsxbtsnrblouiaf
    • k2rpl2elmmqqude53fg3mwme
    • g4cwszonx8pl37n74n1hml9p
    • q74752yizltk1yoorzb163j9
    • h0emwegnuqcx7l6893th5xg3
  • A subsequent retry loop then hit q8ddnllz5o5dq31ii63z25o9 39 consecutive times before attempt 40 finally landed on nkcrlx7jgflan7pazeluxn67:
    • rgjaidws2qp23t2yvqigv6tm

Public user-visible controls checked

  • Prime CLI 0.5.37 does not expose a user-facing hosted RL placement control in prime rl -h or prime rl run -h.
  • Public prime source includes cluster_name in RL config plumbing, but the command definition marks it Admin-only: target a specific cluster by name.

Questions

  1. Is there a supported way for hosted RL users to avoid a specific cluster after repeated failures on that cluster?
  2. Is cluster_name staff/internal only, or can it be enabled for non-admin hosted RL users?
  3. Is the repeated assignment to q8ddnllz5o5dq31ii63z25o9 expected scheduler behavior, account affinity, or a known routing issue?
  4. Is q8ddnllz5o5dq31ii63z25o9 known to have a runtime problem related to repeated step-3 empty batches, or has it been remediated/quarantined?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions