-
Notifications
You must be signed in to change notification settings - Fork 36
Open
Description
Hosted RL behavior on two clusters
- Runs on
q8ddnllz5o5dq31ii63z25o9repeatedly fail at step3with:Step 3 produced 0 training samplesRuntimeError: Step 3 failed after 5 consecutive empty batches
- Verified examples on
q8ddnllz5o5dq31ii63z25o9:gqqlebczerz80saax5o950wgzvjqfiiug41ahk4m4tvw3013mxjbesdb0hekj4c2sqo8l4t4srykllg19vpw22rea0bvbt1b
- Comparison runs on
nkcrlx7jgflan7pazeluxn67complete successfully:zcwzgyxqoclerrzn1326070euldp1uwkodle6wlejbk2j63pf3zumfwflmlvpje6w8kllto7ny8ercq2q2s0cy1onu65tzr2
Placement behavior observed
- Repeated fresh launches of the same config family were heavily skewed to
q8ddnllz5o5dq31ii63z25o9. - Six consecutive fresh launches all landed on
q8ddnllz5o5dq31ii63z25o9and were stopped immediately:chqm1zsgr4c0t6bkh17lz7kwwnf7ei2w7zsxbtsnrblouiafk2rpl2elmmqqude53fg3mwmeg4cwszonx8pl37n74n1hml9pq74752yizltk1yoorzb163j9h0emwegnuqcx7l6893th5xg3
- A subsequent retry loop then hit
q8ddnllz5o5dq31ii63z25o939consecutive times before attempt40finally landed onnkcrlx7jgflan7pazeluxn67:rgjaidws2qp23t2yvqigv6tm
Public user-visible controls checked
- Prime CLI
0.5.37does not expose a user-facing hosted RL placement control inprime rl -horprime rl run -h. - Public
primesource includescluster_namein RL config plumbing, but the command definition marks itAdmin-only: target a specific cluster by name.
Questions
- Is there a supported way for hosted RL users to avoid a specific cluster after repeated failures on that cluster?
- Is
cluster_namestaff/internal only, or can it be enabled for non-admin hosted RL users? - Is the repeated assignment to
q8ddnllz5o5dq31ii63z25o9expected scheduler behavior, account affinity, or a known routing issue? - Is
q8ddnllz5o5dq31ii63z25o9known to have a runtime problem related to repeated step-3 empty batches, or has it been remediated/quarantined?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels