Skip to content

Fix: Crash on attempted checkpoint save when training loop is too fast #567#600

Open
kf-rahman wants to merge 3 commits intosdatkinson:mainfrom
kf-rahman:main
Open

Fix: Crash on attempted checkpoint save when training loop is too fast #567#600
kf-rahman wants to merge 3 commits intosdatkinson:mainfrom
kf-rahman:main

Conversation

@kf-rahman
Copy link
Copy Markdown

Summary

Problem: back-to-back .ckpt then .nam writes could intermittently fail on fast loops.

Solution: bounded try, excepts around both writes.

Tests: added tight-loop save-every-step test; deterministic monkeypatch test to simulate flaky writes; both pass post-fix.

Scope: no behavior change except resilience; logs a warning on retries.

Changes

  • Added retry wrapper around super()._save_checkpoint() and .export()
  • Added tight-loop stress test and deterministic flaky write test
  • Confirmed tests pass and training completes without crash

Testing

  • pytest tests/test_save_checkpoint_fail.py → 2 passed
  • Simulated transient I/O via monkeypatch → retried and succeeded

sdatkinson
sdatkinson previously approved these changes Nov 27, 2025
Copy link
Copy Markdown
Owner

@sdatkinson sdatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!

There are a few suggestions I'd love for you to address if possible.

Either way, it's a step forward so if you don't address them in, say, a week, then I'll merge as-is.

@kf-rahman
Copy link
Copy Markdown
Author

Hey, addressed few of the changes. Sorry couldnt get to it earlier. Thanks for the review :)

@kf-rahman
Copy link
Copy Markdown
Author

kf-rahman commented Feb 22, 2026

Hey was there anything elses needed for this or is it good to go? @sdatkinson

Copy link
Copy Markdown
Owner

@sdatkinson sdatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits. I can grab them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants