Skip to content

Checkpointing State Saving and Loading (NGWPC-10159)#167

Draft
idtodd wants to merge 7 commits intodevelopmentfrom
idt-save-state-checkpointing
Draft

Checkpointing State Saving and Loading (NGWPC-10159)#167
idtodd wants to merge 7 commits intodevelopmentfrom
idt-save-state-checkpointing

Conversation

@idtodd
Copy link
Copy Markdown

@idtodd idtodd commented Apr 6, 2026

Add checkpointing options for the NGEN simulation. These are controlled by the state_saving property of the realization config JSON. When running with checkpoints saving, when a checkpoint step is hit, a new subfolder with the name of the checkpoint will be generated, and model states will be added there. After successfully saving states, the folders of any prior checkpoint will be deleted. When loading checkpoints, NGEN will go through the subfolders of the checkpointing path in reverse numeric order and load from the first folder it finds with a complete state that can be loaded.

For saving checkpoints, an item must be added to the array that looks like

{
  "direction": "save",
  "label": "...", // used only for reporting
  "path": "...", // path to the root folder states will be saved to. A new subfolder will be created for each checkpoint
  "type": "FilePerUnit",
  "when": "Checkpoint",
  "frequency": 1000 // integer for how many steps run between each checkpoint made
}

For loading checkpoints, an item must be added to the array that looks like

{
  "direction": "load",
  "label": "...", // used only for reporting
  "path": "...", // path to the root folder checkpoint saves were saved to. The last subfolder with all required states will be selected when loading
  "type": "FilePerUnit",
  "when": "Checkpoint"
}

Additions

  • Parsing for checkpointing save and load configurations.
  • Methods on NgenSimluation and BMI interfaces for loading a checkpointing state. The main difference between checkpointing and hot start methods is there is not an additional step to reset the BMI's internal time after loading a state.
  • Methods on NgenSimulation and BMI interfaces for saving a checkpointing state. The main difference between checkpointing and end of run methods is additional data regarding the time and output state for NgenSimulation and Layer objects is saved.

Removals

Changes

  • Some generic save and load methods were renamed to specify they're used with checkpointing. Because checkpointing requires extra data regarding the current time of the simulation, separate state saving and loading methods will be maintained to determine what state needs to be preserved and whether additional processing needs to happen after a state is restored.

Testing

Screenshots

Notes

  • The current implementation will not preserve the catchment outputs (CSVs). There is ongoing discussion regarding the current catchment outputs (converting to NetCDF and creating periodic outputs instead of one lump output at the end primarily) that made implementing it now seem like wasted effort in the near future.

Todos

  • To account for delayed MPI messages from remote nexuses, an MPI_Barrier is run before saving the checkpointing. This could slow down the program, so an alternative approach would be to also store MPI messages that have not be received so they can be resent. This would likely be a very large effort to properly coordinate which messages are sent and received on checkpoint save and load, so we would probably only want to do this if we notice a significant performance impact from the MPI_Barrier use in checkpointing.

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (link if applicable)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist (automated report can be put here)

Target Environment support

  • Linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant