Skip to content

Fix SingleMemoryStorageSchedule checkpoint clearing during forward replay#248

Open
sghelichkhani wants to merge 1 commit intodolfin-adjoint:masterfrom
sghelichkhani:sghelichkhani/singlemem-checkpoint-clearing
Open

Fix SingleMemoryStorageSchedule checkpoint clearing during forward replay#248
sghelichkhani wants to merge 1 commit intodolfin-adjoint:masterfrom
sghelichkhani:sghelichkhani/singlemem-checkpoint-clearing

Conversation

@sghelichkhani
Copy link
Contributor

Summary

Fixes #211.

During forward replay, the checkpoint clearing logic for SingleMemoryStorageSchedule uses adjoint_dependencies from the previous timestep to decide whether a checkpoint is safe to discard. But adjoint_dependencies is only fully populated during the reverse pass (after PR #210), so at forward replay time it can be incomplete, causing needed checkpoints to be cleared.

This replaces the adjoint_dependencies check with a block variable identity check:

if var.output.block_variable is var:
    var._checkpoint = None

If var.output.block_variable is var, the Function still holds the correct value in memory, so the checkpoint copy is redundant. If is not var, a later block has taken ownership and will overwrite the Function during replay, so the checkpoint must be kept.

Test

Tested against the MFE on issue #211.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

indiscriminately clearing checkpoints with SingleMemoryStorageSchedule corrupts the adjoints

1 participant