A GPipe implementation in PyTorch
-
Updated
Jul 25, 2024 - Python
A GPipe implementation in PyTorch
An I/O benchmark for deep Learning applications
Very-Low Overhead Checkpointing System
Extending DOLFINx with checkpointing functionality
Keras wrapper that autosaves what ModelCheckpoint cannot.
Sayiir — simple, embeddable durable workflow engine in Rust, node.js/python bindings. Checkpoint-based recovery, no deterministic replay. Simplified aternative to Temporal, Restate, Airflow..
Git-like branching, checkpointing, and comparison for AI agent execution paths. pip install agentgit
A Python package for checkpointing, saving, and loading objects.
A Python package for performing memory-intensive computations in parallel using chunks and checkpointing.
A lightweight checkpointing program written in C.
Code and tutorial on integrating wandb sweeps with Slurm pre-emption
This FLINK project will consume streams from an azure event-hub and produce to a different event-hub ,and the config files for deploying the same in kubernetes
Hangman Game Word Predictor (Character-level attention)
This is a standalone flink producer using for testing the flink-consume-produce-ek repo contents
A shared library to help test your code with failure-injection
Robust distributed checkpointing and job management system for multi-GPU SLURM workloads
Execution runtime for intelligent agents with event-sourced, recoverable task orchestration.
Work in progress
Add a description, image, and links to the checkpointing topic page so that developers can more easily learn about it.
To associate your repository with the checkpointing topic, visit your repo's landing page and select "manage topics."