A deterministic, hash-based backup engine written in Python that performs incremental backups, duplicate detection, versioned history, and retention cleanup with full logging and crash-safe state management.
This project focuses on correctness, safety, and auditability, not UI Shortcuts or automation gimmicks.
- Only new or modified files are processed
- Uses content hashing (not timestamps)
- Safe across restarts
- Deterministic decision logic
- Filesystem event monitoring using
watchdog - Processes changes incrementally
- Debounced to prevent duplicate processing
- Thread-safe under concurrent file activity
- Detects duplicates by content hash, not filename
- Copies duplicates into a
_duplicates/folder - Original files are never deleted automatically
- Deterministic behavior across run
- Modified files are never overwritten
- Previous versions are archived automatically
- Latest version always exists at the top level
- Historical versions stored in
versions/
When a source file is deleted:
- The corresponding backup file is moved to
_deleted/ - The file is removed from persistent state
- No data is permanently deleted automatically
- All actions are logged
Example:
backup/
├── file.txt
├── _deleted/
│ └── file.txt
This ensures:
- No accidental data loss
- Full audit trail
- Deterministic lifecycle behavior
- Keeps only the most recent N versions per file
- Old versions are deleted deterministically
- All deletions are logged
- All actions are logged
The system automatically ignores filesystem events originating from the backup directory.
This prevents:
- Recursive self-backup loops
- Infinite nested backup growth
- Disk exhaustion due to misconfiguration
Even if backup_root is placed inside a monitored source directory, the system remains stable.
- Engine-level locking prevents race conditions
- Concurrent file events handled safely
- State updates are atomic
- State writes use temporary file + atomic replace
- Windows-safe retry logic prevents file-lock failures
- Protects against JSON corruption
- Safe under forced termination
backup_tool/
├── src/
│ ├── __init__.py
│ ├── main.py # Entry point
│ ├── engine.py # Orchestration layer
│ ├── config.py # Config loading & validation
│ ├── state.py # Atomic state persistence
│ ├── scanner.py # Filesystem scanning
│ ├── hasher.py # Hashing logic
│ ├── planner.py # Decision engine
│ ├── executor.py # Safe file operations
│ ├── retention.py # Retention rules
│ ├── watcher.py # Event-driven monitoring
│ └── logger.py # Logging
├── config.json
├── logs/
├── testdata/
│ ├── source/
│ └── backup/
├── tests/
│ └── test_planner.py
└── README.md
- Monitor source directories (event-driven)
- Hash modified or new files
- Compare against stored state
- Plan actions:
backup– new filemodified– content changedduplicate– same content, different pathskip– unchanged
- Execute file operations safely
- Archive previous versions
- Archive deleted files into
_deleted/ - Apply retention rules
- Persist updated state atomically
The system is deterministic and safe across restarts.
{
"sources": ["./testdata/source"],
"backup_root": "./testdata/backup",
"log_dir": "./logs",
"hash_algorithm": "sha256",
"retention": {
"max_versions_per_file": 10
}
}| Key | Description |
|---|---|
sources |
Directories to back up |
backup_root |
Backup destination |
log_dir |
Log output directory |
hash_algorithm |
Hash algorithm (recommended: sha256) |
retention.max_versions_per_file |
Versions to keep per file |
python -m src.main
Behavior:
- First run copies all files
- Subsequent runs finish quickly
- Modified files create versions
- Retention is enforced automatically
- State updated atomically
When monitoring is enabled, the tool:
- Watches source folders continuously
- Reacts to file creation/modification/deletion
- Archives deleted files safely
- Maintains deterministic state
backup/
├── one.txt
├── two.txt
├── state.json
├── _duplicates/
│ └── one_copy.txt
├── _deleted/
│ └── old_file.txt
└── versions/
└── one.txt/
├── v3.bak
├── v4.bak
├── v5.bak
├── v6.bak
└── v7.bak
Only the most recent N versions are retained.
Deleted files are archived in _deleted/
Each run creates a timestamped log file in logs/.
Example:
PLANNER → BACKUP: source/one.txt
BACKUP: source/one.txt → backup/one.txt
ARCHIVE: backup/one.txt → versions/one.txt/v3.bak
UPDATED: source/one.txt → backup/one.txt
ARCHIVE DELETE: backup/old.txt → backup/_deleted/old.txt
STATE REMOVE: source/old.txt
RETENTION delete: versions/one.txt/v1.bak
Run complete
All actions are logged deterministically.
After installing dependencies and verifying the project runs, you can create a local test environment to observe incremental behavior, versioning, duplicates, and logs.
From the project root:
mkdir -p testdata/source
mkdir -p testdata/backup
Your structure should now include:
testdata/
├── source/
└── backup/
Create some files in the source directory:
echo "hello world" > testdata/source/one.txt
echo "another file" > testdata/source/two.txt
echo "hello world" > testdata/source/one_copy.txt
This setup intentionally includes:
- A duplicate file (one_copy.txt)
- Multiple distinct files
python -m src.main
Expected behavior:
- All files are copied
- Duplicates are detected by hash
- State file is created
- Log file is written
Example log output:
BACKUP: source/one.txt → backup/one.txt
BACKUP: source/two.txt → backup/two.txt
DUPLICATE: source/one_copy.txt → backup/_duplicates/one_copy.txt
Run complete
Edit an existing file:
echo "hello world v2" >> testdata/source/one.txt
Add a new file:
echo "new file" > testdata/source/three.txt
Run again:
python -m src.main
Expected behavior:
- Modified files are archived, not overwritten
- New files are backed up
- Unchanged files are skipped
Example log output:
ARCHIVE: backup/one.txt → versions/one.txt/v1.bak
UPDATED: source/one.txt → backup/one.txt
BACKUP: source/three.txt → backup/three.txt
SKIP: source/two.txt
Run complete
If you repeatedly modify the same file:
echo "change" >> testdata/source/one.txt
python -m src.main
Once the number of versions exceeds the configured limit:
RETENTION delete: versions/one.txt/v1.bak
Check:
testdata/backup/→ current filestestdata/backup/versions/→ version historytestdata/backup/_duplicates/→ detected duplicateslogs/→ full audit trail
What this verifies:
- Incremental behavior
- Hash-based duplicate detection
- Versioned backups
- Retention enforcement
- Deterministic logging