Deterministic OCR pipeline in Rust with Tokio + Rayon orchestration and Tesseract execution.
The project explores a strongly decoupled processing architecture where OCR acts only as an input generator for a deterministic pipeline.
flowchart TD
A[main.rs<br>Bootstrap Orchestrator]
A --> B[state.rs<br>Application State]
A --> C[task.rs<br>Async Orchestration]
A --> D[reducer.rs<br>Deterministic Reducer]
A --> E[observation.rs<br>Observation Model]
A --> F[state_bridge.rs<br>Persistence Boundary]
C --> G[OCR Pipeline]
G --> H[Tokio spawn_blocking + Rayon par_iter]
H --> D
D --> F
F --> I[data/observations.jsonl]
The architecture separates:
- orchestration
- deterministic logic
- observation
- persistence
This allows the core logic to remain deterministic and replayable.
-
Reducer purity
The reducer must remain pure, deterministic, and free of side effects. -
External metadata injection
Timestamps, ids, and other non-deterministic metadata must come from the runtime layer. -
Persistence boundary isolation
Storage concerns must stay outside the core and pass only through the bridge. -
Derived event model
Events must be derived from reducer results, not emitted as computation side effects. -
First-class observability
Observations are part of the system contract and must remain structured and auditable.
See the full decision log: docs/DECISIONS.md.
src/
main.rs # Bootstrap and runtime startup
state.rs # Application state (paths, dirs, OCR language)
task.rs # Async orchestration (Tokio + Rayon boundary)
reducer.rs # Deterministic merge/vote logic
observation.rs # Observation model and validation rules
state_bridge.rs # Persistence boundary (JSONL today)
ocrys/
mod.rs # OCR facade
tesseract.rs # Tesseract CLI integration
normalize.rs # (legacy/experimental normalization)
types.rs # OCRDocument / OCRPage / OCRLine
scripts/
setup_data.sh # Prepare data directories and ownership
process_all.sh # Run all images via Docker image
process_all_local.sh # Run all images locally via cargo
-
main.rsloadsAppStateand parses input document paths. -
task.rsschedules one async task per document usingJoinSet. -
Each document crosses into CPU workers using
spawn_blocking. -
Rayon executes OCR variants in parallel:
originalhigh_contrastrotated
-
reducer.rsmerges variant outputs deterministically using fuzzy clustering:- line alignment by position (line index as positional proxy)
- similarity scoring via
strsim::jaro_winkler - stable winner selection by cluster size
- deterministic tie-break rules
-
The final observation is appended by
state_bridge.rsto:
data/observations.jsonl
flowchart TD
A[OCR Variant Output<br>original / contrast / rotated]
A --> B[Line Extraction]
B --> C[Positional Alignment<br>group by line index]
C --> D[Fuzzy Similarity Check<br>Jaro-Winkler]
D --> E[Cluster Formation]
E --> F[Cluster Size Ranking]
F --> G[Tie-break Rules<br>deterministic]
G --> H[Winning Text Line]
H --> I[Final Merged OCR Output]
I --> J[Observation Record]
J --> K[State Bridge]
K --> L[data/observations.jsonl]
The reducer merges OCR variants deterministically.
Processing steps:
- Extract lines from each OCR variant.
- Align candidate lines by position (current implementation: line index).
- Compare textual similarity using Jaro-Winkler.
- Build clusters of similar lines across OCR variants.
- Rank clusters by size.
- Select the winning line using deterministic tie-break rules.
- Produce a final merged output.
- Emit an observation record describing the decision.
Input:
OCRVariant[]
Output:
DeterministicMergedOCR
Guarantees:
- deterministic output
- stable tie-break rules
- replayable decisions
Core stack:
- Language / Runtime: Rust + Tokio async runtime
- OCR Engine: Tesseract CLI
- Parallel Compute: Rayon
- Serialization: serde + serde_json
- String Similarity: strsim (Jaro-Winkler)
Requires a local Tesseract installation.
cargo run -- fixtures/sample.png
Batch process a folder:
./scripts/process_all_local.sh fixtures
Build image:
docker build -t optimo:latest .
Run one file:
mkdir -p data
docker run --rm \
-v "$(pwd)/fixtures:/app/fixtures:ro" \
-v "$(pwd)/data:/app/data" \
optimo:latest /app/fixtures/sample.png
Run all images in a folder:
./scripts/process_all.sh fixtures
data/observations.jsonl
Append-only decision records (one JSON object per line).
Artifacts generated during runs:
data/ocrys/latest/
Example record:
{"decision":"ocr_converged","lines":3,"preview":"hello world ocr test 2024 optimo pipeline ","source":"/app/fixtures/sample.png"}- Default OCR language in
AppStateis currentlyita. observation.rsalready defines richer typed observations (OcrObservation) for the next persistence phase.- JSONL is the current persistence backend.
- SQLite is planned and can be introduced behind
state_bridge.rswithout changing reducer or orchestration logic.
OCR is currently used only as a pipeline stress-test and input generator.
The long-term objective is a deterministic document analysis engine where:
- parsing
- validation
- rule evaluation
- structural checks
can run through the same reducer/observation pipeline.
This design allows the system to evolve without modifying the deterministic core.