OPTIMO

Deterministic OCR pipeline in Rust with Tokio + Rayon orchestration and Tesseract execution.

The project explores a strongly decoupled processing architecture where OCR acts only as an input generator for a deterministic pipeline.

Architecture

flowchart TD

A[main.rs<br>Bootstrap Orchestrator]

A --> B[state.rs<br>Application State]
A --> C[task.rs<br>Async Orchestration]
A --> D[reducer.rs<br>Deterministic Reducer]
A --> E[observation.rs<br>Observation Model]
A --> F[state_bridge.rs<br>Persistence Boundary]

C --> G[OCR Pipeline]
G --> H[Tokio spawn_blocking + Rayon par_iter]
H --> D

D --> F
F --> I[data/observations.jsonl]

The architecture separates:

orchestration
deterministic logic
observation
persistence

This allows the core logic to remain deterministic and replayable.

Non-Negotiable Invariants

Reducer purity
The reducer must remain pure, deterministic, and free of side effects.
External metadata injection
Timestamps, ids, and other non-deterministic metadata must come from the runtime layer.
Persistence boundary isolation
Storage concerns must stay outside the core and pass only through the bridge.
Derived event model
Events must be derived from reducer results, not emitted as computation side effects.
First-class observability
Observations are part of the system contract and must remain structured and auditable.

See the full decision log: docs/DECISIONS.md.

Project Structure

src/

main.rs                # Bootstrap and runtime startup

state.rs               # Application state (paths, dirs, OCR language)

task.rs                # Async orchestration (Tokio + Rayon boundary)

reducer.rs             # Deterministic merge/vote logic

observation.rs         # Observation model and validation rules

state_bridge.rs        # Persistence boundary (JSONL today)

ocrys/
  mod.rs               # OCR facade
  tesseract.rs         # Tesseract CLI integration
  normalize.rs         # (legacy/experimental normalization)
  types.rs             # OCRDocument / OCRPage / OCRLine

scripts/

setup_data.sh          # Prepare data directories and ownership
process_all.sh         # Run all images via Docker image
process_all_local.sh   # Run all images locally via cargo

Processing Model

main.rs loads AppState and parses input document paths.
task.rs schedules one async task per document using JoinSet.
Each document crosses into CPU workers using spawn_blocking.
Rayon executes OCR variants in parallel:
- original
- high_contrast
- rotated
reducer.rs merges variant outputs deterministically using fuzzy clustering:
- line alignment by position (line index as positional proxy)
- similarity scoring via strsim::jaro_winkler
- stable winner selection by cluster size
- deterministic tie-break rules
The final observation is appended by state_bridge.rs to:

data/observations.jsonl

Deterministic Reducer Logic

flowchart TD

A[OCR Variant Output<br>original / contrast / rotated]

A --> B[Line Extraction]

B --> C[Positional Alignment<br>group by line index]

C --> D[Fuzzy Similarity Check<br>Jaro-Winkler]

D --> E[Cluster Formation]

E --> F[Cluster Size Ranking]

F --> G[Tie-break Rules<br>deterministic]

G --> H[Winning Text Line]

H --> I[Final Merged OCR Output]

I --> J[Observation Record]

J --> K[State Bridge]

K --> L[data/observations.jsonl]

Reducer Algorithm

The reducer merges OCR variants deterministically.

Processing steps:

Extract lines from each OCR variant.
Align candidate lines by position (current implementation: line index).
Compare textual similarity using Jaro-Winkler.
Build clusters of similar lines across OCR variants.
Rank clusters by size.
Select the winning line using deterministic tie-break rules.
Produce a final merged output.
Emit an observation record describing the decision.

Reducer Contract

Input:
  OCRVariant[]

Output:
  DeterministicMergedOCR

Guarantees:
  - deterministic output
  - stable tie-break rules
  - replayable decisions

Runtime and Dependencies

Core stack:

Language / Runtime: Rust + Tokio async runtime
OCR Engine: Tesseract CLI
Parallel Compute: Rayon
Serialization: serde + serde_json
String Similarity: strsim (Jaro-Winkler)

Run Locally

Requires a local Tesseract installation.

cargo run -- fixtures/sample.png

Batch process a folder:

./scripts/process_all_local.sh fixtures

Run with Docker (Recommended)

Build image:

docker build -t optimo:latest .

Run one file:

mkdir -p data

docker run --rm \
-v "$(pwd)/fixtures:/app/fixtures:ro" \
-v "$(pwd)/data:/app/data" \
optimo:latest /app/fixtures/sample.png

Run all images in a folder:

./scripts/process_all.sh fixtures

Output

data/observations.jsonl

Append-only decision records (one JSON object per line).

Artifacts generated during runs:

data/ocrys/latest/

Example record:

{"decision":"ocr_converged","lines":3,"preview":"hello world ocr test 2024 optimo pipeline ","source":"/app/fixtures/sample.png"}

Notes

Default OCR language in AppState is currently ita.
observation.rs already defines richer typed observations (OcrObservation) for the next persistence phase.
JSONL is the current persistence backend.
SQLite is planned and can be introduced behind state_bridge.rs without changing reducer or orchestration logic.

Architectural Direction

OCR is currently used only as a pipeline stress-test and input generator.

The long-term objective is a deterministic document analysis engine where:

parsing
validation
rule evaluation
structural checks

can run through the same reducer/observation pipeline.

This design allows the system to evolve without modifying the deterministic core.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
fixtures		fixtures
scripts		scripts
src		src
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DOCKER.md		DOCKER.md
Dockerfile		Dockerfile
MANIFESTO.md		MANIFESTO.md
README.md		README.md
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OPTIMO

Architecture

Non-Negotiable Invariants

Project Structure

Processing Model

Deterministic Reducer Logic

Reducer Algorithm

Reducer Contract

Runtime and Dependencies

Run Locally

Run with Docker (Recommended)

Output

Notes

Architectural Direction

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OPTIMO

Architecture

Non-Negotiable Invariants

Project Structure

Processing Model

Deterministic Reducer Logic

Reducer Algorithm

Reducer Contract

Runtime and Dependencies

Run Locally

Run with Docker (Recommended)

Output

Notes

Architectural Direction

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages