socr

Multi-engine document OCR with cascading fallback and quality audit.

socr orchestrates multiple OCR engines — calling each as a CLI subprocess, auditing output quality, and falling back to a different engine when results are poor. Each engine is a standalone CLI tool (gemini-ocr, deepseek-ocr, marker-ocr, etc.) that can also be used independently.

Install

pip install socr

# With specific engine backends
pip install socr[gemini]          # Google Gemini (cloud)
pip install socr[local]           # DeepSeek + Nougat (local/free)
pip install socr[all]             # All engines

Engines are installed separately because they have different dependencies (torch, cloud SDKs, etc.). Install only what you need.

Usage

# Process a PDF
socr paper.pdf

# Choose engine
socr paper.pdf --primary gemini
socr paper.pdf --primary marker

# Save extracted figures
socr paper.pdf --save-figures

# Batch process a directory
socr batch ~/Papers/ -o ./results/
socr batch ~/Papers/ --dry-run        # preview what would be processed
socr batch ~/Papers/ --reprocess      # force reprocess all

# Check which engines are available
socr engines

How it works

PDF → Primary OCR → Quality Audit → (Fallback OCR if needed) → Markdown

Primary OCR — Calls the primary engine CLI on the whole PDF
Quality audit — Heuristic checks (word count, garbage ratio, repetition)
Fallback — If audit fails, tries a different engine

Each engine is a separate CLI binary. socr calls it as a subprocess, reads the output markdown, and applies the quality pipeline.

Engines

Engine	Package	Type	Notes
Gemini	`gemini-ocr-cli`	Cloud	Google Gemini, ~$0.0002/page
Mistral	`mistral-ocr-cli`	Cloud	Mistral AI
Marker	`marker-ocr-cli`	Local	Layout-aware (Surya + Texify)
DeepSeek	`deepseek-ocr-cli`	Local	Via Ollama
Nougat	`nougat-ocr-cli`	Local	Academic papers, Python <3.13

Check availability:

$ socr engines

  [+] gemini       cloud, ~$0.0002/page
  [+] marker       local, layout-aware (Surya + Texify)
  [+] mistral      cloud, ~$0.001/page
  [+] deepseek     local via Ollama
  [x] nougat       local, academic papers

CLI reference

socr process <PDF> [OPTIONS]
  -o, --output-dir PATH       Output directory
  --primary ENGINE             Primary OCR engine (gemini, marker, deepseek, etc.)
  --fallback ENGINE            Fallback engine
  --no-audit                   Skip quality audit
  --save-figures               Save extracted figure images
  --timeout SECONDS            Subprocess timeout (default: 300)
  --profile NAME               Load ~/.config/socr/{name}.yaml
  --config PATH                Custom YAML config file
  -q, --quiet                  Suppress non-error output
  -v, --verbose                Verbose output
  --dry-run                    List files without processing
  --reprocess                  Force reprocess already-done files

socr batch <DIR> [OPTIONS]
  Same options as process, plus:
  --limit N                    Process first N files

socr engines                   Show available engines

Output

output/<doc_stem>/
├── <doc_stem>.md        # OCR text
├── metadata.json        # Processing stats
└── figures/             # With --save-figures
    └── figure_1_page3.png

Configuration

Create ~/.config/socr/config.yaml:

primary_engine: gemini
fallback_engine: marker
timeout: 300
save_figures: false
audit_enabled: true
audit_min_words: 50

Or use profiles: ~/.config/socr/fast.yaml → socr paper.pdf --profile fast

Engine CLIs

Each backend is an independent CLI tool:

gemini-ocr-cli — Google Gemini
deepseek-ocr-cli — DeepSeek via Ollama
mistral-ocr-cli — Mistral AI
marker-ocr-cli — Marker (Surya + Texify)
nougat-ocr-cli — Meta Nougat

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
docs		docs
examples		examples
src/socr		src/socr
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TICKETS.md		TICKETS.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

socr

Install

Usage

How it works

Engines

CLI reference

Output

Configuration

Engine CLIs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

socr

Install

Usage

How it works

Engines

CLI reference

Output

Configuration

Engine CLIs

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages