Document Intelligence Benchmark Platform - An OCR evaluation framework powered by DeepSeek OCR via Ollama.
DocuLens is a CLI tool designed to evaluate OCR (Optical Character Recognition) accuracy using DeepSeek OCR. It provides comprehensive metrics including Character Error Rate (CER), Word Error Rate (WER), and detailed content analysis with grounding information.
- DeepSeek OCR Integration: State-of-the-art OCR via Ollama local deployment
- Document Processing: PDF and image file support (PNG, JPG, JPEG, GIF, WebP)
- Output Cleaning: Remove grounding tags and convert HTML tables to markdown
- Evaluation Metrics: CER, WER, character/word accuracy, error breakdown
- Grounding Data: Bounding boxes and region type detection
- Structured Output: Organized output folders with timestamps
- Batch Processing: Process multiple documents at once
DocuLens/
├── src/
│ └── doculens/
│ ├── __init__.py
│ ├── cli/
│ │ └── main.py # CLI commands
│ ├── config/
│ │ ├── __init__.py
│ │ └── settings.py # Pydantic settings
│ ├── processing/
│ │ ├── __init__.py
│ │ ├── cleaner.py # Output cleaning utilities
│ │ └── evaluator.py # OCR evaluation metrics
│ └── providers/
│ ├── __init__.py
│ ├── base.py # Abstract OCR provider
│ ├── factory.py # Provider factory
│ └── ollama_deepseek.py # DeepSeek OCR via Ollama
├── output/
│ ├── ocr/ # OCR output files
│ ├── evaluations/ # Evaluation reports
│ └── comparisons/ # Comparison reports
├── samples/ # Sample documents & ground truth
├── tests/ # Test suite
├── .env # Environment configuration
├── pyproject.toml # Project configuration
└── README.md
# Clone the repository
git clone https://github.com/alexandergg/DocuLens.git
cd DocuLens
# Install dependencies
uv sync
# Pull DeepSeek OCR model (6.7GB)
ollama pull deepseek-ocrCreate a .env file in the project root:
# Logging
DOCULENS_LOG_LEVEL=INFO
# Ollama Configuration
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=deepseek-ocr
# Output
DOCULENS_OUTPUT_DIR=output
DOCULENS_OUTPUT_FORMAT=markdownuv run doculens configuv run doculens health# Basic processing
uv run doculens process samples/document.pdf
# With clean output (removes grounding tags)
uv run doculens process samples/document.pdf --clean
# With evaluation report
uv run doculens process samples/document.pdf --clean --evaluate
# Custom output path (no timestamp)
uv run doculens process samples/document.pdf -o output/custom.md --no-timestamp# Clean OCR output file
uv run doculens clean output/ocr/document_20260130_120000.md# Basic evaluation (heuristic quality assessment)
uv run doculens evaluate output/ocr/document.md
# With ground truth comparison (CER/WER metrics)
uv run doculens evaluate output/ocr/document.md --ground-truth samples/ground_truth.txt# Process all PDFs in a directory
uv run doculens batch samples/ --pattern "*.pdf" --cleanWhen no reference text is provided, DocuLens performs heuristic quality assessment:
- Content extraction quality (word/character count)
- Structure detection (headings, tables, lists)
- DeepSeek grounding analysis (region types, bounding boxes)
- Formatting quality (line length, fragmentation)
When a reference text file is provided, DocuLens calculates:
| Metric | Description |
|---|---|
| CER | Character Error Rate - edit distance at character level |
| WER | Word Error Rate - edit distance at word level |
| Accuracy | 1 - error rate (character and word level) |
| Insertions | Extra characters/words in OCR output |
| Deletions | Missing characters/words from OCR output |
| Substitutions | Incorrect characters/words |
| Score | Grade | Interpretation |
|---|---|---|
| 95-100 | A+ | Excellent |
| 90-94 | A | Very Good |
| 85-89 | B+ | Good |
| 80-84 | B | Above Average |
| 70-79 | C | Average |
| 60-69 | D | Below Average |
| 0-59 | F | Poor |
All outputs are organized with timestamps for traceability:
output/
├── ocr/
│ ├── document_20260130_180000.md # Raw OCR output
│ └── document_clean_20260130_180500.md # Cleaned output
├── evaluations/
│ ├── document_20260130_180000_evaluation.md # Markdown report
│ └── document_20260130_180000_evaluation.json # JSON report
└── comparisons/
└── (future comparison reports)
DeepSeek OCR outputs include grounding information:
<|ref|>text<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>
Extracted text content here| Type | Description |
|---|---|
text |
Body paragraphs |
title |
Section headers |
sub_title |
Subsection headers |
table |
Data tables |
image |
Charts/figures |
image_caption |
Figure captions |
table_caption |
Table titles |
Use --clean flag to remove grounding tags for readable output.
doculens --help
Commands:
config View or manage configuration
health Check DeepSeek OCR (Ollama) health and connectivity
process Process a document and extract text using DeepSeek OCR
batch Process multiple documents in batch
clean Clean OCR output by removing grounding tags
evaluate Evaluate OCR output quality and accuracy
| Option | Short | Description |
|---|---|---|
--output |
-o |
Output file path |
--format |
-f |
Output format: markdown, json, text |
--clean |
-c |
Remove grounding tags |
--evaluate |
-e |
Generate evaluation report |
--no-timestamp |
Don't add timestamp to filename |
| Option | Short | Description |
|---|---|---|
--ground-truth |
-g |
Path to ground truth text file |
--output |
-o |
Output path for evaluation report |
--no-timestamp |
Don't add timestamp to filename |
uv run pytest# Linting
uv run ruff check .
# Type checking
uv run mypy src/MIT License
- DeepSeek OCR - State-of-the-art OCR model
- Ollama - Local LLM deployment
- Typer - CLI framework
- Rich - Terminal formatting