DocuLens

Document Intelligence Benchmark Platform - An OCR evaluation framework powered by DeepSeek OCR via Ollama.

Overview

DocuLens is a CLI tool designed to evaluate OCR (Optical Character Recognition) accuracy using DeepSeek OCR. It provides comprehensive metrics including Character Error Rate (CER), Word Error Rate (WER), and detailed content analysis with grounding information.

Features

DeepSeek OCR Integration: State-of-the-art OCR via Ollama local deployment
Document Processing: PDF and image file support (PNG, JPG, JPEG, GIF, WebP)
Output Cleaning: Remove grounding tags and convert HTML tables to markdown
Evaluation Metrics: CER, WER, character/word accuracy, error breakdown
Grounding Data: Bounding boxes and region type detection
Structured Output: Organized output folders with timestamps
Batch Processing: Process multiple documents at once

Project Structure

DocuLens/
├── src/
│   └── doculens/
│       ├── __init__.py
│       ├── cli/
│       │   └── main.py              # CLI commands
│       ├── config/
│       │   ├── __init__.py
│       │   └── settings.py          # Pydantic settings
│       ├── processing/
│       │   ├── __init__.py
│       │   ├── cleaner.py           # Output cleaning utilities
│       │   └── evaluator.py         # OCR evaluation metrics
│       └── providers/
│           ├── __init__.py
│           ├── base.py              # Abstract OCR provider
│           ├── factory.py           # Provider factory
│           └── ollama_deepseek.py   # DeepSeek OCR via Ollama
├── output/
│   ├── ocr/                         # OCR output files
│   ├── evaluations/                 # Evaluation reports
│   └── comparisons/                 # Comparison reports
├── samples/                         # Sample documents & ground truth
├── tests/                           # Test suite
├── .env                             # Environment configuration
├── pyproject.toml                   # Project configuration
└── README.md

Installation

Prerequisites

Python 3.11+
uv package manager
Ollama

Setup

# Clone the repository
git clone https://github.com/alexandergg/DocuLens.git
cd DocuLens

# Install dependencies
uv sync

# Pull DeepSeek OCR model (6.7GB)
ollama pull deepseek-ocr

Configuration

Create a .env file in the project root:

# Logging
DOCULENS_LOG_LEVEL=INFO

# Ollama Configuration
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=deepseek-ocr

# Output
DOCULENS_OUTPUT_DIR=output
DOCULENS_OUTPUT_FORMAT=markdown

Usage

Check Configuration

uv run doculens config

Health Check

uv run doculens health

Process a Document

# Basic processing
uv run doculens process samples/document.pdf

# With clean output (removes grounding tags)
uv run doculens process samples/document.pdf --clean

# With evaluation report
uv run doculens process samples/document.pdf --clean --evaluate

# Custom output path (no timestamp)
uv run doculens process samples/document.pdf -o output/custom.md --no-timestamp

Clean Existing Output

# Clean OCR output file
uv run doculens clean output/ocr/document_20260130_120000.md

Evaluate OCR Output

# Basic evaluation (heuristic quality assessment)
uv run doculens evaluate output/ocr/document.md

# With ground truth comparison (CER/WER metrics)
uv run doculens evaluate output/ocr/document.md --ground-truth samples/ground_truth.txt

Batch Processing

# Process all PDFs in a directory
uv run doculens batch samples/ --pattern "*.pdf" --clean

Evaluation Metrics

Without Ground Truth

When no reference text is provided, DocuLens performs heuristic quality assessment:

Content extraction quality (word/character count)
Structure detection (headings, tables, lists)
DeepSeek grounding analysis (region types, bounding boxes)
Formatting quality (line length, fragmentation)

With Ground Truth

When a reference text file is provided, DocuLens calculates:

Metric	Description
CER	Character Error Rate - edit distance at character level
WER	Word Error Rate - edit distance at word level
Accuracy	1 - error rate (character and word level)
Insertions	Extra characters/words in OCR output
Deletions	Missing characters/words from OCR output
Substitutions	Incorrect characters/words

Quality Grades

Score	Grade	Interpretation
95-100	A+	Excellent
90-94	A	Very Good
85-89	B+	Good
80-84	B	Above Average
70-79	C	Average
60-69	D	Below Average
0-59	F	Poor

Output Structure

All outputs are organized with timestamps for traceability:

output/
├── ocr/
│   ├── document_20260130_180000.md           # Raw OCR output
│   └── document_clean_20260130_180500.md     # Cleaned output
├── evaluations/
│   ├── document_20260130_180000_evaluation.md    # Markdown report
│   └── document_20260130_180000_evaluation.json  # JSON report
└── comparisons/
    └── (future comparison reports)

DeepSeek OCR Grounding

DeepSeek OCR outputs include grounding information:

<|ref|>text<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>
Extracted text content here

Region Types

Type	Description
`text`	Body paragraphs
`title`	Section headers
`sub_title`	Subsection headers
`table`	Data tables
`image`	Charts/figures
`image_caption`	Figure captions
`table_caption`	Table titles

Use --clean flag to remove grounding tags for readable output.

CLI Reference

doculens --help

Commands:
  config    View or manage configuration
  health    Check DeepSeek OCR (Ollama) health and connectivity
  process   Process a document and extract text using DeepSeek OCR
  batch     Process multiple documents in batch
  clean     Clean OCR output by removing grounding tags
  evaluate  Evaluate OCR output quality and accuracy

Process Options

Option	Short	Description
`--output`	`-o`	Output file path
`--format`	`-f`	Output format: markdown, json, text
`--clean`	`-c`	Remove grounding tags
`--evaluate`	`-e`	Generate evaluation report
`--no-timestamp`		Don't add timestamp to filename

Evaluate Options

Option	Short	Description
`--ground-truth`	`-g`	Path to ground truth text file
`--output`	`-o`	Output path for evaluation report
`--no-timestamp`		Don't add timestamp to filename

Development

Running Tests

uv run pytest

Code Quality

# Linting
uv run ruff check .

# Type checking
uv run mypy src/

License

MIT License

Acknowledgments

DeepSeek OCR - State-of-the-art OCR model
Ollama - Local LLM deployment
Typer - CLI framework
Rich - Terminal formatting

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.devcontainer		.devcontainer
src/doculens		src/doculens
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
README.md		README.md
ROADMAP.md		ROADMAP.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

DocuLens

Overview

Features

Project Structure

Installation

Prerequisites

Setup

Configuration

Usage

Check Configuration

Health Check

Process a Document

Clean Existing Output

Evaluate OCR Output

Batch Processing

Evaluation Metrics

Without Ground Truth

With Ground Truth

Quality Grades

Output Structure

DeepSeek OCR Grounding

Region Types

CLI Reference

Process Options

Evaluate Options

Development

Running Tests

Code Quality

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages