Skip to content

alexandergg/DocuLens

Repository files navigation

DocuLens

Document Intelligence Benchmark Platform - An OCR evaluation framework powered by DeepSeek OCR via Ollama.

Overview

DocuLens is a CLI tool designed to evaluate OCR (Optical Character Recognition) accuracy using DeepSeek OCR. It provides comprehensive metrics including Character Error Rate (CER), Word Error Rate (WER), and detailed content analysis with grounding information.

Features

  • DeepSeek OCR Integration: State-of-the-art OCR via Ollama local deployment
  • Document Processing: PDF and image file support (PNG, JPG, JPEG, GIF, WebP)
  • Output Cleaning: Remove grounding tags and convert HTML tables to markdown
  • Evaluation Metrics: CER, WER, character/word accuracy, error breakdown
  • Grounding Data: Bounding boxes and region type detection
  • Structured Output: Organized output folders with timestamps
  • Batch Processing: Process multiple documents at once

Project Structure

DocuLens/
├── src/
│   └── doculens/
│       ├── __init__.py
│       ├── cli/
│       │   └── main.py              # CLI commands
│       ├── config/
│       │   ├── __init__.py
│       │   └── settings.py          # Pydantic settings
│       ├── processing/
│       │   ├── __init__.py
│       │   ├── cleaner.py           # Output cleaning utilities
│       │   └── evaluator.py         # OCR evaluation metrics
│       └── providers/
│           ├── __init__.py
│           ├── base.py              # Abstract OCR provider
│           ├── factory.py           # Provider factory
│           └── ollama_deepseek.py   # DeepSeek OCR via Ollama
├── output/
│   ├── ocr/                         # OCR output files
│   ├── evaluations/                 # Evaluation reports
│   └── comparisons/                 # Comparison reports
├── samples/                         # Sample documents & ground truth
├── tests/                           # Test suite
├── .env                             # Environment configuration
├── pyproject.toml                   # Project configuration
└── README.md

Installation

Prerequisites

  • Python 3.11+
  • uv package manager
  • Ollama

Setup

# Clone the repository
git clone https://github.com/alexandergg/DocuLens.git
cd DocuLens

# Install dependencies
uv sync

# Pull DeepSeek OCR model (6.7GB)
ollama pull deepseek-ocr

Configuration

Create a .env file in the project root:

# Logging
DOCULENS_LOG_LEVEL=INFO

# Ollama Configuration
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=deepseek-ocr

# Output
DOCULENS_OUTPUT_DIR=output
DOCULENS_OUTPUT_FORMAT=markdown

Usage

Check Configuration

uv run doculens config

Health Check

uv run doculens health

Process a Document

# Basic processing
uv run doculens process samples/document.pdf

# With clean output (removes grounding tags)
uv run doculens process samples/document.pdf --clean

# With evaluation report
uv run doculens process samples/document.pdf --clean --evaluate

# Custom output path (no timestamp)
uv run doculens process samples/document.pdf -o output/custom.md --no-timestamp

Clean Existing Output

# Clean OCR output file
uv run doculens clean output/ocr/document_20260130_120000.md

Evaluate OCR Output

# Basic evaluation (heuristic quality assessment)
uv run doculens evaluate output/ocr/document.md

# With ground truth comparison (CER/WER metrics)
uv run doculens evaluate output/ocr/document.md --ground-truth samples/ground_truth.txt

Batch Processing

# Process all PDFs in a directory
uv run doculens batch samples/ --pattern "*.pdf" --clean

Evaluation Metrics

Without Ground Truth

When no reference text is provided, DocuLens performs heuristic quality assessment:

  • Content extraction quality (word/character count)
  • Structure detection (headings, tables, lists)
  • DeepSeek grounding analysis (region types, bounding boxes)
  • Formatting quality (line length, fragmentation)

With Ground Truth

When a reference text file is provided, DocuLens calculates:

Metric Description
CER Character Error Rate - edit distance at character level
WER Word Error Rate - edit distance at word level
Accuracy 1 - error rate (character and word level)
Insertions Extra characters/words in OCR output
Deletions Missing characters/words from OCR output
Substitutions Incorrect characters/words

Quality Grades

Score Grade Interpretation
95-100 A+ Excellent
90-94 A Very Good
85-89 B+ Good
80-84 B Above Average
70-79 C Average
60-69 D Below Average
0-59 F Poor

Output Structure

All outputs are organized with timestamps for traceability:

output/
├── ocr/
│   ├── document_20260130_180000.md           # Raw OCR output
│   └── document_clean_20260130_180500.md     # Cleaned output
├── evaluations/
│   ├── document_20260130_180000_evaluation.md    # Markdown report
│   └── document_20260130_180000_evaluation.json  # JSON report
└── comparisons/
    └── (future comparison reports)

DeepSeek OCR Grounding

DeepSeek OCR outputs include grounding information:

<|ref|>text<|/ref|><|det|>[[x1, y1, x2, y2]]<|/det|>
Extracted text content here

Region Types

Type Description
text Body paragraphs
title Section headers
sub_title Subsection headers
table Data tables
image Charts/figures
image_caption Figure captions
table_caption Table titles

Use --clean flag to remove grounding tags for readable output.

CLI Reference

doculens --help

Commands:
  config    View or manage configuration
  health    Check DeepSeek OCR (Ollama) health and connectivity
  process   Process a document and extract text using DeepSeek OCR
  batch     Process multiple documents in batch
  clean     Clean OCR output by removing grounding tags
  evaluate  Evaluate OCR output quality and accuracy

Process Options

Option Short Description
--output -o Output file path
--format -f Output format: markdown, json, text
--clean -c Remove grounding tags
--evaluate -e Generate evaluation report
--no-timestamp Don't add timestamp to filename

Evaluate Options

Option Short Description
--ground-truth -g Path to ground truth text file
--output -o Output path for evaluation report
--no-timestamp Don't add timestamp to filename

Development

Running Tests

uv run pytest

Code Quality

# Linting
uv run ruff check .

# Type checking
uv run mypy src/

License

MIT License

Acknowledgments

About

Document Intelligence Benchmark Platform - Precision evaluation framework for OCR and vision-language models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors