GitHub - amarcoder01/Optical-Character-Recognition: High-precision OCR Preprocessing Pipeline & Neural Engine. Features 9-stage OpenCV binarization, CRNN/ViT architectures, & Streamlit Vision Lab. PDF to Insights.

 ██████╗  ██████╗██████╗        ██████╗ ███████╗ ██████╗ ██████╗  ██████╗ ███╗  ██╗ █████╗ ████████╗ ██████╗ ██████╗ 
██╔═══██╗██╔════╝██╔══██╗       ██╔══██╗██╔════╝██╔════╝██╔═══██╗██╔════╝ ████╗ ██║██╔══██╗╚══██╔══╝██╔═══██╗██╔══██╗
██║   ██║██║     ██████╔╝       ██████╔╝█████╗  ██║     ██║   ██║██║  ███╗██╔██╗██║███████║   ██║   ██║   ██║██████╔╝
██║   ██║██║     ██╔══██╗       ██╔══██╗██╔══╝  ██║     ██║   ██║██║   ██║██║╚████║██╔══██║   ██║   ██║   ██║██╔══██╗
╚██████╔╝╚██████╗██║  ██║       ██║  ██║███████╗╚██████╗╚██████╔╝╚██████╔╝██║ ╚███║██║  ██║   ██║   ╚██████╔╝██║  ██║
 ╚═════╝  ╚═════╝╚═╝  ╚═╝       ╚═╝  ╚═╝╚══════╝ ╚═════╝ ╚═════╝  ╚═════╝ ╚═╝  ╚══╝╚═╝  ╚═╝   ╚═╝    ╚═════╝ ╚═╝  ╚═╝

High-Resolution Document Intelligence · PDF to Binarized Images to OCR-Ready

Turn raw PDFs into analysis-ready, binarized page images — with a full Streamlit lab for exploring every preprocessing stage, and CRNN + ViT model modules ready for OCR training integration.

Quick Start · Architecture · Features · Setup · Usage Guide · Roadmap · Troubleshooting

Current Status

Component	Status	Notes
PDF to Image Conversion	Working	`pdf2image`, DPI=400, per-page JPEG output
Preprocessing Explorer	Working	Full OpenCV pipeline in Streamlit UI
CRNN Model Definition	Included	Forward pass defined; training loop is an outline
CTC Decoding	In Progress	Decoding not yet wired into `app.py`
ViT Backbone	In Progress	Architecture defined; input alignment needed
End-to-end Training	In Progress	Batching/collation and CSV wiring incomplete

Quick Start

# 1. Clone the repository
git clone https://github.com/amarcoder01/OCR-Recognator.git
cd OCR-Recognator

# 2. Create and activate a virtual environment
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # macOS/Linux

# 3. Install dependencies
pip install -r requirements.txt

# 4. Create the required input folder
mkdir pdf_files

# 5. Launch the Streamlit lab
python -m streamlit run app.py --server.port 8501

Note: Poppler is required for PDF conversion. See Setup for platform-specific instructions.

Overview

OCR-Recognator is a reproducible, inspectable, production-approachable system for turning scanned documents and PDFs into text. We built it to solve a real friction point — getting raw documents into a state where OCR models can actually work on them — and to make every stage of that process visible and debuggable.

Here is what the system does today and where it is headed:

You drop PDFs into pdf_files/, and the system renders every page as a high-DPI JPEG in seconds
You can explore every preprocessing step side-by-side in a browser UI — see exactly what the image looks like after deskewing, binarization, and centering before any model ever touches it
The CRNN and ViT model modules are included and ready for integration once the training data pipeline is finalized
CTC decoding and end-to-end inference are the next milestone (see Roadmap)

What is working right now: PDF conversion and the preprocessing explorer UI
What is included and ready for wiring: OCR inference modules in ocr_model.py and model_architecture.py

Architecture

flowchart TD
    U[User] --> UI["Streamlit UI — app.py"]

    UI -->|"Convert"| CONV["pdf_to_images.py — DPI=400"]
    CONV --> CONV_OUT["converted_images/ — page_n.jpg"]

    UI -->|"Preprocess"| PRE["preprocess_images.py"]
    PRE --> PRE_OUT["preprocessed_images/ — binarized pages"]

    UI --> V1["View: Converted Images"]
    UI --> V2["View: Preprocessed Images"]

    subgraph Preprocessing_Chain["Preprocessing Chain"]
        IN["Input Images from pdf_files/"] --> RESIZE["Resize — target_width=1024"]
        RESIZE --> DENOISE["Denoise — median + bilateral"]
        DENOISE --> CONTRAST["Gamma + CLAHE — clipLimit=3.5"]
        CONTRAST --> DESKEW["Deskew — Canny + Hough Lines"]
        DESKEW --> SHARP["Unsharp Mask — amount=1.2"]
        SHARP --> THRESH{"Threshold — Otsu or Sauvola"}
        THRESH --> CLEAN["Morphological Cleanup"]
        CLEAN --> CENTER["Center — Largest Contour"]
    end

    subgraph OCR_Modules["OCR Model Modules"]
        GT["ground_truth.csv — filename, transcription"] --> CRNN["ocr_model.py — CRNN + CTCLoss"]
        PRE_OUT --> CRNN
        CRNN -. "future: inference" .-> UI
        ViT["model_architecture.py — ViT + LSTM Head"] -. "optional backbone" .-> CRNN
    end

Features

PDF to High-Resolution Images

Powered by pdf2image (Poppler-backed) at DPI=400 for crisp renders
Outputs per-page JPEGs named converted_images/<pdf_stem>_page_<n>.jpg
DPI is configurable for speed vs. quality tradeoff

Production-Grade Preprocessing Chain

Every input image passes through this deterministic, 9-stage pipeline:

Stage	Method	Parameters
Resize	Aspect-ratio preserved	Target width: `1024px`
Denoise	Median blur	`kernel_size=3`
Edge-preserving denoise	Bilateral filter	`d=9`, `sigmaColor=50`, `sigmaSpace=50`
Contrast	CLAHE	`clipLimit=3.5`, `tileGridSize=(8,8)`
Brightness	Gamma correction	Tunable gamma
Deskew	Canny + Hough Lines	`threshold=100`, `minLineLen=50`, `maxGap=10`
Sharpening	Unsharp mask	`amount=1.2`, `kernel=(5,5)`, `sigma=1.0`
Binarization	Otsu or Sauvola	Sauvola: `window=35`, `k=0.15`
Morphological cleanup	Opening + closing	Noise removal and centering

Streamlit Lab Interface

Sidebar — PDF selector from pdf_files/
Button "1. Convert to Image" — runs pdf_to_images.py
Button "2. Preprocess" — runs preprocess_images.py
Tabs — side-by-side converted vs. preprocessed image viewer

Model Modules (Training / Integration)

scripts/ocr_model.py — CRNN (CNN + bidirectional LSTM) with CTCLoss training outline
scripts/model_architecture.py — ViT feature extractor + LSTM head for sequence modeling

Data Flow

flowchart LR
  UI["Streamlit UI"] -->|Convert| PDF2IMG["pdf_to_images.py"]
  PDF2IMG -->|page JPEGs| CONVERTED["converted_images/"]

  UI -->|Preprocess| PRE["preprocess_images.py"]
  PDFIN["pdf_files/"] -->|jpg/png/tiff| PRE
  PRE -->|binarized pages| PREOUT["preprocessed_images/"]

  PREOUT -->|future inference| OCR["ocr_model.py"]
  OCR -->|CTC decode| TXT["Text output"]
  GT["ground_truth.csv"] --> OCR

  UI --> NOTE["UI shows only files containing selected PDF stem"]
  CONVERTED --> NOTE
  PREOUT --> NOTE

Known gap (important): preprocess_images.py currently reads images from pdf_files/ (not from converted_images/). To preprocess pages after conversion, you must copy the converted page images from converted_images/ back into pdf_files/ (so filenames contain the PDF stem the UI expects). This is tracked in Roadmap Phase 1.

Project Structure

OCR-Recognator/
│
├── app.py                      # Streamlit UI — main entry point
├── requirements.txt            # Python dependencies
├── ground_truth.csv            # Placeholder — replace with real CSV data before training
├── walkthrough.md
├── codebase_analysis.md
│
├── scripts/
│   ├── pdf_to_images.py        # PDF to per-page JPEG (DPI=400)
│   ├── preprocess_images.py    # 9-stage image preprocessing chain
│   ├── ocr_model.py            # CRNN + CTCLoss training outline
│   ├── model_architecture.py   # ViT + LSTM head (definition)
│   └── create_ground_truth.py  # Ground truth CSV helper
│
├── pdf_files/                  # INPUT  — place PDFs here (create manually)
├── converted_images/           # OUTPUT — per-page JPEGs from PDF conversion
└── preprocessed_images/        # OUTPUT — binarized and deskewed page images

Setup

Prerequisites

Dependency	Purpose	Install
Python 3.9+	Runtime	python.org
Poppler	PDF rendering backend	See below
pip packages	All Python dependencies	`pip install -r requirements.txt`

Installing Poppler

Windows

Download from poppler-windows releases and extract to one of these paths (auto-detected by the script):

C:\Program Files\Release-24.08.0-0\poppler-24.08.0\Library\bin
C:\poppler\bin
C:\Program Files\poppler\bin

Alternatively, add Poppler's bin/ folder to your system PATH.

macOS

brew install poppler

Linux

sudo apt-get install poppler-utils    # Debian / Ubuntu
sudo yum install poppler-utils        # CentOS / RHEL

Full Setup Steps

# Clone the repository
git clone https://github.com/amarcoder01/OCR-Recognator.git
cd OCR-Recognator

# Create and activate a virtual environment
python -m venv venv
venv\Scripts\activate        # Windows
source venv/bin/activate     # macOS / Linux

# Install all dependencies
pip install -r requirements.txt

# Create the required input directory
mkdir pdf_files

Usage Guide

1. Launch the Lab

python -m streamlit run app.py --server.port 8501

Navigate to http://localhost:8501

2. Convert PDFs

Drop your PDFs into pdf_files/
Select a PDF from the sidebar dropdown
Click "1. Convert to Image" — outputs appear in converted_images/

3. Preprocess Images

Note: Currently reads *.jpg / *.png / *.tiff from pdf_files/, not from converted_images/. As a workaround, copy the converted page images into pdf_files/ before running preprocessing. This will be fixed in Phase 1.

Click "2. Preprocess" — binarized pages are saved to preprocessed_images/
Open the "Preprocessed Images" tab to inspect results side-by-side

4. OCR Training (CRNN)

# After preparing a valid ground_truth.csv:
python scripts/ocr_model.py

Expected CSV format:

filename,transcription
preprocessed_images/doc_page_1.jpg,Hello World
preprocessed_images/doc_page_2.jpg,Sample text on page two

Note: The repo-root ground_truth.csv is a placeholder and must be replaced with real labeled data before training.

Performance Notes

Stage	Bottleneck	Recommendation
PDF Conversion	DPI=400 is CPU and memory intensive	Reduce to 200 DPI for faster iteration; keep 400 for production quality
Preprocessing	Bilateral filter + Hough transform	Cost scales with image resolution and document complexity
CRNN Training	CPU-bound without a GPU	Use CUDA-enabled PyTorch for a 10-50x speedup
ViT Backbone	Heavy memory footprint	Requires proper 3-channel input normalization (see Troubleshooting)

Roadmap

We are actively developing OCR-Recognator toward a complete, production-grade OCR system. Below is the full roadmap, written so that contributors and collaborators can understand exactly what we are building, why each step matters, and what the expected outcome is.

Phase 1 — Close the Pipeline Gap and Add Inference

Goal: Make the system end-to-end in the UI. Right now a user can convert and preprocess but cannot read the text out. Phase 1 closes that gap.

1.1 — Align preprocessing input to conversion output

At the moment preprocess_images.py reads from pdf_files/ instead of converted_images/. This means after converting a PDF you have to manually copy the output before you can preprocess it — which breaks the single-button workflow we want. We will update the script to read directly from converted_images/ so that the full pipeline runs in two clicks: convert, then preprocess.

1.2 — Add OCR inference and CTC decoding

We will load trained weights into the CRNN model, run a forward pass on the preprocessed page images, and implement CTC decoding starting with greedy search and optionally extending to beam search. This is the core capability the project is built toward.

1.3 — Wire inference into app.py

Once decoding works in isolation we will surface the output inside the Streamlit UI — decoded text displayed alongside each preprocessed image, with the ability to copy the output or export it as JSON with page-level metadata.

Outcome of Phase 1: A user can drop a PDF in, click two buttons, and read the extracted text in the browser.

Phase 2 — Training Correctness and Data Quality

Goal: Make the training pipeline actually run correctly end-to-end on real data, not just in outline form.

2.1 — Fix the ground truth pipeline

The current ground_truth.csv in the repo root is a placeholder containing Python code, not CSV data. create_ground_truth.py also writes to the wrong path. We will replace the placeholder with a real labeled dataset, enforce the filename,transcription schema throughout, and align all script paths so that ocr_model.py can find and load the CSV without manual edits.

2.2 — Fix CRNN batching for CTC

The current DataLoader uses a passthrough collate_fn that returns a raw list of samples. The training loop then tries to unpack this as if it were already tensor batches — which crashes. We need to implement a proper collate_fn that stacks the image tensors, concatenates the label sequences, and computes input_lengths and target_lengths correctly so that CTCLoss receives valid inputs.

2.3 — Add training validation and metrics

We will add character error rate (CER) and word error rate (WER) evaluation on a held-out validation split, logged per epoch so we can track whether the model is actually learning.

Outcome of Phase 2: Running python scripts/ocr_model.py trains a real model without crashing, and we have metrics to know whether it is improving.

Phase 3 — Production Hardening

Goal: Make the system reliable, testable, and safe to run on real workloads.

3.1 — Fix Poppler path handling

The current pdf_to_images.py references poppler_path before it is defined, which throws a NameError on any system where Poppler is not on PATH. We will initialize the variable defensively, probe the known Windows install paths, and fall back cleanly to PATH with a clear error message if nothing is found.

3.2 — Streamlit performance and safety

For production use we need to add @st.cache_data decorators on conversion and preprocessing results so that re-running the same file is instant, progress bars so users know the pipeline is running and not frozen, concurrency limits to prevent multiple heavy processes from running simultaneously, and automatic cleanup of converted_images/ and preprocessed_images/ to prevent unbounded disk growth.

3.3 — ViT integration

model_architecture.py defines a ViT-based model head but the current preprocessing outputs single-channel binary images, whereas ViT expects 3-channel RGB input with specific normalization. We will add a channel expansion adapter and normalization layer so the two components can be connected without shape errors.

3.4 — Tests and regression validation

We will add smoke tests covering the conversion and preprocessing scripts, a golden-document regression set (a fixed PDF where we know what the preprocessed output should look like), and integration tests that run the full pipeline on a small labeled sample and check that CER stays within an acceptable range.

Outcome of Phase 3: The system is reliable enough to deploy, testable enough to iterate on confidently, and hardened against the known failure modes documented in the Troubleshooting section.

Troubleshooting

PDF Conversion Fails — NameError: poppler_path

Cause: poppler_path is referenced before being initialized in scripts/pdf_to_images.py.

Fix: Initialize the variable at the top of the function:

poppler_path = None  # Add this line before the if-check

Or ensure pdftoppm is on your system PATH so the variable is never needed.

Preprocessing Does Not Process Converted Pages

Cause: scripts/preprocess_images.py reads from ../pdf_files looking for *.jpg / *.png / *.tiff, not from converted_images/.

Workaround: Copy the converted page images from converted_images/ into pdf_files/, or change the source path in preprocess_images.py.

Permanent fix: Tracked in Roadmap Phase 1.1.

ground_truth.csv Causes pd.read_csv() to Fail

Cause: The repo-root ground_truth.csv contains Python code, not CSV rows.

Fix: Replace it with a valid CSV file:

filename,transcription
preprocessed_images/page_1.jpg,Your ground truth text here

Also note: create_ground_truth.py writes to ../ground_truth/ground_truth.csv, while ocr_model.py expects ground_truth.csv at the repo root. Align these paths before training. Tracked in Roadmap Phase 2.1.

CRNN Training Loop Crashes on Batch Unpacking

Cause: DataLoader(..., collate_fn=lambda x: x) returns a list of samples, but the training loop unpacks batches as if images and labels were already separated tensors.

Fix: Implement a proper collate_fn:

def ocr_collate_fn(batch):
    images = torch.stack([item[0] for item in batch])
    labels = [item[1] for item in batch]
    input_lengths = torch.full((len(batch),), images.shape[-1] // 4, dtype=torch.long)
    target_lengths = torch.tensor([len(l) for l in labels], dtype=torch.long)
    targets = torch.cat(labels)
    return images, targets, input_lengths, target_lengths

Tracked in Roadmap Phase 2.2.

ViT Model Input Shape Mismatch

Cause: ViTModel expects 3-channel RGB images with specific normalization. Preprocessing produces grayscale or binary single-channel images.

Fix: Add an input adapter before passing images to the ViT backbone:

# Expand single-channel to 3-channel and normalize for ViT
x = x.repeat(1, 3, 1, 1)  # (B, 1, H, W) to (B, 3, H, W)
x = transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])(x)

Tracked in Roadmap Phase 3.3.

Tech Stack

Layer	Technology	Role
UI	Streamlit	Interactive preprocessing lab
PDF Rendering	pdf2image + Poppler	PDF to high-DPI JPEG
Image Processing	OpenCV	Denoise, deskew, morph, sharpen
Thresholding	scikit-image	Sauvola adaptive binarization
Deep Learning	PyTorch	CRNN model and CTC training
Transformers	HuggingFace Transformers	ViT feature extractor
Data	pandas	Ground truth CSV loading

Deployment

Local (Recommended)

python -m streamlit run app.py --server.port 8501

Remote / Production

# Nginx reverse proxy example
server {
    listen 80;
    location / {
        proxy_pass http://127.0.0.1:8501;
        proxy_set_header Host $host;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Production recommendations:

Run OCR model inference in a separate worker process to avoid blocking the Streamlit event loop
Add upload size limits and disk cleanup policies for converted_images/ and preprocessed_images/
Use caching decorators (@st.cache_data) for conversion and preprocessing results

License

Distributed under the MIT License. See LICENSE for details.

Built by Amar Pawar

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
pdf_files		pdf_files
scripts		scripts
.gitignore		.gitignore
README.md		README.md
app.py		app.py
ground_truth.csv		ground_truth.csv
requirements.txt		requirements.txt
walkthrough.md		walkthrough.md

Folders and files

Latest commit

History

Repository files navigation

High-Resolution Document Intelligence · PDF to Binarized Images to OCR-Ready

Current Status

Quick Start

Overview

Architecture

Features

PDF to High-Resolution Images

Production-Grade Preprocessing Chain

Streamlit Lab Interface

Model Modules (Training / Integration)

Data Flow

Project Structure

Setup

Prerequisites

Installing Poppler

Full Setup Steps

Usage Guide

1. Launch the Lab

2. Convert PDFs

3. Preprocess Images

4. OCR Training (CRNN)

Performance Notes

Roadmap

Phase 1 — Close the Pipeline Gap and Add Inference

Phase 2 — Training Correctness and Data Quality

Phase 3 — Production Hardening

Troubleshooting

Tech Stack

Deployment

Local (Recommended)

Remote / Production

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages