Skip to content

amarcoder01/Optical-Character-Recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

 ██████╗  ██████╗██████╗        ██████╗ ███████╗ ██████╗ ██████╗  ██████╗ ███╗  ██╗ █████╗ ████████╗ ██████╗ ██████╗ 
██╔═══██╗██╔════╝██╔══██╗       ██╔══██╗██╔════╝██╔════╝██╔═══██╗██╔════╝ ████╗ ██║██╔══██╗╚══██╔══╝██╔═══██╗██╔══██╗
██║   ██║██║     ██████╔╝       ██████╔╝█████╗  ██║     ██║   ██║██║  ███╗██╔██╗██║███████║   ██║   ██║   ██║██████╔╝
██║   ██║██║     ██╔══██╗       ██╔══██╗██╔══╝  ██║     ██║   ██║██║   ██║██║╚████║██╔══██║   ██║   ██║   ██║██╔══██╗
╚██████╔╝╚██████╗██║  ██║       ██║  ██║███████╗╚██████╗╚██████╔╝╚██████╔╝██║ ╚███║██║  ██║   ██║   ╚██████╔╝██║  ██║
 ╚═════╝  ╚═════╝╚═╝  ╚═╝       ╚═╝  ╚═╝╚══════╝ ╚═════╝ ╚═════╝  ╚═════╝ ╚═╝  ╚══╝╚═╝  ╚═╝   ╚═╝    ╚═════╝ ╚═╝  ╚═╝

High-Resolution Document Intelligence · PDF to Binarized Images to OCR-Ready


Python Streamlit PyTorch OpenCV


Turn raw PDFs into analysis-ready, binarized page images — with a full Streamlit lab for exploring every preprocessing stage, and CRNN + ViT model modules ready for OCR training integration.


Quick Start  ·  Architecture  ·  Features  ·  Setup  ·  Usage Guide  ·  Roadmap  ·  Troubleshooting


Current Status

Component Status Notes
PDF to Image Conversion Working pdf2image, DPI=400, per-page JPEG output
Preprocessing Explorer Working Full OpenCV pipeline in Streamlit UI
CRNN Model Definition Included Forward pass defined; training loop is an outline
CTC Decoding In Progress Decoding not yet wired into app.py
ViT Backbone In Progress Architecture defined; input alignment needed
End-to-end Training In Progress Batching/collation and CSV wiring incomplete

Quick Start

# 1. Clone the repository
git clone https://github.com/amarcoder01/OCR-Recognator.git
cd OCR-Recognator

# 2. Create and activate a virtual environment
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # macOS/Linux

# 3. Install dependencies
pip install -r requirements.txt

# 4. Create the required input folder
mkdir pdf_files

# 5. Launch the Streamlit lab
python -m streamlit run app.py --server.port 8501

Note: Poppler is required for PDF conversion. See Setup for platform-specific instructions.


Overview

OCR-Recognator is a reproducible, inspectable, production-approachable system for turning scanned documents and PDFs into text. We built it to solve a real friction point — getting raw documents into a state where OCR models can actually work on them — and to make every stage of that process visible and debuggable.

Here is what the system does today and where it is headed:

  • You drop PDFs into pdf_files/, and the system renders every page as a high-DPI JPEG in seconds
  • You can explore every preprocessing step side-by-side in a browser UI — see exactly what the image looks like after deskewing, binarization, and centering before any model ever touches it
  • The CRNN and ViT model modules are included and ready for integration once the training data pipeline is finalized
  • CTC decoding and end-to-end inference are the next milestone (see Roadmap)

What is working right now: PDF conversion and the preprocessing explorer UI
What is included and ready for wiring: OCR inference modules in ocr_model.py and model_architecture.py


Architecture

flowchart TD
    U[User] --> UI["Streamlit UI — app.py"]

    UI -->|"Convert"| CONV["pdf_to_images.py — DPI=400"]
    CONV --> CONV_OUT["converted_images/ — page_n.jpg"]

    UI -->|"Preprocess"| PRE["preprocess_images.py"]
    PRE --> PRE_OUT["preprocessed_images/ — binarized pages"]

    UI --> V1["View: Converted Images"]
    UI --> V2["View: Preprocessed Images"]

    subgraph Preprocessing_Chain["Preprocessing Chain"]
        IN["Input Images from pdf_files/"] --> RESIZE["Resize — target_width=1024"]
        RESIZE --> DENOISE["Denoise — median + bilateral"]
        DENOISE --> CONTRAST["Gamma + CLAHE — clipLimit=3.5"]
        CONTRAST --> DESKEW["Deskew — Canny + Hough Lines"]
        DESKEW --> SHARP["Unsharp Mask — amount=1.2"]
        SHARP --> THRESH{"Threshold — Otsu or Sauvola"}
        THRESH --> CLEAN["Morphological Cleanup"]
        CLEAN --> CENTER["Center — Largest Contour"]
    end

    subgraph OCR_Modules["OCR Model Modules"]
        GT["ground_truth.csv — filename, transcription"] --> CRNN["ocr_model.py — CRNN + CTCLoss"]
        PRE_OUT --> CRNN
        CRNN -. "future: inference" .-> UI
        ViT["model_architecture.py — ViT + LSTM Head"] -. "optional backbone" .-> CRNN
    end
Loading

Features

PDF to High-Resolution Images

  • Powered by pdf2image (Poppler-backed) at DPI=400 for crisp renders
  • Outputs per-page JPEGs named converted_images/<pdf_stem>_page_<n>.jpg
  • DPI is configurable for speed vs. quality tradeoff

Production-Grade Preprocessing Chain

Every input image passes through this deterministic, 9-stage pipeline:

Stage Method Parameters
Resize Aspect-ratio preserved Target width: 1024px
Denoise Median blur kernel_size=3
Edge-preserving denoise Bilateral filter d=9, sigmaColor=50, sigmaSpace=50
Contrast CLAHE clipLimit=3.5, tileGridSize=(8,8)
Brightness Gamma correction Tunable gamma
Deskew Canny + Hough Lines threshold=100, minLineLen=50, maxGap=10
Sharpening Unsharp mask amount=1.2, kernel=(5,5), sigma=1.0
Binarization Otsu or Sauvola Sauvola: window=35, k=0.15
Morphological cleanup Opening + closing Noise removal and centering

Streamlit Lab Interface

  • Sidebar — PDF selector from pdf_files/
  • Button "1. Convert to Image" — runs pdf_to_images.py
  • Button "2. Preprocess" — runs preprocess_images.py
  • Tabs — side-by-side converted vs. preprocessed image viewer

Model Modules (Training / Integration)

  • scripts/ocr_model.py — CRNN (CNN + bidirectional LSTM) with CTCLoss training outline
  • scripts/model_architecture.py — ViT feature extractor + LSTM head for sequence modeling

Data Flow

flowchart LR
  UI["Streamlit UI"] -->|Convert| PDF2IMG["pdf_to_images.py"]
  PDF2IMG -->|page JPEGs| CONVERTED["converted_images/"]

  UI -->|Preprocess| PRE["preprocess_images.py"]
  PDFIN["pdf_files/"] -->|jpg/png/tiff| PRE
  PRE -->|binarized pages| PREOUT["preprocessed_images/"]

  PREOUT -->|future inference| OCR["ocr_model.py"]
  OCR -->|CTC decode| TXT["Text output"]
  GT["ground_truth.csv"] --> OCR

  UI --> NOTE["UI shows only files containing selected PDF stem"]
  CONVERTED --> NOTE
  PREOUT --> NOTE
Loading

Known gap (important): preprocess_images.py currently reads images from pdf_files/ (not from converted_images/). To preprocess pages after conversion, you must copy the converted page images from converted_images/ back into pdf_files/ (so filenames contain the PDF stem the UI expects). This is tracked in Roadmap Phase 1.


Project Structure

OCR-Recognator/
│
├── app.py                      # Streamlit UI — main entry point
├── requirements.txt            # Python dependencies
├── ground_truth.csv            # Placeholder — replace with real CSV data before training
├── walkthrough.md
├── codebase_analysis.md
│
├── scripts/
│   ├── pdf_to_images.py        # PDF to per-page JPEG (DPI=400)
│   ├── preprocess_images.py    # 9-stage image preprocessing chain
│   ├── ocr_model.py            # CRNN + CTCLoss training outline
│   ├── model_architecture.py   # ViT + LSTM head (definition)
│   └── create_ground_truth.py  # Ground truth CSV helper
│
├── pdf_files/                  # INPUT  — place PDFs here (create manually)
├── converted_images/           # OUTPUT — per-page JPEGs from PDF conversion
└── preprocessed_images/        # OUTPUT — binarized and deskewed page images

Setup

Prerequisites

Dependency Purpose Install
Python 3.9+ Runtime python.org
Poppler PDF rendering backend See below
pip packages All Python dependencies pip install -r requirements.txt

Installing Poppler

Windows

Download from poppler-windows releases and extract to one of these paths (auto-detected by the script):

C:\Program Files\Release-24.08.0-0\poppler-24.08.0\Library\bin
C:\poppler\bin
C:\Program Files\poppler\bin

Alternatively, add Poppler's bin/ folder to your system PATH.

macOS
brew install poppler
Linux
sudo apt-get install poppler-utils    # Debian / Ubuntu
sudo yum install poppler-utils        # CentOS / RHEL

Full Setup Steps

# Clone the repository
git clone https://github.com/amarcoder01/OCR-Recognator.git
cd OCR-Recognator

# Create and activate a virtual environment
python -m venv venv
venv\Scripts\activate        # Windows
source venv/bin/activate     # macOS / Linux

# Install all dependencies
pip install -r requirements.txt

# Create the required input directory
mkdir pdf_files

Usage Guide

1. Launch the Lab

python -m streamlit run app.py --server.port 8501

Navigate to http://localhost:8501

2. Convert PDFs

  1. Drop your PDFs into pdf_files/
  2. Select a PDF from the sidebar dropdown
  3. Click "1. Convert to Image" — outputs appear in converted_images/

3. Preprocess Images

Note: Currently reads *.jpg / *.png / *.tiff from pdf_files/, not from converted_images/. As a workaround, copy the converted page images into pdf_files/ before running preprocessing. This will be fixed in Phase 1.

  1. Click "2. Preprocess" — binarized pages are saved to preprocessed_images/
  2. Open the "Preprocessed Images" tab to inspect results side-by-side

4. OCR Training (CRNN)

# After preparing a valid ground_truth.csv:
python scripts/ocr_model.py

Expected CSV format:

filename,transcription
preprocessed_images/doc_page_1.jpg,Hello World
preprocessed_images/doc_page_2.jpg,Sample text on page two

Note: The repo-root ground_truth.csv is a placeholder and must be replaced with real labeled data before training.


Performance Notes

Stage Bottleneck Recommendation
PDF Conversion DPI=400 is CPU and memory intensive Reduce to 200 DPI for faster iteration; keep 400 for production quality
Preprocessing Bilateral filter + Hough transform Cost scales with image resolution and document complexity
CRNN Training CPU-bound without a GPU Use CUDA-enabled PyTorch for a 10-50x speedup
ViT Backbone Heavy memory footprint Requires proper 3-channel input normalization (see Troubleshooting)

Roadmap

We are actively developing OCR-Recognator toward a complete, production-grade OCR system. Below is the full roadmap, written so that contributors and collaborators can understand exactly what we are building, why each step matters, and what the expected outcome is.


Phase 1 — Close the Pipeline Gap and Add Inference

Goal: Make the system end-to-end in the UI. Right now a user can convert and preprocess but cannot read the text out. Phase 1 closes that gap.

1.1 — Align preprocessing input to conversion output

At the moment preprocess_images.py reads from pdf_files/ instead of converted_images/. This means after converting a PDF you have to manually copy the output before you can preprocess it — which breaks the single-button workflow we want. We will update the script to read directly from converted_images/ so that the full pipeline runs in two clicks: convert, then preprocess.

1.2 — Add OCR inference and CTC decoding

We will load trained weights into the CRNN model, run a forward pass on the preprocessed page images, and implement CTC decoding starting with greedy search and optionally extending to beam search. This is the core capability the project is built toward.

1.3 — Wire inference into app.py

Once decoding works in isolation we will surface the output inside the Streamlit UI — decoded text displayed alongside each preprocessed image, with the ability to copy the output or export it as JSON with page-level metadata.

Outcome of Phase 1: A user can drop a PDF in, click two buttons, and read the extracted text in the browser.


Phase 2 — Training Correctness and Data Quality

Goal: Make the training pipeline actually run correctly end-to-end on real data, not just in outline form.

2.1 — Fix the ground truth pipeline

The current ground_truth.csv in the repo root is a placeholder containing Python code, not CSV data. create_ground_truth.py also writes to the wrong path. We will replace the placeholder with a real labeled dataset, enforce the filename,transcription schema throughout, and align all script paths so that ocr_model.py can find and load the CSV without manual edits.

2.2 — Fix CRNN batching for CTC

The current DataLoader uses a passthrough collate_fn that returns a raw list of samples. The training loop then tries to unpack this as if it were already tensor batches — which crashes. We need to implement a proper collate_fn that stacks the image tensors, concatenates the label sequences, and computes input_lengths and target_lengths correctly so that CTCLoss receives valid inputs.

2.3 — Add training validation and metrics

We will add character error rate (CER) and word error rate (WER) evaluation on a held-out validation split, logged per epoch so we can track whether the model is actually learning.

Outcome of Phase 2: Running python scripts/ocr_model.py trains a real model without crashing, and we have metrics to know whether it is improving.


Phase 3 — Production Hardening

Goal: Make the system reliable, testable, and safe to run on real workloads.

3.1 — Fix Poppler path handling

The current pdf_to_images.py references poppler_path before it is defined, which throws a NameError on any system where Poppler is not on PATH. We will initialize the variable defensively, probe the known Windows install paths, and fall back cleanly to PATH with a clear error message if nothing is found.

3.2 — Streamlit performance and safety

For production use we need to add @st.cache_data decorators on conversion and preprocessing results so that re-running the same file is instant, progress bars so users know the pipeline is running and not frozen, concurrency limits to prevent multiple heavy processes from running simultaneously, and automatic cleanup of converted_images/ and preprocessed_images/ to prevent unbounded disk growth.

3.3 — ViT integration

model_architecture.py defines a ViT-based model head but the current preprocessing outputs single-channel binary images, whereas ViT expects 3-channel RGB input with specific normalization. We will add a channel expansion adapter and normalization layer so the two components can be connected without shape errors.

3.4 — Tests and regression validation

We will add smoke tests covering the conversion and preprocessing scripts, a golden-document regression set (a fixed PDF where we know what the preprocessed output should look like), and integration tests that run the full pipeline on a small labeled sample and check that CER stays within an acceptable range.

Outcome of Phase 3: The system is reliable enough to deploy, testable enough to iterate on confidently, and hardened against the known failure modes documented in the Troubleshooting section.


Troubleshooting

PDF Conversion Fails — NameError: poppler_path

Cause: poppler_path is referenced before being initialized in scripts/pdf_to_images.py.

Fix: Initialize the variable at the top of the function:

poppler_path = None  # Add this line before the if-check

Or ensure pdftoppm is on your system PATH so the variable is never needed.

Preprocessing Does Not Process Converted Pages

Cause: scripts/preprocess_images.py reads from ../pdf_files looking for *.jpg / *.png / *.tiff, not from converted_images/.

Workaround: Copy the converted page images from converted_images/ into pdf_files/, or change the source path in preprocess_images.py.

Permanent fix: Tracked in Roadmap Phase 1.1.

ground_truth.csv Causes pd.read_csv() to Fail

Cause: The repo-root ground_truth.csv contains Python code, not CSV rows.

Fix: Replace it with a valid CSV file:

filename,transcription
preprocessed_images/page_1.jpg,Your ground truth text here

Also note: create_ground_truth.py writes to ../ground_truth/ground_truth.csv, while ocr_model.py expects ground_truth.csv at the repo root. Align these paths before training. Tracked in Roadmap Phase 2.1.

CRNN Training Loop Crashes on Batch Unpacking

Cause: DataLoader(..., collate_fn=lambda x: x) returns a list of samples, but the training loop unpacks batches as if images and labels were already separated tensors.

Fix: Implement a proper collate_fn:

def ocr_collate_fn(batch):
    images = torch.stack([item[0] for item in batch])
    labels = [item[1] for item in batch]
    input_lengths = torch.full((len(batch),), images.shape[-1] // 4, dtype=torch.long)
    target_lengths = torch.tensor([len(l) for l in labels], dtype=torch.long)
    targets = torch.cat(labels)
    return images, targets, input_lengths, target_lengths

Tracked in Roadmap Phase 2.2.

ViT Model Input Shape Mismatch

Cause: ViTModel expects 3-channel RGB images with specific normalization. Preprocessing produces grayscale or binary single-channel images.

Fix: Add an input adapter before passing images to the ViT backbone:

# Expand single-channel to 3-channel and normalize for ViT
x = x.repeat(1, 3, 1, 1)  # (B, 1, H, W) to (B, 3, H, W)
x = transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])(x)

Tracked in Roadmap Phase 3.3.


Tech Stack

Layer Technology Role
UI Streamlit Interactive preprocessing lab
PDF Rendering pdf2image + Poppler PDF to high-DPI JPEG
Image Processing OpenCV Denoise, deskew, morph, sharpen
Thresholding scikit-image Sauvola adaptive binarization
Deep Learning PyTorch CRNN model and CTC training
Transformers HuggingFace Transformers ViT feature extractor
Data pandas Ground truth CSV loading

Deployment

Local (Recommended)

python -m streamlit run app.py --server.port 8501

Remote / Production

# Nginx reverse proxy example
server {
    listen 80;
    location / {
        proxy_pass http://127.0.0.1:8501;
        proxy_set_header Host $host;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Production recommendations:

  • Run OCR model inference in a separate worker process to avoid blocking the Streamlit event loop
  • Add upload size limits and disk cleanup policies for converted_images/ and preprocessed_images/
  • Use caching decorators (@st.cache_data) for conversion and preprocessing results

License

Distributed under the MIT License. See LICENSE for details.


Built by Amar Pawar

About

High-precision OCR Preprocessing Pipeline & Neural Engine. Features 9-stage OpenCV binarization, CRNN/ViT architectures, & Streamlit Vision Lab. PDF to Insights.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages