██████╗ ██████╗██████╗ ██████╗ ███████╗ ██████╗ ██████╗ ██████╗ ███╗ ██╗ █████╗ ████████╗ ██████╗ ██████╗
██╔═══██╗██╔════╝██╔══██╗ ██╔══██╗██╔════╝██╔════╝██╔═══██╗██╔════╝ ████╗ ██║██╔══██╗╚══██╔══╝██╔═══██╗██╔══██╗
██║ ██║██║ ██████╔╝ ██████╔╝█████╗ ██║ ██║ ██║██║ ███╗██╔██╗██║███████║ ██║ ██║ ██║██████╔╝
██║ ██║██║ ██╔══██╗ ██╔══██╗██╔══╝ ██║ ██║ ██║██║ ██║██║╚████║██╔══██║ ██║ ██║ ██║██╔══██╗
╚██████╔╝╚██████╗██║ ██║ ██║ ██║███████╗╚██████╗╚██████╔╝╚██████╔╝██║ ╚███║██║ ██║ ██║ ╚██████╔╝██║ ██║
╚═════╝ ╚═════╝╚═╝ ╚═╝ ╚═╝ ╚═╝╚══════╝ ╚═════╝ ╚═════╝ ╚═════╝ ╚═╝ ╚══╝╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝
Turn raw PDFs into analysis-ready, binarized page images — with a full Streamlit lab for exploring every preprocessing stage, and CRNN + ViT model modules ready for OCR training integration.
Quick Start · Architecture · Features · Setup · Usage Guide · Roadmap · Troubleshooting
| Component | Status | Notes |
|---|---|---|
| PDF to Image Conversion | Working | pdf2image, DPI=400, per-page JPEG output |
| Preprocessing Explorer | Working | Full OpenCV pipeline in Streamlit UI |
| CRNN Model Definition | Included | Forward pass defined; training loop is an outline |
| CTC Decoding | In Progress | Decoding not yet wired into app.py |
| ViT Backbone | In Progress | Architecture defined; input alignment needed |
| End-to-end Training | In Progress | Batching/collation and CSV wiring incomplete |
# 1. Clone the repository
git clone https://github.com/amarcoder01/OCR-Recognator.git
cd OCR-Recognator
# 2. Create and activate a virtual environment
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
# 3. Install dependencies
pip install -r requirements.txt
# 4. Create the required input folder
mkdir pdf_files
# 5. Launch the Streamlit lab
python -m streamlit run app.py --server.port 8501Note: Poppler is required for PDF conversion. See Setup for platform-specific instructions.
OCR-Recognator is a reproducible, inspectable, production-approachable system for turning scanned documents and PDFs into text. We built it to solve a real friction point — getting raw documents into a state where OCR models can actually work on them — and to make every stage of that process visible and debuggable.
Here is what the system does today and where it is headed:
- You drop PDFs into
pdf_files/, and the system renders every page as a high-DPI JPEG in seconds - You can explore every preprocessing step side-by-side in a browser UI — see exactly what the image looks like after deskewing, binarization, and centering before any model ever touches it
- The CRNN and ViT model modules are included and ready for integration once the training data pipeline is finalized
- CTC decoding and end-to-end inference are the next milestone (see Roadmap)
What is working right now: PDF conversion and the preprocessing explorer UI
What is included and ready for wiring: OCR inference modules in ocr_model.py and model_architecture.py
flowchart TD
U[User] --> UI["Streamlit UI — app.py"]
UI -->|"Convert"| CONV["pdf_to_images.py — DPI=400"]
CONV --> CONV_OUT["converted_images/ — page_n.jpg"]
UI -->|"Preprocess"| PRE["preprocess_images.py"]
PRE --> PRE_OUT["preprocessed_images/ — binarized pages"]
UI --> V1["View: Converted Images"]
UI --> V2["View: Preprocessed Images"]
subgraph Preprocessing_Chain["Preprocessing Chain"]
IN["Input Images from pdf_files/"] --> RESIZE["Resize — target_width=1024"]
RESIZE --> DENOISE["Denoise — median + bilateral"]
DENOISE --> CONTRAST["Gamma + CLAHE — clipLimit=3.5"]
CONTRAST --> DESKEW["Deskew — Canny + Hough Lines"]
DESKEW --> SHARP["Unsharp Mask — amount=1.2"]
SHARP --> THRESH{"Threshold — Otsu or Sauvola"}
THRESH --> CLEAN["Morphological Cleanup"]
CLEAN --> CENTER["Center — Largest Contour"]
end
subgraph OCR_Modules["OCR Model Modules"]
GT["ground_truth.csv — filename, transcription"] --> CRNN["ocr_model.py — CRNN + CTCLoss"]
PRE_OUT --> CRNN
CRNN -. "future: inference" .-> UI
ViT["model_architecture.py — ViT + LSTM Head"] -. "optional backbone" .-> CRNN
end
- Powered by
pdf2image(Poppler-backed) at DPI=400 for crisp renders - Outputs per-page JPEGs named
converted_images/<pdf_stem>_page_<n>.jpg - DPI is configurable for speed vs. quality tradeoff
Every input image passes through this deterministic, 9-stage pipeline:
| Stage | Method | Parameters |
|---|---|---|
| Resize | Aspect-ratio preserved | Target width: 1024px |
| Denoise | Median blur | kernel_size=3 |
| Edge-preserving denoise | Bilateral filter | d=9, sigmaColor=50, sigmaSpace=50 |
| Contrast | CLAHE | clipLimit=3.5, tileGridSize=(8,8) |
| Brightness | Gamma correction | Tunable gamma |
| Deskew | Canny + Hough Lines | threshold=100, minLineLen=50, maxGap=10 |
| Sharpening | Unsharp mask | amount=1.2, kernel=(5,5), sigma=1.0 |
| Binarization | Otsu or Sauvola | Sauvola: window=35, k=0.15 |
| Morphological cleanup | Opening + closing | Noise removal and centering |
- Sidebar — PDF selector from
pdf_files/ - Button "1. Convert to Image" — runs
pdf_to_images.py - Button "2. Preprocess" — runs
preprocess_images.py - Tabs — side-by-side converted vs. preprocessed image viewer
scripts/ocr_model.py— CRNN (CNN + bidirectional LSTM) withCTCLosstraining outlinescripts/model_architecture.py— ViT feature extractor + LSTM head for sequence modeling
flowchart LR
UI["Streamlit UI"] -->|Convert| PDF2IMG["pdf_to_images.py"]
PDF2IMG -->|page JPEGs| CONVERTED["converted_images/"]
UI -->|Preprocess| PRE["preprocess_images.py"]
PDFIN["pdf_files/"] -->|jpg/png/tiff| PRE
PRE -->|binarized pages| PREOUT["preprocessed_images/"]
PREOUT -->|future inference| OCR["ocr_model.py"]
OCR -->|CTC decode| TXT["Text output"]
GT["ground_truth.csv"] --> OCR
UI --> NOTE["UI shows only files containing selected PDF stem"]
CONVERTED --> NOTE
PREOUT --> NOTE
Known gap (important):
preprocess_images.pycurrently reads images frompdf_files/(not fromconverted_images/). To preprocess pages after conversion, you must copy the converted page images fromconverted_images/back intopdf_files/(so filenames contain the PDF stem the UI expects). This is tracked in Roadmap Phase 1.
OCR-Recognator/
│
├── app.py # Streamlit UI — main entry point
├── requirements.txt # Python dependencies
├── ground_truth.csv # Placeholder — replace with real CSV data before training
├── walkthrough.md
├── codebase_analysis.md
│
├── scripts/
│ ├── pdf_to_images.py # PDF to per-page JPEG (DPI=400)
│ ├── preprocess_images.py # 9-stage image preprocessing chain
│ ├── ocr_model.py # CRNN + CTCLoss training outline
│ ├── model_architecture.py # ViT + LSTM head (definition)
│ └── create_ground_truth.py # Ground truth CSV helper
│
├── pdf_files/ # INPUT — place PDFs here (create manually)
├── converted_images/ # OUTPUT — per-page JPEGs from PDF conversion
└── preprocessed_images/ # OUTPUT — binarized and deskewed page images
| Dependency | Purpose | Install |
|---|---|---|
| Python 3.9+ | Runtime | python.org |
| Poppler | PDF rendering backend | See below |
| pip packages | All Python dependencies | pip install -r requirements.txt |
Windows
Download from poppler-windows releases and extract to one of these paths (auto-detected by the script):
C:\Program Files\Release-24.08.0-0\poppler-24.08.0\Library\bin
C:\poppler\bin
C:\Program Files\poppler\bin
Alternatively, add Poppler's bin/ folder to your system PATH.
macOS
brew install popplerLinux
sudo apt-get install poppler-utils # Debian / Ubuntu
sudo yum install poppler-utils # CentOS / RHEL# Clone the repository
git clone https://github.com/amarcoder01/OCR-Recognator.git
cd OCR-Recognator
# Create and activate a virtual environment
python -m venv venv
venv\Scripts\activate # Windows
source venv/bin/activate # macOS / Linux
# Install all dependencies
pip install -r requirements.txt
# Create the required input directory
mkdir pdf_filespython -m streamlit run app.py --server.port 8501Navigate to http://localhost:8501
- Drop your PDFs into
pdf_files/ - Select a PDF from the sidebar dropdown
- Click "1. Convert to Image" — outputs appear in
converted_images/
Note: Currently reads
*.jpg / *.png / *.tifffrompdf_files/, not fromconverted_images/. As a workaround, copy the converted page images intopdf_files/before running preprocessing. This will be fixed in Phase 1.
- Click "2. Preprocess" — binarized pages are saved to
preprocessed_images/ - Open the "Preprocessed Images" tab to inspect results side-by-side
# After preparing a valid ground_truth.csv:
python scripts/ocr_model.pyExpected CSV format:
filename,transcription
preprocessed_images/doc_page_1.jpg,Hello World
preprocessed_images/doc_page_2.jpg,Sample text on page twoNote: The repo-root
ground_truth.csvis a placeholder and must be replaced with real labeled data before training.
| Stage | Bottleneck | Recommendation |
|---|---|---|
| PDF Conversion | DPI=400 is CPU and memory intensive | Reduce to 200 DPI for faster iteration; keep 400 for production quality |
| Preprocessing | Bilateral filter + Hough transform | Cost scales with image resolution and document complexity |
| CRNN Training | CPU-bound without a GPU | Use CUDA-enabled PyTorch for a 10-50x speedup |
| ViT Backbone | Heavy memory footprint | Requires proper 3-channel input normalization (see Troubleshooting) |
We are actively developing OCR-Recognator toward a complete, production-grade OCR system. Below is the full roadmap, written so that contributors and collaborators can understand exactly what we are building, why each step matters, and what the expected outcome is.
Goal: Make the system end-to-end in the UI. Right now a user can convert and preprocess but cannot read the text out. Phase 1 closes that gap.
1.1 — Align preprocessing input to conversion output
At the moment preprocess_images.py reads from pdf_files/ instead of converted_images/. This means after converting a PDF you have to manually copy the output before you can preprocess it — which breaks the single-button workflow we want. We will update the script to read directly from converted_images/ so that the full pipeline runs in two clicks: convert, then preprocess.
1.2 — Add OCR inference and CTC decoding
We will load trained weights into the CRNN model, run a forward pass on the preprocessed page images, and implement CTC decoding starting with greedy search and optionally extending to beam search. This is the core capability the project is built toward.
1.3 — Wire inference into app.py
Once decoding works in isolation we will surface the output inside the Streamlit UI — decoded text displayed alongside each preprocessed image, with the ability to copy the output or export it as JSON with page-level metadata.
Outcome of Phase 1: A user can drop a PDF in, click two buttons, and read the extracted text in the browser.
Goal: Make the training pipeline actually run correctly end-to-end on real data, not just in outline form.
2.1 — Fix the ground truth pipeline
The current ground_truth.csv in the repo root is a placeholder containing Python code, not CSV data. create_ground_truth.py also writes to the wrong path. We will replace the placeholder with a real labeled dataset, enforce the filename,transcription schema throughout, and align all script paths so that ocr_model.py can find and load the CSV without manual edits.
2.2 — Fix CRNN batching for CTC
The current DataLoader uses a passthrough collate_fn that returns a raw list of samples. The training loop then tries to unpack this as if it were already tensor batches — which crashes. We need to implement a proper collate_fn that stacks the image tensors, concatenates the label sequences, and computes input_lengths and target_lengths correctly so that CTCLoss receives valid inputs.
2.3 — Add training validation and metrics
We will add character error rate (CER) and word error rate (WER) evaluation on a held-out validation split, logged per epoch so we can track whether the model is actually learning.
Outcome of Phase 2: Running python scripts/ocr_model.py trains a real model without crashing, and we have metrics to know whether it is improving.
Goal: Make the system reliable, testable, and safe to run on real workloads.
3.1 — Fix Poppler path handling
The current pdf_to_images.py references poppler_path before it is defined, which throws a NameError on any system where Poppler is not on PATH. We will initialize the variable defensively, probe the known Windows install paths, and fall back cleanly to PATH with a clear error message if nothing is found.
3.2 — Streamlit performance and safety
For production use we need to add @st.cache_data decorators on conversion and preprocessing results so that re-running the same file is instant, progress bars so users know the pipeline is running and not frozen, concurrency limits to prevent multiple heavy processes from running simultaneously, and automatic cleanup of converted_images/ and preprocessed_images/ to prevent unbounded disk growth.
3.3 — ViT integration
model_architecture.py defines a ViT-based model head but the current preprocessing outputs single-channel binary images, whereas ViT expects 3-channel RGB input with specific normalization. We will add a channel expansion adapter and normalization layer so the two components can be connected without shape errors.
3.4 — Tests and regression validation
We will add smoke tests covering the conversion and preprocessing scripts, a golden-document regression set (a fixed PDF where we know what the preprocessed output should look like), and integration tests that run the full pipeline on a small labeled sample and check that CER stays within an acceptable range.
Outcome of Phase 3: The system is reliable enough to deploy, testable enough to iterate on confidently, and hardened against the known failure modes documented in the Troubleshooting section.
PDF Conversion Fails — NameError: poppler_path
Cause: poppler_path is referenced before being initialized in scripts/pdf_to_images.py.
Fix: Initialize the variable at the top of the function:
poppler_path = None # Add this line before the if-checkOr ensure pdftoppm is on your system PATH so the variable is never needed.
Preprocessing Does Not Process Converted Pages
Cause: scripts/preprocess_images.py reads from ../pdf_files looking for *.jpg / *.png / *.tiff, not from converted_images/.
Workaround: Copy the converted page images from converted_images/ into pdf_files/, or change the source path in preprocess_images.py.
Permanent fix: Tracked in Roadmap Phase 1.1.
ground_truth.csv Causes pd.read_csv() to Fail
Cause: The repo-root ground_truth.csv contains Python code, not CSV rows.
Fix: Replace it with a valid CSV file:
filename,transcription
preprocessed_images/page_1.jpg,Your ground truth text hereAlso note: create_ground_truth.py writes to ../ground_truth/ground_truth.csv, while ocr_model.py expects ground_truth.csv at the repo root. Align these paths before training. Tracked in Roadmap Phase 2.1.
CRNN Training Loop Crashes on Batch Unpacking
Cause: DataLoader(..., collate_fn=lambda x: x) returns a list of samples, but the training loop unpacks batches as if images and labels were already separated tensors.
Fix: Implement a proper collate_fn:
def ocr_collate_fn(batch):
images = torch.stack([item[0] for item in batch])
labels = [item[1] for item in batch]
input_lengths = torch.full((len(batch),), images.shape[-1] // 4, dtype=torch.long)
target_lengths = torch.tensor([len(l) for l in labels], dtype=torch.long)
targets = torch.cat(labels)
return images, targets, input_lengths, target_lengthsTracked in Roadmap Phase 2.2.
ViT Model Input Shape Mismatch
Cause: ViTModel expects 3-channel RGB images with specific normalization. Preprocessing produces grayscale or binary single-channel images.
Fix: Add an input adapter before passing images to the ViT backbone:
# Expand single-channel to 3-channel and normalize for ViT
x = x.repeat(1, 3, 1, 1) # (B, 1, H, W) to (B, 3, H, W)
x = transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])(x)Tracked in Roadmap Phase 3.3.
| Layer | Technology | Role |
|---|---|---|
| UI | Streamlit | Interactive preprocessing lab |
| PDF Rendering | pdf2image + Poppler | PDF to high-DPI JPEG |
| Image Processing | OpenCV | Denoise, deskew, morph, sharpen |
| Thresholding | scikit-image | Sauvola adaptive binarization |
| Deep Learning | PyTorch | CRNN model and CTC training |
| Transformers | HuggingFace Transformers | ViT feature extractor |
| Data | pandas | Ground truth CSV loading |
python -m streamlit run app.py --server.port 8501# Nginx reverse proxy example
server {
listen 80;
location / {
proxy_pass http://127.0.0.1:8501;
proxy_set_header Host $host;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}Production recommendations:
- Run OCR model inference in a separate worker process to avoid blocking the Streamlit event loop
- Add upload size limits and disk cleanup policies for
converted_images/andpreprocessed_images/ - Use caching decorators (
@st.cache_data) for conversion and preprocessing results
Distributed under the MIT License. See LICENSE for details.
Built by Amar Pawar