PDF-to-JSON Extractor with AI

title

emoji

colorFrom

colorTo

sdk

app_port

PDF-to-JSON Extractor with AI

Intelligent PDF document parser that extracts structured JSON data using OpenAI's GPT models and computer vision.

📋 Table of Contents

Overview
Features
Technology Stack
Installation
Usage
Configuration
Author

🎯 Overview

This application converts PDF documents into structured JSON format using:

OpenAI GPT-4 Vision: For intelligent content extraction
Template-based extraction: Customizable JSON schemas for different document types
Streamlit UI: Interactive web interface for easy PDF processing
Docker support: Containerized deployment for production environments

Perfect for automating data extraction from resumes, invoices, forms, and other structured documents.

✨ Features

AI-Powered Extraction: Uses GPT-4 Vision to understand document structure
Template System: Pre-configured JSON templates for common document types
Batch Processing: Handle multiple PDFs efficiently
Image Preview: Visual confirmation of PDF pages before extraction
Format Validation: Ensures extracted JSON matches defined schema
Hugging Face Spaces: Ready for cloud deployment

🛠 Technology Stack

Python 3.9+ - Primary programming language
OpenAI API - GPT-4 Vision for intelligent extraction
pypdfium2 - PDF rendering and image conversion
Streamlit - Interactive web UI framework
Pillow (PIL) - Image processing
Pandas - Data manipulation

🚀 Installation

Prerequisites

Python 3.9 or higher
OpenAI API key (Get one here)

Setup

Clone the repository: ```bash git clone https://github.com/pradyten/pdf-extractor.git cd pdf-extractor ```
Install dependencies: ```bash pip install -r requirements.txt ```
Configure OpenAI API key: ```bash export OPENAI_API_KEY='your-api-key-here' ```

💻 Usage

Command Line

```bash python extractor.py path/to/document.pdf ```

Streamlit Web UI

```bash streamlit run src/streamlit_app.py ```

Docker

```bash docker build -t pdf-extractor . docker run -p 8501:8501 -e OPENAI_API_KEY='your-key' pdf-extractor ```

⚙️ Configuration

Define custom templates in `extractor.py` for different document types (resumes, invoices, forms).

🎓 Use Cases

HR & Recruitment: Batch process resume PDFs
Accounting: Extract invoice data
Data Entry: Automate form digitization
Document Management: Convert scanned documents to searchable JSON

🔒 Security & Privacy

Never commit API keys - use environment variables
PDFs are processed in-memory, not stored
Review OpenAI's data usage policies for compliance

👨‍💻 Author

Pradyumn Tendulkar

Data Science Graduate Student | ML Engineer

⭐ If you found this project helpful, please consider giving it a star!

📝 License: MIT

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
src		src
templates		templates
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
README.md		README.md
extractor.py		extractor.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-to-JSON Extractor with AI

📋 Table of Contents

🎯 Overview

✨ Features

🛠 Technology Stack

🚀 Installation

Prerequisites

Setup

💻 Usage

Command Line

Streamlit Web UI

Docker

⚙️ Configuration

🎓 Use Cases

🔒 Security & Privacy

👨‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF-to-JSON Extractor with AI

📋 Table of Contents

🎯 Overview

✨ Features

🛠 Technology Stack

🚀 Installation

Prerequisites

Setup

💻 Usage

Command Line

Streamlit Web UI

Docker

⚙️ Configuration

🎓 Use Cases

🔒 Security & Privacy

👨‍💻 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages