Skip to content

pradyten/pdf-extractor

Repository files navigation

title emoji colorFrom colorTo sdk app_port tags pinned short_description
Pdf Extractor
🚀
red
red
docker
8501
streamlit
false
pdf_extractor

PDF-to-JSON Extractor with AI

Intelligent PDF document parser that extracts structured JSON data using OpenAI's GPT models and computer vision.

📋 Table of Contents

🎯 Overview

This application converts PDF documents into structured JSON format using:

  • OpenAI GPT-4 Vision: For intelligent content extraction
  • Template-based extraction: Customizable JSON schemas for different document types
  • Streamlit UI: Interactive web interface for easy PDF processing
  • Docker support: Containerized deployment for production environments

Perfect for automating data extraction from resumes, invoices, forms, and other structured documents.

✨ Features

  • AI-Powered Extraction: Uses GPT-4 Vision to understand document structure
  • Template System: Pre-configured JSON templates for common document types
  • Batch Processing: Handle multiple PDFs efficiently
  • Image Preview: Visual confirmation of PDF pages before extraction
  • Format Validation: Ensures extracted JSON matches defined schema
  • Hugging Face Spaces: Ready for cloud deployment

🛠 Technology Stack

  • Python 3.9+ - Primary programming language
  • OpenAI API - GPT-4 Vision for intelligent extraction
  • pypdfium2 - PDF rendering and image conversion
  • Streamlit - Interactive web UI framework
  • Pillow (PIL) - Image processing
  • Pandas - Data manipulation

🚀 Installation

Prerequisites

Setup

  1. Clone the repository: ```bash git clone https://github.com/pradyten/pdf-extractor.git cd pdf-extractor ```

  2. Install dependencies: ```bash pip install -r requirements.txt ```

  3. Configure OpenAI API key: ```bash export OPENAI_API_KEY='your-api-key-here' ```

💻 Usage

Command Line

```bash python extractor.py path/to/document.pdf ```

Streamlit Web UI

```bash streamlit run src/streamlit_app.py ```

Docker

```bash docker build -t pdf-extractor . docker run -p 8501:8501 -e OPENAI_API_KEY='your-key' pdf-extractor ```

⚙️ Configuration

Define custom templates in `extractor.py` for different document types (resumes, invoices, forms).

🎓 Use Cases

  • HR & Recruitment: Batch process resume PDFs
  • Accounting: Extract invoice data
  • Data Entry: Automate form digitization
  • Document Management: Convert scanned documents to searchable JSON

🔒 Security & Privacy

  • Never commit API keys - use environment variables
  • PDFs are processed in-memory, not stored
  • Review OpenAI's data usage policies for compliance

👨‍💻 Author

Pradyumn Tendulkar

Data Science Graduate Student | ML Engineer


⭐ If you found this project helpful, please consider giving it a star!

📝 License: MIT

About

pdf-extractor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors