| title | emoji | colorFrom | colorTo | sdk | app_port | tags | pinned | short_description | |
|---|---|---|---|---|---|---|---|---|---|
Pdf Extractor |
🚀 |
red |
red |
docker |
8501 |
|
false |
pdf_extractor |
Intelligent PDF document parser that extracts structured JSON data using OpenAI's GPT models and computer vision.
This application converts PDF documents into structured JSON format using:
- OpenAI GPT-4 Vision: For intelligent content extraction
- Template-based extraction: Customizable JSON schemas for different document types
- Streamlit UI: Interactive web interface for easy PDF processing
- Docker support: Containerized deployment for production environments
Perfect for automating data extraction from resumes, invoices, forms, and other structured documents.
- AI-Powered Extraction: Uses GPT-4 Vision to understand document structure
- Template System: Pre-configured JSON templates for common document types
- Batch Processing: Handle multiple PDFs efficiently
- Image Preview: Visual confirmation of PDF pages before extraction
- Format Validation: Ensures extracted JSON matches defined schema
- Hugging Face Spaces: Ready for cloud deployment
- Python 3.9+ - Primary programming language
- OpenAI API - GPT-4 Vision for intelligent extraction
- pypdfium2 - PDF rendering and image conversion
- Streamlit - Interactive web UI framework
- Pillow (PIL) - Image processing
- Pandas - Data manipulation
- Python 3.9 or higher
- OpenAI API key (Get one here)
-
Clone the repository: ```bash git clone https://github.com/pradyten/pdf-extractor.git cd pdf-extractor ```
-
Install dependencies: ```bash pip install -r requirements.txt ```
-
Configure OpenAI API key: ```bash export OPENAI_API_KEY='your-api-key-here' ```
```bash python extractor.py path/to/document.pdf ```
```bash streamlit run src/streamlit_app.py ```
```bash docker build -t pdf-extractor . docker run -p 8501:8501 -e OPENAI_API_KEY='your-key' pdf-extractor ```
Define custom templates in `extractor.py` for different document types (resumes, invoices, forms).
- HR & Recruitment: Batch process resume PDFs
- Accounting: Extract invoice data
- Data Entry: Automate form digitization
- Document Management: Convert scanned documents to searchable JSON
- Never commit API keys - use environment variables
- PDFs are processed in-memory, not stored
- Review OpenAI's data usage policies for compliance
Pradyumn Tendulkar
Data Science Graduate Student | ML Engineer
- GitHub: @pradyten
- LinkedIn: Pradyumn Tendulkar
- Email: pktendulkar@wpi.edu
⭐ If you found this project helpful, please consider giving it a star!
📝 License: MIT