Skip to content

AKSarav/pdfstract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDFStract — The Unified Data Preparation Layer for RAG

Extract. Chunk. Embed in one line of code.

One unified API. Switch between 10+ extraction libraries, 10+ chunking methods, and multiple embedding providers with a single parameter change. Focus on your RAG outcomes, not library dependencies.

animation

Quick Start

from pdfstract import PDFStract

pdfstract = PDFStract()

# Complete pipeline: Extract → Chunk → Embed
result = pdfstract.convert_chunk_embed('document.pdf')

# Or step by step
text = pdfstract.convert('document.pdf', library='auto')
chunks = pdfstract.chunk(text, chunker='semantic', chunk_size=512)
vectors = pdfstract.embed_texts([c['text'] for c in chunks['chunks']])
# CLI: Full pipeline in one command
pdfstract convert-chunk-embed document.pdf --library auto --chunker auto --embedding auto

Installation

pip install pdfstract              # Base - pymupdf4llm, markitdown
pip install pdfstract[standard]    # + OCR (pytesseract, unstructured)
pip install pdfstract[advanced]    # + ML-powered (marker, docling, paddleocr)
pip install pdfstract[all]         # Everything

Why PDFStract?

No single PDF extractor, chunker, or embedding provider works best for every document.

PDFStract lets you swap, compare, and automate your data preparation strategy through a single API:

  • Extract: 10+ libraries (Marker, Docling, PyMuPDF4LLM, PaddleOCR, Unstructured, and more)
  • Chunk: 10+ methods (Token, Semantic, Sentence, Recursive, Code-aware, and more)
  • Embed: Multiple providers (OpenAI, Azure, Google, Ollama, Sentence Transformers)

Switch any component with a single parameter change. No code refactoring needed.

Python API

Extract

from pdfstract import PDFStract

pdfstract = PDFStract()

# Auto-select best available library
text = pdfstract.convert('document.pdf', library='auto')

# Use specific library
text = pdfstract.convert('document.pdf', library='marker')
text = pdfstract.convert('document.pdf', library='docling', output_format='json')

# Batch processing
results = pdfstract.batch_convert('./pdfs', library='pymupdf4llm', parallel_workers=4)

# Async
text = await pdfstract.convert_async('document.pdf', library='marker')

Chunk

# Token-based chunking
chunks = pdfstract.chunk(text, chunker='token', chunk_size=512, chunk_overlap=50)

# Semantic chunking
chunks = pdfstract.chunk(text, chunker='semantic', chunk_size=1024)

# Access results
for chunk in chunks['chunks']:
    print(f"Chunk {chunk['chunk_id']}: {chunk['token_count']} tokens")

Embed

# Embed multiple texts
vectors = pdfstract.embed_texts(["First text", "Second text"], model='sentence-transformers')

# Embed single text
vector = pdfstract.embed_text("Hello world", model='openai')

# List available providers
providers = pdfstract.list_available_embeddings()

Combined Pipelines

# Convert + Chunk
result = pdfstract.convert_chunk('document.pdf', library='marker', chunker='semantic')

# Convert + Chunk + Embed (full RAG pipeline)
result = pdfstract.convert_chunk_embed(
    'document.pdf',
    library='docling',
    chunker='semantic',
    embedding='sentence-transformers'
)

# Each chunk has its embedding attached
for chunk in result['chunking_result']['chunks']:
    print(f"Chunk {chunk['chunk_id']}: {len(chunk['embedding'])} dimensions")

CLI

# List available tools
pdfstract libs
pdfstract chunkers
pdfstract embeddings-list

# Extract
pdfstract convert document.pdf --library marker

# Chunk
pdfstract convert-chunk document.pdf --library docling --chunker semantic

# Full pipeline
pdfstract convert-chunk-embed document.pdf --embedding sentence-transformers

# Batch processing
pdfstract batch ./pdfs --library pymupdf4llm --parallel 4

# Compare libraries
pdfstract compare sample.pdf -l marker -l docling -l pymupdf4llm

Web UI

git clone https://github.com/aksarav/pdfstract.git
cd pdfstract
make up

Open http://localhost:3000 for Web UI, http://localhost:8000 for API.

PDFStract UI

PDFStract UI

What's Included

Tier Libraries
Base pymupdf4llm, markitdown
Standard + pytesseract, unstructured
Advanced + marker, docling, paddleocr, deepseek, mineru

Chunkers: token, sentence, semantic, recursive, code, sdpm, late, slumber, neural

Embeddings: OpenAI, Azure OpenAI, Google, Ollama, Sentence Transformers, Model2Vec

Documentation

📖 pdfstract.com — Full documentation, guides, and API reference

Use Cases

  • RAG systems and knowledge bases
  • Document intelligence pipelines
  • LLM fine-tuning dataset preparation
  • Semantic search applications

Contributing

Contributions welcome! Fork, create a feature branch, and submit a pull request.

Support

Questions or issues? Create an issue


Made with ❤️ for AI RAG pipelines · GitHub · PyPI

About

PDFStract - Extract, Chunking and Embedding Layer in Your RAG Pipeline - Available as CLI - WEBUI - API

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors