PDFStract — The Unified Data Preparation Layer for RAG

Extract. Chunk. Embed in one line of code.

One unified API. Switch between 10+ extraction libraries, 10+ chunking methods, and multiple embedding providers with a single parameter change. Focus on your RAG outcomes, not library dependencies.

Quick Start

from pdfstract import PDFStract

pdfstract = PDFStract()

# Complete pipeline: Extract → Chunk → Embed
result = pdfstract.convert_chunk_embed('document.pdf')

# Or step by step
text = pdfstract.convert('document.pdf', library='auto')
chunks = pdfstract.chunk(text, chunker='semantic', chunk_size=512)
vectors = pdfstract.embed_texts([c['text'] for c in chunks['chunks']])

# CLI: Full pipeline in one command
pdfstract convert-chunk-embed document.pdf --library auto --chunker auto --embedding auto

Installation

pip install pdfstract              # Base - pymupdf4llm, markitdown
pip install pdfstract[standard]    # + OCR (pytesseract, unstructured)
pip install pdfstract[advanced]    # + ML-powered (marker, docling, paddleocr)
pip install pdfstract[all]         # Everything

Why PDFStract?

No single PDF extractor, chunker, or embedding provider works best for every document.

PDFStract lets you swap, compare, and automate your data preparation strategy through a single API:

Extract: 10+ libraries (Marker, Docling, PyMuPDF4LLM, PaddleOCR, Unstructured, and more)
Chunk: 10+ methods (Token, Semantic, Sentence, Recursive, Code-aware, and more)
Embed: Multiple providers (OpenAI, Azure, Google, Ollama, Sentence Transformers)

Switch any component with a single parameter change. No code refactoring needed.

Python API

Extract

from pdfstract import PDFStract

pdfstract = PDFStract()

# Auto-select best available library
text = pdfstract.convert('document.pdf', library='auto')

# Use specific library
text = pdfstract.convert('document.pdf', library='marker')
text = pdfstract.convert('document.pdf', library='docling', output_format='json')

# Batch processing
results = pdfstract.batch_convert('./pdfs', library='pymupdf4llm', parallel_workers=4)

# Async
text = await pdfstract.convert_async('document.pdf', library='marker')

Chunk

# Token-based chunking
chunks = pdfstract.chunk(text, chunker='token', chunk_size=512, chunk_overlap=50)

# Semantic chunking
chunks = pdfstract.chunk(text, chunker='semantic', chunk_size=1024)

# Access results
for chunk in chunks['chunks']:
    print(f"Chunk {chunk['chunk_id']}: {chunk['token_count']} tokens")

Embed

# Embed multiple texts
vectors = pdfstract.embed_texts(["First text", "Second text"], model='sentence-transformers')

# Embed single text
vector = pdfstract.embed_text("Hello world", model='openai')

# List available providers
providers = pdfstract.list_available_embeddings()

Combined Pipelines

# Convert + Chunk
result = pdfstract.convert_chunk('document.pdf', library='marker', chunker='semantic')

# Convert + Chunk + Embed (full RAG pipeline)
result = pdfstract.convert_chunk_embed(
    'document.pdf',
    library='docling',
    chunker='semantic',
    embedding='sentence-transformers'
)

# Each chunk has its embedding attached
for chunk in result['chunking_result']['chunks']:
    print(f"Chunk {chunk['chunk_id']}: {len(chunk['embedding'])} dimensions")

CLI

# List available tools
pdfstract libs
pdfstract chunkers
pdfstract embeddings-list

# Extract
pdfstract convert document.pdf --library marker

# Chunk
pdfstract convert-chunk document.pdf --library docling --chunker semantic

# Full pipeline
pdfstract convert-chunk-embed document.pdf --embedding sentence-transformers

# Batch processing
pdfstract batch ./pdfs --library pymupdf4llm --parallel 4

# Compare libraries
pdfstract compare sample.pdf -l marker -l docling -l pymupdf4llm

Web UI

git clone https://github.com/aksarav/pdfstract.git
cd pdfstract
make up

Open http://localhost:3000 for Web UI, http://localhost:8000 for API.

What's Included

Tier	Libraries
Base	pymupdf4llm, markitdown
Standard	+ pytesseract, unstructured
Advanced	+ marker, docling, paddleocr, deepseek, mineru

Chunkers: token, sentence, semantic, recursive, code, sdpm, late, slumber, neural

Embeddings: OpenAI, Azure OpenAI, Google, Ollama, Sentence Transformers, Model2Vec

Documentation

📖 pdfstract.com — Full documentation, guides, and API reference

Use Cases

RAG systems and knowledge bases
Document intelligence pipelines
LLM fine-tuning dataset preparation
Semantic search applications

Contributing

Contributions welcome! Fork, create a feature branch, and submit a pull request.

Support

Questions or issues? Create an issue

Made with ❤️ for AI RAG pipelines · GitHub · PyPI

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.project_doc_record		.project_doc_record
.vscode		.vscode
data		data
docs		docs
frontend		frontend
pdfstract		pdfstract
scripts		scripts
services		services
tests		tests
uploads		uploads
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
CLI_README.md		CLI_README.md
Dockerfile.backend		Dockerfile.backend
Dockerfile.frontend		Dockerfile.frontend
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
build_and_upload.sh		build_and_upload.sh
cli.py		cli.py
docker-compose.yml		docker-compose.yml
main.py		main.py
makefile		makefile
nginx.conf		nginx.conf
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFStract — The Unified Data Preparation Layer for RAG

Quick Start

Installation

Why PDFStract?

Python API

Extract

Chunk

Embed

Combined Pipelines

CLI

Web UI

What's Included

Documentation

Use Cases

Contributing

Support

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDFStract — The Unified Data Preparation Layer for RAG

Quick Start

Installation

Why PDFStract?

Python API

Extract

Chunk

Embed

Combined Pipelines

CLI

Web UI

What's Included

Documentation

Use Cases

Contributing

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages