pdfdeal

🗺️ ENGLISH | 简体中文

Handle PDF more easily and simply, utilizing Doc2X's powerful document conversion capabilities for retained format file conversion/RAG enhancement.

Introduction

Doc2X Support

Doc2X is a new universal document OCR tool that can convert images or PDF files into Markdown/LaTeX text with formulas and text formatting. It performs better than similar tools in most scenarios. pdfdeal provides abstract packaged classes to use Doc2X for requests.

Processing PDFs

Use various OCR or PDF recognition tools to identify images and add them to the original text. You can set the output format to use PDF, which will ensure that the recognized text retains the same page numbers as the original in the new PDF. It also offers various practical file processing tools.

After conversion and pre-processing of PDF using Doc2X, you can achieve better recognition rates when used with knowledge base applications such as graphrag, Dify, and FastGPT.

Markdown Document Processing Features

pdfdeal also provides a series of powerful tools to handle Markdown documents:

Convert HTML tables to Markdown format: Allows conversion of HTML formatted tables to Markdown format for easy use in Markdown documents.
Upload images to remote storage services: Supports uploading local or online images in Markdown documents to remote storage services to ensure image persistence and accessibility.
Convert online images to local images: Allows downloading and converting online images in Markdown documents to local images for offline use.
Document splitting and separator addition: Supports splitting Markdown documents by headings or adding separators within documents for better organization and management.

For detailed feature introduction and usage, please refer to the documentation link.

Cases

graphrag

See how to use it with graphrag, its not supported to recognize pdf, but you can use the CLI tool doc2x to convert it to a txt document for use.

Fastgpt/Dify or other RAG system

Or for knowledge base applications, you can use pdfdeal's built-in variety of enhancements to documents, such as uploading images to remote storage services, adding breaks by paragraph, etc. See Integration with RAG applications.

Documentation

For details, please refer to the documentation

Or check out the documentation repository pdfdeal-docs.

Quick Start

For details, please refer to the documentation

Installation

Install using pip:

pip install --upgrade pdfdeal

If you need document processing tools:

pip install --upgrade "pdfdeal[rag]"

Use the Doc2X PDF API to process all PDF files in a specified folder

from pdfdeal import Doc2X

client = Doc2X(apikey="Your API key",debug=True)
success, failed, flag = client.pdf2file(
    pdf_file="tests/pdf",
    output_path="./Output",
    output_format="docx",
    model="v3-2026",  # optional, default is server-side v2
    formula_level=1,  # optional: 0(default/recommended)=keep formulas; 1=inline formulas -> text; 2=all formulas (inline+block) -> text
)
print(success)
print(failed)
print(flag)

Use the Doc2X PDF API to process the specified PDF file and specify the name of the exported file

from pdfdeal import Doc2X

client = Doc2X(apikey="Your API key",debug=True)
success, failed, flag = client.pdf2file(
    pdf_file="tests/pdf/sample.pdf",
    output_path="./Output/test/single/pdf2file",
    output_names=["sample1.zip"],
    output_format="md_dollar",
)
print(success)
print(failed)
print(flag)

V3 JSON updates

When model="v3-2026":

output_format="json" now saves the raw Doc2X v3 JSON (result.pages...) instead of the legacy simplified [{text, location}] structure.
Raw v3 JSON is always saved as a sidecar .json file, even when output_format does not include json (for example text, detailed, md, docx).
If output_format includes json, the sidecar JSON name follows the json slot in output_names.
If output_format does not include json, the sidecar JSON name follows the first non-empty entry in output_names.
If output_names is omitted, the sidecar JSON falls back to the original PDF basename.
Deprecated direct upload is no longer used. oss_choose="always" and oss_choose="auto" both use the preupload API. oss_choose="never" / oss_choose="none" now raises an error.

Example:

from pdfdeal import Doc2X

client = Doc2X(apikey="Your API key", debug=True)
success, failed, flag = client.pdf2file(
    pdf_file="tests/pdf/sample.pdf",
    output_path="./Output/test/v3",
    output_format="text,json",
    output_names=[["plain.txt", "viz.data"]],
    model="v3-2026",
)
print(success)  # ["page text...", "./Output/test/v3/viz.json"]
print(failed)
print(flag)

Helper scripts for v3 figure/table crops

Two helper scripts were added under scripts/:

extract_v3_figures.py: extract figure crops from a PDF using Doc2X v3 JSON
extract_v3_tables.py: extract table crops from a PDF using Doc2X v3 JSON

Both scripts:

validate that the v3 JSON matches the crop rules first
render only pages containing target blocks with fitz at the requested dpi
save full-page PNGs under _pages/
crop target regions using the block bbox/xyxy and page coordinates from the v3 JSON
write manifest.json with crop metadata

Examples:

python scripts/extract_v3_figures.py \
  --pdf /path/to/input.pdf \
  --v3-json /path/to/input_v3.json \
  --dpi 200 \
  --output-dir ./Output/figures

python scripts/extract_v3_tables.py \
  --pdf /path/to/input.pdf \
  --v3-json /path/to/input_v3.json \
  --dpi 200 \
  --output-dir ./Output/tables

You can also import the helpers directly:

from pdfdeal import extract_v3_figure_images, extract_v3_table_images

figure_summary = extract_v3_figure_images(
    pdf_path="/path/to/input.pdf",
    v3_json_path="/path/to/input_v3.json",
    dpi=200,
    output_dir="./Output/figures",
)
table_summary = extract_v3_table_images(
    pdf_path="/path/to/input.pdf",
    v3_json_path="/path/to/input_v3.json",
    dpi=200,
    output_dir="./Output/tables",
)
print(figure_summary["crop_count"], figure_summary["manifest_path"])
print(table_summary["crop_count"], table_summary["manifest_path"])

See the online documentation for details.

Name		Name	Last commit message	Last commit date
Latest commit History 433 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
scripts		scripts
src/pdfdeal		src/pdfdeal
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfdeal

Introduction

Doc2X Support

Processing PDFs

Markdown Document Processing Features

Cases

graphrag

Fastgpt/Dify or other RAG system

Documentation

Quick Start

Installation

Use the Doc2X PDF API to process all PDF files in a specified folder

Use the Doc2X PDF API to process the specified PDF file and specify the name of the exported file

V3 JSON updates

Helper scripts for v3 figure/table crops

About

Uh oh!

Releases 38

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pdfdeal

Introduction

Doc2X Support

Processing PDFs

Markdown Document Processing Features

Cases

graphrag

Fastgpt/Dify or other RAG system

Documentation

Quick Start

Installation

Use the Doc2X PDF API to process all PDF files in a specified folder

Use the Doc2X PDF API to process the specified PDF file and specify the name of the exported file

V3 JSON updates

Helper scripts for v3 figure/table crops

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 38

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages