PDF to Markdown conversion and quote-to-bbox resolution.
- Convert: Send PDF pages to a vision-capable LLM (via Pydantic AI) to produce clean Markdown with
<!--page-->markers between pages. - Resolve: Given verbatim quote strings, locate them in the source PDF and return bounding box coordinates. Uses pypdfium2 for per-character bbox extraction and seq-smith for Smith-Waterman alignment.
import asyncio
from groundmark import DocumentIndex
from groundmark.convert import Config, convert
async def main():
pdf_bytes = open("document.pdf", "rb").read()
# PDF -> Markdown (requires pydantic-ai, install with e.g. groundmark[bedrock])
result = await convert(pdf_bytes, Config(model="bedrock:au.anthropic.claude-sonnet-4-6"))
print(result.markdown[:500])
# Resolve verbatim quotes to PDF bounding boxes
doc = DocumentIndex(pdf_bytes)
resolved = doc.resolve(["the patient presented with"])
# -> {"the patient presented with": [(page, BBox(top, left, bottom, right)), ...]}
# The DocumentIndex can be reused for multiple resolve calls against the same PDF
more = doc.resolve(["another quote from the same paper"])
if __name__ == "__main__":
asyncio.run(main())# Resolve only (no LLM dependencies)
uv add groundmark
# With LLM provider extra(s) for conversion
uv add groundmark --extra anthropic,bedrock,google,openaiThe LLM call for PDF-to-Markdown conversion can take several minutes for large documents, especially with Opus on Bedrock. Timeout defaults by provider:
| Provider | Default | Environment Variable |
|---|---|---|
| Bedrock (boto3) | 300s | AWS_READ_TIMEOUT |
| Anthropic (httpx) | 600s | — (use ModelSettings(timeout=...)) |
For Bedrock with Opus, 300s may not be enough. Set a higher timeout:
export AWS_READ_TIMEOUT=600This project is licensed under the MIT License - see the LICENSE file for details.
