Variant literature assessment pipeline with AI extraction.
Flowa is a single async pipeline that processes genetic variant literature:
query → download → convert → extract → aggregate
- Query: Search Mastermind/LitVar for papers, resolve PMIDs to DOIs via PubMed
- Download: Fetch PDFs from PMC (main article + supplements)
- Convert: PDF → Markdown via groundmark (LLM-based conversion)
- Extract: Per-paper evidence extraction via LLM
- Aggregate: Cross-paper synthesis via LLM, resolving citation quotes to PDF bounding boxes via groundmark
Papers are processed in parallel. LLM concurrency is controlled via --llm-concurrency.
# Full pipeline
flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar
# Individual steps (for debugging)
flowa query --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar
flowa download --doi '10.1038/s41586-020-2308-7'
flowa convert --doi '10.1038/s41586-020-2308-7'
flowa extract --variant-id VAR123 --doi '10.1038/s41586-020-2308-7'
flowa aggregate --variant-id VAR123| Variable | Description | Example |
|---|---|---|
FLOWA_STORAGE_BASE |
Storage path for PDFs, extractions, results | s3://bucket, gs://bucket, file:///path |
FLOWA_CONVERT_MODEL |
LLM for PDF→Markdown conversion (groundmark) | bedrock:au.anthropic.claude-sonnet-4-6 |
FLOWA_EXTRACTION_MODEL |
LLM for extraction and aggregation | bedrock:au.anthropic.claude-opus-4-6 |
Models use pydantic-ai format. Examples:
- AWS Bedrock:
bedrock:au.anthropic.claude-sonnet-4-6(convert),bedrock:au.anthropic.claude-opus-4-6(extraction) - Google Gemini:
google-gla:gemini-3-pro - OpenAI:
openai:gpt-5.2
Provider credentials:
| Provider | Required Variables |
|---|---|
| AWS Bedrock | AWS_PROFILE + AWS_REGION, or AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + AWS_REGION |
| Google Gemini | GOOGLE_API_KEY |
| OpenAI | OPENAI_API_KEY |
| Backend | FLOWA_STORAGE_BASE |
Additional Variables |
|---|---|---|
| AWS S3 | s3://bucket-name |
AWS credentials (see above) |
| Google Cloud Storage | gs://bucket-name |
GOOGLE_APPLICATION_CREDENTIALS or workload identity |
| S3-compatible (MinIO) | s3://bucket-name |
FSSPEC_S3_ENDPOINT_URL, FSSPEC_S3_KEY, FSSPEC_S3_SECRET |
| Local filesystem | file:///path |
— |
Flowa supports site-specific prompt sets. Each prompt set is a directory under prompts/ containing prompt templates and Pydantic schema modules.
| Variable | Description | Default |
|---|---|---|
FLOWA_PROMPT_SET |
Name of the prompt set directory to use | generic |
prompts/{prompt_set}/
├── extraction_prompt.txt # Prompt template for individual paper extraction
├── extraction_schema.py # Pydantic model defining ExtractionResult
├── aggregate_prompt.txt # Prompt template for cross-paper aggregation
└── aggregate_schema.py # Pydantic model defining AggregateResult
Schema modules must define Pydantic models with specific fields that Flowa's validation logic depends on:
extraction_schema.py must define ExtractionResult with:
evidence[].citations[].quote(str) — verbatim quote from the paper
aggregate_schema.py must define AggregateResult with:
results[category].citations[].paper_id(str) — paper identifierresults[category].citations[].quote(str) — verbatim quote resolved to PDF bounding boxes
All other fields can be customized freely. See prompts/generic/ for the default implementation.
The pipeline uses a unified citation format:
[display text](#cite:paperId "verbatim quote to highlight")paperId= AuthorYear label (e.g.,Smith2024) frompaper_id_mapping- The title attribute carries a verbatim quote that scopes the PDF highlight
- Display text is free-form
During aggregation, quotes are resolved against each paper's source PDF (via groundmark.DocumentIndex) to produce bounding box coordinates. The aggregate output contains pre-resolved bboxes arrays for each citation. Quotes that cannot be resolved get empty bboxes.
papers/{encoded_doi}/
source.pdf # Downloaded PDF
markdown.md # LLM-generated Markdown
metadata.json # PubMed metadata (title, authors, date, etc.)
assessments/{variant_id}/
workflow.json # Pipeline run metadata
variant_details.json # VariantValidator output
query.json # Query results (DOI list)
aggregate.json # Aggregated assessment with pre-resolved bboxes
aggregate_raw.json # Raw LLM conversation
extractions/
{encoded_doi}.json # Per-paper extraction (quotes + commentary)
{encoded_doi}_raw.json # Raw LLM conversation
export FLOWA_STORAGE_BASE=file:///tmp/flowa
export FLOWA_CONVERT_MODEL=bedrock:au.anthropic.claude-sonnet-4-6
export FLOWA_EXTRACTION_MODEL=bedrock:au.anthropic.claude-opus-4-6
uv run flowa run --variant-id test --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvardocker build --build-arg LLM_EXTRA=bedrock -t flowa .
docker run \
-e FLOWA_STORAGE_BASE=s3://bucket \
-e FLOWA_CONVERT_MODEL=bedrock:au.anthropic.claude-sonnet-4-6 \
-e FLOWA_EXTRACTION_MODEL=bedrock:au.anthropic.claude-opus-4-6 \
-e AWS_REGION=ap-southeast-2 \
flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvarCreate a job definition with the flowa container image. A typical run processes up to 50 papers with LLM calls for conversion, extraction, and aggregation — allow sufficient time and retries.
aws batch register-job-definition \
--job-definition-name flowa-worker \
--type container \
--container-properties '{
"image": "123456789.dkr.ecr.ap-southeast-2.amazonaws.com/flowa:latest",
"resourceRequirements": [
{"type": "VCPU", "value": "2"},
{"type": "MEMORY", "value": "8192"}
],
"environment": [
{"name": "FLOWA_STORAGE_BASE", "value": "s3://flowa-data"},
{"name": "FLOWA_CONVERT_MODEL", "value": "bedrock:au.anthropic.claude-sonnet-4-6"},
{"name": "FLOWA_EXTRACTION_MODEL", "value": "bedrock:au.anthropic.claude-opus-4-6"}
]
}' \
--retry-strategy '{"attempts": 2}' \
--timeout '{"attemptDurationSeconds": 3600}'Submit a job:
aws batch submit-job \
--job-name "flowa-VAR123" \
--job-definition flowa-worker \
--job-queue flowa-queue \
--container-overrides '{
"command": ["run", "--variant-id", "VAR123", "--gene", "GAA", "--hgvs-c", "NM_000152.5:c.2238G>C", "--source", "litvar"]
}'