Skip to content

Implement OpenAI two‑pass ETS extraction pipeline with QC, review integration, migrations and tests#239

Open
karilint wants to merge 10 commits intocodex/prepare-implementation-plan-for-recode-extractionfrom
codex/implement-openai-two-pass-trait-extraction
Open

Implement OpenAI two‑pass ETS extraction pipeline with QC, review integration, migrations and tests#239
karilint wants to merge 10 commits intocodex/prepare-implementation-plan-for-recode-extractionfrom
codex/implement-openai-two-pass-trait-extraction

Conversation

@karilint
Copy link
Copy Markdown
Owner

@karilint karilint commented Mar 5, 2026

Motivation

  • Replace the brittle regex/legacy extractor with a production two‑pass OpenAI (Responses API) workflow (PASS1 evidence → PASS2 ETS) and add deterministic QC before any ETS import.
  • Preserve legacy baseline extractor behind flags and keep the existing review/import UI but adapt it to accept LLM‑generated ETS payloads and re‑normalize after curator edits.
  • Build a DB‑driven trait vocabulary (abbr. dictionary + trait list) with TTL caching to prime LLM prompts and preserve page provenance for auditability.

Description

  • Added a production OpenAI adapter using Structured Outputs and pydantic models with retries/backoff in app/recode_extraction/adapters/openai_client.py and prompt templates in app/recode_extraction/services/openai_two_pass_prompts.py.
  • Implemented a DB‑driven trait vocabulary service with 6‑hour cache and bootstrap fallback in app/recode_extraction/services/trait_vocabulary.py.
  • Implemented deterministic QC/normalization (range, mean±SD, point parsing, dedupe, ETS validation and provenance tagging) in app/recode_extraction/services/qc.py and wired it into the pipeline.
  • Extended orchestrator to run openai_two_pass backend: per‑page PASS1 evidence extraction, merged evidence persistence, PASS2 structuring, QC → create candidates with ets_payload in ExtractedAssertionModel; legacy pipeline preserved in _run_legacy_pipeline (see app/recode_extraction/services/orchestrator.py).
  • Adapted review/import flow so persist_approved_assertions_to_ets will import prefilled ets_payload (and re‑normalize when curators edit values/units) in app/recode_extraction/services/review.py.
  • Added model fields and migration to persist artifacts and QC: pass1_evidence_package, pass2_structured_package, qc_summary on SourceExtractionRun, and qc_errors + larger unmapped_reason on ExtractedAssertionModel (app/recode_extraction/migrations/0006_openai_two_pass_fields.py, app/recode_extraction/models.py).
  • Exposed backend selection and QC indicators in views/templates and added settings toggles and OpenAI config variables in app/config/settings.py and app/recode_extraction/templates/*.
  • Added tests that mock OpenAI and PDF text extraction to validate orchestration and QC behavior in app/tests/recode_extraction/test_openai_two_pass_pipeline.py and app/tests/recode_extraction/test_openai_qc_normalization.py, and updated review tests to cover prefilled ETS payload persistence.
  • Updated docs (docs/recode_integration.md) and app/requirements.txt (openai, pydantic), and kept the NE/RE graph pipeline out of production paths.

Testing

  • Static/compile checks: ran python -m py_compile on new modules and related tests and they passed locally in this environment.
  • Unit tests (mocked OpenAI): added tests/recode_extraction/test_openai_two_pass_pipeline.py and tests/recode_extraction/test_openai_qc_normalization.py which mock the OpenAI client and PdfToTextService so no network calls occur.
  • Full test run attempted with pytest --ds=config.settings ... but the Django test harness failed during DB setup in this environment due to missing/None MySQL NAME (CI/local DB config not available here), so automated Django tests could not complete; failures are environmental and not logic failures in added code.
  • Migration: created 0006_openai_two_pass_fields.py to add JSON fields and qc_errors; run of makemigrations was prepared and migration committed.

Files of primary interest: app/recode_extraction/adapters/openai_client.py, app/recode_extraction/services/{openai_two_pass_prompts.py,trait_vocabulary.py,qc.py,openai_two_pass.py}, app/recode_extraction/services/orchestrator.py, app/recode_extraction/services/review.py, app/recode_extraction/models.py, app/recode_extraction/migrations/0006_openai_two_pass_fields.py, tests under app/tests/recode_extraction/ and docs update docs/recode_integration.md.

If you want, I can: run the test matrix inside a configured CI (with DB), trim or extend prompt examples, or split the PR into smaller PRs (prompts/client + orchestrator + QC + UI) for easier review.


Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant