Skip to content

Add OpenAI two-pass extraction pipeline with QC, compaction, and review integration#241

Open
karilint wants to merge 14 commits intocodex/prepare-implementation-plan-for-recode-extractionfrom
codex/implement-openai-two-pass-trait-extraction-wyfifs
Open

Add OpenAI two-pass extraction pipeline with QC, compaction, and review integration#241
karilint wants to merge 14 commits intocodex/prepare-implementation-plan-for-recode-extractionfrom
codex/implement-openai-two-pass-trait-extraction-wyfifs

Conversation

@karilint
Copy link
Copy Markdown
Owner

@karilint karilint commented Mar 6, 2026

Motivation

  • Introduce a production-ready two-pass OpenAI extraction backend to detect raw evidence (PASS1) and produce ETS-structured trait records (PASS2) while reducing token use and noisy analytical content.
  • Improve robustness of the existing pipeline by adding QC/normalization, deterministic deduplication, timing instrumentation and safer handling of schema drift in importer code.
  • Surface PASS1/PASS2 artifacts and QC to the UI and persist necessary fields on extraction runs and assertions for review and downstream persistence.

Description

  • Add OpenAI adapter recode_extraction.adapters.openai_client.OpenAITwoPassClient and prompt builders in openai_two_pass_prompts.py, plus lightweight service exports in openai_two_pass.py.
  • Implement orchestration support in recode_extraction.services.orchestrator.RecodePipelineRunner with a new openai_two_pass path that runs per-page PASS1 calls, compacts evidence (pass1_compaction.py), runs PASS2, runs QC/normalization (qc.py), and persists candidate ExtractedAssertionModel rows; add timing logs and safer page-number handling.
  • Add trait vocabulary service trait_vocabulary.py to build abbreviation dictionaries and trait lists from the DB, cached for performance.
  • Extend data models and DB via migrations to store pass1_evidence_package, pass2_structured_package, qc_summary, qc_errors and extend unmapped_reason length.
  • Add UI and view updates to show PASS1/PASS2 availability, QC summary and allow selecting the OpenAI backend when enabled; add settings flags and many RECODE_OPENAI_* settings in settings.py.
  • Add review flow improvements to services.review to accept prefilled ets_payload when persisting approved assertions and to validate presence of review columns at runtime.
  • Tighten imports module to avoid import cycles and fix validation regex escaping in base_validation.py.
  • Harden BaseImporter.get_or_create_source_location to handle None values and database schema drift when coord_text lacks a default.
  • Add requirements for openai and pydantic and a collection of unit tests and fixtures covering adapter, compaction, QC/normalization, orchestrator flows, review persistence, and importer edge cases.

Testing

  • Ran unit tests for the new OpenAI pipeline and helpers including app/tests/recode_extraction/test_openai_two_pass_pipeline.py, test_openai_client_adapter.py, test_openai_qc_normalization.py, test_openai_schema.py, test_pass1_compaction.py, test_openai_two_pass_pipeline.py, and updated review/importer tests, and they all passed.
  • Ran existing orchestrator and review tests (app/tests/recode_extraction/test_orchestrator.py, test_qc_review.py) which succeeded and show timing logs and new persistence behaviors.
  • Adapter-level OpenAI calls are mocked via the mock_openai fixture in app/tests/recode_extraction/conftest.py to avoid live API calls during tests.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant