Add OpenAI two-pass extraction pipeline with QC, compaction, and review integration by karilint · Pull Request #241 · karilint/mammalbase

karilint · 2026-03-06T15:14:26Z

Motivation

Introduce a production-ready two-pass OpenAI extraction backend to detect raw evidence (PASS1) and produce ETS-structured trait records (PASS2) while reducing token use and noisy analytical content.
Improve robustness of the existing pipeline by adding QC/normalization, deterministic deduplication, timing instrumentation and safer handling of schema drift in importer code.
Surface PASS1/PASS2 artifacts and QC to the UI and persist necessary fields on extraction runs and assertions for review and downstream persistence.

Description

Add OpenAI adapter recode_extraction.adapters.openai_client.OpenAITwoPassClient and prompt builders in openai_two_pass_prompts.py, plus lightweight service exports in openai_two_pass.py.
Implement orchestration support in recode_extraction.services.orchestrator.RecodePipelineRunner with a new openai_two_pass path that runs per-page PASS1 calls, compacts evidence (pass1_compaction.py), runs PASS2, runs QC/normalization (qc.py), and persists candidate ExtractedAssertionModel rows; add timing logs and safer page-number handling.
Add trait vocabulary service trait_vocabulary.py to build abbreviation dictionaries and trait lists from the DB, cached for performance.
Extend data models and DB via migrations to store pass1_evidence_package, pass2_structured_package, qc_summary, qc_errors and extend unmapped_reason length.
Add UI and view updates to show PASS1/PASS2 availability, QC summary and allow selecting the OpenAI backend when enabled; add settings flags and many RECODE_OPENAI_* settings in settings.py.
Add review flow improvements to services.review to accept prefilled ets_payload when persisting approved assertions and to validate presence of review columns at runtime.
Tighten imports module to avoid import cycles and fix validation regex escaping in base_validation.py.
Harden BaseImporter.get_or_create_source_location to handle None values and database schema drift when coord_text lacks a default.
Add requirements for openai and pydantic and a collection of unit tests and fixtures covering adapter, compaction, QC/normalization, orchestrator flows, review persistence, and importer edge cases.

Testing

Ran unit tests for the new OpenAI pipeline and helpers including app/tests/recode_extraction/test_openai_two_pass_pipeline.py, test_openai_client_adapter.py, test_openai_qc_normalization.py, test_openai_schema.py, test_pass1_compaction.py, test_openai_two_pass_pipeline.py, and updated review/importer tests, and they all passed.
Ran existing orchestrator and review tests (app/tests/recode_extraction/test_orchestrator.py, test_qc_review.py) which succeeded and show timing logs and new persistence behaviors.
Adapter-level OpenAI calls are mocked via the mock_openai fixture in app/tests/recode_extraction/conftest.py to avoid live API calls during tests.

Codex Task

Improve OpenAI extraction coverage and handle coord_text DB drift

7899e8b

karilint added the codex label Mar 6, 2026 — with ChatGPT Codex Connector

karilint added 13 commits March 6, 2026 18:10

Improve pass2 coverage and associated reference validation

745df2c

Parse single-value cranial means in table fallback parser

58bcc11

Use SourceDocument citation as constant reference in extraction

c1450c9

Combine mean+range rows and move trait abbreviations into names

1c24cf7

Fix legacy coord_text schema drift for SourceLocation inserts

d091a6f

Fix coord_text drift on extracted assertions and correct assumptions

d1ec371

Improve PASS2 metadata handling and scientific name normalization

cd3ad7f

Drop legacy count_text and sex_text columns from extracted assertions

36f4a77

Drop remaining legacy *_text columns on extracted assertions

1866e27

Default RECODE OpenAI models to gpt-5-mini

9a59711

Handle legacy snippet column drift on extracted assertions

efaa833

Add model-selection UI and Claude placeholders for two-pass backends

13eaee1

Cache extracted text on SourceDocument and reuse across runs

8a41432

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenAI two-pass extraction pipeline with QC, compaction, and review integration#241

Add OpenAI two-pass extraction pipeline with QC, compaction, and review integration#241
karilint wants to merge 14 commits intocodex/prepare-implementation-plan-for-recode-extractionfrom
codex/implement-openai-two-pass-trait-extraction-wyfifs

karilint commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

karilint commented Mar 6, 2026

Motivation

Description

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant