Skip to content

Add score_answer versioning and historical scorer resolution#3

Open
sileod wants to merge 1 commit intomainfrom
codex/add-versioning-to-score_answer-function
Open

Add score_answer versioning and historical scorer resolution#3
sileod wants to merge 1 commit intomainfrom
codex/add-versioning-to-score_answer-function

Conversation

@sileod
Copy link
Copy Markdown
Owner

@sileod sileod commented Mar 22, 2026

Motivation

  • Provide stable, auditable scorer behavior by recording scorer version, content hash, and commit with each generated example to allow reproducible scoring.
  • Allow loading legacy score_answer implementations by version, hash, or from a historical commit/file so that scorer changes remain backward-compatible.
  • Surface the change in the task authoring guide so task authors know to bump versions when changing scorers.

Description

  • Add a new reasoning_core/score_answer_history.py module to compute scorer hashes, locate repository commits, load historical scorer source from files or git commits, resolve callables, and return the appropriate scoring function via resolve_score_answer_fn.
  • Extend reasoning_core/template.py to declare Task.score_answer_version and Task.score_answer_history, provide helpers score_answer_hash, resolve_score_answer_fn, and score_answer_for_entry, and record _score_answer metadata (version, hash, commit) in generate_example().
  • Update reasoning_core/__init__.py to invoke the new per-entry scoring entrypoint via DATASETS[task_name].score_answer_for_entry(...) so scorers chosen per-entry are applied correctly.
  • Update TASK_AUTHORING_GUIDE.md to document the _score_answer metadata and the recommended workflow for bumping scorer compatibility.
  • Add unit tests in tests/test_score_answer_versioning.py covering default metadata recording, loading a legacy scorer from a file via score_answer_history, and error behavior when a requested legacy version is not registered.

Testing

  • Executed pytest tests/test_score_answer_versioning.py which ran the new tests and they passed.
  • The tests validate that _score_answer metadata is recorded, legacy scorer files can be loaded for historical version, and that missing score_answer_history entries raise a KeyError as expected.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant