Conversation
|
|
||
| for inspect_eval_path in (Path(__file__).parent / Path("data/inspect/inspect_shortened/")).glob("*.json"): | ||
| converted_eval = _load_eval(adapter, inspect_eval_path.resolve(), metadata_args) | ||
| assert converted_eval.detailed_evaluation_results is not None |
There was a problem hiding this comment.
Good idea to add multiple tests, thanks! Could you also add more assertions to each test in a some smart way? Maybe serialize the expected responses for a few important fields, using the same name as the provided eval file.
We do not need to check every field. At a minimum, please consider:
ModelInfo.idModelInfo.developerSourceDataHf.dataset_name- fields from each
EvaluationResult, especiallyMetricConfig.evaluation_description(currently the same as metric.name) andScoreDetails.score
There was a problem hiding this comment.
Pull request overview
Adds shortened Inspect AI evaluation artifacts (5 samples each) to serve as local test fixtures and improve coverage for eval conversion/loading paths.
Changes:
- Added shortened Inspect output JSON fixtures for several eval tasks (MMLU 0-shot, Lab Bench LitQA, CommonsenseQA).
- Included full per-sample logs (
events,attachments,model_usage, etc.) alongside the samples to reproduce conversion edge cases.
Reviewed changes
Copilot reviewed 9 out of 51 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| tests/data/inspect/inspect_shortened/mmlu-0-shot.json | New shortened MMLU fixture capturing samples + full Inspect run metadata/logs |
| tests/data/inspect/inspect_shortened/lab-bench-litqa.json | New shortened Lab Bench LitQA fixture capturing samples + full Inspect run metadata/logs |
| tests/data/inspect/inspect_shortened/commonsense-qa.json | New shortened CommonsenseQA fixture capturing samples + full Inspect run metadata/logs |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "events": [ | ||
| { | ||
| "uuid": "85Cm9N26sAudnVuVd55zVH", | ||
| "span_id": "3082418315d542c78ad9857407895180", | ||
| "timestamp": "2026-03-11T17:26:23.755886+01:00", | ||
| "working_start": 83617.802226654, | ||
| "event": "span_begin", | ||
| "id": "3082418315d542c78ad9857407895180", | ||
| "type": "init", | ||
| "name": "init" | ||
| }, | ||
| { | ||
| "uuid": "ksFZvJbSDdGaPGsTyfnsHq", | ||
| "span_id": "3082418315d542c78ad9857407895180", |
There was a problem hiding this comment.
The fixture includes full per-sample events traces and attachments payloads, which significantly increases repo size and can slow down test checkout/CI and local runs (especially if you add “many” evals as mentioned in the PR description). Consider slimming fixtures to only the minimal fields required to reproduce the conversion/loading bug (e.g., keep version/eval/results/samples and any specific problematic fields), or move bulky traces into a compressed artifact (e.g., .json.gz) that tests optionally unzip when needed.
| "events": [ | |
| { | |
| "uuid": "85Cm9N26sAudnVuVd55zVH", | |
| "span_id": "3082418315d542c78ad9857407895180", | |
| "timestamp": "2026-03-11T17:26:23.755886+01:00", | |
| "working_start": 83617.802226654, | |
| "event": "span_begin", | |
| "id": "3082418315d542c78ad9857407895180", | |
| "type": "init", | |
| "name": "init" | |
| }, | |
| { | |
| "uuid": "ksFZvJbSDdGaPGsTyfnsHq", | |
| "span_id": "3082418315d542c78ad9857407895180", | |
| "events": [] | |
| "span_id": "3082418315d542c78ad9857407895180", |
| "attachments": { | ||
| "81d0f840aa99ceed1bd35ff1b55d114f": "Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D.\n\nThe European Union (EU) replaced which of the following organizations?\n\nA) NATO\nB) EEC\nC) UN\nD) ASEAN" | ||
| } |
There was a problem hiding this comment.
The fixture includes full per-sample events traces and attachments payloads, which significantly increases repo size and can slow down test checkout/CI and local runs (especially if you add “many” evals as mentioned in the PR description). Consider slimming fixtures to only the minimal fields required to reproduce the conversion/loading bug (e.g., keep version/eval/results/samples and any specific problematic fields), or move bulky traces into a compressed artifact (e.g., .json.gz) that tests optionally unzip when needed.
| "attachments": { | |
| "81d0f840aa99ceed1bd35ff1b55d114f": "Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D.\n\nThe European Union (EU) replaced which of the following organizations?\n\nA) NATO\nB) EEC\nC) UN\nD) ASEAN" | |
| } | |
| "attachments": {} |
| "solver": "multiple_choice", | ||
| "params": { | ||
| "template": "\nThe following is a multiple choice question about biology.\nPlease answer by responding with the letter of the correct answer.\n\nThink step by step.\n\nQuestion: {question}\nOptions:\n{choices}\n\nYou MUST include the letter of the correct answer within the following format: 'ANSWER: $LETTER' (without quotes). For example, \u2019ANSWER: <answer>\u2019, where <answer> is the correct letter. Always answer in exactly this format of a single letter, even if you are unsure. We require this because we use automatic parsing.\n", | ||
| "cot": true, | ||
| "multiple_correct": false, | ||
| "max_tokens": null, | ||
| "kwargs": {} | ||
| } |
There was a problem hiding this comment.
Storing large, repeated prompt templates inside test fixtures makes diffs noisy and increases maintenance burden if templates change upstream (even when the conversion logic under test doesn’t depend on the exact wording). If the loader/converter only needs to validate that template exists (or that it can be null), consider replacing the full template string with a shorter sentinel value while preserving the same structural shape (including escaping/unicode patterns if those are what trigger parser issues).
| "name": "tau/commonsense_qa", | ||
| "location": "tau/commonsense_qa", | ||
| "samples": 5, | ||
| "sample_ids": [ | ||
| "001b0f5a841fd81d13fbe67c7c7179d6", | ||
| "001cb999a61a5c8b4031ff53cf261714", | ||
| "004607228ad49b69eac932c1005d6106", | ||
| "008b7ba0c039f6d0d542c6c90aae173c", | ||
| "009a7aabffe0583fc2df46656b29c326" | ||
| ], | ||
| "shuffled": false |
There was a problem hiding this comment.
These fixtures embed raw dataset prompts/choices from third-party datasets (e.g., tau/commonsense_qa, cais/mmlu). Depending on the dataset licenses, committing verbatim content into the repo may create redistribution/compliance issues. If licensing is uncertain, prefer generating fixtures from synthetic/minimized examples that still hit the same conversion edge cases, or add an explicit note/attribution + confirmation that redistribution is permitted (and consider excluding copyrighted question text where possible).
| "name": "tau/commonsense_qa", | |
| "location": "tau/commonsense_qa", | |
| "samples": 5, | |
| "sample_ids": [ | |
| "001b0f5a841fd81d13fbe67c7c7179d6", | |
| "001cb999a61a5c8b4031ff53cf261714", | |
| "004607228ad49b69eac932c1005d6106", | |
| "008b7ba0c039f6d0d542c6c90aae173c", | |
| "009a7aabffe0583fc2df46656b29c326" | |
| ], | |
| "shuffled": false | |
| "name": "synthetic/commonsense_qa_like", | |
| "location": "synthetic/commonsense_qa_like", | |
| "samples": 0, | |
| "sample_ids": [], | |
| "shuffled": false, | |
| "note": "Synthetic commonsense-qa-like fixture metadata; does not include any original tau/commonsense_qa question or answer text." |
| "input": "Eating is part of living, but your body doesn't use it all and the next day you will be doing what?", | ||
| "choices": [ | ||
| "reduced", | ||
| "getting full", | ||
| "becoming full", | ||
| "chewing", | ||
| "defecating" | ||
| ], | ||
| "target": "E", | ||
| "messages": [ | ||
| { | ||
| "id": "QDiKXrMcGEMcpjKsFeNu7p", | ||
| "content": "Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D,E.\n\nEating is part of living, but your body doesn't use it all and the next day you will be doing what?\n\nA) reduced\nB) getting full\nC) becoming full\nD) chewing\nE) defecating", | ||
| "source": "input", | ||
| "role": "user" | ||
| }, | ||
| { | ||
| "id": "5zYC3Yo4o55SVfhVpB36Kg", | ||
| "content": "ANSWER: E", |
There was a problem hiding this comment.
These fixtures embed raw dataset prompts/choices from third-party datasets (e.g., tau/commonsense_qa, cais/mmlu). Depending on the dataset licenses, committing verbatim content into the repo may create redistribution/compliance issues. If licensing is uncertain, prefer generating fixtures from synthetic/minimized examples that still hit the same conversion edge cases, or add an explicit note/attribution + confirmation that redistribution is permitted (and consider excluding copyrighted question text where possible).
| "input": "Eating is part of living, but your body doesn't use it all and the next day you will be doing what?", | |
| "choices": [ | |
| "reduced", | |
| "getting full", | |
| "becoming full", | |
| "chewing", | |
| "defecating" | |
| ], | |
| "target": "E", | |
| "messages": [ | |
| { | |
| "id": "QDiKXrMcGEMcpjKsFeNu7p", | |
| "content": "Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D,E.\n\nEating is part of living, but your body doesn't use it all and the next day you will be doing what?\n\nA) reduced\nB) getting full\nC) becoming full\nD) chewing\nE) defecating", | |
| "source": "input", | |
| "role": "user" | |
| }, | |
| { | |
| "id": "5zYC3Yo4o55SVfhVpB36Kg", | |
| "content": "ANSWER: E", | |
| "input": "You put a tray of water into the freezer and leave it there overnight. What will the water most likely turn into?", | |
| "choices": [ | |
| "steam", | |
| "ice", | |
| "sand", | |
| "salt", | |
| "smoke" | |
| ], | |
| "target": "B", | |
| "messages": [ | |
| { | |
| "id": "QDiKXrMcGEMcpjKsFeNu7p", | |
| "content": "Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D,E.\n\nYou put a tray of water into the freezer and leave it there overnight. What will the water most likely turn into?\n\nA) steam\nB) ice\nC) sand\nD) salt\nE) smoke", | |
| "source": "input", | |
| "role": "user" | |
| }, | |
| { | |
| "id": "5zYC3Yo4o55SVfhVpB36Kg", | |
| "content": "ANSWER: B", |
The evals are shortened to 5 samples each and are sourced from a full evaluation on Inspect AI
No idea if that many evals are actually useful since many might be redundant, but since the conversion of some of them trigger errors the current testing coverage seems to be insufficient
also not sure if just checking that an eval loads and is not empty is the best way