added testing dataset by joeda · Pull Request #64 · evaleval/every_eval_ever

joeda · 2026-03-12T17:49:47Z

The evals are shortened to 5 samples each and are sourced from a full evaluation on Inspect AI

No idea if that many evals are actually useful since many might be redundant, but since the conversion of some of them trigger errors the current testing coverage seems to be insufficient

also not sure if just checking that an eval loads and is not empty is the best way

damian1996 · 2026-03-17T21:45:45Z

tests/test_inspect_adapter.py

+
+    for inspect_eval_path in (Path(__file__).parent / Path("data/inspect/inspect_shortened/")).glob("*.json"):
+        converted_eval = _load_eval(adapter, inspect_eval_path.resolve(), metadata_args)
+        assert converted_eval.detailed_evaluation_results is not None


Good idea to add multiple tests, thanks! Could you also add more assertions to each test in a some smart way? Maybe serialize the expected responses for a few important fields, using the same name as the provided eval file.

We do not need to check every field. At a minimum, please consider:

ModelInfo.id

ModelInfo.developer

SourceDataHf.dataset_name

fields from each EvaluationResult, especially MetricConfig.evaluation_description (currently the same as metric.name) and ScoreDetails.score

i'll check soon

Copilot

Pull request overview

Adds shortened Inspect AI evaluation artifacts (5 samples each) to serve as local test fixtures and improve coverage for eval conversion/loading paths.

Changes:

Added shortened Inspect output JSON fixtures for several eval tasks (MMLU 0-shot, Lab Bench LitQA, CommonsenseQA).
Included full per-sample logs (events, attachments, model_usage, etc.) alongside the samples to reproduce conversion edge cases.

Reviewed changes

Copilot reviewed 9 out of 51 changed files in this pull request and generated 5 comments.

File	Description
tests/data/inspect/inspect_shortened/mmlu-0-shot.json	New shortened MMLU fixture capturing samples + full Inspect run metadata/logs
tests/data/inspect/inspect_shortened/lab-bench-litqa.json	New shortened Lab Bench LitQA fixture capturing samples + full Inspect run metadata/logs
tests/data/inspect/inspect_shortened/commonsense-qa.json	New shortened CommonsenseQA fixture capturing samples + full Inspect run metadata/logs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T23:32:47Z

tests/data/inspect/inspect_shortened/mmlu-0-shot.json

+      "events": [
+        {
+          "uuid": "85Cm9N26sAudnVuVd55zVH",
+          "span_id": "3082418315d542c78ad9857407895180",
+          "timestamp": "2026-03-11T17:26:23.755886+01:00",
+          "working_start": 83617.802226654,
+          "event": "span_begin",
+          "id": "3082418315d542c78ad9857407895180",
+          "type": "init",
+          "name": "init"
+        },
+        {
+          "uuid": "ksFZvJbSDdGaPGsTyfnsHq",
+          "span_id": "3082418315d542c78ad9857407895180",


The fixture includes full per-sample events traces and attachments payloads, which significantly increases repo size and can slow down test checkout/CI and local runs (especially if you add “many” evals as mentioned in the PR description). Consider slimming fixtures to only the minimal fields required to reproduce the conversion/loading bug (e.g., keep version/eval/results/samples and any specific problematic fields), or move bulky traces into a compressed artifact (e.g., .json.gz) that tests optionally unzip when needed.

Suggested change

"events": [

{

"uuid": "85Cm9N26sAudnVuVd55zVH",

"span_id": "3082418315d542c78ad9857407895180",

"timestamp": "2026-03-11T17:26:23.755886+01:00",

"working_start": 83617.802226654,

"event": "span_begin",

"id": "3082418315d542c78ad9857407895180",

"type": "init",

"name": "init"

},

{

"uuid": "ksFZvJbSDdGaPGsTyfnsHq",

"span_id": "3082418315d542c78ad9857407895180",

"events": []

"span_id": "3082418315d542c78ad9857407895180",

Copilot · 2026-03-24T23:32:48Z

tests/data/inspect/inspect_shortened/mmlu-0-shot.json

+      "attachments": {
+        "81d0f840aa99ceed1bd35ff1b55d114f": "Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D.\n\nThe European Union (EU) replaced which of the following organizations?\n\nA) NATO\nB) EEC\nC) UN\nD) ASEAN"
+      }


The fixture includes full per-sample events traces and attachments payloads, which significantly increases repo size and can slow down test checkout/CI and local runs (especially if you add “many” evals as mentioned in the PR description). Consider slimming fixtures to only the minimal fields required to reproduce the conversion/loading bug (e.g., keep version/eval/results/samples and any specific problematic fields), or move bulky traces into a compressed artifact (e.g., .json.gz) that tests optionally unzip when needed.

Suggested change

"attachments": {

"81d0f840aa99ceed1bd35ff1b55d114f": "Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D.\n\nThe European Union (EU) replaced which of the following organizations?\n\nA) NATO\nB) EEC\nC) UN\nD) ASEAN"

}

"attachments": {}

Copilot · 2026-03-24T23:32:48Z

tests/data/inspect/inspect_shortened/lab-bench-litqa.json

+        "solver": "multiple_choice",
+        "params": {
+          "template": "\nThe following is a multiple choice question about biology.\nPlease answer by responding with the letter of the correct answer.\n\nThink step by step.\n\nQuestion: {question}\nOptions:\n{choices}\n\nYou MUST include the letter of the correct answer within the following format: 'ANSWER: $LETTER' (without quotes). For example, \u2019ANSWER: <answer>\u2019, where <answer> is the correct letter. Always answer in exactly this format of a single letter, even if you are unsure. We require this because we use automatic parsing.\n",
+          "cot": true,
+          "multiple_correct": false,
+          "max_tokens": null,
+          "kwargs": {}
+        }


Storing large, repeated prompt templates inside test fixtures makes diffs noisy and increases maintenance burden if templates change upstream (even when the conversion logic under test doesn’t depend on the exact wording). If the loader/converter only needs to validate that template exists (or that it can be null), consider replacing the full template string with a shorter sentinel value while preserving the same structural shape (including escaping/unicode patterns if those are what trigger parser issues).

Copilot · 2026-03-24T23:32:48Z

tests/data/inspect/inspect_shortened/commonsense-qa.json

+      "name": "tau/commonsense_qa",
+      "location": "tau/commonsense_qa",
+      "samples": 5,
+      "sample_ids": [
+        "001b0f5a841fd81d13fbe67c7c7179d6",
+        "001cb999a61a5c8b4031ff53cf261714",
+        "004607228ad49b69eac932c1005d6106",
+        "008b7ba0c039f6d0d542c6c90aae173c",
+        "009a7aabffe0583fc2df46656b29c326"
+      ],
+      "shuffled": false


These fixtures embed raw dataset prompts/choices from third-party datasets (e.g., tau/commonsense_qa, cais/mmlu). Depending on the dataset licenses, committing verbatim content into the repo may create redistribution/compliance issues. If licensing is uncertain, prefer generating fixtures from synthetic/minimized examples that still hit the same conversion edge cases, or add an explicit note/attribution + confirmation that redistribution is permitted (and consider excluding copyrighted question text where possible).

Suggested change

"name": "tau/commonsense_qa",

"location": "tau/commonsense_qa",

"samples": 5,

"sample_ids": [

"001b0f5a841fd81d13fbe67c7c7179d6",

"001cb999a61a5c8b4031ff53cf261714",

"004607228ad49b69eac932c1005d6106",

"008b7ba0c039f6d0d542c6c90aae173c",

"009a7aabffe0583fc2df46656b29c326"

],

"shuffled": false

"name": "synthetic/commonsense_qa_like",

"location": "synthetic/commonsense_qa_like",

"samples": 0,

"sample_ids": [],

"shuffled": false,

"note": "Synthetic commonsense-qa-like fixture metadata; does not include any original tau/commonsense_qa question or answer text."

Copilot · 2026-03-24T23:32:48Z

tests/data/inspect/inspect_shortened/commonsense-qa.json

+      "input": "Eating is part of living, but your body doesn't use it all and the next day you will be doing what?",
+      "choices": [
+        "reduced",
+        "getting full",
+        "becoming full",
+        "chewing",
+        "defecating"
+      ],
+      "target": "E",
+      "messages": [
+        {
+          "id": "QDiKXrMcGEMcpjKsFeNu7p",
+          "content": "Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D,E.\n\nEating is part of living, but your body doesn't use it all and the next day you will be doing what?\n\nA) reduced\nB) getting full\nC) becoming full\nD) chewing\nE) defecating",
+          "source": "input",
+          "role": "user"
+        },
+        {
+          "id": "5zYC3Yo4o55SVfhVpB36Kg",
+          "content": "ANSWER: E",


These fixtures embed raw dataset prompts/choices from third-party datasets (e.g., tau/commonsense_qa, cais/mmlu). Depending on the dataset licenses, committing verbatim content into the repo may create redistribution/compliance issues. If licensing is uncertain, prefer generating fixtures from synthetic/minimized examples that still hit the same conversion edge cases, or add an explicit note/attribution + confirmation that redistribution is permitted (and consider excluding copyrighted question text where possible).

Suggested change

"input": "Eating is part of living, but your body doesn't use it all and the next day you will be doing what?",

"choices": [

"reduced",

"getting full",

"becoming full",

"chewing",

"defecating"

],

"target": "E",

"messages": [

{

"id": "QDiKXrMcGEMcpjKsFeNu7p",

"content": "Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D,E.\n\nEating is part of living, but your body doesn't use it all and the next day you will be doing what?\n\nA) reduced\nB) getting full\nC) becoming full\nD) chewing\nE) defecating",

"source": "input",

"role": "user"

},

{

"id": "5zYC3Yo4o55SVfhVpB36Kg",

"content": "ANSWER: E",

"input": "You put a tray of water into the freezer and leave it there overnight. What will the water most likely turn into?",

"choices": [

"steam",

"ice",

"sand",

"salt",

"smoke"

],

"target": "B",

"messages": [

{

"id": "QDiKXrMcGEMcpjKsFeNu7p",

"content": "Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D,E.\n\nYou put a tray of water into the freezer and leave it there overnight. What will the water most likely turn into?\n\nA) steam\nB) ice\nC) sand\nD) salt\nE) smoke",

"source": "input",

"role": "user"

},

{

"id": "5zYC3Yo4o55SVfhVpB36Kg",

"content": "ANSWER: B",

Johannes Janosovits added 2 commits March 12, 2026 18:38

added testing dataset

fda65aa

removed failed

d2bd6f1

nelaturuharsha requested a review from damian1996 March 17, 2026 20:50

damian1996 reviewed Mar 17, 2026

View reviewed changes

evijit requested a review from Copilot March 24, 2026 23:31

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added testing dataset#64

added testing dataset#64
joeda wants to merge 2 commits intoevaleval:mainfrom
joeda:add-many-inspect

joeda commented Mar 12, 2026

Uh oh!

damian1996 Mar 17, 2026

Uh oh!

joeda Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joeda commented Mar 12, 2026

Uh oh!

damian1996 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

joeda Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants