Draft Proposal: Agent Session Result Layer by elronbandel · Pull Request #70 · evaleval/every_eval_ever

elronbandel · 2026-03-17T16:02:02Z

Draft Proposal: Agent Session Result Layer

This PR proposes extending Every Eval Ever with session-level reporting for agentic evaluations. It is a starting point for discussion, not a finished specification.

Full context and motivation: What Agent Evaluation Teams Don’t Tell You

The gap

EEE already captures:

aggregate score outcomes
agentic setup (agentic_eval_config, eval_limits, sandbox)
instance-level traces (interaction_type, messages, tool_calls)

What is missing is standardized session-level semantics: how the run ended, which side failed, how much interaction occurred, and what system was actually evaluated. These are part of the evaluation result, not just diagnostics.

Proposed extensions (all optional under `evaluation_results[]`)

session_result

status: success | unsuccessful | unfinished | error | cancelled | limit_reached
is_finished: boolean
finish_accepted: boolean
stop_reason: agent_done | timeout | max_steps | error | cancelled | benchmark_policy
error_attribution: agent | benchmark | external | unknown
error_detail: string

session_accounting

step_count, action_count, invalid_action_count, parallel_action_max: integer
time_to_first_action, wall_clock_seconds, agent_cost, benchmark_cost: number

agent_system / benchmark_system

White-box description of the evaluated agent (models, tools, subagents, memory) and benchmark-side runtime/grader/protocol

eval_conditions

internet_access, memory_exposure, reset_policy, permissions, repeated_runs, seed

robustness (optional / emerging)

method, num_variants, variance_metric, variance_value

Backward compatibility

No existing fields are removed or changed
All new fields are optional
Existing records remain valid
Model-only evaluations are unchanged

Suggested rollout

Add fields as optional in schema
Encourage early adoption of session_result and session_accounting
Keep composition/conditions/robustness optional while conventions converge

Discussion welcome

Feedback is especially welcome on:

field naming
schema placement/hierarchy
priority order for adoption

This proposal grew out of building the Open General Agent Leaderboard and surveying eight evaluation systems. More detail is in the linked post.

elronbandel · 2026-03-18T14:56:41Z

This adapter is a good example of the abstraction gap we are trying to name.

Terminal-Bench 2.0 is evaluating an agent + model system, but the current mapping still has to center the record on model_info, then attach agent identity in model_info.additional_details, attach agentic setup in generation_config.generation_args, and encode the agent-model pair again in evaluation_id.

That works for ingestion, but it also shows the current schema is accommodating agent evaluation through metadata and config escape hatches rather than representing the evaluated agent system as a first-class object.

In other words: the data fits, but the abstraction is still model-shaped.

This is exactly why a session-result layer and a first-class agent-system layer would be useful. They would let EEE represent what is already being evaluated here directly, rather than indirectly.

Start discussion on agent session result layer

6872be4

elronbandel mentioned this pull request Mar 17, 2026

Draft Proposal: Agent Session Result Layer elronbandel/every_eval_ever#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft Proposal: Agent Session Result Layer#70

Draft Proposal: Agent Session Result Layer#70
elronbandel wants to merge 1 commit intoevaleval:mainfrom
elronbandel:feature/session-result-layer

elronbandel commented Mar 17, 2026 •

edited

Loading

Uh oh!

elronbandel commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

elronbandel commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!