Draft Proposal: Agent Session Result Layer#70
Draft Proposal: Agent Session Result Layer#70elronbandel wants to merge 1 commit intoevaleval:mainfrom
Conversation
|
This adapter is a good example of the abstraction gap we are trying to name. Terminal-Bench 2.0 is evaluating an agent + model system, but the current mapping still has to center the record on That works for ingestion, but it also shows the current schema is accommodating agent evaluation through metadata and config escape hatches rather than representing the evaluated agent system as a first-class object. In other words: the data fits, but the abstraction is still model-shaped. This is exactly why a session-result layer and a first-class agent-system layer would be useful. They would let EEE represent what is already being evaluated here directly, rather than indirectly. |
Draft Proposal: Agent Session Result Layer
This PR proposes extending Every Eval Ever with session-level reporting for agentic evaluations. It is a starting point for discussion, not a finished specification.
Full context and motivation: What Agent Evaluation Teams Don’t Tell You
The gap
EEE already captures:
agentic_eval_config,eval_limits,sandbox)interaction_type,messages,tool_calls)What is missing is standardized session-level semantics: how the run ended, which side failed, how much interaction occurred, and what system was actually evaluated. These are part of the evaluation result, not just diagnostics.
Proposed extensions (all optional under
evaluation_results[])session_resultstatus:success | unsuccessful | unfinished | error | cancelled | limit_reachedis_finished:booleanfinish_accepted:booleanstop_reason:agent_done | timeout | max_steps | error | cancelled | benchmark_policyerror_attribution:agent | benchmark | external | unknownerror_detail:stringsession_accountingstep_count,action_count,invalid_action_count,parallel_action_max:integertime_to_first_action,wall_clock_seconds,agent_cost,benchmark_cost:numberagent_system/benchmark_systemeval_conditionsinternet_access,memory_exposure,reset_policy,permissions,repeated_runs,seedrobustness(optional / emerging)method,num_variants,variance_metric,variance_valueBackward compatibility
Suggested rollout
session_resultandsession_accountingDiscussion welcome
Feedback is especially welcome on:
This proposal grew out of building the Open General Agent Leaderboard and surveying eight evaluation systems. More detail is in the linked post.