Skip to content

Draft Proposal: Agent Session Result Layer#70

Draft
elronbandel wants to merge 1 commit intoevaleval:mainfrom
elronbandel:feature/session-result-layer
Draft

Draft Proposal: Agent Session Result Layer#70
elronbandel wants to merge 1 commit intoevaleval:mainfrom
elronbandel:feature/session-result-layer

Conversation

@elronbandel
Copy link
Contributor

@elronbandel elronbandel commented Mar 17, 2026

Draft Proposal: Agent Session Result Layer

This PR proposes extending Every Eval Ever with session-level reporting for agentic evaluations. It is a starting point for discussion, not a finished specification.

Full context and motivation: What Agent Evaluation Teams Don’t Tell You

The gap

EEE already captures:

  • aggregate score outcomes
  • agentic setup (agentic_eval_config, eval_limits, sandbox)
  • instance-level traces (interaction_type, messages, tool_calls)

What is missing is standardized session-level semantics: how the run ended, which side failed, how much interaction occurred, and what system was actually evaluated. These are part of the evaluation result, not just diagnostics.

Proposed extensions (all optional under evaluation_results[])

session_result

  • status: success | unsuccessful | unfinished | error | cancelled | limit_reached
  • is_finished: boolean
  • finish_accepted: boolean
  • stop_reason: agent_done | timeout | max_steps | error | cancelled | benchmark_policy
  • error_attribution: agent | benchmark | external | unknown
  • error_detail: string

session_accounting

  • step_count, action_count, invalid_action_count, parallel_action_max: integer
  • time_to_first_action, wall_clock_seconds, agent_cost, benchmark_cost: number

agent_system / benchmark_system

  • White-box description of the evaluated agent (models, tools, subagents, memory) and benchmark-side runtime/grader/protocol

eval_conditions

  • internet_access, memory_exposure, reset_policy, permissions, repeated_runs, seed

robustness (optional / emerging)

  • method, num_variants, variance_metric, variance_value

Backward compatibility

  • No existing fields are removed or changed
  • All new fields are optional
  • Existing records remain valid
  • Model-only evaluations are unchanged

Suggested rollout

  • Add fields as optional in schema
  • Encourage early adoption of session_result and session_accounting
  • Keep composition/conditions/robustness optional while conventions converge

Discussion welcome

Feedback is especially welcome on:

  • field naming
  • schema placement/hierarchy
  • priority order for adoption

This proposal grew out of building the Open General Agent Leaderboard and surveying eight evaluation systems. More detail is in the linked post.

@elronbandel
Copy link
Contributor Author

elronbandel commented Mar 18, 2026

This adapter is a good example of the abstraction gap we are trying to name.

Terminal-Bench 2.0 is evaluating an agent + model system, but the current mapping still has to center the record on model_info, then attach agent identity in model_info.additional_details, attach agentic setup in generation_config.generation_args, and encode the agent-model pair again in evaluation_id.

That works for ingestion, but it also shows the current schema is accommodating agent evaluation through metadata and config escape hatches rather than representing the evaluated agent system as a first-class object.

In other words: the data fits, but the abstraction is still model-shaped.

This is exactly why a session-result layer and a first-class agent-system layer would be useful. They would let EEE represent what is already being evaluated here directly, rather than indirectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant