Skip to content

EmergenceAI/EmergenceWebVoyager

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EmergenceWebVoyager

An enhanced version of WebVoyager with methodological improvements for web agent evaluations

Introduction

EmergenceWebVoyager builds upon the original WebVoyager benchmark to address methodological limitations in web agent evaluations. This project implements principles outlined in our paper "Towards Methodological Consistency in WebAgent Evaluations" to provide a more rigorous, consistent, and reproducible framework for assessing web agent capabilities.

Key Contributions

  • Improved Methodological Framework: Standardized evaluation protocols that ensure consistent assessment across different web agents
  • Task Assertions: Every task has a series of assertions that will help an evaluator.
  • Dynamic Task Generation: Enhanced task instantiation process that creates time-appropriate benchmarks
  • Structured Annotation System: Purpose-built annotation tool for consistent human evaluation
  • Transparent Leaderboard: Open, verifiable results with video evidence of agent performance

Leaderboard

Execution Videos

Check out the execution videos on the leaderboard pages linked above. These videos provide transparent evidence of agent performance and enable verification of evaluation results.

Discord

Got feedback or found a bug? Hop into our Discord!

Benchmark Overview

EmergenceWebVoyager provides a comprehensive suite of web navigation tasks designed to evaluate web agents across various dimensions of capability. The benchmark addresses several methodological limitations identified in previous web agent evaluations:

  1. Temporal Relevance: Tasks that automatically update to remain contextually appropriate over time
  2. Reproducibility: Standardized testing protocol with complete task trajectories
  3. Comprehensive Metrics: Multi-dimensional evaluation beyond simple success/failure binary
  4. Transparent Evaluation: Open leaderboard with verifiable execution videos

Task Structure

Each benchmark task consists of:

  • A templated intent with dynamic instantiation parameters
  • Clear success criteria in the form of assessment questions
  • Reference answers that illustrate possible successful outcomes

Installation and Setup

Setup Environment

# Clone the repository
git clone https://github.com/emergenceai/EmergenceWebVoyager.git
cd EmergenceWebVoyager

# Create and activate virtual environment
python -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

Usage

Generating Benchmark Tasks

The benchmark includes a task instantiation script that generates time-appropriate versions of the task templates:

cd tasks
python instantiate_tasks.py

This will create a new JSON file with instantiated tasks using current dates and context-appropriate parameters.

Annotation Tool

The benchmark includes a web-based annotation tool for consistent human evaluation:

cd AnnotationTool
python main.py

Visit http://localhost:8000 in your browser to access the annotation interface.

Project Structure

  • tasks/ - Contains task templates and instantiation scripts
  • AnnotationTool/ - Web-based tool for human evaluation of agent performance
  • leaderboard/ - Interactive leaderboard interface for viewing results

About

An enhanced version of webvoyager with methodological consideration

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors