TraineeBench

TraineeBench is a dynamic evaluation framework designed to evolve MLLM agent benchmarking from static, laboratory-controlled tests to stochastic, production-oriented workplace scenarios. By simulating a "corporate internship," the benchmark subjects agents to a continuous stream of tasks with shifting priorities and deadlines. Unlike information-complete benchmarks, TraineeBench enforces partial observability, requiring agents to proactively explore and uncover latent clues through interaction.

See the full paper at arXiv

🌟 Core Innovations

Dynamic Workflows: Moves beyond static Q&A to simulate real-world task streams with changing requirements.
Partial Observability: Agents must actively explore the environment to find necessary information.
Procedural Generation: Decouples logical meta-task rules from randomized environment variables, enabling the generation of infinite, unique task instances.
Stress Testing: Rigorously evaluates context-aware scheduling, robust decision-making, and strategic evolution.

⚙️ Configuration

Before generating the benchmark, you must configure the LLM API services.

Create the config file: Create an api_config.json file in the project root.
```
touch api_config.json
```

Populate the configuration: Fill in your model details below. The top-level key (e.g., "gpt-4o") acts as the model alias you will use in CLI commands.

{
    "gpt-4o-mini": {
        "model_name": "gpt-4o-mini",
        "api_key_var": "sk-your_api_key",
        "base_url": "https://your.api.provider/v1/",
        "proxy_url": false
    },
    "gpt-4o": {
        "model_name": "gpt-4o",
        "api_key_var": "sk-your_api_key",
        "base_url": "https://your.api.provider/v1/",
        "proxy_url": "http://your.proxy.url/"
    }
}

model_name: The actual model identifier required by the service provider.
api_key_var: Your actual API key string.
proxy_url: Set to false if not needed, or provide the proxy string.

🛠️ Benchmark Generation

Use the following command to procedurally generate benchmark instances based on your configuration.

uv run environments/traineebench/gen_bench_from_config.py \
--config-path environments/traineebench/traineebench_config.json \
--bench-path  benchmarks/traineebench \
--npc-model   gpt-4o-mini

Parameters:

--npc-model: The alias of the model used to simulate NPCs (must match a key in api_config.json).

🚀 Running the Benchmark

Once generated, launch the benchmark evaluation harness:

uv run run_traineebench.py

🧩 Custom Benchmark

As described in our paper, TraineeBench can generate an infinite number of task instances using random parameters and can combine tasks. Therefore, you can follow the instructions below to create a completely new customized_config.json, generating a large number of custom scenarios. You can use these continuously generated scenarios to train your agent and improve its performance.

You can use the following command to get a random new customized_config.json.

uv run environments/traineebench/customized_bench_configs.py \
--config-path environments/traineebench/customized_config.json \
--scenario-nums 10 \
--day-nums 2

The script customized_bench_configs.py mentioned above offers limited customization. If you wish to generate config files with greater freedom, please refer to environments/traineebench/customized_bench_configs.py and environments/traineebench/task_hub.py.

After you generate customized_config.json, you can then use gen_bench_from_config.py to generate the customized benchmark.

uv run environments/traineebench/gen_bench_from_config.py \
--config-path environments/traineebench/customized_config.json \
--bench-path benchmarks/customized_bench \
--npc-model gpt-4o-mini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TraineeBench

🌟 Core Innovations

⚙️ Configuration

🛠️ Benchmark Generation

🚀 Running the Benchmark

🧩 Custom Benchmark

FilesExpand file tree

TraineeBench.md

Latest commit

History

TraineeBench.md

File metadata and controls

TraineeBench

🌟 Core Innovations

⚙️ Configuration

🛠️ Benchmark Generation

🚀 Running the Benchmark

🧩 Custom Benchmark