TraineeBench is a dynamic evaluation framework designed to evolve MLLM agent benchmarking from static, laboratory-controlled tests to stochastic, production-oriented workplace scenarios. By simulating a "corporate internship," the benchmark subjects agents to a continuous stream of tasks with shifting priorities and deadlines. Unlike information-complete benchmarks, TraineeBench enforces partial observability, requiring agents to proactively explore and uncover latent clues through interaction.
See the full paper at arXiv
- Dynamic Workflows: Moves beyond static Q&A to simulate real-world task streams with changing requirements.
- Partial Observability: Agents must actively explore the environment to find necessary information.
- Procedural Generation: Decouples logical meta-task rules from randomized environment variables, enabling the generation of infinite, unique task instances.
- Stress Testing: Rigorously evaluates context-aware scheduling, robust decision-making, and strategic evolution.
Before generating the benchmark, you must configure the LLM API services.
-
Create the config file: Create an
api_config.jsonfile in the project root.touch api_config.json
-
Populate the configuration: Fill in your model details below. The top-level key (e.g.,
"gpt-4o") acts as the model alias you will use in CLI commands.{ "gpt-4o-mini": { "model_name": "gpt-4o-mini", "api_key_var": "sk-your_api_key", "base_url": "https://your.api.provider/v1/", "proxy_url": false }, "gpt-4o": { "model_name": "gpt-4o", "api_key_var": "sk-your_api_key", "base_url": "https://your.api.provider/v1/", "proxy_url": "http://your.proxy.url/" } }model_name: The actual model identifier required by the service provider.api_key_var: Your actual API key string.proxy_url: Set tofalseif not needed, or provide the proxy string.
Use the following command to procedurally generate benchmark instances based on your configuration.
uv run environments/traineebench/gen_bench_from_config.py \
--config-path environments/traineebench/traineebench_config.json \
--bench-path benchmarks/traineebench \
--npc-model gpt-4o-miniParameters:
--npc-model: The alias of the model used to simulate NPCs (must match a key inapi_config.json).
Once generated, launch the benchmark evaluation harness:
uv run run_traineebench.pyAs described in our paper, TraineeBench can generate an infinite number of task instances using random parameters and can combine tasks. Therefore, you can follow the instructions below to create a completely new customized_config.json, generating a large number of custom scenarios. You can use these continuously generated scenarios to train your agent and improve its performance.
You can use the following command to get a random new customized_config.json.
uv run environments/traineebench/customized_bench_configs.py \
--config-path environments/traineebench/customized_config.json \
--scenario-nums 10 \
--day-nums 2The script customized_bench_configs.py mentioned above offers limited customization. If you wish to generate config files with greater freedom, please refer to environments/traineebench/customized_bench_configs.py and environments/traineebench/task_hub.py.
After you generate customized_config.json, you can then use gen_bench_from_config.py to generate the customized benchmark.
uv run environments/traineebench/gen_bench_from_config.py \
--config-path environments/traineebench/customized_config.json \
--bench-path benchmarks/customized_bench \
--npc-model gpt-4o-mini