【中文 | English】
Overview • Installation • Model Training • Model Download • One-Click Evaluation • Community Development
- [2026-01-12] 🚀🚀🚀 We open-sourced AgentCPM-Explore, an agent foundation model trained with only 4B parameters, which successfully entered 8 classic long-horizon and hard agent benchmarks including GAIA, HLE, and BrowseComp. It achieves state-of-the-art performance within the same parameter scale, enabling longer action chains and more accurate deep research capabilities, thereby breaking the performance ceiling of on-device agents.
AgentCPM-Explore is an open-source agent foundation model jointly developed by the THUNLP, Renmin University of China, and ModelBest. It is built upon Qwen3-4B-Thinking-2507 with 4 billion parameters, bringing long-horizon task-solving capabilities of large models to on-device deployment.
Key highlights of AgentCPM-Explore include:
- The first on-device agent model with only 4B full parameters to enter 8 long-horizon and complex agent benchmarks, including GAIA, HLE, and BrowseComp.
- Supports over 100 turns of continuous environment interaction, enabling multi-source information cross-validation, dynamic search strategy adjustment, real-time verification of up-to-date information, and sustained deep exploration until task completion.
- Fully open-sourced pipeline, including an asynchronous agent reinforcement learning framework and a unified tool sandbox management platform, supporting community-driven development and custom extensions.
Demo (accelerated playback):
demo_en.mp4
Experimental Results:
| Model | GAIA (text only) | BrowseComp | BrowseComp (ZH) | HLE | Frames | WebWalkerQA | Seal-0 | xbench-DeepSearch |
|---|---|---|---|---|---|---|---|---|
| Closed-Source Models | ||||||||
| Claude-4.5-sonnet | 71.2% | 19.6% | 40.8% | 24.5% | 85.0% | / | 53.4% | 66.0% |
| Gemini Deep Research | / | / | / | 26.9% | / | / | / | / |
| Deepseek-V3.2 | 63.5% | 67.6% | 65.0% | 40.8% | 80.2% | / | 38.5% | 71.0% |
| Minimax-M2 | 75.7% | 44.0% | 48.5% | 31.8% | / | / | / | 72.0% |
| OpenAI-GPT-5-high | 76.4% | 54.9% | 65.0% | 35.2% | / | / | 51.4% | 77.8% |
| GLM-4.6 | 71.9% | 45.1% | 49.5% | 30.4% | / | / | / | 70.0% |
| Kimi-Researcher | / | / | / | 26.9% | 78.8% | / | 36.0% | 69.0% |
| Seed-1.8 | 87.4% | 67.6% | 81.3% | 40.9% | / | / | / | / |
| Open-Source Models | ||||||||
| MiroThinker 8B | 66.4% | 31.1% | 40.2% | 21.5% | 80.6% | 60.6% | 40.4% | 60.6% |
| Tongyi DeepResearch 30B | 70.9% | 43.4% | 46.7% | 32.9% | 90.6% | 72.2% | / | 75.0% |
| ASearcher QWQ 32B v2 | 58.7% | / | / | / | 74.5% | / | / | 51.1% |
| iterresearch-30B-A3B | 72.8% | 37.3% | 45.2% | 28.8% | 71.0% | / | 39.6% | / |
| WebSailor-V2-30B-A3B (RL) | 74.1% | 35.3% | 44.1% | 30.6% | / | / | / | 73.7% |
| WebLeaper-30B-A3B-RUC | 73.2% | 38.8% | / | / | / | / | 48.6% | 72.0% |
| WebDancer (QwQ-32B) | 51.5% | 3.8% | 18.0% | / | / | 47.9% | / | 38.3% |
| ⭐ AgentCPM-Explore 4B | 63.9% | 24.1% | 29.1% | 19.1% | 82.7% | 68.1% | 40.5% | 70.0% |
- Docker & Docker Compose
- Python 3.10+
- At least 8GB RAM (16GB+ recommended)
AgentDock is the unified tool sandbox management platform for AgentCPM-Explore. It provides containerized deployment and management for MCP (Model Context Protocol) services.
Core Architecture:
| Component | Port | Description |
|---|---|---|
agentdock-manager |
8080 | Management UI, container lifecycle management, health monitoring, API routing |
agentdock-mongodb |
27017 | Persistent state storage |
agentdock-node-full |
8004/8092 | Full-featured MCP node (GitHub, Slack, document processing, etc.) |
agentdock-node-explore |
8014/8102 | Exploration node (web search, crawling, code execution, etc.) |
Quick Deployment:
# 1. Enter the AgentDock folder
cd AgentDock
# 2. Set the environment variables
cp .env.example .env
# Editing .env file,setting the password of MongoDB and optional API Keys
# 3. One-click startup
docker compose up -d
# 4. Access the management dashboard
open http://localhost:8080Set the environment variables (.env):
# Required: MongoDB authentication
MONGODB_USERNAME=admin
MONGODB_PASSWORD=your_password
# Optional: API Keys of search tools
JINA_API_KEY=your_jina_key # Jina Reader API
GOOGLE_SERP_API_KEY=your_serp_key # Google Search API-
QuickStart tutorial video (setup & run): https://www.youtube.com/watch?v=j3dtYY9KCd0
Recommended: follow along in the provided evaluation Docker container to avoid environment discrepancies. -
Multi-model, multi-tool collaborative environment setup: First, start the AgentDock tool sandbox platform to provide unified MCP (Model Context Protocol) tool services. When working with API-based models, configure the model’s
BASE_URLandAPI_KEY. When working with locally hosted models, ensure the model service is accessible. Configure the required tool parameters in theconfig.tomlfile. -
Launch the environment: Out of the box, one-click startup. The AgentDock unified tool sandbox platform supports launching all services with a single
docker compose up -dcommand, including the management dashboard, database, and tool nodes. -
Run execution: Quickly experience the core capabilities of the framework via the QuickStart script, allowing you to run a complete Agent task without complex configuration.
-
Prepare Evaluation Environment (Recommended):
We provide a Docker image with all evaluation dependencies pre-installed. It is recommended to pull the image and run it directly:# Pull the image (Supports amd64/arm64 architectures) docker pull yuyangfu/agenttoleap-eval:v2.0 # Start the container (Adjust the -v path as needed) docker run -dit --name agenttoleap --gpus all --network host -v $(pwd):/workspace yuyangfu/agenttoleap-eval:v2.0 # Enter the container docker exec -it agenttoleap /bin/bash cd /workspace
-
Configure and run:
Openquickstart.pyin the project root directory and make simple configurations in the[USER CONFIGURATION]section:
- Custom task: Modify the
QUERYvariable to the instruction you want to test (e.g., “Check the results of last night’s UEFA Champions League matches”). - Model information: Provide your LLM
API_KEY,MODEL_NAME, andBASE_URL. - Tool service: Set
MANAGER_URLto the address of your MCP tool server (e.g.,http://localhost:8000; make sure the service is already running).
After configuration, run:
python quickstart.pyThe script will automatically create a demo task (by default, querying today’s arXiv computer science papers), generate the execution workflow, and start the evaluation process.
- View Results
After execution completes, results will be saved under the outputs/quickstart_results/ directory. You can inspect dialog.json to obtain the full interaction trace, including tool calls and reasoning chains.
Note: In QuickStart mode, automatic scoring is skipped by default and is intended only to demonstrate the Agent’s execution capabilities.
To fully reproduce the reported results, the startup configuration of the web information summarization model must be aligned. Taking a locally hosted model as an example, the model is launched via sglang with the following configuration:
export SUMMARY_MODEL="Qwen3-14b"
export SUMMARY_BASE_URL="YOUR-BASE-URL"
export SUMMARY_API_KEY="YOUR-API-KEY"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python sglang_init.py \
--model-path YOUR-MODEL-PATH \
--port YOUR-BASE-URL \
--tp-size 1 \
--dp-size 8 \
--api-key YOUR-API-KEY \
--served-model-name YOUR-MODEL-NAME \
--mcp_manager_url YOUR-SERVER-IP-ADDRESSOur training is based on our in-house AgentRL framework.
Detailed training documentation: Please refer to AgentRL Training Documentation for a complete training guide, including environment setup, data preparation, training script configuration, and other details.
We provide a complete automated evaluation framework that supports one-click evaluation on 8 classic agent benchmarks, including GAIA and HLE. Each benchmark can be managed independently, while results are exported in a unified format—making it easy for developers to add new benchmarks on top of this framework.
Note: To ensure consistency in the evaluation environment, it is strongly recommended to run the evaluation within the Docker container mentioned in the QuickStart section.
For detailed parameter configuration, report explanation, and instructions on adding custom benchmarks, please refer to the AgentToLeaP Documentation.
Before running evaluation, please edit the corresponding launch script under AgentToLeaP/benchmarks/ (e.g., AgentToLeaP/benchmarks/gaia/run.sh).
| Variable | Example | Description |
|---|---|---|
MODEL_NAME |
"Qwen3-4B" | Name of the model under evaluation (API model field) |
BASE_URL |
"..." | Primary model API base URL |
API_KEY |
"sk-..." | Primary model API key |
MANAGER_URL |
"..." | Tool server (AgentDock) endpoint |
Take the GAIA benchmark as an example:
# 1. Enter the benchmark folder
cd AgentToLeaP/benchmarks/gaia
# 2. Adjust the configs in run.sh
# 3. Launch the evaluation
bash run.shEvaluation results will be saved under the directory specified by EVALUATION_ROOT_DIR. It includes the interaction trajectory dialog.json, raw results result.json, and detailed reports for each task.
This framework is designed to be easily extensible. To add a new evaluation dataset:
- Create a directory: Create a new folder under
AgentToLeaP/benchmarks/. - Prepare the data: Inside this folder, create a
.jsonlfile with the same name. - Configure the script: Copy any existing
run.shand adjust environment variables.
For more detailed instructions, please refer to the AgentToLeaP Documentation.
If developers want to integrate custom tools into the environment for training and evaluation, they can configure them by following the steps below:
1. Create an MCP tool service
Create a new tool service under the AgentDock/agentdock-node-explore/mcp_servers/ directory:
mkdir mcp_servers/my_custom_tool2. Implement the tool logic
Create a tool service that conforms to the MCP protocol (Python example):
# mcp_servers/my_custom_tool/server.py
from mcp.server import Server
from mcp.types import Tool, TextContent
server = Server("my-custom-tool")
@server.list_tools()
async def list_tools():
return [
Tool(
name="my_tool",
description="tool description",
inputSchema={"type": "object", "properties": {...}}
)
]
@server.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "my_tool":
result = process(arguments)
return [TextContent(type="text", text=result)]3. Register the tool in the configuration file
Edit config.toml and add the new tool:
[mcpServers.my_custom_tool]
command = "python"
args = ["mcp_servers/my_custom_tool/server.py"]
env = { MY_API_KEY = "your_key" } 4. Restart the service to apply changes
docker compose restart agentdock-node-exploreOnce one or more tools have been batch-registered into the unified management platform, you can run inference commands using models such as the Qwen3 series as an example:
python quickstart.py \
--model_name "Qwen3-4B" \
--base_url "http://localhost:8000/v1" \
--api_key "your_api_key" \
--manager_url "http://localhost:8080"If you need to switch to a different model, please refer to the corresponding model documentation to obtain the required special tokens for tool calling. Then, add a corresponding tool-call parser under the src/tool_parser/ directory to parse the model’s tool invocation format, enabling access to the tool services and retrieval of execution results.
This project builds upon and integrates ideas, tools, and resources from several open-source frameworks and models, including verl, trl, TongYi Deep Research, DeepSeek, as well as datasets such as ASearcher, WebExplorer, NVIDIA Nemotron, DeepDive, WebWalker, MiroVerse-Voyager1.0, HybridQA, and MegaScience.
Project Lead: Haotian Chen
Contributors (in alphabetical order): Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang
Project Supervisor: Yankai Lin, Zhiyuan Liu, Maosong Sun
If AgentCPM-Explore is useful for your research, please cite the codebase:
@software{AgentCPMExplore2026,
title = {AgentCPM-Explore: An End-to-End Infrastructure for Training and Evaluating LLM Agents},
author = {Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang, Yankai Lin, Zhiyuan Liu, Maosong Sun},
year = {2026},
url = {https://github.com/OpenBMB/AgentCPM-Explore}
}