Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

AgentCPM-Explore Logo

WeChat WeChat | 🎥 Demo Video

中文 | English】

OverviewInstallationModel TrainingModel DownloadOne-Click EvaluationCommunity Development

📰 Latest News

  • [2026-01-12] 🚀🚀🚀 We open-sourced AgentCPM-Explore, an agent foundation model trained with only 4B parameters, which successfully entered 8 classic long-horizon and hard agent benchmarks including GAIA, HLE, and BrowseComp. It achieves state-of-the-art performance within the same parameter scale, enabling longer action chains and more accurate deep research capabilities, thereby breaking the performance ceiling of on-device agents.

🌟 Overview

AgentCPM-Explore is an open-source agent foundation model jointly developed by the THUNLP, Renmin University of China, and ModelBest. It is built upon Qwen3-4B-Thinking-2507 with 4 billion parameters, bringing long-horizon task-solving capabilities of large models to on-device deployment.

Key highlights of AgentCPM-Explore include:

  • The first on-device agent model with only 4B full parameters to enter 8 long-horizon and complex agent benchmarks, including GAIA, HLE, and BrowseComp.
  • Supports over 100 turns of continuous environment interaction, enabling multi-source information cross-validation, dynamic search strategy adjustment, real-time verification of up-to-date information, and sustained deep exploration until task completion.
  • Fully open-sourced pipeline, including an asynchronous agent reinforcement learning framework and a unified tool sandbox management platform, supporting community-driven development and custom extensions.

Demo (accelerated playback):

demo_en.mp4

Experimental Results:

Model GAIA (text only) BrowseComp BrowseComp (ZH) HLE Frames WebWalkerQA Seal-0 xbench-DeepSearch
Closed-Source Models
Claude-4.5-sonnet 71.2% 19.6% 40.8% 24.5% 85.0% / 53.4% 66.0%
Gemini Deep Research / / / 26.9% / / / /
Deepseek-V3.2 63.5% 67.6% 65.0% 40.8% 80.2% / 38.5% 71.0%
Minimax-M2 75.7% 44.0% 48.5% 31.8% / / / 72.0%
OpenAI-GPT-5-high 76.4% 54.9% 65.0% 35.2% / / 51.4% 77.8%
GLM-4.6 71.9% 45.1% 49.5% 30.4% / / / 70.0%
Kimi-Researcher / / / 26.9% 78.8% / 36.0% 69.0%
Seed-1.8 87.4% 67.6% 81.3% 40.9% / / / /
Open-Source Models
MiroThinker 8B 66.4% 31.1% 40.2% 21.5% 80.6% 60.6% 40.4% 60.6%
Tongyi DeepResearch 30B 70.9% 43.4% 46.7% 32.9% 90.6% 72.2% / 75.0%
ASearcher QWQ 32B v2 58.7% / / / 74.5% / / 51.1%
iterresearch-30B-A3B 72.8% 37.3% 45.2% 28.8% 71.0% / 39.6% /
WebSailor-V2-30B-A3B (RL) 74.1% 35.3% 44.1% 30.6% / / / 73.7%
WebLeaper-30B-A3B-RUC 73.2% 38.8% / / / / 48.6% 72.0%
WebDancer (QwQ-32B) 51.5% 3.8% 18.0% / / 47.9% / 38.3%
AgentCPM-Explore 4B 63.9% 24.1% 29.1% 19.1% 82.7% 68.1% 40.5% 70.0%

⚡ Installation

⚙️ Requirements

  • Docker & Docker Compose
  • Python 3.10+
  • At least 8GB RAM (16GB+ recommended)

🐳 AgentDock Tool Sandbox Platform

AgentDock is the unified tool sandbox management platform for AgentCPM-Explore. It provides containerized deployment and management for MCP (Model Context Protocol) services.

Core Architecture:

Component Port Description
agentdock-manager 8080 Management UI, container lifecycle management, health monitoring, API routing
agentdock-mongodb 27017 Persistent state storage
agentdock-node-full 8004/8092 Full-featured MCP node (GitHub, Slack, document processing, etc.)
agentdock-node-explore 8014/8102 Exploration node (web search, crawling, code execution, etc.)

Quick Deployment:

# 1. Enter the AgentDock folder
cd AgentDock

# 2. Set the environment variables
cp .env.example .env
# Editing .env file,setting the password of MongoDB and optional API Keys

# 3. One-click startup
docker compose up -d

# 4. Access the management dashboard
open http://localhost:8080

Set the environment variables (.env):

# Required: MongoDB authentication
MONGODB_USERNAME=admin
MONGODB_PASSWORD=your_password

# Optional: API Keys of search tools
JINA_API_KEY=your_jina_key        # Jina Reader API
GOOGLE_SERP_API_KEY=your_serp_key # Google Search API

🚀 QuickStart

  • QuickStart tutorial video (setup & run): https://www.youtube.com/watch?v=j3dtYY9KCd0
    Recommended: follow along in the provided evaluation Docker container to avoid environment discrepancies.

  • Multi-model, multi-tool collaborative environment setup: First, start the AgentDock tool sandbox platform to provide unified MCP (Model Context Protocol) tool services. When working with API-based models, configure the model’s BASE_URL and API_KEY. When working with locally hosted models, ensure the model service is accessible. Configure the required tool parameters in the config.toml file.

  • Launch the environment: Out of the box, one-click startup. The AgentDock unified tool sandbox platform supports launching all services with a single docker compose up -d command, including the management dashboard, database, and tool nodes.

  • Run execution: Quickly experience the core capabilities of the framework via the QuickStart script, allowing you to run a complete Agent task without complex configuration.

  1. Prepare Evaluation Environment (Recommended):
    We provide a Docker image with all evaluation dependencies pre-installed. It is recommended to pull the image and run it directly:

    # Pull the image (Supports amd64/arm64 architectures)
    docker pull yuyangfu/agenttoleap-eval:v2.0
    
    # Start the container (Adjust the -v path as needed)
    docker run -dit --name agenttoleap --gpus all --network host -v $(pwd):/workspace yuyangfu/agenttoleap-eval:v2.0
    
    # Enter the container
    docker exec -it agenttoleap /bin/bash
    cd /workspace
  2. Configure and run:
    Open quickstart.py in the project root directory and make simple configurations in the [USER CONFIGURATION] section:

  • Custom task: Modify the QUERY variable to the instruction you want to test (e.g., “Check the results of last night’s UEFA Champions League matches”).
  • Model information: Provide your LLM API_KEY, MODEL_NAME, and BASE_URL.
  • Tool service: Set MANAGER_URL to the address of your MCP tool server (e.g., http://localhost:8000; make sure the service is already running).

After configuration, run:

python quickstart.py

The script will automatically create a demo task (by default, querying today’s arXiv computer science papers), generate the execution workflow, and start the evaluation process.

  1. View Results

After execution completes, results will be saved under the outputs/quickstart_results/ directory. You can inspect dialog.json to obtain the full interaction trace, including tool calls and reasoning chains.

Note: In QuickStart mode, automatic scoring is skipped by default and is intended only to demonstrate the Agent’s execution capabilities.

To fully reproduce the reported results, the startup configuration of the web information summarization model must be aligned. Taking a locally hosted model as an example, the model is launched via sglang with the following configuration:

export SUMMARY_MODEL="Qwen3-14b"
export SUMMARY_BASE_URL="YOUR-BASE-URL"
export SUMMARY_API_KEY="YOUR-API-KEY"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python sglang_init.py \
--model-path YOUR-MODEL-PATH \
--port YOUR-BASE-URL \
--tp-size 1 \
--dp-size 8 \
--api-key YOUR-API-KEY \
--served-model-name YOUR-MODEL-NAME \
--mcp_manager_url YOUR-SERVER-IP-ADDRESS

🎓 Model Training

Our training is based on our in-house AgentRL framework.

Detailed training documentation: Please refer to AgentRL Training Documentation for a complete training guide, including environment setup, data preparation, training script configuration, and other details.

📊 One-Click Evaluation

We provide a complete automated evaluation framework that supports one-click evaluation on 8 classic agent benchmarks, including GAIA and HLE. Each benchmark can be managed independently, while results are exported in a unified format—making it easy for developers to add new benchmarks on top of this framework.

Note: To ensure consistency in the evaluation environment, it is strongly recommended to run the evaluation within the Docker container mentioned in the QuickStart section.

For detailed parameter configuration, report explanation, and instructions on adding custom benchmarks, please refer to the AgentToLeaP Documentation.

⚙️ 1. Core Parameter Configuration

Before running evaluation, please edit the corresponding launch script under AgentToLeaP/benchmarks/ (e.g., AgentToLeaP/benchmarks/gaia/run.sh).

Variable Example Description
MODEL_NAME "Qwen3-4B" Name of the model under evaluation (API model field)
BASE_URL "..." Primary model API base URL
API_KEY "sk-..." Primary model API key
MANAGER_URL "..." Tool server (AgentDock) endpoint

🚀 2. Run Evaluation

Take the GAIA benchmark as an example:

# 1. Enter the benchmark folder
cd AgentToLeaP/benchmarks/gaia

# 2. Adjust the configs in run.sh

# 3. Launch the evaluation
bash run.sh

📄 3. Viewing Reports

Evaluation results will be saved under the directory specified by EVALUATION_ROOT_DIR. It includes the interaction trajectory dialog.json, raw results result.json, and detailed reports for each task.

➕ 4. Adding a Custom Benchmark

This framework is designed to be easily extensible. To add a new evaluation dataset:

  1. Create a directory: Create a new folder under AgentToLeaP/benchmarks/.
  2. Prepare the data: Inside this folder, create a .jsonl file with the same name.
  3. Configure the script: Copy any existing run.sh and adjust environment variables.

For more detailed instructions, please refer to the AgentToLeaP Documentation.

🤝 Community Development

🛠️ Integrating Custom Tools

If developers want to integrate custom tools into the environment for training and evaluation, they can configure them by following the steps below:

1. Create an MCP tool service

Create a new tool service under the AgentDock/agentdock-node-explore/mcp_servers/ directory:

mkdir mcp_servers/my_custom_tool

2. Implement the tool logic

Create a tool service that conforms to the MCP protocol (Python example):

# mcp_servers/my_custom_tool/server.py
from mcp.server import Server
from mcp.types import Tool, TextContent

server = Server("my-custom-tool")

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="my_tool",
            description="tool description",
            inputSchema={"type": "object", "properties": {...}}
        )
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "my_tool":
        result = process(arguments)
        return [TextContent(type="text", text=result)]

3. Register the tool in the configuration file

Edit config.toml and add the new tool:

[mcpServers.my_custom_tool]
command = "python"
args = ["mcp_servers/my_custom_tool/server.py"]
env = { MY_API_KEY = "your_key" } 

4. Restart the service to apply changes

docker compose restart agentdock-node-explore

🔧 Integrating Custom Models

Once one or more tools have been batch-registered into the unified management platform, you can run inference commands using models such as the Qwen3 series as an example:

python quickstart.py \
    --model_name "Qwen3-4B" \
    --base_url "http://localhost:8000/v1" \
    --api_key "your_api_key" \
    --manager_url "http://localhost:8080"

If you need to switch to a different model, please refer to the corresponding model documentation to obtain the required special tokens for tool calling. Then, add a corresponding tool-call parser under the src/tool_parser/ directory to parse the model’s tool invocation format, enabling access to the tool services and retrieval of execution results.

🙏 Acknowledge

This project builds upon and integrates ideas, tools, and resources from several open-source frameworks and models, including verl, trl, TongYi Deep Research, DeepSeek, as well as datasets such as ASearcher, WebExplorer, NVIDIA Nemotron, DeepDive, WebWalker, MiroVerse-Voyager1.0, HybridQA, and MegaScience.

🤝 Contributions

Project Lead: Haotian Chen

Contributors (in alphabetical order): Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang

Project Supervisor: Yankai Lin, Zhiyuan Liu, Maosong Sun


📄 Citation

If AgentCPM-Explore is useful for your research, please cite the codebase:

@software{AgentCPMExplore2026,
  title  = {AgentCPM-Explore: An End-to-End Infrastructure for Training and Evaluating LLM Agents},
  author = {Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang, Yankai Lin, Zhiyuan Liu, Maosong Sun},
  year   = {2026},
  url    = {https://github.com/OpenBMB/AgentCPM-Explore}
}