Skip to content

WxxShirley/Agent-STAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent-STAR

This repository is the official implementation of our paper: Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe.

We use TravelPlanner as a long-horizon tool-use testbed, where agents must iteratively call tools to satisfy multifaceted constraints, i.e., commonsense and hard constraints. We implement STAR [Data Synthesis → SFT → RL], a unified post-training pipeline that systematically studies the agentic RL design space across five axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability.

Please consider citing or giving a 🌟 if Agent-STAR is helpful to your work!

@misc{wu2026agentstar,
      title={Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe}, 
      author={Xixi Wu and Qianguo Sun and Ruiyang Zhang and Chao Song and Junlong Wu and Yiyan Qi and Hong Cheng},
      year={2026},
      eprint={2603.21972},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.21972}, 
}

🎙️ News

📑 [2026-03-24] Our paper is released on arXiv!

🌱 [2026-03-23] We release the data synthesis and inference framework of our STAR-pipeline, along with three STAR-SFT models and three STAR-RL models.

👀 Stay tuned for continuous updates, new STAR checkpoints, RL training frameworks, and RL scripts will be added over time!


📖 Table of Contents

📚 Dataset

We open-source our data synthesis codes and 17K+ synthetic queries.

Synthetic datasets are available at https://huggingface.co/datasets/xxwu/Agent-STAR-TravelDataset.

Data Description
TravelPlanner_Val180.jsonl Official TravelPlanner validation set of 180 instances
TravelTotal_17K.jsonl All 17K+ synthetic queries after element sampling, feasibility checking, and back-translation
Travel_Mixed_1K_RL.jsonl Default 1K RL training set with mixed difficulty
Travel_{Difficulty}_1K.jsonl Difficulty-specific 1K sets, i.e., Easy / Medium / Hard for controlled experiments

To generate your own training samples, follow the three-step pipeline:

  • Step 0 - Prepare the travel database

  • Step 1 - Element sampling

    cd DataSynthesis
    python3 step1_generate_dict.py --generate_dict --easy_num 1000 --medium_num 1000 --hard_num 1000
  • Step 2 - Feasibility checking

    Merge the sampled files into a single file, e.g., combine_3k.jsonl, then:

    python3 step2_dict_feasibility.py --path output/combine_3k.jsonl

    Output files: *_feasible.jsonl and *_failed.jsonl.

  • Step 3 - Query generation

    Use LLM back-translation to turn feasible JSON elements into natural language TravelPlanner queries

    Fill API keys in step3_generate_query.py first:

    python3 step3_generate_query.py --input_file output/combine_3k_feasible.jsonl --output_file output/combine_3k_final.jsonl

🔧 Environment

For inference, the required packages are:

vllm==0.11.0
transformers==4.57.3
torch==2.8.0

Before running the inference, you have to download the travel database, and put all the CSV files under the database folder.

If you use commercial LLMs like Gemini3-Pro, Seed-1.8 / Seed-2.0, Kimi-K2.5, DeepSeek-V3.2, Qwen3-397B-A17B, etc, fill up the corresponding API keys in Inference/config.json.

🤗 Model

We release both the STAR-SFT and STAR-RL tuned models across 1.5B / 3B / 7B scales for reproducing and further research.

Backbones: Qwen2.5-Instruct.

SFT: fine-tune from the Qwen2.5-Instruct base using 1K successful trajectories, which are generated by DeepSeek-V3.2-Exp-Thinking on synthetic queries. The corresponding SFT scripts are provided under SFTScripts/. The scripts are based on LLaMAFactory==0.9.4.dev0.

RL: we follow the paper’s scale-aware recipe, as smaller models benefit from curriculum-style rewards and exploration-heavy algorithms, while 7B can leverage GRPO with the dense SUM reward for stronger performance and faster convergence.

Model Path
Agent-STAR-SFT-1.5B https://huggingface.co/xxwu/Agent-STAR-SFT-1.5B
Agent-STAR-SFT-3B https://huggingface.co/xxwu/Agent-STAR-SFT-3B
Agent-STAR-SFT-7B https://huggingface.co/xxwu/Agent-STAR-SFT-7B
Agent-STAR-RL-1.5B https://huggingface.co/xxwu/Agent-STAR-RL-1.5B
Agent-STAR-RL-3B https://huggingface.co/xxwu/Agent-STAR-RL-3B
Agent-STAR-RL-7B https://huggingface.co/xxwu/Agent-STAR-RL-7B

The figure summarizes TravelPlanner test-set success across training variants and model scales.

🚀 Run Inference

The inference code lives under Inference/. The pipeline is: ReAct inferencePost-processing [NL → structured JSON]Evaluation.

Quick scripts

We provide ready-to-use scripts under Inference/scripts/:

  • infer_sotamodel.sh: run inference with strong commercial LLMs
  • infer_testset.sh: deploy a local model using vLLM and run inference on the TravelPlanner test set

Step-by-step

  • Step 0: Prepare the environment

    • Put the TravelPlanner travel CSVs under database/.
    • If you use commercial LLMs, fill Inference/config.json including API endpoints and keys.
  • Step 1: Run ReAct inference

    From the repo root, you can run:

    cd Inference
    python3 -u main.py \
      --model YOUR_MODEL_PATH_OR_NAME \
      --save_suffix YOUR_SUFFIX \
      --max_workers 20 \
      --split validation  # or test
      --max_context 32768 \
      --max_turns 60 

    If you want to run inference on your own synthesized dataset, add: --use_custom_query --input_file YOUR_FILE.jsonl.

  • Step 2: Post-processing to format plans into the evaluation JSON

    Fill up the API key of the formatting model, DeepSeek-V3.2, in the utils.py first:

    python3 -u post_process.py \
      --path PREDICTIONS.jsonl \
      --split validation  # or test
      --format_model deepseek-chat

    post_process.py converts the model’s natural-language itinerary into the strict JSON format used by the evaluators.

    ⚠️ IMPORTANT & 🚨 REWARD-HACKING PREVENTION

    • For test set, post_process.py runs utils.py::format_planning_for_official_test to normalize transportation, i.e., driving/...Self-driving, so TravelPlanner’s checker matches correctly.
    • For local evaluation like validation set, we use Inference/eval_commonsense.py::detect_transportation_type for stricter transport matching.
  • Step 3: Evaluation

    • Validation set:

      python3 -u eval.py --path YOUR_FORMATTED.jsonl --save_score
      
    • Test set:

      Use the official TravelPlanner leaderboard

📮 Contact

If you have any questions about usage, reproducibility, or would like to discuss, please feel free to open an issue on GitHub or contact the authors via email at xxwu@se.cuhk.edu.hk

🙏 Acknowledgements

We sincerely thank the authors of TravelPlanner for creating a realistic and challenging benchmark, and for providing open-source resources and ongoing support for consistent evaluation. We also appreciate the open-sourced rLLM framework that supports our RL training. The first author, X. Wu, would also like to thank her previous colleagues from DeepResearch Team@Tongyi Lab for helpful hands-on experiences in agentic RL and valuable insights.

About

Official implementation for paper "Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors