This repository is the official implementation of our paper: Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe.
We use TravelPlanner as a long-horizon tool-use testbed, where agents must iteratively call tools to satisfy multifaceted constraints, i.e., commonsense and hard constraints. We implement STAR [Data Synthesis → SFT → RL], a unified post-training pipeline that systematically studies the agentic RL design space across five axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability.
Please consider citing or giving a 🌟 if Agent-STAR is helpful to your work!
@misc{wu2026agentstar,
title={Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe},
author={Xixi Wu and Qianguo Sun and Ruiyang Zhang and Chao Song and Junlong Wu and Yiyan Qi and Hong Cheng},
year={2026},
eprint={2603.21972},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.21972},
}
📑 [2026-03-24] Our paper is released on arXiv!
🌱 [2026-03-23] We release the data synthesis and inference framework of our STAR-pipeline, along with three STAR-SFT models and three STAR-RL models.
👀 Stay tuned for continuous updates, new STAR checkpoints, RL training frameworks, and RL scripts will be added over time!
We open-source our data synthesis codes and 17K+ synthetic queries.
Synthetic datasets are available at https://huggingface.co/datasets/xxwu/Agent-STAR-TravelDataset.
| Data | Description |
|---|---|
| TravelPlanner_Val180.jsonl | Official TravelPlanner validation set of 180 instances |
| TravelTotal_17K.jsonl | All 17K+ synthetic queries after element sampling, feasibility checking, and back-translation |
| Travel_Mixed_1K_RL.jsonl | Default 1K RL training set with mixed difficulty |
| Travel_{Difficulty}_1K.jsonl | Difficulty-specific 1K sets, i.e., Easy / Medium / Hard for controlled experiments |
To generate your own training samples, follow the three-step pipeline:
-
Step 0 - Prepare the travel database
- Download: https://huggingface.co/datasets/xxwu/Agent-STAR-TravelDatabase
- Put all CSV files under
database/.
-
Step 1 - Element sampling
cd DataSynthesis python3 step1_generate_dict.py --generate_dict --easy_num 1000 --medium_num 1000 --hard_num 1000 -
Step 2 - Feasibility checking
Merge the sampled files into a single file, e.g.,
combine_3k.jsonl, then:python3 step2_dict_feasibility.py --path output/combine_3k.jsonl
Output files:
*_feasible.jsonland*_failed.jsonl. -
Step 3 - Query generation
Use LLM back-translation to turn feasible JSON elements into natural language TravelPlanner queries
Fill API keys in
step3_generate_query.pyfirst:python3 step3_generate_query.py --input_file output/combine_3k_feasible.jsonl --output_file output/combine_3k_final.jsonl
For inference, the required packages are:
vllm==0.11.0
transformers==4.57.3
torch==2.8.0
Before running the inference, you have to download the travel database, and put all the CSV files under the database folder.
If you use commercial LLMs like Gemini3-Pro, Seed-1.8 / Seed-2.0, Kimi-K2.5, DeepSeek-V3.2, Qwen3-397B-A17B, etc, fill up the corresponding API keys in Inference/config.json.
We release both the STAR-SFT and STAR-RL tuned models across 1.5B / 3B / 7B scales for reproducing and further research.
Backbones: Qwen2.5-Instruct.
SFT: fine-tune from the Qwen2.5-Instruct base using 1K successful trajectories, which are generated by DeepSeek-V3.2-Exp-Thinking on synthetic queries. The corresponding SFT scripts are provided under SFTScripts/. The scripts are based on LLaMAFactory==0.9.4.dev0.
RL: we follow the paper’s scale-aware recipe, as smaller models benefit from curriculum-style rewards and exploration-heavy algorithms, while 7B can leverage GRPO with the dense SUM reward for stronger performance and faster convergence.
| Model | Path |
|---|---|
| Agent-STAR-SFT-1.5B | https://huggingface.co/xxwu/Agent-STAR-SFT-1.5B |
| Agent-STAR-SFT-3B | https://huggingface.co/xxwu/Agent-STAR-SFT-3B |
| Agent-STAR-SFT-7B | https://huggingface.co/xxwu/Agent-STAR-SFT-7B |
| Agent-STAR-RL-1.5B | https://huggingface.co/xxwu/Agent-STAR-RL-1.5B |
| Agent-STAR-RL-3B | https://huggingface.co/xxwu/Agent-STAR-RL-3B |
| Agent-STAR-RL-7B | https://huggingface.co/xxwu/Agent-STAR-RL-7B |
The figure summarizes TravelPlanner test-set success across training variants and model scales.
The inference code lives under Inference/. The pipeline is:
ReAct inference → Post-processing [NL → structured JSON] → Evaluation.
We provide ready-to-use scripts under Inference/scripts/:
infer_sotamodel.sh: run inference with strong commercial LLMsinfer_testset.sh: deploy a local model using vLLM and run inference on the TravelPlanner test set
-
Step 0: Prepare the environment
- Put the TravelPlanner travel CSVs under
database/. - If you use commercial LLMs, fill
Inference/config.jsonincluding API endpoints and keys.
- Put the TravelPlanner travel CSVs under
-
Step 1: Run ReAct inference
From the repo root, you can run:
cd Inference python3 -u main.py \ --model YOUR_MODEL_PATH_OR_NAME \ --save_suffix YOUR_SUFFIX \ --max_workers 20 \ --split validation # or test --max_context 32768 \ --max_turns 60
If you want to run inference on your own synthesized dataset, add:
--use_custom_query --input_file YOUR_FILE.jsonl. -
Step 2: Post-processing to format plans into the evaluation JSON
Fill up the API key of the formatting model, DeepSeek-V3.2, in the
utils.pyfirst:python3 -u post_process.py \ --path PREDICTIONS.jsonl \ --split validation # or test --format_model deepseek-chatpost_process.pyconverts the model’s natural-language itinerary into the strict JSON format used by the evaluators.⚠️ IMPORTANT & 🚨 REWARD-HACKING PREVENTION- For test set,
post_process.pyrunsutils.py::format_planning_for_official_testto normalizetransportation, i.e.,driving/...→Self-driving, so TravelPlanner’s checker matches correctly. - For local evaluation like validation set, we use
Inference/eval_commonsense.py::detect_transportation_typefor stricter transport matching.
- For test set,
-
Step 3: Evaluation
-
Validation set:
python3 -u eval.py --path YOUR_FORMATTED.jsonl --save_score -
Test set:
Use the official TravelPlanner leaderboard
-
If you have any questions about usage, reproducibility, or would like to discuss, please feel free to open an issue on GitHub or contact the authors via email at xxwu@se.cuhk.edu.hk
We sincerely thank the authors of TravelPlanner for creating a realistic and challenging benchmark, and for providing open-source resources and ongoing support for consistent evaluation. We also appreciate the open-sourced rLLM framework that supports our RL training. The first author, X. Wu, would also like to thank her previous colleagues from DeepResearch Team@Tongyi Lab for helpful hands-on experiences in agentic RL and valuable insights.

