This is the official repository for the paper:
InteractAgent: Agentic Human Motion Interaction with Memory-Augmented LLMs
Tao Zhang*, Senhe Zhang*, Zeyu Zhang*†, Yujia Zhang, Dong Gong‡
*Equal contribution. †Project lead. ‡Corresponding author.
demo.mp4
rescaled_1_re.mp4A 3D-rendered figure walks through a modern, open-plan bedroom and living area with terrazzo flooring, contemporary furniture, and soft natural lighting. The character pauses to sit in a mustard-yellow armchair, then stands and stretches before walking toward the sofa, giving a sense of scale and daily life within the stylish interior space. |
rescaled_3_re.mp4A 3D-rendered figure walks from the bedroom area through a stylish open-plan space with terrazzo flooring, past a mustard armchair and modern sofa, heading toward a raised marble-tiled zone near a decorative umbrella and potted plant — showcasing spatial flow and interior design in a minimalist, contemporary home. |
InteractAgent is a training-free framework for generating long-horizon, scene-aware human motion from images and natural-language goals. The paper formulates the system around three core stages:
- Scene understanding with a multimodal LLM.
- Atomic action planning in natural language.
- Motion execution with a frozen MotionLLM executor.
The paper introduces two reflection mechanisms that make the system robust over long horizons:
- Online Observation: step-wise evaluation of each executed action to decide whether to continue, redo, or replan.
- Memory-Augmented Reflection: trajectory-level refinement across multiple attempts, with adaptive stopping when the task is already solved.
In this repository, the practical implementation centers on Qwen-VL based scene reasoning and Motion-Agent based motion generation. The code also includes an interactive planner, a closed-loop prompt refinement mode, and utilities for converting generated motion into SMPL-X parameters for downstream visualization.
main.py: main entry point withinteractive,motion-generation, andclosed-loopmodes.interactive_scene_planner.py: menu-driven workflow for scene analysis, task creation, motion generation, and multi-round prompt reflection.interactive_qwenvl.py: Qwen-VL API wrapper for image-conditioned reasoning.prompt_templates.py: scene-analysis, planning, trajectory, and reflection prompt templates plus response parsing.enhanced_motion_generator.py: Motion-Agent based motion generation with token concatenation and sequence decoding.scene_motion_planner.py: earlier end-to-end scene-to-motion pipeline for batch-style processing.trumans_utils/: utilities and configs for joint-to-SMPL-X conversion and TRUMANS-style assets.Motion-Agent/: bundled Motion-Agent code used as the low-level motion executor.
The paper describes InteractAgent as a dual-loop system:
- The forward pipeline takes scene images and a task, extracts a scene description, decomposes the goal into short executable actions, and sends those actions to a frozen MotionLLM executor.
- Online Observation evaluates each rendered atomic action and can either accept it, redo it, or replan the remaining steps to prevent early mistakes from cascading.
- Memory-Augmented Reflection stores planning-execution history across attempts and lets the multimodal model improve the next plan using previous failures and successes.
This repository implements the same overall idea as a research codebase, but with a practical emphasis on Qwen-driven planning and Motion-Agent execution. The paper's main experiments use GPT-4o as the multimodal planner/evaluator, while the current scripts default to qwen-vl-max for scene analysis and reflection.
conda create -n interactagent python=3.10
conda activate interactagent
pip install -r requirements.txtSet a valid Qwen API key before running the planner:
export QWEN_API_KEY=your_api_key_hereThe current code validates this key in config.py, and most planner paths rely on it.
This repository includes the Motion-Agent source code, but the required model weights are not bundled.
- The motion generation path in
main.pyexpects a MotionLLM checkpoint atMotion-Agent/ckpt/motionllm.pth. - The enhanced generator also expects access to a Gemma backbone directory named
gemma2b. - The SMPL-X conversion path expects TRUMANS-related assets under
trumans_utils/, including checkpoint files referenced bytrumans_utils/config/config_sample_synhsi.yaml.
In other words, the source code is present, but you still need to place the pretrained weights in the expected locations before motion generation or SMPL-X conversion can run successfully.
python main.py --check-envThis checks core Python dependencies and validates that the Qwen key and motion checkpoint path are available.
Launch the interactive menu:
python main.py --mode interactiveThis mode is the most complete end-to-end demo in the repository. It lets you:
- Provide a single image or dual-view images.
- Ask Qwen-VL to analyze the scene.
- Generate an atomic motion plan in
"A person [action]"format. - Execute the plan with MotionLLM.
- Optionally convert the resulting motion to SMPL-X.
If you already have a motion prompt and want to run MotionLLM directly:
python main.py \
--mode motion-generation \
--motion-gen "A person turns to the right; A person walks forward 3 meters" \
--motion-output-dir ./demoTo also convert the generated motion to SMPL-X:
python main.py \
--mode motion-generation \
--motion-gen "A person walks forward 2 meters" \
--motion-output-dir ./demo \
--convert-smplxThe repository also exposes a lightweight closed-loop mode that alternates between scene reasoning, MotionLLM generation, and prompt refinement:
python main.py \
--mode closed-loop \
--scene-image-url "https://your-public-image-url" \
--task "Walk from the bed to the sofa and sit down" \
--iterations 3 \
--motion-output-dir ./demoUseful options:
--qwen-model: choose the Qwen model used for planning and reflection.--rendered-video-url: provide a public video URL so Qwen can critique an already rendered result.--ask-video-url: enter the render URL interactively after generation.--prev-prompt: seed the loop with a previous MotionLLM prompt that should be improved.--history-file: store reflection history as JSONL for later reuse.
- Scene images are generally expected to be public URLs so the multimodal API can access them.
- Tasks are natural-language instructions such as navigation, reaching, sitting, or multi-step indoor interaction goals.
- Motion prompts are decomposed into short atomic steps, usually in the form
"A person ..."for compatibility with the parser and MotionLLM workflow.
Depending on the mode and available assets, the pipeline can generate:
motionllm_generated.mp4: a rendered motion video.motionllm_generated.npy: generated joint motion data.motionllm_generated_smplx.pkl: SMPL-X parameters for downstream visualization.test_motionllm_integrated.py: a helper Blender script generated during SMPL-X export.logs/closed_loop_history.jsonl: prompt/reflection history from closed-loop runs.
- The repository is a research codebase, not a polished package. Some paths are hard-coded and assume local checkpoints are placed exactly where the scripts expect them.
apimode is declared inmain.pybut is not implemented.- The closed-loop mode in this repo performs prompt-level reflection and history-aware replanning, but it is still a simplified version of the full paper formulation with rendered trajectory feedback and adaptive stopping.
- The paper evaluates the framework on long-horizon human-scene interaction benchmarks including TRUMANS and UniHSI. This repository mainly exposes the generation and reflection pipeline used to build that system.
This project builds on:
- Motion-Agent for the MotionLLM executor and token-based motion generation backbone.
- TRUMANS-related assets and utilities in
trumans_utils/for joint-to-SMPL-X conversion.