IVE is an agentic exploration framework for robotics that uses vision-language models to drive autonomous data collection in open-ended environments. The system converts RGB-D observations into semantic scene graphs, imagines novel configurations, predicts their physical plausibility, and generates executable skill sequences. IVE achieves a 4.1–7.8× increase in entropy of visited states compared to RL-based exploration baselines, while producing demonstrations that match or exceed the quality of human-collected data for downstream learning.
The pipeline consists of four components that work together in a closed loop:
| Component | Role |
|---|---|
| Scene Describer | Converts visual observations into a semantic scene graph — a structured representation of what objects are present and their spatial relationships |
| Explorer | Proposes a novel target scene graph that has not been visited before, guided by memory retrieval over previously seen states |
| Verifier | Checks whether the proposed transition from the current scene graph to the target is physically plausible before committing to execution |
| Action Tools | Translates the verified target scene graph into a concrete skill sequence that manipulates objects in the environment |
At each step the Explorer and Verifier iterate: if the Verifier rejects a proposal, the Explorer revises it using the rejection reason as feedback.
This codebase uses a modified version of VimaBench for tabletop manipulation. We extended it with two action modes:
Relation-based Placement — place an object at a named spatial relation relative to another object (e.g., In Front Of, Stacked On, To The Left Of).
Region-based Placement — place an object at a specific cell in a named grid overlaid on the workspace (e.g., B3, D7).
Both modes are available as action tools and can be configured or extended — see # ACTIONTOOL_DESIGN in the code and the files under prompt/action_design_3/.
conda create -n scenegraph-explr-sim python=3.11
conda activate scenegraph-explr-simInstall PyTorch (tested with torch 2.4.1 + cu118) and dependencies:
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=11.8 -c pytorch -c nvidia
cd VimaBench
pip install -e .
cd ..
pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git
pip install openai omegaconf scikit-learn wandbCopy .env.example to .env and add your OpenAI API key:
cp .env.example .env
# edit .env and set OPENAI_API_KEYPick an actor in config/sim_config.yaml:
| Actor | Description |
|---|---|
gpt |
Full IVE pipeline — Scene Describer → Explorer → Verifier → Action Tools |
heuristic_scenegraph_generator |
Heuristic scene graph generator with VLM action planner |
random_actiontool |
Random action tool selection (baseline) |
random_position |
Random pick-and-place position (baseline) |
Then run:
python run_sim.pyTrajectory and scene graph data are saved to dataset/.
Config values can be overridden from the command line:
python run_sim.py actor=random_actiontool num_episode=5python run_replay.pyAll settings are in config/sim_config.yaml. Key options:
| Field | Description |
|---|---|
actor |
Exploration strategy (see above) |
model |
OpenAI model for all VLM calls |
num_episode |
Number of episodes to collect |
num_queries_per_episode |
Number of VLM queries per episode |
num_steps_per_query |
Number of actions per query |
explorer_verifier_iteration |
Max Explorer–Verifier refinement rounds per step |
saving_dir |
Output directory for dataset |
- Find
# SCENEGRAPH_DESIGNin the code to change the scene graph format. - Find
# ACTIONTOOL_DESIGNin the code to change the action primitives. - Update the corresponding files in
prompt/action_design_*/andprompt/scenegraph_design_*/accordingly.
@inproceedings{lee2025imagine,
title={Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models},
author={Lee, Seungjae and Ekpo, Daniel and Liu, Haowen and Huang, Furong and Shrivastava, Abhinav and Huang, Jia-Bin},
booktitle={Conference on Robot Learning},
pages={4837--4858},
year={2025},
organization={PMLR}
}
