Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models (CoRL2025)

Overview

IVE is an agentic exploration framework for robotics that uses vision-language models to drive autonomous data collection in open-ended environments. The system converts RGB-D observations into semantic scene graphs, imagines novel configurations, predicts their physical plausibility, and generates executable skill sequences. IVE achieves a 4.1–7.8× increase in entropy of visited states compared to RL-based exploration baselines, while producing demonstrations that match or exceed the quality of human-collected data for downstream learning.

System Components

The pipeline consists of four components that work together in a closed loop:

Component	Role
Scene Describer	Converts visual observations into a semantic scene graph — a structured representation of what objects are present and their spatial relationships
Explorer	Proposes a novel target scene graph that has not been visited before, guided by memory retrieval over previously seen states
Verifier	Checks whether the proposed transition from the current scene graph to the target is physically plausible before committing to execution
Action Tools	Translates the verified target scene graph into a concrete skill sequence that manipulates objects in the environment

At each step the Explorer and Verifier iterate: if the Verifier rejects a proposal, the Explorer revises it using the rejection reason as feedback.

Simulation Environment

This codebase uses a modified version of VimaBench for tabletop manipulation. We extended it with two action modes:

Relation-based Placement — place an object at a named spatial relation relative to another object (e.g., In Front Of, Stacked On, To The Left Of).

Region-based Placement — place an object at a specific cell in a named grid overlaid on the workspace (e.g., B3, D7).

Both modes are available as action tools and can be configured or extended — see # ACTIONTOOL_DESIGN in the code and the files under prompt/action_design_3/.

Installation

conda create -n scenegraph-explr-sim python=3.11
conda activate scenegraph-explr-sim

Install PyTorch (tested with torch 2.4.1 + cu118) and dependencies:

conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=11.8 -c pytorch -c nvidia
cd VimaBench
pip install -e .
cd ..
pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git
pip install openai omegaconf scikit-learn wandb

Setup

Copy .env.example to .env and add your OpenAI API key:

cp .env.example .env
# edit .env and set OPENAI_API_KEY

How to Run

1. Collect dataset

Pick an actor in config/sim_config.yaml:

Actor	Description
`gpt`	Full IVE pipeline — Scene Describer → Explorer → Verifier → Action Tools
`heuristic_scenegraph_generator`	Heuristic scene graph generator with VLM action planner
`random_actiontool`	Random action tool selection (baseline)
`random_position`	Random pick-and-place position (baseline)

Then run:

python run_sim.py

Trajectory and scene graph data are saved to dataset/.

Config values can be overridden from the command line:

python run_sim.py actor=random_actiontool num_episode=5

2. Re-replay saved trajectories (sanity check)

python run_replay.py

Configuration

All settings are in config/sim_config.yaml. Key options:

Field	Description
`actor`	Exploration strategy (see above)
`model`	OpenAI model for all VLM calls
`num_episode`	Number of episodes to collect
`num_queries_per_episode`	Number of VLM queries per episode
`num_steps_per_query`	Number of actions per query
`explorer_verifier_iteration`	Max Explorer–Verifier refinement rounds per step
`saving_dir`	Output directory for dataset

Modifying Action Tools / Scene Graph Design

Find # SCENEGRAPH_DESIGN in the code to change the scene graph format.
Find # ACTIONTOOL_DESIGN in the code to change the action primitives.
Update the corresponding files in prompt/action_design_*/ and prompt/scenegraph_design_*/ accordingly.

Citation

@inproceedings{lee2025imagine,
  title={Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models},
  author={Lee, Seungjae and Ekpo, Daniel and Liu, Haowen and Huang, Furong and Shrivastava, Abhinav and Huang, Jia-Bin},
  booktitle={Conference on Robot Learning},
  pages={4837--4858},
  year={2025},
  organization={PMLR}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
VimaBench		VimaBench
assets		assets
config		config
prompt		prompt
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
run_replay.py		run_replay.py
run_sim.py		run_sim.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models (CoRL2025)

Overview

System Components

Simulation Environment

Installation

Setup

How to Run

1. Collect dataset

2. Re-replay saved trajectories (sanity check)

Configuration

Modifying Action Tools / Scene Graph Design

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models (CoRL2025)

Overview

System Components

Simulation Environment

Installation

Setup

How to Run

1. Collect dataset

2. Re-replay saved trajectories (sanity check)

Configuration

Modifying Action Tools / Scene Graph Design

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages