Skip to content

jayLEE0301/imagine_verify_execute

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models (CoRL2025)

[Paper] [Website]

Overview

IVE is an agentic exploration framework for robotics that uses vision-language models to drive autonomous data collection in open-ended environments. The system converts RGB-D observations into semantic scene graphs, imagines novel configurations, predicts their physical plausibility, and generates executable skill sequences. IVE achieves a 4.1–7.8× increase in entropy of visited states compared to RL-based exploration baselines, while producing demonstrations that match or exceed the quality of human-collected data for downstream learning.

System Components

The pipeline consists of four components that work together in a closed loop:

Component Role
Scene Describer Converts visual observations into a semantic scene graph — a structured representation of what objects are present and their spatial relationships
Explorer Proposes a novel target scene graph that has not been visited before, guided by memory retrieval over previously seen states
Verifier Checks whether the proposed transition from the current scene graph to the target is physically plausible before committing to execution
Action Tools Translates the verified target scene graph into a concrete skill sequence that manipulates objects in the environment

At each step the Explorer and Verifier iterate: if the Verifier rejects a proposal, the Explorer revises it using the rejection reason as feedback.

Simulation Environment

This codebase uses a modified version of VimaBench for tabletop manipulation. We extended it with two action modes:

Relation-based Placement — place an object at a named spatial relation relative to another object (e.g., In Front Of, Stacked On, To The Left Of).

Simulation scene

Region-based Placement — place an object at a specific cell in a named grid overlaid on the workspace (e.g., B3, D7).

Simulation scene with grid

Both modes are available as action tools and can be configured or extended — see # ACTIONTOOL_DESIGN in the code and the files under prompt/action_design_3/.

Installation

conda create -n scenegraph-explr-sim python=3.11
conda activate scenegraph-explr-sim

Install PyTorch (tested with torch 2.4.1 + cu118) and dependencies:

conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=11.8 -c pytorch -c nvidia
cd VimaBench
pip install -e .
cd ..
pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git
pip install openai omegaconf scikit-learn wandb

Setup

Copy .env.example to .env and add your OpenAI API key:

cp .env.example .env
# edit .env and set OPENAI_API_KEY

How to Run

1. Collect dataset

Pick an actor in config/sim_config.yaml:

Actor Description
gpt Full IVE pipeline — Scene Describer → Explorer → Verifier → Action Tools
heuristic_scenegraph_generator Heuristic scene graph generator with VLM action planner
random_actiontool Random action tool selection (baseline)
random_position Random pick-and-place position (baseline)

Then run:

python run_sim.py

Trajectory and scene graph data are saved to dataset/.

Config values can be overridden from the command line:

python run_sim.py actor=random_actiontool num_episode=5

2. Re-replay saved trajectories (sanity check)

python run_replay.py

Configuration

All settings are in config/sim_config.yaml. Key options:

Field Description
actor Exploration strategy (see above)
model OpenAI model for all VLM calls
num_episode Number of episodes to collect
num_queries_per_episode Number of VLM queries per episode
num_steps_per_query Number of actions per query
explorer_verifier_iteration Max Explorer–Verifier refinement rounds per step
saving_dir Output directory for dataset

Modifying Action Tools / Scene Graph Design

  • Find # SCENEGRAPH_DESIGN in the code to change the scene graph format.
  • Find # ACTIONTOOL_DESIGN in the code to change the action primitives.
  • Update the corresponding files in prompt/action_design_*/ and prompt/scenegraph_design_*/ accordingly.

Citation

@inproceedings{lee2025imagine,
  title={Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models},
  author={Lee, Seungjae and Ekpo, Daniel and Liu, Haowen and Huang, Furong and Shrivastava, Abhinav and Huang, Jia-Bin},
  booktitle={Conference on Robot Learning},
  pages={4837--4858},
  year={2025},
  organization={PMLR}
}

About

Official implementation for [CoRL'25] Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages