RePlan: Reasoning-Guided Region Planning
for Complex Instruction-Based Image Editing

From Global Semantics to Region-aligned Guidance

📖 Overview

RePlan is an instruction-based image editing framework designed to conquer Instruction-Visual Complexity (IV-Complexity). By coupling a Multimodal Large Language Model (MLLM) planner with a diffusion-based editor, RePlan shifts the paradigm from coarse global semantics to Region-aligned Guidance. It achieves significant improvement in fine-grained visual reasoning and background consistency without requiring massive paired training data.

More Comparative results

Multiple regional instructions with a global modification

Multiple regional instructions together with a global modification are executed simultaneously in a single forward pass with FLUX.2-Klein-9B. Our region attention injection method is the first method to support high-quality parallel controllable editing across so many regions at once. The edits remain well localized without spillover while preserving globally coherent visual appearance, effectively avoiding cross-region interference and instruction omissions that commonly occur in highly complex scenarios.

More multi-region editing examples

🔥 News

[Mar 19th, 2026] We added support for stage offloading to reduce GPU memory usage. With this update, FLUX.2-Klein-4B + RePlan can now run on a 16GB GPU!
[Mar 19th, 2026] We now support the latest open-source state-of-the-art editing models, including Qwen-Image-Edit-2511, FireRed-Image-Edit, and FLUX.2-Klein.
[Dec 26th, 2025] We updated the gradio demo for custom attention control and optimized inference settings when using Qwen-Image-Edit as backbone.
[Dec 19th, 2025] We released paper, model and data of RePlan!

📑 Table of Contents

📖 Overview
💡 Introduction
🛠️ Environment Setup
- Installation
- Environment Options
⚡ Inference
- Web Interface (Gradio)
- Command Line Interface
📊 Evaluation on IV-Edit Benchmark
- Evaluation Pipeline
🚄 Train Your Own Planner
📝 Citation

💡 Introduction

🚧 The Challenge: IV-Complexity

Current instruction-based editing models struggle when intricate instructions meet cluttered, realistic scenes—a challenge we define as Instruction-Visual Complexity (IV-Complexity). In these scenarios, high-level global context is insufficient to distinguish specific targets from semantically similar objects (e.g., distinguishing a "used cup" from a clean glass on a messy desk).

📉 The Gap: Global Semantic Guidance

Existing methods, including unified VLM-diffusion architectures, predominantly rely on Global Semantic Guidance. They compress instructions into global feature vectors, lacking spatial grounding. Consequently, edits often "spill over" into unrelated areas or modify the wrong targets, failing to preserve background consistency.

🚀 Our Solution: Region-Aligned Guidance

RePlan introduces a Plan-then-Execute framework that explicitly links text to pixels. Our key contributions include:

🧱 Reasoning-Guided Planning

A VLM planner performs Chain-of-Thought (CoT) reasoning to decompose complex instructions into structured, region-specific guidance (Bounding Boxes + Local Hints).
🎯 Training-Free Attention Injection

We introduce a mechanism tailored for Multimodal DiT (MMDiT) that executes edits via region-constrained attention. This enables precise, multi-region parallel edits in a single pass while preserving the background, without requiring any training of the DiT backbone.
⚡ Efficient GRPO Training

We enhance the planner's reasoning capabilities using Group Relative Policy Optimization (GRPO). Remarkably, we achieve strong planning performance using only ~1k instruction-only samples, bypassing the need for large-scale paired image datasets.
🎛️ Interactive & Flexible Editing

RePlan's intermediate region guidance is fully editable, enabling user-in-the-loop intervention. Users can adjust bounding boxes or hints directly to refine results. Furthermore, our attention mechanism supports regional negative prompts to prevent bleeding effects.
📊 IV-Edit Benchmark

To foster future research, we establish IV-Edit, the first benchmark specifically designed to evaluate IV-Complex editing, filling the gap left by current subject-dominated evaluation sets.

🛠️ Environment Setup

Installation

First, clone the repository:

git clone https://github.com/taintaintainu/RePlan-IVEdit-page.git
cd RePlan

Environment Options

Option 1: Inference Only

If you only want to run demos or local inference, set up a lightweight environment:

conda create -n replan_infer python=3.10
conda activate replan_infer
pip install -e .
pip install flash-attn --no-build-isolation

Option 2: Full Setup (Training & Evaluation)

For full training or evaluation on the IV-Edit benchmark:

conda create -n replan python=3.10
conda activate replan
pip install -e .[full]
pip install flash-attn --no-build-isolation

⚡ Inference

We provide both a Web Interface for interactive visualization/editing and a Command Line Interface.

Web Interface (Gradio)

Launch a local web server to interactively edit images. This interface supports visualizing and manually modifying the region guidance (Bounding Boxes & Hints) generated by the planner before execution.

By default, RePlan keeps the original inference behavior unchanged. For users with limited GPU memory, you can optionally enable stage-boundary offloading with --stage_offload cpu, which offloads the inactive model between the planning (VLM) stage and the editing stage. With this option enabled, flux_klein_4b + RePlan can run on a 16GB GPU.

# use Flux.1 Kontext dev as the backbone
python replan/inference/app.py --server_port 8080 --pipeline_type "flux_kontext"

# use Qwen-Image-Edit as the backbone
python replan/inference/app.py --server_port 8080 --pipeline_type "qwenimageedit"

# use Qwen-Image-Edit-2511 / FireRed / FLUX.2-Klein variants
python replan/inference/app.py --server_port 8080 --pipeline_type "qwenimageedit2511"
python replan/inference/app.py --server_port 8080 --pipeline_type "firerededit"
python replan/inference/app.py --server_port 8080 --pipeline_type "flux_klein"
python replan/inference/app.py --server_port 8080 --pipeline_type "flux_klein_4b"

# optional: reduce peak GPU memory by offloading between VLM and edit stages
python replan/inference/app.py --server_port 8080 --pipeline_type "flux_klein_4b" --stage_offload cpu

📺 Demo Video

demo.mp4

📝 How to Use

Upload & Select: Click the top-left area to upload an image, or select any provided example.
Baseline Method: Copy your instruction directly into the Global Prompt text box, skip the AI Plan step, and click Run Editing. The result will be identical to using the selected baseline backbone (e.g. Flux.1 Kontext dev, Qwen-Image-Edit, Qwen-Image-Edit-2511, FireRed-Image-Edit, or FLUX.2-Klein).
RePlan Method: Enter your instruction and click AI Plan. After a few seconds, the model generates the Global Prompt and Region Guidance. You can see the visual BBoxes (Bounding Boxes) in the interactive interface. Click Run Editing to generate the RePlan result.
Manual Correction: If you are unsatisfied with the AI-generated guidance, you can modify it:
- Global Prompt: Edit the text in the prompt box below the interface.
- Move/Resize BBoxes: Select a BBox in the interactive window to drag it or resize it by dragging its edges.
- Add/Delete BBoxes: Drag on any empty area in the interface to create a new BBox. To delete one, select it and click the Delete button in the bottom-right.
- Edit Hints: After selecting a BBox, modify its specific Local Hint in the bottom-right text box.
- Click Run Editing to apply your manual changes.
Manual Design: You can skip the AI Plan entirely. Manually enter a Global Prompt, draw your own BBoxes/Hints using the tools above, and click Run Editing.

Advanced Controls

Example of advanced control panel. "Main": global prompt. "Hi": Hint i. "N.Bi": Noise of bbox i. "I.Bi": Image of bbox i. "N.BG": Noise of background. "I.BG": Image of background.

Interactive Attention Matrix: Defines the fine-grained attention mask between specific components to control information flow. Noise/Image components distinguish the Latent Noise Patches from Input Image Patches.
Rule Switch Ratio: Determines the phase transition point (0.0 - 1.0) from Custom Rules (defined in 1) to Default Rules (depicted in the paper). The cutoff step is calculated as (e.g., Ratio 0.7 with 50 steps → Custom Rules apply for steps 0–35).
Locality Control Strategy: Predefined custom rule of the demo disables Noise Background → Noise BBox attention to isolate background from edit target. The Rule Switch Ratio controls the duration (number of steps) of this isolation:
- Higher Ratio: Enforces stronger locality. Increase to fix semantic spillover (edits leaking into the background).
- Lower Ratio: Allows more global interaction. Decrease to fix boundary artifacts (unnatural, sharp edges).
Expand Value: Expands the effective attention mask relative to the bounding box size (e.g., 0.15 to expand 15% of bbox size).

Command Line Interface

Continuous Inference

Launch a session to load the model once and perform multiple inference rounds:

--stage_offload is optional and defaults to none, which preserves the original behavior. Set --stage_offload cpu only if you want to reduce peak GPU memory by offloading the inactive model between the planning and editing stages.

python replan/inference/run_replan.py --pipeline_type "flux_kontext"  # use Flux.1 Kontext dev as RePlan backbone

python replan/inference/run_replan.py --pipeline_type "qwenimageedit"  # use Qwen-Image-Edit as RePlan backbone

python replan/inference/run_replan.py --pipeline_type "qwenimageedit2511"  # use Qwen-Image-Edit-2511

python replan/inference/run_replan.py --pipeline_type "firerededit"  # use FireRed-Image-Edit-1.1

python replan/inference/run_replan.py --pipeline_type "flux_klein"  # use FLUX.2-klein-9B

python replan/inference/run_replan.py --pipeline_type "flux_klein_4b"  # use FLUX.2-klein-4B

# optional: reduce peak GPU memory by offloading between VLM and edit stages
python replan/inference/run_replan.py --pipeline_type "flux_klein_4b" --stage_offload cpu

Single Run

Run inference directly by providing arguments (the program ends after one editing round):

python replan/inference/run_replan.py \
    --image "assets/cup.png" \
    --instruction "Replace the cup that has been used and left on the desk with a small potted plant" \
    --output_dir "output/inference" \
    --pipeline_type "flux_kontext"

Key Arguments:

--image: Path to the input image (relative or absolute).
--instruction: Editing instruction text.
--output_dir: Directory to save results (default: ./output/inference).
--only_save_image: If set, only saves the edited image (omits VLM response and visualization).
--vlm_ckpt_path: Path to the Planner VLM checkpoint (default: TainU/RePlan-Qwen2.5-VL-7B).
--pipeline_type: Diffusion pipeline type. Preferred explicit values are flux_kontext, qwenimageedit, qwenimageedit2509, qwenimageedit2511, firerededit, flux_klein, and flux_klein_4b. For backward compatibility, legacy aliases flux, qwen, qwen2511, qwen_plus, and klein are also accepted.
--stage_offload: Optional stage-boundary offload mode. none keeps the existing behavior unchanged; cpu offloads the inactive model between the VLM planning stage and the image editing stage to reduce peak GPU memory.
--lora_path: (Experimental) Path to LoRA weights.

📊 Evaluation on IV-Edit Benchmark

We propose the IV-Edit Benchmark to evaluate performance on IV-Complex editing scenarios.

Dataset: ~800 manually verified instruction-image pairs.
Capabilities: Tests 7 referring types (e.g., Spatial, Knowledge) and 16 task types (e.g., Attribute Modification, Physics Reasoning). Please refer to the paper appendix for specific categories.
Metrics: We use Gemini-2.5-Pro to evaluate Target, Consistency, Quality, and Effect.

Examples from IV-Edit spanning a wide range of real-world scenarios and fine-grained instruction intents, Including spatial, structural, and reasoning-intensive edits. Each instruction is decomposed into a referring expression and a task type, reflecting the need for both grounded understanding and visual transformation.

Evaluation Pipeline

Evaluation consists of two steps: generating edited images and scoring them using Gemini.

Step 1: Generation

Generate edited images using the desired backbone. Configuration files control model parameters and output paths.

# RePlan with Flux backbone
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_flux.yaml 

# RePlan with Qwen-Image-Edit
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_qwen_image.yaml

# RePlan with Qwen-Image-Edit-2511
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_qwen_image2511.yaml

# RePlan with FireRed-Image-Edit-1.1
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_fire_red_image.yaml

# RePlan with FLUX.2-klein-9B
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_klein.yaml

# Original Flux.1 Kontext Dev
bash replan/eval/scripts/gen_kontext.sh replan/eval/config/flux_kontext.yaml 

# Original Qwen-Image-Edit
bash replan/eval/scripts/gen_qwen_image.sh replan/eval/config/qwen_image.yaml

# Original Qwen-Image-Edit-2511 / FireRed-Image-Edit-1.1
bash replan/eval/scripts/gen_qwen_image.sh replan/eval/config/qwen_image_2511.yaml
bash replan/eval/scripts/gen_qwen_image.sh replan/eval/config/fire_red_image.yaml

# Original FLUX.2-klein-9B
bash replan/eval/scripts/gen_klein.sh replan/eval/config/flux_klein.yaml

# Flux Kontext with Global Instruction Rephrasing (using Qwen)
bash replan/eval/scripts/gen_kontext_rephrase.sh replan/eval/config/qwen_kontext_rephrase.yaml

Step 2: Scoring

We use Gemini-2.5-Pro for evaluation.

Configure API: Update replan/eval/api_config.yaml with your credentials. We support:
- Vertex AI (credential or API key)
- Google GenAI (API key)
- OpenAI-compatible 3rd party endpoints

Run Evaluation:

python evaluate.py \
    --edited_images_dir ./output/replan_flux \  # Must match the output dir from step 1
    --output_json ./output/replan_flux.json     # Path to save aggregated results

Quantitative comparison of open-source and proprietary image editing models on four evaluation dimensions. We also report Overall and Weighted scores. For open‑source models, the highest score in each column is marked as Bold, while the second highest is indicated with Underline. RePlan achieves the best consistency and overall score among open-source models.

🚄 Train Your Own Planner

The RePlan VLM planner is optimized using Group Relative Policy Optimization (GRPO) in two stages, requiring only ~1k samples.

Prerequisites

Ensure full dependencies are installed:

pip install -e .[full]

Stage 1: Format and Reasoning Learning

Optimizes for valid JSON structure and reasoning chain adequacy using format-related rewards.

bash replan/train/scripts/grpo_stage1.sh

Convert Checkpoint: You may need to convert the saved checkpoint to HuggingFace format:

python -m replan.train.model_merger --local_dir ./ckpt/easy_r1/stage1/global_step_30/actor

Stage 2: Planning Learning

Uses a large VLM (e.g., Qwen2.5-VL 72B) as a Reward Model to evaluate the execution quality (Target, Effect, Consistency) of the generated plans.

We use reward server paradigm for reward calculation, requiring deployment of:

1. A VLLM server for the VLM reward model.
2. An editing model server (Kontext).

1. Start Reward Model Server (VLLM) Navigate to the reward function directory:

cd replan/train/reward_function

Launch the VLLM server (example using Qwen2.5-VL-72B on 8 GPUs):

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
    --tensor-parallel-size 8 \
    --max-model-len 4096 \
    --limit-mm-per-prompt '{"image":2}' \
    --disable-mm-preprocessor-cache \
    --gpu-memory-utilization 0.35 \
    --enforce-eager \
    --port 8000

Note: You can use other VLM models or OpenAI-compatible APIs. Smaller models are supported but may degrade reward quality.

2. Start Editing Model Server (Kontext) Deploy the image editing model. Each specified GPU will host a separate instance for data-parallel inference.

# Usage: bash start_kontext_server.sh <gpu_ids> <port> 
bash start_kontext_server.sh 0,1,2,3,4,5,6,7 8001

Note: Since the reward and editing servers are accessed sequentially, they can share GPUs if memory permits.

3. Configure and Run Training Update the server URLs in replan/train/scripts/grpo_stage2.sh:

export KONTEXT_SERVER_URL="http://localhost:8001/v1/images/generations" # Your Kontext server URL
export OPENAI_API_BASE="http://localhost:8000/v1"                       # Your VLLM server URL

(If servers are on different nodes, update localhost to the corresponding IP addresses.)

Launch training:

bash replan/train/scripts/grpo_stage2.sh

4. Convert Final Checkpoint

python -m replan.train.model_merger --local_dir ./ckpt/easy_r1/stage2/global_step_40/actor

📝 Citation

@article{qu2025replan,
      title={RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing}, 
      author={Tianyuan Qu and Lei Ke and Xiaohang Zhan and Longxiang Tang and Yuqi Liu and Bohao Peng and Bei Yu and Dong Yu and Jiaya Jia},
      journal={arXiv preprint arXiv:2512.16864},
      year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
replan		replan
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_infer.txt		requirements_infer.txt

Folders and files

Latest commit

History

Repository files navigation

RePlan: Reasoning-Guided Region Planning for Complex Instruction-Based Image Editing

📖 Overview

🔥 News

📑 Table of Contents

💡 Introduction

🚧 The Challenge: IV-Complexity

📉 The Gap: Global Semantic Guidance

🚀 Our Solution: Region-Aligned Guidance

🛠️ Environment Setup

Installation

Environment Options

Option 1: Inference Only

Option 2: Full Setup (Training & Evaluation)

⚡ Inference

Web Interface (Gradio)

📺 Demo Video

📝 How to Use

Advanced Controls

Command Line Interface

Continuous Inference

Single Run

Key Arguments:

📊 Evaluation on IV-Edit Benchmark

Evaluation Pipeline

Step 1: Generation

Step 2: Scoring

🚄 Train Your Own Planner

Prerequisites

Stage 1: Format and Reasoning Learning

Stage 2: Planning Learning

📝 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

RePlan: Reasoning-Guided Region Planning
for Complex Instruction-Based Image Editing

Packages