From Global Semantics to Region-aligned Guidance
RePlan is an instruction-based image editing framework designed to conquer Instruction-Visual Complexity (IV-Complexity). By coupling a Multimodal Large Language Model (MLLM) planner with a diffusion-based editor, RePlan shifts the paradigm from coarse global semantics to Region-aligned Guidance. It achieves significant improvement in fine-grained visual reasoning and background consistency without requiring massive paired training data.
Multiple regional instructions together with a global modification are executed simultaneously in a single forward pass with FLUX.2-Klein-9B. Our region attention injection method is the first method to support high-quality parallel controllable editing across so many regions at once. The edits remain well localized without spillover while preserving globally coherent visual appearance, effectively avoiding cross-region interference and instruction omissions that commonly occur in highly complex scenarios.
- [Mar 19th, 2026] We added support for stage offloading to reduce GPU memory usage. With this update,
FLUX.2-Klein-4B + RePlancan now run on a 16GB GPU! - [Mar 19th, 2026] We now support the latest open-source state-of-the-art editing models, including
Qwen-Image-Edit-2511,FireRed-Image-Edit, andFLUX.2-Klein. - [Dec 26th, 2025] We updated the gradio demo for custom attention control and optimized inference settings when using Qwen-Image-Edit as backbone.
- [Dec 19th, 2025] We released paper, model and data of RePlan!
- π Overview
- π‘ Introduction
- π οΈ Environment Setup
- β‘ Inference
- π Evaluation on IV-Edit Benchmark
- π Train Your Own Planner
- π Citation
Current instruction-based editing models struggle when intricate instructions meet cluttered, realistic scenesβa challenge we define as Instruction-Visual Complexity (IV-Complexity). In these scenarios, high-level global context is insufficient to distinguish specific targets from semantically similar objects (e.g., distinguishing a "used cup" from a clean glass on a messy desk).
Existing methods, including unified VLM-diffusion architectures, predominantly rely on Global Semantic Guidance. They compress instructions into global feature vectors, lacking spatial grounding. Consequently, edits often "spill over" into unrelated areas or modify the wrong targets, failing to preserve background consistency.
RePlan introduces a Plan-then-Execute framework that explicitly links text to pixels. Our key contributions include:
-
π§± Reasoning-Guided Planning
A VLM planner performs Chain-of-Thought (CoT) reasoning to decompose complex instructions into structured, region-specific guidance (Bounding Boxes + Local Hints).
-
π― Training-Free Attention Injection
We introduce a mechanism tailored for Multimodal DiT (MMDiT) that executes edits via region-constrained attention. This enables precise, multi-region parallel edits in a single pass while preserving the background, without requiring any training of the DiT backbone.
-
β‘ Efficient GRPO Training
We enhance the planner's reasoning capabilities using Group Relative Policy Optimization (GRPO). Remarkably, we achieve strong planning performance using only ~1k instruction-only samples, bypassing the need for large-scale paired image datasets.
-
ποΈ Interactive & Flexible Editing
RePlan's intermediate region guidance is fully editable, enabling user-in-the-loop intervention. Users can adjust bounding boxes or hints directly to refine results. Furthermore, our attention mechanism supports regional negative prompts to prevent bleeding effects.
-
π IV-Edit Benchmark
To foster future research, we establish IV-Edit, the first benchmark specifically designed to evaluate IV-Complex editing, filling the gap left by current subject-dominated evaluation sets.
First, clone the repository:
git clone https://github.com/taintaintainu/RePlan-IVEdit-page.git
cd RePlan
If you only want to run demos or local inference, set up a lightweight environment:
conda create -n replan_infer python=3.10
conda activate replan_infer
pip install -e .
pip install flash-attn --no-build-isolation
For full training or evaluation on the IV-Edit benchmark:
conda create -n replan python=3.10
conda activate replan
pip install -e .[full]
pip install flash-attn --no-build-isolation
We provide both a Web Interface for interactive visualization/editing and a Command Line Interface.
Launch a local web server to interactively edit images. This interface supports visualizing and manually modifying the region guidance (Bounding Boxes & Hints) generated by the planner before execution.
By default, RePlan keeps the original inference behavior unchanged. For users with limited GPU memory, you can optionally enable stage-boundary offloading with --stage_offload cpu, which offloads the inactive model between the planning (VLM) stage and the editing stage. With this option enabled, flux_klein_4b + RePlan can run on a 16GB GPU.
# use Flux.1 Kontext dev as the backbone
python replan/inference/app.py --server_port 8080 --pipeline_type "flux_kontext"
# use Qwen-Image-Edit as the backbone
python replan/inference/app.py --server_port 8080 --pipeline_type "qwenimageedit"
# use Qwen-Image-Edit-2511 / FireRed / FLUX.2-Klein variants
python replan/inference/app.py --server_port 8080 --pipeline_type "qwenimageedit2511"
python replan/inference/app.py --server_port 8080 --pipeline_type "firerededit"
python replan/inference/app.py --server_port 8080 --pipeline_type "flux_klein"
python replan/inference/app.py --server_port 8080 --pipeline_type "flux_klein_4b"
# optional: reduce peak GPU memory by offloading between VLM and edit stages
python replan/inference/app.py --server_port 8080 --pipeline_type "flux_klein_4b" --stage_offload cpudemo.mp4
-
Upload & Select: Click the top-left area to upload an image, or select any provided example.
-
Baseline Method: Copy your instruction directly into the Global Prompt text box, skip the AI Plan step, and click Run Editing. The result will be identical to using the selected baseline backbone (e.g. Flux.1 Kontext dev, Qwen-Image-Edit, Qwen-Image-Edit-2511, FireRed-Image-Edit, or FLUX.2-Klein).
-
RePlan Method: Enter your instruction and click AI Plan. After a few seconds, the model generates the Global Prompt and Region Guidance. You can see the visual BBoxes (Bounding Boxes) in the interactive interface. Click Run Editing to generate the RePlan result.
-
Manual Correction: If you are unsatisfied with the AI-generated guidance, you can modify it:
- Global Prompt: Edit the text in the prompt box below the interface.
- Move/Resize BBoxes: Select a BBox in the interactive window to drag it or resize it by dragging its edges.
- Add/Delete BBoxes: Drag on any empty area in the interface to create a new BBox. To delete one, select it and click the Delete button in the bottom-right.
- Edit Hints: After selecting a BBox, modify its specific Local Hint in the bottom-right text box.
- Click Run Editing to apply your manual changes.
-
Manual Design: You can skip the AI Plan entirely. Manually enter a Global Prompt, draw your own BBoxes/Hints using the tools above, and click Run Editing.
Example of advanced control panel. "Main": global prompt. "Hi": Hint i. "N.Bi": Noise of bbox i. "I.Bi": Image of bbox i. "N.BG": Noise of background. "I.BG": Image of background.
-
Interactive Attention Matrix: Defines the fine-grained attention mask between specific components to control information flow. Noise/Image components distinguish the Latent Noise Patches from Input Image Patches.
-
Rule Switch Ratio: Determines the phase transition point (
0.0-1.0) from Custom Rules (defined in 1) to Default Rules (depicted in the paper). The cutoff step is calculated as (e.g., Ratio0.7with50steps β Custom Rules apply for steps 0β35). -
Locality Control Strategy: Predefined custom rule of the demo disables Noise Background β Noise BBox attention to isolate background from edit target. The Rule Switch Ratio controls the duration (number of steps) of this isolation:
- Higher Ratio: Enforces stronger locality. Increase to fix semantic spillover (edits leaking into the background).
- Lower Ratio: Allows more global interaction. Decrease to fix boundary artifacts (unnatural, sharp edges).
-
Expand Value: Expands the effective attention mask relative to the bounding box size (e.g.,
0.15to expand 15% of bbox size).
Launch a session to load the model once and perform multiple inference rounds:
--stage_offload is optional and defaults to none, which preserves the original behavior. Set --stage_offload cpu only if you want to reduce peak GPU memory by offloading the inactive model between the planning and editing stages.
python replan/inference/run_replan.py --pipeline_type "flux_kontext" # use Flux.1 Kontext dev as RePlan backbone
python replan/inference/run_replan.py --pipeline_type "qwenimageedit" # use Qwen-Image-Edit as RePlan backbone
python replan/inference/run_replan.py --pipeline_type "qwenimageedit2511" # use Qwen-Image-Edit-2511
python replan/inference/run_replan.py --pipeline_type "firerededit" # use FireRed-Image-Edit-1.1
python replan/inference/run_replan.py --pipeline_type "flux_klein" # use FLUX.2-klein-9B
python replan/inference/run_replan.py --pipeline_type "flux_klein_4b" # use FLUX.2-klein-4B
# optional: reduce peak GPU memory by offloading between VLM and edit stages
python replan/inference/run_replan.py --pipeline_type "flux_klein_4b" --stage_offload cpu
Run inference directly by providing arguments (the program ends after one editing round):
python replan/inference/run_replan.py \
--image "assets/cup.png" \
--instruction "Replace the cup that has been used and left on the desk with a small potted plant" \
--output_dir "output/inference" \
--pipeline_type "flux_kontext"
--image: Path to the input image (relative or absolute).--instruction: Editing instruction text.--output_dir: Directory to save results (default:./output/inference).--only_save_image: If set, only saves the edited image (omits VLM response and visualization).--vlm_ckpt_path: Path to the Planner VLM checkpoint (default:TainU/RePlan-Qwen2.5-VL-7B).--pipeline_type: Diffusion pipeline type. Preferred explicit values areflux_kontext,qwenimageedit,qwenimageedit2509,qwenimageedit2511,firerededit,flux_klein, andflux_klein_4b. For backward compatibility, legacy aliasesflux,qwen,qwen2511,qwen_plus, andkleinare also accepted.--stage_offload: Optional stage-boundary offload mode.nonekeeps the existing behavior unchanged;cpuoffloads the inactive model between the VLM planning stage and the image editing stage to reduce peak GPU memory.--lora_path: (Experimental) Path to LoRA weights.
We propose the IV-Edit Benchmark to evaluate performance on IV-Complex editing scenarios.
- Dataset: ~800 manually verified instruction-image pairs.
- Capabilities: Tests 7 referring types (e.g., Spatial, Knowledge) and 16 task types (e.g., Attribute Modification, Physics Reasoning). Please refer to the paper appendix for specific categories.
- Metrics: We use Gemini-2.5-Pro to evaluate Target, Consistency, Quality, and Effect.
Examples from IV-Edit spanning a wide range of real-world scenarios and fine-grained instruction intents, Including spatial, structural, and reasoning-intensive edits. Each instruction is decomposed into a referring expression and a task type, reflecting the need for both grounded understanding and visual transformation.
Evaluation consists of two steps: generating edited images and scoring them using Gemini.
Generate edited images using the desired backbone. Configuration files control model parameters and output paths.
# RePlan with Flux backbone
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_flux.yaml
# RePlan with Qwen-Image-Edit
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_qwen_image.yaml
# RePlan with Qwen-Image-Edit-2511
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_qwen_image2511.yaml
# RePlan with FireRed-Image-Edit-1.1
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_fire_red_image.yaml
# RePlan with FLUX.2-klein-9B
bash replan/eval/scripts/gen_replan.sh replan/eval/config/replan_klein.yaml
# Original Flux.1 Kontext Dev
bash replan/eval/scripts/gen_kontext.sh replan/eval/config/flux_kontext.yaml
# Original Qwen-Image-Edit
bash replan/eval/scripts/gen_qwen_image.sh replan/eval/config/qwen_image.yaml
# Original Qwen-Image-Edit-2511 / FireRed-Image-Edit-1.1
bash replan/eval/scripts/gen_qwen_image.sh replan/eval/config/qwen_image_2511.yaml
bash replan/eval/scripts/gen_qwen_image.sh replan/eval/config/fire_red_image.yaml
# Original FLUX.2-klein-9B
bash replan/eval/scripts/gen_klein.sh replan/eval/config/flux_klein.yaml
# Flux Kontext with Global Instruction Rephrasing (using Qwen)
bash replan/eval/scripts/gen_kontext_rephrase.sh replan/eval/config/qwen_kontext_rephrase.yaml
We use Gemini-2.5-Pro for evaluation.
-
Configure API: Update
replan/eval/api_config.yamlwith your credentials. We support:- Vertex AI (credential or API key)
- Google GenAI (API key)
- OpenAI-compatible 3rd party endpoints
-
Run Evaluation:
python evaluate.py \ --edited_images_dir ./output/replan_flux \ # Must match the output dir from step 1 --output_json ./output/replan_flux.json # Path to save aggregated results
Quantitative comparison of open-source and proprietary image editing models on four evaluation dimensions. We also report Overall and Weighted scores. For openβsource models, the highest score in each column is marked as Bold, while the second highest is indicated with Underline. RePlan achieves the best consistency and overall score among open-source models.
The RePlan VLM planner is optimized using Group Relative Policy Optimization (GRPO) in two stages, requiring only ~1k samples.
Ensure full dependencies are installed:
pip install -e .[full]
Optimizes for valid JSON structure and reasoning chain adequacy using format-related rewards.
bash replan/train/scripts/grpo_stage1.sh
Convert Checkpoint: You may need to convert the saved checkpoint to HuggingFace format:
python -m replan.train.model_merger --local_dir ./ckpt/easy_r1/stage1/global_step_30/actor
Uses a large VLM (e.g., Qwen2.5-VL 72B) as a Reward Model to evaluate the execution quality (Target, Effect, Consistency) of the generated plans.
We use reward server paradigm for reward calculation, requiring deployment of:
1. A VLLM server for the VLM reward model.
2. An editing model server (Kontext).
1. Start Reward Model Server (VLLM) Navigate to the reward function directory:
cd replan/train/reward_function
Launch the VLLM server (example using Qwen2.5-VL-72B on 8 GPUs):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 4096 \
--limit-mm-per-prompt '{"image":2}' \
--disable-mm-preprocessor-cache \
--gpu-memory-utilization 0.35 \
--enforce-eager \
--port 8000
Note: You can use other VLM models or OpenAI-compatible APIs. Smaller models are supported but may degrade reward quality.
2. Start Editing Model Server (Kontext) Deploy the image editing model. Each specified GPU will host a separate instance for data-parallel inference.
# Usage: bash start_kontext_server.sh <gpu_ids> <port>
bash start_kontext_server.sh 0,1,2,3,4,5,6,7 8001
Note: Since the reward and editing servers are accessed sequentially, they can share GPUs if memory permits.
3. Configure and Run Training
Update the server URLs in replan/train/scripts/grpo_stage2.sh:
export KONTEXT_SERVER_URL="http://localhost:8001/v1/images/generations" # Your Kontext server URL
export OPENAI_API_BASE="http://localhost:8000/v1" # Your VLLM server URL
(If servers are on different nodes, update localhost to the corresponding IP addresses.)
Launch training:
bash replan/train/scripts/grpo_stage2.sh
4. Convert Final Checkpoint
python -m replan.train.model_merger --local_dir ./ckpt/easy_r1/stage2/global_step_40/actor
@article{qu2025replan,
title={RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing},
author={Tianyuan Qu and Lei Ke and Xiaohang Zhan and Longxiang Tang and Yuqi Liu and Bohao Peng and Bei Yu and Dong Yu and Jiaya Jia},
journal={arXiv preprint arXiv:2512.16864},
year={2025}
}





