feat: support first+last frame multi-conditioning for I2V#23
feat: support first+last frame multi-conditioning for I2V#23zhaopengme wants to merge 1 commit intoBlaizzy:mainfrom
Conversation
- Fix negative frame_idx bug in apply_conditioning (e.g. -1 for last frame) - Add end_image and end_image_strength params to generate_video() - Add _build_i2v_conditionings() helper to construct conditioning list - Update all 4 pipeline branches (DISTILLED, DEV, DEV_TWO_STAGE, DEV_TWO_STAGE_HQ) to encode and apply both first-frame and last-frame conditioning - Add --end-image and --end-image-strength CLI arguments When both image and end_image are provided, the video is conditioned to start from the first image and end at the last image, creating a smooth transition between the two frames. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
This PR extends the LTX-2 I2V generation path to support dual conditioning (first frame + last frame) while fixing negative frame index handling in latent conditioning.
Changes:
- Fix
apply_conditioning()to correctly handle negativeframe_idxvalues (e.g.,-1for last frame). - Add
end_image/end_image_strengthtogenerate_video()and wire them through all pipeline branches to apply first+last frame conditioning. - Expose
--end-imageand--end-image-strengthin the CLI and add a helper to build the conditioning list.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
mlx_video/models/ltx_2/generate.py |
Adds end-frame conditioning support and applies first+last conditioning across all pipelines + CLI args. |
mlx_video/models/ltx_2/conditioning/latent.py |
Normalizes negative conditioning indices so -1 can target the last latent frame. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| console.print( | ||
| f"[dim]First image: {image} (strength={image_strength}, frame={image_frame_idx})[/]" | ||
| ) |
There was a problem hiding this comment.
When --end-image is set, image_frame_idx is ignored and the first image is always conditioned at frame 0 (see _build_i2v_conditionings). The log line still prints frame={image_frame_idx}, which can be misleading (e.g., user passes --image-frame-idx 8 and sees it echoed even though it won’t be used). Consider printing frame=0 (or explicitly indicating it’s ignored) whenever end_image is provided.
| console.print( | |
| f"[dim]First image: {image} (strength={image_strength}, frame={image_frame_idx})[/]" | |
| ) | |
| if has_end_image: | |
| console.print( | |
| f"[dim]First image: {image} (strength={image_strength}, frame=0; image_frame_idx ignored when --end-image is set)[/]" | |
| ) | |
| else: | |
| console.print( | |
| f"[dim]First image: {image} (strength={image_strength}, frame={image_frame_idx})[/]" | |
| ) |
| is_i2v = image is not None or end_image is not None | ||
| has_end_image = end_image is not None | ||
| if end_image_strength is None: | ||
| end_image_strength = image_strength |
There was a problem hiding this comment.
If both image and end_image are provided, a 1-frame video (or any case where latent_frames == 1) can’t satisfy “first+last” conditioning: frame_idx=-1 normalizes to the same frame as the first conditioning and will overwrite it. Consider validating upfront (e.g., require num_frames > 1 when both images are set, or raise a clear error) to avoid silently dropping one conditioning.
| input_image = load_image(image, height=s1_h, width=s1_w, dtype=model_dtype) | ||
| stage1_image_latent = vae_encoder(prepare_image_for_encoding(input_image, s1_h, s1_w, dtype=model_dtype)) | ||
| mx.eval(stage1_image_latent) | ||
| input_image = load_image(image, height=s2_h, width=s2_w, dtype=model_dtype) | ||
| stage2_image_latent = vae_encoder(prepare_image_for_encoding(input_image, s2_h, s2_w, dtype=model_dtype)) | ||
| mx.eval(stage2_image_latent) | ||
|
|
||
| if has_end_image: | ||
| end_input = load_image(end_image, height=s1_h, width=s1_w, dtype=model_dtype) | ||
| stage1_end_image_latent = vae_encoder(prepare_image_for_encoding(end_input, s1_h, s1_w, dtype=model_dtype)) | ||
| mx.eval(stage1_end_image_latent) | ||
| end_input = load_image(end_image, height=s2_h, width=s2_w, dtype=model_dtype) | ||
| stage2_end_image_latent = vae_encoder(prepare_image_for_encoding(end_input, s2_h, s2_w, dtype=model_dtype)) |
There was a problem hiding this comment.
Several newly added lines here are far beyond Black’s default line length (e.g., nested vae_encoder(prepare_image_for_encoding(...)) calls on a single line). Since the repo enforces Black via pre-commit, this will likely be reformatted (or fail CI if not run). Please run Black / reformat these calls into the standard multi-line style used elsewhere in this file for readability and consistency.
| input_image = load_image(image, height=s1_h, width=s1_w, dtype=model_dtype) | |
| stage1_image_latent = vae_encoder(prepare_image_for_encoding(input_image, s1_h, s1_w, dtype=model_dtype)) | |
| mx.eval(stage1_image_latent) | |
| input_image = load_image(image, height=s2_h, width=s2_w, dtype=model_dtype) | |
| stage2_image_latent = vae_encoder(prepare_image_for_encoding(input_image, s2_h, s2_w, dtype=model_dtype)) | |
| mx.eval(stage2_image_latent) | |
| if has_end_image: | |
| end_input = load_image(end_image, height=s1_h, width=s1_w, dtype=model_dtype) | |
| stage1_end_image_latent = vae_encoder(prepare_image_for_encoding(end_input, s1_h, s1_w, dtype=model_dtype)) | |
| mx.eval(stage1_end_image_latent) | |
| end_input = load_image(end_image, height=s2_h, width=s2_w, dtype=model_dtype) | |
| stage2_end_image_latent = vae_encoder(prepare_image_for_encoding(end_input, s2_h, s2_w, dtype=model_dtype)) | |
| input_image = load_image( | |
| image, | |
| height=s1_h, | |
| width=s1_w, | |
| dtype=model_dtype, | |
| ) | |
| stage1_image_latent = vae_encoder( | |
| prepare_image_for_encoding( | |
| input_image, | |
| s1_h, | |
| s1_w, | |
| dtype=model_dtype, | |
| ) | |
| ) | |
| mx.eval(stage1_image_latent) | |
| input_image = load_image( | |
| image, | |
| height=s2_h, | |
| width=s2_w, | |
| dtype=model_dtype, | |
| ) | |
| stage2_image_latent = vae_encoder( | |
| prepare_image_for_encoding( | |
| input_image, | |
| s2_h, | |
| s2_w, | |
| dtype=model_dtype, | |
| ) | |
| ) | |
| mx.eval(stage2_image_latent) | |
| if has_end_image: | |
| end_input = load_image( | |
| end_image, | |
| height=s1_h, | |
| width=s1_w, | |
| dtype=model_dtype, | |
| ) | |
| stage1_end_image_latent = vae_encoder( | |
| prepare_image_for_encoding( | |
| end_input, | |
| s1_h, | |
| s1_w, | |
| dtype=model_dtype, | |
| ) | |
| ) | |
| mx.eval(stage1_end_image_latent) | |
| end_input = load_image( | |
| end_image, | |
| height=s2_h, | |
| width=s2_w, | |
| dtype=model_dtype, | |
| ) | |
| stage2_end_image_latent = vae_encoder( | |
| prepare_image_for_encoding( | |
| end_input, | |
| s2_h, | |
| s2_w, | |
| dtype=model_dtype, | |
| ) | |
| ) |
| # Normalize negative indices (e.g. -1 -> last frame) | ||
| if frame_idx < 0: | ||
| frame_idx = frame_idx % f |
There was a problem hiding this comment.
apply_conditioning now supports negative frame_idx values via modulo normalization, but VideoConditionByLatentIndex’s docstring still describes frame_idx only as “0 = first frame”. Consider updating the public-facing documentation (class docstring and/or apply_conditioning doc) to explicitly state that negative indices are accepted (e.g., -1 = last frame).
|
Could you share a before and after video? |


When both image and end_image are provided, the video is conditioned to start from the first image and end at the last image, creating a smooth transition between the two frames.
Made-with: Cursor