Skip to content

feat: support first+last frame multi-conditioning for I2V#23

Open
zhaopengme wants to merge 1 commit intoBlaizzy:mainfrom
zhaopengme:main
Open

feat: support first+last frame multi-conditioning for I2V#23
zhaopengme wants to merge 1 commit intoBlaizzy:mainfrom
zhaopengme:main

Conversation

@zhaopengme
Copy link
Copy Markdown

  • Fix negative frame_idx bug in apply_conditioning (e.g. -1 for last frame)
  • Add end_image and end_image_strength params to generate_video()
  • Add _build_i2v_conditionings() helper to construct conditioning list
  • Update all 4 pipeline branches (DISTILLED, DEV, DEV_TWO_STAGE, DEV_TWO_STAGE_HQ) to encode and apply both first-frame and last-frame conditioning
  • Add --end-image and --end-image-strength CLI arguments

When both image and end_image are provided, the video is conditioned to start from the first image and end at the last image, creating a smooth transition between the two frames.

Made-with: Cursor

- Fix negative frame_idx bug in apply_conditioning (e.g. -1 for last frame)
- Add end_image and end_image_strength params to generate_video()
- Add _build_i2v_conditionings() helper to construct conditioning list
- Update all 4 pipeline branches (DISTILLED, DEV, DEV_TWO_STAGE, DEV_TWO_STAGE_HQ)
  to encode and apply both first-frame and last-frame conditioning
- Add --end-image and --end-image-strength CLI arguments

When both image and end_image are provided, the video is conditioned to
start from the first image and end at the last image, creating a smooth
transition between the two frames.

Made-with: Cursor
Copilot AI review requested due to automatic review settings March 23, 2026 06:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the LTX-2 I2V generation path to support dual conditioning (first frame + last frame) while fixing negative frame index handling in latent conditioning.

Changes:

  • Fix apply_conditioning() to correctly handle negative frame_idx values (e.g., -1 for last frame).
  • Add end_image / end_image_strength to generate_video() and wire them through all pipeline branches to apply first+last frame conditioning.
  • Expose --end-image and --end-image-strength in the CLI and add a helper to build the conditioning list.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
mlx_video/models/ltx_2/generate.py Adds end-frame conditioning support and applies first+last conditioning across all pipelines + CLI args.
mlx_video/models/ltx_2/conditioning/latent.py Normalizes negative conditioning indices so -1 can target the last latent frame.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1854 to +1856
console.print(
f"[dim]First image: {image} (strength={image_strength}, frame={image_frame_idx})[/]"
)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When --end-image is set, image_frame_idx is ignored and the first image is always conditioned at frame 0 (see _build_i2v_conditionings). The log line still prints frame={image_frame_idx}, which can be misleading (e.g., user passes --image-frame-idx 8 and sees it echoed even though it won’t be used). Consider printing frame=0 (or explicitly indicating it’s ignored) whenever end_image is provided.

Suggested change
console.print(
f"[dim]First image: {image} (strength={image_strength}, frame={image_frame_idx})[/]"
)
if has_end_image:
console.print(
f"[dim]First image: {image} (strength={image_strength}, frame=0; image_frame_idx ignored when --end-image is set)[/]"
)
else:
console.print(
f"[dim]First image: {image} (strength={image_strength}, frame={image_frame_idx})[/]"
)

Copilot uses AI. Check for mistakes.
Comment on lines +1807 to +1810
is_i2v = image is not None or end_image is not None
has_end_image = end_image is not None
if end_image_strength is None:
end_image_strength = image_strength
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both image and end_image are provided, a 1-frame video (or any case where latent_frames == 1) can’t satisfy “first+last” conditioning: frame_idx=-1 normalizes to the same frame as the first conditioning and will overwrite it. Consider validating upfront (e.g., require num_frames > 1 when both images are set, or raise a clear error) to avoid silently dropping one conditioning.

Copilot uses AI. Check for mistakes.
Comment on lines +2106 to +2118
input_image = load_image(image, height=s1_h, width=s1_w, dtype=model_dtype)
stage1_image_latent = vae_encoder(prepare_image_for_encoding(input_image, s1_h, s1_w, dtype=model_dtype))
mx.eval(stage1_image_latent)
input_image = load_image(image, height=s2_h, width=s2_w, dtype=model_dtype)
stage2_image_latent = vae_encoder(prepare_image_for_encoding(input_image, s2_h, s2_w, dtype=model_dtype))
mx.eval(stage2_image_latent)

if has_end_image:
end_input = load_image(end_image, height=s1_h, width=s1_w, dtype=model_dtype)
stage1_end_image_latent = vae_encoder(prepare_image_for_encoding(end_input, s1_h, s1_w, dtype=model_dtype))
mx.eval(stage1_end_image_latent)
end_input = load_image(end_image, height=s2_h, width=s2_w, dtype=model_dtype)
stage2_end_image_latent = vae_encoder(prepare_image_for_encoding(end_input, s2_h, s2_w, dtype=model_dtype))
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several newly added lines here are far beyond Black’s default line length (e.g., nested vae_encoder(prepare_image_for_encoding(...)) calls on a single line). Since the repo enforces Black via pre-commit, this will likely be reformatted (or fail CI if not run). Please run Black / reformat these calls into the standard multi-line style used elsewhere in this file for readability and consistency.

Suggested change
input_image = load_image(image, height=s1_h, width=s1_w, dtype=model_dtype)
stage1_image_latent = vae_encoder(prepare_image_for_encoding(input_image, s1_h, s1_w, dtype=model_dtype))
mx.eval(stage1_image_latent)
input_image = load_image(image, height=s2_h, width=s2_w, dtype=model_dtype)
stage2_image_latent = vae_encoder(prepare_image_for_encoding(input_image, s2_h, s2_w, dtype=model_dtype))
mx.eval(stage2_image_latent)
if has_end_image:
end_input = load_image(end_image, height=s1_h, width=s1_w, dtype=model_dtype)
stage1_end_image_latent = vae_encoder(prepare_image_for_encoding(end_input, s1_h, s1_w, dtype=model_dtype))
mx.eval(stage1_end_image_latent)
end_input = load_image(end_image, height=s2_h, width=s2_w, dtype=model_dtype)
stage2_end_image_latent = vae_encoder(prepare_image_for_encoding(end_input, s2_h, s2_w, dtype=model_dtype))
input_image = load_image(
image,
height=s1_h,
width=s1_w,
dtype=model_dtype,
)
stage1_image_latent = vae_encoder(
prepare_image_for_encoding(
input_image,
s1_h,
s1_w,
dtype=model_dtype,
)
)
mx.eval(stage1_image_latent)
input_image = load_image(
image,
height=s2_h,
width=s2_w,
dtype=model_dtype,
)
stage2_image_latent = vae_encoder(
prepare_image_for_encoding(
input_image,
s2_h,
s2_w,
dtype=model_dtype,
)
)
mx.eval(stage2_image_latent)
if has_end_image:
end_input = load_image(
end_image,
height=s1_h,
width=s1_w,
dtype=model_dtype,
)
stage1_end_image_latent = vae_encoder(
prepare_image_for_encoding(
end_input,
s1_h,
s1_w,
dtype=model_dtype,
)
)
mx.eval(stage1_end_image_latent)
end_input = load_image(
end_image,
height=s2_h,
width=s2_w,
dtype=model_dtype,
)
stage2_end_image_latent = vae_encoder(
prepare_image_for_encoding(
end_input,
s2_h,
s2_w,
dtype=model_dtype,
)
)

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +110
# Normalize negative indices (e.g. -1 -> last frame)
if frame_idx < 0:
frame_idx = frame_idx % f
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_conditioning now supports negative frame_idx values via modulo normalization, but VideoConditionByLatentIndex’s docstring still describes frame_idx only as “0 = first frame”. Consider updating the public-facing documentation (class docstring and/or apply_conditioning doc) to explicitly state that negative indices are accepted (e.g., -1 = last frame).

Copilot uses AI. Check for mistakes.
@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 23, 2026

Could you share a before and after video?

@zhaopengme
Copy link
Copy Markdown
Author

ref first image

first

first video

video_6fd8a68187df444fb66ddad5fc25a347.mp4

and second

video_17d988fff1e4446fbec7c798928a08fa.mp4

ref end image

end

three video

video_18423d7276ef4e84b53bce1b8fa223fc.mp4

this my repo

https://github.com/zhaopengme/MLXGateway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants