Skip to content

Potential correctness issue in multi-image Qwen visual preprocessing: reused host buffer with async HtoD copy can corrupt visual_input #58

@kimsunguk0

Description

@kimsunguk0

Summary

We found what looks like a general correctness issue in the Qwen multi-image visual preprocessing path in v0.6.0.

This does not appear specific to our downstream model/task. The issue shows up at the visual input boundary before the visual encoder runs.

In our local investigation, the root cause appears to be reuse of a host-side resized image buffer together with cudaMemcpyAsync, without guaranteeing that the previous host-to-device copy has completed before the same host buffer is overwritten by the next image preprocessing step.

As a result, for multi-image inputs, the actual visual_input tensor can become corrupted even when the visual metadata is correct.


Observed behavior

For a fixed multi-image input, the following metadata matched our PyTorch reference exactly (or nearly exactly):

  • image_grid_thw: exact
  • cu_seqlens: exact
  • fast_pos_embed_idx: exact
  • fast_pos_embed_weight: nearly exact

However, visual_input itself showed a systematic mismatch against the PyTorch processor output.

This mismatch was observed in the multi-image path and was not explained by metadata differences.


Root cause we found

The issue appears to come from:

  • a reused host-side resized image buffer
  • copied to device using cudaMemcpyAsync
  • without guaranteeing copy completion before the next image preprocessing step reuses and overwrites the same host buffer

In other words, the next image can overwrite the host buffer before the previous async HtoD copy has fully consumed it.

This creates a correctness issue specifically for multi-image preprocessing.


Evidence

In our local comparison against the PyTorch reference:

Before local fix

  • visual_input mean abs diff: 0.00285263
  • multimodal_output_embedding mean abs diff: 0.00313699
  • inputs_embeds mean abs diff: 0.00388316

After forcing copy completion before host buffer reuse

  • visual_input mean abs diff: 0.00000000
  • multimodal_output_embedding mean abs diff: 0.00057067
  • inputs_embeds mean abs diff: 0.00193488

So the visual_input mismatch itself dropped to exact match after fixing the host-buffer reuse ordering.


Why we think this is a real bug

We do not think this is just harmless downstream numerical drift, because:

  • the mismatch appears at the visual input boundary itself
  • the visual metadata is already correct
  • the problem is systematic in the multi-image path
  • enforcing correct copy ordering removes the visual_input mismatch

So this looks like a preprocessing/input correctness issue rather than an expected tolerance issue.


Scope

We are intentionally not focusing here on downstream task-specific metrics.

Our concern is narrower and more generic:

  • same fixed multi-image input
  • same visual metadata
  • different visual_input tensor caused by host-buffer reuse + async copy ordering

That seems like a general multimodal runtime correctness issue.


Environment

  • TensorRT-Edge-LLM: v0.6.0
  • Model family involved: Qwen multi-image visual path
  • Issue type: multimodal visual preprocessing / visual input boundary correctness

Offer

If useful, we can also provide:

  • the exact local patch location
  • a minimal repro for the multi-image path
  • before/after tensor comparison dumps
  • a cleanup PR based on our local fix

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions