-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Summary
We found what looks like a general correctness issue in the Qwen multi-image visual preprocessing path in v0.6.0.
This does not appear specific to our downstream model/task. The issue shows up at the visual input boundary before the visual encoder runs.
In our local investigation, the root cause appears to be reuse of a host-side resized image buffer together with cudaMemcpyAsync, without guaranteeing that the previous host-to-device copy has completed before the same host buffer is overwritten by the next image preprocessing step.
As a result, for multi-image inputs, the actual visual_input tensor can become corrupted even when the visual metadata is correct.
Observed behavior
For a fixed multi-image input, the following metadata matched our PyTorch reference exactly (or nearly exactly):
image_grid_thw: exactcu_seqlens: exactfast_pos_embed_idx: exactfast_pos_embed_weight: nearly exact
However, visual_input itself showed a systematic mismatch against the PyTorch processor output.
This mismatch was observed in the multi-image path and was not explained by metadata differences.
Root cause we found
The issue appears to come from:
- a reused host-side resized image buffer
- copied to device using
cudaMemcpyAsync - without guaranteeing copy completion before the next image preprocessing step reuses and overwrites the same host buffer
In other words, the next image can overwrite the host buffer before the previous async HtoD copy has fully consumed it.
This creates a correctness issue specifically for multi-image preprocessing.
Evidence
In our local comparison against the PyTorch reference:
Before local fix
visual_inputmean abs diff:0.00285263multimodal_output_embeddingmean abs diff:0.00313699inputs_embedsmean abs diff:0.00388316
After forcing copy completion before host buffer reuse
visual_inputmean abs diff:0.00000000multimodal_output_embeddingmean abs diff:0.00057067inputs_embedsmean abs diff:0.00193488
So the visual_input mismatch itself dropped to exact match after fixing the host-buffer reuse ordering.
Why we think this is a real bug
We do not think this is just harmless downstream numerical drift, because:
- the mismatch appears at the visual input boundary itself
- the visual metadata is already correct
- the problem is systematic in the multi-image path
- enforcing correct copy ordering removes the
visual_inputmismatch
So this looks like a preprocessing/input correctness issue rather than an expected tolerance issue.
Scope
We are intentionally not focusing here on downstream task-specific metrics.
Our concern is narrower and more generic:
- same fixed multi-image input
- same visual metadata
- different
visual_inputtensor caused by host-buffer reuse + async copy ordering
That seems like a general multimodal runtime correctness issue.
Environment
- TensorRT-Edge-LLM:
v0.6.0 - Model family involved: Qwen multi-image visual path
- Issue type: multimodal visual preprocessing / visual input boundary correctness
Offer
If useful, we can also provide:
- the exact local patch location
- a minimal repro for the multi-image path
- before/after tensor comparison dumps
- a cleanup PR based on our local fix