Potential correctness issue in multi-image Qwen visual preprocessing: reused host buffer with async HtoD copy can corrupt visual_input

## Summary

We found what looks like a general correctness issue in the Qwen multi-image visual preprocessing path in `v0.6.0`.

This does not appear specific to our downstream model/task. The issue shows up at the visual input boundary before the visual encoder runs.

In our local investigation, the root cause appears to be reuse of a host-side resized image buffer together with `cudaMemcpyAsync`, without guaranteeing that the previous host-to-device copy has completed before the same host buffer is overwritten by the next image preprocessing step.

As a result, for multi-image inputs, the actual `visual_input` tensor can become corrupted even when the visual metadata is correct.

---

## Observed behavior

For a fixed multi-image input, the following metadata matched our PyTorch reference exactly (or nearly exactly):

- `image_grid_thw`: exact
- `cu_seqlens`: exact
- `fast_pos_embed_idx`: exact
- `fast_pos_embed_weight`: nearly exact

However, `visual_input` itself showed a systematic mismatch against the PyTorch processor output.

This mismatch was observed in the multi-image path and was not explained by metadata differences.

---

## Root cause we found

The issue appears to come from:

- a reused host-side resized image buffer
- copied to device using `cudaMemcpyAsync`
- without guaranteeing copy completion before the next image preprocessing step reuses and overwrites the same host buffer

In other words, the next image can overwrite the host buffer before the previous async HtoD copy has fully consumed it.

This creates a correctness issue specifically for multi-image preprocessing.

---

## Evidence

In our local comparison against the PyTorch reference:

### Before local fix
- `visual_input` mean abs diff: `0.00285263`
- `multimodal_output_embedding` mean abs diff: `0.00313699`
- `inputs_embeds` mean abs diff: `0.00388316`

### After forcing copy completion before host buffer reuse
- `visual_input` mean abs diff: `0.00000000`
- `multimodal_output_embedding` mean abs diff: `0.00057067`
- `inputs_embeds` mean abs diff: `0.00193488`

So the `visual_input` mismatch itself dropped to exact match after fixing the host-buffer reuse ordering.

---

## Why we think this is a real bug

We do not think this is just harmless downstream numerical drift, because:

- the mismatch appears at the visual input boundary itself
- the visual metadata is already correct
- the problem is systematic in the multi-image path
- enforcing correct copy ordering removes the `visual_input` mismatch

So this looks like a preprocessing/input correctness issue rather than an expected tolerance issue.

---

## Scope

We are intentionally not focusing here on downstream task-specific metrics.

Our concern is narrower and more generic:

- same fixed multi-image input
- same visual metadata
- different `visual_input` tensor caused by host-buffer reuse + async copy ordering

That seems like a general multimodal runtime correctness issue.

---

## Environment

- TensorRT-Edge-LLM: `v0.6.0`
- Model family involved: Qwen multi-image visual path
- Issue type: multimodal visual preprocessing / visual input boundary correctness

---

## Offer

If useful, we can also provide:

- the exact local patch location
- a minimal repro for the multi-image path
- before/after tensor comparison dumps
- a cleanup PR based on our local fix


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential correctness issue in multi-image Qwen visual preprocessing: reused host buffer with async HtoD copy can corrupt visual_input #58

Summary

Observed behavior

Root cause we found

Evidence

Before local fix

After forcing copy completion before host buffer reuse

Why we think this is a real bug

Scope

Environment

Offer

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential correctness issue in multi-image Qwen visual preprocessing: reused host buffer with async HtoD copy can corrupt visual_input #58

Description

Summary

Observed behavior

Root cause we found

Evidence

Before local fix

After forcing copy completion before host buffer reuse

Why we think this is a real bug

Scope

Environment

Offer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions