Liyang Chen¹³, Hongkai Chen³, Yujun Cai², Sifan Li³⁴, Qingwen Ye³, Yiwei Wang⁴
¹ University of California, Los Angeles ² The University of Queensland
³ vivo Mobile Communication Co., Ltd. ⁴ University of California, Merced
Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we introduce HALCON to mitigate IH. HALCON follows a three-stage procedure: it first generates initial audio to expose hallucinated segments, then identifies and masks the corresponding unreliable video features, and finally regenerates the audio using the corrected conditioning. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our HALCON method reduces both the prevalence and duration of hallucinations by over 50% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
We recommend using uv and Python 3.10. Create an environment and install dependencies:
git clone https://github.com/vivo/HALCON.git
cd HALCON
conda create -n halcon python=3.10 -y
conda activate halcon
pip install -r requirements.txt
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128Ensure GPU support is enabled for efficient inference.
Download the V2A model (e.g., ThinkSound ).
cd HALCON
git lfs install
git clone https://huggingface.co/liuhuadai/ThinkSound ckptsDownload the ThinkSound cache extracted from the Kling-Audio-Eval dataset: Kling-Audio-Eval-ThinkSound.tar
We recommend using the hf_transfer for a faster download:
pip install -U hf_transfer huggingface_hub
export HF_HUB_ENABLE_HF_TRANSFER=1
hf download \
Helios1208/Kling-Audio-Eval-cache \
Kling-Audio-Eval-ThinkSound.tar \
--repo-type dataset \
--local-dir /workspace
tar -xvf Kling-Audio-Eval-ThinkSound.tarExpected structure:
/workspace/KlingFoley/
├── Animal/ # class 1
├── Domestic sounds, home sounds/ # class 2
├── Human sounds/ # class 3
├── ...
You must update the dataset paths in the configuration file: HALCON/ThinkSound/configs/klingfoley.json.
Replace the /workspace/KlingFoley prefix in both "path" and "split_path" with your actual absolute path where the dataset was extracted:
{
"dataset_type": "multimodal_dir",
"test_datasets": [
{
"id": "klingfoley",
"path": "/YOUR/ACTUAL/PATH/{L1_path}/{L2_path}",
"split_path": "/YOUR/ACTUAL/PATH/{L1_path}/{L2_path}/video.txt"
}
],
"random_crop": true,
"input_type": "prompt"
}Do not modify the
{L1_path}/{L2_path}portion of the strings, as these are placeholders used by the code to navigate the directory structure.
The HALCON pipeline consists of three stages:
Generate Audio for each video-text pair.
python eval_all.py \
--root-dir "PATH_TO_KLING_AUDIO_EVAL_CACHE" \
--dataset-config "ThinkSound/configs/klingfoley.json" \
--model-config "ThinkSound/configs/model_configs/thinksound.json" \
--duration-sec "10" \
--save-dir "PATH_TO_SAVE_DIR/{L1_path}/{L2_path}"
Execute multiple detectors and collect hallucination results.
bash detect.sh "PATH_TO_SAVE_DIR"
Run reasoning model with optional tool outputs.
python eval_frame.py \
--root-dir "PATH_TO_KLING_AUDIO_EVAL_CACHE" \
--dataset-config "configs/klingfoley.json" \
--model-config "configs/thinksound.json" \
--results-dir "PATH_TO_KLING_AUDIO_EVAL_CACHE/{L1_path}/{L2_path}" \
--first-root-dir "PATH_TO_FIRST_SAVE_DIR/{L1_path}/{L2_path}" \
--csv-path "PATH_TO_FIRST_SAVE_DIR/{L1_path}/{L2_path}/cache/segment_fused_mv.csv" \
--save-dir "PATH_TO_NEW_SAVE_DIR/{L1_path}/{L2_path}" \
--duration-sec "10" \
--replace_metaclip
@article{chen2025detecting,
title={Detecting and mitigating insertion hallucination in video-to-audio generation},
author={Chen, Liyang and Chen, Hongkai and Cai, Yujun and Li, Sifan and Ye, Qingwen and Wang, Yiwei},
journal={arXiv preprint arXiv:2510.08078},
year={2025}
}This project is licensed under the terms of the Apache License 2.0.
We especially acknowledge the V2A framework provided by ThinkSound and MMAudio, which supported the development and experimentation of our method.


