HALCON: Hallucination Enhanced Video-to-Audio Generation

Liyang Chen¹³, Hongkai Chen³, Yujun Cai², Sifan Li³⁴, Qingwen Ye³, Yiwei Wang⁴

¹ University of California, Los Angeles ² The University of Queensland

³ vivo Mobile Communication Co., Ltd. ⁴ University of California, Merced

📌 Overview

Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we introduce HALCON to mitigate IH. HALCON follows a three-stage procedure: it first generates initial audio to expose hallucinated segments, then identifies and masks the corresponding unreliable video features, and finally regenerates the audio using the corrected conditioning. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our HALCON method reduces both the prevalence and duration of hallucinations by over 50% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.

⚙️ Installation

🧪 Environment

We recommend using uv and Python 3.10. Create an environment and install dependencies:

git clone https://github.com/vivo/HALCON.git
cd HALCON
conda create -n halcon python=3.10 -y
conda activate halcon
pip install -r requirements.txt
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

Ensure GPU support is enabled for efficient inference.

🤖 V2A Model

Download the V2A model (e.g., ThinkSound ).

cd HALCON
git lfs install
git clone https://huggingface.co/liuhuadai/ThinkSound ckpts

📂 Dataset Preparation

Download the ThinkSound cache extracted from the Kling-Audio-Eval dataset: Kling-Audio-Eval-ThinkSound.tar

1. Download

We recommend using the hf_transfer for a faster download:

pip install -U hf_transfer huggingface_hub
export HF_HUB_ENABLE_HF_TRANSFER=1
hf download \
  Helios1208/Kling-Audio-Eval-cache \
  Kling-Audio-Eval-ThinkSound.tar \
  --repo-type dataset \
  --local-dir /workspace
tar -xvf Kling-Audio-Eval-ThinkSound.tar

Expected structure:

/workspace/KlingFoley/
    ├── Animal/															# class 1
    ├── Domestic sounds, home sounds/									# class 2
    ├── Human sounds/													# class 3
    ├── ...

2. Configuration Setup

You must update the dataset paths in the configuration file: HALCON/ThinkSound/configs/klingfoley.json. Replace the /workspace/KlingFoley prefix in both "path" and "split_path" with your actual absolute path where the dataset was extracted:

{
    "dataset_type": "multimodal_dir",
    "test_datasets": [
        {
            "id": "klingfoley",
            "path": "/YOUR/ACTUAL/PATH/{L1_path}/{L2_path}",
            "split_path": "/YOUR/ACTUAL/PATH/{L1_path}/{L2_path}/video.txt"
        }
    ],
    "random_crop": true,
    "input_type": "prompt"
}

Do not modify the {L1_path}/{L2_path} portion of the strings, as these are placeholders used by the code to navigate the directory structure.

🚀 Usage

The HALCON pipeline consists of three stages:

🧭 Stage 1: Generation (via ThinkSound)

Generate Audio for each video-text pair.

python eval_all.py \
	--root-dir "PATH_TO_KLING_AUDIO_EVAL_CACHE" \
	--dataset-config "ThinkSound/configs/klingfoley.json" \
	--model-config "ThinkSound/configs/model_configs/thinksound.json" \
	--duration-sec "10" \
	--save-dir "PATH_TO_SAVE_DIR/{L1_path}/{L2_path}"

🛠️ Stage 2: Hallucination Detection

Execute multiple detectors and collect hallucination results.

bash detect.sh "PATH_TO_SAVE_DIR"

🧠 Stage 3: Feature Conditioning and Re-Generation

Run reasoning model with optional tool outputs.

python eval_frame.py \
	--root-dir "PATH_TO_KLING_AUDIO_EVAL_CACHE" \
	--dataset-config "configs/klingfoley.json" \
	--model-config "configs/thinksound.json" \
	--results-dir "PATH_TO_KLING_AUDIO_EVAL_CACHE/{L1_path}/{L2_path}" \
	--first-root-dir "PATH_TO_FIRST_SAVE_DIR/{L1_path}/{L2_path}" \
	--csv-path "PATH_TO_FIRST_SAVE_DIR/{L1_path}/{L2_path}/cache/segment_fused_mv.csv" \
	--save-dir "PATH_TO_NEW_SAVE_DIR/{L1_path}/{L2_path}" \
	--duration-sec "10" \
	--replace_metaclip

📊 Experiments

🧩 Examples

📖 Citation

@article{chen2025detecting,
  title={Detecting and mitigating insertion hallucination in video-to-audio generation},
  author={Chen, Liyang and Chen, Hongkai and Cai, Yujun and Li, Sifan and Ye, Qingwen and Wang, Yiwei},
  journal={arXiv preprint arXiv:2510.08078},
  year={2025}
}

📜 License

This project is licensed under the terms of the Apache License 2.0.

🙏 Acknowledgements

We especially acknowledge the V2A framework provided by ThinkSound and MMAudio, which supported the development and experimentation of our method.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
ThinkSound		ThinkSound
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
defaults.ini		defaults.ini
detect.sh		detect.sh
detect_all.py		detect_all.py
eval_all.py		eval_all.py
eval_frame.py		eval_frame.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HALCON: Hallucination Enhanced Video-to-Audio Generation

📌 Overview

⚙️ Installation

🧪 Environment

🤖 V2A Model

📂 Dataset Preparation

1. Download

2. Configuration Setup

🚀 Usage

🧭 Stage 1: Generation (via ThinkSound)

🛠️ Stage 2: Hallucination Detection

🧠 Stage 3: Feature Conditioning and Re-Generation

📊 Experiments

🧩 Examples

📖 Citation

📜 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HALCON: Hallucination Enhanced Video-to-Audio Generation

📌 Overview

⚙️ Installation

🧪 Environment

🤖 V2A Model

📂 Dataset Preparation

1. Download

2. Configuration Setup

🚀 Usage

🧭 Stage 1: Generation (via ThinkSound)

🛠️ Stage 2: Hallucination Detection

🧠 Stage 3: Feature Conditioning and Re-Generation

📊 Experiments

🧩 Examples

📖 Citation

📜 License

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages