Skip to content

vivo/HALCON

Repository files navigation

HALCON: Hallucination Enhanced Video-to-Audio Generation

Liyang Chen¹³, Hongkai Chen³, Yujun Cai², Sifan Li³⁴, Qingwen Ye³, Yiwei Wang

¹ University of California, Los Angeles ² The University of Queensland

³ vivo Mobile Communication Co., Ltd. ⁴ University of California, Merced

arXiv Python License Status


📌 Overview

problem Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we introduce HALCON to mitigate IH. HALCON follows a three-stage procedure: it first generates initial audio to expose hallucinated segments, then identifies and masks the corresponding unreliable video features, and finally regenerates the audio using the corrected conditioning. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our HALCON method reduces both the prevalence and duration of hallucinations by over 50% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.


⚙️ Installation

🧪 Environment

We recommend using uv and Python 3.10. Create an environment and install dependencies:

git clone https://github.com/vivo/HALCON.git
cd HALCON
conda create -n halcon python=3.10 -y
conda activate halcon
pip install -r requirements.txt
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

Ensure GPU support is enabled for efficient inference.

🤖 V2A Model

Download the V2A model (e.g., ThinkSound ).

cd HALCON
git lfs install
git clone https://huggingface.co/liuhuadai/ThinkSound ckpts

📂 Dataset Preparation

Download the ThinkSound cache extracted from the Kling-Audio-Eval dataset: Kling-Audio-Eval-ThinkSound.tar

1. Download

We recommend using the hf_transfer for a faster download:

pip install -U hf_transfer huggingface_hub
export HF_HUB_ENABLE_HF_TRANSFER=1
hf download \
  Helios1208/Kling-Audio-Eval-cache \
  Kling-Audio-Eval-ThinkSound.tar \
  --repo-type dataset \
  --local-dir /workspace
tar -xvf Kling-Audio-Eval-ThinkSound.tar

Expected structure:

/workspace/KlingFoley/
    ├── Animal/															# class 1
    ├── Domestic sounds, home sounds/									# class 2
    ├── Human sounds/													# class 3
    ├── ...

2. Configuration Setup

You must update the dataset paths in the configuration file: HALCON/ThinkSound/configs/klingfoley.json. Replace the /workspace/KlingFoley prefix in both "path" and "split_path" with your actual absolute path where the dataset was extracted:

{
    "dataset_type": "multimodal_dir",
    "test_datasets": [
        {
            "id": "klingfoley",
            "path": "/YOUR/ACTUAL/PATH/{L1_path}/{L2_path}",
            "split_path": "/YOUR/ACTUAL/PATH/{L1_path}/{L2_path}/video.txt"
        }
    ],
    "random_crop": true,
    "input_type": "prompt"
}

Do not modify the {L1_path}/{L2_path} portion of the strings, as these are placeholders used by the code to navigate the directory structure.


🚀 Usage

The HALCON pipeline consists of three stages:

🧭 Stage 1: Generation (via ThinkSound)

Generate Audio for each video-text pair.

python eval_all.py \
	--root-dir "PATH_TO_KLING_AUDIO_EVAL_CACHE" \
	--dataset-config "ThinkSound/configs/klingfoley.json" \
	--model-config "ThinkSound/configs/model_configs/thinksound.json" \
	--duration-sec "10" \
	--save-dir "PATH_TO_SAVE_DIR/{L1_path}/{L2_path}"

🛠️ Stage 2: Hallucination Detection

Execute multiple detectors and collect hallucination results.

bash detect.sh "PATH_TO_SAVE_DIR"

🧠 Stage 3: Feature Conditioning and Re-Generation

Run reasoning model with optional tool outputs.

python eval_frame.py \
	--root-dir "PATH_TO_KLING_AUDIO_EVAL_CACHE" \
	--dataset-config "configs/klingfoley.json" \
	--model-config "configs/thinksound.json" \
	--results-dir "PATH_TO_KLING_AUDIO_EVAL_CACHE/{L1_path}/{L2_path}" \
	--first-root-dir "PATH_TO_FIRST_SAVE_DIR/{L1_path}/{L2_path}" \
	--csv-path "PATH_TO_FIRST_SAVE_DIR/{L1_path}/{L2_path}/cache/segment_fused_mv.csv" \
	--save-dir "PATH_TO_NEW_SAVE_DIR/{L1_path}/{L2_path}" \
	--duration-sec "10" \
	--replace_metaclip

📊 Experiments

main-result main-result


🧩 Examples

framework


📖 Citation

@article{chen2025detecting,
  title={Detecting and mitigating insertion hallucination in video-to-audio generation},
  author={Chen, Liyang and Chen, Hongkai and Cai, Yujun and Li, Sifan and Ye, Qingwen and Wang, Yiwei},
  journal={arXiv preprint arXiv:2510.08078},
  year={2025}
}

📜 License

This project is licensed under the terms of the Apache License 2.0.


🙏 Acknowledgements

We especially acknowledge the V2A framework provided by ThinkSound and MMAudio, which supported the development and experimentation of our method.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors