Skip to content

NKU-HLT/DIFFA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

logo DIFFA Series

πŸ”₯ News

  • 2026.03.03: Released the DIFFA-2 checkpoint and code.
  • 2026.01: Our new paper DIFFA-2 is now available on arXiv. πŸŽ‰
  • 2025.11: DIFFA has been accepted to AAAI 2026!
  • 2025.08: Released the DIFFA checkpoint and code.
  • 2025.07: Our paper DIFFA is available on arXiv. πŸŽ‰

DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding

arXiv πŸ€— Hugging Face GitHub

In this paper, We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding.

DIFFA: Large Language Diffusion Models Can Listen and Understand

arXiv πŸ€— Hugging Face GitHub


DIFFA is the first diffusion-based large audio-language model (LALM) for spoken language understanding.
It leverages a frozen diffusion LLM with dual adapters (semantic + acoustic) to enhance audio perception and reasoning.
As the first exploration of diffusion-based large language models (dLLMs) in speech and audio understanding, DIFFA opens new directions for non-autoregressive multimodal learning. This repository provides the training data, checkpoints, inference scripts, and reproducible training pipelines to facilitate further research on diffusion LLMs in the audio domain.

πŸ“– Citation

If you find DIFFA useful, please cite:

@article{zhou2026diffa,
  title={DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding},
  author={Zhou, Jiaming and Cheng, Xuxin and Zhao, Shiwan and Jia, Yuhang and Liu, Cao and Zeng, Ke and Cai, Xunliang and Qin, Yong},
  journal={arXiv preprint arXiv:2601.23161},
  year={2026}
}

@article{zhou2025diffa,
  title={DIFFA: Large Language Diffusion Models Can Listen and Understand},
  author={Zhou, Jiaming and Chen, Hongjie and Zhao, Shiwan and Kang, Jian and Li, Jie and Wang, Enzhi and Guo, Yujie and Sun, Haoqin and Wang, Hui and Kong, Aobo and others},
  journal={arXiv preprint arXiv:2507.18452},
  year={2025}
}

About

The official implementation of the DIFFA series for dLLM-based large audio language model

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages