A curated list of speech recognition learning resources, models, and toolkits. (continuously updating)
- (Recommended) Automatic Speech Recognition (ASR) 2018-2019 Lectures, School of Informatics, University of Edingburgh [Website]
- Speech recognition, EECS E6870 - Spring 2016, Columbia University [Website]
- CS224N: Natural Language Processing with Deep Learning, Stanford [Website] [Video(Winter 2021)] [Video(Winter 2017)]
- CS224S: Spoken Language Processing (Winter 2021), Stanford [Website]
- DLHLP: DEEP LEARNING FOR HUMAN LANGUAGE PROCESSING, 2020 SPRING, Hung-yi Lee [Website] [Video(Spring 2020)]
- Microsoft DEV287x: Speech Recognition Systems, 2019 [Website]
- 语音识别从入门到精通,2019,谢磊 (NOT FREE) [Website]
- 數位語音處理概論,国立台湾大学,李琳山 [Website]
- Fundamentals of speech recognition, Lawrence Rabiner, Being-Hwang Juang, 1993 [Book]
- Spoken language processing: A guide to theory, algorithm, and system levelopment, xuedong Huang, Alex acero, hsiao-wuen Hon, 2001 [Book]
- Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky & James H. Martin [Website] [Book 3rd Ed]
- Automatic speech recognition: A Deep Learning Approach, Dong Yu and Li Deng, Springer, 2014 [Book]
- Foundations of Statistical Natural Language Processing, Chris Manning and Hinrich Schütze, 1999 [Website] [Book]
- 《解析深度学习:语音识别实践》,俞栋,邓力,电子工业出版社
- 《Kaldi 语音识别实战》,陈果果,电子工业出版社
- 《语音识别:原理与应用》,洪青阳,电子工业出版社
- 《语音识别基本法》,汤志远,电子工业出版社
- 《统计学习方法》李航,清华大学出版社
- 《语音信号处理》韩继庆,清华大学出版社
- 《语音信号处理》赵力,机械工业出版社
- FireRedASR2S (2026) - All-in-one ASR system integrating VAD, language identification, and punctuation. Xiaohongshu. [Paper] [Code]
- Qwen3-ASR (2025) - Speech recognition model supporting 52 languages/dialects with streaming and non-streaming inference. Alibaba. [Paper] [Code]
- Fun-ASR (2025) - End-to-end speech recognition large model by Tongyi Lab, providing Fun-ASR-Nano and Fun-ASR-MLT-Nano. Alibaba. [Paper] [Code]
- Index-ASR (2025) - LLM-based ASR with contextual hotword injection and noise robustness. Bilibili. (Not open-sourced) [Paper]
- FireRedASR (2025) - Open-source industrial-grade Mandarin speech recognition models. Xiaohongshu. [Paper] [Code]
- Qwen2-Audio (2024) - Large audio-language model supporting ASR, translation, and audio understanding. Alibaba. [Paper] [Code]
- SenseVoice (2024) - Multilingual speech recognition with emotion and audio event detection. Alibaba. [Paper] [Code]
- Moonshine (2024) - Tiny yet powerful ASR models optimized for edge devices. Useful Sensors. [Paper] [Code]
- USM (2023) - Universal Speech Model scaling ASR to 100+ languages with 12M hours of speech. Google. [Paper]
- SeamlessM4T (2023) - Massively multilingual & multimodal machine translation supporting speech and text. Meta. [Paper] [Code]
- Whisper (2022) - Large-scale weakly supervised speech recognition trained on 680k hours of web data. OpenAI. [Paper] [Code]
- WavLM (2022) - Pre-trained model for full stack speech processing tasks. Microsoft. [Paper] [Code]
- HuBERT (2021) - Self-supervised speech representation learning by masked prediction. Meta. [Paper] [Code]
- wav2vec 2.0 (2020) - Self-supervised learning framework for speech representations. Meta. [Paper] [Code]
- wav2vec (2019) - Unsupervised pre-training for speech recognition. Meta. [Paper] [Code]
- Canary (2024) - Multilingual ASR and translation model with FastConformer encoder and Transformer decoder, supporting 25 languages. NVIDIA. [Code]
- Paraformer-v2 (2024) - Improved non-autoregressive transformer for noise-robust speech recognition. Alibaba. [Paper] [Unofficial Code]
- FastConformer / Parakeet (2023) - Optimized Conformer variant approximately 2.4x faster. Parakeet is a family of ASR models built on FastConformer with CTC/RNN-T/TDT decoders. NVIDIA. [Paper - FastConformer] [Paper - TDT] [Code]
- Zipformer (2023) - Efficient Conformer variant with a reweighted attention mechanism for ASR. Next-gen Kaldi. [Paper] [Code]
- Paraformer (2022) - Fast and accurate non-autoregressive end-to-end speech recognition with parallel Transformer. Alibaba. [Paper] [Code]
- E-Branchformer (2022) - Enhanced Branchformer for speech recognition. CMU & JHU. [Paper] [Code]
- Branchformer (2022) - Parallel branch architecture combining self-attention and convolution. NTT. [Paper] [Code]
- Conformer (2020) - Convolution-augmented Transformer for speech recognition. Google. [Paper]
- ContextNet (2020) - CNN-RNN-Transducer model with global context for streaming ASR. Google. [Paper]
- Transformer-Transducer (2020) - Transformer-based model for streaming end-to-end ASR. Facebook. [Paper]
- LAS / Listen, Attend and Spell (2015) - Attention-based sequence-to-sequence model for ASR. Google. [Paper]
- Deep Speech 2 (2015) - End-to-end speech recognition in English and Mandarin. Baidu. [Paper] [Code]
- Deep Speech (2014) - Scaling up end-to-end speech recognition. Baidu. [Paper]
- RNN-Transducer (2012) - Sequence transduction with recurrent neural networks. Graves. [Paper]
- DNN-HMM (2012) - Deep neural networks for acoustic modeling in speech recognition. Hinton et al. [Paper]
- GMM-HMM (2007) - The application of hidden Markov models in speech recognition. Gales & Young. [Paper]
- WFST
- An Introduction to Weighted Automata in Machine Learning, Awni Hannun, 2021. [PDF]
- k2
listed in no particular order
- kaldi [Github] [Doc]
- next-gen Kaldi [Github]
- k2: FSA/FST algorithms, differentiable, with PyTorch compatibility. [Github] [Doc]
- icefall: Speech recognition recipes using k2. [Github] [Doc]
- sherpa: Streaming and non-streaming ASR server for next-gen Kaldi. [Github] [Doc]
- sherpa-onnx: Real-time speech recognition using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go. [Github] [Doc]
- sherpa-ncnn: Real-time speech recognition using next-gen Kaldi with ncnn without Internet connection. Support iOS, Android, Raspberry Pi, VisionFive2, etc. [Github] [Doc]
- lhotse: Tools for handling speech data in machine learning projects. [Github] [Doc]
snowfall(deprecated)[Github]
- FunASR [Github] [Doc]
- A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models.
- Gao, Z., Li, Z., Wang, J., Luo, H., Shi, X., Chen, M., ... & Zhang, S. (2023). FunASR: A Fundamental End-to-End Speech Recognition Toolkit. arXiv preprint arXiv:2305.11013.
- espnet/espnet2 [Github]
- Watanabe S, Hori T, Karita S, et al. Espnet: End-to-end speech processing toolkit[J]. arXiv preprint arXiv:1804.00015, 2018.
- wenet [Github]
- Yao Z, Wu D, Wang X, et al. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit[J]. arXiv preprint arXiv:2102.01547, 2021.
- Zhang B, Wu D, Yao Z, et al. Unified streaming and non-streaming two-pass end-to-end model for speech recognition[J]. arXiv preprint arXiv:2012.05481, 2020.
- Wu D, Zhang B, Yang C, et al. U2++: Unified two-pass bidirectional end-to-end model for speech recognition[J]. arXiv preprint arXiv:2106.05642, 2021.
- NeMo [Github] [Doc]
- NVIDIA NeMo Framework is a generative AI framework built for researchers and pytorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS).
- Fairseq [Github] [Doc]
- Fairseq is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.
- speechbrain [Github] [Doc]
- SpeechBrain is an open-source and all-in-one conversational AI toolkit based on PyTorch.
- paddlespeech [Github] [Doc]
- PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
- eesen R.I.P. [Github]
- Miao Y, Gowayyed M, Metze F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding[C]//2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015: 167-174.
- warp_ctc [Github]
- A fast parallel implementation of CTC, on both CPU and GPU.
- htk
- sphinx
- Prabhavalkar R, Horiguchi S, Jain A, et al. End-to-end speech recognition: A survey[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024. [Paper]
- Li J, Deng L, Haeb-Umbach R, et al. Robust automatic speech recognition: A bridge to practical applications[M]. Academic Press, 2015.
- Chen S, Wang C, Chen Z, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022. [Paper]
- Hsu W N, Bolte B, Tsai Y H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021. [Paper]
- Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[C]//NeurIPS. 2020. [Paper]
- Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[C]//Interspeech. 2020. [Paper]
- Chan W, Jaitly N, Le Q V, et al. Listen, attend and spell: A neural network that learns to translate speech to text[C]//ICASSP. 2016. [Paper]
- Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep Speech 2: End-to-end speech recognition in English and Mandarin[C]//ICML. 2016. [Paper]
- Graves A. Sequence transduction with recurrent neural networks[J]. arXiv preprint arXiv:1211.3711, 2012. [Paper]
- CTC: Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376. [Paper]
- Chain/LF-MMI: Povey D, Peddinti V, Galvez D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI[C]//Interspeech. 2016: 2751-2755. [Paper]
- TDNN: Peddinti V, Povey D, Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts[C]//Interspeech. 2015: 3214-3218. [Paper]
- DNN-HMM: Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. [Paper]
- EM: Bilmes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models[J]. International Computer Science Institute, 1998, 4(510): 126. [Paper]
- HMM: Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257-286. [Paper]