Speech Recognition Learning Resources

A curated list of speech recognition learning resources, models, and toolkits. (continuously updating)

Courses

(Recommended) Automatic Speech Recognition (ASR) 2018-2019 Lectures, School of Informatics, University of Edingburgh [Website]
Speech recognition, EECS E6870 - Spring 2016, Columbia University [Website]
CS224N: Natural Language Processing with Deep Learning, Stanford [Website] [Video(Winter 2021)] [Video(Winter 2017)]
CS224S: Spoken Language Processing (Winter 2021), Stanford [Website]
DLHLP: DEEP LEARNING FOR HUMAN LANGUAGE PROCESSING, 2020 SPRING, Hung-yi Lee [Website] [Video(Spring 2020)]
Microsoft DEV287x: Speech Recognition Systems, 2019 [Website]
语音识别从入门到精通，2019，谢磊 (NOT FREE) [Website]
數位語音處理概論，国立台湾大学，李琳山 [Website]

Books

Fundamentals of speech recognition, Lawrence Rabiner, Being-Hwang Juang, 1993 [Book]
Spoken language processing: A guide to theory, algorithm, and system levelopment, xuedong Huang, Alex acero, hsiao-wuen Hon, 2001 [Book]
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky & James H. Martin [Website] [Book 3rd Ed]
Automatic speech recognition: A Deep Learning Approach, Dong Yu and Li Deng, Springer, 2014 [Book]
Foundations of Statistical Natural Language Processing, Chris Manning and Hinrich Schütze, 1999 [Website] [Book]
《解析深度学习：语音识别实践》，俞栋，邓力，电子工业出版社
《Kaldi 语音识别实战》，陈果果，电子工业出版社
《语音识别：原理与应用》，洪青阳，电子工业出版社
《语音识别基本法》，汤志远，电子工业出版社
《统计学习方法》李航，清华大学出版社
《语音信号处理》韩继庆，清华大学出版社
《语音信号处理》赵力，机械工业出版社

Models

Pre-trained / Foundation Models

FireRedASR2S (2026) - All-in-one ASR system integrating VAD, language identification, and punctuation. Xiaohongshu. [Paper] [Code]
Qwen3-ASR (2025) - Speech recognition model supporting 52 languages/dialects with streaming and non-streaming inference. Alibaba. [Paper] [Code]
Fun-ASR (2025) - End-to-end speech recognition large model by Tongyi Lab, providing Fun-ASR-Nano and Fun-ASR-MLT-Nano. Alibaba. [Paper] [Code]
Index-ASR (2025) - LLM-based ASR with contextual hotword injection and noise robustness. Bilibili. (Not open-sourced) [Paper]
FireRedASR (2025) - Open-source industrial-grade Mandarin speech recognition models. Xiaohongshu. [Paper] [Code]
Qwen2-Audio (2024) - Large audio-language model supporting ASR, translation, and audio understanding. Alibaba. [Paper] [Code]
SenseVoice (2024) - Multilingual speech recognition with emotion and audio event detection. Alibaba. [Paper] [Code]
Moonshine (2024) - Tiny yet powerful ASR models optimized for edge devices. Useful Sensors. [Paper] [Code]
USM (2023) - Universal Speech Model scaling ASR to 100+ languages with 12M hours of speech. Google. [Paper]
SeamlessM4T (2023) - Massively multilingual & multimodal machine translation supporting speech and text. Meta. [Paper] [Code]
Whisper (2022) - Large-scale weakly supervised speech recognition trained on 680k hours of web data. OpenAI. [Paper] [Code]
WavLM (2022) - Pre-trained model for full stack speech processing tasks. Microsoft. [Paper] [Code]
HuBERT (2021) - Self-supervised speech representation learning by masked prediction. Meta. [Paper] [Code]
wav2vec 2.0 (2020) - Self-supervised learning framework for speech representations. Meta. [Paper] [Code]
wav2vec (2019) - Unsupervised pre-training for speech recognition. Meta. [Paper] [Code]

End-to-End Models

Canary (2024) - Multilingual ASR and translation model with FastConformer encoder and Transformer decoder, supporting 25 languages. NVIDIA. [Code]
Paraformer-v2 (2024) - Improved non-autoregressive transformer for noise-robust speech recognition. Alibaba. [Paper] [Unofficial Code]
FastConformer / Parakeet (2023) - Optimized Conformer variant approximately 2.4x faster. Parakeet is a family of ASR models built on FastConformer with CTC/RNN-T/TDT decoders. NVIDIA. [Paper - FastConformer] [Paper - TDT] [Code]
Zipformer (2023) - Efficient Conformer variant with a reweighted attention mechanism for ASR. Next-gen Kaldi. [Paper] [Code]
Paraformer (2022) - Fast and accurate non-autoregressive end-to-end speech recognition with parallel Transformer. Alibaba. [Paper] [Code]
E-Branchformer (2022) - Enhanced Branchformer for speech recognition. CMU & JHU. [Paper] [Code]
Branchformer (2022) - Parallel branch architecture combining self-attention and convolution. NTT. [Paper] [Code]
Conformer (2020) - Convolution-augmented Transformer for speech recognition. Google. [Paper]
ContextNet (2020) - CNN-RNN-Transducer model with global context for streaming ASR. Google. [Paper]
Transformer-Transducer (2020) - Transformer-based model for streaming end-to-end ASR. Facebook. [Paper]
LAS / Listen, Attend and Spell (2015) - Attention-based sequence-to-sequence model for ASR. Google. [Paper]
Deep Speech 2 (2015) - End-to-end speech recognition in English and Mandarin. Baidu. [Paper] [Code]
Deep Speech (2014) - Scaling up end-to-end speech recognition. Baidu. [Paper]
RNN-Transducer (2012) - Sequence transduction with recurrent neural networks. Graves. [Paper]

Traditional Models

DNN-HMM (2012) - Deep neural networks for acoustic modeling in speech recognition. Hinton et al. [Paper]
GMM-HMM (2007) - The application of hidden Markov models in speech recognition. Gales & Young. [Paper]

Tutorials

WFST
- An Introduction to Weighted Automata in Machine Learning, Awni Hannun, 2021. [PDF]
k2
- Speech Recognition with Next-Generation Kaldi (K2, Lhotse, Icefall), Interspeech 2021. [Video]
- Progress in ASR with Next-Gen Kaldi, BAAI 2022. [Video] [Slides]
- Speech Recognition with Icefall + Lhotse, Interspeech 2023. [Slides] Tutorial notebook

Toolkits

listed in no particular order

kaldi [Github] [Doc]
next-gen Kaldi [Github]
- k2: FSA/FST algorithms, differentiable, with PyTorch compatibility. [Github] [Doc]
- icefall: Speech recognition recipes using k2. [Github] [Doc]
- sherpa: Streaming and non-streaming ASR server for next-gen Kaldi. [Github] [Doc]
- sherpa-onnx: Real-time speech recognition using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go. [Github] [Doc]
- sherpa-ncnn: Real-time speech recognition using next-gen Kaldi with ncnn without Internet connection. Support iOS, Android, Raspberry Pi, VisionFive2, etc. [Github] [Doc]
- lhotse: Tools for handling speech data in machine learning projects. [Github] [Doc]
- ~~snowfall(deprecated)~~ [Github]
FunASR [Github] [Doc]
- A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models.
- Gao, Z., Li, Z., Wang, J., Luo, H., Shi, X., Chen, M., ... & Zhang, S. (2023). FunASR: A Fundamental End-to-End Speech Recognition Toolkit. arXiv preprint arXiv:2305.11013.
espnet/espnet2 [Github]
- Watanabe S, Hori T, Karita S, et al. Espnet: End-to-end speech processing toolkit[J]. arXiv preprint arXiv:1804.00015, 2018.
wenet [Github]
- Yao Z, Wu D, Wang X, et al. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit[J]. arXiv preprint arXiv:2102.01547, 2021.
- Zhang B, Wu D, Yao Z, et al. Unified streaming and non-streaming two-pass end-to-end model for speech recognition[J]. arXiv preprint arXiv:2012.05481, 2020.
- Wu D, Zhang B, Yang C, et al. U2++: Unified two-pass bidirectional end-to-end model for speech recognition[J]. arXiv preprint arXiv:2106.05642, 2021.
NeMo [Github] [Doc]
- NVIDIA NeMo Framework is a generative AI framework built for researchers and pytorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS).
Fairseq [Github] [Doc]
- Fairseq is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.
speechbrain [Github] [Doc]
- SpeechBrain is an open-source and all-in-one conversational AI toolkit based on PyTorch.
paddlespeech [Github] [Doc]
- PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
eesen R.I.P. [Github]
- Miao Y, Gowayyed M, Metze F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding[C]//2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015: 167-174.
warp_ctc [Github]
- A fast parallel implementation of CTC, on both CPU and GPU.
htk
sphinx

Papers

Survey / Overview

Prabhavalkar R, Horiguchi S, Jain A, et al. End-to-end speech recognition: A survey[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024. [Paper]
Li J, Deng L, Haeb-Umbach R, et al. Robust automatic speech recognition: A bridge to practical applications[M]. Academic Press, 2015.

Pre-training & Self-supervised Learning

Chen S, Wang C, Chen Z, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022. [Paper]
Hsu W N, Bolte B, Tsai Y H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021. [Paper]
Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[C]//NeurIPS. 2020. [Paper]

End-to-End ASR

Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[C]//Interspeech. 2020. [Paper]
Chan W, Jaitly N, Le Q V, et al. Listen, attend and spell: A neural network that learns to translate speech to text[C]//ICASSP. 2016. [Paper]
Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep Speech 2: End-to-end speech recognition in English and Mandarin[C]//ICML. 2016. [Paper]
Graves A. Sequence transduction with recurrent neural networks[J]. arXiv preprint arXiv:1211.3711, 2012. [Paper]
CTC: Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376. [Paper]

Traditional / Foundational

Chain/LF-MMI: Povey D, Peddinti V, Galvez D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI[C]//Interspeech. 2016: 2751-2755. [Paper]
TDNN: Peddinti V, Povey D, Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts[C]//Interspeech. 2015: 3214-3218. [Paper]
DNN-HMM: Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. [Paper]
EM: Bilmes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models[J]. International Computer Science Institute, 1998, 4(510): 126. [Paper]
HMM: Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257-286. [Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
book		book
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Recognition Learning Resources

Table of Contents

Courses

Books

Models

Pre-trained / Foundation Models

End-to-End Models

Traditional Models

Tutorials

Toolkits

Papers

Survey / Overview

Pre-training & Self-supervised Learning

End-to-End ASR

Traditional / Foundational

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Speech Recognition Learning Resources

Table of Contents

Courses

Books

Models

Pre-trained / Foundation Models

End-to-End Models

Traditional Models

Tutorials

Toolkits

Papers

Survey / Overview

Pre-training & Self-supervised Learning

End-to-End ASR

Traditional / Foundational

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages