Skip to content

weimeng23/speech-recognition-learning-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Speech Recognition Learning Resources

Awesome

A curated list of speech recognition learning resources, models, and toolkits. (continuously updating)

Table of Contents

Courses

  • (Recommended) Automatic Speech Recognition (ASR) 2018-2019 Lectures, School of Informatics, University of Edingburgh [Website]
  • Speech recognition, EECS E6870 - Spring 2016, Columbia University [Website]
  • CS224N: Natural Language Processing with Deep Learning, Stanford [Website] [Video(Winter 2021)] [Video(Winter 2017)]
  • CS224S: Spoken Language Processing (Winter 2021), Stanford [Website]
  • DLHLP: DEEP LEARNING FOR HUMAN LANGUAGE PROCESSING, 2020 SPRING, Hung-yi Lee [Website] [Video(Spring 2020)]
  • Microsoft DEV287x: Speech Recognition Systems, 2019 [Website]
  • 语音识别从入门到精通,2019,谢磊 (NOT FREE) [Website]
  • 數位語音處理概論,国立台湾大学,李琳山 [Website]

Books

  • Fundamentals of speech recognition, Lawrence Rabiner, Being-Hwang Juang, 1993 [Book]
  • Spoken language processing: A guide to theory, algorithm, and system levelopment, xuedong Huang, Alex acero, hsiao-wuen Hon, 2001 [Book]
  • Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Daniel Jurafsky & James H. Martin [Website] [Book 3rd Ed]
  • Automatic speech recognition: A Deep Learning Approach, Dong Yu and Li Deng, Springer, 2014 [Book]
  • Foundations of Statistical Natural Language Processing, Chris Manning and Hinrich Schütze, 1999 [Website] [Book]
  • 《解析深度学习:语音识别实践》,俞栋,邓力,电子工业出版社
  • 《Kaldi 语音识别实战》,陈果果,电子工业出版社
  • 《语音识别:原理与应用》,洪青阳,电子工业出版社
  • 《语音识别基本法》,汤志远,电子工业出版社
  • 《统计学习方法》李航,清华大学出版社
  • 《语音信号处理》韩继庆,清华大学出版社
  • 《语音信号处理》赵力,机械工业出版社

Models

Pre-trained / Foundation Models

  • FireRedASR2S (2026) - All-in-one ASR system integrating VAD, language identification, and punctuation. Xiaohongshu. [Paper] [Code]
  • Qwen3-ASR (2025) - Speech recognition model supporting 52 languages/dialects with streaming and non-streaming inference. Alibaba. [Paper] [Code]
  • Fun-ASR (2025) - End-to-end speech recognition large model by Tongyi Lab, providing Fun-ASR-Nano and Fun-ASR-MLT-Nano. Alibaba. [Paper] [Code]
  • Index-ASR (2025) - LLM-based ASR with contextual hotword injection and noise robustness. Bilibili. (Not open-sourced) [Paper]
  • FireRedASR (2025) - Open-source industrial-grade Mandarin speech recognition models. Xiaohongshu. [Paper] [Code]
  • Qwen2-Audio (2024) - Large audio-language model supporting ASR, translation, and audio understanding. Alibaba. [Paper] [Code]
  • SenseVoice (2024) - Multilingual speech recognition with emotion and audio event detection. Alibaba. [Paper] [Code]
  • Moonshine (2024) - Tiny yet powerful ASR models optimized for edge devices. Useful Sensors. [Paper] [Code]
  • USM (2023) - Universal Speech Model scaling ASR to 100+ languages with 12M hours of speech. Google. [Paper]
  • SeamlessM4T (2023) - Massively multilingual & multimodal machine translation supporting speech and text. Meta. [Paper] [Code]
  • Whisper (2022) - Large-scale weakly supervised speech recognition trained on 680k hours of web data. OpenAI. [Paper] [Code]
  • WavLM (2022) - Pre-trained model for full stack speech processing tasks. Microsoft. [Paper] [Code]
  • HuBERT (2021) - Self-supervised speech representation learning by masked prediction. Meta. [Paper] [Code]
  • wav2vec 2.0 (2020) - Self-supervised learning framework for speech representations. Meta. [Paper] [Code]
  • wav2vec (2019) - Unsupervised pre-training for speech recognition. Meta. [Paper] [Code]

End-to-End Models

  • Canary (2024) - Multilingual ASR and translation model with FastConformer encoder and Transformer decoder, supporting 25 languages. NVIDIA. [Code]
  • Paraformer-v2 (2024) - Improved non-autoregressive transformer for noise-robust speech recognition. Alibaba. [Paper] [Unofficial Code]
  • FastConformer / Parakeet (2023) - Optimized Conformer variant approximately 2.4x faster. Parakeet is a family of ASR models built on FastConformer with CTC/RNN-T/TDT decoders. NVIDIA. [Paper - FastConformer] [Paper - TDT] [Code]
  • Zipformer (2023) - Efficient Conformer variant with a reweighted attention mechanism for ASR. Next-gen Kaldi. [Paper] [Code]
  • Paraformer (2022) - Fast and accurate non-autoregressive end-to-end speech recognition with parallel Transformer. Alibaba. [Paper] [Code]
  • E-Branchformer (2022) - Enhanced Branchformer for speech recognition. CMU & JHU. [Paper] [Code]
  • Branchformer (2022) - Parallel branch architecture combining self-attention and convolution. NTT. [Paper] [Code]
  • Conformer (2020) - Convolution-augmented Transformer for speech recognition. Google. [Paper]
  • ContextNet (2020) - CNN-RNN-Transducer model with global context for streaming ASR. Google. [Paper]
  • Transformer-Transducer (2020) - Transformer-based model for streaming end-to-end ASR. Facebook. [Paper]
  • LAS / Listen, Attend and Spell (2015) - Attention-based sequence-to-sequence model for ASR. Google. [Paper]
  • Deep Speech 2 (2015) - End-to-end speech recognition in English and Mandarin. Baidu. [Paper] [Code]
  • Deep Speech (2014) - Scaling up end-to-end speech recognition. Baidu. [Paper]
  • RNN-Transducer (2012) - Sequence transduction with recurrent neural networks. Graves. [Paper]

Traditional Models

  • DNN-HMM (2012) - Deep neural networks for acoustic modeling in speech recognition. Hinton et al. [Paper]
  • GMM-HMM (2007) - The application of hidden Markov models in speech recognition. Gales & Young. [Paper]

Tutorials

  • WFST
    • An Introduction to Weighted Automata in Machine Learning, Awni Hannun, 2021. [PDF]
  • k2
    • Speech Recognition with Next-Generation Kaldi (K2, Lhotse, Icefall), Interspeech 2021. [Video]
    • Progress in ASR with Next-Gen Kaldi, BAAI 2022. [Video] [Slides]
    • Speech Recognition with Icefall + Lhotse, Interspeech 2023. [Slides] Tutorial notebook Interspeech 2023 Tutorial

Toolkits

listed in no particular order

Papers

Survey / Overview

  • Prabhavalkar R, Horiguchi S, Jain A, et al. End-to-end speech recognition: A survey[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024. [Paper]
  • Li J, Deng L, Haeb-Umbach R, et al. Robust automatic speech recognition: A bridge to practical applications[M]. Academic Press, 2015.

Pre-training & Self-supervised Learning

  • Chen S, Wang C, Chen Z, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022. [Paper]
  • Hsu W N, Bolte B, Tsai Y H, et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021. [Paper]
  • Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[C]//NeurIPS. 2020. [Paper]

End-to-End ASR

  • Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[C]//Interspeech. 2020. [Paper]
  • Chan W, Jaitly N, Le Q V, et al. Listen, attend and spell: A neural network that learns to translate speech to text[C]//ICASSP. 2016. [Paper]
  • Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep Speech 2: End-to-end speech recognition in English and Mandarin[C]//ICML. 2016. [Paper]
  • Graves A. Sequence transduction with recurrent neural networks[J]. arXiv preprint arXiv:1211.3711, 2012. [Paper]
  • CTC: Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376. [Paper]

Traditional / Foundational

  • Chain/LF-MMI: Povey D, Peddinti V, Galvez D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI[C]//Interspeech. 2016: 2751-2755. [Paper]
  • TDNN: Peddinti V, Povey D, Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts[C]//Interspeech. 2015: 3214-3218. [Paper]
  • DNN-HMM: Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. [Paper]
  • EM: Bilmes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models[J]. International Computer Science Institute, 1998, 4(510): 126. [Paper]
  • HMM: Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257-286. [Paper]

About

✅ A list of speech recognition learning resources including courses, books, tutorials, papers and toolkits.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors