A machine learning project that classifies New Zealand bird sounds using Vision Transformer (ViT) models fine-tuned with LoRA (Low-Rank Adaptation) on mel spectrogram representations of audio recordings.
ManuAI transforms bird audio recordings into mel spectrograms and uses computer vision techniques to classify different New Zealand bird species. The project leverages the unique approach of treating audio classification as an image classification problem, using Google's Vision Transformer (ViT) model fine-tuned with LoRA for efficient training.
Key takeaway: Visually classifying sound --> seeing sounds...
- Audio-to-Image Conversion: Converts bird audio recordings to mel spectrograms for visual processing
- LoRA Fine-tuning: Efficient parameter-efficient fine-tuning of pre-trained ViT models
- Class Imbalance Handling: Implements augmentation to handle imbalanced datasets
- Automated Data Pipeline: Complete pipeline from data download to model training
- Early Stopping: Prevents overfitting with configurable early stopping callbacks
Follow these steps to train and use ManuAI:
-
Download Data
- Run
download_data.pyto fetch New Zealand bird recordings from Xeno-canto and Kaggle. - Example:
python download_data.py
- Run
-
Preprocess Data
- Run
preprocess_data.pyto segment and convert audio files into mel spectrograms. - Example:
python preprocess_data.py
- Run
-
Fine-tune the Model
- Open and run all cells in
lora-finetune.ipynbto fine-tune the Vision Transformer model using LoRA.
- Open and run all cells in
-
Run Inference
- Use
inference.pyto classify new bird audio samples. - Example:
python inference.py
- Use
See each script/notebook for additional options and configuration details.
The project uses bird recordings from Xeno-canto, a citizen science project focused on sharing bird sounds from around the world, as well as the 'New Zealand Bird Sound' Kaggle dataset
- Download: Fetches New Zealand bird recordings via Xeno-canto API and Kaggle.
- Segmentation: Splits recordings into 4-second segments
- Quality Filtering: Removes low-quality or silent segments
- Spectrogram Conversion: Converts audio to mel spectrograms
- Augmentation: Applies data augmentation techniques
The model currently supports classification of 20 New Zealand bird species including:
- Tui (Prosthemadera novaeseelandiae)
- Bellbird (Anthornis melanura)
- Kaka (Nestor meridionalis)
- Robin (Petroica species)
- Morepork (Ninox novaeseelandiae)
- Fantail (Rhipidura fuliginosa)
- And many more...
- Google ViT-Base-Patch16-224: Pre-trained Vision Transformer
- Input Size: 224x224 RGB images (mel spectrograms)
- Patch Size: 16x16 pixels
- Class Weighting: Handles imbalanced datasets
- Early Stopping: Prevents overfitting
- Learning Rate Scheduling: Warmup and decay
- Mixed Precision: Optional FP16 training
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License.
- Xeno-canto for providing the bird sound database
- Hugging Face for the Transformers library and model hosting
- Google Research for the Vision Transformer architecture
- Microsoft for the LoRA (Low-Rank Adaptation) technique
If you use this project in your research, please cite:
@misc{manuai2025,
title={ManuAI: New Zealand Bird Sound Classification using Vision Transformers},
author={Harry Wills},
year={2025},
url={https://github.com/harrywillss/ManuAI}
}Harry Wills - @harrywillss
Project Link: https://github.com/harrywillss/ManuAI
Made with β€οΈ for New Zealand's native bird conservation