Skip to content

JHU-LCAP/FlexSED

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FlexSED: Towards Open-Vocabulary Sound Event Detection

arXiv Hugging Face Models Hugging Face Space

FlexSED is an easy-to-use, open-vocabulary sound event detection (SED) system. It can be used for data annotation, labeling, and developing evaluation metrics for audio generation.

News

  • Oct 2025: 📦 Released code and pretrained checkpoint
  • Sep 2025: 🎉 FlexSED Spotlighted at WASPAA 2025

Installation

Clone the repository:

git clone https://github.com/JHU-LCAP/FlexSED.git 

Install the dependencies:

cd FlexSED
pip install -r requirements.txt

Usage

from api import FlexSED
import torch
import soundfile as sf

# load model
flexsed = FlexSED(device='cuda')

# run inference
events = ["Door", "Male Speech", "Laughter", "Dog"]
preds = flexsed.run_inference("example.wav", events)

# visualize prediciton
flexsed.to_multi_plot(preds, events, fname="example")

# (Optional) visualize prediciton by video
# flexsed.to_multi_video(preds, events, audio_path="example.wav", fname="example")

Training

  1. Download the AudioSet-Strong subset. The dataset is available from both WavCaps and HF-AS-Strong. Thanks to the contributors for providing these resources.

  2. Prepare metadata following the preprocessing steps. Feel free to check processed metadata.

    (If you wish to create a validation split, remove a subset of samples from the training metadata and format them the same as the test metadata. Recommended: ~2000 samples across ~50 sound classes.)

  3. Update file paths for both metadata and audio in src/configs.

  4. Extract CLAP embeddings

    python src/prepare_clap.py
  5. Run training:

    python src/train.py

Reference

If you find the code useful for your research, please consider citing:

@article{hai2025flexsed,
  title={FlexSED: Towards Open-Vocabulary Sound Event Detection},
  author={Hai, Jiarui and Wang, Helin and Guo, Weizhe and Elhilali, Mounya},
  journal={arXiv preprint arXiv:2509.18606},
  year={2025}
}

About

open-vocabulary sound event detection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages