This codebase provides code for building a sound event detection (SED) demo. The SED systems used in this codebase are directly from large-scale audio tagging systems trained on AudioSet: https://github.com/qiuqiangkong/audioset_tagging_cnn. If this codebase is helpful, please cite [1].
https://www.youtube.com/watch?v=QyFNIhRxFrY
git clone https://github.com/qiuqiangkong/audioset_tagging_cnn
wget -O "Cnn14_DecisionLevelMax_mAP=0.385.pth" "https://zenodo.org/record/3576403/files/Cnn14_DecisionLevelMax_mAP=0.385.pth?download=1"
pip install -r requirements.txt
MODEL_TYPE="Cnn14_DecisionLevelMax"
CHECKPOINT_PATH="Cnn14_DecisionLevelMax_mAP=0.385.pth"
CUDA_VISIBLE_DEVICES=0 python3 sound_event_detection.py --model_type=$MODEL_TYPE --checkpoint_path=$CHECKPOINT_PATH --audio_path="examples/R9_ZSCveAHg_7s.wav" --video_path="examples/R9_ZSCveAHg_7s.mp4" --out_video_path="results/silent_with_text_R9_ZSCveAHg_7s.mp4" --cuda
The output video looks like:
[1] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley. "PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition." arXiv preprint arXiv:1912.10211 (2019).