Author: Yiqiang Cai ([email protected]), Xi'an Jiaotong-Liverpool University
The task 1 of DCASE Challenge focuses on Acoustic Scene Classification (ASC), recognizing the environment in which an audio recording was captured, such as streets, parks, or airports. For a detailed description of the challenge and this task, please visit the DCASE website. The main challenges of this task are summarized as below:
- Domain Shift: Unseen devices exist in test set. (2020~)
- Short Duration: The duration of audio recordings reduced from 10s (~2021) to 1s (2022~).
- Low Complexity: Limited model parameters (128K INT8) and computational overheads (30 MMACs). (2022~)
- Data Efficiency: Train model with fewer data, specifically 5%, 10%, 25%, 50% and 100%. (2024~)
This repository provides an easy way to train your models on the datasets of DCASE task 1. The example system (TF-SepNet + BEATs teacher) won the Judges' Award for DCASE2024 Challenge Task1. Corresponding paper has been accepted by DCASE Workshop 2024 and available here.
- All configurations of model, dataset and training can be done via a simple YAML file.
- Entire system is implemented using PyTorch Lightning.
- Logging is implemented using TensorBoard. (Wandb API is also supported.)
- Various task-related techniques have been included.
- 3 Spectrogram Extractor: Cnn3Mel, CpMel, BEATsMel
- 3 High-performing Backbones: BEATs, TF-SepNet, BC-ResNet.
- 4 Plug-and-played Data Augmentation Techniques: MixUp, FreqMixStyle, SpecAugmentation, Device Impulse Response Augmentation.
- 2 Model Compression Methods: Post-training Quantization, Knowledge Distillation.
- Clone this repository.
- Create and activate a conda environment:
conda create -n dcase_t1
conda activate dcase_t1
- Install PyTorch version that suits your system. For example:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# or for cuda >= 12.1
pip install torch torchvision torchaudio
- Install requirements:
pip install -r requirements.txt
- Download and extract the TAU Urban Acoustic Scenes 2020 Mobile, Development dataset, TAU Urban Acoustic Scenes 2022 Mobile, Development dataset and Microphone Impulse Response according to your needs. The directory should be placed in the parent path of code directory.
You should end up with a directory that contains, among other files, the following:
- ../TAU-urban-acoustic-scenes-2020-mobile-development/development/audio/: A directory containing 230,35 audio files in wav format.
- ../TAU-urban-acoustic-scenes-2022-mobile-development/development/audio/: A directory containing 230,350 audio files in wav format.
- ../microphone_impulse_response/: A directory containing 67 impulse response files in wav format.
- Several default configuration yaml files are provided in config/. The training procedure can be started by running the following command:
python main.py fit --config config/tfsepnet_train.yaml
You can select or revise the config files under code-dcase-2024/config/... or directly override individual arguments in the command line:
python main.py fit --config config/tfsepnet_train.yaml --trainer.max_epochs 30
python main.py fit --config config/tfsepnet_train.yaml --optimizer.lr 0.006
python main.py fit --config config/tfsepnet_train.yaml --data.audio_dir ../TAU-urban-acoustic-scenes-2020-mobile-development/development
python main.py fit --config config/tfsepnet_train.yaml --data.train_subset split100
- Test model:
python main.py test --config config/tfsepnet_test.yaml --ckpt_path path/to/ckpt
- View results:
tensorboard --logdir log/tfsepnet_train # Check training results
tensorboard --logdir log/tfsepnet_test # Check testing results
Then results will be available at localhost port 6006.
- Quantize model:
python main.py validate --config config/tfsepnet_quant.yaml --ckpt_path path/to/ckpt
The available arguments and documentations can be shown by command line.
- To see the available commands type:
python main.py --help
- View all available options with the --help argument given after the subcommand:
python main.py fit --help
- View the documentations and arguments of a specific class:
python main.py fit --model.help LitAcousticSceneClassificationSystem
python main.py fit --data.help DCASEDataModule
- View the documentations of a specific argument:
python main.py fit --model.help LitAcousticSceneClassificationSystem --model.init_args.backbone.help TFSepNet
For convenience, please download the checkpoints into the path: model/beats/checkpoints/.
- Freeze encoder and fine-tune classifier of self-supervised pre-trained BEATs, BEATs (SSL)*:
python main.py fit --config config/beats_ssl_star.yaml
- Unfrozen fine-tune the self-supervised pre-trained BEATs, BEATs (SSL):
python main.py fit --config config/beats_ssl.yaml
- Unfrozen fine-tune the self-supervised pre-trained BEATs with additional supervised fine-tuning on AudioSet, BEATs (SSL+SL):
python main.py fit --config config/beats_ssl+sl.yaml
- Test model:
python main.py test --config config/beats_test.yaml
- Get predictions from fine-tuned BEATs:
python main.py predict --config config/beats_predict.yaml
Before knowledge distillation, make sure that the logits of teacher model have been generated and placed in your preferred directory. Alternatively, we also provided logits of fine-tuned BEATs for easier implementation. Please download and extract them into log/ when use. Input the path of logits files into config/tfsepnet_kd.yaml. If use more than one logit, the logits will be averaged as teacher ensemble.
logits_files:
- log/beats_ssl_star/predictions_split*.pt
- log/beats_ssl/predictions_split*.pt
- log/beats_ssl+sl/predictions_split*.pt
Distilling knowledge from fine-tuned BEATs to TF-SepNet:
python main.py fit --config config/tfsepnet_kd.yaml
Deploy your model in model/backbones/ and inherit the _BaseBackbone:
class YourModel(_BaseBackbone):
...
Implement new spectrogram extractor in util/spec_extractor/ and inherit the _SpecExtractor:
class NewExtractor(_SpecExtractor):
...
Declare new data augmentation method in util/data_augmentation/ and inherit the _DataAugmentation:
(Changes also need to be made in model/lit_asc/LitAcousticSceneClassificationSystem)
class NewAugmentation(_DataAugmentation):
...
More instructions can be found on LightningCLI
If you find our code helps, we would appreciate using the following citation:
@inproceedings{Cai2024workshop,
author = "Cai, Yiqiang and Li, Shengchen and Shao, Xi",
title = "Leveraging Self-Supervised Audio Representations for Data-Efficient Acoustic Scene Classification",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)",
month = "October",
year = "2024",
pages = "21--25",
}