An Improved Event-Independent Network (EIN) for Polyphonic Sound Event Localization and Detection (SELD)
from Centre for Vision, Speech and Signal Processing, University of Surrey.
- Introduction
- Requirements
- Download Dataset
- Preprocessing
- QuickEvaluate
- Usage
- Results
- FAQs
- Citing
- Reference
This is a Pytorch implementation of Event-Independent Networks for Polyphonic SELD.
Event-Independent Networks for Polyphonic SELD uses a trackwise output format and multi-task learning (MTL) of a soft parameter-sharing scheme. For more information, please read papers here.
The features of this method are:
- It uses a trackwise output format to detect different sound events of the same type but with different DoAs.
- It uses a permutation-invaiant training (PIT) to solve the track permutation problem introducted by trackwise output format.
- It uses multi-head self-attention (MHSA) to separate tracks.
- It uses multi-task learning (MTL) of a soft parameter-sharing scheme for joint-SELD.
Currently, the code is availabel for TAU-NIGENS Spatial Sound Events 2020 dataset. Data augmentation methods are not included.
We provide two ways to setup the environment. Both are based on Anaconda.
-
Use the provided
prepare_env.sh
. Note that you need to set theanaconda_dir
inprepare_env.sh
to your anaconda directory, then directly runbash scripts/prepare_env.sh
-
Use the provided
environment.yml
. Note that you also need to set theprefix
to your aimed env directory, then directly runconda env create -f environment.yml
After setup your environment, don't forget to activate it
conda activate ein
Download dataset is easy. Directly run
bash scripts/download_dataset.sh
It is needed to preprocess the data and meta files. .wav
files will be saved to .h5
files. Meta files will also be converted to .h5
files. After downloading the data, directly run
bash scripts/preproc.sh
Preprocessing for meta files (labels) separate labels to different tracks, each with up to one event and a corresponding DoA. The same event is consistently put in the same track. For frame-level permutation-invariant training, this may not be necessary, but for chunk-level PIT or no PIT, consistently arrange the same event in the same track is reasonable.
We uploaded the pre-trained model here. Download it and unzip it in the code folder (EIN-SELD
folder) using
wget 'https://zenodo.org/record/4158864/files/out_train.zip' && unzip out_train.zip
Then directly run
bash scripts/predict.sh && sh scripts/evaluate.sh
Hyper-parameters are stored in ./configs/ein_seld/seld.yaml
. You can change some of them, such as train_chunklen_sec
, train_hoplen_sec
, test_chunklen_sec
, test_hoplen_sec
, batch_size
, lr
and others.
To train a model yourself, setup ./configs/ein_seld/seld.yaml
and directly run
bash scripts/train.sh
train_fold
and valid_fold
in ./configs/ein_seld/seld.yaml
means using what folds to train and validate. Note that valid_fold
can be None
which means no validation is needed, and this is usually used for training using fold 1-6.
overlap
can be 1
or 2
or combined 1&2
, which means using non-overlapped sound event to train or overlapped to train or both.
--seed
is set to a random integer by default. You can set it to a fixed number. Results will not be completely the same if RNN or Transformer is used.
You can consider to add --read_into_mem
argument in train.sh
to pre-load all of the data into memory to increase the training speed, according to your resources.
--num_workers
also affects the training speed, adjust it according to your resources.
Prediction predicts resutls and save to ./out_infer
folder. The saved results is the submission result for DCASE challenge. Directly run
bash scripts/predict.sh
Prediction predicts results on testset_type
set, which can be dev
or eval
. If it is dev
, test_fold
cannot be None
.
Evaluation evaluate the generated submission result. Directly run
bash scripts/evaluate.sh
It is notable that EINV2-DA is a single model with plain VGGish architecture using only the channel-rotation and the specaug data-augmentation methods.
-
If you have any question, please email to [email protected] or report an issue here.
-
Currently the
pin_memory
can only be set toTrue
. For more information, please check Pytorch Doc and Nvidia Developer Blog. -
After downloading, you can delete
downloaded_packages
folder to save some space.
If you use the code, please consider citing the papers below
@article{cao2020anevent,
title={An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection},
author={Cao, Yin and Iqbal, Turab and Kong, Qiuqiang and Fengyan, An and Wang, Wenwu and Plumbley, Mark D},
journal={arXiv preprint arXiv:2010.13092},
year={2020}
}
@article{cao2020event,
title={Event-Independent Network for Polyphonic Sound Event Localization and Detection},
author={Cao, Yin and Iqbal, Turab and Kong, Qiuqiang and Zhong, Yue and Wang, Wenwu and Plumbley, Mark D},
journal={arXiv preprint arXiv:2010.00140},
year={2020}
}
-
Archontis Politis, Sharath Adavanne, and Tuomas Virtanen. A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020). November 2020. URL
-
Annamaria Mesaros, Sharath Adavanne, Archontis Politis, Toni Heittola, and Tuomas Virtanen. Joint measurement of localization and detection of sound events. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, NY, Oct 2019. URL
-
Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, March 2018. URL