Official PyTorch implementation of our following papers:
Sound Source Localization is All About Cross-Modal Alignment
Arda Senocak*, Hyeonggon Ryu*, Junsik Kim*, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung (* Equal Contribution)
ICCV 2023
Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment
Arda Senocak*, Hyeonggon Ryu*, Junsik Kim*, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung (* Equal Contribution)
arXiV 2024
IS3 dataset is available here
or you can simply run download_is3.sh.
The IS3 data is organized as follows:
Note that in IS3 dataset, each annotation is saved as a separate file. For example; the sample accordion_baby_10467
image contains two annotations for accordion and baby objects. These annotations are saved as accordion_baby_10467_accordion
and accordion_baby_10467_baby
for straightforward use. You can always project bounding boxes or segmentation maps onto the original image to see them all at once.
images
and audio_waw
folders contain all the image and audio files respectively.
IS3_annotation.json
file contains ground truth bounding box and category information of each annotation.
gt_segmentation
folder contains segmentation maps in binary image format for each annotation. You can query the file name in IS3_annotation.json
to get semantic category of each segmentation map.
The model checkpoints are available for the following experiments:
Training Set | Test Set | Model Type | Performance (cIoU) | Checkpoint |
---|---|---|---|---|
VGGSound-144K | VGG-SS | NN w/ Sup. Pre. Enc. | 39.94 | Link |
VGGSound-144K | VGG-SS | NN w/ Self-Sup. Pre. Enc. | 39.16 | Link |
VGGSound-144K | VGG-SS | NN w/ Sup. Pre. Enc. Pre-trained Vision | 41.42 | Link |
Flickr-SoundNet-144K | Flickr-SoundNet | NN w/ Sup. Pre. Enc. | 85.20 | Link |
Flickr-SoundNet-144K | Flickr-SoundNet | NN w/ Self-Sup. Pre. Enc. | 84.80 | Link |
Flickr-SoundNet-144K | Flickr-SoundNet | NN w/ Sup. Pre. Enc. Pre-trained Vision | 86.00 | Link |
We provide a zip file that contains model checkpoints and a few data samples from VGGSound.
https://mm.kaist.ac.kr/share/kccv_tutorial.zip
Download the dataset and set up the environment as described below.
sh environment.sh
sh download_is3.sh
Now enjoy the Sound Localization Demo.ipynb!
If you find this code useful, please consider giving a star ⭐ and citing us:
@inproceedings{senocak2023sound,
title={Sound source localization is all about cross-modal alignment},
author={Senocak, Arda and Ryu, Hyeonggon and Kim, Junsik and Oh, Tae-Hyun and Pfister, Hanspeter and Chung, Joon Son},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={7777--7787},
year={2023}
}
If you use this dataset, please consider giving a star ⭐ and citing us:
@article{senocak2024align,
title={Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment},
author={Senocak, Arda and Ryu, Hyeonggon and Kim, Junsik and Oh, Tae-Hyun and Pfister, Hanspeter and Chung, Joon Son},
journal={arXiv preprint arXiv:2407.13676},
year={2024}
}