Skip to content

kaistmm/ssl_kccv

Repository files navigation

Sound Source Localization is All About Alignment (ICCV’23)

Official PyTorch implementation of our following papers:

Sound Source Localization is All About Cross-Modal Alignment

Arda Senocak*, Hyeonggon Ryu*, Junsik Kim*, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung (* Equal Contribution)

ICCV 2023

Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment

Arda Senocak*, Hyeonggon Ryu*, Junsik Kim*, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung (* Equal Contribution)

arXiV 2024

Index

Overview

Pipeline

Interactive Synthetic Sound Source (IS3) Dataset

is3

IS3 dataset is available here or you can simply run download_is3.sh.

The IS3 data is organized as follows:

Note that in IS3 dataset, each annotation is saved as a separate file. For example; the sample accordion_baby_10467 image contains two annotations for accordion and baby objects. These annotations are saved as accordion_baby_10467_accordion and accordion_baby_10467_baby for straightforward use. You can always project bounding boxes or segmentation maps onto the original image to see them all at once.

images and audio_waw folders contain all the image and audio files respectively.

IS3_annotation.json file contains ground truth bounding box and category information of each annotation.

gt_segmentation folder contains segmentation maps in binary image format for each annotation. You can query the file name in IS3_annotation.json to get semantic category of each segmentation map.

Model Checkpoints

The model checkpoints are available for the following experiments:

Training Set Test Set Model Type Performance (cIoU) Checkpoint
VGGSound-144K VGG-SS NN w/ Sup. Pre. Enc. 39.94 Link
VGGSound-144K VGG-SS NN w/ Self-Sup. Pre. Enc. 39.16 Link
VGGSound-144K VGG-SS NN w/ Sup. Pre. Enc. Pre-trained Vision 41.42 Link
Flickr-SoundNet-144K Flickr-SoundNet NN w/ Sup. Pre. Enc. 85.20 Link
Flickr-SoundNet-144K Flickr-SoundNet NN w/ Self-Sup. Pre. Enc. 84.80 Link
Flickr-SoundNet-144K Flickr-SoundNet NN w/ Sup. Pre. Enc. Pre-trained Vision 86.00 Link

Demo

We provide a zip file that contains model checkpoints and a few data samples from VGGSound.

https://mm.kaist.ac.kr/share/kccv_tutorial.zip

Download the dataset and set up the environment as described below.

sh environment.sh
sh download_is3.sh

Now enjoy the Sound Localization Demo.ipynb!

Citation

If you find this code useful, please consider giving a star ⭐ and citing us:

@inproceedings{senocak2023sound,
  title={Sound source localization is all about cross-modal alignment},
  author={Senocak, Arda and Ryu, Hyeonggon and Kim, Junsik and Oh, Tae-Hyun and Pfister, Hanspeter and Chung, Joon Son},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={7777--7787},
  year={2023}
}

If you use this dataset, please consider giving a star ⭐ and citing us:

@article{senocak2024align,
  title={Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment},
  author={Senocak, Arda and Ryu, Hyeonggon and Kim, Junsik and Oh, Tae-Hyun and Pfister, Hanspeter and Chung, Joon Son},
  journal={arXiv preprint arXiv:2407.13676},
  year={2024}
}

ssl_kccv