This repository contains the PyTorch implementation of our Neurips 2022 paper and the associated datasets:
Few-Shot Audio-Visual Learning of Environment Acoustics
Sagnik Majumder, Changan Chen*, Ziad Al-Halah*, Kristen Grauman
The University of Texas at Austin, Facebook AI Research
*Equal contribution
Project website: https://vision.cs.utexas.edu/projects/fs_rir/
Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and---in a major departure from traditional methods---generalizing to novel environments in a few-shot manner.
This code has been tested with python 3.6.13
, habitat-api 0.1.4
, habitat-sim 0.1.4
and torch 1.4.0
. Additional python package requirements are available in requirements.txt
.
First, install the required versions of habitat-api, habitat-sim and torch inside a conda environment.
Next, install the remaining dependencies either by
pip3 install -r requirements.txt
or by parsing requirements.txt
to get the names and versions of individual dependencies and install them individually.
For speechmetrics
, install it from this repo.
Download the Few-shot-RIR-specific datasets from this link, extract the zip, copy the data
directory and paste it under the project root. The extracted data
directory should have 6 subdirectories
- audio_data: the sinusoidal sweep audio for computing IRs and anechoic audio for computing MOSE
- compute_mos: pickle files containing the mapping between different query source-receiver pairs to the corresponding LibriSpeech anechoic audio for computing MOS
- eval_arbitraryRIRQuery_datasets: pickle files that define a uniformly sampled context and queries for deterministic evaluation
- metdata: pickle file that gives the subgraph index for every node in a scene
- valid_poses: pickle files that contain just echo IR poses (source = receiver) or give the split of arbitrary IR poses for train, seen enviroment and unseen environment eval
- cached_room_acoustic_parameters: pickle files that contain the channelwise RT60 and DRR values for each node in a scene
Download the SoundSpaces Matterport3D binaural RIRs and metadata, and extract them into directories named data/binaural_rirs/mp3d
and data/metadata/mp3d
, respectively.
Download the Matterport3D dataset, and cache the observations relevant for the SoundSpaces simulator using this script from the SoundSpaces repository. Use resolutions of 128 x 128
for both RGB and depth sensors. Place the cached observations for all scenes (.pkl files) in data/scene_observations/mp3d
.
For further info about the structuring of the associated datasets, refer to rir_rendering/config/default.py
or the task configs.
8 GPU DataParallel training:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 main.py --exp-config rir_rendering/config/train/uniform_context_sampler.yaml --model-dir runs/fs_rir --run-type train NUM_PROCESSES 1
8 GPU DataParallel testing:
- Seen environments
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 main.py --exp-config rir_rendering/config/test/uniform_context_sampler.yaml --model-dir runs_eval/fs_rir --run-type eval EVAL_CKPT_PATH_DIR runs_eval/fs_rir/data/seen_eval_best_ckpt.pth NUM_PROCESSES 1
- Unseen environments
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 main.py --exp-config rir_rendering/config/test/uniform_context_sampler.yaml --model-dir runs_eval/fs_rir --run-type eval EVAL_CKPT_PATH_DIR runs_eval/fs_rir/data/unseen_eval_best_ckpt.pth NUM_PROCESSES 1
Compute eval metric values, like STFT error, RTE and DRRE, using scripts/impulse_quality/compute_evalMetrics.ipynb
, and MOSE using scripts/impulse_quality/mos/run_mos.py
and scripts/impulse_quality/mos/compute_mose.ipynb
. Additionally, for computing MOS and subsequently MOSE the UniformContextSampler.dump_audio_waveforms
flag in rir_rendering/config/test/uniform_context_sampler.yaml
should be set to True
for dumping the prediced and ground-truth IRs to disk.
Notes:
- metric values reported in the paper are median values.
- The model converges around 120-140 epochs; the next arxiv version will be updated to reflect that
Download model checkpoints from this link.
@inproceedings{
majumder2022fewshot,
title={Few-Shot Audio-Visual Learning of Environment Acoustics},
author={Sagnik Majumder and Changan Chen and Ziad Al-Halah and Kristen Grauman},
booktitle={Advances in Neural Information Processing Systems},
editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
year={2022},
url={https://openreview.net/forum?id=PIXGY1WgU-S}
}
This project is released under the MIT license, as found in the LICENSE file.