Recent studies on dense captioning and visual grounding in 3D have achieved impressive results. Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. Also, how to discriminatively describe objects in complex 3D environments is not fully studied yet. To address these challenges, we present D3Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. Our D3Net unifies dense captioning and visual grounding in 3D in a self-critical manner. This self-critical property of D3Net also introduces discriminability during object caption generation and enables semi-supervised training on ScanNet data with partially annotated descriptions. Our method outperforms SOTA methods in both tasks on the ScanRefer dataset, surpassing the SOTA 3D dense captioning method by a significant margin.
Please also check out the project website here.
For additional detail, please see the D3Net paper:
"D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding"
by Dave Zhenyu Chen, Qirui Wu, Matthias Nießner and Angel X. Chang
from Technical University of Munich and Simon Fraser University.
The code is tested on Ubuntu 18.04 LTS with PyTorch 1.9.1 CUDA 11.1 installed. Please follow the instructions on the PyTorch official website to set up PyTorch with correct version first.
Install the necessary packages listed out in requirements.txt
:
pip install -r requirements.txt
As we implement PointGroup by ourselves, it is required to install the Minkowski Engine first in order to run our code. Please see the installation instructions for more details.
Before moving on to the next step, please don't forget to set the relevant project/data root path in conf/path.yaml
.
- Download the ScanRefer dataset and unzip it under
data/
. - Download the preprocessed GLoVE embeddings (~990MB) and put them under
data/
. - Download the ScanNetV2 dataset and put (or link)
scans/
under (or to)data/scannet/scans/
(Please follow the ScanNet Instructions for downloading the ScanNet dataset).
After this step, there should be folders containing the ScanNet scene data under the
data/scannet/scans/
with names likescene0000_00
- Pre-process ScanNet data. A folder named
split_data/
will be generated underdata/scannet/
after running the following command. Roughly 26GB free space is needed for this step:
cd data/scannet/
python prepare_scannet.py
-
(Optional) Pre-process the multiview features from ENet.
a. Download the ENet pretrained weights (1.4MB) and put it under
data/
b. Download and decompress the extracted ScanNet frames (~13GB).
c. Change the data paths in
conf/path.yaml
marked with TODO accordingly.d. Extract the ENet features:
cd data/scannet/ python compute_multiview_features.py
e. Project ENet features from ScanNet frames to point clouds; you need ~36GB to store the generated HDF5 database:
python project_multiview_features.py --maxpool
As learning the object relative orientations in the relational graph requires CAD model alignment annotations in Scan2CAD, please refer to the Scan2CAD official release (you need ~8MB on your disk). Once the data is downloaded, extract the zip file under data/
and change the path to Scan2CAD annotations (SCAN2CAD
) in conf/path.yaml
. As Scan2CAD doesn't cover all instances in ScanRefer, please download the mapping file and place it under SCAN2CAD
. Parsing the raw Scan2CAD annotations by the following command:
python scripts/Scan2CAD_to_ScanNet.py
And don't forget to refer to Pytorch Geometric to install the graph support.
For the stage-wise training, we need to prepare module weights stated as follows. If you would like to play around our checkpoint, please download and unzip it under outputs
.
Run the following script to start training the PointGroup detection backbone using the multiview features and normals:
python scripts/train.py --config conf/pointgroup.yaml
The trained model as well as the intermediate results will be dumped into outputs/<output_folder>
. For evaluating the model ([email protected]), please run the following script - You should get [email protected] around 47 at this point:
python scripts/eval.py --folder <output_folder> --task detection
Prepare the PointGroup weights for training the speaker:
python scripts/prepare_weights.py --path <path_to_checkpoint> --config conf/pointgroup.yaml --model detector --model_name pointgroup
The prepared weights will be put under pretrained
.
We will now fine-tune the PointGroup detector and train the speaker module with XLE loss:
python scripts/train.py --config conf/pointgroup-speaker.yaml
For evaluating the model ([email protected]), please run the following script - You should get [email protected] around 46 at this point:
python scripts/eval.py --folder <output_folder> --task captioning
NOTE: We recommend compiling the
box_intersection.pyx
for faster evaluation:cd lib/utils && python cython_compile.py build_ext --inplace
Prepare the fine-tuned PointGroup weights and speaker checkpoint for next steps:
# prepare detector weights
python scripts/prepare_weights.py --path <path_to_checkpoint> --config conf/pointgroup-speaker.yaml --model detector --model_name detector
# prepare speaker weights
python scripts/prepare_weights.py --path <path_to_checkpoint> --config conf/pointgroup-speaker.yaml --model speaker --model_name speaker
After the detector is fine-tuned, let's move on to the listener module:
python scripts/train.py --config conf/pointgroup-listener.yaml
For evaluating the model ([email protected]), please run the following script - You should get [email protected] around 35 at this point:
python scripts/eval.py --folder <output_folder> --task grounding
Prepare the listener checkpoint for next steps:
python scripts/prepare_weights.py --path <path_to_checkpoint> --config conf/pointgroup-listener.yaml --model listener --model_name listener
Finally, it is time to put everything together for the joint speaker-listener training!
python scripts/train.py --config conf/pointgroup-speaker-listener.yaml
For evaluating the model performance, please run the following script - Note that since we're using reinforcement learning, you'll expect some variance in the trained model:
# detection
python scripts/eval.py --folder <output_folder> --task detection
# grounding
python scripts/eval.py --folder <output_folder> --task grounding
# captioning
python scripts/eval.py --folder <output_folder> --task captioning
If you found our work helpful, please kindly cite the relavant papers:
@misc{chen2022d3net,
title={D3Net: A Speaker-Listener Architecture for Semi-supervised Dense Captioning and Visual Grounding in RGB-D Scans},
author={Dave Zhenyu Chen and Qirui Wu and Matthias Nießner and Angel X. Chang},
year={2021},
eprint={2112.01551},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{chen2021scan2cap,
title={Scan2Cap: Context-aware Dense Captioning in RGB-D Scans},
author={Chen, Zhenyu and Gholami, Ali and Nie{\ss}ner, Matthias and Chang, Angel X},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={3193--3203},
year={2021}
}
@InProceedings{chen2020scanrefer,
title={ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language},
author={Chen, Zhenyu and Chang, Angel X. and Nie{\ss}ner, Matthias},
editor={Vedaldi, Andrea and Bischof, Horst. and Brox, Thomas and Frahm, Jan-Michael},
booktitle={Computer Vision -- ECCV 2020},
publisher={Springer International Publishing},
pages={202--221},
year={2020},
isbn={978-3-030-58565-5}
}
D3Net is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Copyright (c) 2022 Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, Angel X. Chang