CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

This is the repository of the paper "CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes" (CVPR Workshops '23).

Installation

Code

Install CUDA-enabled PyTorch by following https://pytorch.org/get-started/locally/ (Note that this code has been tested with PyTorch 1.9.0 and 1.10.2 + cudatoolkit 11.3).
Install the remaining necessary dependencies with requirements.txt:
```
pip install -r requirements.txt
```
Compile the CUDA modules for the PointNet++ backbone by running setup.py inside lib/pointnet2/:
```
cd lib/pointnet2
python setup.py install
```
(Note that this requires the full CUDA toolkit. If it fails: goto Troubleshooting.

Data

Download the ScanQA dataset under data/qa/.
Download the ScanRefer dataset and unzip it under data/. To download the ScanRefer dataset you need to fill out this form.
Download the ScanNetV2 dataset and put scans/ under data/scannet/. To download the ScanNetV2 dataset, follow https://github.com/daveredrum/ScanRefer/blob/master/data/scannet/README.md.
Generate the top-down image views for all scenes with run_generate(generate_top_down.py renders the top-down image view for a single scene):
```
 python run_generate.py
```
Download the PointNet++(-1x) checkpoint from https://github.com/facebookresearch/DepthContrast and store checkpoint under directory: checkpoints/

In the end, the data/ directory should have the following structure:

data/
├── qa/
├── scannet/
│   ├── batch_load_scannet_data.py
│   ├── load_scannet_data.py
│   ├── meta_data/
│   ├── model_util_scannet.py
│   ├── scannet_data
│   ├── scannet_utils.py
│   ├── scans/
│   └── visualize.py
├── ScanRefer_filtered.*
└── top_imgs/

Usage

Pretraining

Execute scripts/pretrain.py:
```
python scripts/pretrain.py --no_height
```

Training

Execute scripts/train.py:
- Training with pre-trained weights:
```
python scripts/train.py --no_height --tokenizer_name clip --pretrain_src <folder_name_of_ckpt_file>
```
  <folder_name_of_ckpt_file> corresponds to the folder under outputs/ with the timestamp + (optional) <tag_name>.
- Training from scratch:
```
python scripts/train.py --no_height --tokenizer_name clip
```

Inference

Evaluation of trained models with the val dataset:
```
python scripts/eval.py --folder <folder_name> --qa --force --tokenizer_name clip
```
<folder_name> corresponds to the folder under outputs/ with the timestamp + <tag_name>.

Troubleshooting

Installation of open3d fails:
```
user@device:~/3D-VQA-dev$ pip install open3d
ERROR: Could not find a version that satisfies the requirement open3d (from versions: none)
ERROR: No matching distribution found for open3d
```
- Make sure to generate the topview images on a desktop computer. The device that you are running the training on, might not have a prebuilt open3d package available
- Comment open3d in requirements.txt and thus omit the installation of open3d on this device

Execution of lib/pointnet2/setup.py fails:

user@device:~/3D-VQA-dev/lib/pointnet2$ python setup.py install
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

Make sure that CUDA_HOME is set.

user@device:~/3D-VQA-dev$ python lib/pointnet2/setup.py install
FileNotFoundError: [Errno 2] No such file or directory: '_version.py'

Make sure to execute the code inside of lib/pointnet/ as described in cudalayers installation

BibTeX

@inproceedings{Parelli_2023_CVPR, 
	author = {Maria Parelli and Alexandros Delitzas and Nikolas Hars and Georgios Vlassis and Sotirios Anagnostidis and Gregor Bachmann and Thomas Hofmann}, 
	title = {CLIP-Guided Vision-Language Pre-Training for Question Answering in 3D Scenes}, 
	booktitle = {Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)}, 
	year = {2023}
}

Acknowledgements

This project builds upon ATR-DBI/ScanQA and daveredrum/ScanRefer. It also makes use of openai/CLIP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

Installation

Code

Data

Usage

Pretraining

Training

Inference

Troubleshooting

BibTeX

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
lib		lib
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
generate_top_down.py		generate_top_down.py
pre-training.png		pre-training.png
requirements.txt		requirements.txt
run_generate.py		run_generate.py

AlexDelitzas/3D-VQA

Folders and files

Latest commit

History

Repository files navigation

CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

Installation

Code

Data

Usage

Pretraining

Training

Inference

Troubleshooting

BibTeX

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages