Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu,Rakesh Ranjan, Alexander Schwing, Zhicheng Yan
Multi-view Pose-free RGB-only 3D reconstruction in one step. Also supports for new view synthesis and relative pose estimation.
Please see more visual results and video on our website!
- 2025-1-1: A gradio demo, all checkpoints, training/evaluation code and training/evaluation trajectories of ScanNet.
- 2025-1-8: demo view selection improved, better quality for multiple rooms.
We only test this on a linux server and CUDA=12.4
- Clone MV-DUSt3R+
git clone https://github.com/facebookresearch/mvdust3r.git
cd mvdust3r
- Install the virtual environment under anaconda.
./install.sh
(version of pytorch and pytorch3d should be changed if you need other CUDA version.)
- (Optional for faster runtime) Compile the cuda kernels for RoPE (the same as DUSt3R and Croco)
cd croco/models/curope/
python setup.py build_ext --inplace
cd ../../../
Please download checkpoints here to the folder checkpoints before trying demo and evaluation.
Name | Description |
---|---|
MVD.pth | MV-DUSt3R |
MVDp_s1.pth | MV-DUSt3R+ trained on stage 1 (8 views) |
MVDp_s2.pth | MV-DUSt3R+ trained on stage 1 then stage 2 (mixed 4~12 views) |
DUSt3R_ViTLarge_BaseDecoder_224_linear.pth | the pretrained DUSt3R model. Our training is finetuned upon it |
python demo.py --weights ./checkpoints/{CHECKPOINT}
You will see the UI like this:
The input can be multiple images (we do not support a single image) or a video. You will see the pointcloud along with predicted camera poses (3DGS visualization as future work).
The confidence threshold
controls how many low confidence points should be filtered.
The No. of video frames
is only valid when the input is a video and controls how many frames are uniformly selected from the video for reconstruction.
Note that the demo's inference is slower than what claimed in the paper due to overheads of gradio and model loading. If you need faster runtime, please use our evaluation code.
some tips to improve quality especially for multiple rooms.
We use five data for training and test: ScanNet, ScanNet++, HM3D, Gibson, MP3D. Please go to their website to sign contract, download and extract them in the folder data. Here are more instructions.
Currently we released the trajectories of ScanNet for evaluation. Please download it to the folder trajectories More trajectories for training and more data will be released later.
Here we have the following scripts for evaluation on ScanNet in the folder scripts:
Name | Description |
---|---|
test_mvd.sh | MV-DUSt3R |
test_mvdp_stage1.sh | MV-DUSt3R+ trained on stage 1 (8 views) |
test_mvdp_stage2.sh | MV-DUSt3R+ trained on stage 1 then stage 2 (mixed 4~12 views) |
They should reproduce the paper's result on ScanNet (Tab. 2, 3, 4, S2, S3, and S5).
We are still preparing for the releasing of trajectories of training data and code of trajectory generation. Here we also put training scripts in the folder scripts, which can provide more information about our training.
Name | Description |
---|---|
train_mvd.sh | MV-DUSt3R, loaded from DUSt3R to finetune |
train_mvdp_stage1.sh | MV-DUSt3R+ training on stage 1 (8 views), loaded from DUSt3R to finetune |
train_mvdp_stage2.sh | MV-DUSt3R+ trained on stage 1 finetuning on stage 2 (mixed 4~12 views) |
See here for more hyperparameter explanations.
@article{tang2024mv,
title={MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds},
author={Tang, Zhenggang and Fan, Yuchen and Wang, Dilin and Xu, Hongyu and Ranjan, Rakesh and Schwing, Alexander and Yan, Zhicheng},
journal={arXiv preprint arXiv:2412.06974},
year={2024}
}
We use CC BY-NC 4.0
Many thanks to:
- DUSt3R for the codebase.