Yonghui Wang1,2, Shi-Yong Chen2, Zhenxing Zhou2, Siyi Li2, Haoran Li1,2, Wengang Zhou1, Houqiang Li1
1 University of Science and Technology of China (USTC) 2 Game AI Center, Tencent IEG
🎉 The Qwen2-VL and Intern2-VL checkpoints have released! huggingface ckpt
ROOT is a Vision-Language Model (VLM)-based system for indoor scene understanding.
It combines GPT-4V with vision models to detect objects, extract spatial metadata, and generate hierarchical scene graphs, which handle relationships using support, contain, hang and attach.
- Object Perception: Detects indoor objects using GPT-4V.
- Indoor Scene Parsing: Extracts object bounding boxes, masks, etc.
- Hierarchical Scene Graphs: Captures spatial relationships such as support, contain, hang, and attach.
- Distance Estimation: Estimates distances between objects.
- Extensibility: Supports downstream tasks like 3D scene generation and scene-based Q&A.
-
Download the depth_anything_metric_depth_indoor.pt and place it in the
foundation/Depth_Anything
directory. -
Download the Qwen2-VL Model from Our huggingface and put it in
ckpts/Qwen2-VL-7B-FULL-full
. Additionally, you need to download the Qwen2.5 model and put it inckpts/Qwen2.5-3B-Instruct
-
Add the Azure OpenAI token to the environment variable OPENAI_API_KEY, and uncomment line 28 of
api/gpt4v_azure.py
and comment out line 29. Alternatively, you can directly add your api_key to the token parameter in line 24 ofapi/gpt4v_azure.py
-
Run the System
# Run with main script python main.py # Run with demo app python app.py
Learn how to finetune your own SceneVLM for custom indoor environments, using Qwen2-VL as an example,you can download our ckpt here.
- First, execute the first two steps of our method - iterative object perception and indoor scene parsing. This will obtain various meta information about the indoor scene, including:
- Object list
- Masks
- Bounding boxes
- Depth information
- Distance information between objects
Note: At this stage, you will obtain distance information between objects. You can use this information to either:
- Train SceneVLM for distance prediction capabilities
- Directly used for your downstream tasks
The following example demonstrates the hierarchical scene graph generation, but the same process applies for distance prediction.
- Using the object names and their masks, use our
utils/show_point.py
script to generate training input images. As shown below, the left is the original image and the right is the input image for SceneVLM:
- Construct the training data as follows:
Click to expand the complete scene graph data structure
{
"floor": {
"support": [
{
"rug": {
"support": [
{
"dining table": {}
},
{
"white sofa": {
"support": [
{
"colorful pillow_0": {}
},
{
"colorful pillow_3": {}
},
{
"colorful pillow_2": {}
}
]
}
},
{
"modern chairs": {}
}
]
}
}
]
},
"ceiling": {
"attach": [
{
"chandelier": {}
}
]
},
"wall": {
"hang": [
{
"paintings": {}
},
{
"wooden door": {}
}
]
}
}
During training, we include Chain-of-Thought (CoT) data. The CoT description is generated by prompting GPT-4 with the template from prompt/cot_prompt.txt
along with the above JSON structure. Example CoT output:
The rug is supported by the floor, and on top of the rug, there is a dining table, a white sofa, and modern chairs. The white sofa supports colorful pillow_0, colorful pillow_3, and colorful pillow_2. The chandelier is attached to the ceiling. On the wall, paintings are hanging, as well as a wooden door.
- Format your annotation file as follows:
[
{
"messages": [
{
"content": "<image>\n + [ssg_prompt.py]",
"role": "user"
},
{
"content": "[cot]\n + ```json\n[json_answer]```",
"role": "assistant"
}
],
"images": [
"[your_img_path]"
]
},
{},
]
Follow the Qwen2-VL finetuning method from LLaMA-Factory to train your model.
Note: If you only want to obtain the indoor hierarchical scene graph and already have a list of indoor objects (instead of generating the object list through our first step of iterative object perception), you only need to use GroundingDINO and SAM for detection and segmentation, then manually construct JSON data to fine-tune based on our weights. Based on our experience, just a few thousand data samples can achieve very good scene graph generation results.
Explore our interactive demo using Jupyter Notebook:
- Example Notebook:
demo.ipynb
- Features: Step-by-step guidance and usage examples.
ROOT-VLM-System/
├── api/ # VLM api
├── asset/ # Icons, architecture diagrams, and example outputs
├── foundation/ # Core models and dependencies
├── demo.ipynb/ # Jupyter notebook demos
├── main.py # Main entry point for the system
├── LICENSE # Project license
└── README.md # Documentation
- Indoor object perception
- Indoor scene parsing
- Scene graph generation
- Distance estimation
- SceneVLM model weights
- Docker support
We are working on providing a Docker environment for seamless deployment. Stay tuned!
We extend our gratitude to the authors of the following projects for their foundational contributions:
- GroundingDINO: Scene parsing.
- Segment-Anything: Object segmentation.
- Depth-Anything: Depth estimation.
- InternVL: Fine-tuning base.
- Qwen2-VL: Fine-tuning base.
- GPT-4V: Vision-language reasoning.
If you find our work helpful in your research, please consider 🌟 staring this repository and citing us :
@article{wang2024rootvlmbasedindoor,
title={ROOT: VLM-based System for Indoor Scene Understanding and Beyond},
author={Yonghui Wang and Shi-Yong Chen and Zhenxing Zhou and Siyi Li and Haoran Li and Wengang Zhou and Houqiang Li},
journal={arXiv preprint arXiv:2411.15714},
year={2024}
}