Skip to content
/ ROOT Public

ROOT: VLM based System for Indoor Scene Understanding and Beyond

Notifications You must be signed in to change notification settings

harrytea/ROOT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


ROOT System Logo

ROOT: VLM-based System for Indoor Scene Understanding and Beyond

Yonghui Wang1,2, Shi-Yong Chen2, Zhenxing Zhou2, Siyi Li2, Haoran Li1,2, Wengang Zhou1, Houqiang Li1
1 University of Science and Technology of China (USTC) 2 Game AI Center, Tencent IEG

Paper Project Weights License


🎉 The Qwen2-VL and Intern2-VL checkpoints have released! huggingface ckpt

💡 Introduction

System Architecture

ROOT is a Vision-Language Model (VLM)-based system for indoor scene understanding.
It combines GPT-4V with vision models to detect objects, extract spatial metadata, and generate hierarchical scene graphs, which handle relationships using support, contain, hang and attach.

Hierarchical scene graph

🔍 Features

  • Object Perception: Detects indoor objects using GPT-4V.
  • Indoor Scene Parsing: Extracts object bounding boxes, masks, etc.
  • Hierarchical Scene Graphs: Captures spatial relationships such as support, contain, hang, and attach.
  • Distance Estimation: Estimates distances between objects.
  • Extensibility: Supports downstream tasks like 3D scene generation and scene-based Q&A.

🚀 Quickstart

  1. Download the depth_anything_metric_depth_indoor.pt and place it in the foundation/Depth_Anything directory.

  2. Download the Qwen2-VL Model from Our huggingface and put it in ckpts/Qwen2-VL-7B-FULL-full. Additionally, you need to download the Qwen2.5 model and put it in ckpts/Qwen2.5-3B-Instruct

  3. Add the Azure OpenAI token to the environment variable OPENAI_API_KEY, and uncomment line 28 of api/gpt4v_azure.py and comment out line 29. Alternatively, you can directly add your api_key to the token parameter in line 24 of api/gpt4v_azure.py

  4. Run the System

    # Run with main script
    python main.py
    
    # Run with demo app
    python app.py

🔧 Finetuning

Learn how to finetune your own SceneVLM for custom indoor environments, using Qwen2-VL as an example,you can download our ckpt here.

Data Preparation

  1. First, execute the first two steps of our method - iterative object perception and indoor scene parsing. This will obtain various meta information about the indoor scene, including:
    • Object list
    • Masks
    • Bounding boxes
    • Depth information
    • Distance information between objects

Note: At this stage, you will obtain distance information between objects. You can use this information to either:

  • Train SceneVLM for distance prediction capabilities
  • Directly used for your downstream tasks

The following example demonstrates the hierarchical scene graph generation, but the same process applies for distance prediction.

  1. Using the object names and their masks, use our utils/show_point.py script to generate training input images. As shown below, the left is the original image and the right is the input image for SceneVLM:
  1. Construct the training data as follows:
Click to expand the complete scene graph data structure
{
    "floor": {
        "support": [
            {
                "rug": {
                    "support": [
                        {
                            "dining table": {}
                        },
                        {
                            "white sofa": {
                                "support": [
                                    {
                                        "colorful pillow_0": {}
                                    },
                                    {
                                        "colorful pillow_3": {}
                                    },
                                    {
                                        "colorful pillow_2": {}
                                    }
                                ]
                            }
                        },
                        {
                            "modern chairs": {}
                        }
                    ]
                }
            }
        ]
    },
    "ceiling": {
        "attach": [
            {
                "chandelier": {}
            }
        ]
    },
    "wall": {
        "hang": [
            {
                "paintings": {}
            },
            {
                "wooden door": {}
            }
        ]
    }
}

During training, we include Chain-of-Thought (CoT) data. The CoT description is generated by prompting GPT-4 with the template from prompt/cot_prompt.txt along with the above JSON structure. Example CoT output:

The rug is supported by the floor, and on top of the rug, there is a dining table, a white sofa, and modern chairs. The white sofa supports colorful pillow_0, colorful pillow_3, and colorful pillow_2. The chandelier is attached to the ceiling. On the wall, paintings are hanging, as well as a wooden door.

  1. Format your annotation file as follows:
[
    {
        "messages": [
            {
                "content": "<image>\n + [ssg_prompt.py]",
                "role": "user"
            },
            {
                "content": "[cot]\n + ```json\n[json_answer]```",
                "role": "assistant"
            }
        ],
        "images": [
            "[your_img_path]"
        ]
    },
    {},
]

Training

Follow the Qwen2-VL finetuning method from LLaMA-Factory to train your model.

Note: If you only want to obtain the indoor hierarchical scene graph and already have a list of indoor objects (instead of generating the object list through our first step of iterative object perception), you only need to use GroundingDINO and SAM for detection and segmentation, then manually construct JSON data to fine-tune based on our weights. Based on our experience, just a few thousand data samples can achieve very good scene graph generation results.


📓 Jupyter Notebook Demo

Explore our interactive demo using Jupyter Notebook:

  • Example Notebook: demo.ipynb
  • Features: Step-by-step guidance and usage examples.

🗂️ Repository Structure

ROOT-VLM-System/
├── api/                   # VLM api
├── asset/                 # Icons, architecture diagrams, and example outputs
├── foundation/            # Core models and dependencies
├── demo.ipynb/            # Jupyter notebook demos
├── main.py                # Main entry point for the system
├── LICENSE                # Project license
└── README.md              # Documentation

📃 TODO

  • Indoor object perception
  • Indoor scene parsing
  • Scene graph generation
  • Distance estimation
  • SceneVLM model weights
  • Docker support

🐳 Docker Support (Coming Soon)

We are working on providing a Docker environment for seamless deployment. Stay tuned!


🎉 Acknowledgements

We extend our gratitude to the authors of the following projects for their foundational contributions:


📑 Citation

If you find our work helpful in your research, please consider 🌟 staring this repository and citing us :

@article{wang2024rootvlmbasedindoor,
  title={ROOT: VLM-based System for Indoor Scene Understanding and Beyond}, 
  author={Yonghui Wang and Shi-Yong Chen and Zhenxing Zhou and Siyi Li and Haoran Li and Wengang Zhou and Houqiang Li},
  journal={arXiv preprint arXiv:2411.15714},
  year={2024}
}

About

ROOT: VLM based System for Indoor Scene Understanding and Beyond

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published