Update from drivelm private data release

OpenDriveLab · Dec 21, 2023 · 958f90f · 958f90f
1 parent 38d403d
commit 958f90f
Show file tree

Hide file tree

Showing 20 changed files with 629 additions and 165 deletions.
diff --git a/README.md b/README.md
diff --git a/assets/images/repo/drivelm_teaser.jpg b/assets/images/repo/drivelm_teaser.jpg
diff --git a/assets/images/repo/drivelm_timeline.jpg b/assets/images/repo/drivelm_timeline.jpg
diff --git a/assets/images/repo/drivelm_timeline_v2.jpg b/assets/images/repo/drivelm_timeline_v2.jpg
diff --git a/assets/images/repo/image_example.png b/assets/images/repo/image_example.png
diff --git a/assets/images/repo/paper_data.jpg b/assets/images/repo/paper_data.jpg
diff --git a/assets/images/repo/paper_data_comp.png b/assets/images/repo/paper_data_comp.png
diff --git a/assets/images/repo/paper_model_pipeline.jpg b/assets/images/repo/paper_model_pipeline.jpg
diff --git a/assets/images/repo/paper_qualitative.jpg b/assets/images/repo/paper_qualitative.jpg
diff --git a/assets/images/repo/paper_teaser.jpg b/assets/images/repo/paper_teaser.jpg
diff --git a/assets/images/repo/stats.jpeg b/assets/images/repo/stats.jpeg
diff --git a/assets/images/repo/stats.png b/assets/images/repo/stats.png
diff --git a/assets/images/repo/title_v2.jpg b/assets/images/repo/title_v2.jpg
diff --git a/assets/video/graph.mp4 b/assets/video/graph.mp4
diff --git a/docs/data_details.md b/docs/data_details.md
@@ -0,0 +1,44 @@
+## Features of the DriveLM-Data <a name="features"></a>
+
+- 🛣 Completeness in functionality (covering **Perception**, **Prediction**, and **Planning** QA pairs).
+
+
+<p align="center">
+  <img src="../assets/images/repo/point_1.png">
+</p>
+
+
+- 🔜 Reasoning for future events that have not yet happened.
+  - Many **"What If"**-style questions: imagine the future by language.
+
+
+<p align="center">
+  <img src="../assets/images/repo/point_2.png" width=70%>
+</p>
+
+- ♻ Task-driven decomposition.
+  - **One** scene-level description into **many** frame-level trajectories & planning QA pairs.
+
+<p align="center">
+  <img src="../assets/images/repo/point_3.png">
+</p>
+
+## How about the annotation process? <a name="annotation"></a>
+
+The annotation process is different for DriveLM-nuScenes and DriveLM-CARLA.
+
+<p align="center">
+  <img src="../assets/images/repo/paper_data.jpg">
+</p>
+
+**For DriveLM-nuScenes**, we divide the annotation process into three steps:
+
+1️⃣ Keyframe selection. Given all frames in one clip, the annotator selects the keyframes that need annotation. The criterion is that those frames should involve changes in ego-vehicle movement status (lane changes, sudden stops, start after a stop, etc.).
+
+2️⃣ Key objects selection. Given keyframes, the annotator needs to pick up key objects in the six surrounding images. The criterion is that those objects should be able to affect the action of the ego vehicle (traffic signals, pedestrians crossing the road, other vehicles that move in the direction of the ego vehicle, etc.).
+
+3️⃣ Question and answer annotation. Given those key objects, we automatically generate questions regarding single or multiple objects about perception, prediction, and planning. More details can be found in our data.
+
+**For DriveLM-CARLA**, we employ an automated annotation approach:
+
+We collect data using CARLA 0.9.14 in the Leaderboard 2.0 framework with a privileged rule-based expert. We set up a series of routes in urban, residential, and rural areas and execute the expert on these routes. During this process, we collect the necessary sensor data, generate relevant QAs based on privileged information about objects and the scene, and organize the logical relationships to connect this series of QAs into a graph.
diff --git a/docs/data_prep_nus.md b/docs/data_prep_nus.md
@@ -0,0 +1,98 @@
+## Download data
+We kindly hope you to fill out the [form](https://docs.google.com/forms/d/e/1FAIpQLSeX6CR3u-15IV-TKx2uPv1wiKjydjZ__NNW98H4nR5JZtQa2Q/viewform) before downloading. To get started, download nuScenes subset image data and DriveLM-nuScenes QA json files below.
+
+<!-- <a href="https://docs.google.com/forms/d/e/1FAIpQLSfm8k7LjITLRdXgbURxk46dq5Q2n8qGoRX0nWqQNE1U_322wQ/viewform?usp=sf_link" target="_blank">
+  <img src="https://img.shields.io/badge/Any%20comments%20welcome!-white?logo=google%20forms&label=Google%20Forms&labelColor=blue">
+</a>  -->
+
+| nuScenes subset images | DriveLM-nuScenes version-1.0|
+|-------|-------|
+| [Google Drive](https://drive.google.com/file/d/1DeosPGYeM2gXSChjMODGsQChZyYDmaUz/view?usp=sharing) | [Google Drive](https://drive.google.com/file/d/1LK7pYHytv64neN1626u6eTQBy1Uf4IQH/view?usp=sharing) |
+|[Baidu Netdisk](https://pan.baidu.com/s/11xvxPzUY5xTIsJQrYFogqg?pwd=mk95)|[Baidu Netdisk](https://pan.baidu.com/s/1PAdotDY0MN3nkz8w_XhDsw?pwd=l4wf) |
+|[HuggingFace](https://huggingface.co/datasets/OpenDriveLab/DriveLM/blob/main/drivelm_nus_imgs_train.zip)|[HuggingFace](https://huggingface.co/datasets/OpenDriveLab/DriveLM/blob/main/v1_0_train_nus.json)
+
+You can also download the full nuScenes dataset [HERE](https://www.nuscenes.org/download) to enable video input. 
+
+Our DriveLM dataset contains a collection of questions and answers. Currently, only the training set is publicly available. The dataset is named `v1_0_train_nus.json`.
+
+<!-- - `v1_0_train.json`/`v1_0_val.json`: In this file, questions and answers are not augmented using GPT-3.5/4.0. The answers tend to follow relatively fixed patterns, resulting in straightforward and less diverse responses. -->
+
+<!-- - `gpt_augmented_v1_0_train.json`/`gpt_augmented_v1_0_val.json`: Unlike the previous file, questions and answers in this version have been augmented using GPT. This optimization enhances the diversity of Q&A pairs. Consequently, responses are not limited to simple and direct Q&A, but may include richer expressions and content. -->
+## Prepare the dataset
+
+Organize the data structure as follows:
+
+```
+DriveLM
+├── data/
+│   ├── QA_dataset_nus/
+│   │   ├── v1_0_train_nus.json
+│   ├── nuscenes/
+│   │   ├── samples/
+```
+
+
+#### File structure
+
+The QA pairs are in the `v1_0_train_nus.json`. Below is the json file structure. All `coordinates` mentioned are referenced from the `upper-left` corner of the respective camera, with the `right` and `bottom` directions serving as the positive x and y axes, respectively.
+```
+v1_0_train_nus.json
+├── scene_token:{
+│   ├── "scene_description": "The ego vehicle proceeds along the current road, preparing to enter the main road after a series of consecutive right turns.",
+│   ├── "key_frames":{
+│   │   ├── "frame_token_1":{
+│   │   │   ├── "key_object_infos":{"<c1,CAM_FRONT,258.3,442.5>": {"Category": "Vehicle", "Status": "Moving", "Visual_description": "White Sedan", "2d_bbox": [x_min, y_min, x_max, y_max]}, ...},
+│   │   │   ├── "QA":{
+│   │   │   │   ├── "perception":[
+│   │   │   │   │   ├── {"Q": "What are the important objects in the current scene?", "A": "The important objects are <c1,CAM_FRONT,258.3,442.5>, <c2,CAM_FRONT,1113.3,505.0>, ...", "C": None, "con_up": None, "con_down": None, "cluster": None, "layer": None},
+│   │   │   │   │   ├── {"Q": "xxx", "A": "xxx", "C": None, "con_up": None, "con_down": None, "cluster": None, "layer": None}, ...
+│   │   │   │   ├── ],
+│   │   │   │   ├── "prediction":[
+│   │   │   │   │   ├── {"Q": "What is the future state of <c1,CAM_FRONT,258.3,442.5>?", "A": "Slightly offset to the left in maneuvering.", "C": None, "con_up": None, "con_down": None, "cluster": None, "layer": None}, ...
+│   │   │   │   ├── ],
+│   │   │   │   ├── "planning":[
+│   │   │   │   │   ├── {"Q": "In this scenario, what are safe actions to take for the ego vehicle?", "A": "Brake gently to a stop, turn right, turn left.", "C": None, "con_up": None, "con_down": None, "cluster": None, "layer": None}, ...
+│   │   │   │   ├── ],
+│   │   │   │   ├── "behavior":[
+│   │   │   │   │   ├── {"Q": "Predict the behavior of the ego vehicle.", "A": "The ego vehicle is going straight. The ego vehicle is driving slowly.", "C": None, "con_up": None, "con_down": None, "cluster": None, "layer": None}
+│   │   │   │   ├── ]
+│   │   │   ├── },
+│   │   │   ├── "image_paths":{
+│   │   │   │   ├── "CAM_FRONT": "xxx",
+│   │   │   │   ├── "CAM_FRONT_LEFT": "xxx",
+│   │   │   │   ├── "CAM_FRONT_RIGHT": "xxx",
+│   │   │   │   ├── "CAM_BACK": "xxx",
+│   │   │   │   ├── "CAM_BACK_LEFT": "xxx",
+│   │   │   │   ├── "CAM_BACK_RIGHT": "xxx",
+│   │   │   ├── }
+│   │   ├── },
+│   │   ├── "frame_token_2":{
+│   │   │   ├── "key_object_infos":{"<c1,CAM_BACK,612.5,490.6>": {"Category": "Traffic element", "Status": "None", "Visual_description": "Stop sign", "2d_bbox": [x_min, y_min, x_max, y_max]}, ...},
+│   │   │   ├── "QA":{
+│   │   │   │   ├── "perception":[...],
+│   │   │   │   ├── "prediction":[...],
+│   │   │   │   ├── "planning":[...],
+│   │   │   │   ├── "behavior":[...]
+│   │   │   ├── },
+│   │   │   ├── "image_paths":{...}
+│   │   ├── }
+│   ├── }
+├── }
+```
+
+- `scene_token` is the same as in nuScenes dataset.
+- `scene_description` is a one-sentence summary of ego-vehicle behavior in the about 20-second video clip (the notion of a scene in nuScenes dataset).
+- Under `key_frames`, each key frame is identified by the `frame_token`, which corresponds to the `token` in the nuScenes dataset.
+- The `key_object_infos` is a mapping between `c tag` (i.e. \<c1,CAM_FRONT,258.3,442.5\>) and more information about the related key objects such as the category, the status, the visual description, and the 2d bounding box.
+- `QA` is divided into different tasks, and QA pairs under each task are formulated as a list of dictionaries. Each dictionary encompasses keys of `Q` (question), `A` (answer), `C` (context), `con_up`, `con_down`, `cluster`, and `layer`. Currently, the values of context related keys are set to None, serving as a tentative placeholder for future fields related to DriveLM-CARLA.
+
+
+**Note:** The `c tag` label is used to indicate key objects selected during the annotation process. These objects include not only those present in the ground truth but also objects that are not, such as landmarks and traffic lights. Each key frame contains a minimum of three and a maximum of six key objects. The organization format of the `c tag` is `<c,CAM,x,y>`, where c is the identifier, CAM indicates the camera where the key object’s center point is situated, and x, y represent the horizontal and vertical coordinates of the 2D bounding box in the respective camera’s coordinate system with the `upper-left` corner as the `origin`, and the `right` and `bottom` as the `positive x and y axes`, respectively. 
+
+In contrast to the `c tag`, for the question "Identify all the traffic elements in the front view," the output is presented as a list formatted as `[(c, s, x1, y1, x2, y2), ...]`. Here, `c` denotes the category, `s` represents the status, and `x1, y1, x2, y2` indicate the offsets of the top-left and bottom-right corners of the box relative to the center point.
+
+
+<p align="center">
+  <img width="671" alt="data" src="https://github.com/OpenDriveLab/DriveLM-new/assets/75412366/58d3a3f9-93b1-4899-a1c2-93c04a5978f0" width=90%>
+</p>
+
diff --git a/docs/getting_started.md b/docs/getting_started.md
diff --git a/docs/gvqa.md b/docs/gvqa.md
@@ -0,0 +1,6 @@
+### What is GVQA?
+The most exciting aspect of the dataset is that the questions and answers (`QA pairs`) are connected in a graph-style structure, with QA pairs as every node and potential logical progression as the edges. The reason for doing this in the AD domain is that AD tasks are well-defined per stage, from raw sensor input to final control action through perception, prediction and planning.
+
+Its key difference to prior VQA tasks for AD is the availability of logical dependencies between QAs, which can be used to guide the answering process. Below is a demo video illustrating the idea.
+
+https://github.com/OpenDriveLab/DriveLM-new/assets/75412366/78c32442-73c8-4f1d-ab69-34c15e7060af