Add QCNet

patrick-llgc · Jun 24, 2024 · f65a21c · f65a21c
1 parent 6922609
commit f65a21c
Show file tree

Hide file tree

Showing 6 changed files with 35 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -35,8 +35,8 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [Paper Reading in 2019](https://towardsdatascience.com/the-200-deep-learning-papers-i-read-in-2019-7fb7034f05f7?source=friends_link&sk=7628c5be39f876b2c05e43c13d0b48a3)
 
 ## 2024-06 (8)
-- [LINGO-1: Exploring Natural Language for Autonomous Driving](https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/) [[Notes](paper_notes/lingo1.md)] [Wayve, open-loop world model]
-- [LINGO-2: Driving with Natural Language](https://wayve.ai/thinking/lingo-2-driving-with-language/) [[Notes](paper_notes/lingo2.md)] [Wayve, closed-loop world model]
+- [LINGO-1: Exploring Natural Language for Autonomous Driving](https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/) [[Notes](paper_notes/lingo_1.md)] [Wayve, open-loop world model]
+- [LINGO-2: Driving with Natural Language](https://wayve.ai/thinking/lingo-2-driving-with-language/) [[Notes](paper_notes/lingo_2.md)] [Wayve, closed-loop world model]
 - [OpenVLA: An Open-Source Vision-Language-Action Model](https://arxiv.org/abs/2406.09246) [open source RT-2]
 - [Parting with Misconceptions about Learning-based Vehicle Motion Planning](https://arxiv.org/abs/2306.07962) <kbd>CoRL 2023</kbd> [Simple non-learning based baseline]
 - [QuAD: Query-based Interpretable Neural Motion Planning for Autonomous Driving](https://arxiv.org/abs/2404.01486) [Waabi]
@@ -46,10 +46,10 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [EUDM: Efficient Uncertainty-aware Decision-making for Automated Driving Using Guided Branching](https://arxiv.org/abs/2003.02746) [[Notes](paper_notes/eudm.md)] <kbd>ICRA 2020</kbd> [Wenchao Ding, Shaojie Shen, Behavior planning]
 - [TPP: Tree-structured Policy Planning with Learned Behavior Models](https://arxiv.org/abs/2301.11902) <kbd>ICRA 2023</kbd> [Marco Pavone, Nvidia, Behavior planning]
 - [MARC: Multipolicy and Risk-aware Contingency Planning for Autonomous Driving](https://arxiv.org/abs/2308.12021) [[Notes](paper_notes/marc.md)] <kbd>RAL 2023</kbd> [Shaojie Shen, Behavior planning]
+- [EPSILON: An Efficient Planning System for Automated Vehicles in Highly Interactive Environments](https://arxiv.org/abs/2108.07993) <kbd>TRO 2021</kbd> [Wenchao Ding, encyclopedia of pnc]
 - [trajdata: A Unified Interface to Multiple Human Trajectory Datasets](https://arxiv.org/abs/2307.13924) <kbd>NeurIPS 2023</kbd> [Marco Pavone, Nvidia]
 - [Optimal Vehicle Trajectory Planning for Static Obstacle Avoidance using Nonlinear Optimization](https://arxiv.org/abs/2307.09466) [Xpeng]
 - [Jointly Learnable Behavior and Trajectory Planning for Self-Driving Vehicles](https://arxiv.org/abs/1910.04586) [[Notes](paper_notes/joint_learned_bptp.md)] <kbd>IROS 2019 Oral</kbd> [Uber ATG, behavioral planning, motion planning]
-- [HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction]
 - [Enhancing End-to-End Autonomous Driving with Latent World Model](https://arxiv.org/abs/2406.08481)
 - [OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments](https://arxiv.org/abs/2312.09243) [Jiwen Lu]
 - [RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision](https://arxiv.org/abs/2309.09502) <kbd>ICRA 2024</kbd>
@@ -60,7 +60,6 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [Learning-Based Approach for Online Lane Change Intention Prediction](https://ieeexplore.ieee.org/document/6629564/) <kbd>IV 2013</kbd> [SVM, LC intention prediction]
 - [Traffic Flow-Based Crowdsourced Mapping in Complex Urban Scenario](https://ieeexplore.ieee.org/document/10171417) <kbd>RAL 2023</kbd> [Wenchao Ding, Huawei, crowdsourced map]
 - [FlowMap: Path Generation for Automated Vehicles in Open Space Using Traffic Flow](https://arxiv.org/abs/2305.01622) <kbd>ICRA 2023</kbd>
-- [EPSILON: An Efficient Planning System for Automated Vehicles in Highly Interactive Environments](https://arxiv.org/abs/2108.07993) <kbd>TRO 2021</kbd> [Wenchao Ding, encyclopedia of pnc]
 - [Hybrid A-star: Path Planning for Autonomous Vehicles in Unknown Semi-structured Environments](https://www.semanticscholar.org/paper/Path-Planning-for-Autonomous-Vehicles-in-Unknown-Dolgov-Thrun/0e8c927d9c2c46b87816a0f8b7b8b17ed1263e9c) <kbd>IJRR 2010</kbd> [Dolgov, Thrun, Searching]
 - [Optimal Trajectory Generation for Dynamic Street Scenarios in a Frenet Frame](https://www.semanticscholar.org/paper/Optimal-trajectory-generation-for-dynamic-street-in-Werling-Ziegler/6bda8fc13bda8cffb3bb426a73ce5c12cc0a1760) <kbd>ICRA 2010</kbd> [Werling, Thrun, Sampling] [MUST READ for planning folks]
 - [Autonomous Driving on Curvy Roads Without Reliance on Frenet Frame: A Cartesian-Based Trajectory Planning Method](https://ieeexplore.ieee.org/document/9703250) <kbd>TITS 2022</kbd>
@@ -72,9 +71,13 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) [[Notes](paper_notes/alphago.md)] <kbd>Nature 2016</kbd> [DeepMind, MTCS]
 - [AlphaZero: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play](https://www.science.org/doi/full/10.1126/science.aar6404) <kbd>Science 2017</kbd> [DeepMind]
 - [MuZero: Mastering Atari, Go, chess and shogi by planning with a learned model](https://www.nature.com/articles/s41586-020-03051-4) <kbd>Nature 2020</kbd> [DeepMind]
+- [Grandmaster-Level Chess Without Search](https://arxiv.org/abs/2402.04494) [DeepMind]
 - [Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving](https://arxiv.org/abs/1610.03295) [MobileEye, desire and traj optimization]
 - [Comprehensive Reactive Safety: No Need For A Trajectory If You Have A Strategy](https://arxiv.org/abs/2207.00198) <kbd>IROS 2022</kbd> [Da Fang, Qcraft]
 - [BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning](https://arxiv.org/abs/2310.10357) <kbd>AAAI 2024</kbd>
+- [LLM-MCTS: Large Language Models as Commonsense Knowledge for Large-Scale Task Planning](https://arxiv.org/abs/2305.14078) <kbd>NeurIPS 2023</kbd>
+- [Hivt: Hierarchical vector transformer for multi-agent motion prediction](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhou_HiVT_Hierarchical_Vector_Transformer_for_Multi-Agent_Motion_Prediction_CVPR_2022_paper.pdf) <kbd>CVPR 2022</kbd> [Zikang Zhou, agent-centric, motion prediction]
+- [QCNet: Query-Centric Trajectory Prediction](https://openaccess.thecvf.com/content/CVPR2023/papers/Zhou_Query-Centric_Trajectory_Prediction_CVPR_2023_paper.pdf) [[Notes](paper_notes/qcnet.md)] <kbd>CVPR 2023</kbd> [Zikang Zhou, scene-centric, motion prediction]
 
 ## 2024-03 (11)
 - [Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391) [[Notes](paper_notes/genie.md)] [DeepMind, World Model]
@@ -248,8 +251,6 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection](https://arxiv.org/abs/2206.10092) [[Notes](paper_notes/bevdepth.md)] [BEVNet, NuScenes SOTA, Megvii]
 - [CVT: Cross-view Transformers for real-time Map-view Semantic Segmentation](https://arxiv.org/abs/2205.02833) [[Notes](paper_notes/cvt.md)] <kbd>CVPR 2022 oral</kbd> [UTAustin, Philipp]
 - [Wayformer: Motion Forecasting via Simple & Efficient Attention Networks](https://arxiv.org/abs/2207.05844) [[Notes](paper_notes/wayformer.md)] [Behavior prediction, Waymo]
-- [Hivt: Hierarchical vector transformer for multi-agent motion prediction](https://openaccess.thecvf.com/content/CVPR2022/papers/Zhou_HiVT_Hierarchical_Vector_Transformer_for_Multi-Agent_Motion_Prediction_CVPR_2022_paper.pdf) <kbd>CVPR 2022</kbd> [Zikang Zhou, agent-centric, motion prediction]
-- [QCNet: Query-Centric Trajectory Prediction](https://openaccess.thecvf.com/content/CVPR2023/papers/Zhou_Query-Centric_Trajectory_Prediction_CVPR_2023_paper.pdf) <kbd>CVPR 2023</kbd> [Zikang Zhou, scene-centric, motion prediction]
 
 ## 2022-06 (3)
 - [BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection](https://arxiv.org/abs/2203.17054) [[Notes](paper_notes/bevdet4d.md)] [BEVNet]
@@ -1702,8 +1703,8 @@ Self-Driving](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_GeoSim_
 - [Is 2D Heatmap Representation Even Necessary for Human Pose Estimation?](https://arxiv.org/abs/2107.03332)
 - [Topology Preserving Local Road Network Estimation from Single Onboard Camera Image](https://arxiv.org/abs/2112.10155) [BEVNet, Luc Van Gool]
 - [Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine](https://arxiv.org/abs/2311.16452) [Small LLM prompting, Microsoft]
-- [ToT: Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601) [[Notes](paper_notes/tot.md)]
-- [CoT: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
+- [CoT: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) <kbd>NeurIPS 2022</kbd>
+- [ToT: Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601) [[Notes](paper_notes/tot.md)] <kbd>NeurIPS 2023 Oral</kbd>
 - [Cumulative Reasoning with Large Language Models](https://arxiv.org/abs/2308.04371)
 - [A Survey of Techniques for Maximizing LLM Performance](https://www.youtube.com/watch?v=ahnGLM-RC1Y&ab_channel=OpenAI) [OpenAI]
 - [Drive AGI](https://github.com/OpenDriveLab/DriveAGI)

diff --git a/paper_notes/gaia_1.md b/paper_notes/gaia_1.md
@@ -7,7 +7,7 @@ tl;dr: World model capable of multi-future video generation for autonomous drivi
 #### Overall impression
 A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's action as the world evolves.One possible solution is to learn a world model. A **world model** is a predictive model of the future that learns a general representation of the world in order to understandn the consequences of its actions (or in other words, captures expected future events). **World modeling** has been used as a pretraining task to learn a compact and general representation in a self-supervised way. 
 
-GAIA-1's output is still limited to video domain. The input can be conditioned on action, making it a world model. In contrast, the follow-up work of [Lingo-2](lingo2.md) can output actions.
+GAIA-1's output is still limited to video domain. The input can be conditioned on action, making it a world model. In contrast, the follow-up work of [Lingo-2](lingo_2.md) can output actions. --> Yet Lingo-2 is not built strictly on top of GAIA-1.
 
 Note that some generative models excel at generating visually convincing content, but they may fall short in learning representaing of the evolving world dynamics that are crucial for precissse and robust decision makeing in complex scenarios. --> Sora
 

diff --git a/paper_notes/lingo1.md → paper_notes/lingo_1.md b/paper_notes/lingo1.md → paper_notes/lingo_1.md
@@ -5,7 +5,7 @@ _June 2024_
 tl;dr: Open-loop AD commentator with LLM.
 
 #### Overall impression
-Lingo-1's commentary was not integrated with the driving model, and remains an open loop system. Lingo-1 is enhanced by the relase of Lingo-1X, by extending VLM model to VLX by adding referential segmentation as X. This is enhanced further by successor [Lingo-2](lingo2.md) which is a VLA model and finally achieving close-loop.
+Lingo-1's commentary was not integrated with the driving model, and remains an open loop system. Lingo-1 is enhanced by the relase of Lingo-1X, by extending VLM model to VLX by adding referential segmentation as X. This is enhanced further by successor [Lingo-2](lingo_2.md) which is a VLA model and finally achieving close-loop.
 
 This is the first step torward a fully explanable E2E system. The language model can be coupled with the driving model, offering a nice interface to the E2E blackbox.
 

diff --git a/paper_notes/lingo2.md → paper_notes/lingo_2.md b/paper_notes/lingo2.md → paper_notes/lingo_2.md
@@ -2,14 +2,14 @@
 
 _June 2024_
 
-tl;dr: First closed-loop world model that can output action for autonomous driving.
+tl;dr: First closed-loop world model that can output action for autonomous driving via modification of an LLM.
 
 #### Overall impression
-This is perhaps the second world-model driven autonomous drving system deployed in real world, other than FSDv12. Another example is [ApolloFM (from AIR Tsinghua, blog in Chinese)](https://mp.weixin.qq.com/s/8d1qXTm5v4H94HxAibp1dA).
+This is perhaps the second world-model driven autonomous drving system deployed in real world, other than FSDv12. Another example is [ApolloFM (from AIR Tsinghua, blog in Chinese)](https://mp.weixin.qq.com/s/8d1qXTm5v4H94HxAibp1dA). [Lingo-2](lingo_2.md) are more like [RT2](rt2.md) in the sense that they piggy back on a LLM as a starting point and add multimodality adaptors to it. It is not native vision nor native action as [GAIA-1](gaia_1.md). FSD v12 is highly speculated to be native vision and action.
 
-Wayve call this model a VLAM (vision-language-action model). It improves upon the previous work of [Lingo-1](lingo1.md), which is an open-loop driving commentator, and [Lingo-1-X](https://wayve.ai/thinking/lingo-1-referential-segmentation/) which can outputing reference segmentations. Lingo-1-X extends vision-language model to VLX (vision-language-X) domain. Lingo-2 now officially dives into the new domain of decision making and include action as the X output.
+Wayve call this model a VLAM (vision-language-action model). It improves upon the previous work of [Lingo-1](lingo_1.md), which is an open-loop driving commentator, and [Lingo-1-X](https://wayve.ai/thinking/lingo-1-referential-segmentation/) which can outputing reference segmentations. Lingo-1-X extends vision-language model to VLX (vision-language-X) domain. Lingo-2 now officially dives into the new domain of decision making and include action as the X output.
 
-The action output from Lingo-2's VLAM is a bit different from that of RT-2. Lingo-2 predicts traejctory waypoints (like ApolloFM) vs actions (as in FSD).
+The action output from Lingo-2's VLAM is a bit different from that of [RT-2](rt2.md). Lingo-2 predicts traejctory waypoints (like ApolloFM) vs actions (as in FSD).
 
 The paper claims that is is a strong first indication of the alignment between explanations and decision-making. --> Lingo-2 is outputing driving behavior and textual predictions in real-time, but I feel the "alignment" claim needs to be examined further. 
 

diff --git a/paper_notes/mpdm.md b/paper_notes/mpdm.md
@@ -26,7 +26,7 @@ Despite simple design, MPDM is a pioneering work in decision making, and improve
 	- Approximate interaction with deterministic closed-loop simulation. Given a sampled policy and the driver model, the behaviors of other agents are deterministic.
 	- The decoupling of vehicle behavior as the instantaneous behavior is independent of each other.
 	- The formulation is highly inspiring and is the foundation of [EPSILON](epsilon.md) and all follow-up works.
-	- The horizon is 10s with 0.25s timesteps, so a 40-layer deep tree. 
+	- The horizon is 10s with 0.25s timesteps. 
 
 #### Technical details
 - MPDM How important is the closed-loop realism? The paper seems to argue that the inaccuracy in closed-loop simulation does not affect final algorithm performance that much. Close-loop or not seems to be the key.

diff --git a/paper_notes/qcnet.md b/paper_notes/qcnet.md
@@ -0,0 +1,19 @@
+# [QCNet: Query-Centric Trajectory Prediction](https://openaccess.thecvf.com/content/CVPR2023/papers/Zhou_Query-Centric_Trajectory_Prediction_CVPR_2023_paper.pdf)
+
+_June 2024_
+
+tl;dr: Query centic prediction that marries agent centric and scene centric predictions.
+
+#### Overall impression
+Winning solution in Argoverse and Waymo datasets. 
+
+#### Key ideas
+- Local coordinate system for each agent that leverages invariance.
+- Long horizon prediction in 6-8s is achieved by AR decoding of 1s each, then followed by a trajectory refiner. --> This means the target oriented approach scuh as [TNT](tnt.md) might have been too hard. [TNT](tnt.md) seems to have been proposed to maximize FDE directly.
+
+#### Technical details
+- Summary of technical details, such as important training details, or bugs of previous benchmarks.
+
+#### Notes
+- [Tech blog in Chinese by 周梓康](https://mp.weixin.qq.com/s/Aek1ThqbrKWCSMHG6Xr9eA)
+