September 2020
tl;dr: Predict depth distribution of each pixel for differentiable rendering of a BEV map.
The paper is build on top of quite a few previous work such as OFT, PyrOccNet, MonoLayout and pseudo-lidar.
It proposed probabilistic 3D lifting through prediction of depth distribution for a pixel in the RGB image. In a way it proposed a unified lifting method between the one-hot lifting of pseudo-lidar and the uniform lifting of OFT. This is a trick commonly used in differentiable rendering. --> Actually Pseudo-Lidar v3 also uses this soft rasterizing trick to make depth lifting and projection differentiable.
The semantic BEV map prediction need to fuse predictions from all cameras into a single cohesive representation of the scene. This is full presentation learning of the entire 360 scene local to the ego vehicle conditioned exclusively on camera input. The ultimate goal of the BEV map prediction is to learn dense representation for motion planning.
Fishing Net uses BEV grid resolution: 10 cm and 20 cm/pixel. Lift Splat Shoot uses 50 cm/pixel. They are both coarser than the typical 4 cm or 5 cm per pixel resolution used by mapping purposes such as DAGMapper.
- View transformation: Probabilistic pixel-wise depth prediction
- Lift: probabilistic (and differentiable) 3D lifting.
- [4, 45] meters, 1 meter bin. Very much like DORN.
- Essentially each pixel in (u, v) creates 42 3D points. This is a huge point cloud.
- Splat: point pillar generation
- Shoot: motion planning. Predict a distribution over K templates.
- This Lift-Splat has 3D structure at initialization. This is better than baseline methods used by MonoLayout
- "Resolution":
- Camera images: HxW = 128x352
- BEV grid: XxY, 200x200 @ 0.5 m/pixel = 100m x 100m
- Depth resolution: [4, 45] meters @ 1 meter interval.
- Frustum pooling via cumsum trick (integral image)
- Robust training
- Camera dropout during training adds to the robustness --> similar to the input dropout of HD maps of PIXOR++.
- Training with noisy extrinsics leads to more robust network against calibration noise
- Next step is to use video pipeline to boost the depth prediction accuracy.
- Code available at github
- Tweater feed
- Why the outer product?