October 2019
tl;dr: Mono 3DOD by estimating pose and shape of vehicles and render-and-compare loss.
3DOD is critical for prediction and path planning. However 3D ground truth is hard to obtain. 3D RCNN only needs 2D annotation (depth and semantic segmeantion). It also need accurate intrinsics/extrinsics to make it work.
This video seems to stem from the concept of this video of PASCAL 3D
First learn the low-dimensional space from CAD models for each subtype. PCA is used. But AutoEncoder seems also OK, such as RoI10D which are heavily inspired by this work and seems more practical.
Analysis by synthesis: Estimate the shape, pose and size parameters of the cars, and render (synthesize) the scene. Then the mask and depth map are compared with ground truth to generate loss.
The shape and pose are weakly supervised and arise from end-to-end training.
- Estimate 2D bbox, 3D center projected on 2D from RoIAligned features. Estimate shape and pose with RoIAligned feature concatenated with the intrinsics of virtual RoI camera.
- Latent space of car shapes
- The authors argue that it is hard to predict 3D property such as shape and pose from RoIAligned features. --> RoI10D did use the RoIAligned features.
- Improved multi-bin loss: weighted sum of bin center by confidence score, and L1 loss. This way there is no need to regress for the residual like the deep3dbox.
- Render and compare uses operation of CUDA-OPENGL, and seems quite engineering heavy to make this work.
- 3D dataset with pose estimation
- Pascal 3D
- multiview car dataset: cars on turntable
- PCA is simple and efficient, but RoI10D reported degeneracy of PCA models and favors 3D auto-encoder.