January 2020
tl;dr: Generate synthetic views of image to reduce the complexity of 3D MOD neural networks.
The paper builds on the work of monoDIS. The main idea is that the network has to build distinct representations devoted to recognize objects at specific depths and there is little margin of generalization for different depth. This happens as it lacks generalization across depth. As a result, we have to scale up network's capacity as a function of the depth ranges, and scale up training data as well.
This is a classical tradeoff of model/data complexity vs inference complexity. If there is an inherent structure of the image (in autonomous driving camera images, closer object appear at the bottom of the image and further away object are higher up in the image), it can be exploited using row-aware or depth aware convolution (cf M3D RPN). In this paper, they did a row-wise image pyramid of the original image.
The paper also has a good introduction of monocular 3d object detection.
- Training and inference discrepancy
- Training: train a NN to make correct predictions within a limited depth range.
- generate nv = 8 virtual images per original image.
- Ground truth guided sampling procedure (cf PointRend). The object should be completely visible (not cropped). Random shift of virtual cam by [-Z_res/2, 0].
- GT falling out of preset depth range [0, Z_res] is set to ignore/dont_care.
- The depth is shifted by Zv to ensure depth invariance.
- Inference:
- Sample every Z_res/2 (cut out horizontal strips)
- Adjust height to be the same
- The paper did detailed analysis of the virtual camera intrinsic prarameter but they did not use it for training nor inference. Basically crop and rescale
- Questions and notes on how to improve/revise the current work