Skip to content

Latest commit

 

History

History
33 lines (22 loc) · 2.06 KB

virtual_cam.md

File metadata and controls

33 lines (22 loc) · 2.06 KB

January 2020

tl;dr: Generate synthetic views of image to reduce the complexity of 3D MOD neural networks.

Overall impression

The paper builds on the work of monoDIS. The main idea is that the network has to build distinct representations devoted to recognize objects at specific depths and there is little margin of generalization for different depth. This happens as it lacks generalization across depth. As a result, we have to scale up network's capacity as a function of the depth ranges, and scale up training data as well.

This is a classical tradeoff of model/data complexity vs inference complexity. If there is an inherent structure of the image (in autonomous driving camera images, closer object appear at the bottom of the image and further away object are higher up in the image), it can be exploited using row-aware or depth aware convolution (cf M3D RPN). In this paper, they did a row-wise image pyramid of the original image.

The paper also has a good introduction of monocular 3d object detection.

Key ideas

  • Training and inference discrepancy
  • Training: train a NN to make correct predictions within a limited depth range.
    • generate nv = 8 virtual images per original image.
    • Ground truth guided sampling procedure (cf PointRend). The object should be completely visible (not cropped). Random shift of virtual cam by [-Z_res/2, 0].
    • GT falling out of preset depth range [0, Z_res] is set to ignore/dont_care.
    • The depth is shifted by Zv to ensure depth invariance.
  • Inference:
    • Sample every Z_res/2 (cut out horizontal strips)
    • Adjust height to be the same

Technical details

  • The paper did detailed analysis of the virtual camera intrinsic prarameter but they did not use it for training nor inference. Basically crop and rescale

Notes

  • Questions and notes on how to improve/revise the current work