Skip to content

Latest commit

 

History

History
39 lines (29 loc) · 3.13 KB

sgdepth.md

File metadata and controls

39 lines (29 loc) · 3.13 KB

August 2020

tl;dr: Build a Mannequin dataset for monodepth. Use segmentation mask to filter out real moving object.

Overall impression

The paper addresses the moving object issue by adaptively filter out regions that has large dynamic movement. The motion segmentation idea is explored in Competitive collaboration before.

Segmentation techniques are also used in Every Pixel Counts which proposes an implicit binary segmentation. SGDepth does not extend the image projection model to include cars, but simply exclude the car pixels. But this alone will lead to poor performance as depth of car pixels will not be learned at all.

But this method still seems to suffer from the infinite depth problem. We need to integrate the depth estimation with depth hints. PackNet-SG provides an intuitive way to

SGDepth develops a method to detect frames with non-moving cars, similar to that of Mannequin dataset. In other words, moving cars should be excluded from loss computation while stationary cars should still be used.

Key ideas

  • Major problems with monodepth
    • Occlusion/disocclusion
    • Static frames (little ego motion)
    • DC objects (Dynamic Class objects, cars/pedestrians/etc)
    • Monodepth2 tackles the first two by minimum reprojection loss and automasking. Most previous projects left the third issue open.
  • Loss
    • Min reproj loss
    • Smoothness loss
  • Warping mask and Masking out cars
    • Like warping input images, but uses nearest neighbor sampling as the pixel values in semantic segmentation results do not have ordinal meaning.
    • If the warped mask and the predicted mask on the target image has large IoU, then we can assume that the cars are non-moving in the scene, and use it to train. Otherwise we would need to filter out all cars in the scene.
    • Scheduling of masking thresholds: The threshold if dynamically determined by the fraction of images to be filtered out. In training, more and more images are not-masked out. Masking only guides training in the beginning, and the network sees more and more noisy samples.

Technical details

Notes

  • code on github
  • Can we use optical flow and epipolar constraints to do motion segmentation?
  • If we do motion segmentation, then we can also tell if a car is parked or not.