SGDepth: Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance
August 2020
tl;dr: Build a Mannequin dataset for monodepth. Use segmentation mask to filter out real moving object.
The paper addresses the moving object issue by adaptively filter out regions that has large dynamic movement. The motion segmentation idea is explored in Competitive collaboration before.
Segmentation techniques are also used in Every Pixel Counts which proposes an implicit binary segmentation. SGDepth does not extend the image projection model to include cars, but simply exclude the car pixels. But this alone will lead to poor performance as depth of car pixels will not be learned at all.
But this method still seems to suffer from the infinite depth problem. We need to integrate the depth estimation with depth hints. PackNet-SG provides an intuitive way to
SGDepth develops a method to detect frames with non-moving cars, similar to that of Mannequin dataset. In other words, moving cars should be excluded from loss computation while stationary cars should still be used.
- Major problems with monodepth
- Occlusion/disocclusion
- Static frames (little ego motion)
- DC objects (Dynamic Class objects, cars/pedestrians/etc)
- Monodepth2 tackles the first two by minimum reprojection loss and automasking. Most previous projects left the third issue open.
- Loss
- Min reproj loss
- Smoothness loss
- Warping mask and Masking out cars
- Like warping input images, but uses nearest neighbor sampling as the pixel values in semantic segmentation results do not have ordinal meaning.
- If the warped mask and the predicted mask on the target image has large IoU, then we can assume that the cars are non-moving in the scene, and use it to train. Otherwise we would need to filter out all cars in the scene.
- Scheduling of masking thresholds: The threshold if dynamically determined by the fraction of images to be filtered out. In training, more and more images are not-masked out. Masking only guides training in the beginning, and the network sees more and more noisy samples.
- ENet for real time segmentation network
- Uses the same network (encoder + task specific decoder) to do both monodepth and semantic segmentation. This is different from the work in Towards Scene Understanding: Unsupervised Monocular Depth Estimation with Semantic-aware Representation.
- Depth map prediction by predicting
$1/(a \sigma + b)$ , where$\sigma$ is the post-sigmoid prediction.
- code on github
- Can we use optical flow and epipolar constraints to do motion segmentation?
- If we do motion segmentation, then we can also tell if a car is parked or not.