November 2019
tl;dr: Use sparse depth measurement and RGB to generate dense depth map.
The architecture used in this paper is simple. The main innovation is the brand new direction of this paper creates.
RGB-based depth prediction method are unreliable. The addition of ~100 sparse samples can reduce root mean square error by over 50% (on both NYU-v2 and KITTI).
- The sparse depth samples can come from low-resolution depth sensors or computed from the output of feature-based SLAM.
- The authors also mentioned that even with many samples (>1k) we still see blurry edges. --> The authors suspected that adding skip connection may help. This is further explored in Single Modal Weighted Average.
- Architecture is a simple encoder-decoder structure with UpConv layers. The sparse depth is concat to RGB to become RGB-d image.
- The sparse samples from SLAM/VIO algorithms are not very useful for motion planning. But with the help from color image, these sparse cues can be reconstructed/densified to be dense point cloud map.
- UpProj and UpConv modules proposed by Laina.
- BerHu loss is inverted Huber (smooth L1) loss. L1 close to origin and L2 for further away regions. --> But L1 seems to perform better than it, and much better than L2 (which produces over-smooth boundaries).
- Nearest neighboring samples used to avoid creating spurious sparse depth points.
- Each point is sampled with m/n probability during training to increase the robustness of the network.
- Three methods for depth prediction
- RGB
- sd (sparse depth)
- RGB-d
- The peformance gap between ds (sparse depth) When the number of points reaches 1k (less than 1.5% of total pixels), the color image becomes irrelevant for depth prediction.
- Author's presentation on youtube
- How accurate are the points generated with SLAM/VIO? Can they replace active depth sensors? Can we do 3DOD on sparse point cloud generated by SLAM?