Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras

July 2019

tl;dr: Estimate the intrinsics in addition to the extrinsics of the camera from any video.

Overall impression

This work eliminates the assumption of the availability of intrinsics. This opens up a whole lot possibilities to learn from a wide range of videos.

This network regresses depth, ego-motion, object motion and camera intrinsics from mono videos.

Key ideas

Estimate each of the intrinsics
Occlusion aware loss (picking the most foreground pixels during photometric loss calculation)
Foreground mask to mask out the possible moving objects.
Use a randomized layer optimization (this is quite weird)

Technical details

Sometimes an overall supervision signal is given to two tightly coupled parameters and it is not enough to get accurate estimate for both parameters. (cf. Deep3Dbox)

Notes

In detail, how was the lens correction regressed?
See interview with the CEO of isee on this paper.
Q: Can we project the intermediate representation (3D points) to BEV instead of back to camera plane for loss calculation? This would eliminate the need for using occlusion-aware loss.