PYVA: Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation
September 2021
tl;dr: Transformers to lift image to BEV.
This paper uses a cross-attention transformer structure (although they did not spell that out explicitly) to lift image features to BEV and perform road layout and vehicle segmentation on it.
It is difficult for CNN to fit a view projection model due to the locally confined receptive fields of convolutional layers. Transformers are more suitable to do this job due to the global attention mechanism.
Road layout provides the crucial context information to infer the position and orientation of vehicles. The paper introduces a context-awre discriminator loss to refine the results.
- CVP (cycled view projection)
- 2-layer MLP to project image feature X to BEV feature X', following VPN
- Add cycle consistency loss to ensure the X' captures most information
- CVT (cross view transformer)
- X' as Query, X/X'' as key/value
- Context-aware Discriminator. This follows MonoLayout but takes it one step further.
- distinguish predicted and gt vechiles
- distinguish predicted and gt correlation between vehicle and road
- Summary of technical details