Skip to content

Latest commit

 

History

History
30 lines (22 loc) · 2.21 KB

lighthead_rcnn.md

File metadata and controls

30 lines (22 loc) · 2.21 KB

April 2019

tl;dr: Faster than two-stage detectors and more accurate than one-stage detectors.

Overall impression

The paper analyzed the computation burden in Faster RCNN and R-FCN, and proposes a more balanced network. The authors fine-tune-fu is amazing.

It is now possible to integrate FPN into R-FCN with the changed architecture of light head RCNN.

The PS RoIPooling is replaced with PS RoIAlign. This RoI Align technique also improved AP by more than 1 point. --> PS RoIAlign is further extended to rotated PS RoIAlign in RoI transformer.

Key ideas

  • Faster RCNN has a heavy head (with two fc layers), and R-FCN has a heavy score-map. Even if the base network can be reduced, no much improvement will be gained.
    • Faster RCNN's computation grows when more proposals are needed. There is a global avg pooling and 2 fc layers.
    • R-FCN produces a very large score map with #classes x p x p. This way, the head is cost-free.
    • R-FCN cannot leverage FPN directly due to large memory consumption, if we want to consume the high resolution feature maps.
  • Two main modifications:
    • large kernel separable convolution to convert the feature map from the backbone to a thin score map. We can use C_mid to control the complexity of computation.
    • Reduced score map channel from #classes x p x p to 10 x p x p. (This reduction only holds when #classes >> 10). As 10 is not necessarily the #classes, so we ned to have a fc layer for final prediction.
  • The pooled feature map has only 10 channels.
  • With light weight backbone such as Xception, it can achieve ~100 fps inference speed.

Technical details

  • The feature map for COCO is reduced from 3969 (7x7x81) to 490 (7x7x10)

Notes

  • For special applications like vehicle or pedestrian detection, it perhaps does not save too much as #classes is small (1 or 2).
  • In CV community the majority of effort is to have a good-performance generalized object detector. However in real-world industrial applications, we need good performing detectors focusing on specific object classes, most likely with limited computational resource and time for inference.