Skip to content

Latest commit

 

History

History
35 lines (25 loc) · 2.2 KB

retina_face.md

File metadata and controls

35 lines (25 loc) · 2.2 KB

September 2019

tl;dr: Single stage face detection with landmark regression and dense face regression. Landmark regression helps object detection.

Overall impression

The paper is from the same authors of ArcFace.

Joint face detection and alignment (landmark regression) has been widely used, such as MT-CNN. This paper also adds a dense face regression branch by generating mesh and reproject to 2D image and calculating the photometric loss. This self-supervised branch (can be turned off during inference) helps boost the performance of object detection.

Face detection can be trained on GPU, but inference on CPU is still the most popular choice, for surveillance purpose, etc.

Key ideas

  • Mask RCNN helps with object detection: Dense pixel annotation helps with object detection.
  • Multiple loss:
    • cls loss: face vs not face
    • Face box regression: zero loss for negative anchors
    • Facial landmark regression: zero loss for negative anchors, normalized according to same anchor as face box.
    • Dense regression loss: renders a face mesh and reprojects for self-supervised learning.
  • Context module: inspired by single stage headless detector (SSH) and PyramidBox, basically concats features from different levels together after FPN.
    • input 256 --> 128 --> 64 --> 64
    • 128 + 64 + 64 = 256 output
    • all 3x3 conv replaced with DCN (deformable)
  • Mesh decode is a pretrained model, and it extracts 128 points per face. Then an efficient 3D mesh renderer projects the 128 points to the 2D plane, and compares the photometric loss. Additional parameters in camera and illumination parameters (7+9) for each face.

Technical details

  • Face detection features smaller ratio variations but much larger scale variation.
  • Anchors: 75% of anchors are from P2, high resolution image.
  • Accuracy evaluation normalized by face box size (√WxH)

Notes

  • Both the retinaFace and the STN (supervised transformer network) from MSRA predicts landmark together from the first stage (RPN for STN, and FPN for retinaFace). Maybe regressing landmarks with bbox simultaneously works after all.