Skip to content

Latest commit

 

History

History
41 lines (31 loc) · 3.19 KB

File metadata and controls

41 lines (31 loc) · 3.19 KB

December 2020

tl;dr: End to end object detection with one-to-one label assignment.

Overall impression

The study build upon FCOS. DeFCN points out that the one-to-many label assignment makes NMS necessary. Thus a good one-to-one policy is the key. The paper is also inspired by MultiBox and DETR to use bipartite matching as the matching cost, to allow neural network to learn a better assignment policy.

DeFCN and OneNet:

  • DeFCN shows that a hand crafted one-to-one label assignment already yields OK-ish performance (10% relative drop in KPI). OneNet also mentions that a predefined location cost + classification is able to yield OK baseline.
  • Both DeFCN and OneNet adopts a bbox formulation consisting of a point inside GT bbox + 4 distances to the edges. This addresses eccentric objects or objects where center is occluded.
  • OneNet seem to have inferior performance than DeFCN.

Key ideas

  • One-to-one label assignment is key.
    • One-to-one based on center or anchor is already OK.
    • Matching cost by foreground loss (as in DETR) improves KPI
    • Modified POTO (prediction aware one to one) cost for matching is even better, as the foreground loss (cls+reg) may be weighted and it may not be optimal for bipartite matching.
    • The selection of matching cost is not necessarily differentiable. So theoretically we can use mAP as the cost --> see review on Zhihu.
  • POTO matching cost
    • Spatial priors helps (the center of prediction matched to GT cannot be outside of the GT box)
    • Balanced IoU and classification (by multiplication, better than summation)
  • 3D Max Filtering (3DMF)
    • CenterNet uses 2D max filtering to replace NMS
    • Duplicate predictions majorly come from the nearby spatial regions of the most conf prediction, and comes from neighboring scales. As objects with sizes on the border of a stage may be automatically assigned to neighboring stage of the FPN.
    • 3DMF is a module to perform 3D max pooling to provide sharper response. It is used as a differentiable post-processing step inside the network.
  • Auxiliary loss to speed up convergence.

Technical details

  • By using POTO and 3D MF, the scores of duplicate samples are significantly suppressed.
  • On CrowdHuman, the recall is even higher than the theoretical upper limit with GT (applying NMS on GT).
  • MultiBox is the first paper to propose bipartite matching between pred and GT, way earlier than DETR.

Notes

  • Review on Zhihu by 1st author

    • About spatial prior in matching cost

    在α合理的情况下,空间先验不是必须的,但空间先验能够在匹配过程中帮助排除不好的区域,提升绝对性能;研究者在 COCO 实验中采用 center sampling radius=1.5,在 CrowdHuman 实验中采用 inside gt box. 理由很简单,CrowdHuman 的遮挡问题太严重,center 区域经常完全被遮挡。