November 2020
tl;dr: Sparse proposal and iterative refinement for a two-stage end-to-end object detector.
This paper rethinks the necessity of dense priors (either anchor boxes or reference points) in object detection, very similar to TSP. Sparse RCNN uses a number of sparse proposals (N << HWk dense priors) for object detection.
There are several papers on improving the training speed of DETR.
- Deformable DETR: sparse attention
- TSP: sparse attention
- Sparse RCNN: sparse proposal and iterative refinement
The iterative head design is quite inefficient in capturing the context and relationship with other parts of the image, and thus needs quite a few iterations (~6 cascaded stage). In comparison, the sparse cross attention in Deformable DETR and TSP may be a better way to go.
The authors also wrote OneNet which is a single-stage easy-to-deploy end to end object detection model.
- Sparse-in and sparse out
- DETR uses sparse set of object queries to interact with global (dense) image feature. It is also dense-to-sparse.
- Sparse RCNN proposes both sparse proposals and sparse features
- Sparse RCNN uses only 100 proposals, same as DETR. 300 proposal adds very little inference cost due to light design of dynamic head.
- Sparse Proposal feature: High dim (256-d) latent features encoding pose and shape of instances. Proposal feature generates a series of customized parameters for its exclusive object recognition head via dynamic conv.
- Three Input:
- Image
- A set of learned proposal boxes (fixed, and dataset independent, and may not be symmetric or shift-equivariant) --> This looks to be an inefficiency that can be optimized. Maybe something like the FoI in TSP-FCOS.
- A set of learnable proposal features. The initial values of the proposal feature is actually not that important, given the auto-regressive iterative refinement scheme (similar to IterDet).
- Dynamic instance interactive head --> More like a trick and an afterthought rather than intuition inspired design.
- The interaction between Proposal feature and RoI features is modeled by a dynamic convolution. The parameters are generate by the proposal features.
- Newly generated object boxes and object features of the next stage in iterative process. Features need to be RoI aligned again, similar to Cascade RCNN CVPR 2018. This only introduces a marginal computational overhead as the dynamic conv is very light.
- Set prediction loss exactly the same as DETR.
- Ablation studies
- The paper has a very clear evolution path of model, in Table 2, 3 and 4. According to the author's reply on Zhihu知乎.
-
18.5到20.5:加上cascade r-cnn那样的结构
-
20.5到32.2: cascade结构 + 上一个stage的obj feature concat到下一个stage的obj feature
-
32.2到37.2: cascade结构 + 上一个stage的obj feature先self-att,再concat到下一个stage的obj feature
-
37.2到42.3: cascade结构 + 上一个stage的obj feature先self-att,在作为proposal feature与下一个stage的RoI做instace interaction
-
Initialization of proposal box does not matter much
-
Number of proposals from 100 to 500 improves performance, but also requires longer training time
-
Stage increase saturates around stage 6. The iterative heads gradually refines box and removes duplicate ones.
- Code on github
- The Sparse RCNN paper is not very well written. Luckily the first author clarified most of the important details in Zhihu知乎.
- The proposal can be seen as an averaged GT distribution. This should be improved to be data dependent, and the authors are working on a v2 of Sparse RCNN. The authors also argue that a reasonable statistic can already be qualified candidates. Maybe this is similar to the FoI classifier in TSP-FCOS.