Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

21CVPR#UP-DETR: Unsupervised Pre-training for Object Detection with Transformers #38

Open
XFeiF opened this issue Mar 10, 2021 · 1 comment
Labels
area/image area/SSL self-supervised learning Code Code available. Summary/Brief A breif summary about the paper. task/detection trend/Transformer Every paper uses transformer...

Comments

@XFeiF
Copy link
Owner

XFeiF commented Mar 10, 2021

Paper
Code-pytorch

Authors:
Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen

The Chinese explanation from the author Zhigang Dai in Zhihu.

The framework of the proposed UP-DETR.
Framework

@XFeiF XFeiF added area/SSL self-supervised learning area/image task/detection Summary/Brief A breif summary about the paper. Code Code available. trend/Transformer Every paper uses transformer... labels Mar 10, 2021
@XFeiF
Copy link
Owner Author

XFeiF commented Mar 10, 2021

Highlight:
The proposed UP-DETR framework aims to Unsupervisedly Pre-train the transformers of DETR. The main tasks of object detection are object classification and localization. However, the DETR transformer focuses on spatial localization learning. So the problem comes that how to maintain the image classification ability. Based on this finding, the authors make the following contributions:

  1. Multi-task learning: they introduce frozen pre-training backbone and patch feature reconstruction to preserve the feature discrimination of transformers. (I think this part acts like a penalization term to help the transformer keep the important discriminative information extracted by the pre-trained backbone.)
  2. Multi-query localization: object query shuffle + attention mask, while the former part is the most important. The intuition is: for general object detection, there are multiple object instances in each image. Besides, single-query patch may result in convergence difficulty when the number of object queries is large.

The entire framework:

  1. A frozen CNN backbone is used to extract a visual representation with the feature map f of an input image.
  2. The feature map is added with positional encodings and passed to the multi-layer transformer encoder in DETR.
  3. For the random cropped query patch, the CNN backbone with global average pooling extracts the patch feature p, which is flattened and supplemented with object queries q before passing it into a transformer decoder.
  4. Query feature q of patch p in the image feature k.

The loss function is formed by three parts:

  1. The classification loss of matching the query or not.
  2. The regression loss of bounding-box.
  3. The reconstruction loss of (discriminative image) feature preservation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/image area/SSL self-supervised learning Code Code available. Summary/Brief A breif summary about the paper. task/detection trend/Transformer Every paper uses transformer...
Projects
None yet
Development

No branches or pull requests

1 participant