By Chenxin Tao, Honghui Wang, Xizhou Zhu, Jiahua Dong, Shiji Song, Gao Huang, Jifeng Dai
This is the official implementation of the CVPR 2022 paper Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework.
- 2022.6.28: See Siamese Image Modeling for an application of UniGrad.
- 2022.6.28: Release code for UniGrad and gradient implementations for other methods.
We propose a unified framework for different self-supervised learning methods through the perspective of gradient analysis. It turns out that previous methods share similar gradient structures (positive gradient + negative gradient) as well as similar performances. Based on these analysis, we propose UniGrad as a concise and effective method for self-supervised learning.
Method | Loss formula |
---|---|
contrastive learning | |
asymmetric networks | |
feature decorrelation | Barlow Twins: VICReg: |
UniGrad |
Method | Gradient formula |
---|---|
contrastive learning |
where |
asymmetric networks | |
feature decorrelation | |
UniGrad |
where |
Loss | Model | Epoch | Ori. Top1 acc. | Grad. imple. w/o mm enc. |
Grad. imple. w/ mm enc. |
config | Pretrained model |
---|---|---|---|---|---|---|---|
Contrastive Learning | |||||||
MoCo | R50 | 100 | 67.4 | 67.6 | 70.0 | - | - |
SimCLR | R50 | 100 | 62.7 | 67.6 | 70.0 | config | - |
Asymmetric Networks | |||||||
SimSiam | R50 | 100 | 68.1 | 67.9 | 70.2 | - | - |
BYOL | R50 | 100 | 66.5 | 67.9 | 70.2 | config | model |
Feature Decorrelation | |||||||
VICReg | R50 | 100 | 68.6 | 67.6 | 69.8 | - | - |
Barlow Twins | R50 | 100 | 68.7 | 67.6 | 70.0 | config | model |
UniGrad | |||||||
UniGrad | R50 | 100 | - | - | 70.3 | config | model |
UniGrad+cutmix | R50 | 800 | - | - | 74.9 | - | - |
Note:
(1) The numbers for original versions come from original papers, and all gradient versions are implemented by us;
(2) For MoCo, SimCLR and BYOL, our results improve upon their original performances by a large margin. The improvements mainly come from a stronger projector (3 layers with 2048 hidden dimension). Similarly, original Barlow Twins and VICReg get better results because they use a 3-layer MLP with 8192 hidden dimension for the projector.
The pretraining only requires installing pytorch and downloading ImageNet dataset, whose detailed instructions can be found here.
To conduct the linear evaluation, apex should be installed because LARS optimizer is used.
To conduct pretraining, fill the dataset path into the config, and run the following command:
./run_pretrain.sh {CONFIG_NAME} {NUM_GPUS}
For example, the command for pretraining with UniGrad on 8 GPUs is as follows:
./run_pretrain.sh ./configs/unigrad.py 8
With 8 gpus, a 100-ep experiment usually requires 38 hours to finish, which is also reported in the paper.
To conduct linear evaluation, run the following command:
./run_linear.sh {PRETRAINED_CKPT} {DATA_PATH} {NUM_GPUS}
Note that for 100-ep pretrained model, the base learning rate is set to 0.1, which just follows the evaluation setting of SimSiam. With our provided pretrained model, this command should give ~70 top-1 accuracy.
- release pretrained model without momentum encoder
- release 800ep UniGrad pretrained model
- support vit structure
This project is released under the Apache 2.0 license.
If you find UniGrad useful in your research, please consider citing:
@inproceedings{tao2022exploring,
title={Exploring the equivalence of siamese self-supervised learning via a unified gradient framework},
author={Tao, Chenxin and Wang, Honghui and Zhu, Xizhou and Dong, Jiahua and Song, Shiji and Huang, Gao and Dai, Jifeng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={14431--14440},
year={2022}
}