PaddlePaddle reimplementation of facebookresearch's repository for the MoCo v3 model that was released with the paper An Empirical Study of Training Self-Supervised Vision Transformers.
To enjoy some new features, PaddlePaddle 2.4 is required. For more installation tutorials refer to installation.md
Prepare the data into the following directory:
dataset/
└── ILSVRC2012
├── train
└── val
With a batch size of 4096, ViT-Base is trained with 4 nodes:
# Note: Set the following environment variables
# and then need to run the script on each node.
unset PADDLE_TRAINER_ENDPOINTS
export PADDLE_NNODES=4
export PADDLE_MASTER="xxx.xxx.xxx.xxx:12538"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export FLAGS_stop_check_timeout=3600
IMAGENET_DIR=./dataset/ILSVRC2012/
python -m paddle.distributed.launch \
--nnodes=$PADDLE_NNODES \
--master=$PADDLE_MASTER \
--devices=$CUDA_VISIBLE_DEVICES \
main_moco.py \
-a moco_vit_base \
--optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
--epochs=300 --warmup-epochs=40 \
--stop-grad-conv1 --moco-m-cos --moco-t=.2 \
${IMAGENET_DIR}
By default, we use momentum-SGD and a batch size of 1024 for linear classification on frozen features/weights. This can be done with a single 8-GPU node.
unset PADDLE_TRAINER_ENDPOINTS
export PADDLE_NNODES=1
export PADDLE_MASTER="127.0.0.1:12538"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export FLAGS_stop_check_timeout=3600
IMAGENET_DIR=./dataset/ILSVRC2012/
python -m paddle.distributed.launch \
--nnodes=$PADDLE_NNODES \
--master=$PADDLE_MASTER \
--devices=$CUDA_VISIBLE_DEVICES \
main_lincls.py \
-a moco_vit_base \
--lr=3 \
--pretrained pretrained/checkpoint_0299.pd \
${IMAGENET_DIR}
To perform end-to-end fine-tuning for ViT, use our script to convert the pre-trained ViT checkpoint to PLSC DeiT format:
python extract_weight.py \
--input pretrained/checkpoint_0299.pd \
--output pretrained/moco_vit_base.pdparams
Then run the training with the converted PLSC format checkpoint:
unset PADDLE_TRAINER_ENDPOINTS
export PADDLE_NNODES=1
export PADDLE_MASTER="127.0.0.1:12538"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export FLAGS_stop_check_timeout=3600
python -m paddle.distributed.launch \
--nnodes=$PADDLE_NNODES \
--master=$PADDLE_MASTER \
--devices=$CUDA_VISIBLE_DEVICES \
plsc-train \
-c ./configs/DeiT_base_patch16_224_in1k_1n8c_dp_fp16o1.yaml \
-o Global.epochs=150 \
-o Global.pretrained_model=pretrained/moco_vit_base \
-o Global.finetune=True
Model | Phase | Dataset | Configs | GPUs | Epochs | Top1 Acc | Checkpoint |
---|---|---|---|---|---|---|---|
moco_vit_base | pretrain | ImageNet2012 | - | A100*N4C32 | 300 | - | download |
moco_vit_base | linear prob | ImageNet2012 | - | A100*N1C8 | 90 | 0.7662 | |
moco_vit_base | finetune | ImageNet2012 | config | A100*N1C8 | 150 | 0.8288 |
@Article{chen2021mocov3,
author = {Xinlei Chen* and Saining Xie* and Kaiming He},
title = {An Empirical Study of Training Self-Supervised Vision Transformers},
journal = {arXiv preprint arXiv:2104.02057},
year = {2021},
}