vit cifar10 test with image size 32, 64
acc : 83.6
https://github.com/YuBeomGon/vit_cifar10/blob/master/notebooks/vit-scratch-s4.ipynb
regard v2,
https://pytorch.org/blog/how-to-train-state-of-the-art-models-using-torchvision-latest-primitives/
acc : 88.0
https://github.com/YuBeomGon/vit_cifar10/blob/master/notebooks/vit-scratch-v2.ipynb
acc : 89.1
https://github.com/YuBeomGon/vit_cifar10/blob/master/notebooks/vit-64-from-32.ipynb
confusion matrix
attn map visualize
checkpoint
- image resolution is important. accuracy is 70.0 in original setting, so add overraping patch embeding for efficient attention training
- add lots of augmentation, normalization(q,k,v norm), because dataset is small, and vit has low inductive bias
- weight init is not good in cifar10 case, default init is used.
- use patch embeding interpolation for image_size 64 fine tuning
further
- Use LoRA, adapter like big model fine tuning scheme
- Deit, knowledge distillation(from big model to tiny model knowledge transfer)
- use layer scale(cait)
- loss ft tuning for cat