November 2021
tl;dr: Large scale pretraining based on Masked Image Modeling. Similar to MAE.
This paper is published a week after MAE, obviously rushed by the publication of the latter. The ideas are very similar, but execution (hyperparameter tuning, paper writing) is considerably inferior to MAE.
Difference between MAE and SimMIM:
- MAE uses asymmetric design of encoder and decoder, where encoder does not see masked patches. SimMIM uses symmetric design.
- SimMIM stressed the difference between prediction (of only masked patches) and reconstruction (of all patches), and mentioned that the former yields better performance. MAE also observes the trend (in footnote). However MAE also demonstrates the mid-ground: training without losses on visible patches but prediction on all the patches.
- SimMIM was not validated on more fine-grained downstream tasks such as object detection and segmentation.
Similarities between MAE and SimMIM:
- directly regress the pixels
- light decoder design
- Summaries of the key ideas
- Summary of technical details
- Questions and notes on how to improve/revise the current work