Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About poolformer as a tool for demonstration of MetaFormer #51

Open
YoojLee opened this issue Apr 1, 2023 · 2 comments
Open

About poolformer as a tool for demonstration of MetaFormer #51

YoojLee opened this issue Apr 1, 2023 · 2 comments

Comments

@YoojLee
Copy link

YoojLee commented Apr 1, 2023

Hi, Thanks for the wonderful work, and I am really impressed with the proposed 'MetaFormer' concepts and experimental results you have provided! While reading the paper, some questions were raised regarding the poolformer and the concept of MetaFormer that I wanted to share with you.

  1. As far as I understand, the metaformer basically consists of 'input embedding + iteration of blocks with [norm - token mixer - residual connection - norm - channel mixer - residual connection].' Then does MetaFormer not have consideration for non-overlapping patches or a sequence of flattened patches? If so, is the combination of token mixer and channel mixer with other components basically what we have for the 'MetaFormer' regardless of the hierarchical structure of networks or shape of inputs?
  2. The poolformer has non-parametric 2D pooling for the token mixer, which is extremely simple compared to previous token mixers. However, the patch embedding inserted between the blocks seems to have implicit token mixing since it is a convolution with a smaller stride than its kernel size and eventually yields overlapped patches. Under the assumption of overlapping patches, I believe the resulting patches share information on the same spatial locations.

Thanks!

@yuweihao
Copy link
Collaborator

yuweihao commented Apr 2, 2023

Hi @YoojLee ,

Thanks for your insightful discussion.

  1. In my opinion, the core of MetaFormer is the repeated MetaFormer blocks. Thus, those models using hierarchical structure, like PVT, Swin and PoolFormer, are regarded as MetaFormer models.

  2. For 4-stage hierarchical structure, the four patch embeddings shown in that paper actually can also be called downsampling layers similar to ResNet. Downsampling can also mix tokens, but its main function is to reduce resolution and increase channel numbers. ResNet and PoolFormer have similar hierarchical structures, the better performance of PoolFormer demonstrates the superior of MetaForemer. You may also refer what makes pooling competitive performance or even more than attention? #43.

poolformer_s24

@YoojLee
Copy link
Author

YoojLee commented Apr 2, 2023

Thanks for your reply!

I just want to confirm that what I understand is right. If I get your comment correct, the suggested MetaFormer concept is the mere stack of MetaFormer Block (which consists of normalization, token&channel mixer, and residual connection). Thus, regardless of the extent of inductive bias or whether overall architecture follows a hierarchical structure, the models with repetition of MetaFormer blocks become one of the MetaFormers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants