Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I say PoolFormer is just a non-trainable MLP-like module? #10

Open
072jiajia opened this issue Dec 2, 2021 · 8 comments
Open

Can I say PoolFormer is just a non-trainable MLP-like module? #10

072jiajia opened this issue Dec 2, 2021 · 8 comments

Comments

@072jiajia
Copy link

Hi! Thanks for sharing the great work!
I have some questions about PoolFormer.
If I explain PoolFormer like the following attachments, can I say PoolFormer is just a non-trainable MLP-like model?

image
image

@yuweihao
Copy link
Collaborator

yuweihao commented Dec 2, 2021

Hi @072jiajia ,

Sure. There are a thousand Hamlets in a thousand people's eyes. Feel free to explain PoolFormer from different aspects.

@072jiajia
Copy link
Author

072jiajia commented Dec 2, 2021

Thanks for replying!
Assuming the above statements are true, I'm curious about how your model's performance can be better than ResNet's?
Because It is just a ResNet with some layers whose weights are not trainable.
Or did I miss any part of the model?

@yuweihao
Copy link
Collaborator

yuweihao commented Dec 2, 2021

Hi @072jiajia , in this paper, we claim that MetaFormer is actually what you need for vision. The competitive performance of PoolFormer steams from MetaFormer. You can see that the MetaFormer architecture is different from ResNet architecture. If the token mixer in Metaformer is specified as just a simple learnable Depthwise Convolution, better performance than PoolFormer will be obtained. This can be implemented by replacing self.token_mixer = Pooling(pool_size=pool_size) in the code with self.token_mixer = nn.Conv2d(in_channels=dim, out_channels=dim, kernel_size=3, stride=1, padding=1, groups=dim).

@xingshulicc
Copy link

xingshulicc commented Jul 12, 2022

I totally agree with @072jiajia.
Exactly, I am not sure about the goal of this paper. The general architecture of MetaFormer is almost the same as vision transformers. The only difference is the token-mixer operator. By the way, your proposed pooling operation is a kind of non-trainable convolution. The patch embedding mentioned in the paper is a trainable convolution whose kernel size is equal to stride and the embedding dimension is the output channels. Therefore, I think your model is a variant of the Convolutional Neural Network; it is very similar to Network in Network (NIN) model (Proposed in 2013 by Lin et al.): the two linear layers can be implemented with two 1 x 1 convolutional layers.

@yuweihao
Copy link
Collaborator

Hi @xingshulicc ,

Thanks for your attention. The goal of this paper is not to propose novel models. The idea of this paper is to propose a hypothesis and come up with methods to verify it.

Hypothesis: Instead of the specific token mixer, the general architecture, termed MetaFormer, is more essential for the model to achieve competitive performance.

Verification: We specify the token mixer as an extremely simple operator, pooling, and find the derived model PoolFormer outperforms well-tuned Vision Transformer/MLP-like/ResNet baselines.

MetaFormer is not a specific model but an abstract model. By regarding the attention module as a specific token mixer, MetaFormer is abstracted from Transformer where the token mixer is not specified. By contrast, PoolFormer is a specific model that is obtained by specifying the token mixer as pooling in MetaFormer. PoolFormer is utilized as a tool to verify the hypothesis.

Besides PoolFormer, we also came up with other methods to verify the hypothesis, like specifying the token mixer as random matrix or depthwise convolution, referring to the ablation study table in the paper. Since pooling is simple, we finally choose it as a default tool to verify the hypothesis.

@xingshulicc
Copy link

xingshulicc commented Jul 13, 2022

Hi,
Thank you for your reply. Based on your response, can I say that the most important part for MetaFormer is the architecture abstracted from Transformer, like Figure 1 presented in the paper? If it is true, how can I further improve MetaFormer performance in the future work? Maybe the best way is to enhance the token-mixer part, as shown in the ablation study table (Table 5). I am not questioning the contribution of this paper, I just feel that this article conflicts with my current research views. Of course your article is very solid.

@yuweihao
Copy link
Collaborator

yuweihao commented Jul 14, 2022

Hi @xingshulicc ,

Yes, the essential part of our paper is the MetaFormer hypothesis. To improve the general architecture MetaFormer, maybe we can:

  1. Instead of the abstracted token mixer in MetaFormer, improve a component or even the whole architecture of MetaFormer. For example, propose a new normalization that can steadily improve MetaFormer-like baselines (eg Transformer, MLP-like, or Poolformer models)
  2. Propose a more effective/efficient optimizer to train MetaFormer-like models better/faster, outperforming the most commonly-used AdamW.
  3. and so on ...

We also want to clarify that the MetaFormer hypothesis does not mean the token mixer is insignificant. MetaFormer still has this abstracted component. It means token mixer is not limited to a specific type, e.g. attention (MetaFormer is actually what you need vs. Attention is all you need). It makes sense that specifying better token mixer in MetaFormer would bring better performance (eg Pool vs DWConv in ablation study Table 5). For designing new token mixer, it is recommended to adopt MetaFormer as general architecture since it can guarantee the competitive performance of models (MetaFormer guarantees high lower bound of performance. See ablation study that replaces pool with identity mapping and random matrix). As currently many papers are focused on the token mixer, we hope this paper can inspire more future research devoted to improving the fundamental architecture MetaFormer.

@xingshulicc
Copy link

Hi,
Thank you for your reply. I agree with some of your opinions.
However, in the paper, I did not see substantial modifications to the MetaFormer general architecture. The ablation study part just compared the performance of the different components (Table 5). Furthermore, I can see that the token-mixer part contributes the most to the performance improvement.
Of course, your paper really inspired me a lot, I also hope to come up with a simplified MetaFormer architecture in my future work. Thank you again.

@072jiajia 072jiajia changed the title Can I say PoolFormer is just a non-trainable MLP-like model? Can I say PoolFormer is just a non-trainable MLP-like module? Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants