Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About subtract in pooling #4

Open
Dong-Huo opened this issue Nov 26, 2021 · 16 comments
Open

About subtract in pooling #4

Dong-Huo opened this issue Nov 26, 2021 · 16 comments

Comments

@Dong-Huo
Copy link

Hi, thank you for publishing such a nice paper. I just have one question. I do not understand the subtraction of the input in eqn.4. Is it necessary? What will happen if we just do the average pooling without substrating the input?

@yuweihao
Copy link
Collaborator

Hi @Dong-Huo ,

As shown in the paper, since the MetaFormer block already has a residual connection, subtraction of the input
itself is added in Equation (4). Experimentally, average pooling without substrating still works, but it is a little worse than adding subtraction.

@yuweihao
Copy link
Collaborator

Duplicate of #1

@yuweihao yuweihao marked this as a duplicate of #1 Nov 26, 2021
@yangcf10
Copy link

yangcf10 commented Dec 3, 2021

Hi @Dong-Huo ,

As shown in the paper, since the MetaFormer block already has a residual connection, subtraction of the input itself is added in Equation (4). Experimentally, average pooling without substrating still works, but it is a little worse than adding subtraction.

@yuweihao Have you tried removing the residual connection for token mixer? Currently you subtract "normed" x (basically y = x + pooling(norm(x)) - norm(x)) which seems weird.

@yuweihao
Copy link
Collaborator

yuweihao commented Dec 3, 2021

Hi @yangcf10 ,

It is not elegant to remove the residual connection in the block just for the pooling token mixer. It is better to remain the residual connection whatever the token mixer is so that we can just freely specify the token mixers in MetaFormer.

Instead, I have tried removing subtraction, i.e., replacing return self.pool(x) - x with return self.pool(x) in my preliminary experiments. return self.pool(x) also works well with slight performance decrease than that of return self.pool(x) - x.

@yangcf10
Copy link

yangcf10 commented Dec 3, 2021

Hi @yangcf10 ,

It is not elegant to remove the residual connection in the block just for the pooling token mixer. It is better to remain the residual connection whatever the token mixer is so that we can just freely specify the token mixers in MetaFormer.

Instead, I have tried removing subtraction, i.e., replacing return self.pool(x) - x with return self.pool(x) in my preliminary experiments. return self.pool(x) also works well with slight performance decrease than that of return self.pool(x) - x.

Thanks for the prompt reply! I understand it's mostly from empirical results. But any insight why we should do the subtraction? The explanation "since the MetaFormer block already has a residual connection so we should add subtraction" seems not to be convincing. If we treat token mixer as an abstracted module, then we shouldn't consider the residual connection when designing it.

@yuweihao
Copy link
Collaborator

yuweihao commented Dec 3, 2021

Hi @yangcf10 ,

Thank you for your feedback and suggestion. We will attempt to further improve the explanation "since the MetaFormer block already has a residual connection, subtraction of the input itself is added in Equation (4)".

@Vermeille
Copy link

Why don't we just remove the residual connection and the subtraction then? It would save compute and memory.

What I'm more concerned about is that the subtraction and the residual connection don't use the same "x" so they don't null each other. Indeed, the residual connection uses a pre-norm x while the subtraction uses a post-norm x.

It changes the semantics to something along the lines of a block emphasizing the spatial gradients.

What do you think? Does it work as well without the residual connection and the subtraction?

@Vermeille
Copy link

okay I saw your other comments about using DW conv instead of pooling. I understand that poolformer is not what your paper is about it but about the MetaFormer and the poolformer is indeed just a demonstration. Also, the fact that DW conv brings similar or superior performance shows that there is nothing special in this pooling layer, let alone this subtraction. This is missing the forest for the trees.

@yuweihao
Copy link
Collaborator

yuweihao commented Dec 4, 2021

Hi @Vermeille ,

Many thanks for your attention to this work and insightful comment. Yeah, the target in this work is to demonstrate the competence of transformer-like models primarily stem from the general architecture MetaFormer. The Pooling/PoolFormer are just tools to demonstrate the MetaFormer. If considering PoolFormer as a practical model to use, as your comment, it can be further improved from implementation efficiency and other aspects.

@oliver-batchelor
Copy link

Is there some relation between this pooling operation and graph convolutional networks? Because graphs have no regular structure GCNs are essentially some kind of pooling followed by MLP - which seems a lot like PoolFormer, though the MetaFormer still has an image pyramid which isn't present in graphs.

@yuweihao
Copy link
Collaborator

yuweihao commented Dec 20, 2021

Hi @saulzar , pooling is a basic operator in deep learning. Transformer or MetaFormer can be regarded as a type of Graph Neural Networks [1]. From this perspective, attention or pooling in MetaFormer can be regarded as a type of graph attention or graph pooling, respectively.

[1] https://graphdeeplearning.github.io/post/transformers-are-gnns/

@chuong98
Copy link

@yangcf10 @yuweihao

I understand it's mostly from empirical results. But any insight why we should do the subtraction?

Average Pool combining with Subtraction yields a [Laplacian kernel] (https://homepages.inf.ed.ac.uk/rbf/HIPR2/log.htm)
[1 1 1
1 -8 1
1 1 1 ]
which is the classical kernel of image processing. The Laplacian kernel computes the Spatial gradient. So the token mixer of Pool Former is actually: x <- x + alpha*Laplacian(x).

@yuweihao
Copy link
Collaborator

yuweihao commented Dec 28, 2021

Hi @chuong98 ,

Yes, it can be regarded as a fixed kernel in image processing (vs traditional CNN's learnable kernels). For each token, Laplacian(x) aggregates nearby token information different from itself, while the residual connection remains information of itself. The alpha in Normalization or LayerScale can balance nearby information and own information. Without subtraction, since the MetaFormer block already has a residual connection, alpha then becomes into balancing [nearby information + own information] and [own information], which looks weird. The above reason may make the performance with subtraction slightly better than that without subtraction.

Thanks for your continued attention to our work. Happy new year in advance :)

@DonkeyShot21
Copy link

For anyone wondering, I got the following results on ImageNet-100:

  • vanilla PoolFormer (return self.pool(x) - x): 87.64
  • simple pooling (return self.pool(x)): 87.56
  • DWConv (return dwconv(x)): 88.10

@chuong98
Copy link

that is really helpful, thanks @DonkeyShot21. @yuweihao Can you add these extra experiments to your revised version ?

@yuweihao
Copy link
Collaborator

Hi @chuong98 , sure, we plan to add extra experiments about more different token mixers (eg. DWConv) on ImageNet-1K in our revised version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants