About subtract in pooling #4

Dong-Huo · 2021-11-26T00:21:53Z

Hi, thank you for publishing such a nice paper. I just have one question. I do not understand the subtraction of the input in eqn.4. Is it necessary? What will happen if we just do the average pooling without substrating the input?

yuweihao · 2021-11-26T01:55:24Z

Hi @Dong-Huo ,

As shown in the paper, since the MetaFormer block already has a residual connection, subtraction of the input
itself is added in Equation (4). Experimentally, average pooling without substrating still works, but it is a little worse than adding subtraction.

yuweihao · 2021-11-26T01:58:31Z

Duplicate of #1

yangcf10 · 2021-12-03T07:11:27Z

Hi @Dong-Huo ,

As shown in the paper, since the MetaFormer block already has a residual connection, subtraction of the input itself is added in Equation (4). Experimentally, average pooling without substrating still works, but it is a little worse than adding subtraction.

@yuweihao Have you tried removing the residual connection for token mixer? Currently you subtract "normed" x (basically y = x + pooling(norm(x)) - norm(x)) which seems weird.

yuweihao · 2021-12-03T08:41:07Z

Hi @yangcf10 ,

It is not elegant to remove the residual connection in the block just for the pooling token mixer. It is better to remain the residual connection whatever the token mixer is so that we can just freely specify the token mixers in MetaFormer.

Instead, I have tried removing subtraction, i.e., replacing return self.pool(x) - x with return self.pool(x) in my preliminary experiments. return self.pool(x) also works well with slight performance decrease than that of return self.pool(x) - x.

yangcf10 · 2021-12-03T08:49:21Z

Hi @yangcf10 ,

It is not elegant to remove the residual connection in the block just for the pooling token mixer. It is better to remain the residual connection whatever the token mixer is so that we can just freely specify the token mixers in MetaFormer.

Instead, I have tried removing subtraction, i.e., replacing return self.pool(x) - x with return self.pool(x) in my preliminary experiments. return self.pool(x) also works well with slight performance decrease than that of return self.pool(x) - x.

Thanks for the prompt reply! I understand it's mostly from empirical results. But any insight why we should do the subtraction? The explanation "since the MetaFormer block already has a residual connection so we should add subtraction" seems not to be convincing. If we treat token mixer as an abstracted module, then we shouldn't consider the residual connection when designing it.

yuweihao · 2021-12-03T09:01:50Z

Hi @yangcf10 ,

Thank you for your feedback and suggestion. We will attempt to further improve the explanation "since the MetaFormer block already has a residual connection, subtraction of the input itself is added in Equation (4)".

Vermeille · 2021-12-03T17:47:50Z

Why don't we just remove the residual connection and the subtraction then? It would save compute and memory.

What I'm more concerned about is that the subtraction and the residual connection don't use the same "x" so they don't null each other. Indeed, the residual connection uses a pre-norm x while the subtraction uses a post-norm x.

It changes the semantics to something along the lines of a block emphasizing the spatial gradients.

What do you think? Does it work as well without the residual connection and the subtraction?

Vermeille · 2021-12-03T21:17:24Z

okay I saw your other comments about using DW conv instead of pooling. I understand that poolformer is not what your paper is about it but about the MetaFormer and the poolformer is indeed just a demonstration. Also, the fact that DW conv brings similar or superior performance shows that there is nothing special in this pooling layer, let alone this subtraction. This is missing the forest for the trees.

yuweihao · 2021-12-04T04:12:49Z

Hi @Vermeille ,

Many thanks for your attention to this work and insightful comment. Yeah, the target in this work is to demonstrate the competence of transformer-like models primarily stem from the general architecture MetaFormer. The Pooling/PoolFormer are just tools to demonstrate the MetaFormer. If considering PoolFormer as a practical model to use, as your comment, it can be further improved from implementation efficiency and other aspects.

oliver-batchelor · 2021-12-19T02:03:20Z

Is there some relation between this pooling operation and graph convolutional networks? Because graphs have no regular structure GCNs are essentially some kind of pooling followed by MLP - which seems a lot like PoolFormer, though the MetaFormer still has an image pyramid which isn't present in graphs.

yuweihao · 2021-12-20T05:54:55Z

Hi @saulzar , pooling is a basic operator in deep learning. Transformer or MetaFormer can be regarded as a type of Graph Neural Networks [1]. From this perspective, attention or pooling in MetaFormer can be regarded as a type of graph attention or graph pooling, respectively.

[1] https://graphdeeplearning.github.io/post/transformers-are-gnns/

chuong98 · 2021-12-28T19:33:09Z

@yangcf10 @yuweihao

I understand it's mostly from empirical results. But any insight why we should do the subtraction?

Average Pool combining with Subtraction yields a [Laplacian kernel] (https://homepages.inf.ed.ac.uk/rbf/HIPR2/log.htm)
[1 1 1
1 -8 1
1 1 1 ]
which is the classical kernel of image processing. The Laplacian kernel computes the Spatial gradient. So the token mixer of Pool Former is actually: x <- x + alpha*Laplacian(x).

yuweihao · 2021-12-28T20:31:38Z

Hi @chuong98 ,

Yes, it can be regarded as a fixed kernel in image processing (vs traditional CNN's learnable kernels). For each token, Laplacian(x) aggregates nearby token information different from itself, while the residual connection remains information of itself. The alpha in Normalization or LayerScale can balance nearby information and own information. Without subtraction, since the MetaFormer block already has a residual connection, alpha then becomes into balancing [nearby information + own information] and [own information], which looks weird. The above reason may make the performance with subtraction slightly better than that without subtraction.

Thanks for your continued attention to our work. Happy new year in advance :)

DonkeyShot21 · 2022-01-12T14:24:47Z

For anyone wondering, I got the following results on ImageNet-100:

vanilla PoolFormer (return self.pool(x) - x): 87.64
simple pooling (return self.pool(x)): 87.56
DWConv (return dwconv(x)): 88.10

chuong98 · 2022-01-14T03:27:07Z

that is really helpful, thanks @DonkeyShot21. @yuweihao Can you add these extra experiments to your revised version ?

yuweihao · 2022-01-14T10:55:16Z

Hi @chuong98 , sure, we plan to add extra experiments about more different token mixers (eg. DWConv) on ImageNet-1K in our revised version.

yuweihao marked this as a duplicate of #1 Nov 26, 2021

yuweihao added the discussion label Mar 1, 2022

yuweihao mentioned this issue May 20, 2024

Why use the pool(x) - x #1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About subtract in pooling #4

About subtract in pooling #4

Dong-Huo commented Nov 26, 2021

yuweihao commented Nov 26, 2021

yuweihao commented Nov 26, 2021

yangcf10 commented Dec 3, 2021

yuweihao commented Dec 3, 2021

yangcf10 commented Dec 3, 2021

yuweihao commented Dec 3, 2021 •

edited

Loading

Vermeille commented Dec 3, 2021

Vermeille commented Dec 3, 2021

yuweihao commented Dec 4, 2021

oliver-batchelor commented Dec 19, 2021

yuweihao commented Dec 20, 2021 •

edited

Loading

chuong98 commented Dec 28, 2021

yuweihao commented Dec 28, 2021 •

edited

Loading

DonkeyShot21 commented Jan 12, 2022

chuong98 commented Jan 14, 2022

yuweihao commented Jan 14, 2022

About subtract in pooling #4

About subtract in pooling #4

Comments

Dong-Huo commented Nov 26, 2021

yuweihao commented Nov 26, 2021

yuweihao commented Nov 26, 2021

yangcf10 commented Dec 3, 2021

yuweihao commented Dec 3, 2021

yangcf10 commented Dec 3, 2021

yuweihao commented Dec 3, 2021 • edited Loading

Vermeille commented Dec 3, 2021

Vermeille commented Dec 3, 2021

yuweihao commented Dec 4, 2021

oliver-batchelor commented Dec 19, 2021

yuweihao commented Dec 20, 2021 • edited Loading

chuong98 commented Dec 28, 2021

yuweihao commented Dec 28, 2021 • edited Loading

DonkeyShot21 commented Jan 12, 2022

chuong98 commented Jan 14, 2022

yuweihao commented Jan 14, 2022

yuweihao commented Dec 3, 2021 •

edited

Loading

yuweihao commented Dec 20, 2021 •

edited

Loading

yuweihao commented Dec 28, 2021 •

edited

Loading