[91] Three things everyone should know about Vision Transformers

[paper](https://arxiv.org/abs/2203.09795)

Touvron 의 신작.
1. 모듈들을 병렬로 잘 연결하자
2. data 가 적으면 mhsa 만 튜닝해도 좋다
3. 16 patchify layer 를 resnet-d 처럼 작은 stride 를 갖는 레이어들로 쪼개면 좋다. (multi-step 으로 달성시킨다) 

# ViT 의 특성을 잘 정리해 주었는데, 다음과 같다.
paramter 는 depth 에 비례, width 에 quadratic
FLOP 은 depth 에 비례, width 에 quadratic
Peak memory 는 depth 에는 constant, width 에 quadratic
latency 는 이론상 wide 한 게 더 좋은데, 꼭 그렇지는 않음.

# Parallel ㄱㄱ
![image](https://user-images.githubusercontent.com/16400591/164722283-733d2d9b-3d50-4de3-8eb7-746ba233b9c8.png)

AS-IS
![image](https://user-images.githubusercontent.com/16400591/164722396-65ee2b2b-8c0b-468e-8702-55af6d96ed91.png)

TO-BE
![image](https://user-images.githubusercontent.com/16400591/164722430-3aa32d73-5df7-404c-a6e6-019bc8a9b87b.png)

2개의 block 을 하나의 block 으로 합치는 형태이다.
공평하게 하기 위해서 36 block 와 비교할 때, 18x2 block 과 비교하는 식으로 실험이 진행된다.
![image](https://user-images.githubusercontent.com/16400591/164723814-46c37139-4daf-4aff-9cfd-c8a5c713cd3c.png)

LS 는 LayerScale 인데, 네트워크가 깊어질 수록 안정화시키는 장치이다.
learnable diagonal matrix 를 residual block output 에 더하는 것이고,
diangonal matrix 는 0으로 initialize 된다.
어떻게 보면 learnable 한 per-channel attention parameter라 보면 되겠다.

# Data 적을 때는 attn 이 효과적이다.
![image](https://user-images.githubusercontent.com/16400591/164724668-8f73c2ec-3d96-4517-a43b-b92824fdfbfa.png)
![image](https://user-images.githubusercontent.com/16400591/164724721-220da4b5-9971-463f-8b0f-8a6bb7b7254f.png)

# patchifty 는 나눠서 진행하면 좋다.
![image](https://user-images.githubusercontent.com/16400591/164724797-a748a96b-3b4b-43a9-b198-e6f1baa0cd96.png)
![image](https://user-images.githubusercontent.com/16400591/164724834-0b781e4f-61d1-43c4-8d9e-78a6455289c6.png)
![image](https://user-images.githubusercontent.com/16400591/164724853-b2ce8c42-8eb5-4d36-9dab-5cc5d79841e0.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[91] Three things everyone should know about Vision Transformers #120

ViT 의 특성을 잘 정리해 주었는데, 다음과 같다.

Parallel ㄱㄱ

Data 적을 때는 attn 이 효과적이다.

patchifty 는 나눠서 진행하면 좋다.

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[91] Three things everyone should know about Vision Transformers #120

Description

ViT 의 특성을 잘 정리해 주었는데, 다음과 같다.

Parallel ㄱㄱ

Data 적을 때는 attn 이 효과적이다.

patchifty 는 나눠서 진행하면 좋다.

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions