[102] TRT-ViT: TensorRT-oriented Vision Transformer

[paper](https://arxiv.org/pdf/2205.09579.pdf)

TeraFLOPs TeraParams 설명은 건너뛴다.

## TRT-ViT
4개의 rule 을 실험적으로 찾아내며, 아키텍처를 고름
1. transformer block 은 마지막 stage 에 위치하는 게 가성비가 좋다 (널리 알려진 사실)
2. 앞쪽 stage 는 얕아도 된다.
    ![image](https://user-images.githubusercontent.com/16400591/196411745-11b4a06c-8892-4980-982d-10eba8e5dc31.png)
3. transformer block 보다는, transformer + bottleneck 을 혼합시킨게 더 가성비가 좋다
4. global 을 먼저 보고 local 을 보는게 더 효과적이더라

![image](https://user-images.githubusercontent.com/16400591/196411590-23754eb5-f2f0-415f-b851-d22ae1447a52.png)

이게 끝이다 ㅋㅋ

아래 표에서 볼 수 있듯이 (C) block이 효과적이었다.
![image](https://user-images.githubusercontent.com/16400591/196412133-52373564-d8b2-4aac-8583-fe32e302cbe9.png)

detail 한 아키텍처는 다음과 같다
![image](https://user-images.githubusercontent.com/16400591/196412296-35513db6-6257-4e14-90b8-59b45bce14b2.png)


## Results
### ImageNet
![image](https://user-images.githubusercontent.com/16400591/196412353-f90e9e2e-85aa-4d77-9baa-1f8b9e538194.png)


#### Setttings
- Swin setting (https://github.com/microsoft/Swin-Transformer/blob/d19503d7fbed704792a5e5a3a5ee36f9357d26c1/config.py)
- GPU: V100 * 8
- epochs: 300
- batch-size: 1024
- resolution: 224x224
- gradient clipping: max norm 1 
- Augmentation
  - RandAugment: `rand-m9-mstd0.5-inc1`
  - mixup 은 0.5 확률로 둘 중 하나 선택
    - Mixup: alpha 0.8
    - Cutmix: alpha 1.0
  - random erasing: 0.25
  - stochastic depth: 0.1  (DeiT 스럽게 약간 변형)
  - repeated augmentation, EMA 2개는 사용 안했음 (Swin 기준 성능에 별 영향 없었음)
- optimizer
  - AdamW 
  - weight deacy: 0.05
  - warmup: 30 epoch
  - lr
    - 0.001
    - cosine decay

### Ablations
![image](https://user-images.githubusercontent.com/16400591/196413198-bd1a6495-34fa-4cab-9257-987d08baf837.png)

### ADE 20K
![image](https://user-images.githubusercontent.com/16400591/196413305-7bf71dc9-082b-49ab-8ae0-1edd9d91b5c0.png)

### COCO
![image](https://user-images.githubusercontent.com/16400591/196413260-88cf86b9-9852-4887-be1e-dce7086d611f.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[102] TRT-ViT: TensorRT-oriented Vision Transformer #132

TRT-ViT

Results

ImageNet

Setttings

Ablations

ADE 20K

COCO

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[102] TRT-ViT: TensorRT-oriented Vision Transformer #132

Description

TRT-ViT

Results

ImageNet

Setttings

Ablations

ADE 20K

COCO

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions