Skip to content

Latest commit

 

History

History
63 lines (50 loc) · 4.43 KB

README.md

File metadata and controls

63 lines (50 loc) · 4.43 KB

🌐 xT: Nested Tokenization for Larger Context in Large Images

xT

xT: Nested Tokenization for Larger Context in Large Images
Ritwik Gupta*, Shufan Li*, Tyler Zhu*, Jitendra Malik, Trevor Darrell, Karttikeya Mangalam
Paper: https://arxiv.org/abs/2403.01915

arXiv | Project page

About

xT enables you to model large images, end-to-end, on contemporary, memory-limited GPUs. It is a simple framework for vision transformers which effectively aggregates global context with local details.

Installation

  • conda env create -f environment.yml

The code has been tested on Linux on NVIDIA A100 GPUs with PyTorch 2+. We use custom CUDA kernels as implemented by the Mamba and OpenAI Triton projects. Therefore, modifications may be required to use this repository on other operating systems or GPUs.

Training

Training can be launched through ./run_submit.sh <num GPUs> <port number> config=<path to config>

We also provide SubmitIt scripts in launch_scripts to submit training jobs on Slurm clusters.

Pretrained Models

Weights and configs for our experiments are available on Hugging Face.

Name Resolution Top1-ACC Params Mem (GB) Thrpt (region/s)
Swin-T 256 53.76 31M 0.30 76.43
Swin-T <xT> Hyper 256/256 52.93 47M 0.31 47.81
Swin-T <xT> Hyper 512/256 60.56 47M 0.29 88.28
Swin-T <xT> XL 512/256 58.92 47M 0.17 80.00
Swin-T <xT> Mamba 512/256 61.97 44M 0.29 84.77
Swin-S 256 58.45 52M 0.46 44.44
Swin-S <xT> Hyper 256/256 57.04 69M 0.46 39.80
Swin-S <xT> Hyper 512/256 63.62 69M 0.46 41.45
Swin-S <xT> XL 512/256 62.68 69M 0.23 36.36
Swin-B 256 58.57 92M 0.50 36.14
Swin-B <xT> Hyper 256/256 55.52 107M 0.61 29.85
Swin-B <xT> Hyper 512/256 64.08 107M 0.74 24.00
Swin-B <xT> XL 512/256 62.09 107M 0.39 41.03
Swin-B <xT> Mamba 512/256 63.73 103M 0.58 29.09
Swin-L 256 68.78 206M 0.84 17.02
Swin-L <xT> Hyper 256/256 67.84 215M 1.06 16.08
Swin-L <xT> Hyper 512/256 72.42 215M 1.03 16.58
Swin-L <xT> XL 512/256 73.47 215M 0.53 14.10
Swin-L <xT> Mamba 512/256 73.36 212M 1.03 15.61

Citation

@article{xTLargeImageModeling,
  title={xT: Nested Tokenization for Larger Context in Large Images},
  author={Gupta, Ritwik and Li, Shufan and Zhu, Tyler and Malik, Jitendra and Darrell, Trevor and Mangalam, Karttikeya},
  journal={arXiv preprint arXiv:2403.01915},
  year={2024}
}