This project implements a Vision Transformer (ViT) model from scratch and includes a script for visualizing attention on video frames. The implementation is based on the original ViT paper and inspired by various educational resources.
vision_transformers.py
: Contains the implementation of the ViT model.run.py
: Script for applying attention visualization to video frames using a trained ViT model.
- Vision Transformer (ViT) implementation from scratch
- Training script for the ViT model on image classification tasks
- Video processing script to visualize attention maps on video frames
-
Train the ViT model:
python vision_transformer.py
-
Visualize attention on a video:
python video_attention_visualization.py
Make sure to update the
experiment_name
andinput_video_path
in the script before running.
- PyTorch
- torchvision
- numpy
- matplotlib
- OpenCV (cv2)
This project is inspired by and adapted from the following resources:
-
Implementing Vision Transformer (ViT) from Scratch by Tin Nguyen
- This article provided the foundation for our ViT implementation and helped structure the code.
-
Let's build GPT: from scratch, in code, spelled out by Andrej Karpathy
- While this video focuses on GPT, it offers valuable insights into transformer architecture and implementation details that were helpful in understanding and adapting the ViT model.
Additional
For a deeper understanding of Vision Transformers and attention mechanisms, we recommend the following:
- Attention Is All You Need - The paper introducing the transformer architecture
If you have any questions or feedback, please open an issue in this repository.