Skip to content

Implementation of a Vision Transformer (ViT) model from scratch, including a script for visualizing attention on video frames

Notifications You must be signed in to change notification settings

Davidwarchy/vit

Repository files navigation

Vision Transformer (ViT) Implementation and Video Attention Visualization

attention_map

This project implements a Vision Transformer (ViT) model from scratch and includes a script for visualizing attention on video frames. The implementation is based on the original ViT paper and inspired by various educational resources.

Project Structure

  • vision_transformers.py: Contains the implementation of the ViT model.
  • run.py: Script for applying attention visualization to video frames using a trained ViT model.

Features

  • Vision Transformer (ViT) implementation from scratch
  • Training script for the ViT model on image classification tasks
  • Video processing script to visualize attention maps on video frames

Usage

  1. Train the ViT model:

    python vision_transformer.py
    
  2. Visualize attention on a video:

    python video_attention_visualization.py
    

    Make sure to update the experiment_name and input_video_path in the script before running.

Requirements

  • PyTorch
  • torchvision
  • numpy
  • matplotlib
  • OpenCV (cv2)

Credits

This project is inspired by and adapted from the following resources:

Additional

Additional Resources

For a deeper understanding of Vision Transformers and attention mechanisms, we recommend the following:

  1. Attention Is All You Need - The paper introducing the transformer architecture

Contact

If you have any questions or feedback, please open an issue in this repository.

About

Implementation of a Vision Transformer (ViT) model from scratch, including a script for visualizing attention on video frames

Topics

Resources

Stars

Watchers

Forks

Languages