This project implements image captioning using a Vision Transformer (ViT) architecture. The model consists of a Vision Encoder and a Text Decoder built from scratch by basic PyTorch blocks. The goal is to generate descriptive captions for images using a transformer-based approach.
Here are some (somewhat) accurate captions generated for images:
However, the model generates many inaccurate captions too:
The Vision Transformer (ViT) architecture is implemented using basic PyTorch blocks, including self-attention layers, multi-layer perceptrons (MLPs), and sinusoidal positional embeddings. The model consists of two main components:
- Vision Encoder: Extracts patches from input images and processes them through transformer layers.
- Text Decoder: Generates captions by attending to the encoded image features.
The core of the architecture is built using transformer blocks. Each block consists of:
- Self-attention mechanisms (Multi-Head)
- Layer normalization
- Multi-layer perceptron (MLP)
- Optional causal masking for the decoder
Positional embeddings are added to both image patches and input tokens to retain spatial information throughout the encoding and decoding process.
- The model was trained for 120 epochs on a Kaggle Notebook using a P100 GPU.
- The dataset used was MS-COCO 2014 training and validation partitions.
- The optimizer used was Adam Optimizer with a learning rate of
1e-4
, batch size of128
, and images resized to128x128
. Also used aGradScaler
for stability in training. - Below are the plots of the training loss during the epochs:
Raw Training Loss:
Averaged Training Loss (Window size of 512 using np.convolve
):
The code structure and training process was inspired and derived by Luke Ditria youtube page available at here.