Skip to content

Latest commit

 

History

History
65 lines (41 loc) · 3.67 KB

File metadata and controls

65 lines (41 loc) · 3.67 KB

Image Captioning Using Vision Transformers

This project implements image captioning using a Vision Transformer (ViT) architecture. The model consists of a Vision Encoder and a Text Decoder built from scratch by basic PyTorch blocks. The goal is to generate descriptive captions for images using a transformer-based approach.

Demo

Here are some (somewhat) accurate captions generated for images:

Image Real Caption Generated Caption
COCO_val2014_000000028194 A smiling woman in a wetsuit catches a wave on a surfboard a man on a surfboard riding a wave in the ocean.
COCO_val2014_000000396274 A plant in a garden near a white building. a vase of flowers sitting next to a window.
COCO_val2014_000000140307 A group of people sitting around a table eating underneath an umbrella. a group of people standing around a table with food.
COCO_val2014_000000036484 a cat sitting on a desk on a piece of lined paper next to a computer, pen and computer mouse. a cat is laying on a laptop computer.

However, the model generates many inaccurate captions too:

Image Real Caption Generated Caption
COCO_val2014_000000084270 A busy airport with many people walking around. a street with people walking in the rain.
COCO_val2014_000000049763 The family running on the beach with many birds a couple of people are sitting in the water

Model Architecture

The Vision Transformer (ViT) architecture is implemented using basic PyTorch blocks, including self-attention layers, multi-layer perceptrons (MLPs), and sinusoidal positional embeddings. The model consists of two main components:

  • Vision Encoder: Extracts patches from input images and processes them through transformer layers.
  • Text Decoder: Generates captions by attending to the encoded image features.

Transformer Blocks

The core of the architecture is built using transformer blocks. Each block consists of:

  • Self-attention mechanisms (Multi-Head)
  • Layer normalization
  • Multi-layer perceptron (MLP)
  • Optional causal masking for the decoder

Positional Embeddings

Positional embeddings are added to both image patches and input tokens to retain spatial information throughout the encoding and decoding process.


Training

  • The model was trained for 120 epochs on a Kaggle Notebook using a P100 GPU.
  • The dataset used was MS-COCO 2014 training and validation partitions.
  • The optimizer used was Adam Optimizer with a learning rate of 1e-4, batch size of 128, and images resized to 128x128. Also used a GradScaler for stability in training.
  • Below are the plots of the training loss during the epochs:

Raw Training Loss:

raw_trainig_loss

Averaged Training Loss (Window size of 512 using np.convolve):

averaged_training_loss


Credits

The code structure and training process was inspired and derived by Luke Ditria youtube page available at here.