Image Captioning Using Vision Transformers

This project implements image captioning using a Vision Transformer (ViT) architecture. The model consists of a Vision Encoder and a Text Decoder built from scratch by basic PyTorch blocks. The goal is to generate descriptive captions for images using a transformer-based approach.

Demo

Here are some (somewhat) accurate captions generated for images:

Image	Real Caption	Generated Caption
	A smiling woman in a wetsuit catches a wave on a surfboard	a man on a surfboard riding a wave in the ocean.
	A plant in a garden near a white building.	a vase of flowers sitting next to a window.
	A group of people sitting around a table eating underneath an umbrella.	a group of people standing around a table with food.
	a cat sitting on a desk on a piece of lined paper next to a computer, pen and computer mouse.	a cat is laying on a laptop computer.

However, the model generates many inaccurate captions too:

Image	Real Caption	Generated Caption
	A busy airport with many people walking around.	a street with people walking in the rain.
	The family running on the beach with many birds	a couple of people are sitting in the water

Model Architecture

The Vision Transformer (ViT) architecture is implemented using basic PyTorch blocks, including self-attention layers, multi-layer perceptrons (MLPs), and sinusoidal positional embeddings. The model consists of two main components:

Vision Encoder: Extracts patches from input images and processes them through transformer layers.
Text Decoder: Generates captions by attending to the encoded image features.

Transformer Blocks

The core of the architecture is built using transformer blocks. Each block consists of:

Self-attention mechanisms (Multi-Head)
Layer normalization
Multi-layer perceptron (MLP)
Optional causal masking for the decoder

Positional Embeddings

Positional embeddings are added to both image patches and input tokens to retain spatial information throughout the encoding and decoding process.

Training

The model was trained for 120 epochs on a Kaggle Notebook using a P100 GPU.
The dataset used was MS-COCO 2014 training and validation partitions.
The optimizer used was Adam Optimizer with a learning rate of 1e-4, batch size of 128, and images resized to 128x128. Also used a GradScaler for stability in training.
Below are the plots of the training loss during the epochs:

Raw Training Loss:

Averaged Training Loss (Window size of 512 using np.convolve):

Credits

The code structure and training process was inspired and derived by Luke Ditria youtube page available at here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Image Captioning Using Vision Transformers

Demo

Model Architecture

Transformer Blocks

Positional Embeddings

Training

Credits

Files

README.md

Latest commit

History

README.md

File metadata and controls

Image Captioning Using Vision Transformers

Demo

Model Architecture

Transformer Blocks

Positional Embeddings

Training

Credits