Implementation of the “Show, Attend and Tell” paper (Xu et al., 2015) using PyTorch.
The model generates natural language captions for images via an encoder–decoder with attention mechanism.
This repository demonstrates deep learning-based image captioning. Given an image, the model produces a natural language description, suitable for accessibility, image retrieval, and automatic content generation.
- Encoder: ResNet50 (default) or VGG19 backbone extracts spatial image features.
- Attention: Additive attention mechanism highlights informative image regions during captioning.
- Decoder: LSTM with attention and gating generates captions step by step.
- Data Preprocessing:
- Converts Flickr-style
captions.txttocaptions.json. - Splits dataset into train/test sets.
- Converts Flickr-style
- Training Pipeline: Automated preprocessing, training, and checkpointing via
run_pipeline.py. - Evaluation:
eval.pycomputes BLEU-1 to BLEU-4 scores.test.pygenerates captions on the test split.
- Visualization:
visualize_captions.pydisplays images with predicted captions. - Model Summary:
print_modelsummary.pyprints the architecture of encoder/decoder. - Results: All outputs and metrics saved in
experiments/results/.
# Clone repository
git clone <repo_url>
cd Image_captioning
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windows
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt1. Preprocess data and train the model
python3.12 run_pipeline.py2. Evaluate (BLEU scores on test set)
python3.12 eval.py --checkpoint experiments/checkpoints/latest.pth --split test3. Generate captions for the test split
python3 test.py4. Visualize model predictions
python3 visualize_captions.py5. Print a summary of the model
python3 print_modelsummary.pyFinal Test BLEU Scores:
| Metric | Score |
|---|---|
| BLEU-1 | 0.7941 |
| BLEU-2 | 0.6684 |
| BLEU-3 | 0.5700 |
| BLEU-4 | 0.4850 |
- Supports Flickr8k/Flickr30k style datasets.
- Place images and captions as described in
data/(see preprocessing script for details).
Below: Example of a test image and a generated caption.
Pull requests and issues are welcome!
This project is licensed under the MIT License — see the LICENSE file for details.
