TensorFlow implementation of Deep Cross-Modal Pojection Learning for Image-Text Matching accepted by ECCV 2018.
We propose a cross-modal projection matching (CMPM) loss and a cross-modal projection classication (CMPC) loss for learning discriminative image-text embeddings.
- TensorFlow 1.4.0
- CUDA 8.0 and cuDNN 6.0
- Python 2.7
-
Please download Flickr30k Dataset (About 4.4GB)
-
Please download JSON Annotations
-
Convert the Flickr30k image-text data into TFRecords (About 15GB)
cd builddata & sh scripts/format_and_convert_flickr.sh 0
-
Please Download Pretrained ResNet-v1-152 checkpoint
-
Train CMPM with ResNet-152 + Bi-LSTM on Flickr30k
sh scripts/train_flickr_cmpm.sh 0
- Train CMPM + CMPC with ResNet-152 + Bi-LSTM on Flickr30k
sh scripts/train_flickr_cmpm_cmpc.sh 0
- Compute R@K(k=1,5,10) for image-to-text and text-to-image retrieval evaluation on Flickr30k
sh scripts/test_flickr_cmpm.sh 0
-
We also provide the code for MSCOCO and CUHK-PEDES, which has similar preparation&training&testing procedures with Flickr30k
-
Be careful with the disk space (The MSCOCO may cost 20.1GB for images and 77.6GB for TFRecords)
If you find CMPL useful in your research, please kindly cite our paper:
@inproceedings{ying2018CMPM,
author = {Ying Zhang and Huchuan Lu},
title = {Deep Cross-Modal Projection Learning for Image-Text Matching},
booktitle = {ECCV},
year = {2018}}
If you have any questions, please feel free to contact [email protected]