DDTN: Dual-decoder transformer network for answer grounding in visual question answering

This is the official implementation of Dual-decoder transformer network for answer grounding in visual question answering, which introduces an simple framework for answer grounding in visual question answering under the Transformer network.

Updates

2022.8.6 We create this project for our paper. Thanks for your attention.

2023.4.16 Our paper: Dual-decoder transformer network for answer grounding in visual question answering has been accepted by Rattern Recognition Letters. The code will be released shortly.

2023.5.21 The code is available.

Installation

Prerequisites

conda create -n DDTN python=3.7
conda activate DDTN
git clone https://github.com/zlj63501/DDTN.git
pip install -r requirements.txt
cd mmf
pip install --editable .

Data Preparation

Download the VizWiz-VQA-Grounding images from VizWiz.
Download the annotations and weights from Google Drive.
Extract the image grid and region features, according to the repository VinVl.

The data structure should look like the following:

| -- DDTN
     | -- data
        | -- annotations
            -- train.npy
            -- val.npy
            -- test.npy
            -- answers.txt
            -- vocabulary_100K.txt
        | -- weights
            | -- resnet_head.pth
        | -- features
            | -- train
                | -- VizWiz_train_00000000.npz
                | -- ...
            | -- val
                | -- VizWiz_val_00000001.npz
                | -- ...
            | -- test
                | -- VizWiz_test_00000002.npz
                | -- ...
     ...

Training

We train DDTN to perform grouning and answering at instance level on a single TiTan X GPU with 12 GB memory. The following script performs the training:

python mmf_cli/run.py config=projects/ddtn/configs/defaults.yaml run_type=train_val dataset=vizwiz model=ddtn

Evaluation

python mmf_cli/run.py config=projects/ddtn/configs/defaults.yaml run_type=val dataset=vizwiz model=ddtn checkpoint.resume_file=save/models/xxx.ckpt

Citation

@article{ZHU202353,
title = {Dual-decoder transformer network for answer grounding in visual question answering},
journal = {Pattern Recognition Letters},
volume = {171},
pages = {53-60},
year = {2023},
issn = {0167-8655},
doi = {https://doi.org/10.1016/j.patrec.2023.04.003},
url = {https://www.sciencedirect.com/science/article/pii/S0167865523001046},
author = {Liangjun Zhu and Li Peng and Weinan Zhou and Jielong Yang}
}

Acknowledgement

Our code is built upon the open-sourced MMF, mmdetection, SeqTR and VinVl libraries.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
mmf		mmf
mmf_cli		mmf_cli
projects/ddtn/configs		projects/ddtn/configs
tools		tools
website		website
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
MMF_README.md		MMF_README.md
NOTICES		NOTICES
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DDTN: Dual-decoder transformer network for answer grounding in visual question answering

Updates

Installation

Prerequisites

Data Preparation

Training

Evaluation

Citation

Acknowledgement

About

Releases

Packages

Languages

License

zlj63501/DDTN

Folders and files

Latest commit

History

Repository files navigation

DDTN: Dual-decoder transformer network for answer grounding in visual question answering

Updates

Installation

Prerequisites

Data Preparation

Training

Evaluation

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages