Skip to content

The official GitHub repository of the Bangla Visual Question Answering (VQA) system ChitroJera

License

Notifications You must be signed in to change notification settings

farhanishmam/ChitroJera

Repository files navigation

ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla

paper arXiv

Deeparghya Dutta Barua*, Md Sakib Ul Rahman Sourove*, Md Farhan Ishmam*, Fabiha Haider, Fariha Tanjim Shifat, Md Fahim, and Farhad Alam Bhuiyan.


ChitroJera is a Bangla regionally relevant Visual Question Answering (VQA) dataset with over 15k samples that captures the cultural connotations associated with the Bengal region. We also establish novel baselines using multimodal pre-trained models and Large Language Models (LLMs).

Data Format

Column Title Description
image_id The unique identifier of the image
category Category of the image
caption Bangla caption of the image
caption_en English caption of the image
question Question on the image in Bangla
question_en Question on the image in English
answer Answer to the question in Bangla
answer_en Answer to the question in English

Data Creation Pipeline

Image Not Found

The images of the ChitroJera dataset are sourced from the BanglaLekhaCaptions, Bornon, and BNature datasets. We establish an automated question-answer generation pipeline using the LLMs GPT-4 and Gemini. The quality of the QA pairs is checked by domain experts based on four evaluation criteria. A few images and QA pairs have been provided in the sample_dataset folder.

QA Statistics

Q&A Statistics Q A
Mean character length 33.50 7.10
Max character length 105 45
Min character length 11 1
Mean word count 5.86 1.43
Max word count 17 8
Min word count 3 1

Methodology Overview

Image Not Found

For the baselines, we consider a dual-encoder-based architecture and the zero-shot performance of LLMs. In the dual-encoder-based model, there are two distinct training stages: pretraining and finetuning. During pretraining, both the image and text are fed into their respective encoders to obtain hidden representations, and a co-attention module is used for modality alignment. For pretraining, we use ITM, MLM, and ITC-based objectives, and during finetuning, a feature aggregation module is incorporated for classification tasks.

Quick Start

Pre-trained Multimodal Model

Installation

We recommend using a virtual environment. Install the dependencies of this repository using:

pip install -r requirements.txt

Training and Evaluation

To train and evaluate the model on VQA, use the following command:

python main.py

About

The official GitHub repository of the Bangla Visual Question Answering (VQA) system ChitroJera

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published