Skip to content

This GitHub repository focuses on an integrated approach to scene classification and image caption generation, aiming to improve the accuracy of scene evaluation in computer vision applications.

Notifications You must be signed in to change notification settings

jayant1211/A-Multi-Modal-Approach-to-Improve-Scene-Context

Repository files navigation

A-Multi-Modal-Approach-to-Improve-Scene-Context

WHAT AND WHY?

Scene classification is a crucial tool in computer vision, helping organize and categorize the massive influx of daily-uploaded internet images. With diverse applications such as object detection and activity recognition, it impacts areas like blind assistance, autonomous vehicles, robotics, and augmented reality. Our study proposes an integrated approach using CNNs and LSTM for efficient scene evaluation, demonstrating a solid 81.3% test accuracy with our custom model. This highlights the importance of accurate and efficient scene classification algorithms in computer vision research and development.

OVERVIEW

Now notice the image below.

Image

A certain scene classification model predicts this as:
1. religious_procession_scene - 90% confidence
2. Farm - 7% confidence
3. Hospital - 3% confidence (wrongly, the correct should be farm)

and certain caption_generator generates this captions:
1. Two Ladies walking in green field - 99% confidence
2. Two people in bus - 0.5%
3. People in vehilce - 0.5%

We can see the caption with highest confidence is correct, but scene_prediction with highest confidence is not.

We employ an integrated filtering approach, we utilizes this confidences to make out the best suitable pair among caption and scene_category.

So based on our filtering approach, for that wrongly classified image, we would get result as something like: "Scene is of the {farm} and there are {} two people are sitting in front of store with greenery all in background"

APPROACH

SCENE CLASSIFICATION MODEL

We have employed 2 models to classify scene among certain categories.

A

first, Places365 CNN model - This model is pre-trained on Places365 Dataset which is a state- of-the-art scene-centric dataset that contains 365 categories. The authors of Places365 have developed and released convolutional neural networks (CNNs) with a range of different architectures trained on the Places365 dataset. The model we used is InceptionV3. While not the most accurate among all the available pre-trained CNNs released by the authors, it does provide good trade-off between computation efficiency and performance.

B

secondly, we have created Custom Scene Classification Model for Indian Environments. We have merged some of the popular categories among places365 and added some of our own(please refer paper for details). Our approach is designed to focus on smaller but more frequently occurring categories that are specific to Indian environments. We have used InceptionV3 architecture for training.

Image

IMAGE CAPTION GENERATOR

We have trained a CNN-RNN model for the image caption generator using the state-of-the-art flickr8k dataset. The choice of CNN in this study is the Xception network, which is pre-trained on ImageNet for 1000 categories, due to its exceptional feature extraction capabilities.

Traditional image caption generators typically produce a single caption, in proposed study, a heuristic approach, beam search algorithm is being used for generating multiple captions. For each caption a score corresponding to it is generated. Score generated for each caption represents the model's level of confidence in its generated output. Integrated filtering mechanism utilizes these confidence scores to enhance the model's understanding of the scene being depicted in the image.


For detailed description, please refer the paper.


Output:

Image

INTERGRATED APPROACH FOR IMPROVED RESULTS

The idea of the Classification CNN and image captioning module to validate each other’s results can generate more correct outputs. For validating results, a person can tell the relevance between given places and captions. We have integrated exsiting LLM APIs, to tell the relevance between places and captions. Based on the result, an improved output is given.
Based on multiple scene options and caption options we have, we select one as base module and one as to be filtered. The base selection is purely done on confidence scores. Once we have the base, we generate a prompt something as:


Only one answer. Print the corresponding number of correct option: which of the following descriptions match can be most suitable for {} place: - {}


and this prompt is fed to GPT API. And in response we get a most suitable pair of (scene, caption)

RESULTS

1. Custom model - Scene Classification

Image

Metric Accuracy
Train Accuracy 90.32%
Validation Accuracy 80.3%
Test Accuracy 81.2%

2. Caption Generator

The average BLEU score for up to 4 grams for the image caption generator tested for 1000 different images is given in the table below:

Metric
B-1 0.25
B-2 0.10
B-3 0.05
B-4 0.03

3. Integrated approach result samples

Image

About

This GitHub repository focuses on an integrated approach to scene classification and image caption generation, aiming to improve the accuracy of scene evaluation in computer vision applications.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published