We have proposed a multimodal approach. Where we first took the best unimodal for textual and visual data classification by testing and automation process. Then we fusion of the two models which can successfully classify the materials that have been damaged using the image and text data. EfficientNetB3+BERT multimodal better accuracy with 94.18%

The dataset used in this research, provided by Mouzannar et al., plays a decisive role in the analysis and classification of image-tweet pairs related to disaster events. The dataset, which contains a total of 5831 image-tweet pairs, is divided into two main subsets: a training set of 5247 samples and a test set of 584 samples. This dataset is used to develop models that can identify and classify different image-tweet pairs related to disasters.
The characteristic of the dataset is the imbalance of its classes, which means that the distribution of samples between different classes is not equal. The imbalance is clear when looking at the number of samples in each category. The highest number of samples is in the non-damage category, representing cases where no damage was detected, with 2957 cases. In contrast, the loss of life category, which refers to image-tweet pairs that show damage resulting in casualties during disasters, has the fewest samples, with only 240 cases. To facilitate the classification task, the dataset contains five distinct pairs of disaster images and tweets in addition to the Non-damage (ND) category

This section presents the proposed methodology of multimodal architecture for classifying disasters from social media posts such as Images and tweets. Our model has two parts that work side by side, the first dedicated to capturing visual attributes, and the second focused on extracting textual characteristics. We will now provide a breakdown of each individual step within this architectural framework,

Step-1) Input Image and Tweet: First of all, we split the dataset with 80% for training, 10% testing, and 10% for validation. In this step, the disaster Images and texts from the training dataset are presented to the proposed model batch by batch. For our work batch size is 12.

Step-2) Image Preprocessing: First, we ensure that all images Fig. 1. This methodology for disaster identification: The textual feature extractor module is represented by the right side blocks, while the visual feature extractor module is represented by the left side blocks are the same size and to enable more effective processing, we pre-processed each image by scaling it to 2282283. Normalization technique is used to reduce the image pixels will be scaled between 0 and 1. In this normalization process, the image pixel values are normalized using mean and standard deviation values of [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225], respectively. This standardizes the pixel values to have a mean of 0 and a standard deviation of 1, assisting in the stabilization of the training process. Also, we use different types of augmentation techniques such as

a) Random Horizontal Flip: Images may be flipped horizontally at random. This is done at random to show the model diverse perspectives on items. It’s analogous to showing the computer how things may seem from the opposite side. This improves the computer’s learning since it sees more ways that objects might appear.

b) Color Jitter: Color jitter is the controlled change of picture color attributes like brightness, contrast, saturation, and hue. It is like adjusting the colors a bit in the photos. This helps the computer understand how things can still be the same, even if the lighting changes. So, when the model sees pictures in different lighting conditions, it’s not confused.

c) Random Rotation: Using random rotations causes variety in the orientation of the pictures. This is especially beneficial when images in the collection have varying orientations.

Step-3) Visual Feature Extractor: To extract the visual feature from the image, we used the transfer learning method. First of all, we employed the pre-trained ResNet50 [16] model. It is a pre-train CNN model. Its major goal is to solve the vanishing gradient issue that occurs in extremely deep neural networks. Gradients tend to get lower during backpropagation as a CNN’s depth rises, making it more difficult for the model to learn and update the weights correctly and resulting in poor performance. By including ”skip connections” or ”shortcut connections,” which enable the network to learn residual mappings, ResNet-50 addresses this issue. ResNet-50 has 50 layers and has shown outstanding results in a variety of computer vision applications such as classifying the image, detecting objects, and segmentation of images.
Using a pre-trained DenseNet-201 model substantially reduces the number of parameters, which means that we can save time and resources, and benefit from the learned features of the large dataset. It is also a pre-train CNN model with consist of 201 layers. The main concept of DenseNet21 is to feed-forward connect each layer to every other layer within a single dense block. Information from all previous layers can now flow straight into the current layer, improving feature propagation. It alleviates the vanishing-gradient problem, which means that the network can learn from both shallow and deep layers without losing information. This feature enables DenseNet models to achieve state-of-the-art performance on different computer vision tasks while being more parameter-efficient and simpler to train. It is used for classifying the image, detecting objects, and segmentation of images.
Lastly, EfficientNetB3 [17] models are designed to achieve state-of-the-art accuracy on image classification tasks while being smaller and faster than other models. It is also a convolutional neural network construction and scaling method. To equally modify the depth, breadth, and resolution dimensions, it uses a compound coefficient. Its increased depth, width, and resolution enable it to capture more complex patterns and features, leading to improved performance on challenging visual recognition tasks. When a strong, precise, and computationally efficient deep learning model is required for computer vision applications, especially where resources may be scarce and real-time or low-latency performance is crucial, EfficientNetB3 is used.

Step-4) Textual Feature Extractor: BERT and XLNet are used to extract the feature from the Tweets. BERT [19] converts words into numbers. BERT means Bidirectional Encoder Representations from Transformers. It is a powerful natural language processing (NLP) model. It uses transformer-based architecture which operates a Bi-directional approach to process and understand the context of words in a sentence whereas traditional NLP models process words in a unidirectional manner. It involves pre-training on a substantial corpus to learn contextual representations, followed by fine-tuning particular NLP tasks to complete tasks and generate predictions. In our work So, use a pre-train ’bert-base-uncased’ model This allows the training of machine learning models on textual data [37]. The limitations of BERT are solved by the generalized autoregressive pretraining technique XLNet, which maximized the predicted probability over all permutations of the factorization order. It builds on the Transformer XL architecture and follows a Bi-directional approach. It uses permutation sampling to generate potential permutations of the input tokens and maintains an autoregressive property throughout pre-training, modeling the probability distribution of a sequence by taking each token into account one at a time. This makes it possible for XLNet to efficiently capture token dependencies.

Step-5) Choosing Best-performed CNN model: We compile a list of models to run for our visual tasks. Then, we execute all of the visual models with the identical hyperparameter configuration and save the training and validation event history. The accuracy and loss value of each epoch is stored in the history. Then, from the accuracy list, we pick the index with the highest value; if the index value is ’0,’ the ResNet-50 model is chosen; otherwise, the DenseNet201 model is chosen. The hyperparameter values were- batch size = 12, iterations= 30000, learning rate = 1e-4 and epoch= 77, the epoch is calculated by this formula- iterations len(train dataset) batch size We also use early stop criteria in the time of training to avoid overfit. So, from all visual models, EfficientNetB3 performs better than others.

Step-6) Selecting Best Performed Language Model: The individual best-performing model is chosen from the previous step which is exactly the same procedure that is applied for selecting the visual model. In this step, Bert is selected and sent to the next step named Fusion and Classification.

Step-7) Fusion and Classification: The results from the dense layer in both the visual and text parts are combined to create a shared way of representing both visual and text aspects. By concatenating visual and textual elements, we use a late fusion technique [38] to achieve deep-level representation. In late fusion, the input from several modalities is initially analyzed independently, and the outputs or features obtained from these modalities are afterward joined to get to a final prediction. In the final classification layer of both modalities, the same amount of hidden nodes is employed. We employ a similar size, consisting of 512 nodes, to provide a fair contribution from the text and visual parts.

Step-8) Evaluate The Final Model: Accuracy, Precision (P), recall (R), and weighted F1-score are used to compare performance. The model’s misclassification rate has been utilized as one of the metrics to effectively compare its performance across several classes. To evaluate how well the model performs, we utilize the weighted F1-score measure


