GitHub - CASOS-IDeaS-CMU/SegSub: SebSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models

SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models

Introduction
Repository Organization
Dataset Setup
Reproduction of Results
License

Introduction

This repository accompanies the paper "SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models". It provides the codebase, dataset, and supplementary materials used in our experiments. If you use, extend or build upon this project, please cite the following paper (under review at ACL 2025):

@article{carragher2025segsub,
  title={SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models},
  author={Carragher, Peter and Rao Nikitha and Jha Abhinand and R Raghav and Carley, Kathleen M},
  journal={arXiv preprint arXiv:XXXX.XXXXXX},
  year={2025}
}

Below, we describe the repository's structure and the purpose of each component.

Repository Organization

SegSub/ 
├── augmentation/ 
├── quality_check/
├── vlm_finetuning/
├── vlm_evaluation/
├── figures/

augmentation/: Framework implementation for generating knowledge conflicts in images (parametric, source, and counterfactual conflicts). Based upon Inpaint-Anything.
quality_check/: Tools for dataset construction, including automated VLM quality checks and notebooks for evaluating dataset samples manually.
vlm_finetuning/: Code for fine-tuning Vision-Language Models using SWIFT.
vlm_evaluation/: Evaluation scripts for both baseline and finetuned VLMs on both generated samples, as well as randomly sampled (query,image).
figures/: Contains R scripts that generate the plots visualizations used in the paper.

SegSub Data

Instead of having to generate a dataset from scratch, we provide the SegSub dataset, which was used for all experiments in the paper. Generating this dataset required some ~500 GPU hours. Note: this only includes images perturbed using the framework, and not the original samples. For finetuning and evaluation, the original VQA datasets must be downloaded in the correct directory.

Download the SegSub dataset to SegSub/data and extract the generated images:

cd data
pip install py7zr
py7zr x images.7z
cd ..

Generation

Alternatively, perturbations can be generated locally. Note, this process takes some time and depends on both the number of VQA samples from the original datasets, and the number of generations per sample. GPU hour estimates for the 200,000 original generations are given (based on runtime with 2x NVIDIA RTX A6000).

download VQA datasets
generate a similar dataset using the scripts in augmentation (300 GPU hours), and
use the quality check scripts to filter out low quality generations (200 GPU hours).

VQA Datasets

In order to either generate an alternative dataset or finetune and evaluate VLMs, you will need to download the relevant VQA dataset.

The directory structure should be as follows:

SegSub/
├── data/  
│   ├── coco-images/  
│   │   ├── train2014/  
│   │   ├── val2014/  
│   ├── webqa-images/  
│   │   ├── imgs.lineidx  
│   │   ├── imgs.tsv  
│   ├── WebQA_train_val.json  
│   ├── vqav2_val.json  
│   ├── vqav2_train.json  
│   ├── okvqa_val.json  
│   ├── okvqa_train.json

WebQA

Download and extract the WebQA image dataset from the Google Drive provided by authors of WebQA in data/webqa-images.

mkdir data/
mkdir data/webqa-images && cd data/webqa-images
py7zr x WebQA_imgs_7z_chunks/*.7z
py7zr x WebQA_data_first_release.7z
mv WebQA_train_val.json ../
cd ..

COCO Images

mkdir data/coco-images && cd data/coco-images/
wget http://images.cocodataset.org/zips/train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
unzip train2014.zip
unzip val2014.zip

VQAv2

cd data
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip
unzip v2_Questions_Train_mscoco.zip
unzip v2_Questions_Val_mscoco.zip
mv v2_OpenEnded_mscoco_val2014_questions.json vqav2_val.json
mv v2_OpenEnded_mscoco_train2014_questions.json vqav2_train.json
cd ..

OK-VQA

cd data 
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_train2014_questions.json.zip
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_val2014_questions.json.zip
unzip OpenEnded_mscoco_train2014_questions.json.zip
unzip OpenEnded_mscoco_val2014_questions.json.zip
mv OpenEnded_mscoco_val2014_questions.json okvqa_val.json
mv OpenEnded_mscoco_train2014_questions.json okvqa_train.json
cd ..

Reproduction of Results

Based on these datasets, we can finetune and evaluate the robustness of VLMs to counterfactual samples and knowledge conflicts. GPU hour estimates for the training and evaluation sets are given (based on runtime with 2x NVIDIA RTX A6000).

For setting up and running finetuning using SegSub data and SWIFT, see our finetuning scripts (16 GPU hours per model per epoch).

For setting up and running evaluation on the SegSub evaluation set (or any other dataset generated using the framework), see our evaluation scripts (6 GPU hours per model).

Finally, run the R scripts to plot the results as shown in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
augmentation		augmentation
figures		figures
quality_check		quality_check
vlm_evaluation		vlm_evaluation
vlm_finetuning		vlm_finetuning
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models

Introduction

Repository Organization

SegSub Data

Generation

VQA Datasets

WebQA

COCO Images

VQAv2

OK-VQA

Reproduction of Results

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

CASOS-IDeaS-CMU/SegSub

Folders and files

Latest commit

History

Repository files navigation

SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language Models

Introduction

Repository Organization

SegSub Data

Generation

VQA Datasets

WebQA

COCO Images

VQAv2

OK-VQA

Reproduction of Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages