CAPTURING GAZE SHIFTS FOR GUIDANCE: CROSS-MODAL FUSION ENHANCEMENT FOR VLM HALLUCINATION MITIGATION

This repository contains the implementation of the proposed method, GIFT (Gaze Shift-Guided Cross-modal Fusion Enhancement), that mitigates hallucinations in Vision-Language Models (VLMs).

The implementation builds upon Transformers v4.50.0 by modifying attention computation in QwenVL2 and Llama architectures. Visual saliency maps are integrated into the attention mechanism to enhance cross-modal fusion.

Abstract

Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.

Setup

Create and activate conda environment:

conda create -n gift python=3.12
conda activate gift

Install dependencies:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Install modified transformers:

cd transformers-4.50.0
pip install -e .
cd ..

Supported Models

The following Vision-Language Models are currently supported:

LLaVA-1.5-7B
LLaVA-1.5-13B
Qwen2-VL-7B

Configuration

Configuration files in the configs/ directory specify:

Model parameters
Data paths
GIFT parameters
- use_gift: Enable/disable GIFT enhancement
- visual_saliency_computation_layers: Layers used for computing visual saliency maps
- attention_enhancement_layers: Layers where cross-modal attention is enhanced
- alpha: Scaling factor for vision attention enhancement

Example configuration (llava_1.5_7b.yaml):

model_name: "llava_1.5_7b"
max_new_tokens: 1
use_gift: true
visual_saliency_computation_layers: [11]
attention_enhancement_layers: [12,13,14,15,16,17,18,19,20,21,22]
alpha: 5.0

Default hyperparameters are available in the provided config files.

Usage

Run inference with default settings:

python inference.py --config configs/llava_1.5_7b.yaml

Modified Transformers Files

The implementation modifies the following Transformers files:

transformers-4.50.0/src/transformers/generation/utils.py
transformers-4.50.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py
transformers-4.50.0/src/transformers/models/llama/modeling_llama.py

Security

See CONTRIBUTING for more information.

License

This library is licensed under the CC BY-NC 4.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
transformers-4.50.0		transformers-4.50.0
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
THIRD-PARTY-NOTICES		THIRD-PARTY-NOTICES
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CAPTURING GAZE SHIFTS FOR GUIDANCE: CROSS-MODAL FUSION ENHANCEMENT FOR VLM HALLUCINATION MITIGATION

Abstract

Setup

Supported Models

Configuration

Usage

Modified Transformers Files

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

amazon-science/GIFT

Folders and files

Latest commit

History

Repository files navigation

CAPTURING GAZE SHIFTS FOR GUIDANCE: CROSS-MODAL FUSION ENHANCEMENT FOR VLM HALLUCINATION MITIGATION

Abstract

Setup

Supported Models

Configuration

Usage

Modified Transformers Files

Security

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages