Skip to content

Sepehr-Kamahi/faith

Repository files navigation

Faithfulness of Feature Attribution Methods

This repository contains the code for the paper "Counterfactuals as a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models" by Sepehr Kamahi and Yadollah Yaghoobzadeh, presented at the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP in 2024.

Overview

The codebase provides tools to evaluate the faithfulness of feature attribution methods in autoregressive language models using counterfactual generation. The approach generates fluent, in-distribution counterfactuals to create a more reliable evaluation protocol.

Paper Summary

Autoregressive language models are widely used in NLP, but their explainability remains a challenge. Evaluating the faithfulness of explanation methods—how accurately they reflect the model's inner workings—is particularly difficult for autoregressive models due to their next-token prediction nature. Existing evaluation techniques often rely on input corruption, which can result in out-of-distribution (OOD) inputs.

This study introduces a novel faithfulness evaluation protocol that utilizes counterfactual generation to create fluent, in-distribution examples. Counterfactuals are generated by modifying input tokens deemed important by attribution methods, ensuring the changes remain within the original distribution. The evaluation ranks attribution methods based on their ability to minimize token modifications required to flip the model’s predictions.

Key contributions include:

  1. Proposing a new faithfulness evaluation protocol for autoregressive models that preserves input distribution.
  2. Demonstrating the importance of counterfactuals in mitigating OOD issues during evaluation.
  3. Evaluating sensitivity differences between fine-tuned and off-the-shelf models when using feature attribution.

The experiments utilize fine-tuned Gemma-2b and off-the-shelf Gemma-2b-instruct models, employing datasets like SST-2, IMDB, and AG-News for evaluation. The results highlight the advantages of using counterfactual generation in reliably evaluating the faithfulness of feature attribution methods especially for off-the-shelf instruct-tuned models.

Repository Structure

  • Attribution.ipynb: Contains code for attaining feature attributions for the evaluation examples.
  • lm_saliency.py: Contains helper functions for Attribution.ipynb.
  • training_cfg_imdb.py, training_cfg_news.py, training_cfg_sst.py: Codes for training counterfactual generators (editors) on IMDB, news, and SST datasets, respectively. The training code is similar to train.py for instruction tuning on the Alpaca dataset: https://github.com/tatsu-lab/stanford_alpaca.
  • plot_corrs.ipynb: Plots correlations between different example editing scenarios (including our counterfactual generation). Figures 5 and 6 of the paper.
  • train_predictor.ipynb: Notebook for training predictors for the three datasets (fine-tuned Gemma models).
  • ood_comparisons.ipynb: Computes what percentage of edited sentences are OOD. Tables 1, 4, 5, and 6 of the paper.
  • mask_percent.ipynb: Computes the numbers in Tables 2, 3, 7, and 8 of the paper.
  • camel.ipynb: Generates counterfactuals using our method and computes the percentages of changes needed for flipping.
  • camel_mask.ipynb: Generates counterfactuals using other replacement methods used in the paper and computes the percentages of changes needed for flipping.

Citation

If you use this code in your research, please cite the paper:

@inproceedings{kamahi-yaghoobzadeh-2024-counterfactuals,
    title = "Counterfactuals as a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models",
    author = "Kamahi, Sepehr  and
      Yaghoobzadeh, Yadollah",
    editor = "Belinkov, Yonatan  and
      Kim, Najoung  and
      Jumelet, Jaap  and
      Mohebbi, Hosein  and
      Mueller, Aaron  and
      Chen, Hanjie",
    booktitle = "Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP",
    month = nov,
    year = "2024",
    address = "Miami, Florida, US",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.blackboxnlp-1.28",
    doi = "10.18653/v1/2024.blackboxnlp-1.28",
    pages = "452--468",
    abstract = "Despite the widespread adoption of autoregressive language models, explainability evaluation research has predominantly focused on span infilling and masked language models. Evaluating the faithfulness of an explanation method{---}how accurately it explains the inner workings and decision-making of the model{---}is challenging because it is difficult to separate the model from its explanation. Most faithfulness evaluation techniques corrupt or remove input tokens deemed important by a particular attribution (feature importance) method and observe the resulting change in the model{'}s output. However, for autoregressive language models, this approach creates out-of-distribution inputs due to their next-token prediction training objective. In this study, we propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language models. Our technique generates fluent, in-distribution counterfactuals, making the evaluation protocol more reliable.",
}

About

counterfactuals as a means to test faithfulness

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors