Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability

This repository contains the data and code used to perform the experiments of the paper: Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability. It also includes the supplementary materials.

How to use

First, clone the repository and install the required dependencies:

git clone [email protected]:jgcarrasco/detecting-vulnerabilities-mech-interp.git
cd detecting-vulnerabilities-mech-interp
pip install -r requirements.txt

Then, follow along the different Jupyter notebooks to follow the case study presented in the paper. There is a separate notebook for each step.

0_building_dataset.ipynb: BUild the synthetic dataset composed by acronyms that will be used for the preceding steps.
1_patching.ipynb: Apply a series of patching experiments to localize the circuit.
2_build_acronyms.ipynb: Build a set of acronyms with Algorithm 1 that will be used to locate vulnerabilities.
3_lens.ipynb: Locate which components of the circuit are vulnerable via logit lens.
4_interpret_vulnerabilities.ipynb: Interpret what is happening on the identified vulnerable components.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability

How to use

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
data		data
0_building_dataset.ipynb		0_building_dataset.ipynb
1_patching.ipynb		1_patching.ipynb
2_build_acronyms.ipynb		2_build_acronyms.ipynb
3_lens.ipynb		3_lens.ipynb
4_interpret_vulnerabilities.ipynb		4_interpret_vulnerabilities.ipynb
README.md		README.md
plotly_utils.py		plotly_utils.py
requirements.txt		requirements.txt
supplementary-materials.pdf		supplementary-materials.pdf

ProyectoAether/detecting-vulnerabilities-mech-interp

Folders and files

Latest commit

History

Repository files navigation

Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability

How to use

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages