Iterated Prisoner's Dilemma (IPD) — LLM Agent Simulation & Strategy Modulation with Sparse Autoencoders
This repository contains the official codebase and experimental data for the paper “Interpretable Risk Mitigation in LLM Agent Systems.” We explore how Large Language Models (LLMs) behave when playing the Iterated Prisoner’s Dilemma (IPD) and demonstrate how Sparse Autoencoder (SAE) steering can help interpret and influence their strategies.
- Overview
- Features
- Repository Structure
- Usage
- Notebook LLM Instance (Google Colab)
- Custom LLM Endpoint Calls
- Neuronpedia API Calls
- Data Analysis
- Citation
- License
The project focuses on risk mitigation in autonomous AI agents playing the IPD. Our approach:
Uses Sparse Autoencoders for steering and analyzing LLM outputs. Demonstrates how interpretability tools (SAE Lens) can help visualize internal representations of the LLM during gameplay. Provides multiple ways to generate and simulate IPD strategies, including local LLMs, custom endpoints, and Neuronpedia API. The primary goal is to reduce harmful agent behaviors and enhance trustworthiness in multi-step interactions.
IPD Game Generation: Code to generate multi-round Prisoner’s Dilemma games using LLM prompts and strategies. Sparse Autoencoder Steering: An SAE-based approach to modulate and analyze LLM hidden activations. Multiple Strategy Implementations: Classic IPD strategies (e.g., tit-for-tat, win-stay-lose-change) alongside advanced agent-based approaches. Extensive Data Analysis Tools: Scripts to analyze agent's performance, compare strategies, and visualize transformer activations.
mitigation
├─ notebooks
│ ├─ clustering.ipynb
│ └─ colab
│ ├─ data
│ ├─ 0_steering_and_analysis.ipynb
│ ├─ 1_feature_interpretations.ipynb
│ ├─ 2_find_features_on_prompt.ipynb
│ └─ 3_steering_multiple_featires.ipynb
├─ simulation
│ ├─ analyzers.py
│ ├─ mixtral.py
│ ├─ neuropedia.py
│ ├─ rounds.py
│ └─ strategies.py
├─ pics
├─ requirements.txt
├─ results
│ ├─ SAE_results
│ ├─ mixtral_results
│ │ ├─ new_run_analysis
│ │ └─ old_run_analysis
│ └─ strategies_results
│ └─ win_stay_lose_change.tsv
└─ data
├─ cluster_data.npy
├─ cluster_data.tsv
├─ combined_data_mixtral.npy
├─ combined_data_mixtral.tsv
├─ neuropedia_test.tsv
└─ raw_simulation_data
├─ mixtral_runs
├─ neuropedia_runs
└─ strategies
notebooks/
: Jupyter notebooks for clustering, analysis, and simulations (includes Google Colab–ready notebooks).simulation/
: Core Python scripts for generating IPD rounds, hooking up to LLM endpoints, and analyzing results.data/
: Preprocessed and raw IPD simulation data.results/
: Output files and logs from IPD simulations and analyses.
- Notebook LLM Instance (Google Colab)
- Start with familiarizing yourself with
0_steering_and_analysis.ipynb
notebook, where we cover the premise of the paper. We generate LLM agent responses and steer the generation with a provided feature ID. We analyze and visualize the results. - In
1_feature_interpretations.ipynb
you can find interpretations of a given feature. - You can find the most activating features of a prompt in
2_find_features_on_prompt.ipynb
- Here, we iterate over a set of interesting features and save the results into Google Drive
3_steering_multiple_featires.ipynb
- Custom LLM Endpoint Calls
Configure your endpoint in
mixtral.py
by adding the URL and access credentials. Use the request function to call your custom endpoint for IPD round generation. The code will send prompts to your LLM endpoint, collect responses, and run the IPD simulation logic. - Neuronpedia API Calls
Sign up at Neuronpedia to obtain API credentials.
Place your credentials in
neuropedia.py
. Launchneuropedia.py
to generate and steer IPD responses via the Neuronpedia API.
Most of the data processing and analysis scripts are located in simulation/analyzers.py
:
- Process single-round data.
- Aggregate multi-round game data.
- Perform multi-game comparisons.
If you find this code useful for your research, please cite our paper:
@article{ ... }
This project is licensed under the MIT License. Feel free to use, modify, and distribute this codebase.