Iterated Prisoner's Dilemma (IPD) — LLM Agent Simulation & Strategy Modulation with Sparse Autoencoders

This repository contains the official codebase and experimental data for the paper “Interpretable Risk Mitigation in LLM Agent Systems.” We explore how Large Language Models (LLMs) behave when playing the Iterated Prisoner’s Dilemma (IPD) and demonstrate how Sparse Autoencoder (SAE) steering can help interpret and influence their strategies.

Overview

The project focuses on risk mitigation in autonomous AI agents playing the IPD. Our approach:

Uses Sparse Autoencoders for steering and analyzing LLM outputs. Demonstrates how interpretability tools (SAE Lens) can help visualize internal representations of the LLM during gameplay. Provides multiple ways to generate and simulate IPD strategies, including local LLMs, custom endpoints, and Neuronpedia API. The primary goal is to reduce harmful agent behaviors and enhance trustworthiness in multi-step interactions.

Features

IPD Game Generation: Code to generate multi-round Prisoner’s Dilemma games using LLM prompts and strategies. Sparse Autoencoder Steering: An SAE-based approach to modulate and analyze LLM hidden activations. Multiple Strategy Implementations: Classic IPD strategies (e.g., tit-for-tat, win-stay-lose-change) alongside advanced agent-based approaches. Extensive Data Analysis Tools: Scripts to analyze agent's performance, compare strategies, and visualize transformer activations.

Repository Structure

mitigation
├─ notebooks
│   ├─ clustering.ipynb
│   └─ colab
│       ├─ data
│       ├─ 0_steering_and_analysis.ipynb
│       ├─ 1_feature_interpretations.ipynb
│       ├─ 2_find_features_on_prompt.ipynb
│       └─ 3_steering_multiple_featires.ipynb
├─ simulation
│   ├─ analyzers.py
│   ├─ mixtral.py
│   ├─ neuropedia.py
│   ├─ rounds.py
│   └─ strategies.py
├─ pics
├─ requirements.txt
├─ results
│   ├─ SAE_results
│   ├─ mixtral_results
│   │   ├─ new_run_analysis
│   │   └─ old_run_analysis
│   └─ strategies_results
│       └─ win_stay_lose_change.tsv
└─ data
    ├─ cluster_data.npy
    ├─ cluster_data.tsv
    ├─ combined_data_mixtral.npy
    ├─ combined_data_mixtral.tsv
    ├─ neuropedia_test.tsv
    └─ raw_simulation_data
        ├─ mixtral_runs
        ├─ neuropedia_runs
        └─ strategies

Key Directories:

notebooks/: Jupyter notebooks for clustering, analysis, and simulations (includes Google Colab–ready notebooks).
simulation/: Core Python scripts for generating IPD rounds, hooking up to LLM endpoints, and analyzing results.
data/: Preprocessed and raw IPD simulation data.
results/: Output files and logs from IPD simulations and analyses.

Usage

Notebook LLM Instance (Google Colab)

Start with familiarizing yourself with 0_steering_and_analysis.ipynb notebook, where we cover the premise of the paper. We generate LLM agent responses and steer the generation with a provided feature ID. We analyze and visualize the results.
In 1_feature_interpretations.ipynb you can find interpretations of a given feature.
You can find the most activating features of a prompt in 2_find_features_on_prompt.ipynb
Here, we iterate over a set of interesting features and save the results into Google Drive 3_steering_multiple_featires.ipynb

Custom LLM Endpoint Calls Configure your endpoint in mixtral.py by adding the URL and access credentials. Use the request function to call your custom endpoint for IPD round generation. The code will send prompts to your LLM endpoint, collect responses, and run the IPD simulation logic.
Neuronpedia API Calls Sign up at Neuronpedia to obtain API credentials. Place your credentials in neuropedia.py. Launch neuropedia.py to generate and steer IPD responses via the Neuronpedia API.

Data Analysis

Most of the data processing and analysis scripts are located in simulation/analyzers.py:

Process single-round data.
Aggregate multi-round game data.
Perform multi-game comparisons.

Citation

If you find this code useful for your research, please cite our paper:

@article{ ... }

License

This project is licensed under the MIT License. Feel free to use, modify, and distribute this codebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Iterated Prisoner's Dilemma (IPD) — LLM Agent Simulation & Strategy Modulation with Sparse Autoencoders

Table of Contents

Overview

Features

Repository Structure

Key Directories:

Usage

Data Analysis

Citation

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
notebooks		notebooks
pics		pics
results		results
simulation		simulation
README.md		README.md
requirements.txt		requirements.txt

Samsung/LLM-Agent-SAE

Folders and files

Latest commit

History

Repository files navigation

Iterated Prisoner's Dilemma (IPD) — LLM Agent Simulation & Strategy Modulation with Sparse Autoencoders

Table of Contents

Overview

Features

Repository Structure

Key Directories:

Usage

Data Analysis

Citation

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages