Skip to content

Samsung/LLM-Agent-SAE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Iterated Prisoner's Dilemma (IPD) — LLM Agent Simulation & Strategy Modulation with Sparse Autoencoders

This repository contains the official codebase and experimental data for the paper “Interpretable Risk Mitigation in LLM Agent Systems.” We explore how Large Language Models (LLMs) behave when playing the Iterated Prisoner’s Dilemma (IPD) and demonstrate how Sparse Autoencoder (SAE) steering can help interpret and influence their strategies.

Table of Contents

Overview

The project focuses on risk mitigation in autonomous AI agents playing the IPD. Our approach:

Uses Sparse Autoencoders for steering and analyzing LLM outputs. Demonstrates how interpretability tools (SAE Lens) can help visualize internal representations of the LLM during gameplay. Provides multiple ways to generate and simulate IPD strategies, including local LLMs, custom endpoints, and Neuronpedia API. The primary goal is to reduce harmful agent behaviors and enhance trustworthiness in multi-step interactions.

Features

IPD Game Generation: Code to generate multi-round Prisoner’s Dilemma games using LLM prompts and strategies. Sparse Autoencoder Steering: An SAE-based approach to modulate and analyze LLM hidden activations. Multiple Strategy Implementations: Classic IPD strategies (e.g., tit-for-tat, win-stay-lose-change) alongside advanced agent-based approaches. Extensive Data Analysis Tools: Scripts to analyze agent's performance, compare strategies, and visualize transformer activations.

Repository Structure

mitigation
├─ notebooks
│   ├─ clustering.ipynb
│   └─ colab
│       ├─ data
│       ├─ 0_steering_and_analysis.ipynb
│       ├─ 1_feature_interpretations.ipynb
│       ├─ 2_find_features_on_prompt.ipynb
│       └─ 3_steering_multiple_featires.ipynb
├─ simulation
│   ├─ analyzers.py
│   ├─ mixtral.py
│   ├─ neuropedia.py
│   ├─ rounds.py
│   └─ strategies.py
├─ pics
├─ requirements.txt
├─ results
│   ├─ SAE_results
│   ├─ mixtral_results
│   │   ├─ new_run_analysis
│   │   └─ old_run_analysis
│   └─ strategies_results
│       └─ win_stay_lose_change.tsv
└─ data
    ├─ cluster_data.npy
    ├─ cluster_data.tsv
    ├─ combined_data_mixtral.npy
    ├─ combined_data_mixtral.tsv
    ├─ neuropedia_test.tsv
    └─ raw_simulation_data
        ├─ mixtral_runs
        ├─ neuropedia_runs
        └─ strategies

Key Directories:

  • notebooks/: Jupyter notebooks for clustering, analysis, and simulations (includes Google Colab–ready notebooks).
  • simulation/: Core Python scripts for generating IPD rounds, hooking up to LLM endpoints, and analyzing results.
  • data/: Preprocessed and raw IPD simulation data.
  • results/: Output files and logs from IPD simulations and analyses.

Usage

  1. Notebook LLM Instance (Google Colab)
  • Start with familiarizing yourself with 0_steering_and_analysis.ipynb notebook, where we cover the premise of the paper. We generate LLM agent responses and steer the generation with a provided feature ID. We analyze and visualize the results.
  • In 1_feature_interpretations.ipynb you can find interpretations of a given feature.
  • You can find the most activating features of a prompt in 2_find_features_on_prompt.ipynb
  • Here, we iterate over a set of interesting features and save the results into Google Drive 3_steering_multiple_featires.ipynb
  1. Custom LLM Endpoint Calls Configure your endpoint in mixtral.py by adding the URL and access credentials. Use the request function to call your custom endpoint for IPD round generation. The code will send prompts to your LLM endpoint, collect responses, and run the IPD simulation logic.
  2. Neuronpedia API Calls Sign up at Neuronpedia to obtain API credentials. Place your credentials in neuropedia.py. Launch neuropedia.py to generate and steer IPD responses via the Neuronpedia API.

Data Analysis

Most of the data processing and analysis scripts are located in simulation/analyzers.py:

  • Process single-round data.
  • Aggregate multi-round game data.
  • Perform multi-game comparisons.

Citation

If you find this code useful for your research, please cite our paper:

@article{ ... }

License

This project is licensed under the MIT License. Feel free to use, modify, and distribute this codebase.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published