Skip to content

ahmettrkck/safety-steering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

safety-steering

Extracts safety-relevant directions from a language model's activation space and uses them to steer behavior at inference time. Quantifies the alignment-capability tradeoff: how much general capability do you lose when you steer a model to be safer?

Based on:

questions

  1. Can we extract interpretable "safety directions" from the residual stream using contrastive activation pairs?
  2. Does steering along these directions at inference reliably increase refusal on harmful prompts?
  3. What is the alignment-capability tradeoff — how much helpfulness and general knowledge (MMLU) do you lose?

method

  1. Direction extraction: run 30 matched contrastive prompt pairs (harmful/harmless) through Gemma-2-2B-IT, collect residual stream activations at every layer, compute the mean activation difference per layer
  2. Steering: hook into the residual stream at the layer with the strongest safety direction and add α × direction_vector during the forward pass
  3. Evaluation: sweep α ∈ [-2, 2] and measure refusal rate (40 harmful prompts), helpfulness (40 benign prompts), and MMLU accuracy (20 questions)

Also extracts an "honesty" direction from 20 sycophancy-vs-truthfulness pairs.

setup

pip install -r requirements.txt

Needs a GPU with ≥8 GB VRAM for Gemma-2-2B in bfloat16, or use Colab (T4 free tier works).

usage

# 1. extract steering directions
python src/extract_directions.py --model google/gemma-2-2b-it --output_dir outputs/directions

# 2. quick sanity check — steer and generate
python src/steer.py --model google/gemma-2-2b-it --directions_dir outputs/directions --alpha 1.5

# 3. full evaluation sweep
python src/evaluate.py --model google/gemma-2-2b-it --directions_dir outputs/directions --output_dir outputs/results

For Colab with GPU, use notebooks/run_steering.ipynb.

structure

├── configs/default.yaml          # model and experiment settings
├── src/
│   ├── contrastive_pairs.py      # 30 refusal + 20 honesty extraction pairs, held-out eval sets
│   ├── extract_directions.py     # compute steering vectors (mean_diff or PCA)
│   ├── steer.py                  # hook-based activation steering at inference
│   ├── evaluate.py               # refusal rate, helpfulness, MMLU accuracy
│   └── utils.py                  # seed, device helpers
├── notebooks/
│   └── run_steering.ipynb        # full pipeline for Colab
└── outputs/                      # generated vectors and results (gitignored)

results

Run the notebook on Colab to reproduce. Results table will appear here after experiments.

refs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors