safety-steering

Extracts safety-relevant directions from a language model's activation space and uses them to steer behavior at inference time. Quantifies the alignment-capability tradeoff: how much general capability do you lose when you steer a model to be safer?

Based on:

Representation Engineering (Zou et al., 2023)
Refusal in Language Models Is Mediated by a Single Direction (Arditi et al., 2024)

questions

Can we extract interpretable "safety directions" from the residual stream using contrastive activation pairs?
Does steering along these directions at inference reliably increase refusal on harmful prompts?
What is the alignment-capability tradeoff — how much helpfulness and general knowledge (MMLU) do you lose?

method

Direction extraction: run 30 matched contrastive prompt pairs (harmful/harmless) through Gemma-2-2B-IT, collect residual stream activations at every layer, compute the mean activation difference per layer
Steering: hook into the residual stream at the layer with the strongest safety direction and add α × direction_vector during the forward pass
Evaluation: sweep α ∈ [-2, 2] and measure refusal rate (40 harmful prompts), helpfulness (40 benign prompts), and MMLU accuracy (20 questions)

Also extracts an "honesty" direction from 20 sycophancy-vs-truthfulness pairs.

setup

pip install -r requirements.txt

Needs a GPU with ≥8 GB VRAM for Gemma-2-2B in bfloat16, or use Colab (T4 free tier works).

usage

# 1. extract steering directions
python src/extract_directions.py --model google/gemma-2-2b-it --output_dir outputs/directions

# 2. quick sanity check — steer and generate
python src/steer.py --model google/gemma-2-2b-it --directions_dir outputs/directions --alpha 1.5

# 3. full evaluation sweep
python src/evaluate.py --model google/gemma-2-2b-it --directions_dir outputs/directions --output_dir outputs/results

For Colab with GPU, use notebooks/run_steering.ipynb.

structure

├── configs/default.yaml          # model and experiment settings
├── src/
│   ├── contrastive_pairs.py      # 30 refusal + 20 honesty extraction pairs, held-out eval sets
│   ├── extract_directions.py     # compute steering vectors (mean_diff or PCA)
│   ├── steer.py                  # hook-based activation steering at inference
│   ├── evaluate.py               # refusal rate, helpfulness, MMLU accuracy
│   └── utils.py                  # seed, device helpers
├── notebooks/
│   └── run_steering.ipynb        # full pipeline for Colab
└── outputs/                      # generated vectors and results (gitignored)

results

Run the notebook on Colab to reproduce. Results table will appear here after experiments.

refs

Zou et al. (2023) — Representation Engineering: A Top-Down Approach to AI Transparency
Arditi et al. (2024) — Refusal in Language Models Is Mediated by a Single Direction
Turner et al. (2023) — Activation Addition: Steering Language Models Without Optimization
Burns et al. (2023) — Weak-to-Strong Generalization

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

safety-steering

questions

method

setup

usage

structure

results

refs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

safety-steering

questions

method

setup

usage

structure

results

refs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages