Extracts safety-relevant directions from a language model's activation space and uses them to steer behavior at inference time. Quantifies the alignment-capability tradeoff: how much general capability do you lose when you steer a model to be safer?
Based on:
- Representation Engineering (Zou et al., 2023)
- Refusal in Language Models Is Mediated by a Single Direction (Arditi et al., 2024)
- Can we extract interpretable "safety directions" from the residual stream using contrastive activation pairs?
- Does steering along these directions at inference reliably increase refusal on harmful prompts?
- What is the alignment-capability tradeoff — how much helpfulness and general knowledge (MMLU) do you lose?
- Direction extraction: run 30 matched contrastive prompt pairs (harmful/harmless) through Gemma-2-2B-IT, collect residual stream activations at every layer, compute the mean activation difference per layer
- Steering: hook into the residual stream at the layer with the strongest safety direction and add
α × direction_vectorduring the forward pass - Evaluation: sweep
α ∈ [-2, 2]and measure refusal rate (40 harmful prompts), helpfulness (40 benign prompts), and MMLU accuracy (20 questions)
Also extracts an "honesty" direction from 20 sycophancy-vs-truthfulness pairs.
pip install -r requirements.txtNeeds a GPU with ≥8 GB VRAM for Gemma-2-2B in bfloat16, or use Colab (T4 free tier works).
# 1. extract steering directions
python src/extract_directions.py --model google/gemma-2-2b-it --output_dir outputs/directions
# 2. quick sanity check — steer and generate
python src/steer.py --model google/gemma-2-2b-it --directions_dir outputs/directions --alpha 1.5
# 3. full evaluation sweep
python src/evaluate.py --model google/gemma-2-2b-it --directions_dir outputs/directions --output_dir outputs/resultsFor Colab with GPU, use notebooks/run_steering.ipynb.
├── configs/default.yaml # model and experiment settings
├── src/
│ ├── contrastive_pairs.py # 30 refusal + 20 honesty extraction pairs, held-out eval sets
│ ├── extract_directions.py # compute steering vectors (mean_diff or PCA)
│ ├── steer.py # hook-based activation steering at inference
│ ├── evaluate.py # refusal rate, helpfulness, MMLU accuracy
│ └── utils.py # seed, device helpers
├── notebooks/
│ └── run_steering.ipynb # full pipeline for Colab
└── outputs/ # generated vectors and results (gitignored)
Run the notebook on Colab to reproduce. Results table will appear here after experiments.
- Zou et al. (2023) — Representation Engineering: A Top-Down Approach to AI Transparency
- Arditi et al. (2024) — Refusal in Language Models Is Mediated by a Single Direction
- Turner et al. (2023) — Activation Addition: Steering Language Models Without Optimization
- Burns et al. (2023) — Weak-to-Strong Generalization