Skip to content

Latest commit

 

History

History
65 lines (44 loc) · 1.88 KB

File metadata and controls

65 lines (44 loc) · 1.88 KB

Quantization x Interpretability

Research project investigating how model quantization affects interpretability tools, specifically Sparse Autoencoders (SAEs).

Key Findings

  1. SAEs Transfer Across Precisions: BF16-trained SAEs achieve 99% sample correlation when applied to INT4 activations
  2. Degradation Has Structure: Code generation degrades 50% at INT4, while knowledge retrieval remains stable
  3. Smaller SAEs Transfer Better: 0.5x hidden dimension SAEs transfer 2.3x better than 8x SAEs

Project Structure

.
├── scripts/           # Python experiment code
├── data/              # Experiment results (JSON)
├── figures/           # Generated visualizations
├── research_summary.html  # Interactive results summary
├── RESEARCH_FINDINGS.md   # Detailed findings
└── METHODOLOGY.md         # Experimental methodology

Models Tested

  • Qwen3-Coder-30B-A3B (MoE architecture)
  • StarCoder2-15B (Dense architecture)

Precisions

  • BF16 (baseline)
  • FP16
  • INT8
  • INT4 (NF4 quantization)

Metrics

  • Procrustes Alignment: 85-89% across architectures
  • Sample Correlation: 99%
  • Top-10 Feature Agreement: 89%
  • Feature Correlation: 95%

Setup

pip install torch transformers bitsandbytes scipy numpy

Usage

See scripts/ for experiment code. Main entry points:

  • overnight_production_v2.py - Full experiment pipeline
  • semantic_transfer.py - SAE transfer analysis
  • benchmark_eval.py - Benchmark evaluation

View Results

Open research_summary.html in a browser to see the interactive results summary.

Context

This research was conducted as part of the Anthropic Fellows Program application. The goal is to understand whether interpretability tools trained on full-precision models remain valid when those models are quantized for production deployment.

Author

Jack Switzer - January 2026