Research project investigating how model quantization affects interpretability tools, specifically Sparse Autoencoders (SAEs).
- SAEs Transfer Across Precisions: BF16-trained SAEs achieve 99% sample correlation when applied to INT4 activations
- Degradation Has Structure: Code generation degrades 50% at INT4, while knowledge retrieval remains stable
- Smaller SAEs Transfer Better: 0.5x hidden dimension SAEs transfer 2.3x better than 8x SAEs
.
├── scripts/ # Python experiment code
├── data/ # Experiment results (JSON)
├── figures/ # Generated visualizations
├── research_summary.html # Interactive results summary
├── RESEARCH_FINDINGS.md # Detailed findings
└── METHODOLOGY.md # Experimental methodology
- Qwen3-Coder-30B-A3B (MoE architecture)
- StarCoder2-15B (Dense architecture)
- BF16 (baseline)
- FP16
- INT8
- INT4 (NF4 quantization)
- Procrustes Alignment: 85-89% across architectures
- Sample Correlation: 99%
- Top-10 Feature Agreement: 89%
- Feature Correlation: 95%
pip install torch transformers bitsandbytes scipy numpySee scripts/ for experiment code. Main entry points:
overnight_production_v2.py- Full experiment pipelinesemantic_transfer.py- SAE transfer analysisbenchmark_eval.py- Benchmark evaluation
Open research_summary.html in a browser to see the interactive results summary.
This research was conducted as part of the Anthropic Fellows Program application. The goal is to understand whether interpretability tools trained on full-precision models remain valid when those models are quantized for production deployment.
Jack Switzer - January 2026