Quantization x Interpretability

Research project investigating how model quantization affects interpretability tools, specifically Sparse Autoencoders (SAEs).

Key Findings

SAEs Transfer Across Precisions: BF16-trained SAEs achieve 99% sample correlation when applied to INT4 activations
Degradation Has Structure: Code generation degrades 50% at INT4, while knowledge retrieval remains stable
Smaller SAEs Transfer Better: 0.5x hidden dimension SAEs transfer 2.3x better than 8x SAEs

Project Structure

.
├── scripts/           # Python experiment code
├── data/              # Experiment results (JSON)
├── figures/           # Generated visualizations
├── research_summary.html  # Interactive results summary
├── RESEARCH_FINDINGS.md   # Detailed findings
└── METHODOLOGY.md         # Experimental methodology

Models Tested

Qwen3-Coder-30B-A3B (MoE architecture)
StarCoder2-15B (Dense architecture)

Precisions

BF16 (baseline)
FP16
INT8
INT4 (NF4 quantization)

Metrics

Procrustes Alignment: 85-89% across architectures
Sample Correlation: 99%
Top-10 Feature Agreement: 89%
Feature Correlation: 95%

Setup

pip install torch transformers bitsandbytes scipy numpy

Usage

See scripts/ for experiment code. Main entry points:

overnight_production_v2.py - Full experiment pipeline
semantic_transfer.py - SAE transfer analysis
benchmark_eval.py - Benchmark evaluation

View Results

Open research_summary.html in a browser to see the interactive results summary.

Context

This research was conducted as part of the Anthropic Fellows Program application. The goal is to understand whether interpretability tools trained on full-precision models remain valid when those models are quantized for production deployment.

Author

Jack Switzer - January 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization x Interpretability

Key Findings

Project Structure

Models Tested

Precisions

Metrics

Setup

Usage

View Results

Context

Author

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Quantization x Interpretability

Key Findings

Project Structure

Models Tested

Precisions

Metrics

Setup

Usage

View Results

Context

Author