Geometric Feature Invariance in SAEs: A Framework for Transferable Mechanistic Interpretability and Scalable AI Safety

Abstract

This research will integrate controlled synthetic model features evaluation with cross-model geometric feature invariance analysis, towards developing a principled framework for transferable interpretability across model families and variants. Recent work has showed that Sparse Autoencoders (SAEs) trained on different LLMs learn geometrically similar feature spaces (invariance and analogous feature universality) while exhibiting trade-offs against synthetic ground-truth features, meaning no current SAE architecture can perfectly recover the ground-truth features.

Our work will address fundamental challenges in mechanistic interpretability by establishing principled methods for leveraging geometric feature universality. Efficient feature transfer is critical for AI safety because it enables:

Rapid safety evaluation of new models without restarting interpretability analysis from scratch
Early detection of hazardous capabilities by comparing feature spaces to known dangerous configurations
Reliable monitoring across deployment contexts by tracking feature drift
Scalable oversight of large model families where per-model analysis becomes infeasible

During the project implementation we will characterize geometric patterns that remain stable across models, and geometric transformation methods that reliably map feature correspondences, validated against synthetic ground-truth. Furthermore, we will demonstrate that safety-relevant interventions transfer within the same family of models, or fine-tuned variants of a model. This research will accelerate mechanistic interpretability, and enable efficient safety analysis as AI systems grow in capability and complexity.

Theory of Change


Activities	Develop synthetic benchmarks to test SAE transfer and the presence of invariant feature structures
Outputs	Protocols that enable reliable cross-model interpretability transfer based on invariant AI safety-related features and circuits
Outcomes	Enable interpretations and safety interventions developed on one model to reliably transfer to other models in the same family
Impact	AI safety becomes scalable without requiring complete re-interpretation and SAE generation for each new model, fine-tuned variant, or model family member
Key Assumption	Geometric feature similarity is both necessary and sufficient for interpretability transfer

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
.claude		.claude
.github/workflows		.github/workflows
_templates		_templates
feature_hierarchies		feature_hierarchies
figures		figures
result-figures		result-figures
slides-reveal.js		slides-reveal.js
texts-md		texts-md
.gitignore		.gitignore
Adding-bibliography.md		Adding-bibliography.md
Geometric_SAE_Invariance v1.pdf		Geometric_SAE_Invariance v1.pdf
Krampis-SyntheticLLMs-2026.md		Krampis-SyntheticLLMs-2026.md
Krampis-SyntheticLLMs-2026.pdf		Krampis-SyntheticLLMs-2026.pdf
README.md		README.md
arxiv-preamble.tex		arxiv-preamble.tex
create_hierarchy_figure.py		create_hierarchy_figure.py
ieee.csl		ieee.csl
index.html		index.html
paper-meta.yaml		paper-meta.yaml
references.bib		references.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Geometric Feature Invariance in SAEs: A Framework for Transferable Mechanistic Interpretability and Scalable AI Safety

Abstract

Theory of Change

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Geometric Feature Invariance in SAEs: A Framework for Transferable Mechanistic Interpretability and Scalable AI Safety

Abstract

Theory of Change

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages