A system that takes any large language model, learns which parts of its attention mechanism matter for a specific task, removes the parts that don't, and quantizes the rest at variable precision — producing a model that runs on consumer hardware where it normally wouldn't fit.
Proven: Qwen2.5-Coder-32B-Instruct (64 layers, 40 Q-heads, 8 KV-heads) compressed to 25 Q-heads / 5 KV-heads with Q3_K_S quantization. Runs locally on M1 Pro 32GB at 5.3 tok/s generating correct Python code. A 32-billion parameter coding model on a laptop.
Large models are better than small models. But they don't fit on consumer devices. The current options are:
- Use a smaller model — sacrifice intelligence
- Use an API — sacrifice privacy, cost money, require internet
- Uniform quantization — squeeze the whole model equally, lose quality everywhere
This pipeline adds a fourth option:
- Adaptive compression — learn what matters, keep precision where it counts, remove what doesn't contribute. Same memory budget, better model. Or same quality, smaller model.
The key insight: attention heads specialize. In a 32B model with 40 query heads grouped into 8 KV groups, not all groups contribute equally to every task. For coding, some groups handle syntax, others handle variable tracking, others handle natural language reasoning. By fine-tuning briefly on coding data and capturing gradient flow through each head, we learn which groups are essential and which are expendable.
This isn't theoretical. We proved it works:
- Pruned 3 of 8 KV groups (37.5% of the KV cache)
- Quantized to 3.5 bits per weight
- Model still generates correct
is_prime,binary_search,fibonacci,flatten_list,reverse_string - Runs locally on a $2000 laptop with 32GB RAM
Every compression decision trades between three dimensions:
QUALITY
/\
/ \
/ \
/ sweet \
/ spot \
/____________\
SIZE ----------- SPEED
- Quality: Does the model still produce correct, coherent output?
- Size: Does it fit in the target device's memory?
- Speed: How many tokens per second?
Uniform quantization moves you along one axis. Adaptive compression lets you move diagonally — same size but better quality, because precision is concentrated where utilization is high.
| Quant | Size | Fits? | Speed | Quality | Status |
|---|---|---|---|---|---|
| BF16 (original) | 62 GB | No | — | Reference | Too large |
| Q4_K_M | 18.8 GB | No (OOM) | — | Excellent (A100) | Proven coherent on server |
| Q3_K_S | 12.9 GB | Yes | 5.3 tok/s | Good (some repetition) | Working locally |
| Q2_K | 11.0 GB | Yes | — | NaN | Too aggressive |
The gap between Q3_K_S and Q4_K_M is exactly where adaptive mixed quantization adds value: give important layers Q5_K precision and unimportant layers Q3_K, achieving Q4_K_M-level quality at Q3_K_S-level size.
Five stages, all Rust except the LoRA training step (Python on remote GPU). Each stage is a pure function with defined inputs and outputs.
┌──────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐
│ SCORE │───▶│ PLAN │───▶│ COMPRESS │───▶│ VERIFY │───▶│ INFER │
│ (remote) │ │ (Rust) │ │ (Rust) │ │ (Rust) │ │ (Metal) │
└──────────┘ └──────────┘ └───────────┘ └──────────┘ └──────────┘
LoRA train Utilization Prune heads Check NaN Generate
+ gradient → recipe: + mixed quant + budget tokens
capture per-tensor → GGUF file + coherence locally
quant levels
Where: Remote GPU (RunPod, 5090, reticulum grid)
What: LoRA fine-tuning on the target dataset with gradient capture
Output: gate_gradients.json — per-head utilization scores
The model is briefly fine-tuned (1-3 epochs) on task-specific data. During training, a callback captures gradient magnitudes flowing through each attention head's gate projection. Heads with consistently high gradients are critical for the task; heads with near-zero gradients are expendable.
This already works via peft-train.py with GateGradientCallback. The Academy pipeline (TeacherPipeline) runs this as part of its training loop.
Future: Sentinel orchestrates this step, farming it to available GPU nodes. The Reticulum can distribute scoring across multiple machines, each training on different dataset slices and merging utilization maps.
Where: Local (Rust, no GPU needed)
What: Turn utilization scores + device spec into a CompressionRecipe
Output: CompressionRecipe — complete specification of what to prune and how to quantize
The planner takes three inputs:
- Utilization scores from Stage 1
- A device specification (e.g., "MacBook Air 16GB", "MacBook Pro 32GB", "RTX 5090 24GB")
- The base model's architecture config
And produces a recipe specifying:
- Which KV groups to prune (utilization below threshold)
- Per-tensor quantization type (high-util layers → Q5_K/Q6_K, low-util → Q3_K)
- Dimension padding for block alignment
- Memory budget breakdown
pub fn plan_compression(
utilization: &UtilizationData,
device: &DeviceSpec,
arch: &ModelArchConfig,
) -> Result<CompressionRecipe, String>Device presets:
DeviceSpec::macbook_air_16gb()— 11 GB effective budgetDeviceSpec::macbook_pro_32gb()— 24 GB effective budgetDeviceSpec::rtx_5090_24gb()— 22 GB effective VRAM budgetDeviceSpec::from_memory_gb(total)— auto-compute reserves
Quantization floor: Q3_K_S is the practical minimum. Q2_K produces NaN for compacted models — the combination of head pruning and extreme quantization destroys too much precision.
Where: Local (Rust, CPU-only) What: Read safetensors, apply recipe, write GGUF Output: Single GGUF file with mixed quantization + custom metadata
This is the core new capability. Our own GGUF writer that:
- Reads each tensor from the base model's safetensors
- For attention projections: slices out pruned heads, pads dimensions to block alignment
- Quantizes each tensor at the recipe's assigned level using candle's
GgmlDType::from_float() - Writes the GGUF file with standard + custom metadata
Why our own writer: llama.cpp's quantizer applies uniform quantization and rejects non-standard tensor dimensions from pruned models (e.g., ncols=3200 not divisible by 256). Our writer handles variable dimensions and per-tensor quant levels.
Custom GGUF metadata (readable by our inference engine, ignored by others):
continuum.compression_recipe— JSON string of the full recipecontinuum.per_layer_head_counts— array of Q head counts per layercontinuum.per_layer_kv_head_counts— array of KV head counts per layercontinuum.utilization_scores— array of per-group utilization scores
Where: Local (Rust) What: Validate the compressed GGUF before deployment Output: Pass/fail + quality report
Checks:
- Load metadata, confirm dimensions match recipe
- Dequantize sample layers, check for NaN/Inf
- Actual file size vs budget target
- Short inference test (3-5 tokens) — does it produce coherent text?
Where: Local (Metal/CUDA/CPU via Candle) What: Run the compressed model Output: Tokens
The existing LlamaGgufBackend in quantized_llama.rs already handles GGUF inference. Extended to:
- Read
continuum.per_layer_head_countsfrom custom metadata - Use per-layer head counts in attention reshape (instead of global uniform count)
- Derive
head_dimfrom tensor shapes rather than metadata formulas
Central data structure that drives everything:
pub struct CompressionRecipe {
/// What heads to keep/prune, per-head precision tiers
pub topology: HeadTopology,
/// Per-tensor quantization: tensor name pattern → GGUF quant type
pub tensor_quant_map: Vec<TensorQuantAssignment>,
/// Target device that drove the budget
pub device_spec: DeviceSpec,
/// Memory budget breakdown
pub budget: MemoryBudget,
}
pub struct TensorQuantAssignment {
/// Glob pattern: "model.layers.*.self_attn.q_proj.weight"
pub pattern: String,
/// GGUF quant type: Q3_K_S, Q4_K_M, Q5_K_S, Q6_K, etc.
pub quant_type: GgufQuantType,
/// Why this assignment (for debugging/reports)
pub reason: String,
}The recipe is:
- Serializable (JSON) — can be stored, versioned, shared
- Deterministic — same inputs always produce the same recipe
- Device-aware — auto-fits to the target memory budget
- Auditable — every assignment has a reason string
# Compress a model for a specific device target
./jtag plasticity/compress \
--capturePath=/path/to/gate_gradients.json \
--modelPath=Qwen/Qwen2.5-Coder-32B-Instruct \
--deviceSpec=32gb \
--outputPath=~/.continuum/genome/models/qwen32b-coding.gguf
# Or as a Sentinel pipeline step
- type: Command
command: plasticity/compress
params:
capturePath: "{{steps.train.data.outputDir}}"
modelPath: "{{input.baseModelPath}}"
deviceSpec: "{{input.targetDevice}}"Uniform Q3_K_S: every tensor at 3.5 bits. Wastes precision on unimportant layers, starves important ones.
Mixed quantization with the same total size:
| Layer | Utilization | Uniform | Mixed |
|---|---|---|---|
| 0-5 (early) | Medium | Q3_K_S | Q4_K_S |
| 6-20 (mid, high-util) | High | Q3_K_S | Q5_K_M |
| 21-50 (mid, low-util) | Low | Q3_K_S | Q3_K_S |
| 51-63 (late) | High | Q3_K_S | Q5_K_S |
| embed_tokens | Critical | Q3_K_S | Q6_K |
| lm_head | Critical | Q3_K_S | Q6_K |
Same file size. Better quality. The important layers (which drive code correctness) get 5-6 bits. The unimportant layers (which contribute less) get 3 bits. The embeddings (where token identity lives) get maximum affordable precision.
The pipeline naturally separates GPU work (scoring) from CPU work (planning, compression, verification). This enables:
- RunPod: On-demand GPU for scoring large models
- RTX 5090: Local GPU for scoring smaller models or incremental re-scoring
- Reticulum: Distribute scoring across multiple nodes, each training on different data slices, merge utilization maps
- Sentinel: Orchestrate the whole pipeline — provision GPU, run training, collect scores, compress locally, deploy
The compression step itself (Stage 3) is CPU-only and runs on any machine. A MacBook Air can compress a 70B model if it has enough disk space — it processes one tensor at a time, never loading the full model into memory.
The same pipeline works for any task:
- Coding: Train on code, prune heads that don't help with syntax/logic
- Creative writing: Train on fiction, prune heads that specialize in formal/technical language
- Translation: Train on bilingual data, prune heads that specialize in languages you don't need
- Domain expertise: Train on medical/legal/scientific text, prune generalist heads
The utilization scores are dataset-driven. Change the dataset, get a different pruning pattern, get a model optimized for a different task. Same base model, many specialized compressed variants.
This is personalized model compression. Your model, your data, your hardware, your budget.
- Gradient-based utilization scoring (
scoring.rs) - Head topology planning (
topology.rs) - Tensor compaction / head pruning (
compactor.rs) - Candle GGUF inference with Qwen2 support (
quantized_llama.rs) - Architecture-aware GGUF metadata (qwen2, llama)
- head_dim derivation for compacted models
- KV cache clear without model reload
- DeviceEmbedding (F16 on Metal)
- Benchmark harness with quality/speed/memory metrics
- Proof of concept: 32B on 32GB MacBook (Q3_K_S, 5.3 tok/s)
- CompressionRecipe type definitions
- Planner: utilization → per-tensor quant assignments
- GGUF writer (our own, not llama.cpp)
- Mixed quantization support
- Per-layer variable head counts in inference
- Pipeline IPC command (
plasticity/compress) - Verification stage
- Dimension padding for block alignment
- Sentinel pipeline integration
- Distributed scoring via Reticulum
- Custom sub-3-bit quantization kernels (Ternary/Q2 with custom Metal shaders)
- Per-head mixed quantization within a single tensor (requires custom GGUF tensor types)
- Auto-discovery of optimal compression for a given model+dataset+device triple
src/workers/continuum-core/src/modules/plasticity/
├── mod.rs — IPC routing, handle_command
├── types.rs — HeadTopology, CompressionRecipe, DeviceSpec, GgufQuantType
├── scoring.rs — Per-head gradient utilization scoring
├── topology.rs — Head topology I/O, group selection
├── compactor.rs — Tensor slicing (prune heads from safetensors)
├── quantizer.rs — Block quantization primitives
├── planner.rs — [NEW] Recipe planning from scores + device spec
├── gguf_writer.rs — [NEW] Mixed-quant GGUF writer
├── pipeline.rs — [NEW] End-to-end orchestration
├── validation.rs — Integration tests, GGUF verification
src/workers/continuum-core/src/inference/
├── vendored/quantized_llama.rs — GGUF inference (Qwen2 + variable heads)
├── backends/llama_gguf.rs — LlamaGgufBackend
├── backends/mod.rs — ModelBackend trait, generate()
├── model.rs — Model loading utilities
docs/genome/
├── COMPRESSION-PIPELINE.md — This document
├── plasticity_benchmark_report.json — Benchmark results
├── bench_q3ks.json — Detailed Q3_K_S benchmark data