MoE expert paging: load only the needed expert on demand, page rest from HF cache

## Idea

MoE models have N experts but only 1-2 activate per token. On memory-constrained devices (MacBook Air 8-16GB), we don't need ALL experts resident. Load only the expert needed for the current task, page others from HuggingFace cache or disk.

## Architecture

```
Resident (always loaded):
  - Router/gate network (~100MB) — decides which expert to activate
  - Shared attention layers (~2GB) — used by all experts
  - Active expert FFN (~1-2GB) — the one currently needed

Paged (on disk/HF cache, loaded on demand):
  - Other N-1 expert FFNs (~1-2GB each)
  - Loaded when task domain changes (code → reasoning → creative)
  - LRU eviction when memory full
```

## Memory savings

Qwen3.5-35B-A3B has ~8 experts:
- Full model: ~21GB (Q4_K_M GGUF)
- Router + attention + 1 expert: ~4-6GB
- Fits MacBook Air with room for the system

## Connection to existing architecture

This IS genome paging (#382) at a deeper level:
- Genome paging: swap LoRA adapters in/out
- Expert paging: swap MoE expert weights in/out
- Same LRU eviction, same on-demand loading, same HF cache

## Implementation questions

1. Can GGUF files be partially loaded? (load specific tensor groups)
2. Does Candle support lazy tensor loading?
3. Can we split a GGUF into per-expert shards at publish time?
4. What's the latency for paging in a 2GB expert from SSD? (~200ms?)
5. Could we pre-predict the next expert needed based on task classification?

## Why this matters

- MacBook Air becomes viable for the full 35B MoE model
- Each device only stores experts it uses frequently
- Rare experts live on HF, fetched when needed
- The `-code-cont` suffix literally means "only the code expert is baked in"

## Related
- #382 (genome paging — same architecture, higher level)
- #417 (Qwen3.5 evaluation)
- #430 (MoE training)
- Plasticity compaction — prune unused experts permanently for deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE expert paging: load only the needed expert on demand, page rest from HF cache #433

Idea

Architecture

Memory savings

Connection to existing architecture

Implementation questions

Why this matters

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MoE expert paging: load only the needed expert on demand, page rest from HF cache #433

Description

Idea

Architecture

Memory savings

Connection to existing architecture

Implementation questions

Why this matters

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions