-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Idea
MoE models have N experts but only 1-2 activate per token. On memory-constrained devices (MacBook Air 8-16GB), we don't need ALL experts resident. Load only the expert needed for the current task, page others from HuggingFace cache or disk.
Architecture
Resident (always loaded):
- Router/gate network (~100MB) — decides which expert to activate
- Shared attention layers (~2GB) — used by all experts
- Active expert FFN (~1-2GB) — the one currently needed
Paged (on disk/HF cache, loaded on demand):
- Other N-1 expert FFNs (~1-2GB each)
- Loaded when task domain changes (code → reasoning → creative)
- LRU eviction when memory full
Memory savings
Qwen3.5-35B-A3B has ~8 experts:
- Full model: ~21GB (Q4_K_M GGUF)
- Router + attention + 1 expert: ~4-6GB
- Fits MacBook Air with room for the system
Connection to existing architecture
This IS genome paging (#382) at a deeper level:
- Genome paging: swap LoRA adapters in/out
- Expert paging: swap MoE expert weights in/out
- Same LRU eviction, same on-demand loading, same HF cache
Implementation questions
- Can GGUF files be partially loaded? (load specific tensor groups)
- Does Candle support lazy tensor loading?
- Can we split a GGUF into per-expert shards at publish time?
- What's the latency for paging in a 2GB expert from SSD? (~200ms?)
- Could we pre-predict the next expert needed based on task classification?
Why this matters
- MacBook Air becomes viable for the full 35B MoE model
- Each device only stores experts it uses frequently
- Rare experts live on HF, fetched when needed
- The
-code-contsuffix literally means "only the code expert is baked in"
Related
- Genome paging: activateSkill/evictLRU not wired end-to-end #382 (genome paging — same architecture, higher level)
- Evaluate Qwen3.5-35B-A3B as local inference model — Opus reasoning distilled, 3B active #417 (Qwen3.5 evaluation)
- 5090 tower: install Unsloth + verify MoE LoRA training works #430 (MoE training)
- Plasticity compaction — prune unused experts permanently for deployment
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels