Skip to content

MoE expert paging: load only the needed expert on demand, page rest from HF cache #433

@joelteply

Description

@joelteply

Idea

MoE models have N experts but only 1-2 activate per token. On memory-constrained devices (MacBook Air 8-16GB), we don't need ALL experts resident. Load only the expert needed for the current task, page others from HuggingFace cache or disk.

Architecture

Resident (always loaded):
  - Router/gate network (~100MB) — decides which expert to activate
  - Shared attention layers (~2GB) — used by all experts
  - Active expert FFN (~1-2GB) — the one currently needed

Paged (on disk/HF cache, loaded on demand):
  - Other N-1 expert FFNs (~1-2GB each)
  - Loaded when task domain changes (code → reasoning → creative)
  - LRU eviction when memory full

Memory savings

Qwen3.5-35B-A3B has ~8 experts:

  • Full model: ~21GB (Q4_K_M GGUF)
  • Router + attention + 1 expert: ~4-6GB
  • Fits MacBook Air with room for the system

Connection to existing architecture

This IS genome paging (#382) at a deeper level:

  • Genome paging: swap LoRA adapters in/out
  • Expert paging: swap MoE expert weights in/out
  • Same LRU eviction, same on-demand loading, same HF cache

Implementation questions

  1. Can GGUF files be partially loaded? (load specific tensor groups)
  2. Does Candle support lazy tensor loading?
  3. Can we split a GGUF into per-expert shards at publish time?
  4. What's the latency for paging in a 2GB expert from SSD? (~200ms?)
  5. Could we pre-predict the next expert needed based on task classification?

Why this matters

  • MacBook Air becomes viable for the full 35B MoE model
  • Each device only stores experts it uses frequently
  • Rare experts live on HF, fetched when needed
  • The -code-cont suffix literally means "only the code expert is baked in"

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions