Skip to content

MoE surgery: extract individual experts for targeted training + tiny deployment #439

@joelteply

Description

@joelteply

The Vision

Take a 35B MoE model. Extract just the code expert (~4-5B). Fine-tune it. Publish it.
OpenCode/VSCode users get a 4B model with the coding depth of a 35B.

Why This Is Huge

MoE models have N experts, but users typically need 1-2. A coder needs the code expert.
A reasoning tool needs the reasoning expert. A creative writer needs the creative expert.

Current reality: You download 35B and use 3B of it. Wasteful.
With MoE surgery: You download 4B and get 35B-quality in your domain.

Technical Approach

  1. Load the full 35B model (needs >32GB, use CPU offload or multi-GPU)
  2. Identify experts by domain — the router/gate network decides which expert activates for which tokens. Run coding prompts through the model, track which expert(s) activate most for code tokens
  3. Extract expert weights — save just the router + shared attention + target expert(s)
  4. Convert to dense — remove the MoE routing, make it a standard dense model
  5. Fine-tune the extracted expert — LoRA on top for Continuum tool system
  6. Publishcontinuum-ai/qwen3.5-4b-code-cont (extracted from 35B MoE code expert)

What Makes This Novel

  • Nobody publishes individual extracted experts from MoE models
  • The naming convention tells users exactly what they're getting
  • Plasticity compaction on top → even smaller
  • Candle can load the extracted dense model (no MoE support needed)

Size Estimates

Config Params GGUF Q4_K_M Fits MacBook Air?
Full MoE 35B 21.2GB No
1 expert extracted ~4-5B ~3GB Yes
2 experts extracted ~8-9B ~6GB Yes
Compacted 1 expert ~2-3B ~2GB Yes (phone?)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions