Skip to content

johnjg75dev/u-moec-tiny

Repository files navigation

u-moec-tiny: Universal Mixture-of-Experts Compressor

Lossless neural compression via predictive modeling + rANS entropy coding.

Raw Bytes -> Modality Adapter -> Byte Patches -> Hybrid Backbone (Mamba-2 + MLA + Fine-Grained MoE + MoD + AdaLN-Zero) -> rANS

This repo scaffolds the implementation starting from Tier 1 (u-moec-micro), a ~20M-active model that fits on a single GPU and roundtrips through the full stack in <24h. Tier 2 and Tier 3 configs are included; their model code is stubbed but interfaces are stable.

Layout

u-moec-tiny/
├── README.md                  <- you are here
├── pyproject.toml             <- uv/pip install
├── configs/                   <- YAML configs, one per (tier, ablation)
│   ├── tier1_micro.yaml
│   ├── tier2_small.yaml              (multi-scale ON)
│   ├── tier2_small_no_ms.yaml        (ablation: multi-scale OFF)
│   └── tier3_base.yaml
├── umoec/                     <- main package
│   ├── model/                 <- backbone, MoE, MoD, cascade, heads
│   ├── entropy/               <- rANS / tANS coders
│   ├── stats/                 <- global block stats + rolling local stats
│   ├── data/                  <- corpus loaders + sequence packing
│   ├── lora/                  <- online per-expert LoRA adapt for big files
│   ├── train/                 <- training loop, Muon optimizer, loss
│   ├── inference/             <- encoder/decoder loops, profile selection
│   └── eval/                  <- bits-per-byte, lossless roundtrip check
├── bin/                       <- CLI entry points
│   ├── compress.py
│   ├── decompress.py
│   └── train.py
├── scripts/                   <- shell entry points
│   ├── train_tier1.sh
│   ├── train_tier2.sh
│   ├── train_tier3.sh
│   └── eval_enwik8.sh
└── tests/                     <- unit + integration tests
    └── test_roundtrip.py      <- THE invariant: compress(x) -> decompress = x

Quick Start (Tier 1)

# Install (uv recommended)
uv sync

# Train Tier 1 (~6-10h on 1x RTX 4090 / A100)
./scripts/train_tier1.sh

# Evaluate on enwik8 (bits per byte)
./scripts/eval_enwik8.sh

# Roundtrip test
pytest tests/test_roundtrip.py -v

# Compress / decompress a file
python -m bin.compress input.bin output.umc --profile turbo
python -m bin.decompress output.umc output.bin
sha256sum input.bin output.bin   # must match

Design choices (locked in)

  • Tier 1 → Tier 2 → Tier 3 build order (limited resources).
  • Tier 1: 64 routed experts (was 32 — bumped for fine-grained specialization).
  • Tier 2 ablation: configs for multi-scale patches (4B/16B/64B) on and off.
  • Online LoRA adapt at Tier 1: per-expert rank-8 LoRA, ~1M trainable. Adapt active experts on files >10 MB before encoding.

See docs/SCALING_PLAN.md for the full 5-tier plan and docs/ARCHITECTURE_v2.md for the architecture spec.

Status

Component Tier 1 Tier 2 Tier 3
Mamba-2 block ✅ implemented ⏳ stub ⏳ stub
MLA block ✅ implemented ⏳ stub ⏳ stub
Fine-grained MoE + shared ✅ implemented ⏳ stub ⏳ stub
Aux-loss-free bias balance ✅ implemented ⏳ stub ⏳ stub
MoD gate (scalar per token) ✅ implemented ⏳ stub ⏳ stub
Cheap predictor cascade ✅ implemented ⏳ stub ⏳ stub
rANS coder ✅ implemented
Sequence packing ✅ implemented
Muon optimizer ✅ implemented
Online LoRA adapt ✅ implemented ⏳ stub ⏳ stub
Multi-scale patches (4B/16B/64B) ❌ (16B only) ✅ via config flag
Tiered early-exit cascade ⏳ stub ⏳ stub
AdaLN-Zero conditioning ✅ implemented ⏳ stub ⏳ stub
FSDP-2 + Expert Parallel ❌ (single GPU) ⏳ stub ⏳ stub
FP8 mixed precision ❌ (bf16) ⏳ stub

The "implemented" modules are real PyTorch code that runs end-to-end. The "stub" modules have the right interface so you can fill them in when you scale up.

License

TBD.

About

WIP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors