Lossless neural compression via predictive modeling + rANS entropy coding.
Raw Bytes -> Modality Adapter -> Byte Patches -> Hybrid Backbone (Mamba-2 + MLA + Fine-Grained MoE + MoD + AdaLN-Zero) -> rANS
This repo scaffolds the implementation starting from Tier 1 (u-moec-micro), a ~20M-active model that fits on a single GPU and roundtrips through the full stack in <24h. Tier 2 and Tier 3 configs are included; their model code is stubbed but interfaces are stable.
u-moec-tiny/
├── README.md <- you are here
├── pyproject.toml <- uv/pip install
├── configs/ <- YAML configs, one per (tier, ablation)
│ ├── tier1_micro.yaml
│ ├── tier2_small.yaml (multi-scale ON)
│ ├── tier2_small_no_ms.yaml (ablation: multi-scale OFF)
│ └── tier3_base.yaml
├── umoec/ <- main package
│ ├── model/ <- backbone, MoE, MoD, cascade, heads
│ ├── entropy/ <- rANS / tANS coders
│ ├── stats/ <- global block stats + rolling local stats
│ ├── data/ <- corpus loaders + sequence packing
│ ├── lora/ <- online per-expert LoRA adapt for big files
│ ├── train/ <- training loop, Muon optimizer, loss
│ ├── inference/ <- encoder/decoder loops, profile selection
│ └── eval/ <- bits-per-byte, lossless roundtrip check
├── bin/ <- CLI entry points
│ ├── compress.py
│ ├── decompress.py
│ └── train.py
├── scripts/ <- shell entry points
│ ├── train_tier1.sh
│ ├── train_tier2.sh
│ ├── train_tier3.sh
│ └── eval_enwik8.sh
└── tests/ <- unit + integration tests
└── test_roundtrip.py <- THE invariant: compress(x) -> decompress = x
# Install (uv recommended)
uv sync
# Train Tier 1 (~6-10h on 1x RTX 4090 / A100)
./scripts/train_tier1.sh
# Evaluate on enwik8 (bits per byte)
./scripts/eval_enwik8.sh
# Roundtrip test
pytest tests/test_roundtrip.py -v
# Compress / decompress a file
python -m bin.compress input.bin output.umc --profile turbo
python -m bin.decompress output.umc output.bin
sha256sum input.bin output.bin # must match- Tier 1 → Tier 2 → Tier 3 build order (limited resources).
- Tier 1: 64 routed experts (was 32 — bumped for fine-grained specialization).
- Tier 2 ablation: configs for multi-scale patches (4B/16B/64B) on and off.
- Online LoRA adapt at Tier 1: per-expert rank-8 LoRA, ~1M trainable. Adapt active experts on files >10 MB before encoding.
See docs/SCALING_PLAN.md for the full 5-tier plan and docs/ARCHITECTURE_v2.md
for the architecture spec.
| Component | Tier 1 | Tier 2 | Tier 3 |
|---|---|---|---|
| Mamba-2 block | ✅ implemented | ⏳ stub | ⏳ stub |
| MLA block | ✅ implemented | ⏳ stub | ⏳ stub |
| Fine-grained MoE + shared | ✅ implemented | ⏳ stub | ⏳ stub |
| Aux-loss-free bias balance | ✅ implemented | ⏳ stub | ⏳ stub |
| MoD gate (scalar per token) | ✅ implemented | ⏳ stub | ⏳ stub |
| Cheap predictor cascade | ✅ implemented | ⏳ stub | ⏳ stub |
| rANS coder | ✅ implemented | ✅ | ✅ |
| Sequence packing | ✅ implemented | ✅ | ✅ |
| Muon optimizer | ✅ implemented | ✅ | ✅ |
| Online LoRA adapt | ✅ implemented | ⏳ stub | ⏳ stub |
| Multi-scale patches (4B/16B/64B) | ❌ (16B only) | ✅ via config flag | ✅ |
| Tiered early-exit cascade | ❌ | ⏳ stub | ⏳ stub |
| AdaLN-Zero conditioning | ✅ implemented | ⏳ stub | ⏳ stub |
| FSDP-2 + Expert Parallel | ❌ (single GPU) | ⏳ stub | ⏳ stub |
| FP8 mixed precision | ❌ (bf16) | ⏳ stub | ✅ |
The "implemented" modules are real PyTorch code that runs end-to-end. The "stub" modules have the right interface so you can fill them in when you scale up.
TBD.