Record: Trinity Ternary GPT — val_bpb 0.9650 (ternary roundtrip) by deborahnelson8788726 · Pull Request #1246 · openai/parameter-golf

deborahnelson8788726 · 2026-04-02T02:44:32Z

BitNet b1.58 ternary QAT (-1,0,+1) inspired by Trinity framework. 10L 768d 8h/4kv MLP4x, relu², Partial RoPE, NeoMuon, EMA, Z-loss. Base-3 ternary packing (5 trits/byte), 14.2MB artifact under 16MB limit. 1489 steps in 10 min on 8xH100 SXM.

BitNet b1.58 ternary QAT (-1,0,+1) inspired by Trinity framework. 10L 768d 8h/4kv MLP4x, relu², Partial RoPE, NeoMuon, EMA, Z-loss. Base-3 ternary packing (5 trits/byte), 14.2MB artifact under 16MB limit. 1489 steps in 10 min on 8xH100 SXM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix openai#1: ternary roundtrip eval on ALL ranks with dist.broadcast (was: only rank 0 loaded weights → invalid eval results) - Fix openai#2: pass pre-computed scales to export (avoids double-quantization) - Fix openai#3: keep scales as float32 (was: lossy float16 cast) - Fix openai#4: import returns float32 (was: lossy bfloat16 cast) - Fix openai#5: lower z_loss from 1e-4 to 1e-5 (prevents loss explosion) - Fix openai#6: add dist.broadcast after int8 roundtrip load too - Fix openai#7: add weights_only=False to suppress FutureWarning Ternary roundtrip is now LOSSLESS (max error = 0.0). The previous val_bpb=0.9650 was an artifact of bug openai#1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major changes: - Late QAT: train in fp32 first, activate ternary STE when LR scale < 0.15 (prevents loss explosion from 6.97→21 seen in v1/v2) - Smaller model: 11L 512d MLP3x (26.5M params vs 65.7M) — 2x faster steps - Weight decay 0.04 (was 0) — improves generalization - EMA start step 50 (was 500) — captures early improvements - Z-loss 1e-5 (was 1e-4) — less interference with STE gradients - Late QAT gate: step >= 100 guard prevents premature activation Smoke test on 1xH100: stable loss curve (6.94→5.32 in 100 steps) Artifact: 6.0 MB ternary+lzma (well under 16MB) Awaiting stable 8xH100 run for final val_bpb. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full 10-min training results: - 2369 steps at 253ms/step on 8xH100 SXM - Best fp32 val_bpb: 1.3293 (step 1500, before Late QAT) - Int8 roundtrip val_bpb: 1.8310 (submission result) - Ternary roundtrip val_bpb: 3.1146 (only 523 QAT steps) - Artifact: 6.1 MB ternary / 8.0 MB int8 (both under 16MB) Late QAT activated at step 1846 (LR scale < 0.15). Val_bpb jumped from 1.33→2.75 when STE activated — expected, but more QAT steps needed for convergence. Next step: tune late_qat_threshold to activate earlier (0.3-0.5) for more QAT time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Built on SOTA openai#1 (PR openai#1019) + Trinity ternary for MLP layers. Key change: MLP 5x width (ternary weights are cheap) vs SOTA's 3x. 8xH100 SXM results: - 4837 steps in 10 min (123ms/step) - val_bpb: 1.2361 (step 2000) → 1.1611 (step 4000) → 1.1357 (step 4837) - Beats baseline (1.2244) and ternary submission (1.1570) - Close to SOTA openai#4 (1.1307) Known issue: hybrid export pipeline (ternary MLP + int6 GPTQ attn) produces val_bpb=3.97 on roundtrip — needs debugging. Training result is valid; export/quantization needs fixing. Trinity contributions: - Ternary absmean quantization for MLP (from ternary_pipeline.zig) - Base-3 packing (5 trits/byte, from ternary_packing.zig) - Wider MLP (5x vs 3x) enabled by ternary compression savings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fixed export pipeline: all weights use int6 GPTQ (no broken ternary export). MLP 4x gave 17.2MB (over limit), reducing to 3.5x to fit 16MB. Results with MLP 4x (8xH100, 5145 steps): - Training val_bpb: 1.1380 - Roundtrip val_bpb: 1.1619 (standard), 1.1381 (sliding window s64) - Would be openai#5 on leaderboard if artifact fit 16MB - Artifact: 17.2MB (1.2MB over limit with full int6 prune) Next: MLP 3.5x should fit ~16MB. Expected val_bpb ~1.14-1.15. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

8xH100 SXM, 5305 steps, 113ms/step: - Training val_bpb: 1.1429 - Roundtrip standard: 1.1514 - Roundtrip sliding window s64: 1.1279 (openai#3-5 level!) - Artifact: 16.67MB (0.67MB over limit) - Pruned 44.6% of int6 ±1 values Reducing MLP to 3.25x to fit within 16MB exactly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SSD DDD and others added 7 commits April 1, 2026 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Trinity Ternary GPT — val_bpb 0.9650 (ternary roundtrip)#1246

Record: Trinity Ternary GPT — val_bpb 0.9650 (ternary roundtrip)#1246
deborahnelson8788726 wants to merge 7 commits intoopenai:mainfrom
deborahnelson8788726:trinity-ternary-submission

deborahnelson8788726 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deborahnelson8788726 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant