Record: Trinity Ternary GPT — val_bpb 0.9650 (ternary roundtrip)#1246
Open
deborahnelson8788726 wants to merge 7 commits intoopenai:mainfrom
Open
Record: Trinity Ternary GPT — val_bpb 0.9650 (ternary roundtrip)#1246deborahnelson8788726 wants to merge 7 commits intoopenai:mainfrom
deborahnelson8788726 wants to merge 7 commits intoopenai:mainfrom
Conversation
BitNet b1.58 ternary QAT (-1,0,+1) inspired by Trinity framework. 10L 768d 8h/4kv MLP4x, relu², Partial RoPE, NeoMuon, EMA, Z-loss. Base-3 ternary packing (5 trits/byte), 14.2MB artifact under 16MB limit. 1489 steps in 10 min on 8xH100 SXM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix openai#1: ternary roundtrip eval on ALL ranks with dist.broadcast (was: only rank 0 loaded weights → invalid eval results) - Fix openai#2: pass pre-computed scales to export (avoids double-quantization) - Fix openai#3: keep scales as float32 (was: lossy float16 cast) - Fix openai#4: import returns float32 (was: lossy bfloat16 cast) - Fix openai#5: lower z_loss from 1e-4 to 1e-5 (prevents loss explosion) - Fix openai#6: add dist.broadcast after int8 roundtrip load too - Fix openai#7: add weights_only=False to suppress FutureWarning Ternary roundtrip is now LOSSLESS (max error = 0.0). The previous val_bpb=0.9650 was an artifact of bug openai#1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major changes: - Late QAT: train in fp32 first, activate ternary STE when LR scale < 0.15 (prevents loss explosion from 6.97→21 seen in v1/v2) - Smaller model: 11L 512d MLP3x (26.5M params vs 65.7M) — 2x faster steps - Weight decay 0.04 (was 0) — improves generalization - EMA start step 50 (was 500) — captures early improvements - Z-loss 1e-5 (was 1e-4) — less interference with STE gradients - Late QAT gate: step >= 100 guard prevents premature activation Smoke test on 1xH100: stable loss curve (6.94→5.32 in 100 steps) Artifact: 6.0 MB ternary+lzma (well under 16MB) Awaiting stable 8xH100 run for final val_bpb. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full 10-min training results: - 2369 steps at 253ms/step on 8xH100 SXM - Best fp32 val_bpb: 1.3293 (step 1500, before Late QAT) - Int8 roundtrip val_bpb: 1.8310 (submission result) - Ternary roundtrip val_bpb: 3.1146 (only 523 QAT steps) - Artifact: 6.1 MB ternary / 8.0 MB int8 (both under 16MB) Late QAT activated at step 1846 (LR scale < 0.15). Val_bpb jumped from 1.33→2.75 when STE activated — expected, but more QAT steps needed for convergence. Next step: tune late_qat_threshold to activate earlier (0.3-0.5) for more QAT time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Built on SOTA openai#1 (PR openai#1019) + Trinity ternary for MLP layers. Key change: MLP 5x width (ternary weights are cheap) vs SOTA's 3x. 8xH100 SXM results: - 4837 steps in 10 min (123ms/step) - val_bpb: 1.2361 (step 2000) → 1.1611 (step 4000) → 1.1357 (step 4837) - Beats baseline (1.2244) and ternary submission (1.1570) - Close to SOTA openai#4 (1.1307) Known issue: hybrid export pipeline (ternary MLP + int6 GPTQ attn) produces val_bpb=3.97 on roundtrip — needs debugging. Training result is valid; export/quantization needs fixing. Trinity contributions: - Ternary absmean quantization for MLP (from ternary_pipeline.zig) - Base-3 packing (5 trits/byte, from ternary_packing.zig) - Wider MLP (5x vs 3x) enabled by ternary compression savings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixed export pipeline: all weights use int6 GPTQ (no broken ternary export). MLP 4x gave 17.2MB (over limit), reducing to 3.5x to fit 16MB. Results with MLP 4x (8xH100, 5145 steps): - Training val_bpb: 1.1380 - Roundtrip val_bpb: 1.1619 (standard), 1.1381 (sliding window s64) - Would be openai#5 on leaderboard if artifact fit 16MB - Artifact: 17.2MB (1.2MB over limit with full int6 prune) Next: MLP 3.5x should fit ~16MB. Expected val_bpb ~1.14-1.15. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8xH100 SXM, 5305 steps, 113ms/step: - Training val_bpb: 1.1429 - Roundtrip standard: 1.1514 - Roundtrip sliding window s64: 1.1279 (openai#3-5 level!) - Artifact: 16.67MB (0.67MB over limit) - Pruned 44.6% of int6 ±1 values Reducing MLP to 3.25x to fit within 16MB exactly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BitNet b1.58 ternary QAT (-1,0,+1) inspired by Trinity framework. 10L 768d 8h/4kv MLP4x, relu², Partial RoPE, NeoMuon, EMA, Z-loss. Base-3 ternary packing (5 trits/byte), 14.2MB artifact under 16MB limit. 1489 steps in 10 min on 8xH100 SXM.