-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866 #1471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,102 @@ | ||||||||
| ## Record: SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 (val_bpb: 1.0866) | ||||||||
|
|
||||||||
| **val_bpb: 1.0866** (sliding window stride=64, 3-seed mean, std 0.0007) | **~15.98 MB** | 8xH100 SXM, 590s | ||||||||
|
|
||||||||
| ### 3-Seed Results (8×H100 80GB SXM) | ||||||||
|
|
||||||||
| | Seed | Pre-quant BPB | Sliding BPB (s64) | Pruning | Artifact | | ||||||||
| |------|---------------|-------------------|---------|----------| | ||||||||
| | 42 | 1.0874 | **1.0873** | None | 15,981,300 B | | ||||||||
| | 1337 | 1.0865 | **1.0866** | None | 15,978,870 B | | ||||||||
| | 2024 | — | **1.0859** | None | — | | ||||||||
|
|
||||||||
| **Mean: 1.0866 | Std: 0.0007** | All artifacts under 16,000,000 bytes | Zero selective pruning | ||||||||
|
|
||||||||
| Current merged SOTA: **1.1147** (PR #1019). Delta: **−0.0281 BPB**. | ||||||||
|
|
||||||||
| ### Key Changes (over PR #1445, this author) | ||||||||
|
|
||||||||
| Two major additions to the PR #1445 stack: | ||||||||
|
|
||||||||
| | Change | PR #1445 | This | Impact | | ||||||||
| |--------|----------|------|--------| | ||||||||
| | **Tokenizer** | SP4096 | **SP8192** | Larger vocab, better context per sequence | | ||||||||
| | **Quantization clip** | Percentile search | **SDClip (c = k·std)** | Principled clipping, zero pruning, better rate-distortion | | ||||||||
|
|
||||||||
| ### SDClip: Standard-Deviation-Based Clipping | ||||||||
|
|
||||||||
| Replaces the multi-percentile clip search with a single principled formula from PR #1394 (@clarkkev): | ||||||||
|
|
||||||||
| ``` | ||||||||
| clip = k · std(row) | ||||||||
| ``` | ||||||||
|
|
||||||||
| - **k=12.85** for int6 matrix parameters (mlp, attn) | ||||||||
| - **k=20.0** for int8 embeddings | ||||||||
|
|
||||||||
| Higher k = wider clip = more values near zero = lower entropy = better compression. This directly accounts for compressed artifact size rather than just reconstruction error, and requires only one GPTQ pass per matrix instead of 5. | ||||||||
|
|
||||||||
| Result: **zero selective pruning** across all 3 seeds. The model fits comfortably under 16MB without destroying any quantized values. | ||||||||
|
|
||||||||
| ### SP8192 Tokenizer | ||||||||
|
|
||||||||
| Moving from 4096 to 8192 SentencePiece tokens gives the model more granular subword representations. Combined with SDClip's superior compression, the larger embedding table fits within the 16MB budget despite doubling the vocabulary. | ||||||||
|
|
||||||||
| ### Full Stack (carried from PR #1445) | ||||||||
|
|
||||||||
| | Parameter | Value | Source | | ||||||||
| |-----------|-------|--------| | ||||||||
| | **Tokenizer** | SP8192 | This work | | ||||||||
| | **SDClip k (matrices)** | 12.85 | PR #1394, this work | | ||||||||
| | **SDClip k (embeddings)** | 20.0 | PR #1394, this work | | ||||||||
| | Recurrence layers | 3,4,5 (3-layer, 14 virtual) | PR #1331 | | ||||||||
| | Weight decay | 0.095 | PR #1331 | | ||||||||
| | Matrix LR | 0.022 | PR #1331 | | ||||||||
| | EMA decay | 0.9965 | PR #1421 (this author) | | ||||||||
| | Recurrence start | step 2000 | PR #1445 (this author) | | ||||||||
| | Warmdown fraction | 0.72 | PR #1445 (this author) | | ||||||||
|
|
||||||||
| ### Architecture | ||||||||
|
|
||||||||
| - 11 transformer layers, 512-dim, 8 heads (4 KV heads, GQA) | ||||||||
| - Depth recurrence: layers 3,4,5 repeat (virtual 14 layers), activated at step 2000 | ||||||||
| - Skip gates, parallel residuals from layer 7, QK-Gain 5.0 | ||||||||
| - XSA on all 11 layers, LeakyReLU(0.5)² | ||||||||
| - Shared Value Embedding (dim=128, layers 9,10) | ||||||||
| - Tied embeddings, logit softcap=30.0 | ||||||||
|
|
||||||||
| ### Training | ||||||||
|
|
||||||||
| - FlashAttention 3 (Hopper-optimized) | ||||||||
| - Muon optimizer (matrices): lr=0.022, WD=0.095 | ||||||||
| - Adam (head): lr=0.008, fused=True | ||||||||
| - AdamW (embeddings): lr=0.6, WD=0.095, fused=True | ||||||||
| - Gradient clip: 0.3, Batch: 786,432 tokens/step, seq_len=2048 | ||||||||
| - Warmdown: 72%, EMA decay=0.9965, Wallclock: 590s | ||||||||
|
|
||||||||
| ### Quantization | ||||||||
|
|
||||||||
| - Full Hessian GPTQ + Cholesky + actorder for all int6 layers | ||||||||
| - **SDClip** (c = k·std) instead of percentile search | ||||||||
| - Int6 per-row for MLP + attention, Int8 per-row for embeddings | ||||||||
| - Brotli compression | ||||||||
| - **Zero selective pruning** — model fits natively under 16MB | ||||||||
|
|
||||||||
| ### Run Command | ||||||||
|
|
||||||||
| ```bash | ||||||||
| SEED=42 VOCAB_SIZE=8192 \ | ||||||||
| DATA_PATH=./data/datasets/fineweb10B_sp8192/ \ | ||||||||
| TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \ | ||||||||
|
Comment on lines
+89
to
+90
|
||||||||
| DATA_PATH=./data/datasets/fineweb10B_sp8192/ \ | |
| TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \ | |
| DATA_DIR=./data/ \ |
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,10 @@ | ||||||||
| { | ||||||||
| "author": "Abhishek Leji", | ||||||||
| "github_id": "X-Abhishek-X", | ||||||||
| "name": "Record: SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965", | ||||||||
| "blurb": "SP8192 tokenizer with SDClip quantization (c=k*std), 3-layer depth recurrence (3,4,5), EMA 0.9965, WD=0.095, MLR=0.022, early recurrence (step 2000), extended warmdown (72%). Zero selective pruning.", | ||||||||
| "date": "2026-04-08T00:00:00Z", | ||||||||
| "val_loss": 2.80668370, | ||||||||
| "val_bpb": 1.08655472, | ||||||||
| "bytes_total": 15978870 | ||||||||
|
||||||||
| "bytes_total": 15978870 | |
| "bytes_total": 15978870, | |
| "bytes_code": 15978870 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,134 @@ | ||
| W0408 09:27:57.372000 46549 torch/distributed/run.py:803] | ||
| W0408 09:27:57.372000 46549 torch/distributed/run.py:803] ***************************************** | ||
| W0408 09:27:57.372000 46549 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | ||
| W0408 09:27:57.372000 46549 torch/distributed/run.py:803] ***************************************** | ||
| Hyperparameters: | ||
| adam_eps: 1e-08 | ||
| adam_wd: 0.02 | ||
| beta1: 0.9 | ||
| beta2: 0.95 | ||
| compressor: brotli | ||
| data_dir: ./data/ | ||
| datasets_dir: ./data/datasets/fineweb10B_sp8192 | ||
| distributed: True | ||
| ema_decay: 0.9965 | ||
| embed_lr: 0.6 | ||
| embed_wd: 0.095 | ||
| embedding_dim: 512 | ||
| eval_seq_len: 2048 | ||
| eval_stride: 64 | ||
| gptq_calibration_batches: 64 | ||
| gptq_enabled: True | ||
| gptq_reserve_seconds: 10.0 | ||
| grad_accum_steps: 1 | ||
| grad_clip_norm: 0.3 | ||
| head_lr: 0.008 | ||
| is_main_process: True | ||
| iterations: 20000 | ||
| ln_scale: True | ||
| local_rank: 0 | ||
| logfile: logs/426ddf98-afdb-4609-a85b-b957b9d54903.txt | ||
| logit_softcap: 30.0 | ||
| matrix_lr: 0.022 | ||
| max_wallclock_seconds: 600.0 | ||
| min_lr: 0.0 | ||
| mlp_mult: 4.0 | ||
| model_dim: 512 | ||
| model_path: final_model.pt | ||
| muon_backend_steps: 5 | ||
| muon_beta2: 0.95 | ||
| muon_momentum: 0.99 | ||
| muon_momentum_warmup_start: 0.92 | ||
| muon_momentum_warmup_steps: 1500 | ||
| muon_wd: 0.095 | ||
| num_heads: 8 | ||
| num_kv_heads: 4 | ||
| num_layers: 11 | ||
| parallel_start_layer: 7 | ||
| qk_gain_init: 5.0 | ||
| quantized_model_path: final_model.int6.ptz | ||
| rank: 0 | ||
| recur_layers: 3,4,5 | ||
| recur_start_step: 2000 | ||
| rope_base: 10000.0 | ||
| rope_dims: 16 | ||
| rope_train_seq_len: 2048 | ||
| run_id: 426ddf98-afdb-4609-a85b-b957b9d54903 | ||
| scalar_lr: 0.02 | ||
| sdclip_k: 12.85 | ||
| sdclip_k_embed: 20.0 | ||
| seed: 42 | ||
| skip_gates_enabled: True | ||
| sliding_window_enabled: True | ||
| tie_embeddings: True | ||
| tied_embed_init_std: 0.005 | ||
| tied_embed_lr: 0.03 | ||
| tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model | ||
| train_batch_tokens: 786432 | ||
| train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin | ||
| train_log_every: 500 | ||
| train_seq_len: 2048 | ||
| ttt_batch_seqs: 32 | ||
| ttt_chunk_tokens: 32768 | ||
| ttt_enabled: False | ||
| ttt_epochs: 3 | ||
| ttt_freeze_blocks: 0 | ||
| ttt_grad_clip: 1.0 | ||
| ttt_lr: 0.002 | ||
| ttt_momentum: 0.9 | ||
| val_batch_tokens: 524288 | ||
| val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin | ||
| val_loss_every: 4000 | ||
| ve_dim: 128 | ||
| ve_enabled: True | ||
| ve_layers: 9,10 | ||
| vocab_size: 8192 | ||
| warmdown_frac: 0.72 | ||
| warmup_steps: 20 | ||
| world_size: 8 | ||
| xsa_last_n: 11 | ||
| train_shards: 128 | ||
| val_tokens: 40540160 | ||
| model_params:37022812 | ||
| gptq:reserving 10s, effective=590000ms | ||
| warmup_step: 1/20 | ||
| warmup_step: 2/20 | ||
| warmup_step: 3/20 | ||
| warmup_step: 4/20 | ||
| warmup_step: 5/20 | ||
| warmup_step: 6/20 | ||
| warmup_step: 10/20 | ||
| warmup_step: 20/20 | ||
| 0/20000 val_loss: 9.0081 val_bpb: 3.4873 | ||
| 1/20000 train_loss: 9.0090 train_time: 0.0m tok/s: 8179673 | ||
| 2/20000 train_loss: 12.1257 train_time: 0.0m tok/s: 8089933 | ||
| 3/20000 train_loss: 10.9394 train_time: 0.0m tok/s: 7998743 | ||
| 4/20000 train_loss: 9.3743 train_time: 0.0m tok/s: 7962441 | ||
| 5/20000 train_loss: 8.2859 train_time: 0.0m tok/s: 7933798 | ||
| 500/20000 train_loss: 3.4218 train_time: 0.9m tok/s: 7707008 | ||
| 1000/20000 train_loss: 3.2639 train_time: 1.7m tok/s: 7690673 | ||
| 1500/20000 train_loss: 3.1442 train_time: 2.6m tok/s: 7688175 | ||
| 2000/20000 train_loss: 3.1367 train_time: 3.4m tok/s: 7688467 | ||
| recurrence:activated at step 2000, virtual_layers=[0, 1, 2, 3, 4, 5, 3, 4, 5, 6, 7, 8, 9, 10] | ||
| 2500/20000 train_loss: 3.0249 train_time: 4.7m tok/s: 7024536 | ||
| 3000/20000 train_loss: 2.9817 train_time: 5.7m tok/s: 6880938 | ||
| 3500/20000 train_loss: 3.0498 train_time: 6.8m tok/s: 6782295 | ||
| 4000/20000 train_loss: 2.8990 train_time: 7.8m tok/s: 6710501 | ||
| 4000/20000 val_loss: 2.9085 val_bpb: 1.1260 | ||
| 4500/20000 train_loss: 2.9393 train_time: 8.9m tok/s: 6656731 | ||
| 4964/20000 val_loss: 2.8123 val_bpb: 1.0887 | ||
| stopping_early: wallclock_cap train_time: 590043ms step: 4964/20000 | ||
| peak memory allocated: 33093 MiB reserved: 33112 MiB | ||
| ema:applying EMA weights | ||
| pre-quantization post-ema val_loss:2.80884157 val_bpb:1.08739010 eval_time:1988ms | ||
| Serialized model: 137649029 bytes | ||
| Code size: 83119 bytes | ||
| GPTQ:collecting Hessians from calibration data... | ||
| GPTQ:collected 66 Hessians in 10.6s | ||
| GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search | ||
| selective_prune: unpruned=15.98MB target=16.0MB | ||
| selective_prune: already fits, no pruning needed | ||
| Serialized model int6+brotli: 15898181 bytes | ||
| Total submission size int6+brotli: 15981300 bytes | ||
| final_int6_roundtrip val_loss:2.85199103 val_bpb:1.10409460 eval_time:8340ms | ||
| final_int6_sliding_window val_loss:2.80849403 val_bpb:1.08725556 eval_time:79162ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seed 2024 row is missing the pre-quant BPB and artifact size, but
train_seed2024.logincludes both (pre-quantization post-ema val_bpb: 1.08623375andTotal submission size ...: 15,975,819 bytes). Please fill these in (or remove the column) so the README matches the provided logs.