Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
## Record: SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 (val_bpb: 1.0866)

**val_bpb: 1.0866** (sliding window stride=64, 3-seed mean, std 0.0007) | **~15.98 MB** | 8xH100 SXM, 590s

### 3-Seed Results (8×H100 80GB SXM)

| Seed | Pre-quant BPB | Sliding BPB (s64) | Pruning | Artifact |
|------|---------------|-------------------|---------|----------|
| 42 | 1.0874 | **1.0873** | None | 15,981,300 B |
| 1337 | 1.0865 | **1.0866** | None | 15,978,870 B |
| 2024 | — | **1.0859** | None | — |
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed 2024 row is missing the pre-quant BPB and artifact size, but train_seed2024.log includes both (pre-quantization post-ema val_bpb: 1.08623375 and Total submission size ...: 15,975,819 bytes). Please fill these in (or remove the column) so the README matches the provided logs.

Suggested change
| 2024 | | **1.0859** | None | |
| 2024 | 1.0862 | **1.0859** | None | 15,975,819 B |

Copilot uses AI. Check for mistakes.

**Mean: 1.0866 | Std: 0.0007** | All artifacts under 16,000,000 bytes | Zero selective pruning

Current merged SOTA: **1.1147** (PR #1019). Delta: **−0.0281 BPB**.

### Key Changes (over PR #1445, this author)

Two major additions to the PR #1445 stack:

| Change | PR #1445 | This | Impact |
|--------|----------|------|--------|
| **Tokenizer** | SP4096 | **SP8192** | Larger vocab, better context per sequence |
| **Quantization clip** | Percentile search | **SDClip (c = k·std)** | Principled clipping, zero pruning, better rate-distortion |

### SDClip: Standard-Deviation-Based Clipping

Replaces the multi-percentile clip search with a single principled formula from PR #1394 (@clarkkev):

```
clip = k · std(row)
```

- **k=12.85** for int6 matrix parameters (mlp, attn)
- **k=20.0** for int8 embeddings

Higher k = wider clip = more values near zero = lower entropy = better compression. This directly accounts for compressed artifact size rather than just reconstruction error, and requires only one GPTQ pass per matrix instead of 5.

Result: **zero selective pruning** across all 3 seeds. The model fits comfortably under 16MB without destroying any quantized values.

### SP8192 Tokenizer

Moving from 4096 to 8192 SentencePiece tokens gives the model more granular subword representations. Combined with SDClip's superior compression, the larger embedding table fits within the 16MB budget despite doubling the vocabulary.

### Full Stack (carried from PR #1445)

| Parameter | Value | Source |
|-----------|-------|--------|
| **Tokenizer** | SP8192 | This work |
| **SDClip k (matrices)** | 12.85 | PR #1394, this work |
| **SDClip k (embeddings)** | 20.0 | PR #1394, this work |
| Recurrence layers | 3,4,5 (3-layer, 14 virtual) | PR #1331 |
| Weight decay | 0.095 | PR #1331 |
| Matrix LR | 0.022 | PR #1331 |
| EMA decay | 0.9965 | PR #1421 (this author) |
| Recurrence start | step 2000 | PR #1445 (this author) |
| Warmdown fraction | 0.72 | PR #1445 (this author) |

### Architecture

- 11 transformer layers, 512-dim, 8 heads (4 KV heads, GQA)
- Depth recurrence: layers 3,4,5 repeat (virtual 14 layers), activated at step 2000
- Skip gates, parallel residuals from layer 7, QK-Gain 5.0
- XSA on all 11 layers, LeakyReLU(0.5)²
- Shared Value Embedding (dim=128, layers 9,10)
- Tied embeddings, logit softcap=30.0

### Training

- FlashAttention 3 (Hopper-optimized)
- Muon optimizer (matrices): lr=0.022, WD=0.095
- Adam (head): lr=0.008, fused=True
- AdamW (embeddings): lr=0.6, WD=0.095, fused=True
- Gradient clip: 0.3, Batch: 786,432 tokens/step, seq_len=2048
- Warmdown: 72%, EMA decay=0.9965, Wallclock: 590s

### Quantization

- Full Hessian GPTQ + Cholesky + actorder for all int6 layers
- **SDClip** (c = k·std) instead of percentile search
- Int6 per-row for MLP + attention, Int8 per-row for embeddings
- Brotli compression
- **Zero selective pruning** — model fits natively under 16MB

### Run Command

```bash
SEED=42 VOCAB_SIZE=8192 \
DATA_PATH=./data/datasets/fineweb10B_sp8192/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
Comment on lines +89 to +90
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run command uses DATA_PATH and TOKENIZER_PATH, but this train_gpt.py reads DATA_DIR and constructs datasets_dir/tokenizer_path internally (it does not read DATA_PATH/TOKENIZER_PATH). Update the command to use the env vars the script actually consumes so reproduction works when data isn’t under ./data/.

Suggested change
DATA_PATH=./data/datasets/fineweb10B_sp8192/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
DATA_DIR=./data/ \

Copilot uses AI. Check for mistakes.
RECUR_START_STEP=2000 WARMDOWN_FRAC=0.72 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

### Credits

- **SDClip quantization + SP8192 baseline**: PR #1394 by @clarkkev
- **Base architecture + depth recurrence**: PR #1334 by @aryanbhosale
- **3-layer recurrence + WD/LR tuning**: PR #1331
- **EMA decay tuning (0.9965)**: PR #1421 by @X-Abhishek-X (this author)
- **Early recurrence + extended warmdown**: PR #1445 by @X-Abhishek-X (this author)
- **SP8192 + SDClip integration**: This work
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"author": "Abhishek Leji",
"github_id": "X-Abhishek-X",
"name": "Record: SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965",
"blurb": "SP8192 tokenizer with SDClip quantization (c=k*std), 3-layer depth recurrence (3,4,5), EMA 0.9965, WD=0.095, MLR=0.022, early recurrence (step 2000), extended warmdown (72%). Zero selective pruning.",
"date": "2026-04-08T00:00:00Z",
"val_loss": 2.80668370,
"val_bpb": 1.08655472,
"bytes_total": 15978870
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This track’s submission.json files consistently include bytes_code alongside bytes_total (e.g., records/track_10min_16mb/2026-03-17_NaiveBaseline/submission.json:10). Consider adding bytes_code here as well to match the established schema and make size breakdowns comparable.

Suggested change
"bytes_total": 15978870
"bytes_total": 15978870,
"bytes_code": 15978870

Copilot uses AI. Check for mistakes.
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
W0408 09:27:57.372000 46549 torch/distributed/run.py:803]
W0408 09:27:57.372000 46549 torch/distributed/run.py:803] *****************************************
W0408 09:27:57.372000 46549 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0408 09:27:57.372000 46549 torch/distributed/run.py:803] *****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: ./data/
datasets_dir: ./data/datasets/fineweb10B_sp8192
distributed: True
ema_decay: 0.9965
embed_lr: 0.6
embed_wd: 0.095
embedding_dim: 512
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_enabled: True
gptq_reserve_seconds: 10.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/426ddf98-afdb-4609-a85b-b957b9d54903.txt
logit_softcap: 30.0
matrix_lr: 0.022
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_wd: 0.095
num_heads: 8
num_kv_heads: 4
num_layers: 11
parallel_start_layer: 7
qk_gain_init: 5.0
quantized_model_path: final_model.int6.ptz
rank: 0
recur_layers: 3,4,5
recur_start_step: 2000
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: 426ddf98-afdb-4609-a85b-b957b9d54903
scalar_lr: 0.02
sdclip_k: 12.85
sdclip_k_embed: 20.0
seed: 42
skip_gates_enabled: True
sliding_window_enabled: True
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
ttt_batch_seqs: 32
ttt_chunk_tokens: 32768
ttt_enabled: False
ttt_epochs: 3
ttt_freeze_blocks: 0
ttt_grad_clip: 1.0
ttt_lr: 0.002
ttt_momentum: 0.9
val_batch_tokens: 524288
val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
val_loss_every: 4000
ve_dim: 128
ve_enabled: True
ve_layers: 9,10
vocab_size: 8192
warmdown_frac: 0.72
warmup_steps: 20
world_size: 8
xsa_last_n: 11
train_shards: 128
val_tokens: 40540160
model_params:37022812
gptq:reserving 10s, effective=590000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
0/20000 val_loss: 9.0081 val_bpb: 3.4873
1/20000 train_loss: 9.0090 train_time: 0.0m tok/s: 8179673
2/20000 train_loss: 12.1257 train_time: 0.0m tok/s: 8089933
3/20000 train_loss: 10.9394 train_time: 0.0m tok/s: 7998743
4/20000 train_loss: 9.3743 train_time: 0.0m tok/s: 7962441
5/20000 train_loss: 8.2859 train_time: 0.0m tok/s: 7933798
500/20000 train_loss: 3.4218 train_time: 0.9m tok/s: 7707008
1000/20000 train_loss: 3.2639 train_time: 1.7m tok/s: 7690673
1500/20000 train_loss: 3.1442 train_time: 2.6m tok/s: 7688175
2000/20000 train_loss: 3.1367 train_time: 3.4m tok/s: 7688467
recurrence:activated at step 2000, virtual_layers=[0, 1, 2, 3, 4, 5, 3, 4, 5, 6, 7, 8, 9, 10]
2500/20000 train_loss: 3.0249 train_time: 4.7m tok/s: 7024536
3000/20000 train_loss: 2.9817 train_time: 5.7m tok/s: 6880938
3500/20000 train_loss: 3.0498 train_time: 6.8m tok/s: 6782295
4000/20000 train_loss: 2.8990 train_time: 7.8m tok/s: 6710501
4000/20000 val_loss: 2.9085 val_bpb: 1.1260
4500/20000 train_loss: 2.9393 train_time: 8.9m tok/s: 6656731
4964/20000 val_loss: 2.8123 val_bpb: 1.0887
stopping_early: wallclock_cap train_time: 590043ms step: 4964/20000
peak memory allocated: 33093 MiB reserved: 33112 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.80884157 val_bpb:1.08739010 eval_time:1988ms
Serialized model: 137649029 bytes
Code size: 83119 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 66 Hessians in 10.6s
GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
selective_prune: unpruned=15.98MB target=16.0MB
selective_prune: already fits, no pruning needed
Serialized model int6+brotli: 15898181 bytes
Total submission size int6+brotli: 15981300 bytes
final_int6_roundtrip val_loss:2.85199103 val_bpb:1.10409460 eval_time:8340ms
final_int6_sliding_window val_loss:2.80849403 val_bpb:1.08725556 eval_time:79162ms
Loading
Loading