-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0783 (3-seed mean) #1561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
EthanYangTW
wants to merge
4
commits into
openai:main
Choose a base branch
from
EthanYangTW:submission/sp8192-legal-sota-clean
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0783 (3-seed mean) #1561
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
b5c2872
Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 …
EthanYangTW cbe5ed1
Clean submission: remove experimental files
EthanYangTW 8481237
Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 …
EthanYangTW 18caf93
Remove old submission folder (superseded by 2026-04-12 clean rerun)
EthanYangTW File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
50 changes: 50 additions & 0 deletions
50
records/track_10min_16mb/2026-04-12_SP8192_LegalSOTA_Clean/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| # Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0783 (3-seed mean) | ||
|
|
||
| **val_bpb = 1.0783** (3-seed mean, std 0.0004) | **~15.99 MB** | 8xH100 SXM | ||
|
|
||
| ## 3-Seed Results | ||
|
|
||
| | Seed | Pre-quant BPP | Sliding BPP | **TTT BPP** | Artifact | | ||
| |------|---------------|-------------|-------------|----------| | ||
| | 1337 | 1.0859 | 1.0798 | **1.0782** | 15,986,623 | | ||
| | 42 | 1.0856 | 1.0793 | **1.0781** | 15,983,529 | | ||
| | 2024 | 1.0862 | 1.0800 | **1.0788** | 15,986,767 | | ||
| | **Mean** | 1.0859 | 1.0797 | **1.0783** | | | ||
|
|
||
| ## Architecture | ||
|
|
||
| ``` | ||
| SP8192 tokenizer, 11 physical / 17 virtual layers | ||
| 512 dim, MLP 4x (2048 hidden), GQA 8Q/4KV, head_dim=64 | ||
| Parallel residuals L7+, QK-Gain 5.0, XSA all 11 layers | ||
| LeakyReLU(0.5)², skip gates, logit softcap 30 | ||
| MuonEq-R (lr=0.022, wd=0.095, momentum=0.97) + AdamW | ||
| EMA 0.997, warmdown 66.7%, loop at 35% | ||
| SDClip GPTQ int6 (k=12.85) + int8 embed (k=20) + brotli | ||
| Score-first TTT: SGD lr=0.01, mom=0.9, 3ep, 32K chunks | ||
| Hash embedding: 16384x512, zero-init, trained in TTT | ||
| ~36M params, ~15.99MB artifact | ||
| ``` | ||
|
|
||
| ## Compliance (Track B — Score-First TTT) | ||
|
|
||
| Per Issue #1017: | ||
| - **Condition 1:** Hash key uses prefix tokens only | ||
| - **Condition 2:** Full normalized softmax distribution | ||
| - **Condition 3:** Each chunk scored under no_grad() before TTT update | ||
| - **Condition 4:** Single left-to-right pass, no rescoring | ||
|
|
||
| No SLOT, no pre-quant TTT, no n-gram caches, no Tap-In. | ||
|
|
||
| ## Reproduction | ||
|
|
||
| ```bash | ||
| pip install brotli sentencepiece | ||
| MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192 | ||
| SEED=1337 TTT_ENABLED=1 HASH_EMBED_ENABLED=1 TTT_LR=0.01 MUON_MOMENTUM=0.97 \ | ||
| torchrun --standalone --nproc_per_node=8 train_gpt.py | ||
| ``` | ||
|
|
||
| ## Credits | ||
|
|
||
| PR #1420 @abaybektursun (triple loop + fused kernels), PR #1394 @clarkkev (SP8192 + SDClip), PR #1471 @X-Abhishek-X (3-layer recurrence), PR #1477 @aryanbhosale (parallel residuals + score-first TTT), PR #1460 @resouer (eval-time hash embedding), PR #399 @abaybektursun (parameter banking concept), PR #1514 @dexhunter (Muon 0.97) | ||
1 change: 1 addition & 0 deletions
1
records/track_10min_16mb/2026-04-12_SP8192_LegalSOTA_Clean/submission.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| {"author":"EthanYangTW","github_id":"EthanYangTW","name":"SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 + Score-First TTT + Hash Embedding","date":"2026-04-12","track":"10min_16mb","val_bpb":1.07833,"val_bpb_std":0.00037,"seeds":[1337,42,2024],"seed_results":{"1337":{"val_bpb":1.07817,"artifact_bytes":15986623},"42":{"val_bpb":1.07807,"artifact_bytes":15983529},"2024":{"val_bpb":1.07876,"artifact_bytes":15986767}},"hardware":"8xH100 80GB SXM","pytorch_version":"2.9.1+cu128","technique_summary":"SP8192 + Triple Depth Recurrence (3,4,5 x3, 17 virtual) + Parameter Banking + Fused MLP Triton TMA + CUTLASS EVT + Muon 0.97 + Parallel Residuals (L7+) + QK-Gain 5.0 + Score-First TTT (3ep SGD lr=0.01) + Eval-Time Hash Embedding + SDClip GPTQ int6 + Brotli"} |
5 changes: 5 additions & 0 deletions
5
records/track_10min_16mb/2026-04-12_SP8192_LegalSOTA_Clean/train_gpt.py
Large diffs are not rendered by default.
Oops, something went wrong.
277 changes: 277 additions & 0 deletions
277
records/track_10min_16mb/2026-04-12_SP8192_LegalSOTA_Clean/train_seed1337.log
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,277 @@ | ||
| W0412 04:57:35.750000 1777 torch/distributed/run.py:803] | ||
| W0412 04:57:35.750000 1777 torch/distributed/run.py:803] ***************************************** | ||
| W0412 04:57:35.750000 1777 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | ||
| W0412 04:57:35.750000 1777 torch/distributed/run.py:803] ***************************************** | ||
| Hyperparameters: | ||
| adam_eps: 1e-08 | ||
| adam_wd: 0.02 | ||
| beta1: 0.9 | ||
| beta2: 0.95 | ||
| compressor: brotli | ||
| data_dir: ./data/ | ||
| datasets_dir: ./data/datasets/fineweb10B_sp8192 | ||
| distributed: True | ||
| ema_decay: 0.997 | ||
| embed_bits: 8 | ||
| embed_clip_sigmas: 20.0 | ||
| embed_lr: 0.6 | ||
| embed_wd: 0.095 | ||
| embedding_dim: 512 | ||
| enable_looping_at: 0.35 | ||
| eval_seq_len: 2048 | ||
| eval_stride: 64 | ||
| gptq_calibration_batches: 64 | ||
| gptq_reserve_seconds: 12.0 | ||
| grad_accum_steps: 1 | ||
| grad_clip_norm: 0.3 | ||
| hash_embed_enabled: True | ||
| hash_embed_size: 16384 | ||
| head_lr: 0.008 | ||
| is_main_process: True | ||
| iterations: 20000 | ||
| ln_scale: True | ||
| local_rank: 0 | ||
| logfile: logs/daa165fe-62f5-44c7-9f7b-10d92ebec09c.txt | ||
| logit_softcap: 30.0 | ||
| loop_end: 5 | ||
| loop_start: 3 | ||
| matrix_bits: 6 | ||
| matrix_clip_sigmas: 12.85 | ||
| matrix_lr: 0.022 | ||
| max_wallclock_seconds: 600.0 | ||
| min_lr: 0.0 | ||
| mlp_mult: 4.0 | ||
| model_dim: 512 | ||
| model_path: final_model.pt | ||
| muon_backend_steps: 5 | ||
| muon_beta2: 0.95 | ||
| muon_momentum: 0.97 | ||
| muon_momentum_warmup_start: 0.92 | ||
| muon_momentum_warmup_steps: 1500 | ||
| muon_row_normalize: True | ||
| muon_wd: 0.095 | ||
| num_heads: 8 | ||
| num_kv_heads: 4 | ||
| num_layers: 11 | ||
| num_loops: 2 | ||
| qk_gain_init: 5.0 | ||
| quantized_model_path: final_model.int6.ptz | ||
| rank: 0 | ||
| rope_base: 10000.0 | ||
| rope_dims: 16 | ||
| rope_train_seq_len: 2048 | ||
| run_id: daa165fe-62f5-44c7-9f7b-10d92ebec09c | ||
| scalar_lr: 0.02 | ||
| seed: 1337 | ||
| skip_gates_enabled: True | ||
| sliding_window_enabled: True | ||
| tie_embeddings: True | ||
| tied_embed_init_std: 0.005 | ||
| tied_embed_lr: 0.03 | ||
| tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model | ||
| train_batch_tokens: 786432 | ||
| train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin | ||
| train_log_every: 500 | ||
| train_seq_len: 2048 | ||
| ttt_adamw_wd: 0.0 | ||
| ttt_batch_seqs: 32 | ||
| ttt_chunk_tokens: 32768 | ||
| ttt_enabled: True | ||
| ttt_epochs: 3 | ||
| ttt_freeze_blocks: 0 | ||
| ttt_grad_clip: 1.0 | ||
| ttt_lr: 0.01 | ||
| ttt_momentum: 0.9 | ||
| ttt_optimizer: sgd | ||
| val_batch_tokens: 524288 | ||
| val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin | ||
| val_loss_every: 4000 | ||
| vocab_size: 8192 | ||
| warmdown_frac: 0.667 | ||
| warmup_steps: 20 | ||
| world_size: 8 | ||
| xsa_last_n: 11 | ||
| train_shards: 80 | ||
| val_tokens: 40540160 | ||
| model_params:35944537 | ||
| gptq:reserving 12s, effective=588000ms | ||
| warmup_step: 1/20 | ||
| warmup_step: 2/20 | ||
| warmup_step: 3/20 | ||
| warmup_step: 4/20 | ||
| warmup_step: 5/20 | ||
| warmup_step: 6/20 | ||
| warmup_step: 10/20 | ||
| warmup_step: 20/20 | ||
| loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] | ||
| loop_warmup_step: 1/20 | ||
| loop_warmup_step: 2/20 | ||
| loop_warmup_step: 3/20 | ||
| loop_warmup_step: 4/20 | ||
| loop_warmup_step: 5/20 | ||
| loop_warmup_step: 6/20 | ||
| loop_warmup_step: 10/20 | ||
| loop_warmup_step: 20/20 | ||
| 0/20000 val_loss: 9.0095 val_bpb: 3.4878 | ||
| 1/20000 train_loss: 9.0103 train_time: 0.0m tok/s: 17603941 | ||
| 2/20000 train_loss: 12.2673 train_time: 0.0m tok/s: 13040294 | ||
| 3/20000 train_loss: 10.9224 train_time: 0.0m tok/s: 10729005 | ||
| 4/20000 train_loss: 9.3858 train_time: 0.0m tok/s: 9811713 | ||
| 5/20000 train_loss: 8.2725 train_time: 0.0m tok/s: 9334895 | ||
| 500/20000 train_loss: 3.3833 train_time: 0.8m tok/s: 7821276 | ||
| 1000/20000 train_loss: 3.2932 train_time: 1.7m tok/s: 7803444 | ||
| 1500/20000 train_loss: 3.1922 train_time: 2.5m tok/s: 7799631 | ||
| 2000/20000 train_loss: 3.1034 train_time: 3.4m tok/s: 7803281 | ||
| layer_loop:enabled step:2042 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] | ||
| 2500/20000 train_loss: 3.1491 train_time: 4.6m tok/s: 7186166 | ||
| 3000/20000 train_loss: 2.9161 train_time: 5.9m tok/s: 6721413 | ||
| 3500/20000 train_loss: 2.9536 train_time: 7.1m tok/s: 6477927 | ||
| 4000/20000 train_loss: 2.8244 train_time: 8.3m tok/s: 6306083 | ||
| 4000/20000 val_loss: 2.8830 val_bpb: 1.1161 | ||
| 4500/20000 train_loss: 2.8384 train_time: 9.5m tok/s: 6178152 | ||
| 4603/20000 val_loss: 2.8044 val_bpb: 1.0857 | ||
| stopping_early: wallclock_cap train_time: 588166ms step: 4603/20000 | ||
| peak memory allocated: 39956 MiB reserved: 40024 MiB | ||
| ema:applying EMA weights | ||
| pre-quantization post-ema val_loss:2.80498827 val_bpb:1.08589837 eval_time:6389ms | ||
| Serialized model: 135408623 bytes | ||
| Code size: 20681 bytes | ||
| GPTQ:collecting Hessians from calibration data... | ||
| GPTQ:collected 67 Hessians in 12.4s | ||
| Quantized weights: | ||
| gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight | ||
| gptq (int8): tok_emb.weight | ||
| passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, lane_merge, skip_gates, skip_weights | ||
| Serialized model quantized+brotli: 15965942 bytes | ||
| Total submission size quantized+brotli: 15986623 bytes | ||
| quantized val_loss:2.83306033 val_bpb:1.09676594 eval_time:27828ms | ||
| quantized_sliding_window val_loss:2.78916788 val_bpb:1.07977381 eval_time:123617ms | ||
| ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.01 ttt_epochs=3 freeze_blocks=0 optimizer=sgd hash_embed=True | ||
| ttt_sliding:params unfrozen=44333145 frozen=0 | ||
| ttt_chunk [1/1238] bpb=1.117492 time=44.6s | ||
| ttt_chunk [11/1238] bpb=1.069226 time=68.8s | ||
| ttt_chunk [21/1238] bpb=1.106644 time=71.4s | ||
| ttt_chunk [31/1238] bpb=1.099689 time=74.0s | ||
| ttt_chunk [41/1238] bpb=1.093361 time=76.6s | ||
| ttt_chunk [51/1238] bpb=1.086964 time=79.2s | ||
| ttt_chunk [61/1238] bpb=1.078842 time=81.8s | ||
| ttt_chunk [71/1238] bpb=1.086084 time=84.4s | ||
| ttt_chunk [81/1238] bpb=1.079623 time=87.0s | ||
| ttt_chunk [91/1238] bpb=1.076128 time=89.6s | ||
| ttt_chunk [101/1238] bpb=1.075850 time=92.2s | ||
| ttt_chunk [111/1238] bpb=1.074081 time=94.8s | ||
| ttt_chunk [121/1238] bpb=1.077203 time=97.4s | ||
| ttt_chunk [131/1238] bpb=1.080943 time=100.0s | ||
| ttt_chunk [141/1238] bpb=1.081458 time=102.6s | ||
| ttt_chunk [151/1238] bpb=1.081208 time=105.2s | ||
| ttt_chunk [161/1238] bpb=1.081698 time=107.8s | ||
| ttt_chunk [171/1238] bpb=1.081580 time=110.3s | ||
| ttt_chunk [181/1238] bpb=1.080086 time=112.9s | ||
| ttt_chunk [191/1238] bpb=1.079866 time=115.5s | ||
| ttt_chunk [201/1238] bpb=1.077432 time=118.1s | ||
| ttt_chunk [211/1238] bpb=1.081917 time=120.7s | ||
| ttt_chunk [221/1238] bpb=1.082308 time=123.3s | ||
| ttt_chunk [231/1238] bpb=1.083948 time=125.8s | ||
| ttt_chunk [241/1238] bpb=1.082189 time=128.4s | ||
| ttt_chunk [251/1238] bpb=1.082218 time=131.0s | ||
| ttt_chunk [261/1238] bpb=1.083265 time=133.6s | ||
| ttt_chunk [271/1238] bpb=1.083724 time=136.2s | ||
| ttt_chunk [281/1238] bpb=1.083000 time=138.8s | ||
| ttt_chunk [291/1238] bpb=1.084080 time=141.3s | ||
| ttt_chunk [301/1238] bpb=1.084275 time=143.9s | ||
| ttt_chunk [311/1238] bpb=1.083204 time=146.5s | ||
| ttt_chunk [321/1238] bpb=1.083052 time=149.1s | ||
| ttt_chunk [331/1238] bpb=1.083339 time=151.7s | ||
| ttt_chunk [341/1238] bpb=1.082432 time=154.3s | ||
| ttt_chunk [351/1238] bpb=1.083202 time=156.9s | ||
| ttt_chunk [361/1238] bpb=1.082090 time=159.5s | ||
| ttt_chunk [371/1238] bpb=1.080503 time=162.1s | ||
| ttt_chunk [381/1238] bpb=1.080910 time=164.7s | ||
| ttt_chunk [391/1238] bpb=1.080581 time=167.3s | ||
| ttt_chunk [401/1238] bpb=1.080644 time=169.8s | ||
| ttt_chunk [411/1238] bpb=1.081146 time=172.4s | ||
| ttt_chunk [421/1238] bpb=1.080661 time=175.0s | ||
| ttt_chunk [431/1238] bpb=1.080855 time=177.6s | ||
| ttt_chunk [441/1238] bpb=1.080873 time=180.2s | ||
| ttt_chunk [451/1238] bpb=1.082030 time=182.8s | ||
| ttt_chunk [461/1238] bpb=1.080247 time=185.4s | ||
| ttt_chunk [471/1238] bpb=1.080256 time=188.0s | ||
| ttt_chunk [481/1238] bpb=1.080434 time=190.6s | ||
| ttt_chunk [491/1238] bpb=1.080855 time=193.2s | ||
| ttt_chunk [501/1238] bpb=1.080472 time=195.8s | ||
| ttt_chunk [511/1238] bpb=1.080056 time=198.4s | ||
| ttt_chunk [521/1238] bpb=1.079531 time=201.0s | ||
| ttt_chunk [531/1238] bpb=1.079483 time=203.6s | ||
| ttt_chunk [541/1238] bpb=1.079554 time=206.2s | ||
| ttt_chunk [551/1238] bpb=1.079075 time=208.8s | ||
| ttt_chunk [561/1238] bpb=1.078385 time=211.4s | ||
| ttt_chunk [571/1238] bpb=1.077832 time=214.0s | ||
| ttt_chunk [581/1238] bpb=1.078158 time=216.6s | ||
| ttt_chunk [591/1238] bpb=1.078420 time=219.2s | ||
| ttt_chunk [601/1238] bpb=1.078327 time=221.8s | ||
| ttt_chunk [611/1238] bpb=1.078900 time=224.4s | ||
| ttt_chunk [621/1238] bpb=1.079747 time=227.0s | ||
| ttt_chunk [631/1238] bpb=1.079804 time=229.6s | ||
| ttt_chunk [641/1238] bpb=1.080233 time=232.2s | ||
| ttt_chunk [651/1238] bpb=1.080547 time=234.7s | ||
| ttt_chunk [661/1238] bpb=1.079856 time=237.3s | ||
| ttt_chunk [671/1238] bpb=1.079636 time=239.9s | ||
| ttt_chunk [681/1238] bpb=1.080911 time=242.5s | ||
| ttt_chunk [691/1238] bpb=1.081091 time=245.1s | ||
| ttt_chunk [701/1238] bpb=1.080913 time=247.7s | ||
| ttt_chunk [711/1238] bpb=1.081619 time=250.3s | ||
| ttt_chunk [721/1238] bpb=1.081895 time=252.9s | ||
| ttt_chunk [731/1238] bpb=1.081240 time=255.5s | ||
| ttt_chunk [741/1238] bpb=1.080877 time=258.1s | ||
| ttt_chunk [751/1238] bpb=1.079932 time=260.7s | ||
| ttt_chunk [761/1238] bpb=1.079347 time=263.3s | ||
| ttt_chunk [771/1238] bpb=1.078309 time=265.8s | ||
| ttt_chunk [781/1238] bpb=1.078310 time=268.5s | ||
| ttt_chunk [791/1238] bpb=1.078646 time=271.1s | ||
| ttt_chunk [801/1238] bpb=1.078925 time=273.7s | ||
| ttt_chunk [811/1238] bpb=1.078430 time=276.3s | ||
| ttt_chunk [821/1238] bpb=1.077210 time=278.9s | ||
| ttt_chunk [831/1238] bpb=1.076847 time=281.5s | ||
| ttt_chunk [841/1238] bpb=1.076337 time=284.1s | ||
| ttt_chunk [851/1238] bpb=1.076039 time=286.7s | ||
| ttt_chunk [861/1238] bpb=1.075668 time=289.3s | ||
| ttt_chunk [871/1238] bpb=1.075539 time=291.9s | ||
| ttt_chunk [881/1238] bpb=1.075073 time=294.5s | ||
| ttt_chunk [891/1238] bpb=1.074550 time=297.1s | ||
| ttt_chunk [901/1238] bpb=1.074925 time=299.7s | ||
| ttt_chunk [911/1238] bpb=1.074611 time=302.3s | ||
| ttt_chunk [921/1238] bpb=1.074869 time=304.9s | ||
| ttt_chunk [931/1238] bpb=1.075550 time=307.5s | ||
| ttt_chunk [941/1238] bpb=1.075935 time=310.1s | ||
| ttt_chunk [951/1238] bpb=1.075848 time=312.7s | ||
| ttt_chunk [961/1238] bpb=1.076667 time=315.2s | ||
| ttt_chunk [971/1238] bpb=1.077061 time=317.8s | ||
| ttt_chunk [981/1238] bpb=1.077401 time=320.4s | ||
| ttt_chunk [991/1238] bpb=1.077162 time=323.0s | ||
| ttt_chunk [1001/1238] bpb=1.077185 time=325.6s | ||
| ttt_chunk [1011/1238] bpb=1.077516 time=328.2s | ||
| ttt_chunk [1021/1238] bpb=1.078212 time=330.8s | ||
| ttt_chunk [1031/1238] bpb=1.078671 time=333.4s | ||
| ttt_chunk [1041/1238] bpb=1.079137 time=336.0s | ||
| ttt_chunk [1051/1238] bpb=1.079049 time=338.6s | ||
| ttt_chunk [1061/1238] bpb=1.079036 time=341.2s | ||
| ttt_chunk [1071/1238] bpb=1.079200 time=343.8s | ||
| ttt_chunk [1081/1238] bpb=1.079092 time=346.4s | ||
| ttt_chunk [1091/1238] bpb=1.079284 time=349.0s | ||
| ttt_chunk [1101/1238] bpb=1.079803 time=351.6s | ||
| ttt_chunk [1111/1238] bpb=1.080085 time=354.2s | ||
| ttt_chunk [1121/1238] bpb=1.080238 time=356.8s | ||
| ttt_chunk [1131/1238] bpb=1.079881 time=359.4s | ||
| ttt_chunk [1141/1238] bpb=1.079522 time=361.9s | ||
| ttt_chunk [1151/1238] bpb=1.079551 time=364.5s | ||
| ttt_chunk [1161/1238] bpb=1.079662 time=367.1s | ||
| ttt_chunk [1171/1238] bpb=1.079425 time=369.7s | ||
| ttt_chunk [1181/1238] bpb=1.078935 time=372.3s | ||
| ttt_chunk [1191/1238] bpb=1.079061 time=374.9s | ||
| ttt_chunk [1201/1238] bpb=1.079133 time=377.5s | ||
| ttt_chunk [1211/1238] bpb=1.078808 time=380.1s | ||
| ttt_chunk [1221/1238] bpb=1.078332 time=382.7s | ||
| ttt_chunk [1231/1238] bpb=1.077951 time=385.3s | ||
| ttt_chunk [1238/1238] bpb=1.077950 time=407.3s | ||
| ttt_sliding:done val_loss=2.785021 val_bpb=1.07816838 elapsed=408.8s | ||
| legal_ttt_exact val_loss:2.78502089 val_bpb:1.07816838 eval_time:409073ms |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results table labels the metric as "BPP" (e.g., "Pre-quant BPP", "Sliding BPP", "TTT BPP"), but this repo’s record READMEs consistently use "BPB" /
val_bpb. Consider renaming these headers to "BPB" to avoid confusion about what metric is being reported.