-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| # Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) | ||
|
|
||
| **val_bpb = 1.0778** (3-seed mean, std 0.0008) | **~15.99 MB** | 8xH100 SXM | ||
|
|
||
| ## 3-Seed Results | ||
|
|
||
| | Seed | Pre-quant BPP | Sliding BPP | **TTT BPP** | | ||
| |------|---------------|-------------|-------------| | ||
| | 1337 | 1.0848 | 1.0786 | **1.0771** | | ||
| | 42 | 1.0856 | 1.0792 | **1.0776** | | ||
| | 2024 | 1.0862 | 1.0798 | **1.0787** | | ||
| | **Mean** | 1.0855 | 1.0792 | **1.0778** | | ||
|
|
||
| Merged SOTA (PR #1493): **1.0810 BPP**. Delta: **-0.0032 BPP**. | ||
|
|
||
| ## Contributions | ||
|
|
||
| ### 1. Parameter Banking with Parallel Muon (systems) | ||
|
|
||
| Restructures 66 separate weight matrices into 4 contiguous 3D parameter banks (qo, kv, mlp_up, mlp_down). Replaces DDP with manual reduce_scatter → batched Newton-Schulz → all_gather. Reduces optimizer step from 19.7ms to 1.3ms (15x faster). Critical fix: restored MuonEq-R row normalization that the refactor had dropped. Combined: **+3.8% training throughput**. | ||
|
|
||
| ### 2. Fused MLP Triton TMA Kernel (systems) | ||
|
|
||
| Fuses `fc(x) → LeakyReLU(0.5) → square` into a single Hopper TMA kernel. The 384MB MLP intermediate never touches HBM. With CUTLASS EVT backward fusion for `(grad @ proj_w) * act_grad`. Combined with banking: **+5.2% total throughput**. | ||
|
|
||
| ### 3. Muon Momentum 0.97 (training) | ||
|
|
||
| Reduced Muon optimizer momentum from default 0.99 to 0.97. Lower momentum provides less smoothing but faster adaptation to the depth-recurrent architecture. **-0.0004 BPP** improvement. | ||
|
|
||
| ### 4. Triple Depth Recurrence (architecture) | ||
|
|
||
| 17 virtual layers from 11 physical. Layers 3,4,5 looped 3x total (NUM_LOOPS=2), activated at 35% training. First legal Track B submission with triple recurrence. | ||
|
|
||
| ### 5. Eval-Time Hash Embedding (eval) | ||
|
|
||
| Zero-initialized nn.Embedding(16384, 512) created at eval time, trained through score-first TTT. Bigram hash `h = (prev_token * 2039 + curr_token) % 16384` adds learned residual before RMSNorm. | ||
|
|
||
| ### 6. TTT LR=0.01 (eval) | ||
|
|
||
| Optimized TTT learning rate from default 0.005 to 0.01. **-0.0003 BPP** free improvement. | ||
|
|
||
| ## Full Architecture | ||
|
|
||
| ``` | ||
| SP8192 tokenizer, 11 physical / 17 virtual layers | ||
| 512 dim, MLP 4x (2048 hidden), GQA 8Q/4KV, head_dim=64 | ||
| Parallel residuals L7+, QK-Gain 5.0, XSA all 11 layers | ||
| LeakyReLU(0.5)², skip gates, logit softcap 30 | ||
| MuonEq-R (lr=0.022, wd=0.095, momentum=0.97) + AdamW | ||
| EMA 0.997, warmdown 66.7%, loop at 35% | ||
| SDClip GPTQ int6 (k=12.85) + int8 embed (k=20) + brotli | ||
| Score-first TTT: SGD lr=0.01, mom=0.9, 3ep, 32K chunks | ||
| Hash embedding: 16384×512, zero-init, trained in TTT | ||
| ~36M params, ~15.99MB artifact | ||
| ``` | ||
|
|
||
| ## Compliance (Track B — Score-First TTT) | ||
|
|
||
| Per Issue #1017: | ||
| - **Condition 1:** Hash key uses prefix tokens only | ||
| - **Condition 2:** Full normalized softmax distribution | ||
| - **Condition 3:** Each chunk scored under no_grad() before TTT update | ||
| - **Condition 4:** Single left-to-right pass, no rescoring | ||
|
|
||
| No SLOT, no pre-quant TTT, no n-gram caches, no Tap-In. | ||
|
|
||
| ## Reproduction | ||
|
|
||
| ```bash | ||
| pip install brotli sentencepiece | ||
| MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192 | ||
| SEED=1337 TTT_ENABLED=1 HASH_EMBED_ENABLED=1 TTT_LR=0.01 MUON_MOMENTUM=0.97 \ | ||
| torchrun --standalone --nproc_per_node=8 train_gpt.py | ||
| ``` | ||
|
|
||
| Requires: CUTLASS 3.x for EVT backward fusion (optional, falls back to standard PyTorch). | ||
|
Comment on lines
+67
to
+76
|
||
|
|
||
| ## Credits | ||
|
|
||
| PR #1420 @abaybektursun (triple loop + fused kernels), PR #1394 @clarkkev (SP8192 + SDClip), PR #1471 @X-Abhishek-X (3-layer recurrence), PR #1477 @aryanbhosale (parallel residuals + score-first TTT), PR #1460 @resouer (eval-time hash embedding), PR #399 @abaybektursun (parameter banking concept), PR #1514 @dexhunter (Muon 0.97) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| {"author":"EthanYangTW","github_id":"EthanYangTW","name":"SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 + Score-First TTT + Hash Embedding","date":"2026-04-10","track":"10min_16mb","val_bpb":1.07778,"val_bpb_std":0.00080,"seeds":[1337,42,2024],"seed_results":{"1337":{"val_bpb":1.07714},"42":{"val_bpb":1.07755},"2024":{"val_bpb":1.07866}},"hardware":"8xH100 80GB SXM","pytorch_version":"2.9.1+cu128","technique_summary":"SP8192 + Triple Depth Recurrence (3,4,5 x3, 17 virtual) + Parameter Banking + Fused MLP Triton TMA + CUTLASS EVT + Muon 0.97 + Parallel Residuals (L7+) + QK-Gain 5.0 + Score-First TTT (3ep SGD lr=0.01) + Eval-Time Hash Embedding + SDClip GPTQ int6 + Brotli"} |
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,277 @@ | ||
| W0410 08:12:33.505000 225910 torch/distributed/run.py:803] | ||
| W0410 08:12:33.505000 225910 torch/distributed/run.py:803] ***************************************** | ||
| W0410 08:12:33.505000 225910 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | ||
| W0410 08:12:33.505000 225910 torch/distributed/run.py:803] ***************************************** | ||
| Hyperparameters: | ||
| adam_eps: 1e-08 | ||
| adam_wd: 0.02 | ||
| beta1: 0.9 | ||
| beta2: 0.95 | ||
| compressor: brotli | ||
| data_dir: ./data/ | ||
| datasets_dir: ./data/datasets/fineweb10B_sp8192 | ||
| distributed: True | ||
| ema_decay: 0.997 | ||
| embed_bits: 8 | ||
| embed_clip_sigmas: 20.0 | ||
| embed_lr: 0.6 | ||
| embed_wd: 0.095 | ||
| embedding_dim: 512 | ||
| enable_looping_at: 0.35 | ||
| eval_seq_len: 2048 | ||
| eval_stride: 64 | ||
| gptq_calibration_batches: 64 | ||
| gptq_reserve_seconds: 12.0 | ||
| grad_accum_steps: 1 | ||
| grad_clip_norm: 0.3 | ||
| hash_embed_enabled: True | ||
| hash_embed_size: 16384 | ||
| head_lr: 0.008 | ||
| is_main_process: True | ||
| iterations: 20000 | ||
| ln_scale: True | ||
| local_rank: 0 | ||
| logfile: logs/90428d5c-524a-4cff-9b4e-439eedc4edc3.txt | ||
| logit_softcap: 30.0 | ||
| loop_end: 5 | ||
| loop_start: 3 | ||
| matrix_bits: 6 | ||
| matrix_clip_sigmas: 12.85 | ||
| matrix_lr: 0.022 | ||
| max_wallclock_seconds: 600.0 | ||
| min_lr: 0.0 | ||
| mlp_mult: 4.0 | ||
| model_dim: 512 | ||
| model_path: final_model.pt | ||
| muon_backend_steps: 5 | ||
| muon_beta2: 0.95 | ||
| muon_momentum: 0.97 | ||
| muon_momentum_warmup_start: 0.92 | ||
| muon_momentum_warmup_steps: 1500 | ||
| muon_row_normalize: True | ||
| muon_wd: 0.095 | ||
| num_heads: 8 | ||
| num_kv_heads: 4 | ||
| num_layers: 11 | ||
| num_loops: 2 | ||
| qk_gain_init: 5.0 | ||
| quantized_model_path: final_model.int6.ptz | ||
| rank: 0 | ||
| rope_base: 10000.0 | ||
| rope_dims: 16 | ||
| rope_train_seq_len: 2048 | ||
| run_id: 90428d5c-524a-4cff-9b4e-439eedc4edc3 | ||
| scalar_lr: 0.02 | ||
| seed: 1337 | ||
| skip_gates_enabled: True | ||
| sliding_window_enabled: True | ||
| tie_embeddings: True | ||
| tied_embed_init_std: 0.005 | ||
| tied_embed_lr: 0.03 | ||
| tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model | ||
| train_batch_tokens: 786432 | ||
| train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin | ||
| train_log_every: 500 | ||
| train_seq_len: 2048 | ||
| ttt_adamw_wd: 0.0 | ||
| ttt_batch_seqs: 32 | ||
| ttt_chunk_tokens: 32768 | ||
| ttt_enabled: True | ||
| ttt_epochs: 3 | ||
| ttt_freeze_blocks: 0 | ||
| ttt_grad_clip: 1.0 | ||
| ttt_lr: 0.01 | ||
| ttt_momentum: 0.9 | ||
| ttt_optimizer: sgd | ||
| val_batch_tokens: 524288 | ||
| val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin | ||
| val_loss_every: 4000 | ||
| vocab_size: 8192 | ||
| warmdown_frac: 0.667 | ||
| warmup_steps: 20 | ||
| world_size: 8 | ||
| xsa_last_n: 11 | ||
| train_shards: 80 | ||
| val_tokens: 40540160 | ||
| model_params:35944537 | ||
| gptq:reserving 12s, effective=588000ms | ||
| warmup_step: 1/20 | ||
| warmup_step: 2/20 | ||
| warmup_step: 3/20 | ||
| warmup_step: 4/20 | ||
| warmup_step: 5/20 | ||
| warmup_step: 6/20 | ||
| warmup_step: 10/20 | ||
| warmup_step: 20/20 | ||
| loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] | ||
| loop_warmup_step: 1/20 | ||
| loop_warmup_step: 2/20 | ||
| loop_warmup_step: 3/20 | ||
| loop_warmup_step: 4/20 | ||
| loop_warmup_step: 5/20 | ||
| loop_warmup_step: 6/20 | ||
| loop_warmup_step: 10/20 | ||
| loop_warmup_step: 20/20 | ||
| 0/20000 val_loss: 9.0095 val_bpb: 3.4878 | ||
| 1/20000 train_loss: 9.0103 train_time: 0.0m tok/s: 19203723 | ||
| 2/20000 train_loss: 12.2692 train_time: 0.0m tok/s: 13178384 | ||
| 3/20000 train_loss: 10.9250 train_time: 0.0m tok/s: 10992253 | ||
| 4/20000 train_loss: 9.3856 train_time: 0.0m tok/s: 10069344 | ||
| 5/20000 train_loss: 8.2711 train_time: 0.0m tok/s: 9588196 | ||
| 500/20000 train_loss: 3.3884 train_time: 0.8m tok/s: 7998832 | ||
| 1000/20000 train_loss: 3.2890 train_time: 1.6m tok/s: 7976048 | ||
| 1500/20000 train_loss: 3.1912 train_time: 2.5m tok/s: 7973211 | ||
| 2000/20000 train_loss: 3.1127 train_time: 3.3m tok/s: 7975349 | ||
| layer_loop:enabled step:2087 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10] | ||
| 2500/20000 train_loss: 3.1515 train_time: 4.4m tok/s: 7405479 | ||
| 3000/20000 train_loss: 2.9234 train_time: 5.6m tok/s: 6986294 | ||
| 3500/20000 train_loss: 2.9637 train_time: 6.8m tok/s: 6717228 | ||
| 4000/20000 train_loss: 2.8400 train_time: 8.0m tok/s: 6529024 | ||
| 4000/20000 val_loss: 2.8960 val_bpb: 1.1211 | ||
| 4500/20000 train_loss: 2.8532 train_time: 9.2m tok/s: 6390202 | ||
| 4737/20000 val_loss: 2.8018 val_bpb: 1.0847 | ||
| stopping_early: wallclock_cap train_time: 588152ms step: 4737/20000 | ||
| peak memory allocated: 39956 MiB reserved: 40024 MiB | ||
| ema:applying EMA weights | ||
| pre-quantization post-ema val_loss:2.80220487 val_bpb:1.08482083 eval_time:6496ms | ||
| Serialized model: 135408623 bytes | ||
| Code size: 63396 bytes | ||
| GPTQ:collecting Hessians from calibration data... | ||
| GPTQ:collected 67 Hessians in 12.4s | ||
| Quantized weights: | ||
| gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight | ||
| gptq (int8): tok_emb.weight | ||
| passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, lane_merge, skip_gates, skip_weights | ||
| Serialized model quantized+brotli: 15959849 bytes | ||
| Total submission size quantized+brotli: 16023245 bytes | ||
| quantized val_loss:2.83010605 val_bpb:1.09562225 eval_time:27068ms | ||
| quantized_sliding_window val_loss:2.78620943 val_bpb:1.07862850 eval_time:123241ms | ||
| ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.01 ttt_epochs=3 freeze_blocks=0 optimizer=sgd hash_embed=True | ||
| ttt_sliding:params unfrozen=44333145 frozen=0 | ||
| ttt_chunk [1/1238] bpb=1.116524 time=42.1s | ||
| ttt_chunk [11/1238] bpb=1.068934 time=64.4s | ||
| ttt_chunk [21/1238] bpb=1.106056 time=67.0s | ||
| ttt_chunk [31/1238] bpb=1.099473 time=69.6s | ||
| ttt_chunk [41/1238] bpb=1.093043 time=72.2s | ||
| ttt_chunk [51/1238] bpb=1.086294 time=74.8s | ||
| ttt_chunk [61/1238] bpb=1.077997 time=77.4s | ||
| ttt_chunk [71/1238] bpb=1.084929 time=80.0s | ||
| ttt_chunk [81/1238] bpb=1.078178 time=82.6s | ||
| ttt_chunk [91/1238] bpb=1.074756 time=85.2s | ||
| ttt_chunk [101/1238] bpb=1.074463 time=87.8s | ||
| ttt_chunk [111/1238] bpb=1.072804 time=90.4s | ||
| ttt_chunk [121/1238] bpb=1.075915 time=93.0s | ||
| ttt_chunk [131/1238] bpb=1.079574 time=95.6s | ||
| ttt_chunk [141/1238] bpb=1.080044 time=98.2s | ||
| ttt_chunk [151/1238] bpb=1.079880 time=100.8s | ||
| ttt_chunk [161/1238] bpb=1.080492 time=103.4s | ||
| ttt_chunk [171/1238] bpb=1.080322 time=106.0s | ||
| ttt_chunk [181/1238] bpb=1.078888 time=108.7s | ||
| ttt_chunk [191/1238] bpb=1.078744 time=111.3s | ||
| ttt_chunk [201/1238] bpb=1.076331 time=113.8s | ||
| ttt_chunk [211/1238] bpb=1.080742 time=116.4s | ||
| ttt_chunk [221/1238] bpb=1.081103 time=119.0s | ||
| ttt_chunk [231/1238] bpb=1.082775 time=121.6s | ||
| ttt_chunk [241/1238] bpb=1.080991 time=124.2s | ||
| ttt_chunk [251/1238] bpb=1.081014 time=126.8s | ||
| ttt_chunk [261/1238] bpb=1.082135 time=129.4s | ||
| ttt_chunk [271/1238] bpb=1.082521 time=132.0s | ||
| ttt_chunk [281/1238] bpb=1.081860 time=134.6s | ||
| ttt_chunk [291/1238] bpb=1.083032 time=137.2s | ||
| ttt_chunk [301/1238] bpb=1.083233 time=139.8s | ||
| ttt_chunk [311/1238] bpb=1.082095 time=142.4s | ||
| ttt_chunk [321/1238] bpb=1.081955 time=145.0s | ||
| ttt_chunk [331/1238] bpb=1.082203 time=147.6s | ||
| ttt_chunk [341/1238] bpb=1.081312 time=150.2s | ||
| ttt_chunk [351/1238] bpb=1.082068 time=152.8s | ||
| ttt_chunk [361/1238] bpb=1.080987 time=155.4s | ||
| ttt_chunk [371/1238] bpb=1.079450 time=158.0s | ||
| ttt_chunk [381/1238] bpb=1.079823 time=160.6s | ||
| ttt_chunk [391/1238] bpb=1.079473 time=163.2s | ||
| ttt_chunk [401/1238] bpb=1.079583 time=165.8s | ||
| ttt_chunk [411/1238] bpb=1.080115 time=168.4s | ||
| ttt_chunk [421/1238] bpb=1.079621 time=171.0s | ||
| ttt_chunk [431/1238] bpb=1.079801 time=173.6s | ||
| ttt_chunk [441/1238] bpb=1.079829 time=176.2s | ||
| ttt_chunk [451/1238] bpb=1.080983 time=178.8s | ||
| ttt_chunk [461/1238] bpb=1.079211 time=181.3s | ||
| ttt_chunk [471/1238] bpb=1.079199 time=183.9s | ||
| ttt_chunk [481/1238] bpb=1.079363 time=186.5s | ||
| ttt_chunk [491/1238] bpb=1.079796 time=189.1s | ||
| ttt_chunk [501/1238] bpb=1.079444 time=191.7s | ||
| ttt_chunk [511/1238] bpb=1.079065 time=194.3s | ||
| ttt_chunk [521/1238] bpb=1.078585 time=196.9s | ||
| ttt_chunk [531/1238] bpb=1.078517 time=199.5s | ||
| ttt_chunk [541/1238] bpb=1.078623 time=202.1s | ||
| ttt_chunk [551/1238] bpb=1.078109 time=204.7s | ||
| ttt_chunk [561/1238] bpb=1.077421 time=207.3s | ||
| ttt_chunk [571/1238] bpb=1.076834 time=209.9s | ||
| ttt_chunk [581/1238] bpb=1.077163 time=212.5s | ||
| ttt_chunk [591/1238] bpb=1.077398 time=215.2s | ||
| ttt_chunk [601/1238] bpb=1.077349 time=217.8s | ||
| ttt_chunk [611/1238] bpb=1.077920 time=220.4s | ||
| ttt_chunk [621/1238] bpb=1.078738 time=223.0s | ||
| ttt_chunk [631/1238] bpb=1.078804 time=225.6s | ||
| ttt_chunk [641/1238] bpb=1.079260 time=228.1s | ||
| ttt_chunk [651/1238] bpb=1.079598 time=230.7s | ||
| ttt_chunk [661/1238] bpb=1.078928 time=233.3s | ||
| ttt_chunk [671/1238] bpb=1.078663 time=235.9s | ||
| ttt_chunk [681/1238] bpb=1.079945 time=238.5s | ||
| ttt_chunk [691/1238] bpb=1.080136 time=241.1s | ||
| ttt_chunk [701/1238] bpb=1.079955 time=243.7s | ||
| ttt_chunk [711/1238] bpb=1.080626 time=246.3s | ||
| ttt_chunk [721/1238] bpb=1.080899 time=248.9s | ||
| ttt_chunk [731/1238] bpb=1.080251 time=251.5s | ||
| ttt_chunk [741/1238] bpb=1.079896 time=254.1s | ||
| ttt_chunk [751/1238] bpb=1.078964 time=256.8s | ||
| ttt_chunk [761/1238] bpb=1.078348 time=259.3s | ||
| ttt_chunk [771/1238] bpb=1.077305 time=261.9s | ||
| ttt_chunk [781/1238] bpb=1.077280 time=264.5s | ||
| ttt_chunk [791/1238] bpb=1.077593 time=267.1s | ||
| ttt_chunk [801/1238] bpb=1.077851 time=269.7s | ||
| ttt_chunk [811/1238] bpb=1.077336 time=272.3s | ||
| ttt_chunk [821/1238] bpb=1.076122 time=274.9s | ||
| ttt_chunk [831/1238] bpb=1.075773 time=277.5s | ||
| ttt_chunk [841/1238] bpb=1.075276 time=280.1s | ||
| ttt_chunk [851/1238] bpb=1.074955 time=282.7s | ||
| ttt_chunk [861/1238] bpb=1.074606 time=285.3s | ||
| ttt_chunk [871/1238] bpb=1.074471 time=287.8s | ||
| ttt_chunk [881/1238] bpb=1.073991 time=290.4s | ||
| ttt_chunk [891/1238] bpb=1.073450 time=293.1s | ||
| ttt_chunk [901/1238] bpb=1.073799 time=295.7s | ||
| ttt_chunk [911/1238] bpb=1.073483 time=298.2s | ||
| ttt_chunk [921/1238] bpb=1.073746 time=300.8s | ||
| ttt_chunk [931/1238] bpb=1.074397 time=303.4s | ||
| ttt_chunk [941/1238] bpb=1.074786 time=306.0s | ||
| ttt_chunk [951/1238] bpb=1.074704 time=308.6s | ||
| ttt_chunk [961/1238] bpb=1.075537 time=311.2s | ||
| ttt_chunk [971/1238] bpb=1.075919 time=313.8s | ||
| ttt_chunk [981/1238] bpb=1.076252 time=316.4s | ||
| ttt_chunk [991/1238] bpb=1.076020 time=319.0s | ||
| ttt_chunk [1001/1238] bpb=1.076038 time=321.6s | ||
| ttt_chunk [1011/1238] bpb=1.076366 time=324.2s | ||
| ttt_chunk [1021/1238] bpb=1.077058 time=326.8s | ||
| ttt_chunk [1031/1238] bpb=1.077493 time=329.4s | ||
| ttt_chunk [1041/1238] bpb=1.077979 time=332.0s | ||
| ttt_chunk [1051/1238] bpb=1.077907 time=334.6s | ||
| ttt_chunk [1061/1238] bpb=1.077911 time=337.2s | ||
| ttt_chunk [1071/1238] bpb=1.078056 time=339.8s | ||
| ttt_chunk [1081/1238] bpb=1.077949 time=342.4s | ||
| ttt_chunk [1091/1238] bpb=1.078151 time=345.0s | ||
| ttt_chunk [1101/1238] bpb=1.078670 time=347.6s | ||
| ttt_chunk [1111/1238] bpb=1.078942 time=350.2s | ||
| ttt_chunk [1121/1238] bpb=1.079110 time=352.8s | ||
| ttt_chunk [1131/1238] bpb=1.078766 time=355.4s | ||
| ttt_chunk [1141/1238] bpb=1.078415 time=358.0s | ||
| ttt_chunk [1151/1238] bpb=1.078445 time=360.6s | ||
| ttt_chunk [1161/1238] bpb=1.078565 time=363.2s | ||
| ttt_chunk [1171/1238] bpb=1.078340 time=365.8s | ||
| ttt_chunk [1181/1238] bpb=1.077874 time=368.4s | ||
| ttt_chunk [1191/1238] bpb=1.078011 time=371.0s | ||
| ttt_chunk [1201/1238] bpb=1.078072 time=373.6s | ||
| ttt_chunk [1211/1238] bpb=1.077756 time=376.2s | ||
| ttt_chunk [1221/1238] bpb=1.077276 time=378.7s | ||
| ttt_chunk [1231/1238] bpb=1.076906 time=381.4s | ||
| ttt_chunk [1238/1238] bpb=1.076905 time=402.1s | ||
| ttt_sliding:done val_loss=2.782366 val_bpb=1.07714044 elapsed=403.0s | ||
| legal_ttt_exact val_loss:2.78236563 val_bpb:1.07714044 eval_time:403222ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README claims an artifact size of "~15.99 MB", but the training logs in this folder report total submission sizes of ~16.02 MB (e.g., 16,032,371 bytes on seed 42). Please reconcile these numbers and clarify whether the cap is 16,000,000 bytes or 16 MiB, and ensure the reported artifact fits the enforced limit.