diff --git a/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md new file mode 100644 index 0000000000..2249d50228 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/README.md @@ -0,0 +1,204 @@ +# Recurrent Depth with Progressive Pass Growth + Error Feedback + +**val_bpb: 1.1163** (3-seed mean, std 0.0013) | **~15.96 MB** | 8×H100 SXM + +A non-record submission targeting significant improvement over [PR #549](https://github.com/openai/parameter-golf/pull/549) (LeakyReLU² baseline, 1.1194 mean bpb). Achieves **-0.0031 bpb** vs that baseline. For an in-depth analysis of depth recurrence in this competition, see [PR #363](https://github.com/openai/parameter-golf/pull/363). I targeted 549 when I started building this solution, after I finished evaluation the new improved model has been published to the leaderboard. However I believe the experiments here can be applied to any model to improve performance, with the largest benefit for submissions using TTT since the recurrance makes use of the 10 available minutes of evaluation time very effectively. + +## Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128) + + +| Seed | step_avg | steps | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | TTT time | Artifact | +| -------- | ---------- | --------- | ----------- | ----------------------- | ----------- | --------- | ---------- | +| 1337 | 83.5ms | 6,328 | 1.1353 | **1.1157** | -0.0196 | 566s | 15,909,018 | +| 42 | 83.5ms | 6,334 | 1.1372 | **1.1177** | -0.0195 | 579s | 15,897,530 | +| 2025 | 83.4ms | 6,334 | 1.1351 | **1.1155** | -0.0197 | 588s | 15,995,558 | +| **Mean** | **83.5ms** | **6,332** | **1.1359** | **1.1163 (std 0.0013)** | **-0.0196** | **~578s** | | + +We significantly beat the [PR #549](https://github.com/openai/parameter-golf/pull/549) LeakyReLU² baseline (1.1194 mean bpb / 1.8901 nats) by **-0.0031 bpb / -0.0053 nats** across all three seeds (1.1163 mean bpb / 1.8848 nats), achieving the goal we set out with. + +## Progressive Recurrence Architecture + +``` + ┌───────────┐ ┌───────────┐ ┌───────────┐ + │ │ │ │ │ │ + │ Tail │ │ Tail │ │ Tail │ + │ [7-10] │ │ [7-10] │ │ [7-10] │ + │ │ │ │ │ │ + ├───────────┤ ├───────────┤╮ ├───────────┤╮ + │ │ 4500 steps │ ││ 1000 steps │ ││ + │ Core │ ───────────> │ Core ││ ──────────> │ Core ││ + │ [4-6] │ │ [4-6] │2x │ [4-6] │3x + │ │ │ ││ │ ││ + ├───────────┤ ├───────────┤╯ ├───────────┤╯ + │ │ │ │ │ │ + │ Stem │ │ Stem │ │ Stem │ + │ [0-3] │ │ [0-3] │ │ [0-3] │ + │ │ │ │ │ │ + └───────────┘ └───────────┘ └───────────┘ + + 11 layers 14 layers 17 layers + (steps 0-4499) (steps 4500-5499) (steps 5500+, eval) +``` + +## The Problem: Depth Recurrence Fails Under Competition Constraints + +[PR #363](https://github.com/openai/parameter-golf/pull/363) demonstrated that depth recurrence — reusing a shared block of transformer layers multiple times — saves parameters but *hurts* bpb under the 10-minute / 16MB competition constraints. Their controlled experiments showed a **+0.025 bpb gap** (looped worse) due to two compounding taxes: + +1. **Quantization error amplification.** When shared weights are quantized to int6, the quantization error is injected at every pass. After K passes through the same core, the cumulative error grows superlinearly. Additionally hidden state magnitudes tend to explode with to many recurrent passes through a block if we do not stabilize this. +2. **Step time overhead.** Each additional recurrence pass adds forward/backward compute. With 4 passes, +32ms/step translates to ~1200 fewer training steps in the 600s budget. + +## Our Solution: Late Growth + Contractive Stabilization + +We address both taxes by growing recurrence depth progressively during training and stabilizing the recurrent dynamics. + +### Progressive Pass Schedule (Late Growth) + +The key insight: **start training with 1 pass and gradually add passes late in training**. This preserves fast step times for the majority of training (83.5ms/step at 1-pass vs ~95ms at 3-pass), maximizing the total number of gradient updates within the 600s wallclock budget. The schedule: + + +| Step range | Passes | Effective layers | step_avg | +| ---------- | ------ | ---------------- | -------- | +| 0–4499 | 1 | 11 | ~83.5ms | +| 4500–5499 | 2 | 14 | ~85.5ms | +| 5500–6328 | 3 | 17 | ~91ms | + + +This reduces the step/capacity trade-off that normally makes recurrence impractical under competition constraints. We get ~6,330 training steps (vs ~7,180 for the flat LeakyReLU baseline), but the final model has 17 effective layers at eval vs the baseline's 11. + +We also tested training with 4 recurrence passes. While 4-pass shows better per-step loss, the additional step time cost (~105ms/step) means fewer total steps within the wallclock budget. Under the competition's 600s constraint, **3-pass wins the step/capacity trade-off**, the extra training steps from the faster 3-pass schedule outweigh the marginal per-step quality gain from 4 passes. + +### Learnable Residual Scaling + +Per-pass learnable scalars contract the residual update, preventing hidden state magnitude growth across passes: + +$$h_{k+1} = h_k + \alpha_k \cdot F(h_k + c_k)$$ + +where $\alpha_k$ is initialized to 0.5 and learned during training. This ensures the recurrent dynamics are contractive — later passes refine rather than amplify. + +### Error Feedback Module + +A low-rank correction compensates for accumulated error before each recurrence pass: + +$$e_k = U(V^\top h_k), \qquad c_k = \mathrm{diag}(d) \cdot e_k$$ + +where $U, V \in \mathbb{R}^{d \times r}$ with rank $r=2$ and $d \in \mathbb{R}^d$ is a learnable diagonal. The correction is zero on pass 0 (no prior error to correct) and active on subsequent passes. Total parameter overhead: **2,560 params** (negligible vs 26.7M model params). + +The feedback module is important but not strictly required — we confirmed that stable training is possible without it, and even running eval-only without feedback works, at a cost of ~0.001 bpb higher. The feedback module's main contribution is providing the recurrent passes with an error signal about the previous iteration's residual. + +### Jacobian Proxy Loss (Stabilizer) + +A regularization term penalizes hidden state growth ratio above 1.0, enforcing contractive dynamics without computing the full Jacobian: + +$$\mathcal{L}*J = \lambda \cdot \mathrm{ReLU}\left(\frac{h*{k+1} - h_k}{h_k + \epsilon} - 1\right)^{2}$$ + +with $\lambda = 0.01$. This is a cheap finite-difference proxy for the spectral norm of the Jacobian $\partial h_{k+1}/\partial h_k$, encouraging it to stay below 1 (contractive map). The model learns to adhere to this quickly and it does not seem to effect early training dynamics. However we did see better results with 0.01 compared to 0.1 for Lambda, potentially since the restriction of 0.1 is to high, we don't always need contractive layers with only 3x recurrance, but we do need it to not explode. + +This loss term is critical for training stability. **Without it, gradient norms and hidden state magnitudes explode** during the multi-pass phases, destabilizing training. The proxy loss keeps the recurrent dynamics well-behaved without the computational cost of full Jacobian computation. + +Note: the jacobian proxy loss is only added to the training loss — it does not affect evaluation scoring, which uses pure cross-entropy. + +## Legal TTT Protocol + +Score-first legal TTT following [PR #461](https://github.com/openai/parameter-golf/pull/461): + +1. Val tokens split into 1,893 non-overlapping 32K-token chunks. Here 3 pass recurrance is vital since with 4 passes we must increase chunk size to fit within the time limit. +2. **For each chunk**: + - **SCORE**: Sliding window eval under `torch.inference_mode()` — no gradients, no weight mutation + - **TRAIN**: SGD on the already-scored chunk. 3 epochs, all blocks unfrozen, cosine LR decay, grad clip 1.0 +3. Last chunk scored but never trained on + + +| Parameter | Value | +| ---------------- | --------------------------------- | +| Chunk size | 32,768 tokens | +| Optimizer | SGD + momentum(0.9) | +| Learning rate | 0.002 (cosine decay) | +| Epochs per chunk | 3 | +| Frozen blocks | None (all blocks adapt) | +| Gradient clip | 1.0 | +| Eval passes | 3 (matching final training phase) | + + +### Timing Budget + + +| Phase | Time | +| ------------------------------------- | -------------------- | +| Training (wallclock cap) | 600s (10 min) | +| Standard eval (int6 + sliding window) | ~3s | +| Legal TTT (score-first + adaptation) | ~578s | +| **Total eval** | **~581s (< 10 min)** | + + +## Architecture + +Built on the [PR #414](https://github.com/openai/parameter-golf/pull/414) stack with [PR #399](https://github.com/openai/parameter-golf/pull/399) Parallel Muon: + + +| Component | Setting | +| ----------------------- | ----------------------------------------------------------- | +| Layers | 11 unique (512d, 8H, 4KV) | +| Effective layers (eval) | 17 (4 stem + 3 core ×3 + 4 tail) | +| MLP | 3× with LeakyReLU(0.5)² | +| BigramHash | 512 | +| XSA | Last 4 layers | +| RoPE | Partial (16/64 dims) | +| LN Scale | 1/√(layer+1) | +| VE128 | Layers 9-10 | +| Recurrence core | Layers 4-6, progressive 1→2→3 passes | +| ResidualScale | Per-pass learnable, init 0.5 | +| Error Feedback | Diagonal mode, rank 2, 2560 params | +| Jacobian proxy | λ=0.01 | +| Weight avg | EMA(0.997) + SWA(every 50) | +| Quantization | Late QAT (threshold 0.15) + GPTQ-lite int6 + lzma | +| Warmup precompilation | All pass×QAT graph variants compiled during 20 warmup steps | +| Optimizer | Parameter Banking + Parallel Muon | + + +## Run Command + +```bash +cd records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance +bash run_earlyqat.sh # Single seed (set SEED env var) +``` + +Key flags: + +```bash +torchrun --standalone --nproc_per_node=8 train_gpt.py \ + --feedback-mode diagonal --feedback-rank 2 \ + --residual-scale-init 0.5 \ + --jacobian-proxy-weight 0.01 \ + --no-interpass-rmsnorm +``` + +## Tricks + +### Graph Precompilation Warmup + +`torch.compile` is lazy — it only compiles a new graph variant the first time it's encountered. With progressive recurrence (1→2→3 passes) and late QAT, this means the training loop would hit compilation stalls at step 4500 (2-pass), step 5500 (3-pass), and again when QAT enables. Under a 600s wallclock cap, these stalls are expensive. + +The fix: **precompile all graph variants during warmup before training starts**. During the 20 warmup steps: + +1. The last few warmup steps cycle through each `num_passes` variant (2-pass, 3-pass) and each with QAT toggled on +2. This forces `torch.compile` to eagerly compile every forward/backward graph that will appear during training +3. After warmup, model weights and optimizer states are restored to their initial values — the warmup steps have zero effect on the actual training run + +This ensures the training loop runs at full speed from step 0 with no compilation jitter when passes change or QAT kicks in. + +### Code Minification with python-minifier + +The original training script was 88,253 bytes, which caused seed 2025 to exceed the 16MB submission limit (16,025,625 bytes). After removing dead code paths (eval-only mode, int8 quantization, unused feedback variants, verbose logging), the file was still too large. + +[python-minifier](https://github.com/dflook/python-minifier) with `--no-rename-locals` shrinks the code aggressively (whitespace, docstrings, constant folding) while preserving local variable names — critical because the training script uses string-based lookups for `state_dict` keys and `named_parameters`. This brought the file from 68,435 bytes down to **58,186 bytes**, comfortably fitting all seeds under the 16MB decimal limit. + +**Note:** The code was minified *after* all three seed runs completed, so the log files report `Code size: 88253 bytes` and correspondingly larger `Total submission size` values. The actual submission uses the minified 58,186-byte script — the correct per-seed totals are listed in `submission.json` and the results table above. + +## Credits + +- **Base model**: [PR #414](https://github.com/openai/parameter-golf/pull/414) by @signalrush +- **Optimizer (Parameter Banking + Parallel Muon)**: [PR #399](https://github.com/openai/parameter-golf/pull/399) by @abaybektursun +- **LeakyReLU² activation**: [PR #493](https://github.com/openai/parameter-golf/pull/493) by @parinzee +- **TTT recipe**: [PR #461](https://github.com/openai/parameter-golf/pull/461) by @Christopher-Lee-McClendon +- **Depth recurrence analysis**: [PR #363](https://github.com/openai/parameter-golf/pull/363) by @evangelinehelsinki + diff --git a/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/submission.json b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/submission.json new file mode 100644 index 0000000000..fd2cccba00 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/submission.json @@ -0,0 +1,42 @@ +{ + "name": "Recurrent Depth with Progressive Pass Growth + Error Feedback", + "val_bpb": 1.1163, + "val_bpb_std": 0.0013, + "bytes_total": 15995558, + "bytes_code": 58186, + "seeds": [1337, 42, 2025], + "seed_results": { + "1337": { + "val_loss": 1.88375543, + "val_bpb": 1.11566902, + "pre_ttt_val_bpb": 1.1353, + "ttt_time_seconds": 565.6, + "steps": 6328, + "bytes_model_int6_lzma": 15850832, + "bytes_total": 15909018 + }, + "42": { + "val_loss": 1.88715720, + "val_bpb": 1.11768375, + "pre_ttt_val_bpb": 1.1372, + "ttt_time_seconds": 579.0, + "steps": 6334, + "bytes_model_int6_lzma": 15839344, + "bytes_total": 15897530 + }, + "2025": { + "val_loss": 1.88338589, + "val_bpb": 1.11545016, + "pre_ttt_val_bpb": 1.1351, + "ttt_time_seconds": 588.0, + "steps": 6334, + "bytes_model_int6_lzma": 15937372, + "bytes_total": 15995558 + } + }, + "wallclock_seconds": 600, + "blurb": "Progressive depth recurrence (1->2->3 passes) with error feedback + jacobian proxy stabilization. Late growth preserves fast step times for most of training, avoiding the step/capacity trade-off that makes naive recurrence impractical. 3-seed mean: 1.1163 (std 0.0013), -0.0031 vs PR #549 LeakyReLU baseline (1.1194). Built on PR #414 stack with Parallel Muon (PR #399). All artifacts under 16MB, all eval under 10 min.", + "author": "nestamidavaine", + "github_id": "nestamidavaine", + "date": "2026-04-01" +} diff --git a/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_gpt.py b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_gpt.py new file mode 100644 index 0000000000..42b03eb369 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_gpt.py @@ -0,0 +1,617 @@ +from __future__ import annotations +_Z='passthrough_ctrl' +_Y='passthrough' +_X='momentum' +_W='shard_mom' +_V='padded_grad' +_U='fineweb_train_*.bin' +_T='diagonal' +_S='.scale' +_R='mlp_down_bank' +_Q='mlp_up_bank' +_P='kv_bank' +_O='qo_bank' +_N='attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,smear,ve_layer_scales,ve_shared.scale' +_M='shard' +_L='scale' +_K='full_update' +_J='utf-8' +_I='cuda' +_H='0' +_G='lr' +_F='params' +_E=.0 +_D=False +_C=1. +_B=True +_A=None +import copy,glob,io,lzma,math,os,random,time,uuid +from pathlib import Path +import numpy as np,sentencepiece as spm,torch,torch._dynamo +torch._dynamo.config.recompile_limit=32 +import torch.distributed as dist,torch.nn.functional as F +from torch import Tensor,nn +_gpu_mem_frac=float(os.environ.get('CUDA_MEM_FRACTION',_H)) +if _gpu_mem_frac>0:torch.cuda.set_per_process_memory_fraction(_gpu_mem_frac,0) +from flash_attn_interface import flash_attn_func as flash_attn_3_func +import argparse +class RecurrentStabilizer: + def __init__(self,jacobian_proxy_weight=_E,eps=1e-06,**kw):self.jacobian_proxy_weight=jacobian_proxy_weight;self.eps=eps + def clip(self,h):return h + def jacobian_proxy_loss(self,h_in,h_out): + if self.jacobian_proxy_weight<=0:return h_in.new_zeros(()) + delta=h_out-h_in;ratio=delta.norm()/(h_in.norm()+self.eps);return self.jacobian_proxy_weight*torch.relu(ratio-_C).square() + def reset(self):0 +class ResidualScale(nn.Module): + def __init__(self,num_passes,init_value=_C):super().__init__();self.scales=nn.Parameter(torch.full((num_passes,),init_value,dtype=torch.float32)) + def forward(self,residual,pass_idx):return self.scales[pass_idx].to(dtype=residual.dtype)*residual +class LowRankResidual(nn.Module): + def __init__(self,dim,rank=2):super().__init__();self.V=nn.Parameter(torch.zeros(dim,rank));self.U=nn.Parameter(torch.zeros(dim,rank)) + def forward(self,h):return h@self.V@self.U.T +class DiagonalFeedback(nn.Module): + def __init__(self,dim,init_ones=_D):super().__init__();init_val=torch.ones(dim)if init_ones else torch.zeros(dim);self.d=nn.Parameter(init_val) + def forward(self,e):return self.d.to(dtype=e.dtype)*e +class ErrorFeedbackModule(nn.Module): + def __init__(self,dim,rank=2,feedback_mode=_T,per_pass=_D,num_passes=3,**kw): + super().__init__();self.per_pass=per_pass;self.residual=LowRankResidual(dim,rank) + if feedback_mode=='identity':self.correction=_A + elif per_pass:self.correction=nn.ModuleList([DiagonalFeedback(dim)for _ in range(num_passes)]) + else:self.correction=DiagonalFeedback(dim) + def forward(self,h,pass_idx): + e=self.residual(h) + if self.correction is _A:c=e + elif self.per_pass:c=self.correction[pass_idx](e) + else:c=self.correction(e) + mask=torch.tensor(_C if pass_idx>0 else _E,device=h.device,dtype=h.dtype);return c*mask + def param_count(self):return sum(p.numel()for p in self.parameters()) +_e=os.environ.get +_i=lambda k,d:int(_e(k,d)) +_f=lambda k,d:float(_e(k,d)) +_b=lambda k,d:bool(int(_e(k,d))) +class Hyperparameters:data_path=_e('DATA_PATH','./data/datasets/fineweb10B_sp1024');train_files=os.path.join(data_path,_U);val_files=os.path.join(data_path,'fineweb_val_*.bin');tokenizer_path=_e('TOKENIZER_PATH','./data/tokenizers/fineweb_1024_bpe.model');run_id=_e('RUN_ID',str(uuid.uuid4()));seed=_i('SEED',1337);val_batch_size=_i('VAL_BATCH_SIZE',524288);val_loss_every=_i('VAL_LOSS_EVERY',4000);train_log_every=_i('TRAIN_LOG_EVERY',500);iterations=_i('ITERATIONS',20000);warmdown_iters=_i('WARMDOWN_ITERS',3500);warmup_steps=_i('WARMUP_STEPS',20);train_batch_tokens=_i('TRAIN_BATCH_TOKENS',786432);train_seq_len=_i('TRAIN_SEQ_LEN',2048);eval_seq_len=_i('EVAL_SEQ_LEN',2048);max_wallclock_seconds=_f('MAX_WALLCLOCK_SECONDS',6e2);qk_gain_init=_f('QK_GAIN_INIT',1.5);vocab_size=_i('VOCAB_SIZE',1024);num_layers=_i('NUM_LAYERS',11);num_kv_heads=_i('NUM_KV_HEADS',4);model_dim=_i('MODEL_DIM',512);num_heads=_i('NUM_HEADS',8);mlp_mult=_f('MLP_MULT',3.);tie_embeddings=_b('TIE_EMBEDDINGS','1');rope_base=_f('ROPE_BASE',1e4);logit_softcap=_f('LOGIT_SOFTCAP',3e1);embed_lr=_f('EMBED_LR',.6);head_lr=_f('HEAD_LR',.008);tied_embed_lr=_f('TIED_EMBED_LR',.035);tied_embed_init_std=_f('TIED_EMBED_INIT_STD',.005);matrix_lr=_f('MATRIX_LR',.025);scalar_lr=_f('SCALAR_LR',.025);muon_momentum=_f('MUON_MOMENTUM',.99);muon_backend_steps=_i('MUON_BACKEND_STEPS',5);muon_momentum_warmup_start=_f('MUON_MOMENTUM_WARMUP_START',.92);muon_momentum_warmup_steps=_i('MUON_MOMENTUM_WARMUP_STEPS',1500);beta1=_f('BETA1',.9);beta2=_f('BETA2',.95);adam_eps=_f('ADAM_EPS',1e-08);grad_clip_norm=_f('GRAD_CLIP_NORM',.3);eval_stride=_i('EVAL_STRIDE',64);muon_beta2=_f('MUON_BETA2',.95);swa_enabled=_b('SWA_ENABLED','1');swa_every=_i('SWA_EVERY',50);muon_wd=_f('MUON_WD',.04);adam_wd=_f('ADAM_WD',.04);qat_enabled=_b('QAT_ENABLED',_H);xsa_last_n=_i('XSA_LAST_N',4);rope_dims=_i('ROPE_DIMS',16);ln_scale=_b('LN_SCALE','1');late_qat_threshold=_f('LATE_QAT_THRESHOLD',.15);ttt_enabled=_b('TTT_ENABLED',_H);ttt_lr=_f('TTT_LR',.002);ttt_epochs=_i('TTT_EPOCHS',3);ttt_chunk_tokens=_i('TTT_CHUNK_TOKENS',32768);ttt_freeze_blocks=_i('TTT_FREEZE_BLOCKS',2);ttt_momentum=_f('TTT_MOMENTUM',.9);ttt_batch_seqs=_i('TTT_BATCH_SEQS',32);ttt_grad_clip=_f('TTT_GRAD_CLIP',_C);core_start=_i('CORE_START',3);core_end=_i('CORE_END',8);num_passes=_i('NUM_PASSES',1);core_quant_bits=_i('CORE_QUANT_BITS',6);core_quant_enabled=_b('CORE_QUANT_ENABLED',_H);eval_passes=_i('EVAL_PASSES',0);passes_schedule_str=_e('PASSES_SCHEDULE','');bigram_vocab_size=_i('BIGRAM_VOCAB_SIZE',0);bigram_dim=_i('BIGRAM_DIM',32);ve_enabled=_b('VE_ENABLED',_H);ve_dim=_i('VE_DIM',128);ve_layers=_e('VE_LAYERS','9,10') +def zeropower_via_newtonschulz5(G,steps=5,eps=1e-07): + a,b,c=3.4445,-4.775,2.0315;was_2d=G.ndim==2 + if was_2d:G=G.unsqueeze(0) + X=G.bfloat16();transposed=X.size(-2)>X.size(-1) + if transposed:X=X.mT + X=X/(X.norm(dim=(-2,-1),keepdim=_B)+eps) + for _ in range(steps):A=X@X.mT;B=b*A+c*(A@A);X=a*X+B@X + if transposed:X=X.mT + if was_2d:X=X.squeeze(0) + return X +class Muon(torch.optim.Optimizer): + def __init__(self,params,lr,momentum,backend_steps,nesterov=_B,weight_decay=_E):super().__init__(params,dict(lr=lr,momentum=momentum,backend_steps=backend_steps,nesterov=nesterov,weight_decay=weight_decay));self._built=_D + def _build(self): + self._distributed=dist.is_available()and dist.is_initialized();self._world_size=dist.get_world_size()if self._distributed else 1;self._rank=dist.get_rank()if self._distributed else 0;ws=self._world_size;self._bank_meta=[] + for group in self.param_groups: + for p in group[_F]:B=p.shape[0];padded_B=(B+ws-1)//ws*ws;shard_B=padded_B//ws;tail=p.shape[1:];dev=p.device;self._bank_meta.append({'p':p,'B':B,_V:torch.zeros(padded_B,*tail,device=dev,dtype=torch.bfloat16),_M:torch.zeros(shard_B,*tail,device=dev,dtype=torch.bfloat16),_W:torch.zeros(shard_B,*tail,device=dev,dtype=torch.bfloat16),_K:torch.zeros(padded_B,*tail,device=dev,dtype=torch.bfloat16),_L:max(1,p.shape[-2]/p.shape[-1])**.5}) + self._bank_meta.sort(key=lambda m:-m['p'].numel());self._built=_B + def launch_reduce_scatters(self): + '' + if not self._built:self._build() + if not self._distributed:return + self._rs_futures=[] + for m in self._bank_meta: + p=m['p'] + if p.grad is _A:self._rs_futures.append(_A);continue + pg=m[_V];pg[:m['B']].copy_(p.grad.bfloat16()) + if pg.shape[0]>m['B']:pg[m['B']:].zero_() + fut=dist.reduce_scatter_tensor(m[_M],pg,op=dist.ReduceOp.AVG,async_op=_B);self._rs_futures.append(fut) + @torch.no_grad() + def step(self,closure=_A): + '';B='_rs_futures';A='momentum_buffer';loss=_A + if closure is not _A: + with torch.enable_grad():loss=closure() + if not self._built:self._build() + for group in self.param_groups: + lr=group[_G];momentum=group[_X];backend_steps=group['backend_steps'];nesterov=group['nesterov'];wd=group.get('weight_decay',_E);prev_ag_handle=_A;prev_m=_A;sharded=self._distributed and hasattr(self,B) + for(i,m)in enumerate(self._bank_meta): + p=m['p'] + if p.grad is _A:continue + if prev_ag_handle is not _A: + prev_ag_handle.wait();pp=prev_m['p'];upd=prev_m[_K][:prev_m['B']] + if wd>_E:pp.data.mul_(_C-lr*wd) + pp.add_(upd.to(dtype=pp.dtype),alpha=-lr*prev_m[_L]) + if sharded and self._rs_futures[i]is not _A:self._rs_futures[i].wait();g=m[_M];buf=m[_W] + else: + g=p.grad.bfloat16();state=self.state[p] + if A not in state:state[A]=torch.zeros_like(g) + buf=state[A] + buf.mul_(momentum).add_(g) + if nesterov:update=g.add(buf,alpha=momentum) + else:update=buf + update=zeropower_via_newtonschulz5(update,steps=backend_steps) + if sharded:prev_ag_handle=dist.all_gather_into_tensor(m[_K],update,async_op=_B);prev_m=m + else: + if wd>_E:p.data.mul_(_C-lr*wd) + p.add_(update.to(dtype=p.dtype),alpha=-lr*m[_L]) + if prev_ag_handle is not _A: + prev_ag_handle.wait();pp=prev_m['p'];upd=prev_m[_K][:prev_m['B']] + if wd>_E:pp.data.mul_(_C-lr*wd) + pp.add_(upd.to(dtype=pp.dtype),alpha=-lr*prev_m[_L]) + if hasattr(self,B):del self._rs_futures + return loss +def build_sentencepiece_luts(sp,vocab_size,device): + sp_vocab_size=int(sp.vocab_size());table_size=max(sp_vocab_size,vocab_size);base_bytes_np=np.zeros((table_size,),dtype=np.int16);has_leading_space_np=np.zeros((table_size,),dtype=np.bool_);is_boundary_token_np=np.ones((table_size,),dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id)or sp.is_unknown(token_id)or sp.is_unused(token_id):continue + is_boundary_token_np[token_id]=_D + if sp.is_byte(token_id):base_bytes_np[token_id]=1;continue + piece=sp.id_to_piece(token_id) + if piece.startswith('▁'):has_leading_space_np[token_id]=_B;piece=piece[1:] + base_bytes_np[token_id]=len(piece.encode(_J)) + return torch.tensor(base_bytes_np,dtype=torch.int16,device=device),torch.tensor(has_leading_space_np,dtype=torch.bool,device=device),torch.tensor(is_boundary_token_np,dtype=torch.bool,device=device) +def load_validation_tokens(pattern,seq_len): + files=[Path(p)for p in sorted(glob.glob(pattern))] + if not files:raise FileNotFoundError(f"No files found for pattern: {pattern}") + tokens=torch.cat([load_data_shard(file)for file in files]).contiguous();usable=(tokens.numel()-1)//seq_len*seq_len + if usable<=0:raise ValueError('val split too short') + return tokens[:usable+1] +def eval_val(args,model,rank,world_size,device,grad_accum_steps,val_tokens,base_bytes_lut,has_leading_space_lut,is_boundary_token_lut,eval_seq_len=_A): + seq_len=eval_seq_len or args.train_seq_len;local_batch_tokens=args.val_batch_size//(world_size*grad_accum_steps) + if local_batch_tokens0 else _C,dtype=torch.float32);q=torch.clamp(torch.round(torch.clamp(t32,-clip_abs,clip_abs)/scale),-127,127).to(torch.int8).contiguous();return q,scale +def load_data_shard(file): + B='0: + avail=self.tokens.numel()-self.pos + if avail<=0:self._advance_file();continue + k=min(remaining,avail);chunks.append(self.tokens[self.pos:self.pos+k]);self.pos+=k;remaining-=k + return chunks[0]if len(chunks)==1 else torch.cat(chunks) +class DistributedTokenLoader: + def __init__(self,pattern,rank,world_size,device):self.rank=rank;self.world_size=world_size;self.device=device;self.stream=TokenStream(pattern) + def next_batch(self,global_tokens,seq_len,grad_accum_steps):local_tokens=global_tokens//(self.world_size*grad_accum_steps);per_rank_span=local_tokens+1;chunk=self.stream.take(per_rank_span*self.world_size);start=self.rank*per_rank_span;local=chunk[start:start+per_rank_span].to(dtype=torch.int64);x=local[:-1].reshape(-1,seq_len);y=local[1:].reshape(-1,seq_len);return x.to(self.device,non_blocking=_B),y.to(self.device,non_blocking=_B) +class BigramHashEmbedding(nn.Module): + def __init__(self,bigram_vocab_size,bigram_dim,model_dim): + super().__init__();self.bigram_vocab_size=bigram_vocab_size;self.embed=nn.Embedding(bigram_vocab_size,bigram_dim);nn.init.zeros_(self.embed.weight);self.proj=CastedLinear(bigram_dim,model_dim,bias=_D)if bigram_dim!=model_dim else _A + if self.proj is not _A:nn.init.zeros_(self.proj.weight) + self.scale=nn.Parameter(torch.tensor(.05,dtype=torch.float32)) + def bigram_hash(self,tokens):t=tokens.to(torch.int32);mod=self.bigram_vocab_size-1;out=torch.empty_like(t);out[...,0]=mod;out[...,1:]=torch.bitwise_xor(36313*t[...,1:],27191*t[...,:-1])%mod;return out.long() + def forward(self,token_ids): + h=self.embed(self.bigram_hash(token_ids)) + if self.proj is not _A:h=self.proj(h) + return h*self.scale.to(dtype=h.dtype) +class ValueEmbedding(nn.Module): + def __init__(self,vocab_size,ve_dim,model_dim): + super().__init__();self.embed=nn.Embedding(vocab_size,ve_dim);nn.init.normal_(self.embed.weight,std=.01);self.proj=CastedLinear(ve_dim,model_dim,bias=_D)if ve_dim!=model_dim else _A + if self.proj is not _A:nn.init.zeros_(self.proj.weight) + self.scale=nn.Parameter(torch.tensor(.1,dtype=torch.float32)) + def forward(self,token_ids): + h=self.embed(token_ids) + if self.proj is not _A:h=self.proj(h) + return h*self.scale.to(dtype=h.dtype) +class RMSNorm(nn.Module): + def __init__(self,eps=_A):super().__init__();self.eps=eps + def forward(self,x):return F.rms_norm(x,(x.size(-1),),eps=self.eps) +class CastedLinear(nn.Linear): + _qat_enabled:bool=_D + def forward(self,x): + w=self.weight.to(x.dtype) + if CastedLinear._qat_enabled and self.training and w.ndim==2: + with torch.no_grad():w32=self.weight.float();row_max=w32.abs().amax(dim=1);scale=(row_max/31.).clamp_min(_C/31.);w_q=(torch.clamp(torch.round(w32/scale[:,_A]),-32,31)*scale[:,_A]).to(x.dtype) + w=w+(w_q-w).detach() + bias=self.bias.to(x.dtype)if self.bias is not _A else _A;return F.linear(x,w,bias) +def restore_low_dim_params_to_fp32(module): + with torch.no_grad(): + for(name,param)in module.named_parameters(): + if(param.ndim<2 or any(p in name for p in _N.split(',')))and param.dtype!=torch.float32:param.data=param.data.float() +class Rotary(nn.Module): + def __init__(self,dim,base=1e4,train_seq_len=1024,rope_dims=0):super().__init__();self.dim=dim;self.base=base;self.train_seq_len=train_seq_len;self.rope_dims=rope_dims if rope_dims>0 else dim;inv_freq=_C/base**(torch.arange(0,self.rope_dims,2,dtype=torch.float32)/self.rope_dims);self.register_buffer('inv_freq',inv_freq,persistent=_D);self._seq_len_cached=0;self._cos_cached=_A;self._sin_cached=_A + def forward(self,seq_len,device,dtype): + if self._cos_cached is _A or self._sin_cached is _A or self._seq_len_cached!=seq_len or self._cos_cached.device!=device: + rd=self.rope_dims + if seq_len>self.train_seq_len:scale=seq_len/self.train_seq_len;new_base=self.base*scale**(rd/(rd-2));inv_freq=_C/new_base**(torch.arange(0,rd,2,dtype=torch.float32,device=device)/rd) + else:inv_freq=self.inv_freq.to(device) + t=torch.arange(seq_len,device=device,dtype=inv_freq.dtype);freqs=torch.outer(t,inv_freq);self._cos_cached=freqs.cos()[_A,:,_A,:];self._sin_cached=freqs.sin()[_A,:,_A,:];self._seq_len_cached=seq_len + return self._cos_cached.to(dtype=dtype),self._sin_cached.to(dtype=dtype) +def apply_rotary_emb(x,cos,sin,rope_dims=0): + if rope_dims>0 and rope_dims=2:row_max=w32.abs().amax(dim=-1);scale=(row_max/clip_range).clamp_min(_C/clip_range);dims=(slice(_A),)*(w32.ndim-1)+(_A,);w_q=(torch.clamp(torch.round(w32/scale[dims]),-clip_range,clip_range)*scale[dims]).to(w.dtype) + else:amax=w32.abs().max();scale=(amax/clip_range).clamp_min(_C/clip_range);w_q=(torch.clamp(torch.round(w32/scale),-clip_range,clip_range)*scale).to(w.dtype) + return w+(w_q-w).detach() +class GPT(nn.Module): + def __init__(self,vocab_size,num_layers,model_dim,num_heads,num_kv_heads,mlp_mult,tie_embeddings,tied_embed_init_std,logit_softcap,rope_base,qk_gain_init,xsa_last_n=0,rope_dims=0,ln_scale=_D,core_start=3,core_end=8,num_passes=1,core_quant_bits=6,core_quant_enabled=_D,residual_scale=_A,interpass_rmsnorm=_B,bigram_vocab_size=0,bigram_dim=32,ve_enabled=_D,ve_dim=128,ve_layers='9,10'): + super().__init__();self._ve_target_dim=num_kv_heads*(model_dim//num_heads) + if logit_softcap<=_E:raise ValueError('logit_softcap must be >0') + self.tie_embeddings=tie_embeddings;self.tied_embed_init_std=tied_embed_init_std;self.logit_softcap=logit_softcap;self.core_start=core_start;self.core_end=min(core_end,num_layers);self.interpass_rmsnorm=interpass_rmsnorm;self.num_passes=num_passes;self.core_quant_bits=core_quant_bits;self.core_quant_enabled=core_quant_enabled;self.num_stem=core_start;self.num_core=self.core_end-core_start;self.num_tail=num_layers-self.core_end;self.residual_scale=residual_scale;self.tok_emb=nn.Embedding(vocab_size,model_dim);self.bigram=BigramHashEmbedding(bigram_vocab_size,bigram_dim,model_dim)if bigram_vocab_size>0 else _A;self.smear=SmearGate(model_dim);self.num_skip_weights=min(self.num_stem,self.num_tail);self.skip_weights=nn.Parameter(torch.ones(self.num_skip_weights,model_dim,dtype=torch.float32));head_dim=model_dim//num_heads;kv_dim=num_kv_heads*head_dim;mlp_dim=int(mlp_mult*model_dim);self.num_layers=num_layers;self.qo_bank=nn.Parameter(torch.empty(2*num_layers,model_dim,model_dim));self.kv_bank=nn.Parameter(torch.empty(2*num_layers,kv_dim,model_dim));self.mlp_up_bank=nn.Parameter(torch.empty(num_layers,mlp_dim,model_dim));self.mlp_down_bank=nn.Parameter(torch.empty(num_layers,model_dim,mlp_dim));self.blocks=nn.ModuleList([Block(model_dim,num_heads,num_kv_heads,mlp_mult,rope_base,qk_gain_init,layer_idx=i,ln_scale=ln_scale)for i in range(num_layers)]) + if rope_dims>0: + head_dim=model_dim//num_heads + for block in self.blocks:block.attn.rope_dims=rope_dims;block.attn.rotary=Rotary(head_dim,base=rope_base,train_seq_len=1024,rope_dims=rope_dims) + self.ve_layer_indices=[int(x)for x in ve_layers.split(',')if x.strip()]if ve_enabled else[];kv_dim_ve=self._ve_target_dim + if self.ve_layer_indices:self.ve_shared=ValueEmbedding(vocab_size,ve_dim,kv_dim_ve);self.ve_layer_scales=nn.ParameterList([nn.Parameter(torch.ones(1,dtype=torch.float32))for _ in self.ve_layer_indices]) + else:self.ve_shared=_A;self.ve_layer_scales=nn.ParameterList() + self.value_embeds=nn.ModuleList();self.final_norm=RMSNorm();self.lm_head=_A if tie_embeddings else CastedLinear(model_dim,vocab_size,bias=_D) + if self.lm_head is not _A:self.lm_head._zero_init=_B + self.mtp_heads=nn.ModuleList() + if xsa_last_n>0: + for i in range(max(0,num_layers-xsa_last_n),num_layers): + if i=self.core_end:self.blocks[i].attn.use_xsa=_B + self._init_weights() + def _init_weights(self): + if self.tie_embeddings:nn.init.normal_(self.tok_emb.weight,mean=_E,std=self.tied_embed_init_std) + n=self.num_layers;proj_scale=_C/math.sqrt(2*n) + for i in range(n):nn.init.orthogonal_(self.qo_bank.data[i],gain=_C);nn.init.zeros_(self.qo_bank.data[n+i]);nn.init.orthogonal_(self.kv_bank.data[i],gain=_C);nn.init.orthogonal_(self.kv_bank.data[n+i],gain=_C);nn.init.orthogonal_(self.mlp_up_bank.data[i],gain=_C);nn.init.zeros_(self.mlp_down_bank.data[i]);self.qo_bank.data[n+i].mul_(proj_scale);self.mlp_down_bank.data[i].mul_(proj_scale) + for(name,module)in self.named_modules(): + if isinstance(module,nn.Linear): + if getattr(module,'_zero_init',_D):nn.init.zeros_(module.weight) + elif module.weight.ndim==2 and module.weight.shape[0]>=64 and module.weight.shape[1]>=64:nn.init.orthogonal_(module.weight,gain=_C) + def _get_ve(self,layer_idx,input_ids,ve_cache=_A): + A='ve' + if self.ve_shared is _A or layer_idx not in self.ve_layer_indices:return + if ve_cache is not _A and A not in ve_cache:ve_cache[A]=self.ve_shared(input_ids) + ve_base=ve_cache[A]if ve_cache is not _A else self.ve_shared(input_ids);ve_idx=self.ve_layer_indices.index(layer_idx);return ve_base*self.ve_layer_scales[ve_idx].to(dtype=ve_base.dtype) + def _get_bank_weights(self,bi): + n=self.num_layers;q_w=self.qo_bank[bi];out_w=self.qo_bank[n+bi];k_w=self.kv_bank[bi];v_w=self.kv_bank[n+bi];up_w=self.mlp_up_bank[bi];down_w=self.mlp_down_bank[bi] + if self.core_quant_enabled and self.training and self.core_start<=bi0 and self.interpass_rmsnorm:x=F.rms_norm(x,(x.size(-1),)) + if feedback_fn is not _A:x=x+feedback_fn(x,k) + if stabilizer is not _A:x=stabilizer.clip(x) + x_before_pass=x + for j in range(self.core_start,self.core_end):h_prev=x;ve=self._get_ve(j,input_ids,ve_cache);q_w,k_w,v_w,out_w,up_w,down_w=self._get_bank_weights(j);x,_=self.blocks[j](x,x0,q_w,k_w,v_w,out_w,up_w,down_w,v_embed=ve) + if self.residual_scale is not _A and k>0:delta=x-x_before_pass;x=x_before_pass+self.residual_scale(delta,k) + h_core_out=x + for i in range(self.core_end,n): + ti=i-self.core_end + if ti0:main_loss=main_loss+stabilizer.jacobian_proxy_loss(h_core_in,h_core_out) + return main_loss + def forward_logits(self,input_ids,feedback_fn=_A,stabilizer=_A): + '';x,_,_=self._forward_hidden(input_ids,feedback_fn,stabilizer) + if self.tie_embeddings:logits_proj=F.linear(x,self.tok_emb.weight) + else:logits_proj=self.lm_head(x) + return self.logit_softcap*torch.tanh(logits_proj/self.logit_softcap) +def eval_val_sliding_ttt(args,base_model,rank,world_size,device,val_tokens,base_bytes_lut,has_leading_space_lut,is_boundary_token_lut,stride,batch_seqs=32,log0=print,feedback_fn=_A,feedback_module=_A): + seq_len=args.train_seq_len;total_tokens=val_tokens.numel()-1;ttt_chunk=args.ttt_chunk_tokens;window_starts=[ws for ws in range(0,total_tokens,stride)if min(ws+seq_len,total_tokens)-ws>=stride or ws==0];num_chunks=(total_tokens+ttt_chunk-1)//ttt_chunk;chunk_windows=[[]for _ in range(num_chunks)] + for ws in window_starts:end=min(ws+seq_len,total_tokens);wlen=end-ws;s=0 if ws==0 else max(wlen-stride,0);scored_start=ws+s;ci=min(scored_start//ttt_chunk,num_chunks-1);chunk_windows[ci].append(ws) + log0(f"ttt_sliding:start chunks={num_chunks} chunk_tokens={ttt_chunk} total_windows={len(window_starts)} stride={stride} ttt_lr={args.ttt_lr} ttt_epochs={args.ttt_epochs} freeze_blocks={args.ttt_freeze_blocks}");loss_sum=torch.zeros((),device=device,dtype=torch.float64);token_count=torch.zeros((),device=device,dtype=torch.float64);byte_count=torch.zeros((),device=device,dtype=torch.float64);frozen_block_ids=set(range(min(args.ttt_freeze_blocks,len(base_model.blocks))));ttt_params=[] + for(name,p)in base_model.named_parameters(): + freeze=_D + for bi in frozen_block_ids: + if f"blocks.{bi}."in name:freeze=_B;break + if freeze:p.requires_grad_(_D) + else:p.requires_grad_(_B);ttt_params.append(p) + if feedback_module is not _A: + for p in feedback_module.parameters():p.requires_grad_(_B);ttt_params.append(p) + log0(f"ttt_sliding:params unfrozen={sum(p.numel()for p in ttt_params)} frozen={sum(p.numel()for p in base_model.parameters()if not p.requires_grad)}");optimizer=torch.optim.SGD(ttt_params,lr=args.ttt_lr,momentum=args.ttt_momentum);t0=time.perf_counter() + for ci in range(num_chunks): + windows=chunk_windows[ci] + if not windows:continue + chunk_start=ci*ttt_chunk;chunk_end=min((ci+1)*ttt_chunk,total_tokens);my_s=len(windows)*rank//world_size;my_e=len(windows)*(rank+1)//world_size;my_windows=windows[my_s:my_e];base_model.eval() + with torch.inference_mode(): + for bi in range(0,len(my_windows),batch_seqs): + batch_ws=my_windows[bi:bi+batch_seqs];bsz=len(batch_ws);x_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);y_batch=torch.zeros(bsz,seq_len,dtype=torch.int64,device=device);wlens=[] + for(i,ws)in enumerate(batch_ws):end=min(ws+seq_len,total_tokens);wlen=end-ws;wlens.append(wlen);chunk_tok=val_tokens[ws:end+1].to(dtype=torch.int64,device=device);x_batch[i,:wlen]=chunk_tok[:-1];y_batch[i,:wlen]=chunk_tok[1:] + with torch.autocast(device_type=_I,dtype=torch.bfloat16):logits=base_model.forward_logits(x_batch,feedback_fn=feedback_fn) + nll=F.cross_entropy(logits.reshape(-1,logits.size(-1)).float(),y_batch.reshape(-1),reduction='none').reshape(bsz,seq_len) + for(i,ws)in enumerate(batch_ws):wlen=wlens[i];s=0 if ws==0 else max(wlen-stride,0);scored_nll=nll[i,s:wlen].to(torch.float64);loss_sum+=scored_nll.sum();token_count+=float(wlen-s);tgt,prev=y_batch[i,s:wlen],x_batch[i,s:wlen];tb=base_bytes_lut[tgt].to(torch.float64);tb+=(has_leading_space_lut[tgt]&~is_boundary_token_lut[prev]).to(torch.float64);byte_count+=tb.sum() + is_last_chunk=ci==num_chunks-1 + if not is_last_chunk and args.ttt_epochs>0: + base_model.train();chunk_seqs=(chunk_end-chunk_start)//seq_len + if chunk_seqs>0: + cos_lr=args.ttt_lr*.5*(_C+math.cos(math.pi*ci/max(num_chunks-1,1))) + for pg in optimizer.param_groups:pg[_G]=cos_lr + my_seq_s=chunk_seqs*rank//world_size;my_seq_e=chunk_seqs*(rank+1)//world_size;my_chunk_seqs=my_seq_e-my_seq_s + for _ep in range(args.ttt_epochs): + for bs in range(0,my_chunk_seqs,args.ttt_batch_seqs): + be=min(bs+args.ttt_batch_seqs,my_chunk_seqs);actual_bs=my_seq_s+bs;start_tok=chunk_start+actual_bs*seq_len;end_tok=chunk_start+(my_seq_s+be)*seq_len+1 + if end_tok>val_tokens.numel():continue + local=val_tokens[start_tok:end_tok].to(device=device,dtype=torch.int64);x=local[:-1].reshape(-1,seq_len);y=local[1:].reshape(-1,seq_len);optimizer.zero_grad(set_to_none=_B) + with torch.autocast(device_type=_I,dtype=torch.bfloat16):loss=base_model(x,y,feedback_fn=feedback_fn) + loss.backward() + if world_size>1: + for p in ttt_params: + if p.grad is not _A:dist.all_reduce(p.grad,op=dist.ReduceOp.AVG) + torch.nn.utils.clip_grad_norm_(ttt_params,args.ttt_grad_clip);optimizer.step() + if rank==0 and(ci%10==0 or ci==num_chunks-1):elapsed=time.perf_counter()-t0;rl=loss_sum.item()/max(token_count.item(),1);rbpb=rl/math.log(2.)*(token_count.item()/max(byte_count.item(),1))if token_count.item()>0 else _E;log0(f" ttt_chunk [{ci+1}/{num_chunks}] bpb={rbpb:.6f} time={elapsed:.1f}s") + if dist.is_available()and dist.is_initialized():dist.all_reduce(loss_sum,op=dist.ReduceOp.SUM);dist.all_reduce(token_count,op=dist.ReduceOp.SUM);dist.all_reduce(byte_count,op=dist.ReduceOp.SUM) + val_loss=(loss_sum/token_count).item();val_bpb=val_loss/math.log(2.)*(token_count.item()/byte_count.item()) + for p in base_model.parameters():p.requires_grad_(_B) + base_model.eval();log0(f"ttt_sliding:done val_loss={val_loss:.6f}{ val_bpb=:.6f} elapsed={time.perf_counter()-t0:.1f}s");return val_loss,val_bpb +def quantize_int6_per_row(t,clip_range=31): + t32=t.float() + if t32.ndim==2: + best_q,best_s,best_err=_A,_A,float('inf') + for pct in[.999,.9995,.9999,.99999,_C]: + if pct<_C:row_clip=torch.quantile(t32.abs(),pct,dim=1) + else:row_clip=t32.abs().amax(dim=1) + s=(row_clip/clip_range).clamp_min(_C/clip_range).to(torch.float16);q=torch.clamp(torch.round(t32/s.float()[:,_A]),-clip_range,clip_range).to(torch.int8);recon=q.float()*s.float()[:,_A];err=(t32-recon).pow(2).mean().item() + if err0 else _C,dtype=torch.float16);q=torch.clamp(torch.round(t32/scale.float()),-clip_range,clip_range).to(torch.int8);return q,scale +def _unbank_state_dict(sd,num_layers): + out={};n=num_layers + for(name,tensor)in sd.items(): + if name==_O: + for i in range(n):out[f"blocks.{i}.attn.c_q.weight"]=tensor[i];out[f"blocks.{i}.attn.proj.weight"]=tensor[n+i] + elif name==_P: + for i in range(n):out[f"blocks.{i}.attn.c_k.weight"]=tensor[i];out[f"blocks.{i}.attn.c_v.weight"]=tensor[n+i] + elif name==_Q: + for i in range(n):out[f"blocks.{i}.mlp.fc.weight"]=tensor[i] + elif name==_R: + for i in range(n):out[f"blocks.{i}.mlp.proj.weight"]=tensor[i] + else:out[name]=tensor + return out +def _rebank_state_dict(sd,num_layers,template_sd): + out={};n=num_layers;qo_slices=[_A]*(2*n);kv_slices=[_A]*(2*n);up_slices=[_A]*n;down_slices=[_A]*n;consumed=set() + for i in range(n): + qk=f"blocks.{i}.attn.c_q.weight" + if qk in sd:qo_slices[i]=sd[qk];consumed.add(qk) + ok=f"blocks.{i}.attn.proj.weight" + if ok in sd:qo_slices[n+i]=sd[ok];consumed.add(ok) + kk=f"blocks.{i}.attn.c_k.weight" + if kk in sd:kv_slices[i]=sd[kk];consumed.add(kk) + vk=f"blocks.{i}.attn.c_v.weight" + if vk in sd:kv_slices[n+i]=sd[vk];consumed.add(vk) + fk=f"blocks.{i}.mlp.fc.weight" + if fk in sd:up_slices[i]=sd[fk];consumed.add(fk) + dk=f"blocks.{i}.mlp.proj.weight" + if dk in sd:down_slices[i]=sd[dk];consumed.add(dk) + out[_O]=torch.stack(qo_slices).to(dtype=template_sd[_O].dtype);out[_P]=torch.stack(kv_slices).to(dtype=template_sd[_P].dtype);out[_Q]=torch.stack(up_slices).to(dtype=template_sd[_Q].dtype);out[_R]=torch.stack(down_slices).to(dtype=template_sd[_R].dtype) + for(name,tensor)in sd.items(): + if name not in consumed:out[name]=tensor + return out +def mixed_quantize_int6(state_dict,int6_cats,core_start=-1,core_end=-1): + A='type';num_layers_total=max((int(k.split('.')[1])for k in state_dict if k.startswith('blocks.')),default=0)+1;late_k_layers=set(range(num_layers_total-2,num_layers_total));result={};meta={} + for(name,tensor)in state_dict.items(): + t=tensor.detach().cpu().contiguous();cat='embed'if'tok_emb'in name or'lm_head'in name else'mlp'if'.mlp.'in name else'attn'if'.attn.'in name else'other' + if not t.is_floating_point()or t.numel()<=65536:result[name]=t.to(torch.float16)if t.is_floating_point()else t;meta[name]=_Y;continue + if any(p in name for p in _N.split(',')):result[name]=t.float();meta[name]=_Z;continue + if cat in int6_cats and t.ndim>=1:q,s=quantize_int6_per_row(t);result[name+'.q']=q;result[name+_S]=s;meta[name]={A:'int6'} + else:q,s=quantize_float_tensor(t);result[name+'.q']=q;result[name+_S]=s;meta[name]={A:'int8'} + return result,meta +def dequantize_mixed_int6(result,meta,template_sd): + out={} + for(name,orig)in template_sd.items(): + info=meta.get(name) + if info is _A:continue + orig_dtype=orig.dtype + if info in(_Y,_Z,'passthrough_fp16'): + t=result[name] + if t.dtype==torch.float16 and orig_dtype in(torch.float32,torch.bfloat16):t=t.to(orig_dtype) + out[name]=t;continue + q,s=result[name+'.q'],result[name+_S] + if s.ndim>0:out[name]=(q.float()*s.float().view(q.shape[0],*[1]*(q.ndim-1))).to(orig_dtype) + else:out[name]=(q.float()*float(s.item())).to(orig_dtype) + return out +def parse_args():A='store_true';p=argparse.ArgumentParser();p.add_argument('--feedback-rank',type=int,default=2);p.add_argument('--feedback-mode',type=str,default=_T);p.add_argument('--per-pass-feedback',action=A);p.add_argument('--residual-scale-init',type=float,default=.5);p.add_argument('--jacobian-proxy-weight',type=float,default=.01);p.add_argument('--no-interpass-rmsnorm',action=A);return p.parse_args() +def _make_gpt(args,cli,num_passes,**kw):return GPT(vocab_size=args.vocab_size,num_layers=args.num_layers,model_dim=args.model_dim,num_heads=args.num_heads,num_kv_heads=args.num_kv_heads,mlp_mult=args.mlp_mult,tie_embeddings=args.tie_embeddings,tied_embed_init_std=args.tied_embed_init_std,logit_softcap=args.logit_softcap,rope_base=args.rope_base,qk_gain_init=args.qk_gain_init,xsa_last_n=args.xsa_last_n,rope_dims=args.rope_dims,ln_scale=args.ln_scale,core_start=args.core_start,core_end=args.core_end,num_passes=num_passes,interpass_rmsnorm=not cli.no_interpass_rmsnorm,bigram_vocab_size=args.bigram_vocab_size,bigram_dim=args.bigram_dim,ve_enabled=args.ve_enabled,ve_dim=args.ve_dim,ve_layers=args.ve_layers,**kw) +def _promote_fp32(m): + m.qo_bank.data=m.qo_bank.data.float();m.kv_bank.data=m.kv_bank.data.float();m.mlp_up_bank.data=m.mlp_up_bank.data.float();m.mlp_down_bank.data=m.mlp_down_bank.data.float() + for mod in m.modules(): + if isinstance(mod,CastedLinear):mod.float() + restore_low_dim_params_to_fp32(m) +def main(): + G='final_model.int6.ptz';F='final_model.pt';E='WORLD_SIZE';D='RANK';C='_feedback.';B='_fb.';A='base_lr';cli=parse_args();code=Path(__file__).read_text(encoding=_J);args=Hyperparameters();distributed=D in os.environ and E in os.environ;rank=int(os.environ.get(D,_H));world_size=int(os.environ.get(E,'1'));local_rank=int(os.environ.get('LOCAL_RANK',_H)) + if world_size<=0:raise ValueError('bad WORLD_SIZE') + if 8%world_size!=0:raise ValueError('WORLD_SIZE must divide 8') + grad_accum_steps=8//world_size;grad_scale=_C/grad_accum_steps + if not torch.cuda.is_available():raise RuntimeError('CUDA is required') + device=torch.device(_I,local_rank);torch.cuda.set_device(device) + if distributed:dist.init_process_group(backend='nccl',device_id=device);dist.barrier() + master_process=rank==0;torch.backends.cuda.matmul.allow_tf32=_B;torch.backends.cudnn.allow_tf32=_B;from torch.backends.cuda import enable_cudnn_sdp,enable_flash_sdp,enable_math_sdp,enable_mem_efficient_sdp;enable_cudnn_sdp(_D);enable_flash_sdp(_B);enable_mem_efficient_sdp(_D);enable_math_sdp(_D);logfile=_A + if master_process:os.makedirs('logs',exist_ok=_B);logfile=f"logs/{args.run_id}.txt";print(logfile) + def log0(msg,console=_B): + if not master_process:return + if console:print(msg) + if logfile is not _A: + with open(logfile,'a',encoding=_J)as f:print(msg,file=f) + log0(code,console=_D);random.seed(args.seed);np.random.seed(args.seed);torch.manual_seed(args.seed);torch.cuda.manual_seed_all(args.seed) + if not args.tokenizer_path.endswith('.model'):raise ValueError('need .model tokenizer') + sp=spm.SentencePieceProcessor(model_file=args.tokenizer_path) + if int(sp.vocab_size())!=args.vocab_size:raise ValueError('vocab size mismatch') + dataset_dir=Path(args.data_path).resolve();actual_train_files=len(list(dataset_dir.glob(_U)));effective_eval_seq_len=args.eval_seq_len if args.eval_seq_len>0 else args.train_seq_len;val_seq_len=max(args.train_seq_len,effective_eval_seq_len);val_tokens=load_validation_tokens(args.val_files,val_seq_len);base_bytes_lut,has_leading_space_lut,is_boundary_token_lut=build_sentencepiece_luts(sp,args.vocab_size,device);log0(f"val_bpb:enabled tokenizer_path={args.tokenizer_path}");log0(f"train:{dataset_dir.name} shards:{actual_train_files} val_tokens:{val_tokens.numel()-1}");CastedLinear._qat_enabled=args.qat_enabled;base_model=_make_gpt(args,cli,args.num_passes,core_quant_bits=args.core_quant_bits,core_quant_enabled=args.core_quant_enabled,residual_scale=_A).to(device).bfloat16();_promote_fp32(base_model);feedback=_A;feedback_fn=_A;stabilizer=_A;residual_scale=_A;extra_scalar_params=[];passes_schedule=[] + if args.passes_schedule_str: + for entry in args.passes_schedule_str.split(','):s,p=entry.strip().split(':');passes_schedule.append((int(s),int(p))) + passes_schedule.sort(key=lambda x:x[0]) + max_passes=max((p for(_,p)in passes_schedule),default=args.num_passes);max_passes=max(max_passes,args.eval_passes if args.eval_passes>0 else args.num_passes);needs_recurrence=max_passes>1 + if cli.feedback_mode!='none'and needs_recurrence: + feedback=ErrorFeedbackModule(dim=args.model_dim,rank=cli.feedback_rank,feedback_mode=cli.feedback_mode,per_pass=cli.per_pass_feedback,num_passes=max_passes).to(device).bfloat16();restore_low_dim_params_to_fp32(feedback);extra_scalar_params.extend(feedback.parameters()) + def feedback_fn(h,pass_idx):return feedback(h,pass_idx) + log0(f"feedback: {cli.feedback_mode} r={cli.feedback_rank} params={sum(p.numel()for p in feedback.parameters())}") + if needs_recurrence: + stabilizer=RecurrentStabilizer(jacobian_proxy_weight=cli.jacobian_proxy_weight) + if cli.residual_scale_init!=_C:residual_scale=ResidualScale(max_passes,cli.residual_scale_init).to(device);base_model.residual_scale=residual_scale;extra_scalar_params.extend(residual_scale.parameters()) + log0(f"recurrence: {args.core_start}-{args.core_end} passes={args.num_passes}/{max_passes} s/c/t={base_model.num_stem}/{base_model.num_core}/{base_model.num_tail} sched={passes_schedule}");compiled_model=torch.compile(base_model,dynamic=_D,fullgraph=_B);model=compiled_model;matrix_params=[base_model.qo_bank,base_model.kv_bank,base_model.mlp_up_bank,base_model.mlp_down_bank];block_named_params=list(base_model.blocks.named_parameters());scalar_params=[p for(name,p)in block_named_params if p.ndim<2 or any(p in name for p in _N.split(','))] + if base_model.skip_weights.numel()>0:scalar_params.append(base_model.skip_weights) + scalar_params.append(base_model.smear.gate);token_lr=args.tied_embed_lr if args.tie_embeddings else args.embed_lr;tok_params=[{_F:[base_model.tok_emb.weight],_G:token_lr,A:token_lr}] + if base_model.bigram is not _A: + tok_params.append({_F:[base_model.bigram.embed.weight],_G:token_lr,A:token_lr}) + if base_model.bigram.proj is not _A:scalar_params.append(base_model.bigram.proj.weight) + scalar_params.append(base_model.bigram.scale) + if base_model.ve_shared is not _A: + tok_params.append({_F:[base_model.ve_shared.embed.weight],_G:token_lr,A:token_lr}) + if base_model.ve_shared.proj is not _A:scalar_params.append(base_model.ve_shared.proj.weight) + scalar_params.append(base_model.ve_shared.scale) + for s in base_model.ve_layer_scales:scalar_params.append(s) + optimizer_tok=torch.optim.AdamW(tok_params,betas=(args.beta1,args.beta2),eps=args.adam_eps,weight_decay=args.adam_wd,fused=_B);optimizer_muon=Muon(matrix_params,lr=args.matrix_lr,momentum=args.muon_momentum,backend_steps=args.muon_backend_steps,weight_decay=args.muon_wd) + for group in optimizer_muon.param_groups:group[A]=args.matrix_lr + scalar_params.extend(extra_scalar_params);optimizer_scalar=torch.optim.AdamW([{_F:scalar_params,_G:args.scalar_lr,A:args.scalar_lr}],betas=(args.beta1,args.beta2),eps=args.adam_eps,weight_decay=args.adam_wd,fused=_B);replicated_params=list(optimizer_tok.param_groups[0][_F]) + for pg in optimizer_tok.param_groups[1:]:replicated_params.extend(pg[_F]) + replicated_params.extend(scalar_params);optimizer_head=_A + if base_model.lm_head is not _A:optimizer_head=torch.optim.Adam([{_F:[base_model.lm_head.weight],_G:args.head_lr,A:args.head_lr}],betas=(args.beta1,args.beta2),eps=args.adam_eps,fused=_B);replicated_params.append(base_model.lm_head.weight) + optimizers=[optimizer_tok,optimizer_muon,optimizer_scalar] + if optimizer_head is not _A:optimizers.append(optimizer_head) + log0(f"params:{sum(p.numel()for p in base_model.parameters())} ws:{world_size} ga:{grad_accum_steps} iters:{args.iterations} wc:{args.max_wallclock_seconds:.0f}s seed:{args.seed}");train_loader=DistributedTokenLoader(args.train_files,rank,world_size,device) + def zero_grad_all(): + for opt in optimizers:opt.zero_grad(set_to_none=_B) + max_wallclock_ms=1e3*args.max_wallclock_seconds if args.max_wallclock_seconds>0 else _A + def lr_mul(step,elapsed_ms): + if args.warmdown_iters<=0:return _C + if max_wallclock_ms is _A:warmdown_start=max(args.iterations-args.warmdown_iters,0);return max((args.iterations-step)/max(args.warmdown_iters,1),_E)if warmdown_start<=step0: + initial_model_state={name:tensor.detach().cpu().clone()for(name,tensor)in base_model.state_dict().items()};initial_optimizer_states=[copy.deepcopy(opt.state_dict())for opt in optimizers];_precompile_passes=sorted(set(p for(_,p)in passes_schedule)-{args.num_passes})if passes_schedule else[];_qat_precompile_passes=_precompile_passes[-2:]if len(_precompile_passes)>=2 else _precompile_passes[:];_total_precompile=len(_precompile_passes)+len(_qat_precompile_passes);_precompile_start=args.warmup_steps-_total_precompile;model.train() + for warmup_step in range(args.warmup_steps): + if warmup_step>=_precompile_start: + _pc_idx=warmup_step-_precompile_start + if _pc_idx=stop_after_step;should_validate=last_step or args.val_loss_every>0 and step%args.val_loss_every==0 + if should_validate:torch.cuda.synchronize();training_time_ms+=1e3*(time.perf_counter()-t0);val_loss,val_bpb=eval_val(args,model,rank,world_size,device,grad_accum_steps,val_tokens,base_bytes_lut,has_leading_space_lut,is_boundary_token_lut);log0(f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms/max(step,1):.2f}ms");torch.cuda.synchronize();t0=time.perf_counter() + if last_step: + if stop_after_step is not _A and step=threshold_step:target_passes=p + if target_passes!=base_model.num_passes:base_model.num_passes=target_passes;log0(f"progressive_passes: step:{step} num_passes:{target_passes}") + if args.late_qat_threshold>0 and step>100 and scale0 else _C;muon_momentum=(1-frac)*args.muon_momentum_warmup_start+frac*args.muon_momentum + for group in optimizer_muon.param_groups:group[_X]=muon_momentum + for opt in optimizers: + for group in opt.param_groups:group[_G]=group[A]*scale + grad_norm=_A + if args.grad_clip_norm>0:grad_norm=torch.nn.utils.clip_grad_norm_(base_model.parameters(),args.grad_clip_norm) + optimizer_muon.launch_reduce_scatters() + if distributed: + for p in replicated_params: + if p.grad is not _A:dist.all_reduce(p.grad,op=dist.ReduceOp.AVG) + optimizer_tok.step();optimizer_scalar.step() + if optimizer_head is not _A:optimizer_head.step() + optimizer_muon.step();zero_grad_all() + with torch.no_grad(): + _cur=dict(base_model.state_dict()) + if feedback is not _A: + for(k,v)in feedback.state_dict().items():_cur[f"_fb.{k}"]=v + for(name,t)in _cur.items():ema_state[name].mul_(ema_decay).add_(t.detach().float(),alpha=_C-ema_decay) + step+=1;approx_training_time_ms=training_time_ms+1e3*(time.perf_counter()-t0) + if args.swa_enabled and scale<.2 and step%args.swa_every==0: + if swa_state is _A:swa_state={name:t.detach().cpu().clone()for(name,t)in base_model.state_dict().items()};swa_count=1;log0(f"swa:start step:{step}") + else: + for(name,t)in base_model.state_dict().items():swa_state[name]+=t.detach().cpu() + swa_count+=1 + should_log_train=args.train_log_every>0 and(step<=10 or step%args.train_log_every==0 or stop_after_step is not _A) + if should_log_train:tl=train_loss.item();gn_str=f" grad_norm:{grad_norm:.4f}"if grad_norm is not _A else'';log0(f"step:{step}/{args.iterations} train_loss:{tl:.4f}{gn_str} train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms/step:.2f}ms") + reached_cap=max_wallclock_ms is not _A and approx_training_time_ms>=max_wallclock_ms + if distributed and max_wallclock_ms is not _A:reached_cap_tensor=torch.tensor(int(reached_cap),device=device);dist.all_reduce(reached_cap_tensor,op=dist.ReduceOp.MAX);reached_cap=bool(reached_cap_tensor.item()) + if stop_after_step is _A and reached_cap:stop_after_step=step + log0(f"peak memory allocated: {torch.cuda.max_memory_allocated()//1024//1024} MiB reserved: {torch.cuda.max_memory_reserved()//1024//1024} MiB");log0('ema:applying EMA weights');current_state=base_model.state_dict();model_ema={k:v for(k,v)in ema_state.items()if not k.startswith(B)};avg_state={name:model_ema[name].to(dtype=current_state[name].dtype)for name in current_state};base_model.load_state_dict(avg_state,strict=_B) + if feedback is not _A:fb_ema={k.removeprefix(B):v for(k,v)in ema_state.items()if k.startswith(B)};fb_state=feedback.state_dict();fb_avg={k:fb_ema[k].to(dtype=fb_state[k].dtype)for k in fb_state};feedback.load_state_dict(fb_avg,strict=_B) + torch.cuda.synchronize();t_diag=time.perf_counter();diag_val_loss,diag_val_bpb=eval_val(args,compiled_model,rank,world_size,device,grad_accum_steps,val_tokens,base_bytes_lut,has_leading_space_lut,is_boundary_token_lut);torch.cuda.synchronize();log0(f"DIAGNOSTIC post_ema val_loss:{diag_val_loss:.4f} val_bpb:{diag_val_bpb:.4f} eval_time:{1e3*(time.perf_counter()-t_diag):.0f}ms");full_state_dict=base_model.state_dict();export_sd=full_state_dict + if feedback is not _A: + for(k,v)in feedback.state_dict().items():export_sd[f"_feedback.{k}"]=v + if master_process:torch.save(export_sd,F);model_bytes=os.path.getsize(F);code_bytes=len(code.encode(_J));log0(f"Serialized model: {model_bytes} bytes");log0(f"Code size: {code_bytes} bytes") + eval_num_passes=args.eval_passes if args.eval_passes>0 else args.num_passes + if eval_num_passes!=args.num_passes: + log0(f"eval_override: num_passes {args.num_passes} -> {eval_num_passes}");base_model.num_passes=eval_num_passes + if base_model.residual_scale is not _A:old_s=base_model.residual_scale.scales.data;new_s=torch.full((eval_num_passes,),cli.residual_scale_init,dtype=torch.float32,device=old_s.device);copy_len=min(eval_num_passes,old_s.shape[0]);new_s[:copy_len]=old_s[:copy_len];base_model.residual_scale.scales=nn.Parameter(new_s) + export_sd=base_model.state_dict() + if feedback is not _A: + for(k,v)in feedback.state_dict().items():export_sd[f"_feedback.{k}"]=v + sd_cpu={k:v.detach().cpu()for(k,v)in export_sd.items()};unbanked_sd=_unbank_state_dict(sd_cpu,args.num_layers);quant_result,quant_meta=mixed_quantize_int6(unbanked_sd,{'mlp','attn'});quant_buf=io.BytesIO();torch.save({'w':quant_result,'m':quant_meta},quant_buf);quant_raw=quant_buf.getvalue();quant_blob=lzma.compress(quant_raw,preset=6) + if master_process: + with open(G,'wb')as f:f.write(quant_blob) + quant_file_bytes=len(quant_blob);code_bytes=len(code.encode(_J));log0(f"Serialized model int6+lzma: {quant_file_bytes} bytes");log0(f"Total submission size int6+lzma: {quant_file_bytes+code_bytes} bytes") + if distributed:dist.barrier() + with open(G,'rb')as f:quant_blob_disk=f.read() + quant_state=torch.load(io.BytesIO(lzma.decompress(quant_blob_disk)),map_location='cpu');deq_unbanked=dequantize_mixed_int6(quant_state['w'],quant_state['m'],unbanked_sd);deq_state=_rebank_state_dict(deq_unbanked,args.num_layers,sd_cpu);eval_feedback=_A;eval_feedback_fn=_A;fb_keys={k:v for(k,v)in deq_state.items()if k.startswith(C)} + if fb_keys: + deq_state={k:v for(k,v)in deq_state.items()if not k.startswith(C)};eval_feedback=ErrorFeedbackModule(dim=args.model_dim,rank=cli.feedback_rank,feedback_mode=cli.feedback_mode,per_pass=cli.per_pass_feedback,num_passes=eval_num_passes).to(device).bfloat16();fb_sd={k.removeprefix(C):v for(k,v)in fb_keys.items()};eval_feedback.load_state_dict(fb_sd,strict=_B) + def eval_feedback_fn(h,pass_idx):return eval_feedback(h,pass_idx) + log0(f"eval_feedback: loaded from artifact, params={eval_feedback.param_count()}") + eval_model=_make_gpt(args,cli,eval_num_passes).to(device).bfloat16() + if residual_scale is not _A:eval_rs=ResidualScale(eval_num_passes,cli.residual_scale_init).to(device);eval_model.residual_scale=eval_rs + _promote_fp32(eval_model);eval_model.load_state_dict(deq_state,strict=_B) + if args.ttt_enabled:torch.cuda.synchronize();t_ttt=time.perf_counter();ttt_loss,ttt_bpb=eval_val_sliding_ttt(args,eval_model,rank,world_size,device,val_tokens,base_bytes_lut,has_leading_space_lut,is_boundary_token_lut,stride=args.eval_stride,log0=log0,feedback_fn=eval_feedback_fn,feedback_module=eval_feedback);torch.cuda.synchronize();log0(f"legal_ttt val_loss:{ttt_loss:.4f} val_bpb:{ttt_bpb:.4f} eval_time:{1e3*(time.perf_counter()-t_ttt):.0f}ms");log0(f"legal_ttt_exact val_loss:{ttt_loss:.8f} val_bpb:{ttt_bpb:.8f}") + if distributed:dist.destroy_process_group() +if __name__=='__main__':main() \ No newline at end of file diff --git a/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed1337.log b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed1337.log new file mode 100644 index 0000000000..6dfbd82729 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed1337.log @@ -0,0 +1,387 @@ +W0401 17:11:57.414000 166208 torch/distributed/run.py:851] +W0401 17:11:57.414000 166208 torch/distributed/run.py:851] ***************************************** +W0401 17:11:57.414000 166208 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0401 17:11:57.414000 166208 torch/distributed/run.py:851] ***************************************** +logs/bigram_ve_wd3500_3pass.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:80 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=4 core_end=7 num_passes=1 max_passes=3 stem=4 core=3 tail=4 schedule=[(0, 1), (4500, 2), (5500, 3)] +model_params:26698335 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:8 grad_accum_steps:1 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:6500 warmup_steps:20 max_wallclock_seconds:600.000 +seed:1337 +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/6500 val_loss:6.9281 val_bpb:4.1032 train_time:0ms step_avg:0.01ms +step:1/6500 train_loss:6.9290 grad_norm:0.3974 train_time:127ms step_avg:126.94ms +step:2/6500 train_loss:8.4773 grad_norm:3.3575 train_time:161ms step_avg:80.53ms +step:3/6500 train_loss:7.6671 grad_norm:1.7536 train_time:242ms step_avg:80.61ms +step:4/6500 train_loss:7.3212 grad_norm:1.1850 train_time:323ms step_avg:80.77ms +step:5/6500 train_loss:7.1339 grad_norm:1.4214 train_time:404ms step_avg:80.83ms +step:6/6500 train_loss:6.8977 grad_norm:1.2182 train_time:485ms step_avg:80.85ms +step:7/6500 train_loss:6.8026 grad_norm:1.3195 train_time:566ms step_avg:80.81ms +step:8/6500 train_loss:6.7622 grad_norm:0.9580 train_time:647ms step_avg:80.85ms +step:9/6500 train_loss:6.5054 grad_norm:0.9286 train_time:728ms step_avg:80.87ms +step:10/6500 train_loss:6.1474 grad_norm:0.9324 train_time:810ms step_avg:80.97ms +step:50/6500 train_loss:3.7892 grad_norm:1.0422 train_time:4097ms step_avg:81.94ms +step:100/6500 train_loss:3.1689 grad_norm:0.6969 train_time:8215ms step_avg:82.15ms +step:150/6500 train_loss:2.8908 grad_norm:0.4749 train_time:12398ms step_avg:82.65ms +step:200/6500 train_loss:2.3761 grad_norm:0.3690 train_time:16541ms step_avg:82.70ms +step:250/6500 train_loss:2.4712 grad_norm:0.3667 train_time:20679ms step_avg:82.72ms +step:300/6500 train_loss:2.5395 grad_norm:0.3156 train_time:24879ms step_avg:82.93ms +step:350/6500 train_loss:2.5270 grad_norm:0.2875 train_time:29023ms step_avg:82.92ms +step:400/6500 train_loss:2.3904 grad_norm:0.2057 train_time:33223ms step_avg:83.06ms +step:450/6500 train_loss:2.3445 grad_norm:0.2296 train_time:37361ms step_avg:83.02ms +step:500/6500 train_loss:2.3812 grad_norm:0.2048 train_time:41503ms step_avg:83.01ms +step:550/6500 train_loss:2.3155 grad_norm:0.2007 train_time:45702ms step_avg:83.09ms +step:600/6500 train_loss:2.3195 grad_norm:0.1791 train_time:49845ms step_avg:83.08ms +step:650/6500 train_loss:2.3103 grad_norm:0.1746 train_time:54046ms step_avg:83.15ms +step:700/6500 train_loss:2.3268 grad_norm:0.1511 train_time:58186ms step_avg:83.12ms +step:750/6500 train_loss:2.3125 grad_norm:0.1521 train_time:62336ms step_avg:83.11ms +step:800/6500 train_loss:2.2223 grad_norm:0.1654 train_time:66542ms step_avg:83.18ms +step:850/6500 train_loss:2.2136 grad_norm:0.1482 train_time:70686ms step_avg:83.16ms +step:900/6500 train_loss:2.1112 grad_norm:0.1492 train_time:74884ms step_avg:83.20ms +step:950/6500 train_loss:2.2038 grad_norm:0.1359 train_time:79035ms step_avg:83.20ms +step:1000/6500 train_loss:2.2579 grad_norm:0.1482 train_time:83183ms step_avg:83.18ms +step:1050/6500 train_loss:2.2082 grad_norm:0.1318 train_time:87392ms step_avg:83.23ms +step:1100/6500 train_loss:2.3091 grad_norm:0.1599 train_time:91542ms step_avg:83.22ms +step:1150/6500 train_loss:2.2358 grad_norm:0.1365 train_time:95750ms step_avg:83.26ms +step:1200/6500 train_loss:2.3392 grad_norm:0.1272 train_time:99901ms step_avg:83.25ms +step:1250/6500 train_loss:2.2380 grad_norm:0.1281 train_time:104058ms step_avg:83.25ms +step:1300/6500 train_loss:2.0824 grad_norm:0.1104 train_time:108275ms step_avg:83.29ms +step:1350/6500 train_loss:2.2349 grad_norm:0.1266 train_time:112426ms step_avg:83.28ms +step:1400/6500 train_loss:2.1730 grad_norm:0.1093 train_time:116641ms step_avg:83.32ms +step:1450/6500 train_loss:2.1051 grad_norm:0.0967 train_time:120789ms step_avg:83.30ms +step:1500/6500 train_loss:2.2093 grad_norm:0.0965 train_time:124938ms step_avg:83.29ms +step:1550/6500 train_loss:2.1683 grad_norm:0.0896 train_time:129151ms step_avg:83.32ms +step:1600/6500 train_loss:2.0650 grad_norm:0.0988 train_time:133311ms step_avg:83.32ms +step:1650/6500 train_loss:2.1750 grad_norm:0.0917 train_time:137469ms step_avg:83.31ms +step:1700/6500 train_loss:2.1297 grad_norm:0.0860 train_time:141678ms step_avg:83.34ms +step:1750/6500 train_loss:2.1828 grad_norm:0.0806 train_time:145835ms step_avg:83.33ms +step:1800/6500 train_loss:2.1386 grad_norm:0.1055 train_time:150044ms step_avg:83.36ms +step:1850/6500 train_loss:2.0174 grad_norm:0.1119 train_time:154193ms step_avg:83.35ms +step:1900/6500 train_loss:2.1075 grad_norm:0.0826 train_time:158345ms step_avg:83.34ms +step:1950/6500 train_loss:2.0046 grad_norm:0.0752 train_time:162561ms step_avg:83.36ms +step:2000/6500 train_loss:2.0548 grad_norm:0.0749 train_time:166715ms step_avg:83.36ms +step:2050/6500 train_loss:2.0980 grad_norm:0.0746 train_time:170936ms step_avg:83.38ms +step:2100/6500 train_loss:2.0319 grad_norm:0.0740 train_time:175087ms step_avg:83.37ms +step:2150/6500 train_loss:2.1359 grad_norm:0.0763 train_time:179243ms step_avg:83.37ms +step:2200/6500 train_loss:2.1221 grad_norm:0.1377 train_time:183459ms step_avg:83.39ms +step:2250/6500 train_loss:2.1587 grad_norm:0.0783 train_time:187611ms step_avg:83.38ms +step:2300/6500 train_loss:2.0955 grad_norm:0.0806 train_time:191822ms step_avg:83.40ms +step:2350/6500 train_loss:2.1575 grad_norm:0.0721 train_time:195977ms step_avg:83.39ms +step:2400/6500 train_loss:2.0526 grad_norm:0.0739 train_time:200128ms step_avg:83.39ms +step:2450/6500 train_loss:2.0661 grad_norm:0.0732 train_time:204340ms step_avg:83.40ms +step:2500/6500 train_loss:2.1566 grad_norm:0.1040 train_time:208499ms step_avg:83.40ms +step:2550/6500 train_loss:2.1926 grad_norm:0.0828 train_time:212714ms step_avg:83.42ms +step:2600/6500 train_loss:2.0968 grad_norm:0.0773 train_time:216864ms step_avg:83.41ms +step:2650/6500 train_loss:2.0556 grad_norm:0.0839 train_time:221016ms step_avg:83.40ms +step:2700/6500 train_loss:2.0891 grad_norm:0.0734 train_time:225232ms step_avg:83.42ms +step:2750/6500 train_loss:2.0194 grad_norm:0.0732 train_time:229393ms step_avg:83.42ms +step:2800/6500 train_loss:2.1399 grad_norm:0.0801 train_time:233614ms step_avg:83.43ms +step:2850/6500 train_loss:2.0551 grad_norm:0.0691 train_time:237762ms step_avg:83.43ms +step:2900/6500 train_loss:2.0127 grad_norm:0.0757 train_time:241916ms step_avg:83.42ms +step:2950/6500 train_loss:2.0701 grad_norm:0.0740 train_time:246137ms step_avg:83.44ms +step:3000/6500 train_loss:2.1529 grad_norm:0.0738 train_time:250300ms step_avg:83.43ms +step:3050/6500 train_loss:2.0333 grad_norm:0.0755 train_time:254456ms step_avg:83.43ms +step:3100/6500 train_loss:2.0245 grad_norm:0.0759 train_time:258671ms step_avg:83.44ms +step:3150/6500 train_loss:1.9603 grad_norm:0.0692 train_time:262829ms step_avg:83.44ms +step:3200/6500 train_loss:2.1628 grad_norm:0.0798 train_time:267049ms step_avg:83.45ms +step:3250/6500 train_loss:2.0394 grad_norm:0.0723 train_time:271202ms step_avg:83.45ms +step:3300/6500 train_loss:2.0628 grad_norm:0.0678 train_time:275363ms step_avg:83.44ms +step:3350/6500 train_loss:2.0850 grad_norm:0.0709 train_time:279583ms step_avg:83.46ms +step:3400/6500 train_loss:2.0100 grad_norm:0.0749 train_time:283734ms step_avg:83.45ms +step:3450/6500 train_loss:2.1029 grad_norm:0.0800 train_time:287953ms step_avg:83.46ms +step:3500/6500 train_loss:2.1720 grad_norm:0.0755 train_time:292103ms step_avg:83.46ms +step:3550/6500 train_loss:1.9111 grad_norm:0.0700 train_time:296260ms step_avg:83.45ms +step:3600/6500 train_loss:2.0871 grad_norm:0.0764 train_time:300476ms step_avg:83.47ms +step:3650/6500 train_loss:1.9652 grad_norm:0.0743 train_time:304632ms step_avg:83.46ms +step:3700/6500 train_loss:2.0881 grad_norm:0.0731 train_time:308849ms step_avg:83.47ms +step:3750/6500 train_loss:1.9141 grad_norm:0.0710 train_time:313006ms step_avg:83.47ms +step:3800/6500 train_loss:2.0665 grad_norm:0.0780 train_time:317165ms step_avg:83.46ms +step:3850/6500 train_loss:2.0818 grad_norm:0.0738 train_time:321374ms step_avg:83.47ms +step:3900/6500 train_loss:2.0701 grad_norm:0.0706 train_time:325529ms step_avg:83.47ms +step:3950/6500 train_loss:2.1639 grad_norm:0.0731 train_time:329743ms step_avg:83.48ms +step:4000/6500 train_loss:1.9696 grad_norm:0.0719 train_time:333902ms step_avg:83.48ms +step:4000/6500 val_loss:2.0584 val_bpb:1.2191 train_time:333951ms step_avg:83.49ms +step:4050/6500 train_loss:2.0853 grad_norm:0.0715 train_time:338056ms step_avg:83.47ms +step:4100/6500 train_loss:2.0046 grad_norm:0.0766 train_time:342277ms step_avg:83.48ms +step:4150/6500 train_loss:2.1088 grad_norm:0.0738 train_time:346432ms step_avg:83.48ms +step:4200/6500 train_loss:2.1411 grad_norm:0.0838 train_time:350646ms step_avg:83.49ms +step:4250/6500 train_loss:2.1075 grad_norm:0.0830 train_time:354797ms step_avg:83.48ms +step:4300/6500 train_loss:2.0524 grad_norm:0.0747 train_time:358948ms step_avg:83.48ms +step:4350/6500 train_loss:2.0648 grad_norm:0.0788 train_time:363174ms step_avg:83.49ms +step:4400/6500 train_loss:2.0249 grad_norm:0.0755 train_time:367326ms step_avg:83.48ms +step:4450/6500 train_loss:2.0430 grad_norm:0.0741 train_time:371479ms step_avg:83.48ms +step:4500/6500 train_loss:2.1161 grad_norm:0.0755 train_time:375699ms step_avg:83.49ms +progressive_passes: step:4500 num_passes:2 +step:4550/6500 train_loss:2.1261 grad_norm:0.0730 train_time:381269ms step_avg:83.80ms +step:4600/6500 train_loss:1.8320 grad_norm:0.0798 train_time:386903ms step_avg:84.11ms +step:4650/6500 train_loss:2.0453 grad_norm:0.0729 train_time:392474ms step_avg:84.40ms +step:4700/6500 train_loss:2.2250 grad_norm:0.1195 train_time:398047ms step_avg:84.69ms +step:4750/6500 train_loss:2.0101 grad_norm:0.0791 train_time:403682ms step_avg:84.99ms +step:4800/6500 train_loss:2.4105 grad_norm:0.1550 train_time:409252ms step_avg:85.26ms +step:4850/6500 train_loss:2.0919 grad_norm:0.0808 train_time:414888ms step_avg:85.54ms +step:4900/6500 train_loss:2.0343 grad_norm:0.0757 train_time:420457ms step_avg:85.81ms +step:4950/6500 train_loss:2.0808 grad_norm:0.0813 train_time:426024ms step_avg:86.07ms +step:5000/6500 train_loss:2.0848 grad_norm:0.0733 train_time:431652ms step_avg:86.33ms +step:5050/6500 train_loss:2.0497 grad_norm:0.0871 train_time:437218ms step_avg:86.58ms +step:5100/6500 train_loss:2.1074 grad_norm:0.0788 train_time:442845ms step_avg:86.83ms +step:5150/6500 train_loss:2.0053 grad_norm:0.0777 train_time:448413ms step_avg:87.07ms +step:5200/6500 train_loss:2.0183 grad_norm:0.0751 train_time:453990ms step_avg:87.31ms +step:5250/6500 train_loss:2.0446 grad_norm:0.0715 train_time:459618ms step_avg:87.55ms +step:5300/6500 train_loss:1.9820 grad_norm:0.0760 train_time:465187ms step_avg:87.77ms +step:5350/6500 train_loss:1.8966 grad_norm:0.0798 train_time:470817ms step_avg:88.00ms +step:5400/6500 train_loss:2.0190 grad_norm:0.0775 train_time:476388ms step_avg:88.22ms +step:5450/6500 train_loss:2.0444 grad_norm:0.0796 train_time:481954ms step_avg:88.43ms +step:5500/6500 train_loss:1.9880 grad_norm:0.0786 train_time:487580ms step_avg:88.65ms +progressive_passes: step:5500 num_passes:3 +step:5550/6500 train_loss:1.9738 grad_norm:0.0820 train_time:494245ms step_avg:89.05ms +step:5600/6500 train_loss:1.9185 grad_norm:0.0789 train_time:500970ms step_avg:89.46ms +step:5650/6500 train_loss:2.0222 grad_norm:0.0827 train_time:507635ms step_avg:89.85ms +step:5700/6500 train_loss:1.9775 grad_norm:0.0862 train_time:514295ms step_avg:90.23ms +step:5750/6500 train_loss:2.0524 grad_norm:0.0933 train_time:521020ms step_avg:90.61ms +step:5800/6500 train_loss:1.9523 grad_norm:0.0860 train_time:528938ms step_avg:91.20ms +step:5850/6500 train_loss:2.0892 grad_norm:0.0870 train_time:535657ms step_avg:91.57ms +swa:start step:5900 +step:5900/6500 train_loss:1.8599 grad_norm:0.0804 train_time:542320ms step_avg:91.92ms +step:5950/6500 train_loss:1.9200 grad_norm:0.0773 train_time:549081ms step_avg:92.28ms +late_qat:enabled step:5968 scale:0.1496 core_quant:on +step:6000/6500 train_loss:1.9028 grad_norm:0.0843 train_time:555878ms step_avg:92.65ms +step:6050/6500 train_loss:1.9311 grad_norm:0.0847 train_time:562586ms step_avg:92.99ms +step:6100/6500 train_loss:1.8777 grad_norm:0.0861 train_time:569299ms step_avg:93.33ms +step:6150/6500 train_loss:1.9797 grad_norm:0.0873 train_time:576065ms step_avg:93.67ms +step:6200/6500 train_loss:1.9020 grad_norm:0.0877 train_time:582782ms step_avg:94.00ms +step:6250/6500 train_loss:2.0230 grad_norm:0.0927 train_time:589555ms step_avg:94.33ms +step:6300/6500 train_loss:1.9017 grad_norm:0.0820 train_time:596266ms step_avg:94.65ms +step:6328/6500 val_loss:1.9200 val_bpb:1.1371 train_time:600113ms step_avg:94.83ms +stopping_early: wallclock_cap train_time:600113ms step:6328/6500 +peak memory allocated: 34074 MiB reserved: 34084 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9168 val_bpb:1.1353 eval_time:2969ms +Serialized model: 105478842 bytes +Code size: 88253 bytes +eval_override: num_passes 1 -> 3 +Serialized model int6+lzma: 15850832 bytes +Total submission size int6+lzma: 15939085 bytes +eval_feedback: loaded from artifact, params=2560 +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=26700895 frozen=0 + ttt_chunk [1/1893] bpb=1.152954 time=0.5s + ttt_chunk [11/1893] bpb=1.141819 time=3.5s + ttt_chunk [21/1893] bpb=1.126763 time=6.5s + ttt_chunk [31/1893] bpb=1.125066 time=9.4s + ttt_chunk [41/1893] bpb=1.111860 time=12.4s + ttt_chunk [51/1893] bpb=1.106199 time=15.4s + ttt_chunk [61/1893] bpb=1.112955 time=18.3s + ttt_chunk [71/1893] bpb=1.111630 time=21.2s + ttt_chunk [81/1893] bpb=1.110738 time=24.2s + ttt_chunk [91/1893] bpb=1.111503 time=27.2s + ttt_chunk [101/1893] bpb=1.114974 time=30.1s + ttt_chunk [111/1893] bpb=1.117455 time=33.1s + ttt_chunk [121/1893] bpb=1.111134 time=37.2s + ttt_chunk [131/1893] bpb=1.111261 time=40.1s + ttt_chunk [141/1893] bpb=1.116930 time=43.1s + ttt_chunk [151/1893] bpb=1.118842 time=46.1s + ttt_chunk [161/1893] bpb=1.118422 time=49.0s + ttt_chunk [171/1893] bpb=1.122751 time=52.0s + ttt_chunk [181/1893] bpb=1.124991 time=54.9s + ttt_chunk [191/1893] bpb=1.132251 time=57.9s + ttt_chunk [201/1893] bpb=1.131038 time=60.9s + ttt_chunk [211/1893] bpb=1.128836 time=63.9s + ttt_chunk [221/1893] bpb=1.130341 time=66.8s + ttt_chunk [231/1893] bpb=1.128959 time=69.8s + ttt_chunk [241/1893] bpb=1.129200 time=72.7s + ttt_chunk [251/1893] bpb=1.128760 time=75.7s + ttt_chunk [261/1893] bpb=1.125903 time=78.6s + ttt_chunk [271/1893] bpb=1.124748 time=81.6s + ttt_chunk [281/1893] bpb=1.126141 time=84.6s + ttt_chunk [291/1893] bpb=1.127927 time=87.5s + ttt_chunk [301/1893] bpb=1.128620 time=90.5s + ttt_chunk [311/1893] bpb=1.130720 time=93.5s + ttt_chunk [321/1893] bpb=1.132607 time=96.4s + ttt_chunk [331/1893] bpb=1.132470 time=99.4s + ttt_chunk [341/1893] bpb=1.131561 time=102.4s + ttt_chunk [351/1893] bpb=1.133871 time=105.3s + ttt_chunk [361/1893] bpb=1.134134 time=108.3s + ttt_chunk [371/1893] bpb=1.133490 time=111.2s + ttt_chunk [381/1893] bpb=1.133694 time=114.2s + ttt_chunk [391/1893] bpb=1.133479 time=117.2s + ttt_chunk [401/1893] bpb=1.131444 time=120.1s + ttt_chunk [411/1893] bpb=1.130286 time=123.1s + ttt_chunk [421/1893] bpb=1.129405 time=126.0s + ttt_chunk [431/1893] bpb=1.129247 time=129.0s + ttt_chunk [441/1893] bpb=1.129642 time=132.0s + ttt_chunk [451/1893] bpb=1.129939 time=134.9s + ttt_chunk [461/1893] bpb=1.128810 time=137.9s + ttt_chunk [471/1893] bpb=1.129386 time=140.9s + ttt_chunk [481/1893] bpb=1.129022 time=143.8s + ttt_chunk [491/1893] bpb=1.127956 time=146.8s + ttt_chunk [501/1893] bpb=1.127475 time=149.7s + ttt_chunk [511/1893] bpb=1.126774 time=152.7s + ttt_chunk [521/1893] bpb=1.124527 time=155.7s + ttt_chunk [531/1893] bpb=1.125727 time=158.6s + ttt_chunk [541/1893] bpb=1.126064 time=161.6s + ttt_chunk [551/1893] bpb=1.125011 time=164.5s + ttt_chunk [561/1893] bpb=1.125537 time=167.5s + ttt_chunk [571/1893] bpb=1.124514 time=170.4s + ttt_chunk [581/1893] bpb=1.123709 time=173.4s + ttt_chunk [591/1893] bpb=1.123103 time=176.4s + ttt_chunk [601/1893] bpb=1.123596 time=179.3s + ttt_chunk [611/1893] bpb=1.123535 time=182.3s + ttt_chunk [621/1893] bpb=1.123429 time=185.3s + ttt_chunk [631/1893] bpb=1.124181 time=188.2s + ttt_chunk [641/1893] bpb=1.123921 time=191.2s + ttt_chunk [651/1893] bpb=1.123995 time=194.2s + ttt_chunk [661/1893] bpb=1.123446 time=197.1s + ttt_chunk [671/1893] bpb=1.123794 time=200.1s + ttt_chunk [681/1893] bpb=1.124511 time=203.1s + ttt_chunk [691/1893] bpb=1.125507 time=206.0s + ttt_chunk [701/1893] bpb=1.124950 time=209.0s + ttt_chunk [711/1893] bpb=1.124905 time=211.9s + ttt_chunk [721/1893] bpb=1.124560 time=214.9s + ttt_chunk [731/1893] bpb=1.124625 time=217.9s + ttt_chunk [741/1893] bpb=1.124698 time=221.7s + ttt_chunk [751/1893] bpb=1.124579 time=225.4s + ttt_chunk [761/1893] bpb=1.124500 time=228.4s + ttt_chunk [771/1893] bpb=1.124177 time=231.3s + ttt_chunk [781/1893] bpb=1.124888 time=234.2s + ttt_chunk [791/1893] bpb=1.124512 time=237.2s + ttt_chunk [801/1893] bpb=1.124853 time=240.1s + ttt_chunk [811/1893] bpb=1.124605 time=243.1s + ttt_chunk [821/1893] bpb=1.124387 time=246.9s + ttt_chunk [831/1893] bpb=1.124196 time=249.8s + ttt_chunk [841/1893] bpb=1.123552 time=252.8s + ttt_chunk [851/1893] bpb=1.123263 time=255.7s + ttt_chunk [861/1893] bpb=1.123022 time=259.5s + ttt_chunk [871/1893] bpb=1.123296 time=262.4s + ttt_chunk [881/1893] bpb=1.123489 time=265.4s + ttt_chunk [891/1893] bpb=1.123064 time=268.3s + ttt_chunk [901/1893] bpb=1.122801 time=271.3s + ttt_chunk [911/1893] bpb=1.122916 time=274.2s + ttt_chunk [921/1893] bpb=1.123400 time=277.2s + ttt_chunk [931/1893] bpb=1.123364 time=280.1s + ttt_chunk [941/1893] bpb=1.123043 time=283.0s + ttt_chunk [951/1893] bpb=1.123428 time=286.0s + ttt_chunk [961/1893] bpb=1.123504 time=288.9s + ttt_chunk [971/1893] bpb=1.124333 time=291.9s + ttt_chunk [981/1893] bpb=1.124425 time=294.8s + ttt_chunk [991/1893] bpb=1.124452 time=297.7s + ttt_chunk [1001/1893] bpb=1.124389 time=300.7s + ttt_chunk [1011/1893] bpb=1.124163 time=303.6s + ttt_chunk [1021/1893] bpb=1.124501 time=306.6s + ttt_chunk [1031/1893] bpb=1.124946 time=309.5s + ttt_chunk [1041/1893] bpb=1.124610 time=312.4s + ttt_chunk [1051/1893] bpb=1.124364 time=315.4s + ttt_chunk [1061/1893] bpb=1.124409 time=318.3s + ttt_chunk [1071/1893] bpb=1.124986 time=321.3s + ttt_chunk [1081/1893] bpb=1.125240 time=324.2s + ttt_chunk [1091/1893] bpb=1.125996 time=327.2s + ttt_chunk [1101/1893] bpb=1.126011 time=330.1s + ttt_chunk [1111/1893] bpb=1.125871 time=333.1s + ttt_chunk [1121/1893] bpb=1.125666 time=336.0s + ttt_chunk [1131/1893] bpb=1.125557 time=339.0s + ttt_chunk [1141/1893] bpb=1.125269 time=342.8s + ttt_chunk [1151/1893] bpb=1.125279 time=345.8s + ttt_chunk [1161/1893] bpb=1.124897 time=348.7s + ttt_chunk [1171/1893] bpb=1.125181 time=351.7s + ttt_chunk [1181/1893] bpb=1.124422 time=354.6s + ttt_chunk [1191/1893] bpb=1.124309 time=357.5s + ttt_chunk [1201/1893] bpb=1.124724 time=360.5s + ttt_chunk [1211/1893] bpb=1.124261 time=363.4s + ttt_chunk [1221/1893] bpb=1.123951 time=366.4s + ttt_chunk [1231/1893] bpb=1.123660 time=369.3s + ttt_chunk [1241/1893] bpb=1.123320 time=372.3s + ttt_chunk [1251/1893] bpb=1.122737 time=375.3s + ttt_chunk [1261/1893] bpb=1.122718 time=378.2s + ttt_chunk [1271/1893] bpb=1.122354 time=381.2s + ttt_chunk [1281/1893] bpb=1.122155 time=384.1s + ttt_chunk [1291/1893] bpb=1.121913 time=387.1s + ttt_chunk [1301/1893] bpb=1.121328 time=390.1s + ttt_chunk [1311/1893] bpb=1.120942 time=393.0s + ttt_chunk [1321/1893] bpb=1.120628 time=396.0s + ttt_chunk [1331/1893] bpb=1.120549 time=398.9s + ttt_chunk [1341/1893] bpb=1.120410 time=401.9s + ttt_chunk [1351/1893] bpb=1.120330 time=404.8s + ttt_chunk [1361/1893] bpb=1.120381 time=407.8s + ttt_chunk [1371/1893] bpb=1.120255 time=410.7s + ttt_chunk [1381/1893] bpb=1.120227 time=413.7s + ttt_chunk [1391/1893] bpb=1.119838 time=416.6s + ttt_chunk [1401/1893] bpb=1.119801 time=419.6s + ttt_chunk [1411/1893] bpb=1.119912 time=422.5s + ttt_chunk [1421/1893] bpb=1.120166 time=425.5s + ttt_chunk [1431/1893] bpb=1.119882 time=428.4s + ttt_chunk [1441/1893] bpb=1.120376 time=431.4s + ttt_chunk [1451/1893] bpb=1.120698 time=434.3s + ttt_chunk [1461/1893] bpb=1.120238 time=437.2s + ttt_chunk [1471/1893] bpb=1.121294 time=440.2s + ttt_chunk [1481/1893] bpb=1.120835 time=443.1s + ttt_chunk [1491/1893] bpb=1.120648 time=446.1s + ttt_chunk [1501/1893] bpb=1.120564 time=449.0s + ttt_chunk [1511/1893] bpb=1.120603 time=452.0s + ttt_chunk [1521/1893] bpb=1.120635 time=454.9s + ttt_chunk [1531/1893] bpb=1.120111 time=457.9s + ttt_chunk [1541/1893] bpb=1.119976 time=460.8s + ttt_chunk [1551/1893] bpb=1.120291 time=463.8s + ttt_chunk [1561/1893] bpb=1.120305 time=466.7s + ttt_chunk [1571/1893] bpb=1.120150 time=469.7s + ttt_chunk [1581/1893] bpb=1.120269 time=472.6s + ttt_chunk [1591/1893] bpb=1.120126 time=475.6s + ttt_chunk [1601/1893] bpb=1.120311 time=478.6s + ttt_chunk [1611/1893] bpb=1.120251 time=481.5s + ttt_chunk [1621/1893] bpb=1.119847 time=484.5s + ttt_chunk [1631/1893] bpb=1.120159 time=487.4s + ttt_chunk [1641/1893] bpb=1.120168 time=490.4s + ttt_chunk [1651/1893] bpb=1.120126 time=493.3s + ttt_chunk [1661/1893] bpb=1.120008 time=497.0s + ttt_chunk [1671/1893] bpb=1.120476 time=500.0s + ttt_chunk [1681/1893] bpb=1.120639 time=502.9s + ttt_chunk [1691/1893] bpb=1.120481 time=505.9s + ttt_chunk [1701/1893] bpb=1.120642 time=508.9s + ttt_chunk [1711/1893] bpb=1.120626 time=511.8s + ttt_chunk [1721/1893] bpb=1.120632 time=514.8s + ttt_chunk [1731/1893] bpb=1.120504 time=517.7s + ttt_chunk [1741/1893] bpb=1.120304 time=520.7s + ttt_chunk [1751/1893] bpb=1.120143 time=523.7s + ttt_chunk [1761/1893] bpb=1.120286 time=526.6s + ttt_chunk [1771/1893] bpb=1.120187 time=529.6s + ttt_chunk [1781/1893] bpb=1.120210 time=532.5s + ttt_chunk [1791/1893] bpb=1.119813 time=535.5s + ttt_chunk [1801/1893] bpb=1.119678 time=538.4s + ttt_chunk [1811/1893] bpb=1.119594 time=541.4s + ttt_chunk [1821/1893] bpb=1.119651 time=544.4s + ttt_chunk [1831/1893] bpb=1.119053 time=547.3s + ttt_chunk [1841/1893] bpb=1.119075 time=550.3s + ttt_chunk [1851/1893] bpb=1.118874 time=553.3s + ttt_chunk [1861/1893] bpb=1.118513 time=556.2s + ttt_chunk [1871/1893] bpb=1.118501 time=559.3s + ttt_chunk [1881/1893] bpb=1.118056 time=562.2s + ttt_chunk [1891/1893] bpb=1.117825 time=565.2s + ttt_chunk [1893/1893] bpb=1.117869 time=565.6s +ttt_sliding:done val_loss=1.883755 val_bpb=1.115669 elapsed=565.6s +legal_ttt val_loss:1.8838 val_bpb:1.1157 eval_time:566013ms +legal_ttt_exact val_loss:1.88375543 val_bpb:1.11566902 diff --git a/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed2025.log b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed2025.log new file mode 100644 index 0000000000..c825b25919 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed2025.log @@ -0,0 +1,387 @@ +W0401 18:14:56.025000 191441 torch/distributed/run.py:851] +W0401 18:14:56.025000 191441 torch/distributed/run.py:851] ***************************************** +W0401 18:14:56.025000 191441 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0401 18:14:56.025000 191441 torch/distributed/run.py:851] ***************************************** +logs/bigram_ve_wd3500_3pass.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:80 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=4 core_end=7 num_passes=1 max_passes=3 stem=4 core=3 tail=4 schedule=[(0, 1), (4500, 2), (5500, 3)] +model_params:26698335 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:8 grad_accum_steps:1 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:6500 warmup_steps:20 max_wallclock_seconds:600.000 +seed:2025 +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/6500 val_loss:6.9300 val_bpb:4.1044 train_time:0ms step_avg:0.01ms +step:1/6500 train_loss:6.9310 grad_norm:0.3872 train_time:128ms step_avg:127.73ms +step:2/6500 train_loss:8.5729 grad_norm:3.7965 train_time:162ms step_avg:81.13ms +step:3/6500 train_loss:7.7014 grad_norm:2.0579 train_time:243ms step_avg:80.95ms +step:4/6500 train_loss:7.1616 grad_norm:1.1957 train_time:324ms step_avg:81.12ms +step:5/6500 train_loss:6.9726 grad_norm:1.3804 train_time:406ms step_avg:81.15ms +step:6/6500 train_loss:6.9302 grad_norm:1.1740 train_time:487ms step_avg:81.22ms +step:7/6500 train_loss:6.8503 grad_norm:1.3501 train_time:568ms step_avg:81.16ms +step:8/6500 train_loss:6.7528 grad_norm:1.1604 train_time:649ms step_avg:81.13ms +step:9/6500 train_loss:6.4660 grad_norm:0.9217 train_time:730ms step_avg:81.11ms +step:10/6500 train_loss:6.1355 grad_norm:1.3881 train_time:811ms step_avg:81.12ms +step:50/6500 train_loss:3.7789 grad_norm:0.7948 train_time:4100ms step_avg:81.99ms +step:100/6500 train_loss:3.1790 grad_norm:0.5920 train_time:8228ms step_avg:82.28ms +step:150/6500 train_loss:2.8888 grad_norm:0.5019 train_time:12419ms step_avg:82.79ms +step:200/6500 train_loss:2.3854 grad_norm:0.4094 train_time:16561ms step_avg:82.80ms +step:250/6500 train_loss:2.4722 grad_norm:0.3397 train_time:20702ms step_avg:82.81ms +step:300/6500 train_loss:2.5404 grad_norm:0.2761 train_time:24912ms step_avg:83.04ms +step:350/6500 train_loss:2.5242 grad_norm:0.2486 train_time:29054ms step_avg:83.01ms +step:400/6500 train_loss:2.3962 grad_norm:0.2509 train_time:33255ms step_avg:83.14ms +step:450/6500 train_loss:2.3545 grad_norm:0.2513 train_time:37398ms step_avg:83.11ms +step:500/6500 train_loss:2.3796 grad_norm:0.2000 train_time:41540ms step_avg:83.08ms +step:550/6500 train_loss:2.3167 grad_norm:0.2005 train_time:45748ms step_avg:83.18ms +step:600/6500 train_loss:2.3201 grad_norm:0.2006 train_time:49897ms step_avg:83.16ms +step:650/6500 train_loss:2.3102 grad_norm:0.1738 train_time:54099ms step_avg:83.23ms +step:700/6500 train_loss:2.3297 grad_norm:0.1634 train_time:58239ms step_avg:83.20ms +step:750/6500 train_loss:2.3176 grad_norm:0.1924 train_time:62388ms step_avg:83.18ms +step:800/6500 train_loss:2.2269 grad_norm:0.1741 train_time:66598ms step_avg:83.25ms +step:850/6500 train_loss:2.2177 grad_norm:0.1652 train_time:70737ms step_avg:83.22ms +step:900/6500 train_loss:2.1087 grad_norm:0.1458 train_time:74951ms step_avg:83.28ms +step:950/6500 train_loss:2.2081 grad_norm:0.1481 train_time:79104ms step_avg:83.27ms +step:1000/6500 train_loss:2.2609 grad_norm:0.1369 train_time:83253ms step_avg:83.25ms +step:1050/6500 train_loss:2.2108 grad_norm:0.1563 train_time:87470ms step_avg:83.30ms +step:1100/6500 train_loss:2.3090 grad_norm:0.1307 train_time:91612ms step_avg:83.28ms +step:1150/6500 train_loss:2.2338 grad_norm:0.1242 train_time:95814ms step_avg:83.32ms +step:1200/6500 train_loss:2.3379 grad_norm:0.1267 train_time:99964ms step_avg:83.30ms +step:1250/6500 train_loss:2.2385 grad_norm:0.1357 train_time:104106ms step_avg:83.29ms +step:1300/6500 train_loss:2.0882 grad_norm:0.1181 train_time:108319ms step_avg:83.32ms +step:1350/6500 train_loss:2.2403 grad_norm:0.1335 train_time:112466ms step_avg:83.31ms +step:1400/6500 train_loss:2.1715 grad_norm:0.1002 train_time:116669ms step_avg:83.34ms +step:1450/6500 train_loss:2.1064 grad_norm:0.1013 train_time:120817ms step_avg:83.32ms +step:1500/6500 train_loss:2.2087 grad_norm:0.1064 train_time:124963ms step_avg:83.31ms +step:1550/6500 train_loss:2.1723 grad_norm:0.0932 train_time:129174ms step_avg:83.34ms +step:1600/6500 train_loss:2.0630 grad_norm:0.0914 train_time:133323ms step_avg:83.33ms +step:1650/6500 train_loss:2.1777 grad_norm:0.0843 train_time:137474ms step_avg:83.32ms +step:1700/6500 train_loss:2.1294 grad_norm:0.0812 train_time:141693ms step_avg:83.35ms +step:1750/6500 train_loss:2.1849 grad_norm:0.0830 train_time:145855ms step_avg:83.35ms +step:1800/6500 train_loss:2.1436 grad_norm:0.1140 train_time:150071ms step_avg:83.37ms +step:1850/6500 train_loss:2.0157 grad_norm:0.0837 train_time:154220ms step_avg:83.36ms +step:1900/6500 train_loss:2.1111 grad_norm:0.0802 train_time:158375ms step_avg:83.36ms +step:1950/6500 train_loss:2.0071 grad_norm:0.0732 train_time:162600ms step_avg:83.38ms +step:2000/6500 train_loss:2.0536 grad_norm:0.0741 train_time:166756ms step_avg:83.38ms +step:2050/6500 train_loss:2.1021 grad_norm:0.0812 train_time:170980ms step_avg:83.40ms +step:2100/6500 train_loss:2.0340 grad_norm:0.0728 train_time:175134ms step_avg:83.40ms +step:2150/6500 train_loss:2.1384 grad_norm:0.0759 train_time:179289ms step_avg:83.39ms +step:2200/6500 train_loss:2.1237 grad_norm:0.1112 train_time:183511ms step_avg:83.41ms +step:2250/6500 train_loss:2.1585 grad_norm:0.0762 train_time:187667ms step_avg:83.41ms +step:2300/6500 train_loss:2.0963 grad_norm:0.0734 train_time:191892ms step_avg:83.43ms +step:2350/6500 train_loss:2.1587 grad_norm:0.0718 train_time:196043ms step_avg:83.42ms +step:2400/6500 train_loss:2.0555 grad_norm:0.0726 train_time:200201ms step_avg:83.42ms +step:2450/6500 train_loss:2.0712 grad_norm:0.0756 train_time:204419ms step_avg:83.44ms +step:2500/6500 train_loss:2.1573 grad_norm:0.0919 train_time:208575ms step_avg:83.43ms +step:2550/6500 train_loss:2.1984 grad_norm:0.0866 train_time:212796ms step_avg:83.45ms +step:2600/6500 train_loss:2.0966 grad_norm:0.0777 train_time:216958ms step_avg:83.45ms +step:2650/6500 train_loss:2.0599 grad_norm:0.0825 train_time:221118ms step_avg:83.44ms +step:2700/6500 train_loss:2.0877 grad_norm:0.0707 train_time:225342ms step_avg:83.46ms +step:2750/6500 train_loss:2.0203 grad_norm:0.0764 train_time:229499ms step_avg:83.45ms +step:2800/6500 train_loss:2.1453 grad_norm:0.0883 train_time:233725ms step_avg:83.47ms +step:2850/6500 train_loss:2.0545 grad_norm:0.0708 train_time:237877ms step_avg:83.47ms +step:2900/6500 train_loss:2.0150 grad_norm:0.0708 train_time:242036ms step_avg:83.46ms +step:2950/6500 train_loss:2.0711 grad_norm:0.0771 train_time:246257ms step_avg:83.48ms +step:3000/6500 train_loss:2.1547 grad_norm:0.0798 train_time:250411ms step_avg:83.47ms +step:3050/6500 train_loss:2.0322 grad_norm:0.0736 train_time:254570ms step_avg:83.47ms +step:3100/6500 train_loss:2.0229 grad_norm:0.0768 train_time:258787ms step_avg:83.48ms +step:3150/6500 train_loss:1.9607 grad_norm:0.0770 train_time:262937ms step_avg:83.47ms +step:3200/6500 train_loss:2.1643 grad_norm:0.0762 train_time:267158ms step_avg:83.49ms +step:3250/6500 train_loss:2.0406 grad_norm:0.0700 train_time:271320ms step_avg:83.48ms +step:3300/6500 train_loss:2.0632 grad_norm:0.0767 train_time:275481ms step_avg:83.48ms +step:3350/6500 train_loss:2.0895 grad_norm:0.0714 train_time:279706ms step_avg:83.49ms +step:3400/6500 train_loss:2.0162 grad_norm:0.0769 train_time:283866ms step_avg:83.49ms +step:3450/6500 train_loss:2.1041 grad_norm:0.0823 train_time:288086ms step_avg:83.50ms +step:3500/6500 train_loss:2.1725 grad_norm:0.0725 train_time:292241ms step_avg:83.50ms +step:3550/6500 train_loss:1.9173 grad_norm:0.0723 train_time:296398ms step_avg:83.49ms +step:3600/6500 train_loss:2.0913 grad_norm:0.0857 train_time:300633ms step_avg:83.51ms +step:3650/6500 train_loss:1.9646 grad_norm:0.0721 train_time:304792ms step_avg:83.50ms +step:3700/6500 train_loss:2.0906 grad_norm:0.0733 train_time:309024ms step_avg:83.52ms +step:3750/6500 train_loss:1.9146 grad_norm:0.0699 train_time:313189ms step_avg:83.52ms +step:3800/6500 train_loss:2.0682 grad_norm:0.0742 train_time:317342ms step_avg:83.51ms +step:3850/6500 train_loss:2.0838 grad_norm:0.0797 train_time:321569ms step_avg:83.52ms +step:3900/6500 train_loss:2.0688 grad_norm:0.0756 train_time:325727ms step_avg:83.52ms +step:3950/6500 train_loss:2.1688 grad_norm:0.0756 train_time:329952ms step_avg:83.53ms +step:4000/6500 train_loss:1.9681 grad_norm:0.0761 train_time:334116ms step_avg:83.53ms +step:4000/6500 val_loss:2.0597 val_bpb:1.2199 train_time:334165ms step_avg:83.54ms +step:4050/6500 train_loss:2.0878 grad_norm:0.0717 train_time:338276ms step_avg:83.53ms +step:4100/6500 train_loss:2.0089 grad_norm:0.0768 train_time:342492ms step_avg:83.53ms +step:4150/6500 train_loss:2.1031 grad_norm:0.0722 train_time:346653ms step_avg:83.53ms +step:4200/6500 train_loss:2.1474 grad_norm:0.0806 train_time:350873ms step_avg:83.54ms +step:4250/6500 train_loss:2.1078 grad_norm:0.0773 train_time:355034ms step_avg:83.54ms +step:4300/6500 train_loss:2.0497 grad_norm:0.0714 train_time:359189ms step_avg:83.53ms +step:4350/6500 train_loss:2.0626 grad_norm:0.0769 train_time:363402ms step_avg:83.54ms +step:4400/6500 train_loss:2.0247 grad_norm:0.0802 train_time:367559ms step_avg:83.54ms +step:4450/6500 train_loss:2.0398 grad_norm:0.0715 train_time:371710ms step_avg:83.53ms +step:4500/6500 train_loss:2.1202 grad_norm:0.0758 train_time:375934ms step_avg:83.54ms +progressive_passes: step:4500 num_passes:2 +step:4550/6500 train_loss:2.1214 grad_norm:0.0736 train_time:381502ms step_avg:83.85ms +step:4600/6500 train_loss:1.8320 grad_norm:0.0857 train_time:387135ms step_avg:84.16ms +step:4650/6500 train_loss:2.0429 grad_norm:0.0732 train_time:392703ms step_avg:84.45ms +step:4700/6500 train_loss:2.2223 grad_norm:0.1153 train_time:398274ms step_avg:84.74ms +step:4750/6500 train_loss:2.0115 grad_norm:0.0777 train_time:403912ms step_avg:85.03ms +step:4800/6500 train_loss:2.4075 grad_norm:0.1478 train_time:409482ms step_avg:85.31ms +step:4850/6500 train_loss:2.0903 grad_norm:0.0810 train_time:415124ms step_avg:85.59ms +step:4900/6500 train_loss:2.0334 grad_norm:0.0771 train_time:420697ms step_avg:85.86ms +step:4950/6500 train_loss:2.0793 grad_norm:0.0849 train_time:426276ms step_avg:86.12ms +step:5000/6500 train_loss:2.0823 grad_norm:0.0777 train_time:431912ms step_avg:86.38ms +step:5050/6500 train_loss:2.0483 grad_norm:0.0879 train_time:437486ms step_avg:86.63ms +step:5100/6500 train_loss:2.1072 grad_norm:0.0761 train_time:443134ms step_avg:86.89ms +step:5150/6500 train_loss:2.0028 grad_norm:0.0791 train_time:448705ms step_avg:87.13ms +step:5200/6500 train_loss:2.0176 grad_norm:0.0750 train_time:454285ms step_avg:87.36ms +step:5250/6500 train_loss:2.0460 grad_norm:0.0718 train_time:459925ms step_avg:87.60ms +step:5300/6500 train_loss:1.9792 grad_norm:0.0751 train_time:465491ms step_avg:87.83ms +step:5350/6500 train_loss:1.8959 grad_norm:0.0778 train_time:471138ms step_avg:88.06ms +step:5400/6500 train_loss:2.0199 grad_norm:0.0772 train_time:476710ms step_avg:88.28ms +step:5450/6500 train_loss:2.0468 grad_norm:0.0773 train_time:482281ms step_avg:88.49ms +step:5500/6500 train_loss:1.9886 grad_norm:0.0815 train_time:487909ms step_avg:88.71ms +progressive_passes: step:5500 num_passes:3 +step:5550/6500 train_loss:1.9760 grad_norm:0.0816 train_time:494576ms step_avg:89.11ms +step:5600/6500 train_loss:1.9213 grad_norm:0.0796 train_time:501307ms step_avg:89.52ms +step:5650/6500 train_loss:2.0222 grad_norm:0.0810 train_time:507972ms step_avg:89.91ms +step:5700/6500 train_loss:1.9784 grad_norm:0.0853 train_time:514641ms step_avg:90.29ms +step:5750/6500 train_loss:2.0558 grad_norm:0.0940 train_time:521373ms step_avg:90.67ms +step:5800/6500 train_loss:1.9551 grad_norm:0.0925 train_time:528043ms step_avg:91.04ms +step:5850/6500 train_loss:2.0869 grad_norm:0.0836 train_time:534780ms step_avg:91.42ms +swa:start step:5900 +step:5900/6500 train_loss:1.8584 grad_norm:0.0822 train_time:541437ms step_avg:91.77ms +step:5950/6500 train_loss:1.9194 grad_norm:0.0798 train_time:548208ms step_avg:92.14ms +late_qat:enabled step:5974 scale:0.1500 core_quant:on +step:6000/6500 train_loss:1.9026 grad_norm:0.0840 train_time:555011ms step_avg:92.50ms +step:6050/6500 train_loss:1.9272 grad_norm:0.0839 train_time:561738ms step_avg:92.85ms +step:6100/6500 train_loss:1.8776 grad_norm:0.0841 train_time:568450ms step_avg:93.19ms +step:6150/6500 train_loss:1.9784 grad_norm:0.0836 train_time:575232ms step_avg:93.53ms +step:6200/6500 train_loss:1.9036 grad_norm:0.0847 train_time:581961ms step_avg:93.86ms +step:6250/6500 train_loss:2.0241 grad_norm:0.0935 train_time:588767ms step_avg:94.20ms +step:6300/6500 train_loss:1.9003 grad_norm:0.0837 train_time:595474ms step_avg:94.52ms +step:6334/6500 val_loss:1.9198 val_bpb:1.1370 train_time:600111ms step_avg:94.74ms +stopping_early: wallclock_cap train_time:600111ms step:6334/6500 +peak memory allocated: 34074 MiB reserved: 34084 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9165 val_bpb:1.1351 eval_time:2972ms +Serialized model: 105478842 bytes +Code size: 88253 bytes +eval_override: num_passes 1 -> 3 +Serialized model int6+lzma: 15937372 bytes +Total submission size int6+lzma: 16025625 bytes +eval_feedback: loaded from artifact, params=2560 +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=26700895 frozen=0 + ttt_chunk [1/1893] bpb=1.159074 time=0.5s + ttt_chunk [11/1893] bpb=1.143360 time=3.6s + ttt_chunk [21/1893] bpb=1.126563 time=6.7s + ttt_chunk [31/1893] bpb=1.124558 time=9.8s + ttt_chunk [41/1893] bpb=1.111652 time=12.9s + ttt_chunk [51/1893] bpb=1.106247 time=16.0s + ttt_chunk [61/1893] bpb=1.112587 time=19.0s + ttt_chunk [71/1893] bpb=1.111006 time=22.1s + ttt_chunk [81/1893] bpb=1.110399 time=25.2s + ttt_chunk [91/1893] bpb=1.111234 time=28.3s + ttt_chunk [101/1893] bpb=1.114763 time=31.4s + ttt_chunk [111/1893] bpb=1.117112 time=34.5s + ttt_chunk [121/1893] bpb=1.110626 time=37.6s + ttt_chunk [131/1893] bpb=1.110903 time=40.7s + ttt_chunk [141/1893] bpb=1.116692 time=43.7s + ttt_chunk [151/1893] bpb=1.118475 time=46.8s + ttt_chunk [161/1893] bpb=1.118025 time=49.9s + ttt_chunk [171/1893] bpb=1.122484 time=53.0s + ttt_chunk [181/1893] bpb=1.124696 time=56.0s + ttt_chunk [191/1893] bpb=1.131805 time=59.1s + ttt_chunk [201/1893] bpb=1.130526 time=62.2s + ttt_chunk [211/1893] bpb=1.128334 time=65.3s + ttt_chunk [221/1893] bpb=1.129947 time=68.4s + ttt_chunk [231/1893] bpb=1.128611 time=71.5s + ttt_chunk [241/1893] bpb=1.128931 time=74.6s + ttt_chunk [251/1893] bpb=1.128412 time=77.6s + ttt_chunk [261/1893] bpb=1.125529 time=80.7s + ttt_chunk [271/1893] bpb=1.124390 time=83.8s + ttt_chunk [281/1893] bpb=1.125835 time=86.9s + ttt_chunk [291/1893] bpb=1.127584 time=90.0s + ttt_chunk [301/1893] bpb=1.128341 time=93.0s + ttt_chunk [311/1893] bpb=1.130443 time=96.1s + ttt_chunk [321/1893] bpb=1.132480 time=99.2s + ttt_chunk [331/1893] bpb=1.132351 time=102.3s + ttt_chunk [341/1893] bpb=1.131368 time=105.4s + ttt_chunk [351/1893] bpb=1.133657 time=108.5s + ttt_chunk [361/1893] bpb=1.133975 time=111.5s + ttt_chunk [371/1893] bpb=1.133283 time=114.6s + ttt_chunk [381/1893] bpb=1.133417 time=117.7s + ttt_chunk [391/1893] bpb=1.133243 time=120.8s + ttt_chunk [401/1893] bpb=1.131189 time=123.9s + ttt_chunk [411/1893] bpb=1.130058 time=127.0s + ttt_chunk [421/1893] bpb=1.129191 time=130.1s + ttt_chunk [431/1893] bpb=1.129039 time=133.2s + ttt_chunk [441/1893] bpb=1.129411 time=136.3s + ttt_chunk [451/1893] bpb=1.129700 time=139.4s + ttt_chunk [461/1893] bpb=1.128602 time=142.5s + ttt_chunk [471/1893] bpb=1.129204 time=145.6s + ttt_chunk [481/1893] bpb=1.128811 time=148.7s + ttt_chunk [491/1893] bpb=1.127740 time=151.8s + ttt_chunk [501/1893] bpb=1.127267 time=154.9s + ttt_chunk [511/1893] bpb=1.126595 time=158.0s + ttt_chunk [521/1893] bpb=1.124313 time=161.1s + ttt_chunk [531/1893] bpb=1.125507 time=164.2s + ttt_chunk [541/1893] bpb=1.125829 time=167.2s + ttt_chunk [551/1893] bpb=1.124810 time=170.3s + ttt_chunk [561/1893] bpb=1.125331 time=173.4s + ttt_chunk [571/1893] bpb=1.124264 time=176.6s + ttt_chunk [581/1893] bpb=1.123483 time=179.6s + ttt_chunk [591/1893] bpb=1.122882 time=182.8s + ttt_chunk [601/1893] bpb=1.123371 time=185.9s + ttt_chunk [611/1893] bpb=1.123308 time=189.0s + ttt_chunk [621/1893] bpb=1.123158 time=192.1s + ttt_chunk [631/1893] bpb=1.123903 time=195.1s + ttt_chunk [641/1893] bpb=1.123643 time=198.2s + ttt_chunk [651/1893] bpb=1.123759 time=201.3s + ttt_chunk [661/1893] bpb=1.123246 time=204.4s + ttt_chunk [671/1893] bpb=1.123611 time=207.5s + ttt_chunk [681/1893] bpb=1.124330 time=210.6s + ttt_chunk [691/1893] bpb=1.125310 time=213.7s + ttt_chunk [701/1893] bpb=1.124762 time=216.7s + ttt_chunk [711/1893] bpb=1.124741 time=219.8s + ttt_chunk [721/1893] bpb=1.124374 time=222.9s + ttt_chunk [731/1893] bpb=1.124466 time=226.0s + ttt_chunk [741/1893] bpb=1.124553 time=229.1s + ttt_chunk [751/1893] bpb=1.124388 time=232.2s + ttt_chunk [761/1893] bpb=1.124284 time=235.3s + ttt_chunk [771/1893] bpb=1.123991 time=238.4s + ttt_chunk [781/1893] bpb=1.124701 time=241.5s + ttt_chunk [791/1893] bpb=1.124307 time=244.6s + ttt_chunk [801/1893] bpb=1.124621 time=247.7s + ttt_chunk [811/1893] bpb=1.124357 time=250.7s + ttt_chunk [821/1893] bpb=1.124124 time=253.8s + ttt_chunk [831/1893] bpb=1.123943 time=256.9s + ttt_chunk [841/1893] bpb=1.123340 time=260.0s + ttt_chunk [851/1893] bpb=1.123098 time=263.1s + ttt_chunk [861/1893] bpb=1.122827 time=266.2s + ttt_chunk [871/1893] bpb=1.123090 time=269.3s + ttt_chunk [881/1893] bpb=1.123268 time=272.4s + ttt_chunk [891/1893] bpb=1.122859 time=275.5s + ttt_chunk [901/1893] bpb=1.122593 time=278.5s + ttt_chunk [911/1893] bpb=1.122701 time=281.6s + ttt_chunk [921/1893] bpb=1.123194 time=284.7s + ttt_chunk [931/1893] bpb=1.123168 time=287.8s + ttt_chunk [941/1893] bpb=1.122848 time=290.9s + ttt_chunk [951/1893] bpb=1.123224 time=294.0s + ttt_chunk [961/1893] bpb=1.123313 time=297.1s + ttt_chunk [971/1893] bpb=1.124171 time=300.2s + ttt_chunk [981/1893] bpb=1.124228 time=303.3s + ttt_chunk [991/1893] bpb=1.124239 time=306.4s + ttt_chunk [1001/1893] bpb=1.124194 time=309.5s + ttt_chunk [1011/1893] bpb=1.123999 time=312.6s + ttt_chunk [1021/1893] bpb=1.124350 time=315.7s + ttt_chunk [1031/1893] bpb=1.124818 time=318.8s + ttt_chunk [1041/1893] bpb=1.124450 time=321.9s + ttt_chunk [1051/1893] bpb=1.124200 time=325.0s + ttt_chunk [1061/1893] bpb=1.124270 time=328.1s + ttt_chunk [1071/1893] bpb=1.124875 time=331.2s + ttt_chunk [1081/1893] bpb=1.125172 time=334.3s + ttt_chunk [1091/1893] bpb=1.125919 time=337.4s + ttt_chunk [1101/1893] bpb=1.125932 time=340.5s + ttt_chunk [1111/1893] bpb=1.125773 time=343.6s + ttt_chunk [1121/1893] bpb=1.125569 time=346.7s + ttt_chunk [1131/1893] bpb=1.125455 time=349.8s + ttt_chunk [1141/1893] bpb=1.125156 time=353.0s + ttt_chunk [1151/1893] bpb=1.125168 time=356.1s + ttt_chunk [1161/1893] bpb=1.124798 time=359.2s + ttt_chunk [1171/1893] bpb=1.125108 time=362.3s + ttt_chunk [1181/1893] bpb=1.124343 time=365.4s + ttt_chunk [1191/1893] bpb=1.124225 time=368.5s + ttt_chunk [1201/1893] bpb=1.124636 time=371.6s + ttt_chunk [1211/1893] bpb=1.124180 time=374.7s + ttt_chunk [1221/1893] bpb=1.123870 time=377.8s + ttt_chunk [1231/1893] bpb=1.123608 time=380.9s + ttt_chunk [1241/1893] bpb=1.123258 time=384.0s + ttt_chunk [1251/1893] bpb=1.122673 time=387.1s + ttt_chunk [1261/1893] bpb=1.122640 time=390.2s + ttt_chunk [1271/1893] bpb=1.122271 time=393.3s + ttt_chunk [1281/1893] bpb=1.122074 time=396.4s + ttt_chunk [1291/1893] bpb=1.121831 time=399.5s + ttt_chunk [1301/1893] bpb=1.121238 time=402.6s + ttt_chunk [1311/1893] bpb=1.120828 time=405.7s + ttt_chunk [1321/1893] bpb=1.120508 time=408.8s + ttt_chunk [1331/1893] bpb=1.120440 time=411.9s + ttt_chunk [1341/1893] bpb=1.120323 time=415.0s + ttt_chunk [1351/1893] bpb=1.120243 time=418.1s + ttt_chunk [1361/1893] bpb=1.120306 time=421.2s + ttt_chunk [1371/1893] bpb=1.120192 time=424.3s + ttt_chunk [1381/1893] bpb=1.120179 time=427.4s + ttt_chunk [1391/1893] bpb=1.119783 time=430.4s + ttt_chunk [1401/1893] bpb=1.119751 time=433.5s + ttt_chunk [1411/1893] bpb=1.119858 time=436.6s + ttt_chunk [1421/1893] bpb=1.120104 time=439.7s + ttt_chunk [1431/1893] bpb=1.119817 time=442.8s + ttt_chunk [1441/1893] bpb=1.120322 time=445.9s + ttt_chunk [1451/1893] bpb=1.120656 time=449.0s + ttt_chunk [1461/1893] bpb=1.120212 time=452.1s + ttt_chunk [1471/1893] bpb=1.121265 time=455.2s + ttt_chunk [1481/1893] bpb=1.120809 time=458.3s + ttt_chunk [1491/1893] bpb=1.120625 time=461.4s + ttt_chunk [1501/1893] bpb=1.120538 time=464.5s + ttt_chunk [1511/1893] bpb=1.120563 time=467.6s + ttt_chunk [1521/1893] bpb=1.120587 time=470.7s + ttt_chunk [1531/1893] bpb=1.120083 time=473.8s + ttt_chunk [1541/1893] bpb=1.119941 time=476.9s + ttt_chunk [1551/1893] bpb=1.120253 time=480.0s + ttt_chunk [1561/1893] bpb=1.120268 time=483.1s + ttt_chunk [1571/1893] bpb=1.120085 time=486.2s + ttt_chunk [1581/1893] bpb=1.120205 time=489.3s + ttt_chunk [1591/1893] bpb=1.120053 time=492.3s + ttt_chunk [1601/1893] bpb=1.120213 time=495.4s + ttt_chunk [1611/1893] bpb=1.120152 time=498.5s + ttt_chunk [1621/1893] bpb=1.119723 time=501.6s + ttt_chunk [1631/1893] bpb=1.120027 time=504.7s + ttt_chunk [1641/1893] bpb=1.120035 time=507.8s + ttt_chunk [1651/1893] bpb=1.119989 time=510.9s + ttt_chunk [1661/1893] bpb=1.119861 time=514.7s + ttt_chunk [1671/1893] bpb=1.120338 time=517.9s + ttt_chunk [1681/1893] bpb=1.120488 time=521.0s + ttt_chunk [1691/1893] bpb=1.120321 time=524.0s + ttt_chunk [1701/1893] bpb=1.120478 time=527.1s + ttt_chunk [1711/1893] bpb=1.120483 time=530.2s + ttt_chunk [1721/1893] bpb=1.120475 time=533.3s + ttt_chunk [1731/1893] bpb=1.120354 time=536.5s + ttt_chunk [1741/1893] bpb=1.120160 time=540.3s + ttt_chunk [1751/1893] bpb=1.119994 time=543.4s + ttt_chunk [1761/1893] bpb=1.120119 time=546.5s + ttt_chunk [1771/1893] bpb=1.120013 time=549.6s + ttt_chunk [1781/1893] bpb=1.120046 time=552.7s + ttt_chunk [1791/1893] bpb=1.119635 time=555.8s + ttt_chunk [1801/1893] bpb=1.119522 time=558.9s + ttt_chunk [1811/1893] bpb=1.119422 time=562.0s + ttt_chunk [1821/1893] bpb=1.119469 time=565.1s + ttt_chunk [1831/1893] bpb=1.118879 time=569.0s + ttt_chunk [1841/1893] bpb=1.118886 time=572.1s + ttt_chunk [1851/1893] bpb=1.118681 time=575.2s + ttt_chunk [1861/1893] bpb=1.118320 time=578.3s + ttt_chunk [1871/1893] bpb=1.118311 time=581.4s + ttt_chunk [1881/1893] bpb=1.117867 time=584.5s + ttt_chunk [1891/1893] bpb=1.117626 time=587.6s + ttt_chunk [1893/1893] bpb=1.117667 time=588.0s +ttt_sliding:done val_loss=1.883386 val_bpb=1.115450 elapsed=588.0s +legal_ttt val_loss:1.8834 val_bpb:1.1155 eval_time:588454ms +legal_ttt_exact val_loss:1.88338589 val_bpb:1.11545016 diff --git a/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed42.log b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed42.log new file mode 100644 index 0000000000..cbd7b63cbb --- /dev/null +++ b/records/track_non_record_16mb/2026-03-26_Stable_Growing_Recurrance/train_seed42.log @@ -0,0 +1,387 @@ +W0401 17:53:03.402000 182049 torch/distributed/run.py:851] +W0401 17:53:03.402000 182049 torch/distributed/run.py:851] ***************************************** +W0401 17:53:03.402000 182049 torch/distributed/run.py:851] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0401 17:53:03.402000 182049 torch/distributed/run.py:851] ***************************************** +logs/bigram_ve_wd3500_3pass.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:80 +val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +feedback: mode=diagonal rank=2 per_pass=False params=2560 +recurrence: core_start=4 core_end=7 num_passes=1 max_passes=3 stem=4 core=3 tail=4 schedule=[(0, 1), (4500, 2), (5500, 3)] +model_params:26698335 +XSA:last_4 active_layers:[7, 8, 9, 10] +world_size:8 grad_accum_steps:1 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:6500 warmup_steps:20 max_wallclock_seconds:600.000 +seed:42 +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/6500 val_loss:6.9291 val_bpb:4.1038 train_time:0ms step_avg:0.01ms +step:1/6500 train_loss:6.9308 grad_norm:0.3913 train_time:129ms step_avg:128.83ms +step:2/6500 train_loss:8.5088 grad_norm:3.4697 train_time:163ms step_avg:81.37ms +step:3/6500 train_loss:7.6732 grad_norm:1.7810 train_time:244ms step_avg:81.17ms +step:4/6500 train_loss:7.3186 grad_norm:1.2484 train_time:325ms step_avg:81.22ms +step:5/6500 train_loss:7.0964 grad_norm:1.5246 train_time:406ms step_avg:81.18ms +step:6/6500 train_loss:6.9715 grad_norm:1.3441 train_time:487ms step_avg:81.18ms +step:7/6500 train_loss:6.8847 grad_norm:1.4191 train_time:568ms step_avg:81.09ms +step:8/6500 train_loss:6.8142 grad_norm:0.9933 train_time:649ms step_avg:81.09ms +step:9/6500 train_loss:6.4831 grad_norm:0.8023 train_time:730ms step_avg:81.15ms +step:10/6500 train_loss:6.1384 grad_norm:1.1039 train_time:811ms step_avg:81.11ms +step:50/6500 train_loss:3.7638 grad_norm:0.8901 train_time:4103ms step_avg:82.06ms +step:100/6500 train_loss:3.1626 grad_norm:0.5541 train_time:8225ms step_avg:82.25ms +step:150/6500 train_loss:2.8900 grad_norm:0.4495 train_time:12415ms step_avg:82.77ms +step:200/6500 train_loss:2.3897 grad_norm:0.3579 train_time:16547ms step_avg:82.73ms +step:250/6500 train_loss:2.4792 grad_norm:0.3376 train_time:20686ms step_avg:82.74ms +step:300/6500 train_loss:2.5535 grad_norm:0.3133 train_time:24876ms step_avg:82.92ms +step:350/6500 train_loss:2.5321 grad_norm:0.2698 train_time:29017ms step_avg:82.90ms +step:400/6500 train_loss:2.4047 grad_norm:0.2372 train_time:33216ms step_avg:83.04ms +step:450/6500 train_loss:2.3542 grad_norm:0.2499 train_time:37357ms step_avg:83.02ms +step:500/6500 train_loss:2.3848 grad_norm:0.1949 train_time:41502ms step_avg:83.00ms +step:550/6500 train_loss:2.3247 grad_norm:0.2375 train_time:45705ms step_avg:83.10ms +step:600/6500 train_loss:2.3201 grad_norm:0.1828 train_time:49846ms step_avg:83.08ms +step:650/6500 train_loss:2.3143 grad_norm:0.1735 train_time:54060ms step_avg:83.17ms +step:700/6500 train_loss:2.3316 grad_norm:0.1780 train_time:58196ms step_avg:83.14ms +step:750/6500 train_loss:2.3158 grad_norm:0.1632 train_time:62336ms step_avg:83.11ms +step:800/6500 train_loss:2.2248 grad_norm:0.1652 train_time:66556ms step_avg:83.19ms +step:850/6500 train_loss:2.2198 grad_norm:0.1562 train_time:70697ms step_avg:83.17ms +step:900/6500 train_loss:2.1103 grad_norm:0.1462 train_time:74903ms step_avg:83.23ms +step:950/6500 train_loss:2.2089 grad_norm:0.1508 train_time:79046ms step_avg:83.21ms +step:1000/6500 train_loss:2.2624 grad_norm:0.1474 train_time:83190ms step_avg:83.19ms +step:1050/6500 train_loss:2.2083 grad_norm:0.1428 train_time:87393ms step_avg:83.23ms +step:1100/6500 train_loss:2.3023 grad_norm:0.1304 train_time:91536ms step_avg:83.21ms +step:1150/6500 train_loss:2.2354 grad_norm:0.1320 train_time:95746ms step_avg:83.26ms +step:1200/6500 train_loss:2.3407 grad_norm:0.1321 train_time:99889ms step_avg:83.24ms +step:1250/6500 train_loss:2.2390 grad_norm:0.1242 train_time:104035ms step_avg:83.23ms +step:1300/6500 train_loss:2.0891 grad_norm:0.1257 train_time:108257ms step_avg:83.27ms +step:1350/6500 train_loss:2.2409 grad_norm:0.1202 train_time:112405ms step_avg:83.26ms +step:1400/6500 train_loss:2.1705 grad_norm:0.1071 train_time:116620ms step_avg:83.30ms +step:1450/6500 train_loss:2.1071 grad_norm:0.0987 train_time:120763ms step_avg:83.28ms +step:1500/6500 train_loss:2.2093 grad_norm:0.1028 train_time:124909ms step_avg:83.27ms +step:1550/6500 train_loss:2.1736 grad_norm:0.0960 train_time:129123ms step_avg:83.30ms +step:1600/6500 train_loss:2.0634 grad_norm:0.0866 train_time:133271ms step_avg:83.29ms +step:1650/6500 train_loss:2.1779 grad_norm:0.0914 train_time:137426ms step_avg:83.29ms +step:1700/6500 train_loss:2.1295 grad_norm:0.0810 train_time:141642ms step_avg:83.32ms +step:1750/6500 train_loss:2.1853 grad_norm:0.0818 train_time:145801ms step_avg:83.31ms +step:1800/6500 train_loss:2.1455 grad_norm:0.1147 train_time:150030ms step_avg:83.35ms +step:1850/6500 train_loss:2.0186 grad_norm:0.0845 train_time:154184ms step_avg:83.34ms +step:1900/6500 train_loss:2.1159 grad_norm:0.0816 train_time:158338ms step_avg:83.34ms +step:1950/6500 train_loss:2.0067 grad_norm:0.0725 train_time:162560ms step_avg:83.36ms +step:2000/6500 train_loss:2.0534 grad_norm:0.0752 train_time:166714ms step_avg:83.36ms +step:2050/6500 train_loss:2.0987 grad_norm:0.0760 train_time:170941ms step_avg:83.39ms +step:2100/6500 train_loss:2.0385 grad_norm:0.0749 train_time:175092ms step_avg:83.38ms +step:2150/6500 train_loss:2.1424 grad_norm:0.0755 train_time:179247ms step_avg:83.37ms +step:2200/6500 train_loss:2.1272 grad_norm:0.1143 train_time:183469ms step_avg:83.40ms +step:2250/6500 train_loss:2.1605 grad_norm:0.0768 train_time:187617ms step_avg:83.39ms +step:2300/6500 train_loss:2.1000 grad_norm:0.0772 train_time:191838ms step_avg:83.41ms +step:2350/6500 train_loss:2.1580 grad_norm:0.0748 train_time:195996ms step_avg:83.40ms +step:2400/6500 train_loss:2.0524 grad_norm:0.0746 train_time:200154ms step_avg:83.40ms +step:2450/6500 train_loss:2.0746 grad_norm:0.0777 train_time:204374ms step_avg:83.42ms +step:2500/6500 train_loss:2.1638 grad_norm:0.1080 train_time:208537ms step_avg:83.41ms +step:2550/6500 train_loss:2.1969 grad_norm:0.0786 train_time:212758ms step_avg:83.43ms +step:2600/6500 train_loss:2.0995 grad_norm:0.0756 train_time:216916ms step_avg:83.43ms +step:2650/6500 train_loss:2.0600 grad_norm:0.0839 train_time:221077ms step_avg:83.43ms +step:2700/6500 train_loss:2.0931 grad_norm:0.0721 train_time:225310ms step_avg:83.45ms +step:2750/6500 train_loss:2.0210 grad_norm:0.0743 train_time:229467ms step_avg:83.44ms +step:2800/6500 train_loss:2.1436 grad_norm:0.0817 train_time:233688ms step_avg:83.46ms +step:2850/6500 train_loss:2.0573 grad_norm:0.0767 train_time:237845ms step_avg:83.45ms +step:2900/6500 train_loss:2.0140 grad_norm:0.0738 train_time:242002ms step_avg:83.45ms +step:2950/6500 train_loss:2.0732 grad_norm:0.0777 train_time:246232ms step_avg:83.47ms +step:3000/6500 train_loss:2.1495 grad_norm:0.0746 train_time:250387ms step_avg:83.46ms +step:3050/6500 train_loss:2.0383 grad_norm:0.0830 train_time:254547ms step_avg:83.46ms +step:3100/6500 train_loss:2.0264 grad_norm:0.0722 train_time:258785ms step_avg:83.48ms +step:3150/6500 train_loss:1.9652 grad_norm:0.0764 train_time:262939ms step_avg:83.47ms +step:3200/6500 train_loss:2.1629 grad_norm:0.0760 train_time:267158ms step_avg:83.49ms +step:3250/6500 train_loss:2.0424 grad_norm:0.0697 train_time:271320ms step_avg:83.48ms +step:3300/6500 train_loss:2.0659 grad_norm:0.0749 train_time:275478ms step_avg:83.48ms +step:3350/6500 train_loss:2.0906 grad_norm:0.0745 train_time:279711ms step_avg:83.50ms +step:3400/6500 train_loss:2.0139 grad_norm:0.0743 train_time:283870ms step_avg:83.49ms +step:3450/6500 train_loss:2.1105 grad_norm:0.0828 train_time:288091ms step_avg:83.50ms +step:3500/6500 train_loss:2.1715 grad_norm:0.0732 train_time:292251ms step_avg:83.50ms +step:3550/6500 train_loss:1.9168 grad_norm:0.0786 train_time:296411ms step_avg:83.50ms +step:3600/6500 train_loss:2.0902 grad_norm:0.0772 train_time:300637ms step_avg:83.51ms +step:3650/6500 train_loss:1.9711 grad_norm:0.0710 train_time:304794ms step_avg:83.51ms +step:3700/6500 train_loss:2.0917 grad_norm:0.0703 train_time:309022ms step_avg:83.52ms +step:3750/6500 train_loss:1.9167 grad_norm:0.0710 train_time:313181ms step_avg:83.51ms +step:3800/6500 train_loss:2.0676 grad_norm:0.0808 train_time:317341ms step_avg:83.51ms +step:3850/6500 train_loss:2.0849 grad_norm:0.0734 train_time:321563ms step_avg:83.52ms +step:3900/6500 train_loss:2.0710 grad_norm:0.0744 train_time:325721ms step_avg:83.52ms +step:3950/6500 train_loss:2.1685 grad_norm:0.0723 train_time:329932ms step_avg:83.53ms +step:4000/6500 train_loss:1.9678 grad_norm:0.0709 train_time:334094ms step_avg:83.52ms +step:4000/6500 val_loss:2.0609 val_bpb:1.2206 train_time:334144ms step_avg:83.54ms +step:4050/6500 train_loss:2.0891 grad_norm:0.0704 train_time:338250ms step_avg:83.52ms +step:4100/6500 train_loss:2.0107 grad_norm:0.0763 train_time:342478ms step_avg:83.53ms +step:4150/6500 train_loss:2.1066 grad_norm:0.0686 train_time:346639ms step_avg:83.53ms +step:4200/6500 train_loss:2.1462 grad_norm:0.0812 train_time:350858ms step_avg:83.54ms +step:4250/6500 train_loss:2.1092 grad_norm:0.0763 train_time:355014ms step_avg:83.53ms +step:4300/6500 train_loss:2.0546 grad_norm:0.0752 train_time:359173ms step_avg:83.53ms +step:4350/6500 train_loss:2.0656 grad_norm:0.0760 train_time:363401ms step_avg:83.54ms +step:4400/6500 train_loss:2.0301 grad_norm:0.0769 train_time:367556ms step_avg:83.54ms +step:4450/6500 train_loss:2.0432 grad_norm:0.0745 train_time:371713ms step_avg:83.53ms +step:4500/6500 train_loss:2.1230 grad_norm:0.0726 train_time:375939ms step_avg:83.54ms +progressive_passes: step:4500 num_passes:2 +step:4550/6500 train_loss:2.1247 grad_norm:0.0731 train_time:381510ms step_avg:83.85ms +step:4600/6500 train_loss:1.8322 grad_norm:0.0886 train_time:387153ms step_avg:84.16ms +step:4650/6500 train_loss:2.0494 grad_norm:0.0743 train_time:392726ms step_avg:84.46ms +step:4700/6500 train_loss:2.2236 grad_norm:0.1161 train_time:398301ms step_avg:84.74ms +step:4750/6500 train_loss:2.0131 grad_norm:0.0707 train_time:403941ms step_avg:85.04ms +step:4800/6500 train_loss:2.4144 grad_norm:0.1498 train_time:409516ms step_avg:85.32ms +step:4850/6500 train_loss:2.0925 grad_norm:0.0791 train_time:415156ms step_avg:85.60ms +step:4900/6500 train_loss:2.0318 grad_norm:0.0760 train_time:420732ms step_avg:85.86ms +step:4950/6500 train_loss:2.0796 grad_norm:0.0818 train_time:426301ms step_avg:86.12ms +step:5000/6500 train_loss:2.0858 grad_norm:0.0780 train_time:431938ms step_avg:86.39ms +step:5050/6500 train_loss:2.0497 grad_norm:0.0826 train_time:437509ms step_avg:86.64ms +step:5100/6500 train_loss:2.1092 grad_norm:0.0773 train_time:443146ms step_avg:86.89ms +step:5150/6500 train_loss:2.0069 grad_norm:0.0785 train_time:448717ms step_avg:87.13ms +step:5200/6500 train_loss:2.0194 grad_norm:0.0775 train_time:454289ms step_avg:87.36ms +step:5250/6500 train_loss:2.0475 grad_norm:0.0722 train_time:459924ms step_avg:87.60ms +step:5300/6500 train_loss:1.9879 grad_norm:0.0783 train_time:465494ms step_avg:87.83ms +step:5350/6500 train_loss:1.9020 grad_norm:0.0779 train_time:471147ms step_avg:88.06ms +step:5400/6500 train_loss:2.0257 grad_norm:0.0774 train_time:476721ms step_avg:88.28ms +step:5450/6500 train_loss:2.0488 grad_norm:0.0765 train_time:482296ms step_avg:88.49ms +step:5500/6500 train_loss:1.9912 grad_norm:0.0790 train_time:487937ms step_avg:88.72ms +progressive_passes: step:5500 num_passes:3 +step:5550/6500 train_loss:1.9789 grad_norm:0.0810 train_time:494608ms step_avg:89.12ms +step:5600/6500 train_loss:1.9256 grad_norm:0.0786 train_time:501346ms step_avg:89.53ms +step:5650/6500 train_loss:2.0259 grad_norm:0.0842 train_time:508018ms step_avg:89.91ms +step:5700/6500 train_loss:1.9806 grad_norm:0.0837 train_time:514687ms step_avg:90.30ms +step:5750/6500 train_loss:2.0596 grad_norm:0.0910 train_time:521422ms step_avg:90.68ms +step:5800/6500 train_loss:1.9578 grad_norm:0.0867 train_time:528090ms step_avg:91.05ms +step:5850/6500 train_loss:2.0914 grad_norm:0.0835 train_time:534823ms step_avg:91.42ms +swa:start step:5900 +step:5900/6500 train_loss:1.8655 grad_norm:0.0820 train_time:541493ms step_avg:91.78ms +step:5950/6500 train_loss:1.9220 grad_norm:0.0783 train_time:548258ms step_avg:92.14ms +late_qat:enabled step:5974 scale:0.1498 core_quant:on +step:6000/6500 train_loss:1.9047 grad_norm:0.0854 train_time:555064ms step_avg:92.51ms +step:6050/6500 train_loss:1.9335 grad_norm:0.0832 train_time:561782ms step_avg:92.86ms +step:6100/6500 train_loss:1.8795 grad_norm:0.0848 train_time:568501ms step_avg:93.20ms +step:6150/6500 train_loss:1.9813 grad_norm:0.0840 train_time:575291ms step_avg:93.54ms +step:6200/6500 train_loss:1.9069 grad_norm:0.0873 train_time:582012ms step_avg:93.87ms +step:6250/6500 train_loss:2.0284 grad_norm:0.0927 train_time:588792ms step_avg:94.21ms +step:6300/6500 train_loss:1.9061 grad_norm:0.0825 train_time:595509ms step_avg:94.53ms +step:6334/6500 val_loss:1.9232 val_bpb:1.1390 train_time:600161ms step_avg:94.75ms +stopping_early: wallclock_cap train_time:600161ms step:6334/6500 +peak memory allocated: 34074 MiB reserved: 34084 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9202 val_bpb:1.1372 eval_time:2969ms +Serialized model: 105478842 bytes +Code size: 88253 bytes +eval_override: num_passes 1 -> 3 +Serialized model int6+lzma: 15839344 bytes +Total submission size int6+lzma: 15927597 bytes +eval_feedback: loaded from artifact, params=2560 +ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=0 +ttt_sliding:params unfrozen=26700895 frozen=0 + ttt_chunk [1/1893] bpb=1.152477 time=0.5s + ttt_chunk [11/1893] bpb=1.143035 time=3.6s + ttt_chunk [21/1893] bpb=1.129097 time=6.6s + ttt_chunk [31/1893] bpb=1.127795 time=9.7s + ttt_chunk [41/1893] bpb=1.115135 time=12.7s + ttt_chunk [51/1893] bpb=1.108923 time=15.8s + ttt_chunk [61/1893] bpb=1.115323 time=18.8s + ttt_chunk [71/1893] bpb=1.114068 time=21.8s + ttt_chunk [81/1893] bpb=1.113434 time=24.9s + ttt_chunk [91/1893] bpb=1.114154 time=27.9s + ttt_chunk [101/1893] bpb=1.117640 time=31.0s + ttt_chunk [111/1893] bpb=1.119946 time=34.0s + ttt_chunk [121/1893] bpb=1.113175 time=37.1s + ttt_chunk [131/1893] bpb=1.113513 time=40.1s + ttt_chunk [141/1893] bpb=1.119060 time=43.1s + ttt_chunk [151/1893] bpb=1.120887 time=46.2s + ttt_chunk [161/1893] bpb=1.120409 time=49.2s + ttt_chunk [171/1893] bpb=1.124729 time=52.2s + ttt_chunk [181/1893] bpb=1.126902 time=55.3s + ttt_chunk [191/1893] bpb=1.134130 time=58.3s + ttt_chunk [201/1893] bpb=1.132989 time=61.4s + ttt_chunk [211/1893] bpb=1.130820 time=64.4s + ttt_chunk [221/1893] bpb=1.132337 time=67.4s + ttt_chunk [231/1893] bpb=1.131158 time=70.5s + ttt_chunk [241/1893] bpb=1.131378 time=73.5s + ttt_chunk [251/1893] bpb=1.130935 time=76.6s + ttt_chunk [261/1893] bpb=1.128082 time=79.6s + ttt_chunk [271/1893] bpb=1.126981 time=82.7s + ttt_chunk [281/1893] bpb=1.128422 time=85.7s + ttt_chunk [291/1893] bpb=1.130257 time=88.8s + ttt_chunk [301/1893] bpb=1.131029 time=91.8s + ttt_chunk [311/1893] bpb=1.133104 time=94.9s + ttt_chunk [321/1893] bpb=1.135079 time=97.9s + ttt_chunk [331/1893] bpb=1.134946 time=100.9s + ttt_chunk [341/1893] bpb=1.133947 time=104.0s + ttt_chunk [351/1893] bpb=1.136247 time=107.0s + ttt_chunk [361/1893] bpb=1.136484 time=110.1s + ttt_chunk [371/1893] bpb=1.135787 time=113.2s + ttt_chunk [381/1893] bpb=1.135928 time=116.2s + ttt_chunk [391/1893] bpb=1.135725 time=119.3s + ttt_chunk [401/1893] bpb=1.133622 time=122.3s + ttt_chunk [411/1893] bpb=1.132451 time=125.4s + ttt_chunk [421/1893] bpb=1.131507 time=128.4s + ttt_chunk [431/1893] bpb=1.131405 time=131.5s + ttt_chunk [441/1893] bpb=1.131771 time=134.6s + ttt_chunk [451/1893] bpb=1.132087 time=137.6s + ttt_chunk [461/1893] bpb=1.131017 time=140.7s + ttt_chunk [471/1893] bpb=1.131668 time=143.7s + ttt_chunk [481/1893] bpb=1.131297 time=146.8s + ttt_chunk [491/1893] bpb=1.130192 time=149.9s + ttt_chunk [501/1893] bpb=1.129677 time=152.9s + ttt_chunk [511/1893] bpb=1.128974 time=156.0s + ttt_chunk [521/1893] bpb=1.126707 time=159.1s + ttt_chunk [531/1893] bpb=1.127885 time=162.1s + ttt_chunk [541/1893] bpb=1.128209 time=165.2s + ttt_chunk [551/1893] bpb=1.127153 time=168.2s + ttt_chunk [561/1893] bpb=1.127686 time=171.3s + ttt_chunk [571/1893] bpb=1.126673 time=174.3s + ttt_chunk [581/1893] bpb=1.125867 time=177.4s + ttt_chunk [591/1893] bpb=1.125232 time=180.4s + ttt_chunk [601/1893] bpb=1.125737 time=183.5s + ttt_chunk [611/1893] bpb=1.125643 time=186.5s + ttt_chunk [621/1893] bpb=1.125484 time=189.7s + ttt_chunk [631/1893] bpb=1.126196 time=192.7s + ttt_chunk [641/1893] bpb=1.125945 time=195.8s + ttt_chunk [651/1893] bpb=1.126042 time=198.8s + ttt_chunk [661/1893] bpb=1.125539 time=201.9s + ttt_chunk [671/1893] bpb=1.125913 time=204.9s + ttt_chunk [681/1893] bpb=1.126614 time=208.0s + ttt_chunk [691/1893] bpb=1.127610 time=211.0s + ttt_chunk [701/1893] bpb=1.127051 time=214.1s + ttt_chunk [711/1893] bpb=1.127028 time=217.1s + ttt_chunk [721/1893] bpb=1.126685 time=220.2s + ttt_chunk [731/1893] bpb=1.126729 time=223.3s + ttt_chunk [741/1893] bpb=1.126821 time=226.3s + ttt_chunk [751/1893] bpb=1.126687 time=229.4s + ttt_chunk [761/1893] bpb=1.126623 time=232.4s + ttt_chunk [771/1893] bpb=1.126304 time=235.5s + ttt_chunk [781/1893] bpb=1.127035 time=238.5s + ttt_chunk [791/1893] bpb=1.126622 time=241.6s + ttt_chunk [801/1893] bpb=1.126957 time=244.6s + ttt_chunk [811/1893] bpb=1.126711 time=247.7s + ttt_chunk [821/1893] bpb=1.126489 time=250.8s + ttt_chunk [831/1893] bpb=1.126309 time=253.8s + ttt_chunk [841/1893] bpb=1.125658 time=256.9s + ttt_chunk [851/1893] bpb=1.125406 time=259.9s + ttt_chunk [861/1893] bpb=1.125146 time=263.0s + ttt_chunk [871/1893] bpb=1.125420 time=266.0s + ttt_chunk [881/1893] bpb=1.125586 time=269.1s + ttt_chunk [891/1893] bpb=1.125168 time=272.1s + ttt_chunk [901/1893] bpb=1.124899 time=275.1s + ttt_chunk [911/1893] bpb=1.125018 time=278.2s + ttt_chunk [921/1893] bpb=1.125495 time=281.2s + ttt_chunk [931/1893] bpb=1.125456 time=284.3s + ttt_chunk [941/1893] bpb=1.125151 time=287.3s + ttt_chunk [951/1893] bpb=1.125534 time=290.4s + ttt_chunk [961/1893] bpb=1.125608 time=293.4s + ttt_chunk [971/1893] bpb=1.126466 time=296.5s + ttt_chunk [981/1893] bpb=1.126540 time=299.5s + ttt_chunk [991/1893] bpb=1.126531 time=302.5s + ttt_chunk [1001/1893] bpb=1.126461 time=305.6s + ttt_chunk [1011/1893] bpb=1.126239 time=308.6s + ttt_chunk [1021/1893] bpb=1.126577 time=311.7s + ttt_chunk [1031/1893] bpb=1.127017 time=314.7s + ttt_chunk [1041/1893] bpb=1.126693 time=317.8s + ttt_chunk [1051/1893] bpb=1.126438 time=320.8s + ttt_chunk [1061/1893] bpb=1.126498 time=323.9s + ttt_chunk [1071/1893] bpb=1.127124 time=327.0s + ttt_chunk [1081/1893] bpb=1.127399 time=330.0s + ttt_chunk [1091/1893] bpb=1.128159 time=333.1s + ttt_chunk [1101/1893] bpb=1.128177 time=336.1s + ttt_chunk [1111/1893] bpb=1.128034 time=339.2s + ttt_chunk [1121/1893] bpb=1.127833 time=342.3s + ttt_chunk [1131/1893] bpb=1.127712 time=345.3s + ttt_chunk [1141/1893] bpb=1.127398 time=348.4s + ttt_chunk [1151/1893] bpb=1.127410 time=351.5s + ttt_chunk [1161/1893] bpb=1.127032 time=354.5s + ttt_chunk [1171/1893] bpb=1.127359 time=357.6s + ttt_chunk [1181/1893] bpb=1.126599 time=360.6s + ttt_chunk [1191/1893] bpb=1.126502 time=363.6s + ttt_chunk [1201/1893] bpb=1.126926 time=366.7s + ttt_chunk [1211/1893] bpb=1.126480 time=369.7s + ttt_chunk [1221/1893] bpb=1.126193 time=372.8s + ttt_chunk [1231/1893] bpb=1.125910 time=375.8s + ttt_chunk [1241/1893] bpb=1.125549 time=378.9s + ttt_chunk [1251/1893] bpb=1.124967 time=382.0s + ttt_chunk [1261/1893] bpb=1.124947 time=385.0s + ttt_chunk [1271/1893] bpb=1.124571 time=388.1s + ttt_chunk [1281/1893] bpb=1.124391 time=391.1s + ttt_chunk [1291/1893] bpb=1.124160 time=394.2s + ttt_chunk [1301/1893] bpb=1.123558 time=397.3s + ttt_chunk [1311/1893] bpb=1.123176 time=400.3s + ttt_chunk [1321/1893] bpb=1.122847 time=403.4s + ttt_chunk [1331/1893] bpb=1.122788 time=406.4s + ttt_chunk [1341/1893] bpb=1.122665 time=409.5s + ttt_chunk [1351/1893] bpb=1.122591 time=412.5s + ttt_chunk [1361/1893] bpb=1.122658 time=415.6s + ttt_chunk [1371/1893] bpb=1.122520 time=418.6s + ttt_chunk [1381/1893] bpb=1.122507 time=421.9s + ttt_chunk [1391/1893] bpb=1.122107 time=425.0s + ttt_chunk [1401/1893] bpb=1.122084 time=428.1s + ttt_chunk [1411/1893] bpb=1.122201 time=431.1s + ttt_chunk [1421/1893] bpb=1.122447 time=434.1s + ttt_chunk [1431/1893] bpb=1.122150 time=437.2s + ttt_chunk [1441/1893] bpb=1.122669 time=440.2s + ttt_chunk [1451/1893] bpb=1.123008 time=443.2s + ttt_chunk [1461/1893] bpb=1.122569 time=446.3s + ttt_chunk [1471/1893] bpb=1.123625 time=449.3s + ttt_chunk [1481/1893] bpb=1.123158 time=452.4s + ttt_chunk [1491/1893] bpb=1.122974 time=455.4s + ttt_chunk [1501/1893] bpb=1.122869 time=458.5s + ttt_chunk [1511/1893] bpb=1.122901 time=461.5s + ttt_chunk [1521/1893] bpb=1.122938 time=464.6s + ttt_chunk [1531/1893] bpb=1.122409 time=467.6s + ttt_chunk [1541/1893] bpb=1.122271 time=470.7s + ttt_chunk [1551/1893] bpb=1.122576 time=473.7s + ttt_chunk [1561/1893] bpb=1.122587 time=476.8s + ttt_chunk [1571/1893] bpb=1.122415 time=479.9s + ttt_chunk [1581/1893] bpb=1.122534 time=482.9s + ttt_chunk [1591/1893] bpb=1.122382 time=486.0s + ttt_chunk [1601/1893] bpb=1.122561 time=489.0s + ttt_chunk [1611/1893] bpb=1.122496 time=492.1s + ttt_chunk [1621/1893] bpb=1.122078 time=495.1s + ttt_chunk [1631/1893] bpb=1.122388 time=498.2s + ttt_chunk [1641/1893] bpb=1.122410 time=501.2s + ttt_chunk [1651/1893] bpb=1.122366 time=504.3s + ttt_chunk [1661/1893] bpb=1.122246 time=508.2s + ttt_chunk [1671/1893] bpb=1.122717 time=511.2s + ttt_chunk [1681/1893] bpb=1.122863 time=514.3s + ttt_chunk [1691/1893] bpb=1.122692 time=517.3s + ttt_chunk [1701/1893] bpb=1.122854 time=520.4s + ttt_chunk [1711/1893] bpb=1.122857 time=523.4s + ttt_chunk [1721/1893] bpb=1.122852 time=526.5s + ttt_chunk [1731/1893] bpb=1.122739 time=529.6s + ttt_chunk [1741/1893] bpb=1.122554 time=532.7s + ttt_chunk [1751/1893] bpb=1.122384 time=535.7s + ttt_chunk [1761/1893] bpb=1.122522 time=538.8s + ttt_chunk [1771/1893] bpb=1.122413 time=541.8s + ttt_chunk [1781/1893] bpb=1.122441 time=544.9s + ttt_chunk [1791/1893] bpb=1.122033 time=548.0s + ttt_chunk [1801/1893] bpb=1.121908 time=551.0s + ttt_chunk [1811/1893] bpb=1.121817 time=554.1s + ttt_chunk [1821/1893] bpb=1.121874 time=557.1s + ttt_chunk [1831/1893] bpb=1.121276 time=560.2s + ttt_chunk [1841/1893] bpb=1.121284 time=563.3s + ttt_chunk [1851/1893] bpb=1.121066 time=566.3s + ttt_chunk [1861/1893] bpb=1.120692 time=569.4s + ttt_chunk [1871/1893] bpb=1.120685 time=572.4s + ttt_chunk [1881/1893] bpb=1.120241 time=575.5s + ttt_chunk [1891/1893] bpb=1.120004 time=578.5s + ttt_chunk [1893/1893] bpb=1.120051 time=579.0s +ttt_sliding:done val_loss=1.887157 val_bpb=1.117684 elapsed=579.0s +legal_ttt val_loss:1.8872 val_bpb:1.1177 eval_time:579405ms +legal_ttt_exact val_loss:1.88715720 val_bpb:1.11768375