diff --git a/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/README.md b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/README.md new file mode 100644 index 0000000000..f26bda5dbc --- /dev/null +++ b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/README.md @@ -0,0 +1,71 @@ +# Record: SP8192 + Score-First TTT + Eval-Time Hash Embedding + +**val_bpb: 1.08269** (3-seed mean, std 0.00060) | ~15.99 MB | 8xH100 SXM | ~450s eval + +Merged SOTA (PR #1019, 3-seed mean): **1.88218 nats**. This run: **2.79670 nats**. Delta: **-0.914 nats**. Clears the 0.005-nat threshold. + +## Results (3-seed) + +| Seed | BPP | val_loss (nats) | Artifact | +|------|-----|-----------------|----------| +| 1337 | **1.08218** | 2.79537 | 15,982,929 | +| 42 | **1.08252** | 2.79626 | 15,988,459 | +| 2025 | **1.08337** | 2.79846 | 15,989,420 | +| **Mean** | **1.08269** | **2.79670** | | + +## Changes from Merged SOTA (PR #1019) + +### 1. Eval-Time Hash Embedding (Novel) + +A zero-initialized `nn.Embedding(16384, 512)` is created at evaluation time and trained exclusively through the score-first TTT loop. At each position, a bigram hash `h = (prev_token * 2039 + curr_token) % 16384` looks up a residual vector that is added to `tok_emb(x)` before RMSNorm. The hash embedding learns document-local bigram patterns without modifying any pre-trained model weights. + +**Nearest PR:** PR #1413 (@kevclark) — legal score-first TTT with full-model weight updates. **Different:** We add an ephemeral hash embedding that is instantiated from zeros at eval start and adapts via the same TTT loop. This is a new adaptation target — the model retunes a separate bigram-keyed memory alongside its existing weights. No existing PR creates and trains a new embedding module from scratch at eval time (LoRA-TTT PRs #1254/#1354 create adapter matrices, but those adapt existing layers, not a standalone hash embedding). + +**Measured delta:** -0.0004 BPP vs packed baseline without hash embedding (ablation: 1.08307 mean without, 1.08269 mean with). + +### 2. Score-First TTT (Legal) + +SGD with momentum 0.9, LR=0.005, 3 epochs per 32K-token chunk, cosine decay. All model blocks unfrozen (freeze=0). Same mechanism as PR #549 and PR #1413. + +**Measured delta:** -0.002 BPP vs sliding window without TTT. + +### 3. SP8192 Architecture Stack + +- 11 layers, model_dim=512, 8 heads, 4 KV heads +- Parallel residuals (layers 7-10, PaLM-style) +- Depth recurrence (layers 4-5, loop 2x) +- Skip gates (sigmoid-gated skip connections) +- QK-Gain 4.0, XSA (all 11 layers) +- Full Hessian GPTQ int6 + byte-shuffle + brotli compression +- Coprime-stride weighted multi-shard data loader +- Code packed with lzma+base85 self-extracting wrapper (saves 32KB) + +## Compliance + +Per Issue #1017 (Track B — legal eval-time adaptation): + +- **Condition 1 (Causal/prefix-only):** Hash key uses `(prev_token, curr_token)` — both are input token identities from `x_batch = chunk[:-1]`, not model predictions. The hash embedding at position t depends only on prefix tokens. +- **Condition 2 (Full normalized distribution):** Hash residual is added to the embedding before RMSNorm and the standard transformer + tied LM head + full-vocab softmax. +- **Condition 3 (Score-before-update):** Each chunk is fully scored under `torch.no_grad()` before any TTT parameter update. The hash embedding is updated as part of the standard TTT training step, after scoring. +- **Condition 4 (Single left-to-right pass):** One evaluation pass, no rescoring, no multi-pass selection. +- **Precedent for eval-time-created parameters:** LoRA-TTT PRs #1254, #1354 also instantiate new trainable parameters at eval time. + +No SLOT, no pre-quant TTT, no n-gram caches, no ETLB. + +## Reproduction + +```bash +pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/ +torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +No env vars needed. All defaults are the submission config. + +## Credits + +- Base architecture: PR #549 (@abaybektursun), PR #1019 (@abaybektursun) +- Score-first TTT framework: PR #549 (@abaybektursun), PR #1413 (@kevclark) +- Parallel residuals + depth recurrence: PR #1204 (@msisovic) +- SP8192 + GPTQ embeddings + SDClip: PR #1394 (@clarkkev) +- Coprime-stride loader: PR #726, PR #1060 +- Eval-time hash embedding: original to this submission diff --git a/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/requirements.txt b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/requirements.txt new file mode 100644 index 0000000000..4b9cb77bcb --- /dev/null +++ b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/requirements.txt @@ -0,0 +1 @@ +flash_attn_3 diff --git a/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/submission.json b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/submission.json new file mode 100644 index 0000000000..b393169da3 --- /dev/null +++ b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/submission.json @@ -0,0 +1,8 @@ +{ + "description": "SP8192 + Score-First TTT + Eval-Time Hash Embedding", + "val_bpb_mean": 1.08269, + "val_bpb_std": 0.00060, + "seeds": [1337, 42, 2025], + "hardware": "8xH100 SXM", + "framework": "PyTorch 2.9.1+cu128" +} diff --git a/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_gpt.py b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_gpt.py new file mode 100644 index 0000000000..a78750fe08 --- /dev/null +++ b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_gpt.py @@ -0,0 +1 @@ +import lzma,base64;exec(compile(lzma.decompress(base64.b85decode(b'{Wp48S^xk9=GL@E0stWa8~^|S5YJf5;Kk@Q)?ENJn@VT6Qap3bt~@<3h>ok~)Km^%c^ys%R{D_%yAk9-_tV7^coUOo3$w>`(`ci)t`2F7>r>Ltx>>S2CRw|7ov>Wn1e~_!RLQ=%V9g?)G3yPsu%SBy!lj1PaC-x%dDmCDOZ^r^!)+WWz}ejKXTJ#^U6Ra!};QocHHXQC+4UM!QQ!-N5Xd|%~a(9)bTYIO+>B~8~@lqmri%^qEkQUy074Rh6w7V_#^s9J-3BNA`G;qyR$LYcI?e+loZVWi~B$n=TKFp{%SeHYp{oNWh;U@Ahk8M2$OU%K8B$lb*dRQXd-GR_@*KAZdRdwSd#X_bO(lvJ3fp9Otblkh?o!zlDF02+sRjLV6IqG{ieQx44UY(f20c)^AD5kE{7_@f9?Q-ePHMY$wCTcn5ij2k?>T>CFcZ<|5Bh`%hA!j2d4G(X-Bbwu<(#drck2`tR2eo$wi$p$UEHkQdFiFmlJR#zIG@3*smdlqZ?s>Cn@I!i44iGk>T1KUmKDUWEJXYFF3Mh*&Tbca$esa+z^`enxeV%UmK_#Ex_)>$lBJA(Wj|4yV%J<~unPL@@@KfP=NTcv-SVPiG3BDdu=*>C1izrS~RvqEe6Re7Xf)zp2fR3F%Ntl(>3N{Nxb8vzZkhK?{Rpe$4_KYFAQQI8+y6>9jc4wHeJ2^*|#6%ShCfQ3zc=<|0Z`Ja7Y9E|lQy~CwnZkBE%teQ=I^Ddvss1V~h8|jSxW!&||)rKK)$#K{ntoatClG0YVqyY7Qmr3Mv0zyH;LQ+{zu5s@m_NC1q7o?u96Zu_lD}&wKWnBj8K_s&H!?(%4lDiW`^h3XfN^w7^NZ}p;fxjfP(oriMP93@G`Hokf^LPBO_tUOCA9Y?Ycbc|&Ozk{6p6^Be5B~llA6b0L6cbtv^Vkq!Kv#A|GL0_5%6C1U%s;o?{>h@9KI+Dwrzrw4p&PwOoXNEPg+i3?tCZo5gvp{$48`j*%6?1_2lYE_<5L|gF)ZT%Mdifaor1Y6A-xqGx_jCe-eXB1@O;wxE}o1!2=w51P~Rf{vSRtlDlpsWdo(B|H(J|Hre)n=$!&fH&HLWy~FtJL=>Qr;5H0HeDo`2Eq0fPkI)-B?>WuR!Qs7^_Bzx?q((%!D@A{nSVKd=>Z`XgS;F+huHn#(`cbm!kgBnH^MhO^f*1SC+`LOy&4J-gGEKkGWERO-S#{sfQO=3uA-UVa^p^2!ba3o@d;o$LIWyQl7&(V<&>vkFuQG5*8tzEV3V~2{f=7B3NKBi5um4Cn`~g4!XZqa1I(w;eSCs}}<*yvx_&snIjihwB69SiB}ZJ;~f~}(5Vv^Z7Od3)MXj^A|%#SyL-VXtA&1LawKB}dnsR3Ra!of#eNYI3`qsu?m9jhKkP%}@Ez$ziMK*%!R3DrFN;7zO~0lVDl%a5+Z5ga4d+5$WTb|G5Ic{sq1ZlaK3sJhJ-1p?{Fb}VZtDUu^cMrE+gzLwxH&q4KOpi{aBwB7WU)qA7b$9e0zXA|U40=0lyK0h49)ddWv=jOnk^JN&nqWgTroN<&w>2FCfm5*wcyG#JY97nYMM-Gz!iHnI{KdzST7cOjf=fjSs4?Y)YuPq6y=y9$(Kmm8hpFf!Osj@vs-($2gJfddu!w&P?U?J&bR!UIMicEu{m*m0K#BTmn_!JtFmyFu^y66#hviAQEF3fC@77yxTPm%3N@PkzuU#)@toz!*f?m}w9n(iB?n+ZF#kb+5PBH`76Z3!VQpdfis4Ji)RD$Ni!W#^nLJ`9n=Ib$g7sS5|AJc&GeS(AI4er}qP18hR(b7?$XY-)Z((k}>m_$TZB;*KIN4q*YY|(UJM?1g42p(0lXk0Yx9a<+5VJww)lNdqEXlfqG!@>TWvV3E5A9QaDoc*~?Ca)leBF(k1NLe^d~LMi*rfFnPJiBok>ogdfz%SNghYs!@8ekhz#!fI+ag^bO|@)scK_+n!l^9{YRPLyA8{R+b_}t~%_y2qgDqjHfCmo$7J-l@BBAh;(u{~iORLxs%15f;m>2nqy5&pw*z!U$}hB4S<*Io0#lQzuBlZC^A10La$Iq1)PHSkz`Ji0&ldT#e`G0kp>Ieq`*z*0>fR4FY&*27K2(Gh+4j|Qa8BrfnrZ>!hA;|NpSg-QPEaa^Ko&l5P29r^sFg~M1J5-3+f7Y=>&#fXlmyzZHAHW+$sH<@w234FrYi~Sn)PdrK3oEC=!OVXB!6{Ci|Javi1!Lqg#v9{7k+xArC&iA*@eGjz$XDmr?V@iXzQt<_>8xe<4!T5e*gMw&?($`vzzj(S4pxk@5^d!l0Cz=(Ggi}v%FnLq~+AV4RBRn>J!GVznI{S_9+ePwGn-a@sZ<~(QTf?+TM2~7T%UPz`b}ezDg@yLj#2&WpHHGo)^xamIhd|oaC|Y?*4x0t2vB^9~NZaN_-#X;q+*DC7#6}NMQBa4Oq`w1g5oK!8$|v&d5Qr3h=uR&^~@c1wPGN1@=`}Blc1>6CH>G7;u@RjdElXI3BjA&gldk7laqa=O7k0nWjJD`s}FSxcwnEK~l`odt6+R7qobwyU!K?6FCYN7uqOj)|hSF{frsOD|Rjz{$q|$OtGYv6p3`Y`@Lomky1p04{bg=hRLGr_AVsFOcHx?Y&VX0-D%XwxF{I2pqyI7L1W!v%$AAC$w8%T=G&Nzhzuu1S;1Ry?{D?a0#3o9DTin%hwwfkSq@MZJZo*QV%&D7HIE^HT;FytYI@jMqiK}w?Ju&tK=Gg4EVGN@sI8y$S|=o)z+s}f|)Q;ImHpF=(N-2O)D-61CMU;C610ATbnv@9(!ENsQ@xM9C166zSwGfOJ;d#?IVqK?|D`qw-ybt<3KLXP>5mpqlgraQJe)%E9ND`G~&j@t|zAm5Z3OAqzdE@Sx(K!fUFf-H))-=fVvYotdYcL9`ivcd%OwJS`BS|llc|*UgC!fd$Y5xNb7YbRadgpUP>v={`Oyj|c+;mZN?8vO&FYGfC70x8C>Cxiy{o-eF19@*krqp9thY*iZX&L00+m_Z>hd;AatNUC?a?nqWU?R&UWOeGw3gr@kx$=Vyc)BqQ!;=s>^)K@i8S#eu87?`q_dFh?IRX!jg(%0SSW!q$fnRBPK;D2G{w#H5aZ|B1GuaLt7TyZ8K!j1^(A-_m@Y|M7+sSd)%EE3Ecp@KmRRn_89}eyMKF*^*;G9u*=Q`~FlzjG<5xii1Aymy=dymJ15MHj)|r>P~Tq%pS9;QcYAbIiUEsP^j$Ib^$m=Lvz!kw`kqNI4k$}RpI>&#gt2K-kp&0!4CwZ#FpJe;*(>EVaeDmF$Tn^i*RT%&SrR`uWrpzfl;jJ*ODIPb-_dFxq26*Sdd=N7Y<0YK>hOzj1aGnLIP=-US?lFjJjF+*{Nt*dB^N&T0DmW*r@egqQ_IZia{SN1b1aF21WRnq3K=EuE`v+En3^WYx!Ctcgs9zZ@6WDoI}v{0h4ovB6b_UZtSnY2W;g&hW@bSrf&HztgyEuxT2@r7Qg|aUGs%i@nZI<#!B!HrFLsn!P#~sQ!DiJNpRurW8b?GqBV;>Fy3SXZ|{f#iYt?qb*&WihDEFDpomS70sb?Q+MiZjrmK;EiBa{p7588}w%fFB~sN&~;7DWjo@R|2)`qNzyChI^*}87zA`ggONPgfg%@yAD}f#sUSGiH5ha$?N+q(HCL(YCsKf2ngT`6qrBh(7Ji@D+qnqqzfOwPTJD)hele48)G+#1A9rXwxJRKu#n{pG4a2G&4@o@>66S4%r_ouAv)ND^%%n3A#XcN7<6qnL9s`SUmVEUow?@fdfbfC17{O(ehytxPAIP^eRxTsCLgBS(dkC$)ui5sk!duiXwbHjY_XPW@+7?~*DB*8jdEu8!rke?BHn?CAs2$ix2YSvw=%U7cVym}GZOE5iRm9^EKps)zR8jK?{d|cEFt=_1Mx&lY^P#a<59DKSL)CFL0;dEN&J~iOM?ck4PT0TwKN4t{)5NeDX{*FLsKFWdm7UpP<>GbxdJTEL$WF^MmZIn&lx7dgZWUVKgTT!jxVDD7Ft@lCWcW0FqLP%H_N6Q_pmc9FaVfT;!S2YX;~KOuXN1>&DHnf4$R(lBNe0;zX4mRjhVMcaK49L<+TTA>Qsc{EjF}m&9qgwSSr_C2DX>&R>x5pQ}`je02_Ff>>O{cknG&_NK2?Ek6V+PsD%6bb6!zrf)I*ZBz=`8On3m7d0{{b8TtHlapW03%V>>4CG!sYsros*u!OfYbG*HO4^TiQnjP%?@rL_hw!)S#AFQ-nlI*clRL$`**jATX)vJgkZLOwLqFtyMc%7DUTGnbT`i`4@o){05qV&5`zM6MEXz;1HW@4Ptq{B9ab0MT93ADoQh24l0g+4{MY95&%5%1MOX!S@?)wi@g1ER6Ph*kwdV9kPBGvUPMYxRSzwVwy^^cGj=R^Cnr%NCIBkk}$>J<_+=brBGhQ5VcmTf&cK*uNT2Q#AGV8@=4r&~nEwI2rJzr?S0~6#mu)Qbl4Ax_1TFMcek-V(P)e(dhRvT__EnV40~mwXB#(2CnW1HDS=#dn4X<*n_8IRvrBTaWkEEAya=^WbDHnsWlt#(tNMa;%Wv|&eR9@v}C1ZkbocvLsPY}GG(plSealh7aADL_-%McYA)9N=RD76=ovi5K+Doo;6!kb5qsf?B&{TvLwN3z2?a2=XeXr?Yj-uYvjCZg$3I7)uT;yL@ii6sr{!$-sqOUo^7p5uPi3KFtt4;=g^%&o=;vG(s4xM%TPr|22-_quVq3EL`3OC9c0jl$`)5U3$y-=37!JgU=r=VQo9+;t(+r3IuxxRtjOE&_Xjf|tiN=-buTh2gG)QRx45nl4{bb%u;&9OI=&QT$<}i!?zi$fP(H4oNu%4V4Q$m=Y!&rwNA%^L+!M`zGoV=J;y?cV14x4;in9-LtCXi<{CNIrUM1&edCZq!sEO1(9dzuyKk*CGT}E@x=vKf6D6@V&N|0VNv#>r}eloR~@rMC!PdxsLal;T?l`i7dc&j6<9Q*XF=Qcb?+j$0T~7BJEA;W;MvL&_H9w<0AiQQ)4G-edik`ywLLoahj2+dd+0=f^l1hEugZ%-G=Yu3Y0;YActICo1%yONYULadxxa|nYUj};&;4g+;34dy#OD`l=dvXSx1PaiuQjd;Vx*d<58V!R-sX`C?Qatn*B*F#h_RtE8s>}xNJ)*LZYYm6kEP118d=NPx3e;m~|UTUmQoc$)Fw^)wwBk@W_vfazwN|b`zRo+pX?LP8X_%m5YPR|51Y;4dDW`U8Mmh8dndG)GwPKd2u@UrxZEn%RAOx3FDljXSSbIMrxits-(|2lX;DY6K4Z_Ieo*GC3bQ%3U5N$2Yv(3~ZP~<;RI!qMgAl(=L0?642W7RRfHYGDJ+0eRu?bKosp65(sa+(hu(nn=p56JfC2TwL@e4tcxeiKWog?_0kDj{uA-{w(P)yG0bUQf-jO3tG5Gr%Bi8#h48Kz*Ux+T`RH^rIk0L}mePuKn<+h`tdUq-kLEMx!#OQwwc{kX<(TObr!4Q^gmm7vK<6Nor|meYhed@OPV*#f}!-vs+iK6iK)-o(0vY4dp)YK!^(OcM%_FzzEqq3P;cM=nG8o(rvLyfks7QyP9mdQ6O}4F?Y)N(u9(jR%7udli=?M?Z)Bo7+ZZgew)1E5Tq$WsOLt`0EhG@C%*a6UaiMPT9^DD#OnGyCDg)nRap0o_(KTOURO3n66KTg?CxfiD|J{RpP80c=6h#g}_~*hI?LCm>zkQG9Mp4dTL#+Qgd0X#Gcj9hno-Z15S#(!22k)!Qmlq7xN3uX@ZiA5^k3V~EDDAoTVFkI+U5H$-;*~%#q9b9O%bx&sy%hiAharUWk<7pWsGlj3_Bp&eUmz8Y3DBJsW%56dPQ|L3EUOlY>{(j_i@8`BJuxT1Zfs)WHKO9Y3{?!~PwRzv&)(`VCfgw+pL60VWT=yg5I`th7_Ssjna?yAcCj$`wq>Pp9d8b0tQ8B_CeU`cIKwQIt><8FbY{%WLwfL+`$P8wYRKhksizuh?#RTXDV(JH`O^#X(GeEJEz_^3!7>ftl1tG%O$os`jtXi#ZNdXqE4oF8zLw8D--xj8MWdGBF7iHItytyh$VrpnEW}Ce(%M<>t?Zg=L(tgWan2JcRoopm%j~3ope~jnhJ5RwAH4pS%&=e&s;C8exQ47|sbY4a1bl)-mS@u@f?**=sLk-vj$@)KZ%>;a$cXsMIxxqaJvYsgA63_Te^{Mvu1$JS&HqMr0p9^@;WbdN>#gKyz%@I+YT2*U6f5BY|O}!~^gvh-sxB%l&ihTxdaoLV7DPza|q2WCPrT4qt<|fi5-b6tRaqT=RaE{F)+a`lh$%Ybrx*6|@?5YNfp1jx7qT!I;78qCh?IUJX>9G0&&#CX5}WQuiBh{O1~F(exB|FMlpU3!J&&HJTB3C(obE%c&EBIVd9wq|eO>n6H4sd&qSZD8>ay=W)B;U-N}f_mm8Se8zyeyj6#*al^@1?D)SUMbv5cXWZ~dippbGE{+XH6im{JEcd;QBqgOfT5}}U2J5V(|E>8esfn&Re<_J~S7rV4zG70j$LvoS(}1Mj-ApCGU5tYlqvW)CZU$7w=2w^YFu|R=a+d#hJd6w`LKPh}rMqc^&VSXsQ@NSH!lpX@3_^uk!%UXaR_B{&5&doX=c}!@7{=p=PBq4?gwYiO0!1MzG^A*gN>s#S}c6sF;3DiV_z746O*7_yVQ+5%u|if!;<5Dmw*+f^q3FhE{0hg0tCYA9FQqIpFUhFI4|u;L1}mLy#t}CExOM2t?Fzbj0e0^o&i;zNBioz9c;c2Ik`Faj;2H~aOYyFciB)tVdI&Obz38AD2tpES{PbNd&nq_+{ggI_tX<%uj)w$GqY!Bo&a*+rB!T^f8AKA2dAEd|@S(c)U^k=LEh6GOV+t{sn(t=m$HJveNe+aI~C6rAgT`q{|m@HL!v&S`Yu&ROV3W{$6Tolk3NT(Zi{b<#Hi5F3*YM$bbMDCJU_Z!XFJ;>!zHedvfq9*m}d%NvD3{wE<{~>1+1d9BB&k0|g{<)(~V~-@Q+a`o=w)5LfS3b30UI+xVCN(ka|?8N&Z?6#7LE^}DOP<-H!B)jR|Loq(S#@)7pOrF_rV`44+MHFmuw?>zuFcYS3S1<$k#Q;^PDS)pX^DdY{qdbd%yww4c4<1s=%QZ8K`gDcIS7pT(L9Wx+gKb|t~MJe*-y%+77BRBE_71*LS5$6*}8v;x=!e_xzGfPllX0H9Gwf44kw}o?R^=ta|ngzo;B=)x7v)t2|F#8xKq&Bi)m$~IZd!HsilW@ZYqRl$ig`Ff^Wm6<{5tb;n*3LwJDk`n^eX29daEwM(BetuXB&jy^2@r6innTk3Dv+G)*B3)CcDh~@+}eD9t|r{+z|meQSP2Bs|02T&Xk>Y(p$1XqDnMhYeY!aE%cBBZA61%cpnUZ3dq7CkQwX(nuF<85jr-Ate){_-WrVq;&Z?gJ*+}=uVqPV;K9Tq@!L3P`{|h*Cpp`((x1o9>opqieM5IdULZ)wpA$~6QSL$SZ)`Wk{d_+0Y4FryV>JU9k*tRx;6kG|des<(*bv*+0y_bSXbyZiu%K?K6qT|(zlh0zeIcV%_^qsYS6d*;uw69`Q2Pl35h}fnU8J=jxm!uG27|9kMi&Sk;`pHddRq(BkZ|;fdZ}K_F?&ZiwtAe3}dCLNj)t|)+=H161XO$CcEk=bR+>RrD!9zpzdsS-V`c6y_giu_0Rua(k&r0NpVUHry=^YcAO|cb=o)Y43&z{UdYqg>zub;)C36q9Re{TDIg$W0(gvVPrz3*p)!Z`uQLbH#a-LMFvt8PnN;7_7;v=V@ndQGk0DS5-|LvROB#_>>H+*I5|+FbPO{xFs)4H*(rY^>KO06hwU^Kn_-=k~N(ffS0YZ%9g6GP~hAFt=T*!c{KWir-7e!eLrvw#n{?_xri9$z%(VOMOQXMbwS$w2691!++@=FfZUb!-SgPBBe(lV*bogKS!Wgtd~kxBo-d$I$yR&`XflB5~Rsb1TRd+?Uv0P-GSpCgTBis;nq!X#Hy8CI)oud%SpE_K7W8*+KXR8(x69&jZ)oJO;kk2GPq$*&sjJW(`c=^bUw_G>yMbX>eW#vyFZ?>{#Egf$enVvicybMqC#TC%Ni%TlQ*Q1$Thnl?QG4eaz0&~3DNY-e8dbmPArk5i$K-GzUDLUu;x2>)|PXJ5D3e7isH}{lYCk-sMqT8Ga4N);m38|!iCFVa?8;%Q*gpHlb8c%@1)7_^TKfLY5&`mxtAoYXa#gt&wc|~)FJKnWZrC8(!YF{$QsK%MK5UzA8oQ~SKiguv&_GIM9oW@g3)=1iFgo}5|dWEQ7@Ar~)zX|24o5p3ACYn#mMj4K1l4Mhc^W!=i(SUg8BB7V65#?5k$GZuY0<$z|C`4w9&E|i3f1PscZ6_16XMnM+F7rMu7$LD6`!UtrafWxX$x4>=o4Qrue>5wLb{^i!A3r?`otZ?aD1~-tqH>Zfq+~cd;(ZqM1a)e{;Q=V{C37e{!WF>hIloFaNo(bClf18PIunb|yq#Ltf!+1ZnaIYq|#tLY(aftiiRm!2HOU<@rL6y$?tl)G;pmRE6$IWgVhG`8|CQAuK8Xtq<_jG#T)sVz9bWDyLFY0{5|BRZAU7%8zTi3gNia|^tkxkikx$hlD?11}w~8vXNAuGbUSzSixWr7@-7i;Bjqo%&`4N9{Oq9}c081*m;9$5h_5P7=9H7ILKad+xeWT#r-3qE}Y|yK0Zpjv`th}Dp!K?e-%2s5v{3Y&+)NdZ_9iCoQ={2It~S_N?cQU`=9XqO7nP_RjX?yhD`N_M0g6#)oIax6zhWxHMQ?;fj`dPDmvncX1{AHi!faF_z9_U)sg{tzM(DFkzsygBxGo-OU(HRRwNu)>FBrgz#4GAk!B_tiuhrzyF-B6qJyEKXMN(nDlh_{rwHI!@HAjVi><)sH!-MuJJ25$3l0h`ax#E(}hj0yZTk`q%)*TpPk$4Bs`@RntnQlYRoL-W++stUf#UDAqAmCMMPFNr!n8=F)WUx-MBq$GH8yoQ~*Cc&9etm!K6Z_05@^R^^RD!}DI@uAE7S=FDQw08MD!rhB*=F5wG!E^4uu;ZnYLNcm%Vh>iL<+b6x`giU=`+hya=z*HFf9;$aAu>0I2GEW~b2y_(cn8_{4LXm%!m@Tsi7oB6}oPxyDW4e%(2JtaL+k&@T)7r?&T>NWmk>j=aToFnRax9o#SsXjq7tIHp}{Ac-`bF5UNrqX4oWH{^PA~AnEGU)&Dh1$g2Sp$ww`h{mtn25{_YS&6N@*a8B`_n#sLlw!~Xs{##*C!E>1hjD>`TqAMWRU}ikq3Ea+Q*C3!%>~v!ck{hNozK}6lt8>g>C$vCtQKb`3d!)lMojq~Wz--NN%PET0w9jZU}hesJWJ}#92y@ZBXhd9{=C(K&v}b)(rSeg6ssuKH>DL})aEiKYABvk;%zOX^{RZhM(T#mWT!*YMLi*0?)?Y7x75xGVBErz&(JBJ##d!MI92-v{@O;egoq60nB5EKH^GD4$fM_wNC*kj`so?c56Z){DiOpzj@q)ra;TXPUD!&L-MYP|AF#|&p(C)t;C~mIAEr63(-+xCll_V^IJ5l3J>L>wttF#-2}A)H04D*a7fb!Vq_9^Bx$h$_aE2h3_zBw%#E6=m-OdrbSgl{lBUwE$4))@H<5MM52_bb}_U-iER@if=;BU~!M%I+bWdXaO;1XNzHvj+tk>3%b1W>EJ00HuA>5BsZgM-%?vBYQl0ssI200dcD')),'train_gpt.py','exec')) diff --git a/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_seed1337.log b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_seed1337.log new file mode 100644 index 0000000000..1ffe897788 --- /dev/null +++ b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_seed1337.log @@ -0,0 +1,292 @@ +From https://github.com/resouer/parameter-golf + * branch exp/round-14/packed-baseline -> FETCH_HEAD +Note: switching to 'f8889961853802ce9beae36ae6453f4754fd71ab'. +You are in 'detached HEAD' state. You can look around, make experimental +changes and commit them, and you can discard any commits you make in this +state without impacting any branches by switching back to a branch. +If you want to create a new branch to retain commits you create, you may +do so (now or later) by using -c with the switch command. Example: + git switch -c +Or undo this operation with: + git switch - +Turn off this advice by setting config variable advice.detachedHead to false +HEAD is now at f888996 Tune hash embedding: 16K buckets + 10x LR +data_setup: vocab=8192 shards=128 +Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. +W0407 17:26:19.904000 230 torch/distributed/run.py:803] +W0407 17:26:19.904000 230 torch/distributed/run.py:803] ***************************************** +W0407 17:26:19.904000 230 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/0955bf9f-30f6-456c-9a1b-1bba772c0180.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 4 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_start_layer: 7 + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 0955bf9f-30f6-456c-9a1b-1bba772c0180 + scalar_lr: 0.02 + seed: 1337 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_loop_only: False + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 128 +val_tokens: 40540160 +model_params:35943512 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0047 val_bpb: 3.4860 +1/20000 train_loss: 9.0084 train_time: 0.0m tok/s: 7979883 +2/20000 train_loss: 12.3253 train_time: 0.0m tok/s: 7956964 +3/20000 train_loss: 11.0666 train_time: 0.0m tok/s: 7894778 +4/20000 train_loss: 9.4821 train_time: 0.0m tok/s: 7862760 +5/20000 train_loss: 8.4564 train_time: 0.0m tok/s: 7840139 +500/20000 train_loss: 3.4002 train_time: 0.9m tok/s: 7642136 +1000/20000 train_loss: 3.2103 train_time: 1.7m tok/s: 7638780 +1500/20000 train_loss: 3.2043 train_time: 2.6m tok/s: 7640151 +2000/20000 train_loss: 3.1272 train_time: 3.4m tok/s: 7644863 +2500/20000 train_loss: 2.9939 train_time: 4.3m tok/s: 7646757 +layer_loop:enabled step:2858 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 2.9880 train_time: 5.2m tok/s: 7534801 +3500/20000 train_loss: 3.0117 train_time: 6.4m tok/s: 7212071 +4000/20000 train_loss: 2.9522 train_time: 7.5m tok/s: 7004380 +4000/20000 val_loss: 2.9168 val_bpb: 1.1292 +4500/20000 train_loss: 2.9283 train_time: 8.6m tok/s: 6851646 +5000/20000 train_loss: 2.9118 train_time: 9.7m tok/s: 6733583 +5031/20000 val_loss: 2.8147 val_bpb: 1.0897 +stopping_early: wallclock_cap train_time: 588136ms step: 5031/20000 +peak memory allocated: 34604 MiB reserved: 34708 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.81221955 val_bpb:1.08869783 eval_time:7242ms +Serialized model: 135426937 bytes +Code size: 17405 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 11.4s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15965524 bytes +Total submission size quantized+brotli: 15982929 bytes +final_int6_roundtrip_exact val_loss:2.84283616 val_bpb:1.10055047 eval_time:28254ms +final_int6_sliding_window val_loss:2.79967517 val_bpb:1.08384151 eval_time:116330ms +eval_hash_emb:init size=16384 dim=512 lr_mult=10x +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 loop_only=False +ttt_sliding:params model=35943512 hash=8388608 frozen=0 + ttt_chunk [1/1238] bpb=1.118075 time=39.3s + ttt_chunk [11/1238] bpb=1.072418 time=60.4s + ttt_chunk [21/1238] bpb=1.110299 time=63.4s + ttt_chunk [31/1238] bpb=1.104399 time=66.4s + ttt_chunk [41/1238] bpb=1.097277 time=69.4s + ttt_chunk [51/1238] bpb=1.090595 time=72.5s + ttt_chunk [61/1238] bpb=1.082098 time=75.5s + ttt_chunk [71/1238] bpb=1.089282 time=78.5s + ttt_chunk [81/1238] bpb=1.082622 time=81.5s + ttt_chunk [91/1238] bpb=1.079137 time=84.5s + ttt_chunk [101/1238] bpb=1.079003 time=87.5s + ttt_chunk [111/1238] bpb=1.077055 time=90.5s + ttt_chunk [121/1238] bpb=1.079954 time=93.5s + ttt_chunk [131/1238] bpb=1.083717 time=96.5s + ttt_chunk [141/1238] bpb=1.084246 time=99.5s + ttt_chunk [151/1238] bpb=1.084058 time=102.5s + ttt_chunk [161/1238] bpb=1.084570 time=105.5s + ttt_chunk [171/1238] bpb=1.084351 time=108.5s + ttt_chunk [181/1238] bpb=1.082837 time=111.5s + ttt_chunk [191/1238] bpb=1.082471 time=114.6s + ttt_chunk [201/1238] bpb=1.080035 time=117.6s + ttt_chunk [211/1238] bpb=1.084420 time=120.6s + ttt_chunk [221/1238] bpb=1.084683 time=123.6s + ttt_chunk [231/1238] bpb=1.086376 time=126.7s + ttt_chunk [241/1238] bpb=1.084805 time=129.7s + ttt_chunk [251/1238] bpb=1.084906 time=132.7s + ttt_chunk [261/1238] bpb=1.085894 time=135.8s + ttt_chunk [271/1238] bpb=1.086360 time=138.8s + ttt_chunk [281/1238] bpb=1.085676 time=141.8s + ttt_chunk [291/1238] bpb=1.086788 time=144.9s + ttt_chunk [301/1238] bpb=1.086983 time=147.9s + ttt_chunk [311/1238] bpb=1.085810 time=150.9s + ttt_chunk [321/1238] bpb=1.085742 time=153.9s + ttt_chunk [331/1238] bpb=1.086015 time=156.9s + ttt_chunk [341/1238] bpb=1.085132 time=159.9s + ttt_chunk [351/1238] bpb=1.085876 time=162.9s + ttt_chunk [361/1238] bpb=1.084845 time=165.9s + ttt_chunk [371/1238] bpb=1.083294 time=169.0s + ttt_chunk [381/1238] bpb=1.083703 time=172.0s + ttt_chunk [391/1238] bpb=1.083389 time=175.0s + ttt_chunk [401/1238] bpb=1.083470 time=178.0s + ttt_chunk [411/1238] bpb=1.084060 time=181.0s + ttt_chunk [421/1238] bpb=1.083528 time=184.0s + ttt_chunk [431/1238] bpb=1.083680 time=187.0s + ttt_chunk [441/1238] bpb=1.083740 time=190.0s + ttt_chunk [451/1238] bpb=1.084933 time=193.0s + ttt_chunk [461/1238] bpb=1.083184 time=196.0s + ttt_chunk [471/1238] bpb=1.083202 time=199.0s + ttt_chunk [481/1238] bpb=1.083356 time=202.1s + ttt_chunk [491/1238] bpb=1.083813 time=205.1s + ttt_chunk [501/1238] bpb=1.083402 time=208.1s + ttt_chunk [511/1238] bpb=1.083042 time=211.1s + ttt_chunk [521/1238] bpb=1.082550 time=214.1s + ttt_chunk [531/1238] bpb=1.082540 time=217.1s + ttt_chunk [541/1238] bpb=1.082638 time=220.1s + ttt_chunk [551/1238] bpb=1.082180 time=223.1s + ttt_chunk [561/1238] bpb=1.081472 time=226.1s + ttt_chunk [571/1238] bpb=1.080947 time=229.1s + ttt_chunk [581/1238] bpb=1.081304 time=232.1s + ttt_chunk [591/1238] bpb=1.081547 time=235.2s + ttt_chunk [601/1238] bpb=1.081472 time=238.2s + ttt_chunk [611/1238] bpb=1.082065 time=241.2s + ttt_chunk [621/1238] bpb=1.082927 time=244.2s + ttt_chunk [631/1238] bpb=1.082988 time=247.3s + ttt_chunk [641/1238] bpb=1.083444 time=250.3s + ttt_chunk [651/1238] bpb=1.083771 time=253.3s + ttt_chunk [661/1238] bpb=1.083074 time=256.3s + ttt_chunk [671/1238] bpb=1.082817 time=259.3s + ttt_chunk [681/1238] bpb=1.084133 time=262.3s + ttt_chunk [691/1238] bpb=1.084339 time=265.4s + ttt_chunk [701/1238] bpb=1.084158 time=268.4s + ttt_chunk [711/1238] bpb=1.084863 time=271.4s + ttt_chunk [721/1238] bpb=1.085172 time=274.4s + ttt_chunk [731/1238] bpb=1.084527 time=277.4s + ttt_chunk [741/1238] bpb=1.084264 time=280.4s + ttt_chunk [751/1238] bpb=1.083327 time=283.4s + ttt_chunk [761/1238] bpb=1.082729 time=286.4s + ttt_chunk [771/1238] bpb=1.081737 time=289.4s + ttt_chunk [781/1238] bpb=1.081705 time=292.4s + ttt_chunk [791/1238] bpb=1.082077 time=295.4s + ttt_chunk [801/1238] bpb=1.082371 time=298.4s + ttt_chunk [811/1238] bpb=1.081889 time=301.4s + ttt_chunk [821/1238] bpb=1.080715 time=304.4s + ttt_chunk [831/1238] bpb=1.080390 time=307.4s + ttt_chunk [841/1238] bpb=1.079961 time=310.4s + ttt_chunk [851/1238] bpb=1.079671 time=313.4s + ttt_chunk [861/1238] bpb=1.079308 time=316.4s + ttt_chunk [871/1238] bpb=1.079154 time=319.4s + ttt_chunk [881/1238] bpb=1.078678 time=322.4s + ttt_chunk [891/1238] bpb=1.078162 time=325.4s + ttt_chunk [901/1238] bpb=1.078537 time=328.4s + ttt_chunk [911/1238] bpb=1.078240 time=331.4s + ttt_chunk [921/1238] bpb=1.078529 time=334.4s + ttt_chunk [931/1238] bpb=1.079226 time=337.4s + ttt_chunk [941/1238] bpb=1.079620 time=340.4s + ttt_chunk [951/1238] bpb=1.079547 time=343.4s + ttt_chunk [961/1238] bpb=1.080376 time=346.3s + ttt_chunk [971/1238] bpb=1.080783 time=349.3s + ttt_chunk [981/1238] bpb=1.081129 time=352.3s + ttt_chunk [991/1238] bpb=1.080920 time=355.2s + ttt_chunk [1001/1238] bpb=1.080958 time=358.2s + ttt_chunk [1011/1238] bpb=1.081282 time=361.1s + ttt_chunk [1021/1238] bpb=1.081971 time=364.1s + ttt_chunk [1031/1238] bpb=1.082455 time=367.1s + ttt_chunk [1041/1238] bpb=1.082935 time=370.0s + ttt_chunk [1051/1238] bpb=1.082857 time=373.0s + ttt_chunk [1061/1238] bpb=1.082862 time=375.9s + ttt_chunk [1071/1238] bpb=1.083027 time=378.9s + ttt_chunk [1081/1238] bpb=1.082909 time=381.9s + ttt_chunk [1091/1238] bpb=1.083105 time=384.9s + ttt_chunk [1101/1238] bpb=1.083648 time=387.8s + ttt_chunk [1111/1238] bpb=1.083949 time=390.8s + ttt_chunk [1121/1238] bpb=1.084112 time=393.8s + ttt_chunk [1131/1238] bpb=1.083788 time=396.8s + ttt_chunk [1141/1238] bpb=1.083437 time=399.7s + ttt_chunk [1151/1238] bpb=1.083476 time=402.7s + ttt_chunk [1161/1238] bpb=1.083628 time=405.7s + ttt_chunk [1171/1238] bpb=1.083412 time=408.6s + ttt_chunk [1181/1238] bpb=1.082947 time=411.6s + ttt_chunk [1191/1238] bpb=1.083090 time=414.6s + ttt_chunk [1201/1238] bpb=1.083129 time=417.6s + ttt_chunk [1211/1238] bpb=1.082835 time=420.6s + ttt_chunk [1221/1238] bpb=1.082375 time=423.6s + ttt_chunk [1231/1238] bpb=1.082009 time=426.5s + ttt_chunk [1238/1238] bpb=1.082009 time=447.1s +ttt_sliding:done val_loss=2.795371 val_bpb=1.082175 elapsed=448.0s +legal_ttt_hash val_loss:2.79537088 val_bpb:1.08217518 eval_time:448277ms +results_json: {"val_bpb": 1.08217518, "val_loss": 2.79537088, "bytes_total": 15982929, "peak_memory_mib": 34604} diff --git a/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_seed2025.log b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_seed2025.log new file mode 100644 index 0000000000..167fa626b5 --- /dev/null +++ b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_seed2025.log @@ -0,0 +1,292 @@ +From https://github.com/resouer/parameter-golf + * branch exp/round-14/packed-baseline -> FETCH_HEAD +Note: switching to '81d1d128518e1934d0f953243389199d37808cc1'. +You are in 'detached HEAD' state. You can look around, make experimental +changes and commit them, and you can discard any commits you make in this +state without impacting any branches by switching back to a branch. +If you want to create a new branch to retain commits you create, you may +do so (now or later) by using -c with the switch command. Example: + git switch -c +Or undo this operation with: + git switch - +Turn off this advice by setting config variable advice.detachedHead to false +HEAD is now at 81d1d12 3-seed: seed=2025 +data_setup: vocab=8192 shards=128 +Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. +W0407 18:57:31.178000 230 torch/distributed/run.py:803] +W0407 18:57:31.178000 230 torch/distributed/run.py:803] ***************************************** +W0407 18:57:31.178000 230 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/59d86248-8899-491d-9a9b-c2983e1ea990.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 4 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_start_layer: 7 + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 59d86248-8899-491d-9a9b-c2983e1ea990 + scalar_lr: 0.02 + seed: 2025 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_loop_only: False + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 128 +val_tokens: 40540160 +model_params:35943512 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0067 val_bpb: 3.4868 +1/20000 train_loss: 9.0104 train_time: 0.0m tok/s: 7672126 +2/20000 train_loss: 12.3810 train_time: 0.0m tok/s: 7795338 +3/20000 train_loss: 11.1231 train_time: 0.0m tok/s: 7788068 +4/20000 train_loss: 9.5024 train_time: 0.0m tok/s: 7786149 +5/20000 train_loss: 8.4666 train_time: 0.0m tok/s: 7782727 +500/20000 train_loss: 3.4027 train_time: 0.9m tok/s: 7641228 +1000/20000 train_loss: 3.2107 train_time: 1.7m tok/s: 7628820 +1500/20000 train_loss: 3.2071 train_time: 2.6m tok/s: 7627715 +2000/20000 train_loss: 3.1345 train_time: 3.4m tok/s: 7629852 +2500/20000 train_loss: 2.9943 train_time: 4.3m tok/s: 7629288 +layer_loop:enabled step:2852 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 2.9925 train_time: 5.2m tok/s: 7512829 +3500/20000 train_loss: 3.0110 train_time: 6.4m tok/s: 7214157 +4000/20000 train_loss: 2.9564 train_time: 7.5m tok/s: 7005504 +4000/20000 val_loss: 2.9203 val_bpb: 1.1306 +4500/20000 train_loss: 2.9317 train_time: 8.6m tok/s: 6852709 +5000/20000 train_loss: 2.9158 train_time: 9.7m tok/s: 6734503 +5031/20000 val_loss: 2.8175 val_bpb: 1.0907 +stopping_early: wallclock_cap train_time: 588047ms step: 5031/20000 +peak memory allocated: 34604 MiB reserved: 34708 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.81501688 val_bpb:1.08978076 eval_time:7315ms +Serialized model: 135426937 bytes +Code size: 17400 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 11.4s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15972020 bytes +Total submission size quantized+brotli: 15989420 bytes +final_int6_roundtrip_exact val_loss:2.84565551 val_bpb:1.10164193 eval_time:27942ms +final_int6_sliding_window val_loss:2.80252245 val_bpb:1.08494377 eval_time:116213ms +eval_hash_emb:init size=16384 dim=512 lr_mult=10x +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 loop_only=False +ttt_sliding:params model=35943512 hash=8388608 frozen=0 + ttt_chunk [1/1238] bpb=1.123618 time=39.1s + ttt_chunk [11/1238] bpb=1.074318 time=60.6s + ttt_chunk [21/1238] bpb=1.112277 time=63.5s + ttt_chunk [31/1238] bpb=1.105880 time=66.5s + ttt_chunk [41/1238] bpb=1.098651 time=69.4s + ttt_chunk [51/1238] bpb=1.092146 time=72.4s + ttt_chunk [61/1238] bpb=1.083323 time=75.4s + ttt_chunk [71/1238] bpb=1.090243 time=78.3s + ttt_chunk [81/1238] bpb=1.083706 time=81.3s + ttt_chunk [91/1238] bpb=1.080187 time=84.3s + ttt_chunk [101/1238] bpb=1.080060 time=87.2s + ttt_chunk [111/1238] bpb=1.078419 time=90.2s + ttt_chunk [121/1238] bpb=1.081175 time=93.2s + ttt_chunk [131/1238] bpb=1.084939 time=96.2s + ttt_chunk [141/1238] bpb=1.085544 time=99.2s + ttt_chunk [151/1238] bpb=1.085348 time=102.1s + ttt_chunk [161/1238] bpb=1.085902 time=105.1s + ttt_chunk [171/1238] bpb=1.085757 time=108.1s + ttt_chunk [181/1238] bpb=1.084136 time=111.1s + ttt_chunk [191/1238] bpb=1.083905 time=114.0s + ttt_chunk [201/1238] bpb=1.081514 time=116.9s + ttt_chunk [211/1238] bpb=1.085934 time=119.9s + ttt_chunk [221/1238] bpb=1.086255 time=122.9s + ttt_chunk [231/1238] bpb=1.088032 time=125.8s + ttt_chunk [241/1238] bpb=1.086257 time=128.8s + ttt_chunk [251/1238] bpb=1.086203 time=131.7s + ttt_chunk [261/1238] bpb=1.087177 time=134.7s + ttt_chunk [271/1238] bpb=1.087688 time=137.7s + ttt_chunk [281/1238] bpb=1.086983 time=140.6s + ttt_chunk [291/1238] bpb=1.088057 time=143.6s + ttt_chunk [301/1238] bpb=1.088260 time=146.6s + ttt_chunk [311/1238] bpb=1.087126 time=149.5s + ttt_chunk [321/1238] bpb=1.086963 time=152.5s + ttt_chunk [331/1238] bpb=1.087240 time=155.4s + ttt_chunk [341/1238] bpb=1.086320 time=158.3s + ttt_chunk [351/1238] bpb=1.087066 time=161.2s + ttt_chunk [361/1238] bpb=1.085998 time=164.2s + ttt_chunk [371/1238] bpb=1.084435 time=167.1s + ttt_chunk [381/1238] bpb=1.084855 time=170.0s + ttt_chunk [391/1238] bpb=1.084557 time=173.0s + ttt_chunk [401/1238] bpb=1.084613 time=175.9s + ttt_chunk [411/1238] bpb=1.085162 time=178.8s + ttt_chunk [421/1238] bpb=1.084693 time=181.8s + ttt_chunk [431/1238] bpb=1.084887 time=184.7s + ttt_chunk [441/1238] bpb=1.084946 time=187.6s + ttt_chunk [451/1238] bpb=1.086138 time=190.6s + ttt_chunk [461/1238] bpb=1.084343 time=193.6s + ttt_chunk [471/1238] bpb=1.084315 time=196.5s + ttt_chunk [481/1238] bpb=1.084455 time=199.4s + ttt_chunk [491/1238] bpb=1.084940 time=202.4s + ttt_chunk [501/1238] bpb=1.084569 time=205.3s + ttt_chunk [511/1238] bpb=1.084224 time=208.3s + ttt_chunk [521/1238] bpb=1.083719 time=211.2s + ttt_chunk [531/1238] bpb=1.083696 time=214.2s + ttt_chunk [541/1238] bpb=1.083804 time=217.2s + ttt_chunk [551/1238] bpb=1.083344 time=220.1s + ttt_chunk [561/1238] bpb=1.082695 time=223.1s + ttt_chunk [571/1238] bpb=1.082155 time=226.1s + ttt_chunk [581/1238] bpb=1.082488 time=229.0s + ttt_chunk [591/1238] bpb=1.082673 time=232.0s + ttt_chunk [601/1238] bpb=1.082605 time=234.9s + ttt_chunk [611/1238] bpb=1.083171 time=237.8s + ttt_chunk [621/1238] bpb=1.084036 time=240.8s + ttt_chunk [631/1238] bpb=1.084093 time=243.7s + ttt_chunk [641/1238] bpb=1.084534 time=246.7s + ttt_chunk [651/1238] bpb=1.084850 time=249.6s + ttt_chunk [661/1238] bpb=1.084198 time=252.5s + ttt_chunk [671/1238] bpb=1.083942 time=255.5s + ttt_chunk [681/1238] bpb=1.085286 time=258.4s + ttt_chunk [691/1238] bpb=1.085476 time=261.4s + ttt_chunk [701/1238] bpb=1.085254 time=264.4s + ttt_chunk [711/1238] bpb=1.085953 time=267.4s + ttt_chunk [721/1238] bpb=1.086247 time=270.3s + ttt_chunk [731/1238] bpb=1.085632 time=273.3s + ttt_chunk [741/1238] bpb=1.085354 time=276.3s + ttt_chunk [751/1238] bpb=1.084448 time=279.2s + ttt_chunk [761/1238] bpb=1.083866 time=282.2s + ttt_chunk [771/1238] bpb=1.082851 time=285.1s + ttt_chunk [781/1238] bpb=1.082851 time=288.1s + ttt_chunk [791/1238] bpb=1.083205 time=291.1s + ttt_chunk [801/1238] bpb=1.083515 time=294.1s + ttt_chunk [811/1238] bpb=1.083029 time=297.1s + ttt_chunk [821/1238] bpb=1.081844 time=300.0s + ttt_chunk [831/1238] bpb=1.081554 time=303.0s + ttt_chunk [841/1238] bpb=1.081088 time=305.9s + ttt_chunk [851/1238] bpb=1.080786 time=308.9s + ttt_chunk [861/1238] bpb=1.080437 time=311.9s + ttt_chunk [871/1238] bpb=1.080301 time=314.8s + ttt_chunk [881/1238] bpb=1.079854 time=317.7s + ttt_chunk [891/1238] bpb=1.079325 time=320.7s + ttt_chunk [901/1238] bpb=1.079725 time=323.6s + ttt_chunk [911/1238] bpb=1.079426 time=326.5s + ttt_chunk [921/1238] bpb=1.079735 time=329.4s + ttt_chunk [931/1238] bpb=1.080411 time=332.4s + ttt_chunk [941/1238] bpb=1.080796 time=335.3s + ttt_chunk [951/1238] bpb=1.080718 time=338.3s + ttt_chunk [961/1238] bpb=1.081547 time=341.2s + ttt_chunk [971/1238] bpb=1.081967 time=344.2s + ttt_chunk [981/1238] bpb=1.082324 time=347.1s + ttt_chunk [991/1238] bpb=1.082121 time=350.1s + ttt_chunk [1001/1238] bpb=1.082152 time=353.1s + ttt_chunk [1011/1238] bpb=1.082509 time=356.0s + ttt_chunk [1021/1238] bpb=1.083216 time=359.0s + ttt_chunk [1031/1238] bpb=1.083699 time=362.0s + ttt_chunk [1041/1238] bpb=1.084151 time=364.9s + ttt_chunk [1051/1238] bpb=1.084083 time=367.9s + ttt_chunk [1061/1238] bpb=1.084075 time=370.9s + ttt_chunk [1071/1238] bpb=1.084231 time=373.9s + ttt_chunk [1081/1238] bpb=1.084115 time=376.9s + ttt_chunk [1091/1238] bpb=1.084311 time=379.9s + ttt_chunk [1101/1238] bpb=1.084865 time=382.9s + ttt_chunk [1111/1238] bpb=1.085161 time=385.9s + ttt_chunk [1121/1238] bpb=1.085330 time=388.9s + ttt_chunk [1131/1238] bpb=1.084992 time=391.9s + ttt_chunk [1141/1238] bpb=1.084654 time=394.9s + ttt_chunk [1151/1238] bpb=1.084688 time=397.9s + ttt_chunk [1161/1238] bpb=1.084839 time=400.8s + ttt_chunk [1171/1238] bpb=1.084604 time=403.7s + ttt_chunk [1181/1238] bpb=1.084134 time=406.6s + ttt_chunk [1191/1238] bpb=1.084295 time=409.5s + ttt_chunk [1201/1238] bpb=1.084322 time=412.5s + ttt_chunk [1211/1238] bpb=1.084031 time=415.4s + ttt_chunk [1221/1238] bpb=1.083577 time=418.4s + ttt_chunk [1231/1238] bpb=1.083219 time=421.3s + ttt_chunk [1238/1238] bpb=1.083217 time=442.0s +ttt_sliding:done val_loss=2.798463 val_bpb=1.083372 elapsed=442.7s +legal_ttt_hash val_loss:2.79846273 val_bpb:1.08337213 eval_time:442961ms +results_json: {"val_bpb": 1.08337213, "val_loss": 2.79846273, "bytes_total": 15989420, "peak_memory_mib": 34604} diff --git a/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_seed42.log b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_seed42.log new file mode 100644 index 0000000000..730118fc97 --- /dev/null +++ b/records/track_10min_16mb/2026-04-08_SP8192_TTT_HashEmbedding/train_seed42.log @@ -0,0 +1,292 @@ +From https://github.com/resouer/parameter-golf + * branch exp/round-14/packed-baseline -> FETCH_HEAD +Note: switching to 'd9a58d25d1a17b94ed8aed81756d5d20bbb1e20b'. +You are in 'detached HEAD' state. You can look around, make experimental +changes and commit them, and you can discard any commits you make in this +state without impacting any branches by switching back to a branch. +If you want to create a new branch to retain commits you create, you may +do so (now or later) by using -c with the switch command. Example: + git switch -c +Or undo this operation with: + git switch - +Turn off this advice by setting config variable advice.detachedHead to false +HEAD is now at d9a58d2 3-seed: seed=42 +data_setup: vocab=8192 shards=128 +Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. +W0407 18:27:49.807000 230 torch/distributed/run.py:803] +W0407 18:27:49.807000 230 torch/distributed/run.py:803] ***************************************** +W0407 18:27:49.807000 230 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/6ebe69d6-702c-49ce-a2d7-49ccb6893d1c.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 4 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + parallel_start_layer: 7 + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 6ebe69d6-702c-49ce-a2d7-49ccb6893d1c + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + ttt_batch_seqs: 32 + ttt_chunk_tokens: 32768 + ttt_enabled: True + ttt_epochs: 3 + ttt_freeze_blocks: 0 + ttt_grad_clip: 1.0 + ttt_loop_only: False + ttt_lr: 0.005 + ttt_momentum: 0.9 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 128 +val_tokens: 40540160 +model_params:35943512 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0090 val_bpb: 3.4877 +1/20000 train_loss: 9.0125 train_time: 0.0m tok/s: 8044048 +2/20000 train_loss: 12.3897 train_time: 0.0m tok/s: 7965622 +3/20000 train_loss: 11.1326 train_time: 0.0m tok/s: 7807640 +4/20000 train_loss: 9.5319 train_time: 0.0m tok/s: 7803216 +5/20000 train_loss: 8.4757 train_time: 0.0m tok/s: 7794010 +500/20000 train_loss: 3.3964 train_time: 0.9m tok/s: 7648626 +1000/20000 train_loss: 3.2109 train_time: 1.7m tok/s: 7646547 +1500/20000 train_loss: 3.2005 train_time: 2.6m tok/s: 7645392 +2000/20000 train_loss: 3.1244 train_time: 3.4m tok/s: 7648616 +2500/20000 train_loss: 2.9936 train_time: 4.3m tok/s: 7650307 +layer_loop:enabled step:2859 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 2.9906 train_time: 5.2m tok/s: 7539126 +3500/20000 train_loss: 3.0086 train_time: 6.3m tok/s: 7235040 +4000/20000 train_loss: 2.9496 train_time: 7.5m tok/s: 7023009 +4000/20000 val_loss: 2.9181 val_bpb: 1.1297 +4500/20000 train_loss: 2.9315 train_time: 8.6m tok/s: 6867833 +5000/20000 train_loss: 2.9137 train_time: 9.7m tok/s: 6747165 +5039/20000 val_loss: 2.8151 val_bpb: 1.0898 +stopping_early: wallclock_cap train_time: 588037ms step: 5039/20000 +peak memory allocated: 34604 MiB reserved: 34708 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.81261061 val_bpb:1.08884921 eval_time:7267ms +Serialized model: 135426937 bytes +Code size: 17410 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 11.4s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15971049 bytes +Total submission size quantized+brotli: 15988459 bytes +final_int6_roundtrip_exact val_loss:2.84428373 val_bpb:1.10111087 eval_time:28055ms +final_int6_sliding_window val_loss:2.80117028 val_bpb:1.08442031 eval_time:116447ms +eval_hash_emb:init size=16384 dim=512 lr_mult=10x +ttt_sliding:start chunks=1238 chunk_tokens=32768 total_windows=633409 stride=64 ttt_lr=0.005 ttt_epochs=3 freeze_blocks=0 loop_only=False +ttt_sliding:params model=35943512 hash=8388608 frozen=0 + ttt_chunk [1/1238] bpb=1.117948 time=39.4s + ttt_chunk [11/1238] bpb=1.071025 time=61.4s + ttt_chunk [21/1238] bpb=1.108952 time=64.4s + ttt_chunk [31/1238] bpb=1.103575 time=67.4s + ttt_chunk [41/1238] bpb=1.096045 time=70.4s + ttt_chunk [51/1238] bpb=1.089845 time=73.3s + ttt_chunk [61/1238] bpb=1.081582 time=76.3s + ttt_chunk [71/1238] bpb=1.088691 time=79.3s + ttt_chunk [81/1238] bpb=1.082152 time=82.3s + ttt_chunk [91/1238] bpb=1.078797 time=85.3s + ttt_chunk [101/1238] bpb=1.078487 time=88.3s + ttt_chunk [111/1238] bpb=1.076761 time=91.3s + ttt_chunk [121/1238] bpb=1.079653 time=94.3s + ttt_chunk [131/1238] bpb=1.083374 time=97.3s + ttt_chunk [141/1238] bpb=1.083945 time=100.3s + ttt_chunk [151/1238] bpb=1.083675 time=103.3s + ttt_chunk [161/1238] bpb=1.084251 time=106.3s + ttt_chunk [171/1238] bpb=1.084121 time=109.3s + ttt_chunk [181/1238] bpb=1.082684 time=112.3s + ttt_chunk [191/1238] bpb=1.082377 time=115.3s + ttt_chunk [201/1238] bpb=1.079989 time=118.3s + ttt_chunk [211/1238] bpb=1.084448 time=121.3s + ttt_chunk [221/1238] bpb=1.084648 time=124.3s + ttt_chunk [231/1238] bpb=1.086474 time=127.2s + ttt_chunk [241/1238] bpb=1.084792 time=130.2s + ttt_chunk [251/1238] bpb=1.084860 time=133.2s + ttt_chunk [261/1238] bpb=1.085918 time=136.1s + ttt_chunk [271/1238] bpb=1.086418 time=139.1s + ttt_chunk [281/1238] bpb=1.085740 time=142.1s + ttt_chunk [291/1238] bpb=1.086935 time=145.0s + ttt_chunk [301/1238] bpb=1.087181 time=148.0s + ttt_chunk [311/1238] bpb=1.086081 time=151.0s + ttt_chunk [321/1238] bpb=1.086002 time=154.0s + ttt_chunk [331/1238] bpb=1.086292 time=156.9s + ttt_chunk [341/1238] bpb=1.085431 time=159.9s + ttt_chunk [351/1238] bpb=1.086205 time=162.9s + ttt_chunk [361/1238] bpb=1.085187 time=165.9s + ttt_chunk [371/1238] bpb=1.083667 time=168.8s + ttt_chunk [381/1238] bpb=1.084077 time=171.8s + ttt_chunk [391/1238] bpb=1.083797 time=174.8s + ttt_chunk [401/1238] bpb=1.083833 time=177.8s + ttt_chunk [411/1238] bpb=1.084435 time=180.8s + ttt_chunk [421/1238] bpb=1.083933 time=183.8s + ttt_chunk [431/1238] bpb=1.084167 time=186.7s + ttt_chunk [441/1238] bpb=1.084250 time=189.7s + ttt_chunk [451/1238] bpb=1.085416 time=192.7s + ttt_chunk [461/1238] bpb=1.083638 time=195.7s + ttt_chunk [471/1238] bpb=1.083673 time=198.7s + ttt_chunk [481/1238] bpb=1.083804 time=201.6s + ttt_chunk [491/1238] bpb=1.084239 time=204.6s + ttt_chunk [501/1238] bpb=1.083855 time=207.6s + ttt_chunk [511/1238] bpb=1.083510 time=210.6s + ttt_chunk [521/1238] bpb=1.083063 time=213.6s + ttt_chunk [531/1238] bpb=1.083056 time=216.6s + ttt_chunk [541/1238] bpb=1.083175 time=219.6s + ttt_chunk [551/1238] bpb=1.082698 time=222.6s + ttt_chunk [561/1238] bpb=1.082027 time=225.6s + ttt_chunk [571/1238] bpb=1.081494 time=228.5s + ttt_chunk [581/1238] bpb=1.081869 time=231.5s + ttt_chunk [591/1238] bpb=1.082091 time=234.5s + ttt_chunk [601/1238] bpb=1.082019 time=237.5s + ttt_chunk [611/1238] bpb=1.082578 time=240.5s + ttt_chunk [621/1238] bpb=1.083417 time=243.5s + ttt_chunk [631/1238] bpb=1.083486 time=246.4s + ttt_chunk [641/1238] bpb=1.083934 time=249.4s + ttt_chunk [651/1238] bpb=1.084258 time=252.4s + ttt_chunk [661/1238] bpb=1.083613 time=255.4s + ttt_chunk [671/1238] bpb=1.083345 time=258.3s + ttt_chunk [681/1238] bpb=1.084670 time=261.3s + ttt_chunk [691/1238] bpb=1.084872 time=264.3s + ttt_chunk [701/1238] bpb=1.084669 time=267.2s + ttt_chunk [711/1238] bpb=1.085380 time=270.2s + ttt_chunk [721/1238] bpb=1.085675 time=273.1s + ttt_chunk [731/1238] bpb=1.085043 time=276.1s + ttt_chunk [741/1238] bpb=1.084760 time=279.0s + ttt_chunk [751/1238] bpb=1.083835 time=282.0s + ttt_chunk [761/1238] bpb=1.083212 time=285.0s + ttt_chunk [771/1238] bpb=1.082210 time=287.9s + ttt_chunk [781/1238] bpb=1.082171 time=290.9s + ttt_chunk [791/1238] bpb=1.082537 time=293.8s + ttt_chunk [801/1238] bpb=1.082833 time=296.8s + ttt_chunk [811/1238] bpb=1.082316 time=299.7s + ttt_chunk [821/1238] bpb=1.081119 time=302.7s + ttt_chunk [831/1238] bpb=1.080801 time=305.6s + ttt_chunk [841/1238] bpb=1.080334 time=308.6s + ttt_chunk [851/1238] bpb=1.080029 time=311.6s + ttt_chunk [861/1238] bpb=1.079690 time=314.5s + ttt_chunk [871/1238] bpb=1.079576 time=317.5s + ttt_chunk [881/1238] bpb=1.079123 time=320.4s + ttt_chunk [891/1238] bpb=1.078596 time=323.4s + ttt_chunk [901/1238] bpb=1.078982 time=326.4s + ttt_chunk [911/1238] bpb=1.078667 time=329.4s + ttt_chunk [921/1238] bpb=1.078938 time=332.3s + ttt_chunk [931/1238] bpb=1.079611 time=335.3s + ttt_chunk [941/1238] bpb=1.080021 time=338.2s + ttt_chunk [951/1238] bpb=1.079963 time=341.2s + ttt_chunk [961/1238] bpb=1.080771 time=344.2s + ttt_chunk [971/1238] bpb=1.081188 time=347.1s + ttt_chunk [981/1238] bpb=1.081548 time=350.1s + ttt_chunk [991/1238] bpb=1.081337 time=353.1s + ttt_chunk [1001/1238] bpb=1.081385 time=356.1s + ttt_chunk [1011/1238] bpb=1.081741 time=359.1s + ttt_chunk [1021/1238] bpb=1.082456 time=362.0s + ttt_chunk [1031/1238] bpb=1.082932 time=365.0s + ttt_chunk [1041/1238] bpb=1.083386 time=367.9s + ttt_chunk [1051/1238] bpb=1.083313 time=370.9s + ttt_chunk [1061/1238] bpb=1.083325 time=373.9s + ttt_chunk [1071/1238] bpb=1.083472 time=376.8s + ttt_chunk [1081/1238] bpb=1.083361 time=379.8s + ttt_chunk [1091/1238] bpb=1.083566 time=382.7s + ttt_chunk [1101/1238] bpb=1.084110 time=385.7s + ttt_chunk [1111/1238] bpb=1.084406 time=388.6s + ttt_chunk [1121/1238] bpb=1.084578 time=391.6s + ttt_chunk [1131/1238] bpb=1.084228 time=394.6s + ttt_chunk [1141/1238] bpb=1.083897 time=397.5s + ttt_chunk [1151/1238] bpb=1.083953 time=400.5s + ttt_chunk [1161/1238] bpb=1.084094 time=403.5s + ttt_chunk [1171/1238] bpb=1.083869 time=406.4s + ttt_chunk [1181/1238] bpb=1.083397 time=409.5s + ttt_chunk [1191/1238] bpb=1.083547 time=412.5s + ttt_chunk [1201/1238] bpb=1.083565 time=415.4s + ttt_chunk [1211/1238] bpb=1.083248 time=418.4s + ttt_chunk [1221/1238] bpb=1.082781 time=421.4s + ttt_chunk [1231/1238] bpb=1.082423 time=424.4s + ttt_chunk [1238/1238] bpb=1.082429 time=445.0s +ttt_sliding:done val_loss=2.796264 val_bpb=1.082521 elapsed=445.4s +legal_ttt_hash val_loss:2.79626379 val_bpb:1.08252085 eval_time:445694ms +results_json: {"val_bpb": 1.08252085, "val_loss": 2.79626379, "bytes_total": 15988459, "peak_memory_mib": 34604}