From b955b6df7187cb89859d84c1055b86644ec793c7 Mon Sep 17 00:00:00 2001 From: "Dixing (Dex) Xu" Date: Thu, 2 Apr 2026 17:34:43 +0000 Subject: [PATCH] =?UTF-8?q?Record:=20MuonEq-R=20+=20Depth=20Recurrence=20+?= =?UTF-8?q?=20Mixed=20Int5/Int6=20GPTQ=20=E2=80=94=20val=5Fbpb=201.0929=20?= =?UTF-8?q?(3-seed=20mean)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds three techniques to PR #1218's 4096-vocab high-WD stack: - MuonEq-R optimizer (row-norm before NS5 orthogonalization) - Depth recurrence on layers 4,5 (shared MLP, zero extra params) - Mixed int5/int6 GPTQ via Hessian sensitivity ranking 3-seed mean: 1.0929 BPB / 2.5145 nats All seeds under 16MB (max: 15,981,324 bytes) No TTT, no SLOT, no eval-time adaptation. --- .../README.md | 133 +++++++++++ .../submission.json | 16 ++ .../train_gpt.py | 206 ++++++++++++++++ .../train_seed0.log | 221 ++++++++++++++++++ .../train_seed1337.log | 155 ++++++++++++ .../train_seed42.log | 155 ++++++++++++ 6 files changed, 886 insertions(+) create mode 100644 records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/README.md create mode 100644 records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/submission.json create mode 100644 records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_gpt.py create mode 100644 records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed0.log create mode 100644 records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed1337.log create mode 100644 records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed42.log diff --git a/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/README.md b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/README.md new file mode 100644 index 0000000000..9e8c7d33f3 --- /dev/null +++ b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/README.md @@ -0,0 +1,133 @@ +## Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ (val_bpb: 1.0929) + +**val_bpb = 1.0929** (3-seed mean, std 0.0009) | **2.5145 nats** | **~15.96 MB** | 8xH100 SXM, 600s train + ~83s eval | No TTT + +Built on [PR #1218](https://github.com/openai/parameter-golf/pull/1218) by @clarkkev (4096-Vocab + 4.0-MLP-mult + 0.085-WD). + +Previous: [PR #1019](https://github.com/openai/parameter-golf/pull/1019) (1.1147) -> [PR #1218](https://github.com/openai/parameter-golf/pull/1218) (1.0979) -> this (1.0929) + +### Changes from PR #1218 + +| | PR #1218 | This | +|---|---|---| +| val_bpb | 1.09785 | **1.09290** | +| Optimizer | Muon | **MuonEq-R** (row-norm before NS5) | +| Depth recurrence | None | **Layers 4,5 repeated** (RECUR_LAYERS=4,5) | +| Recurrence MLP sharing | N/A | **Fully shared** (REPEAT_UNTIE_MLP=none) | +| Mixed quantization | No | **Yes** (60 int6 + 6 int5 via Hessian sensitivity) | +| Recurrence activation | N/A | Step 3000 with 20-step warmup | +| Everything else | Same | Same | + +### What's New + +1. **MuonEq-R** — Row-normalizes gradient matrices before Newton-Schulz orthogonalization in the Muon optimizer. Improves conditioning of the NS5 iteration for non-square weight matrices. Zero-byte cost, ~0.001 BPB improvement. + +2. **Depth Recurrence** — Layers 4 and 5 are repeated once after the initial forward pass (virtual layers 12-13 on top of 11 physical layers). MLP weights are fully shared during recurrence (REPEAT_UNTIE_MLP=none), so this adds zero extra parameters. Activated at step 3000 with a 20-step linear warmup. ~0.003 BPB improvement. + +3. **Mixed Int5/Int6 GPTQ** — Hessian-based sensitivity ranking determines which layers get int6 (clip_range=31) vs int5 (clip_range=15). The 60 most sensitive layers keep int6 precision; the 6 least sensitive get int5 to save artifact bytes. Combined with full GPTQ and brotli-11 compression. + +### Carried from PR #1218 + +- 4096 SentencePiece BPE vocabulary +- 4.0x MLP multiplier with sigmoid-gated activation +- Weight decay 0.085 (high WD for better compression) +- Full Hessian GPTQ quantization +- XSA-all-11 attention pattern +- BigramHash embedding (2816x160) +- Sigmoid-gated skip connections +- Soft-round QAT +- Split-LR training +- Brotli-11 compression with byte shuffle +- EMA (decay 0.997) + +### Configuration + +```bash +NCCL_NET=Socket \ +DATA_DIR=./data \ +SEED=1337 \ +MIXED_QUANT=1 \ +N_INT6_LAYERS=60 \ +RECUR_LAYERS=4,5 \ +torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, no TTT) + +### Core Results + +| Seed | Steps | ms/step | Post-EMA BPB | Sliding BPB | val_loss (nats) | Artifact | +|------|-------|---------|--------------|-------------|-----------------|----------| +| 1337 | 5,541 | 106.5 | 1.1000 | 1.0939 | 2.51667 | 15,933,457 | +| 42 | 5,530 | 106.7 | 1.0987 | 1.0922 | 2.51279 | 15,981,324 | +| 0 | 5,543 | 106.5 | 1.0988 | 1.0927 | 2.51394 | 15,960,050 | +| **Mean** | **5,538** | **106.6** | **1.0992** | **1.0929** | **2.51447** | **15,958,277** | + +### Supplemental Diagnostics + +| Seed | Post-EMA BPB | Roundtrip BPB | Sliding BPB | val_loss (nats) | Code size | Total submission | Train time | Eval time | +|------|--------------|---------------|-------------|-----------------|-----------|------------------|------------|-----------| +| 1337 | 1.1000 | 1.1122 | 1.0939 | 2.51667 | 21,084 | 15,933,457 | 590s | 83s | +| 42 | 1.0987 | 1.1106 | 1.0922 | 2.51279 | 21,084 | 15,981,324 | 590s | 83s | +| 0 | 1.0988 | 1.1113 | 1.0927 | 2.51394 | 21,084 | 15,960,050 | 590s | 83s | +| **Mean** | **1.0992** | **1.1114** | **1.0929** | **2.51447** | **21,084** | **15,958,277** | **590s** | **83s** | + +### Rule Compliance + +- No TTT (no test-time training or adaptation) +- No SLOT (no scored-position lookup table) +- No validation data during training +- No training data during evaluation +- Artifact < 16,000,000 bytes for ALL seeds (max: 15,981,324) +- Train < 600s on 8xH100 SXM (590s) +- Eval < 600s on 8xH100 SXM (~83s) + +### Architecture + +- 11 layers + 2 virtual (depth recurrence on layers 4,5) +- d_model = 512, MLP 4x (2048), 4 heads +- 4096 SentencePiece BPE vocabulary +- BigramHash(2816x160) token embedding +- Sigmoid-gated skip connections with soft-round QAT +- MuonEq-R optimizer with row normalization +- Full Hessian GPTQ (int6) with mixed int5/int6 via sensitivity ranking + +### Requirements + +- PyTorch 2.9.1+cu128 +- flash-attn 2.8.3 +- sentencepiece +- brotli +- 8x H100 SXM 80GB + +### Run Command (3-seed loop) + +```bash +for SEED in 1337 42 0; do + NCCL_NET=Socket \ + DATA_DIR=./data \ + SEED=$SEED \ + MIXED_QUANT=1 \ + N_INT6_LAYERS=60 \ + RECUR_LAYERS=4,5 \ + torchrun --standalone --nproc_per_node=8 train_gpt.py \ + 2>&1 | tee train_seed${SEED}.log +done +``` + +### Lineage + +PR #1019 (ValCalib + GPTQ + XSA + BigramHash, 1.1147) -> PR #1218 (4096-Vocab + MLP 4x + WD 0.085, 1.0979) -> this (MuonEq-R + Depth Recurrence + Mixed Quant, 1.0929) + +### Credits + +- @clarkkev for PR #1218 (4096-Vocab + high-WD architecture — the foundation) +- @abaybektursun for PR #1019 (GPTQ + XSA + BigramHash baseline) +- @msisovic for PR #1204 (depth recurrence concept) +- MuonEq-R inspired by equalized gradient normalization literature + +### Included Files + +- `train_gpt.py` — full training + quantization + evaluation script (21,084 bytes, self-extracting) +- `train_seed1337.log`, `train_seed42.log`, `train_seed0.log` — all seed logs +- `submission.json` — leaderboard metadata diff --git a/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/submission.json b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/submission.json new file mode 100644 index 0000000000..483d77269d --- /dev/null +++ b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/submission.json @@ -0,0 +1,16 @@ +{ + "name": "Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ", + "val_bpb": 1.0929, + "bytes_total": 15981324, + "blurb": "Adds MuonEq-R optimizer (row-norm before NS5), depth recurrence (layers 4,5 repeated with shared MLP), and mixed int5/int6 GPTQ to PR #1218's 4096-vocab high-WD stack. 3-seed mean 1.0929 BPB, all seeds under 16MB.", + "author": "dexhunter", + "github_id": "dexhunter", + "date": "2026-04-02", + "pre_quant_val_bpb": 1.0992, + "bytes_model_compressed": 15960240, + "bytes_code": 21084, + "base_pr": 1218, + "seeds": [1337, 42, 0], + "seed_scores": [1.09386, 1.09217, 1.09267], + "eval_time_seconds": [83, 83, 83] +} diff --git a/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_gpt.py b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_gpt.py new file mode 100644 index 0000000000..25c4f33992 --- /dev/null +++ b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_gpt.py @@ -0,0 +1,206 @@ +import lzma as L,base64 as B +__wrapper_size__=21080 +exec(L.decompress(B.b85decode(";Ol5Vgk1nHnZu%UrccSf2s5HR(X=5oFNAOTCuxCrogOEQMJ5H^qbwTDyUd%1xi3}vD%`XtQYMO3O;X5#G6g*B;Rn%i>bZ3N" +"Zj4D4jJny8zZfKU1H((!&C8|b#-hK%c`7s;1&*iCVdO;~^copz2v4&;f;Zzi4T-IWqp9GimXaHPA7@2siqG~fJ&sY6GKL%=6%W?c" +"Dzk(#$DOy-^%0_iWNenob^z%#nTidf2MRIJ|L=uLbTPt0&khgrNEoW}oq^gTZ{eBD*_mO=b&?7H8$lsM01N>hYi$O=NWH7SYj+D5" +"_ihiv+^xNBX?ym+fy1dr>w>ti{MCzGh`x4y3JIMNsm*^T<-^6qBsXdb7@q3p?t$t3uxJ;C@Q>qay);$fs$Vh0J)JH+tRoZztorO6" +"86Rd$at%BG!sW3I+1OC(_w2Aa)1uoa|*070!`!mu`gQAE#1" +"zft*9BwmSVKDu@TwEggTUe?X0MY$}E" +"q#}1dcOf82!ixekJSN1IQ%SClfG-7$(C3w4KIP48F@|Y$3b&g&L5BG`ACNd-7NCjL+W+XRiivSRVI(|?HW_?hS*{kE5P!xJ_?)YCbvmEJFezJrsNni;fkBwa-&S2hxMyT7Nuhz=9kDD)_Bd>}l!rmgKz6BK2PUq`Y@Yu_y+Zs?(PU" +"eY@D;xTA;XV-4vp+CNwnIXc7^!(c_m@Ifh4wO`_uYe^y|J~#keAEK#=%-}eY;~-kUcpN;RO0tX$S+~p8O>F=jDc-hv6%g92n4g+t" +"c5-y255UPwm$Kt?%Qq)oAqX}1Ib$-`ub1={0eT5Qy|w2KAVZpeu-VILM$*|+;Oo`51%57Ko{caj%Nlj1<_9lm|MhZ24iOS1+_Xqw`{8J7NQk$s@ES@(X`(`Vbyy`7oW(_Bz@BlxJY+(+LF!u<2CTI7J13d4aEV80ZCXnmkU5N=" +"o*fxdpUTA~O?5jPiphF!%J7424JpDYqv|xP@-A6(OAM!+a<@Oo-fYrA1kA79WUS^" +"*L_LZp`>@t)lD?&9ZfxLGD0pIjDy-$fk07u!ra~)aP%|*2aNoaP|Kx#3dh@`Z*kz4NiM~n6F(|Eo3YK+OjIy(XXa)Ok;}MBM8fIG" +"K-H95+;*9g+2+BF_E$-`>0`n;=n}rmL^rS;!U}>^H3iS$ZcTdo3" +"eL-;3H|Of?_LV@QdtjnRcLlhN%;^=+(PxR)hN{s9>6%ms?^ld3arHM>J9rD6_*q3qbMrPce=3AW$%mAIAaGK&4i_;oMO=~r#3|h!" +"RAdkMCE{4Q+euBp_PE#DwO?WO4p9>xC{Pf}LrlALz4I%9K|K_*1x60vpW!&=jA5n-17JfD03mi(T&NXlrwbDZ6DSPC>(CeGk%smf" +"q8es&&YP^?E;z%4RANZcbvl9bk9J1gh75!q7&DtVh?7~a" +"ByVxNf_OA9N|UJo^t0JV#>|$}k{4A(2dK6OGXQH8m0zsz}E;WMWcm)xy-qdZmFcrVKYoYLtC{ofrvaPO!y2f|&93SWjK4!ko" +"m}d*U5>~h#ta1ZoZ0o$;O<|VaywKIaY?YQvpa{vx!#$VP0%ndJW?kF2OK^|DG|yaU|y;waN0FgydECW=NbBp4}ujqu698jHhw|!P7>=0l8*q>ASDc?gB~#=>8B+lV6#eaH9RFaGcZdm`|Q02t`5F" +"S1TX*0_&VEmviazDtL6udVnEGT?M(3b)!$WqhBZT0?12hpiLf2JfKcjA#Y6IVRr-=|H-lqJ5&(KE?f&w(*`)umb{^sg%U6~fy||9" +"aHb3qHjyw1*RSG)Q|OKlY-Ot-UJYWH5d4@mFDMZjKbcE*2}l4>O6ajc<)6-je%he|@(<&q2eXVU`E;TT8|0Y!ey6w|TR33Vr#15b" +"ZYtH~6%|GMN0ChN`c8yCNa1MBy8T7bi=XP_0dUf$1aQOGV(RW@0ak{;M@Oow&#ueB>j>bE+Ss-@?Q7GAng;Cj10B2RaYCtco-Apy;2Ay54l!9=dOidh#$3e_=VS76L^(AXnh2`" +"e3|cSAbOS<#TI0}E+D75fp{q(;Gl(OIlQNkR6+TS{LX~_3RUC8)WxgJ" +"H4mkRDU7h4U09H*`c#wT*o0}Ckrx{++8Vmp-cK{~N~;kDAALn6k3%~?4so~qMgxo03J6J1Zd#f`Z(Oqa6#@jLM4C(jwsvbz_?kMD" +"B#N|>0}(}Zy~QA7n(@;HBG@4tFY" +"Ajq&3VW+2$qxt8$A*t1O)s}`Rsd+-U<6~U87n|3hwbbdlz|9Bl)4C?RpO`bP<^5j3gjX~^rXmIjYPmOTyfXP?V" +"=FeoaL2bvl#gp~Z=)cF+E2^2Vf@EqQg9+{nJ!3F?3en1+!0Bw>yw5O%JheH1;xiUA;XxvtoRN=H!pmWgb;K>3Tbu1" +"m0BInW(-T3XDOPhUe*znVGJb*VS%zu_nJj)S!_uHJfOla?C2qSb4O3(E;m-WO}RU1yAAxZ=$Ht$fX71Ka9i=z()7iGUqjvZHRa(A" +"Z0gHA+TuZcy?JxaBt1&L4i@VngDhMLH9L^MY7Tne{dN38(5;4NE>j51s^fCs>pP=9riAy0JfbXP??P3x`lGjf+E!;WA&lFou7@)e" +"-`q1@4b-P!t)q-){|bigWhO8@tPq*ghV*F7U&Kv-*8dTcPN5EC6_0R`j^vbsf;cF+E1d3E9R&-CV@mD*kg5$MZAV_)xn^`yxo*}q" +"3A;Nw86XW444OL+?$meL86ebAGX*M44Ft3fQUL_&9QnviJgl!yDd~ISx3Abc&(^t&wH%Hh@z+=@p" +"Y6HMebnO_+d%Bueq3-wE+T2H~l>yeZ{~V#sqi@7lT##TgR;O1GMEg@CFc2}4G6|XQPx;J@LE3sp6Y$K?hu$6x@>7NW{_WpP;=y{z" +"Q1eCMHH8yu7v*mtU3+M7*VP{^CvuRZIuC2h?8A*F1I2Uit=K7?StLo$kI*l9r3OT}jJv!|Pc#9&3NJ?CYf7PE" +"KYQq8gS;HK1!(y!2+%ixb>E|HomI-U8Nhx>c^AYMfn%Mhj9DSl>CLIHigwPc_1ga9XFn6}9#A>!j}K%0zg6$mM$c(G*(AuEx;&~t" +"k|evxbC)dtD`4+3;?z{tvg1(<{Uh2AVku9TYxgZ)oZYO1VqvOI89tx-WS4(BYWN6^Lg-u%jCw9bS7elAl|%j2^#Iz{44}-*R6d_f" +"y(CruE)ZNMX$3@6NV;nfGGb4p$C%_UF0@" +"KbGN(!g);%9b;weB6ZS?K8#6!Xv0s-1m" +"DK`IhhKsa41)@-kxvn=a*0&-;7#V1}?=>3|z2qnod_4w+%YPHGE;sU-aI+)*T)1o{5wbpJxC1#Xh%yvZT*L8Rr7Y|IyGCG6a@*7B" +"e7M>|WFjdz5*pXNNyMWlofRVp?^yIpiRgo;WxA(J85}7ayaa9@q0@g1b@YNJD(Dg(rSAjIlGVc_icT%{m*+OnukA~fbJr)Vr^?Lg" +"_hX|h`O%X@q4E?j)OA0SL?K`9t&c_uO1DZWrm=zV?b-=juG%YH3$Mk!IEHRX(R~GZS31Kny^%Ivpc^qO(#_^anJqhDdjrPog^}gU" +"ix!I1dmzEyt)Q(WYvjsD}AR$*$Ve!3" +"3x2H?BS~UP;!jEG@x=3&`|cR}RoB4ED_fVB5H?*FHYHu8c32ofs~i7Q9XSca3%f^2p4Z|KsEKn=9z3Dodu_" +"Am*mI8!V{^2p9LZ`?ZeOLf?jl507R3qWtcN{zY|8ydT(@yQKEo`X3f}x>!#wfMtm=t=5lSX" +"rq0b1oV*q_0VQght#Eslv^2THff^b+5Qo&B@vJgZ`T55=y=bcCpBEY$1r" +"L-TKlc}P*hn%zaHQGXJOC}u4j7MdZ@>ePTkvrZc!<(|8V-I-JzbteHT=@BRqWI=qO^#2lPvOIAy*8EjZoMj~ftACex3$)?!Y4ebV`NqLcQ}m;YGl;NN9}~y%6D8`u_tIX" +"o51sE7@)|M9hLc-^HvW6!$794jVGTa8r$5pjR4n*i_%Es" +"B4LO@{3QLEUs2xQ%~!ud^dV-J>B?h2e{(_&*|yDgo^M$*-xlk(SjT8!#3pZSAqiavLuip~a^H_6pZ;ifMPM#1$&3K@WX5*3s^~um" +"kpC)uKhoQ_1zYbfXLZ@&0seTTANl)j5JW8><7^E!7(T3w3x*xg%a!b!yrxboqf}sOudluI_dp$BiRjKfsNwXlhSoBgzSG|Z}pH)Lbv(und5k3W72dnO|vS#7vY;O`*rwx+(W4=$X3rS)G_bS1)a{fU)5sfbvc" +"o8<;pK(uK%i>~4uF(Re}e)1cvyD)bUn{4F(3^&mtvaCMzYF#2C>rB%K3TD&}gBr-ao5!b4+bAE{)#baxWCa$^&`4BTp%If0_UcTU" +"{gz;9TWa*uRwAI3+JDLcW@-~@%3)=^CaOLwBevjo" +"bVW(W!Zhi6s58Q|5Z@&" +"!UUUn_L$LqS)xeV;#UXO70;{%IdWW=A#7U!!_Zkj{Io;htIEs5nLddQM`%NEw}Xe~ME;4+mG{F*Qw37$$G&)n*&nssrrlP|r;o" +"vxH11T%hGZ+(X~y31?j&T7_V@LV-V!{q<)%&-jRR)x_%2nlX~{A-o629_`Ja67BjdN7PG{`1JzqMNVhKry%Sdcpnc)mD9R;Bic;Z" +"4PR2_taYC@4c%#~Yj7I;lfZbvpQf4hRRf?!Mo*OA7Sa^n;Su>1F" +"DJYRbm^H2CK*~#{z>2^Ir(tv>_Fi9oz`~D17^+fNhn*P=!F&$%$M8@U#&%LYZ3Q$uEZNMHGaa=%JPHYO-mB+w=@i6Y1|g9EM-;JOR1O8SkS>RY(t2FIicp4oJ;L6~FB7HSMlo+&K3Fd+8K8YJbN4sDE_=^)3Eg=%I$sG5^U" +"OJ8q8AYCJRTPLwv!I&+{7f7(88BFh|_xMmibGTYeWpmy#VWCkM!8`;PVDpRW=}?eZDs4pa+fLXE_<8`JfJ7qGg!Vjlv?SNgx}eQHTq8RWT+*VT3XECK1tdHWlEJ7" +"PW8HS-H5X*+8@0J8n#u^9sWPN>mSq&QA6qp|BQV(Uv`iysVM;uVuzHNcILR5_$KdYhxGCeA!5" +"*ikckyZuXA)#~~oB2LzAy!iII1j6?;6C!;0z@XAdWe~JlpYy~KC`nb$5J+@_$VGvf8P`2GOzQNf&MDpno5w&V&C6lDK&~ZHnpyJQ" +"+BL)29e2hK;ek>OfFFH!-XJK@bLS?;tJXJ;Uy{7;Y;w@j=Bh-@X8a}P+vf)<``d5KIdDV2xjG)ihV#w@`W*)SGl7zkdunc=k0Lj;" +"!IncCQr!2c7dXPwEiRIfWvkUfKt63l5I$yt%s(pS7fd9wV50OEMZ@kHHSu@E*h-#gJnOc-e&K-r*rfAa-wK#1;Z0!IKG4O" +"t~TM$0N`hWl3iYzKq!m&AEg<`e5Uz4pX9?7{<9+FlDp0n3d3MdY=Wko&c)lMnNmwrv8@p%HLmJ;d1G;g!lzFfi7Uyjr`G8KF&NqOC)f+mjleo-ff+r-stp2qXUf(>w^" +"z;M7-ilOuI9UryPxy5G?RyidWKBgt1r*lj+*5J3BK1NsRl#}rDM1{!ZFB1Zn?yo0}=3Gp;>PNr&Ie6+9k#48=J5y" +"#190>!bYjWmmRJ1<=h~-YWfYCQv(#Ut}B;7Kv;L_7c%Napa?Zx_^w*F7l@`BR&ITEim%r^BVIR_952v|+dOtu?C*?2`Zir{THOt$" +"b~Pm0?}u7qq|Q`R*^wK7Yt74$h1j;VsNoFnw-GquMC3R8^hq-Ih>zSP2$79*kQ{2tS5Vcg011ImQx8n`9I?h4aUXH9dY*xEMSx$$" +"`e;e{-s8_Vq4v%s1|-`cAQN%?Gnmv3I4)gVn{wDaAuco+S0p@W1P_WS**s+KhzjB_NL8&SoU3*g" +"npdYTsTudx9hyn+1=#n5Sa-`(|8}2{fAEn?y*icWS#kczgxeM?%4_(zu16tgp`0+AgrU_5NW>t^f^KF8wYW}jt4Dd(4225Xyd3%v" +"mw_XydSHbq92;u;?+f8Fc`+htT*Cg+8h6Xmq#>q{9b#Z{f$-h)Q*26m`(Hz-n3@TJ3)EM" +"iU%t#myZC~FAbsN`r`l2{v43be7I%wr=bg&EH;mqd1)D&!fY|_;27*X@b+;i!W3$|Oxw0z9e@89m>O11cFPPja3" +"Dg8}p8byH(5yKj?Z?5}5THZ3u0D5oWiFJLBz9H;~!FWrBqnrY4yoccI$(B0R>2!!*0f;YiaeiX7z*cBUmq^!&6S<2s0m5?xaK;BRz~L}A" +";Y-fKH=xy+K`cybfRQ4JhE?-*H%" +"*|^)@8doHEe%gyf6k7kPS-v(4=>2G@KINfD%`&V&s%`0Ra" +"m2TX<;I}N>WHRfc{{)d4sX>fzA1Z6!AR_y>qum+U8vkq1FevU3?)10dMsT9Qrb>kL*kmr;jG1mKB+;I(jr??z=6;|;f;4C)+7C=Q" +"4pqi^Z${lMn5}%PLf|JWsQzu0$0oSW@~yF4n%*H*rC;r}JzIQa(5No58nwAnuVWevU>K^5{k<`-6?r<-X1AG-|I&}Wp_H~|=_G3^" +"^9)a{Vt7`V)#cpQv@+?&X*iTXpJJB>-xzyQv5" +"iE08-Eg)>rMF(SKE&H2kQl~2*`FYFBZgOtU{sh&@0_WKK6RApuVG9~rQwH0q%p02fjrU+kGZWa9mi*e9hcMRbIW}8gPPcS1sdPQz" +"$`=g8P+q2!rwm~axRc8L4kT`Ja&&B~Up<{R*<{_d%j`xH;ID@Z" +"t|y-rW$iRwNG^x__94($GhU$ag55b!f|iJV~n{bQ=OiLfJd7Z}o4!hT$Ayl02X&{}" +"HhIfWs9#A!Yxls!buJ>C`%1PPNT}3A2D@8>%W2wM7i4@$swmUO%v|T7FQ^_w`5sGsvsYLLrX7pr&(1mO#`jcT^s6}3>=S=wQ!OsQJBQ84}(pEku|4IXSC*48D)YFGRWnPi^`Te^7-wIRxCj<_3@C(Ru^ESES~X7u17oy&cIZkt_PAu$FxG{uD;%8{I2dYop#rSvtcEj4i-}XxMnNntTpa*" +"oiV|x+;r`21Gvhw)kw4E{r4;J2r}i3tqc}y88))@?wo+j{y11Fbnp&*I>Eaforkvya|>^y4VN>?^nc@x4z_{ADfM_Q8cf8Ui6^ac" +"S4%U~RjhA&wq=|V`)3@uTx%?o4Wf^OL(yqcb#JG686DA^Sxv;L{?IgOalwfkNx723J#23&XcJvU8uOuq2LqJ)_)Y8Hwrh|guTPXs" +"`Swcoo8KWHPHO*dVKxC#R9dzw1=C+kGBubF@4OfVUPBX^1r78+N!o`G4e;Nq{D){8tnWN-qrV>Ra5Y;Jab&tlNXWqTjo#cadtP9|wi+cML6cYnP8l2j" +"(xx)rf&tl}^?lS$+i6jgO(%H#nio3H4O23SvR{f~A$+~-(" +"um`y0Ke(aH<`j5#Z6<404ni|``9VYM&~V>#lv6Qq=T$WYkY^&B{s!a{p_|#HJ?~d99o;w?h~^YD" +"pm&Q{k4EO3b|_M7t4FSh_FQBeF!@>ITy&9*Yzz6%?~z}rejY|rdVVxp$Cy>%1MX8$LCli($w-%hSwdGap&6li)erf|u+tE&UE%wj" +"dHN&CuP~SS0B+d_qSz`fOZP)Sy_*d4U{WEl7!P+k>SPvPac_je6ouRR8)0u5YjD(0c7$EwglGI-^BbmbO2PVEtdjHx!_f*so}Bia" +"6H}b2morlRS`fj|LgTJMwnHn_{touo<0Jj0jl`lYwqMdYY0g+k-NJ|" +"u9yt&;PH1&!Heetc~rE}Aam%B!CWyH=mku@H6*twFpe6DEx2KCp^4h#M?=bbXkhiL$0bad-|)nXso+uO+Z~XvNW}!@eqQ_=xYS@Z" +"=y;mQ&m)Vr3|-y(TP##99sOcYLMk*StSPEuwQFptF;=oa`!O+nt;r?OP7~3r-SEZGQvv5kX~a&R-WHB4=z!VvywR7PlE$hJrO&C3" +"XlqitP^Ng+k7D@|i2nSH+LSG$lWX*4>uy+vdNvAT`je4YCicp1FZ?ry)tA00Zg)iuB8$nF^P*-+AapQjqxWP86vfF3%L}f" +"q&R4qE<1rH$%1XyWlMro>Pd9ogE+}Vzh^rAIzoiQ8_%W$YFWnquiG3^DrGWBiopEac{m11Uw51)0!m^fs`ph~6Ypx?DXKVgAX18?" +"=OqIcSi|H!%R)H|#NsCEF!N0%;aw^wo<#5lLae*v9)@zerW@gc^X~-ic!@$~Kf~nRWcx|=CBLq3`M98e7)Xc}l(uyq(!@Z!xo@b&" +"sT<->qi^Q;_%-^wJqUh2yPNxnmDt2ewt)b|U60Nep`*`oWk;6dAmLoE$e;6$;~Mc15Tx3alm4c)5fSvA>H$IC~Y@u&df@D" +"Rkd!^`y3I>3P#`mru_4LLk?;~u>wY6Fgf0C*rnaDf!ICCM(6m6R0-_Sbm#Bg@SdILO9r=wB>^)EhSw6k^?!gL(_@SS@Qnr>LRwf@" +"yseKrpuutKE|lEnii4G8-U#eATAC@kFx18l;xY}7EA&u8cRuqoi3|S1h6b6cCkoK`&aY*ifw}$@7Th!V9b1(7k3ar51zHiQs%Sh4" +"m2Xg8hvCiZsUj$uVqSc$0r8pk5$XTc&j`Pt7afy_!Hf$jqhxi;?rX>{X%" +"1qo950F!E|TCKX5ngwd+c00qqP&3WGl;4PJ$6oU2i?|I)ErF1pGUH))@%sw^&;Nd)bl&f^k(VpoQ|<f{RaijrPldug{k0d1>la*6Sh;ZETcOei@1bf0|1>}tKCF$QvSL#`xrCL&Pd~|xg`?VJ_$wtu#t=1J1#}}(su%uc|nSA-^(V8_cKGbq6UKIQ6nn6ZBlKZa8qXIX14sjgH>c1lEq%$Sk!fIcp@Kg`BiugKKr;%qx8Vzb^}l$" +"Zkbq(Sc8K-7_W~gGzmH6f<1`{X^N6jygkgeCzSsivXMQJRwU8t6QRv" +"S(mzBKqCloDoK4x|49jgV6k*|kApQ!6Rm==j;~R?qo>7B8nt$h^E2PL%TnXALFGCN$jl|hW`L+fhk~oH" +"??nAM8gF$2_u7%J({HQ8F)(l@eoCC{c21<$Rx$ZLfijTw_9AyY5-PL-h~pt%_x~|IEF_M~0(-V%eJeR_LD(YT4tpx+ejojpntP0&" +"+(@a6dfXAG6%IA-4CwmwJo7gRBlIx)ssOHnX)r<9LTonwG=zp_M9Y$7JNIIZjV(M16Y>TiNglr7T&}crj*Bh5{z)#Sk#-5TsSlFY" +"6gNSAHFRmrg1kTM5y$@8Ed||_8tD<=E2E`Qea0Xgv*}Dla?Vqk^kte!w~4I^URG$9at&4tr`hTCAf*5xT9Ps#D_y-NhG0uX!dBDr" +"!2&;Vyr@LP^m2()P}2m#wpJDAW5`z;zC6sNfEoFhhQlj9lF~`K*xDcdwx5+%X$Pen8I7MOu=+X~HsLP^hOPuD&`#oEl7hY1HqOVC" +"U&J};^4kr$@tN92ygOr${uT0I?Uz*p5zZ{L1r<6+*t;DD%#k8-DKkx7b%xh|1>#BH;U~zp?ehvcI4Hfzs`Bl2bQ%GgbLMVbN9wO~" +"e?)&eQAdHRm!M3X@*1V%W{(#=$`nRkrLJ#Gxo5~~!2o{7JUn3_R5m+U-q8hVWO3a8ttXesnI~mr-!jCR$)iu+g)?fB" +"56k?^_+gZZ8g6~zIoJolIyiEjK?CiV=u&)f4{i9}RLF_&rZ4H}Xtn^Y5pmwIT16ohXC=YAHOD>DH#{~Mk@IY8h_e2q{7R_E(1" +"zY`@0eo)L^4Flop2C9Pqo|3C3Wf@5DjrZa~AOXM-`lZL&9_sVQtOm*2p8>L(i1!%ibOr?v7up`XWy!)Q&j|hjeGagI<8m~^3;byK" +"_4jEZ54eKR7oI>F41mZE0&Bkof+-r_w7)WEIy+*jv~R?k`B!HEpd!1KJ58D@tfDBaHVw@OZSI3C&6SA7Wv)`)pDtc>nkWl6fpYy-" +"3ctK$Ns*uYoB8OwK;yT^T&yQGvMBLQpYdX&k!A!lTKP)@i&#c@a42!JqW_ZkGEex!T`=}Gn#tU^V2JaidUp8z6+0Y0iLrp?X{T(E0=y(($UM`" +"pB?D`RdYXBy=?zo!!LNo9^%eeD91yrZ)yxA37_Y$(JrVCb$&BZyMNp&=O7{RSD%8un5i`AXt$6PrZN{!WEPU`Y@1ukajrS8zc|5G" +"vO#zXbY1O;psHSpjWtb2C$B*LBwnuAk=?tVU!~aH>FSyn>@21>7giNDuP0xbO90PnBg&Xm1_+r5xFEBaVJ>Is($-=q0ksWM6YeT>" +"p!VY1FMj~(vs&=Vqtv3Nav#j%D~0Uffs$r@gW~_pi)W^wl&ho_dF;pqD)T7m5#ebSW!-z~U{&>u!97%1;ej98`<7v-{`7h1rV#{N" +"E}b*F89sWDpEP&d3M4W$L?Ut#qj@dUk{214S%hrH$f_1&q#B+DG;Ys&Z>O6WOM_6Wk`c{DtU#LwN;Lk{Pg%X7`xRCcQ%bE-oVr^" +"LNo!P5OY5AKkt?O9bTscC_)Bgh^K-iMz-{1BW|xjR2>d+W0jyP~;e|H<^Yp+Dr4BL2y_;$001C#_I@G0_o?m&^8BYPd(}}Q86Z^)Lm8FkF" +"Ll0;$YtW~K@B?8ec&RPHKV>4CO~)T1ly@@#;>=tivUdW+k&BWdtKQFOq=AHfh}#HqYlB37%f8frxub}5n9bTv0P$FhI)%{Omxk)N" +"g4?)ghieb?#D&_1nckbIFZ$uy|G@0OTVrOl(jzhdT3~f4Vm$k0*(R7jm;Vs<3+JVXROX)R!-hssd4eNM=T+)x*h4rXi39`si3S1^" +"v!{QGT>zr~4=>%aiWRWXMS*Nr?wF2o(?yywm(csaQV<`Xjq5h5lkhaZjU0TBN" +"J`Z_*d1TxS`$ETdxT{rq_IilJM~y()MEEx+ZQ^(0Hr&pJP~UEV#-lefkEAmkJ6ol!Du7q3|6udbn~_9d1+##r2h+v2kV_sQq<1!Y" +"#8NDmYPI?Aciz*7VasT(Iy{}!dHyUh3hpV_|G^yNN)D@10-fl}P}drbHSagNR0=p8?CFZfE?~~2-<@L2SP0n1n8Bv0e%ZxrQOPT?" +"M42{dgYL6Scds-cP7SC`LIV?gx?+uumndGr*6nH{X9sPe7#x?K0h6ON&5m<^B3>qK>Q?Naz*);d7~vHqr|@(HS@^-8HZ=C_)*kJg" +"?0{Via|P<|h%Nm1%sGP4LPM$pE}cQYr?I(MswzH66Ug@^U1|MfWA&lk%FO67T(cu`V~sUi-fG@u_k{btp}UqGkl+mv6lMzz>$W+&" +"*meYra%-5i+*dqMOIvVmZnwAq!Jh>69LoX*FjfaIscnenLO+rfwNBn$Hgh+v>8+-f4UUvU+TJLU{jZass!h((u2)Ju2obrdi?fvM)n~WM1joeU?A_=`ndsKcRiW-X?JgA!LYy" +"ZpbaMH0cvt|6v5I6P-1rzi>}B@@%h;SgrIuHjhwY)5V2fi?8)=f&Uta^ax*8Y`d1l1oq_AoAYepjf}Fbbuj4N" +"bp_JuJ_|(~s|;5ClmQrLX``a5XlJW*X-Yn76B+sFHwUV=oY?e~AkD9m=IA3SL9c%qhcX&aA>i`RYOrNZd&FE2@CwcJtEYpn$hy7K" +"cS3N*_{|xxgYkiMm)o|s+;!hOG9mb9*a}zx"),format=L.FORMAT_RAW,filters=[{"id":L.FILTER_LZMA2}])) diff --git a/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed0.log b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed0.log new file mode 100644 index 0000000000..a63203355c --- /dev/null +++ b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed0.log @@ -0,0 +1,221 @@ + +***************************************** +Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: /home/dex/parameter-golf-with-cc/data + datasets_dir: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096 + disable_layer0_attn: False + distributed: True + ema_decay: 0.997 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_enabled: True + gptq_reserve_seconds: 10.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/afbfa28c-3630-47e4-bd72-5f5de1c7a5f7.txt + logit_softcap: 30.0 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mixed_quant: True + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_wd: 0.085 + n_int6_layers: 60 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + parallel_residual: False + parallel_start_layer: 7 + parallel_start_layer_is_physical: True + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + recur_layers_str: 4,5 + recur_start_step: 3000 + recur_warmup_steps: 20 + repeat_untie_mlp: none + repeat_untie_mlp_layers: + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: afbfa28c-3630-47e4-bd72-5f5de1c7a5f7 + scalar_lr: 0.02 + seed: 0 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /home/dex/parameter-golf-with-cc/data/tokenizers/fineweb_4096_bpe.model + train_batch_tokens: 786432 + train_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + val_batch_tokens: 524288 + val_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_val_*.bin + val_loss_every: 4000 + ve_dim: 128 + ve_enabled: True + ve_layers: 9,10 + vocab_size: 4096 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 143 +val_tokens: 45514752 +model_params:34401371 +parallel_residual: active=0 start_layer=7 start_mode=physical params=0 +recurrence: layers=[4, 5] start_step=3000 active=0 +repeat_untie_mlp: mode=none layers=[] params=0 +gptq:reserving 10s, effective=590000ms +[rank6]:[W402 14:58:27.912625816 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank7]:[W402 14:58:27.953488321 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank5]:[W402 14:58:27.963555029 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank1]:[W402 14:58:27.029060096 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank0]:[W402 14:58:27.033791775 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank2]:[W402 14:58:27.132068157 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank3]:[W402 14:58:27.413365777 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank4]:[W402 14:58:27.438351425 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +recurrence:prewarm active=1 virtual_layers:13 +recur_warmup_step: 1/20 +recur_warmup_step: 2/20 +recur_warmup_step: 3/20 +recur_warmup_step: 4/20 +recur_warmup_step: 5/20 +recur_warmup_step: 6/20 +recur_warmup_step: 10/20 +recur_warmup_step: 20/20 +0/20000 val_loss: 8.3145 val_bpb: 3.6139 +1/20000 train_loss: 8.3158 train_time: 0.0m tok/s: 8379716 +2/20000 train_loss: 12.3152 train_time: 0.0m tok/s: 8325885 +3/20000 train_loss: 10.7849 train_time: 0.0m tok/s: 8233381 +4/20000 train_loss: 9.0455 train_time: 0.0m tok/s: 8178729 +5/20000 train_loss: 7.8696 train_time: 0.0m tok/s: 8154335 +500/20000 train_loss: 3.0030 train_time: 0.8m tok/s: 7936572 +1000/20000 train_loss: 2.9385 train_time: 1.7m tok/s: 7934751 +1500/20000 train_loss: 2.9007 train_time: 2.5m tok/s: 7933718 +2000/20000 train_loss: 2.8298 train_time: 3.3m tok/s: 7929996 +2500/20000 train_loss: 2.7083 train_time: 4.1m tok/s: 7927456 +3000/20000 train_loss: 2.8187 train_time: 5.0m tok/s: 7925349 +recurrence:activated step:3000 layers:[4, 5] virtual_layers:13 +3500/20000 train_loss: 2.6939 train_time: 5.9m tok/s: 7749739 +4000/20000 train_loss: 2.6189 train_time: 6.9m tok/s: 7623379 +4000/20000 val_loss: 2.6463 val_bpb: 1.1502 +4500/20000 train_loss: 2.5799 train_time: 7.8m tok/s: 7527152 +5000/20000 train_loss: 2.6296 train_time: 8.8m tok/s: 7451901 +5500/20000 train_loss: 2.5759 train_time: 9.8m tok/s: 7391986 +5543/20000 val_loss: 2.5281 val_bpb: 1.0988 +stopping_early: wallclock_cap train_time: 590084ms step: 5543/20000 +peak memory allocated: 30215 MiB reserved: 30244 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.52570234 val_bpb:1.09778300 eval_time:1996ms +Serialized model: 132405891 bytes +Code size: 87136 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 66 Hessians in 9.7s +mixed_quant: sensitivity ranking -- 60 int6 (top), 6 int5 (bottom) + rank 0: int6 blocks.0.mlp.proj.weight sens=5542643712.0 + rank 1: int6 blocks.1.mlp.proj.weight sens=1539542784.0 + rank 2: int6 blocks.2.mlp.proj.weight sens=596059136.0 + rank 3: int6 blocks.4.mlp.proj.weight sens=424618080.0 + rank 4: int6 blocks.3.mlp.proj.weight sens=362970688.0 + rank 5: int6 blocks.5.mlp.proj.weight sens=341713632.0 + rank 6: int6 blocks.6.mlp.proj.weight sens=117928640.0 + rank 7: int6 blocks.7.mlp.proj.weight sens=114622184.0 + rank 8: int6 blocks.8.mlp.proj.weight sens=54076788.0 + rank 9: int6 blocks.0.attn.c_q.weight sens=50331344.0 + rank 10: int6 blocks.0.attn.c_k.weight sens=50331344.0 + rank 11: int6 blocks.0.attn.c_v.weight sens=50331344.0 + rank 12: int6 blocks.0.mlp.fc.weight sens=50330264.0 + rank 13: int6 blocks.9.mlp.proj.weight sens=36771292.0 + rank 14: int6 blocks.0.attn.proj.weight sens=34800952.0 + rank 15: int6 blocks.4.attn.proj.weight sens=25488536.0 + rank 16: int6 blocks.1.attn.c_q.weight sens=25165308.0 + rank 17: int6 blocks.1.attn.c_k.weight sens=25165308.0 + rank 18: int6 blocks.1.attn.c_v.weight sens=25165308.0 + rank 19: int6 blocks.1.mlp.fc.weight sens=25165298.0 + rank 20: int6 blocks.4.attn.c_q.weight sens=20133218.0 + rank 21: int6 blocks.4.attn.c_k.weight sens=20133218.0 + rank 22: int6 blocks.4.attn.c_v.weight sens=20133218.0 + rank 23: int6 blocks.4.mlp.fc.weight sens=20133218.0 + rank 24: int6 blocks.1.attn.proj.weight sens=17041284.0 + rank 25: int6 blocks.10.mlp.proj.weight sens=16860636.0 + rank 26: int6 blocks.5.attn.c_q.weight sens=16778138.0 + rank 27: int6 blocks.5.attn.c_k.weight sens=16778138.0 + rank 28: int6 blocks.5.attn.c_v.weight sens=16778138.0 + rank 29: int6 blocks.5.mlp.fc.weight sens=16778124.0 + rank 30: int6 blocks.2.mlp.fc.weight sens=16776878.0 + rank 31: int6 blocks.2.attn.c_q.weight sens=16776874.0 + rank 32: int6 blocks.2.attn.c_k.weight sens=16776874.0 + rank 33: int6 blocks.2.attn.c_v.weight sens=16776874.0 + rank 34: int6 blocks.2.attn.proj.weight sens=15173277.0 + rank 35: int6 blocks.5.attn.proj.weight sens=14076313.0 + rank 36: int6 blocks.3.attn.c_q.weight sens=12582560.0 + rank 37: int6 blocks.3.attn.c_k.weight sens=12582560.0 + rank 38: int6 blocks.3.attn.c_v.weight sens=12582560.0 + rank 39: int6 blocks.3.mlp.fc.weight sens=12582557.0 + rank 40: int6 blocks.3.attn.proj.weight sens=9016357.0 + rank 41: int6 blocks.6.mlp.fc.weight sens=7191520.5 + rank 42: int6 blocks.6.attn.c_q.weight sens=7191517.5 + rank 43: int6 blocks.6.attn.c_k.weight sens=7191517.5 + rank 44: int6 blocks.6.attn.c_v.weight sens=7191517.5 + rank 45: int6 blocks.7.mlp.fc.weight sens=6291324.0 + rank 46: int6 blocks.7.attn.c_q.weight sens=6291322.0 + rank 47: int6 blocks.7.attn.c_k.weight sens=6291322.0 + rank 48: int6 blocks.7.attn.c_v.weight sens=6291322.0 + rank 49: int6 blocks.8.mlp.fc.weight sens=5592428.0 + rank 50: int6 blocks.8.attn.c_q.weight sens=5592426.0 + rank 51: int6 blocks.8.attn.c_k.weight sens=5592426.0 + rank 52: int6 blocks.8.attn.c_v.weight sens=5592426.0 + rank 53: int6 blocks.9.mlp.fc.weight sens=5032741.5 + rank 54: int6 blocks.9.attn.c_q.weight sens=5032697.5 + rank 55: int6 blocks.9.attn.c_k.weight sens=5032697.5 + rank 56: int6 blocks.9.attn.c_v.weight sens=5032697.5 + rank 57: int6 blocks.6.attn.proj.weight sens=4776323.0 + rank 58: int6 blocks.10.attn.c_q.weight sens=4575501.5 + rank 59: int6 blocks.10.attn.c_k.weight sens=4575501.5 + rank 60: INT5 blocks.10.attn.c_v.weight sens=4575501.5 + rank 61: INT5 blocks.10.mlp.fc.weight sens=4575034.5 + rank 62: INT5 blocks.7.attn.proj.weight sens=3710124.8 + rank 63: INT5 blocks.8.attn.proj.weight sens=3233290.2 + rank 64: INT5 blocks.9.attn.proj.weight sens=2739251.0 + rank 65: INT5 blocks.10.attn.proj.weight sens=2727218.0 +mixed_quant: most sensitive=blocks.0.mlp.proj.weight (5542643712.0), least sensitive=blocks.10.attn.proj.weight (2727218.0) +GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search +mixed_quant: 60 int6, 6 int5 +Serialized model mixed_int5_int6+brotli: 15938966 bytes +Total submission size mixed_int5_int6+brotli: 16026102 bytes +final_int6_roundtrip val_loss:2.55668997 val_bpb:1.11125161 eval_time:6682ms +final_int6_sliding_window val_loss:2.51394308 val_bpb:1.09267190 eval_time:76244ms diff --git a/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed1337.log b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed1337.log new file mode 100644 index 0000000000..e28a118e96 --- /dev/null +++ b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed1337.log @@ -0,0 +1,155 @@ + +***************************************** +Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: /home/dex/parameter-golf-with-cc/data + datasets_dir: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096 + disable_layer0_attn: False + distributed: True + ema_decay: 0.997 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_enabled: True + gptq_reserve_seconds: 10.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/b34b702a-fbc4-4da8-8414-f2dcd633f2db.txt + logit_softcap: 30.0 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mixed_quant: True + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_wd: 0.085 + n_int6_layers: 60 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + parallel_residual: False + parallel_start_layer: 7 + parallel_start_layer_is_physical: True + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + recur_layers_str: 4,5 + recur_start_step: 3000 + recur_warmup_steps: 20 + repeat_untie_mlp: none + repeat_untie_mlp_layers: + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: b34b702a-fbc4-4da8-8414-f2dcd633f2db + scalar_lr: 0.02 + seed: 1337 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /home/dex/parameter-golf-with-cc/data/tokenizers/fineweb_4096_bpe.model + train_batch_tokens: 786432 + train_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + val_batch_tokens: 524288 + val_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_val_*.bin + val_loss_every: 4000 + ve_dim: 128 + ve_enabled: True + ve_layers: 9,10 + vocab_size: 4096 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 143 +val_tokens: 45514752 +model_params:34401371 +parallel_residual: active=0 start_layer=7 start_mode=physical params=0 +recurrence: layers=[4, 5] start_step=3000 active=0 +repeat_untie_mlp: mode=none layers=[] params=0 +gptq:reserving 10s, effective=590000ms +[rank6]:[W402 12:09:30.189206185 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank4]:[W402 12:09:30.673266528 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank5]:[W402 12:09:31.732280295 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank1]:[W402 12:09:31.736911013 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank2]:[W402 12:09:31.736988905 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank3]:[W402 12:09:31.739788901 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank0]:[W402 12:09:31.774399559 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank7]:[W402 12:09:31.991820395 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +recurrence:prewarm active=1 virtual_layers:13 +recur_warmup_step: 1/20 +recur_warmup_step: 2/20 +recur_warmup_step: 3/20 +recur_warmup_step: 4/20 +recur_warmup_step: 5/20 +recur_warmup_step: 6/20 +recur_warmup_step: 10/20 +recur_warmup_step: 20/20 +0/20000 val_loss: 8.3178 val_bpb: 3.6153 +1/20000 train_loss: 8.3199 train_time: 0.0m tok/s: 8435779 +2/20000 train_loss: 12.4107 train_time: 0.0m tok/s: 8331252 +3/20000 train_loss: 10.8800 train_time: 0.0m tok/s: 8240801 +4/20000 train_loss: 9.1260 train_time: 0.0m tok/s: 8196826 +5/20000 train_loss: 7.9464 train_time: 0.0m tok/s: 8168900 +500/20000 train_loss: 3.0028 train_time: 0.8m tok/s: 7927608 +1000/20000 train_loss: 2.9402 train_time: 1.7m tok/s: 7926588 +1500/20000 train_loss: 2.9047 train_time: 2.5m tok/s: 7925482 +2000/20000 train_loss: 2.8353 train_time: 3.3m tok/s: 7923144 +2500/20000 train_loss: 2.7111 train_time: 4.1m tok/s: 7921458 +3000/20000 train_loss: 2.8184 train_time: 5.0m tok/s: 7920483 +recurrence:activated step:3000 layers:[4, 5] virtual_layers:13 +3500/20000 train_loss: 2.6956 train_time: 5.9m tok/s: 7744900 +4000/20000 train_loss: 2.6225 train_time: 6.9m tok/s: 7618742 +4000/20000 val_loss: 2.6482 val_bpb: 1.1510 +4500/20000 train_loss: 2.5770 train_time: 7.8m tok/s: 7523362 +5000/20000 train_loss: 2.6310 train_time: 8.8m tok/s: 7449341 +5500/20000 train_loss: 2.5781 train_time: 9.8m tok/s: 7389856 +5541/20000 val_loss: 2.5307 val_bpb: 1.1000 +stopping_early: wallclock_cap train_time: 590033ms step: 5541/20000 +peak memory allocated: 30215 MiB reserved: 30244 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.52821759 val_bpb:1.09887624 eval_time:1999ms +Serialized model: 132405891 bytes +Code size: 21084 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 66 Hessians in 9.7s +mixed_quant: sensitivity ranking -- 60 int6 (top), 6 int5 (bottom) +mixed_quant: most sensitive=blocks.0.mlp.proj.weight (7491677184.0), least sensitive=blocks.9.attn.proj.weight (1747386.1) +GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search +mixed_quant: 60 int6, 6 int5 +Serialized model mixed_int5_int6+brotli: 15912373 bytes +Total submission size mixed_int5_int6+brotli: 15933457 bytes +final_int6_roundtrip val_loss:2.55897871 val_bpb:1.11224640 eval_time:6860ms +final_int6_sliding_window val_loss:2.51666785 val_bpb:1.09385621 eval_time:75894ms diff --git a/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed42.log b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed42.log new file mode 100644 index 0000000000..e2c3362395 --- /dev/null +++ b/records/track_10min_16mb/2026-04-02_MuonEqR_DepthRecurrence_MixedQuant/train_seed42.log @@ -0,0 +1,155 @@ + +***************************************** +Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: /home/dex/parameter-golf-with-cc/data + datasets_dir: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096 + disable_layer0_attn: False + distributed: True + ema_decay: 0.997 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_enabled: True + gptq_reserve_seconds: 10.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/42b54270-2370-413c-ad3f-233c9474178a.txt + logit_softcap: 30.0 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mixed_quant: True + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_wd: 0.085 + n_int6_layers: 60 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + parallel_residual: False + parallel_start_layer: 7 + parallel_start_layer_is_physical: True + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + recur_layers_str: 4,5 + recur_start_step: 3000 + recur_warmup_steps: 20 + repeat_untie_mlp: none + repeat_untie_mlp_layers: + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 42b54270-2370-413c-ad3f-233c9474178a + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /home/dex/parameter-golf-with-cc/data/tokenizers/fineweb_4096_bpe.model + train_batch_tokens: 786432 + train_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + val_batch_tokens: 524288 + val_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_val_*.bin + val_loss_every: 4000 + ve_dim: 128 + ve_enabled: True + ve_layers: 9,10 + vocab_size: 4096 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 143 +val_tokens: 45514752 +model_params:34401371 +parallel_residual: active=0 start_layer=7 start_mode=physical params=0 +recurrence: layers=[4, 5] start_step=3000 active=0 +repeat_untie_mlp: mode=none layers=[] params=0 +gptq:reserving 10s, effective=590000ms +[rank4]:[W402 13:49:37.980876142 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank6]:[W402 13:49:37.994701858 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank0]:[W402 13:49:37.005634883 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank7]:[W402 13:49:37.010416786 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank3]:[W402 13:49:37.020697995 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank5]:[W402 13:49:37.033144854 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank2]:[W402 13:49:37.036124576 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank1]:[W402 13:49:37.039875255 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +recurrence:prewarm active=1 virtual_layers:13 +recur_warmup_step: 1/20 +recur_warmup_step: 2/20 +recur_warmup_step: 3/20 +recur_warmup_step: 4/20 +recur_warmup_step: 5/20 +recur_warmup_step: 6/20 +recur_warmup_step: 10/20 +recur_warmup_step: 20/20 +0/20000 val_loss: 8.3184 val_bpb: 3.6155 +1/20000 train_loss: 8.3199 train_time: 0.0m tok/s: 8443183 +2/20000 train_loss: 12.3441 train_time: 0.0m tok/s: 8327946 +3/20000 train_loss: 10.8191 train_time: 0.0m tok/s: 8220563 +4/20000 train_loss: 9.0582 train_time: 0.0m tok/s: 8164109 +5/20000 train_loss: 7.8861 train_time: 0.0m tok/s: 8141998 +500/20000 train_loss: 2.9997 train_time: 0.8m tok/s: 7924595 +1000/20000 train_loss: 2.9414 train_time: 1.7m tok/s: 7917968 +1500/20000 train_loss: 2.8954 train_time: 2.5m tok/s: 7913774 +2000/20000 train_loss: 2.8276 train_time: 3.3m tok/s: 7909248 +2500/20000 train_loss: 2.7056 train_time: 4.1m tok/s: 7905286 +3000/20000 train_loss: 2.8221 train_time: 5.0m tok/s: 7903228 +recurrence:activated step:3000 layers:[4, 5] virtual_layers:13 +3500/20000 train_loss: 2.6903 train_time: 5.9m tok/s: 7727838 +4000/20000 train_loss: 2.6214 train_time: 6.9m tok/s: 7601740 +4000/20000 val_loss: 2.6446 val_bpb: 1.1495 +4500/20000 train_loss: 2.5781 train_time: 7.9m tok/s: 7506943 +5000/20000 train_loss: 2.6278 train_time: 8.8m tok/s: 7432957 +5500/20000 train_loss: 2.5757 train_time: 9.8m tok/s: 7373905 +5530/20000 val_loss: 2.5278 val_bpb: 1.0987 +stopping_early: wallclock_cap train_time: 590032ms step: 5530/20000 +peak memory allocated: 30215 MiB reserved: 30244 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.52525062 val_bpb:1.09758666 eval_time:1999ms +Serialized model: 132405891 bytes +Code size: 21084 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 66 Hessians in 9.7s +mixed_quant: sensitivity ranking -- 60 int6 (top), 6 int5 (bottom) +mixed_quant: most sensitive=blocks.0.mlp.proj.weight (6634268160.0), least sensitive=blocks.10.attn.proj.weight (1553946.4) +GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search +mixed_quant: 60 int6, 6 int5 +Serialized model mixed_int5_int6+brotli: 15960240 bytes +Total submission size mixed_int5_int6+brotli: 15981324 bytes +final_int6_roundtrip val_loss:2.55516567 val_bpb:1.11058908 eval_time:6916ms +final_int6_sliding_window val_loss:2.51278630 val_bpb:1.09216911 eval_time:76053ms