diff --git a/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/README.md b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/README.md new file mode 100644 index 0000000000..63d1c0ea24 --- /dev/null +++ b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/README.md @@ -0,0 +1,135 @@ +## Record: MuonEq-R + Depth Recurrence + N61 Mixed Int5/Int6 GPTQ (val_bpb: 1.0924) + +**val_bpb = 1.0924** (3-seed mean, std 0.0008) | **2.5133 nats** | **~15.98 MB** | 8xH100 SXM, 590s train + ~76s eval | No TTT + +Built on [PR #1218](https://github.com/openai/parameter-golf/pull/1218) by @clarkkev (4096-Vocab + 4.0-MLP-mult + 0.085-WD). + +Previous: [PR #1019](https://github.com/openai/parameter-golf/pull/1019) (1.1147) -> [PR #1218](https://github.com/openai/parameter-golf/pull/1218) (1.0979) -> [PR #1260](https://github.com/openai/parameter-golf/pull/1260) (1.0929) -> this (1.0924) + +### Changes from PR #1218 + +| | PR #1218 | This | +|---|---|---| +| val_bpb | 1.09785 | **1.09241** | +| Optimizer | Muon | **MuonEq-R** (row-norm before NS5) | +| Depth recurrence | None | **Layers 4,5 repeated** (RECUR_LAYERS=4,5) | +| Recurrence MLP sharing | N/A | **Fully shared** (REPEAT_UNTIE_MLP=none) | +| Mixed quantization | No | **Yes** (61 int6 + 5 int5 via Hessian sensitivity) | +| Recurrence activation | N/A | Step 3000 with 20-step warmup | +| Everything else | Same | Same | + +### Changes from PR #1260 + +| | PR #1260 | This | +|---|---|---| +| val_bpb | 1.09290 | **1.09241** | +| N_INT6_LAYERS | 60 | **61** | +| Seeds | 1337, 42, 0 | **42, 0, 7** | +| Code size | 21,084 | 21,396 | + +Key insight: N_INT6=61 (one more int6 layer) improves BPP by ~0.001 per seed with no architecture change. The smaller mini (21,396 bytes vs 87K standalone) creates enough headroom for the extra int6 layer to fit. + +### What's New + +1. **MuonEq-R** — Row-normalizes gradient matrices before Newton-Schulz orthogonalization in the Muon optimizer. Zero-byte cost, ~0.001 BPB improvement. + +2. **Depth Recurrence** — Layers 4 and 5 are repeated once after the initial forward pass (virtual layers 12-13 on top of 11 physical layers). MLP weights are fully shared (REPEAT_UNTIE_MLP=none), adding zero extra parameters. Activated at step 3000 with 20-step warmup. ~0.003 BPB improvement. + +3. **N_INT6=61 Mixed Quantization** — Hessian sensitivity ranking: 61 int6 + 5 int5 layers (vs 60+6 in PR #1260). One additional int6 layer improves BPP by ~0.001 with minimal artifact increase. Combined with full GPTQ and brotli-11 compression. + +### Carried from PR #1218 + +- 4096 SentencePiece BPE vocabulary +- 4.0x MLP multiplier with sigmoid-gated activation +- Weight decay 0.085 +- Full Hessian GPTQ quantization +- XSA-all-11 attention +- BigramHash embedding (2816x160) +- Sigmoid-gated skip connections + soft-round QAT +- Split-LR training +- Brotli-11 compression with byte shuffle +- EMA (decay 0.997) + +### Configuration + +```bash +NCCL_NET=Socket \ +DATA_DIR=./data \ +SEED=42 \ +MIXED_QUANT=1 \ +N_INT6_LAYERS=61 \ +RECUR_LAYERS=4,5 \ +torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, no TTT) + +### Core Results + +| Seed | Steps | ms/step | Post-EMA BPB | Sliding BPB | val_loss (nats) | Artifact | +|------|-------|---------|--------------|-------------|-----------------|----------| +| 42 | 5,540 | 106.5 | 1.0985 | 1.0917 | 2.51171 | 15,996,591 | +| 0 | 5,536 | 106.6 | 1.0988 | 1.0923 | 2.51309 | 15,974,481 | +| 7 | 5,538 | 106.6 | 1.0995 | 1.0932 | 2.51522 | 15,982,332 | +| **Mean** | **5,538** | **106.6** | **1.0989** | **1.0924** | **2.51334** | **15,984,468** | + +### Supplemental Diagnostics + +| Seed | Post-EMA BPB | Roundtrip BPB | Sliding BPB | val_loss (nats) | Code size | Total submission | Train time | Eval time | +|------|--------------|---------------|-------------|-----------------|-----------|------------------|------------|-----------| +| 42 | 1.0985 | 1.1101 | 1.0917 | 2.51171 | 21,396 | 15,996,591 | 590s | 83s | +| 0 | 1.0988 | 1.1108 | 1.0923 | 2.51309 | 21,396 | 15,974,481 | 590s | 83s | +| 7 | 1.0995 | 1.1115 | 1.0932 | 2.51522 | 21,396 | 15,982,332 | 590s | 83s | +| **Mean** | **1.0989** | **1.1108** | **1.0924** | **2.51334** | **21,396** | **15,984,468** | **590s** | **83s** | + +### Rule Compliance + +- No TTT (no test-time training or adaptation) +- No SLOT (no scored-position lookup table) +- No validation data during training +- No training data during evaluation +- Artifact < 16,000,000 bytes for ALL seeds (max: 15,996,591) +- Train < 600s on 8xH100 SXM (590s) +- Eval < 600s on 8xH100 SXM (~83s) + +### Architecture + +- 11 layers + 2 virtual (depth recurrence on layers 4,5) +- d_model = 512, MLP 4x (2048), 8 heads, 4 KV heads +- 4096 SentencePiece BPE vocabulary +- BigramHash(2816x160) token embedding +- Sigmoid-gated skip connections with soft-round QAT +- MuonEq-R optimizer with row normalization +- Full Hessian GPTQ with 61 int6 + 5 int5 layers + +### Run Command (3-seed loop) + +```bash +for SEED in 42 0 7; do + NCCL_NET=Socket \ + DATA_DIR=./data \ + SEED=$SEED \ + MIXED_QUANT=1 \ + N_INT6_LAYERS=61 \ + RECUR_LAYERS=4,5 \ + torchrun --standalone --nproc_per_node=8 train_gpt.py \ + 2>&1 | tee train_seed${SEED}.log +done +``` + +### Lineage + +PR #1019 (1.1147) -> PR #1218 (1.0979) -> PR #1260 (1.0929) -> this (1.0924) + +### Credits + +- @clarkkev for PR #1218 (4096-Vocab + high-WD architecture) +- @abaybektursun for PR #1019 (GPTQ + XSA + BigramHash baseline) +- @msisovic for PR #1204 (depth recurrence concept) +- @dexhunter for PR #1260 (MuonEq-R + recurrence + mixed quant) + +### Included Files + +- `train_gpt.py` — full training + quantization + evaluation script (21,396 bytes, self-extracting) +- `train_seed42.log`, `train_seed0.log`, `train_seed7.log` — all seed logs +- `submission.json` — leaderboard metadata diff --git a/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/submission.json b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/submission.json new file mode 100644 index 0000000000..7eacfbec7d --- /dev/null +++ b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/submission.json @@ -0,0 +1,16 @@ +{ + "name": "Record: MuonEq-R + Depth Recurrence + N61 Mixed Int5/Int6 GPTQ", + "val_bpb": 1.0924, + "bytes_total": 15996591, + "blurb": "Improves PR #1260 with N_INT6=61 (one more int6 layer). MuonEq-R optimizer, depth recurrence (layers 4,5 shared MLP), 61 int6 + 5 int5 Hessian-ranked GPTQ. 3-seed mean 1.0924 BPB, all under 16MB. No TTT, no SLOT.", + "author": "dexhunter", + "github_id": "dexhunter", + "date": "2026-04-03", + "pre_quant_val_bpb": 1.0989, + "bytes_model_compressed": 15975195, + "bytes_code": 21396, + "base_pr": 1218, + "seeds": [42, 0, 7], + "seed_scores": [1.09170, 1.09230, 1.09323], + "eval_time_seconds": [83, 83, 83] +} diff --git a/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_gpt.py b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_gpt.py new file mode 100644 index 0000000000..8cec22eb4b --- /dev/null +++ b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_gpt.py @@ -0,0 +1,209 @@ +import lzma as L,base64 as B +__wrapper_size__=21392 +exec(L.decompress(B.b85decode(";O_K5c3l83nZu%UrccSf2s5HR(X=5oFNAOTCuxCrogOEQMJ5H^qbwTDyUd%1xi3}vD%`XtQYMO3O;X5#G6g*B;Rn%i>bZ3N" +"Zj4D4jJny8zZfKU1H((!&C8|b#-hK%c`7s;1&*iCVdO;~^copz2v4&;f;Zzi4T-IWqp9GimXaHPA7@2siqG~fJ&sY6GKL%=6%W?c" +"Dzk(#$Gv|L*h;#@DHOp|8@`$XiVm}OLBFp{8><<{&rn(asrG;vUo$w!YEi>-Zml;bkDSR0*aoXbn)wqw%Uu8Vsuzl}|mqsa!Xn%aqRjf>03f#@_{GX#>QoN+9W0>R3)MhA39b|;FjGA?+?t_&es0c#T(|nSiT0aBI!=n2eFr*50Z+V" +"b?}-|bab}Em}MbZY$-5t(5NfQ4;Y$QG+>vs8^hSP=xLb{5W+{EIO;03T>L-~2T}DCm3>4c&zc9eUg}6o>6rZ#LL&?ZIdFdth=$&a" +"&g8;2>k0x;aT$2|-0QWq?nbZ?4w?v`kLpLhnWku}ux`=u4|5QDwLEFcOPP-f4Zc9~f8#BHk(;82q=w|`-B;p!stpvgpW5s@i>tV2" +"zqP}g@4g4rDL`#{Vx|Xo-^00G%aF1;nt8V43v8Mu>J`+G`d_B4!qfS3;o3JYRBfzMEohTK&Lc9#7=AKIm2E*(Ib>W@#lxe~=~NDW" +"w^yyTJJ{=lIc7&!8kt735U)piPAB>AbhM$zcS=wgv8BKw%0WfssV`HdS>KOHOx!FE6okMLZJN" +"U(5Z)R#^`5&)AmP;5^Ss*;-&X1_A9u5^P1!Q0I*r2(ilBWegMMTS7L|sht>XJjM`MvQXD$&Sxo%55O9c<9s^lAbtp|Saxx9;Qa<5" +"sN6zET?iqn33pUwmAZ+=1=+4vssevys7A`qG_z_uOcq!kU`@qD^rRvMY%Y@<8=0I+!mjC1<#r+ctxxA`$+NupC" +"qb+F*kKKKqlyPY|!36<2(I0e&dXSF4)yO*8y1Yz0I|)vd9An;>k%9=8$M+9tU@gz0tI@AlwbLkD>l1dh3-WSy^mlQiq}7q|Gl!r~" +"lx0iF$k2mxJeks7~C^mt<%miLMp{c4x5RFWw(Z1DClZ-d19}y(;aDe=#IrT^MY`w5}BbOL_D~KbQ" +"CqLEH%+=ITVz?y&_i?-W=jK(&z4*=#y6M0@yIYbfQwFT;QBe#Skul5AXjK>xMOlNps?u<1^I7R" +"voLs&i?i!&f4m~lbo3;Od(bef_P4-RBYEvrH8~~pyC31fY1fxsqIX2WD72J#{4Vxb`C7F8xcu1-Rpk+l<2oG=mA)YA;2(E<-#1kM" +"HCLr@;hp_ahwCAr>X(oV&xswZ#uVb~WdP?c2I7D1*Iv!ov-l+42Dy%R|~1;zT|reeWPzX>TTy}hsPEH3Fy!BPb$" +"6Q4H9gR?O~gNiVe|L!%q7)ur5M46Z%XBiP73Zns?GjQBiM3wwjIV$bF+A9$B" +"qn3%Cl{+SGR`re4)!_n@31WhHQa})0X@w5zP-l6ggE7yXX_M;jg<_&loHIxy+*V?Os~>>geHqVz_Oj%`r_a9CP_HODa=Gy&gTY8v8T;B+Ry-6aP9R?dmAZBS7Iry1Kkb%r(tl@B~|t9Tl1)ov)eS!ZUuZcC*Cc=TDM{6Bzf!IHM?6!E;aRUPPl*=$6@o`!?q}VhnCZ" +"S50?e9(h;kMk~wEV98zafwm~igTj02h+?SGzuG^%46h3K`qAS(UD4W&FGBjC2~B19Ir0WCPD$_I^k+P=Kv1F09gfL;ha+vaA&s12" +"O|Aps(VzU&HWjdcGr7_NUZ?J6M3A_js09lgG0|EsobZ&ZC(C(t2}M05(ngAbJqr;`%F}v1LTig=%^AK@~-@=zHWP8%i#5&gZ3YiVJ_4h01r" +"pALFQ)6zzYBCpqW`5QLRu8gW|^HJ)(Bf-`8zR7U-UW-xf9$jDK@i-CA3lm&b2=T>Z!@2$lwT~FT7F31S)4k4+s(@CQoj{e=x{Os|" +"i54|lBjOt;VHqh(Z$4)9c`eifAodmTOTVJQYGRhvUws7>uK|rLolpPQ&?`kALQYmg>^J)J(Qwh_g-yeZFRMd0?4GF9c;wFf%j_zT&g>DPKjan)gKMLeir7V" +"I!)^>}JvNEP+bV-`um2}V`UZLo7VPU!wpN3`YP@v_60h}&!y)0vb{80eJk$UH)Oh&FQ{%|e39fDS5|" +"xvvj;w95Z&fePuC30dQm-Qb;3P)n51q(G=IH5by3nQSFm|10(MgV^EpBm|mhu(*&^@me`R=ggf~-?3L=k+Z!NW_1pCVjG(}Z~tdv" +"Io#;!d(jYL9V5|hKO}_{hhlsE>`z>Y4a&SWAUrT)L_1Og!Y-46Wbp@45O`ttK8ijLmq)4D5UqnXTrZhtO?nfA&|tcU=`c9m$K" +"4xIRy$!?s05|;^$zJ-L6VxmLBke1JFjueA#yOcinrT4FQKlCAS6mS-A!LEAvuCOAstXjvzL" +"_UZ`Kx&5o99kyYf$sK%EMSUU*?k+Kv3sI5v`?V{Hj{*qtWLiOGIuNs__EdUjBO;g_9Fm3N1;K~PG`ieUR#5I>-O4vfBpJe8X)bA$" +"QVem6AZ|)uImgF0W4_M+6~I_B7BZg!Dg;>@FuBQ=?%z?;iO7OO8S~%P#hG;^xA5l7Augs5R>t#o0)0gwOA{?U@bh" +"6*_;fg9YEG^q`;1ZdLWku>j!n{=;vDe`(OV&1{&DGr9GQ4KC{I_HiM2_3d-U+JGB#*yo*yP>`0`W%*2#L*;fT&m+<14w7AUN%C*<" +"E%Y&S>ms@oJAl>qTxLtLW4_A>TJ|&" +"MwmrhAjaYK>U0MXHZ$Mis4fNU!l>)^;9{86Q*5`%!YFYmLpSBU$|>9>tNs$8`C5{Mbt@F+p>d>4jhXRrI~^U)9@!#ElZ7-ewY#5T" +"=yHGY=J2d(W6D%74>QYg-o(HMOC|>}d)3nk_2o?s@Ub`FL!YfzF;6by16+onMu}(CR>V9JZ%NB^D{mz|v(h" +"YuqQZ+qz73f9EV3xggqQsXznXRK)qOwEC4HXuSfKg%}sf`@Vvk4yc9|-HgujmQM>pm4@NRcJIi$hUbUX{HNaIbb!|r=%X`Aw%dSb" +"=3|&hA6ZP~3M}6s?J@R^^@o1y;YSB-YiYooEQj@eTp?iyRub{qHkXeLfDi!dkhGn1+BU)37A&38{UpznqvWm#C&7V#1BudPOg66s" +"@Y!Noe~HVU#tM$9x~TPY5F{zPxx{idOAw!KAr4AY0gkWM{ar0q>S+SIJ~J*kzUjsM-o<(6Vu9B5;2Fj~nit-@s6c~n1&zK8wj$D`YK;s*tJgp$~shVmXE94~Ys&0!6RzDqvyzxvq-@&I)LB" +")I%d2-pWbQuS^AMt{-L0iLb}z`9-d$5T8^c#O+2Q`pn_RiZ!cpkf3D|y`5U!>B}q2#d9!m5Y{z#r=S>pY=(!$k#CzE^Q4VC1NsFK" +"YL=_RuNQ=Fxh9Ti4)<7RYEI+y3r_6t!zGyW2Tk1II#SqMi`w*>xQ!$>l~Ph-a#d2m+B+zWDrPJo>Nz" +"aPa}lZg84nGhf4#_P0C2lX}ZU2m4J2lZGxoWVhXC@DM%Jg~bbiCbE33BzY}nzEo9mA*CI>DefN+dmBd!V{8~JLcpfxKlXNhs31Tn" +"klp?F" +"F)@tyW8{Yg-z*K4YE1VV7Vii3nQyKKvALBQ*+!$hk33^Z0^{)cspbL8l@I}dTccN(0#?IW-*_H51)(f=%uDfDB-~Au@;ogY0;l{>`6jAyuq-t5{jYL#FHoj8DKTG75}B5^mGi`AB~Ti=WVybtFa$|yw_f@qGZgK5lm|=E)OS2nSQ_-_y&sqFiER_2F5Md" +"V*5MRg;m)utZk}KPt9VHHE_oj%4ZTVqb(#Hpx@eRwM~_$`;M^zG@at$)w=9Ms4PkJD1x2n9#!L1_Y&6FVwxw8T$r{SD^77pf02YDk90Be`5i=EIbXS#M$=LR<6LqA?&`rwU@seCwEuYDS)iC#bDOTjZBUV;?#lm+}+@|Cj`2nZp~45(SD!^n5n" +"-(#gizu`=~Ii@0oO!fOM{)E^qUvpVwQDLNU9{%bkMSwTRUQZJdK)F?U_)<7mOtdl8tniCN4It}_uNWI+Aj$+Ob~9T" +"OSnU4+4KqhAM=H_7Ci5Y$I=xgNW*VZev#0m*tHFPx36Grg%*6iAy66Or-pXcSfMELdMC6HAu}!MfzuSNQ&Vu^(YKvOsfwaHF*xpFRd3(yaw!P*bSP~bh+ZmBiL5mLbwgQjM44D0mBheBO" +"n#S`TFT%PgDHba1mf36sj92rDqPPgK)HDCJAj&4zKq?#k$>NjthxK<(?3wM2!fj-!`K|!bD@mZ?qTu{{>!`dytXJf&@{<&3%P64P" +"?4s#pnTExxZ~N;Zj(C;zDG0@wl-Da)$;Ada58(LBI-GDjAUSR0TOD6>T_rch@_NBw" +"`j" +";oLWQsx+d=MZ#Fi89BU}dy*oOpP7x6O>l5hgwTi6Bk|ux56_xZqb=_js;nd>*k}H%Vr#T6@{w}*%tQkfeKyPg&GeJrW(5L!sbkH{" +"xTaX=k-3G~8s#G$D$NqFJV&zOxCyM%FG@ogMBNE{S8)%XRSZY94E@0P{6uO$Sbq&wlE!Lz3xM4pb}5J*f&IwU4Ngdir8io7zGCnySzP7zr>vj8+I=bQ?V>)EN*u~%<>Fg^g+k_~}Y2u1Dt>&v!H#Vq)hQ@5|p&4!A))sBrFX5;" +";8ETQCggpFsnHl()c$JWsLAHQTG%u^LQl0I^4P)5bx7K|kyb^z98_$deq-SXvRKy" +"s~CTlI=Jy@u@-`!dcG{xCWu1=%YdwN2rk_^efPJ*U4bQ9zSbI^<~Qg(T5W9j@?``(!E#N%J)@VS47BLQ7Iw?@X~*)kQ=`k{z(3gV(Y#" +"B*rt3%B(0~ztLEDI(mFSRrKn+xhD)jyduMUnCuP=QyOfj_SEe-le854p7|3Jyt?3g_DJ_X8GV0(Ur-wI`?nTAyVjk2&<`KH$GfzG" +"b|YbHRFxBsz=To$k+y8k{gH{jg#Bj}D#QhlPWiU*m{6Xj*q_L(YPt0D5fuX7;beS&hST`P)6Igxa<@=x*_{HoPc*oq^g7kEde`tA" +"dkP+q@_vD%xfzypr4OJmW@|P~WLHdy>&DLQKI$fB7`eV7$`>8iFJ$YrrMnBQem~FG(5)MkSHfEM" +"EGrFOmJZi+@{nMNPmww)njDk^s3^$}=aD|@U|Dvhf3E+zj6^umeAP!6_nQ@ONmSX#fC2>apgd^VyHTCoHghd^9D-$#Zm^#rz85it" +"B(Vye`%_GUHxZ_aI7AEy8}q}@1@?gfu=RFknJb`!nGC^SsAY)Zv70Tgk_LY?TX<)N+0>UhFHU$b#Np&H$" +"Tk@+uA)K1rio6=N9GF6w|e#xJV42j98vE$+bz5&0QOVWS;KI%w8Z>E1DkLRs!>{|zlS~O~3#MU#zfh##5gtxi3N(G_RTewU=#A;-53#o;#" +"Aw5Ac0lk(j(IRx?g!yT^dGe=q*5>4p6sD2nwQTei)2T>53I&lvz6#oc86FG0aaiZVOLsbhy{;IbTLS2!nY*`ImNfIU^_LNsKP;u$" +"sP`}ntrL_VL#}5BjqeK8R*Sv?f2I" +">#H(Mgn(Ft&UK_a_te-*rF69uXa%?|O}BJsczSp(nKDY5KcWq~9xna+npByh7yo`6t+7`!pWZB0i|>T=JVL*MrydDDQzT-nhV;iV" +"p;lJ~6J>#0Z5*-jee^nJ!*2fubmE*}Hr8Kep3V1+#`~YCv2(ET4f>`l$$rXSE@jP+k_a>Wm0uR)O%dkn+$7Q%?LzJ2Rq1n(+OK0R" +"_fq;3pYLB)prs7Ntc6QBAFk_KC2f#2acmg_AlR$Mi+l_u!>+&!H(X1w!n3~G4(wg^EG+MUp@#ma)Wz=|J~iky)!qTaJUs=R4oodg" +"3-$)6V!lLz%Zeq?e?#>$SvwaOjd}uMY^2omT+pS@LRn3Vn$Kcc=" +"{^?L{epLa5TbRsDK_z9NSM{uw%%+gurmEq(SSw=1&JagZ!=Y;-QTFzlTE*Dth|NoAIxuPGm6^^%Z~(v@axl" +"Te*2`bzLCa80*79X$yWVL*E_el!Nl1NUjDXLBhb4^UJ6X4JV-1f3mAZVpDWhjPKik_eP}CYHhp+;gCnW@?s8MzwKqB@ufP`VhmFo" +"rZp*i*z}C=KEs!ytl~aBj;)VG@Gg6a29qM|~qzu_O|Z" +"hpDm!Z+5^WJ|>*OTrLxwS34g#rQc7!5OJvKqMO*GLn(++tzEQ6;}d`&sBk%?SjNGvHjJS7;#NJ?M_1(|YS1ZC(DJU(kD1?eXiw_~" +"tl#}no7QM2m<$Ug5q{F!k^!nWe(%DTE0HMYZ!IvM0RYy01mtefhX}bU6*5uAT0Siebs|&oVj=1@mQy6VLB{#Nq%1le5=-}%rQ4;WcO>y%i%U=e|RF&U%_N^!wT?ka{+!)GFT~~u-M0IaW^JNvJ}IUh4*%*gG$VDoF5_D+o#o?&X0|P|ESDS=xmMJFlC_W" +"nFaFE^<>~VokFQY8Df0`9gkkIj5W$>0R7MG5V@*}{hRT{Yup37c?wWB85XpIADNtNK}rvF~*S6Gk$9@u9J9`K-" +"pzEK2$S{D<1!El8z?DD=(Hwls?%(x0u5oRQn-;o(Uf%F!JaFQU!#?oCa<&aB;-lj9@5T$|*lNI3G@Y3igNO(X1=GXo4tPVRPZ(x&" +"F5~x@WW+nBH$m#lg($UmEz;<;=xyTT5Y`HHw&x4#CS-F|re5Gs#x6`H0qEax67!t=dF~E`k)H)Py&2v1JM&^>AQ@&+$b6*j(6AEA" +"3SW(f)j7I(F;-!L{cE>;9vIrZjTMI-IR6_DhGP7K)Si<>Y=IFd{5H4&7$nsVc-eUZ-fx?YIT81$41PBaM4vB83MyX#@szmx4RUJt" +"1Z8HJ+b~x*9n>SVYCe_I2~I2yVP6fcI4wJ7n{ky%{rXo9_AV4u76V`NW8S!#;#y8>5FP8SK9zUh#QD7E2G>ecd+N-y2rr<&Dcv!Z" +"eEegEYtz+nSF4!xvp?zlA0!o){okn3KT-1sfN^^wxp1tJMtB>Q)0cIF`(&G;H^&7kjp6~9aCb$;=FwK^5AuGU_%!`H1>z*GzdLaMuo2|U{+4*@" +"!~KsHu@aH9;K2D9?14G+iOT$?p*I%!!l9WCzgf&R;%Ml2cy5oI=7sM%2Ju" +"mHj|ZJUT**eR~=Am42u=g{tbrpdK2KY6J`+ZSP~A9fgYS;ZMid`9NXaf|HuO^r`24YQdU3+w#HZchy<>w>J7N%n0X0$t35HU&PBR" +"TjUQ&pu#c!J9(S$T5L{;fH-M1{`Q{j2mlQ>hy?=S8DsU=AJI=kHrgCe8}Ma=E113BP>V&ZR)>x-1tF^Wj3h~7n1vKKD" +"gvtyb*Cj$=k!hO9e7Liy_KuEY&@$bd8szW~s4tnSAke|&R4|2583hPS7N3+f$4N5UngsyaE3ofF>mi@rCbI4nE6fP*Khfx(PQ37b" +"Z76Tg)PtnX2-sz;WFDSB|Hw2i0o+;WPDH{e#^T*SDoslTLJEEV=c3XX=0H0F9!pimH@$=0WJ7l|zC*8eZYSJeJRSc_EdB`$wX$ez6>2" +"3;Nc*`Lhk)L7S0O(ajiMip@gh0K!Iq6||{Wngo$Hr*Aj%`Ls*jz=>;n809GSTvC-FS>L;S5&A0zg`2G3Z*hY?oy1Nli@P^{%b9J;" +"fI;M*=((z}O-@G4tmTVr9GA==7B`n;z15Kr|M&C=2|a;gPK(RFWTIcb-bEl=KOuL^9vL>*IRh?B>}f2@X!7k0IWUGD0gx)oi`Ch!" +"U0))P$~G4tLOSy`b)Ncc(Y#Z" +"Ps8IAS!?D1l*hN&K2@17vFAPEuv0A867K85l}~Gr6qDDNL+f>8Hv*%TB>d;(y$ik_mX2pBP)-dL>I-1>&$e^xv_Q(Ao62zRmTQyd" +"foU!*a=jaZZnJ**z)Jw6z@1iG2_?%dylTkfA2zB4-YcuCYhty!mt4`" +"COB=@^ZgI;0LKf-Y{L!=$|~d}Ld9}a)qdimA0fFr?(J*iyusv{ubxlMNR<&sBR!t-B^f>0iKoyvyTXu4^meZSUeD#HC;M!y5`*{O8z|5cccBrvol9eErrA(wwgY(2MDFl=mK<{k|{uvE81hx_)bcKJOODM;vK*q007y9lx!>k{`<6s(4*;*;YVuTZ1s@cdD%39pUrat4*AP)W6GH=ZN^)LD|4s~Jx<0wk<&FU4q*Mz$" +"GbwNByOb+>tWjHK=n3`1fCfbc31a0_VC7Tm47L2!4Sy^NGPS-shE;NctP=K4sE56KOjE|EmnIaUZkGJL!BC8J;|Bg{&3Y>5){iCMCzYd6Gp" +"c`*G5LL;9V|1|!-1oq}Sr*fc3#6l^{9mUxi81>}r=qhS{cdcIGDu_m13Q0Szc&Rp68lil*5pDpg=3jw@v)COlqeZx$R#C" +")nWc2E0Lb4j1ujWI68R0_eU7n?qN)&bu_eF@a`ulQY3K*;#el1NOMI~4sX-4ds-_v{Ds44&>pk)l46rqsnj{J*#m!ZD3LBsPot(~" +"RdlE&+G&g{@vXuRp>26fJj}Pxq1^zQf*uO2=WHr~&@ZT7!G1`TYejJ6R9jIeEESm?Djogau|huZtvo" +"Gg2WAAHt7KOrtV@(AserN`ZrH@pn3%%e)cPLaVOl6rYW-fnt3q;5U!g>zWyNM%})JqTwVh4!A@74o2sWLemPTcNaysR;P-IKxCA$5phU;R=r#=|`BjR^mK5xOk&6lh<+Y$6ArC%0r?QE@Ndws|tlP##_67" +"%bVcW`0jUd)a%He8KcXQ?LpJ${|be$(5r@&TzXfGfOM#LtN4n4bwf(DL5#RYm$Xewc0KJ`eZQFNSXw*2ZCbg)#$04(4~J@11A;lU" +"$0t>8{_%hprNpL0pdD2Z)&-vR8J_UPbY9spG`*^d4e2w5-E28jG3Sf)vx2nzMNBL_lLqN$vHg8QO&T0G4uAl$HCXt9Ql=7C!$%fD" +"tZcbNX1l?%g6w2l=U93Ja@=I!%7;<7E^WrJsqS0}(0j3!`0z$efKmC5z1yVWaPRbEXQ2NBB7Vo}W~TBD?BfrSXZ%y5y{(p@?*L?X" +"MjW+2A6zcs+l_14hUaK3k}lxf_2xlga=yv1!{}3tNvTLD" +"?((N@AUSKUm|N}1J0kwyB}p^P8fKK39bfvpMSVCYSJO+6IZlS_WuD1f3PJWA7>_)TWON*NoB>5>ju$ru%GpgQ3prKml8ubh8X(`^_~R(+Qv|u?Bd4FjCQCo#3y4n+meMC{{EOT>g+rJeI@|)TUFLoX+;geD<-$X" +"!wl;V*DEtQ_aws&rY(cxh=H>JUmK-Qh=&8Cji6Oo+xb5L!~TVmCk|)rV1EQRUW7O`%UJ3gOCKUEJpV#5UH" +"e5_4RwU3QPp2e_s`1LdGlyc!+)`ig4t95j_6Iw$4Mp-Blb3o4u00P32r;EkT4U~N_x^m%}tV~W%!BGGJVg6IXYEHjM7~kr>cqwtE" +"I$_Tu$TU*-LD3e`Egi_H9}A|!$1Uj9i1dl36*ZRAJ`I0mnOOeli7}0;N4dW1ZnOX*++E6zte_JaG4*l%pknp~Lo4Kyf>+<^77l6j" +"f6MKiq+99y3mEN8RFXoeQopXkDBicXbS)n{ufyb$$#hx=32>02jyDM<)261wgV-NN`Q8Kz{r#j8>YGJ0#7KgrRm?LX>LHcC9D#_WQO$(9jPj1-Oi&xXke2ylqKo2I8XFEG^OolexNCE^WC-ss>){Wn*6m_>v!`p9^;&19ZYyGa(zHH2" +"MqNiWMy$9^wn&}Yjn)UOm" +"`k*URuw8HkT1QuGAw{uM6z9>gE8LGD#6S%4k*#js0|!`I|I0t^v<)V3;)_xG@=NwfXuX(+KpNm5-ub1;qUp!oOK" +"-JnRK_lgl@!Mawf>4BdA&Sv{WORlSSkC@>Sv6S87AtI-rAqKa|v!4{}@tBdOT-5X5Cq~Dqe6r`|#Co_ArX7!a!FbVP>6{gv" +"!y)d&X8cc{U@59s%!l#0C;|-GX?;nIu%qV}*Iv(tuD2L?0*RDjS{!>chFupSP02nDLs7$+br+SGE?k-FWtf`B;NE6pOnZoP3`)wG" +"v_w2uC(q9cy!Qie$oDCiomgG@d(IfN!QCc0QEjW2-rvDz2)cYmJGs?))C=O!`aGEl=f~@RM8_?3t8Raezobat~J$rs%WQIIA4m|3NlgtAxVbE@G9" +"jDNr0f{RiN`bN#D8wx`uq^tF!(a!tN5aG54>9{gPXqL#>zWUGkau`5s_WgMysup+EJwO;lnr@t))%PjvPHbgm&n+#yi7XBk5lAxM" +"bupU)S&j)%x3Qs+VyFthr2+_suX92#wt`|2N@mr5=kcM&+MpATAlM5W)YMus4GM6&(Ocv2ysL6>><@jw{or@^FP20CFy2!9-&51N" +"Nd#k247{j>QOtiAI7m#{t`OG0I=uOgJ!B9D|4;F+W|XD$Y;jA)j-u{RJI;FYEg=$m&xvc" +"VFy!MmZ+@}+`zHUc%2Rvo+&`&4MIQ4Sdi%sn14NsO?~9YZ2C820%Q=g&#ZpSWVmIn!hXE" +"VviJ@QX1=)hoyJyNTRV^J97gzF`m6LLV->l!D>~W7=20#fC1U|WWQm$cMUkU8LxScK#szXJ|+o_0AqUgdrNRAu}^032M1" +"2DKd^Cv|=6mKpSlkKnQi4eokEW{k^Y80PN;>hy|dL{^tO=h0<=!!(LcbnG2kPU$#(BTH=DJ!d?|cI)TTfhV$dFM%;Uq085NwQI@W" +"gLVsN@m^fq66+6>UweRUn(K!B9X*H_6WYRhTd3nwHuR5-MsD?Qvsq8!H#iLrz;4=%V5bh$jrT" +"JO7aR$7Hy=HqA)Gr9K+a?hkSXuP@FH+d6sLHKrf=fL@T2K@``3)5c*4$1Kj0&iVyLuzzH#lF|b9fMh6I{Zu8gJXSosh6%{q$~s1e" +"*8_&ZQPpWGc2`^juu^7xk#G&3*4)CcoTQgX$^bQQQ$vogowx~cHy=yD158S2DbdfmeDK5>H6AS@Y=_Hnq+MA{}$nUp%Om58|GyQ_wzGldRUn?CsetdwUO5U1`(?v" +"4sOMjh||)CL#_y*1z0HK1}fxmduon{ERl$#>~f26xJ?2?%+XK^{pW?Yg{c-zF3iPs)XzT09c4SND=bIQsq<4-(9(@HZSTx=ZwW-Q" +"pMSTvHPe<^$fBnL8>a~ida&5HI~PdxV;EnY&C>xxZ=`9Jj^jf#5>{rweQigs+ukieTI!+kNK(Tu`@gbWUH$?y)~*9PkE?|hEo!B_" +"^8EARH_K<ll)>faeCU|)H1#Q#&rt-GL?ar$1o%_x" +"4^L*I@43l^M82r5l8w50>ng~rm$Fj^4F(aF*R($YoXA(`h_H;4!k33qGGHPYfr|8;o8K_D!9P0VJzpyupu9ygLBOeEhsvcdMciRT" +"Zyha?_vdBLoK^X^9}Y#B(K62a^(hM{fAcBFGr6qrxMsi+rZ}}c3W-Y2h>X;o~y&FfJ}$fsMiSiVB6n|?dUC+l7u>IxYf}<" +"(SR|jOd8d9@Y^-g@Fg8rMM0=nq3)rVcEu(C=hp=" +"VW(=m`dm$Tffr#jG^keMtF13orb&lu;$AU4Urt`pWn+hEhsbGGe9YGgJBHi)buZ`RJ~=F|9iFW4+J+JmHMF4$VcndZ!*4d^E)AfO" +"-A(WoAlUx-Jau^K?wFFEN+0S=^Mf-4;yk_}*8l"),format=L.FORMAT_RAW,filters=[{"id":L.FILTER_LZMA2}])) diff --git a/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_seed0.log b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_seed0.log new file mode 100644 index 0000000000..e93b6a5019 --- /dev/null +++ b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_seed0.log @@ -0,0 +1,221 @@ + +***************************************** +Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: /home/dex/parameter-golf-with-cc/data + datasets_dir: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096 + disable_layer0_attn: False + distributed: True + ema_decay: 0.997 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_enabled: True + gptq_reserve_seconds: 10.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/15c783e2-1a72-4eb7-aba9-e2c9664fad3b.txt + logit_softcap: 30.0 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mixed_quant: True + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_wd: 0.085 + n_int6_layers: 61 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + parallel_residual: False + parallel_start_layer: 7 + parallel_start_layer_is_physical: True + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + recur_layers_str: 4,5 + recur_start_step: 3000 + recur_warmup_steps: 20 + repeat_untie_mlp: none + repeat_untie_mlp_layers: + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 15c783e2-1a72-4eb7-aba9-e2c9664fad3b + scalar_lr: 0.02 + seed: 0 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /home/dex/parameter-golf-with-cc/data/tokenizers/fineweb_4096_bpe.model + train_batch_tokens: 786432 + train_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + val_batch_tokens: 524288 + val_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_val_*.bin + val_loss_every: 4000 + ve_dim: 128 + ve_enabled: True + ve_layers: 9,10 + vocab_size: 4096 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 143 +val_tokens: 45514752 +model_params:34401371 +parallel_residual: active=0 start_layer=7 start_mode=physical params=0 +recurrence: layers=[4, 5] start_step=3000 active=0 +repeat_untie_mlp: mode=none layers=[] params=0 +gptq:reserving 10s, effective=590000ms +[rank0]:[W403 01:19:06.145159734 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank4]:[W403 01:19:06.412397291 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank6]:[W403 01:19:06.603538761 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank2]:[W403 01:19:06.634596706 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank3]:[W403 01:19:06.635656709 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank5]:[W403 01:19:06.707228027 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank1]:[W403 01:19:07.713477697 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank7]:[W403 01:19:07.720692185 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +recurrence:prewarm active=1 virtual_layers:13 +recur_warmup_step: 1/20 +recur_warmup_step: 2/20 +recur_warmup_step: 3/20 +recur_warmup_step: 4/20 +recur_warmup_step: 5/20 +recur_warmup_step: 6/20 +recur_warmup_step: 10/20 +recur_warmup_step: 20/20 +0/20000 val_loss: 8.3145 val_bpb: 3.6139 +1/20000 train_loss: 8.3158 train_time: 0.0m tok/s: 8415050 +2/20000 train_loss: 12.3152 train_time: 0.0m tok/s: 8340228 +3/20000 train_loss: 10.7849 train_time: 0.0m tok/s: 8240668 +4/20000 train_loss: 9.0455 train_time: 0.0m tok/s: 8183188 +5/20000 train_loss: 7.8696 train_time: 0.0m tok/s: 8157335 +500/20000 train_loss: 3.0033 train_time: 0.8m tok/s: 7924771 +1000/20000 train_loss: 2.9385 train_time: 1.7m tok/s: 7922089 +1500/20000 train_loss: 2.8979 train_time: 2.5m tok/s: 7920802 +2000/20000 train_loss: 2.8279 train_time: 3.3m tok/s: 7917409 +2500/20000 train_loss: 2.7097 train_time: 4.1m tok/s: 7914943 +3000/20000 train_loss: 2.8225 train_time: 5.0m tok/s: 7912430 +recurrence:activated step:3000 layers:[4, 5] virtual_layers:13 +3500/20000 train_loss: 2.6964 train_time: 5.9m tok/s: 7737358 +4000/20000 train_loss: 2.6205 train_time: 6.9m tok/s: 7611225 +4000/20000 val_loss: 2.6457 val_bpb: 1.1499 +4500/20000 train_loss: 2.5778 train_time: 7.8m tok/s: 7515618 +5000/20000 train_loss: 2.6293 train_time: 8.8m tok/s: 7441273 +5500/20000 train_loss: 2.5730 train_time: 9.8m tok/s: 7381627 +5536/20000 val_loss: 2.5281 val_bpb: 1.0988 +stopping_early: wallclock_cap train_time: 590116ms step: 5536/20000 +peak memory allocated: 30215 MiB reserved: 30244 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.52554844 val_bpb:1.09771611 eval_time:1999ms +Serialized model: 132405891 bytes +Code size: 21396 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 66 Hessians in 9.7s +mixed_quant: sensitivity ranking -- 61 int6 (top), 5 int5 (bottom) + rank 0: int6 blocks.0.mlp.proj.weight sens=5714225664.0 numel=1048576 + rank 1: int6 blocks.1.mlp.proj.weight sens=1516211968.0 numel=1048576 + rank 2: int6 blocks.2.mlp.proj.weight sens=582575808.0 numel=1048576 + rank 3: int6 blocks.4.mlp.proj.weight sens=447541504.0 numel=1048576 + rank 4: int6 blocks.3.mlp.proj.weight sens=369834464.0 numel=1048576 + rank 5: int6 blocks.5.mlp.proj.weight sens=341793024.0 numel=1048576 + rank 6: int6 blocks.6.mlp.proj.weight sens=118435464.0 numel=1048576 + rank 7: int6 blocks.7.mlp.proj.weight sens=114313272.0 numel=1048576 + rank 8: int6 blocks.8.mlp.proj.weight sens=54047244.0 numel=1048576 + rank 9: int6 blocks.0.mlp.fc.weight sens=50330268.0 numel=1048576 + rank 10: int6 blocks.0.attn.c_q.weight sens=50329780.0 numel=262144 + rank 11: int6 blocks.0.attn.c_k.weight sens=50329780.0 numel=131072 + rank 12: int6 blocks.0.attn.c_v.weight sens=50329780.0 numel=131072 + rank 13: int6 blocks.0.attn.proj.weight sens=36198472.0 numel=262144 + rank 14: int6 blocks.9.mlp.proj.weight sens=35929068.0 numel=1048576 + rank 15: int6 blocks.1.mlp.fc.weight sens=25165284.0 numel=1048576 + rank 16: int6 blocks.1.attn.c_q.weight sens=25165272.0 numel=262144 + rank 17: int6 blocks.1.attn.c_k.weight sens=25165272.0 numel=131072 + rank 18: int6 blocks.1.attn.c_v.weight sens=25165272.0 numel=131072 + rank 19: int6 blocks.4.attn.proj.weight sens=24464428.0 numel=262144 + rank 20: int6 blocks.4.attn.c_q.weight sens=20133230.0 numel=262144 + rank 21: int6 blocks.4.attn.c_k.weight sens=20133230.0 numel=131072 + rank 22: int6 blocks.4.attn.c_v.weight sens=20133230.0 numel=131072 + rank 23: int6 blocks.4.mlp.fc.weight sens=20133216.0 numel=1048576 + rank 24: int6 blocks.1.attn.proj.weight sens=17484898.0 numel=262144 + rank 25: int6 blocks.10.mlp.proj.weight sens=16837256.0 numel=1048576 + rank 26: int6 blocks.5.attn.c_q.weight sens=16778134.0 numel=262144 + rank 27: int6 blocks.5.attn.c_k.weight sens=16778134.0 numel=131072 + rank 28: int6 blocks.5.attn.c_v.weight sens=16778134.0 numel=131072 + rank 29: int6 blocks.5.mlp.fc.weight sens=16778132.0 numel=1048576 + rank 30: int6 blocks.2.mlp.fc.weight sens=16776892.0 numel=1048576 + rank 31: int6 blocks.2.attn.c_q.weight sens=16776867.0 numel=262144 + rank 32: int6 blocks.2.attn.c_k.weight sens=16776867.0 numel=131072 + rank 33: int6 blocks.2.attn.c_v.weight sens=16776867.0 numel=131072 + rank 34: int6 blocks.2.attn.proj.weight sens=15114738.0 numel=262144 + rank 35: int6 blocks.5.attn.proj.weight sens=14728838.0 numel=262144 + rank 36: int6 blocks.3.attn.c_q.weight sens=12582560.0 numel=262144 + rank 37: int6 blocks.3.attn.c_k.weight sens=12582560.0 numel=131072 + rank 38: int6 blocks.3.attn.c_v.weight sens=12582560.0 numel=131072 + rank 39: int6 blocks.3.mlp.fc.weight sens=12582558.0 numel=1048576 + rank 40: int6 blocks.3.attn.proj.weight sens=9479571.0 numel=262144 + rank 41: int6 blocks.6.attn.c_q.weight sens=7191537.0 numel=262144 + rank 42: int6 blocks.6.attn.c_k.weight sens=7191537.0 numel=131072 + rank 43: int6 blocks.6.attn.c_v.weight sens=7191537.0 numel=131072 + rank 44: int6 blocks.6.mlp.fc.weight sens=7191531.0 numel=1048576 + rank 45: int6 blocks.7.attn.c_q.weight sens=6291326.0 numel=262144 + rank 46: int6 blocks.7.attn.c_k.weight sens=6291326.0 numel=131072 + rank 47: int6 blocks.7.attn.c_v.weight sens=6291326.0 numel=131072 + rank 48: int6 blocks.7.mlp.fc.weight sens=6291326.0 numel=1048576 + rank 49: int6 blocks.8.mlp.fc.weight sens=5592428.5 numel=1048576 + rank 50: int6 blocks.8.attn.c_q.weight sens=5592427.5 numel=262144 + rank 51: int6 blocks.8.attn.c_k.weight sens=5592427.5 numel=131072 + rank 52: int6 blocks.8.attn.c_v.weight sens=5592427.5 numel=131072 + rank 53: int6 blocks.9.mlp.fc.weight sens=5032749.0 numel=1048576 + rank 54: int6 blocks.9.attn.c_q.weight sens=5032695.5 numel=262144 + rank 55: int6 blocks.9.attn.c_k.weight sens=5032695.5 numel=131072 + rank 56: int6 blocks.9.attn.c_v.weight sens=5032695.5 numel=131072 + rank 57: int6 blocks.6.attn.proj.weight sens=4647180.0 numel=262144 + rank 58: int6 blocks.10.attn.c_q.weight sens=4575346.5 numel=262144 + rank 59: int6 blocks.10.attn.c_k.weight sens=4575346.5 numel=131072 + rank 60: int6 blocks.10.attn.c_v.weight sens=4575346.5 numel=131072 + rank 61: INT5 blocks.10.mlp.fc.weight sens=4574933.0 numel=1048576 + rank 62: INT5 blocks.7.attn.proj.weight sens=3612980.0 numel=262144 + rank 63: INT5 blocks.8.attn.proj.weight sens=3078859.2 numel=262144 + rank 64: INT5 blocks.9.attn.proj.weight sens=2743971.0 numel=262144 + rank 65: INT5 blocks.10.attn.proj.weight sens=2626454.8 numel=262144 +mixed_quant: most sensitive=blocks.0.mlp.proj.weight (5714225664.0), least sensitive=blocks.10.attn.proj.weight (2626454.8) +GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search +mixed_quant: 61 int6, 5 int5 +Serialized model mixed_int5_int6+brotli: 15953085 bytes +Total submission size mixed_int5_int6+brotli: 15974481 bytes +final_int6_roundtrip val_loss:2.55573548 val_bpb:1.11083675 eval_time:6861ms +final_int6_sliding_window val_loss:2.51309363 val_bpb:1.09230269 eval_time:76015ms diff --git a/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_seed42.log b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_seed42.log new file mode 100644 index 0000000000..33d3f91c77 --- /dev/null +++ b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_seed42.log @@ -0,0 +1,221 @@ + +***************************************** +Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: /home/dex/parameter-golf-with-cc/data + datasets_dir: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096 + disable_layer0_attn: False + distributed: True + ema_decay: 0.997 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_enabled: True + gptq_reserve_seconds: 10.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/f383aa5a-3618-4ef2-992b-ea3b9ebde090.txt + logit_softcap: 30.0 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mixed_quant: True + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_wd: 0.085 + n_int6_layers: 61 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + parallel_residual: False + parallel_start_layer: 7 + parallel_start_layer_is_physical: True + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + recur_layers_str: 4,5 + recur_start_step: 3000 + recur_warmup_steps: 20 + repeat_untie_mlp: none + repeat_untie_mlp_layers: + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: f383aa5a-3618-4ef2-992b-ea3b9ebde090 + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /home/dex/parameter-golf-with-cc/data/tokenizers/fineweb_4096_bpe.model + train_batch_tokens: 786432 + train_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + val_batch_tokens: 524288 + val_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_val_*.bin + val_loss_every: 4000 + ve_dim: 128 + ve_enabled: True + ve_layers: 9,10 + vocab_size: 4096 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 143 +val_tokens: 45514752 +model_params:34401371 +parallel_residual: active=0 start_layer=7 start_mode=physical params=0 +recurrence: layers=[4, 5] start_step=3000 active=0 +repeat_untie_mlp: mode=none layers=[] params=0 +gptq:reserving 10s, effective=590000ms +[rank1]:[W403 01:03:36.890177871 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank5]:[W403 01:03:36.260981315 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank6]:[W403 01:03:36.265402878 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank7]:[W403 01:03:36.343299932 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank2]:[W403 01:03:36.358422482 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank4]:[W403 01:03:36.360322568 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank3]:[W403 01:03:36.363179784 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank0]:[W403 01:03:37.755336442 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +recurrence:prewarm active=1 virtual_layers:13 +recur_warmup_step: 1/20 +recur_warmup_step: 2/20 +recur_warmup_step: 3/20 +recur_warmup_step: 4/20 +recur_warmup_step: 5/20 +recur_warmup_step: 6/20 +recur_warmup_step: 10/20 +recur_warmup_step: 20/20 +0/20000 val_loss: 8.3184 val_bpb: 3.6155 +1/20000 train_loss: 8.3199 train_time: 0.0m tok/s: 8409840 +2/20000 train_loss: 12.3441 train_time: 0.0m tok/s: 8343525 +3/20000 train_loss: 10.8191 train_time: 0.0m tok/s: 8237190 +4/20000 train_loss: 9.0581 train_time: 0.0m tok/s: 8185480 +5/20000 train_loss: 7.8861 train_time: 0.0m tok/s: 8155727 +500/20000 train_loss: 2.9996 train_time: 0.8m tok/s: 7931805 +1000/20000 train_loss: 2.9390 train_time: 1.7m tok/s: 7930790 +1500/20000 train_loss: 2.8983 train_time: 2.5m tok/s: 7928987 +2000/20000 train_loss: 2.8294 train_time: 3.3m tok/s: 7926592 +2500/20000 train_loss: 2.7068 train_time: 4.1m tok/s: 7924078 +3000/20000 train_loss: 2.8195 train_time: 5.0m tok/s: 7922624 +recurrence:activated step:3000 layers:[4, 5] virtual_layers:13 +3500/20000 train_loss: 2.6923 train_time: 5.9m tok/s: 7746685 +4000/20000 train_loss: 2.6212 train_time: 6.9m tok/s: 7619819 +4000/20000 val_loss: 2.6450 val_bpb: 1.1496 +4500/20000 train_loss: 2.5778 train_time: 7.8m tok/s: 7523753 +5000/20000 train_loss: 2.6262 train_time: 8.8m tok/s: 7448744 +5500/20000 train_loss: 2.5761 train_time: 9.8m tok/s: 7388430 +5540/20000 val_loss: 2.5273 val_bpb: 1.0985 +stopping_early: wallclock_cap train_time: 590037ms step: 5540/20000 +peak memory allocated: 30215 MiB reserved: 30244 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.52474965 val_bpb:1.09736892 eval_time:1999ms +Serialized model: 132405891 bytes +Code size: 21396 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 66 Hessians in 9.7s +mixed_quant: sensitivity ranking -- 61 int6 (top), 5 int5 (bottom) + rank 0: int6 blocks.0.mlp.proj.weight sens=6560197120.0 numel=1048576 + rank 1: int6 blocks.1.mlp.proj.weight sens=1157604480.0 numel=1048576 + rank 2: int6 blocks.2.mlp.proj.weight sens=535741248.0 numel=1048576 + rank 3: int6 blocks.4.mlp.proj.weight sens=421712640.0 numel=1048576 + rank 4: int6 blocks.3.mlp.proj.weight sens=368326176.0 numel=1048576 + rank 5: int6 blocks.5.mlp.proj.weight sens=354418720.0 numel=1048576 + rank 6: int6 blocks.6.mlp.proj.weight sens=149662432.0 numel=1048576 + rank 7: int6 blocks.7.mlp.proj.weight sens=77031888.0 numel=1048576 + rank 8: int6 blocks.8.mlp.proj.weight sens=63791736.0 numel=1048576 + rank 9: int6 blocks.0.mlp.fc.weight sens=50330280.0 numel=1048576 + rank 10: int6 blocks.0.attn.c_q.weight sens=50329784.0 numel=262144 + rank 11: int6 blocks.0.attn.c_k.weight sens=50329784.0 numel=131072 + rank 12: int6 blocks.0.attn.c_v.weight sens=50329784.0 numel=131072 + rank 13: int6 blocks.0.attn.proj.weight sens=41123276.0 numel=262144 + rank 14: int6 blocks.1.attn.proj.weight sens=31683348.0 numel=262144 + rank 15: int6 blocks.9.mlp.proj.weight sens=29195914.0 numel=1048576 + rank 16: int6 blocks.1.attn.c_q.weight sens=25165520.0 numel=262144 + rank 17: int6 blocks.1.attn.c_k.weight sens=25165520.0 numel=131072 + rank 18: int6 blocks.1.attn.c_v.weight sens=25165520.0 numel=131072 + rank 19: int6 blocks.1.mlp.fc.weight sens=25165318.0 numel=1048576 + rank 20: int6 blocks.4.attn.proj.weight sens=23533146.0 numel=262144 + rank 21: int6 blocks.4.attn.c_q.weight sens=20133212.0 numel=262144 + rank 22: int6 blocks.4.attn.c_k.weight sens=20133212.0 numel=131072 + rank 23: int6 blocks.4.attn.c_v.weight sens=20133212.0 numel=131072 + rank 24: int6 blocks.4.mlp.fc.weight sens=20133212.0 numel=1048576 + rank 25: int6 blocks.10.mlp.proj.weight sens=18650652.0 numel=1048576 + rank 26: int6 blocks.5.attn.c_q.weight sens=16778138.0 numel=262144 + rank 27: int6 blocks.5.attn.c_k.weight sens=16778138.0 numel=131072 + rank 28: int6 blocks.5.attn.c_v.weight sens=16778138.0 numel=131072 + rank 29: int6 blocks.5.mlp.fc.weight sens=16778136.0 numel=1048576 + rank 30: int6 blocks.2.mlp.fc.weight sens=16776893.0 numel=1048576 + rank 31: int6 blocks.2.attn.c_q.weight sens=16776840.0 numel=262144 + rank 32: int6 blocks.2.attn.c_k.weight sens=16776840.0 numel=131072 + rank 33: int6 blocks.2.attn.c_v.weight sens=16776840.0 numel=131072 + rank 34: int6 blocks.2.attn.proj.weight sens=16107929.0 numel=262144 + rank 35: int6 blocks.5.attn.proj.weight sens=13670360.0 numel=262144 + rank 36: int6 blocks.3.attn.c_q.weight sens=12582557.0 numel=262144 + rank 37: int6 blocks.3.attn.c_k.weight sens=12582557.0 numel=131072 + rank 38: int6 blocks.3.attn.c_v.weight sens=12582557.0 numel=131072 + rank 39: int6 blocks.3.mlp.fc.weight sens=12582557.0 numel=1048576 + rank 40: int6 blocks.3.attn.proj.weight sens=10975853.0 numel=262144 + rank 41: int6 blocks.6.mlp.fc.weight sens=7191528.0 numel=1048576 + rank 42: int6 blocks.6.attn.c_q.weight sens=7191527.5 numel=262144 + rank 43: int6 blocks.6.attn.c_k.weight sens=7191527.5 numel=131072 + rank 44: int6 blocks.6.attn.c_v.weight sens=7191527.5 numel=131072 + rank 45: int6 blocks.7.attn.c_q.weight sens=6291326.5 numel=262144 + rank 46: int6 blocks.7.attn.c_k.weight sens=6291326.5 numel=131072 + rank 47: int6 blocks.7.attn.c_v.weight sens=6291326.5 numel=131072 + rank 48: int6 blocks.7.mlp.fc.weight sens=6291326.0 numel=1048576 + rank 49: int6 blocks.7.attn.proj.weight sens=5705765.5 numel=262144 + rank 50: int6 blocks.8.attn.c_q.weight sens=5592423.0 numel=262144 + rank 51: int6 blocks.8.attn.c_k.weight sens=5592423.0 numel=131072 + rank 52: int6 blocks.8.attn.c_v.weight sens=5592423.0 numel=131072 + rank 53: int6 blocks.8.mlp.fc.weight sens=5592423.0 numel=1048576 + rank 54: int6 blocks.9.attn.c_q.weight sens=5032720.5 numel=262144 + rank 55: int6 blocks.9.attn.c_k.weight sens=5032720.5 numel=131072 + rank 56: int6 blocks.9.attn.c_v.weight sens=5032720.5 numel=131072 + rank 57: int6 blocks.9.mlp.fc.weight sens=5032720.0 numel=1048576 + rank 58: int6 blocks.6.attn.proj.weight sens=4689538.0 numel=262144 + rank 59: int6 blocks.10.attn.c_q.weight sens=4575679.5 numel=262144 + rank 60: int6 blocks.10.attn.c_k.weight sens=4575679.5 numel=131072 + rank 61: INT5 blocks.10.attn.c_v.weight sens=4575679.5 numel=131072 + rank 62: INT5 blocks.10.mlp.fc.weight sens=4575454.0 numel=1048576 + rank 63: INT5 blocks.8.attn.proj.weight sens=2705658.5 numel=262144 + rank 64: INT5 blocks.9.attn.proj.weight sens=2551724.0 numel=262144 + rank 65: INT5 blocks.10.attn.proj.weight sens=1576195.8 numel=262144 +mixed_quant: most sensitive=blocks.0.mlp.proj.weight (6560197120.0), least sensitive=blocks.10.attn.proj.weight (1576195.8) +GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search +mixed_quant: 61 int6, 5 int5 +Serialized model mixed_int5_int6+brotli: 15975195 bytes +Total submission size mixed_int5_int6+brotli: 15996591 bytes +final_int6_roundtrip val_loss:2.55414009 val_bpb:1.11014332 eval_time:6775ms +final_int6_sliding_window val_loss:2.51171062 val_bpb:1.09170157 eval_time:75991ms diff --git a/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_seed7.log b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_seed7.log new file mode 100644 index 0000000000..463d8e23fb --- /dev/null +++ b/records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_N61MixedQuant/train_seed7.log @@ -0,0 +1,221 @@ + +***************************************** +Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +***************************************** +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: /home/dex/parameter-golf-with-cc/data + datasets_dir: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096 + disable_layer0_attn: False + distributed: True + ema_decay: 0.997 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_enabled: True + gptq_reserve_seconds: 10.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/e3222df5-988f-4ad7-856c-78a860692582.txt + logit_softcap: 30.0 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mixed_quant: True + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_wd: 0.085 + n_int6_layers: 61 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + parallel_residual: False + parallel_start_layer: 7 + parallel_start_layer_is_physical: True + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + recur_layers_str: 4,5 + recur_start_step: 3000 + recur_warmup_steps: 20 + repeat_untie_mlp: none + repeat_untie_mlp_layers: + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: e3222df5-988f-4ad7-856c-78a860692582 + scalar_lr: 0.02 + seed: 7 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: /home/dex/parameter-golf-with-cc/data/tokenizers/fineweb_4096_bpe.model + train_batch_tokens: 786432 + train_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + val_batch_tokens: 524288 + val_files: /home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp4096/fineweb_val_*.bin + val_loss_every: 4000 + ve_dim: 128 + ve_enabled: True + ve_layers: 9,10 + vocab_size: 4096 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +train_shards: 143 +val_tokens: 45514752 +model_params:34401371 +parallel_residual: active=0 start_layer=7 start_mode=physical params=0 +recurrence: layers=[4, 5] start_step=3000 active=0 +repeat_untie_mlp: mode=none layers=[] params=0 +gptq:reserving 10s, effective=590000ms +[rank1]:[W403 01:50:21.434608368 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank2]:[W403 01:50:21.443540369 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank5]:[W403 01:50:21.444781033 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank4]:[W403 01:50:21.446956526 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank0]:[W403 01:50:21.452436132 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank6]:[W403 01:50:21.476742353 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank7]:[W403 01:50:21.482014708 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +[rank3]:[W403 01:50:21.482136277 reducer.cpp:1431] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +recurrence:prewarm active=1 virtual_layers:13 +recur_warmup_step: 1/20 +recur_warmup_step: 2/20 +recur_warmup_step: 3/20 +recur_warmup_step: 4/20 +recur_warmup_step: 5/20 +recur_warmup_step: 6/20 +recur_warmup_step: 10/20 +recur_warmup_step: 20/20 +0/20000 val_loss: 8.3151 val_bpb: 3.6141 +1/20000 train_loss: 8.3170 train_time: 0.0m tok/s: 8423294 +2/20000 train_loss: 12.2942 train_time: 0.0m tok/s: 8320228 +3/20000 train_loss: 10.8034 train_time: 0.0m tok/s: 8231287 +4/20000 train_loss: 9.0943 train_time: 0.0m tok/s: 8174885 +5/20000 train_loss: 7.9397 train_time: 0.0m tok/s: 8147199 +500/20000 train_loss: 3.0002 train_time: 0.8m tok/s: 7930671 +1000/20000 train_loss: 2.9435 train_time: 1.7m tok/s: 7928531 +1500/20000 train_loss: 2.9058 train_time: 2.5m tok/s: 7927060 +2000/20000 train_loss: 2.8302 train_time: 3.3m tok/s: 7923935 +2500/20000 train_loss: 2.7127 train_time: 4.1m tok/s: 7921580 +3000/20000 train_loss: 2.8226 train_time: 5.0m tok/s: 7919784 +recurrence:activated step:3000 layers:[4, 5] virtual_layers:13 +3500/20000 train_loss: 2.6936 train_time: 5.9m tok/s: 7743889 +4000/20000 train_loss: 2.6240 train_time: 6.9m tok/s: 7616960 +4000/20000 val_loss: 2.6474 val_bpb: 1.1507 +4500/20000 train_loss: 2.5806 train_time: 7.8m tok/s: 7520174 +5000/20000 train_loss: 2.6281 train_time: 8.8m tok/s: 7445186 +5500/20000 train_loss: 2.5767 train_time: 9.8m tok/s: 7384965 +5538/20000 val_loss: 2.5297 val_bpb: 1.0995 +stopping_early: wallclock_cap train_time: 590077ms step: 5538/20000 +peak memory allocated: 30215 MiB reserved: 30244 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.52732846 val_bpb:1.09848979 eval_time:2000ms +Serialized model: 132405891 bytes +Code size: 21396 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 66 Hessians in 9.7s +mixed_quant: sensitivity ranking -- 61 int6 (top), 5 int5 (bottom) + rank 0: int6 blocks.0.mlp.proj.weight sens=6506361856.0 numel=1048576 + rank 1: int6 blocks.1.mlp.proj.weight sens=1261223680.0 numel=1048576 + rank 2: int6 blocks.2.mlp.proj.weight sens=623147200.0 numel=1048576 + rank 3: int6 blocks.4.mlp.proj.weight sens=393641760.0 numel=1048576 + rank 4: int6 blocks.5.mlp.proj.weight sens=369362624.0 numel=1048576 + rank 5: int6 blocks.3.mlp.proj.weight sens=348602464.0 numel=1048576 + rank 6: int6 blocks.6.mlp.proj.weight sens=110822000.0 numel=1048576 + rank 7: int6 blocks.7.mlp.proj.weight sens=109934440.0 numel=1048576 + rank 8: int6 blocks.0.mlp.fc.weight sens=50330284.0 numel=1048576 + rank 9: int6 blocks.0.attn.c_q.weight sens=50329240.0 numel=262144 + rank 10: int6 blocks.0.attn.c_k.weight sens=50329240.0 numel=131072 + rank 11: int6 blocks.0.attn.c_v.weight sens=50329240.0 numel=131072 + rank 12: int6 blocks.8.mlp.proj.weight sens=49911372.0 numel=1048576 + rank 13: int6 blocks.0.attn.proj.weight sens=38699824.0 numel=262144 + rank 14: int6 blocks.9.mlp.proj.weight sens=30399688.0 numel=1048576 + rank 15: int6 blocks.4.attn.proj.weight sens=27087060.0 numel=262144 + rank 16: int6 blocks.1.attn.c_q.weight sens=25165440.0 numel=262144 + rank 17: int6 blocks.1.attn.c_k.weight sens=25165440.0 numel=131072 + rank 18: int6 blocks.1.attn.c_v.weight sens=25165440.0 numel=131072 + rank 19: int6 blocks.1.mlp.fc.weight sens=25165376.0 numel=1048576 + rank 20: int6 blocks.1.attn.proj.weight sens=24764336.0 numel=262144 + rank 21: int6 blocks.4.attn.c_q.weight sens=20133218.0 numel=262144 + rank 22: int6 blocks.4.attn.c_k.weight sens=20133218.0 numel=131072 + rank 23: int6 blocks.4.attn.c_v.weight sens=20133218.0 numel=131072 + rank 24: int6 blocks.4.mlp.fc.weight sens=20133212.0 numel=1048576 + rank 25: int6 blocks.10.mlp.proj.weight sens=19611092.0 numel=1048576 + rank 26: int6 blocks.5.attn.c_q.weight sens=16778136.0 numel=262144 + rank 27: int6 blocks.5.attn.c_k.weight sens=16778136.0 numel=131072 + rank 28: int6 blocks.5.attn.c_v.weight sens=16778136.0 numel=131072 + rank 29: int6 blocks.5.mlp.fc.weight sens=16778130.0 numel=1048576 + rank 30: int6 blocks.2.attn.c_q.weight sens=16776884.0 numel=262144 + rank 31: int6 blocks.2.attn.c_k.weight sens=16776884.0 numel=131072 + rank 32: int6 blocks.2.attn.c_v.weight sens=16776884.0 numel=131072 + rank 33: int6 blocks.2.mlp.fc.weight sens=16776876.0 numel=1048576 + rank 34: int6 blocks.5.attn.proj.weight sens=14194073.0 numel=262144 + rank 35: int6 blocks.3.attn.c_q.weight sens=12582560.0 numel=262144 + rank 36: int6 blocks.3.attn.c_k.weight sens=12582560.0 numel=131072 + rank 37: int6 blocks.3.attn.c_v.weight sens=12582560.0 numel=131072 + rank 38: int6 blocks.3.mlp.fc.weight sens=12582558.0 numel=1048576 + rank 39: int6 blocks.3.attn.proj.weight sens=10718067.0 numel=262144 + rank 40: int6 blocks.2.attn.proj.weight sens=9254327.0 numel=262144 + rank 41: int6 blocks.6.mlp.fc.weight sens=7191522.0 numel=1048576 + rank 42: int6 blocks.6.attn.c_q.weight sens=7191519.0 numel=262144 + rank 43: int6 blocks.6.attn.c_k.weight sens=7191519.0 numel=131072 + rank 44: int6 blocks.6.attn.c_v.weight sens=7191519.0 numel=131072 + rank 45: int6 blocks.7.attn.c_q.weight sens=6291327.5 numel=262144 + rank 46: int6 blocks.7.attn.c_k.weight sens=6291327.5 numel=131072 + rank 47: int6 blocks.7.attn.c_v.weight sens=6291327.5 numel=131072 + rank 48: int6 blocks.7.mlp.fc.weight sens=6291325.0 numel=1048576 + rank 49: int6 blocks.8.mlp.fc.weight sens=5592429.0 numel=1048576 + rank 50: int6 blocks.8.attn.c_q.weight sens=5592428.5 numel=262144 + rank 51: int6 blocks.8.attn.c_k.weight sens=5592428.5 numel=131072 + rank 52: int6 blocks.8.attn.c_v.weight sens=5592428.5 numel=131072 + rank 53: int6 blocks.9.attn.c_q.weight sens=5032693.5 numel=262144 + rank 54: int6 blocks.9.attn.c_k.weight sens=5032693.5 numel=131072 + rank 55: int6 blocks.9.attn.c_v.weight sens=5032693.5 numel=131072 + rank 56: int6 blocks.9.mlp.fc.weight sens=5032682.5 numel=1048576 + rank 57: int6 blocks.6.attn.proj.weight sens=4691881.5 numel=262144 + rank 58: int6 blocks.10.attn.c_q.weight sens=4575373.5 numel=262144 + rank 59: int6 blocks.10.attn.c_k.weight sens=4575373.5 numel=131072 + rank 60: int6 blocks.10.attn.c_v.weight sens=4575373.5 numel=131072 + rank 61: INT5 blocks.10.mlp.fc.weight sens=4575340.5 numel=1048576 + rank 62: INT5 blocks.7.attn.proj.weight sens=3624991.2 numel=262144 + rank 63: INT5 blocks.8.attn.proj.weight sens=3561298.8 numel=262144 + rank 64: INT5 blocks.10.attn.proj.weight sens=3232795.8 numel=262144 + rank 65: INT5 blocks.9.attn.proj.weight sens=1478642.5 numel=262144 +mixed_quant: most sensitive=blocks.0.mlp.proj.weight (6506361856.0), least sensitive=blocks.9.attn.proj.weight (1478642.5) +GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search +mixed_quant: 61 int6, 5 int5 +Serialized model mixed_int5_int6+brotli: 15960936 bytes +Total submission size mixed_int5_int6+brotli: 15982332 bytes +final_int6_roundtrip val_loss:2.55721153 val_bpb:1.11147830 eval_time:7005ms +final_int6_sliding_window val_loss:2.51522064 val_bpb:1.09322719 eval_time:76102ms