From 99b080be7ee758616038526f297e9e4f8680734c Mon Sep 17 00:00:00 2001 From: "Dixing (Dex) Xu" Date: Tue, 31 Mar 2026 06:20:48 +0000 Subject: [PATCH] =?UTF-8?q?Record:=20SLOT=20+=20Split-LR=20+=20Full=20GPTQ?= =?UTF-8?q?=20+=20XSA-all=20=E2=80=94=20val=5Fbpb=201.1015=20(3-seed=20mea?= =?UTF-8?q?n)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SLOT eval-time delta optimization + split early/late Muon LR + Full Hessian GPTQ int6 + sigmoid-gated skip connections + soft-round QAT + Brotli-11 + BigramHash(2816x160) + code minification. 3-seed mean: 1.1015 (std 0.0011), delta -0.0132 BPP / -0.0224 nats vs PR #1019. --- .../README.md | 143 +++++++++++ .../submission.json | 9 + .../train_gpt.py | 228 ++++++++++++++++++ .../train_seed1337.log | 109 +++++++++ .../train_seed2025.log | 109 +++++++++ .../train_seed42.log | 109 +++++++++ 6 files changed, 707 insertions(+) create mode 100644 records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/README.md create mode 100644 records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/submission.json create mode 100644 records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_gpt.py create mode 100644 records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_seed1337.log create mode 100644 records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_seed2025.log create mode 100644 records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_seed42.log diff --git a/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/README.md b/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/README.md new file mode 100644 index 0000000000..003d0d2d85 --- /dev/null +++ b/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/README.md @@ -0,0 +1,143 @@ +# Record: SLOT + Split-LR + Full GPTQ + XSA-all (val_bpb: 1.1015) + +**val_bpb: 1.1015** (3-seed mean, std 0.0011) | **1.8598 nats** | **~15.65 MB** | 8xH100 SXM, 600s train + 177s eval + +Built on [PR #1019](https://github.com/openai/parameter-golf/pull/1019) by @abaybektursun. +Previous: [PR #549](https://github.com/openai/parameter-golf/pull/549) (1.1194) -> [PR #1019](https://github.com/openai/parameter-golf/pull/1019) (1.1147) -> this. + +## Results (8xH100 SXM) + +| Seed | Steps | ms/step | Post-EMA BPB | **Sliding+SLOT BPB** | val_loss (nats) | Artifact | +|------|-------|---------|-------------|---------------------|-----------------|----------| +| 1337 | 6704 | 88.2 | 1.1309 | **1.10213** | 1.8609 | 15,647,124 | +| 42 | 6706 | 88.2 | 1.1289 | **1.10019** | 1.8576 | 15,658,061 | +| 2025 | 6684 | 88.4 | 1.1310 | **1.10216** | 1.8609 | 15,650,266 | +| **Mean** | **6698** | **88.3** | **1.1303** | **1.10149** | **1.8598** | **15,651,817** | + +### Improvement vs SOTA + +| Metric | Merged SOTA (PR #1019) | This submission | Delta | +|--------|----------------------|-----------------|-------| +| val_bpb (3-seed mean) | 1.1147 | **1.1015** | **-0.0132** | +| val_loss (nats) | 1.88218 | **1.85982** | **-0.02236** | + +Clears the 0.005 nats threshold by 4.5x. + +## Changes vs Baseline (PR #1019) + +### 1. SLOT: Sample-specific LM Optimization at Test-time + +At eval time, for each sliding-window batch, we optimize a single additive delta vector (R^512) between the frozen hidden states and the logit projection. The model forward is split into `forward_hidden()` (frozen, no grad) and `compute_logits()` (carries grad for delta optimization). + +- **Delta shape**: `[1, 1, 512]` — broadcasts across batch and sequence +- **Optimizer**: AdamW (lr=0.005, weight_decay=1e-8, eps=1e-5) +- **Steps**: 8 per batch +- **Eval time overhead**: ~90s (well within 600s eval budget) + +SLOT is score-first: hidden states are computed under `torch.no_grad()`, the delta adapts through `compute_logits()` only, and final scoring uses the adapted logits. The model weights are never modified. + +Reference: Hu et al., arXiv:2505.12392v2. Also used in PR #1128, PR #1105. + +### 2. Sigmoid-Gated Skip Connections + +U-Net skip connections use learned sigmoid gates instead of simple addition: +```python +g = sigmoid(skip_gates[i]) +x = lerp(skip_weights[i] * skip, x, g) +``` +Gate starts at sigmoid(0) = 0.5 (balanced blend). Adds 2,560 params (5 gates x 512 dims). + +### 3. Soft-Round QAT with Alpha Ramp + +Late QAT uses differentiable sigmoid rounding instead of hard STE: +```python +soft_rounded = floor(scaled) + sigmoid(alpha * (frac - 0.5)) +``` +Alpha ramps from 1 (smooth) to 16 (near-hard) over 500 steps. Provides real gradients through rounding, letting weights adapt to quantization grid. + +### 4. Split Early/Late Muon Learning Rate + +Bank gradients are scaled per-layer before the Muon reduce-scatter: +- Early layers (0-4): Muon LR = 0.025 +- Late layers (5-10): Muon LR = 0.030 + +Late layers benefit from higher LR (weaker gradient signal further from loss). + +### 5. Warmdown = 4000 Steps + +Extended warmdown from 3500 to 4000 estimated steps. Holds LR higher for longer, giving the model more time at productive learning rates. + +### 6. BigramHash(2816x160) + +Enlarged bigram embedding dimension from 112 to 160. Same 2816 buckets. Richer per-bucket representation at minimal artifact cost. + +### 7. Code Minification + +`pyminify` + LZMA2 + base85 self-extracting wrapper reduces code from 101KB to 23KB, freeing ~78KB of artifact budget for model weights. + +### 8. Brotli-11 Compression with Byte-Shuffle + +Replaces LZMA-6 with Brotli quality=11 + stride-2 byte-shuffle preprocessing. Saves ~400KB vs LZMA. + +### 9. GPTQ Reserve 9s (was 14s) + +Reduced GPTQ calibration time reservation from 14s to 9s, gaining ~55 extra training steps. + +## Negative Results (tested, did not help) + +| Technique | Result | Notes | +|-----------|--------|-------| +| Turbo-Muon (AOL + Polar Express) | +2MB artifact bloat | Weight distribution changes break compression | +| No-GPTQ (PR #1120 style) | -0.005 BPP worse | GPTQ essential for our stack | +| Pure EngramLite swap | -0.003 worse | Same-budget multi-head too diluted | +| ResidLambdas | -0.003 worse | Quant error compounds through lambda scaling | +| LeakyReLU slope=0.3 | Neutral | | +| Partial key offset | Neutral | | +| BIGRAM_DIM=192 | -0.001 worse | Diminishing returns past 160 | +| TTT (score-first SGD) | Neutral on Full GPTQ stack | Post-quant weights too well-optimized | +| Mixed int5/int6 GPTQ | Broken or worse | Needs full PR #1089-style pipeline | + +## Architecture Summary + +| Component | Setting | Source | +|-----------|---------|--------| +| Layers | 11 | PR #549 | +| Model dim | 512 | PR #549 | +| Heads / KV heads | 8 / 4 (GQA) | PR #549 | +| MLP mult | 3.0x (LeakyReLU(0.5)^2) | PR #549 | +| XSA | All 11 layers | PR #1019 | +| BigramHash | 2816 x 160 | **This submission** (dim=160) | +| ValueEmbedding | dim=128, layers 9,10 | PR #549 | +| SmearGate | F.pad causal shift | PR #549, optimized | +| Skip connections | Sigmoid-gated lerp | **This submission** | +| Quantization | Full Hessian GPTQ int6 | PR #1019 | +| Compression | Brotli-11 + byte-shuffle | **This submission** | +| Optimizer | Parallel Muon + Split-LR | **This submission** (split-LR) | +| QAT | Soft-round alpha ramp 1->16 | **This submission** | +| Eval | Sliding window stride=64 + SLOT | **This submission** (SLOT) | +| Code | LZMA2 self-extracting wrapper | **This submission** | +| Warmdown | 4000 steps | **This submission** | +| Params | 27.2M | | + +## Setup & Reproduction + +```bash +# Environment: 8xH100 SXM, PyTorch 2.9.1+cu128, flash-attn 2.8.3 +export NCCL_NET=Socket # Required on GCP H100 +export SLOT_ENABLED=1 +export BIGRAM_DIM=160 +export WARMDOWN_ITERS=4000 +export SLOT_LR=0.005 +export SLOT_STEPS=8 + +# Run with torchrun (evaluate.py handles this) +SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py +SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py +SEED=2025 torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +## Acknowledgements + +Thanks to **@0hq** and **@valerio-oai** for organizing and maintaining an excellent competition. + +This submission builds directly on @abaybektursun's PR #549 and PR #1019, which established the LeakyReLU^2 + Parallel Muon + XSA + Full GPTQ stack. The SLOT technique follows Hu et al. (arXiv:2505.12392v2) and was independently validated by @AnubhavBharadwaaj (PR #1128) and @abaybektursun (PR #1105). The sigmoid-gated skip connection idea draws from @mikeapedia's PR #1089. Code minification approach adapted from PR #1089's shrink pipeline. diff --git a/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/submission.json b/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/submission.json new file mode 100644 index 0000000000..61440cf21f --- /dev/null +++ b/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/submission.json @@ -0,0 +1,9 @@ +{ + "name": "SLOT + Split-LR + Full GPTQ + Sigmoid-Gated Skips + Soft-Round QAT + XSA-all", + "val_bpb": 1.1015, + "bytes_total": 15658061, + "blurb": "SLOT eval-time delta optimization (lr=0.005, 8 AdamW steps per batch) + split early/late Muon LR (0.025/0.030) + Full Hessian GPTQ int6 + sigmoid-gated U-Net skip connections + soft-round QAT with alpha ramp + Brotli-11 byte-shuffle compression + BigramHash(2816x160) + code minification (23KB wrapper). 3-seed mean: 1.1015 (std 0.0011). Built on PR #1019 by @abaybektursun.", + "author": "dexhunter", + "github_id": "dexhunter", + "date": "2026-03-31" +} diff --git a/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_gpt.py b/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_gpt.py new file mode 100644 index 0000000000..4225f207c5 --- /dev/null +++ b/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_gpt.py @@ -0,0 +1,228 @@ +import lzma as L,base64 as B +__wrapper_size__=23349 +exec(L.decompress(B.b85decode(";TBUyW?cX?9Eu`uszf$Krci>25I-HuUNjF`N?9VI&1P%41Wt3M0HcnGi85w-CJ8_DYWqCUhZ{>g4nvkJcgHT-RDPNuEud#fAnQaL" +"SNvkmp$G^X{d(AZjRFxFy>2wf?R)sc9i|_R`TJuej!(A6t?t2t=+vR+ec}DB7#mA!0U3ZPbPd!4VG0ztIP9Y`Qh|Y5(~*gP" +"Uu|L4Sm$iKcMKwI^Xm?&;!d4_(?pkoFq&`({X;YasmK" +"ijVu_E?FVOr018tgy@qyX@~WH+K-{#(pQ3sorfjje*Nqq`AjnAH" +"#2oqxmY=ERi&#v}WlLw)|zW>xaD0R(Xii$i~3lIJSNYhuO;!r(0jIG}Mj)~w>w{F9*RCJ?e-N=g~oga2Dww6Q(2Z" +"Qnkf1d)+V-e+yFgmA)BGt=5*Ym<40o?9<_oU}`tCF`w=UrX(o<_#FV!AcE#rhRFf|eRF~3Cjq5wZ=`#2v$GndICM8;50c{)-(6Pv" +"m6W1aAqaI7KfqNozgRm8V*(YZGw*&-o>D$cr~1a(o`+us6bu$DkpQd1Cn&2q" +"85g02HR^skp|5m`b5?2gKu-HFd2ygiroiK(k`X~d1yS3C?oyQLL1J10OdE=d958oL1P4tcVKW;f)cF^7gQ=Zv3e0GnxN);;yWTj_Ge0G+07^HQcg%}14" +"d>ww#&%HCEtU6^UkHU;t50#Z0tB+=G)x1HK>6(Jj%9FHqpGhOK2oqlkrCXFz!3w+ibOxHCTcmP%5-b*AErx0=^6N=n_ol^4;hQ1=" +"*Gz_l_2ve;*DDCf+=6B>^c3YU@<~&j$lrD#^=|V!JNXSbQ18=nd-7u8cifrEg=&ND-sSlYV>x0vL!1n8$Ei^xW%QH+hEvsZvJllS3yr>snOw+lTjIj`{z@UxNNTtQ1Ck^prKdoM{F`;>LPCyV`#dJwgp^Vav" +"9}&kM$~E6Ty}O1uyAK+sUr{ggHgmN>FZ-KhFbAhPeQRwh4_S;KPThZo(3UZc<#VkCL5XBbg#JSsWw?CMeEp;zhhSxe|Q=J%YQz>mobcRCy$IC32LAHtkys" +"%;npjvY!O(1W#l8lkn(p2*}i=k8yNF{k2UVOC%ALD^`AKo|R*-u&V82Y1gjHL$+&r1tVk?Lm!3mJ&rm`xN`V$!?8G|^7ez8HN-fH" +"N(y6hHy=V{1YbD^D7-F(`(Q`{M5DbR4|TUozLQwV=_tXYi-M^C1G7pNbohKEjpf{gf`6li`WV`&pW@-&G2$ti)*@Q3Djks|%AaZX" +")t5r7o{cAHk@dIIMhY!%oreu6!J&o@0)WfPJF*xa8Q@>%BIVl%EB2{90cxcM>aHc)ZN>Y;RC|4d5qv0B2d!07YzFIA{$eU1?E+aY" +"l97Ik$AfzqoQpzKzx!6cHIQGPkIbkK09lOwS*vY3O>An2+w-2Yo#@Eg$" +"IXV{ga_?#)Al9FNu!7}K+U)`k`t;E@;V>C8luA%?`fg@M;??|;eaM{|Rjt{cr8Jpzn757f65;B0+T8y1kgRCb=Dv?ThR`heT)a*j+3<`vp9LW454gEbcL3sQ`iR4Vf@TUGw5N7Qf" +"mAP!g16qGeV6$x4;&sPhii)s!63xnl{XZ&&Plk*-^6M`);d47O3kfCzr_%UpHYgY;IAsShVCt4eR>U^A+Am|j~{8P9cQ61!x&9_B}sBL!ksV*ep" +"(8sn?fhPQ~WZ1vB2r;S0lgp$^E%9Lk=ro>FEk_jK;7aP@cW+@bey=GxtHz%a(K8R)}&gg}(pe-YASvw4eZEaR1r5(J&oPQjLRQG?M({8T5$M(J4" +"ZI_`=-+-x-jUjXU-Qlixn$6YtBH}csW##=W4Zf40N=XFestH-y{@y=cYqXq$Lww(HRQS4;;OO@&$IE6Jef8ZGn@*jO8B$9x=1VCz" +"tiAR9`S2Kx?&|SSLY6c?*$4rh8R$x*_y|7`2}c9n-JDpJoj*kMv{sg_vyIW+QC72@==xH+9q%v*kPl@bi(A+_F}AovA%xSPOupBM" +"X!bF+m!BP!Db(EUJsTb_HfJKUv?xJ-~qKo1Z}" +"{Ua7HHuI5ECE=itzw^AXvLn38AJ9OGczhObyUYhaD92BP$#>(CT0NM!t01RAryQkWrQpnI?X;k~?UO_4B1im%z" +"<&rNuz|VW@XyLB>g6O0wHdE+D8Er}L+$U|BQq8W}v%x`K!caZgf(^Di##VDW_PqR<*aAz30x{8_pIKi{c-J>+^`IXLZp@4a!>;7S" +"D-*4#drki2ued3nKJ`-`Z^he$JBvRnxs;&*vp|^l4apU2$OQ#l-buz0XeZZJn!bO>k8@84bjD@oWP5FE7<&RNH4k#;_(U+w==9Ma" +"b4T=lc4D4BLU8$Bb{f1)`%nps(`PHWO!IRZut`f6;$tKZ+%YufXhL|GAufo-9R$HD{lnMq8g2^;&qC6YsILKP$3wRep|ycyOqF%=" +"!P^t3U6vVAX)zlQ+71=Fv+M>m^;TRU5inPd|8+p*!^)k59O)VAp|&PMzwS*ToaxiF5dOI4" +"=}$~qcu@Cv$UJTyA{HOrLdY(`sez6e;-7SLGo<*YEH>~^#_Y{_vh?tKCPRzY_mHR%}FXFxG;XaVprh5-~-" +"S+P6REJOkDQ_ZraWdCEGDO^-5_}C~xZ*$0kvO=w_6hGU{!di*QV-riw`nkgETrg`;1XQ?SCDDbxCmMU6}S67>_" +"8iGt)i7aoo9kB`KmuFnChdf!#E??K+SC@#2gl_FoS*%" +"*Ad@F9XSV15jh+rTWy@SqaO|fiE5I7-CrCENCr20l0qK5of&9m3V>}YjEUwjDz%SDrnT+`UdE})m&QRB(2IZyKas`kBma#b{7t?0" +"1X?u%=W%nE;7C9(qtSI7s%PM=0N;$PKKb7Cn4CumeooMtF)=5jjY8&NlI0*rUFIZjf}-&~;$_a2>|`T-}c$NgrMbjZUOOwOs@^A=JRRW8U7" +"XbY?z7!uu^b>+ro;+VneU$S~?b~&CBhm`$&GQ4k&vm0yN{hIVrt7~rLmUo3wtIJ`!^H=r-m*2+a?eFUoUoTGGkWseMbQQIloTLrJTbE3($ZDz=%u4@poGjg(}7icWu9~cEsB>7T7D@6p3Ybd=KXv?+sc>yW?4ALD&Px9)c-EvJ)dB" +"+=xNZ@WflGMyQ*a(c98NI^{aO&DAlZljm6U9CD;i6!cE1_gJGF+$+2T;nxvS`VD7%JbbNp0SN2ot+&ul89B$ze&P%<;Jb&8~h*Y;hHiZa_UWv9AM4vEq^%`^?#sl(F=l|iksTqtn" +")dRo=N+4(i5v{)%bU%UAZ0`Av{+g?)2VqR+&IZ_@rN9|dNr(VlDxOz+C|#WYmr_&y^aj%dp`n^XO%ZY{0oEzZP!2vJi(nl-%K;?D" +"4y`5n{Z89?l`GHIQ;>7VccD=gl!ZFI&Wz;f1$U>XJL8~2VwL)`8n+G4f7!y3<~3!}k#1aNgugmS3uTa#!`AqH<_bFQUN;DGn(no~" +"c^Mqn%pCuiKt#swOscwCly&ummdMj|e*?vEIdrIm0xblKBF4-nCEo6U=^THik{+HkQ=*)a9ZNAjnjJvo%N!)JqKXbO3kICY@SjmP" +"B*n-+ZxH@`<}j;26AqX)K2gVor1P9&I77!`H5ws*eL+|nBN3XanDen;VmIwftJahG7%ux9uyM)`POxYA$>1YAxvL)D4Gkz6@jZV!6j1YUr" +"%#JaQ@S;O1haVVqsN&>iVW^-mtl7;FjLuM7QJvTl*(_vuBayARqRgQC(x7wF3MG3A6@+=j6H5Kmq0}FzqZ{27_mjPy~@4^+EF!i6+=5*vE#UrJdY4x5$!R%gy>n!!J<*eb%jGvghVfi;9@1%ez%LCSE*rEO;7s!%9xYCd>LY}q8C6m7PabZ8>;G2YN1&U(5mGRQhUtGnAq-9I" +"B-WI>6UB!Lnx5Hr$CU|z8CJr3oWNRu)H^ZX*EniR4iZMpPqHT@eRG4`g)3F3=SUz;9uw#^F9+Q+yg1N?hSfFP4@jZn#v2|^8WXUn" +"P9U&h)AjK~unzZtnDC*k!NV{@Vfy}6k_vg+#gV{N$LNRbwwjntVXYx(XEz~Hiqo#u%I!G5lI(z(iMz_~V%7o%v+Y+t?Q;=uztF_R" +"{{0Tlq?sdi4U){T2Pes@=kNU?#&+PGEXQlSpvb6$%{JS`ukp#`puv*3sSnGFqTGbchN4(kb3T?ckBu{tiboBgQFoCIN0*D3P6" +"Cs`aHdI)px41S!j>X){3;8@Zq<|WH$)SyS#d?fufl2ASw7" +"C_z}j)<*$210aI-Z$YlQyaevpS%@m|38DEU($3+)Mn9)jX&;7;O7=h(;-zzRfUZWe@tK;ay$sNeM5hXBSz(Y*=j8p8C4F?=Ou02~" +"@@2)S?^4@4@WxkzQ^#ySrTY7q=~K%3k#DL+a3#>y*IA72JI;hGEqQDV!(Jmp9-i0COafprb;}q0bMA$b2U|+l(-8Zy?gf9WP!kB|" +"Tvl1m^%gwOx-WAYx!~=&i`=*2Q<-uy_QjX~wf5b87$h?d!v#`s*%6k;x5LY#X0NcbX6zK@`&mbZy;u_#QFi9XLTA_7$`D4Z&JSMVycX_2|{%7%jltA{b)2uh?&gh3+3#`^U-TH" +"=X_%AyigC=>oh?{jFw#0H@gJZTEmcrCvB|BXM)>f+450ODew(IP44u>D^s-0+pWUSZHcGzzs{W8ut1`}%I#kE7VkBPKC(i&+I|6~" +"X;*mEiiZ2SXig7cYL2Tr0;d$HtN|Y&N%jbT;t)+s6-xZy;SQ}U&^`t|l>uZpC8R^pPrgI*RYC)de0#seOXB^l4nXb}3XhlMiRj}K" +"{CoCK){>^HrPE9ZF_oq9q64@9x$80PNo!|IMbbS?3y=e;I7T^JYf4@lYdy`V>+0#w*dgo3&j3y>^e>jkf(6y7oIvs~;#RrIMkqg#" +"0aD2yn+KGLY;|{jH-xHzCm~?;za<`T5+rqc+R}0)B-GzkMc;6zCEVY>z(z=r>H?w6qge`)F`&RDMA=CMe5;#*-wQ>q`G|AJ`5~sy" +"6&4=;$f=dJ!7vxfHF8xT!jIAPBVKoYQD?;~@Hdj_z_E+sa9~X;$qOXSu-SNmf&CkWrWIOfk{eQ(+2n5&pHf6ltuj2#pVQ3UQPM5^i6Ks#&UAm2{@LnrXdzB+v" +"jM*^!j4rf6M8&!eXj1a&bI@n{mKHB{v0Na*>OTP$$Etb|h4s;XN9" +"_KSt8;I8eGm!X}O&M@4joN_uuR%brPghuz>ko*<}dg$Yb31gKL1^M!vQe0I%sN6JQ$US%dsW11Rn**z!{M;!SaKuA1FUxtipBY)x" +"q@%!t*w*P)0EAX|)pS{E=!$xaseA8DOUFsa1m(x+*UAJ(?<*H|fOeqiHmiY=&e+6i_dbI~%UOZ9)nC9|r0zNtN^FzNPCx&R!Drm&fMs{A+tX-Z^UcLj6io|IWgPPcQ1UCZSDrp;C;w>;gAC?uDi2o+5k}" +"8XI=UVNL^iz6$%g&r&ZJFYpsfL8GmEoF#*OLH^Acx->N+qKC4YlqcG-n5RJ^7yf`Wf5=oRSEn^21;PqAzd$UN+Hp@h;JQJk9*%k)" +"ly6LVD;`qfPHPTEBEa&y}!#C)>e" +"Ic=MSbaiUKeEm9Oe*CG^-Z92xkN!7ZncfC$}ALjAEAu?|=5>~i38+b1F|W(*Z(xr@>xTgm_zz+RAd*wqEM" +"Q2#BvwLD%>sl5>PEbiJAvUQ4&o&W(Sgn{?5ARs504)6Z#f^V1*n*s@Yu-*O!^{=lKrPI" +"1@9L#{DcA4?C&X$&{Xw0<1%2A$+#`>izZh{$ssx4UUQ`BR})FuYekT$KREU#T)yjl-<@a_@TIAFMUx" +"YOqcpd6%0$PSor|8Iqgrv0XTWg~)iFlJ%X4Q?Rv9R+5U{|J8H?qYL5FuhKp8BsOCYdj`UbV9N9iR(+WUgebb&7yD1ypp7H}SfF=!" +"Qs_iM-g3Wp-Hfr15S;xn$B(#ph~&+`Qq}>ur}~qmst9$qr}WSAinMP)pn-trpttU+wXr(*KvfULWB5|n{Tbqa>G&tI&or#ck4@Um" +"A#5lhI^*QfxyqTCYf2+26*iO579JlQ>9l445>OA&>K|CfN(jtJtobV@=KY23wdk^-r}|8;x!D~V+Q1#RQ3j>ekgA5yr)_uINuMC}" +"{_B8V7{zymRVtYV7ll!3G~Eg+QX&7mL`r?a+`G*bVF9(tB%B*~WYL4iV+e=Orln`muAQ0zFB|ya9NE1Spe~YW9U!n;EKDz(9c5Z%" +"@*b_#UTaDlz8u={1OH9ru$C=QgIi{G)LsaNxEi2MkX3>ALKJtOFjTJf=Dd>#h}_bVhiJnNt!vF`~KsZY?V=J" +"cb;GXQ7_5tLwyxkluhH~%SqjPZeU5c*Gwbw)YB$y4cNterJt5`ySSW7jm>WqF-PRf^jFnbeUc|;!wZaZC%=xpGH;Pz4cAEol6DZ&Y>Uv7NssS_#yOqy9>TXqrFHN$KvT}(URY=F#A3y4~F;QUU9$Iw#s@1fLxpt{Q" +"5aHpA}!CifNozfI&5;bh9ALQ89hJHiOTEZZq!dVrMfLgm#HlybUPkwkFYR~gKw5+w}WB>`dS6tZ1E" +"t_b8^HYWMKDWC-z^1KEN#Z$;UO17y@{)sPmngTSGZ^#4^t-fT_2>lT>xJ3qrb=?{Z=*-JHiPW%J^+TdmkmIvhcK%LvwNnJP6}}7<" +"%g^Gm0uj5OLFa{ydE={oP&N@BXrW3VIU1O!u7gGVnrM^`cE^8i5Mjc6m`kCK)e%86A2{2O^?lEaqiK+rD;q)s5SU9#?K9z{uMwSFDnwH5(W;Lg%?Cexe92hR_E+F|_L})reRtxR8o7E@(*b)dD)T328K-P<{VjL!eTq&L+JqjJ8B}" +"0@}iduhXM~x2GRWkfWRuW?^pK$hMq)`N%i?71kKYWR@7zqOY|~o}5ujNYppEq_L8)J_zLPTC3njjSkkD2W0^1>^EHzRJ42fuS>5)TiiEkR+G|g<=)EHaz$ur?BOzXtO5lA>rb9RSvbYfxXEe;ogsm^4j*s=^hJSUgN+ec#NsvPt`" +"(FfZ!-RfV1k*m+uaC)9HS9}QUjRd(K8H(sLbGN{e{tTK+1xc_NZ_-<)}qG0QF3oEdBS))LL}bFDg9xUWn$SNv&qC&DW|M{gB)ED?hxb?" +"7C^u<5ZAX!A>5_4XT-@P?u{5Lz$$B?hhZjghado|s9iRcW" +"ss-5RR&|(0ld534%J$Q0709V?lN%f&*xC&Nl+_sxTB`Je$#{-5D1K^CTYkxp1a~@bK;1fdd0>Isu8_1`h8AkMgC!f#vvMFnRdx`G" +"Uws7kUg;j#@Vv))dllE4i20<3AE(#`W9Y*QL=#4BMU7ScP?boIPW|*_IjK*vTqylnfH-Lk5=fP1mX#GnhT2xUju;v7#I;2uzWSFp" +"qE2((YH}MhJ`j1Z%vxzNC&r*n{f+Z%dg;QqE?dSWunraqi!;77Kp5d{JAJt-j){X}Eb!_=*!f+z6Nam%c$7{W<~R|Qz8`<7I*DVy" +"B-5w$lZrr^A&ddHbX(ogvAcyNdXdkptBOhOK#BCN>1M~#LuSwp(Q=W7(eBU%P=IX-1h(7gQMi+w9bdF=PTYqlV`wQqiaEnL?KBd0T" +"4^O=O4i<1Br?|bhAoaWA>OO%2Yf0!$N" +"MGvtpPveu35NpK33r;_y)YW=m>VeYE1GmMc4TLpg6VL)nx{B}O71u2LOK*=h@y_vbMHoOLg~~K*PW~Dc;o$)nBBJMG_8kGUp5U+T" +"_l{0y%warcWQHxMfji@vhc9b&s>33DfS^U1)f{W1cGOw=IK7jHW6$k90mBtVY<5D0%?)&3{d>2V%kD#qd02ChhrHx1fflhC&hR}Awsxx+y@o}#6wefyU<+NMORPFB%P2W0nWiUQSA!jq@!@0{f)A@+d" +"eckOifyWJt!W=bFlBU9nKST6(IM~k&5|kMbqfvO<{_u;N6QfmWsgJ4s<5C}KTszka)YIg*iL)Y`HyrDInzOmLEDlicDl-cne#ez+" +"mu!jz{BZX@OjPuKV1WnFdr=|=#-9n(UkiNFh$X&X4kX@|4K@g>N0|hgs*j*Bv-)VSiR2gC0)k$E_^hE`lx(oL6~K*y1^PfLmmSn_" +"=V1ymF@vwnM$=S#zAX@@NwUZbS1h1Pw3X#^%YsWC&tDNqU;ytjynoZZ$A#A" +"6s+Dhq?OpC++2*;qqHZS5P{gb)s4bVSz$zy*y*a3QS2^`l*ED>BwXC!CzIEiEOk~PMv87f#t~2*!LphK" +"F_BlB5+q<3Ue?Ia*Q{q`BLXLJLm(-NpnRP5gBlCwCOoYRGI>GGQwxYkf9%*3MC+%)@aV*h^y-XmtMRy)x(~;g1&1?3SKlijPpaRX" +"h?t@mxnq?so&)SE6)9O=L=)i(qnR;pvHarmN-ig3Wzy7X~26$m@QIi+&B^0ruC6d)9$r(y3oHKfb*oNS27r1zu>kp+vu7Jx" +";fL@$eHYADw+4(>JFaK52vPQsV!XgqVXY5znoULX*2=dG5l6c3&bo|#Y(d`WOh`}csMz!zzvWUW8B^=M*&J(c@jcp6Icfj_+J2Z+" +"!U9{+3r6Cdosfzqu)9)Uuk!wRbJoV%6q7F}MW6@{_CajN6rb4f7D_r(uwa}vg?lYfShJI3%bS%NoF~?Uh{kQ3BB_9~6|5&NZvPLtS{Ymrkey!Royo6>4NenrKGqzfIy>" +"1M3nbyE|}@;t=x?XIU-sZ~|;jB^Qek6oH@(+H7M+%CY`;#y&~6%i$i^o!DVG*9S#qzduW;Lg}UG=ayN4y2RknMGYD$)ce0P}uoV4d)^" +"4-&FSxY;LYVK-;by(yrHUp0`9I=r8#m35A7K(qA7b2w&nT{zY<>r@(&=j*0=Z;)GX@Bh#jO" +"|F)%T!F1??sUcW9I(MOlv2bf-Lt>p-YmIIW^8MY-(f6_F61Uo~{nlpdXe6bEp-Ih(Vbv$i{3XiYYs-Xgm`i*w|}=" +"Dg!U0ZRWf|XF|Qbq2AVe$h8Pzyc@K%ykkh6df@UOTif~Kj;71*cZw7<>uFo()heQ79#4}W+j(!!7E{XD4w~n2YeXRX38}r+cbl3{" +"^K6WLPkLi8eVtN1&w9$&GzRPke(EGcwYw7q|q" +"1_?_i-WpCEwk*3ciQ{>(ku@q{OY_p6bD1EQt$K1K>UEfX8kVv3#A)o~BfU>?oHmn{fe^lZ1xgsEW5yD)WeZUA)uw$AM$!%I;5*)J" +">Ynq1P}G(UhCZ(2k(YEe6OGV-Zgfzp>A86>e_4r~+)33TEqno5GQQ)}Qtay^qF&ium}t8JG0csqT(Y^W1kp^qr7x;6Xg{vBWMCO6" +"5WVfjAt5T)mFe5h$u2}ltlH&2h=|m7IL2{oa;#oNpB)mjwBLZ;vnkIoDITWBM+F+z*EDHpMbym4R?^1#5ly$#w?YPMMe-5" +"L!-fq`$Ul!3lW9kvMRa4f25V^@QmlV#98xWlquCPm*Zdh6+D7^gmU_L>MBE>$FqVxmr)" +"2v7URgzZt9uznvzj-Hv?y!CS$klQJe*n?RyqCY%I" +"HMsfYgbzqL-d1;l@*o$Kb#7Yl#K{f}(BezN65ez30C#P)>e=<;x2(sRLC?f1Y(*sZoHY$B=BWddxk_dCK^Ud3qk?ZWVO@zDOwZ#K" +"JfNHiMgTHdXEf3cIwqLKv2geD9%nVidX@OzAhx)EEAUmM1fg_{`iky)miVO0>wUQLo-wsbNUn89bw#TNIyxFR5?Q~h_$4`vIw`Bb" +"wiHv%M9n&&%7h=p*+#8){8dU7y^0dlnp}Mkta>7$0U9x&hQv}ZUpwdzC2;Po)?F&>-&i$bl!%D!!)ob-U@Q@Ygp7nr{WLKN!SgTq" +"GYA|h)jO{#kmnum5J8X-=nX0aA{;X>CPUl$r?R|N5ogPWBoMNvi?3EeyEst4Nn#4z+~bZKvQEjQo6dWh)(Y|Ixao80IKidPgy6Nt" +"29Vdtm^TwP{>Z-5Z*P|V7nRd|#Fy%0&{&6C`@Z3zK{ZxFBXx1WdiRx3;YMMU!;MP6C<4nKYM`51)x!++8`n#3Gh@~u}nB|!GTr2i)RKp5?6Gq^fqLdO70~e=EmKc77" +"ynX~0?cIUs`(o-&#{~P`G1%`h9;{$LNEIf8kXE-Xj>+Ngrky|gaA9XD9I6zkB>BN~i8YpoL-gA" +"Fg}~;4x4x&TC!!Mq?zNSO;yn%z6;|hHZLu_AiejypqhwuC?6~_K5zl9g+V3n^@joz(a^L`W6CMH)04(vFthv{P9@Fy=cVUbLIXYe" +"u{Q*x2fc|XblZP=J$PItrs%&83t;jJo5+?DaG%qb@bvb;OcSWToY" +"rR!(KIA_9@%|2sKo^;}&vziU0AMk$tC>u2+e&2w9HQH~6B|}9vt@q%FH`QV_GD*Lb%Q6R;SAJmyqs`45@mgmICFRLGK#1k5!@o`{" +"%Xp;wYG6;)@92=(@5" +"ggvy6do&eKX1LT7W`Sm)f7ZIL*Rh&TaQYUm`zv1gAJw>#u-?=IiOcMw9ZT{1R|bEQ=~%`m2NiwRcN`?mBX4i}HQRyd_oTJ9_KI={" +"ATaZ(<|_V&k%zb_nXH4JqK98DUS(lrZU4c^L%u2h`--}7oxOsl1yGYux!*)sve`n>@QrG&VAt*Fl&R{Oa44F3znW_(K%RG&;W;j+" +"f*AzTAMA%TK#gS9G%nw0(%M-9R?*q|eDsDrd}}F`dVtE9kvIG$?B-(Lc>{u`Y8;!hiF14+3xBm=voJ)Olo3YDqxsXB$J!-r6P~4~" +"jBp5nz#AoTF%ursXGoE%_Jg55y>&fyg^|m7XBhK4@2X1mTU`26sEW`_" +"8#QD>&L-sqt9E@A_ci|eTKK%Jo2J4n4_~a0nVc!wfs0u*%>~)?0" +"s7XrNSSUog(~;4ua7xU|^>D2+tZd!Tr)%Wy3cfYbdpVPh)8c@Xi9A!Ej_wGSbqxuCt=>XtMCVWfLLR=!qBL{y!1T?(=qFgnDB#(U" +"jpqv}SDprQgt@r7WPv2;XZb`UFsdVVE)aI;911a{w^z7B$9#fGwcYt3|C?)ks}Y&xItD>Bp*T8U3Sk!rT@}qn3Z%*1iIb1$J(eZ2" +"0ucIiV%;>N;9hvW8SCB*_QjAE{LOrN)odJ0-l~vUsYkR2LdCFppR8?G;1Gik!cMpgSRR#yx4NnMbg_{0)QdVTRg)U|zY7EjWAp}hhsp@U<4L;h;MAr?S^bUep_<))8qZ`blj@@Fe3EH3|_$QfAprKjDx+H%qfZmJNp" +"wl{b2ZDcSZKPR0g=*>l`bsj=?iR#Yqsw1pGo)DIAEh4pCt}7f0qBtRw3oS7KeW6o_gtoxBu|R!fSa?Njy>RftK5Hd45I~;PIr{VV" +"xh2AUxlH&wG;sE)h_28aT@r56dLC`uksCc7>DJO(0q8AS0N`PBHHUEueqbgt9Dmlt;JA5G%h8Ma2^w*LF^-" +">cv(Yq`8QkHT>^JGqN~tzVoN*9f)ws#sSB(i{otxquW|DD+4)5_jnd{XwxWcC=E?Op({t1r!#wUAL6I$?gG0>jXO8;(Dw(a%C0a6bdru+ir1fpW!T~s5;#PAb29Pt" +"FRLBi*r35iT@-~Qf?yecqZ&!v-kip4X_h+A%t|EO0lG`=2uZBIiF%25y}*pO0FnoU0LfnIxA!r5q+i&837DW*)RH-nJ}r*bkI7kJ" +"8L7gXQbcHnEQ`;^zfVFOGm%^ft^o6!a2~}5H0HM?aY9rCmkjdPa1d6E+i#{@->-GQb)p!$;|qlT=ICf;pVaHCn`1Wj=u<5Y4PK#3" +"FAkKs1C|4BLZixvAX(E_I+*8$c~8XrXU9F=s?bxZ=-6KxA^Ev-m4vAARbdcD-i1@G=InWl_q5g&o6pv`-H?KjG9;" +"oOPmRcJ(M63vCE53_Z&>Q4r_F?#@)gDxl)A0sfH*Efz5sF$W{3#3)L&8Mn0OZJQtA|z!%OX%qMT7!4S2%#eCp=YYMk+V4}4$OgfQhvI;k`jFnNai&C-wP~NPNd$Pz9_Az?*?~jH>jIQdPy%B}uu-ozUh*GC^)x6CDhjP8bVu7uj>P0&|*sB0WQ4|M`+g}C*69Y@;5X$(VhPDvQofkb`a|`Sw" +">;K|~-GHeKvERay+FTsKLGO9;c$*y;F&`mDS2b?-41r-;yDMeMm;BnBouC`gz-k#e1stIC=Vm;(9N0Nm;$" +"7bJ6CQ4n6j;NT>_aa|1~D*J}M7xVtm_9V8OhP&U<21&@UIMDxFREVvR({a^g>arbxO4US>4Z@$k5j(K>>_$M#(|VrmD;+87c^*%;" +"h_oc(Pp{G$%i&SzYnyMpsAT4Co!^4#D0?pqg}(w+Y218c|5)O+2a9YusEbOanuRXc;}(?H@u44L=$`R)sM*AYy8B?I{UR~87j3Ot" +"-BX4dW{iK^*Z6Q9?E*qHA$)&y=c#nw*tuyMI^G?d&+UyY?zcDov-Ada47e06rkdT<;mZBV)u3)K^utBrKZ?Jcc%" +"#JVlwB@(>99<|siu46&%n%#Or7!V~pO?jZufD$hR?|3BMZuzQ*3�f!wp_G-CX6)mi4i<*dp#WwVOn{v*9g9H7AOqyCSGJ{$MqF" +"3(W(#9>e;i@oca&>*c&kIFoe>&>B@e)4+n04BUB03hwKw3QrG(R^;`ysWMRlmT#HsEM7xSO?}sSI>gb?Azai%?2Wpxg+o" +"x@!l*-grP7-8CccZlD%Tc*cZa?U0K;QwfRj)1B{osj%35op4&^DtKI$$68AprfN>}=}|(@`awd*=_b>uO-UiipEd#~|NKt2p+moy" +"D!*$={w&a4nas$P^epK3y!p6FKw_`j&h>yv7y(WP|I^aeqcc7n{#%ssU-E-`Z9x%AI|Mj=v8uFJroPK7sEuBbR6&{ysHbm<>50RM" +"-m&XMS7I$X%>HcKUTaGYFMLumm@|<9(ud2?7>dV7wWjuV=!@t=S%}fXF*TXiyDTp~(TJC#79?gA(" +"3wa@!H1Wp7VzGIPQ8v*OTVhB!z>$rcjz=S74wdZU_}3Pjd$lR7-2^=XRr~*}CKeT$sy6bpsckmDtTKHp_LOM|cx?3-yK$kVkZ}0t{Wb5O1eE9xP|I2>DJ8D&xi+35qM9mjIXp>4x*xsjG7i6djGnI*BGUo*y" +"Qlo%Bb$o{8?u?Y_^}m(-qkcz`ePd`%bOJm=$K%L^i!LE0iJ)k3+WG*@^-1|#Kh| 1.9063 (drop: 5.0208) +convergence_rate: 0.7489 per 1000 steps +swa_checkpoints: 0 +WARNING: step_avg 88.2ms > 70ms threshold. Possible torch.compile issue. +WARNING: artifact 15647124 bytes close to 16MB limit (352876 headroom) +=== END TRAINING ANALYSIS === + +FINAL_METRIC val_bpb: 1.10212542 +EVAL_RESULT_JSON {"candidate": "/home/dex/parameter-golf-with-cc/optimize.py", "seed": 1337, "val_bpb": 1.10212542, "val_loss": 1.86088769, "method": "sliding_window", "metric_name": "final_int6_sliding_window_exact", "metric_source": "legacy_exact_log", "artifact_size_bytes": 15647124, "artifact_limit_bytes": 16000000, "artifact_headroom_bytes": 352876, "total_steps": 6704, "avg_step_ms": 88.17, "elapsed_seconds": 896.3570425510406, "eval_time_ms": 177192, "eval_budget_ms": 600000, "eval_budget_exceeded": false, "status": "pass"} diff --git a/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_seed2025.log b/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_seed2025.log new file mode 100644 index 0000000000..646610b93e --- /dev/null +++ b/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_seed2025.log @@ -0,0 +1,109 @@ +=== evaluate.py: Starting training === +optimize.py: 23353 bytes +NPROC: 8 +timeout: 1200s +cwd: /home/dex/parameter-golf-with-cc + + +***************************************** +Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +***************************************** +logs/30016209-9484-49d0-85c0-55ad4b735cd2.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/home/dex/parameter-golf-with-cc/data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:80 +val_loader:shards pattern=/home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +model_params:27201116 +XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] +world_size:8 grad_accum_steps:1 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.03 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000 +seed:2025 +gptq:reserving 9000ms from training budget, effective=591000ms +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9295 val_bpb:4.1040 train_time:0ms step_avg:0.02ms +step:1/20000 train_loss:6.9263 train_time:132ms step_avg:132.03ms +step:2/20000 train_loss:8.7514 train_time:168ms step_avg:84.20ms +step:3/20000 train_loss:7.5457 train_time:255ms step_avg:84.98ms +step:4/20000 train_loss:7.2889 train_time:342ms step_avg:85.61ms +step:5/20000 train_loss:7.0991 train_time:429ms step_avg:85.89ms +step:6/20000 train_loss:7.1450 train_time:516ms step_avg:85.98ms +step:7/20000 train_loss:6.9402 train_time:603ms step_avg:86.13ms +step:8/20000 train_loss:6.7033 train_time:690ms step_avg:86.21ms +step:9/20000 train_loss:6.2786 train_time:776ms step_avg:86.22ms +step:10/20000 train_loss:5.9484 train_time:863ms step_avg:86.29ms +step:500/20000 train_loss:2.3011 train_time:43928ms step_avg:87.86ms +step:1000/20000 train_loss:2.2242 train_time:87928ms step_avg:87.93ms +step:1500/20000 train_loss:2.1392 train_time:131973ms step_avg:87.98ms +step:2000/20000 train_loss:2.1811 train_time:176082ms step_avg:88.04ms +step:2500/20000 train_loss:2.0431 train_time:220196ms step_avg:88.08ms +step:3000/20000 train_loss:2.1009 train_time:264395ms step_avg:88.13ms +step:3500/20000 train_loss:2.0559 train_time:308556ms step_avg:88.16ms +step:4000/20000 train_loss:2.0253 train_time:352718ms step_avg:88.18ms +step:4000/20000 val_loss:2.0213 val_bpb:1.1972 train_time:352773ms step_avg:88.19ms +step:4500/20000 train_loss:2.0418 train_time:396893ms step_avg:88.20ms +step:5000/20000 train_loss:1.9680 train_time:440996ms step_avg:88.20ms +step:5500/20000 train_loss:1.9548 train_time:485144ms step_avg:88.21ms +swa:start step:5900 +step:6000/20000 train_loss:1.9507 train_time:529503ms step_avg:88.25ms +late_qat:enabled step:6096 scale:0.1498 +step:6500/20000 train_loss:1.9056 train_time:574456ms step_avg:88.38ms +step:6684/20000 val_loss:1.9113 val_bpb:1.1320 train_time:591115ms step_avg:88.44ms +stopping_early: wallclock_cap train_time:591115ms step:6684/20000 +peak memory allocated: 23337 MiB reserved: 23386 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9097 val_bpb:1.1310 eval_time:2073ms +Serialized model: 106609335 bytes +Code size: 23349 bytes +gptq:calibrating with 64 batches (training data)... +gptq:calibrated 66 layers in 6.8s +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +Serialized model int6+brotli: 15626917 bytes +Total submission size int6+brotli: 15650266 bytes +final_int6_roundtrip val_loss:1.9155 val_bpb:1.1345 eval_time:6691ms +final_int6_roundtrip_exact val_loss:1.91554389 val_bpb:1.13449299 +final_int6_sliding_window val_loss:1.8609 val_bpb:1.1022 stride:64 eval_time:168622ms +final_int6_sliding_window_exact val_loss:1.86094674 val_bpb:1.10216040 +final_int8_zlib_roundtrip_exact val_loss:1.86094674 val_bpb:1.10216040 + +=== evaluate.py: Finished in 902.3s (exit code: 0) === + +=== EVALUATE.PY TRAINING ANALYSIS === +total_steps: 6684 +avg_step_ms: 88.4 +train_loss: 6.9263 -> 1.9056 (drop: 5.0207) +convergence_rate: 0.7512 per 1000 steps +swa_checkpoints: 0 +WARNING: step_avg 88.4ms > 70ms threshold. Possible torch.compile issue. +WARNING: artifact 15650266 bytes close to 16MB limit (349734 headroom) +=== END TRAINING ANALYSIS === + +FINAL_METRIC val_bpb: 1.10216040 +EVAL_RESULT_JSON {"candidate": "/home/dex/parameter-golf-with-cc/optimize.py", "seed": 2025, "val_bpb": 1.1021604, "val_loss": 1.86094674, "method": "sliding_window", "metric_name": "final_int6_sliding_window_exact", "metric_source": "legacy_exact_log", "artifact_size_bytes": 15650266, "artifact_limit_bytes": 16000000, "artifact_headroom_bytes": 349734, "total_steps": 6684, "avg_step_ms": 88.44, "elapsed_seconds": 902.3265001773834, "eval_time_ms": 177386, "eval_budget_ms": 600000, "eval_budget_exceeded": false, "status": "pass"} diff --git a/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_seed42.log b/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_seed42.log new file mode 100644 index 0000000000..32a369181a --- /dev/null +++ b/records/track_10min_16mb/2026-03-31_SLOT_SplitLR_GPTQ_XSA_1.1015/train_seed42.log @@ -0,0 +1,109 @@ +=== evaluate.py: Starting training === +optimize.py: 23353 bytes +NPROC: 8 +timeout: 1200s +cwd: /home/dex/parameter-golf-with-cc + + +***************************************** +Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +***************************************** +logs/7d9ea09b-8fa0-4233-a75f-3b26f7232a33.txt +val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/home/dex/parameter-golf-with-cc/data/tokenizers/fineweb_1024_bpe.model +train_loader:dataset:fineweb10B_sp1024 train_shards:80 +val_loader:shards pattern=/home/dex/parameter-golf-with-cc/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632 +model_params:27201116 +XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] +world_size:8 grad_accum_steps:1 +sdp_backends:cudnn=False flash=True mem_efficient=False math=False +attention_mode:gqa num_heads:8 num_kv_heads:4 +tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.03 scalar_lr:0.025 +train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000 +seed:42 +gptq:reserving 9000ms from training budget, effective=591000ms +warmup_step:1/20 +warmup_step:2/20 +warmup_step:3/20 +warmup_step:4/20 +warmup_step:5/20 +warmup_step:6/20 +warmup_step:7/20 +warmup_step:8/20 +warmup_step:9/20 +warmup_step:10/20 +warmup_step:11/20 +warmup_step:12/20 +warmup_step:13/20 +warmup_step:14/20 +warmup_step:15/20 +warmup_step:16/20 +warmup_step:17/20 +warmup_step:18/20 +warmup_step:19/20 +warmup_step:20/20 +step:0/20000 val_loss:6.9299 val_bpb:4.1043 train_time:0ms step_avg:0.01ms +step:1/20000 train_loss:6.9263 train_time:133ms step_avg:133.50ms +step:2/20000 train_loss:8.7688 train_time:170ms step_avg:84.91ms +step:3/20000 train_loss:7.5948 train_time:256ms step_avg:85.24ms +step:4/20000 train_loss:7.3235 train_time:342ms step_avg:85.50ms +step:5/20000 train_loss:7.1037 train_time:429ms step_avg:85.73ms +step:6/20000 train_loss:7.1014 train_time:515ms step_avg:85.81ms +step:7/20000 train_loss:6.9497 train_time:601ms step_avg:85.92ms +step:8/20000 train_loss:6.7012 train_time:688ms step_avg:86.04ms +step:9/20000 train_loss:6.2772 train_time:775ms step_avg:86.07ms +step:10/20000 train_loss:5.9729 train_time:861ms step_avg:86.14ms +step:500/20000 train_loss:2.3044 train_time:43865ms step_avg:87.73ms +step:1000/20000 train_loss:2.2291 train_time:87809ms step_avg:87.81ms +step:1500/20000 train_loss:2.1380 train_time:131797ms step_avg:87.86ms +step:2000/20000 train_loss:2.1848 train_time:175909ms step_avg:87.95ms +step:2500/20000 train_loss:2.0413 train_time:219944ms step_avg:87.98ms +step:3000/20000 train_loss:2.0994 train_time:263953ms step_avg:87.98ms +step:3500/20000 train_loss:2.0549 train_time:307946ms step_avg:87.98ms +step:4000/20000 train_loss:2.0263 train_time:351939ms step_avg:87.98ms +step:4000/20000 val_loss:2.0188 val_bpb:1.1957 train_time:351994ms step_avg:88.00ms +step:4500/20000 train_loss:2.0385 train_time:395916ms step_avg:87.98ms +step:5000/20000 train_loss:1.9656 train_time:439897ms step_avg:87.98ms +step:5500/20000 train_loss:1.9540 train_time:483877ms step_avg:87.98ms +swa:start step:5950 +step:6000/20000 train_loss:1.9498 train_time:527972ms step_avg:88.00ms +late_qat:enabled step:6115 scale:0.1498 +step:6500/20000 train_loss:1.9033 train_time:572637ms step_avg:88.10ms +step:6706/20000 val_loss:1.9078 val_bpb:1.1299 train_time:591128ms step_avg:88.15ms +stopping_early: wallclock_cap train_time:591128ms step:6706/20000 +peak memory allocated: 23337 MiB reserved: 23386 MiB +ema:applying EMA weights +DIAGNOSTIC post_ema val_loss:1.9062 val_bpb:1.1289 eval_time:2076ms +Serialized model: 106609335 bytes +Code size: 23349 bytes +gptq:calibrating with 64 batches (training data)... +gptq:calibrated 66 layers in 6.8s +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +gptq_quantize: 66 GPTQ layers, 1 naive layers +Serialized model int6+brotli: 15634712 bytes +Total submission size int6+brotli: 15658061 bytes +final_int6_roundtrip val_loss:1.9120 val_bpb:1.1324 eval_time:6840ms +final_int6_roundtrip_exact val_loss:1.91202666 val_bpb:1.13240989 +final_int6_sliding_window val_loss:1.8576 val_bpb:1.1002 stride:64 eval_time:168407ms +final_int6_sliding_window_exact val_loss:1.85761751 val_bpb:1.10018864 +final_int8_zlib_roundtrip_exact val_loss:1.85761751 val_bpb:1.10018864 + +=== evaluate.py: Finished in 896.6s (exit code: 0) === + +=== EVALUATE.PY TRAINING ANALYSIS === +total_steps: 6706 +avg_step_ms: 88.2 +train_loss: 6.9263 -> 1.9033 (drop: 5.0230) +convergence_rate: 0.7490 per 1000 steps +swa_checkpoints: 0 +WARNING: step_avg 88.2ms > 70ms threshold. Possible torch.compile issue. +WARNING: artifact 15658061 bytes close to 16MB limit (341939 headroom) +=== END TRAINING ANALYSIS === + +FINAL_METRIC val_bpb: 1.10018864 +EVAL_RESULT_JSON {"candidate": "/home/dex/parameter-golf-with-cc/optimize.py", "seed": 42, "val_bpb": 1.10018864, "val_loss": 1.85761751, "method": "sliding_window", "metric_name": "final_int6_sliding_window_exact", "metric_source": "legacy_exact_log", "artifact_size_bytes": 15658061, "artifact_limit_bytes": 16000000, "artifact_headroom_bytes": 341939, "total_steps": 6706, "avg_step_ms": 88.15, "elapsed_seconds": 896.6127007007599, "eval_time_ms": 177323, "eval_budget_ms": 600000, "eval_budget_exceeded": false, "status": "pass"}