diff --git a/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/README.md b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/README.md new file mode 100644 index 0000000000..4ee5078c0f --- /dev/null +++ b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/README.md @@ -0,0 +1,141 @@ + +# Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip + Simplifications — val_bpb 1.08563 + +**val bpb: 1.08563** (5-seed mean, std=0.0007) + +| Seed | Steps | Pre-quant BPB | Post-quant BPB | **Sliding BPB** | Artifact | +|-|-|-|-|-|-| +| 1 | 4988 | 1.08996 | 1.10239 | **1.08554** | 15,987,547 | +| 42 | 4986 | 1.08994 | 1.10345 | **1.08664** | 15,988,983 | +| 1234 | 4989 | 1.08942 | 1.10130 | **1.08463** | 15,983,318 | +| 1337 | 4992 | 1.09079 | 1.10222 | **1.08554** | 15,984,924 | +| 2025 | 4989 | 1.09092 | 1.10239 | **1.08578** | 15,983,617 | +| **Mean** | | 1.09021 | 1.10235 | **1.08563** | 15,985,678 | + +## Changes + +This script builds on [#1218](https://github.com/openai/parameter-golf/pull/1218). The main changes are: +* Increase the vocabulary size from 4096 to 8192. +* GPTQ-quantize the embedding matrix instead of using simple round-to-nearest quantization. The other matrices were already using GPTQ. +* Remove the value embeddings. +* Replace the coprime-stride data loader from [#726](https://github.com/openai/parameter-golf/pull/726) with a simpler ShuffledSequenceLoader. +* Loop layers 4-5 twice (while sharing params): the idea is from [#1204](https://github.com/openai/parameter-golf/pull/1204), but this script uses a simpler implementation and loops twice rather than once. +* Use row-normalized Muon from [#1217](https://github.com/openai/parameter-golf/pull/1217). +* Choose the quantization clip threshold based on the standard deviation of the row rather than searching for a quantile with low reconstruction error. See the note at the end for motivation/details. + +## Requirements + +Flash Attention 3 (Hopper) is required. The script imports `flash_attn_interface` directly and was run with PyTorch 2.11.0+cu130. Install commands: + +```bash +pip install torch --index-url https://download.pytorch.org/whl/cu130 +pip install --no-cache-dir \ + "https://download.pytorch.org/whl/cu130/flash_attn_3-3.0.0-cp39-abi3-manylinux_2_28_x86_64.whl" +pip install -r requirements.txt +``` + +The tokenizer and pre-tokenized data (sp8192) is available on my [HuggingFace](https://huggingface.co/datasets/kevclark/parameter-golf). You can download it with: + +```bash +rm -f data/manifest.json +MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \ +python3 data/cached_challenge_fineweb.py --variant sp8192 --train-shards 128 +``` + +Note this first deletes any existing `data/manifest.json` because the download script caches the manifest locally, and a stale one from the default repo won't include sp8192. Alternatively, to regenerate the tokenizer and dataset from scratch: + +```bash +cat > data/tokenizer_specs_8192.json << 'EOF' +[ + { + "name": "sp_bpe_8192", + "kind": "sentencepiece_bpe", + "vocab_size": 8192, + "tokenizer_train_docs": 5000000 + } +] +EOF +python3 data/download_hf_docs_and_tokenize.py \ + --output-root data \ + --tokenizer-config data/tokenizer_specs_8192.json \ + --skip-byte +``` + +## Run Command + +```bash +RUN_ID=1337 SEED=1337 \ +torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +## Quantization–Compression Tradeoffs + +Quantization and compression interact in interesting ways. The compressed size depends not just on bitwidth, but also on the clip range (also called the scale) used during quantization. An int5 quantized network can actually compress smaller than an int4 one if the int5 quantization uses a much wider clip range. The reason is that the effectiveness of compression algorithms like `brotli` depends on the entropy of the data they are compressing, and increasing the clip range can lower that entropy. + +### An example + +![Quantization example](quantization_example.png) + +Neural network weights are approximately normally distributed (a). In this example, we could clip the weights to [-1, 1] and uniformly quantize them into int5 (b). But this seems a bit wasteful because many of those bins are spent modeling the tails of the distribution, where very few weights lie. Instead, we could clip to [-0.5, 0.5] and use int4 \(c). Or we could go one step further and use a non-uniform quantizer such as [NF4](https://arxiv.org/abs/2305.14314) (d) so there are approximately the same number of weights at each quantized value. + +Now here is the surprising part: after compression, int4 is only slightly smaller than int5, and NF4 is quite a bit larger. Why? Because the effectiveness of compression depends on not just the raw number of bits, but also the entropy of the quantized values. When we moved from int5 to int4, we made the histogram flatter, which increases entropy. NF4 flattens it even further by design, pushing the entropy higher still. + +Another view is that the int4 and int5 parameters are mostly the same. The only difference is that the weights that would have been clipped to +-7 by int4 can take on larger values in int5, but as there are very few of them, this does not substantially increase compressed size. + +### Mathematical explanation + +Suppose our network has $n$ weights and we quantize each one to $b$ bits. The quantized model size is $s_q = n b$. However, we also compress our network after quantizing. A useful first approximation is that the compressed size $s$ is proportional to $H(q)$, the entropy of the quantized weights: + +$$s \propto H(q)$$ + +This is not exact: compressors can also exploit structure beyond the marginal distribution. But neural network weights usually contain much less structure than natural data, so in practice their compressed size is often very close to what their entropy would suggest. So what is $H(q)$? Suppose our weights are normally distributed: + +$$w \sim \mathcal{N}(0, \sigma^2)$$ + +The differential entropy is + +$$H(w) = \frac{1}{2}\log{2\pi e} + \log{\sigma}$$ + +Now, suppose we clip our weights between $[-c, c]$ and quantize them into $2^b$ evenly spaced bins, i.e, we uniformly quantize them into int-$b$. Each bin then has width + +$$l = \frac{2c}{2^b} = \frac{c}{2^{b-1}}.$$ + +The entropy of the resulting quantized weights, which we call $q$, is approximately + +$$ +\begin{aligned} +H(q) &\approx H(w) - \log l \\ +&= H(w) - \log(c / 2^{b-1}) \\ +&= \frac{1}{2}\log(2\pi e) + \log \sigma - \log c + \log(2^{b-1}) +\end{aligned} +$$ + +If we measure entropy in bits, this becomes + +$$H(q) \approx \frac{1}{2}\log_2{\frac{\pi e}{2}} + \log_2{\frac{\sigma}{c}} + b$$ + +This approximation becomes more accurate when $c \gg \sigma$ (since in that case only a small fraction of the weights are clipped), when $b$ is large enough that the quantization bins are small, and when $n$ is large enough that we still have many weights per bin. + +A natural choice is to set the clip range proportional to the standard deviation, writing $c = k\sigma$ for some hyperparameter $k$. This makes the amount of clipping scale-invariant: if the weights become 2x larger, the clip range should also become 2x larger. Substituting $c = k\sigma$ into the expression above gives + +$$ +\begin{aligned} +H(q) &\approx \frac{1}{2}\log_2(\frac{\pi e}{2}) + \log_2(\frac{\sigma}{k\sigma}) + b \\ +&= b - \log_2 k + \text{constant} +\end{aligned} +$$ + +This gives two ways to reduce compressed model size: decrease $b$ (for example, go from int5 to int4), or increase $k$ (use a wider clip range so the quantized values get more concentrated near the center, which lowers their entropy). In fact, increasing $b$ and increasing $k$ have roughly opposite effects. The histogram produced by $(b, k)$ exactly matches the middle $2^b$ bins of $(b + 1, 2k)$. The $(b + 1, 2k)$ quantization also includes additional outer bins, but very few weights lie in those bins, so $H(q)$ may not increase by much. This is exactly what we saw in the int5 versus int4 example. + +Of course our approximations do not hold exactly in practice: the derivation ignores clipping, the weight distribution is only approximately normal, and compression depends on the full byte representation, not just the marginal histogram of quantized values. However, when I examined some trained networks, I found the standard deviation of a matrix (an estimate of $\sigma$) correlated very strongly ($R^2=0.995$) with the compression ratio of that matrix under a fixed clip width, suggesting the approximations are reasonable in practice. Lastly, I should note that usually each row is quantized separately, but the same reasoning applies on a per-row basis. + +### Improved clipping + +The previous practice was to search over multiple clip thresholds to find the one that minimized reconstruction error. In the new version, the clipping threshold for a matrix row is just set at + +$$c=k \cdot \text{std}(\text{row})$$ + +In practice, I used $b = 6, k = 12.85$ for matrix parameters (tuned so the artifact is close to 16MB) and $b=8, k = 20$ for embeddings (they are more sensitive to quantization). As the above analysis suggests, upping the matrix params to int7 or int8 while doubling/quadrupling $k$ produced similarly-sized models, but I stuck with int6 to keep the script consistent with the previous version. Compared with the old approach, the new standard-deviation-based clipping has several advantages: +- **More principled:** It directly accounts for compressed size, not just reconstruction error. In the old approach, changes to the script could unexpectedly change the final compressed size because they changed the best clip threshold. +- **Faster:** We only need to run GPTQ once per matrix, rather than once for every candidate clip threshold. +- **Easier to tune:** Increasing $k$ monotonically reduces the compressed size, making it easier to control how close the model is to the 16MB cap. diff --git a/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/quantization_example.png b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/quantization_example.png new file mode 100644 index 0000000000..774dae7efc Binary files /dev/null and b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/quantization_example.png differ diff --git a/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/submission.json b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/submission.json new file mode 100644 index 0000000000..897e8c2286 --- /dev/null +++ b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/submission.json @@ -0,0 +1,48 @@ +{ + "author": "Kevin Clark", + "github_id": "clarkkev", + "name": "SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SD-Clip + Simplifications", + "blurb": "8192-token sentencepiece vocab, GPTQ-quantize embeddings, loop layers 4,5 two times, row-normalized muon, standard-deviation based weight clipping", + "date": "2026-04-05", + "track": "10min_16mb", + "val_loss": 2.80428, + "val_bpb": 1.08563, + "val_bpb_std": 0.0007, + "seeds": [1, 42, 1234, 1337, 2025], + "seed_results": { + "1": { + "val_loss": 2.80405, + "val_bpb": 1.08554, + "artifact_bytes": 15987547, + "steps": 4988 + }, + "42": { + "val_loss": 2.80691, + "val_bpb": 1.08664, + "artifact_bytes": 15988983, + "steps": 4986 + }, + "1234": { + "val_loss": 2.80170, + "val_bpb": 1.08463, + "artifact_bytes": 15983318, + "steps": 4989 + }, + "1337": { + "val_loss": 2.80407, + "val_bpb": 1.08554, + "artifact_bytes": 15984924, + "steps": 4992 + }, + "2025": { + "val_loss": 2.80467, + "val_bpb": 1.08578, + "artifact_bytes": 15983617, + "steps": 4989 + } + }, + "hardware": "8xH100 80GB SXM", + "pytorch_version": "2.11.0+cu130", + "bytes_total": 15985678, + "bytes_code": 15516 +} diff --git a/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_gpt.py b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_gpt.py new file mode 100644 index 0000000000..3bd17c944f --- /dev/null +++ b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_gpt.py @@ -0,0 +1,2 @@ +import lzma as L,base64 as B +exec(L.decompress(B.b85decode(";HX|O7hM1}n@VT6Qap3bt~@<3h>ok~)Km^%c^ys%R{D_%yAk9-_tV7^coUOo3$w>`(`ci)t`2F7>r>Ltx>>S2CRw|7ov>Wn1e~_!RLQ=%V9g?)G3yPsu%SBy!lj1PaC-x%dDmCDOZ^r^!)+WWz}ejKXTJ#^U6Ra!};QocHHXQC+4UM!QQ!-N5Xd|%~a(9)bTYIO+>B~8~@lqmri%^qEkQUy074Rh6w7V_#^s9J-3BNA`G;qyR$LYcI?e+loZVWi~B$n=TKFp{%SeHYp{oNWh;U@Ahk8M2$OU%K8B$lb*dRQXd-GR_@*KAZdRdwSd#X_bO(lvJ3fp9Otblkh?o!zlDF02+sRjLV6IqG{ieQx44UY(f20c)^AD5kE{7_@f9?Q-ePHMY$wCTcn5ij2k?>T>CFcZ<|5Bh`%hA!j2d4G(X-Bbwu<(#drck2`tR2eo$wi$p$UEHkQdFiFmlJR#zIG@3*smdlqZ?s>Cn@I!i44iGk>T1KUmKDUWEJXYFF3Mh*&Tbca$esa+z^`enxeV%UmK_#Ex_)>$lBJA(Wj|4yV%J<~unPL@@@KfP=NTcv-SVPiG3BDdu=*>C1izrS~RvqEe6Re7Xf)zp2fR3F%Ntl(>3N{Nxb8vzZkhK?{Rpe$4_KYFAQQI8+y6>9jc4wHeJ2^*|#6%ShCfQ3zc=<|0Z`Ja7Y9E|lQy~CwnZkBE%teQ=I^Ddvss1V~h8|jSxW!&||)rKK)$#K{ntoatClG0YVqyY7Qmr3Mv0zyH;LQ+{zu5s@m_NC1q7o?u96Zu_lD}&wKWnBj8K_s&H!?(%4lDiW`^h3XfN^w7^NZ}p;fxjfP(oriMP93@G`Hokf^LPBO_tUOCA9Y?Ycbc|&Ozk{6p6^Be5B~llA6b0L6cbtv^Vkq!Kv#A|GL0_5%6C1U%s;o?{>h@9KI+Dwrzrw4p&PwOoXNEPg+i3?tCZo5gvp{$48`j*%6?1{L#j37BeDS}J&L-w$u~nWS{$zk4O?Fk!S@ZYnwp=3F(;(vDes=-aZELHq}To7g7z=yeQ2p0GutKswbXKoQOv_RuMXy@@&jO%_NYFmih!-#%5VZ%jXr{MtGs@^!FIB5Ti1=nf3`PswaWThv0ix3RkZHirj3xRCI9tQdV>PHUW`Gf@xZ#GFH)hbx;q0mDMQ>t*;g(bBexR-dM-1Il;vgAt_%BRrh;H4wqb*Y*66qAb1EnJPpeXEZ%!`t7r&ormIj~5cxA|rzYsncwy(;231c6bVX?c==hD)ZwFyKx$_?(6&!V*v&mOsK)uq|_^yCTbxoA4tVoS>UbVE~aVZYcP7^!=Tc|Gly&<&c^@NjtNrVnF~ti+?ZYl?VyPo4&q-f}vM4c>x<04P#~YE5WOSw48K4<8rFW)S<Vd)4@Zgsw^T+{G?dpjlfIe;!9u2jH{e%4h;4*9dkZ*Gg4bP$nw%loV4__yT;gK6!=4^f!ycLqd#?M1=G2p{~+ng&?R2%?^4NrD^;9gt5kmv!`MII>7tWiTN5G)$EbYt{@dLF;1s+{EEN@Z4TIN;|KBU|96)zXXN(0g|~Hl>Llmt*NDOR*;{@q@+t!@+8Db+>*TBxyHEqsH*Qf$C^o3SU(I^}i$bY>Z0^vChl_9|qxs?{fL(s$0WO~WEV*Ppj$r30u0RC3FR~b_?*&&fs-THkVKfX!a}iNqk?yiZcZP1-o!ofJoe@Lt-6Y9!#@fLlMv_>CvtT?Eu&{-bM*RDBF$)L(^5KMce6{=vh^qE*vOFSeXe6^1@luEQqD^f!B8kKL%zD_b+OY%b^JvF1evI@cod2z#Gtn@g*}LuGphG+4v5{?i+TZCway00x6p*8GW+wzK*JaWT9qBwuHm2H(bx|o~vz*-c9Tl}MM;K?qw2hV*a8B$a985!AdT3%=U&t^}YLz&zc3VyHoVQaY!AGkc5R+3@hs(7L=*U>{`k87aW>@zDa=FV#H7S85*7=GhA!a`t*uD8lPJJA;;auN?7lU60mA7JOVc07Cz;2QLGCPeF?)Q_g5_}UWXBaFC{_!F`e(p}8dv>piS17aQ0hx-UYlf-I5VQX_2K-h2=MB9~+G-+5O_)GPL7eZcN&?Fv#qwRy@LhWv>0zraAXm{oRe2yjk%E2pP`ew2dXvMAg~sQGndiGPpUE8pe7k_jj=CCTJ`y_oy7flkoSR#cS-Z%jPQHjSVF{xH=F=G!bO;S{%sA{F`LZ~s@+&g`F&M`>#9i}NKS;^-T~tbujy%jIkUqvk?Sp|8V?tRv1Y-gvFwQ@CQ#NTrT-c#`*>b__9qK1d?hIMMid-diH5vDx!3Og;GvJHUBbR(IKA_X^tv)3A1V+JoY-{*MoNGhr%2*NhnYskfK}s3gNxXF8im)g`^fj@dG6rmktPCZ>~W{)?4~tAJ5;e+*VaK1bO|d~?I{rq9sn!!+UOqgr>jGST7|&lKd{m%IeBX2a?!L}{xG|_#Y?O#K?-<}nz2-Scshyiw%b%^pr(g-3lqhd`J^ZTgl#UrN8)yds8E0idg*?iQlY#trRwXev*gkT}3RW0Mn|F!c|1VvpcLY4(ENNpG$7k2fhi);e1f9QwlDRVhn%rfjT*4^siFQpm^acbb;Q6yPG-XX{$Hjr$YL)|RQGW~_@)a1IfSrPcu!?x5?OYNMvL*n0+>_gkOa&|KS$}I^Ljq+DK!uCB!}46|P(!1?V6D#}f-?)dqPmjHOu=5^4HzSaHkNZ=Zt?-hUbsKXfoXz&N>gNqfTdl9kuKI*osYZc2O$LFyXE$%2U`EdWX7I#Tf1^lhZPS;5Cm@kC}m;XW-yX{+`db|FZ>byMr`?qFX#kBAzsj5;a2krgNlc<)m#49`nc)7#?&^IlKjBowO(O()I|BE;ZDXO-Ll`)W;LERNSjLg$vMNPaD9{3g_E%S`lq3^S|g%BK`f+w*7p_J6~$AseZ8C>aG20j601^jk#N`XZEdAoPKH$-JruSPpj9Vf%a_*Syw8*51Z@z7`H!Lq>5PbE7&$hSHqN|(Kl>+3WLNZIXC3-Ivbjgys`*4D}!6DLVnyG{}s)K19Ib!zD!+%Oe@NlOf);qUB*I^@Np;HDc+9FA+fg;PbEe$Uev7}ex92p>XP`r`3s|1ux>nVxKjT4s14fy4=ui@{7^LDmBb|nvu-(m7J})H-oSvhYdjh9Ykl?!92*%4-|1eYL?d+EDjv{OD!QkVtp!PU#|yd9X_aRs$GAW%h9*QIt^rp~aRIjMiiEpT0-N#tji2(my1|W#E!XvAnN{vJ|?g8Ote@#-{}PYkDYSCtWBgYuEjhKeVF?PFK~15zuC!GocB#KD)4UjsPnVn?m2yMZWd*#D8ZKfSLuw*S8Ir9}QV0vggw&XdlQTs_A?LHFPzZ2fbpu{65dw8}O*z6X?H4vRK}VG#J}4sSUBh5?TYpX8Qpx*zKMpqxV%6SGWz-7#0U)gWw~1BQ{IhIaxLp8;k`X9$8ExZ46+A_J@lMr15o+qH)5VW|~41M9wg6sffT*e}ZZxjRapG(SlN!=g)0;y0Q++K2vqW*dNgUuk9yhS2q`rQ`T)pP(EtvFv?X_kYOT<;+sD4pU5f_Uf+Mp>cI-C{hF9oIgg5n=N&mtFys|7+=(heYgD!_9mwn)ay_GXP#Z2jWGw$CDRD5p*Xf+ZA3WL+2Z!TUUq0kqw0)RJg`aE;>x!ajC@o6DN^`!$5c<)6%a>9^G#}kZRJGCrp(2elZ5PPo{KvU^<&)Fjp?g?K6DCoB9);4#&5`wxB<}u`9W1rJkp(+{0#)pT1V>v@+YgRu_~4w^=)iIVt0^Q^W6ZF-+Ggf<98dd+NGl{e%(YHe(DY4$uz?4SQN9(Jo)5YJVs7C5H^xtZia8hVBAf6KWfOvML?m+lZp$zbkomELJpu%Ou#dnqUsK*v1^4GGx@YZy5x8qu**1Cnz)rc|6hKDGuxcrmp~0VG^(RJqB~l{EO;9krh-vLlplnmGuBEY%rl&N#-CgiblyQo*Cc&_$hlM1WI><$e&qY=Pj(o{fziR6W{xOUJ&Q2i#mC*)>&^Ctts)Qhrs*q`2QOBwtSj|fu&tWS{&4VZigZjA|K!F_A45k1SOfB~;fki6L+Z&P^Ex=)?nGw$>y5@>Y!8^0YKoTvm9NF{8}_69ZX=;z>S3t~}Nz1{K2IQZGJj+lI~66$XP&6QVjN!L6-?#$8K!-lx_fthNR>#-tcVmw6B^qs1n=0gxwo+OECb1)}<3`wl0C+}3QXBMlT8ra|MRjGVr0j|dcXW_2f1|l<%>;DX3#$XmpzEl>mM9$|y)7!jh>9GUEuPxZUlt!7ODq3k5JW1$y!KE=p2J{)nzhrZosd@^C!~*6SCgF{8ISaYWc+$gm%Dw1O^QKZpP&!pdT-1)ED-E^iilmUuaJ;F@OSbgYQ|j)=aXd~C`@Q`+KQdh%c6KcrA@?x~z>a32ThdwLyU)nNnqOctNL`6Q16@^(_^-b?CLuKV#mgP6QHQ+bYe*H{V;{ffd9i#WAp(gC`Y%SXi#JmS!~QtTP#`{({sWh?7>SxTF??K6Zcrs*QQV3UgX3*fE>m)-|D=NH9|zkDQFB%UjPjwC?JMI{vAeAcyb3k{zWKg)qPvF(;|2n{0O^IkTx5%}6k!sg*Z&9Ej<9ZgAPIh_(8-yvHsEEoak6ajY#%o2*+5}$&C+Tc3=tancJlN4EF-yQ4p+*8>tBnI0s!jDdGpVz^uIO?fY;9J*r;CbpSd9pS|Md5y}mGSXE|3B4oNPJJKN9k4z<#?<<)g~v(T2TaDF5onzw;wAx5QbxKA%=$MapWU_i0U5{oM|#j7c!B|Q@?IK>USJ5LS)`9!N83pOsIrg6`DrN_$o9y;@UOY|$PB1g{QtVacueYmj5$g5(wYNG7^9>rYew3wtAxuVLuA?r%)=AEi1zvd$Skbaj_qs*jdGBK0U@#0?Ub7)KNyv#mSPBUq19BDX1*QBw5la^WdGmbt!hKLy_YLk@w%WV3mLv6Ap_(jpeWxT2FBce=>YN|DHs*|1ad%&X3X;em#n(W(O2R!bB(p|-p@$MGT@^F?hnJ!{G04ia0(hLQ3Z?(<6aVP62v>2aa+w-Z#C-#GN78{NBoub-%M0{R=2WW{MouN;^VZECFd!E4=?UC8bpc;hRH#QFS~sNz?nK^jNYuSX4+Qk-Ggo1f(AZxe;0l+cp-65QuaT9rMk$+p6ym+3;YiGE}H-Fl`C^-P(Vpcu_FvQTkBS!R>9K|7VNAN;#{621#?UiVNCbu)uHJs{{ssn>5t2;-MdH?-R0W4c`aYypd(thqj($+t7{>={UO_4MJkzx6etUY3`HJMpevgI)9zIe6bXZ|G+QZiz0_M^v=~@+w|g1D90Gh5S0+9l7`&F-zSaXtn@UCH?)YL8E!jVU9u4oLf2vL&V-LX6_%mUM1+J4D6;#GVpDCju!x=*fH|&_cexdu{5kJudUiI5q-6oJKtS{J+hsM+o2n4K17}-o|JbXUkV*(v+X673i|JX@boR3*+WC6bt>yE}m72*#c(xuC5sNhs#2~lo<+%DWEBG^#K(qd_Sme?r6rM&%5%pDnx(w{eF5%S=}S#1mA9OjFdWXLJYgR4GXG>Lc~&8xkQQfR?9yBfWQ{1{{@KPi~aU$ME#q(-IPuM4&t#lW#wpq`bVZ{AVL{iL&ev`g)VoQg24Zqrfm{tn{9=>Mk6HRfOmWEh=pcekF;DR`nVYAg~L74Mv8z1*Ww?wn(7m}l4J-lFv9gSZYY74=Km9no_@E(|dGHoFnQQZXKA#xBY|?;|qAGWiCvGGbn}QLf^|>a$ss=y(mRN^17k)Z5DpP(kv+NjFt68z!Rb3r4iqkwF9gUpZW*!PA+h@Cq8~_4&6jXQ@JpdAgT}Jp*RXFCd^M&t$iWC3$RJV1ihmm2iLFDsMDAj7`c;DyA@0Mt*gNT*aH`ia?P@lz$%ZsXvHV!^7f^SCCw$)7;QZ)mc$ME0SF+0y<+UE6OS+GsfQctWm4(CEDZ`2&!yg<9I?RRbademjP_cT#kgmmZ~CDQSJu6gCwAAAtmqAH6u(%u`$$`mHk)M+0{pD3(vfN2KEKvw}Kt;tOwffB$++72d5Ley!=|U;@OAiN^<_BF;GHVyQNlHloANx6Yo$i3V@vL6PXtOJSHJI``PB4!gVPjnW(l(zbHdWHSM93dl+XYnDE|v!04|sRnL;CRrr=&6D^Sk-5Xhw!GDc72H=n`}g32{cqy}(5@mJvr~?)T;6~~#X+0E{XJxv=P$~vz1W*HdmT(>x|Jemf8FmF9dvqNX`1LQ4aDEguK{z9=MmfWRgQs8G=7^|VPCGyA6ko2iqlIXG(Nr8>XqKEl865}@u9F%mrPaXF2u6q%HQsw95kW4wThlEr@ab?kYnJ}jz7q39WWcRN@ohL~7Zz;Mb<@9%tpp)0h)~7}R78I~|v@?YGPBhJtOnWhFh|}2f?9nga#fa`C<8CxHQOuOG)}%tj%del=R0FicOy~urv-JQ_)e1OhEnA0u-}Ih>8)Tti4PB=@i9(yok;1`?;k{Sg&lc-tUL1j1Nc#_7&f9Mag{DzAE2AM^p;ZHryc_uT!b8(D+3lcHLJvY!uT+|q%|&h94VmYjsyc1zzAgG|m6S;z-8CR(s{7KgugO88g(WhtIpN{pJ{mKWx9mF-e}DwzHkAcht;PP7>had@I&?LI`Jh84UXj^CupHkMx|24I6`gT5Xja5^mt%j&aQ63=^D?1+T!KCaqQOG+|itF@pJq}pqKTBQm9pkW!PfkVZB@cqP{*2nvVzusZai*BZen|L$>cv2u`7C{5t#;S@Eu-635N4_4j9OGk0?F#!t`R?E-(%4KH>L9e}IBR0XgC^|NUn5QKnCCV&Ez>9UgeL*G6<6RGHf-I=827UHFcRVLNx5`_GvyJf`t?6Uk#dSr!F2r!z2|qIj=(B(XuYlAXB`CMsA=~Nc-{*f$?gG8EMMIZBf61*rus3_AgcAYkn9J)4M^g?0aqjF*7tDVa1icV!dDyd=Xgh)tMV;jye{HA~g4!v1P&mSSiIU6%6$kk8#Ltw!DE^BbGg|lK@#N7~SI+-{tRQDD~Sc>TV(0gQq!2&B26H{lk(R6k2=&t|9dl=)OE}SWb%zO?}NWw^(er{0}GaMXG@eBL!?e-WmMlidgL~#)nRit!Vg8vKHEP^Z2QE{-0c3rywp<4d>`umYgoPi_roj;J!&~GTMoa)oK&>_J=INX#FmqSO}N|-bFGciCv@qdtTqnTTJpfuP8__~U6qd$3s{pw31HhdHguQcV_{c~50ItWLVwz~FdY7RV`HI;RJ-bAK+;oVhPMX||#VZzc-*e9uWR(WJ751FH?u4cC;zhABc?(Y%8hJP!ZlM)U)m++weidgm&Of~p$+CIjqv(5eK(m&nhjlev1bnU2i(&;4=7!N4FndO-52V*@S0E67yO_BFi^Yo_w5HtTy-nAd(AEgtZf&^vyS5gT@bB8!j1tsxHUJs*zS1k=(OG;>(K};{~U1SO~}+^?SvC~SL_+e{@qgd__XPWbf`}pcj`a5>?akUxm85%yC+U1kByUXOyg9FX-4`)y@RUNmkCu-kHs6(Y`qD{iq0r>slwF2=`D)eFZ+whN)`n<2>aKyhzDWYu6X8()$DN2^V1ck3|XH&XDy#MmI(gO>{OMJ(CmNAO?jN5KJ?TVjI|Eo^>={ftfZFidcSm$tN(5hQhpNsBZ>M71ml8$ioc3c@oNb9z@&+fm<=>{)kta?BF9_b>@PFMXC%9ktS@$#&T%e_;F`J4+pe%E(&VKqL!#mIRpA6R|+&EsdCM$mrXsN$>O306&01vRv8go{xLn;Xd8rwf>=v&+<*@K&~vFuw}fD))at30bA(s%q?>W&+|}vXpmWN9cUPC6u@Y)_9v&!m>Raxp%9a66ZNHip$Sj!C{fC2jJwtbQ=MBUUndM{Ba!T#U9(Zopbo+_hR{hB_>1BPyx9b35>3Br@q9dPbnYC7mCxDbgFh*g_~HF5FGJw*76;@gX6DJswj3DQziu8si1|fRda@>1kLXZ{6}V??f+XpgmyUZSw6|@$g+oK=w);Y?Ew^l;;}?YH*!=tn_rfLk^t_UPdl>WL5%7ue{k4h@Bfv#?>Qc-sLa+bI*(vyAV7hFm@NMfSzDlUXQW64recPf>lt7p<5g2P(ewrO-HfZJq3`^6&iSubi!3dIN+ssbp}4*($G$zP$C4;DR{H^hX^;p1g?NbZ%v7+L65RYXhB^Mj6rBHqU73lWptvjh@XNyE3=r$U@$tt{pLb2+*tQYDIM4`dCylrxaVq$ILQ8Z57n^AlO)x(}*NbBcT*%6i+BHr6+1v>8T!Dovcz4j{fNj@CHy}35|QYTUzcncDu$u9WoR$j>%fOmib{|~ZoF1vsy+saN;wJU#l;iJzgQE4H`{Te8;Q0j$DtMZD6yFy5L}!IoNxtf)N8JJYV~|AJWtwOyoro4t3BHPh!*u$Tgo4#6}+5BiNpX10oFv!2*x??PP>#OU2Nw2`omX&Jr=&d11YsI<^L2Rewix8l>5=(Zih9)M;CB92wLC;hvhlg1g_BVCZO2XrZhC11#8odmVqkDsz2iRA|T2Y6JK(WP_kYS86r*r<#?LaRU%XyQ0H)w$hUF!ZPg$EqD<(Puh_^^x>7RNdROX(+3a!aZDa>(Qx+@?D0qltppGFm+k$=j^f32(co`b|I)~ZK8gKp`@ZdoKecbKbJTNmUe_06js>+vqfHP_*R8x+kr-(O8RyRW^Kl0DdG(F%`Orf%AHyv~b9STcQDw#^pL~~~u2Ymjg+dRp^K)N-U6oAbi{`E{NAl(T!^B5bf^EB&9w0_1i|)57uPRb9ZJl5m9)rpGrR@FTc%x*#J%mq2_=FvAKglbyyS9l^DG&a^>%x^3Wpa0UVm{mK7oR8iX+>x)SUlX#WWVp2(74ohp_kf(2ak-!Gp68EM?qY1DiMGEC-*bL None: + global _logger_hparams + _logger_hparams = h + + +def log(msg, console: bool = True) -> None: + if _logger_hparams is None: + print(msg) + return + if _logger_hparams.is_main_process: + if console: + print(msg) + if _logger_hparams.logfile is not None: + with open(_logger_hparams.logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + +# ---------------------------------------- +# Data Loading +# ---------------------------------------- + +class ValidationData: + def __init__(self, h: Hyperparameters, device: torch.device): + self.sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path) + if int(self.sp.vocab_size()) != h.vocab_size: + raise ValueError( + f"VOCAB_SIZE={h.vocab_size} does not match tokenizer vocab_size={int(self.sp.vocab_size())}" + ) + self.val_tokens = load_validation_tokens(h.val_files, h.eval_seq_len) + self.base_bytes_lut, self.has_leading_space_lut, self.is_boundary_token_lut = ( + build_sentencepiece_luts(self.sp, h.vocab_size, device)) + + +def build_sentencepiece_luts( + sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device +) -> tuple[Tensor, Tensor, Tensor]: + sp_vocab_size = int(sp.vocab_size()) + # The BPB calculation assumes "▁" is its own token so that leading-space bytes + # are counted correctly. See https://github.com/openai/parameter-golf/issues/897 + assert sp.piece_to_id("\u2581") != sp.unk_id(), \ + "Tokenizer must have '▁' (space) as its own token for correct BPB byte counting" + table_size = max(sp_vocab_size, vocab_size) + base_bytes_np = np.zeros((table_size,), dtype=np.int16) + has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) + is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id): + continue + is_boundary_token_np[token_id] = False + if sp.is_byte(token_id): + base_bytes_np[token_id] = 1 + continue + piece = sp.id_to_piece(token_id) + if piece.startswith("\u2581"): + has_leading_space_np[token_id] = True + piece = piece[1:] + base_bytes_np[token_id] = len(piece.encode("utf-8")) + return ( + torch.tensor(base_bytes_np, dtype=torch.int16, device=device), + torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), + torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), + ) + + +def load_validation_tokens(pattern: str, seq_len: int) -> Tensor: + files = [Path(p) for p in sorted(glob.glob(pattern))] + if not files: + raise FileNotFoundError(f"No files found for pattern: {pattern}") + # The export pipeline writes the fixed first-50k-doc validation set to fineweb_val_*. + tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() + usable = ((tokens.numel() - 1) // seq_len) * seq_len + if usable <= 0: + raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") + return tokens[: usable + 1] + + +def load_data_shard(file: Path) -> Tensor: + header_bytes = 256 * np.dtype(" int: + key = str(file) + cached = _SHARD_NTOKENS_CACHE.get(key) + if cached is not None: + return cached + header = np.fromfile(file, dtype=" np.memmap: + key = str(file) + mm = _MMAP_CACHE.get(key) + if mm is not None: + return mm + n = _read_num_tokens(file) + mm = np.memmap(file, mode="r", dtype=" None: + max_phase = min(self.seq_len - 1, max(0, self.num_tokens[si] - self.seq_len - 1)) + phase = int(self.rng.integers(max_phase + 1)) if max_phase > 0 else 0 + num_sequences = (self.num_tokens[si] - 1 - phase) // self.seq_len + sequence_order = self.rng.permutation(num_sequences) + self.start_inds[si] = (phase + sequence_order * self.seq_len).tolist() + + def next_batch(self, global_tokens: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]: + device_tokens = global_tokens // (self.world_size * grad_accum_steps) + device_batch_size = device_tokens // self.seq_len + remaining = np.array([len(s) for s in self.start_inds], dtype=np.float64) + x = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64) + y = torch.empty((device_batch_size, self.seq_len), dtype=torch.int64) + for bi in range(device_batch_size): + total = remaining.sum() + if total <= 0: + for si in range(len(self.files)): + self._reset_shard(si) + remaining = np.array([len(s) for s in self.start_inds], dtype=np.float64) + total = remaining.sum() + probs = remaining / total + si = int(self.rng.choice(len(self.files), p=probs)) + start_ind = self.start_inds[si].pop() + remaining[si] -= 1 + mm = _get_shard_memmap(self.files[si]) + window = torch.as_tensor( + np.array(mm[start_ind:start_ind + self.seq_len + 1], dtype=np.int64)) + x[bi] = window[:-1] + y[bi] = window[1:] + return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True) + +# ---------------------------------------- +# Model Architecture +# ---------------------------------------- + +class RMSNorm(nn.Module): + def __init__(self, eps: float | None = None): + super().__init__() + self.eps = eps + + def forward(self, x: Tensor) -> Tensor: + return F.rms_norm(x, (x.size(-1),), eps=self.eps) + + +class CastedLinear(nn.Linear): + def forward(self, x: Tensor) -> Tensor: + w = self.weight.to(x.dtype) + bias = self.bias.to(x.dtype) if self.bias is not None else None + return F.linear(x, w, bias) + + +class Rotary(nn.Module): + def __init__(self, dim: int, base: float = 10000.0, train_seq_len: int = 1024, rope_dims: int = 0): + super().__init__() + self.dim = dim + self.base = base + self.train_seq_len = train_seq_len + self.rope_dims = rope_dims if rope_dims > 0 else dim + inv_freq = 1.0 / (base ** (torch.arange(0, self.rope_dims, 2, dtype=torch.float32) / self.rope_dims)) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self._seq_len_cached = 0 + self._cos_cached: Tensor | None = None + self._sin_cached: Tensor | None = None + + def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]: + if ( + self._cos_cached is None + or self._sin_cached is None + or self._seq_len_cached != seq_len + or self._cos_cached.device != device + ): + rd = self.rope_dims + if seq_len > self.train_seq_len: + scale = seq_len / self.train_seq_len + new_base = self.base * (scale ** (rd / (rd - 2))) + inv_freq = 1.0 / (new_base ** (torch.arange( + 0, rd, 2, dtype=torch.float32, device=device) / rd)) + else: + inv_freq = self.inv_freq.to(device) + t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) + freqs = torch.outer(t, inv_freq) + self._cos_cached = freqs.cos()[None, :, None, :] + self._sin_cached = freqs.sin()[None, :, None, :] + self._seq_len_cached = seq_len + return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype) + + +def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor, rope_dims: int = 0) -> Tensor: + if rope_dims > 0 and rope_dims < x.size(-1): + x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:] + half = rope_dims // 2 + x1, x2 = x_rope[..., :half], x_rope[..., half:] + x_rope = torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + return torch.cat((x_rope, x_pass), dim=-1) + half = x.size(-1) // 2 + x1, x2 = x[..., :half], x[..., half:] + return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + + +class CausalSelfAttention(nn.Module): + def __init__(self, dim: int, num_heads: int, num_kv_heads: int, + rope_base: float, qk_gain_init: float, train_seq_len: int): + super().__init__() + if dim % num_heads != 0: + raise ValueError("model_dim must be divisible by num_heads") + if num_heads % num_kv_heads != 0: + raise ValueError("num_heads must be divisible by num_kv_heads") + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.head_dim = dim // num_heads + if self.head_dim % 2 != 0: + raise ValueError("head_dim must be even for RoPE") + kv_dim = self.num_kv_heads * self.head_dim + self.c_q = CastedLinear(dim, dim, bias=False) + self.c_k = CastedLinear(dim, kv_dim, bias=False) + self.c_v = CastedLinear(dim, kv_dim, bias=False) + self.proj = CastedLinear(dim, dim, bias=False) + self.proj._zero_init = True + self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32)) + self.rope_dims = 0 + self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=train_seq_len) + self.use_xsa = False + + def _xsa_efficient(self, y: Tensor, v: Tensor) -> Tensor: + B, T, H, D = y.shape + Hkv = v.size(-2) + group = H // Hkv + y_g = y.reshape(B, T, Hkv, group, D) + vn = F.normalize(v, dim=-1).unsqueeze(-2) + proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn + return (y_g - proj).reshape(B, T, H, D) + + def forward(self, x: Tensor) -> Tensor: + bsz, seqlen, dim = x.shape + q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim) + k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = self.rotary(seqlen, x.device, q.dtype) + q = apply_rotary_emb(q, cos, sin, self.rope_dims) + k = apply_rotary_emb(k, cos, sin, self.rope_dims) + q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if self.use_xsa: + y = self._xsa_efficient(y, v) + y = y.reshape(bsz, seqlen, dim) + return self.proj(y) + + +class MLP(nn.Module): + def __init__(self, dim: int, mlp_mult: int): + super().__init__() + hidden = int(mlp_mult * dim) + self.fc = CastedLinear(dim, hidden, bias=False) + self.proj = CastedLinear(hidden, dim, bias=False) + self.proj._zero_init = True + + def forward(self, x: Tensor) -> Tensor: + return self.proj(F.leaky_relu(self.fc(x), negative_slope=0.5).square()) + + +class Block(nn.Module): + def __init__(self, dim: int, num_heads: int, num_kv_heads: int, mlp_mult: int, + rope_base: float, qk_gain_init: float, train_seq_len: int, + layer_idx: int = 0, ln_scale: bool = False): + super().__init__() + self.attn_norm = RMSNorm() + self.mlp_norm = RMSNorm() + self.attn = CausalSelfAttention( + dim, num_heads, num_kv_heads, rope_base, qk_gain_init, train_seq_len) + self.mlp = MLP(dim, mlp_mult) + self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) + self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0 + + def forward(self, x: Tensor, x0: Tensor) -> Tensor: + mix = self.resid_mix.to(dtype=x.dtype) + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + attn_out = self.attn(self.attn_norm(x_in) * self.ln_scale_factor) + x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp( + self.mlp_norm(x_out) * self.ln_scale_factor) + return x_out + + +class GPT(nn.Module): + def __init__(self, h: Hyperparameters): + super().__init__() + if h.logit_softcap <= 0.0: + raise ValueError(f"logit_softcap must be positive, got {h.logit_softcap}") + self.tie_embeddings = h.tie_embeddings + self.tied_embed_init_std = h.tied_embed_init_std + self.logit_softcap = h.logit_softcap + self.tok_emb = nn.Embedding(h.vocab_size, h.embedding_dim) + if h.embedding_dim != h.model_dim: + self.embed_proj = CastedLinear(h.embedding_dim, h.model_dim, bias=False) + self.head_proj = CastedLinear(h.model_dim, h.embedding_dim, bias=False) + else: + self.embed_proj = None + self.head_proj = None + self.num_encoder_layers = h.num_layers // 2 + self.num_decoder_layers = h.num_layers - self.num_encoder_layers + self.blocks = nn.ModuleList([ + Block(h.model_dim, h.num_heads, h.num_kv_heads, h.mlp_mult, h.rope_base, + h.qk_gain_init, h.train_seq_len, layer_idx=i, ln_scale=h.ln_scale) + for i in range(h.num_layers) + ]) + if h.rope_dims > 0: + head_dim = h.model_dim // h.num_heads + for block in self.blocks: + block.attn.rope_dims = h.rope_dims + block.attn.rotary = Rotary(head_dim, base=h.rope_base, train_seq_len=h.train_seq_len, rope_dims=h.rope_dims) + self.final_norm = RMSNorm() + self.lm_head = None if h.tie_embeddings else CastedLinear(h.embedding_dim, h.vocab_size, bias=False) + if self.lm_head is not None: + self.lm_head._zero_init = True + if h.xsa_last_n > 0: + for i in range(max(0, h.num_layers - h.xsa_last_n), h.num_layers): + self.blocks[i].attn.use_xsa = True + + # Layer looping + self.looping_active: bool = False + if h.num_loops > 0: + loop_seg = list(range(h.loop_start, h.loop_end + 1)) + all_indices = list(range(h.loop_start)) + for _ in range(h.num_loops + 1): + all_indices.extend(loop_seg) + all_indices.extend(range(h.loop_end + 1, h.num_layers)) + num_enc = len(all_indices) // 2 + self.encoder_indices: list[int] = all_indices[:num_enc] + self.decoder_indices: list[int] = all_indices[num_enc:] + else: + self.encoder_indices = list(range(self.num_encoder_layers)) + self.decoder_indices = list(range(self.num_encoder_layers, h.num_layers)) + self.num_skip_weights = min(len(self.encoder_indices), len(self.decoder_indices)) + self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, h.model_dim, dtype=torch.float32)) + self.skip_gates = nn.Parameter(torch.zeros(self.num_skip_weights, h.model_dim, dtype=torch.float32)) if h.skip_gates_enabled else None + + self._init_weights() + + def _init_weights(self) -> None: + if self.tie_embeddings: + nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std) + for name, module in self.named_modules(): + if isinstance(module, nn.Linear): + if getattr(module, "_zero_init", False): + nn.init.zeros_(module.weight) + elif (module.weight.ndim == 2 and module.weight.shape[0] >= 64 and + module.weight.shape[1] >= 64): + nn.init.orthogonal_(module.weight, gain=1.0) + + def forward_logits(self, input_ids: Tensor) -> Tensor: + x = self.tok_emb(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + if self.embed_proj is not None: + x = self.embed_proj(x) + x0 = x + skips: list[Tensor] = [] + enc_iter = self.encoder_indices if self.looping_active else range(self.num_encoder_layers) + dec_iter = self.decoder_indices if self.looping_active else range(self.num_encoder_layers, self.num_encoder_layers + self.num_decoder_layers) + for i in enc_iter: + x = self.blocks[i](x, x0) + skips.append(x) + for skip_idx, i in enumerate(dec_iter): + if skip_idx < self.num_skip_weights and skips: + scaled_skip = self.skip_weights[skip_idx].to(dtype=x.dtype)[None, None, :] * skips.pop() + if self.skip_gates is not None: + g = torch.sigmoid(self.skip_gates[skip_idx].to(dtype=x.dtype))[None, None, :] + x = torch.lerp(scaled_skip, x, g) + else: + x = x + scaled_skip + x = self.blocks[i](x, x0) + x = self.final_norm(x) + if self.head_proj is not None: + x = self.head_proj(x) + if self.tie_embeddings: + logits_proj = F.linear(x, self.tok_emb.weight) + else: + logits_proj = self.lm_head(x) + return self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + + def forward(self, input_ids: Tensor, target_ids: Tensor) -> Tensor: + logits = self.forward_logits(input_ids) + return F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), target_ids.reshape(-1), reduction="mean") + + +def classify_param(name: str) -> str: + if "tok_emb" in name or "lm_head" in name: + return "embed" + if ".mlp." in name: + return "mlp" + if ".attn." in name or (".proj." in name and ".mlp." not in name): + return "attn" + return "other" + +# ---------------------------------------- +# Optimization +# ---------------------------------------- + +@torch.compile +def zeropower_via_newtonschulz5(G: Tensor, steps: int = 10, eps: float = 1e-7) -> Tensor: + a, b, c = (3.4445, -4.7750, 2.0315) + X = G.bfloat16() + X /= X.norm() + eps + transposed = G.size(0) > G.size(1) + if transposed: + X = X.T + for _ in range(steps): + A = X @ X.T + B = b * A + c * A @ A + X = a * X + B @ X + return X.T if transposed else X + + +class Muon(torch.optim.Optimizer): + def __init__(self, params, lr: float, momentum: float, backend_steps: int, + nesterov: bool = True, weight_decay: float = 0.0, + row_normalize: bool = False): + super().__init__( + params, + dict(lr=lr, momentum=momentum, backend_steps=backend_steps, + nesterov=nesterov, weight_decay=weight_decay, + row_normalize=row_normalize), + ) + + @torch.no_grad() + def step(self, closure=None): + loss = None + if closure is not None: + with torch.enable_grad(): + loss = closure() + distributed = dist.is_available() and dist.is_initialized() + world_size = dist.get_world_size() if distributed else 1 + rank = dist.get_rank() if distributed else 0 + for group in self.param_groups: + params = group["params"] + if not params: + continue + lr = group["lr"] + momentum = group["momentum"] + backend_steps = group["backend_steps"] + nesterov = group["nesterov"] + total_params = sum(int(p.numel()) for p in params) + updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16) + curr = 0 + for i, p in enumerate(params): + if i % world_size == rank and p.grad is not None: + g = p.grad + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + buf.mul_(momentum).add_(g) + if nesterov: + g = g.add(buf, alpha=momentum) + if group.get("row_normalize", False): + row_norms = g.float().norm(dim=-1, keepdim=True).clamp_min(1e-07) + g = g / row_norms.to(g.dtype) + g = zeropower_via_newtonschulz5(g, steps=backend_steps) + g *= max(1, g.size(0) / g.size(1)) ** 0.5 + updates_flat[curr : curr + p.numel()] = g.reshape(-1) + curr += p.numel() + if distributed: + dist.all_reduce(updates_flat, op=dist.ReduceOp.SUM) + wd = group.get("weight_decay", 0.0) + curr = 0 + for p in params: + if wd > 0.0: + p.data.mul_(1.0 - lr * wd) + g = updates_flat[curr : curr + p.numel()].view_as(p).to(dtype=p.dtype) + p.add_(g, alpha=-lr) + curr += p.numel() + return loss + + +CONTROL_TENSOR_NAME_PATTERNS = tuple( + pattern + for pattern in os.environ.get( + "CONTROL_TENSOR_NAME_PATTERNS", + "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,skip_gates", + ).split(",") + if pattern +) + + +class Optimizers(): + def __init__(self, h: Hyperparameters, base_model: GPT): + block_named_params = list(base_model.blocks.named_parameters()) + matrix_params = [ + p + for name, p in block_named_params + if p.ndim == 2 and not any(pattern in name for pattern in + CONTROL_TENSOR_NAME_PATTERNS) + ] + scalar_params = [ + p + for name, p in block_named_params + if p.ndim < 2 or any(pattern in name for pattern in + CONTROL_TENSOR_NAME_PATTERNS) + ] + if base_model.skip_weights.numel() > 0: + scalar_params.append(base_model.skip_weights) + if base_model.skip_gates is not None and base_model.skip_gates.numel() > 0: + scalar_params.append(base_model.skip_gates) + + token_lr = h.tied_embed_lr if h.tie_embeddings else h.embed_lr + tok_params = [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}] + self.optimizer_tok = torch.optim.AdamW( + tok_params, + betas=(h.beta1, h.beta2), + eps=h.adam_eps, + weight_decay=h.embed_wd, + fused=True, + ) + self.optimizer_muon = Muon( + matrix_params, + lr=h.matrix_lr, + momentum=h.muon_momentum, + backend_steps=h.muon_backend_steps, + weight_decay=h.muon_wd, + row_normalize=h.muon_row_normalize, + ) + for group in self.optimizer_muon.param_groups: + group["base_lr"] = h.matrix_lr + self.optimizer_scalar = torch.optim.AdamW( + [{"params": scalar_params, "lr": h.scalar_lr, "base_lr": h.scalar_lr}], + betas=(h.beta1, h.beta2), + eps=h.adam_eps, + weight_decay=h.adam_wd, + fused=True, + ) + self.optimizers = [self.optimizer_tok, self.optimizer_muon, self.optimizer_scalar] + if base_model.lm_head is not None: + self.optimizer_head = torch.optim.Adam( + [{"params": [base_model.lm_head.weight], "lr": h.head_lr, "base_lr": h.head_lr}], + betas=(h.beta1, h.beta2), + eps=h.adam_eps, + fused=True, + ) + self.optimizers.insert(1, self.optimizer_head) + else: + self.optimizer_head = None + + def __iter__(self): + return iter(self.optimizers) + + def zero_grad_all(self) -> None: + for opt in self.optimizers: + opt.zero_grad(set_to_none=True) + + def step(self): + for opt in self.optimizers: + opt.step() + self.zero_grad_all() + +# ---------------------------------------- +# Quantization +# ---------------------------------------- + +def restore_fp32_params(model: nn.Module) -> None: + for module in model.modules(): + if isinstance(module, CastedLinear): + module.float() + for name, param in model.named_parameters(): + if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32: + param.data = param.data.float() + + +def collect_hessians( + model: nn.Module, + train_loader: ShuffledSequenceLoader, + h: Hyperparameters, + device: torch.device, + n_calibration_batches: int = 64, +) -> dict[str, Tensor]: + hessians: dict[str, Tensor] = {} + hooks = [] + + def make_hook(name: str): + def hook_fn(module, inp, out): + x = inp[0].detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + if name not in hessians: + hessians[name] = torch.zeros( + x.shape[1], x.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(x.T, x) + return hook_fn + + for name, module in model.named_modules(): + if isinstance(module, CastedLinear) and module.weight.numel() > 65536: + cat = classify_param(name + ".weight") + if cat in ("mlp", "attn"): + hooks.append(module.register_forward_hook(make_hook(name + ".weight"))) + + if model.tie_embeddings: + hook_module = model.head_proj if model.head_proj is not None else model.final_norm + def make_output_hook(name: str): + def hook_fn(module, inp, out): + x = out.detach().float() + if x.ndim == 3: + x = x.reshape(-1, x.shape[-1]) + if name not in hessians: + hessians[name] = torch.zeros( + x.shape[1], x.shape[1], dtype=torch.float32, device=device + ) + hessians[name].addmm_(x.T, x) + return hook_fn + hooks.append(hook_module.register_forward_hook(make_output_hook("tok_emb.weight"))) + + model.eval() + with torch.no_grad(): + for _ in range(n_calibration_batches): + x, _ = train_loader.next_batch(h.train_batch_tokens, h.grad_accum_steps) + model.forward_logits(x) + + for hook in hooks: + hook.remove() + + for name in hessians: + hessians[name] = hessians[name].cpu() / n_calibration_batches + + return hessians + + +def gptq_quantize_weight( + w: Tensor, + H: Tensor, + clip_sigmas: float = 3.0, + clip_range: int = 63, + block_size: int = 128, +) -> tuple[Tensor, Tensor]: + W_orig = w.float().clone() + rows, cols = W_orig.shape + H = H.float().clone() + + dead = torch.diag(H) == 0 + H[dead, dead] = 1 + damp = 0.01 * H.diag().mean() + H.diagonal().add_(damp) + + perm = torch.argsort(H.diag(), descending=True) + invperm = torch.argsort(perm) + W_perm = W_orig[:, perm].clone() + W_perm[:, dead[perm]] = 0 + H = H[perm][:, perm] + + Hinv = torch.cholesky_inverse(torch.linalg.cholesky(H)) + Hinv = torch.linalg.cholesky(Hinv, upper=True) + + row_std = W_orig.std(dim=1) + s = (clip_sigmas * row_std / clip_range).clamp_min(1e-10).to(torch.float16) + sf = s.float() + + Q = torch.zeros(rows, cols, dtype=torch.int8) + W_work = W_perm.clone() + for i1 in range(0, cols, block_size): + i2 = min(i1 + block_size, cols) + W_block = W_work[:, i1:i2].clone() + Hinv_block = Hinv[i1:i2, i1:i2] + Err = torch.zeros(rows, i2 - i1) + for j in range(i2 - i1): + w_col = W_block[:, j] + d = Hinv_block[j, j] + q_col = torch.clamp(torch.round(w_col / sf), -clip_range, clip_range) + Q[:, i1 + j] = q_col.to(torch.int8) + err = (w_col - q_col.float() * sf) / d + Err[:, j] = err + W_block[:, j:] -= err.unsqueeze(1) * Hinv_block[j, j:].unsqueeze(0) + if i2 < cols: + W_work[:, i2:] -= Err @ Hinv[i1:i2, i2:] + + return Q[:, invperm], s + + +def gptq_mixed_quantize( + state_dict: dict[str, Tensor], + hessians: dict[str, Tensor], + h: Hyperparameters, +) -> tuple[dict[str, Tensor], dict[str, object]]: + result: dict[str, Tensor] = {} + meta: dict[str, object] = {} + + for name, tensor in state_dict.items(): + t = tensor.detach().cpu().contiguous() + if not t.is_floating_point() or t.numel() <= 65536: + result[name] = t.to(torch.float16) if t.is_floating_point() else t + meta[name] = "passthrough (float16)" + continue + cs = h.embed_clip_sigmas if "tok_emb" in name else h.matrix_clip_sigmas + bits = h.embed_bits if "tok_emb" in name else h.matrix_bits + q, s = gptq_quantize_weight( + t, hessians[name], clip_sigmas=cs, clip_range=2**(bits - 1) - 1) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = f"gptq (int{bits})" + + categories = collections.defaultdict(set) + for name, cat in meta.items(): + short = re.sub(r'\.\d+$', '', re.sub(r'blocks\.\d+', 'blocks', name)) + categories[cat].add(short) + log("Quantized weights:") + for cat in sorted(categories): + log(f" {cat}: {', '.join(sorted(categories[cat]))}") + + return result, meta + + +def dequantize_mixed(result: dict[str, Tensor], meta: dict[str, object], + template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + for name, orig in template_sd.items(): + info = meta.get(name) + if info is None: + continue + orig_dtype = orig.dtype + if "passthrough" in info: + t = result[name] + if t.dtype == torch.float16 and orig_dtype in (torch.float32, torch.bfloat16): + t = t.to(orig_dtype) + out[name] = t + continue + q, s = result[name + ".q"], result[name + ".scale"] + if s.ndim > 0: + out[name] = (q.float() * s.float().view(q.shape[0], *([1] * (q.ndim - 1)))).to(orig_dtype) + else: + out[name] = (q.float() * float(s.item())).to(orig_dtype) + return out + + +_BSHF_MAGIC = b"BSHF" + + +def _byte_shuffle(data: bytes, stride: int = 2) -> bytes: + if stride <= 1 or len(data) < stride: + return data + src = np.frombuffer(data, dtype=np.uint8) + n = len(src) + out = np.empty(n, dtype=np.uint8) + dest_off = 0 + for pos in range(stride): + chunk = src[pos::stride] + out[dest_off:dest_off + len(chunk)] = chunk + dest_off += len(chunk) + return _BSHF_MAGIC + bytes([stride]) + out.tobytes() + + +def _byte_unshuffle(data: bytes) -> bytes: + if len(data) < 5 or data[:4] != _BSHF_MAGIC: + return data + stride = data[4] + if stride < 2: + return data[5:] + payload = np.frombuffer(data, dtype=np.uint8, offset=5) + n = len(payload) + out = np.empty(n, dtype=np.uint8) + src_off = 0 + for pos in range(stride): + chunk_len = n // stride + (1 if pos < n % stride else 0) + out[pos::stride][:chunk_len] = payload[src_off:src_off + chunk_len] + src_off += chunk_len + return out.tobytes() + + +def _compress(data: bytes, compressor: str) -> bytes: + data = _byte_shuffle(data) + if compressor == "lzma": + return lzma.compress(data, preset=6) + elif compressor == "brotli": + import brotli + return brotli.compress(data, quality=11) + raise ValueError(f"Unknown compressor: {compressor!r}") + + +def _decompress(data: bytes, compressor: str) -> bytes: + if compressor == "lzma": + raw = lzma.decompress(data) + elif compressor == "brotli": + import brotli + raw = brotli.decompress(data) + else: + raise ValueError(f"Unknown compressor: {compressor!r}") + raw = _byte_unshuffle(raw) + return raw + + +def serialize(h: Hyperparameters, base_model: torch.nn.Module, code: str) -> tuple[int, int]: + code_bytes = len(code.encode("utf-8")) + if h.is_main_process: + torch.save(base_model.state_dict(), h.model_path) + model_bytes = os.path.getsize(h.model_path) + log(f"Serialized model: {model_bytes} bytes") + log(f"Code size: {code_bytes} bytes") + + sd_cpu = {k: v.detach().cpu() for k, v in base_model.state_dict().items()} + device = torch.device("cuda", h.local_rank) + log("GPTQ:collecting Hessians from calibration data...") + t0 = time.perf_counter() + calib_loader = ShuffledSequenceLoader(h, device) + hessians = collect_hessians( + base_model, calib_loader, h, device, + n_calibration_batches=h.gptq_calibration_batches, + ) + log(f"GPTQ:collected {len(hessians)} Hessians in {time.perf_counter() - t0:.1f}s") + quant_result, quant_meta = gptq_mixed_quantize(sd_cpu, hessians, h) + + quant_buf = io.BytesIO() + torch.save({"w": quant_result, "m": quant_meta}, quant_buf) + quant_raw = quant_buf.getvalue() + quant_blob = _compress(quant_raw, h.compressor) + quant_file_bytes = len(quant_blob) + bytes_total = quant_file_bytes + code_bytes + if h.is_main_process: + with open(h.quantized_model_path, "wb") as f: + f.write(quant_blob) + log(f"Serialized model quantized+{h.compressor}: {quant_file_bytes} bytes") + log(f"Total submission size quantized+{h.compressor}: {bytes_total} bytes") + return bytes_total, quant_file_bytes + + +def deserialize(h: Hyperparameters, device: torch.device) -> GPT: + eval_model = GPT(h).to(device).bfloat16() + restore_fp32_params(eval_model) + sd_cpu = {k: v.detach().cpu() for k, v in eval_model.state_dict().items()} + + with open(h.quantized_model_path, "rb") as f: + quant_blob_disk = f.read() + quant_state = torch.load( + io.BytesIO(_decompress(quant_blob_disk, h.compressor)), + map_location="cpu", + ) + deq_state = dequantize_mixed(quant_state["w"], quant_state["m"], sd_cpu) + eval_model.load_state_dict(deq_state, strict=True) + + return eval_model + +# ---------------------------------------- +# Evaluation +# ---------------------------------------- + +def _loss_bpb(loss_sum, token_count, byte_count) -> tuple[float, float]: + val_loss = (loss_sum / token_count).item() + val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) + return val_loss, val_bpb + + +def eval_val( + h: Hyperparameters, + device: torch.device, + val_data: ValidationData, + model: nn.Module +) -> tuple[float, float]: + seq_len = h.eval_seq_len + local_batch_tokens = h.val_batch_tokens // (h.world_size * h.grad_accum_steps) + if local_batch_tokens < seq_len: + raise ValueError( + "VAL_BATCH_SIZE must provide at least one sequence per rank; " + f"got VAL_BATCH_SIZE={h.val_batch_tokens}, WORLD_SIZE={h.world_size}, " + f"GRAD_ACCUM_STEPS={h.grad_accum_steps}, seq_len={seq_len}" + ) + local_batch_seqs = local_batch_tokens // seq_len + total_seqs = (val_data.val_tokens.numel() - 1) // seq_len + seq_start = (total_seqs * h.rank) // h.world_size + seq_end = (total_seqs * (h.rank + 1)) // h.world_size + val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) + val_token_count = torch.zeros((), device=device, dtype=torch.float64) + val_byte_count = torch.zeros((), device=device, dtype=torch.float64) + + model.eval() + with torch.inference_mode(): + for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): + batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) + raw_start = batch_seq_start * seq_len + raw_end = batch_seq_end * seq_len + 1 + local = val_data.val_tokens[raw_start:raw_end].to( + device=device, dtype=torch.int64, non_blocking=True) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + batch_loss = model(x, y).detach() + batch_token_count = float(y.numel()) + val_loss_sum += batch_loss.to(torch.float64) * batch_token_count + val_token_count += batch_token_count + prev_ids = x.reshape(-1) + tgt_ids = y.reshape(-1) + token_bytes = val_data.base_bytes_lut[tgt_ids].to(dtype=torch.int16) + token_bytes += (val_data.has_leading_space_lut[tgt_ids] & + ~val_data.is_boundary_token_lut[prev_ids]).to(dtype=torch.int16) + val_byte_count += token_bytes.to(torch.float64).sum() + + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM) + + model.train() + return _loss_bpb(val_loss_sum, val_token_count, val_byte_count) + + +def eval_val_sliding( + h: Hyperparameters, + device: torch.device, + val_data: ValidationData, + base_model: nn.Module, + batch_seqs: int = 32 +) -> tuple[float, float]: + base_model.eval() + logits_fn = torch.compile(base_model.forward_logits, dynamic=False, fullgraph=True) + + seq_len = h.eval_seq_len + context_size = seq_len - h.eval_stride + total_tokens = val_data.val_tokens.numel() - 1 + + window_starts = [ws for ws in range(0, total_tokens, h.eval_stride) + if ws + context_size < total_tokens] + + total_windows = len(window_starts) + my_s = (total_windows * h.rank) // h.world_size + my_e = (total_windows * (h.rank + 1)) // h.world_size + my_windows = window_starts[my_s:my_e] + + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + + for i, ws in enumerate(batch_ws): + we = min(ws + seq_len, total_tokens) + wlen = we - ws + wlens.append(wlen) + chunk = val_data.val_tokens[ws:we + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk[:-1] + y_batch[i, :wlen] = chunk[1:] + + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = logits_fn(x_batch) + + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), + reduction="none", + ).reshape(bsz, seq_len) + + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else context_size + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt = y_batch[i, s:wlen] + prev = x_batch[i, s:wlen] + tb = val_data.base_bytes_lut[tgt].to(torch.float64) + tb += (val_data.has_leading_space_lut[tgt] & + ~val_data.is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + + base_model.train() + return _loss_bpb(loss_sum, token_count, byte_count) + + +def timed_eval(label: str, fn, *args, **kwargs) -> tuple[float, float]: + torch.cuda.synchronize() + t0 = time.perf_counter() + val_loss, val_bpb = fn(*args, **kwargs) + torch.cuda.synchronize() + elapsed_ms = 1000.0 * (time.perf_counter() - t0) + log(f"{label} val_loss:{val_loss:.8f} val_bpb:{val_bpb:.8f} eval_time:{elapsed_ms:.0f}ms") + return val_loss, val_bpb + + +# ----------------------------- +# Training +# ----------------------------- + +def train_model(h: Hyperparameters, device: torch.device, val_data: ValidationData): + # Set up model + base_model = GPT(h).to(device).bfloat16() + restore_fp32_params(base_model) + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + if h.distributed: + model = DDP(compiled_model, device_ids=[h.local_rank], broadcast_buffers=False) + else: + model = compiled_model + log(f"model_params:{sum(p.numel() for p in base_model.parameters())}") + + # Set up optimizer and load train data + optimizers = Optimizers(h, base_model) + train_loader = ShuffledSequenceLoader(h, device) + + # Helper functions for training + max_wallclock_ms = 1000.0 * h.max_wallclock_seconds if h.max_wallclock_seconds > 0 else None + if max_wallclock_ms is not None: + max_wallclock_ms -= h.gptq_reserve_seconds * 1000.0 + log(f"gptq:reserving {h.gptq_reserve_seconds:.0f}s, effective={max_wallclock_ms:.0f}ms") + + def training_frac(step: int, elapsed_ms: float) -> float: + if max_wallclock_ms is None: + return step / max(h.iterations, 1) + return elapsed_ms / max(max_wallclock_ms, 1e-9) + + def lr_mul(frac: float) -> float: + if h.warmdown_frac <= 0: + return 1.0 + if frac >= 1.0 - h.warmdown_frac: + return max((1.0 - frac) / h.warmdown_frac, h.min_lr) + return 1.0 + + def step_fn(step, lr_scale): + optimizers.zero_grad_all() + train_loss = torch.zeros((), device=device) + for micro_step in range(h.grad_accum_steps): + if h.distributed: + model.require_backward_grad_sync = micro_step == h.grad_accum_steps - 1 + x, y = train_loader.next_batch(h.train_batch_tokens, h.grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + loss = model(x, y) + train_loss += loss.detach() + (loss / h.grad_accum_steps).backward() + train_loss /= h.grad_accum_steps + + frac = min(step / h.muon_momentum_warmup_steps, 1.0) if h.muon_momentum_warmup_steps > 0 else 1.0 + muon_momentum = (1 - frac) * h.muon_momentum_warmup_start + frac * h.muon_momentum + for group in optimizers.optimizer_muon.param_groups: + group["momentum"] = muon_momentum + + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * lr_scale + + if h.grad_clip_norm > 0: + torch.nn.utils.clip_grad_norm_(base_model.parameters(), h.grad_clip_norm) + + optimizers.step() + return train_loss + + # Model warmup + if h.warmup_steps > 0: + initial_model_state = {name: tensor.detach().cpu().clone() + for name, tensor in base_model.state_dict().items()} + initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] + model.train() + for warmup_step in range(h.warmup_steps): + step_fn(warmup_step, 1.0) + if warmup_step <= 5 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == h.warmup_steps: + log(f"warmup_step: {warmup_step + 1}/{h.warmup_steps}") + if h.num_loops > 0: + base_model.looping_active = True + log(f"loop_warmup:enabled encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}") + for warmup_step in range(h.warmup_steps): + step_fn(warmup_step, 1.0) + if warmup_step <= 5 or (warmup_step + 1) % 10 == 0 or warmup_step + 1 == h.warmup_steps: + log(f"loop_warmup_step: {warmup_step + 1}/{h.warmup_steps}") + base_model.looping_active = False + base_model.load_state_dict(initial_model_state, strict=True) + for opt, state in zip(optimizers, initial_optimizer_states, strict=True): + opt.load_state_dict(state) + optimizers.zero_grad_all() + if h.distributed: + model.require_backward_grad_sync = True + train_loader = ShuffledSequenceLoader(h, device) + + # Training loop + ema_state = {name: t.detach().float().clone() for name, t in base_model.state_dict().items()} + ema_decay = h.ema_decay + + training_time_ms = 0.0 + stop_after_step: int | None = None + torch.cuda.synchronize() + t0 = time.perf_counter() + + step = 0 + while True: + last_step = step == h.iterations or (stop_after_step is not None and step >= stop_after_step) + + should_validate = last_step or (h.val_loss_every > 0 and step % h.val_loss_every == 0) + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + val_loss, val_bpb = eval_val(h, device, val_data, model) + log(f"{step}/{h.iterations} val_loss: {val_loss:.4f} val_bpb: {val_bpb:.4f}") + torch.cuda.synchronize() + t0 = time.perf_counter() + + if last_step: + if stop_after_step is not None and step < h.iterations: + log( + f"stopping_early: wallclock_cap train_time: {training_time_ms:.0f}ms " + f"step: {step}/{h.iterations}" + ) + break + + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + frac = training_frac(step, elapsed_ms) + scale = lr_mul(frac) + if h.num_loops > 0 and not base_model.looping_active and frac >= h.enable_looping_at: + base_model.looping_active = True + log(f"layer_loop:enabled step:{step} frac:{frac:.3f} encoder:{base_model.encoder_indices} decoder:{base_model.decoder_indices}") + train_loss = step_fn(step, scale) + + with torch.no_grad(): + for name, t in base_model.state_dict().items(): + ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay) + + step += 1 + approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + + should_log_train = ( + h.train_log_every > 0 + and (step <= 5 or step % h.train_log_every == 0 or stop_after_step is not None) + ) + if should_log_train: + tok_per_sec = step * h.train_batch_tokens / (approx_training_time_ms / 1000.0) + log( + f"{step}/{h.iterations} train_loss: {train_loss.item():.4f} " + f"train_time: {approx_training_time_ms / 60000:.1f}m tok/s: {tok_per_sec:.0f}" + ) + + reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms + if h.distributed and max_wallclock_ms is not None: + reached_cap_tensor = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(reached_cap_tensor, op=dist.ReduceOp.MAX) + reached_cap = bool(reached_cap_tensor.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + + log( + f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB " + f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB" + ) + + # Weight averaging + log("ema:applying EMA weights") + current_state = base_model.state_dict() + avg_state = {name: t.to(dtype=current_state[name].dtype) for name, t in ema_state.items()} + base_model.load_state_dict(avg_state, strict=True) + + return base_model, compiled_model + + +def train_and_eval(h: Hyperparameters, device: torch.device) -> None: + random.seed(h.seed) + np.random.seed(h.seed) + torch.manual_seed(h.seed) + torch.cuda.manual_seed_all(h.seed) + + val_data = ValidationData(h, device) + log(f"train_shards: {len(list(Path(h.datasets_dir).resolve().glob('fineweb_train_*.bin')))}") + log(f"val_tokens: {val_data.val_tokens.numel() - 1}") + + base_model, compiled_model = train_model(h, device, val_data) + torch._dynamo.reset() + timed_eval("pre-quantization post-ema", eval_val, h, device, val_data, compiled_model) + + serialize(h, base_model, Path(__file__).read_text(encoding="utf-8")) + if h.distributed: + dist.barrier() + eval_model = deserialize(h, device) + if h.num_loops > 0: + eval_model.looping_active = True + + compiled_model = torch.compile(eval_model, dynamic=False, fullgraph=True) + timed_eval("quantized", eval_val, h, device, val_data, compiled_model) + if h.sliding_window_enabled: + timed_eval("quantized_sliding_window", eval_val_sliding, h, device, val_data, eval_model) + + +def main(): + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + + if not torch.cuda.is_available(): + raise RuntimeError("CUDA is required") + if world_size <= 0: + raise ValueError(f"WORLD_SIZE must be positive, got {world_size}") + if 8 % world_size != 0: + raise ValueError(f"WORLD_SIZE={world_size} must divide 8 so grad_accum_steps stays integral") + + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + torch.set_float32_matmul_precision("high") + from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp + + enable_cudnn_sdp(False) + enable_flash_sdp(True) + enable_mem_efficient_sdp(False) + enable_math_sdp(False) + torch._dynamo.config.optimize_ddp = False + + h = Hyperparameters() + set_logging_hparams(h) + if h.is_main_process: + os.makedirs("logs", exist_ok=True) + log(100 * "=", console=False) + log("Hyperparameters:", console=True) + for k, v in sorted(vars(type(h)).items()): + if not k.startswith("_"): + log(f" {k}: {v}", console=True) + log("=" * 100, console=False) + log(f"Running Python {sys.version}", console=False) + log(f"Running PyTorch {torch.__version__}", console=False) + log( + subprocess.run(["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, + text=True, check=False).stdout, + console=False, + ) + log("=" * 100, console=False) + + train_and_eval(h, device) + + if distributed: + dist.destroy_process_group() + + +if __name__ == "__main__": + main() diff --git a/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed1.log b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed1.log new file mode 100644 index 0000000000..410d73df1a --- /dev/null +++ b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed1.log @@ -0,0 +1,193 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/1.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 4 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 1 + scalar_lr: 0.02 + seed: 1 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.10.12 (main, Mar 3 2026, 11:56:32) [GCC 11.4.0] +Running PyTorch 2.11.0+cu130 +Sun Apr 5 20:00:43 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 Off | 00000000:0A:00.0 Off | 0 | +| N/A 45C P0 130W / 700W | 1505MiB / 81559MiB | 4% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 Off | 00000000:18:00.0 Off | 0 | +| N/A 37C P0 117W / 700W | 1505MiB / 81559MiB | 6% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 Off | 00000000:23:00.0 Off | 0 | +| N/A 36C P0 120W / 700W | 1505MiB / 81559MiB | 6% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 Off | 00000000:2C:00.0 Off | 0 | +| N/A 45C P0 121W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 Off | 00000000:87:00.0 Off | 0 | +| N/A 46C P0 127W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 Off | 00000000:90:00.0 Off | 0 | +| N/A 36C P0 124W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 Off | 00000000:B8:00.0 Off | 0 | +| N/A 35C P0 118W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 Off | 00000000:C1:00.0 Off | 0 | +| N/A 44C P0 126W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| 0 N/A N/A 3688196 C /usr/bin/python3 1496MiB | +| 1 N/A N/A 3688197 C /usr/bin/python3 1496MiB | +| 2 N/A N/A 3688198 C /usr/bin/python3 1496MiB | +| 3 N/A N/A 3688199 C /usr/bin/python3 1496MiB | +| 4 N/A N/A 3688200 C /usr/bin/python3 1496MiB | +| 5 N/A N/A 3688201 C /usr/bin/python3 1496MiB | +| 6 N/A N/A 3688202 C /usr/bin/python3 1496MiB | +| 7 N/A N/A 3688203 C /usr/bin/python3 1496MiB | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 128 +val_tokens: 40540160 +model_params:35943512 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0092 val_bpb: 3.4878 +1/20000 train_loss: 9.0111 train_time: 0.0m tok/s: 8223613 +2/20000 train_loss: 12.3647 train_time: 0.0m tok/s: 8133865 +3/20000 train_loss: 11.1589 train_time: 0.0m tok/s: 8039124 +4/20000 train_loss: 9.4650 train_time: 0.0m tok/s: 7985284 +5/20000 train_loss: 8.3469 train_time: 0.0m tok/s: 7962424 +500/20000 train_loss: 3.3363 train_time: 0.8m tok/s: 7727074 +1000/20000 train_loss: 3.1873 train_time: 1.7m tok/s: 7728408 +1500/20000 train_loss: 3.0986 train_time: 2.5m tok/s: 7728144 +2000/20000 train_loss: 3.0701 train_time: 3.4m tok/s: 7726369 +2500/20000 train_loss: 3.1005 train_time: 4.2m tok/s: 7725307 +layer_loop:enabled step:2888 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 3.0019 train_time: 5.2m tok/s: 7617356 +3500/20000 train_loss: 3.0018 train_time: 6.3m tok/s: 7248271 +4000/20000 train_loss: 2.9590 train_time: 7.5m tok/s: 6993814 +4000/20000 val_loss: 2.9184 val_bpb: 1.1298 +4500/20000 train_loss: 2.8130 train_time: 8.7m tok/s: 6808105 +4988/20000 val_loss: 2.8177 val_bpb: 1.0908 +stopping_early: wallclock_cap train_time: 588138ms step: 4988/20000 +peak memory allocated: 35372 MiB reserved: 35418 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.81547748 val_bpb:1.08995907 eval_time:6003ms +Serialized model: 135426937 bytes +Code size: 15516 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 11.3s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15972031 bytes +Total submission size quantized+brotli: 15987547 bytes +quantized val_loss:2.84757723 val_bpb:1.10238588 eval_time:7531ms +quantized_sliding_window val_loss:2.80405024 val_bpb:1.08553523 eval_time:83990ms diff --git a/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed1234.log b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed1234.log new file mode 100644 index 0000000000..98ce5759ac --- /dev/null +++ b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed1234.log @@ -0,0 +1,193 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/1234.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 4 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 1234 + scalar_lr: 0.02 + seed: 1234 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.10.12 (main, Mar 3 2026, 11:56:32) [GCC 11.4.0] +Running PyTorch 2.11.0+cu130 +Sun Apr 5 20:15:37 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 Off | 00000000:0A:00.0 Off | 0 | +| N/A 44C P0 130W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 Off | 00000000:18:00.0 Off | 0 | +| N/A 36C P0 117W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 Off | 00000000:23:00.0 Off | 0 | +| N/A 35C P0 120W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 Off | 00000000:2C:00.0 Off | 0 | +| N/A 45C P0 121W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 Off | 00000000:87:00.0 Off | 0 | +| N/A 45C P0 127W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 Off | 00000000:90:00.0 Off | 0 | +| N/A 36C P0 125W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 Off | 00000000:B8:00.0 Off | 0 | +| N/A 35C P0 117W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 Off | 00000000:C1:00.0 Off | 0 | +| N/A 45C P0 126W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| 0 N/A N/A 3691699 C /usr/bin/python3 1496MiB | +| 1 N/A N/A 3691700 C /usr/bin/python3 1496MiB | +| 2 N/A N/A 3691701 C /usr/bin/python3 1496MiB | +| 3 N/A N/A 3691702 C /usr/bin/python3 1496MiB | +| 4 N/A N/A 3691703 C /usr/bin/python3 1496MiB | +| 5 N/A N/A 3691704 C /usr/bin/python3 1496MiB | +| 6 N/A N/A 3691705 C /usr/bin/python3 1496MiB | +| 7 N/A N/A 3691706 C /usr/bin/python3 1496MiB | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 128 +val_tokens: 40540160 +model_params:35943512 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0072 val_bpb: 3.4870 +1/20000 train_loss: 9.0096 train_time: 0.0m tok/s: 8283109 +2/20000 train_loss: 12.3217 train_time: 0.0m tok/s: 8101365 +3/20000 train_loss: 11.1062 train_time: 0.0m tok/s: 8021873 +4/20000 train_loss: 9.4457 train_time: 0.0m tok/s: 7974463 +5/20000 train_loss: 8.3499 train_time: 0.0m tok/s: 7947403 +500/20000 train_loss: 3.3371 train_time: 0.8m tok/s: 7728045 +1000/20000 train_loss: 3.1847 train_time: 1.7m tok/s: 7727323 +1500/20000 train_loss: 3.0952 train_time: 2.5m tok/s: 7729961 +2000/20000 train_loss: 3.0714 train_time: 3.4m tok/s: 7728881 +2500/20000 train_loss: 3.0973 train_time: 4.2m tok/s: 7728385 +layer_loop:enabled step:2889 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 2.9992 train_time: 5.2m tok/s: 7621452 +3500/20000 train_loss: 3.0083 train_time: 6.3m tok/s: 7251870 +4000/20000 train_loss: 2.9590 train_time: 7.5m tok/s: 6997356 +4000/20000 val_loss: 2.9171 val_bpb: 1.1293 +4500/20000 train_loss: 2.8114 train_time: 8.7m tok/s: 6811422 +4989/20000 val_loss: 2.8165 val_bpb: 1.0903 +stopping_early: wallclock_cap train_time: 588024ms step: 4989/20000 +peak memory allocated: 35372 MiB reserved: 35418 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.81409759 val_bpb:1.08942487 eval_time:5976ms +Serialized model: 135426937 bytes +Code size: 15516 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 11.3s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15967802 bytes +Total submission size quantized+brotli: 15983318 bytes +quantized val_loss:2.84476656 val_bpb:1.10129779 eval_time:7428ms +quantized_sliding_window val_loss:2.80170263 val_bpb:1.08462640 eval_time:83811ms diff --git a/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed1337.log b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed1337.log new file mode 100644 index 0000000000..a3567c7690 --- /dev/null +++ b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed1337.log @@ -0,0 +1,193 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/1337.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 4 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 1337 + scalar_lr: 0.02 + seed: 1337 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.10.12 (main, Mar 3 2026, 11:56:32) [GCC 11.4.0] +Running PyTorch 2.11.0+cu130 +Sun Apr 5 19:13:28 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 Off | 00000000:0A:00.0 Off | 0 | +| N/A 41C P0 127W / 700W | 1505MiB / 81559MiB | 2% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 Off | 00000000:18:00.0 Off | 0 | +| N/A 35C P0 116W / 700W | 1505MiB / 81559MiB | 3% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 Off | 00000000:23:00.0 Off | 0 | +| N/A 34C P0 119W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 Off | 00000000:2C:00.0 Off | 0 | +| N/A 41C P0 119W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 Off | 00000000:87:00.0 Off | 0 | +| N/A 42C P0 124W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 Off | 00000000:90:00.0 Off | 0 | +| N/A 35C P0 124W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 Off | 00000000:B8:00.0 Off | 0 | +| N/A 34C P0 117W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 Off | 00000000:C1:00.0 Off | 0 | +| N/A 41C P0 123W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| 0 N/A N/A 3676907 C /usr/bin/python3 1496MiB | +| 1 N/A N/A 3676908 C /usr/bin/python3 1496MiB | +| 2 N/A N/A 3676909 C /usr/bin/python3 1496MiB | +| 3 N/A N/A 3676910 C /usr/bin/python3 1496MiB | +| 4 N/A N/A 3676911 C /usr/bin/python3 1496MiB | +| 5 N/A N/A 3676912 C /usr/bin/python3 1496MiB | +| 6 N/A N/A 3676913 C /usr/bin/python3 1496MiB | +| 7 N/A N/A 3676914 C /usr/bin/python3 1496MiB | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 128 +val_tokens: 40540160 +model_params:35943512 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0047 val_bpb: 3.4860 +1/20000 train_loss: 9.0067 train_time: 0.0m tok/s: 8265920 +2/20000 train_loss: 12.2900 train_time: 0.0m tok/s: 8133773 +3/20000 train_loss: 11.0757 train_time: 0.0m tok/s: 8050026 +4/20000 train_loss: 9.3856 train_time: 0.0m tok/s: 7998464 +5/20000 train_loss: 8.3022 train_time: 0.0m tok/s: 7965893 +500/20000 train_loss: 3.3393 train_time: 0.8m tok/s: 7725942 +1000/20000 train_loss: 3.1912 train_time: 1.7m tok/s: 7732443 +1500/20000 train_loss: 3.0979 train_time: 2.5m tok/s: 7737632 +2000/20000 train_loss: 3.0689 train_time: 3.4m tok/s: 7737108 +2500/20000 train_loss: 3.1046 train_time: 4.2m tok/s: 7736292 +layer_loop:enabled step:2892 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 3.0037 train_time: 5.2m tok/s: 7631560 +3500/20000 train_loss: 3.0051 train_time: 6.3m tok/s: 7259704 +4000/20000 train_loss: 2.9604 train_time: 7.5m tok/s: 7003746 +4000/20000 val_loss: 2.9212 val_bpb: 1.1309 +4500/20000 train_loss: 2.8168 train_time: 8.7m tok/s: 6816745 +4992/20000 val_loss: 2.8200 val_bpb: 1.0917 +stopping_early: wallclock_cap train_time: 588041ms step: 4992/20000 +peak memory allocated: 35372 MiB reserved: 35418 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.81763619 val_bpb:1.09079477 eval_time:6030ms +Serialized model: 135426937 bytes +Code size: 15516 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 11.2s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15969408 bytes +Total submission size quantized+brotli: 15984924 bytes +quantized val_loss:2.84715497 val_bpb:1.10222242 eval_time:7483ms +quantized_sliding_window val_loss:2.80406725 val_bpb:1.08554182 eval_time:84099ms diff --git a/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed2025.log b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed2025.log new file mode 100644 index 0000000000..2668391bff --- /dev/null +++ b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed2025.log @@ -0,0 +1,193 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/2025.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 4 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 2025 + scalar_lr: 0.02 + seed: 2025 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.10.12 (main, Mar 3 2026, 11:56:32) [GCC 11.4.0] +Running PyTorch 2.11.0+cu130 +Sun Apr 5 19:45:51 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 Off | 00000000:0A:00.0 Off | 0 | +| N/A 45C P0 131W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 Off | 00000000:18:00.0 Off | 0 | +| N/A 38C P0 118W / 700W | 1505MiB / 81559MiB | 4% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 Off | 00000000:23:00.0 Off | 0 | +| N/A 36C P0 121W / 700W | 1505MiB / 81559MiB | 3% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 Off | 00000000:2C:00.0 Off | 0 | +| N/A 45C P0 121W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 Off | 00000000:87:00.0 Off | 0 | +| N/A 46C P0 128W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 Off | 00000000:90:00.0 Off | 0 | +| N/A 37C P0 125W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 Off | 00000000:B8:00.0 Off | 0 | +| N/A 36C P0 118W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 Off | 00000000:C1:00.0 Off | 0 | +| N/A 46C P0 126W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| 0 N/A N/A 3684815 C /usr/bin/python3 1496MiB | +| 1 N/A N/A 3684816 C /usr/bin/python3 1496MiB | +| 2 N/A N/A 3684817 C /usr/bin/python3 1496MiB | +| 3 N/A N/A 3684818 C /usr/bin/python3 1496MiB | +| 4 N/A N/A 3684819 C /usr/bin/python3 1496MiB | +| 5 N/A N/A 3684820 C /usr/bin/python3 1496MiB | +| 6 N/A N/A 3684821 C /usr/bin/python3 1496MiB | +| 7 N/A N/A 3684822 C /usr/bin/python3 1496MiB | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 128 +val_tokens: 40540160 +model_params:35943512 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0067 val_bpb: 3.4868 +1/20000 train_loss: 9.0086 train_time: 0.0m tok/s: 8276903 +2/20000 train_loss: 12.3468 train_time: 0.0m tok/s: 8167585 +3/20000 train_loss: 11.1483 train_time: 0.0m tok/s: 8060505 +4/20000 train_loss: 9.4410 train_time: 0.0m tok/s: 7999981 +5/20000 train_loss: 8.3515 train_time: 0.0m tok/s: 7969423 +500/20000 train_loss: 3.3350 train_time: 0.8m tok/s: 7727390 +1000/20000 train_loss: 3.1870 train_time: 1.7m tok/s: 7729167 +1500/20000 train_loss: 3.0981 train_time: 2.5m tok/s: 7729723 +2000/20000 train_loss: 3.0708 train_time: 3.4m tok/s: 7727999 +2500/20000 train_loss: 3.1031 train_time: 4.2m tok/s: 7726167 +layer_loop:enabled step:2889 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 2.9998 train_time: 5.2m tok/s: 7619848 +3500/20000 train_loss: 3.0103 train_time: 6.3m tok/s: 7250583 +4000/20000 train_loss: 2.9624 train_time: 7.5m tok/s: 6996318 +4000/20000 val_loss: 2.9209 val_bpb: 1.1308 +4500/20000 train_loss: 2.8183 train_time: 8.7m tok/s: 6810460 +4989/20000 val_loss: 2.8202 val_bpb: 1.0918 +stopping_early: wallclock_cap train_time: 588097ms step: 4989/20000 +peak memory allocated: 35372 MiB reserved: 35418 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.81795928 val_bpb:1.09091985 eval_time:5995ms +Serialized model: 135426937 bytes +Code size: 15516 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 11.3s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15968101 bytes +Total submission size quantized+brotli: 15983617 bytes +quantized val_loss:2.84759969 val_bpb:1.10239458 eval_time:7466ms +quantized_sliding_window val_loss:2.80467299 val_bpb:1.08577632 eval_time:84042ms diff --git a/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed42.log b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed42.log new file mode 100644 index 0000000000..5884d422b5 --- /dev/null +++ b/records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_seed42.log @@ -0,0 +1,193 @@ +==================================================================================================== +Hyperparameters: + adam_eps: 1e-08 + adam_wd: 0.02 + beta1: 0.9 + beta2: 0.95 + compressor: brotli + data_dir: ./data/ + datasets_dir: ./data/datasets/fineweb10B_sp8192 + distributed: True + ema_decay: 0.997 + embed_bits: 8 + embed_clip_sigmas: 20.0 + embed_lr: 0.6 + embed_wd: 0.085 + embedding_dim: 512 + enable_looping_at: 0.5 + eval_seq_len: 2048 + eval_stride: 64 + gptq_calibration_batches: 64 + gptq_reserve_seconds: 12.0 + grad_accum_steps: 1 + grad_clip_norm: 0.3 + head_lr: 0.008 + is_main_process: True + iterations: 20000 + ln_scale: True + local_rank: 0 + logfile: logs/42.txt + logit_softcap: 30.0 + loop_end: 5 + loop_start: 4 + matrix_bits: 6 + matrix_clip_sigmas: 12.85 + matrix_lr: 0.02 + max_wallclock_seconds: 600.0 + min_lr: 0.0 + mlp_mult: 4.0 + model_dim: 512 + model_path: final_model.pt + muon_backend_steps: 5 + muon_beta2: 0.95 + muon_momentum: 0.99 + muon_momentum_warmup_start: 0.92 + muon_momentum_warmup_steps: 1500 + muon_row_normalize: True + muon_wd: 0.085 + num_heads: 8 + num_kv_heads: 4 + num_layers: 11 + num_loops: 2 + qk_gain_init: 4.0 + quantized_model_path: final_model.int6.ptz + rank: 0 + rope_base: 10000.0 + rope_dims: 16 + rope_train_seq_len: 2048 + run_id: 42 + scalar_lr: 0.02 + seed: 42 + skip_gates_enabled: True + sliding_window_enabled: True + tie_embeddings: True + tied_embed_init_std: 0.005 + tied_embed_lr: 0.03 + tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model + train_batch_tokens: 786432 + train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin + train_log_every: 500 + train_seq_len: 2048 + val_batch_tokens: 524288 + val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin + val_loss_every: 4000 + vocab_size: 8192 + warmdown_frac: 0.667 + warmup_steps: 20 + world_size: 8 + xsa_last_n: 11 +==================================================================================================== +Running Python 3.10.12 (main, Mar 3 2026, 11:56:32) [GCC 11.4.0] +Running PyTorch 2.11.0+cu130 +Sun Apr 5 19:31:18 2026 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA H100 80GB HBM3 Off | 00000000:0A:00.0 Off | 0 | +| N/A 40C P0 126W / 700W | 1505MiB / 81559MiB | 3% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA H100 80GB HBM3 Off | 00000000:18:00.0 Off | 0 | +| N/A 35C P0 117W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA H100 80GB HBM3 Off | 00000000:23:00.0 Off | 0 | +| N/A 34C P0 119W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA H100 80GB HBM3 Off | 00000000:2C:00.0 Off | 0 | +| N/A 40C P0 118W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 4 NVIDIA H100 80GB HBM3 Off | 00000000:87:00.0 Off | 0 | +| N/A 41C P0 123W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 5 NVIDIA H100 80GB HBM3 Off | 00000000:90:00.0 Off | 0 | +| N/A 35C P0 126W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 6 NVIDIA H100 80GB HBM3 Off | 00000000:B8:00.0 Off | 0 | +| N/A 33C P0 116W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ +| 7 NVIDIA H100 80GB HBM3 Off | 00000000:C1:00.0 Off | 0 | +| N/A 40C P0 122W / 700W | 1505MiB / 81559MiB | 0% Default | +| | | Disabled | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| 0 N/A N/A 3681385 C /usr/bin/python3 1496MiB | +| 1 N/A N/A 3681386 C /usr/bin/python3 1496MiB | +| 2 N/A N/A 3681387 C /usr/bin/python3 1496MiB | +| 3 N/A N/A 3681388 C /usr/bin/python3 1496MiB | +| 4 N/A N/A 3681389 C /usr/bin/python3 1496MiB | +| 5 N/A N/A 3681390 C /usr/bin/python3 1496MiB | +| 6 N/A N/A 3681391 C /usr/bin/python3 1496MiB | +| 7 N/A N/A 3681392 C /usr/bin/python3 1496MiB | ++-----------------------------------------------------------------------------------------+ + +==================================================================================================== +train_shards: 128 +val_tokens: 40540160 +model_params:35943512 +gptq:reserving 12s, effective=588000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +loop_warmup_step: 1/20 +loop_warmup_step: 2/20 +loop_warmup_step: 3/20 +loop_warmup_step: 4/20 +loop_warmup_step: 5/20 +loop_warmup_step: 6/20 +loop_warmup_step: 10/20 +loop_warmup_step: 20/20 +0/20000 val_loss: 9.0090 val_bpb: 3.4877 +1/20000 train_loss: 9.0111 train_time: 0.0m tok/s: 8242436 +2/20000 train_loss: 12.3696 train_time: 0.0m tok/s: 8150256 +3/20000 train_loss: 11.1541 train_time: 0.0m tok/s: 8040561 +4/20000 train_loss: 9.4484 train_time: 0.0m tok/s: 7995440 +5/20000 train_loss: 8.3663 train_time: 0.0m tok/s: 7971618 +500/20000 train_loss: 3.3339 train_time: 0.8m tok/s: 7726831 +1000/20000 train_loss: 3.1906 train_time: 1.7m tok/s: 7728478 +1500/20000 train_loss: 3.0973 train_time: 2.5m tok/s: 7728165 +2000/20000 train_loss: 3.0734 train_time: 3.4m tok/s: 7725857 +2500/20000 train_loss: 3.0977 train_time: 4.2m tok/s: 7724265 +layer_loop:enabled step:2888 frac:0.500 encoder:[0, 1, 2, 3, 4, 5, 4] decoder:[5, 4, 5, 6, 7, 8, 9, 10] +3000/20000 train_loss: 2.9966 train_time: 5.2m tok/s: 7616509 +3500/20000 train_loss: 3.0053 train_time: 6.3m tok/s: 7246256 +4000/20000 train_loss: 2.9557 train_time: 7.5m tok/s: 6991397 +4000/20000 val_loss: 2.9188 val_bpb: 1.1300 +4500/20000 train_loss: 2.8134 train_time: 8.7m tok/s: 6805522 +4986/20000 val_loss: 2.8178 val_bpb: 1.0909 +stopping_early: wallclock_cap train_time: 588136ms step: 4986/20000 +peak memory allocated: 35372 MiB reserved: 35418 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:2.81543002 val_bpb:1.08994070 eval_time:6006ms +Serialized model: 135426937 bytes +Code size: 15516 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 67 Hessians in 11.3s +Quantized weights: + gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight + gptq (int8): tok_emb.weight + passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights +Serialized model quantized+brotli: 15973467 bytes +Total submission size quantized+brotli: 15988983 bytes +quantized val_loss:2.85032574 val_bpb:1.10344992 eval_time:7467ms +quantized_sliding_window val_loss:2.80691494 val_bpb:1.08664424 eval_time:84115ms