Skip to content

Commit ce487d5

Browse files
Octavianclaude
andcommitted
Log session results: 1.0763 SwiGLU (size problem), 1.1215 v7 (submitted)
SwiGLU fork (PR #462 base) + GPTQ + OptRot + AdamW TTT = 1.0763 BPB but artifact is 19.6MB (over 16MB limit). OptRot Hadamard rotation hurts zstd compression. Next step: solve the size problem. v7 GPTQ stack submitted as PR #508: 3-seed mean 1.1215 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 5aec5ae commit ce487d5

4 files changed

Lines changed: 608 additions & 0 deletions

File tree

autoresearch_frug2.log

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
======================================================================
2+
FRUGENDORFF V2 — Closing the Gap to SOTA
3+
Model: qwen3-coder:30b | Started: 2026-03-23T00:05:35.674382
4+
======================================================================
5+
6+
>>> SEED PHASE: 8 configs
7+
8+
[seed 1] H100 winner: 6x2 mlp4
9+
>>> val_bpb=2.254615
10+
11+
[seed 2] 6x2 mlp3 (how much does mlp4 help?)
12+
>>> val_bpb=2.235808
13+
14+
[seed 3] 4x3 mlp4 (more loops, fewer layers)
15+
>>> val_bpb=2.205482
16+
17+
[seed 4] 8x2 mlp4 (more unique, fast)
18+
>>> val_bpb=2.32917
19+
20+
[seed 5] 6x2 always fractal
21+
>>> val_bpb=2.35167
22+
23+
[seed 6] 6x2 cadence 2 (F/N)
24+
>>> val_bpb=2.707633
25+
26+
[seed 7] 5x2 mlp4 (faster, more steps)
27+
>>> val_bpb=2.255619
28+
29+
[seed 8] 6x1 no loops (flat mlp4 control)
30+
>>> val_bpb=2.196184
31+
32+
======================================================================
33+
RUN 9 | 00:39:57 | best=2.1962
34+
Qwen: Based on the results, L=4x3 with cadence 3 and MLP 4x shows strong performance, while the current best (L=6x2 cad=3 mlp=
35+
L=5x2 cad=3 mlp=4
36+
>>> val_bpb=2.24638
37+
38+
======================================================================
39+
RUN 10 | 00:43:38 | best=2.1962
40+
Qwen: Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance (2.2464 BPB), while L=6x2 with cadence 3
41+
L=5x2 cad=3 mlp=4
42+
>>> val_bpb=2.24475
43+
44+
LEADERBOARD (10 runs)
45+
1. 2.1962 | L=6x1 cad=1 mlp=4
46+
2. 2.2055 | L=4x3 cad=3 mlp=4
47+
3. 2.2358 | L=6x2 cad=3 mlp=3
48+
4. 2.2447 | L=5x2 cad=3 mlp=4
49+
5. 2.2464 | L=5x2 cad=3 mlp=4
50+
6. 2.2546 | L=6x2 cad=3 mlp=4
51+
7. 2.2556 | L=5x2 cad=3 mlp=4
52+
8. 2.3292 | L=8x2 cad=3 mlp=4
53+
9. 2.3517 | L=6x2 cad=1 mlp=4
54+
10. 2.7076 | L=6x2 cad=2 mlp=4
55+
56+
======================================================================
57+
RUN 11 | 00:47:16 | best=2.1962
58+
Qwen: Based on the results, L=5x2 with cadence=3 and MLP=4 shows strong performance at 2.2447 BPB. The previous best was L=6x2
59+
L=5x2 cad=3 mlp=4
60+
>>> val_bpb=2.228696
61+
62+
======================================================================
63+
RUN 12 | 00:50:53 | best=2.1962
64+
Qwen: Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance while being more efficient than higher lo
65+
L=4x3 cad=3 mlp=4
66+
>>> val_bpb=2.213174
67+
68+
======================================================================
69+
RUN 13 | 00:54:54 | best=2.1962
70+
Qwen: Based on the results, L=4x3 with cadence 3 and MLP 4x shows strong performance, but we need to explore if increasing loo
71+
L=4x3 cad=3 mlp=4
72+
>>> val_bpb=2.210759
73+
74+
======================================================================
75+
RUN 14 | 00:58:55 | best=2.1962
76+
Qwen: Based on the results, L=4x3 with cadence 3 and MLP 4x shows strong performance, while the optimal configuration appears
77+
L=5x2 cad=3 mlp=4
78+
>>> val_bpb=2.242233
79+
80+
======================================================================
81+
RUN 15 | 01:02:32 | best=2.1962
82+
Qwen: Based on the results, 5x2 with MLP 4x shows strong performance around 2.24 BPB, and cadence 3 seems optimal. The 6x2 con
83+
L=5x2 cad=3 mlp=4
84+
>>> val_bpb=2.238606
85+
86+
LEADERBOARD (15 runs)
87+
1. 2.1962 | L=6x1 cad=1 mlp=4
88+
2. 2.2055 | L=4x3 cad=3 mlp=4
89+
3. 2.2108 | L=4x3 cad=3 mlp=4
90+
4. 2.2132 | L=4x3 cad=3 mlp=4
91+
5. 2.2287 | L=5x2 cad=3 mlp=4
92+
6. 2.2358 | L=6x2 cad=3 mlp=3
93+
7. 2.2386 | L=5x2 cad=3 mlp=4
94+
8. 2.2422 | L=5x2 cad=3 mlp=4
95+
9. 2.2447 | L=5x2 cad=3 mlp=4
96+
10. 2.2464 | L=5x2 cad=3 mlp=4
97+
98+
======================================================================
99+
RUN 16 | 01:06:10 | best=2.1962
100+
Qwen: Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance around 2.238 BPB. The best configurations
101+
L=5x2 cad=3 mlp=4
102+
>>> val_bpb=2.244032
103+
104+
======================================================================
105+
RUN 17 | 01:09:47 | best=2.1962
106+
Qwen: Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance close to the best. The cadence pattern F/
107+
L=5x2 cad=3 mlp=4
108+
>>> val_bpb=2.23282
109+
110+
======================================================================
111+
RUN 18 | 01:13:24 | best=2.1962
112+
Qwen: Based on the results, 5x2 with MLP 4x shows strong performance around 2.23 BPB, and cadence 3 seems optimal. The 8x2 con
113+
L=5x2 cad=3 mlp=4
114+
>>> val_bpb=2.231204
115+
116+
======================================================================
117+
RUN 19 | 01:17:01 | best=2.1962
118+
Qwen: Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.24 BPB, while 6x2 with MLP 4x is slightly w
119+
L=4x2 cad=3 mlp=4
120+
>>> val_bpb=2.197056
121+
122+
======================================================================
123+
RUN 20 | 01:20:14 | best=2.1962
124+
Qwen: Based on the results, 5x2 with MLP 4x shows strong performance around 2.23 BPB, while 6x2 with MLP 4x is slightly worse
125+
L=5x2 cad=3 mlp=4
126+
>>> val_bpb=2.258728
127+
128+
LEADERBOARD (20 runs)
129+
1. 2.1962 | L=6x1 cad=1 mlp=4
130+
2. 2.1971 | L=4x2 cad=3 mlp=4
131+
3. 2.2055 | L=4x3 cad=3 mlp=4
132+
4. 2.2108 | L=4x3 cad=3 mlp=4
133+
5. 2.2132 | L=4x3 cad=3 mlp=4
134+
6. 2.2287 | L=5x2 cad=3 mlp=4
135+
7. 2.2312 | L=5x2 cad=3 mlp=4
136+
8. 2.2328 | L=5x2 cad=3 mlp=4
137+
9. 2.2358 | L=6x2 cad=3 mlp=3
138+
10. 2.2386 | L=5x2 cad=3 mlp=4
139+
140+
======================================================================
141+
RUN 21 | 01:23:52 | best=2.1962
142+
Qwen: Based on the results, 5x2 with MLP 4x shows the best performance so far, but we need to explore if increasing loops to 6
143+
L=5x4 cad=3 mlp=4
144+
>>> val_bpb=2.273206
145+
146+
======================================================================
147+
RUN 22 | 01:29:16 | best=2.1962
148+
Qwen: Based on the results, 5x2 with MLP 4x shows the best performance so far, but we need to explore if increasing loop count
149+
L=5x3 cad=3 mlp=4
150+
>>> val_bpb=2.267194
151+
152+
======================================================================
153+
RUN 23 | 01:33:51 | best=2.1962
154+
Qwen: Based on the results, L=5x2 with MLP=4 shows the best performance so far, but we need to explore if increasing loop coun
155+
L=5x3 cad=3 mlp=4
156+
>>> val_bpb=2.268142
157+
158+
======================================================================
159+
RUN 24 | 01:38:22 | best=2.1962
160+
Qwen: Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.24 BPB, and cadence 3 appears optimal. The
161+
L=5x2 cad=3 mlp=4
162+
>>> val_bpb=2.252595
163+
164+
======================================================================
165+
RUN 25 | 01:42:00 | best=2.1962
166+
Qwen: Based on the results, 5x2 with MLP 4x shows strong performance while maintaining good training stability. The cadence pa
167+
L=5x2 cad=3 mlp=4
168+
>>> val_bpb=2.237074
169+
170+
LEADERBOARD (25 runs)
171+
1. 2.1962 | L=6x1 cad=1 mlp=4
172+
2. 2.1971 | L=4x2 cad=3 mlp=4
173+
3. 2.2055 | L=4x3 cad=3 mlp=4
174+
4. 2.2108 | L=4x3 cad=3 mlp=4
175+
5. 2.2132 | L=4x3 cad=3 mlp=4
176+
6. 2.2287 | L=5x2 cad=3 mlp=4
177+
7. 2.2312 | L=5x2 cad=3 mlp=4
178+
8. 2.2328 | L=5x2 cad=3 mlp=4
179+
9. 2.2358 | L=6x2 cad=3 mlp=3
180+
10. 2.2371 | L=5x2 cad=3 mlp=4
181+
182+
======================================================================
183+
RUN 26 | 01:45:38 | best=2.1962
184+
Qwen: Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.24 BPB, while 6x2 with MLP 4x is slightly w
185+
L=5x2 cad=3 mlp=4
186+
>>> val_bpb=2.258998
187+
188+
======================================================================
189+
RUN 27 | 01:49:17 | best=2.1962
190+
Qwen: Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.26 BPB, with cadence 3 being superior to 1
191+
L=5x2 cad=3 mlp=4
192+
>>> val_bpb=2.24007
193+
194+
======================================================================
195+
RUN 28 | 01:52:54 | best=2.1962
196+
Qwen: Based on the results, 5x2 with MLP 4x shows strong performance close to the best, while 6x2 with MLP 4x is slightly wors
197+
L=5x2 cad=3 mlp=3
198+
>>> val_bpb=2.22595
199+
200+
======================================================================
201+
RUN 29 | 01:56:41 | best=2.1962
202+
Qwen: Based on the results, L=4x3 with cadence 3 shows strong performance while maintaining lower complexity than L=5x3 or L=5
203+
L=4x3 cad=3 mlp=4
204+
>>> val_bpb=2.214263
205+
206+
======================================================================
207+
RUN 30 | 02:00:43 | best=2.1962
208+
Qwen: Based on the results, L=5x2 with MLP=4 shows the best performance so far, but we need to explore if increasing loop coun
209+
L=5x2 cad=1 mlp=4

autoresearch_frug2_results.csv

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
timestamp,run_id,val_bpb,cadence,cadence_offset,num_unique_layers,num_loops,lr,grad_clip,mlp_mult,model_dim,steps,f_steps,n_steps,avg_ms,time_s,params,reasoning,notes
2+
2026-03-23T00:09:42.654400,frug2_001,2.254615,3,0,6,2,0.002,5.0,4,0,,,,,,,seed,H100 winner: 6x2 mlp4
3+
2026-03-23T00:13:34.069000,frug2_002,2.235808,3,0,6,2,0.002,5.0,3,0,,,,,,,seed,6x2 mlp3 (how much does mlp4 help?)
4+
2026-03-23T00:17:38.304284,frug2_003,2.205482,3,0,4,3,0.002,5.0,4,0,,,,,,,seed,"4x3 mlp4 (more loops, fewer layers)"
5+
2026-03-23T00:22:33.028628,frug2_004,2.32917,3,0,8,2,0.002,5.0,4,0,,,,,,,seed,"8x2 mlp4 (more unique, fast)"
6+
2026-03-23T00:28:34.288819,frug2_005,2.35167,1,0,6,2,0.002,5.0,4,0,,,,,,,seed,6x2 always fractal
7+
2026-03-23T00:33:10.586660,frug2_006,2.707633,2,0,6,2,0.002,5.0,4,0,,,,,,,seed,6x2 cadence 2 (F/N)
8+
2026-03-23T00:36:49.993983,frug2_007,2.255619,3,0,5,2,0.002,5.0,4,0,,,,,,,seed,"5x2 mlp4 (faster, more steps)"
9+
2026-03-23T00:39:57.175446,frug2_008,2.196184,1,0,6,1,0.002,5.0,4,0,,,,,,,seed,6x1 no loops (flat mlp4 control)
10+
2026-03-23T00:43:38.816871,frug2_009,2.24638,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, L=4x3 with cadence 3 and MLP 4x shows strong performance, while the current best (L=6x2 cad=3 mlp=4) is close behind. The optimal configuration likely balances depth and loop cou","Based on the results, L=4x3 with cadence 3 and MLP 4x shows strong performance, while the current be"
11+
2026-03-23T00:47:16.360352,frug2_010,2.24475,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance (2.2464 BPB), while L=6x2 with cadence 3 is slightly worse (2.2546 BPB). The cadence pattern F/N/N (cadence 3) seems bene","Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance (2.2464 BPB), while L"
12+
2026-03-23T00:50:53.541726,frug2_011,2.228696,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, L=5x2 with cadence=3 and MLP=4 shows strong performance at 2.2447 BPB. The previous best was L=6x2 cad=3 mlp=4 at 2.2546 BPB, so we should explore variations around this winning ","Based on the results, L=5x2 with cadence=3 and MLP=4 shows strong performance at 2.2447 BPB. The pre"
13+
2026-03-23T00:54:54.538187,frug2_012,2.213174,3,0,4,3,0.002,5.0,4,0,,,,,,,"Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance while being more efficient than higher loop counts. The cadence pattern F/N/N with MLP 4x appears optimal. We should test","Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance while being more effi"
14+
2026-03-23T00:58:55.568576,frug2_013,2.210759,3,0,4,3,0.002,5.0,4,0,,,,,,,"Based on the results, L=4x3 with cadence 3 and MLP 4x shows strong performance, but we need to explore if increasing loop count with reduced layers per loop can improve efficiency. The cadence pattern","Based on the results, L=4x3 with cadence 3 and MLP 4x shows strong performance, but we need to explo"
15+
2026-03-23T01:02:32.929687,frug2_014,2.242233,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, L=4x3 with cadence 3 and MLP 4x shows strong performance, while the optimal configuration appears to be around 5x2 or 6x2 with MLP 4x. The cadence pattern F/N/N (cadence=3) seems","Based on the results, L=4x3 with cadence 3 and MLP 4x shows strong performance, while the optimal co"
16+
2026-03-23T01:06:10.278314,frug2_015,2.238606,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, 5x2 with MLP 4x shows strong performance around 2.24 BPB, and cadence 3 seems optimal. The 6x2 configuration with MLP 4x is close to SOTA at 1.1233 BPB, so we should test variati","Based on the results, 5x2 with MLP 4x shows strong performance around 2.24 BPB, and cadence 3 seems "
17+
2026-03-23T01:09:47.108121,frug2_016,2.244032,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance around 2.238 BPB. The best configurations seem to favor fewer unique layers with more loops for optimal efficiency. Testi","Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance around 2.238 BPB. The"
18+
2026-03-23T01:13:24.252495,frug2_017,2.23282,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance close to the best. The cadence pattern F/N/N (cadence=3) seems effective, but we should test if changing cadence or tryin","Based on the results, L=5x2 with cadence 3 and MLP 4x shows strong performance close to the best. Th"
19+
2026-03-23T01:17:01.282282,frug2_018,2.231204,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, 5x2 with MLP 4x shows strong performance around 2.23 BPB, and cadence 3 seems optimal. The 8x2 configuration performs poorly, suggesting diminishing returns from increasing loops","Based on the results, 5x2 with MLP 4x shows strong performance around 2.23 BPB, and cadence 3 seems "
20+
2026-03-23T01:20:14.825450,frug2_019,2.197056,3,0,4,2,0.002,5.0,4,0,,,,,,,"Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.24 BPB, while 6x2 with MLP 4x is slightly worse at ~2.25 BPB. The cadence 3 performs better than 1 or 2. We should test if ","Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.24 BPB, while 6x2 with "
21+
2026-03-23T01:23:52.959072,frug2_020,2.258728,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, 5x2 with MLP 4x shows strong performance around 2.23 BPB, while 6x2 with MLP 4x is slightly worse at ~2.25 BPB. The cadence pattern F/N/N (cadence=3) seems optimal based on previ","Based on the results, 5x2 with MLP 4x shows strong performance around 2.23 BPB, while 6x2 with MLP 4"
22+
2026-03-23T01:29:16.647297,frug2_021,2.273206,3,0,5,4,0.002,5.0,4,0,,,,,,,"Based on the results, 5x2 with MLP 4x shows the best performance so far, but we need to explore if increasing loops to 6 while keeping 5 layers could improve further. The cadence pattern F/N/N (3) see","Based on the results, 5x2 with MLP 4x shows the best performance so far, but we need to explore if i"
23+
2026-03-23T01:33:51.777113,frug2_022,2.267194,3,0,5,3,0.002,5.0,4,0,,,,,,,"Based on the results, 5x2 with MLP 4x shows the best performance so far, but we need to explore if increasing loop count with 5 layers could yield better results. The cadence pattern F/N/N (3) seems o","Based on the results, 5x2 with MLP 4x shows the best performance so far, but we need to explore if i"
24+
2026-03-23T01:38:22.362045,frug2_023,2.268142,3,0,5,3,0.002,5.0,4,0,,,,,,,"Based on the results, L=5x2 with MLP=4 shows the best performance so far, but we need to explore if increasing loop count while keeping layers optimal can improve further. The cadence pattern F/N/N (c","Based on the results, L=5x2 with MLP=4 shows the best performance so far, but we need to explore if "
25+
2026-03-23T01:42:00.656021,frug2_024,2.252595,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.24 BPB, and cadence 3 appears optimal. The 6x2 configuration slightly lags behind, suggesting we should explore if 5x2 can ","Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.24 BPB, and cadence 3 a"
26+
2026-03-23T01:45:38.936954,frug2_025,2.237074,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, 5x2 with MLP 4x shows strong performance while maintaining good training stability. The cadence pattern F/N/N (cadence=3) appears optimal from previous findings. We should test i","Based on the results, 5x2 with MLP 4x shows strong performance while maintaining good training stabi"
27+
2026-03-23T01:49:17.361837,frug2_026,2.258998,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.24 BPB, while 6x2 with MLP 4x is slightly worse at ~2.25 BPB. The cadence pattern F/N/N (cadence=3) seems optimal based on ","Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.24 BPB, while 6x2 with "
28+
2026-03-23T01:52:54.696448,frug2_027,2.24007,3,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.26 BPB, with cadence 3 being superior to 1 or 2. The optimal configuration appears to be in the 5-6 layers with 2 loops, an","Based on the results, 5x2 with MLP 4x shows strong performance around 2.23-2.26 BPB, with cadence 3 "
29+
2026-03-23T01:56:41.869660,frug2_028,2.22595,3,0,5,2,0.002,5.0,3,0,,,,,,,"Based on the results, 5x2 with MLP 4x shows strong performance close to the best, while 6x2 with MLP 4x is slightly worse but still competitive. The cadence pattern F/N/N (cadence=3) appears optimal f","Based on the results, 5x2 with MLP 4x shows strong performance close to the best, while 6x2 with MLP"
30+
2026-03-23T02:00:43.043706,frug2_029,2.214263,3,0,4,3,0.002,5.0,4,0,,,,,,,"Based on the results, L=4x3 with cadence 3 shows strong performance while maintaining lower complexity than L=5x3 or L=5x4. The MLP 4x configuration consistently outperforms MLP 3x, and cadence 3 appe","Based on the results, L=4x3 with cadence 3 shows strong performance while maintaining lower complexi"
31+
2026-03-23T02:06:00.113189,frug2_030,2.398726,1,0,5,2,0.002,5.0,4,0,,,,,,,"Based on the results, L=5x2 with MLP=4 shows the best performance so far, but we need to explore if increasing loop count with fixed layers or adjusting cadence patterns could yield better results. Th","Based on the results, L=5x2 with MLP=4 shows the best performance so far, but we need to explore if "
32+
2026-03-23T02:09:41.992206,frug2_031,2.233198,3,0,5,2,0.002,5.0,4,0,,,,,,,"The results show that 5x2 with MLP 4x performs best among the tested configs, with 6x2 showing slight degradation. The cadence pattern significantly impacts performance, with cadence 3 being superior ","The results show that 5x2 with MLP 4x performs best among the tested configs, with 6x2 showing sligh"

0 commit comments

Comments
 (0)