You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Subagent gap analysis of top 3 open PRs (openai#1437, openai#1423, openai#1445) found
QK_GAIN_INIT=5.0 is the simplest training-time technique we're missing
that has 2-PR evidence (top open openai#1 and openai#2 both use 5.0 vs upstream
default 1.5).
CRITICAL: QK_GAIN_INIT is already an upstream env var (line 60 of
train_gpt.py). NO code patch needed — just add experiments that override
the env var. Zero patcher risk, zero anchor risk.
Application: q_gain is multiplied element-wise with query tensor before
F.scaled_dot_product_attention, scaling Q-K product by the gain factor.
4 QK experiments queued:
QK0_qkgain5_alone, QK1_qkgain5_seed42, QK2_qkgain5_L4weights,
QK3_qkgain5_with_engram
Hypertuning rule check: this is a SINGLE-value port from 2 top open
records, NOT a weight sweep. Satisfies "port from top records" rule.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: RESEARCH_LOG.md
+68Lines changed: 68 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1244,3 +1244,71 @@ If DR0 fits and gives a clean train_loss, that's a real validation of depth recu
1244
1244
-**First architectural patch** in many fires that fits our train_loss metric
1245
1245
1246
1246
This is the third optimizer/architectural patch in 5 fires that I've shipped (Mousse #9, MuonEq-R #10, Depth Recurrence #13). The pattern of "find a port from a top record, ship the conservative variant, validate on the loop within 1-2 hours" is now the operating mode for the remaining session.
1247
+
1248
+
---
1249
+
1250
+
## Research Fire #14 — 2026-04-08 (cron min :16, Track A) — QK_GAIN_INIT=5.0 port (NO code patch needed)
1251
+
1252
+
**Subject**: Subagent gap analysis of top 3 open PRs (#1437, #1423, #1445) for training-time techniques we don't have. Looking for the simplest port that fits our train_loss metric.
1253
+
1254
+
### Subagent finding (top 3 PR cross-reference)
1255
+
1256
+
**Training-time techniques NOT in our 24-patch stack**:
1257
+
1.**QK_GAIN_INIT=5.0** (vs default 1.5) — used in PR #1437 (1.078) AND PR #1423 (1.079). Highest port-with-evidence ratio.
1258
+
2.**WD=0.095** (vs default 0.090) — PR #1445 only
1259
+
3.**WARMDOWN_FRAC=0.72** (vs default 0.667) — PR #1445 only
1260
+
1261
+
The first one (QK_GAIN_INIT) is in TWO of the top 3 PRs, the others only in one. Top confidence pick.
1262
+
1263
+
### Critical finding: QK_GAIN_INIT is ALREADY an upstream env var
1264
+
1265
+
Second focused subagent confirmed: `qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5))` exists at line 60 of upstream train_gpt.py. Default is **1.5** (NOT 4.0 as the first subagent guessed).
1266
+
1267
+
**Application** (line 592-593 in CausalSelfAttention.forward):
The `q_gain` parameter is initialized from `qk_gain_init` env var and multiplied element-wise with the query tensor before attention. Effectively scales Q-K product by the gain factor inside attention.
1274
+
1275
+
### Why no code patch is needed
1276
+
1277
+
`QK_GAIN_INIT` is already a first-class env var in upstream train_gpt.py. To port the PR #1437/#1423 finding, I just need to add experiments that pass `QK_GAIN_INIT=5.0` as an environment variable. The runner already supports passing env vars to train_gpt.py. **Zero code changes**, just JSON additions to experiments.json.
1278
+
1279
+
This is the cleanest possible ship: no patcher anchor risk, no graceful-fallback code, no marker checks. The patch surface area is exactly 4 new lines in experiments.json.
1280
+
1281
+
### Hypertuning rule check
1282
+
1283
+
User wishes (CLAUDE.md):
1284
+
- "NO HYPERTUNING — don't push experiments that just twiddle weights of validated configs"
1285
+
- "PORTING-WITH-EVIDENCE — every patch must either be: (a) novel, or (b) ported from a comp PR that's in the top 10 records"
1286
+
1287
+
QK_GAIN_INIT=5.0:
1288
+
- ✓ Port from PR #1437 (top open) AND PR #1423 (top open #2)
1289
+
- ✓ Single value, NOT a sweep
1290
+
- ✓ Targeted single change
1291
+
- ✓ Empirical evidence at competitor scale
1292
+
1293
+
Satisfies the spirit of "port from top records" without violating "no hypertuning sweeps". Multi-seed validation experiments are explicitly OK per the rule.
1294
+
1295
+
### Experiments queued (4 added → queue is now 36)
1296
+
1297
+
-**QK0_qkgain5_alone** — QK_GAIN_INIT=5.0 + L5 weights + leaky + ngram (champion config with single change)
1298
+
-**QK1_qkgain5_seed42** — multi-seed validation
1299
+
-**QK2_qkgain5_L4weights** — QK_GAIN_INIT=5.0 with L4 weights (the more reliable champion family)
1300
+
-**QK3_qkgain5_with_engram** — stack with EngramLite to test additivity
1301
+
1302
+
### Expected outcome
1303
+
1304
+
PR #1437 reports their full stack at val_bpb 1.078 with QK_GAIN_INIT=5.0. PR #1423 also at 1.079. The implied marginal benefit of QK_GAIN_INIT=5.0 alone vs the rest of the stack is small (~-0.0004 BPB based on subagent estimate). At our scale this might translate to -0.005 to -0.015 train_loss — possibly within the noise band but worth measuring.
1305
+
1306
+
If QK0 lands at 3.27-3.29 (within champion noise), it's a free addition. If it lands at 3.31+, it's neutral-to-negative. Either way it's a clean empirical answer in 2 cycles (~10 min per experiment).
1307
+
1308
+
### What this fire produced
1309
+
1310
+
-**4 QK_GAIN experiments queued** for loop validation
1311
+
-**No code patches** — purely a JSON queue addition
1312
+
-**Highest signal-to-effort ratio** of any research fire so far (4 experiments, 0 LOC, no anchor risk)
1313
+
1314
+
This is the cleanest possible "port from top record" we've shipped all night. If it works, great. If it doesn't, we lost zero compute on patcher risk.
0 commit comments