Skip to content

Commit b31e9b6

Browse files
Takoda Mundyclaude
andcommitted
research fire openai#14: queue QK_GAIN_INIT=5.0 experiments (port from PR openai#1437/openai#1423)
Subagent gap analysis of top 3 open PRs (openai#1437, openai#1423, openai#1445) found QK_GAIN_INIT=5.0 is the simplest training-time technique we're missing that has 2-PR evidence (top open openai#1 and openai#2 both use 5.0 vs upstream default 1.5). CRITICAL: QK_GAIN_INIT is already an upstream env var (line 60 of train_gpt.py). NO code patch needed — just add experiments that override the env var. Zero patcher risk, zero anchor risk. Application: q_gain is multiplied element-wise with query tensor before F.scaled_dot_product_attention, scaling Q-K product by the gain factor. 4 QK experiments queued: QK0_qkgain5_alone, QK1_qkgain5_seed42, QK2_qkgain5_L4weights, QK3_qkgain5_with_engram Hypertuning rule check: this is a SINGLE-value port from 2 top open records, NOT a weight sweep. Satisfies "port from top records" rule. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f21cd0c commit b31e9b6

2 files changed

Lines changed: 74 additions & 1 deletion

File tree

RESEARCH_LOG.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1244,3 +1244,71 @@ If DR0 fits and gives a clean train_loss, that's a real validation of depth recu
12441244
- **First architectural patch** in many fires that fits our train_loss metric
12451245

12461246
This is the third optimizer/architectural patch in 5 fires that I've shipped (Mousse #9, MuonEq-R #10, Depth Recurrence #13). The pattern of "find a port from a top record, ship the conservative variant, validate on the loop within 1-2 hours" is now the operating mode for the remaining session.
1247+
1248+
---
1249+
1250+
## Research Fire #14 — 2026-04-08 (cron min :16, Track A) — QK_GAIN_INIT=5.0 port (NO code patch needed)
1251+
1252+
**Subject**: Subagent gap analysis of top 3 open PRs (#1437, #1423, #1445) for training-time techniques we don't have. Looking for the simplest port that fits our train_loss metric.
1253+
1254+
### Subagent finding (top 3 PR cross-reference)
1255+
1256+
**Training-time techniques NOT in our 24-patch stack**:
1257+
1. **QK_GAIN_INIT=5.0** (vs default 1.5) — used in PR #1437 (1.078) AND PR #1423 (1.079). Highest port-with-evidence ratio.
1258+
2. **WD=0.095** (vs default 0.090) — PR #1445 only
1259+
3. **WARMDOWN_FRAC=0.72** (vs default 0.667) — PR #1445 only
1260+
1261+
The first one (QK_GAIN_INIT) is in TWO of the top 3 PRs, the others only in one. Top confidence pick.
1262+
1263+
### Critical finding: QK_GAIN_INIT is ALREADY an upstream env var
1264+
1265+
Second focused subagent confirmed: `qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5))` exists at line 60 of upstream train_gpt.py. Default is **1.5** (NOT 4.0 as the first subagent guessed).
1266+
1267+
**Application** (line 592-593 in CausalSelfAttention.forward):
1268+
```python
1269+
q = q * self.q_gain.to(dtype=q.dtype)[None, :, None, None]
1270+
y = F.scaled_dot_product_attention(...)
1271+
```
1272+
1273+
The `q_gain` parameter is initialized from `qk_gain_init` env var and multiplied element-wise with the query tensor before attention. Effectively scales Q-K product by the gain factor inside attention.
1274+
1275+
### Why no code patch is needed
1276+
1277+
`QK_GAIN_INIT` is already a first-class env var in upstream train_gpt.py. To port the PR #1437/#1423 finding, I just need to add experiments that pass `QK_GAIN_INIT=5.0` as an environment variable. The runner already supports passing env vars to train_gpt.py. **Zero code changes**, just JSON additions to experiments.json.
1278+
1279+
This is the cleanest possible ship: no patcher anchor risk, no graceful-fallback code, no marker checks. The patch surface area is exactly 4 new lines in experiments.json.
1280+
1281+
### Hypertuning rule check
1282+
1283+
User wishes (CLAUDE.md):
1284+
- "NO HYPERTUNING — don't push experiments that just twiddle weights of validated configs"
1285+
- "PORTING-WITH-EVIDENCE — every patch must either be: (a) novel, or (b) ported from a comp PR that's in the top 10 records"
1286+
1287+
QK_GAIN_INIT=5.0:
1288+
- ✓ Port from PR #1437 (top open) AND PR #1423 (top open #2)
1289+
- ✓ Single value, NOT a sweep
1290+
- ✓ Targeted single change
1291+
- ✓ Empirical evidence at competitor scale
1292+
1293+
Satisfies the spirit of "port from top records" without violating "no hypertuning sweeps". Multi-seed validation experiments are explicitly OK per the rule.
1294+
1295+
### Experiments queued (4 added → queue is now 36)
1296+
1297+
- **QK0_qkgain5_alone** — QK_GAIN_INIT=5.0 + L5 weights + leaky + ngram (champion config with single change)
1298+
- **QK1_qkgain5_seed42** — multi-seed validation
1299+
- **QK2_qkgain5_L4weights** — QK_GAIN_INIT=5.0 with L4 weights (the more reliable champion family)
1300+
- **QK3_qkgain5_with_engram** — stack with EngramLite to test additivity
1301+
1302+
### Expected outcome
1303+
1304+
PR #1437 reports their full stack at val_bpb 1.078 with QK_GAIN_INIT=5.0. PR #1423 also at 1.079. The implied marginal benefit of QK_GAIN_INIT=5.0 alone vs the rest of the stack is small (~-0.0004 BPB based on subagent estimate). At our scale this might translate to -0.005 to -0.015 train_loss — possibly within the noise band but worth measuring.
1305+
1306+
If QK0 lands at 3.27-3.29 (within champion noise), it's a free addition. If it lands at 3.31+, it's neutral-to-negative. Either way it's a clean empirical answer in 2 cycles (~10 min per experiment).
1307+
1308+
### What this fire produced
1309+
1310+
- **4 QK_GAIN experiments queued** for loop validation
1311+
- **No code patches** — purely a JSON queue addition
1312+
- **Highest signal-to-effort ratio** of any research fire so far (4 experiments, 0 LOC, no anchor risk)
1313+
1314+
This is the cleanest possible "port from top record" we've shipped all night. If it works, great. If it doesn't, we lost zero compute on patcher risk.

runpod_tests/loop/experiments.json

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,5 +38,10 @@
3838
{"name": "DR0_recur_block3_min", "USE_DEPTH_RECURRENCE": "1", "DEPTH_RECUR_START": "3", "DEPTH_RECUR_END": "3", "DEPTH_RECUR_CYCLES": "2", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"},
3939
{"name": "DR1_recur_blocks3_4", "USE_DEPTH_RECURRENCE": "1", "DEPTH_RECUR_START": "3", "DEPTH_RECUR_END": "4", "DEPTH_RECUR_CYCLES": "2", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"},
4040
{"name": "DR2_recur_block3_3x", "USE_DEPTH_RECURRENCE": "1", "DEPTH_RECUR_START": "3", "DEPTH_RECUR_END": "3", "DEPTH_RECUR_CYCLES": "3", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"},
41-
{"name": "DR3_recur_seed42", "USE_DEPTH_RECURRENCE": "1", "DEPTH_RECUR_START": "3", "DEPTH_RECUR_END": "3", "DEPTH_RECUR_CYCLES": "2", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "SEED": "42", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"}
41+
{"name": "DR3_recur_seed42", "USE_DEPTH_RECURRENCE": "1", "DEPTH_RECUR_START": "3", "DEPTH_RECUR_END": "3", "DEPTH_RECUR_CYCLES": "2", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "SEED": "42", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"},
42+
43+
{"name": "QK0_qkgain5_alone", "QK_GAIN_INIT": "5.0", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"},
44+
{"name": "QK1_qkgain5_seed42", "QK_GAIN_INIT": "5.0", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "SEED": "42", "NGRAM_W_BIGRAM": "0.15", "NGRAM_W_TRIGRAM": "0.20", "NGRAM_W_FOURGRAM": "0.15", "MAX_WALLCLOCK_SECONDS": "300"},
45+
{"name": "QK2_qkgain5_L4weights", "QK_GAIN_INIT": "5.0", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "SEED": "42", "NGRAM_W_BIGRAM": "0.25", "NGRAM_W_TRIGRAM": "0.25", "NGRAM_W_FOURGRAM": "0.20", "MAX_WALLCLOCK_SECONDS": "300"},
46+
{"name": "QK3_qkgain5_with_engram", "QK_GAIN_INIT": "5.0", "USE_ENGRAM_LITE": "1", "USE_LEAKY_RELU": "1", "USE_NGRAM_BIAS": "1", "SEED": "42", "NGRAM_W_BIGRAM": "0.25", "NGRAM_W_TRIGRAM": "0.25", "NGRAM_W_FOURGRAM": "0.20", "MAX_WALLCLOCK_SECONDS": "300"}
4247
]

0 commit comments

Comments
 (0)