|
| 1 | +# Research Log — Session 5 (2026-04-04, continued) |
| 2 | + |
| 3 | +## Environment |
| 4 | +- GPU: 1× NVIDIA H100 80GB HBM3 |
| 5 | +- PyTorch 2.11.0+cu128, FA3 3.0.0, Triton OK, CUDA 12.8 |
| 6 | +- Baseline verified: 353ms/step, 2.597 BPB @ 50 steps |
| 7 | + |
| 8 | +## State at Start of Session |
| 9 | + |
| 10 | +### From Session 4 (same day) |
| 11 | +EXP-1 (SLOT compatibility test on FiLM) completed: |
| 12 | +- **FiLM baseline (int6)**: 1.3003 BPB |
| 13 | +- **Standard SLOT (int6+SLOT24)**: 0.9028 BPB (-0.3975) — works but **ILLEGAL** |
| 14 | +- **Causal SLOT v1 (int6+SLOT24)**: 1.3095 BPB (+0.009) — **HURTS performance** |
| 15 | +- Multiple runs crashed with torch.compile dtype mismatch (fixed iteratively) |
| 16 | + |
| 17 | +### Root Cause Analysis: Why Causal SLOT v1 Failed |
| 18 | + |
| 19 | +The broadcast delta `[bsz, 1, hdim]` is the core problem: |
| 20 | +1. **Standard SLOT**: opt_mask == score_mask (same positions). Delta optimized directly for scored positions → massive improvement. |
| 21 | +2. **Causal SLOT**: opt_mask (context positions) and score_mask (new positions) are completely disjoint. A broadcast delta optimized for context can actively hurt new positions. |
| 22 | +3. The +0.009 BPB result means the delta HURTS more than it helps on new positions. |
| 23 | + |
| 24 | +### Competition Intelligence (from web research) |
| 25 | + |
| 26 | +PR #1350 achieves 1.0046 BPB with causal SLOT (-0.087 BPP). Key implementation details: |
| 27 | +- **L-BFGS optimizer** (max_iter=25, history=20) — much faster convergence than AdamW |
| 28 | +- **Logit space** — optimize logit biases, not hidden deltas |
| 29 | +- **Focal loss on last 128 context tokens** — nearby context more predictive of new positions |
| 30 | +- **Warm-start between windows** — carry bias across consecutive windows |
| 31 | +- **Delta clamped to +/-5** — prevent overfitting |
| 32 | +- Eval time: ~556s |
| 33 | + |
| 34 | +Per-Sample SLOT (PR #1329) reaches 0.636 BPB but is standard SLOT (illegal). |
| 35 | + |
| 36 | +## Experiments Run |
| 37 | + |
| 38 | +### Quick Smoke Test: L-BFGS logit-only (4 steps) |
| 39 | +**Config**: lbfgs_logit mode, 4 steps, no focal/warmstart/clamp (old code) |
| 40 | +**Result**: 1.2658 BPB (-0.035 from 1.3003 baseline) |
| 41 | +**Significance**: Confirms L-BFGS + logit-only approach works for causal SLOT. |
| 42 | +Even with just 4 steps and no focal/warmstart, already -0.035 vs +0.009 for v1. |
| 43 | + |
| 44 | +### L-BFGS logit-only (24 steps) [RUNNING] |
| 45 | +**Config**: lbfgs_logit mode, 24 steps, no focal/warmstart/clamp (old code) |
| 46 | +**Expected**: ~1.20-1.25 BPB |
| 47 | +**Status**: Running (~45 min estimated) |
| 48 | + |
| 49 | +## Implementation Changes |
| 50 | + |
| 51 | +### SLOT Mode System |
| 52 | +Added `SLOT_MODE` env var with four modes: |
| 53 | +- `v1`: Original AdamW delta+bias (default for backward compat) |
| 54 | +- `logit_only`: AdamW logit bias only (no hidden delta) |
| 55 | +- `lbfgs`: L-BFGS delta+bias |
| 56 | +- `lbfgs_logit`: L-BFGS logit bias only (recommended for causal) |
| 57 | + |
| 58 | +### Causal SLOT v2 Features (matching PR #1350) |
| 59 | +- `SLOT_FOCAL_CTX=128`: Focal loss on last 128 context tokens |
| 60 | +- `SLOT_WARMSTART=1`: Carry mean logit bias between batches |
| 61 | +- `SLOT_CLAMP=5.0`: Clamp logit bias to [-5, 5] |
| 62 | +- `SLOT_LBFGS_HISTORY=20`: L-BFGS curvature history |
| 63 | + |
| 64 | +## Untested Novel Ideas |
| 65 | + |
| 66 | +### FiLM-Modulation SLOT (genuinely new) |
| 67 | +Instead of optimizing logit biases, optimize FiLM modulation params (attn_scales, mlp_scales, resid_mixes) at test time. |
| 68 | +- 14,336 parameters (compact, semantically meaningful) |
| 69 | +- Changes HOW the model processes data, not WHAT it outputs |
| 70 | +- Requires re-running model forward pass per SLOT step (expensive) |
| 71 | +- Could be implemented as "delta to FiLM scales" with clamping |
| 72 | +- **Shelved for now** — focus on getting logit-bias approach working first |
| 73 | + |
| 74 | +## Next Steps (prioritized) |
| 75 | +1. [RUNNING] Get 24-step L-BFGS baseline result |
| 76 | +2. [QUEUED] Run v2 variants (focal+warmstart+clamp) |
| 77 | +3. [QUEUED] SP4096 200-step screen |
| 78 | +4. [QUEUED] QK-Gain 5.0 + WD 0.085 screen |
| 79 | +5. [IDEA] Extended depth recurrence (7 → 9-10 virtual layers) |
| 80 | +6. [IDEA] FiLM-modulation SLOT |
0 commit comments