Session 5 research log + causal SLOT v2 test script

yuyeon · claude · yuyeon · commit 19f4c8fceb53 · 2026-04-04T23:49:10.000Z
Key finding: L-BFGS logit-only causal SLOT gives -0.035 BPB (4 steps) vs v1's +0.009 (24 steps). Confirms root cause diagnosis. Causal SLOT v2 test script compares: - v2_full: focal=128, warmstart, clamp=5, 25 steps (PR openai#1350 approach) - v2_50steps: same but 50 steps (check if more steps help) - v2_nofocal: all context (ablation) - v2_adamw: AdamW instead of L-BFGS (optimizer ablation) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/docs/research_log_session5.md b/docs/research_log_session5.md
@@ -0,0 +1,80 @@
+# Research Log — Session 5 (2026-04-04, continued)
+
+## Environment
+- GPU: 1× NVIDIA H100 80GB HBM3
+- PyTorch 2.11.0+cu128, FA3 3.0.0, Triton OK, CUDA 12.8
+- Baseline verified: 353ms/step, 2.597 BPB @ 50 steps
+
+## State at Start of Session
+
+### From Session 4 (same day)
+EXP-1 (SLOT compatibility test on FiLM) completed:
+- **FiLM baseline (int6)**: 1.3003 BPB
+- **Standard SLOT (int6+SLOT24)**: 0.9028 BPB (-0.3975) — works but **ILLEGAL**
+- **Causal SLOT v1 (int6+SLOT24)**: 1.3095 BPB (+0.009) — **HURTS performance**
+- Multiple runs crashed with torch.compile dtype mismatch (fixed iteratively)
+
+### Root Cause Analysis: Why Causal SLOT v1 Failed
+
+The broadcast delta `[bsz, 1, hdim]` is the core problem:
+1. **Standard SLOT**: opt_mask == score_mask (same positions). Delta optimized directly for scored positions → massive improvement.
+2. **Causal SLOT**: opt_mask (context positions) and score_mask (new positions) are completely disjoint. A broadcast delta optimized for context can actively hurt new positions.
+3. The +0.009 BPB result means the delta HURTS more than it helps on new positions.
+
+### Competition Intelligence (from web research)
+
+PR #1350 achieves 1.0046 BPB with causal SLOT (-0.087 BPP). Key implementation details:
+- **L-BFGS optimizer** (max_iter=25, history=20) — much faster convergence than AdamW
+- **Logit space** — optimize logit biases, not hidden deltas
+- **Focal loss on last 128 context tokens** — nearby context more predictive of new positions
+- **Warm-start between windows** — carry bias across consecutive windows
+- **Delta clamped to +/-5** — prevent overfitting
+- Eval time: ~556s
+
+Per-Sample SLOT (PR #1329) reaches 0.636 BPB but is standard SLOT (illegal).
+
+## Experiments Run
+
+### Quick Smoke Test: L-BFGS logit-only (4 steps)
+**Config**: lbfgs_logit mode, 4 steps, no focal/warmstart/clamp (old code)
+**Result**: 1.2658 BPB (-0.035 from 1.3003 baseline)
+**Significance**: Confirms L-BFGS + logit-only approach works for causal SLOT.
+Even with just 4 steps and no focal/warmstart, already -0.035 vs +0.009 for v1.
+
+### L-BFGS logit-only (24 steps) [RUNNING]
+**Config**: lbfgs_logit mode, 24 steps, no focal/warmstart/clamp (old code)
+**Expected**: ~1.20-1.25 BPB
+**Status**: Running (~45 min estimated)
+
+## Implementation Changes
+
+### SLOT Mode System
+Added `SLOT_MODE` env var with four modes:
+- `v1`: Original AdamW delta+bias (default for backward compat)
+- `logit_only`: AdamW logit bias only (no hidden delta)
+- `lbfgs`: L-BFGS delta+bias
+- `lbfgs_logit`: L-BFGS logit bias only (recommended for causal)
+
+### Causal SLOT v2 Features (matching PR #1350)
+- `SLOT_FOCAL_CTX=128`: Focal loss on last 128 context tokens
+- `SLOT_WARMSTART=1`: Carry mean logit bias between batches
+- `SLOT_CLAMP=5.0`: Clamp logit bias to [-5, 5]
+- `SLOT_LBFGS_HISTORY=20`: L-BFGS curvature history
+
+## Untested Novel Ideas
+
+### FiLM-Modulation SLOT (genuinely new)
+Instead of optimizing logit biases, optimize FiLM modulation params (attn_scales, mlp_scales, resid_mixes) at test time.
+- 14,336 parameters (compact, semantically meaningful)
+- Changes HOW the model processes data, not WHAT it outputs
+- Requires re-running model forward pass per SLOT step (expensive)
+- Could be implemented as "delta to FiLM scales" with clamping
+- **Shelved for now** — focus on getting logit-bias approach working first
+
+## Next Steps (prioritized)
+1. [RUNNING] Get 24-step L-BFGS baseline result
+2. [QUEUED] Run v2 variants (focal+warmstart+clamp) 
+3. [QUEUED] SP4096 200-step screen
+4. [QUEUED] QK-Gain 5.0 + WD 0.085 screen
+5. [IDEA] Extended depth recurrence (7 → 9-10 virtual layers)
+6. [IDEA] FiLM-modulation SLOT
diff --git a/experiments/film_slot_test/run_causal_slot_v2.sh b/experiments/film_slot_test/run_causal_slot_v2.sh
@@ -0,0 +1,100 @@
+#!/bin/bash
+# Test improved causal SLOT v2 with focal context, warm-start, clamping
+# Uses L-BFGS logit-only mode with PR #1350 hyperparameters
+set -euo pipefail
+
+export PATH="$HOME/.local/bin:$PATH"
+export C_INCLUDE_PATH="$HOME/.local/include:$HOME/.local/include/python3.10"
+export CPATH="$HOME/.local/include:$HOME/.local/include/python3.10"
+
+REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
+SCRIPT="$REPO_ROOT/experiments/film_slot/train_gpt.py"
+CKPT="$REPO_ROOT/experiments/film_slot_test/run_20260404_213815/final_model.pt"
+WORKDIR="$REPO_ROOT/experiments/film_slot_test/causal_v2_$(date +%Y%m%d_%H%M%S)"
+mkdir -p "$WORKDIR"
+
+if [ ! -f "$CKPT" ]; then
+    echo "ERROR: checkpoint not found at $CKPT"
+    exit 1
+fi
+
+# Common env vars
+export DATA_PATH="$REPO_ROOT/data/datasets/fineweb10B_sp1024"
+export TOKENIZER_PATH="$REPO_ROOT/data/tokenizers/fineweb_1024_bpe.model"
+export VOCAB_SIZE=1024
+export NUM_SHARED_BLOCKS=5
+export NUM_LAYERS=7
+export MLP_MULT=8
+export SEED=42
+export MAX_WALLCLOCK_SECONDS=0
+export ITERATIONS=200
+export TRAIN_SEQ_LEN=1024
+export TRAIN_BATCH_TOKENS=524288
+export VAL_LOSS_EVERY=0
+export TRAIN_LOG_EVERY=200
+export SLOT_ENABLED=1
+export CAUSAL_SLOT=1
+export SLOT_LR=0.012
+export SLOT_LR_MIN=0.001
+export SLOT_BATCH_SEQS=32
+export EVAL_STRIDE=96
+export LOAD_CHECKPOINT="$CKPT"
+export USE_INT6=1
+
+run_variant() {
+    local name="$1"
+    shift
+    echo ""
+    echo "============================================"
+    echo "  $name"
+    echo "============================================"
+    local dir="$WORKDIR/$name"
+    mkdir -p "$dir"
+    cd "$dir"
+    export RUN_ID="${name}_$$"
+    # Apply all extra env vars
+    for var in "$@"; do
+        export "$var"
+    done
+    python3 "$SCRIPT" 2>&1 | tee "${name}.log"
+    grep -E "final_slot|ERROR" "${name}.log" | tail -3
+    echo ""
+}
+
+# V2 variants — all use lbfgs_logit as base
+# A: Full PR #1350 approach (focal=128, warmstart, clamp=5, steps=25)
+run_variant "v2_full" \
+    SLOT_MODE=lbfgs_logit SLOT_STEPS=25 SLOT_FOCAL_CTX=128 \
+    SLOT_WARMSTART=1 SLOT_CLAMP=5.0 SLOT_LBFGS_HISTORY=20
+
+# B: More steps (50) with same settings
+run_variant "v2_50steps" \
+    SLOT_MODE=lbfgs_logit SLOT_STEPS=50 SLOT_FOCAL_CTX=128 \
+    SLOT_WARMSTART=1 SLOT_CLAMP=5.0 SLOT_LBFGS_HISTORY=20
+
+# C: No focal (all context) for comparison
+run_variant "v2_nofocal" \
+    SLOT_MODE=lbfgs_logit SLOT_STEPS=25 SLOT_FOCAL_CTX=0 \
+    SLOT_WARMSTART=1 SLOT_CLAMP=5.0 SLOT_LBFGS_HISTORY=20
+
+# D: AdamW logit-only with focal+warmstart+clamp (compare optimizer)
+run_variant "v2_adamw" \
+    SLOT_MODE=logit_only SLOT_STEPS=25 SLOT_FOCAL_CTX=128 \
+    SLOT_WARMSTART=1 SLOT_CLAMP=5.0
+
+echo ""
+echo "============================================"
+echo "  RESULTS SUMMARY"
+echo "============================================"
+echo ""
+echo "Baseline (no SLOT, int6): 1.3003 BPB"
+echo "v1 causal (AdamW delta+bias, 24 steps): 1.3095 BPB (+0.009, HURTS)"
+echo "lbfgs_logit (4 steps, no focal/warmstart): 1.2658 BPB (-0.035)"
+echo ""
+for name in v2_full v2_50steps v2_nofocal v2_adamw; do
+    log="$WORKDIR/$name/${name}.log"
+    if [ -f "$log" ]; then
+        echo "--- $name ---"
+        grep "final_slot.*_exact" "$log" 2>/dev/null || grep "final_slot" "$log" 2>/dev/null | tail -1 || echo "FAILED"
+    fi
+done