Skip to content

Commit 19f4c8f

Browse files
yuyeonclaude
andcommitted
Session 5 research log + causal SLOT v2 test script
Key finding: L-BFGS logit-only causal SLOT gives -0.035 BPB (4 steps) vs v1's +0.009 (24 steps). Confirms root cause diagnosis. Causal SLOT v2 test script compares: - v2_full: focal=128, warmstart, clamp=5, 25 steps (PR openai#1350 approach) - v2_50steps: same but 50 steps (check if more steps help) - v2_nofocal: all context (ablation) - v2_adamw: AdamW instead of L-BFGS (optimizer ablation) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 85dee53 commit 19f4c8f

2 files changed

Lines changed: 180 additions & 0 deletions

File tree

docs/research_log_session5.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Research Log — Session 5 (2026-04-04, continued)
2+
3+
## Environment
4+
- GPU: 1× NVIDIA H100 80GB HBM3
5+
- PyTorch 2.11.0+cu128, FA3 3.0.0, Triton OK, CUDA 12.8
6+
- Baseline verified: 353ms/step, 2.597 BPB @ 50 steps
7+
8+
## State at Start of Session
9+
10+
### From Session 4 (same day)
11+
EXP-1 (SLOT compatibility test on FiLM) completed:
12+
- **FiLM baseline (int6)**: 1.3003 BPB
13+
- **Standard SLOT (int6+SLOT24)**: 0.9028 BPB (-0.3975) — works but **ILLEGAL**
14+
- **Causal SLOT v1 (int6+SLOT24)**: 1.3095 BPB (+0.009) — **HURTS performance**
15+
- Multiple runs crashed with torch.compile dtype mismatch (fixed iteratively)
16+
17+
### Root Cause Analysis: Why Causal SLOT v1 Failed
18+
19+
The broadcast delta `[bsz, 1, hdim]` is the core problem:
20+
1. **Standard SLOT**: opt_mask == score_mask (same positions). Delta optimized directly for scored positions → massive improvement.
21+
2. **Causal SLOT**: opt_mask (context positions) and score_mask (new positions) are completely disjoint. A broadcast delta optimized for context can actively hurt new positions.
22+
3. The +0.009 BPB result means the delta HURTS more than it helps on new positions.
23+
24+
### Competition Intelligence (from web research)
25+
26+
PR #1350 achieves 1.0046 BPB with causal SLOT (-0.087 BPP). Key implementation details:
27+
- **L-BFGS optimizer** (max_iter=25, history=20) — much faster convergence than AdamW
28+
- **Logit space** — optimize logit biases, not hidden deltas
29+
- **Focal loss on last 128 context tokens** — nearby context more predictive of new positions
30+
- **Warm-start between windows** — carry bias across consecutive windows
31+
- **Delta clamped to +/-5** — prevent overfitting
32+
- Eval time: ~556s
33+
34+
Per-Sample SLOT (PR #1329) reaches 0.636 BPB but is standard SLOT (illegal).
35+
36+
## Experiments Run
37+
38+
### Quick Smoke Test: L-BFGS logit-only (4 steps)
39+
**Config**: lbfgs_logit mode, 4 steps, no focal/warmstart/clamp (old code)
40+
**Result**: 1.2658 BPB (-0.035 from 1.3003 baseline)
41+
**Significance**: Confirms L-BFGS + logit-only approach works for causal SLOT.
42+
Even with just 4 steps and no focal/warmstart, already -0.035 vs +0.009 for v1.
43+
44+
### L-BFGS logit-only (24 steps) [RUNNING]
45+
**Config**: lbfgs_logit mode, 24 steps, no focal/warmstart/clamp (old code)
46+
**Expected**: ~1.20-1.25 BPB
47+
**Status**: Running (~45 min estimated)
48+
49+
## Implementation Changes
50+
51+
### SLOT Mode System
52+
Added `SLOT_MODE` env var with four modes:
53+
- `v1`: Original AdamW delta+bias (default for backward compat)
54+
- `logit_only`: AdamW logit bias only (no hidden delta)
55+
- `lbfgs`: L-BFGS delta+bias
56+
- `lbfgs_logit`: L-BFGS logit bias only (recommended for causal)
57+
58+
### Causal SLOT v2 Features (matching PR #1350)
59+
- `SLOT_FOCAL_CTX=128`: Focal loss on last 128 context tokens
60+
- `SLOT_WARMSTART=1`: Carry mean logit bias between batches
61+
- `SLOT_CLAMP=5.0`: Clamp logit bias to [-5, 5]
62+
- `SLOT_LBFGS_HISTORY=20`: L-BFGS curvature history
63+
64+
## Untested Novel Ideas
65+
66+
### FiLM-Modulation SLOT (genuinely new)
67+
Instead of optimizing logit biases, optimize FiLM modulation params (attn_scales, mlp_scales, resid_mixes) at test time.
68+
- 14,336 parameters (compact, semantically meaningful)
69+
- Changes HOW the model processes data, not WHAT it outputs
70+
- Requires re-running model forward pass per SLOT step (expensive)
71+
- Could be implemented as "delta to FiLM scales" with clamping
72+
- **Shelved for now** — focus on getting logit-bias approach working first
73+
74+
## Next Steps (prioritized)
75+
1. [RUNNING] Get 24-step L-BFGS baseline result
76+
2. [QUEUED] Run v2 variants (focal+warmstart+clamp)
77+
3. [QUEUED] SP4096 200-step screen
78+
4. [QUEUED] QK-Gain 5.0 + WD 0.085 screen
79+
5. [IDEA] Extended depth recurrence (7 → 9-10 virtual layers)
80+
6. [IDEA] FiLM-modulation SLOT
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
#!/bin/bash
2+
# Test improved causal SLOT v2 with focal context, warm-start, clamping
3+
# Uses L-BFGS logit-only mode with PR #1350 hyperparameters
4+
set -euo pipefail
5+
6+
export PATH="$HOME/.local/bin:$PATH"
7+
export C_INCLUDE_PATH="$HOME/.local/include:$HOME/.local/include/python3.10"
8+
export CPATH="$HOME/.local/include:$HOME/.local/include/python3.10"
9+
10+
REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
11+
SCRIPT="$REPO_ROOT/experiments/film_slot/train_gpt.py"
12+
CKPT="$REPO_ROOT/experiments/film_slot_test/run_20260404_213815/final_model.pt"
13+
WORKDIR="$REPO_ROOT/experiments/film_slot_test/causal_v2_$(date +%Y%m%d_%H%M%S)"
14+
mkdir -p "$WORKDIR"
15+
16+
if [ ! -f "$CKPT" ]; then
17+
echo "ERROR: checkpoint not found at $CKPT"
18+
exit 1
19+
fi
20+
21+
# Common env vars
22+
export DATA_PATH="$REPO_ROOT/data/datasets/fineweb10B_sp1024"
23+
export TOKENIZER_PATH="$REPO_ROOT/data/tokenizers/fineweb_1024_bpe.model"
24+
export VOCAB_SIZE=1024
25+
export NUM_SHARED_BLOCKS=5
26+
export NUM_LAYERS=7
27+
export MLP_MULT=8
28+
export SEED=42
29+
export MAX_WALLCLOCK_SECONDS=0
30+
export ITERATIONS=200
31+
export TRAIN_SEQ_LEN=1024
32+
export TRAIN_BATCH_TOKENS=524288
33+
export VAL_LOSS_EVERY=0
34+
export TRAIN_LOG_EVERY=200
35+
export SLOT_ENABLED=1
36+
export CAUSAL_SLOT=1
37+
export SLOT_LR=0.012
38+
export SLOT_LR_MIN=0.001
39+
export SLOT_BATCH_SEQS=32
40+
export EVAL_STRIDE=96
41+
export LOAD_CHECKPOINT="$CKPT"
42+
export USE_INT6=1
43+
44+
run_variant() {
45+
local name="$1"
46+
shift
47+
echo ""
48+
echo "============================================"
49+
echo " $name"
50+
echo "============================================"
51+
local dir="$WORKDIR/$name"
52+
mkdir -p "$dir"
53+
cd "$dir"
54+
export RUN_ID="${name}_$$"
55+
# Apply all extra env vars
56+
for var in "$@"; do
57+
export "$var"
58+
done
59+
python3 "$SCRIPT" 2>&1 | tee "${name}.log"
60+
grep -E "final_slot|ERROR" "${name}.log" | tail -3
61+
echo ""
62+
}
63+
64+
# V2 variants — all use lbfgs_logit as base
65+
# A: Full PR #1350 approach (focal=128, warmstart, clamp=5, steps=25)
66+
run_variant "v2_full" \
67+
SLOT_MODE=lbfgs_logit SLOT_STEPS=25 SLOT_FOCAL_CTX=128 \
68+
SLOT_WARMSTART=1 SLOT_CLAMP=5.0 SLOT_LBFGS_HISTORY=20
69+
70+
# B: More steps (50) with same settings
71+
run_variant "v2_50steps" \
72+
SLOT_MODE=lbfgs_logit SLOT_STEPS=50 SLOT_FOCAL_CTX=128 \
73+
SLOT_WARMSTART=1 SLOT_CLAMP=5.0 SLOT_LBFGS_HISTORY=20
74+
75+
# C: No focal (all context) for comparison
76+
run_variant "v2_nofocal" \
77+
SLOT_MODE=lbfgs_logit SLOT_STEPS=25 SLOT_FOCAL_CTX=0 \
78+
SLOT_WARMSTART=1 SLOT_CLAMP=5.0 SLOT_LBFGS_HISTORY=20
79+
80+
# D: AdamW logit-only with focal+warmstart+clamp (compare optimizer)
81+
run_variant "v2_adamw" \
82+
SLOT_MODE=logit_only SLOT_STEPS=25 SLOT_FOCAL_CTX=128 \
83+
SLOT_WARMSTART=1 SLOT_CLAMP=5.0
84+
85+
echo ""
86+
echo "============================================"
87+
echo " RESULTS SUMMARY"
88+
echo "============================================"
89+
echo ""
90+
echo "Baseline (no SLOT, int6): 1.3003 BPB"
91+
echo "v1 causal (AdamW delta+bias, 24 steps): 1.3095 BPB (+0.009, HURTS)"
92+
echo "lbfgs_logit (4 steps, no focal/warmstart): 1.2658 BPB (-0.035)"
93+
echo ""
94+
for name in v2_full v2_50steps v2_nofocal v2_adamw; do
95+
log="$WORKDIR/$name/${name}.log"
96+
if [ -f "$log" ]; then
97+
echo "--- $name ---"
98+
grep "final_slot.*_exact" "$log" 2>/dev/null || grep "final_slot" "$log" 2>/dev/null | tail -1 || echo "FAILED"
99+
fi
100+
done

0 commit comments

Comments
 (0)