Fix TTT: use eval_model (int6 artifact) not base_model, honor EVAL_STRIDE

anthony-maio · claude · anthony-maio · commit 01724f3058b8 · 2026-03-23T15:38:20.000-04:00
P1: TTT was running on the pre-quantization base_model instead of the int6 round-tripped eval_model. This overstated TTT gains since the artifact model has quantization noise. Now matches PR openai#473's approach. P2: TTT hardcoded stride=64 instead of using args.eval_stride. Now honors the configured stride so TTT results stay consistent with the sliding window eval path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/train_gpt.py b/records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/train_gpt.py
@@ -1564,9 +1564,10 @@ def lr_mul(step: int, elapsed_ms: float) -> float:
         torch.cuda.synchronize()
         t_ttt = time.perf_counter()
         ttt_val_loss, ttt_val_bpb = eval_val_sliding_ttt(
-            args, base_model, rank, world_size, device,
+            args, eval_model, rank, world_size, device,
             val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut,
-            stride=64, batch_seqs=args.ttt_batch_seqs, log0=log0,
+            stride=args.eval_stride if args.eval_stride > 0 else 64,
+            batch_seqs=args.ttt_batch_seqs, log0=log0,
         )
         torch.cuda.synchronize()
         log0(f"final_ttt val_loss:{ttt_val_loss:.4f} val_bpb:{ttt_val_bpb:.4f} "