parameter-golf/full_baseline.log at main · nestamidavaine/parameter-golf · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
START full run: 4-pass baseline (no LoRA) TTT SWA, 80min (Thu Mar 26 23:01:24 UTC 2026)
logs/f98390b0-4cf3-4710-b1f1-a48bf3145a00.txt
logs/5733dda4-95d3-4001-9992-a963924a436b.txt
logs/6326f57e-9e2a-4921-b0d9-138db532ff5b.txt
logs/03b40738-df0a-452f-9ffb-6cfcc1cfc8e8.txt
logs/a7ef90b3-b0b5-4e27-87c5-52fc39836bed.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:10
val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
train_loader:dataset:fineweb10B_sp1024 train_shards:10
val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:10
val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:10
val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=../../../data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:10
val_loader:shards pattern=../../../data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
feedback: mode=diagonal rank=2 per_pass=False params=2560
recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3
model_params:26927200
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_4 active_layers:[8, 9, 10]
world_size:1 grad_accum_steps:8
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000
seed:1337
wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY.
wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead.
wandb: setting up run qltwebo4
wandb: Tracking run with wandb version 0.25.1
wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230156-qltwebo4
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run full_4pass_baseline_80min
wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf
wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/qltwebo4
feedback: mode=diagonal rank=2 per_pass=False params=2560
recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3
wandb:initialized
feedback: mode=diagonal rank=2 per_pass=False params=2560
recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3
model_params:26927200
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_4 active_layers:[8, 9, 10]
world_size:1 grad_accum_steps:8
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000
seed:1337
model_params:26927200
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_4 active_layers:[8, 9, 10]
world_size:1 grad_accum_steps:8
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000
seed:1337
wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY.
wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY.
feedback: mode=diagonal rank=2 per_pass=False params=2560
recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3
feedback: mode=diagonal rank=2 per_pass=False params=2560
recurrence: core_start=3 core_end=8 num_passes=4 stem=3 core=5 tail=3
model_params:26927200
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_4 active_layers:[8, 9, 10]
world_size:1 grad_accum_steps:8
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000
seed:1337
wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY.
model_params:26927200
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_4 active_layers:[8, 9, 10]
world_size:1 grad_accum_steps:8
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000
seed:1337
wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY.
wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead.
wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead.
wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Currently logged in as: nesta-midavaine (propensity) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead.
wandb: WARNING Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead.
wandb: setting up run fsi4c82a
wandb: setting up run zcabiozu
wandb: Tracking run with wandb version 0.25.1
wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-zcabiozu
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run full_4pass_baseline_80min
wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf
wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/zcabiozu
wandb: Tracking run with wandb version 0.25.1
wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-fsi4c82a
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run full_4pass_baseline_80min
wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf
wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/fsi4c82a
wandb:initialized
wandb:initialized
wandb: setting up run 43bipylb
wandb: setting up run jkh80zal
wandb: Tracking run with wandb version 0.25.1
wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-jkh80zal
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run full_4pass_baseline_80min
wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf
wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/jkh80zal
wandb: Tracking run with wandb version 0.25.1
wandb: Run data is saved locally in /home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/wandb/run-20260326_230158-43bipylb
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run full_4pass_baseline_80min
wandb: ⭐️ View project at https://wandb.ai/propensity/parameter-golf
wandb: 🚀 View run at https://wandb.ai/propensity/parameter-golf/runs/43bipylb
wandb:initialized
wandb:initialized
Traceback (most recent call last):
  File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in <module>
    main()
  File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main
    warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward
    def forward(self, input_ids: Tensor, target_ids: Tensor,
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g
    return f(*args)
           ^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward
    fw_outs = call_func_at_runtime_with_args(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run
    out = model(new_inputs)
          ^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 9077, in call
    buf776 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 24.69 MiB is free. Process 448538 has 754.00 MiB memory in use. Process 448539 has 756.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Including non-PyTorch memory, this process has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 38.01 GiB is allocated by PyTorch, and 15.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
Traceback (most recent call last):
  File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in <module>
    main()
  File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main
    warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward
    def forward(self, input_ids: Tensor, target_ids: Tensor,
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g
    return f(*args)
           ^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward
    fw_outs = call_func_at_runtime_with_args(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run
    out = model(new_inputs)
          ^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 7110, in call
    buf5 = empty_strided_cuda((98304, 512), (512, 1), torch.bfloat16)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 20.69 MiB is free. Including non-PyTorch memory, this process has 758.00 MiB memory in use. Process 448539 has 756.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 127.50 MiB is allocated by PyTorch, and 18.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
Traceback (most recent call last):
  File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in <module>
    main()
  File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1813, in main
    warmup_loss = model(x, y, feedback_fn=feedback_fn, stabilizer=stabilizer)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 472, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1056, in forward
    def forward(self, input_ids: Tensor, target_ids: Tensor,
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1200, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 566, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 105, in g
    return f(*args)
           ^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 596, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2503, in forward
    fw_outs = call_func_at_runtime_with_args(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 783, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1011, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run
    out = model(new_inputs)
          ^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_nesta/3d/c3dzegwqai5cgqo2lurmxd3pfex6kuqjwm3ooueib5swt7sgm5kg.py", line 7110, in call
    buf5 = empty_strided_cuda((98304, 512), (512, 1), torch.bfloat16)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 18.69 MiB is free. Process 448538 has 758.00 MiB memory in use. Including non-PyTorch memory, this process has 758.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Process 448537 has 48.42 GiB memory in use. Of the allocated memory 127.50 MiB is allocated by PyTorch, and 18.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
Traceback (most recent call last):
  File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 2148, in <module>
    main()
  File "/home/nesta/parameter-golf/records/track_10min_16mb/2026-03-26_RecurrentSOTA_Feedback/train_gpt_recurrent.py", line 1814, in main
    (warmup_loss * grad_scale).backward()
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 631, in backward
    torch.autograd.backward(
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 381, in backward
    _engine_run_backward(
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 869, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/autograd/function.py", line 317, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2748, in backward
    return impl_fn()
           ^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2734, in impl_fn
    out = CompiledFunction._backward_impl(ctx, all_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2918, in _backward_impl
    out = call_func_at_runtime_with_args(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1263, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 656, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nesta/parameter-golf/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3401, in run
    out = model(new_inputs)
          ^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_nesta/wm/cwmtq2g54vzhxzvsc4odvxx7srlroi2tosqcftjhmyw2c637ogrq.py", line 12255, in call
    buf54 = empty_strided_cuda((48, 2048, 512), (1048576, 512, 1), torch.bfloat16)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 139.80 GiB of which 18.69 MiB is free. Process 448538 has 758.00 MiB memory in use. Process 448539 has 758.00 MiB memory in use. Process 448534 has 51.16 GiB memory in use. Process 448547 has 38.69 GiB memory in use. Including non-PyTorch memory, this process has 48.42 GiB memory in use. Of the allocated memory 47.76 GiB is allocated by PyTorch, and 3.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)
[1;34mwandb[0m:
[1;34mwandb[0m: 🚀 View run [33mfull_4pass_baseline_80min[0m at: [34mhttps://wandb.ai/propensity/parameter-golf/runs/jkh80zal[0m
[1;34mwandb[0m: Find logs at: [1;35mwandb/run-20260326_230158-jkh80zal/logs[0m
[1;34mwandb[0m:
[1;34mwandb[0m: 🚀 View run [33mfull_4pass_baseline_80min[0m at: [34mhttps://wandb.ai/propensity/parameter-golf/runs/fsi4c82a[0m
[1;34mwandb[0m: Find logs at: [1;35mwandb/run-20260326_230158-fsi4c82a/logs[0m
[1;34mwandb[0m:
[1;34mwandb[0m: 🚀 View run [33mfull_4pass_baseline_80min[0m at: [34mhttps://wandb.ai/propensity/parameter-golf/runs/43bipylb[0m
[1;34mwandb[0m: Find logs at: [1;35mwandb/run-20260326_230158-43bipylb/logs[0m
[1;34mwandb[0m:
[1;34mwandb[0m: 🚀 View run [33mfull_4pass_baseline_80min[0m at: [34mhttps://wandb.ai/propensity/parameter-golf/runs/zcabiozu[0m
[1;34mwandb[0m: Find logs at: [1;35mwandb/run-20260326_230158-zcabiozu/logs[0m
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9304 val_bpb:4.1046 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9310 grad_norm:0.3717 train_time:1291ms step_avg:1291.09ms
step:2/20000 train_loss:8.3536 grad_norm:3.5393 train_time:2598ms step_avg:1298.81ms
step:3/20000 train_loss:7.5089 grad_norm:1.8069 train_time:3954ms step_avg:1318.13ms
step:4/20000 train_loss:7.5822 grad_norm:1.8725 train_time:5317ms step_avg:1329.29ms
step:5/20000 train_loss:7.3524 grad_norm:1.8843 train_time:6673ms step_avg:1334.64ms
step:6/20000 train_loss:7.0868 grad_norm:1.7131 train_time:8028ms step_avg:1338.00ms
step:7/20000 train_loss:6.9401 grad_norm:2.0897 train_time:9384ms step_avg:1340.63ms
step:8/20000 train_loss:6.8952 grad_norm:1.4534 train_time:10745ms step_avg:1343.15ms
step:9/20000 train_loss:6.5431 grad_norm:1.0222 train_time:12102ms step_avg:1344.70ms
step:10/20000 train_loss:6.1427 grad_norm:0.9715 train_time:13466ms step_avg:1346.55ms
step:50/20000 train_loss:3.6903 grad_norm:0.9422 train_time:68054ms step_avg:1361.07ms
step:100/20000 train_loss:3.1184 grad_norm:0.5410 train_time:136293ms step_avg:1362.93ms
step:150/20000 train_loss:2.7752 grad_norm:0.3613 train_time:205070ms step_avg:1367.13ms
step:200/20000 train_loss:2.5614 grad_norm:0.2693 train_time:273305ms step_avg:1366.53ms
step:250/20000 train_loss:2.5709 grad_norm:0.2522 train_time:341556ms step_avg:1366.22ms
step:300/20000 train_loss:2.4364 grad_norm:0.2295 train_time:409825ms step_avg:1366.08ms
step:350/20000 train_loss:2.4859 grad_norm:0.2104 train_time:478072ms step_avg:1365.92ms
step:400/20000 train_loss:2.3988 grad_norm:0.1555 train_time:546341ms step_avg:1365.85ms
step:450/20000 train_loss:2.2317 grad_norm:0.1958 train_time:614614ms step_avg:1365.81ms
step:500/20000 train_loss:2.2898 grad_norm:0.1775 train_time:682900ms step_avg:1365.80ms
step:500/20000 val_loss:2.3130 val_bpb:1.3699 train_time:682945ms step_avg:1365.89ms
step:550/20000 train_loss:2.3492 grad_norm:0.1559 train_time:751209ms step_avg:1365.83ms
step:600/20000 train_loss:2.2513 grad_norm:0.1438 train_time:819544ms step_avg:1365.91ms
step:650/20000 train_loss:2.2323 grad_norm:0.1536 train_time:888368ms step_avg:1366.72ms
step:700/20000 train_loss:2.3026 grad_norm:0.1020 train_time:956783ms step_avg:1366.83ms
step:750/20000 train_loss:2.2750 grad_norm:0.1105 train_time:1025183ms step_avg:1366.91ms
step:800/20000 train_loss:2.2546 grad_norm:0.1031 train_time:1093599ms step_avg:1367.00ms
step:850/20000 train_loss:2.1799 grad_norm:0.0737 train_time:1162084ms step_avg:1367.16ms
step:900/20000 train_loss:2.0960 grad_norm:0.0817 train_time:1230597ms step_avg:1367.33ms
step:950/20000 train_loss:2.2968 grad_norm:0.0953 train_time:1299094ms step_avg:1367.47ms
step:1000/20000 train_loss:2.2247 grad_norm:0.0713 train_time:1367589ms step_avg:1367.59ms
step:1000/20000 val_loss:2.1722 val_bpb:1.2865 train_time:1367633ms step_avg:1367.63ms
step:1050/20000 train_loss:2.1500 grad_norm:0.1469 train_time:1436112ms step_avg:1367.73ms
step:1100/20000 train_loss:2.1744 grad_norm:0.0794 train_time:1504991ms step_avg:1368.17ms
step:1150/20000 train_loss:2.1290 grad_norm:0.0672 train_time:1573762ms step_avg:1368.49ms
step:1200/20000 train_loss:2.1756 grad_norm:0.0636 train_time:1642514ms step_avg:1368.76ms
step:1250/20000 train_loss:2.1991 grad_norm:0.0599 train_time:1711283ms step_avg:1369.03ms
step:1300/20000 train_loss:2.1695 grad_norm:0.1132 train_time:1780070ms step_avg:1369.28ms
step:1350/20000 train_loss:2.1436 grad_norm:0.1200 train_time:1848866ms step_avg:1369.53ms
step:1400/20000 train_loss:2.1553 grad_norm:0.0700 train_time:1917654ms step_avg:1369.75ms
step:1450/20000 train_loss:2.1501 grad_norm:0.0631 train_time:1986442ms step_avg:1369.96ms
step:1500/20000 train_loss:2.1193 grad_norm:0.0733 train_time:2055220ms step_avg:1370.15ms
step:1500/20000 val_loss:2.1071 val_bpb:1.2479 train_time:2055264ms step_avg:1370.18ms
step:1550/20000 train_loss:2.0928 grad_norm:0.0758 train_time:2124013ms step_avg:1370.33ms
step:1600/20000 train_loss:2.1722 grad_norm:0.0814 train_time:2193129ms step_avg:1370.71ms
step:1650/20000 train_loss:1.9557 grad_norm:0.0655 train_time:2261915ms step_avg:1370.86ms
step:1700/20000 train_loss:2.0848 grad_norm:0.0634 train_time:2330710ms step_avg:1371.01ms
step:1750/20000 train_loss:2.0562 grad_norm:0.0759 train_time:2399493ms step_avg:1371.14ms
step:1800/20000 train_loss:2.0964 grad_norm:0.0645 train_time:2468259ms step_avg:1371.26ms
step:1850/20000 train_loss:2.1107 grad_norm:0.0831 train_time:2537046ms step_avg:1371.38ms
step:1900/20000 train_loss:2.0580 grad_norm:0.0648 train_time:2605824ms step_avg:1371.49ms
step:1950/20000 train_loss:2.0431 grad_norm:0.0981 train_time:2674651ms step_avg:1371.62ms
step:2000/20000 train_loss:2.2944 grad_norm:0.0838 train_time:2743419ms step_avg:1371.71ms
step:2000/20000 val_loss:2.0763 val_bpb:1.2297 train_time:2743463ms step_avg:1371.73ms
step:2050/20000 train_loss:2.0607 grad_norm:0.1013 train_time:2812501ms step_avg:1371.95ms
step:2100/20000 train_loss:2.0358 grad_norm:0.0558 train_time:2881257ms step_avg:1372.03ms
step:2150/20000 train_loss:2.0142 grad_norm:0.0526 train_time:2950035ms step_avg:1372.11ms
step:2200/20000 train_loss:2.1668 grad_norm:0.0614 train_time:3018808ms step_avg:1372.19ms
step:2250/20000 train_loss:2.0604 grad_norm:0.0644 train_time:3087562ms step_avg:1372.25ms
step:2300/20000 train_loss:2.0377 grad_norm:0.1123 train_time:3156291ms step_avg:1372.30ms
step:2350/20000 train_loss:1.9923 grad_norm:0.0511 train_time:3225042ms step_avg:1372.36ms
step:2400/20000 train_loss:2.1062 grad_norm:0.0682 train_time:3293804ms step_avg:1372.42ms
step:2450/20000 train_loss:2.0650 grad_norm:0.0639 train_time:3362565ms step_avg:1372.48ms
step:2500/20000 train_loss:2.0208 grad_norm:0.0580 train_time:3431320ms step_avg:1372.53ms
step:2500/20000 val_loss:2.0279 val_bpb:1.2010 train_time:3431364ms step_avg:1372.55ms
step:2550/20000 train_loss:2.0211 grad_norm:0.0558 train_time:3500393ms step_avg:1372.70ms
step:2600/20000 train_loss:2.0001 grad_norm:0.0479 train_time:3569165ms step_avg:1372.76ms
step:2650/20000 train_loss:2.0040 grad_norm:0.0582 train_time:3637929ms step_avg:1372.80ms
step:2700/20000 train_loss:2.0265 grad_norm:0.0542 train_time:3706703ms step_avg:1372.85ms
step:2750/20000 train_loss:2.0077 grad_norm:0.0457 train_time:3775459ms step_avg:1372.89ms
step:2800/20000 train_loss:2.0415 grad_norm:0.0569 train_time:3844241ms step_avg:1372.94ms
step:2850/20000 train_loss:1.9900 grad_norm:0.0487 train_time:3913011ms step_avg:1372.99ms
step:2900/20000 train_loss:2.0045 grad_norm:0.0438 train_time:3981769ms step_avg:1373.02ms
step:2950/20000 train_loss:2.0440 grad_norm:0.0447 train_time:4050513ms step_avg:1373.06ms
step:3000/20000 train_loss:1.9316 grad_norm:0.0567 train_time:4119545ms step_avg:1373.18ms
step:3000/20000 val_loss:1.9838 val_bpb:1.1749 train_time:4119590ms step_avg:1373.20ms
step:3050/20000 train_loss:1.9372 grad_norm:0.0506 train_time:4188300ms step_avg:1373.21ms
step:3100/20000 train_loss:1.9990 grad_norm:0.0465 train_time:4257075ms step_avg:1373.25ms
step:3150/20000 train_loss:2.0077 grad_norm:0.0401 train_time:4325837ms step_avg:1373.28ms
swa:start step:3200
step:3200/20000 train_loss:1.9812 grad_norm:0.0445 train_time:4394566ms step_avg:1373.30ms
late_qat:enabled step:3241 scale:0.1495 core_quant:on
step:3250/20000 train_loss:1.9531 grad_norm:0.0567 train_time:4519079ms step_avg:1390.49ms
step:3300/20000 train_loss:1.9296 grad_norm:0.0386 train_time:4587540ms step_avg:1390.16ms
step:3350/20000 train_loss:1.9653 grad_norm:0.0394 train_time:4655858ms step_avg:1389.81ms
step:3400/20000 train_loss:2.0099 grad_norm:0.0483 train_time:4724204ms step_avg:1389.47ms
step:3450/20000 train_loss:1.9637 grad_norm:0.0369 train_time:4792535ms step_avg:1389.14ms
step:3456/20000 val_loss:1.9505 val_bpb:1.1552 train_time:4800814ms step_avg:1389.12ms
stopping_early: wallclock_cap train_time:4800814ms step:3456/20000
peak memory allocated: 50545 MiB reserved: 50594 MiB
ema:applying EMA weights
DIAGNOSTIC post_ema val_loss:1.9472 val_bpb:1.1532 eval_time:32839ms
Serialized model: 106023671 bytes
Code size: 102633 bytes
Serialized model int6+lzma: 16373548 bytes
Total submission size int6+lzma: 16476181 bytes
final_int6_roundtrip val_loss:1.9574 val_bpb:1.1593 eval_time:39862ms
final_int6_roundtrip_exact val_loss:1.95735252 val_bpb:1.15925441
final_int6_sliding_window val_loss:1.9164 val_bpb:1.1350 stride:64 eval_time:1105486ms
final_int6_sliding_window_exact val_loss:1.91642779 val_bpb:1.13501949
final_int8_zlib_roundtrip_exact val_loss:1.91642779 val_bpb:1.13501949
ttt_sliding:start chunks=1893 chunk_tokens=32768 total_windows=969088 stride=64 ttt_lr=0.002 ttt_epochs=3 freeze_blocks=2
ttt_sliding:params unfrozen=26923088 frozen=4112
  ttt_chunk [1/1893] bpb=1.226275 time=1.9s
  ttt_chunk [11/1893] bpb=1.128206 time=20.5s
  ttt_chunk [21/1893] bpb=1.137378 time=39.0s
  ttt_chunk [31/1893] bpb=1.142175 time=57.6s
  ttt_chunk [41/1893] bpb=1.138228 time=76.1s
  ttt_chunk [51/1893] bpb=1.139877 time=94.6s
  ttt_chunk [61/1893] bpb=1.143695 time=113.2s
  ttt_chunk [71/1893] bpb=1.141806 time=131.7s
  ttt_chunk [81/1893] bpb=1.138175 time=150.2s
  ttt_chunk [91/1893] bpb=1.137107 time=168.8s
  ttt_chunk [101/1893] bpb=1.138115 time=187.3s
  ttt_chunk [111/1893] bpb=1.138295 time=205.9s
  ttt_chunk [121/1893] bpb=1.134671 time=224.4s
  ttt_chunk [131/1893] bpb=1.133939 time=242.9s
  ttt_chunk [141/1893] bpb=1.132766 time=261.5s
  ttt_chunk [151/1893] bpb=1.132980 time=280.0s
  ttt_chunk [161/1893] bpb=1.133800 time=298.6s
  ttt_chunk [171/1893] bpb=1.135874 time=317.1s
  ttt_chunk [181/1893] bpb=1.135884 time=335.6s
  ttt_chunk [191/1893] bpb=1.138340 time=354.2s
  ttt_chunk [201/1893] bpb=1.137866 time=372.7s
  ttt_chunk [211/1893] bpb=1.136957 time=391.2s
  ttt_chunk [221/1893] bpb=1.137842 time=409.8s
  ttt_chunk [231/1893] bpb=1.137565 time=428.3s
  ttt_chunk [241/1893] bpb=1.137849 time=446.8s
  ttt_chunk [251/1893] bpb=1.137360 time=465.4s
  ttt_chunk [261/1893] bpb=1.136692 time=483.9s
  ttt_chunk [271/1893] bpb=1.135780 time=502.5s
  ttt_chunk [281/1893] bpb=1.137389 time=521.0s
  ttt_chunk [291/1893] bpb=1.137018 time=539.5s
  ttt_chunk [301/1893] bpb=1.137918 time=558.1s
  ttt_chunk [311/1893] bpb=1.138001 time=576.6s
  ttt_chunk [321/1893] bpb=1.138708 time=595.1s
  ttt_chunk [331/1893] bpb=1.138179 time=613.7s
  ttt_chunk [341/1893] bpb=1.137832 time=632.2s
  ttt_chunk [351/1893] bpb=1.138543 time=650.8s
  ttt_chunk [361/1893] bpb=1.139301 time=669.3s
  ttt_chunk [371/1893] bpb=1.139185 time=687.8s
  ttt_chunk [381/1893] bpb=1.138924 time=706.4s
  ttt_chunk [391/1893] bpb=1.139607 time=724.9s
  ttt_chunk [401/1893] bpb=1.139172 time=743.4s
  ttt_chunk [411/1893] bpb=1.138218 time=762.0s
  ttt_chunk [421/1893] bpb=1.138334 time=780.5s
  ttt_chunk [431/1893] bpb=1.138777 time=799.1s
  ttt_chunk [441/1893] bpb=1.138161 time=817.6s
  ttt_chunk [451/1893] bpb=1.138301 time=836.1s
  ttt_chunk [461/1893] bpb=1.138190 time=854.7s
  ttt_chunk [471/1893] bpb=1.137746 time=873.2s
  ttt_chunk [481/1893] bpb=1.137597 time=891.8s
  ttt_chunk [491/1893] bpb=1.137722 time=910.3s
  ttt_chunk [501/1893] bpb=1.137492 time=928.8s
  ttt_chunk [511/1893] bpb=1.137017 time=947.4s
  ttt_chunk [521/1893] bpb=1.136714 time=965.9s
  ttt_chunk [531/1893] bpb=1.137443 time=984.5s
  ttt_chunk [541/1893] bpb=1.137557 time=1003.0s
  ttt_chunk [551/1893] bpb=1.137019 time=1021.5s
  ttt_chunk [561/1893] bpb=1.136885 time=1040.1s
  ttt_chunk [571/1893] bpb=1.136621 time=1058.6s
  ttt_chunk [581/1893] bpb=1.136257 time=1077.2s
  ttt_chunk [591/1893] bpb=1.135719 time=1095.7s
  ttt_chunk [601/1893] bpb=1.135711 time=1114.2s
  ttt_chunk [611/1893] bpb=1.135386 time=1132.8s
  ttt_chunk [621/1893] bpb=1.135235 time=1151.3s
  ttt_chunk [631/1893] bpb=1.134973 time=1169.9s
  ttt_chunk [641/1893] bpb=1.134519 time=1188.4s
  ttt_chunk [651/1893] bpb=1.134057 time=1206.9s
  ttt_chunk [661/1893] bpb=1.133947 time=1225.5s
  ttt_chunk [671/1893] bpb=1.133482 time=1244.0s
  ttt_chunk [681/1893] bpb=1.132918 time=1262.6s
  ttt_chunk [691/1893] bpb=1.132994 time=1281.1s
  ttt_chunk [701/1893] bpb=1.132163 time=1299.6s
  ttt_chunk [711/1893] bpb=1.132176 time=1318.2s
  ttt_chunk [721/1893] bpb=1.132090 time=1336.7s
  ttt_chunk [731/1893] bpb=1.132331 time=1355.2s
  ttt_chunk [741/1893] bpb=1.132205 time=1373.8s
  ttt_chunk [751/1893] bpb=1.131884 time=1392.3s
  ttt_chunk [761/1893] bpb=1.132028 time=1410.8s
  ttt_chunk [771/1893] bpb=1.131860 time=1429.4s
  ttt_chunk [781/1893] bpb=1.132024 time=1447.9s
  ttt_chunk [791/1893] bpb=1.131869 time=1466.4s
  ttt_chunk [801/1893] bpb=1.131804 time=1485.0s
  ttt_chunk [811/1893] bpb=1.131817 time=1503.5s
  ttt_chunk [821/1893] bpb=1.131702 time=1522.1s
  ttt_chunk [831/1893] bpb=1.131418 time=1540.6s
  ttt_chunk [841/1893] bpb=1.131180 time=1559.1s
  ttt_chunk [851/1893] bpb=1.131241 time=1577.7s
  ttt_chunk [861/1893] bpb=1.131312 time=1596.2s
  ttt_chunk [871/1893] bpb=1.131521 time=1614.7s
  ttt_chunk [881/1893] bpb=1.131519 time=1633.3s
  ttt_chunk [891/1893] bpb=1.130978 time=1651.8s
  ttt_chunk [901/1893] bpb=1.130995 time=1670.3s
  ttt_chunk [911/1893] bpb=1.130849 time=1688.9s
  ttt_chunk [921/1893] bpb=1.130984 time=1707.4s
  ttt_chunk [931/1893] bpb=1.130928 time=1726.0s
  ttt_chunk [941/1893] bpb=1.131129 time=1744.5s
  ttt_chunk [951/1893] bpb=1.131431 time=1763.0s
  ttt_chunk [961/1893] bpb=1.131741 time=1781.6s
  ttt_chunk [971/1893] bpb=1.132107 time=1800.1s
  ttt_chunk [981/1893] bpb=1.132319 time=1818.6s
  ttt_chunk [991/1893] bpb=1.132236 time=1837.2s
  ttt_chunk [1001/1893] bpb=1.132567 time=1855.7s
  ttt_chunk [1011/1893] bpb=1.132723 time=1874.3s
  ttt_chunk [1021/1893] bpb=1.133011 time=1892.8s
  ttt_chunk [1031/1893] bpb=1.133400 time=1911.3s
  ttt_chunk [1041/1893] bpb=1.133897 time=1929.9s
  ttt_chunk [1051/1893] bpb=1.133756 time=1948.4s
  ttt_chunk [1061/1893] bpb=1.133865 time=1967.0s
  ttt_chunk [1071/1893] bpb=1.134029 time=1985.5s
  ttt_chunk [1081/1893] bpb=1.134076 time=2004.1s
  ttt_chunk [1091/1893] bpb=1.134326 time=2022.7s
  ttt_chunk [1101/1893] bpb=1.134469 time=2041.2s
  ttt_chunk [1111/1893] bpb=1.134274 time=2059.8s
  ttt_chunk [1121/1893] bpb=1.134049 time=2078.3s
  ttt_chunk [1131/1893] bpb=1.133943 time=2096.9s
  ttt_chunk [1141/1893] bpb=1.133705 time=2115.4s
  ttt_chunk [1151/1893] bpb=1.133733 time=2134.0s
  ttt_chunk [1161/1893] bpb=1.133569 time=2152.5s
  ttt_chunk [1171/1893] bpb=1.133389 time=2171.1s
  ttt_chunk [1181/1893] bpb=1.133164 time=2189.6s
  ttt_chunk [1191/1893] bpb=1.133317 time=2208.2s
  ttt_chunk [1201/1893] bpb=1.133519 time=2226.8s
  ttt_chunk [1211/1893] bpb=1.133117 time=2245.3s
  ttt_chunk [1221/1893] bpb=1.133455 time=2263.9s
  ttt_chunk [1231/1893] bpb=1.133394 time=2282.4s
  ttt_chunk [1241/1893] bpb=1.133104 time=2300.9s
  ttt_chunk [1251/1893] bpb=1.132567 time=2319.5s
  ttt_chunk [1261/1893] bpb=1.132300 time=2338.0s
  ttt_chunk [1271/1893] bpb=1.132047 time=2356.6s
  ttt_chunk [1281/1893] bpb=1.131738 time=2375.1s
  ttt_chunk [1291/1893] bpb=1.131494 time=2393.7s
  ttt_chunk [1301/1893] bpb=1.131443 time=2412.2s
  ttt_chunk [1311/1893] bpb=1.131173 time=2430.7s
  ttt_chunk [1321/1893] bpb=1.130872 time=2449.3s
  ttt_chunk [1331/1893] bpb=1.130632 time=2467.8s
  ttt_chunk [1341/1893] bpb=1.130505 time=2486.4s
  ttt_chunk [1351/1893] bpb=1.130352 time=2504.9s
  ttt_chunk [1361/1893] bpb=1.130484 time=2523.5s
  ttt_chunk [1371/1893] bpb=1.130705 time=2542.0s
  ttt_chunk [1381/1893] bpb=1.130910 time=2560.5s
  ttt_chunk [1391/1893] bpb=1.130695 time=2579.1s
  ttt_chunk [1401/1893] bpb=1.130724 time=2597.6s
  ttt_chunk [1411/1893] bpb=1.130831 time=2616.2s
  ttt_chunk [1421/1893] bpb=1.130815 time=2634.7s
  ttt_chunk [1431/1893] bpb=1.130791 time=2653.3s
  ttt_chunk [1441/1893] bpb=1.131256 time=2671.8s
  ttt_chunk [1451/1893] bpb=1.131119 time=2691.1s
  ttt_chunk [1461/1893] bpb=1.131048 time=2709.6s
  ttt_chunk [1471/1893] bpb=1.131643 time=2728.2s
  ttt_chunk [1481/1893] bpb=1.131517 time=2746.7s
  ttt_chunk [1491/1893] bpb=1.131890 time=2765.3s
  ttt_chunk [1501/1893] bpb=1.131872 time=2783.8s
  ttt_chunk [1511/1893] bpb=1.131833 time=2802.3s
  ttt_chunk [1521/1893] bpb=1.131945 time=2820.9s
  ttt_chunk [1531/1893] bpb=1.132160 time=2839.4s
  ttt_chunk [1541/1893] bpb=1.132230 time=2858.0s
  ttt_chunk [1551/1893] bpb=1.132470 time=2876.5s
  ttt_chunk [1561/1893] bpb=1.132554 time=2895.1s
  ttt_chunk [1571/1893] bpb=1.132686 time=2913.6s
  ttt_chunk [1581/1893] bpb=1.132836 time=2932.1s
  ttt_chunk [1591/1893] bpb=1.132902 time=2950.7s
  ttt_chunk [1601/1893] bpb=1.133020 time=2969.2s
  ttt_chunk [1611/1893] bpb=1.133281 time=2987.8s
  ttt_chunk [1621/1893] bpb=1.133141 time=3006.3s
  ttt_chunk [1631/1893] bpb=1.133187 time=3024.8s
  ttt_chunk [1641/1893] bpb=1.133212 time=3043.4s
  ttt_chunk [1651/1893] bpb=1.133269 time=3061.9s
  ttt_chunk [1661/1893] bpb=1.133410 time=3080.5s
  ttt_chunk [1671/1893] bpb=1.133595 time=3099.0s
  ttt_chunk [1681/1893] bpb=1.133686 time=3117.5s
  ttt_chunk [1691/1893] bpb=1.133787 time=3136.1s
  ttt_chunk [1701/1893] bpb=1.133884 time=3154.6s
  ttt_chunk [1711/1893] bpb=1.133862 time=3173.2s
  ttt_chunk [1721/1893] bpb=1.133701 time=3191.7s
  ttt_chunk [1731/1893] bpb=1.133797 time=3210.2s
  ttt_chunk [1741/1893] bpb=1.133534 time=3228.8s
  ttt_chunk [1751/1893] bpb=1.133407 time=3247.3s
  ttt_chunk [1761/1893] bpb=1.133444 time=3265.9s
  ttt_chunk [1771/1893] bpb=1.133395 time=3284.4s
  ttt_chunk [1781/1893] bpb=1.133298 time=3303.0s
  ttt_chunk [1791/1893] bpb=1.132959 time=3321.5s
  ttt_chunk [1801/1893] bpb=1.132941 time=3340.0s
  ttt_chunk [1811/1893] bpb=1.132795 time=3358.6s
  ttt_chunk [1821/1893] bpb=1.132853 time=3377.1s
  ttt_chunk [1831/1893] bpb=1.132699 time=3395.7s
  ttt_chunk [1841/1893] bpb=1.132738 time=3414.2s
  ttt_chunk [1851/1893] bpb=1.132559 time=3432.7s
  ttt_chunk [1861/1893] bpb=1.132478 time=3451.3s
  ttt_chunk [1871/1893] bpb=1.132413 time=3469.8s
  ttt_chunk [1881/1893] bpb=1.132170 time=3488.4s
  ttt_chunk [1891/1893] bpb=1.132153 time=3506.9s
  ttt_chunk [1893/1893] bpb=1.132184 time=3509.9s
ttt_sliding:done val_loss=1.911640 val_bpb=1.132184 elapsed=3510.0s
legal_ttt val_loss:1.9116 val_bpb:1.1322 eval_time:3510399ms
legal_ttt_exact val_loss:1.91163996 val_bpb:1.13218386
wandb: updating run metadata
wandb: uploading output.log; uploading wandb-summary.json; uploading config.yaml
wandb: uploading data
wandb:
wandb: Run history:
wandb:   grad_norm ▂█▅▅▄▃▃▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:    lr_scale ██████████████████████▇▇▇▆▆▅▅▄▄▄▃▃▃▂▂▂▂▁
wandb: step_avg_ms ▁▂▃▄▄▅▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█
wandb:  train_loss ▆█▇▇▇▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:     val_bpb █▂▁▁▁▁▁▁
wandb:    val_loss █▂▁▁▁▁▁▁
wandb:
wandb: Run summary:
wandb:   grad_norm 0.03694
wandb:    lr_scale 0.00374
wandb: step_avg_ms 1389.14072
wandb:  train_loss 1.96371
wandb:     val_bpb 1.1552
wandb:    val_loss 1.95051
wandb:
wandb: 🚀 View run full_4pass_baseline_80min at: https://wandb.ai/propensity/parameter-golf/runs/qltwebo4
wandb: ⭐️ View project at: https://wandb.ai/propensity/parameter-golf
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20260326_230156-qltwebo4/logs