Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372 by Tonyy1977 · Pull Request #1579 · openai/parameter-golf

Tonyy1977 · 2026-04-13T01:24:31Z

Non-Record: Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372

val_bpb: 1.1372 (3-seed mean, std 0.0004) | ~15.03 MB | 1x RTX 6000 Ada 48GB, 18000s

3-Seed Results

Seed	Steps	Pre-quant BPB	GPTQ Roundtrip	TTT BPB	Artifact
1337	6042	1.1232	1.1738	1.1372	15,025,540 B
42	6012	1.1235	1.1732	1.1376	15,021,049 B
2024	5977	1.1222	1.1746	1.1368	15,075,112 B
Mean		1.1230	1.1739	1.1372

Trained on 1x RTX 6000 Ada for 5hr/seed (~6000 steps), equivalent to 8xH100 600s (verified: 6374 steps, SWA 1.1200).

Changes from PR #927

Change	PR #927	This
Architecture	Recursive 4B/7L (d=1024)	Crawler 3f+2cx2 (d=736)
Tokenizer	SP1024	SP8192
Quantization	Percentile search + LZMA	SDClip + GPTQ + Brotli
TTT freeze	freeze=5 (surgical)	freeze=1
Warmdown	2000 steps fixed	60% fraction
Weight decay	0.04	0.085
val_bpb	1.1696	1.1372

Architecture: Crawler Transformer

Novel Crawler Transformer (inspired by @newjordan): 3 flat blocks + 2 crawler blocks x 2 loops = 7 effective depth. dim=736, 16 heads (8 KV), MLP 4x, GQA, BigramHash, SmearGate, VE, XSA all layers. 38.3M params.

Quantization + TTT

SDClip GPTQ int6 + int8 embed + Brotli (zero pruning, ~15MB)
Post-quant TTT on deserialized GPTQ artifact (honest eval)
TTT recovers 0.037 BPB; total pre-quant penalty only +0.014

Credits

SDClip + SP8192 + GPTQ embeddings + Brotli: PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 @clarkkev
XSA + Sliding window TTT: PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 @abaybektursun
Crawler architecture: inspired by @newjordan

… mean) True Universal Transformer: 4 shared blocks x 7 loops (7x weight reuse), dim=1024, int6 QAT from step 0, score-first TTT+sliding window eval. 3-seed mean: 1.1696 BPB, 15.85MB artifact, 600s training on 8xH100.

Required for zstd-22 compression of the int8 quantized model artifact. Without it, the script falls back to zlib which produces 17.5MB (over 16MB budget). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…bpb 1.1372 3-seed mean 1.1372 (std 0.0004) on 1x RTX 6000 Ada 48GB. Novel crawler architecture with SDClip quantization and honest post-quant TTT on GPTQ artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

newjordan · 2026-04-13T01:31:06Z

loop aware GPTQ stabilizes the quant dmage your seeing. nice run! what made you think of using 3f? to hit 1.12 pre is really impressive.. Im kinda stunned...

Not sure if you saw this - #1535 - but there is a ton of testing and examples here. I thought I was done with this for comp to focus on neural but you just busted open a whole other can... Check out the helix concept considering you found something I didnt here also!

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tonyy1977 · 2026-04-13T02:00:46Z

@newjordan Thanks! Honestly I'm just stubborn about keeping it to 4-5 blocks max, tried a bunch of flat/crawler ratios and 3f+2cx2 kept winning at that scale. The 1.12 pre-quant mostly comes from SP8192 and aggressive 60% warmdown (bpb barely drops at 50% of training). Will definitely check out #1535 and the helix concept, the quant penalty is still +0.05 so loop-aware GPTQ sounds like exactly what I need. Really cool to see 1.074 on the 4-hour run.

newjordan · 2026-04-13T02:29:36Z

majority of my research was on stabilizing the quant gap. im running your configuration now.

Tonyy1977 · 2026-04-13T12:51:34Z

I'm really glad you're testing the config! Curious how the training is going.

Quick heads up if you run it on the cluster (8xH100s): I hit a nasty DDP + torch.compile crash/slowdown in the early steps due to the unused_residual_scales params.

Fix: torch._dynamo.config.optimize_ddp = False plus a zero-contribution trick in the forward pass:
unused = sum(p.sum() * 0.0 for p in self.crawler_residual_scales); x = x + unused

Also, I'm confident that the 5 hours locally on the RTX 6000 Ada ≈ 10 minutes on 8xH100 SXMs. I just can't afford to burn that much cluster time anymore :D

newjordan · 2026-04-13T14:18:01Z

I ported your fxcxr onto my best base, and its performing weird. not bad, I just dont have a handle on it yet. First run -
step:8723/20000 val_loss:2.9915 val_bpb:1.1581 train_time:600028ms step_avg:68.79ms
stopping_early: wallclock_cap train_time:600028ms step:8723/20000
gptq_loop_aware:phase1 collecting all-layer Hessians...
gptq_loop_aware:phase1 collecting all-layer Hessians...
gptq_loop_aware:phase1 collecting all-layer Hessians...
gptq_loop_aware:phase1 collecting all-layer Hessians...gptq_loop_aware:phase1 collecting all-layer Hessians...

gptq_loop_aware:phase1 collecting all-layer Hessians...
gptq_loop_aware:phase1 collecting all-layer Hessians...
peak memory allocated: 17009 MiB reserved: 18516 MiB
gptq:loop-aware 2-phase calibration samples=256 seq_len=2048...
gptq_loop_aware:phase1 collecting all-layer Hessians...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq:loop-aware calibrated 34 layers in 5.6s
ema:SKIPPED (SKIP_EMA=1) — using live model weights
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
DIAGNOSTIC post_ema val_loss:2.9915 val_bpb:1.1581 eval_time:1339ms
Serialized model: 62947963 bytes
Code size: 130682 bytes
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
Serialized model int6+brotli: 11530330 bytes
Total submission size int6+brotli: 11661012 bytes
Total submission size int8+zlib: 11661012 bytes
final_int6_roundtrip val_loss:3.0271 val_bpb:1.1719 eval_time:8934ms
final_int6_roundtrip_exact val_loss:3.02711128 val_bpb:1.17188911
final_int6_sliding_window val_loss:2.9846 val_bpb:1.1554 stride:64 eval_time:57395ms
final_int6_sliding_window_exact val_loss:2.98455995 val_bpb:1.15542103
final_int8_zlib_roundtrip_exact val_loss:2.98455995 val_bpb:1.15542103

============================================
RESULT — TON-E rhythm run seed=4
model_params: 18508844
raw_bpb: 1.15808866
int6_sw_bpb: 1.15542103
val_loss: 2.98455995
step_avg_ms: 68.79
steps: 8723
train_time_s: 600
bytes_total: 11661012 (limit 16000000)
bytes_code: 130682
artifact_legal:yes

And then I am working to see if I can get the bpob back and keep a fast runner, but its def experiments. I wanted to see if a streraihg 8 quant would fix the loop compression damage, it doesnt so the recursion loops are goign to need a different typoe of goertner thant he looped GPTQ that stabilizes the crawler. This is fine resarerech as it will carrry into any sota line.

This is pure int8 to see where recovery can happen, and its not here, so the issue is with recursion loops not reporting to the final output enough, this quant gap shows up any time we use the recursive loops, and is seen on the sota leaderboards as well. stabilizing it is a good thing for all of us.

step:5000/20000 train_loss:2.8775 train_time:560627ms step_avg:112.13ms crawler_grad_mul:0.837
step:5348/20000 val_loss:2.9133 val_bpb:1.1278 train_time:600001ms step_avg:112.19ms
stopping_early: wallclock_cap train_time:600001ms step:5348/20000
peak memory allocated: 27799 MiB reserved: 29334 MiB
gptq:SKIPPED (EXPORT_QUANT=int8_flat) — not used by flat int8 export
ema:SKIPPED (SKIP_EMA=1) — using live model weights
DIAGNOSTIC post_ema val_loss:2.9133 val_bpb:1.1278 eval_time:1710ms
Serialized model: 95490137 bytes
Code size: 135269 bytes
int8_stats:param_count=26641988 float_tensors=51 baseline_tensor_bytes=95459600 int8_payload_bytes=26941710
Serialized model int8_flat: 21066472 bytes
Total submission size int8_flat: 21201741 bytes
Total submission size int8+zlib: 21201741 bytes
final_int8_flat_roundtrip val_loss:2.9163 val_bpb:1.1290 eval_time:13614ms
final_int8_flat_roundtrip_exact val_loss:2.91626094 val_bpb:1.12897549
final_int8_flat_sliding_window val_loss:2.8740 val_bpb:1.1126 stride:64 eval_time:80566ms
final_int8_flat_sliding_window_exact val_loss:2.87395762 val_bpb:1.11260324
final_int8_zlib_roundtrip_exact val_loss:2.87395762 val_bpb:1.11260324

============================================
RESULT — TON-E rhythm run seed=4
model_params: 26641988
raw_bpb: 1.12784530
quant_mode: int8_flat
quant_sw_bpb: 1.11260324
val_loss: 2.87395762
step_avg_ms: 112.19
steps: 5348
train_time_s: 600
bytes_total: 21201741 (limit 16000000)
bytes_code: 135269
artifact_legal:no

newjordan · 2026-04-13T14:26:53Z

I have experiments alot wiht the layer configeraiont but I never tried such a small flat layer (i didnt rely on he crawlker to provide bpb ever really, its a compression loss system/kinda) and the fact you can lean into with 3flat layers was very interesting, The recursion loops are also interesting... In general I have alto more fun messing with whatevere the hell this is, than pushing sota and im glad to see this test because it shows me I should continue to develop the crawler to get the bpb up more. I dont know if its a competitive architecture because its built in a silo with no reference to arvixx or popular concepts on purpose. Its a totally original, weird implementation that I enjoy tinkering around. Based on what I have seen from your work 1. big vocxab is fine, doesnt hurt, but crawler isnt using all of it, this could be partially whee the quant spinout is happining. and 2. the crawler itself can be optimnized more as a standalone transformer and not just a compression battery. Its work is nowhere near ready on the standalone transformer, but the fact it can push bpb with only 3 flat layers tells me I was not leaning into what the crawler is capable of, if actually tuned and configured right. I think its still missing a couple critical data input/output connections to move the needle like a flat layer. Thanks for even giving it a shot. Nice to have a human interaction on this line and not another AI test =p

Tonyy1977 · 2026-04-13T15:35:01Z

This is wonderful! I really aiming to be the unique-est model (4-5 blocks max) in this challenge and still meet the constraint of 10 minutes training and 16MB. I spent most of the time to find a good way to compress so as to afford more dim, the big quant damage is no way to avoid (more loops more compounds) so I can accept it as long as the TTT can recover most of the penalty back (this PR is only 0.01 difference between pre quant and TTT). As long as I have headroom in the budget, I immediately added more dim!

Your Run 1 is at 11.66MB has 4.3MB of headroom. Curious if bumping dim helps or if the quant damage eats the gain; that’s been my constant battle at d=736.

Tonyy1977 and others added 3 commits March 26, 2026 23:29

Add zstandard to requirements.txt

64490b1

Required for zstd-22 compression of the int8 quantized model artifact. Without it, the script falls back to zlib which produces 17.5MB (over 16MB budget). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix cluster verification numbers in README

18cd043

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372#1579

Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372#1579
Tonyy1977 wants to merge 4 commits intoopenai:mainfrom
Tonyy1977:crawler-v5-sp8192-submission

Tonyy1977 commented Apr 13, 2026 •

edited

Loading

Uh oh!

newjordan commented Apr 13, 2026 •

edited

Loading

Uh oh!

Tonyy1977 commented Apr 13, 2026

Uh oh!

newjordan commented Apr 13, 2026

Uh oh!

Tonyy1977 commented Apr 13, 2026

Uh oh!

newjordan commented Apr 13, 2026

Uh oh!

newjordan commented Apr 13, 2026

Uh oh!

Tonyy1977 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tonyy1977 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-Record: Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372

3-Seed Results

Changes from PR #927

Architecture: Crawler Transformer

Quantization + TTT

Credits

Uh oh!

newjordan commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tonyy1977 commented Apr 13, 2026

Uh oh!

newjordan commented Apr 13, 2026

Uh oh!

Tonyy1977 commented Apr 13, 2026

Uh oh!

newjordan commented Apr 13, 2026

============================================ RESULT — TON-E rhythm run seed=4 model_params: 18508844 raw_bpb: 1.15808866 int6_sw_bpb: 1.15542103 val_loss: 2.98455995 step_avg_ms: 68.79 steps: 8723 train_time_s: 600 bytes_total: 11661012 (limit 16000000) bytes_code: 130682 artifact_legal:yes

Uh oh!

newjordan commented Apr 13, 2026

Uh oh!

Tonyy1977 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tonyy1977 commented Apr 13, 2026 •

edited

Loading

newjordan commented Apr 13, 2026 •

edited

Loading

============================================
RESULT — TON-E rhythm run seed=4
model_params: 18508844
raw_bpb: 1.15808866
int6_sw_bpb: 1.15542103
val_loss: 2.98455995
step_avg_ms: 68.79
steps: 8723
train_time_s: 600
bytes_total: 11661012 (limit 16000000)
bytes_code: 130682
artifact_legal:yes