Skip to content

Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372#1579

Open
Tonyy1977 wants to merge 4 commits intoopenai:mainfrom
Tonyy1977:crawler-v5-sp8192-submission
Open

Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372#1579
Tonyy1977 wants to merge 4 commits intoopenai:mainfrom
Tonyy1977:crawler-v5-sp8192-submission

Conversation

@Tonyy1977
Copy link
Copy Markdown

@Tonyy1977 Tonyy1977 commented Apr 13, 2026

Non-Record: Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372

val_bpb: 1.1372 (3-seed mean, std 0.0004) | ~15.03 MB | 1x RTX 6000 Ada 48GB, 18000s

3-Seed Results

Seed Steps Pre-quant BPB GPTQ Roundtrip TTT BPB Artifact
1337 6042 1.1232 1.1738 1.1372 15,025,540 B
42 6012 1.1235 1.1732 1.1376 15,021,049 B
2024 5977 1.1222 1.1746 1.1368 15,075,112 B
Mean 1.1230 1.1739 1.1372

Trained on 1x RTX 6000 Ada for 5hr/seed (~6000 steps), equivalent to 8xH100 600s (verified: 6374 steps, SWA 1.1200).

Changes from PR #927

Change PR #927 This
Architecture Recursive 4B/7L (d=1024) Crawler 3f+2cx2 (d=736)
Tokenizer SP1024 SP8192
Quantization Percentile search + LZMA SDClip + GPTQ + Brotli
TTT freeze freeze=5 (surgical) freeze=1
Warmdown 2000 steps fixed 60% fraction
Weight decay 0.04 0.085
val_bpb 1.1696 1.1372

Architecture: Crawler Transformer

Novel Crawler Transformer (inspired by @newjordan): 3 flat blocks + 2 crawler blocks x 2 loops = 7 effective depth. dim=736, 16 heads (8 KV), MLP 4x, GQA, BigramHash, SmearGate, VE, XSA all layers. 38.3M params.

Quantization + TTT

  • SDClip GPTQ int6 + int8 embed + Brotli (zero pruning, ~15MB)
  • Post-quant TTT on deserialized GPTQ artifact (honest eval)
  • TTT recovers 0.037 BPB; total pre-quant penalty only +0.014

Credits

Tonyy1977 and others added 3 commits March 26, 2026 23:29
… mean)

True Universal Transformer: 4 shared blocks x 7 loops (7x weight reuse),
dim=1024, int6 QAT from step 0, score-first TTT+sliding window eval.
3-seed mean: 1.1696 BPB, 15.85MB artifact, 600s training on 8xH100.
Required for zstd-22 compression of the int8 quantized model artifact.
Without it, the script falls back to zlib which produces 17.5MB (over 16MB budget).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…bpb 1.1372

3-seed mean 1.1372 (std 0.0004) on 1x RTX 6000 Ada 48GB.
Novel crawler architecture with SDClip quantization and honest
post-quant TTT on GPTQ artifact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@newjordan
Copy link
Copy Markdown

newjordan commented Apr 13, 2026

loop aware GPTQ stabilizes the quant dmage your seeing. nice run! what made you think of using 3f? to hit 1.12 pre is really impressive.. Im kinda stunned...

Not sure if you saw this - #1535 - but there is a ton of testing and examples here. I thought I was done with this for comp to focus on neural but you just busted open a whole other can... Check out the helix concept considering you found something I didnt here also!

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Tonyy1977
Copy link
Copy Markdown
Author

@newjordan Thanks! Honestly I'm just stubborn about keeping it to 4-5 blocks max, tried a bunch of flat/crawler ratios and 3f+2cx2 kept winning at that scale. The 1.12 pre-quant mostly comes from SP8192 and aggressive 60% warmdown (bpb barely drops at 50% of training). Will definitely check out #1535 and the helix concept, the quant penalty is still +0.05 so loop-aware GPTQ sounds like exactly what I need. Really cool to see 1.074 on the 4-hour run.

@newjordan
Copy link
Copy Markdown

majority of my research was on stabilizing the quant gap. im running your configuration now.

@Tonyy1977
Copy link
Copy Markdown
Author

I'm really glad you're testing the config! Curious how the training is going.

Quick heads up if you run it on the cluster (8xH100s): I hit a nasty DDP + torch.compile crash/slowdown in the early steps due to the unused_residual_scales params.

Fix: torch._dynamo.config.optimize_ddp = False plus a zero-contribution trick in the forward pass:
unused = sum(p.sum() * 0.0 for p in self.crawler_residual_scales); x = x + unused

Also, I'm confident that the 5 hours locally on the RTX 6000 Ada ≈ 10 minutes on 8xH100 SXMs. I just can't afford to burn that much cluster time anymore :D

@newjordan
Copy link
Copy Markdown

I ported your fxcxr onto my best base, and its performing weird. not bad, I just dont have a handle on it yet. First run -
step:8723/20000 val_loss:2.9915 val_bpb:1.1581 train_time:600028ms step_avg:68.79ms
stopping_early: wallclock_cap train_time:600028ms step:8723/20000
gptq_loop_aware:phase1 collecting all-layer Hessians...
gptq_loop_aware:phase1 collecting all-layer Hessians...
gptq_loop_aware:phase1 collecting all-layer Hessians...
gptq_loop_aware:phase1 collecting all-layer Hessians...gptq_loop_aware:phase1 collecting all-layer Hessians...

gptq_loop_aware:phase1 collecting all-layer Hessians...
gptq_loop_aware:phase1 collecting all-layer Hessians...
peak memory allocated: 17009 MiB reserved: 18516 MiB
gptq:loop-aware 2-phase calibration samples=256 seq_len=2048...
gptq_loop_aware:phase1 collecting all-layer Hessians...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:patched 18 flat layers with GPTQ weights
gptq_loop_aware:phase2 collecting crawler Hessians with quantized-flat activations...
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq:loop-aware calibrated 34 layers in 5.6s
ema:SKIPPED (SKIP_EMA=1) — using live model weights
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
gptq_loop_aware:phase2 collected 15 crawler Hessians
gptq_loop_aware:restored 18 flat layer weights
gptq_loop_aware:merged 34 Hessians (15 crawler from phase2)
DIAGNOSTIC post_ema val_loss:2.9915 val_bpb:1.1581 eval_time:1339ms
Serialized model: 62947963 bytes
Code size: 130682 bytes
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
gptq_quantize: 30 GPTQ layers, 0 naive layers
Serialized model int6+brotli: 11530330 bytes
Total submission size int6+brotli: 11661012 bytes
Total submission size int8+zlib: 11661012 bytes
final_int6_roundtrip val_loss:3.0271 val_bpb:1.1719 eval_time:8934ms
final_int6_roundtrip_exact val_loss:3.02711128 val_bpb:1.17188911
final_int6_sliding_window val_loss:2.9846 val_bpb:1.1554 stride:64 eval_time:57395ms
final_int6_sliding_window_exact val_loss:2.98455995 val_bpb:1.15542103
final_int8_zlib_roundtrip_exact val_loss:2.98455995 val_bpb:1.15542103

============================================
RESULT — TON-E rhythm run seed=4
model_params: 18508844
raw_bpb: 1.15808866
int6_sw_bpb: 1.15542103
val_loss: 2.98455995
step_avg_ms: 68.79
steps: 8723
train_time_s: 600
bytes_total: 11661012 (limit 16000000)
bytes_code: 130682
artifact_legal:yes

And then I am working to see if I can get the bpob back and keep a fast runner, but its def experiments. I wanted to see if a streraihg 8 quant would fix the loop compression damage, it doesnt so the recursion loops are goign to need a different typoe of goertner thant he looped GPTQ that stabilizes the crawler. This is fine resarerech as it will carrry into any sota line.

This is pure int8 to see where recovery can happen, and its not here, so the issue is with recursion loops not reporting to the final output enough, this quant gap shows up any time we use the recursive loops, and is seen on the sota leaderboards as well. stabilizing it is a good thing for all of us.

step:5000/20000 train_loss:2.8775 train_time:560627ms step_avg:112.13ms crawler_grad_mul:0.837
step:5348/20000 val_loss:2.9133 val_bpb:1.1278 train_time:600001ms step_avg:112.19ms
stopping_early: wallclock_cap train_time:600001ms step:5348/20000
peak memory allocated: 27799 MiB reserved: 29334 MiB
gptq:SKIPPED (EXPORT_QUANT=int8_flat) — not used by flat int8 export
ema:SKIPPED (SKIP_EMA=1) — using live model weights
DIAGNOSTIC post_ema val_loss:2.9133 val_bpb:1.1278 eval_time:1710ms
Serialized model: 95490137 bytes
Code size: 135269 bytes
int8_stats:param_count=26641988 float_tensors=51 baseline_tensor_bytes=95459600 int8_payload_bytes=26941710
Serialized model int8_flat: 21066472 bytes
Total submission size int8_flat: 21201741 bytes
Total submission size int8+zlib: 21201741 bytes
final_int8_flat_roundtrip val_loss:2.9163 val_bpb:1.1290 eval_time:13614ms
final_int8_flat_roundtrip_exact val_loss:2.91626094 val_bpb:1.12897549
final_int8_flat_sliding_window val_loss:2.8740 val_bpb:1.1126 stride:64 eval_time:80566ms
final_int8_flat_sliding_window_exact val_loss:2.87395762 val_bpb:1.11260324
final_int8_zlib_roundtrip_exact val_loss:2.87395762 val_bpb:1.11260324

============================================
RESULT — TON-E rhythm run seed=4
model_params: 26641988
raw_bpb: 1.12784530
quant_mode: int8_flat
quant_sw_bpb: 1.11260324
val_loss: 2.87395762
step_avg_ms: 112.19
steps: 5348
train_time_s: 600
bytes_total: 21201741 (limit 16000000)
bytes_code: 135269
artifact_legal:no

@newjordan
Copy link
Copy Markdown

I have experiments alot wiht the layer configeraiont but I never tried such a small flat layer (i didnt rely on he crawlker to provide bpb ever really, its a compression loss system/kinda) and the fact you can lean into with 3flat layers was very interesting, The recursion loops are also interesting... In general I have alto more fun messing with whatevere the hell this is, than pushing sota and im glad to see this test because it shows me I should continue to develop the crawler to get the bpb up more. I dont know if its a competitive architecture because its built in a silo with no reference to arvixx or popular concepts on purpose. Its a totally original, weird implementation that I enjoy tinkering around. Based on what I have seen from your work 1. big vocxab is fine, doesnt hurt, but crawler isnt using all of it, this could be partially whee the quant spinout is happining. and 2. the crawler itself can be optimnized more as a standalone transformer and not just a compression battery. Its work is nowhere near ready on the standalone transformer, but the fact it can push bpb with only 3 flat layers tells me I was not leaning into what the crawler is capable of, if actually tuned and configured right. I think its still missing a couple critical data input/output connections to move the needle like a flat layer. Thanks for even giving it a shot. Nice to have a human interaction on this line and not another AI test =p

@Tonyy1977
Copy link
Copy Markdown
Author

This is wonderful! I really aiming to be the unique-est model (4-5 blocks max) in this challenge and still meet the constraint of 10 minutes training and 16MB. I spent most of the time to find a good way to compress so as to afford more dim, the big quant damage is no way to avoid (more loops more compounds) so I can accept it as long as the TTT can recover most of the penalty back (this PR is only 0.01 difference between pre quant and TTT). As long as I have headroom in the budget, I immediately added more dim!

Your Run 1 is at 11.66MB has 4.3MB of headroom. Curious if bumping dim helps or if the quant damage eats the gain; that’s been my constant battle at d=736.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants