Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372#1579
Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372#1579Tonyy1977 wants to merge 4 commits intoopenai:mainfrom
Conversation
… mean) True Universal Transformer: 4 shared blocks x 7 loops (7x weight reuse), dim=1024, int6 QAT from step 0, score-first TTT+sliding window eval. 3-seed mean: 1.1696 BPB, 15.85MB artifact, 600s training on 8xH100.
Required for zstd-22 compression of the int8 quantized model artifact. Without it, the script falls back to zlib which produces 17.5MB (over 16MB budget). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…bpb 1.1372 3-seed mean 1.1372 (std 0.0004) on 1x RTX 6000 Ada 48GB. Novel crawler architecture with SDClip quantization and honest post-quant TTT on GPTQ artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
loop aware GPTQ stabilizes the quant dmage your seeing. nice run! what made you think of using 3f? to hit 1.12 pre is really impressive.. Im kinda stunned... Not sure if you saw this - #1535 - but there is a ton of testing and examples here. I thought I was done with this for comp to focus on neural but you just busted open a whole other can... Check out the helix concept considering you found something I didnt here also! |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@newjordan Thanks! Honestly I'm just stubborn about keeping it to 4-5 blocks max, tried a bunch of flat/crawler ratios and 3f+2cx2 kept winning at that scale. The 1.12 pre-quant mostly comes from SP8192 and aggressive 60% warmdown (bpb barely drops at 50% of training). Will definitely check out #1535 and the helix concept, the quant penalty is still +0.05 so loop-aware GPTQ sounds like exactly what I need. Really cool to see 1.074 on the 4-hour run. |
|
majority of my research was on stabilizing the quant gap. im running your configuration now. |
|
I'm really glad you're testing the config! Curious how the training is going. Quick heads up if you run it on the cluster (8xH100s): I hit a nasty DDP + torch.compile crash/slowdown in the early steps due to the unused_residual_scales params. Fix: torch._dynamo.config.optimize_ddp = False plus a zero-contribution trick in the forward pass: Also, I'm confident that the 5 hours locally on the RTX 6000 Ada ≈ 10 minutes on 8xH100 SXMs. I just can't afford to burn that much cluster time anymore :D |
|
I ported your fxcxr onto my best base, and its performing weird. not bad, I just dont have a handle on it yet. First run - gptq_loop_aware:phase1 collecting all-layer Hessians... ============================================
|
|
I have experiments alot wiht the layer configeraiont but I never tried such a small flat layer (i didnt rely on he crawlker to provide bpb ever really, its a compression loss system/kinda) and the fact you can lean into with 3flat layers was very interesting, The recursion loops are also interesting... In general I have alto more fun messing with whatevere the hell this is, than pushing sota and im glad to see this test because it shows me I should continue to develop the crawler to get the bpb up more. I dont know if its a competitive architecture because its built in a silo with no reference to arvixx or popular concepts on purpose. Its a totally original, weird implementation that I enjoy tinkering around. Based on what I have seen from your work 1. big vocxab is fine, doesnt hurt, but crawler isnt using all of it, this could be partially whee the quant spinout is happining. and 2. the crawler itself can be optimnized more as a standalone transformer and not just a compression battery. Its work is nowhere near ready on the standalone transformer, but the fact it can push bpb with only 3 flat layers tells me I was not leaning into what the crawler is capable of, if actually tuned and configured right. I think its still missing a couple critical data input/output connections to move the needle like a flat layer. Thanks for even giving it a shot. Nice to have a human interaction on this line and not another AI test =p |
|
This is wonderful! I really aiming to be the unique-est model (4-5 blocks max) in this challenge and still meet the constraint of 10 minutes training and 16MB. I spent most of the time to find a good way to compress so as to afford more dim, the big quant damage is no way to avoid (more loops more compounds) so I can accept it as long as the TTT can recover most of the penalty back (this PR is only 0.01 difference between pre quant and TTT). As long as I have headroom in the budget, I immediately added more dim! Your Run 1 is at 11.66MB has 4.3MB of headroom. Curious if bumping dim helps or if the quant damage eats the gain; that’s been my constant battle at d=736. |
Non-Record: Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372
val_bpb: 1.1372 (3-seed mean, std 0.0004) | ~15.03 MB | 1x RTX 6000 Ada 48GB, 18000s
3-Seed Results
Trained on 1x RTX 6000 Ada for 5hr/seed (~6000 steps), equivalent to 8xH100 600s (verified: 6374 steps, SWA 1.1200).
Changes from PR #927
Architecture: Crawler Transformer
Novel Crawler Transformer (inspired by @newjordan): 3 flat blocks + 2 crawler blocks x 2 loops = 7 effective depth. dim=736, 16 heads (8 KV), MLP 4x, GQA, BigramHash, SmearGate, VE, XSA all layers. 38.3M params.
Quantization + TTT
Credits