[Research Non-Record] Pure raw-byte JEPA negative result#906
[Research Non-Record] Pure raw-byte JEPA negative result#906andrew-medrano wants to merge 3 commits intoopenai:mainfrom
Conversation
|
Hey there, great work on putting this together, love to see more unique approaches to this competition! I saw you tagged my PR and just wanted to mention a few things:
This statement is incorrect as #903 does work in detached mode.
|
|
Hey @CiprianFlorin-Ifrim great points on the above, but wanted to ask about point 4: Since we're predicting a single autoregressive token stream with no natural second view, wouldn't the JEPA signal collapse to a less informative version of CE? I think the reason that the LLM-JEPA paper succeeds here is because code diffs have a natural 2-view structure where JEPA captures cross-view relationships that single-stream CE misses, and we dont have that asymmetry in predicting the next token. #832 did it by chunk and i think that's why they see the benefit Not arguing that JEPA can be beneficial to CE as an auxiliary objective but I think only if the token sequences arent flat |
|
Good catch, thanks. You’re right that #903 does include a detached diagnostic probe, and I’ve updated my wording. The distinction I was trying to make is narrower: my PR is about a backbone trained in a pure detached-probe regime, with no exact-loss gradients into the backbone at all, whereas your main reported model is still a CE-trained JEPA-augmented system. So I agree I would agree with the narrower point that BPB requires an explicit exact decoder at evaluation time, since you need logits / normalized probabilities over the next symbol. Where I’d draw the line differently is that this does not make CE loss mandatory in the backbone training path itself. In my setup, the backbone is trained purely with a JEPA objective, then frozen, and only afterward I train a separate exact decoder on top of the frozen predicted latents. That gives the required exact decoder for BPB without letting exact-loss gradients shape the backbone. One more note is that the detached decoder in my setup is stronger than a quick linear probe. I’m not just fitting a fresh linear head for a few steps; I train a small Transformer decoder on the frozen features. So the negative result is that even with a reasonably strong detached exact decoder, the pure JEPA representation still only reached about I do realize this is a more awkward and probably less practical setup than the hybrid approaches, and I think papers like LeWorldModel and LLM-JEPA are probably right that for actual usage you usually want JEPA as part of a broader training recipe rather than in this strict isolated form. My goal here was just narrower: to test the cleanest “pure JEPA” version I could, and separate that question from the much more practical “does a JEPA-like auxiliary objective help a strong CE model?” question. In any case, thanks for interacting. It makes this whole process a lot more enjoyable. |
|
Good points, however, a few comments based on my work and things I found out through ablations:
On decoder complexity, through the ablations done it was found that a more complex decoder on worse representations loses to a simple decoder with better representations, and that there is a cap to the complexity needed for the decoder, and your work showcases that exactly. Diagnostics showed:
Top-k accuracy measures how often the correct next token appears in the model's k highest-ranked predictions, top1 means the model's single best guess is correct 70% of the time, top5 means the correct token is in the top 5 guesses 91% of the time. The final model uses a simple tied linear projection (no hidden layers, no nonlinearity) and reaches 1.2 BPB. The information gap is in the backbone, not the decoder. |
|
Ahhh gotcha, that's a fair distinction. Enforcing smooth latent trajectories is different from the multi-view setup I thoiught you were referencing. Agreed about size. CE might already be extracting everything the representation can hold at this scale. |
|
@MVPandey Indeed, in my approach I have Mamba SSM with MLP (like the original paper), and I found out by skipping an MLP every 2, with at least or more than 10 layers, the performance effect is limited. At 8 layers there is a significant drop. Clearly this 16MB scale has its effect on what can be done and how well they'll work. |
Community Review — [Research Non-Record] Pure raw-byte JEPA negative resultCompliance: NEEDS AUTHOR ACTION — What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with: A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:
Recommendation: Could you run Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — AttributeError: 'NoneType' object has no attribute 'dict'. Classification via |
Retraction — this IMPORT_FAIL was a Python 3.10
|
Summary
This PR documents the cleanest pure raw-byte JEPA attempt I ran for Parameter Golf. The best result was
2.3839 bpbwithtransformer_rope_gqa_localglobal + slot_ema_teacher, which is a real improvement over my earlier pure-JEPA runs but still far from the simple baseline1.2244.What Makes This Pure JEPA
byte260So the clean question here is narrow: can a pure raw-byte JEPA backbone, trained without exact-loss gradients, carry enough information that a later detached exact decoder can recover good
bpb?Main Result
bpb2.3839transformer_rope_gqa_localglobal + slot_ema_teacher2.85833.0774Controlled Comparisons
These were three fixed-budget comparisons:
Headline winners:
bpbtransformer_rope_gqa_localglobal2.3889800525604903slot_ema_teacher2.3839conv_patch2.746384624395377Comparison to Other JEPA PRs
2.38392.12520.005 BPBand is40%faster1.2064sliding /1.2235standard for best long BPE;1.3348standard for 10-minute bytePRs #708 and #896 are hybrid or auxiliary-loss approaches. PR #903 is closer to this line of work because it also includes a detached diagnostic probe, but its main reported model is still a CE-trained JEPA-augmented system rather than a pure backbone-only JEPA path. So none of them are apples-to-apples comparisons with this PR.
Main Takeaways
bpb.Why This Still Matters
This PR isolates the “pure JEPA” question more cleanly than the hybrid JEPA-related PRs in the repo. That makes it a useful lower bound and negative control for future JEPA claims: the best-performing JEPA-adjacent results still rely on a strong main CE path, which strengthens rather than weakens the negative result from the pure setup.