Add non-record JEPA byte-level encoder-decoder submission#696
Add non-record JEPA byte-level encoder-decoder submission#696gravelBridge wants to merge 28 commits intoopenai:mainfrom
Conversation
Replace the isolated per-patch decoder (8-byte window with no cross-patch information flow) with a full-sequence causal decoder over all bytes. Each byte can now attend to all preceding bytes across patch boundaries, with patch-level context upsampled and added as conditioning. This removes the critical information bottleneck where byte predictions at patch boundaries had no access to preceding bytes from other patches.
Shift parameter budget from the encoder to the decoder, where val_bpb is determined. Encoder goes from 10 unique blocks to 5 unique blocks cycled 2x (same 10 effective layers, half the unique params). Decoder grows from 2 to 6 layers, tripling capacity for byte-level prediction. Total unique params drops ~1.5M but decoder gets ~4M more.
…atents Replace the decoder's conditioning signal from pred_latent (predictor's noisy estimate, routed through a latent_dim bottleneck) with the encoder's context output directly (model_dim, shifted by 1 patch for causality). The JEPA predictor + MSE loss remain as an auxiliary training objective, but the decoder now receives the exact encoder representations instead of a noisy compressed proxy. Removes decoder_cond projection layer since context is already model_dim.
…dicted latents" This reverts commit 7cdf651.
The JEPA prediction loss had 2x the gradient weight of the actual compression objective (CE). Flip the ratio: CE gets 3x weight, pred gets 0.5x. This directs more gradient signal toward byte-level prediction quality, with JEPA serving as a lighter regularizer.
Current compressed model is only 9MB of the 16MB limit. Increase model_dim from 384 to 480 and decoder_layers from 6 to 8, bringing total params from ~14.7M to ~26.4M (compressed ~15.8MB). Nearly all the extra capacity goes to the decoder where val_bpb is determined.
Sliding window eval scores each byte with near-maximum context. Windows of seq_len advance by stride (default 512 bytes = 64 patches). Only the tail stride bytes per window are scored (first window scores all). Adds forward_logits() method that returns per-position logits without computing loss. Only the final int8+zlib roundtrip eval uses sliding window; periodic training eval stays fast (non-overlapping).
- MLP activation: relu² → LeakyReLU(0.5)² (matching SOTA) - EMA weight averaging (decay=0.997) applied before serialization - SWA snapshots collected every 50 steps when lr scale < 0.2 - Test-time training: score-first legal TTT with SGD (lr=0.002, momentum=0.9, 3 epochs, 32K chunks, cosine LR decay) - Eval stride reduced to 64 (matching SOTA)
SWA snapshots collected during warmdown are now averaged and applied instead of being discarded. TTT adaptation uses forward_logits + CE loss directly, avoiding unnecessary prediction/SIGReg gradient signal.
Replace INT8+zlib with mixed INT6/INT8+LZMA to reduce serialized model size. MLP/attn/other weights use INT6 ([-31,31]) with per-row MSE-optimal clip search; embeddings stay INT8. Add STE quantization-aware training that activates during warmdown (LR scale < 0.15). Switch compression from zlib to LZMA for better entropy exploitation on low-range values. Also bump eval batch defaults (val_sliding_batch, ttt_batch_seqs) from 8 to 32 to match SOTA, and add infer/adapt timing breakdown to TTT logs.
Halves context length to ~4x reduce attention cost per forward pass, making sliding window eval and TTT feasible within time budget.
524032 = 2047 * 32 * 8, ensuring divisibility across 8 GPUs.
Bump LZMA compression from preset 6 to 9 and quantize embeddings to INT6 (previously INT8). Previous run was 363KB over budget.
TTT was catastrophically diverging (bpb 1.24 -> 2.53) because sequential adaptation was destroying the JEPA encoder. Now only decoder parameters adapt during TTT. Also disable QAT during TTT to avoid injecting quantization noise on already-dequantized weights.
Revert encoder freeze (divergence was from QAT during TTT, not encoder adaptation). Increase sliding window stride from 64 to 256 and reduce TTT epochs from 3 to 1 for faster eval.
Each decoder block is ~1.84M params (~1.1MB compressed). Previous run was 572KB over the 16MB limit. Dropping one decoder layer should provide enough headroom.
Drop eval_val_sliding (was burning 88s for a diagnostic log). TTT already does sliding window evaluation with adaptation. Bump TTT epochs from 1 to 2 with the freed time budget.
Match the openai#1 record's optimizer settings: - MATRIX_LR/SCALAR_LR: 0.015 -> 0.025 - MUON_MOMENTUM: 0.95 -> 0.99 - MUON_MOMENTUM_WARMUP_START: 0.85 -> 0.92 - MUON_MOMENTUM_WARMUP_STEPS: 500 -> 1500 - WARMDOWN_ITERS: 1200 -> 3500
Community Review — Add non-record JEPA byte-level encoder-decoder submissionBPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 338 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=480, layers=5, vocab=260, code=66358 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=480, layers=5, vocab=260, code=66358 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Non-record submission for the 16MB track using a JEPA (Joint Embedding Predictive Architecture) encoder-decoder as an alternative to the standard causal GPT used by all current leaderboard entries.