Add non-record JEPA byte-level encoder-decoder submission by gravelBridge · Pull Request #696 · openai/parameter-golf

gravelBridge · 2026-03-25T08:25:12Z

Non-record submission for the 16MB track using a JEPA (Joint Embedding Predictive Architecture) encoder-decoder as an alternative to the standard causal GPT used by all current leaderboard entries.

Architecture: Byte-level tokenizer (vocab 260, no BPE), 5-layer depth-recurrent encoder (2 repeats) with patched latent projection, 7-layer causal decoder conditioned on encoder latents. 24.6M parameters.
Quantization & Compression: INT6 optimal-clip quantization with STE QAT during warmdown, LZMA preset 9 compression. - Total submission: 15.7MB.
Test-Time Training: Sliding window TTT with SGD (lr=0.002), 2 epochs per chunk, stride 256.
Result: Final TTT val_bpb 1.2622 (pre-quantization 1.2957), trained for 10,635 / 20,000 steps.

Replace the isolated per-patch decoder (8-byte window with no cross-patch information flow) with a full-sequence causal decoder over all bytes. Each byte can now attend to all preceding bytes across patch boundaries, with patch-level context upsampled and added as conditioning. This removes the critical information bottleneck where byte predictions at patch boundaries had no access to preceding bytes from other patches.

Shift parameter budget from the encoder to the decoder, where val_bpb is determined. Encoder goes from 10 unique blocks to 5 unique blocks cycled 2x (same 10 effective layers, half the unique params). Decoder grows from 2 to 6 layers, tripling capacity for byte-level prediction. Total unique params drops ~1.5M but decoder gets ~4M more.

…atents Replace the decoder's conditioning signal from pred_latent (predictor's noisy estimate, routed through a latent_dim bottleneck) with the encoder's context output directly (model_dim, shifted by 1 patch for causality). The JEPA predictor + MSE loss remain as an auxiliary training objective, but the decoder now receives the exact encoder representations instead of a noisy compressed proxy. Removes decoder_cond projection layer since context is already model_dim.

…dicted latents" This reverts commit 7cdf651.

The JEPA prediction loss had 2x the gradient weight of the actual compression objective (CE). Flip the ratio: CE gets 3x weight, pred gets 0.5x. This directs more gradient signal toward byte-level prediction quality, with JEPA serving as a lighter regularizer.

Current compressed model is only 9MB of the 16MB limit. Increase model_dim from 384 to 480 and decoder_layers from 6 to 8, bringing total params from ~14.7M to ~26.4M (compressed ~15.8MB). Nearly all the extra capacity goes to the decoder where val_bpb is determined.

Sliding window eval scores each byte with near-maximum context. Windows of seq_len advance by stride (default 512 bytes = 64 patches). Only the tail stride bytes per window are scored (first window scores all). Adds forward_logits() method that returns per-position logits without computing loss. Only the final int8+zlib roundtrip eval uses sliding window; periodic training eval stays fast (non-overlapping).

- MLP activation: relu² → LeakyReLU(0.5)² (matching SOTA) - EMA weight averaging (decay=0.997) applied before serialization - SWA snapshots collected every 50 steps when lr scale < 0.2 - Test-time training: score-first legal TTT with SGD (lr=0.002, momentum=0.9, 3 epochs, 32K chunks, cosine LR decay) - Eval stride reduced to 64 (matching SOTA)

SWA snapshots collected during warmdown are now averaged and applied instead of being discarded. TTT adaptation uses forward_logits + CE loss directly, avoiding unnecessary prediction/SIGReg gradient signal.

Replace INT8+zlib with mixed INT6/INT8+LZMA to reduce serialized model size. MLP/attn/other weights use INT6 ([-31,31]) with per-row MSE-optimal clip search; embeddings stay INT8. Add STE quantization-aware training that activates during warmdown (LR scale < 0.15). Switch compression from zlib to LZMA for better entropy exploitation on low-range values. Also bump eval batch defaults (val_sliding_batch, ttt_batch_seqs) from 8 to 32 to match SOTA, and add infer/adapt timing breakdown to TTT logs.

Halves context length to ~4x reduce attention cost per forward pass, making sliding window eval and TTT feasible within time budget.

524032 = 2047 * 32 * 8, ensuring divisibility across 8 GPUs.

Bump LZMA compression from preset 6 to 9 and quantize embeddings to INT6 (previously INT8). Previous run was 363KB over budget.

TTT was catastrophically diverging (bpb 1.24 -> 2.53) because sequential adaptation was destroying the JEPA encoder. Now only decoder parameters adapt during TTT. Also disable QAT during TTT to avoid injecting quantization noise on already-dequantized weights.

Revert encoder freeze (divergence was from QAT during TTT, not encoder adaptation). Increase sliding window stride from 64 to 256 and reduce TTT epochs from 3 to 1 for faster eval.

Each decoder block is ~1.84M params (~1.1MB compressed). Previous run was 572KB over the 16MB limit. Dropping one decoder layer should provide enough headroom.

Drop eval_val_sliding (was burning 88s for a diagnostic log). TTT already does sliding window evaluation with adaptation. Bump TTT epochs from 1 to 2 with the freed time budget.

Match the openai#1 record's optimizer settings: - MATRIX_LR/SCALAR_LR: 0.015 -> 0.025 - MUON_MOMENTUM: 0.95 -> 0.99 - MUON_MOMENTUM_WARMUP_START: 0.85 -> 0.92 - MUON_MOMENTUM_WARMUP_STEPS: 500 -> 1500 - WARMDOWN_ITERS: 1200 -> 3500

MatoTeziTanka · 2026-04-11T20:16:15Z

Community Review — Add non-record JEPA byte-level encoder-decoder submission

BPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 0f532ea62d76, file records/track_non_record_16mb/2026-03-25_JEPA_BytePatch_INT6_LZMA_TTT/train_gpt.py):

The TTT path at line 338 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=480, layers=5, vocab=260, code=66358 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=480, layers=5, vocab=260, code=66358 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

gravelBridge added 28 commits March 24, 2026 15:47

jepa test

1a9a8b9

Switch tokenizer config to pure byte export

d69711b

Make JEPA target loss fully end-to-end

ea4df4d

Restore detached JEPA target loss

7ed0372

Revert "Teacher-forced decoder: condition on encoder context, not pre…

b773b91

…dicted latents" This reverts commit 7cdf651.

Apply SWA weights during warmdown and use pure CE for TTT

1e7a26e

SWA snapshots collected during warmdown are now averaged and applied instead of being discarded. TTT adaptation uses forward_logits + CE loss directly, avoiding unnecessary prediction/SIGReg gradient signal.

Remove SWA, use EMA-only weight averaging

4c6d3ad

Reduce default train_seq_len from 4095 to 2047

b751bd8

Halves context length to ~4x reduce attention cost per forward pass, making sliding window eval and TTT feasible within time budget.

Fix train_batch_tokens alignment for seq_len 2047

8e73b0c

524032 = 2047 * 32 * 8, ensuring divisibility across 8 GPUs.

Fit within 16MB budget: LZMA preset 9, INT6 for all categories

2db8faf

Bump LZMA compression from preset 6 to 9 and quantize embeddings to INT6 (previously INT8). Previous run was 363KB over budget.

Speed up eval: stride 256, TTT epochs 1

6ab7b51

Revert encoder freeze (divergence was from QAT during TTT, not encoder adaptation). Increase sliding window stride from 64 to 256 and reduce TTT epochs from 3 to 1 for faster eval.

Reduce decoder layers from 8 to 7 to fit 16MB budget

2f8a8ad

Each decoder block is ~1.84M params (~1.1MB compressed). Previous run was 572KB over the 16MB limit. Dropping one decoder layer should provide enough headroom.

Remove standalone sliding window eval, use TTT epochs 2

237b2d6

Drop eval_val_sliding (was burning 88s for a diagnostic log). TTT already does sliding window evaluation with adaptation. Bump TTT epochs from 1 to 2 with the freed time budget.

Adopt SOTA optimizer hyperparameters

f864f91

Match the openai#1 record's optimizer settings: - MATRIX_LR/SCALAR_LR: 0.015 -> 0.025 - MUON_MOMENTUM: 0.95 -> 0.99 - MUON_MOMENTUM_WARMUP_START: 0.85 -> 0.92 - MUON_MOMENTUM_WARMUP_STEPS: 500 -> 1500 - WARMDOWN_ITERS: 1200 -> 3500

prepare non-record submission

7dc65d1

Merge branch 'openai:main' into main

fc86543

Revert non-records files to upstream versions for clean merge

9fafabe

Merge branch 'openai:main' into main

f32b6c0

Merge branch 'openai:main' into main

0f532ea

yashverms mentioned this pull request Mar 26, 2026

byte260 variant listed in download script but shards are not on HuggingFace #899

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add non-record JEPA byte-level encoder-decoder submission#696

Add non-record JEPA byte-level encoder-decoder submission#696
gravelBridge wants to merge 28 commits intoopenai:mainfrom
gravelBridge:main

gravelBridge commented Mar 25, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gravelBridge commented Mar 25, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Add non-record JEPA byte-level encoder-decoder submission

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants