Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# JEPA Attempt Summary

This is the short historical note for the pure-JEPA experiments after cleaning out the old branches. The primary writeup for this folder is in [README.md](README.md).

Repo simple baseline for reference: `1.22436570 val_bpb`.

## Top-Line Result

We did not get pure JEPA close to the repo baseline. The best clean detached-probe result we saw was:

- `2.3839 bpb` with `transformer_rope_gqa_localglobal + slot_ema_teacher`

That was a large improvement over the earlier pure-JEPA runs, but it was still about `+1.16 bpb` above the simple baseline.

## What Counted As "Pure JEPA" Here

- raw `byte260` inputs only
- no tokenizer
- no exact byte-NLL into the backbone
- backbone trained only with JEPA-style latent prediction plus anti-collapse regularization
- exact byte probabilities produced later by a detached Transformer decoder probe on frozen features

So this was a strict test of whether JEPA latents alone could carry enough information for good byte compression.

## Historical Progression

### 2026-03-24 `BytePatchJEPA_PurityFirst`

- Raw-byte JEPA backbone with a coupled exact decoder term
- Best full run reached about `2.8583 bpb`
- Negative: more compute helped, but the coupled byte-loss path was not pure enough and still far from baseline

### 2026-03-25 `BytePatchJEPA_TiedTransformer`

- Early tied-Transformer JEPA retry
- Effectively stalled near uniform-entropy behavior
- Negative: bad Transformer recipe, not a meaningful positive signal

### 2026-03-25 `BytePatchJEPA_DeepGRU`

- Larger recurrent control
- Trained, but stayed weak
- Negative: more GRU was not the answer

### 2026-03-25 `BytePatchJEPA_UncappedValChase`

- Uncapped validation-only chase
- Improved over the earliest pure runs but still did not suggest an easy path to baseline

### 2026-03-26 `BytePatchJEPA_PureProbeScaling`

- First clean frozen-probe pipeline
- Best result was GRU-based at about `3.0774 bpb`
- Data scaling helped, but the first multi-horizon and multi-scale variants hurt
- Negative: detached probing was the right protocol, but the target and early Transformer recipe were still wrong

## Transformer-Only Campaign

This folder kept only the parts that still looked worth pushing:

- Transformer backbones only
- slot-based targets instead of pooled patch regression
- detached Transformer strong probe only
- stronger repo-style Transformer ingredients: RoPE, GQA, RMSNorm, SwiGLU, residual branch scaling, Muon/AdamW split

### Backbone Screen

At the anchor size, with `slot_l2` fixed:

- `transformer_rope_gqa_localglobal`: `2.3889800525604903 bpb`
- `transformer_rope_gqa_base`: `2.389990501438125 bpb`
- `transformer_rope_gqa_convstem`: `2.5803010001832605 bpb`

Takeaway:

- `localglobal` narrowly beat `base`
- `convstem` was a real regression

### Objective Screen

With `transformer_rope_gqa_localglobal` fixed, objective ranking was:

- `slot_ema_teacher`: `2.3839 bpb`
- `slot_cosine`: `2.3885 bpb`
- `slot_l2`: `2.3888 bpb`
- `slot_vicreg`: `2.3918 bpb`
- `masked_slot_jepa`: `2.5098 bpb`

These numbers were recovered from the copied-back live logs because the final `objective_screen/summary.json` was not synced back.

Takeaway:

- `slot_ema_teacher` was the best objective in this family
- objective changes only moved the number by a few thousandths to a few hundredths, except for `masked_slot_jepa`, which was clearly worse
- the main bottleneck did not look like "pick a better JEPA loss" anymore

### Encoder Screen

With `transformer_rope_gqa_localglobal + slot_ema_teacher` fixed and a short equal-budget rerun:

- `conv_patch`: `2.746384624395377 bpb`
- `mlp_baseline`: `2.7525905146099565 bpb`
- `patch_transformer`: `2.8835849452702482 bpb`
- `latent_queries`: `2.899715507869489 bpb`

Takeaway:

- `conv_patch` was the only encoder that slightly beat the baseline MLP, and only by about `0.0062 bpb`
- `patch_transformer` and `latent_queries` were clearly worse and slower
- richer within-patch encoders did not solve the core problem

## Main Negatives

- Pure JEPA remained far above the simple baseline even after moving to the stronger Transformer-only setup.
- Lower JEPA loss did not reliably translate into lower exact byte `bpb`.
- Richer patch encoders were mostly negative.
- The detached exact decoder probe learned fine, but the frozen JEPA features still looked too lossy for byte compression.
- The biggest remaining weakness is probably not raw backbone capacity; it is the latent/interface design, especially how much exact local detail survives into the temporal state.

## Current Best Hypothesis

If pure JEPA is going to work better here, the next gains probably come from changing the latent family and the way the backbone consumes it, not from adding more GRU or just making the patch encoder fancier.

The most plausible next directions are:

- let the backbone consume slot tokens directly instead of mostly reasoning over patch summaries
- redesign the latent target family to preserve more local detail
- keep using a detached exact decoder probe so the experiment stays honest
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Pure Raw-Byte JEPA: Negative Result

This folder is a research non-record writeup of the cleanest pure-JEPA path we tried for Parameter Golf. The setup is deliberately strict: raw `byte260`, no tokenizer, no exact byte-loss gradients into the backbone, and exact byte prediction only through a later detached Transformer decoder trained on frozen features. The best result from this path was **`2.3839 bpb`** with `transformer_rope_gqa_localglobal + slot_ema_teacher`, which is a real improvement over our earlier pure-JEPA runs but still about **`+1.16 bpb`** above the simple baseline `1.22436570`.

## What This Tests

The clean question here is narrow:

> Can a pure raw-byte JEPA backbone, trained without exact-loss gradients, carry enough information that a later detached exact decoder can recover good `bpb`?

The protocol was:

- train the backbone only with JEPA-style future-latent prediction plus collapse regularization
- encode each `8`-byte patch into one summary latent and four ordered `2`-byte slot latents
- predict the next summary and slot bank with a Transformer backbone
- freeze the backbone
- train a detached Transformer decoder on frozen features consisting of the causal context state, predicted next summary, and predicted next slot bank

This is intentionally different from hybrid JEPA setups where the exact next-token or next-byte objective helps train the backbone.

## Main Result

| Result | `bpb` | Notes |
|------|------:|------|
| Best pure detached-probe result | `2.3839` | `transformer_rope_gqa_localglobal + slot_ema_teacher` |
| Earlier purity-first milestone | `2.8583` | earlier raw-byte JEPA with a coupled exact decoder term |
| First clean frozen-probe milestone | `3.0774` | earlier pure-probe campaign |

No clean scaling-law claim is made here. The dedicated scale run was interrupted, and the early scale points were not strong enough to support a meaningful extrapolation.

## Three Controlled Comparisons

Internally these are named `backbone_screen`, `objective_screen`, and `encoder_screen`. They are just three controlled comparisons run at fixed budgets.

### 1. Backbone Comparison

Same objective, same patch latent design, different Transformer backbones.

| Backbone | `bpb` |
|------|------:|
| `transformer_rope_gqa_localglobal` | `2.3889800525604903` |
| `transformer_rope_gqa_base` | `2.389990501438125` |
| `transformer_rope_gqa_convstem` | `2.5803010001832605` |

### 2. Objective Comparison

Same winning backbone, same patch latent design, different JEPA objectives.

These values were recovered from copied-back final strong-probe logs because `results/objective_screen/summary.json` never synced back.

| Objective | `bpb` |
|------|------:|
| `slot_ema_teacher` | `2.3839` |
| `slot_cosine` | `2.3885` |
| `slot_l2` | `2.3888` |
| `slot_vicreg` | `2.3918` |
| `masked_slot_jepa` | `2.5098` |

### 3. Patch-Encoder Comparison

Same winning backbone and objective, different within-patch latent encoders, under the same short equal-budget rerun.

| Patch encoder | `bpb` |
|------|------:|
| `conv_patch` | `2.746384624395377` |
| `mlp_baseline` | `2.7525905146099565` |
| `patch_transformer` | `2.8835849452702482` |
| `latent_queries` | `2.899715507869489` |

## Comparison to Other JEPA PRs

These are useful comparison points, but they are not the same experiment.

| PR | Training path | Tokenization | Reported result | Why it differs |
|------|------|------|------:|------|
| This folder | pure detached-probe JEPA | raw bytes | `2.3839` | no exact-loss gradients into backbone |
| [PR #708](https://github.com/openai/parameter-golf/pull/708) | hybrid JEPA + exact next-byte scorer | raw bytes | about `2.1252` | exact next-byte compression objective is in the main training path and predicted chunk latents are fused back into the scorer |
| [PR #896](https://github.com/openai/parameter-golf/pull/896) | JEPA self-distillation auxiliary loss on top of autoregressive LM | tokenized | PR author reports vanilla CE beats JEPA by `0.005 BPB` and is `40%` faster | CE remains the main path and the comparison is token-level, not raw-byte pure JEPA |
| [PR #903](https://github.com/openai/parameter-golf/pull/903) | LeWorldModel-style JEPA + SIGReg + CE head, plus a detached diagnostic probe | BPE and byte | reported `1.2064` sliding / `1.2235` standard for best long BPE, `1.2566` 10-minute BPE, `1.3348` standard 10-minute byte | includes a detached probe diagnostic, but the main reported model is still CE-trained, CE is described as dominant by mid-training, and the JEPA-only contribution remains open |

PRs #708 and #896 are hybrid or auxiliary-loss approaches. PR #903 is closer to this line of work because it also includes a detached diagnostic probe, but its main reported model is still a CE-trained JEPA-augmented system rather than a backbone trained in a pure detached-probe regime. So none of them are apples-to-apples comparisons with this setup.

## Main Takeaways

- Stronger Transformer backbone plus slot-based targets improved pure JEPA substantially over earlier attempts.
- Once that latent family was in place, objective changes only moved the result a little, except `masked_slot_jepa`, which was clearly worse.
- Richer within-patch encoders mostly did not help; `conv_patch` only barely beat the baseline MLP encoder.
- Lower JEPA loss did not reliably translate into lower exact-byte `bpb`.
- The current bottleneck looks like latent/interface design, not just encoder capacity or loss choice.

## What Still Looks Wrong

- The temporal path still appears too summary-dominant: the backbone mostly reasons over patch summaries, not the full slot history.
- The future-latent predictor is still effectively too deterministic for byte compression, so it likely averages over plausible futures.
- The detached exact decoder can learn, but the frozen JEPA features still appear too lossy for exact byte prediction.

## Evidence Kept in This Folder

- [Historical notes](JEPA_SUMMARY.md)
- [Objective comparison recovered from logs](results/objective_screen_from_logs.md)
- [Backbone comparison summary](results/backbone_screen/summary.json)
- [Patch-encoder comparison: `mlp_baseline`](results/encoder_screen_mlp_baseline/summary.json)
- [Patch-encoder comparison: `conv_patch`](results/encoder_screen_conv_patch/summary.json)
- [Patch-encoder comparison: `patch_transformer`](results/encoder_screen_patch_transformer/summary.json)
- [Patch-encoder comparison: `latent_queries`](results/encoder_screen_latent_queries/summary.json)

## Reproduction

Smoke:

```bash
cd records/track_non_record_16mb/2026-03-26_BytePatchJEPA_TransformerOnly
env SELF_TEST=1 python3 train_gpt.py
python3 summarize_sweep.py --self-test
python3 launch_runpod_probe.py --phase smoke --gpu-count 1
```

Backbone comparison:

```bash
cd records/track_non_record_16mb/2026-03-26_BytePatchJEPA_TransformerOnly
python3 launch_runpod_probe.py --phase backbone_screen --gpu-count 4
```

Objective comparison:

```bash
cd records/track_non_record_16mb/2026-03-26_BytePatchJEPA_TransformerOnly
python3 launch_runpod_probe.py --phase objective_screen --gpu-count 4
```

This folder is a research non-record writeup. It does **not** claim a validated 16MB artifact submission.
Loading