Skip to content

Non-record Submission: Text Diffusion + Retrodiction + TTT + Depth Recurrence#1255

Open
akaiHuang wants to merge 8 commits intoopenai:mainfrom
akaiHuang:codex/nonrecord-textdiffusion-retrodiction-ttt
Open

Non-record Submission: Text Diffusion + Retrodiction + TTT + Depth Recurrence#1255
akaiHuang wants to merge 8 commits intoopenai:mainfrom
akaiHuang:codex/nonrecord-textdiffusion-retrodiction-ttt

Conversation

@akaiHuang
Copy link
Copy Markdown

Summary

This PR adds a non-record 16MB submission under records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence.

Techniques included:

  • Text Diffusion (CDM) with Sequential Unmasking eval
  • AR Retrodiction
  • Test-Time Training (full-model AdamW)
  • Depth recurrence experiments
  • Custom v4096 tokenizer

Files

  • README.md
  • submission.json
  • train_gpt.py
  • train_cdm.py
  • eval_sequential_unmasking.py
  • eval_ttt.py
  • bpe_v4096.model
  • train.log

This is intended for the non-record track.

16L/512d/39M params, trained on M1 Max (not 8xH100).
Retrodiction: reversed sequence auxiliary loss from quantum information theory.
Int6 + lzma = 14.8MB (within 16MB limit).
@akaiHuang
Copy link
Copy Markdown
Author

Update

This PR is for the non-record track.

Current results:

  • AR + Retrodiction (v4096): 1.497
  • AR + TTT (full-model AdamW): 1.492
  • Shared AR+CDM (single model): 1.503 (~2.3MB)
  • CDM + Sequential Unmasking: 2.570

Some exploratory H100 coarse-to-fine entries in the README are marked “no log saved” and are research observations, not record claims.

Next: run 3–5 seed reproducibility on 8xH100 and optimize for strict 10-minute train/eval constraints.

…rting scripts

This commit promotes the v3.5 writeup (already present as README_v3_5_DRAFT.md
since commit 5790ba7) to the canonical README.md so that PR openai#1255 reviewers
see the verified version directly, and syncs the three supporting scripts
from the meadow-golf research diary to the v3.5 versions.

README.md (v3.3 -> v3.5):
- Headline now reports the 5-seed mean delta (-0.0205 BPB) as the primary
  effect size, with the single best seed (-0.0290 BPB) as a post-hoc
  deployable-artifact reference, not as the headline number
- §3.1 is now the 11L multi-seed verification at the true final checkpoint
  (5 fresh shared seeds + 1 fresh causal-only control seed). Original
  6-run scaling sweep retained as §3.2 cross-scale evidence
- Adds §6.0 (5L multi-seed verification) as the gating follow-up
- Adds Appendix A consolidating legacy intermediate-checkpoint 11L numbers
  for traceability with v3.3
- §10 Compliance clarifies that competition submission unit is int6.lzma
  (under 16 MB cap); .npz files are working format only
- §3.1 statistical caveat: no significance test is computed because the
  control side has only 1 fresh seed in this round; the second control
  seed in §6.0 closes this gap

train_cdm.py: writes step_final.pt at end of training (addresses
intermediate-checkpoint bias from v3.3).

eval_cf_ablation.py: detects .npz files and loads them with the correct
parameter dtype, so CF eval can run directly on training final-state
.npz output.

train_ablation_runner.py: adds --seed flag that patches the module-level
SEED constant in train_cdm.py and writes a per-seed patched script with
_s<seed> in the filename, so seeds_run/run_p5.sh and run_phase_b.sh are
self-contained against the unified runner.

The seeds_run/ reviewer spot-check bundle (logs, orchestration scripts,
wrapper stdout) is already committed in 5790ba7. The .npz / step_final.pt
state files (~1.3 GB) are intentionally not committed; their location and
on-request availability are documented in seeds_run/README.md and §10.
@akaiHuang
Copy link
Copy Markdown
Author

v3.5 update pushed to the PR head branch.

Main changes in this revision:

  • Canonical README.md is now the v3.5 writeup; reviewers no longer need to open README_v3_5_DRAFT.md.
  • The 11L headline now uses the true final training step with 5 fresh shared-model seeds and 1 fresh matched causal-only control seed.
  • Main method-level result at 11L: shared CF 1.3009 ± 0.005 vs matched causal-only final-checkpoint control 1.3214, for a 5-seed mean delta of -0.0205 BPB.
  • The single best shared seed (SEED=1337) is retained only as a post-hoc deployable-artifact reference: 1.2924, i.e. -0.0290 BPB versus the matched control.
  • The original 6-run 5L+11L sweep is retained as §3.2 cross-scale evidence, while legacy intermediate-checkpoint 11L numbers are moved to Appendix A for traceability with v3.3.
  • Supporting scripts are synchronized with the v3.5 methodology:
    • train_cdm.py now writes step_final.pt
    • train_ablation_runner.py supports --seed for the per-seed reruns in seeds_run/
    • eval_cf_ablation.py can load final-state .npz saves directly

This remains a non-record submission. The 11L rows are not claimed as record candidates; they are filed under the non-record track explicitly.

Standalone research diary mirror: https://github.com/akaiHuang/meadow-golf

@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record Submission: Text Diffusion + Retrodiction + TTT + Depth Recurrence

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

ModuleNotFoundError: No module named 'mlx'

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'mlx'. Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

…s cleanly on Linux CPU

Addresses the @MatoTeziTanka community-review compliance check on PR openai#1255
that flagged ModuleNotFoundError("No module named 'mlx'") at the import step
on a Linux CPU smoke-test environment (CT2038 proteus-engine, Python 3.10,
torch 2.10.0+cpu).

The four affected files are Apple-Silicon-only pre-flight artifacts referenced
in README §3.3 / §3.4 / §3.6 (the M1 Max 5L sweep and the leakage integrity
test). They are not part of the H100 production training or evaluation path
(train_cdm.py / eval_cf_ablation.py), but they live in the same PR folder and
the reviewer's smoke test imports the whole folder.

Fix in each file:

- Wrap `import mlx.core / mlx.nn / mlx.optimizers / mlx.utils` in a
  try/except ImportError block.
- On ImportError, install minimal stubs so module-level definitions like
  `class Foo(nn.Module):` and `COMPUTE_DTYPE = mx.bfloat16` parse without
  raising. The stub class is permissive: any attribute access or call
  returns another stub instance, sufficient for class subclassing and
  module-level attribute assignment.
- Set `_HAS_MLX = True/False` so any future runtime check can gate behavior.

leakage_test.py specifically had ~95 lines of module-level executable code
(including a sys.exit(1) on missing checkpoint, and direct mx.array / mx.eval
calls). All of that is now wrapped in a `_main()` function with an
`if __name__ == "__main__": sys.exit(_main())` guard at the bottom, so
importing the file is a no-op and only running it as a script triggers the
test logic. The script also exits cleanly with a clear message when MLX is
not installed.

Verification (this commit):
- All 4 files: py_compile passes on Python 3.10 syntax (verified with the
  default python3 in this environment).
- All 4 files: import succeeds on a machine with MLX installed
  (_HAS_MLX = True; real MLX path).
- All 4 files: import succeeds on a simulated mlx-blocked environment
  via a custom __import__ override (_HAS_MLX = False; stub path).
- Functional behavior on Apple Silicon is unchanged: the real `mlx.core`,
  `mlx.nn`, `mlx.optimizers`, and `mlx.utils` are imported when available.

This commit only touches the four pre-flight scripts. No README, training
code, eval code, or numbers change.
@akaiHuang
Copy link
Copy Markdown
Author

Thanks for the smoke-test details and for catching this before the official audit.

The four files that triggered the import error (eval_cf_dualbrain.py, eval_ttt.py, eval_sequential_unmasking.py, leakage_test.py) are Apple-Silicon-only pre-flight artifacts referenced in §3.3 / §3.4 / §3.6 of the README. They are not part of the H100 production training or evaluation path (which is train_cdm.py + eval_cf_ablation.py), but they live in the same PR folder so your folder-level smoke test correctly hit them.

Fix pushed in commit a4286a4: each file now wraps its import mlx in a try / except ImportError block, installs a minimal stub class on failure (so module-level definitions like class Foo(nn.Module): and COMPUTE_DTYPE = mx.bfloat16 still parse), and sets _HAS_MLX = False. leakage_test.py additionally moves its module-level test logic into a _main() function gated by if __name__ == \"__main__\": so that importing the file is a no-op.

Locally verified:

  • py_compile passes on all four files under Python 3.10 syntax.
  • Import succeeds on a real Apple Silicon machine with MLX installed (_HAS_MLX=True, original behaviour preserved).
  • Import succeeds on a simulated mlx-blocked environment via a custom `import` override (_HAS_MLX=False, stub path).

No README, no scoring path, no numbers change. Please re-run the audit at your convenience — happy to help debug if anything else trips.

@MatoTeziTanka
Copy link
Copy Markdown

Re-audited at head SHA a4286a4.

Fix confirmed. The four MLX-dependent files (eval_cf_dualbrain.py, eval_ttt.py, eval_sequential_unmasking.py, leakage_test.py) all now wrap import mlx in try/except ImportError with _HAS_MLX = False fallback and minimal stub classes. leakage_test.py moves its module-level logic into a _main() gated by if __name__ == "__main__". All files compile clean under Python 3.10.

As you noted, these are Apple Silicon pre-flight artifacts — not part of the H100 scored eval path (train_cdm.py + eval_cf_ablation.py). The fix is the right approach: guarded import so folder-level smoke tests don't false-positive while preserving full functionality on actual Apple Silicon machines.

The train_gpt.py (H100 path) compiles clean. I'll queue the full compliance audit on the active eval path for the next sweep.


Re-audit by @MatoTeziTanka. Verified MLX import guards in all 4 files, py_compile OK under Python 3.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants