Skip to content

Add RIOM v1 shared-depth recurrence non-record submission#523

Draft
hesong0222-dev wants to merge 2 commits intoopenai:mainfrom
hesong0222-dev:riom-v1-shared-depth
Draft

Add RIOM v1 shared-depth recurrence non-record submission#523
hesong0222-dev wants to merge 2 commits intoopenai:mainfrom
hesong0222-dev:riom-v1-shared-depth

Conversation

@hesong0222-dev
Copy link
Copy Markdown

Summary

This draft PR adds a non-record RIOM-style submission under records/track_non_record_16mb/2026-03-23_RIOM_v1_recur.

The core idea is shared-depth recurrence: replace 9 distinct transformer blocks with 3 shared blocks repeated 3 times, with lightweight learned recurrence gates. The tokenizer and dataset remain unchanged from the official SP1024 FineWeb setup, and the change is isolated to the record-local train_gpt.py.

What actually ran

This package is based on a real Apple Silicon MLX development run.

  • training steps: 50
  • validation prefix: 1,048,576 official validation tokens
  • pre-quant: val_loss=5.4143, val_bpb=3.2439
  • post-quant: val_loss=5.42207763, val_bpb=3.24862008
  • code bytes: 51,404
  • compressed model bytes: 2,273,437
  • total counted bytes: 2,324,841

Why this is interesting even if non-record

On the same local development prefix, this recurrent package improved val_bpb over the local baseline package while sharply reducing counted artifact size. That is the main RIOM hypothesis in compact form: more effective depth per byte through parameter sharing.

Limitations

  • This is not a record claim.
  • This is a local MLX development run, not an 8xH100 submission.
  • Validation was capped with VAL_MAX_TOKENS; it should be rerun with VAL_MAX_TOKENS=0 before any serious upstream submission.
  • The next concrete step is a CUDA/PyTorch port into the official train_gpt.py path.

@hesong0222-dev hesong0222-dev mentioned this pull request Mar 23, 2026
@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Apr 11, 2026

Community Review — Add RIOM v1 shared-depth recurrence non-record submission

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

ModuleNotFoundError: No module named 'mlx'

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'mlx'. Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

@hesong0222-dev
Copy link
Copy Markdown
Author

Thanks for the smoke test.

I pushed a fix for the import blocker on the RIOM v1 non-record script. The update guards the MLX and sentencepiece imports so the module can be imported on a machine without those packages installed, while still failing clearly at runtime if someone actually tries to run the MLX path without the required dependencies.

What I verified locally after the change:

  • python3 -m py_compile records/track_non_record_16mb/2026-03-23_RIOM_v1_recur/train_gpt.py
  • direct module import in an environment without mlx and without sentencepiece
  • calling main() in that same environment now raises a clear ModuleNotFoundError explaining which runtime dependencies are missing, instead of failing during import

If you re-run the compliance pipeline, the import step should now pass cleanly.

@hesong0222-dev
Copy link
Copy Markdown
Author

Follow-up: I re-ran the checks under Python 3.10 specifically.

  • python3.10 -m py_compile records/track_non_record_16mb/2026-03-23_RIOM_v1_recur/train_gpt.py now passes
  • direct module import also passes on a machine without mlx and without sentencepiece
  • in that same environment, calling main() now fails cleanly with a runtime dependency error instead of failing during import

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants