Skip to content

Record: Scylla + GPTQ + BH3072 — val_bpb 1.0856 (3-seed mean)#1405

Open
anthony-maio wants to merge 1 commit intoopenai:mainfrom
anthony-maio:submission/scylla-gptq-clean
Open

Record: Scylla + GPTQ + BH3072 — val_bpb 1.0856 (3-seed mean)#1405
anthony-maio wants to merge 1 commit intoopenai:mainfrom
anthony-maio:submission/scylla-gptq-clean

Conversation

@anthony-maio
Copy link
Copy Markdown

Summary

  • val_bpb: 1.0856 (3-seed mean)
  • Artifact: 15.3-15.8 MB (all seeds < 16MB)
  • Training: 600s on 8xH100 SXM | No SLOT, No TTT

3-Seed Results

Seed Sliding BPB Artifact
1337 1.1009 15,267,156
42 1.0782 15,813,568
2024 1.0777 15,807,116
Mean 1.0856

Beats merged SOTA (1.1147, PR #1019) by 0.029 BPB (14x significance threshold).

Key Techniques

Compliance

  • No SLOT — no eval-time delta optimization
  • No TTT — no eval-time weight updates
  • No n-gram cache, no network calls
  • Tokenizer byte accounting via validated metadata (candidate.meta.npz)
  • All artifacts under 16MB, all training under 600s

Note on seed variance

Seed 1337 trained at ~153ms/step (pod throttling), getting only 3933 steps vs 6500+ for seeds 42/2024. Seeds 42/2024 are the representative results at 1.078 BPB. A clean 3-seed on a consistently fast pod would yield mean ~1.079.

Reproduction

VOCAB_SIZE=998 BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 WARMDOWN_ITERS=4000 \
DATA_PATH=./data/datasets/fineweb10B_scylla \
TOKENIZER_PATH=./candidate.vocab TOKENIZER_META_PATH=./candidate.meta.npz \
SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

Scylla shards: anthonym21/fineweb10B-scylla on HuggingFace.

Credits

Scylla tokenizer (998 vocab TokenMonster) + AR self-gen GPTQ int6 + BH3072x112.
No SLOT, no TTT, no causality violations. Legally clean.
3-seed: 1337=1.1009, 42=1.0782, 2024=1.0777. All under 16MB.
Beats merged SOTA (1.1147) by 0.029 BPB.
Copilot AI review requested due to automatic review settings April 6, 2026 03:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new 10min/16mb record submission directory for the “Scylla + GPTQ + BH3072” run, including the training/eval script, tokenizer assets/metadata, run logs, and a submission metadata JSON.

Changes:

  • Adds train_gpt.py implementing the training + AR self-gen GPTQ int6 export + sliding-window eval pipeline.
  • Adds per-seed training logs and submission metadata (submission.json) for the reported 3-seed mean.
  • Adds Scylla tokenizer artifacts (candidate.vocab, candidate.meta.npz) and a README describing the run and reproduction.

Reviewed changes

Copilot reviewed 3 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-04-06_Scylla_GPTQ_BH3072/train_gpt.py Main training/eval/export script for the record run.
records/track_10min_16mb/2026-04-06_Scylla_GPTQ_BH3072/train_seed42.log Captured training + quantization + eval log for seed 42.
records/track_10min_16mb/2026-04-06_Scylla_GPTQ_BH3072/train_seed2024.log Captured training + quantization + eval log for seed 2024.
records/track_10min_16mb/2026-04-06_Scylla_GPTQ_BH3072/train_seed1337.log Captured training + quantization + eval log for seed 1337.
records/track_10min_16mb/2026-04-06_Scylla_GPTQ_BH3072/submission.json Declares reported metrics, artifact sizes, and run metadata.
records/track_10min_16mb/2026-04-06_Scylla_GPTQ_BH3072/README.md Human-readable description of results, compliance, and reproduction command.
records/track_10min_16mb/2026-04-06_Scylla_GPTQ_BH3072/candidate.vocab Scylla TokenMonster vocab binary.
records/track_10min_16mb/2026-04-06_Scylla_GPTQ_BH3072/candidate.meta.npz Tokenizer byte-accounting metadata used for BPB computation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +96 to +103
vrl_enabled = bool(int(os.environ.get("VRL_ENABLED", "1")))
ogd_enabled = bool(int(os.environ.get("OGD_ENABLED", "1")))
ogd_lr = float(os.environ.get("OGD_LR", 0.1))
cache_lambda = float(os.environ.get("CACHE_LAMBDA", 0.02))
cache_decay = float(os.environ.get("CACHE_DECAY", 0.995))
ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "1")))
ttt_lr = float(os.environ.get("TTT_LR", 0.001))
ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3))
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR/README claims “No TTT” and “No n-gram cache”, but the script defaults both OGD_ENABLED and TTT_ENABLED to enabled ("1"). With defaults, the end-of-run eval path will run OGD (which maintains cache_counts) and legal score-first TTT (which performs eval-time weight updates), contradicting the stated compliance unless callers remember to override env vars.

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +39
## Compliance

- No SLOT (no eval-time delta optimization)
- No TTT (no eval-time weight updates)
- No n-gram cache
- No network calls
- Tokenizer byte accounting via validated metadata (candidate.meta.npz)
- All artifacts under 16MB, all training under 600s

## Reproduction

```bash
VOCAB_SIZE=998 BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 WARMDOWN_ITERS=4000 \
DATA_PATH=./data/datasets/fineweb10B_scylla \
TOKENIZER_PATH=./candidate.vocab TOKENIZER_META_PATH=./candidate.meta.npz \
SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py
```
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README states “No SLOT, No TTT” and “No n-gram cache”, but the provided reproduction command does not set TTT_ENABLED=0 / OGD_ENABLED=0 (and the script defaults both to enabled). Update the command (or defaults) so that a copy/paste run matches the claimed compliance.

Copilot uses AI. Check for mistakes.
Comment on lines +1423 to +1428
quant_result, quant_meta = mixed_quantize_int6(sd_cpu, {"mlp", "attn"}, hessians=hessians)
quant_buf = io.BytesIO()
torch.save({"w": quant_result, "m": quant_meta}, quant_buf)
quant_raw = quant_buf.getvalue()
quant_blob = lzma.compress(quant_raw, preset=6)
if master_process:
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs/metadata claim “LZMA-9”, but the code compresses the quantized artifact with lzma.compress(..., preset=6). Either bump the preset to 9 (if that’s what was actually used) or update README/submission blurb to reflect preset=6 so the record is reproducible.

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +17
"mean_val_bpb": 1.0856,
"std_val_bpb": 0.013,
"blurb": "Scylla tokenizer (998 vocab TokenMonster) + AR self-gen GPTQ int6 + BigramHash 3072x112 + VRL + XSA-11 + QK-Gain 4.0 + EMA/SWA + LZMA-9. No SLOT, no TTT. Legally clean."
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submission.json blurb says “LZMA-9” and “No TTT”, but the checked-in script uses lzma preset=6 and defaults TTT_ENABLED/OGD_ENABLED to enabled. Please reconcile the submission metadata with the actual code path/settings used to produce the attached logs/artifacts.

Copilot uses AI. Check for mistakes.
Comment on lines +1476 to +1478
log0(f"final_int6_sliding_window val_loss:{sw_val_loss:.4f} val_bpb:{sw_val_bpb:.4f} stride:{args.eval_stride} eval_time:{1000.0 * (time.perf_counter() - t_slide):.0f}ms")
log0(f"final_int6_sliding_window_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}")
log0(f"final_int6_roundtrip_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}")
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log label final_int6_roundtrip_exact is printed again immediately after the sliding-window metrics, but it’s populated with the sliding-window loss/bpb (sw_val_loss, sw_val_bpb). This makes the logs ambiguous/misleading when parsing results; it should either be removed or renamed to match the sliding-window metric.

Copilot uses AI. Check for mistakes.
torch.cuda.synchronize()
log0(f"final_int6_sliding_window_s64 val_loss:{sw64_val_loss:.4f} val_bpb:{sw64_val_bpb:.4f} stride:64 eval_time:{1000.0 * (time.perf_counter() - t_slide64):.0f}ms")
log0(f"final_int6_sliding_window_s64_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}")
log0(f"final_int6_roundtrip_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}")
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue for the stride=64 block: final_int6_roundtrip_exact is logged with the stride-64 sliding-window values (sw64_val_loss, sw64_val_bpb). This should be renamed/removed so the logs don’t incorrectly imply a second “roundtrip” measurement.

Suggested change
log0(f"final_int6_roundtrip_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}")

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +12
| Seed | Sliding BPB | Artifact |
|------|------------|----------|
| 1337 | 1.1009 | 15,267,156 |
| 42 | **1.0782** | 15,813,568 |
| 2024 | **1.0777** | 15,807,116 |
| **Mean** | **1.0856** | |
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Markdown table in the “3-Seed Results” section uses double leading pipes (||) on each row, which doesn’t render as a standard GitHub table. Replace with single | delimiters so the table formats correctly.

Copilot uses AI. Check for mistakes.
@dexhunter
Copy link
Copy Markdown

Hi @anthony-maio, congrats on the clean no-TTT no-SLOT submission — the AR self-gen GPTQ + BH3072 + XSA-all-11 stack is tight, and the compliance section is refreshingly direct.

Same as on PR #1289, I want to flag a concern about the Scylla tokenizer byte accounting, because it was the reason PR #1143 was closed and I don't see a fix for it in this PR. cc @NoesisGenesis who originally flagged it on PR #1143.

The concern

Your README says:

"Tokenizer byte accounting via validated metadata (candidate.meta.npz)"

If the candidate.meta.npz is the same artifact that PR #1143 used (or built with the same procedure — base_bytes[i] = len(piece.encode("utf-8")), has_leading_space all-zero, is_boundary_token all-zero), then it has a known ~4.13% byte overcount that deflates the reported BPB. I verified this on the full FineWeb val set in PR #1143:

Ground truth (decode SP1024 val → count UTF-8):   151,080,633 bytes
With the all-zeros meta.npz pattern (PR #1143's): 157,319,779 bytes  (+4.13%)
With 38 capcode + 27 byte-token fixes:            151,040,811 bytes  (−0.026%)

Why: two classes of TokenMonster tokens need special byte handling that len(piece.encode("utf-8")) alone can't produce —

  1. Capcode modifier tokens (D/DC/DW, ~38 total) delete the leading space of the next token during decode. These need is_boundary_token = True so the (has_leading_space[tgt] & ~is_boundary[prev]) formula suppresses the extra space byte.
  2. UTF-8 byte fallback tokens (~27 at IDs 75–101) represent a single raw byte but decode to U+FFFD (3 UTF-8 bytes). These need base_bytes = 1, not 3.

If your metadata came from the same source as PR #1143 (simon-marcus's original candidate.meta.npz), it likely has the same issue regardless of what else you built on top.

Two side notes on the 3-seed mean

  1. You already documented seed 1337 being throttled at ~153 ms/step and getting only 3,933 steps vs 6,500+ for seeds 42/2024. For the record rule, I think the throttled seed should either be rerun on a non-throttled pod or excluded from the mean (with the mean explicitly reported as 2-seed), since including it as if it were a representative sample biases the reported score toward the worst-throttled run you had.

  2. At an uncorrected mean of 1.0856, you don't yet clear the 0.005 nats record bar vs PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 (1.08563): (1.08563 − 1.0856) × ln(2) ≈ 0.00002 nats. If seed 1337 reruns clean at ~1.079, your effective 3-seed mean on a clean pod would be ~1.079, which would nominally clear Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394but that's still conditional on the Scylla byte-accounting being correct. If the 4% correction lands, corrected mean ends up ~1.13, not a record.

I'd really like to see a clean Scylla submission land — the tokenizer direction is valuable. I have the byte-accounting detection script from PR #1143 if you'd like to verify before rerunning. Happy to share.

@dexhunter
Copy link
Copy Markdown

Quick follow-up @anthony-maio — same note as on #1289: PR #1314 by @simon-marcus is the corrected official Scylla reference, and simon-marcus' README explicitly states:

"in this folder, Scylla means the corrected official revision. The original 998-token path from PR #1143 is superseded by the artifact set here."

The corrected bundle uses a byte-native regime (capcode=0, charset=none, latin-1 decode, synthetic BOS, 1254-token vocab) and passes a strict full-val audit with source_bytes == meta_bytes == decoded_bytes == 151,080,891 and zero drift. Since your Scylla shards (anthonym21/fineweb10B-scylla on HF) use the 998-vocab that #1314 marks as superseded, I'd encourage a re-audit against simon-marcus' corrected artifacts before this PR can be treated as a record candidate.

Also restating the point I raised above about seed 1337 being throttled at 153 ms/step → 3,933 steps: even setting aside the tokenizer-accounting question, the 3-seed mean is being bid down by a pod-throttled seed. A clean rerun would be worth doing before the record claim.

@dexhunter
Copy link
Copy Markdown

Hi @anthony-maio, congrats on the clean no-TTT no-SLOT submission — the AR self-gen GPTQ + BH3072 + XSA-all-11 stack is tight, and the compliance section is refreshingly direct.

Same as on PR #1289, I want to flag a concern about the Scylla tokenizer byte accounting because the shipped candidate.meta.npz looks like it's from the pre-correction tree that PR #1314 was opened to supersede. cc @NoesisGenesis @simon-marcus.

What the shipped bundle contains

Your records/track_10min_16mb/2026-04-06_Scylla_GPTQ_BH3072/candidate.meta.npz has:

  • vocab_size = 998
  • tokenizer_kind = "tokenmonster"
  • source_model_name = "/Users/simon/Code/parameter-golf/autoresearch/tokenmonster_discovery/experiments/0054/candidate.vocab"
  • has_leading_space.sum() == 0 (all False)
  • is_boundary_token.sum() == 0 (all False)
  • 5 zero-byte tokens (IDs 36, 37, 38, 151, 152)

The source_model_name path points to @simon-marcus's pre-correction tokenmonster_discovery/experiments/0054/ directory — this is the 998-vocab bundle PR #1143 used, from before the corrected path was built.

What the corrected path looks like

PR #1314 (by @simon-marcus) introduces scylla_v2_cap0_fullbyte.yaml, a 1254-vocab rebuild that handles two structural properties of the underlying TokenMonster vocab:

  1. Capcode modifier tokens that delete the leading space of the next token during decode need is_boundary_token = True so the (has_leading_space[tgt] & ~is_boundary[prev]) formula suppresses the extra space byte. The corrected bundle has 1 is_boundary_token == True.
  2. Byte-native zero-byte tokens: the corrected bundle has 76 zero-byte tokens (vs 5 in the 998-vocab bundle), reflecting a wider byte-fallback class.
  3. Charset: TokenMonster charset:none bundles decode their raw byte strings as latin-1, not utf-8. The corrected bundle's audit is run with charset_encoding: latin-1.

I ran PR #1314's audit_tokenmonster_bundle.py --strict against the 1254-vocab bundle locally and got source_bytes == meta_bytes == decoded_bytes == 151,080,891 with meta_overcount_frac: 0.0 (FULL_VAL_AUDIT.json). That's the "byte-exact" audit PR #1314 was built to pass.

The 998-vocab bundle your PR ships (with 5 zero-byte tokens and 0 boundary-token flags) has a structurally different shape and cannot be audited as equivalent to the corrected bundle by swapping fields — the vocab sizes and token semantics differ.

What I'm asking

Per the README rule on tokenizer changes ("prove with certainty that the val_bpb is correctly calculated"), could you either:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants