Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# 11L s2048 4h on 1xA100 — 1.1104 BPB (non-record)

**Author:** Huanyi Xie (`xiehuanyi`)
**Date:** 2026-04-10
**Track:** `non_record_16mb`
**Result:** **val_bpb = 1.11044406** (int6 GPTQ + LZMA + sliding window eval stride=64)

## TL;DR

Drop-in longer-context variant of the existing `1.1147 ValCalib_GPTQ_XSA_BigramHash3072`-style stack: **11 layers, 3x MLP, LeakyReLU(0.5)^2, XSA-all, BigramHash(2048), Partial RoPE(16/64), LN Scale, SmearGate, U-Net skip, Muon+AdamW(WD=0.04), EMA(0.997), SWA, Late QAT@0.15, Int6 GPTQ with self-gen autoregressive calibration, LZMA preset=9, sliding window eval s64.**

The only changes vs. a classic ~1.13 BPB s1024 stack are:
1. `TRAIN_SEQ_LEN=2048` (longer training and eval context)
2. 4 hours of training on 1x A100 (~240 A100-min ≈ 76-80 equivalent H100-min, close to the official 80 H100-min compute budget)

## Why non-record

This submission does **not** qualify for `track_10min_16mb` because it was trained on **1x A100 for 4 hours**, not on **8x H100 for 10 min**. A100 ≈ H100 / 3.17 raw BF16 throughput (excluding FA3). The total compute is roughly comparable (~76–80 H100-min-equivalent vs. 80 allowed) but the submission was never verified on actual 8xH100 hardware, so it belongs in the non-record track.

FA3 is unavailable on Ampere; the attention forward uses PyTorch SDP (flash backend) as a drop-in via a small wrapper.

## Numbers (seed 1337)

| Metric | Value |
|---|---|
| **Int6 Sliding Window BPB** | **1.11044406** |
| Int6 Roundtrip BPB | 1.13437381 |
| Pre-quant val_bpb (post-EMA) | 1.1323 |
| Training steps | 14065 / 20000 |
| Step avg | 1023.86 ms |
| Peak memory | 16.3 GiB |
| Model params | 26,993,756 |
| Artifact bytes (int6+lzma) | 15,941,100 |
| **Total (code + artifact)** | **16,040,603** (under 16 MiB = 16,777,216) |

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README’s artifact/total byte counts appear inconsistent with the included training log. train_seed1337.log reports Serialized model int6+lzma: 15920436 bytes and Code size: 120167 bytes (total 16040603), but this README lists Artifact bytes ... 15,941,100. Please update the README numbers to match the actual generated files (or regenerate the log/README from the same run) so readers can verify the 16 MiB constraint.

Copilot uses AI. Check for mistakes.
## Ablation context

This result was the top performer in a 24-experiment ablation run on 1x A100 with identical infrastructure. Summary of the biggest levers:

| Change | BPB (lower = better) | Notes |
|---|---|---|
| seq_len=512, full stack, 2h (exp07 Round 2) | 1.1484 | old baseline |
| seq_len=1024, full stack, 2h (exp13 Round 2) | 1.1317 | **+context alone = -0.017** |
| seq_len=1024, full stack, 4h (exp30 Round 3) | 1.1177 | **+time alone = -0.014** |
| **seq_len=2048, full stack, 4h (exp34, this)** | **1.1104** | **+context+time = -0.021 vs exp13** |

Trick ablations that did NOT help noticeably once training time was sufficient:
- Gated Attention (neutral)
- Value Residual (neutral)
- V-Norm on V projection (neutral)
- BigramHash 3072×112 vs 2048×128 (noise)
- warmdown=4000 vs 3500 (marginal)
- 13 layers (step overhead > depth gain at this budget)
- Attention Residuals (Kimi) — strongly negative (-0.036)
- Differential Attention — negative (-0.014)
- Layer Tying (22L/11B, 11L/6B) — strongly negative
- MTP 2 heads — slight negative

The key finding is that **at this model size and compute budget, simply extending context length and training longer dominates all micro-architectural tricks.**

## Reproduction

```bash
# On 1x A100 80GB:
RUN_ID=v3_exp34_s2048 \
SEED=1337 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MAX_WALLCLOCK_SECONDS=14400 \
TRAIN_BATCH_TOKENS=524288 \
TRAIN_SEQ_LEN=2048 \
EVAL_SEQ_LEN=2048 \
WARMDOWN_ITERS=3500 \
BIGRAM_VOCAB_SIZE=2048 \
XSA_LAST_N=11 \
ROPE_DIMS=16 \
LN_SCALE=1 \
VE_ENABLED=1 \
LATE_QAT_THRESHOLD=0.15 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

## Files

- `train_gpt.py` — modified training script (based on `2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072` with A100/FA2/SDP fallback, deferred EMA start for short runs, optional layer tying / V-norm / diff attn / AttnRes / MTP toggles, none of which were active for this run)
- `final_model.int6.ptz` — 15.94 MB int6-quantized model (LZMA preset=9)
- `train_seed1337.log` — full training log
- `submission.json` — structured metadata
- `requirements.txt` — pip deps

## Caveats

- **Single seed (1337) only.** A proper 3-seed mean (42, 314, 999) has NOT been run yet. This makes the reported BPB noisier than the main-leaderboard records. Seeds 42 and 999 are planned.
- EMA is deferred to start at 20% of wallclock to avoid random-init contamination on shorter runs (discovered during Round 2 experiments).
- The attention backend falls back to PyTorch SDP (flash backend) because FA3 is Hopper-only; FA2 is not installed in the current venv.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# A100 (Ampere) — FA3 is Hopper-only, falls back to PyTorch SDP (flash backend)
# FA2 can be installed for modest speedup: pip install flash-attn --no-build-isolation
sentencepiece
zstandard
# LZMA is in stdlib, no install needed
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
{
"author": "Huanyi Xie",
"github_id": "xiehuanyi",
"name": "11L s2048 4h on 1xA100 — 1.1104 BPB (non-record, 1 seed)",
"blurb": "11-layer 3xMLP LeakyReLU(0.5)^2 full stack (SmearGate + BigramHash(2048) + XSA-all + Partial RoPE(16/64) + LN Scale + Muon+AdamW(WD=0.04) + EMA(0.997, deferred start) + SWA + Late QAT@0.15 + Int6 GPTQ (self-gen AR calibration) + LZMA preset=9 + Sliding Window Eval (stride=64)) trained with TRAIN_SEQ_LEN=2048 and MAX_WALLCLOCK_SECONDS=14400 (4h) on 1x A100 80GB. The single change from the Round-2 best was TRAIN_SEQ_LEN 1024 -> 2048 and doubling training time, which alone moved BPB from 1.1317 to 1.1104. NOT a main-leaderboard submission: used 240 A100-minutes (~76-80 H100-minute equivalent) instead of the required 10 min on 8xH100. Single seed (1337) only.",
"date": "2026-04-10",
"track": "non_record_16mb",
"val_loss": 1.87493332,
"val_bpb": 1.11044406,
"seeds": [1337],
"seed_results": {
"1337": {
"val_loss": 1.87493332,
"val_bpb": 1.11044406,
"val_loss_pre_quant": 1.9119,
"val_bpb_pre_quant": 1.1323,
"val_loss_int6_roundtrip": 1.91534267,
"val_bpb_int6_roundtrip": 1.13437381,
"artifact_bytes": 15941100,
"total_submission_bytes": 16040603,
"steps": 14065,
"step_avg_ms": 1023.86
}
},
"artifact_bytes_max": 15941100,
"bytes_total": 16040603,
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

artifact_bytes / artifact_bytes_max don’t match the sizes in the included train_seed1337.log. The log reports Serialized model int6+lzma: 15920436 bytes and Total submission size ...: 16040603 bytes, implying artifact_bytes should be 15920436 (and code bytes ~120167), not 15941100. Please recompute these fields from the actual final_model.int6.ptz and script size so metadata stays self-consistent.

Copilot uses AI. Check for mistakes.
"train_steps_mean": 14065,
"step_avg_ms_mean": 1023.86,
"hardware": "1x NVIDIA A100 80GB SXM4 (IBEX cluster, KAUST)",
"pytorch_version": "2.8.0+cu128",
"cuda_version": "12.8",
"python_version": "3.9.18",
"attn_backend": "PyTorch SDP (flash backend; FA3 unavailable on A100)",
"max_wallclock_seconds": 14400,
"train_seq_len": 2048,
"eval_seq_len": 2048,
"train_batch_tokens": 524288,
"num_layers": 11,
"model_dim": 512,
"mlp_mult": 3.0,
"vocab_size": 1024,
"num_heads": 8,
"num_kv_heads": 4,
"xsa_last_n": 11,
"rope_dims": 16,
"ln_scale": true,
"bigram_vocab_size": 2048,
"bigram_dim": 128,
"warmdown_iters": 3500,
"late_qat_threshold": 0.15,
"muon_wd": 0.04,
"adam_wd": 0.04,
"matrix_lr": 0.025,
"scalar_lr": 0.025,
"tied_embed_lr": 0.035,
"model_params": 26993756
}
Loading
Loading