Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Non-Record Submission: 1.20289664 BPB — Mixed-Int6 LZMA9 B3072 Warm5000

**EMA + XSA(last-4) + BigramHash3072 + warmdown5000 + LeakyReLU^2 + mixed-int6 export**

**val_bpb: 1.20289664** (sliding, seed=42) | **15,991,188 bytes** artifact | single-GPU unlimited-compute run (~16.1h)

> **This is a non-record unlimited-compute submission.** Training ran for about 16.1 hours on a single GPU, so this is not a 10-minute leaderboard attempt. The main result is a stronger legal artifact than the listed 4-hour non-record baseline, using a known flat-transformer recipe plus longer single-GPU training and a broad mixed-int6/LZMA9 export path.

## Result

| Metric | Value |
|--------|-------|
| Sliding BPB | **1.20289664** |
| Sliding val_loss | **1.99963255** |
| Pre-quant sliding BPB | **1.16618894** |
| Pre-quant sliding val_loss | **1.93861159** |
| Steps | **16,000** |
| Training time | **57,979.039s** (~16.1h) |
| Artifact | **15,991,188 bytes** |
| Code bytes | **110,016** |
| Compressed model bytes | **15,881,172** |
| Parameters | **27,124,828** |

## Positioning

This is **not** claiming a new SOTA technique. EMA, XSA, BigramHash, LeakyReLU^2, int6-style quantization, LZMA compression, and sliding evaluation are all established in prior submissions.

The useful contribution is a non-record data point showing that the flat EMA/XSA/BigramHash family still improves under longer single-GPU training, and that a broad `mlp;attn;embed` mixed-int6 export with LZMA9 can keep the resulting 27.1M-parameter checkpoint legal under the 16MB artifact cap.

| Reference | BPB | Notes |
|-----------|----:|-------|
| This submission | **1.20289664** | 16.1h single-GPU, non-record, 15,991,188 bytes |
| 4-hour non-record baseline | 1.20737944 | Listed unlimited-compute baseline |
| 1-bit non-record | 1.1239 | Stronger non-record result; this submission does not beat it |
| Current 10-minute record | 1.11473509 | Stronger 10-minute SOTA; this submission is not a record attempt |

## What Changed

- **Longer training on the flat stack:** `BIGRAM_VOCAB_SIZE=3072`, XSA on the last 4 layers, EMA, LeakyReLU^2, and `WARMDOWN_ITERS=5000` produced a raw sliding score of **1.16618894 BPB** before legal export.
- **Broad mixed-int6 export:** `QUANT_INT6_CATS=mlp;attn;embed`, `INT8_KEEP_FLOAT_MAX_NUMEL=32768`, and LZMA9 extreme compression produced a legal artifact at **15,991,188 bytes**.
- **Export separation:** the raw checkpoint was preserved, then re-exported and evaluated independently. The export found **3,894,003** candidate `+-1` int6 entries, but the compressed artifact already fit the target byte cap, so no entries had to be pruned.

## Training Configuration

- 8 FineWeb SP1024 training shards
- EMA enabled, decay `0.997`
- XSA active on the last 4 layers
- BigramHash `3072 x 128`
- `leaky_relu2` activation with slope `0.5`
- `TRAIN_BATCH_TOKENS=262144`
- `TRAIN_SEQ_LEN=2048`
- `ITERATIONS=16000`
- `WARMDOWN_ITERS=5000`
- `MAX_WALLCLOCK_SECONDS=64800`

The exact training log is included in [train.log](./train.log).

## Export Configuration

- Eval-only export from the preserved raw checkpoint
- `QUANT_SCHEME=mixed_int6`
- `QUANT_INT6_CATS=mlp;attn;embed`
- `INT8_KEEP_FLOAT_MAX_NUMEL=32768`
- `QUANT_SELECTIVE_PRUNE=1`
- `QUANT_TARGET_TOTAL_BYTES=16000000`
- `QUANT_LZMA_PRESET=9`
- `QUANT_LZMA_EXTREME=1`

The exact export/eval log is included in [export_eval.log](./export_eval.log).

## Files

- `train_gpt.py`: exact script for this submission branch
- `requirements.txt`: environment reference
- `train.log`: training log
- `export_eval.log`: export and final eval log

## Compliance

- [x] Artifact <= 16,000,000 bytes
- [x] No test-time training on validation data
- [x] No network calls during evaluation
- [x] Self-contained script included
- [x] Non-record unlimited-compute track
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
Starting run at 2026-04-07 07:12:18 UTC
Run: exp8855_mi6_mae_k32_pr_lz9e_9336
NPROC: 1
MLP activation: leaky_relu2
MLP leaky slope: 0.5
MLP prelu init: 0.25
MLP softplus beta: 1.0
TTT optimizer: sgd
TTT epoch schedule: fixed
TTT chunk schedule: fixed
TTT batch seqs: 32
logs/exp8855_mi6_mae_k32_pr_lz9e_9336.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:8
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:131072 limited=1
attention_backend:requested=sdpa flash_attn_available=0 compile_enabled=0 cuda_dtype=float16
eval_only_checkpoint:<raw_checkpoint>/long16h_ema_b3072_warm5000_8855.final_model_raw.pt
averaging_config ema_enabled=0 ema_decay=0.997 swa_enabled=0 swa_every=50 lawa_enabled=0 save_raw_checkpoint=0
activation_config mlp_activation=leaky_relu2 mlp_leaky_slope=0.5 mlp_prelu_init=0.25 mlp_softplus_beta=1.0
quant_config scheme=mixed_int6 int6_cats=mlp;attn;embed keep_float_max_numel=32768 keep_float16=none force_float16=none lzma_preset=9 extreme=1 selective_prune=1 target_total_bytes=16000000
model_params:27124828
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_4 active_layers:[7, 8, 9, 10]
world_size:1 grad_accum_steps:8
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:262144 train_seq_len:2048 iterations:300 warmup_steps:0 max_wallclock_seconds:1800.000
seed:42
step:300/300 val_loss:2.0278 val_bpb:1.2163 train_time:0ms step_avg:0.00ms
peak memory allocated: 170 MiB reserved: 192 MiB
DIAGNOSTIC pre_avg val_loss:2.0278 val_bpb:1.2163 eval_time:1912ms
averaging:using raw training weights
DIAGNOSTIC post_ema val_loss:2.0278 val_bpb:1.2163 eval_time:1914ms
Serialized model: 106420662 bytes
Code size: 110016 bytes
selective_prune: 3894003 ±1 candidates unpruned=15991188 target=16000000
selective_prune: already fits target
Serialized model int6+lzma: 15881172 bytes
Total submission size int6+lzma: 15991188 bytes
final_int6_roundtrip val_loss:2.0361 val_bpb:1.2213 eval_time:1923ms
final_int6_roundtrip_exact val_loss:2.03610863 val_bpb:1.22128751
final_int6_sliding_window val_loss:1.9996 val_bpb:1.2029 stride:64 eval_time:52209ms
final_int6_sliding_window_exact val_loss:1.99963255 val_bpb:1.20289664
final_int8_zlib_roundtrip_exact val_loss:1.99963255 val_bpb:1.20289664
Finished run at 2026-04-07 07:14:15 UTC
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
numpy
tqdm
torch
huggingface-hub
kernels
setuptools
typing-extensions==4.15.0
datasets
tiktoken
sentencepiece
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"author": "Ayman Abdul-Majid",
"github_id": "sabdulmajid",
"name": "Mixed-Int6 LZMA9 B3072 Warm5000",
"blurb": "Unlimited-compute non-record submission: a 16k-step EMA + XSA(last-4) + BigramHash3072 SP1024 run with longer warmdown, exported legally with mixed-int6 over mlp/attn/embed plus LZMA9 extreme compression. Pre-quant sliding BPB 1.1662, final legal artifact 1.20289664 at 15,991,188 bytes.",
"date": "2026-04-07T07:14:15Z",
"track": "non-record-unlimited-compute-16mb",
"val_loss": 1.99963255,
"val_bpb": 1.20289664,
"pre_quant_val_loss": 1.93861159,
"pre_quant_val_bpb": 1.16618894,
"step_stop": 16000,
"wallclock_seconds": 57979.039,
"bytes_total": 15991188,
"bytes_model_int6_lzma": 15881172,
"bytes_code": 110016,
"hardware": "single-GPU training and single-GPU export/eval",
"quant_scheme": "mixed_int6",
"quant_int6_cats": "mlp;attn;embed",
"quant_lzma_preset": 9,
"quant_lzma_extreme": true
}
Loading