Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# AR Self-Gen GPTQ + XSA-all + Per-Layer Hadamard ROTQ

**Non-record submission.**

This run uses the same evaluation metric and artifact accounting as the main leaderboard, but it is **not** a main-track record submission because it was trained on **1xH100 for 4800 seconds**, not reproduced under the official **8xH100 SXM / 600 second** budget.

## Summary

This submission starts from the current public `11L AR Self-Gen GPTQ + XSA` stack and adds a small, modular quantization-basis change:

- keep the runtime model standard dense linear layers after load
- choose a compact right-rotation for large MLP matrices before GPTQ
- quantize in the rotated basis
- invert the rotation after dequantization during roundtrip evaluation

The best variant here is simple:

- Hadamard right-rotation
- applied to `mlp_up` and `mlp_down`
- block size chosen per layer from `{128, 256, 512}`

On this longer single-GPU run, that export path slightly improves the exact final sliding score over the unrotated export on the same checkpoint while staying under the 16,000,000-byte cap.

## Result

Reference values from the current upstream README at submission time:

- Main leaderboard target: `1.1147`
- Naive baseline: `1.2244`

Best result in this folder:

- Exact sliding `val_bpb`: `1.11290586`
- Exact sliding `val_loss`: `1.87908996`
- Total artifact bytes: `15,826,148`
- Hardware / duration: `1xH100`, `4800s`, seed `314`

Because this is a non-record run, the correct interpretation is:

- same metric as the leaderboard
- numerically better than the current top README score
- not eligible for the main track until reproduced within the official budget

## Ablation Table

All rows below use the same trained checkpoint and differ only in the export path.

| Export path | Exact sliding val_bpb | Bytes total |
|---|---:|---:|
| Base AR self-gen GPTQ export | `1.11296252` | `15,869,601` |
| `mlp_down` Hadamard `256` | `1.11291713` | `15,825,000` |
| `mlp_down` per-layer Hadamard | `1.11290938` | `15,826,480` |
| `mlp_up + mlp_down` per-layer Hadamard | `1.11290586` | `15,826,148` |

Net gain from the best rotation-aware export over the base export on the same checkpoint:

- `-0.00005666` BPB

## What Changed

Relative to the public AR self-generated GPTQ + XSA-all baseline stack, this version adds:

1. Rotation-aware GPTQ hooks for selected 2D tensors
2. Blockwise Walsh-Hadamard right-rotations before GPTQ
3. Shared or per-layer rotation selection over semantic tensor groups
4. Roundtrip reconstruction by inverting the right-rotation after dequantization

The implementation stays modular and optional through environment flags. The best result here came from forcing per-layer Hadamard search on `mlp_up` and `mlp_down`.

## Run Commands

Training run used for the checkpoint:

```bash
RUN_ID=ar_xsa_base_long4800_seed314 \
DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
BIGRAM_VOCAB_SIZE=3072 \
BIGRAM_DIM=112 \
WARMDOWN_ITERS=4000 \
TARGET_MB=15.9 \
SEED=314 \
MAX_WALLCLOCK_SECONDS=4800 \
ROTQ_ENABLED=0 \
EVAL_FORCE_S64=1 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

Best export-only ablation on the resulting checkpoint:

```bash
RUN_ID=ar_export_rotq_mlpupdown_perlayer_had_long4800_fullcal_slide64 \
INIT_MODEL_PATH=/workspace/parameter-golf/checkpoints/ar_xsa_base_long4800_seed314_final_model.pt \
DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
BIGRAM_VOCAB_SIZE=3072 \
BIGRAM_DIM=112 \
WARMDOWN_ITERS=4000 \
TARGET_MB=15.9 \
SEED=314 \
MAX_WALLCLOCK_SECONDS=0 \
EVAL_FORCE_S64=1 \
ROTQ_ENABLED=1 \
ROTQ_FORCE=1 \
ROTQ_MODE=hadamard \
ROTQ_TARGETS=mlp_down,mlp_up \
ROTQ_BLOCK_SIZES=128,256,512 \
ROTQ_GROUP_SHARE=0 \
ROTQ_SCORE_TENSORS=6 \
ROTQ_SCORE_ROWS=256 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

## Compliance Notes

- No validation leakage during training
- GPTQ calibration data is autoregressively self-generated by the trained model
- No network calls are required during evaluation
- Artifact remains under the decimal `16,000,000` byte cap
- This is explicitly a **non-record** submission because compute exceeds the main-track budget

## Included Files

- `train_gpt.py`: exact script used for the best export result in this folder
- `train.log`: recovered copy of the auto-generated log for the long training run
- `rotq_ablation.log`: recovered copy of the export-only ablation log
- `submission.json`: metadata for this non-record submission

## Limitations

- Single-seed evidence only
- Not reproduced on `8xH100 SXM` within the `600s` training budget
- The pod stopped before raw log sync completed, so the attached logs are recovered copies of the exact auto-generated log content captured from the pod session
- The ROTQ gain is real but small; the main value of this PR is the new modular export idea and the evidence that it composes with the current strong AR-GPTQ/XSA stack
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sentencepiece
zstandard
flash_attn_3
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
starting ar_export_rotq_mlpdown_had256_long4800_fullcal_slide64
world_size:1 grad_accum_steps:8
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_kernel:flashattn3
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:0 warmup_steps:0 max_wallclock_seconds:0.000
seed:314
step:0/0 val_loss:1.9122 val_bpb:1.1325 train_time:0ms step_avg:0.02ms
peak memory allocated: 1002 MiB reserved: 1242 MiB
ema:applying EMA weights
DIAGNOSTIC post_ema val_loss:1.9122 val_bpb:1.1325 eval_time:16429ms
Serialized model: 106289590 bytes
Code size: 119252 bytes
gptq:building non-banked model for Hessian collection...
gptq:generating autoregressive calibration data (64 seqs x 2048 tokens, temp=0.8)...
gptq:generated 64 sequences in 199.6s
gptq:collecting hessians from autoregressive data...
gptq:collected hessians for 68 layers (AR self-gen)
rotq:enabled groups:1 active:11 meta_bytes:264
rotq:mlp_down:1536 group:mlp_down mode:hadamard block:256 score:1.564798e-01
selective_prune: 4217740 ±1 candidates, unpruned=15.09MB target=15.9MB
selective_prune: already fits, no pruning needed
Serialized model int6+lzma: 15705748 bytes
Rotation metadata bytes (logical): 264
Total submission size int6+lzma: 15825000 bytes
final_int6_roundtrip val_loss:1.9190 val_bpb:1.1365 eval_time:16283ms
final_int6_roundtrip_exact val_loss:1.91901076 val_bpb:1.13654627
final_int6_sliding_window val_loss:1.8791 val_bpb:1.1129 stride:64 eval_time:595377ms
final_int6_sliding_window_exact val_loss:1.87910898 val_bpb:1.11291713
final_int8_zlib_roundtrip_exact val_loss:1.87910898 val_bpb:1.11291713
starting ar_export_rotq_mlpdown_perlayer_had_long4800_fullcal_slide64
DIAGNOSTIC post_ema val_loss:1.9122 val_bpb:1.1325 eval_time:16381ms
Serialized model: 106289590 bytes
Code size: 119252 bytes
gptq:building non-banked model for Hessian collection...
gptq:generating autoregressive calibration data (64 seqs x 2048 tokens, temp=0.8)...
gptq:generated 64 sequences in 198.9s
gptq:collecting hessians from autoregressive data...
gptq:collected hessians for 68 layers (AR self-gen)
rotq:enabled groups:11 active:11 meta_bytes:264
rotq:blocks.0.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.569647e-01
rotq:blocks.1.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.561793e-01
rotq:blocks.2.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.568002e-01
rotq:blocks.3.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.558599e-01
rotq:blocks.4.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.558491e-01
rotq:blocks.5.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.564744e-01
rotq:blocks.6.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.560896e-01
rotq:blocks.7.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.559215e-01
rotq:blocks.8.mlp.proj.weight group:mlp_down mode:hadamard block:256 score:1.554852e-01
rotq:blocks.9.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.563086e-01
rotq:blocks.10.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.571935e-01
selective_prune: 4216026 ±1 candidates, unpruned=15.09MB target=15.9MB
selective_prune: already fits, no pruning needed
Serialized model int6+lzma: 15707228 bytes
Rotation metadata bytes (logical): 264
Total submission size int6+lzma: 15826480 bytes
final_int6_roundtrip val_loss:1.9190 val_bpb:1.1365 eval_time:16266ms
final_int6_roundtrip_exact val_loss:1.91899953 val_bpb:1.13653961
final_int6_sliding_window val_loss:1.8791 val_bpb:1.1129 stride:64 eval_time:592103ms
final_int6_sliding_window_exact val_loss:1.87909590 val_bpb:1.11290938
final_int8_zlib_roundtrip_exact val_loss:1.87909590 val_bpb:1.11290938
starting ar_export_rotq_mlpupdown_perlayer_had_long4800_fullcal_slide64
rotq:blocks.2.mlp.fc.weight group:mlp_up mode:hadamard block:128 score:1.543067e-01
rotq:blocks.3.mlp.fc.weight group:mlp_up mode:hadamard block:512 score:1.542510e-01
rotq:blocks.4.mlp.fc.weight group:mlp_up mode:hadamard block:512 score:1.528345e-01
rotq:blocks.5.mlp.fc.weight group:mlp_up mode:hadamard block:128 score:1.565601e-01
rotq:blocks.6.mlp.fc.weight group:mlp_up mode:hadamard block:256 score:1.545500e-01
rotq:blocks.7.mlp.fc.weight group:mlp_up mode:hadamard block:256 score:1.552076e-01
rotq:blocks.8.mlp.fc.weight group:mlp_up mode:hadamard block:512 score:1.547106e-01
rotq:blocks.9.mlp.fc.weight group:mlp_up mode:hadamard block:128 score:1.537057e-01
rotq:blocks.10.mlp.fc.weight group:mlp_up mode:hadamard block:128 score:1.545073e-01
rotq:blocks.0.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.569647e-01
rotq:blocks.1.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.561793e-01
rotq:blocks.2.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.568002e-01
rotq:blocks.3.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.558599e-01
rotq:blocks.4.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.558491e-01
rotq:blocks.5.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.564744e-01
rotq:blocks.6.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.560896e-01
rotq:blocks.7.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.559215e-01
rotq:blocks.8.mlp.proj.weight group:mlp_down mode:hadamard block:256 score:1.554852e-01
rotq:blocks.9.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.563086e-01
rotq:blocks.10.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.571935e-01
selective_prune: 4208001 ±1 candidates, unpruned=15.09MB target=15.9MB
selective_prune: already fits, no pruning needed
Serialized model int6+lzma: 15706896 bytes
Rotation metadata bytes (logical): 506
Total submission size int6+lzma: 15826148 bytes
final_int6_roundtrip val_loss:1.9190 val_bpb:1.1365 eval_time:16322ms
final_int6_roundtrip_exact val_loss:1.91899332 val_bpb:1.13653594
final_int6_sliding_window val_loss:1.8791 val_bpb:1.1129 stride:64 eval_time:596342ms
final_int6_sliding_window_exact val_loss:1.87908996 val_bpb:1.11290586
final_int8_zlib_roundtrip_exact val_loss:1.87908996 val_bpb:1.11290586
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"track": "non_record_16mb",
"date": "2026-04-01",
"name": "AR Self-Gen GPTQ + XSA-all + Per-Layer Hadamard ROTQ",
"author": "Arash",
"github_id": "vermissa0ss",
"val_bpb": 1.11290586,
"val_loss": 1.87908996,
"bytes_total": 15826148,
"hardware": "1xH100 80GB",
"blurb": "Non-record submission. Starts from the public AR self-generated GPTQ + XSA-all + BigramHash stack and adds rotation-aware GPTQ with per-layer Hadamard right-rotations on MLP matrices. Best export on a 1xH100/4800s run reached 1.11290586 sliding bpb at 15,826,148 bytes."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
model_params:27067484
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
world_size:1 grad_accum_steps:8
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_kernel:flashattn3
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000
seed:314
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9271 val_bpb:4.1026 train_time:0ms step_avg:0.02ms
step:1/20000 train_loss:6.9279 train_time:652ms step_avg:652.22ms
step:2/20000 train_loss:8.6225 train_time:1243ms step_avg:621.68ms
step:3/20000 train_loss:7.6539 train_time:1904ms step_avg:634.55ms
step:4/20000 train_loss:7.3319 train_time:2548ms step_avg:636.98ms
step:5/20000 train_loss:7.0682 train_time:3200ms step_avg:639.99ms
step:6/20000 train_loss:7.0074 train_time:3847ms step_avg:641.21ms
step:7/20000 train_loss:6.9721 train_time:4493ms step_avg:641.83ms
step:8/20000 train_loss:6.7644 train_time:5146ms step_avg:643.20ms
step:9/20000 train_loss:6.4777 train_time:5793ms step_avg:643.68ms
step:10/20000 train_loss:6.1204 train_time:6441ms step_avg:644.09ms
step:500/20000 train_loss:2.3192 train_time:325737ms step_avg:651.47ms
step:1000/20000 train_loss:2.2598 train_time:651864ms step_avg:651.86ms
step:1500/20000 train_loss:2.1287 train_time:978338ms step_avg:652.23ms
step:2000/20000 train_loss:2.0470 train_time:1305273ms step_avg:652.64ms
step:2500/20000 train_loss:2.0920 train_time:1632135ms step_avg:652.85ms
step:3000/20000 train_loss:2.0749 train_time:1959198ms step_avg:653.07ms
step:3500/20000 train_loss:2.0674 train_time:2286291ms step_avg:653.23ms
step:4000/20000 train_loss:2.1350 train_time:2613343ms step_avg:653.34ms
step:4000/20000 val_loss:2.0439 val_bpb:1.2105 train_time:2613392ms step_avg:653.35ms
step:4500/20000 train_loss:2.1269 train_time:2940377ms step_avg:653.42ms
step:5000/20000 train_loss:2.0362 train_time:3267311ms step_avg:653.46ms
step:5500/20000 train_loss:2.0406 train_time:3594311ms step_avg:653.51ms
step:6000/20000 train_loss:1.9402 train_time:3921190ms step_avg:653.53ms
step:6500/20000 train_loss:2.0445 train_time:4247900ms step_avg:653.52ms
swa:start step:6550
late_qat:enabled step:6744 scale:0.1498
step:7000/20000 train_loss:1.8408 train_time:4576477ms step_avg:653.78ms
step:7341/20000 val_loss:1.9137 val_bpb:1.1334 train_time:4800390ms step_avg:653.91ms
stopping_early: wallclock_cap train_time:4800390ms step:7341/20000
peak memory allocated: 23002 MiB reserved: 23184 MiB
ema:applying EMA weights
DIAGNOSTIC post_ema val_loss:1.9121 val_bpb:1.1325 eval_time:16505ms
Serialized model: 106289590 bytes
Code size: 118057 bytes
gptq:building non-banked model for Hessian collection...
gptq:generating autoregressive calibration data (64 seqs x 2048 tokens, temp=0.8)...
gptq:generated 64 sequences in 204.7s
gptq:collecting hessians from autoregressive data...
gptq:collected hessians for 68 layers (AR self-gen)
selective_prune: 4217763 ±1 candidates, unpruned=15.13MB target=15.9MB
selective_prune: already fits, no pruning needed
Serialized model int6+lzma: 15751544 bytes
Rotation metadata bytes (logical): 0
Total submission size int6+lzma: 15869601 bytes
final_int6_roundtrip val_loss:1.9191 val_bpb:1.1366 eval_time:20641ms
final_int6_roundtrip_exact val_loss:1.91911329 val_bpb:1.13660699
final_int6_sliding_window val_loss:1.8792 val_bpb:1.1130 stride:64 eval_time:591478ms
final_int6_sliding_window_exact val_loss:1.87918563 val_bpb:1.11296252
final_int8_zlib_roundtrip_exact val_loss:1.87918563 val_bpb:1.11296252
Loading