Proposal: Validate ASQU on the March 22 10min/16MB control line by fahmitech · Pull Request #1247 · openai/parameter-golf

fahmitech · 2026-04-02T05:18:24Z

Summary

This PR adds a differentiated candidate record folder that transfers ASQU onto the strongest clean March 22 track_10min_16mb control line in this fork.

This is not a record claim. It is a public proposal for a stage-gated validation campaign under the real 8xH100, 600s, <=16 MB constraint.

The base stack is the March 22 11-layer FA3 family:

11L, 512d, GQA
XSA on the last 4 layers
Partial RoPE + LN scale + VE128
EMA + tight SWA
late QAT
GPTQ-lite int6 export

The architectural delta in this PR is intentionally small:

control: relu^2
candidate: ASQU(x) = x^2 if x > 0 else beta_i * x^2

where beta_i is learned per hidden channel with a dedicated low learning rate.

Relevant Prior Art And Credit

Architecture ancestor: PR #374
March 22 control-line family / record baseline: PR #429
ASQU prior art with controlled non-record evidence: PR #1035
Earlier ASQU-related exploration: PR #679

ASQU is therefore not new to the repo. What this PR contributes is a controlled transfer of ASQU onto a stronger, cleaner March 22 control line that already has real track_10min_16mb evidence.

What This PR Adds

a new candidate folder:
- records/track_10min_16mb/2026-04-02_11L_XSA4_PartialRoPE_LNScale_VE128_ASQU_EMA_GPTQlite
train_gpt.py support for:
- MLP_ACTIVATION=asqu
- learned per-channel beta parameters
- ASQU_BETA_INIT
- ASQU_LR
a dedicated optimizer group for ASQU beta parameters
export handling that keeps ASQU control tensors out of quantization damage
a seed-1337 debug log:
- train_seed1337.log

This keeps the test surface narrow:

one architectural delta
no change to dataset or tokenizer
no change to legality profile
no change to artifact format

Why This Transfer Matters

The core question is not whether ASQU can work in isolation. Prior non-record evidence already suggests it can.

The open question is narrower and more useful:

Does ASQU transfer onto the March 22 control line without breaking runtime, artifact size, or quantized quality in the real record regime?

That question is worth testing because:

the March 22 base already has a hard external anchor in the target regime
the change surface is small enough to interpret cleanly
this is a lower-entropy bet than broad architecture rewrites

Preliminary Evidence

I ran a matched local debug regime on 1 GPU, 2000 iterations, same seed, same tokenizer, same dataset, same base stack.

For clarity, the 2000-step regime here means:

ITERATIONS=2000
MAX_WALLCLOCK_SECONDS=0
TRAIN_LOG_EVERY=100
VAL_LOSS_EVERY=0

The current code already supports this directly via environment variables and writes logs to:

logs/<RUN_ID>.txt
train_seed<SEED>.log
train.log when SEED=1337

Matched Debug Comparison

March 22-style control (relu^2):

step_avg: 647.44 ms
step:2000 val_bpb: 1.2260
post_ema val_bpb: 1.2312
final_int6_roundtrip_exact val_bpb: 1.30048843
total submission size: 11,764,467 bytes

ASQU candidate:

step_avg: 680.15 ms
step:2000 val_bpb: 1.2208
post_ema val_bpb: 1.2254
final_int6_roundtrip_exact val_bpb: 1.30236542
total submission size: 11,762,459 bytes

Observed delta:

pre-quantized improvement:
- about -0.0052 BPB at step:2000
- about -0.0058 BPB at post_ema
runtime cost:
- about +5.1%
size:
- effectively unchanged
quantized endpoint:
- about +0.0019 BPB worse on final_int6_roundtrip_exact

Interpretation

This is a mixed result, not a submission-ready result.

What the local debug run suggests:

ASQU helps the float / pre-quantized model in this matched regime
ASQU does not yet help the decisive quantized endpoint
the runtime penalty is small enough that the candidate is still plausible in the true regime

That is exactly why this is a good compute-grant question:

the code path is stable
the architectural delta is measurable
the key uncertainty is now sharply defined:
- is the quantization regression a wrong-regime artifact of 1 GPU / 2000 steps
- or does ASQU genuinely fail to survive the March 22 export path

Only real 8xH100, 600s runs can resolve that cleanly.

Requested Compute

Request:

up to $1k in compute credits as a ceiling, not a spend target
stage-gated release against explicit milestones
stop immediately when a kill condition is hit

Why a $1k ceiling is rational:

one run is not enough to separate transfer signal from noise
a small number of real record-regime runs is enough to answer the actual question
this is a bounded validation campaign, not an open-ended search

Proposed staged use of credits:

Stage 1: control reproduction
Re-run the March 22 control in the fork-local code path on target hardware to verify runtime, artifact size, and metric alignment.
Stage 2: ASQU A/B
Run the ASQU candidate under the exact same 8xH100, 600s conditions with seed 1337.
Stage 3: confirmation only if positive
If ASQU is still directionally better on the decisive quantized metric, spend the remaining budget on:
- one extra seed
- optionally one narrow sweep around ASQU_BETA_INIT or ASQU_LR
Stage 4: final record-quality attempt only if still positive
Use any remaining budget for one clean record-track attempt.

Success / Kill Criteria

GO:

the run stays within the 600s training budget
artifact remains <16,000,000 bytes
runtime remains close enough to control to stay viable
ASQU improves a decisive quantized metric, ideally final_int6_roundtrip_exact

KILL:

runtime penalty is too large for the 10-minute regime
artifact breaks the cap
quantized quality does not improve
the pre-quant gain disappears under the real regime
no more credits are spent after that point

Why This Is A Good Use Of Compute

This request is capital-efficient for five reasons:

the base is already proven
the architectural delta is tiny
the local signal is real but mixed, which is exactly the kind of uncertainty that merits targeted validation
the experiment is easy to interpret
the stop conditions are explicit

If ASQU transfers, it becomes a differentiated record-track candidate on top of a strong clean base.

If ASQU fails, the result still closes an underexplored branch cleanly and cheaply, which is better than spending the same credits on broader unbounded exploration.

Reproducibility

ASQU candidate folder:

records/track_10min_16mb/2026-04-02_11L_XSA4_PartialRoPE_LNScale_VE128_ASQU_EMA_GPTQlite

Target 8xH100 command:

cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-02_11L_XSA4_PartialRoPE_LNScale_VE128_ASQU_EMA_GPTQlite

OMP_NUM_THREADS=1 \
PYTHONUNBUFFERED=1 \
RUN_ID=asqu_seed1337 \
SEED=1337 \
DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MLP_ACTIVATION=asqu \
ASQU_BETA_INIT=0.25 \
ASQU_LR=0.001 \
MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Matched local control debug command:

OMP_NUM_THREADS=1 \
PYTHONUNBUFFERED=1 \
RUN_ID=ab_relu2_seed1337 \
SEED=1337 \
DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MLP_ACTIVATION=relu2 \
ITERATIONS=2000 \
TRAIN_LOG_EVERY=100 \
VAL_LOSS_EVERY=0 \
MAX_WALLCLOCK_SECONDS=0 \
torchrun --standalone --nproc_per_node=1 train_gpt.py

Matched local ASQU debug command:

OMP_NUM_THREADS=1 \
PYTHONUNBUFFERED=1 \
RUN_ID=ab_asqu_seed1337 \
SEED=1337 \
DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MLP_ACTIVATION=asqu \
ASQU_BETA_INIT=0.25 \
ASQU_LR=0.001 \
ITERATIONS=2000 \
TRAIN_LOG_EVERY=100 \
VAL_LOSS_EVERY=0 \
MAX_WALLCLOCK_SECONDS=0 \
torchrun --standalone --nproc_per_node=1 train_gpt.py

andrewmouldon · 2026-04-02T16:32:30Z

Glad you are planning to test this with the established meta practices!

One thing that stands out to me, is the throughput gap though. My runs were done on 1x 5090, and I observed a .5% slowdown when I didn't optimize beta, and a 2.5% slowdown with learned beta (I just recently verified this, I forget if in my runs I reported I remembered to cast beta, I might have done that after but I can't remember for sure).

I wonder if the relative slow down is due to differences in the differences in base code, GPU, or maybe even pytorch version? But 5% overhead is quite harsh for timed leaderboard, given at least in my results, ASQU did not give massive improvement over leaky relu squared.

One thing that might be worth trying too, is computing beta grad only every k microsteps. I wish throughput wasn't such a large factor in this, but it is what it is.

That is also interesting that it falls off harder post-quant. All of my work has been on the base script with int8 quant, so I cant speak much on that.

Hope your request gets approved, curious for how it turns out.

fahmitech added 2 commits April 2, 2026 12:05

Add ASQU GPTQ-lite candidate record

134061e

Add ASQU seed-1337 debug log

f30e006

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Validate ASQU on the March 22 10min/16MB control line#1247

Proposal: Validate ASQU on the March 22 10min/16MB control line#1247
fahmitech wants to merge 2 commits intoopenai:mainfrom
fahmitech:pr-asqu-transfer

fahmitech commented Apr 2, 2026

Uh oh!

andrewmouldon commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fahmitech commented Apr 2, 2026

Summary

Relevant Prior Art And Credit

What This PR Adds

Why This Transfer Matters

Preliminary Evidence

Matched Debug Comparison

Interpretation

Requested Compute

Success / Kill Criteria

Why This Is A Good Use Of Compute

Reproducibility

Uh oh!

andrewmouldon commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants