Proposal: Validate ASQU on the March 22 10min/16MB control line#1247
Proposal: Validate ASQU on the March 22 10min/16MB control line#1247fahmitech wants to merge 2 commits intoopenai:mainfrom
Conversation
|
Glad you are planning to test this with the established meta practices! One thing that stands out to me, is the throughput gap though. My runs were done on 1x 5090, and I observed a .5% slowdown when I didn't optimize beta, and a 2.5% slowdown with learned beta (I just recently verified this, I forget if in my runs I reported I remembered to cast beta, I might have done that after but I can't remember for sure). I wonder if the relative slow down is due to differences in the differences in base code, GPU, or maybe even pytorch version? But 5% overhead is quite harsh for timed leaderboard, given at least in my results, ASQU did not give massive improvement over leaky relu squared. One thing that might be worth trying too, is computing beta grad only every k microsteps. I wish throughput wasn't such a large factor in this, but it is what it is. That is also interesting that it falls off harder post-quant. All of my work has been on the base script with int8 quant, so I cant speak much on that. Hope your request gets approved, curious for how it turns out. |
Summary
This PR adds a differentiated candidate record folder that transfers ASQU onto the strongest clean March 22
track_10min_16mbcontrol line in this fork.This is not a record claim. It is a public proposal for a stage-gated validation campaign under the real
8xH100, 600s, <=16 MBconstraint.The base stack is the March 22 11-layer FA3 family:
The architectural delta in this PR is intentionally small:
relu^2ASQU(x) = x^2 if x > 0 else beta_i * x^2where
beta_iis learned per hidden channel with a dedicated low learning rate.Relevant Prior Art And Credit
ASQU is therefore not new to the repo. What this PR contributes is a controlled transfer of ASQU onto a stronger, cleaner March 22 control line that already has real
track_10min_16mbevidence.What This PR Adds
records/track_10min_16mb/2026-04-02_11L_XSA4_PartialRoPE_LNScale_VE128_ASQU_EMA_GPTQlitetrain_gpt.pysupport for:MLP_ACTIVATION=asqubetaparametersASQU_BETA_INITASQU_LRtrain_seed1337.logThis keeps the test surface narrow:
Why This Transfer Matters
The core question is not whether ASQU can work in isolation. Prior non-record evidence already suggests it can.
The open question is narrower and more useful:
That question is worth testing because:
Preliminary Evidence
I ran a matched local debug regime on
1 GPU,2000iterations, same seed, same tokenizer, same dataset, same base stack.For clarity, the
2000-step regimehere means:ITERATIONS=2000MAX_WALLCLOCK_SECONDS=0TRAIN_LOG_EVERY=100VAL_LOSS_EVERY=0The current code already supports this directly via environment variables and writes logs to:
logs/<RUN_ID>.txttrain_seed<SEED>.logtrain.logwhenSEED=1337Matched Debug Comparison
March 22-style control (
relu^2):step_avg:647.44 msstep:2000 val_bpb:1.2260post_ema val_bpb:1.2312final_int6_roundtrip_exact val_bpb:1.30048843total submission size:11,764,467bytesASQU candidate:
step_avg:680.15 msstep:2000 val_bpb:1.2208post_ema val_bpb:1.2254final_int6_roundtrip_exact val_bpb:1.30236542total submission size:11,762,459bytesObserved delta:
-0.0052 BPBatstep:2000-0.0058 BPBatpost_ema+5.1%+0.0019 BPBworse onfinal_int6_roundtrip_exactInterpretation
This is a mixed result, not a submission-ready result.
What the local debug run suggests:
That is exactly why this is a good compute-grant question:
1 GPU / 2000 stepsOnly real
8xH100, 600sruns can resolve that cleanly.Requested Compute
Request:
$1kin compute credits as a ceiling, not a spend targetWhy a
$1kceiling is rational:Proposed staged use of credits:
Re-run the March 22 control in the fork-local code path on target hardware to verify runtime, artifact size, and metric alignment.
Run the ASQU candidate under the exact same
8xH100, 600sconditions with seed1337.If ASQU is still directionally better on the decisive quantized metric, spend the remaining budget on:
ASQU_BETA_INITorASQU_LRUse any remaining budget for one clean record-track attempt.
Success / Kill Criteria
GO:600straining budget<16,000,000bytesfinal_int6_roundtrip_exactKILL:Why This Is A Good Use Of Compute
This request is capital-efficient for five reasons:
If ASQU transfers, it becomes a differentiated record-track candidate on top of a strong clean base.
If ASQU fails, the result still closes an underexplored branch cleanly and cheaply, which is better than spending the same credits on broader unbounded exploration.
Reproducibility
ASQU candidate folder:
records/track_10min_16mb/2026-04-02_11L_XSA4_PartialRoPE_LNScale_VE128_ASQU_EMA_GPTQliteTarget
8xH100command:cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-02_11L_XSA4_PartialRoPE_LNScale_VE128_ASQU_EMA_GPTQlite OMP_NUM_THREADS=1 \ PYTHONUNBUFFERED=1 \ RUN_ID=asqu_seed1337 \ SEED=1337 \ DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 \ TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \ VOCAB_SIZE=1024 \ MLP_ACTIVATION=asqu \ ASQU_BETA_INIT=0.25 \ ASQU_LR=0.001 \ MAX_WALLCLOCK_SECONDS=600 \ torchrun --standalone --nproc_per_node=8 train_gpt.pyMatched local control debug command:
Matched local ASQU debug command: