Skip to content

Proposal: Validate ASQU on the March 22 10min/16MB control line#1247

Open
fahmitech wants to merge 2 commits intoopenai:mainfrom
fahmitech:pr-asqu-transfer
Open

Proposal: Validate ASQU on the March 22 10min/16MB control line#1247
fahmitech wants to merge 2 commits intoopenai:mainfrom
fahmitech:pr-asqu-transfer

Conversation

@fahmitech
Copy link
Copy Markdown

Summary

This PR adds a differentiated candidate record folder that transfers ASQU onto the strongest clean March 22 track_10min_16mb control line in this fork.

This is not a record claim. It is a public proposal for a stage-gated validation campaign under the real 8xH100, 600s, <=16 MB constraint.

The base stack is the March 22 11-layer FA3 family:

  • 11L, 512d, GQA
  • XSA on the last 4 layers
  • Partial RoPE + LN scale + VE128
  • EMA + tight SWA
  • late QAT
  • GPTQ-lite int6 export

The architectural delta in this PR is intentionally small:

  • control: relu^2
  • candidate: ASQU(x) = x^2 if x > 0 else beta_i * x^2

where beta_i is learned per hidden channel with a dedicated low learning rate.

Relevant Prior Art And Credit

  • Architecture ancestor: PR #374
  • March 22 control-line family / record baseline: PR #429
  • ASQU prior art with controlled non-record evidence: PR #1035
  • Earlier ASQU-related exploration: PR #679

ASQU is therefore not new to the repo. What this PR contributes is a controlled transfer of ASQU onto a stronger, cleaner March 22 control line that already has real track_10min_16mb evidence.

What This PR Adds

  • a new candidate folder:
    • records/track_10min_16mb/2026-04-02_11L_XSA4_PartialRoPE_LNScale_VE128_ASQU_EMA_GPTQlite
  • train_gpt.py support for:
    • MLP_ACTIVATION=asqu
    • learned per-channel beta parameters
    • ASQU_BETA_INIT
    • ASQU_LR
  • a dedicated optimizer group for ASQU beta parameters
  • export handling that keeps ASQU control tensors out of quantization damage
  • a seed-1337 debug log:
    • train_seed1337.log

This keeps the test surface narrow:

  • one architectural delta
  • no change to dataset or tokenizer
  • no change to legality profile
  • no change to artifact format

Why This Transfer Matters

The core question is not whether ASQU can work in isolation. Prior non-record evidence already suggests it can.

The open question is narrower and more useful:

Does ASQU transfer onto the March 22 control line without breaking runtime, artifact size, or quantized quality in the real record regime?

That question is worth testing because:

  • the March 22 base already has a hard external anchor in the target regime
  • the change surface is small enough to interpret cleanly
  • this is a lower-entropy bet than broad architecture rewrites

Preliminary Evidence

I ran a matched local debug regime on 1 GPU, 2000 iterations, same seed, same tokenizer, same dataset, same base stack.

For clarity, the 2000-step regime here means:

  • ITERATIONS=2000
  • MAX_WALLCLOCK_SECONDS=0
  • TRAIN_LOG_EVERY=100
  • VAL_LOSS_EVERY=0

The current code already supports this directly via environment variables and writes logs to:

  • logs/<RUN_ID>.txt
  • train_seed<SEED>.log
  • train.log when SEED=1337

Matched Debug Comparison

March 22-style control (relu^2):

  • step_avg: 647.44 ms
  • step:2000 val_bpb: 1.2260
  • post_ema val_bpb: 1.2312
  • final_int6_roundtrip_exact val_bpb: 1.30048843
  • total submission size: 11,764,467 bytes

ASQU candidate:

  • step_avg: 680.15 ms
  • step:2000 val_bpb: 1.2208
  • post_ema val_bpb: 1.2254
  • final_int6_roundtrip_exact val_bpb: 1.30236542
  • total submission size: 11,762,459 bytes

Observed delta:

  • pre-quantized improvement:
    • about -0.0052 BPB at step:2000
    • about -0.0058 BPB at post_ema
  • runtime cost:
    • about +5.1%
  • size:
    • effectively unchanged
  • quantized endpoint:
    • about +0.0019 BPB worse on final_int6_roundtrip_exact

Interpretation

This is a mixed result, not a submission-ready result.

What the local debug run suggests:

  • ASQU helps the float / pre-quantized model in this matched regime
  • ASQU does not yet help the decisive quantized endpoint
  • the runtime penalty is small enough that the candidate is still plausible in the true regime

That is exactly why this is a good compute-grant question:

  • the code path is stable
  • the architectural delta is measurable
  • the key uncertainty is now sharply defined:
    • is the quantization regression a wrong-regime artifact of 1 GPU / 2000 steps
    • or does ASQU genuinely fail to survive the March 22 export path

Only real 8xH100, 600s runs can resolve that cleanly.

Requested Compute

Request:

  • up to $1k in compute credits as a ceiling, not a spend target
  • stage-gated release against explicit milestones
  • stop immediately when a kill condition is hit

Why a $1k ceiling is rational:

  • one run is not enough to separate transfer signal from noise
  • a small number of real record-regime runs is enough to answer the actual question
  • this is a bounded validation campaign, not an open-ended search

Proposed staged use of credits:

  1. Stage 1: control reproduction
    Re-run the March 22 control in the fork-local code path on target hardware to verify runtime, artifact size, and metric alignment.
  2. Stage 2: ASQU A/B
    Run the ASQU candidate under the exact same 8xH100, 600s conditions with seed 1337.
  3. Stage 3: confirmation only if positive
    If ASQU is still directionally better on the decisive quantized metric, spend the remaining budget on:
    • one extra seed
    • optionally one narrow sweep around ASQU_BETA_INIT or ASQU_LR
  4. Stage 4: final record-quality attempt only if still positive
    Use any remaining budget for one clean record-track attempt.

Success / Kill Criteria

GO:

  • the run stays within the 600s training budget
  • artifact remains <16,000,000 bytes
  • runtime remains close enough to control to stay viable
  • ASQU improves a decisive quantized metric, ideally final_int6_roundtrip_exact

KILL:

  • runtime penalty is too large for the 10-minute regime
  • artifact breaks the cap
  • quantized quality does not improve
  • the pre-quant gain disappears under the real regime
  • no more credits are spent after that point

Why This Is A Good Use Of Compute

This request is capital-efficient for five reasons:

  • the base is already proven
  • the architectural delta is tiny
  • the local signal is real but mixed, which is exactly the kind of uncertainty that merits targeted validation
  • the experiment is easy to interpret
  • the stop conditions are explicit

If ASQU transfers, it becomes a differentiated record-track candidate on top of a strong clean base.

If ASQU fails, the result still closes an underexplored branch cleanly and cheaply, which is better than spending the same credits on broader unbounded exploration.

Reproducibility

ASQU candidate folder:

  • records/track_10min_16mb/2026-04-02_11L_XSA4_PartialRoPE_LNScale_VE128_ASQU_EMA_GPTQlite

Target 8xH100 command:

cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-02_11L_XSA4_PartialRoPE_LNScale_VE128_ASQU_EMA_GPTQlite

OMP_NUM_THREADS=1 \
PYTHONUNBUFFERED=1 \
RUN_ID=asqu_seed1337 \
SEED=1337 \
DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MLP_ACTIVATION=asqu \
ASQU_BETA_INIT=0.25 \
ASQU_LR=0.001 \
MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Matched local control debug command:

OMP_NUM_THREADS=1 \
PYTHONUNBUFFERED=1 \
RUN_ID=ab_relu2_seed1337 \
SEED=1337 \
DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MLP_ACTIVATION=relu2 \
ITERATIONS=2000 \
TRAIN_LOG_EVERY=100 \
VAL_LOSS_EVERY=0 \
MAX_WALLCLOCK_SECONDS=0 \
torchrun --standalone --nproc_per_node=1 train_gpt.py

Matched local ASQU debug command:

OMP_NUM_THREADS=1 \
PYTHONUNBUFFERED=1 \
RUN_ID=ab_asqu_seed1337 \
SEED=1337 \
DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MLP_ACTIVATION=asqu \
ASQU_BETA_INIT=0.25 \
ASQU_LR=0.001 \
ITERATIONS=2000 \
TRAIN_LOG_EVERY=100 \
VAL_LOSS_EVERY=0 \
MAX_WALLCLOCK_SECONDS=0 \
torchrun --standalone --nproc_per_node=1 train_gpt.py

@andrewmouldon
Copy link
Copy Markdown

Glad you are planning to test this with the established meta practices!

One thing that stands out to me, is the throughput gap though. My runs were done on 1x 5090, and I observed a .5% slowdown when I didn't optimize beta, and a 2.5% slowdown with learned beta (I just recently verified this, I forget if in my runs I reported I remembered to cast beta, I might have done that after but I can't remember for sure).

I wonder if the relative slow down is due to differences in the differences in base code, GPU, or maybe even pytorch version? But 5% overhead is quite harsh for timed leaderboard, given at least in my results, ASQU did not give massive improvement over leaky relu squared.

One thing that might be worth trying too, is computing beta grad only every k microsteps. I wish throughput wasn't such a large factor in this, but it is what it is.

That is also interesting that it falls off harder post-quant. All of my work has been on the base script with int8 quant, so I cant speak much on that.

Hope your request gets approved, curious for how it turns out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants