Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.08354 BPB)#1412
Open
Robby955 wants to merge 2 commits intoopenai:mainfrom
Open
Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.08354 BPB)#1412Robby955 wants to merge 2 commits intoopenai:mainfrom
Robby955 wants to merge 2 commits intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.08354 BPB)
val bpb: 1.08354 (3-seed mean, std=0.00050)
Not a record. This is a small 3-seed experiment over PR #1394 on my runs, but not enough evidence for a statistical claim, the seed count, and reduction in BPB is too small for confidence. Posting because the changes are zero-cost, reproducible, and may be useful to others trying out different techniques.
Changes
Three zero-cost modifications on top of PR #1394, adding zero extra parameters or bytes:
1. Parallel Residuals (Layers 7+)
GPT-J style parallel attention+MLP (Wang & Komatsuzaki, 2021) for the last 4 layers. Both attention and MLP read from the same input and their outputs are added in parallel:
I expected parallel residuals to reduce interference between attention and MLP during GPTQ calibration. Pre-quant BPB barely moved, but the quantization gap tightened across all 3 seeds, which made this the most useful change in practice.
2. Hessian-Aware SDClip
I used GPTQ's existing Hessian diagonal as a cheap importance signal to slightly modulate SDClip thresholds by row:
where$\sigma_i$ is the standard deviation of row $i$ and $r_i$ is the row importance derived from Hessian-weighted magnitude. The effect is small but directionally useful at $\lambda = 0.175$ ; higher $\lambda$ hurt compression. I initially used $\lambda = 0.30$ but found $\lambda = 0.175$ is consistently better across seeds — both lower BPB and smaller artifact. Higher $\lambda$ reduces rounding error but increases entropy, which makes Brotli compression less effective.
3. Progressive Recurrence
Depth recurrence split into two phases: first loop enabled at 50% of training, second at 65%. The split points were not optimized — 50% matches the original and 65% was a single manual choice. Enabling both loops at once causes a sharper loss spike; splitting gives the model time to adapt to each additional pass before adding the next.
Hessian Analysis (Cross-Seed)
Hessian diagnostics from 3 seeds, 67 matrices each:
Importance hierarchy: early blocks (30x trace of late blocks) >> loop >> mid >> late. Per-row importance is too noisy to be a reliable signal, but group-level traces are very stable across seeds. This suggests per-group clip allocation could be a useful direction.
Future Directions
Several ideas I'd like to explore with more compute time:
Run Command
Requirements
Flash Attention 3 (Hopper) required. SP8192 BPE tokenizer trained on FineWeb 10B (sentencepiece BPE, 8192 vocab).
pip install torch --index-url https://download.pytorch.org/whl/cu130 pip install --no-cache-dir \ "https://download.pytorch.org/whl/cu130/flash_attn_3-3.0.0-cp39-abi3-manylinux_2_28_x86_64.whl" pip install -r requirements.txtCompliance (Track A — Fixed Predictor)
Credits
Learned from and inspired by PR #1394 (@clarkkev) — SDClip, depth recurrence, and GPTQ embedding quantization ideas. Parallel residuals from GPT-J (Wang & Komatsuzaki, 2021). Additional credits: PR #1204 (@msisovic, depth recurrence), PR #1217 (@bigbag, MuonEq-R), PR #1019 (@abaybektursun, previous SOTA).