Stable Growing Recurrence: Progressive Depth + Error Feedback (non-record)#1
Open
nestamidavaine wants to merge 7 commits intomainfrom
Open
Stable Growing Recurrence: Progressive Depth + Error Feedback (non-record)#1nestamidavaine wants to merge 7 commits intomainfrom
nestamidavaine wants to merge 7 commits intomainfrom
Conversation
added 5 commits
April 1, 2026 19:58
…n-record) 3-seed mean val_bpb: 1.1163 (std 0.0013), -0.0031 vs PR openai#549 LeakyReLU baseline. Progressive depth recurrence (1->2->3 passes) with error feedback + jacobian proxy stabilization. Late growth preserves fast step times, avoiding the step/capacity trade-off. Code minified with python-minifier to fit all seeds under 16MB.
…minifier Document the torch.compile graph precompilation trick (cycling through pass/QAT variants during warmup to avoid compilation stalls under the 600s wallclock cap) and the python-minifier approach for fitting under the 16MB submission limit.
Include seeds array, seed_results with per-seed val_loss/val_bpb/bytes, plus pre_quant_val_bpb, step_stop, wallclock_seconds, eval_time_seconds, and bytes_code fields matching the standard submission format.
…and submission.json - Rename folder from RecurrentSOTA_Feedback to Stable_Growing_Recurrance - Add per-seed post-TTT results to submission.json (legal_ttt_exact values) - Add Tricks section to README: graph precompilation warmup, python-minifier - Note in README that logs report pre-minification code size - Fix author to nestamidavaine - Update .gitignore exception for new folder name
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-Record: Recurrent Depth with Progressive Pass Growth + Error Feedback
val_bpb: 1.1163 (3-seed mean, std 0.0013) | ~15.96 MB | 8×H100 SXM
A non-record submission targeting significant improvement over PR #549 (LeakyReLU² baseline, 1.1194 mean bpb). Achieves -0.0031 bpb vs that baseline. For an in-depth analysis of depth recurrence in this competition, see PR #363. I targeted 549 when I started building this solution, after I finished evaluation the new improved model has been published to the leaderboard. However I believe the experiments here can be applied to any model to improve performance, with the largest benefit for submissions using TTT since the recurrance makes use of the 10 available minutes of evaluation time very effectively.
Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
We significantly beat the PR #549 LeakyReLU² baseline (1.1194 mean bpb / 1.8901 nats) by -0.0031 bpb / -0.0053 nats across all three seeds (1.1163 mean bpb / 1.8848 nats), achieving the goal we set out with.
Progressive Recurrence Architecture
The Problem: Depth Recurrence Fails Under Competition Constraints
PR #363 demonstrated that depth recurrence — reusing a shared block of transformer layers multiple times — saves parameters but hurts bpb under the 10-minute / 16MB competition constraints. Their controlled experiments showed a +0.025 bpb gap (looped worse) due to two compounding taxes:
Our Solution: Late Growth + Contractive Stabilization
We address both taxes by growing recurrence depth progressively during training and stabilizing the recurrent dynamics.
Progressive Pass Schedule (Late Growth)
The key insight: start training with 1 pass and gradually add passes late in training. This preserves fast step times for the majority of training (83.5ms/step at 1-pass vs ~95ms at 3-pass), maximizing the total number of gradient updates within the 600s wallclock budget. The schedule:
This reduces the step/capacity trade-off that normally makes recurrence impractical under competition constraints. We get ~6,330 training steps (vs ~7,180 for the flat LeakyReLU baseline), but the final model has 17 effective layers at eval vs the baseline's 11.
We also tested training with 4 recurrence passes. While 4-pass shows better per-step loss, the additional step time cost (~105ms/step) means fewer total steps within the wallclock budget. Under the competition's 600s constraint, 3-pass wins the step/capacity trade-off, the extra training steps from the faster 3-pass schedule outweigh the marginal per-step quality gain from 4 passes.
Learnable Residual Scaling
Per-pass learnable scalars contract the residual update, preventing hidden state magnitude growth across passes:
where$\alpha_k$ is initialized to 0.5 and learned during training. This ensures the recurrent dynamics are contractive — later passes refine rather than amplify.
Error Feedback Module
A low-rank correction compensates for accumulated error before each recurrence pass:
where$U, V \in \mathbb{R}^{d \times r}$ with rank $r=2$ and $d \in \mathbb{R}^d$ is a learnable diagonal. The correction is zero on pass 0 (no prior error to correct) and active on subsequent passes. Total parameter overhead: 2,560 params (negligible vs 26.7M model params).
The feedback module is important but not strictly required — we confirmed that stable training is possible without it, and even running eval-only without feedback works, at a cost of ~0.001 bpb higher. The feedback module's main contribution is providing the recurrent passes with an error signal about the previous iteration's residual.
Jacobian Proxy Loss (Stabilizer)
A regularization term penalizes hidden state growth ratio above 1.0, enforcing contractive dynamics without computing the full Jacobian:
with$\lambda = 0.01$ . This is a cheap finite-difference proxy for the spectral norm of the Jacobian $\partial h_{k+1}/\partial h_k$ , encouraging it to stay below 1 (contractive map). The model learns to adhere to this quickly and it does not seem to effect early training dynamics. However we did see better results with 0.01 compared to 0.1 for Lambda, potentially since the restriction of 0.1 is to high, we don't always need contractive layers with only 3x recurrance, but we do need it to not explode.
This loss term is critical for training stability. Without it, gradient norms and hidden state magnitudes explode during the multi-pass phases, destabilizing training. The proxy loss keeps the recurrent dynamics well-behaved without the computational cost of full Jacobian computation.
Note: the jacobian proxy loss is only added to the training loss — it does not affect evaluation scoring, which uses pure cross-entropy.
Legal TTT Protocol
Score-first legal TTT following PR #461:
torch.inference_mode()— no gradients, no weight mutationTiming Budget
Architecture
Built on the PR #414 stack with PR #399 Parallel Muon:
Run Command
Key flags:
torchrun --standalone --nproc_per_node=8 train_gpt.py \ --feedback-mode diagonal --feedback-rank 2 \ --residual-scale-init 0.5 \ --jacobian-proxy-weight 0.01 \ --no-interpass-rmsnormTricks
Graph Precompilation Warmup
torch.compileis lazy — it only compiles a new graph variant the first time it's encountered. With progressive recurrence (1→2→3 passes) and late QAT, this means the training loop would hit compilation stalls at step 4500 (2-pass), step 5500 (3-pass), and again when QAT enables. Under a 600s wallclock cap, these stalls are expensive.The fix: precompile all graph variants during warmup before training starts. During the 20 warmup steps:
num_passesvariant (2-pass, 3-pass) and each with QAT toggled ontorch.compileto eagerly compile every forward/backward graph that will appear during trainingThis ensures the training loop runs at full speed from step 0 with no compilation jitter when passes change or QAT kicks in.
Code Minification with python-minifier
The original training script was 88,253 bytes, which caused seed 2025 to exceed the 16MB submission limit (16,025,625 bytes). After removing dead code paths (eval-only mode, int8 quantization, unused feedback variants, verbose logging), the file was still too large.
python-minifier with
--no-rename-localsshrinks the code aggressively (whitespace, docstrings, constant folding) while preserving local variable names — critical because the training script uses string-based lookups forstate_dictkeys andnamed_parameters. This brought the file from 68,435 bytes down to 58,186 bytes, comfortably fitting all seeds under the 16MB decimal limit.Note: The code was minified after all three seed runs completed, so the log files report
Code size: 88253 bytesand correspondingly largerTotal submission sizevalues. The actual submission uses the minified 58,186-byte script — the correct per-seed totals are listed insubmission.jsonand the results table above.Credits