Skip to content

RQZ-Golf v1: Depth recurrence for parameter efficiency#54

Closed
TheCause wants to merge 1 commit intoopenai:mainfrom
TheCause:rqz-golf-v1
Closed

RQZ-Golf v1: Depth recurrence for parameter efficiency#54
TheCause wants to merge 1 commit intoopenai:mainfrom
TheCause:rqz-golf-v1

Conversation

@TheCause
Copy link
Copy Markdown

Non-record experimental submission

Approach: Replace some unique layers with a single shared recurrent layer applied K times, saving parameters while increasing effective depth.

Architecture

  • 7 unique layers (encoder/decoder with U-Net skip connections)
  • 1 recurrent layer applied K=3 times with learned iteration embeddings
  • Effective depth: 10 layers (7 unique + 3 recurrent) vs baseline 9
  • Residual scaling by 1/sqrt(K) for stability

Key ideas

  1. Depth recurrence: shared weights across K passes saves ~20% parameters
  2. Iteration embeddings: per-pass learned vector (psi_k) for pass-awareness
  3. Test-time compute: increase K at inference (K'>K) for better BPB without changing model size

Status

  • Preliminary baseline: 1.5283 BPB (1 shard, 1xA100)
  • RQZ-Golf architecture implemented, not yet benchmarked on full dataset
  • Requesting compute credits for full evaluation

Theoretical basis

Inspired by Universal Transformers (Dehghani 2019) and Deep Equilibrium Models (Bai 2019).

Non-record experimental submission.
Architecture: 7 unique layers + 1 shared recurrent layer (K=3 passes)
with iteration embeddings and 1/sqrt(K) scaling.
Test-time compute: increase K at inference without changing model size.
@TheCause
Copy link
Copy Markdown
Author

TheCause commented Apr 7, 2026

Closing this PR. Our depth recurrence findings are independently confirmed and superseded by PR #363 (merged), which documented the same core result (+0.025 BPB degradation) with 35 runs across 8xH100/2xH100/consumer GPUs.

Our additional experiments (E2: 15 runs positional ablation, E3: layer diagnostics on 1x RTX 3090) corroborate that recurrence degrades performance when used as a standalone technique. Meanwhile, PRs #1204 (1.1063 BPB) and #1392 (1.1020 BPB) demonstrate that recurrence does work within a complete stack (11L + MLP3x + XSA + EMA + GPTQ + parallel residuals), confirming the stack-dependency hypothesis.

Full analysis documented internally. Thanks to @evangelinehelsinki for the thorough work on #363.

@TheCause TheCause closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants