Skip to content

Non-record: GPTQ-lite Scale Clamp Fix + 6-bit Packing + Depth Recurrence on Stack B#1389

Open
Rome-1 wants to merge 1 commit intoopenai:mainfrom
Rome-1:submission/stackb-gptqlite-depth-recurrence
Open

Non-record: GPTQ-lite Scale Clamp Fix + 6-bit Packing + Depth Recurrence on Stack B#1389
Rome-1 wants to merge 1 commit intoopenai:mainfrom
Rome-1:submission/stackb-gptqlite-depth-recurrence

Conversation

@Rome-1
Copy link
Copy Markdown

@Rome-1 Rome-1 commented Apr 5, 2026

Summary

Non-record submission with three quantization contributions on the Stack B foundation (PR #1218 / #1260 lineage):

1. GPTQ-lite Scale Clamp Bug Fix

The original GPTQ-lite computes scale = (row_clip / clip_range).clamp_min(1/clip_range). For int6 (clip_range=31), this floors the scale at 1/31 ≈ 0.032 — but typical weight row maxima are O(0.01–0.05), so the clamp fires on most rows and wastes ~90% of quantization dynamic range. Fix: clamp_min(1e-7) (same as the int8 path). One-line fix, affects every int6-quantized tensor.

2. 6-bit Packing

Pack 4 int6 values into 3 bytes instead of storing in int8. 25% payload reduction for all int6 tensors, directly helps fit under 16MB. ~10 lines each for pack/unpack, negligible eval overhead.

3. Forced Int8 for Depth-Recurrence Shared Layers

When layers share weights (depth recurrence), quantization error compounds through each reuse. Solution: force int8 (127 levels) for shared layers while keeping int6 for non-shared layers. Small byte cost, meaningful quality gain.

Architecture

  • 11L / 512d / 4096 vocab / 4x MLP / GQA (8h/4kv)
  • LeakyReLU(0.5)², partial RoPE (16/64), LN scale, XSA-all, EMA, MuonEq-R
  • Depth recurrence (shared MLP layers 4,5)
  • Mixed int6/int8 GPTQ-lite with scale fix + 6-bit packing + zstd-22

Status

Non-record — validated on 1xH100 NVL (717 steps, undertrained). No 8xH100 run yet. Submitting for the quantization findings which may help others.

Credits

Stack B: PR #1218, #1260. MuonEq-R/depth recurrence: @signalrush. XSA: @abaybektursun. LeakyReLU²: @parinzee, @sofiabod. GPTQ-lite baseline: PR #374, #1019.

…th recurrence

Three quantization contributions on the Stack B foundation:

1. GPTQ-lite scale clamp bug fix: original clamp_min(1/clip_range) wastes
   ~90% of int6 dynamic range when weight magnitudes are small. Fix: clamp_min(1e-7).

2. 6-bit packing: pack 4 int6 values into 3 bytes (25% payload reduction).

3. Forced int8 for depth-recurrence shared layers: quantization error
   amplifies through weight reuse, so shared MLP layers get int8 while
   non-shared layers keep int6.

Non-record submission — validated on 1xH100 NVL (717 steps, undertrained).
No 8xH100 run yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant