Non-record: GPTQ-lite Scale Clamp Fix + 6-bit Packing + Depth Recurrence on Stack B#1389
Open
Rome-1 wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: GPTQ-lite Scale Clamp Fix + 6-bit Packing + Depth Recurrence on Stack B#1389Rome-1 wants to merge 1 commit intoopenai:mainfrom
Rome-1 wants to merge 1 commit intoopenai:mainfrom
Conversation
…th recurrence Three quantization contributions on the Stack B foundation: 1. GPTQ-lite scale clamp bug fix: original clamp_min(1/clip_range) wastes ~90% of int6 dynamic range when weight magnitudes are small. Fix: clamp_min(1e-7). 2. 6-bit packing: pack 4 int6 values into 3 bytes (25% payload reduction). 3. Forced int8 for depth-recurrence shared layers: quantization error amplifies through weight reuse, so shared MLP layers get int8 while non-shared layers keep int6. Non-record submission — validated on 1xH100 NVL (717 steps, undertrained). No 8xH100 run yet.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Non-record submission with three quantization contributions on the Stack B foundation (PR #1218 / #1260 lineage):
1. GPTQ-lite Scale Clamp Bug Fix
The original GPTQ-lite computes
scale = (row_clip / clip_range).clamp_min(1/clip_range). For int6 (clip_range=31), this floors the scale at1/31 ≈ 0.032— but typical weight row maxima areO(0.01–0.05), so the clamp fires on most rows and wastes ~90% of quantization dynamic range. Fix:clamp_min(1e-7)(same as the int8 path). One-line fix, affects every int6-quantized tensor.2. 6-bit Packing
Pack 4 int6 values into 3 bytes instead of storing in int8. 25% payload reduction for all int6 tensors, directly helps fit under 16MB. ~10 lines each for pack/unpack, negligible eval overhead.
3. Forced Int8 for Depth-Recurrence Shared Layers
When layers share weights (depth recurrence), quantization error compounds through each reuse. Solution: force int8 (127 levels) for shared layers while keeping int6 for non-shared layers. Small byte cost, meaningful quality gain.
Architecture
Status
Non-record — validated on 1xH100 NVL (717 steps, undertrained). No 8xH100 run yet. Submitting for the quantization findings which may help others.
Credits
Stack B: PR #1218, #1260. MuonEq-R/depth recurrence: @signalrush. XSA: @abaybektursun. LeakyReLU²: @parinzee, @sofiabod. GPTQ-lite baseline: PR #374, #1019.