Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# ConsensusWindow Bypass (FAT-Golf)

Adds a depthwise causal convolution bypass path to the SOTA baseline, derived from the ORC FAT-AR architecture (Factorized Attention Transformer for Autoregressive generation).

## Changes from baseline (abaybektursun, 1.1194 BPB)

Two additions (~90 lines of new code, ~47K params, ~0.2% of ~22M model):

1. **ConsensusWindowEmbed**: replaces SmearGate (1-token lookback, 512 params) with a depthwise causal conv1d (16-token receptive field, ~9K params). Learns per-channel weighted sum over local context at the embedding level.

2. **ConsensusBlockBypass** on deepest 4 layers: gated parallel path alongside attention. Each block gets a depthwise causal conv that processes the same normed input as attention, with a per-dimension sigmoid gate (initialized 80% attention / 20% bypass) blending the outputs.

Everything else is identical: Muon, parameter banking, int6 QAT, EMA/SWA, BigramHash, XSA, Partial RoPE, LN Scale, VE, LeakyReLU(0.5)^2, TTT.

## Status

**Small-scale results only** — awaiting H100 compute for full-scale validation.

Tested at 256d, 6 layers, 500 steps on a single 4060 Ti 8GB.

### Results (3-seed means)

Pre-EMA val_bpb: baseline 2.3477, ours 2.3208 (delta -0.027)
Post-EMA+int6 val_bpb: baseline 2.4185, ours 2.3438 (delta -0.075)

Key finding: the combined architecture produces weights far more robust to EMA averaging and int6 quantization (EMA+quant penalty +0.023 vs baseline's +0.071). Neither component alone beats baseline post-quantization — they must be combined for the synergistic effect.

## Environment variables

```
CONSENSUS_WINDOW_SIZE=32 # Conv1d receptive field (0 = use SmearGate)
CONSENSUS_BYPASS_LAST_N=4 # Number of deepest layers with bypass
CONSENSUS_EMA_EXCLUDE=0 # Exclude consensus from EMA (not recommended)
```

## Source

Full repository with tests and ablation scripts: https://github.com/TheDryhtscipe/golf-model-1
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
torch>=2.5.0
numpy
sentencepiece
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"author": "TheDryhtscipe",
"github_id": "TheDryhtscipe",
"name": "ConsensusWindow Bypass (FAT-Golf)",
"blurb": "Depthwise causal conv bypass from ORC FAT-AR. Replaces SmearGate + adds gated deep-layer bypass. Small-scale: -0.075 BPB post-quantization, 3x more robust to EMA+int6.",
"date": "2026-03-27T00:00:00Z",
"val_loss": 0.0,
"val_bpb": 0.0,
"bytes_total": 0,
"bytes_code": 0,
"note": "Small-scale results only (256d/6L/500steps on 4060 Ti). Awaiting H100 compute for full-scale validation."
}
Loading