-
Notifications
You must be signed in to change notification settings - Fork 71
[DRAFT] Speedrun Submission nanogpt_features_v0 #2184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[DRAFT] Speedrun Submission nanogpt_features_v0 #2184
Conversation
dlwh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me! Seems like a good cluster of improvements
| * boosted attn scale. Using 1.35/sqrt(head_dim) | ||
|
|
||
| ### Some larger modeling differences with NanoGPT | ||
| * Uses the GPT2-tokenizer with 50,000 tokens, whereas the marin-tokenizer is defaulting to 128,256 vocab-size. This means that for small models there is a substantial amount of compute locked in the lm_head projection. In terms of total param count, the 150m model has 80% of its params in the embedding and lm_head. I don't know enough about this repo yet to test other tokenizers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is a bit of a weird thing. We have evidence that the llama3 tokenzier is better than gpt2 and llama2 at even 1b scale but it does seem pretty lopsided in terms of parameter allocations at small scales
|
|
||
| ### Some larger modeling differences with NanoGPT | ||
| * Uses the GPT2-tokenizer with 50,000 tokens, whereas the marin-tokenizer is defaulting to 128,256 vocab-size. This means that for small models there is a substantial amount of compute locked in the lm_head projection. In terms of total param count, the 150m model has 80% of its params in the embedding and lm_head. I don't know enough about this repo yet to test other tokenizers. | ||
| * Uses fp8 on the lm_head. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems risky to me in terms of long term stability and accuracy? softmaxes are pretty sensitive, no?
| * Schedule based updates. Updates the momentum terms, attention window sizes, batch size, and rotary params throughout training. | ||
| * Parameter Group specific lr. In particular, the embed is set to 75x the lr of the lm_head. | ||
| * Attention Masking. Short/Short/Short/Short/Long attention window configuration | ||
| * Data Sampling. I am not aware yet of how this run does data sampling, but I expect differences here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concat and split and then random samples with cross doc masking
|
This pull request has been inactive for 23 days and is marked as stale. |
|
bump to keep open |

NanoGPT_Features_v0
This PR adds 3 speedruns for sizes 150m, 270m, and 460m param models with a subset of the modded-nanogpt features. The 270m mirrors the scale of NanoGPT. I am limiting the scope of this draft to an exploratory phase and have not cleaned up the syntax of the hackable transformer file.
One objective here is to baseline the two repos to identify speedup opportunities. As a result, I am not ablating individual changes and instead want to add enough ML features such that the remaining speed gap can be isolated to non-ML components. flops_per_token is an estimate, as lambdas are treated as rounding errors.
Features included
Some larger modeling differences with NanoGPT
There are ~20 other minor differences that could be interesting to explore in a more scientific manner at some point.
FLOP Gap
For forward pass flops per token (lm_head, mlp, attn) NanoGPT is (77M, 104M, 79M) = 260M, whereas this 270M parameter run is (197M, 104M, 122M) = 422M FLOPS. This run was 22 throughput/mfu whereas NanoGPT is roughly around 45 throughput/mfu. Hence, 3x speed gap.
Notes
When I tested https://wandb.ai/marin-community/marin/runs/hacktx_300m_stdattn_4096-77b709?nw=nwuserabhinavg4 on a single H100 I got 13 MFU instead of 21 MFU, which leads me to believe either the GPU I was allocated was poor, or there is a substantial aspect here of finding architectures that are well tuned to leverage the gpu/tpu specifics of the hardware. I got more reasonable MFU on H100 when I decreased the seq_len and replaced gated SILU with Relu^2.
A large number of parameters such as learning rates, seq_len, and batch size are left unmodified across scales, so I am not infering much from performance outside the 270m run. Checking different values was left out of scope. The throughput of the 130M run dropped by 10% for the last 50% of the run, unsure why.