Skip to content

Conversation

@Calvin-Xu
Copy link
Member

Description

Addresses #2109

Implements Gated Attention per https://github.com/qiuzh20/gated_attention and sweeps to find optimal LR scaling factor.

Basically ready for a while but rerunning the 1.2B track on v5p-32 to get good hardware FLOPs data points, and we are having trouble launching v5p-32 specifically on our clusters. Putting a draft PR up to let people know this is being worked on lest effort be duplicated (Will almost did).

Checklist

  • You ran uv run python infra/pre-commit.py --all-files to lint/format your code
  • You ran 'pytest' to test your code
  • Delete this checklist

@Calvin-Xu Calvin-Xu changed the title Calvin/gated attention Gated Attention & Scaling Speedruns Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants