-
Notifications
You must be signed in to change notification settings - Fork 68
optimize inner persistent scheduler for static fusion #5281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Review updated until commit 5bb64ee Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
214d963 to
2f03db7
Compare
Add flashinfer in benchmarks.
Two optimizations for inner persistent scheduler:
(1) Given a LLM model, the hidden dimension is fixed, we can use static shape in that dimension. Change bdimx to static.
(2) For input with small batch size, not all SMs are used, pick a large bdimx to allow more warps.
After these two optimizations, the performance of nvFuser is better than flashinfer.
raw data: https://docs.google.com/spreadsheets/d/1JMea0s_Z2mYDgbdNSI7-pS4MtdUBrIONl5EqiezFrDo/edit?usp=sharing