upgrade flashinfer to v0.4.0rc1#25315
Conversation
Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
There was a problem hiding this comment.
Code Review
This pull request upgrades FlashInfer to version v0.4.0rc1. The version updates in the Dockerfiles and setup.py are consistent with this goal. However, there is a critical issue in vllm/v1/attention/backends/flashinfer.py where an API call for the new FlashInfer version appears to be only partially updated. This is likely to cause a runtime error. Please see the specific comment for details.
|
|
||
| try: | ||
| # Make sure we pass exactly 15 arguments for tensor core version | ||
| # Make sure we pass exactly 18 arguments for tensor core version |
There was a problem hiding this comment.
While you've correctly updated the internal _cached_module.plan call for the new flashinfer version, the corresponding public self.plan method call (at lines 1025-1043) seems to have been missed. This public method is used for the initial warm-up when CUDA graphs are enabled.
If the public plan API also changed (which is highly likely given the internal API change), this will cause a TypeError at runtime during the warm-up. Please update the call at lines 1025-1043 to include any new arguments. Based on the changes to the internal call, it's likely that arguments such as window_right and allow_fp16_qk_reduction need to be added.
|
Resolved by #26326 |
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.