-
Notifications
You must be signed in to change notification settings - Fork 41
Implement runai model streamer for MODEL_IMPL_TYPE=flax_nnx #955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c04d3e2 to
9f75b5a
Compare
|
Thanks for wanting to enable this feature! We need tests (I assume across different TP values for different model sizes) that show this is both 1) correct/accurate 2) performant. @py4 @vipannalla @manojkris @jcyang43 to comment if there's any guidance we can share. |
9f75b5a to
68557aa
Compare
68557aa to
9a32eeb
Compare
0a7f8b2 to
2562489
Compare
2562489 to
d79fd04
Compare
We don't need anything for different TP values yet. That will come once Distributed streaming is added in follow up PR. I have confirmed it works to load the model (by pushing image, and deploying inference server on GKE), still trying to run the tests I added. |
9a76395 to
25c8d92
Compare
25c8d92 to
5e78a57
Compare
c3800f1 to
d269fdf
Compare
8354386 to
97e28c8
Compare
97e28c8 to
12700e7
Compare
Signed-off-by: Alexis MacAskill <[email protected]> Signed-off-by: <[email protected]>
12700e7 to
f81adb5
Compare
Description
Start with a short description of what the PR does and how this is a change from
the past.
Recently we made changes to support GCS for the Run AI model streamer in vllm. The last step of that was to install the RunAI model streamer within the vllm image. This was done for GPU in PR 26464, but we forgot to add the installation of the runai-model-streamer module in TPU Dockerfile. By not installing run ai model streamer in TPU image, it requires customers to make this change locally, and build custom TPU vllm image in order to use the RunAI model streamer.
If the change fixes a bug or a Github issue, please include a link, e.g.,:
No bug / Issue has been created for this as RunAI model streamer support in GCS is still pending release for GPU.
Tests
Please describe how you tested this change, and include any instructions and/or
commands to reproduce.
Tested that building the image still succeeds, and that run AI model streamer can be used to load the model for a vllm inference server:
After this, I used this image for the vllm image to a qwen3 model from a GCS bucket with runai model streamer, deployed this on a GKE TPU cluster, and I confirmed the inference-server can start up successfully.
Also ensured my correctness test passed. Full test setup here
Checklist
Before submitting this PR, please make sure:
[x] I have performed a self-review of my code.
[x] I have necessary comments in my code, particularly in hard-to-understand areas.
[x] I have made or will make corresponding changes to any relevant documentation.