-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] [7/N] API Server: Multiprocessing Detokenizer [ DO NOT MERGE ] #11636
[V1] [7/N] API Server: Multiprocessing Detokenizer [ DO NOT MERGE ] #11636
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Closing as the implementation did not lead to performance gains |
NOTES:
LLM
performance alreadySo, we will abandon the 3 process architecure and go forward with the 2 process architecture
SUMMARY:
AsyncLLM
and API Server #11237weakref._finalizer
PERFORMANCE:
Server
vllm serve $MODEL --no-enable-prefix-caching --gpu-memory-utilization 0.98 --max-num-batched-tokens 8192 --disable-log-requests
python3 benchmark_serving.py --model $MODEL --dataset-name sonnet --dataset-path sonnet.txt --sonnet-input-len 250 --sonnet-output-len 200
main
:55.99 requests/sec
pr
:54.05 requests/sec
AsyncLLM
main