-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experimental] Add a gRPC server for completion request #2478
base: main
Are you sure you want to change the base?
Conversation
Hi, @MrAta is this ready to be reviewed? |
I am seeing some test fialures |
Hey! No, it's not! It's still WIP and drafted. Will make it ready for review once done. |
@rkooo567 It's ready for review now. Admittedly, the protos are minimal and need some iteration. |
@MrAta From the current design, the http server and grpc server cannot be turned on simultaneously. Could you have a way to have both? One way is to use the original way to launch the HTTP server, and then launch the grpc server in the tokenizer manager when the first request comes (because only then will the event loop be created). The grpc server can use |
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
@Ying1123 done. |
I tried your implementation last night, turn out adapting to input_embeds is surprisingly painless. It involves modifying your completion.proto slightly as well as alteration to grpc server to take either prompt or input_embeds. Issue I had was that it was awfully slow, likely because of the type conversion involved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now when launching the server, it launches both a fastAPI and grpc server (when --grpc-port is provided).
Is this an intentional design? Is there a way to only launch grpc server?
Also a couple comments;
- I think we need to be a little more careful about the default values
- Also can you add tests?
- We should handle the client closes the connection (we probably want to abort the request in this case). Doesn't need to be done in this PR
- For missing / unsupported fields, can you give us a list and maybe create an issue to track?
@@ -23,7 +23,7 @@ runtime_common = ["aiohttp", "decord", "fastapi", | |||
"psutil", "pydantic", "python-multipart", | |||
"pyzmq>=25.1.2", "torchao>=0.7.0", "uvicorn", "uvloop", | |||
"xgrammar>=0.1.6"] | |||
srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer>=0.1.6"] | |||
srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer>=0.1.6", "grpcio==1.68.1", "grpcio-tools==1.68.1"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is version restriction necessary?
|
||
try: | ||
# Create gRPC request with same parameters as FastAPI | ||
request = completion_pb2.CompletionRequest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sglang/python/sglang/bench_serving.py
Line 315 in 2125898
payload = { |
This doesn't seem to match (e.g., return logprobs or **request_func_input.extra_request_body,)
parser.add_argument( | ||
"--grpc-port", | ||
type=int, | ||
help="If not set, the default port is configured according to its default value for different LLM Inference Engines.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's comment this is only for sglang backend and used only when grpc backend is used with --backend=="sglang-grpc"?
): | ||
self.generate_request = generate_request | ||
|
||
async def Complete( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async def Complete( | |
async def complete( |
float presence_penalty = 9; | ||
bool stream = 10; | ||
repeated string stop = 11; | ||
bool ignore_eos = 12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are missing fields here? maybe add comments? (like add all fields and comment it out)
@@ -0,0 +1,38 @@ | |||
# -*- coding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is it re-built whenever we merge a PR & change schema?
|
||
The SRT server consists of an HTTP server and the SRT engine. | ||
|
||
1. HTTP server: A FastAPI server that routes requests to the engine. | ||
1. HTTP server: A FastAPI server that routes requests to the engine. Alternatively, it can be a gRPC server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "alternatively" the right word? Thought we are going to start both servers
sampling_params={ | ||
"max_new_tokens": request.max_tokens, | ||
"temperature": request.temperature, | ||
"top_p": request.top_p, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part is dangerous. grpc protobuf has a default value 0 for unset values, and it means this top_p, top_k could be different from the default (1 and -1 respectively)
I think we need some sort of mechanism to set the default value? one potential solution is to use optional field in protobuf and set the correct default value when they are not provided
# Send final response with finished flag | ||
final_response = completion_pb2.CompletionResponse( | ||
text=content["text"], # Final complete text | ||
finished=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: how does regular http completion request figure out if it is finished?
) | ||
|
||
# Process request through tokenizer manager | ||
async for content in self.generate_request(adapted_request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to hanadle the client closing the connection and abort the request (I think we can do it using context object). We can also do it as a follow up, and in that case, can you create a separate isuse?
just want to chime in that grpc python server isn't very performant: LesnyRumcajs/grpc_bench#441 triton-inference-server removed grpc for performance reason: |
Motivation
Being able to launch sglang as a grpc servic in production.
Modifications
Now when launching the server, it launches both a fastAPI and grpc server (when
--grpc-port
is provided).Notes:
This is an experimental feature! As of now:
The proto definition of
CompletionRequest
andCompletionResponse
DO NOT match 100% with their OAI counterparts inprotocol.py
.The grpc
CompletionService
implementation DOES NOT match 100% withv1_completion
yet, as the latter contains several branches. Th current implementation is the simplest case for an streaming request/response workload.After we get enough feedback and reviews, I can work on addressing the above.
Tests
Please see the
bench_serving.py
where I added a grpc example benchmark.Here's the results comparing fast api and grpc for the same benchmark and server settings:
gRPC
============ Serving Benchmark Result ============ Backend: sglang-grpc Traffic request rate: inf Max reqeuest concurrency: not set Successful requests: 300 Benchmark duration (s): 23.94 Total input tokens: 70805 Total generated tokens: 58269 Total generated tokens (retokenized): 58254 Request throughput (req/s): 12.53 Input token throughput (tok/s): 2957.02 Output token throughput (tok/s): 2433.48 Total token throughput (tok/s): 5390.50 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 4363.72 Median E2E Latency (ms): 3313.38 ---------------Time to First Token---------------- Mean TTFT (ms): 841.61 Median TTFT (ms): 897.15 P99 TTFT (ms): 1439.15 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 38.70 Median TPOT (ms): 19.81 P99 TPOT (ms): 223.82 ---------------Inter-token Latency---------------- Mean ITL (ms): 18.13 Median ITL (ms): 16.22 P99 ITL (ms): 32.96 ==================================================
FastAPI
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Max reqeuest concurrency: not set Successful requests: 300 Benchmark duration (s): 24.00 Total input tokens: 70805 Total generated tokens: 58269 Total generated tokens (retokenized): 58267 Request throughput (req/s): 12.50 Input token throughput (tok/s): 2949.97 Output token throughput (tok/s): 2427.67 Total token throughput (tok/s): 5377.64 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 4504.54 Median E2E Latency (ms): 3411.37 ---------------Time to First Token---------------- Mean TTFT (ms): 1023.32 Median TTFT (ms): 990.22 P99 TTFT (ms): 1666.15 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 36.10 Median TPOT (ms): 19.82 P99 TPOT (ms): 237.27 ---------------Inter-token Latency---------------- Mean ITL (ms): 18.02 Median ITL (ms): 16.90 P99 ITL (ms): 28.88 ==================================================
Checklist