Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental] Add a gRPC server for completion request #2478

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

MrAta
Copy link
Contributor

@MrAta MrAta commented Dec 13, 2024

Motivation

Being able to launch sglang as a grpc servic in production.

Modifications

  • adds a grpc server
  • supports completion request for now

Now when launching the server, it launches both a fastAPI and grpc server (when --grpc-port is provided).

Notes:

This is an experimental feature! As of now:

  • The proto definition of CompletionRequest and CompletionResponse DO NOT match 100% with their OAI counterparts in protocol.py.

  • The grpc CompletionService implementation DOES NOT match 100% with v1_completion yet, as the latter contains several branches. Th current implementation is the simplest case for an streaming request/response workload.

After we get enough feedback and reviews, I can work on addressing the above.

Tests

Please see the bench_serving.py where I added a grpc example benchmark.

Here's the results comparing fast api and grpc for the same benchmark and server settings:

gRPC

============ Serving Benchmark Result ============
Backend:                                 sglang-grpc
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  23.94     
Total input tokens:                      70805     
Total generated tokens:                  58269     
Total generated tokens (retokenized):    58254     
Request throughput (req/s):              12.53     
Input token throughput (tok/s):          2957.02   
Output token throughput (tok/s):         2433.48   
Total token throughput (tok/s):          5390.50   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4363.72   
Median E2E Latency (ms):                 3313.38   
---------------Time to First Token----------------
Mean TTFT (ms):                          841.61    
Median TTFT (ms):                        897.15    
P99 TTFT (ms):                           1439.15   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          38.70     
Median TPOT (ms):                        19.81     
P99 TPOT (ms):                           223.82    
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.13     
Median ITL (ms):                         16.22     
P99 ITL (ms):                            32.96     
==================================================

FastAPI

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  24.00     
Total input tokens:                      70805     
Total generated tokens:                  58269     
Total generated tokens (retokenized):    58267     
Request throughput (req/s):              12.50     
Input token throughput (tok/s):          2949.97   
Output token throughput (tok/s):         2427.67   
Total token throughput (tok/s):          5377.64   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4504.54   
Median E2E Latency (ms):                 3411.37   
---------------Time to First Token----------------
Mean TTFT (ms):                          1023.32   
Median TTFT (ms):                        990.22    
P99 TTFT (ms):                           1666.15   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.10     
Median TPOT (ms):                        19.82     
P99 TPOT (ms):                           237.27    
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.02     
Median ITL (ms):                         16.90     
P99 ITL (ms):                            28.88     
==================================================

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@MrAta MrAta changed the title Ata/grpc Add grpc server for completion request Dec 13, 2024
@rkooo567
Copy link
Contributor

Hi, @MrAta is this ready to be reviewed?

@rkooo567
Copy link
Contributor

I am seeing some test fialures

@MrAta
Copy link
Contributor Author

MrAta commented Dec 15, 2024

Hi, @MrAta is this ready to be reviewed?

Hey! No, it's not! It's still WIP and drafted. Will make it ready for review once done.

@MrAta MrAta changed the title Add grpc server for completion request [Experimental] Add grpc server for completion request Dec 15, 2024
@MrAta MrAta marked this pull request as ready for review December 15, 2024 09:21
@MrAta MrAta changed the title [Experimental] Add grpc server for completion request [Experimental] Add a gRPC server for completion request Dec 15, 2024
@MrAta
Copy link
Contributor Author

MrAta commented Dec 15, 2024

@rkooo567 It's ready for review now. Admittedly, the protos are minimal and need some iteration.

@Ying1123
Copy link
Member

@MrAta From the current design, the http server and grpc server cannot be turned on simultaneously. Could you have a way to have both? One way is to use the original way to launch the HTTP server, and then launch the grpc server in the tokenizer manager when the first request comes (because only then will the event loop be created). The grpc server can use get_event_loop so that the HTTP and grpc server can be in the same event loop.

MrAta added 12 commits December 18, 2024 01:04
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
MrAta added 18 commits December 18, 2024 01:04
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
Signed-off-by: Ata Fatahi <[email protected]>
@MrAta
Copy link
Contributor Author

MrAta commented Dec 18, 2024

@MrAta From the current design, the http server and grpc server cannot be turned on simultaneously. Could you have a way to have both? One way is to use the original way to launch the HTTP server, and then launch the grpc server in the tokenizer manager when the first request comes (because only then will the event loop be created). The grpc server can use get_event_loop so that the HTTP and grpc server can be in the same event loop.

@Ying1123 done.

@RinRin-32
Copy link
Contributor

@MrAta Hello, any chance this could work with /generate using input_embeds? #2052

I think with gRPC, input_embeds based usage of Sglang would benefit hugely!

@MrAta
Copy link
Contributor Author

MrAta commented Dec 23, 2024

@MrAta Hello, any chance this could work with /generate using input_embeds? #2052

I think with gRPC, input_embeds based usage of Sglang would benefit hugely!

Yes, that's part of the plan.

@RinRin-32
Copy link
Contributor

I tried your implementation last night, turn out adapting to input_embeds is surprisingly painless. It involves modifying your completion.proto slightly as well as alteration to grpc server to take either prompt or input_embeds.

Issue I had was that it was awfully slow, likely because of the type conversion involved.

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now when launching the server, it launches both a fastAPI and grpc server (when --grpc-port is provided).

Is this an intentional design? Is there a way to only launch grpc server?

Also a couple comments;

  • I think we need to be a little more careful about the default values
  • Also can you add tests?
  • We should handle the client closes the connection (we probably want to abort the request in this case). Doesn't need to be done in this PR
  • For missing / unsupported fields, can you give us a list and maybe create an issue to track?

@@ -23,7 +23,7 @@ runtime_common = ["aiohttp", "decord", "fastapi",
"psutil", "pydantic", "python-multipart",
"pyzmq>=25.1.2", "torchao>=0.7.0", "uvicorn", "uvloop",
"xgrammar>=0.1.6"]
srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer>=0.1.6"]
srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer>=0.1.6", "grpcio==1.68.1", "grpcio-tools==1.68.1"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is version restriction necessary?


try:
# Create gRPC request with same parameters as FastAPI
request = completion_pb2.CompletionRequest(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to match (e.g., return logprobs or **request_func_input.extra_request_body,)

parser.add_argument(
"--grpc-port",
type=int,
help="If not set, the default port is configured according to its default value for different LLM Inference Engines.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's comment this is only for sglang backend and used only when grpc backend is used with --backend=="sglang-grpc"?

):
self.generate_request = generate_request

async def Complete(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def Complete(
async def complete(

float presence_penalty = 9;
bool stream = 10;
repeated string stop = 11;
bool ignore_eos = 12;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are missing fields here? maybe add comments? (like add all fields and comment it out)

@@ -0,0 +1,38 @@
# -*- coding: utf-8 -*-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is it re-built whenever we merge a PR & change schema?


The SRT server consists of an HTTP server and the SRT engine.

1. HTTP server: A FastAPI server that routes requests to the engine.
1. HTTP server: A FastAPI server that routes requests to the engine. Alternatively, it can be a gRPC server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "alternatively" the right word? Thought we are going to start both servers

sampling_params={
"max_new_tokens": request.max_tokens,
"temperature": request.temperature,
"top_p": request.top_p,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is dangerous. grpc protobuf has a default value 0 for unset values, and it means this top_p, top_k could be different from the default (1 and -1 respectively)

I think we need some sort of mechanism to set the default value? one potential solution is to use optional field in protobuf and set the correct default value when they are not provided

# Send final response with finished flag
final_response = completion_pb2.CompletionResponse(
text=content["text"], # Final complete text
finished=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: how does regular http completion request figure out if it is finished?

)

# Process request through tokenizer manager
async for content in self.generate_request(adapted_request):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to hanadle the client closing the connection and abort the request (I think we can do it using context object). We can also do it as a follow up, and in that case, can you create a separate isuse?

@zifeitong
Copy link
Contributor

just want to chime in that grpc python server isn't very performant: LesnyRumcajs/grpc_bench#441

triton-inference-server removed grpc for performance reason:
triton-inference-server/python_backend#42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants