[Experimental] Add a gRPC server for completion request #2478

MrAta · 2024-12-13T23:48:17Z

Motivation

Being able to launch sglang as a grpc servic in production.

Modifications

adds a grpc server
supports completion request for now

Now when launching the server, it launches both a fastAPI and grpc server (when --grpc-port is provided).

Notes:

This is an experimental feature! As of now:

The proto definition of CompletionRequest and CompletionResponse DO NOT match 100% with their OAI counterparts in protocol.py.
The grpc CompletionService implementation DOES NOT match 100% with v1_completion yet, as the latter contains several branches. Th current implementation is the simplest case for an streaming request/response workload.

After we get enough feedback and reviews, I can work on addressing the above.

Tests

Please see the bench_serving.py where I added a grpc example benchmark.

Here's the results comparing fast api and grpc for the same benchmark and server settings:

gRPC

============ Serving Benchmark Result ============
Backend:                                 sglang-grpc
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  23.94     
Total input tokens:                      70805     
Total generated tokens:                  58269     
Total generated tokens (retokenized):    58254     
Request throughput (req/s):              12.53     
Input token throughput (tok/s):          2957.02   
Output token throughput (tok/s):         2433.48   
Total token throughput (tok/s):          5390.50   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4363.72   
Median E2E Latency (ms):                 3313.38   
---------------Time to First Token----------------
Mean TTFT (ms):                          841.61    
Median TTFT (ms):                        897.15    
P99 TTFT (ms):                           1439.15   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          38.70     
Median TPOT (ms):                        19.81     
P99 TPOT (ms):                           223.82    
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.13     
Median ITL (ms):                         16.22     
P99 ITL (ms):                            32.96     
==================================================

FastAPI

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  24.00     
Total input tokens:                      70805     
Total generated tokens:                  58269     
Total generated tokens (retokenized):    58267     
Request throughput (req/s):              12.50     
Input token throughput (tok/s):          2949.97   
Output token throughput (tok/s):         2427.67   
Total token throughput (tok/s):          5377.64   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4504.54   
Median E2E Latency (ms):                 3411.37   
---------------Time to First Token----------------
Mean TTFT (ms):                          1023.32   
Median TTFT (ms):                        990.22    
P99 TTFT (ms):                           1666.15   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.10     
Median TPOT (ms):                        19.82     
P99 TPOT (ms):                           237.27    
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.02     
Median ITL (ms):                         16.90     
P99 ITL (ms):                            28.88     
==================================================

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

rkooo567 · 2024-12-15T02:31:33Z

Hi, @MrAta is this ready to be reviewed?

rkooo567 · 2024-12-15T02:31:41Z

I am seeing some test fialures

MrAta · 2024-12-15T03:22:49Z

Hi, @MrAta is this ready to be reviewed?

Hey! No, it's not! It's still WIP and drafted. Will make it ready for review once done.

MrAta · 2024-12-15T20:04:45Z

@rkooo567 It's ready for review now. Admittedly, the protos are minimal and need some iteration.

Ying1123 · 2024-12-16T08:12:33Z

@MrAta From the current design, the http server and grpc server cannot be turned on simultaneously. Could you have a way to have both? One way is to use the original way to launch the HTTP server, and then launch the grpc server in the tokenizer manager when the first request comes (because only then will the event loop be created). The grpc server can use get_event_loop so that the HTTP and grpc server can be in the same event loop.

Signed-off-by: Ata Fatahi <[email protected]>

MrAta · 2024-12-18T09:14:33Z

@MrAta From the current design, the http server and grpc server cannot be turned on simultaneously. Could you have a way to have both? One way is to use the original way to launch the HTTP server, and then launch the grpc server in the tokenizer manager when the first request comes (because only then will the event loop be created). The grpc server can use get_event_loop so that the HTTP and grpc server can be in the same event loop.

@Ying1123 done.

RinRin-32 · 2024-12-23T12:22:09Z

@MrAta Hello, any chance this could work with /generate using input_embeds? #2052

I think with gRPC, input_embeds based usage of Sglang would benefit hugely!

MrAta · 2024-12-23T20:55:54Z

@MrAta Hello, any chance this could work with /generate using input_embeds? #2052

I think with gRPC, input_embeds based usage of Sglang would benefit hugely!

Yes, that's part of the plan.

RinRin-32 · 2024-12-25T01:22:49Z

I tried your implementation last night, turn out adapting to input_embeds is surprisingly painless. It involves modifying your completion.proto slightly as well as alteration to grpc server to take either prompt or input_embeds.

Issue I had was that it was awfully slow, likely because of the type conversion involved.

rkooo567

Now when launching the server, it launches both a fastAPI and grpc server (when --grpc-port is provided).

Is this an intentional design? Is there a way to only launch grpc server?

Also a couple comments;

I think we need to be a little more careful about the default values
Also can you add tests?
We should handle the client closes the connection (we probably want to abort the request in this case). Doesn't need to be done in this PR
For missing / unsupported fields, can you give us a list and maybe create an issue to track?

rkooo567 · 2024-12-26T16:32:03Z

python/pyproject.toml

@@ -23,7 +23,7 @@ runtime_common = ["aiohttp", "decord", "fastapi",
    "psutil", "pydantic", "python-multipart",
    "pyzmq>=25.1.2", "torchao>=0.7.0", "uvicorn", "uvloop",
    "xgrammar>=0.1.6"]
-srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer>=0.1.6"]
+srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer>=0.1.6", "grpcio==1.68.1", "grpcio-tools==1.68.1"]


is version restriction necessary?

rkooo567 · 2024-12-26T16:38:05Z

python/sglang/bench_serving.py

+
+    try:
+        # Create gRPC request with same parameters as FastAPI
+        request = completion_pb2.CompletionRequest(


sglang/python/sglang/bench_serving.py

Line 315 in 2125898

payload = {

This doesn't seem to match (e.g., return logprobs or **request_func_input.extra_request_body,)

rkooo567 · 2024-12-26T16:38:54Z

python/sglang/bench_serving.py

+    parser.add_argument(
+        "--grpc-port",
+        type=int,
+        help="If not set, the default port is configured according to its default value for different LLM Inference Engines.",


Let's comment this is only for sglang backend and used only when grpc backend is used with --backend=="sglang-grpc"?

rkooo567 · 2024-12-26T16:39:30Z

python/sglang/srt/grpc_server.py

+    ):
+        self.generate_request = generate_request
+
+    async def Complete(


Suggested change

async def Complete(

async def complete(

rkooo567 · 2024-12-26T16:44:01Z

python/sglang/srt/proto/completion.proto

+  float presence_penalty = 9;
+  bool stream = 10;
+  repeated string stop = 11;
+  bool ignore_eos = 12;


what are missing fields here? maybe add comments? (like add all fields and comment it out)

rkooo567 · 2024-12-26T16:46:40Z

python/sglang/srt/proto/completion_pb2.py

@@ -0,0 +1,38 @@
+# -*- coding: utf-8 -*-


how is it re-built whenever we merge a PR & change schema?

rkooo567 · 2024-12-26T16:47:36Z

python/sglang/srt/server.py


    The SRT server consists of an HTTP server and the SRT engine.
-
-    1. HTTP server: A FastAPI server that routes requests to the engine.
+    1. HTTP server: A FastAPI server that routes requests to the engine. Alternatively, it can be a gRPC server.


Is "alternatively" the right word? Thought we are going to start both servers

rkooo567 · 2024-12-26T16:51:56Z

python/sglang/srt/grpc_server.py

+                sampling_params={
+                    "max_new_tokens": request.max_tokens,
+                    "temperature": request.temperature,
+                    "top_p": request.top_p,


this part is dangerous. grpc protobuf has a default value 0 for unset values, and it means this top_p, top_k could be different from the default (1 and -1 respectively)

I think we need some sort of mechanism to set the default value? one potential solution is to use optional field in protobuf and set the correct default value when they are not provided

rkooo567 · 2024-12-26T16:54:14Z

python/sglang/srt/grpc_server.py

+            # Send final response with finished flag
+            final_response = completion_pb2.CompletionResponse(
+                text=content["text"],  # Final complete text
+                finished=True,


QQ: how does regular http completion request figure out if it is finished?

rkooo567 · 2024-12-26T16:55:04Z

python/sglang/srt/grpc_server.py

+            )
+
+            # Process request through tokenizer manager
+            async for content in self.generate_request(adapted_request):


we need to hanadle the client closing the connection and abort the request (I think we can do it using context object). We can also do it as a follow up, and in that case, can you create a separate isuse?

zifeitong · 2025-01-03T21:00:25Z

just want to chime in that grpc python server isn't very performant: LesnyRumcajs/grpc_bench#441

triton-inference-server removed grpc for performance reason:
triton-inference-server/python_backend#42

MrAta changed the title ~~Ata/grpc~~ Add grpc server for completion request Dec 13, 2024

MrAta changed the title ~~Add grpc server for completion request~~ [Experimental] Add grpc server for completion request Dec 15, 2024

MrAta force-pushed the ata/grpc branch from 70bc8ba to 003730b Compare December 15, 2024 09:08

MrAta marked this pull request as ready for review December 15, 2024 09:21

MrAta requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners December 15, 2024 09:21

MrAta changed the title ~~[Experimental] Add grpc server for completion request~~ [Experimental] Add a gRPC server for completion request Dec 15, 2024

merrymercy assigned Ying1123 Dec 17, 2024

merrymercy added the high priority label Dec 17, 2024

MrAta added 12 commits December 18, 2024 01:04

add proto defenition for completion request

c4ad1ac

Signed-off-by: Ata Fatahi <[email protected]>

add grpc server

6987cd2

Signed-off-by: Ata Fatahi <[email protected]>

add grpc server args

fdf3d12

Signed-off-by: Ata Fatahi <[email protected]>

add tests

fa0881c

Signed-off-by: Ata Fatahi <[email protected]>

add benchmark for grpc server

8c16262

Signed-off-by: Ata Fatahi <[email protected]>

add proto generated files

baba186

Signed-off-by: Ata Fatahi <[email protected]>

separate grpc and http server

1fe98c0

Signed-off-by: Ata Fatahi <[email protected]>

fix bench serving

ceae9d4

Signed-off-by: Ata Fatahi <[email protected]>

remove grpc port in bench serving

d8d561e

Signed-off-by: Ata Fatahi <[email protected]>

add back server docstrings

d655ba1

Signed-off-by: Ata Fatahi <[email protected]>

fix formats

a28182f

Signed-off-by: Ata Fatahi <[email protected]>

fix bench serving

9c33cf8

Signed-off-by: Ata Fatahi <[email protected]>

MrAta added 18 commits December 18, 2024 01:04

fix erorr handling in server

3b2a806

Signed-off-by: Ata Fatahi <[email protected]>

make client verbose

0af6298

Signed-off-by: Ata Fatahi <[email protected]>

update server docstring

8f2e26c

Signed-off-by: Ata Fatahi <[email protected]>

create final chunk explicitly

7435d17

Signed-off-by: Ata Fatahi <[email protected]>

remove duplicate tests

0d7362c

Signed-off-by: Ata Fatahi <[email protected]>

add dependencies

eed1dca

Signed-off-by: Ata Fatahi <[email protected]>

fix local import path

81c4748

Signed-off-by: Ata Fatahi <[email protected]>

pin grpc versions

4ad057c

Signed-off-by: Ata Fatahi <[email protected]>

remove debug prints

68007cb

Signed-off-by: Ata Fatahi <[email protected]>

trigger tests

60e37af

Signed-off-by: Ata Fatahi <[email protected]>

make fast api default server

31f8f8e

Signed-off-by: Ata Fatahi <[email protected]>

remove tm depenency in grpc server

cf0edac

Signed-off-by: Ata Fatahi <[email protected]>

launch grpc server inside tokenizer manager

fb37ac0

Signed-off-by: Ata Fatahi <[email protected]>

conver grpc launch into a func

9f88fd3

Signed-off-by: Ata Fatahi <[email protected]>

create server inside the coroutine

b89ad96

Signed-off-by: Ata Fatahi <[email protected]>

remove comments

8b29ed7

Signed-off-by: Ata Fatahi <[email protected]>

move grpc server creation into loop handler

9fbef84

Signed-off-by: Ata Fatahi <[email protected]>

launch the grpc server in a separate thread

f072559

Signed-off-by: Ata Fatahi <[email protected]>

MrAta force-pushed the ata/grpc branch from ceef737 to f072559 Compare December 18, 2024 09:04

rkooo567 suggested changes Dec 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] Add a gRPC server for completion request #2478

[Experimental] Add a gRPC server for completion request #2478

MrAta commented Dec 13, 2024 •

edited

Loading

rkooo567 commented Dec 15, 2024

rkooo567 commented Dec 15, 2024

MrAta commented Dec 15, 2024

MrAta commented Dec 15, 2024

Ying1123 commented Dec 16, 2024

MrAta commented Dec 18, 2024

RinRin-32 commented Dec 23, 2024

MrAta commented Dec 23, 2024

RinRin-32 commented Dec 25, 2024

rkooo567 left a comment

rkooo567 Dec 26, 2024

rkooo567 Dec 26, 2024

rkooo567 Dec 26, 2024

rkooo567 Dec 26, 2024

rkooo567 Dec 26, 2024

rkooo567 Dec 26, 2024

rkooo567 Dec 26, 2024

rkooo567 Dec 26, 2024

rkooo567 Dec 26, 2024

rkooo567 Dec 26, 2024

zifeitong commented Jan 3, 2025

[Experimental] Add a gRPC server for completion request #2478

Are you sure you want to change the base?

[Experimental] Add a gRPC server for completion request #2478

Conversation

MrAta commented Dec 13, 2024 • edited Loading

Motivation

Modifications

Notes:

Tests

gRPC

FastAPI

Checklist

rkooo567 commented Dec 15, 2024

rkooo567 commented Dec 15, 2024

MrAta commented Dec 15, 2024

MrAta commented Dec 15, 2024

Ying1123 commented Dec 16, 2024

MrAta commented Dec 18, 2024

RinRin-32 commented Dec 23, 2024

MrAta commented Dec 23, 2024

RinRin-32 commented Dec 25, 2024

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zifeitong commented Jan 3, 2025

MrAta commented Dec 13, 2024 •

edited

Loading