Why `vLLM` was not used for `srt` #1770

Venkat2811 · 2024-10-23T20:41:58Z

Venkat2811
Oct 23, 2024

Thanks for this amazing project, it was very easy to get it running locally. I've been scouting vLLM and sglang InfEngine impl details and few recent PRs. I was also catching up on your meetings on YouTube channel & learning materials repo. I also see that you are using vLLM as a dependency and using some implementation parts. I am no expert by any means, just trying to understand:

Not implementing RadixAttention as one of the attention in vLLM.
Not using vLLM + RadixAttention as srt for SGLang instead of your own runtime
In the future, is there any plans to merge with vLLM
With vLLM as dependency, what parts of it are you using and willing to use and what parts you don't want to use and your reasons.

Would like to understand your vision and roadmap for this project. I understand that current vLLM architecture makes it difficult to implement large changes like the ones in sglang. vLLM team is working on architecture 2.0 to address several pain points. vLLM team is also doing various improvements to decrease CPU overhead, supporting different types of KV caches, multi-process API server and engine, TP, PP, EP, LMCache, structured output generation, etc.,

Since you've also worked on vLLM project, sharing the details for your motivation would be very helpful. Thanks in advance !

Thanks,
Venkat

zhyncs · 2024-10-23T21:31:06Z

zhyncs
Oct 23, 2024
Maintainer

The vLLM dependency will be removed

3 replies

Venkat2811 Oct 23, 2024
Author

Good to know @zhyncs, ty ! Now the question boils down to -- Once vLLM has structured output generation and radixattention, why should/would one prefer SGLang over vLLM given the widespread popularity / adoption?

merrymercy Oct 24, 2024
Maintainer

Thanks for your interests. The answer is very simple: SGLang will keep innovating and maintaining unique advantages. For example, many techniques proposed by SGLang has not been correctly implemented by vLLM. Why do you think they can copy all the current and up-coming features from SGLang in the future?

If you are looking for a lightweight and high-performance code base, you can continue using SGLang and don't need to wait for vLLM to catch up. Many of the features you mentioned from vLLM 2.0 are already available in SGLang, not to mention the new features on our roadmap.

SGLang follows its own unique design and only reuses some basic layers from vLLM to avoid redundant work. As we gradually grow the developer community, we will maintain better versions of the whole stack and remove the dependency of vLLM.

If you want to be a bit adventurous, you can keep trying the latest features from SGLang.

Venkat2811 Oct 24, 2024
Author

Thanks for your response @merrymercy.

Why do you think they can copy all the current and up-coming features from SGLang in the future?

My perspective is not about copying, but more on staying with latest research. Take FA 2,3 for example, every InfEngine uses it. My thinking / understanding is - Inf Engines can evolve to have configurable KV cache management, scheduler, etc., Another example is conversations about having more performant scheduler here and in vLLM project (slow python, python GIL, etc), which HF TGI avoided by having scheduler/router in async rust. I can share more examples: KV cache offloading (LMCache), chunked prefill, disaggregated inference (dedicated prefill and decode nodes), JSON decode, etc., which most of the inference engines will try to implement or already have it.

I wanted to understand your thoughts on why you and the team didn't implement on top of vLLM, since you've already worked on it.

You've mostly answered this. I also agree with you that as a prospective community developer myself, SGLang's simplicity & lightweight nature is making it more attractive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why `vLLM` was not used for `srt` #1770

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why vLLM was not used for srt #1770

Venkat2811 Oct 23, 2024

Replies: 1 comment · 3 replies

zhyncs Oct 23, 2024 Maintainer

Venkat2811 Oct 23, 2024 Author

merrymercy Oct 24, 2024 Maintainer

Venkat2811 Oct 24, 2024 Author

Why `vLLM` was not used for `srt` #1770

Venkat2811
Oct 23, 2024

Replies: 1 comment 3 replies

zhyncs
Oct 23, 2024
Maintainer

Venkat2811 Oct 23, 2024
Author

merrymercy Oct 24, 2024
Maintainer

Venkat2811 Oct 24, 2024
Author