Skip to content

Releases: sgl-project/sglang

Release v0.3.4.post1

22 Oct 04:30
1f26e8b
Compare
Choose a tag to compare

Highlights

  • Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving.
    • Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. Slides
  • Added Engine API for offline inference with reduced overhead. Usage. #1614 #1567
  • Added an overlap scheduler for reducing CPU overhead #1738
  • New models: Llama 3.2 (#1551), QWen-VL2 (#1721), OLMo (#1676), GLM 4 (#1736).
  • Added support for reward models #1525.
  • Added support for Intel XPU #1480.
  • Improved stability for greedy decoding #1589.
  • Accelerated multi-LoRA serving #1587.

What's Changed

Read more

Release v0.3.2

02 Oct 17:19
37c5899
Compare
Choose a tag to compare

Highlight

  • Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422
  • Initial support for multi-LoRA serving #1307
  • Integrate torchao for quantization #1341
  • Optimize the CPU scheduler overhead
  • Multiple critical bug fixes for llama and llava (tokenizer, modality)
  • Support AMD backend #1420
  • New models: MiniCPM3, OLMoE

What's Changed

Read more

Release v0.3.0

19 Sep 10:09
5ab9418
Compare
Choose a tag to compare

Highlights

Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.

  • Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
  • Up to 1.5x lower latency with torch.compile on small batch sizes
  • Support for interleaved text and multi-image/video in LLaVA-OneVision
  • Support for interleaved window attention and 2x longer context length in Gemma-2
  • Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
  • Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.

What's Changed

Read more

Release v0.2.13

19 Sep 10:08
5bd9537
Compare
Choose a tag to compare

Highlights

  • New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
  • New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
  • Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
  • More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
  • Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)

What's Changed

Read more

Release v0.2.9

02 Aug 08:55
30a9b2e
Compare
Choose a tag to compare

Highlights

  • New feature: Chunked prefill (#800, #811)
  • New models: Deepseek v2
  • Performance improvement: vectorized logprob computation
  • Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
  • Feature fix: fixed many missing logprob-related features in the OpenAI API server
  • CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.

What's Changed

New Contributors

Full Changelog: v0.2.5...v0.2.9

Release v0.2.5

26 Jul 19:56
5bd06b4
Compare
Choose a tag to compare

Highlights

  • We recently released a blog. Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. SGLang consistently outperforms vLLM, achieving up to 3.1x higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.

  • We have now automated the release processes for PyPI, Docker, and Release using GitHub workflows. Previously, because Release was not automated, GitHub Tags were not updated in time, leading to a jump from v0.2.0 directly to v0.2.5.

  • Welcome everyone to try using https://github.com/sgl-project/sglang, and also welcome everyone to actively participate in the community, including but not limited to issues, PRs, and discussions. Cheers!

Release v0.2.0

25 Jul 15:58
1a491d0
Compare
Choose a tag to compare

Highlights

  • We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest blog.
  • New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo

What's Changed

New Contributors

Read more

Release v0.1.20

14 Jul 00:33
5d264a9
Compare
Choose a tag to compare

Highlights

  • Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612)
  • Model support: Gemma2, minicpm, Qwen2 MoE
  • Docker support (#217 )
  • Various latency optimizations

What's Changed

New Contributors

Full Changelog: v0.1.18...v0.1.20

Release v0.1.18

04 Jul 06:35
Compare
Choose a tag to compare

Highlight

  • 2x large batch prefill improvement with the new flashinfer kernels #579
  • Multi-node tensor parallelism #550
  • New model support: ChatGLM #516

What's Changed

New Contributors

Full Changelog: v0.1.17...v0.1.18

Release v0.1.17

08 Jun 02:58
e8a2327
Compare
Choose a tag to compare

Highlights

  • Add data parallelim #480
  • Add speculative execution for OpenAI API #250
  • Update vllm to v0.4.3 for new quantization features #511
  • Better error handling (#457, #449, #514)

What's Changed

New Contributors

Full Changelog: v0.1.16...v0.1.17