Skip to content

Releases: sgl-project/sglang

Release v0.4.1

25 Dec 23:27
efc52f8
Compare
Choose a tag to compare

Highlights

  • We're excited to announce SGLang v0.4.1, which now supports DeepSeek V3 - currently the strongest open-source LLM, even surpassing GPT-4o.

    The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU from day one. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models.

    Special thanks to Meituan's Search & Recommend Platform Team @ispobock @HandH1998 and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.

  • Various improvements to the cache-aware sglang router, torchao integration, server termination

  • Added a standalone package sgl-kernel for supporting more custom kernels in the code base.

What's Changed

Read more

Release v0.4.0

04 Dec 02:14
f8b0326
Compare
Choose a tag to compare

Highlights

blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/

We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:

  • Zero-overhead batch scheduler: 1.1x increase in throughput.
  • Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.
  • Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.
  • Fast structured outputs with xgrammar: up to 10x faster.

What's Changed

Read more

Release v0.3.6

22 Nov 11:36
9a00e6f
Compare
Choose a tag to compare

Highlights

  • Reduce CPU overhead by enabling overlap scheduler by default. 1.1x higher throughput. (#2105, #2067, #2095)
  • Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (#1970, #2061)
  • Cache-aware load balancer. 4x higher cache hit rate (#1934)
  • Support xgrammar backend for grammar-guided decoding (#2056)
  • Support Prometheus metrics (#1853, #1981)
  • Support torch 2.5.1 (#2069) and torch-native tensor parallelism (#1876)
  • Support graceful termination (#1838) and watchdog (#1816)
  • Support notebook-style documentation (https://sgl-project.github.io/)
  • Add an offline benchmark script (#1968)
  • Bug, deadlock, NaN, and OOM fixes (#2083, #1850, #1800, #1779, #1789, #1858)
  • New models: Phi3-small (#2062), Gemma-2 reward model (#1954), GPT-2 (#1833)

What's Changed

Read more

Release v0.3.4.post1

22 Oct 04:30
1f26e8b
Compare
Choose a tag to compare

Highlights

  • Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving.
    • Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. Slides
  • Added Engine API for offline inference with reduced overhead. Usage. #1614 #1567
  • Added an overlap scheduler for reducing CPU overhead #1738
  • New models: Llama 3.2 (#1551), QWen-VL2 (#1721), OLMo (#1676), GLM 4 (#1736).
  • Added support for reward models #1525.
  • Added support for Intel XPU #1480.
  • Improved stability for greedy decoding #1589.
  • Accelerated multi-LoRA serving #1587.

What's Changed

Read more

Release v0.3.2

02 Oct 17:19
37c5899
Compare
Choose a tag to compare

Highlight

  • Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422
  • Initial support for multi-LoRA serving #1307
  • Integrate torchao for quantization #1341
  • Optimize the CPU scheduler overhead
  • Multiple critical bug fixes for llama and llava (tokenizer, modality)
  • Support AMD backend #1420
  • New models: MiniCPM3, OLMoE

What's Changed

Read more

Release v0.3.0

19 Sep 10:09
5ab9418
Compare
Choose a tag to compare

Highlights

Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.

  • Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
  • Up to 1.5x lower latency with torch.compile on small batch sizes
  • Support for interleaved text and multi-image/video in LLaVA-OneVision
  • Support for interleaved window attention and 2x longer context length in Gemma-2
  • Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
  • Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.

What's Changed

Read more

Release v0.2.13

19 Sep 10:08
5bd9537
Compare
Choose a tag to compare

Highlights

  • New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
  • New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
  • Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
  • More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
  • Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)

What's Changed

Read more

Release v0.2.9

02 Aug 08:55
30a9b2e
Compare
Choose a tag to compare

Highlights

  • New feature: Chunked prefill (#800, #811)
  • New models: Deepseek v2
  • Performance improvement: vectorized logprob computation
  • Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
  • Feature fix: fixed many missing logprob-related features in the OpenAI API server
  • CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.

What's Changed

New Contributors

Full Changelog: v0.2.5...v0.2.9

Release v0.2.5

26 Jul 19:56
5bd06b4
Compare
Choose a tag to compare

Highlights

  • We recently released a blog. Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. SGLang consistently outperforms vLLM, achieving up to 3.1x higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.

  • We have now automated the release processes for PyPI, Docker, and Release using GitHub workflows. Previously, because Release was not automated, GitHub Tags were not updated in time, leading to a jump from v0.2.0 directly to v0.2.5.

  • Welcome everyone to try using https://github.com/sgl-project/sglang, and also welcome everyone to actively participate in the community, including but not limited to issues, PRs, and discussions. Cheers!

Release v0.2.0

25 Jul 15:58
1a491d0
Compare
Choose a tag to compare

Highlights

  • We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest blog.
  • New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo

What's Changed

New Contributors

Read more