Release Release v0.2.0 · sgl-project/sglang

Highlights

We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest blog.
New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo

What's Changed

Optimize mem indices mangement by @hnyls2002 in #619
Unify index operations by @hnyls2002 in #620
Simplify mem state by @wisclmy0611 in #623
Improve tensor parallel performance by @Ying1123 in #625
Bump version to 0.1.21 by @Ying1123 in #626
Fix model forward grad by @hnyls2002 in #628
Update docker file by @Ying1123 in #629
Disable NCCL_NVLS by default by @Ying1123 in #631
Add qwen2 tie word embedding by @yileld in #630
Add support for VertexAI safety settings by @AidanCooper in #624
Fix vertexai by @hnyls2002 in #633
Reduce docker size by @hnyls2002 in #632
clean up step function by @Ying1123 in #635
feat: support internlm2 by @zhyncs in #636
misc: add pre-commit config by @zhyncs in #637
misc: add issue and pr template by @zhyncs in #638
Flashinfer sample kernel by @hnyls2002 in #617
Move global_server_args_dict by @hnyls2002 in #642
Increase the capacity of the memory pool by @Ying1123 in #643
feat: add check_env by @zhyncs in #645
Remove the dependency of rpyc by @wisclmy0611 in #646
misc: rm rpyc from PACKAGE_LIST by @zhyncs in #649
fix: set ulimit -n 65535 by @zhyncs in #647
feat: add lint workflow by @zhyncs in #648
fix: resolve lint error by @zhyncs in #650
Remove useless variables in infer_batch.py by @Ying1123 in #651
Detokenize incrementally when streaming by @hnyls2002 in #653
TokenizerManager.context_len should inherit from `server_args.conte… by @shrirajh in #654
Remove cached triton launcher by @merrymercy in #656
perf: reduce ttft and itl with stream_interval 1 by @zhyncs in #658
feat: add benchmark serving by @zhyncs in #657
refactor model loader [unreachable code]: initial refactor by @Ying1123 in #655
misc: update SGLang package description by @zhyncs in #659
Update Readme by @Ying1123 in #660
feat: update check env by @zhyncs in #661
Improve docs by @Ying1123 in #662
Add benchmark instructions by @Ying1123 in #663
Fix jump forward when streaming by @hnyls2002 in #665
Fix kill process util by @ispobock in #666
Add support for OpenAI API parallel sampling by @yichuan520030910320 in #640
Update OpenAI API by @wisclmy0611 in #667
Temporary fix invalid sample results by @hnyls2002 in #668
Support random dataset in bench_serving.py by @merrymercy in #669
Revert "Temporary fix invalid sample results" by @hnyls2002 in #673
refactor model loader: initial refactor by @Ying1123 in #664
Fix cuda graph with flashinfer by @merrymercy in #675
Tmp fix illegal sample by @hnyls2002 in #676
Update version to 0.1.22 by @Ying1123 in #677
Fallback when sampling failed by @ispobock in #678
feat: support TRT LLM benchmark and multiple benchmarks by @zhyncs in #670
Decouple kv by @hnyls2002 in #679
Support gpt-bigcode model class by @hnyls2002 in #681
support non-streaming benchmark by @merrymercy in #682
Fix StreamExecutor.fork() losing the current role start index. by @max99x in #684
feat: update bench serving by @zhyncs in #685
misc: update output file logic by @zhyncs in #686
Allow disabling streaming in bench by @merrymercy in #687
docs: update README by @zhyncs in #688
Support Deepseek MoE Model by @hnyls2002 in #689
misc: recommend to use chat model for benchmark by @zhyncs in #690
Support Mistral-Nemo by @ispobock in #691
docs: update README by @zhyncs in #692
fix: update bench serving by @zhyncs in #694
misc: update output token logic by @zhyncs in #695
Tune params by @Ying1123 in #696
Fix trt benchmark by @Ying1123 in #697
misc: fix typo by @zhyncs in #698
Fix flashinfer by @Ying1123 in #700
Fix hf config loading by @ispobock in #702
Use min new token ratio at start by @hnyls2002 in #701
feat: add e2e latency by @zhyncs in #704
Update vllm version to support llama3.1 by @Ying1123 in #705
bump version to 0.1.23 by @Ying1123 in #706
Reduce hardcoded logic of kernel usage by @wisclmy0611 in #707
Fix multi-node deadlock by @merrymercy in #709
Auto adjust new ratio by @hnyls2002 in #708
Fix prefill size by @Ying1123 in #711
docs: update README by @zhyncs in #712
docs: update doc by @zhyncs in #713
fix: llama 3.1 405b fp8 by @zhyncs in #714
misc: update doc by @zhyncs in #715
Improve benchmark scripts by @Ying1123 in #717
Bump version to 0.1.24 by @Ying1123 in #718
docs: update supported models by @zhyncs in #719
docs: update comment by @zhyncs in #721
chore: add close inactive issues workflow by @zhyncs in #722
misc: update bulid instruction by @zhyncs in #724
fix: fp8 config by @Ying1123 in #723
Fix dockerfile and triton cache manager by @hnyls2002 in #720
chore: bump v0.1.25 by @zhyncs in #725
fix: resolve the logo display issue on the PyPI page by @zhyncs in #726
misc: update bug issue template by @zhyncs in #727
Revert "fix: fp8 config" by @Ying1123 in #728
Fix bugs (fp8 checkpoints, triton cache manager) by @Ying1123 in #729
Bump version to 0.2.0 by @Ying1123 in #730

New Contributors

@yileld made their first contribution in #630
@AidanCooper made their first contribution in #624
@zhyncs made their first contribution in #636
@shrirajh made their first contribution in #654
@yichuan520030910320 made their first contribution in #640
@max99x made their first contribution in #684

Full Changelog: v0.1.20...v0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.2.0

Highlights

What's Changed

New Contributors

Contributors