Release v0.2.0
Highlights
- We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest blog.
- New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo
What's Changed
- Optimize mem indices mangement by @hnyls2002 in #619
- Unify index operations by @hnyls2002 in #620
- Simplify mem state by @wisclmy0611 in #623
- Improve tensor parallel performance by @Ying1123 in #625
- Bump version to 0.1.21 by @Ying1123 in #626
- Fix model forward grad by @hnyls2002 in #628
- Update docker file by @Ying1123 in #629
- Disable NCCL_NVLS by default by @Ying1123 in #631
- Add qwen2 tie word embedding by @yileld in #630
- Add support for VertexAI safety settings by @AidanCooper in #624
- Fix vertexai by @hnyls2002 in #633
- Reduce docker size by @hnyls2002 in #632
- clean up step function by @Ying1123 in #635
- feat: support internlm2 by @zhyncs in #636
- misc: add pre-commit config by @zhyncs in #637
- misc: add issue and pr template by @zhyncs in #638
- Flashinfer sample kernel by @hnyls2002 in #617
- Move
global_server_args_dict
by @hnyls2002 in #642 - Increase the capacity of the memory pool by @Ying1123 in #643
- feat: add check_env by @zhyncs in #645
- Remove the dependency of rpyc by @wisclmy0611 in #646
- misc: rm rpyc from PACKAGE_LIST by @zhyncs in #649
- fix: set ulimit -n 65535 by @zhyncs in #647
- feat: add lint workflow by @zhyncs in #648
- fix: resolve lint error by @zhyncs in #650
- Remove useless variables in infer_batch.py by @Ying1123 in #651
- Detokenize incrementally when streaming by @hnyls2002 in #653
TokenizerManager.context_len
should inherit from `server_args.conte… by @shrirajh in #654- Remove cached triton launcher by @merrymercy in #656
- perf: reduce ttft and itl with stream_interval 1 by @zhyncs in #658
- feat: add benchmark serving by @zhyncs in #657
- refactor model loader [unreachable code]: initial refactor by @Ying1123 in #655
- misc: update SGLang package description by @zhyncs in #659
- Update Readme by @Ying1123 in #660
- feat: update check env by @zhyncs in #661
- Improve docs by @Ying1123 in #662
- Add benchmark instructions by @Ying1123 in #663
- Fix jump forward when streaming by @hnyls2002 in #665
- Fix kill process util by @ispobock in #666
- Add support for OpenAI API parallel sampling by @yichuan520030910320 in #640
- Update OpenAI API by @wisclmy0611 in #667
- Temporary fix invalid sample results by @hnyls2002 in #668
- Support random dataset in bench_serving.py by @merrymercy in #669
- Revert "Temporary fix invalid sample results" by @hnyls2002 in #673
- refactor model loader: initial refactor by @Ying1123 in #664
- Fix cuda graph with flashinfer by @merrymercy in #675
- Tmp fix illegal sample by @hnyls2002 in #676
- Update version to 0.1.22 by @Ying1123 in #677
- Fallback when sampling failed by @ispobock in #678
- feat: support TRT LLM benchmark and multiple benchmarks by @zhyncs in #670
- Decouple kv by @hnyls2002 in #679
- Support gpt-bigcode model class by @hnyls2002 in #681
- support non-streaming benchmark by @merrymercy in #682
- Fix StreamExecutor.fork() losing the current role start index. by @max99x in #684
- feat: update bench serving by @zhyncs in #685
- misc: update output file logic by @zhyncs in #686
- Allow disabling streaming in bench by @merrymercy in #687
- docs: update README by @zhyncs in #688
- Support Deepseek MoE Model by @hnyls2002 in #689
- misc: recommend to use chat model for benchmark by @zhyncs in #690
- Support Mistral-Nemo by @ispobock in #691
- docs: update README by @zhyncs in #692
- fix: update bench serving by @zhyncs in #694
- misc: update output token logic by @zhyncs in #695
- Tune params by @Ying1123 in #696
- Fix trt benchmark by @Ying1123 in #697
- misc: fix typo by @zhyncs in #698
- Fix flashinfer by @Ying1123 in #700
- Fix hf config loading by @ispobock in #702
- Use min new token ratio at start by @hnyls2002 in #701
- feat: add e2e latency by @zhyncs in #704
- Update vllm version to support llama3.1 by @Ying1123 in #705
- bump version to 0.1.23 by @Ying1123 in #706
- Reduce hardcoded logic of kernel usage by @wisclmy0611 in #707
- Fix multi-node deadlock by @merrymercy in #709
- Auto adjust new ratio by @hnyls2002 in #708
- Fix prefill size by @Ying1123 in #711
- docs: update README by @zhyncs in #712
- docs: update doc by @zhyncs in #713
- fix: llama 3.1 405b fp8 by @zhyncs in #714
- misc: update doc by @zhyncs in #715
- Improve benchmark scripts by @Ying1123 in #717
- Bump version to 0.1.24 by @Ying1123 in #718
- docs: update supported models by @zhyncs in #719
- docs: update comment by @zhyncs in #721
- chore: add close inactive issues workflow by @zhyncs in #722
- misc: update bulid instruction by @zhyncs in #724
- fix: fp8 config by @Ying1123 in #723
- Fix dockerfile and triton cache manager by @hnyls2002 in #720
- chore: bump v0.1.25 by @zhyncs in #725
- fix: resolve the logo display issue on the PyPI page by @zhyncs in #726
- misc: update bug issue template by @zhyncs in #727
- Revert "fix: fp8 config" by @Ying1123 in #728
- Fix bugs (fp8 checkpoints, triton cache manager) by @Ying1123 in #729
- Bump version to 0.2.0 by @Ying1123 in #730
New Contributors
- @yileld made their first contribution in #630
- @AidanCooper made their first contribution in #624
- @zhyncs made their first contribution in #636
- @shrirajh made their first contribution in #654
- @yichuan520030910320 made their first contribution in #640
- @max99x made their first contribution in #684
Full Changelog: v0.1.20...v0.2.0