Release Release v0.2.9 · sgl-project/sglang

Highlights

New feature: Chunked prefill (#800, #811)
New models: Deepseek v2
Performance improvement: vectorized logprob computation
Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
Feature fix: fixed many missing logprob-related features in the OpenAI API server
CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.

What's Changed

Deepseek v2 support by @hnyls2002 in #693
Fix context length by @hnyls2002 in #757
docs: update model support by @zhyncs in #760
fix: not run workflows on fork repo by @zhyncs in #762
Update supported models by @hnyls2002 in #763
Fix TransformerTokenizer init for chatglm2 & 3 by @ispobock in #761
[Minor] Improve the code style in TokenizerManager by @merrymercy in #767
Update readme by @Ying1123 in #769
feat: add fake tag by @zhyncs in #770
Fix max_tokens for OpenAI chat completion API by @merrymercy in #766
Fix max new tokens by @merrymercy in #772
Move sampling logits to float32 by @merrymercy in #773
minor refactor: move check server args to server_args.py by @wisclmy0611 in #774
Fix return_log_probs with cuda graph by @merrymercy in #775
Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs by @merrymercy in #776
Allow disabling flashinfer sampling kernel by @merrymercy in #778
Bump version to 0.2.6 by @merrymercy in #779
fix: replace pillow with PIL in PACKAGE_LIST by @zhyncs in #781
docs: init readthedocs support by @zhyncs in #783
fix: init readthedocs support by @zhyncs in #784
fix: exclude logo png in gitignore by @zhyncs in #785
docs: update index by @zhyncs in #786
Vectorize logprobs computation by @Ying1123 in #787
docs: update README by @zhyncs in #788
docs: make badges center by @zhyncs in #789
chore: add copyright for srt by @zhyncs in #790
Fix echo + lobprob for OpenAI API when the prompt is a list by @Ying1123 in #791
Update README.md by @Ying1123 in #792
Lazy-import third-party backends by @bgyoon in #794
Fix lazy import location by @Ying1123 in #795
Fix logging by @Ying1123 in #796
Add role documentation, add system begin & end tokens by @objnf-dev in #793
Chunked prefill support by @hnyls2002 in #797
Revert "Chunked prefill support" by @Ying1123 in #799
Chunked prefill by @hnyls2002 in #800
fix: update flashinfer to 0.1.2 to fix sampling for cu118 by @zhyncs in #803
Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" by @Ying1123 in #805
feat: add chat template for internlm2-chat by @zhyncs in #802
Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" by @Ying1123 in #806
Add support for OpenAI API : offline batch(file) processing by @yichuan520030910320 in #699
Organize public APIs by @hnyls2002 in #809
Remove inf value for chunked prefill size by @hnyls2002 in #812
Revert "Organize public APIs" by @Ying1123 in #815
fix: use v0.2.5 for benchmark by @zhyncs in #814
Fix LiteLLM kwargs by @qeternity in #817
Code structure refactor by @hnyls2002 in #807
docs: update README by @zhyncs in #819
Fix streaming bug by @objnf-dev in #820
feat: add runner by @zhyncs in #821
feat: add pr e2e test by @zhyncs in #822
Support disable_ignore_eos in bench_serving.py by @Ying1123 in #824
Adjust default mem fraction to avoid OOM by @Ying1123 in #823
Add awq_marlin by @Ying1123 in #826
misc: update e2e test benchmark config by @zhyncs in #825
misc: enable e2e test when push by @zhyncs in #828
docs: add set up runner by @zhyncs in #829
chore: bump v0.2.7 by @zhyncs in #830
Add --max-total-tokens by @hnyls2002 in #840
Fix List input bug by @yichuan520030910320 in #838
Add req slots leaking check by @hnyls2002 in #842
docs: update README.md by @eltociear in #843
misc: update e2e test paths config by @zhyncs in #848
chore: update flashinfer to v0.1.3 by @zhyncs in #850
Fix llama for classification by @Ying1123 in #855
Add troubleshooting doc by @Ying1123 in #856
Fix #857 by @kaifronsdal in #858
Add support for logprobs in OpenAI chat API by @yichuan520030910320 in #852
Support chunked prefill when radix cache is disabled by @hnyls2002 in #811
misc: update e2e test paths config by @zhyncs in #860
Rename github workflows by @Ying1123 in #861
misc: disable auto release by @zhyncs in #862
misc: add cancel previous at e2e by @zhyncs in #864
Add OpenAI backend to the CI test by @Ying1123 in #869
Fix openai CI tests by @Ying1123 in #870
misc: use pip cache purge and add unit test ci by @zhyncs in #871
misc: update unit test config by @zhyncs in #873
Fix unit tests for the frontend language part by @Ying1123 in #872
bump to 0.2.8 by @Ying1123 in #877
Make scripts under /test/srt as unit tests by @Ying1123 in #875
Update runner docs by @hnyls2002 in #876
Improve the coverage of the openai api server test by @Ying1123 in #878
Implement served_model_name to customize model id when use local mode… by @dionren in #749
Update runner docs by @hnyls2002 in #879
Add more unit tests to CI by @Ying1123 in #880
Add accuracy test to CI: MMLU by @Ying1123 in #882
Update workflow name by @Ying1123 in #883
Fix the double BOS problem in the HF chat template by @Ying1123 in #888
Add benchmark: HumanEval by @Ying1123 in #889
Increase openai client limit by @Ying1123 in #886
Bump version to v0.2.9 by @Ying1123 in #890

New Contributors

@bgyoon made their first contribution in #794
@objnf-dev made their first contribution in #793
@kaifronsdal made their first contribution in #858
@dionren made their first contribution in #749

Full Changelog: v0.2.5...v0.2.9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.2.9

Highlights

What's Changed

New Contributors

Contributors