Release Release v0.2.13 · sgl-project/sglang

Highlights

New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)

What's Changed

fix: set env in runner by @zhyncs in #891
docs: update setup runner by @zhyncs in #884
misc: update cuda graph capture exception log by @zhyncs in #894
chore: add multipart dep for fastapi by @zhyncs in #895
[minor] fixed code formatting doc by @min-xu-et in #896
Bump version to 0.2.9.post1 by @Ying1123 in #899
Update the base image of the docker by @Ying1123 in #900
Reorder CI unit tests. by @hnyls2002 in #908
fixed an error handling in bench_latency.py by @min-xu-et in #904
Add model accuracy test - step 1 by @Ying1123 in #866
latency test enhancement - part 1 by @min-xu-et in #909
Improve the structure of CI by @Ying1123 in #911
fix: use e2e and unit test only for original repo or pr by @zhyncs in #912
misc: add triton in check_env PACKAGE_LIST by @zhyncs in #914
Support MLA for DeepSeek-V2 with Triton - step 1 by @ispobock in #905
enhance latency test - part 2 by @min-xu-et in #915
Make API Key OpenAI-compatible by @Ying1123 in #917
Update hyperparameter_tuning.md by @Ying1123 in #918
Fix CI && python3.8 compatible by @hnyls2002 in #920
Support more OpenAI API test by @yichuan520030910320 in #916
Bump version to 0.2.10 by @Ying1123 in #923
latency test enhancement - final part by @min-xu-et in #921
Test openai vision api by @Ying1123 in #925
Test regex in vision api by @Ying1123 in #926
Update README.md by @Ying1123 in #927
Fix prompt len in parallel sampling by @yichuan520030910320 in #928
docs: update README by @zhyncs in #935
Remove leftover auth_token by @AidanCooper in #934
Feat: add alternative choices selection methods by @AidanCooper in #835
Fix union operator by @ispobock in #940
Support multiple args options by @yichuan520030910320 in #941
Fix stuck in get_new_prefill_batch by @hnyls2002 in #948
Organize code (rename, movement) by @hnyls2002 in #953
fix nsys cannot profile cuda kernel by @mpjlu in #957
Add support for Batch API test by @yichuan520030910320 in #936
Show more error messages for warmup errors by @Ying1123 in #932
misc: update issue template by @zhyncs in #963
misc: simplify test by @yichuan520030910320 in #964
misc: add compute capability in check_env by @zhyncs in #965
Make req_pool_indices on CPU by @hnyls2002 in #960
misc: fix the req_to_token member change by @hnyls2002 in #967
chore: update vllm to 0.5.4 by @zhyncs in #966
chore: bump v0.2.11 by @zhyncs in #970
Purge self-runner's pip cache weekly by @hnyls2002 in #975
Run purge-cache only in sgl-project by @hnyls2002 in #976
misc: correct the int data type for token ids and indices by @xiezhq-hermann in #969
PrefillAdder abstraction by @hnyls2002 in #968
RadixCache method adjust by @hnyls2002 in #977
Adjust max prefix len by @hnyls2002 in #980
#590 Increase default , track changes in examples and documentation by @foszto in #971
[minor] Update type annotation in tokenizer_manager.py by @Ying1123 in #982
Fix chunked prefill by @hnyls2002 in #984
Add llama embedding modules [unreachable code] - step 1/3 by @Ying1123 in #983
Add io struct for embedding models [unreachable code] - step 2/3 by @Ying1123 in #987
Adjust InputeMetadata and ScheduleBatch by @hnyls2002 in #981
support more optioin about usage in stream mode by @yichuan520030910320 in #985
Create contributor_guide.md by @Ying1123 in #992
feat: frequency, min_new_tokens, presence, and repetition penalties by @vhain in #973
Move torch.compile configs into cuda_graph_runner.py by @Ying1123 in #993
Add e5-mistral embedding model - step 3/3 by @Ying1123 in #988
test: negative value testing for frequency, presence penalizers by @vhain in #995
support models from www.modelscope.cn by @liuyhwangyh in #994
bugfix: penalizers to be merged before reqs by @vhain in #1001
fix: resolve correctness_test issue by @zhyncs in #1002
Minor bugfix on benchmark serving by @ywang96 in #1005
Add openai embedding API by @Ying1123 in #997
Add skip_tokenizer_init args. by @gryffindor-rr in #959
Fix benchmark latency by @wisclmy0611 in #1007
Some warnings to crash when CI by @hnyls2002 in #1009
Reduce the overhead when cache is disabled by @hnyls2002 in #1010
Support embedding input as a list by @Ying1123 in #1014
misc: update test config by @zhyncs in #990
fix: force max new tokens to be 1 for embedding request by @Ying1123 in #1019
Clean up unit tests by @merrymercy in #1020
Fix input_ids && rename to fill_ids by @hnyls2002 in #1021
feat: use FlashInfer rmsnorm and silu by @zhyncs in #907
misc: update issue template by @zhyncs in #1024
Clean up readme and arguments of chunked prefill by @merrymercy in #1022
Fix wrong assert by @hnyls2002 in #1028
Improve type annotation by @merrymercy in #1029
hotfix: add CustomOp abstraction by @zhyncs in #1027
Fix the case where r.prefix_indices is None by @merrymercy in #1031
Fix triton args init by @hnyls2002 in #1034
Fix the case when max_new_tokens is too large by @merrymercy in #1025
Test the case when max_new_tokens is very large by @merrymercy in #1038
Fix the prefix indices by @hnyls2002 in #1037
Improve end-to-end throughput test and its coverage by @merrymercy in #1039
Delete the useless test/srt/test_throughput.py by @merrymercy in #1045
minor: some potential bugs by @hnyls2002 in #1044
Clean up the comments and names under python/sglang/srt/layers by @merrymercy in #1047
fix: Fix returned prefill logits and add output str test by @Ying1123 in #1046
feat: update Dockerfile by @zhyncs in #1033
docs: update setup github runner by @zhyncs in #1050
Add longer accuracy test on CI by @merrymercy in #1049
Fix accuracy test by @merrymercy in #1051
Re-organize CI tests by @merrymercy in #1052
chore: bump v0.2.12 by @zhyncs in #1048
feat: replace all rmsnorm and silu by @zhyncs in #1057
fix: not use the default port by @zhyncs in #1068
Fix layernorm input shape by @ispobock in #1066
fix: temporary solution for DeepSeek V2 H100 layout conversion issue by @zhyncs in #1060
ci: add cancel pr workflow by @zhyncs in #1070
ci: add moe test by @zhyncs in #1053
fix: use devel for Triton's compiler requirements by @zhyncs in #1074
ci: add accuracy timeout by @zhyncs in #1078
Fix create_abort_task, GenerateReqInput does not have rids. by @gryffindor-rr in #1079
Example file for docker compose and k8s by @LucienShui in #1006
Update the mixtral to use the better FusedMoE layer by @merrymercy in #1081
[Feat] Add window attention for gemma-2 by @Ying1123 in #1056
Fix jump forward final state circular path bug. by @hnyls2002 in #1084
ci: update timeout and retry by @zhyncs in #1086
[Feature] modify Runtime to support skip_tokenizer_init by @gryffindor-rr in #1088
Fix a bug in cuda graph runner by @merrymercy in #1094
ci: remove workflow path trigger by @zhyncs in #1096
docs: update README by @zhyncs in #1098
Update grok 1 model by @merrymercy in #1095
docs: update pr template by @zhyncs in #1099
Use dtype to control generate by @hnyls2002 in #1082
[Fix] Compatibility of window attention and cuda graph by @Ying1123 in #1090
docs: update nsys usage by @zhyncs in #1103
Support stop_token_ids in sglang API by @hnyls2002 in #1092
Support jinja as chat template file by @Ying1123 in #1104
Use a single workspace for flashinfer by @merrymercy in #1077
[Fix] fix the typo bug for window attention by @Ying1123 in #1106
Enable chunked prefill by default by @merrymercy in #1040
[Fix] fix flashinfer usage for window attention by @Ying1123 in #1107
misc: rm unused model_loader by @zhyncs in #1110
[Fix] Window attention compatible with RadixAttention and chunked prefill by @Ying1123 in #1112
set CUDA_DEVICE_MAX_CONNECTIONS=1 by @merrymercy in #1113
chore: bump v0.2.13 by @zhyncs in #1111

New Contributors

@min-xu-et made their first contribution in #896
@mpjlu made their first contribution in #957
@xiezhq-hermann made their first contribution in #969
@foszto made their first contribution in #971
@vhain made their first contribution in #973
@liuyhwangyh made their first contribution in #994
@ywang96 made their first contribution in #1005
@gryffindor-rr made their first contribution in #959
@LucienShui made their first contribution in #1006

Full Changelog: v0.2.9...v0.2.13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.2.13

Highlights

What's Changed

New Contributors

Contributors