Release v0.2.13
Highlights
- New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
- New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
- Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
- More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
- Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)
What's Changed
- fix: set env in runner by @zhyncs in #891
- docs: update setup runner by @zhyncs in #884
- misc: update cuda graph capture exception log by @zhyncs in #894
- chore: add multipart dep for fastapi by @zhyncs in #895
- [minor] fixed code formatting doc by @min-xu-et in #896
- Bump version to 0.2.9.post1 by @Ying1123 in #899
- Update the base image of the docker by @Ying1123 in #900
- Reorder CI unit tests. by @hnyls2002 in #908
- fixed an error handling in bench_latency.py by @min-xu-et in #904
- Add model accuracy test - step 1 by @Ying1123 in #866
- latency test enhancement - part 1 by @min-xu-et in #909
- Improve the structure of CI by @Ying1123 in #911
- fix: use e2e and unit test only for original repo or pr by @zhyncs in #912
- misc: add triton in check_env PACKAGE_LIST by @zhyncs in #914
- Support MLA for DeepSeek-V2 with Triton - step 1 by @ispobock in #905
- enhance latency test - part 2 by @min-xu-et in #915
- Make API Key OpenAI-compatible by @Ying1123 in #917
- Update hyperparameter_tuning.md by @Ying1123 in #918
- Fix CI && python3.8 compatible by @hnyls2002 in #920
- Support more OpenAI API test by @yichuan520030910320 in #916
- Bump version to 0.2.10 by @Ying1123 in #923
- latency test enhancement - final part by @min-xu-et in #921
- Test openai vision api by @Ying1123 in #925
- Test regex in vision api by @Ying1123 in #926
- Update README.md by @Ying1123 in #927
- Fix prompt len in parallel sampling by @yichuan520030910320 in #928
- docs: update README by @zhyncs in #935
- Remove leftover auth_token by @AidanCooper in #934
- Feat: add alternative choices selection methods by @AidanCooper in #835
- Fix union operator by @ispobock in #940
- Support multiple args options by @yichuan520030910320 in #941
- Fix stuck in
get_new_prefill_batch
by @hnyls2002 in #948 - Organize code (rename, movement) by @hnyls2002 in #953
- fix nsys cannot profile cuda kernel by @mpjlu in #957
- Add support for Batch API test by @yichuan520030910320 in #936
- Show more error messages for warmup errors by @Ying1123 in #932
- misc: update issue template by @zhyncs in #963
- misc: simplify test by @yichuan520030910320 in #964
- misc: add compute capability in check_env by @zhyncs in #965
- Make
req_pool_indices
on CPU by @hnyls2002 in #960 - misc: fix the req_to_token member change by @hnyls2002 in #967
- chore: update vllm to 0.5.4 by @zhyncs in #966
- chore: bump v0.2.11 by @zhyncs in #970
- Purge self-runner's pip cache weekly by @hnyls2002 in #975
- Run purge-cache only in sgl-project by @hnyls2002 in #976
- misc: correct the int data type for token ids and indices by @xiezhq-hermann in #969
- PrefillAdder abstraction by @hnyls2002 in #968
- RadixCache method adjust by @hnyls2002 in #977
- Adjust max prefix len by @hnyls2002 in #980
- #590 Increase default , track changes in examples and documentation by @foszto in #971
- [minor] Update type annotation in tokenizer_manager.py by @Ying1123 in #982
- Fix chunked prefill by @hnyls2002 in #984
- Add llama embedding modules [unreachable code] - step 1/3 by @Ying1123 in #983
- Add io struct for embedding models [unreachable code] - step 2/3 by @Ying1123 in #987
- Adjust
InputeMetadata
andScheduleBatch
by @hnyls2002 in #981 - support more optioin about usage in stream mode by @yichuan520030910320 in #985
- Create contributor_guide.md by @Ying1123 in #992
- feat: frequency, min_new_tokens, presence, and repetition penalties by @vhain in #973
- Move torch.compile configs into cuda_graph_runner.py by @Ying1123 in #993
- Add e5-mistral embedding model - step 3/3 by @Ying1123 in #988
- test: negative value testing for frequency, presence penalizers by @vhain in #995
- support models from www.modelscope.cn by @liuyhwangyh in #994
- bugfix: penalizers to be merged before reqs by @vhain in #1001
- fix: resolve correctness_test issue by @zhyncs in #1002
- Minor bugfix on benchmark serving by @ywang96 in #1005
- Add openai embedding API by @Ying1123 in #997
- Add skip_tokenizer_init args. by @gryffindor-rr in #959
- Fix benchmark latency by @wisclmy0611 in #1007
- Some warnings to crash when CI by @hnyls2002 in #1009
- Reduce the overhead when cache is disabled by @hnyls2002 in #1010
- Support embedding input as a list by @Ying1123 in #1014
- misc: update test config by @zhyncs in #990
- fix: force max new tokens to be 1 for embedding request by @Ying1123 in #1019
- Clean up unit tests by @merrymercy in #1020
- Fix
input_ids
&& rename tofill_ids
by @hnyls2002 in #1021 - feat: use FlashInfer rmsnorm and silu by @zhyncs in #907
- misc: update issue template by @zhyncs in #1024
- Clean up readme and arguments of chunked prefill by @merrymercy in #1022
- Fix wrong assert by @hnyls2002 in #1028
- Improve type annotation by @merrymercy in #1029
- hotfix: add CustomOp abstraction by @zhyncs in #1027
- Fix the case where r.prefix_indices is None by @merrymercy in #1031
- Fix triton args init by @hnyls2002 in #1034
- Fix the case when max_new_tokens is too large by @merrymercy in #1025
- Test the case when max_new_tokens is very large by @merrymercy in #1038
- Fix the prefix indices by @hnyls2002 in #1037
- Improve end-to-end throughput test and its coverage by @merrymercy in #1039
- Delete the useless test/srt/test_throughput.py by @merrymercy in #1045
- minor: some potential bugs by @hnyls2002 in #1044
- Clean up the comments and names under python/sglang/srt/layers by @merrymercy in #1047
- fix: Fix returned prefill logits and add output str test by @Ying1123 in #1046
- feat: update Dockerfile by @zhyncs in #1033
- docs: update setup github runner by @zhyncs in #1050
- Add longer accuracy test on CI by @merrymercy in #1049
- Fix accuracy test by @merrymercy in #1051
- Re-organize CI tests by @merrymercy in #1052
- chore: bump v0.2.12 by @zhyncs in #1048
- feat: replace all rmsnorm and silu by @zhyncs in #1057
- fix: not use the default port by @zhyncs in #1068
- Fix layernorm input shape by @ispobock in #1066
- fix: temporary solution for DeepSeek V2 H100 layout conversion issue by @zhyncs in #1060
- ci: add cancel pr workflow by @zhyncs in #1070
- ci: add moe test by @zhyncs in #1053
- fix: use devel for Triton's compiler requirements by @zhyncs in #1074
- ci: add accuracy timeout by @zhyncs in #1078
- Fix create_abort_task, GenerateReqInput does not have rids. by @gryffindor-rr in #1079
- Example file for docker compose and k8s by @LucienShui in #1006
- Update the mixtral to use the better FusedMoE layer by @merrymercy in #1081
- [Feat] Add window attention for gemma-2 by @Ying1123 in #1056
- Fix jump forward final state circular path bug. by @hnyls2002 in #1084
- ci: update timeout and retry by @zhyncs in #1086
- [Feature] modify Runtime to support skip_tokenizer_init by @gryffindor-rr in #1088
- Fix a bug in cuda graph runner by @merrymercy in #1094
- ci: remove workflow path trigger by @zhyncs in #1096
- docs: update README by @zhyncs in #1098
- Update grok 1 model by @merrymercy in #1095
- docs: update pr template by @zhyncs in #1099
- Use
dtype
to control generate by @hnyls2002 in #1082 - [Fix] Compatibility of window attention and cuda graph by @Ying1123 in #1090
- docs: update nsys usage by @zhyncs in #1103
- Support
stop_token_ids
in sglang API by @hnyls2002 in #1092 - Support jinja as chat template file by @Ying1123 in #1104
- Use a single workspace for flashinfer by @merrymercy in #1077
- [Fix] fix the typo bug for window attention by @Ying1123 in #1106
- Enable chunked prefill by default by @merrymercy in #1040
- [Fix] fix flashinfer usage for window attention by @Ying1123 in #1107
- misc: rm unused model_loader by @zhyncs in #1110
- [Fix] Window attention compatible with RadixAttention and chunked prefill by @Ying1123 in #1112
- set CUDA_DEVICE_MAX_CONNECTIONS=1 by @merrymercy in #1113
- chore: bump v0.2.13 by @zhyncs in #1111
New Contributors
- @min-xu-et made their first contribution in #896
- @mpjlu made their first contribution in #957
- @xiezhq-hermann made their first contribution in #969
- @foszto made their first contribution in #971
- @vhain made their first contribution in #973
- @liuyhwangyh made their first contribution in #994
- @ywang96 made their first contribution in #1005
- @gryffindor-rr made their first contribution in #959
- @LucienShui made their first contribution in #1006
Full Changelog: v0.2.9...v0.2.13