Release v0.3.0
Highlights
Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.
- Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
- Up to 1.5x lower latency with torch.compile on small batch sizes
- Support for interleaved text and multi-image/video in LLaVA-OneVision
- Support for interleaved window attention and 2x longer context length in Gemma-2
- Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
- Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.
What's Changed
- update hyperparameter guide by @merrymercy in #1114
- ci: compatible with fork repo by @zhyncs in #1115
- fix: resolve Python.h header missing by @zhyncs in #1119
- Fix the deadlock in multi-node tp by @merrymercy in #1122
- Mixed style of chunked prefill by @hnyls2002 in #1013
- Fix port conflicts between local CI and runner CI. by @hnyls2002 in #1131
- Fix CI accuracy && time out limit by @hnyls2002 in #1133
- fix: use fp16 dtype for sm75 by @zhyncs in #1136
- Improve the code style: more comments and remove useless packages by @merrymercy in #1139
- Improve benchmark by @merrymercy in #1140
- Fix duplicated imports in hf_transformers_utils.py by @merrymercy in #1141
- fixed a typo by @min-xu-et in #1143
- [Docs] Add instruction for running on clouds and kubernetes with SkyPilot by @Michaelvll in #1144
- [Feat]Add support for optional start len of logprobs by @yichuan520030910320 in #1035
- Optimize MLA/GQA/MQA Triton decoding by @ispobock in #1138
- feat: allow streaming for multi-prompt and/or parallel sampling by @vhain in #1134
- Improve docs and warnings by @merrymercy in #1164
- [Feature] add disable-custom-all-reduce by @Xu-Chen in #1148
- misc: add hypervisor vendor by @zhyncs in #1165
- support /v1/health using a generation 1 token by @LucienShui in #1154
- fix: resolve README render by @zhyncs in #1166
- [Feat] Support update weights without restart server by @shanyu-sys in #1157
- Improve multi-node stability by @merrymercy in #1171
- fix: custom op fallback forward native when lower sm80 by @zhyncs in #1177
- [Feature] Add a function to convert sampling_params to kwargs by @gryffindor-rr in #1170
- Support min-p sampling by @intervitens in #1167
- [Docs] Fix rendering of details in README by @Michaelvll in #1179
- Improve code style of sampler by @hnyls2002 in #1168
- [Minor] Improve logging and rename the health check endpoint name by @merrymercy in #1180
- Fix broken penalty by @hnyls2002 in #1184
- Fix benchmark script by @Ying1123 in #1185
- [Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by @kcz358 in #1123
- feat: use gelu_tanh_and_mul by @zhyncs in #1193
- Cleanup readme, llava examples, usage examples and nccl init by @merrymercy in #1194
- Update README.md by @merrymercy in #1198
- [CI] Fix the problem of hf runner too slow by @Ying1123 in #1202
- [Fix] the issue of random order when input is a list by @Ying1123 in #1199
- Relax the assert in moe throughput test to fix the flaky CI by @merrymercy in #1207
- [Fix] Fixing the multi-images error for llava-onevision by @kcz358 in #1205
- Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by @zhaochenyang20 in #1186
- [Minor] Improve the function organization in TokenizerManager & improve loggers by @merrymercy in #1208
- [Minor] Temporarily skip flaky test by @Ying1123 in #1209
- [CI] Fix the issue of unit test hanging by @Ying1123 in #1211
- Update CI workflows by @merrymercy in #1210
- Update CI runner docs by @merrymercy in #1213
- [Feature] Support fp8 e5m2 kv cache with flashinfer by @ispobock in #1204
- Update workflow files by @merrymercy in #1214
- improve the threshold and ports in tests by @wisclmy0611 in #1215
- [CI] Fix CI by @wisclmy0611 in #1217
- [Fix] Multi-images loading error by @kcz358 in #1218
- [Minor] improve CI and dependencies by @hnyls2002 in #1212
- [CI] Parallelize unit tests in CI by @wisclmy0611 in #1219
- Move sampler into CUDA graph by @hnyls2002 in #1201
- chore: bump v0.2.14 by @zhyncs in #1155
- [FEAT] JSON constrained support by @havetc in #1125
- Torch compile CI throughput test by @hnyls2002 in #1223
- [FEAT] Support batches cancel by @caiyueliang in #1222
- [Minor] add delete test and delete tmp file on ci server by @yichuan520030910320 in #1227
- [FIX] Wrong logger by @havetc in #1230
- feat: replace get_act_fn for gpt_bigcode by @zhyncs in #1231
- Fix readme by @ArtificialZeng in #1236
- Fix bench latency benchmark by @hnyls2002 in #1225
- [Minor] Add more type annotations by @merrymercy in #1237
- feat: support sm75 with FlashInfer v0.1.6 by @zhyncs in #1233
- Update README.md by @merrymercy in #1239
- hotfix: revert sampler CUDA Graph by @zhyncs in #1242
- Add sglang.bench_latency to CI by @merrymercy in #1243
- fix: increase max_new_tokens when testing generation models by @zhyncs in #1244
- feat: update GemmaRMSNorm by @zhyncs in #1232
- Fix llava on multi images by @merrymercy in #1247
- feat: replace GeluAndMul by @zhyncs in #1234
- fix: resolve qwen2 moe weight loader by @zhyncs in #1252
- chore: bump v0.2.14.post2 by @zhyncs in #1250
- make json_schema usable from gen by @qeternity in #1254
- fix data racing due to mutable reference using deepcopy by @xiezhq-hermann in #1255
- Sampler cudagraph by @hnyls2002 in #1253
- fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by @lxww302 in #1260
- Transpose mla weight offline by @ispobock in #1261
- EXAONE 3.0 Model Support by @Deepfocused in #1258
- Update README Support Exaone 3.0 by @Deepfocused in #1267
- Report median instead of mean in bench_latency.py by @merrymercy in #1269
- Allow more flexible assistant and system response by @BabyChouSr in #1256
- fix: resolve the fp8 bug introduced by vLLM 0.5.5 by @zhyncs in #1276
- [doc] fix quick start link by @ByronHsu in #1282
- Optimize the update flashinfer indices by @xiaobochen123 in #1262
- [CI] Add more multi-gpu tests by @merrymercy in #1280
- feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by @zhyncs in #1285
- [CI] merge all ci tests into one file by @merrymercy in #1289
- Support Triton fp8 e5m2 kv cache by @ispobock in #1286
- [triton] Remove the zero initialization of qk_acc by directly writing the result by @ByronHsu in #1288
- [Chore] Rename model_overide_args to model_override_args by @kevin85421 in #1284
- Allow new lines during JSON generation by @qeternity in #1277
- fix: resolve fp8 for mixtral by @zhyncs in #1290
- ci: add nightly eval by @zhyncs in #1291
- Fix the flaky tests in test_moe_eval_accuracy_large.py by @merrymercy in #1293
- [doc] Fix more broken links by @ByronHsu in #1294
- Fix regex mask by @hnyls2002 in #1296
- Fix hang when doing s += None. by @max99x in #1297
- Release v0.2.15 by @merrymercy in #1295
- feat: update nightly gsm8k eval by @zhyncs in #1304
- Fix bugs in sampler with CUDA graph / torch.compile by @hnyls2002 in #1306
- [Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping by @merrymercy in #1308
- Support Phi3 mini and medium by @janimo in #1299
- Update README.md for llava-onevision instructions by @merrymercy in #1313
- Fix llama2 weight loader by @merrymercy in #1317
- Fix select by ensuring each request has at least one token by @merrymercy in #1318
- misc: speedup load safetensors by @zhyncs in #1319
- chore: bump v0.3.0 by @zhyncs in #1320
- Fix the flaky test test_moe_eval_accuracy_large.py by @merrymercy in #1326
- docs: update news by @zhyncs in #1327
New Contributors
- @Michaelvll made their first contribution in #1144
- @Xu-Chen made their first contribution in #1148
- @shanyu-sys made their first contribution in #1157
- @intervitens made their first contribution in #1167
- @zhaochenyang20 made their first contribution in #1186
- @havetc made their first contribution in #1125
- @caiyueliang made their first contribution in #1222
- @ArtificialZeng made their first contribution in #1236
- @lxww302 made their first contribution in #1260
- @Deepfocused made their first contribution in #1258
- @ByronHsu made their first contribution in #1282
- @xiaobochen123 made their first contribution in #1262
- @kevin85421 made their first contribution in #1284
Full Changelog: v0.2.13...v0.3.0