Release Release v0.3.0 · sgl-project/sglang

Highlights

Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.

Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
Up to 1.5x lower latency with torch.compile on small batch sizes
Support for interleaved text and multi-image/video in LLaVA-OneVision
Support for interleaved window attention and 2x longer context length in Gemma-2
Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.

What's Changed

update hyperparameter guide by @merrymercy in #1114
ci: compatible with fork repo by @zhyncs in #1115
fix: resolve Python.h header missing by @zhyncs in #1119
Fix the deadlock in multi-node tp by @merrymercy in #1122
Mixed style of chunked prefill by @hnyls2002 in #1013
Fix port conflicts between local CI and runner CI. by @hnyls2002 in #1131
Fix CI accuracy && time out limit by @hnyls2002 in #1133
fix: use fp16 dtype for sm75 by @zhyncs in #1136
Improve the code style: more comments and remove useless packages by @merrymercy in #1139
Improve benchmark by @merrymercy in #1140
Fix duplicated imports in hf_transformers_utils.py by @merrymercy in #1141
fixed a typo by @min-xu-et in #1143
[Docs] Add instruction for running on clouds and kubernetes with SkyPilot by @Michaelvll in #1144
[Feat]Add support for optional start len of logprobs by @yichuan520030910320 in #1035
Optimize MLA/GQA/MQA Triton decoding by @ispobock in #1138
feat: allow streaming for multi-prompt and/or parallel sampling by @vhain in #1134
Improve docs and warnings by @merrymercy in #1164
[Feature] add disable-custom-all-reduce by @Xu-Chen in #1148
misc: add hypervisor vendor by @zhyncs in #1165
support /v1/health using a generation 1 token by @LucienShui in #1154
fix: resolve README render by @zhyncs in #1166
[Feat] Support update weights without restart server by @shanyu-sys in #1157
Improve multi-node stability by @merrymercy in #1171
fix: custom op fallback forward native when lower sm80 by @zhyncs in #1177
[Feature] Add a function to convert sampling_params to kwargs by @gryffindor-rr in #1170
Support min-p sampling by @intervitens in #1167
[Docs] Fix rendering of details in README by @Michaelvll in #1179
Improve code style of sampler by @hnyls2002 in #1168
[Minor] Improve logging and rename the health check endpoint name by @merrymercy in #1180
Fix broken penalty by @hnyls2002 in #1184
Fix benchmark script by @Ying1123 in #1185
[Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by @kcz358 in #1123
feat: use gelu_tanh_and_mul by @zhyncs in #1193
Cleanup readme, llava examples, usage examples and nccl init by @merrymercy in #1194
Update README.md by @merrymercy in #1198
[CI] Fix the problem of hf runner too slow by @Ying1123 in #1202
[Fix] the issue of random order when input is a list by @Ying1123 in #1199
Relax the assert in moe throughput test to fix the flaky CI by @merrymercy in #1207
[Fix] Fixing the multi-images error for llava-onevision by @kcz358 in #1205
Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by @zhaochenyang20 in #1186
[Minor] Improve the function organization in TokenizerManager & improve loggers by @merrymercy in #1208
[Minor] Temporarily skip flaky test by @Ying1123 in #1209
[CI] Fix the issue of unit test hanging by @Ying1123 in #1211
Update CI workflows by @merrymercy in #1210
Update CI runner docs by @merrymercy in #1213
[Feature] Support fp8 e5m2 kv cache with flashinfer by @ispobock in #1204
Update workflow files by @merrymercy in #1214
improve the threshold and ports in tests by @wisclmy0611 in #1215
[CI] Fix CI by @wisclmy0611 in #1217
[Fix] Multi-images loading error by @kcz358 in #1218
[Minor] improve CI and dependencies by @hnyls2002 in #1212
[CI] Parallelize unit tests in CI by @wisclmy0611 in #1219
Move sampler into CUDA graph by @hnyls2002 in #1201
chore: bump v0.2.14 by @zhyncs in #1155
[FEAT] JSON constrained support by @havetc in #1125
Torch compile CI throughput test by @hnyls2002 in #1223
[FEAT] Support batches cancel by @caiyueliang in #1222
[Minor] add delete test and delete tmp file on ci server by @yichuan520030910320 in #1227
[FIX] Wrong logger by @havetc in #1230
feat: replace get_act_fn for gpt_bigcode by @zhyncs in #1231
Fix readme by @ArtificialZeng in #1236
Fix bench latency benchmark by @hnyls2002 in #1225
[Minor] Add more type annotations by @merrymercy in #1237
feat: support sm75 with FlashInfer v0.1.6 by @zhyncs in #1233
Update README.md by @merrymercy in #1239
hotfix: revert sampler CUDA Graph by @zhyncs in #1242
Add sglang.bench_latency to CI by @merrymercy in #1243
fix: increase max_new_tokens when testing generation models by @zhyncs in #1244
feat: update GemmaRMSNorm by @zhyncs in #1232
Fix llava on multi images by @merrymercy in #1247
feat: replace GeluAndMul by @zhyncs in #1234
fix: resolve qwen2 moe weight loader by @zhyncs in #1252
chore: bump v0.2.14.post2 by @zhyncs in #1250
make json_schema usable from gen by @qeternity in #1254
fix data racing due to mutable reference using deepcopy by @xiezhq-hermann in #1255
Sampler cudagraph by @hnyls2002 in #1253
fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by @lxww302 in #1260
Transpose mla weight offline by @ispobock in #1261
EXAONE 3.0 Model Support by @Deepfocused in #1258
Update README Support Exaone 3.0 by @Deepfocused in #1267
Report median instead of mean in bench_latency.py by @merrymercy in #1269
Allow more flexible assistant and system response by @BabyChouSr in #1256
fix: resolve the fp8 bug introduced by vLLM 0.5.5 by @zhyncs in #1276
[doc] fix quick start link by @ByronHsu in #1282
Optimize the update flashinfer indices by @xiaobochen123 in #1262
[CI] Add more multi-gpu tests by @merrymercy in #1280
feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by @zhyncs in #1285
[CI] merge all ci tests into one file by @merrymercy in #1289
Support Triton fp8 e5m2 kv cache by @ispobock in #1286
[triton] Remove the zero initialization of qk_acc by directly writing the result by @ByronHsu in #1288
[Chore] Rename model_overide_args to model_override_args by @kevin85421 in #1284
Allow new lines during JSON generation by @qeternity in #1277
fix: resolve fp8 for mixtral by @zhyncs in #1290
ci: add nightly eval by @zhyncs in #1291
Fix the flaky tests in test_moe_eval_accuracy_large.py by @merrymercy in #1293
[doc] Fix more broken links by @ByronHsu in #1294
Fix regex mask by @hnyls2002 in #1296
Fix hang when doing s += None. by @max99x in #1297
Release v0.2.15 by @merrymercy in #1295
feat: update nightly gsm8k eval by @zhyncs in #1304
Fix bugs in sampler with CUDA graph / torch.compile by @hnyls2002 in #1306
[Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping by @merrymercy in #1308
Support Phi3 mini and medium by @janimo in #1299
Update README.md for llava-onevision instructions by @merrymercy in #1313
Fix llama2 weight loader by @merrymercy in #1317
Fix select by ensuring each request has at least one token by @merrymercy in #1318
misc: speedup load safetensors by @zhyncs in #1319
chore: bump v0.3.0 by @zhyncs in #1320
Fix the flaky test test_moe_eval_accuracy_large.py by @merrymercy in #1326
docs: update news by @zhyncs in #1327

New Contributors

@Michaelvll made their first contribution in #1144
@Xu-Chen made their first contribution in #1148
@shanyu-sys made their first contribution in #1157
@intervitens made their first contribution in #1167
@zhaochenyang20 made their first contribution in #1186
@havetc made their first contribution in #1125
@caiyueliang made their first contribution in #1222
@ArtificialZeng made their first contribution in #1236
@lxww302 made their first contribution in #1260
@Deepfocused made their first contribution in #1258
@ByronHsu made their first contribution in #1282
@xiaobochen123 made their first contribution in #1262
@kevin85421 made their first contribution in #1284

Full Changelog: v0.2.13...v0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.3.0

Highlights

What's Changed

New Contributors

Contributors