Releases: sgl-project/sglang
Release v0.3.4.post1
Highlights
- Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving.
- Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. Slides
- Added Engine API for offline inference with reduced overhead. Usage. #1614 #1567
- Added an overlap scheduler for reducing CPU overhead #1738
- New models: Llama 3.2 (#1551), QWen-VL2 (#1721), OLMo (#1676), GLM 4 (#1736).
- Added support for reward models #1525.
- Added support for Intel XPU #1480.
- Improved stability for greedy decoding #1589.
- Accelerated multi-LoRA serving #1587.
What's Changed
- [Fix] Ignore model import error by @merrymercy in #1513
- minor: fix config by @hnyls2002 in #1524
- [Event] Update meeting link by @Ying1123 in #1529
- [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B by @Ying1123 in #1525
- Add float8 dynamic quant to torchao_utils by @jerryzh168 in #1528
- [FIX] Catch syntax error of Regex Guide to avoid crash by @du00cs in #1521
- [bugfix]Add modelscope package to avoid docker image without modelscope by @KylinMountain in #1520
- Fix RuntimeEndpoint.select method by @jeffrey-fong in #1495
- Multiple minor fixes by @merrymercy in #1530
- Make detokenizer_manager.py not asyncio by @merrymercy in #1532
- Organize image inputs by @hnyls2002 in #1531
- Improve process creation by @merrymercy in #1534
- fix ipv6 url when warm up model by @cauyxy in #1537
- Move scheduler code from tp_worker.py to scheduler.py by @merrymercy in #1538
- Process image in parallel by @hnyls2002 in #1539
- Let ModelRunner take InputMetadata as input, instead of ScheduleBatch by @merrymercy in #1541
- Rename InputMetadata -> ForwardBatch by @merrymercy in #1543
- Clean up batch data structures: Introducing ModelWorkerBatch by @merrymercy in #1544
- [Fix, LoRA] fix LoRA with updates in main by @Ying1123 in #1545
- Organize Attention Backends by @hnyls2002 in #1547
- Fix bugs of
logprobs_nums
by @hnyls2002 in #1548 - Dispatch flashinfer wrappers by @hnyls2002 in #1550
- Simplify flashinfer dispatch by @hnyls2002 in #1552
- [Refactor] Simplify io_struct and tokenizer_manager by @Ying1123 in #1549
- [Performance, Hardware] MoE tuning on AMD MI300x GPUs by @kkHuang-amd in #1554
- [Fix] Fix all the Huggingface paths by @tbarton16 in #1553
- [Fix] do not maintain regex_fsm in SamplingBatchInfo by @merrymercy in #1555
- [Fix] Move ScheduleBatch out of SamplingInfo by @merrymercy in #1556
- Move status check in the memory pool to CPU by @merrymercy in #1557
- [Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' by @mssongit in #1536
- [FP8 KV Cache] Avoid KeyError at loading pre-quantized FP8 model with kv_scale by @HaiShaw in #1559
- Organize sampling batch info better by @merrymercy in #1562
- Use ipc instead of tcp in zmq by @merrymercy in #1566
- Make input_ids a torch.Tensor by @merrymercy in #1568
- [Minifix] Remove extra space in cot example by @FredericOdermatt in #1569
- [Fix] Fix major performance bug in certain cases by @Ying1123 in #1563
- Refine the add request reasons to avoid corner cases. by @hnyls2002 in #1574
- chore: update README.md by @eltociear in #1580
- [Easy] use .text() instead of .text by @ByronHsu in #1577
- [Event] Update README.md by @Ying1123 in #1572
- Add llama implementation with no tensor parallel linears by @jerryzh168 in #1561
- Backend method not found when SRT Runtime is used by @ByronHsu in #1576
- default sampling param should be deepcopied by @ByronHsu in #1581
- Fix styling by @ByronHsu in #1583
- Fix runtime.generate when sampling param is not passed by @ByronHsu in #1582
- Support min_tokens in sgl.gen by @ByronHsu in #1573
- [Minor] Improve the style and fix flaky tests by @merrymercy in #1584
- [Bug] Fix decode stats error on output_len 1 by @HaiShaw in #1585
- Clean up event loop by @merrymercy in #1586
- [LoRA, Performance] Speedup multi-LoRA serving - Step 1 by @Ying1123 in #1587
- [Minor, Performance] Use torch.argmax for greedy sampling by @Ying1123 in #1589
- Test consistency for single and batch seperately by @ByronHsu in #1590
- Update README.md by @merrymercy in #1591
- Fix modality for image inputs by @merrymercy in #1592
- Provide an offline engine API by @ByronHsu in #1567
- [Fix] Fix the case where prompt_len = 0 by @merrymercy in #1593
- Use
atexit
hook to implicitly shutdownRuntime
by @ByronHsu in #1595 - Use is_flashinfer_available to replace is_hip for flashinfer check by @merrymercy in #1596
- Fix chunked prefill condition by @ispobock in #1594
- Fix the port_args in bench_latency by @merrymercy in #1597
- Remove references to squeezellm by @janimo in #1603
- [Profile] Add pytorch profiler by @Ying1123 in #1604
- [Engine] Fix generate hanging issue after the first call by @ByronHsu in #1606
- Release v0.3.3 by @merrymercy in #1605
- [Minor] Fix logging typo by @amosyou in #1615
- Fix test_vision_openai_server on CI by @ByronHsu in #1620
- [Performance, hardware] MoE tuning update to AMD MI300x GPUs by @HaiShaw in #1619
- Update README.md by @kushal34712 in #1625
- Update README.md by @merrymercy in #1629
- Add device support by @liangan1 in #1607
- Nit about the decorator of
PortArgs.init_new
by @glen-amd in #1611 - [Bug] Fix the Image Input of Batch Generation by @OBJECT907 in #1579
- Add the ability to enable and disable the Profiler via HTTP API. by @Abatom in #1626
- Fix the correctness test in bench_latency.py when tp > 1 and test_generation_models.py by @merrymercy in #1631
- Add image_token in conversation.py by @merrymercy in #1632
- Added a "Back To Top" Button by @JanumalaAkhilendra in #1633
- Fix constrained decoding by @merrymercy in #1634
- Add back data parallelism by @merrymercy in #1635
- Release v0.3.3.post1 by @merrymercy in #1636
- [engine] support async and streaming by @ByronHsu in #1614
- [Fix] Fix the style of test_large_max_new_tokens.py by @merrymercy in #1638
- fix missing ignore_eos in v1/chat/completions by @learninmou in #1642
- Fix ignore_eos in the OpenAI ChatCompletions API by @merrymercy in #1645
- [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch by @liangan1 in #1480
- Fix...
Release v0.3.2
Highlight
- Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422
- Initial support for multi-LoRA serving #1307
- Integrate torchao for quantization #1341
- Optimize the CPU scheduler overhead
- Multiple critical bug fixes for llama and llava (tokenizer, modality)
- Support AMD backend #1420
- New models: MiniCPM3, OLMoE
What's Changed
- Remove useless fields in global_config.py by @merrymercy in #1328
- docs: update README by @zhyncs in #1336
- docs: highlight ttft itl and throughput by @zhyncs in #1337
- docs: add conclusion by @zhyncs in #1340
- Optimize schedule by @hnyls2002 in #1339
- Fix some online scheduling delay by @hnyls2002 in #1345
- [triton] Support head_dim not 2^n in triton extend and decode attention by @ByronHsu in #1281
- [Feat] Add modalities for vision server when handling pixel values for llava by @kcz358 in #1346
- [server] Passing
model_override_args
tolaunch_server
via the CLI. by @kevin85421 in #1298 - [Minor] Many cleanup by @merrymercy in #1357
- Add torchao quant (int4/int8/fp8) to llama models by @jerryzh168 in #1341
- [CI] Return output logprobs in unit test by @Ying1123 in #1361
- Unify forward mode by @hnyls2002 in #1360
- Support OpenAI API json_schema response format by @zifeitong in #1363
- Adding Documentation for installation by @zhaochenyang20 in #1300
- [Docs] Improve documentations by @merrymercy in #1368
- fix bug of
undefined is_single
in methcreate_abort_task
by @wcsjtu in #1370 - Support MiniCPM3 by @Achazwl in #1371
- Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy by @josephrocca in #1373
- [Minor] improve kill scripts and torchao import by @merrymercy in #1375
- Fix vocab mask update bug by @hnyls2002 in #1376
- [Minor] move triton attention kernels into a separate folder by @merrymercy in #1379
- Deprecate --disable-flashinfer and introduce --attention-backend by @merrymercy in #1380
- Organize flashinfer indices update by @hnyls2002 in #1378
- remove assertion in triton attention and add an unit test by @ByronHsu in #1385
- BaiChuan2 Model by @blacker521 in #1367
- [Fix] Fix --disable-flashinfer by @merrymercy in #1389
- Improve error reporting during server launch by @merrymercy in #1390
- Refactor attention backend by @merrymercy in #1381
- Add no commit to main rule by @hnyls2002 in #1393
- Fix README format by @Achazwl in #1399
- Support cuda graph in the triton attention backend by @merrymercy in #1401
- kernel: use tensor cores for flashinfer gqa kernels by @yzh119 in #1403
- [Minor Fix] Fix llava modalities issue for single-image by @kcz358 in #1402
- Add Support for XVERSE Models (Dense and MoE) to sglang by @hxer7963 in #1397
- [Feature] Initial support for multi-LoRA serving by @Ying1123 in #1307
- [Minor, CI] remove lora test from minimal suite by @Ying1123 in #1406
- Make stop reason a dict instead of str by @merrymercy in #1407
- [CI] Include triton backend and online serving benchmark into CI by @merrymercy in #1408
- [Minor] Raise exception for wrong import by @Ying1123 in #1409
- Balance test in CI by @merrymercy in #1411
- Update pr-test.yml by @merrymercy in #1412
- ci: fix finish by @zhyncs in #1414
- Optimize conflicts between CUDA graph and vocab mask tensors by @hnyls2002 in #1392
- Add torchao quant for mixtral and qwen_moe by @jerryzh168 in #1418
- Add pytorch sampling backend ut by @ispobock in #1425
- fix: resolve nightly eval by @zhyncs in #1426
- Enable torch.compile for triton backend by @merrymercy in #1422
- Add libibverbs-dev to Dockerfile by @Aphoh in #1427
- Update backend.md by @merrymercy in #1429
- [Fix] Fix logprob and normalized_logprob by @merrymercy in #1428
- Release v0.3.1 by @merrymercy in #1430
- Remove deprecated configs by @merrymercy in #1431
- [Feature] Support LoRA path renaming and add LoRA serving benchmarks by @Ying1123 in #1433
- Revert "[Minor] Raise exception for wrong import (#1409)" by @Ying1123 in #1432
- Add constrained_json_whitespace_pattern to ServerArgs by @zifeitong in #1438
- Clean up model loader by @merrymercy in #1440
- Simplify sampler and its error handling by @merrymercy in #1441
- [Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm by @HaiShaw in #1420
- Fix torch compile for deepseek-v2 by @ispobock in #1442
- Add OLMoE model by @janimo in #1444
- Release 0.3.1.post1 by @merrymercy in #1445
- Enable MLA by default by @ispobock in #1447
- Fix attention backend by @ispobock in #1448
- fix schedule bug by @hnyls2002 in #1450
- Fix schedule bug by @hnyls2002 in #1451
- Fixed n>1 causing list index out of range with VLM by @jasonyux in #1449
- Add bench_server_latency.py by @merrymercy in #1452
- [Bugfix] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1419) by @HaiShaw in #1453
- Fix oom issues with fp8 for llama by @merrymercy in #1454
- Fuse top_k and top_k in the sampler by @merrymercy in #1457
- [Event] Add public meeting invite to README by @Ying1123 in #1458
- fix: creat new dict everytime for putting new frame by @Luodian in #1464
- Fix padding in the cuda graph by @merrymercy in #1469
- Release v0.3.1.post2 by @merrymercy in #1470
- Fix env vars in bench_latency by @merrymercy in #1472
- feat: update linear deps 1/N by @zhyncs in #1305
- minor: add quant eval compared with base by @zhyncs in #1475
- Add OLMoE by @Muennighoff in #1476
- Fix triton head num by @ispobock in #1482
- Add MLA gsm8k eval by @ispobock in #1484
- chore: bump v0.3.1.post3 by @zhyncs in #1483
- fix incorrect links in documentation by @rchen19 in #1481
- doc: update backend by @zhyncs in #1486
- Better unit tests for adding a new model by @merrymercy in #1488
- Pr fix max workers by @wellhowtosay in #1456
- Add a unit test for data parallelism by @merrymercy in #1489
- Add AMD tests to CI by @Ying1123 in #1491
- Update dockerfile to include datamodel_code_generator by @merrymercy in #1492
- [API, Feature] Support response prefill for openai API by @Ying1123 in #1490
- minor: add mla fp8 test by @zhyncs in #1494
- Fix the overhead due to penalizer in bench_latency by @merrymercy i...
Release v0.3.0
Highlights
Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.
- Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
- Up to 1.5x lower latency with torch.compile on small batch sizes
- Support for interleaved text and multi-image/video in LLaVA-OneVision
- Support for interleaved window attention and 2x longer context length in Gemma-2
- Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
- Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.
What's Changed
- update hyperparameter guide by @merrymercy in #1114
- ci: compatible with fork repo by @zhyncs in #1115
- fix: resolve Python.h header missing by @zhyncs in #1119
- Fix the deadlock in multi-node tp by @merrymercy in #1122
- Mixed style of chunked prefill by @hnyls2002 in #1013
- Fix port conflicts between local CI and runner CI. by @hnyls2002 in #1131
- Fix CI accuracy && time out limit by @hnyls2002 in #1133
- fix: use fp16 dtype for sm75 by @zhyncs in #1136
- Improve the code style: more comments and remove useless packages by @merrymercy in #1139
- Improve benchmark by @merrymercy in #1140
- Fix duplicated imports in hf_transformers_utils.py by @merrymercy in #1141
- fixed a typo by @min-xu-et in #1143
- [Docs] Add instruction for running on clouds and kubernetes with SkyPilot by @Michaelvll in #1144
- [Feat]Add support for optional start len of logprobs by @yichuan520030910320 in #1035
- Optimize MLA/GQA/MQA Triton decoding by @ispobock in #1138
- feat: allow streaming for multi-prompt and/or parallel sampling by @vhain in #1134
- Improve docs and warnings by @merrymercy in #1164
- [Feature] add disable-custom-all-reduce by @Xu-Chen in #1148
- misc: add hypervisor vendor by @zhyncs in #1165
- support /v1/health using a generation 1 token by @LucienShui in #1154
- fix: resolve README render by @zhyncs in #1166
- [Feat] Support update weights without restart server by @shanyu-sys in #1157
- Improve multi-node stability by @merrymercy in #1171
- fix: custom op fallback forward native when lower sm80 by @zhyncs in #1177
- [Feature] Add a function to convert sampling_params to kwargs by @gryffindor-rr in #1170
- Support min-p sampling by @intervitens in #1167
- [Docs] Fix rendering of details in README by @Michaelvll in #1179
- Improve code style of sampler by @hnyls2002 in #1168
- [Minor] Improve logging and rename the health check endpoint name by @merrymercy in #1180
- Fix broken penalty by @hnyls2002 in #1184
- Fix benchmark script by @Ying1123 in #1185
- [Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by @kcz358 in #1123
- feat: use gelu_tanh_and_mul by @zhyncs in #1193
- Cleanup readme, llava examples, usage examples and nccl init by @merrymercy in #1194
- Update README.md by @merrymercy in #1198
- [CI] Fix the problem of hf runner too slow by @Ying1123 in #1202
- [Fix] the issue of random order when input is a list by @Ying1123 in #1199
- Relax the assert in moe throughput test to fix the flaky CI by @merrymercy in #1207
- [Fix] Fixing the multi-images error for llava-onevision by @kcz358 in #1205
- Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by @zhaochenyang20 in #1186
- [Minor] Improve the function organization in TokenizerManager & improve loggers by @merrymercy in #1208
- [Minor] Temporarily skip flaky test by @Ying1123 in #1209
- [CI] Fix the issue of unit test hanging by @Ying1123 in #1211
- Update CI workflows by @merrymercy in #1210
- Update CI runner docs by @merrymercy in #1213
- [Feature] Support fp8 e5m2 kv cache with flashinfer by @ispobock in #1204
- Update workflow files by @merrymercy in #1214
- improve the threshold and ports in tests by @wisclmy0611 in #1215
- [CI] Fix CI by @wisclmy0611 in #1217
- [Fix] Multi-images loading error by @kcz358 in #1218
- [Minor] improve CI and dependencies by @hnyls2002 in #1212
- [CI] Parallelize unit tests in CI by @wisclmy0611 in #1219
- Move sampler into CUDA graph by @hnyls2002 in #1201
- chore: bump v0.2.14 by @zhyncs in #1155
- [FEAT] JSON constrained support by @havetc in #1125
- Torch compile CI throughput test by @hnyls2002 in #1223
- [FEAT] Support batches cancel by @caiyueliang in #1222
- [Minor] add delete test and delete tmp file on ci server by @yichuan520030910320 in #1227
- [FIX] Wrong logger by @havetc in #1230
- feat: replace get_act_fn for gpt_bigcode by @zhyncs in #1231
- Fix readme by @ArtificialZeng in #1236
- Fix bench latency benchmark by @hnyls2002 in #1225
- [Minor] Add more type annotations by @merrymercy in #1237
- feat: support sm75 with FlashInfer v0.1.6 by @zhyncs in #1233
- Update README.md by @merrymercy in #1239
- hotfix: revert sampler CUDA Graph by @zhyncs in #1242
- Add sglang.bench_latency to CI by @merrymercy in #1243
- fix: increase max_new_tokens when testing generation models by @zhyncs in #1244
- feat: update GemmaRMSNorm by @zhyncs in #1232
- Fix llava on multi images by @merrymercy in #1247
- feat: replace GeluAndMul by @zhyncs in #1234
- fix: resolve qwen2 moe weight loader by @zhyncs in #1252
- chore: bump v0.2.14.post2 by @zhyncs in #1250
- make json_schema usable from gen by @qeternity in #1254
- fix data racing due to mutable reference using deepcopy by @xiezhq-hermann in #1255
- Sampler cudagraph by @hnyls2002 in #1253
- fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by @lxww302 in #1260
- Transpose mla weight offline by @ispobock in #1261
- EXAONE 3.0 Model Support by @Deepfocused in #1258
- Update README Support Exaone 3.0 by @Deepfocused in #1267
- Report median instead of mean in bench_latency.py by @merrymercy in #1269
- Allow more flexible assistant and system response by @BabyChouSr in #1256
- fix: resolve the fp8 bug introduced by vLLM 0.5.5 by @zhyncs in #1276
- [doc] fix quick start link by @ByronHsu in #1282
- Optimize the update flashinfer indices by @xiaobochen123 in #1262
- [CI] Add more multi-gpu tests by @merrymercy in #1280
- feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by @zhyncs in #1285
- [CI] merge all ci tests into one file by @merrymercy i...
Release v0.2.13
Highlights
- New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
- New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
- Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
- More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
- Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)
What's Changed
- fix: set env in runner by @zhyncs in #891
- docs: update setup runner by @zhyncs in #884
- misc: update cuda graph capture exception log by @zhyncs in #894
- chore: add multipart dep for fastapi by @zhyncs in #895
- [minor] fixed code formatting doc by @min-xu-et in #896
- Bump version to 0.2.9.post1 by @Ying1123 in #899
- Update the base image of the docker by @Ying1123 in #900
- Reorder CI unit tests. by @hnyls2002 in #908
- fixed an error handling in bench_latency.py by @min-xu-et in #904
- Add model accuracy test - step 1 by @Ying1123 in #866
- latency test enhancement - part 1 by @min-xu-et in #909
- Improve the structure of CI by @Ying1123 in #911
- fix: use e2e and unit test only for original repo or pr by @zhyncs in #912
- misc: add triton in check_env PACKAGE_LIST by @zhyncs in #914
- Support MLA for DeepSeek-V2 with Triton - step 1 by @ispobock in #905
- enhance latency test - part 2 by @min-xu-et in #915
- Make API Key OpenAI-compatible by @Ying1123 in #917
- Update hyperparameter_tuning.md by @Ying1123 in #918
- Fix CI && python3.8 compatible by @hnyls2002 in #920
- Support more OpenAI API test by @yichuan520030910320 in #916
- Bump version to 0.2.10 by @Ying1123 in #923
- latency test enhancement - final part by @min-xu-et in #921
- Test openai vision api by @Ying1123 in #925
- Test regex in vision api by @Ying1123 in #926
- Update README.md by @Ying1123 in #927
- Fix prompt len in parallel sampling by @yichuan520030910320 in #928
- docs: update README by @zhyncs in #935
- Remove leftover auth_token by @AidanCooper in #934
- Feat: add alternative choices selection methods by @AidanCooper in #835
- Fix union operator by @ispobock in #940
- Support multiple args options by @yichuan520030910320 in #941
- Fix stuck in
get_new_prefill_batch
by @hnyls2002 in #948 - Organize code (rename, movement) by @hnyls2002 in #953
- fix nsys cannot profile cuda kernel by @mpjlu in #957
- Add support for Batch API test by @yichuan520030910320 in #936
- Show more error messages for warmup errors by @Ying1123 in #932
- misc: update issue template by @zhyncs in #963
- misc: simplify test by @yichuan520030910320 in #964
- misc: add compute capability in check_env by @zhyncs in #965
- Make
req_pool_indices
on CPU by @hnyls2002 in #960 - misc: fix the req_to_token member change by @hnyls2002 in #967
- chore: update vllm to 0.5.4 by @zhyncs in #966
- chore: bump v0.2.11 by @zhyncs in #970
- Purge self-runner's pip cache weekly by @hnyls2002 in #975
- Run purge-cache only in sgl-project by @hnyls2002 in #976
- misc: correct the int data type for token ids and indices by @xiezhq-hermann in #969
- PrefillAdder abstraction by @hnyls2002 in #968
- RadixCache method adjust by @hnyls2002 in #977
- Adjust max prefix len by @hnyls2002 in #980
- #590 Increase default , track changes in examples and documentation by @foszto in #971
- [minor] Update type annotation in tokenizer_manager.py by @Ying1123 in #982
- Fix chunked prefill by @hnyls2002 in #984
- Add llama embedding modules [unreachable code] - step 1/3 by @Ying1123 in #983
- Add io struct for embedding models [unreachable code] - step 2/3 by @Ying1123 in #987
- Adjust
InputeMetadata
andScheduleBatch
by @hnyls2002 in #981 - support more optioin about usage in stream mode by @yichuan520030910320 in #985
- Create contributor_guide.md by @Ying1123 in #992
- feat: frequency, min_new_tokens, presence, and repetition penalties by @vhain in #973
- Move torch.compile configs into cuda_graph_runner.py by @Ying1123 in #993
- Add e5-mistral embedding model - step 3/3 by @Ying1123 in #988
- test: negative value testing for frequency, presence penalizers by @vhain in #995
- support models from www.modelscope.cn by @liuyhwangyh in #994
- bugfix: penalizers to be merged before reqs by @vhain in #1001
- fix: resolve correctness_test issue by @zhyncs in #1002
- Minor bugfix on benchmark serving by @ywang96 in #1005
- Add openai embedding API by @Ying1123 in #997
- Add skip_tokenizer_init args. by @gryffindor-rr in #959
- Fix benchmark latency by @wisclmy0611 in #1007
- Some warnings to crash when CI by @hnyls2002 in #1009
- Reduce the overhead when cache is disabled by @hnyls2002 in #1010
- Support embedding input as a list by @Ying1123 in #1014
- misc: update test config by @zhyncs in #990
- fix: force max new tokens to be 1 for embedding request by @Ying1123 in #1019
- Clean up unit tests by @merrymercy in #1020
- Fix
input_ids
&& rename tofill_ids
by @hnyls2002 in #1021 - feat: use FlashInfer rmsnorm and silu by @zhyncs in #907
- misc: update issue template by @zhyncs in #1024
- Clean up readme and arguments of chunked prefill by @merrymercy in #1022
- Fix wrong assert by @hnyls2002 in #1028
- Improve type annotation by @merrymercy in #1029
- hotfix: add CustomOp abstraction by @zhyncs in #1027
- Fix the case where r.prefix_indices is None by @merrymercy in #1031
- Fix triton args init by @hnyls2002 in #1034
- Fix the case when max_new_tokens is too large by @merrymercy in #1025
- Test the case when max_new_tokens is very large by @merrymercy in #1038
- Fix the prefix indices by @hnyls2002 in #1037
- Improve end-to-end throughput test and its coverage by @merrymercy in #1039
- Delete the useless test/srt/test_throughput.py by @merrymercy in #1045
- minor: some potential bugs by @hnyls2002 in #1044
- Clean up the comments and names under python/sglang/srt/layers by @merrymercy in #1047
- fix...
Release v0.2.9
Highlights
- New feature: Chunked prefill (#800, #811)
- New models: Deepseek v2
- Performance improvement: vectorized logprob computation
- Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
- Feature fix: fixed many missing logprob-related features in the OpenAI API server
- CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.
What's Changed
- Deepseek v2 support by @hnyls2002 in #693
- Fix context length by @hnyls2002 in #757
- docs: update model support by @zhyncs in #760
- fix: not run workflows on fork repo by @zhyncs in #762
- Update supported models by @hnyls2002 in #763
- Fix TransformerTokenizer init for chatglm2 & 3 by @ispobock in #761
- [Minor] Improve the code style in TokenizerManager by @merrymercy in #767
- Update readme by @Ying1123 in #769
- feat: add fake tag by @zhyncs in #770
- Fix max_tokens for OpenAI chat completion API by @merrymercy in #766
- Fix max new tokens by @merrymercy in #772
- Move sampling logits to float32 by @merrymercy in #773
- minor refactor: move check server args to server_args.py by @wisclmy0611 in #774
- Fix return_log_probs with cuda graph by @merrymercy in #775
- Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs by @merrymercy in #776
- Allow disabling flashinfer sampling kernel by @merrymercy in #778
- Bump version to 0.2.6 by @merrymercy in #779
- fix: replace pillow with PIL in PACKAGE_LIST by @zhyncs in #781
- docs: init readthedocs support by @zhyncs in #783
- fix: init readthedocs support by @zhyncs in #784
- fix: exclude logo png in gitignore by @zhyncs in #785
- docs: update index by @zhyncs in #786
- Vectorize logprobs computation by @Ying1123 in #787
- docs: update README by @zhyncs in #788
- docs: make badges center by @zhyncs in #789
- chore: add copyright for srt by @zhyncs in #790
- Fix echo + lobprob for OpenAI API when the prompt is a list by @Ying1123 in #791
- Update README.md by @Ying1123 in #792
- Lazy-import third-party backends by @bgyoon in #794
- Fix lazy import location by @Ying1123 in #795
- Fix logging by @Ying1123 in #796
- Add role documentation, add system begin & end tokens by @objnf-dev in #793
- Chunked prefill support by @hnyls2002 in #797
- Revert "Chunked prefill support" by @Ying1123 in #799
- Chunked prefill by @hnyls2002 in #800
- fix: update flashinfer to 0.1.2 to fix sampling for cu118 by @zhyncs in #803
- Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" by @Ying1123 in #805
- feat: add chat template for internlm2-chat by @zhyncs in #802
- Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" by @Ying1123 in #806
- Add support for OpenAI API : offline batch(file) processing by @yichuan520030910320 in #699
- Organize public APIs by @hnyls2002 in #809
- Remove inf value for chunked prefill size by @hnyls2002 in #812
- Revert "Organize public APIs" by @Ying1123 in #815
- fix: use v0.2.5 for benchmark by @zhyncs in #814
- Fix LiteLLM kwargs by @qeternity in #817
- Code structure refactor by @hnyls2002 in #807
- docs: update README by @zhyncs in #819
- Fix streaming bug by @objnf-dev in #820
- feat: add runner by @zhyncs in #821
- feat: add pr e2e test by @zhyncs in #822
- Support disable_ignore_eos in bench_serving.py by @Ying1123 in #824
- Adjust default mem fraction to avoid OOM by @Ying1123 in #823
- Add awq_marlin by @Ying1123 in #826
- misc: update e2e test benchmark config by @zhyncs in #825
- misc: enable e2e test when push by @zhyncs in #828
- docs: add set up runner by @zhyncs in #829
- chore: bump v0.2.7 by @zhyncs in #830
- Add
--max-total-tokens
by @hnyls2002 in #840 - Fix List input bug by @yichuan520030910320 in #838
- Add req slots leaking check by @hnyls2002 in #842
- docs: update README.md by @eltociear in #843
- misc: update e2e test paths config by @zhyncs in #848
- chore: update flashinfer to v0.1.3 by @zhyncs in #850
- Fix llama for classification by @Ying1123 in #855
- Add troubleshooting doc by @Ying1123 in #856
- Fix #857 by @kaifronsdal in #858
- Add support for logprobs in OpenAI chat API by @yichuan520030910320 in #852
- Support chunked prefill when radix cache is disabled by @hnyls2002 in #811
- misc: update e2e test paths config by @zhyncs in #860
- Rename github workflows by @Ying1123 in #861
- misc: disable auto release by @zhyncs in #862
- misc: add cancel previous at e2e by @zhyncs in #864
- Add OpenAI backend to the CI test by @Ying1123 in #869
- Fix openai CI tests by @Ying1123 in #870
- misc: use pip cache purge and add unit test ci by @zhyncs in #871
- misc: update unit test config by @zhyncs in #873
- Fix unit tests for the frontend language part by @Ying1123 in #872
- bump to 0.2.8 by @Ying1123 in #877
- Make scripts under
/test/srt
as unit tests by @Ying1123 in #875 - Update runner docs by @hnyls2002 in #876
- Improve the coverage of the openai api server test by @Ying1123 in #878
- Implement served_model_name to customize model id when use local mode… by @dionren in #749
- Update runner docs by @hnyls2002 in #879
- Add more unit tests to CI by @Ying1123 in #880
- Add accuracy test to CI: MMLU by @Ying1123 in #882
- Update workflow name by @Ying1123 in #883
- Fix the double BOS problem in the HF chat template by @Ying1123 in #888
- Add benchmark: HumanEval by @Ying1123 in #889
- Increase openai client limit by @Ying1123 in #886
- Bump version to v0.2.9 by @Ying1123 in #890
New Contributors
- @bgyoon made their first contribution in #794
- @objnf-dev made their first contribution in #793
- @kaifronsdal made their first contribution in #858
- @dionren made their first contribution in #749
Full Changelog: v0.2.5...v0.2.9
Release v0.2.5
Highlights
-
We recently released a blog. Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. SGLang consistently outperforms vLLM, achieving up to 3.1x higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.
-
We have now automated the release processes for PyPI, Docker, and Release using GitHub workflows. Previously, because Release was not automated, GitHub Tags were not updated in time, leading to a jump from v0.2.0 directly to v0.2.5.
-
Welcome everyone to try using https://github.com/sgl-project/sglang, and also welcome everyone to actively participate in the community, including but not limited to issues, PRs, and discussions. Cheers!
Release v0.2.0
Highlights
- We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest blog.
- New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo
What's Changed
- Optimize mem indices mangement by @hnyls2002 in #619
- Unify index operations by @hnyls2002 in #620
- Simplify mem state by @wisclmy0611 in #623
- Improve tensor parallel performance by @Ying1123 in #625
- Bump version to 0.1.21 by @Ying1123 in #626
- Fix model forward grad by @hnyls2002 in #628
- Update docker file by @Ying1123 in #629
- Disable NCCL_NVLS by default by @Ying1123 in #631
- Add qwen2 tie word embedding by @yileld in #630
- Add support for VertexAI safety settings by @AidanCooper in #624
- Fix vertexai by @hnyls2002 in #633
- Reduce docker size by @hnyls2002 in #632
- clean up step function by @Ying1123 in #635
- feat: support internlm2 by @zhyncs in #636
- misc: add pre-commit config by @zhyncs in #637
- misc: add issue and pr template by @zhyncs in #638
- Flashinfer sample kernel by @hnyls2002 in #617
- Move
global_server_args_dict
by @hnyls2002 in #642 - Increase the capacity of the memory pool by @Ying1123 in #643
- feat: add check_env by @zhyncs in #645
- Remove the dependency of rpyc by @wisclmy0611 in #646
- misc: rm rpyc from PACKAGE_LIST by @zhyncs in #649
- fix: set ulimit -n 65535 by @zhyncs in #647
- feat: add lint workflow by @zhyncs in #648
- fix: resolve lint error by @zhyncs in #650
- Remove useless variables in infer_batch.py by @Ying1123 in #651
- Detokenize incrementally when streaming by @hnyls2002 in #653
TokenizerManager.context_len
should inherit from `server_args.conte… by @shrirajh in #654- Remove cached triton launcher by @merrymercy in #656
- perf: reduce ttft and itl with stream_interval 1 by @zhyncs in #658
- feat: add benchmark serving by @zhyncs in #657
- refactor model loader [unreachable code]: initial refactor by @Ying1123 in #655
- misc: update SGLang package description by @zhyncs in #659
- Update Readme by @Ying1123 in #660
- feat: update check env by @zhyncs in #661
- Improve docs by @Ying1123 in #662
- Add benchmark instructions by @Ying1123 in #663
- Fix jump forward when streaming by @hnyls2002 in #665
- Fix kill process util by @ispobock in #666
- Add support for OpenAI API parallel sampling by @yichuan520030910320 in #640
- Update OpenAI API by @wisclmy0611 in #667
- Temporary fix invalid sample results by @hnyls2002 in #668
- Support random dataset in bench_serving.py by @merrymercy in #669
- Revert "Temporary fix invalid sample results" by @hnyls2002 in #673
- refactor model loader: initial refactor by @Ying1123 in #664
- Fix cuda graph with flashinfer by @merrymercy in #675
- Tmp fix illegal sample by @hnyls2002 in #676
- Update version to 0.1.22 by @Ying1123 in #677
- Fallback when sampling failed by @ispobock in #678
- feat: support TRT LLM benchmark and multiple benchmarks by @zhyncs in #670
- Decouple kv by @hnyls2002 in #679
- Support gpt-bigcode model class by @hnyls2002 in #681
- support non-streaming benchmark by @merrymercy in #682
- Fix StreamExecutor.fork() losing the current role start index. by @max99x in #684
- feat: update bench serving by @zhyncs in #685
- misc: update output file logic by @zhyncs in #686
- Allow disabling streaming in bench by @merrymercy in #687
- docs: update README by @zhyncs in #688
- Support Deepseek MoE Model by @hnyls2002 in #689
- misc: recommend to use chat model for benchmark by @zhyncs in #690
- Support Mistral-Nemo by @ispobock in #691
- docs: update README by @zhyncs in #692
- fix: update bench serving by @zhyncs in #694
- misc: update output token logic by @zhyncs in #695
- Tune params by @Ying1123 in #696
- Fix trt benchmark by @Ying1123 in #697
- misc: fix typo by @zhyncs in #698
- Fix flashinfer by @Ying1123 in #700
- Fix hf config loading by @ispobock in #702
- Use min new token ratio at start by @hnyls2002 in #701
- feat: add e2e latency by @zhyncs in #704
- Update vllm version to support llama3.1 by @Ying1123 in #705
- bump version to 0.1.23 by @Ying1123 in #706
- Reduce hardcoded logic of kernel usage by @wisclmy0611 in #707
- Fix multi-node deadlock by @merrymercy in #709
- Auto adjust new ratio by @hnyls2002 in #708
- Fix prefill size by @Ying1123 in #711
- docs: update README by @zhyncs in #712
- docs: update doc by @zhyncs in #713
- fix: llama 3.1 405b fp8 by @zhyncs in #714
- misc: update doc by @zhyncs in #715
- Improve benchmark scripts by @Ying1123 in #717
- Bump version to 0.1.24 by @Ying1123 in #718
- docs: update supported models by @zhyncs in #719
- docs: update comment by @zhyncs in #721
- chore: add close inactive issues workflow by @zhyncs in #722
- misc: update bulid instruction by @zhyncs in #724
- fix: fp8 config by @Ying1123 in #723
- Fix dockerfile and triton cache manager by @hnyls2002 in #720
- chore: bump v0.1.25 by @zhyncs in #725
- fix: resolve the logo display issue on the PyPI page by @zhyncs in #726
- misc: update bug issue template by @zhyncs in #727
- Revert "fix: fp8 config" by @Ying1123 in #728
- Fix bugs (fp8 checkpoints, triton cache manager) by @Ying1123 in #729
- Bump version to 0.2.0 by @Ying1123 in #730
New Contributors
- @yileld made their first contribution in #630
- @AidanCooper made their first contribution in #624
- @zhyncs made their first contribution in #636
- @shrirajh made their first contribution in #654
- @yichuan520030910320 made their first contribution in https://github.com/...
Release v0.1.20
Highlights
- Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612)
- Model support: Gemma2, minicpm, Qwen2 MoE
- Docker support (#217 )
- Various latency optimizations
What's Changed
- Add docker file by @Ying1123 in #588
- Add Gemma2 by @Ying1123 in #592
- Format by @Ying1123 in #593
- Fix Llava model by @wisclmy0611 in #594
- Add
--enable-p2p-check
option by @hnyls2002 in #599 - Fix streaming by @hnyls2002 in #600
- Reduce number of workspaces for flashinfer by @wisclmy0611 in #601
- add
LogitsMetadata
by @hnyls2002 in #604 - add minicpm support by @Titan-p in #602
- Make sglang compat with vllm 0.5.1 by @M0gician in #598
- Add Qwen2 MoE support by @M0gician in #603
- Update chat template for qwen and yi-1.5. by @for-just-we in #530
- [Feat] Expose logprob options to
sgl.gen
API by @huyiwen in #503 - Fix bench latency by @merrymercy in #607
- Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py by @merrymercy in #609
- Clean up the usage of flashinfer by @merrymercy in #610
- Cleanup attention backend: flashinfer and triton by @merrymercy in #611
- Enable cuda graph by default by @merrymercy in #612
- Improve benchmark scripts & fix llava by @merrymercy in #613
- Memorypool chunked prefetch by @hnyls2002 in #614
- Improve benchmark scripts by @merrymercy in #615
- Fix memory pool index error by @Ying1123 in #616
- Bump version to 0.1.20 by @merrymercy in #618
New Contributors
- @wisclmy0611 made their first contribution in #594
- @Titan-p made their first contribution in #586
- @M0gician made their first contribution in #598
- @for-just-we made their first contribution in #530
Full Changelog: v0.1.18...v0.1.20
Release v0.1.18
Highlight
- 2x large batch prefill improvement with the new flashinfer kernels #579
- Multi-node tensor parallelism #550
- New model support: ChatGLM #516
What's Changed
- Fix missing numpy dependency in pyproject.toml by @fpreiss in #524
- Fix RAG nb, parea setup (parea -> parea-ai) by @fpreiss in #525
- [Minor] Correct Optional type hints in api by @fpreiss in #526
- Add ChatGLM Model Support by @Qubitium in #516
- Fix Regression: Disable p2p for 4090 by @ZX-ModelCloud in #531
- Decode Incrementally by @hnyls2002 in #517
- Fix dependency by @merrymercy in #538
- Fix dependency & crash issues by @Ying1123 in #539
- Higher priority for user input of max_prefill_tokens & format by @Ying1123 in #540
- Add disk cache for loading ShareGPT dataset. by @hnyls2002 in #542
- Fix tp worker only checking req[0] for stream by @Qubitium in #546
- Fix the Jump-Forward with Chinese by @hnyls2002 in #551
- Update fused_moe by @merrymercy in #553
- Multi-node Tensor Parallelism by @Ying1123 in #550
- Update flashinfer to 0.0.5 by @merrymercy in #554
- Follow-up fixes for flashinfer 0.0.5 by @merrymercy in #556
- Fix latency benchmark by @hnyls2002 in #557
- Clean up logits processor by @merrymercy in #558
- Update test_flashinfer by @hnyls2002 in #560
- Allow running with vllm==0.4.3 by @merrymercy in #561
- Add a new arguments log_level_http to control the HTTP logging by @merrymercy in #563
- Add sglang.bench_latency for offline benchmark by @merrymercy in #564
- Warmup cublas by @merrymercy in #566
- Increase the number of thread limitation for tp worker managers. by @merrymercy in #567
- Update readme by @merrymercy in #568
- Expose dtype argument by @merrymercy in #569
- Update benchmark script by @Ying1123 in #571
- Minor fix in compiler & format by @ZackZeng999 in #545
- Update run_batch interface and max_prefill_tokens by @Ying1123 in #574
- Fix flashinfer version by @PanJason in #576
- [BugFix] gemma loading weights "lm_head.weight" key error by @dhgarcia in #577
- Turn on flashinfer by default by @Ying1123 in #578
- fix the broken server args by @hnyls2002 in #585
- 2x performance improvement for large prefill & Fix workspace conflicts by @Ying1123 in #579
New Contributors
- @fpreiss made their first contribution in #524
- @ZackZeng999 made their first contribution in #545
- @PanJason made their first contribution in #576
- @dhgarcia made their first contribution in #577
Full Changelog: v0.1.17...v0.1.18
Release v0.1.17
Highlights
- Add data parallelim #480
- Add speculative execution for OpenAI API #250
- Update vllm to v0.4.3 for new quantization features #511
- Better error handling (#457, #449, #514)
What's Changed
- [Feat] Add llava qwen, llava mistral by @kcz358 in #419
- Format code by @hnyls2002 in #441
- Add finish_reason to OpenAI API by @mgerstgrasser in #446
- Simplify port allocation by @merrymercy in #447
- Add PUT for generate api by @Ying1123 in #448
- Improve error handling & abort disconnected requests by @merrymercy in #449
- Fix the broken
--disable-radix-cache
by @hnyls2002 in #451 - openai chat speculative execution by @ChuyueSun in #250
- Fix openai speculative execution by @Ying1123 in #456
- Abort disconnected requests by @merrymercy in #457
- Rename api_num_spec_tokens -> num_api_spec_tokens by @merrymercy in #458
- Use model loader from vllm by @merrymercy in #459
- port fp8 mixtral by @merrymercy in #460
- fix test bug in srt_llava_next_test.py by @bingwork in #470
- Add the instruction link to the LLaVA-NeXT-Video at README by @ZhangYuanhan-AI in #463
- Improve logging & add logit cap by @merrymercy in #471
- Optimize retract by @hnyls2002 in #440
- Add benchmark scripts by @Ying1123 in #476
- [Feat/Fix] Refactoring Llava models into single file by @Luodian in #475
- Improve benchmark scripts & rename some scripts by @merrymercy in #477
- Improve benchmark scripts & add more models by @merrymercy in #484
- Support data parallelism (static) by @Ying1123 in #480
- Make the server random by default by @merrymercy in #488
- Revert "Make the server random by default" by @Ying1123 in #492
- update the script: examples/usage/llava_video/srt_example_llava_v.sh by @ZhangYuanhan-AI in #491
- Make the server random by default by @merrymercy in #493
- Update vllm to v0.4.3 by @merrymercy in #511
- remove redundant pad_input_ids function by @amosyou in #500
- Litellm Backend by @huyiwen in #502
- Fix rid state map leak + Refractor .finished by @Qubitium in #505
- Crash the server when error or OOM happens by @merrymercy in #514
- Update version to 0.1.17 by @merrymercy in #515
New Contributors
- @kcz358 made their first contribution in #419
- @mgerstgrasser made their first contribution in #446
- @bingwork made their first contribution in #470
- @amosyou made their first contribution in #500
- @huyiwen made their first contribution in #502
Full Changelog: v0.1.16...v0.1.17