Releases · sgl-project/sglang

14 Feb 02:50

zhyncs

v0.4.3

e0b9a42

v0.4.3 Latest

Latest

Highlights

The SGLang team is excited to announce the release of v0.4.3. We will keep improving DeepSeek V3/R1 performance. In the last six weeks, SGLang has been the fastest engine running DeepSeek V3/R1 among all open-source LLM inference engines. We stay ahead by integrating FlashInfer MLA and optimizing further. Look out for new optimizations coming soon! Please feel free to join our Slack channel https://slack.sglang.ai Cheers!

Performance Improvements

DeepSeek V3/R1 Optimizations

Pioneering integration of FlashInfer MLA Attention delivers 4x performance improvement for long-context scenarios (Special thanks to the FlashInfer team @yzh119 ) #3550
Added torch.compile support for FP8, achieving 50 tokens/s for online inference #3232
Implemented CUTLASS block-wise FP8 for enhanced efficiency

Architecture Enhancements

Upgraded to FlashInfer v0.2
Enabled Flash Attention 3 by default for prefill
Extended EAGLE 2 support:
- Enhanced integration with FlashInfer backend
- Added support in Triton backend

New Features

Introduced Function Calling capabilities
Added regex pattern support in XGrammar backend
Implemented custom sampling processor for flexible inference control
Integrated LoRA support in Triton backend

What's Changed

docs: add deepseek v3 launch instructions by @zhyncs in #2589
fix: only enable moe_align_block_size for now by @zhyncs in #2590
docs: update deepseek v3 example by @zhyncs in #2592
h100 tuning fused_moe_triton for qwen2 moe by @BBuf in #2560
Fix cache hit rate when chunked prefill by @hnyls2002 in #2555
Update README.md by @merrymercy in #2594
Error occurs when loading the gemma model in bitsandbytes format. by @upskyy in #2557
[Feature] Support new parameter - EBNF in xgrammar by @adarshxs in #2526
update readme of DeepSeek V3 by @fsygd in #2596
Fix logprob_start_len for multi modal models by @merrymercy in #2597
Fix duplicated handling of GetWeightsByNameReqInput by @fzyzcjy in #2565
[unittest] add unit test to test quant args of srt engine by @JamesSand in #2574
Fix test and benchmark scripts by @merrymercy in #2598
fix: package data missing by @yudian0504 in #2521
[UTILS] improve makefile a bit by adding help info by @kzhou003 in #2570
Super tiny typo fix by @fzyzcjy in #2564
Update contributor_guide.md by @merrymercy in #2603
Update README.md by @merrymercy in #2605
Tiny code cleanup in tokenizer_manager.py by @fzyzcjy in #2586
Regression fix to AMD/ROCm from recent change by @HaiShaw in #2606
Update CODEOWNERS by @merrymercy in #2608
Fused moe triton cfg opt for rocm by @kkHuang-amd in #2612
Fix triton kernel performance regression by @kkHuang-amd in #2611
Change extend attention kernel launch parameter for ROCm platform to … by @kkHuang-amd in #2610
fix moe_align_block_size by @HandH1998 in #2615
update sgl_moe_align_block_size usage by @HandH1998 in #2617
chore: bump v0.4.1.post1 by @zhyncs in #2616
docs: update README by @zhyncs in #2618
[FIX] Update EOS from config by @zhengy001 in #2475
[minor] clean up docs and eos id by @merrymercy in #2622
Add more supporting organizations by @merrymercy in #2623
Update readme by @ispobock in #2625
avoid fused_moe_triton padding circular import by @BBuf in #2624
[CI] Fix nightly test and raise better error message by @merrymercy in #2626
Docs: Add constrained decoding tutorial by @shuaills in #2614
[docs]Refactor constrained decoding tutorial by @shuaills in #2633
add configs for block fp8 related kernels by @zhyncs in #2628
Add update_weights_from_tensor by @fzyzcjy in #2631
[Feature] Function Calling by @Tushar-ml in #2544
[Docs] Add EBNF to sampling params docs by @adarshxs in #2609
Clean up wrapper in flashinfer backend by @merrymercy in #2638
minor: add nsys cli for docker dev by @zhyncs in #2639
Add llama_eagle.py by @merrymercy in #2640
[Session] Update session control interface by @Ying1123 in #2635
AMD: set weights and scaling numbers properly for block FP8 by @HaiShaw in #2637
Update Triton configs for block fp8 kernels by @HandH1998 in #2641
chore: bump v0.4.1.post2 by @zhyncs in #2643
docs: update README by @zhyncs in #2644
docs: add development guide using docker by @zhyncs in #2645
[Feature] Get Token IDs with Engine.generate() by @shuaills in #2636
Fix unittest for input tokens by @shuaills in #2646
skip special token for unit test by @zhaochenyang20 in #2648
Release 0.4.1.post3 - upload the config.json to PyPI by @merrymercy in #2647
Update the timeout in nightly-test.yml by @merrymercy in #2649
add 2*h20 node serving example for deepseek v3 by @Lzhang-hub in #2650
docs: update README by @zhyncs in #2651
[feat] Add math eval to CI by @XiaotongJiang in #2652
Revert "[feat] Add math eval to CI" by @merrymercy in #2656
fix typo by @HaiShaw in #2655
[Docs] clean up structured outputs docs by @merrymercy in #2654
Update structured_outputs.ipynb by @merrymercy in #2666
Refactor sgl-kernel build by @ispobock in #2642
Refactor logprob computation to return the real logprob used in sampling by @merrymercy in #2664
Add GemLite caching after each capture by @mobicham in #2669
AMD DeepSeek_V3 FP8 Numerical fix by @HaiShaw in #2667
Minor follow-up fixes for the logprob refactor by @merrymercy in #2670
Tiny update scripts to fail fast by @fzyzcjy in #2672
Improve the computation for time_per_output_token Prometheus metrics by @merrymercy in #2674
Add cutlass submodule for sgl-kernel by @ispobock in #2676
minor: cleanup sgl-kernel by @zhyncs in #2679
Eagle speculative decoding part 1: Support target model verification in the attention backend by @merrymercy in #2678
misc: update CODEOWNERS by @zhyncs in #2680
feat: use CUDA 12.4 by default (for FA3) by @zhyncs in #2682
Update README.md by @merrymercy in #2683
Eagle speculative decoding part 2: Fix cuda graph + DP attention hanging by @merrymercy in #2684
[Fix] fix openai adapter by @Ying1123 in #2685
h200 tuning fused_moe_triton config for Mixtral 8x7B/8x22B and Qwen2 57BA14B by @BBuf in #2689
[Docs] refactor Contribution Guide by @shuaills in #2690
Doc: Rename contribution_guide.md by @zhaochenyang20 in #2691
ROCm base image update by @kkHuang-amd in #2692
[Docs] Add Support for Structured Output Format by @shuaills in #2697
[feat]...

Contributors

libratiger, lycanlancelot, and 82 other contributors

Assets 2

25 Dec 23:27

zhyncs

v0.4.1

efc52f8

Release v0.4.1

Highlights

We're excited to announce SGLang v0.4.1, which now supports DeepSeek V3 - currently the strongest open-source LLM, even surpassing GPT-4o.

The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU from day one. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models.

Special thanks to Meituan's Search & Recommend Platform Team @ispobock @HandH1998 and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
Various improvements to the cache-aware sglang router, torchao integration, server termination
Added a standalone package sgl-kernel for supporting more custom kernels in the code base.

What's Changed

Adding SGLang FP8 Utils by @HaiShaw in #2348
docs: add SGLang v0.4 blog by @zhyncs in #2341
MLA prefill w/o weight absorption by @ispobock in #2349
Check gpu availability at server args creation by @MrAta in #2340
minor: limit the range of vllm versions by @zhyncs in #2350
Fix Docs CI When Compile Error by @zhaochenyang20 in #2323
Add Docs For SGLang Native Router by @zhaochenyang20 in #2308
Make torch TP composable with torch.compile by @kwen2501 in #2352
move apply_torchao_config_ to model_runner by @jerryzh168 in #2342
[Minor] Code style improvements by @merrymercy in #2355
Fix AWQ with enable MLA by @ispobock in #2364
MoE Expert Parallel by @xiaobochen123 in #2371
Move FP8 to SGLang by @zhyncs in #2370
optimize cuda graph max_bs_settings on low-end gpus by @BBuf in #2360
Add more support for intel Gaudi accelerators by @YangQun1 in #2357
[router] support /add_worker api by @ByronHsu in #2369
docs: update adoption (Meituan) by @zhyncs in #2373
Use proc.join instead of busy waiting by @merrymercy in #2374
docs: Improve instructions for supporting new models by @vchzls in #2363
Fix the overlap for xgrammar by @merrymercy in #2377
Release v0.4.0.post1 by @merrymercy in #2375
[Router] remove duplicate char count by @ByronHsu in #2378
[router] add remove tenant method in the radix tree by @ByronHsu in #2379
[router] Add remove worker api by @ByronHsu in #2380
fix: resolve fp8 moe issue by @zhyncs in #2387
fix: update xgrammar v0.1.6 by @zhyncs in #2390
Fp8 MoE optimizations on AMD by @HaiShaw in #2388
minor: update killall script by @zhyncs in #2391
[router] Health check on worker before added to the router by @ByronHsu in #2392
Fix shape error that occurred when loading lora weight of gemma2 model. by @upskyy in #2330
nit: Remove busy waiting on scheduler by @rkooo567 in #2382
Optimize Triton decoding kernel for long context by @ispobock in #2394
Update killall_sglang.sh by @merrymercy in #2397
Remove unused vars in the triton backend by @ispobock in #2401
Fix a bug with logprob streaming + chunked prefill by @merrymercy in #2403
fix: specify dtype with begin_forward aka plan by @zhyncs in #2404
Fix recv_requests by @merrymercy in #2405
minor: update correct measurement unit by @zhyncs in #2406
feat: support custom task runner by @zhyncs in #2407
minor: add random use case by @zhyncs in #2408
minor: add random flashinfer vs triton use case by @zhyncs in #2409
Simplify stream_output by @merrymercy in #2398
[router] Improve cleanup logic by @ByronHsu in #2411
[Router] fix interrupt from terminal by @ByronHsu in #2413
[router] defer health checking to router init by @ByronHsu in #2393
reduce watchdog interval to 5s by @ByronHsu in #2410
Add a unittest for fused_moe by @BBuf in #2416
[Minor] Improve code style by @merrymercy in #2419
[Minor] Improve code style by @merrymercy in #2422
[feat] Enable chunked prefill for llava-onevision by @Ying1123 in #2412
Typo fix in router.md by @adarshxs in #2424
feat: support sgl-kernel PyPI by @zhyncs in #2433
fix: use manylinux2014_x86_64 tag by @zhyncs in #2434
fix: compatible with PEP 440 by @zhyncs in #2435
[router] Refactor: decouple select and send stage by @ByronHsu in #2440
[router] Use borrow if possible to save cost by @ByronHsu in #2441
Make torch TP composable with torchao by @kwen2501 in #2436
chore: update ao v0.7.0 by @zhyncs in #2447
decoding attention kernel benchmark by @bjmsong in #2425
Fix model loader for more quantization formats by @merrymercy in #2448
Fix warmup in bench_offline_throughput.py by @merrymercy in #2449
Add support for IBM Granite 3.x models by @frreiss in #2437
[router] Add retries based fault tolerance by @ByronHsu in #2452
[router] remove main.rs because only lib.rs is used for py binding by @ByronHsu in #2453
[Core] in batch prefix caching by delay scheduling by @rkooo567 in #2442
[router] Update doc for dynamic scaling and fault tolerance by @ByronHsu in #2454
[router] Release router 0.1.0 with dynamic scaling and fault tolerance by @ByronHsu in #2455
Make request payload size configurable by @MrAta in #2444
Include version info into the router package by @MrAta in #2456
Bump sglang-router to 0.1.1 by @MrAta in #2459
chore: bump v0.0.2 for sgl-kernel by @zhyncs in #2462
minor: update pypi tag by @zhyncs in #2463
fix: set runtime path by @zhyncs in #2466
Rename rust folder to sgl-router by @MrAta in #2464
feat: support dev image by @zhyncs in #2469
[Minor] Fix grok model loader by @merrymercy in #2473
Fix correctness issue for triton decoding kernel by @ispobock in #2479
format: add clang-format for sgl-kernel by @zhyncs in #2483
Remove cuda graph batch size adjustment for dp attention by @ispobock in #2484
hotfix: checking for HIP by @zhyncs in #2485
sgl-kernel adapt tensorrt llm custom allreduce by @yizhang2077 in #2481
fix typo by @zhyncs in #2487
[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm by @BBuf in #2486
fix moe-ep accuracy issue for fp8 by @xiaobochen123 in #2489
minor: update flashinfer nightly by @zhyncs in #2490
Small fixes for torchao quant by @jerryzh168 in #2476
Simplify pytorch sampling kernel and logit processor by @merrymercy in #2491
Temporarily disable unit test of torch native attention backe...

Contributors

ccchow, MrAta, and 22 other contributors

Assets 2

04 Dec 02:14

zhyncs

v0.4.0

f8b0326

Release v0.4.0

Highlights

blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/

We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:

Zero-overhead batch scheduler: 1.1x increase in throughput.
Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.
Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.
Fast structured outputs with xgrammar: up to 10x faster.

What's Changed

fix: add xgrammar dependency by @zhyncs in #2126
docs: fix module docstrings and copyright headers by @XuehaiPan in #2077
feat(pre-commit): trim unnecessary notebook metadata from git history by @XuehaiPan in #2127
Expose max total num tokens from Runtime & Engine API by @henryhmko in #2092
Only stream output on tp rank 0 by @merrymercy in #2124
Revert "Only stream output on tp rank 0" by @merrymercy in #2130
Add initial support for intel Gaudi accelerators by @ankurneog in #2121
Add simple CPU offloading support. by @janimo in #2081
Fix grid size in Triton decoding kernel by @ispobock in #2134
[CI] Fix test cases by @merrymercy in #2137
Add concurrency option for benchmark by @cermeng in #2136
Fix dp print message by @merrymercy in #2138
fix: resolve bench_serving args by @zhyncs in #2139
[router] cache-aware load-balancing router v1 by @ByronHsu in #2114
Bump sglang-router to 0.0.5 by @ByronHsu in #2142
update router doc by @ByronHsu in #2143
fix dp_rank env by @ByronHsu in #2144
Add more api routes (completion, health, etc) to the router by @ByronHsu in #2146
add prefix match for certain tenant by @ByronHsu in #2147
Improve sglang router by @ByronHsu in #2148
Merged three native APIs into one: get_server_info by @henryhmko in #2152
feat: remove the dependency on FusedMoE by @zhyncs in #2153
feat: update gitignore and add tuning config for FusedMoE by @zhyncs in #2155
fix: resolve end-of-file-fixer by @zhyncs in #2157
Simplify Scheduler.update_running_batch by @merrymercy in #2154
feat: update other MoE models deps by @zhyncs in #2156
Update CI threshold & Improve code style by @merrymercy in #2159
fix: use torch.sum for compatible by @zhyncs in #2161
Fix mixed chunked prefill in overlap mode by @merrymercy in #2158
Balance CI tests by @merrymercy in #2162
Rename triton_fused_moe -> fused_moe_triton by @merrymercy in #2163
Fix docs by @merrymercy in #2164
[Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b by @BBuf in #2167
Allow overwrite flashinfer use_tensorcore by @merrymercy in #2169
Replace prob based with threshold based load balancing by @ByronHsu in #2170
feat: fused_moe fp8 monkey patch by @zhyncs in #2174
[Fix] Avoid calling fill_vocab_mask for terminated requests by @Ubospica in #2175
[CI] Split test cases in CI for better load balancing by @merrymercy in #2180
Bump rustls from 0.23.16 to 0.23.18 in /rust by @dependabot in #2182
[feat] Refactor session control interface and add CI by @Ying1123 in #2173
[router] Replace print with logger by @ByronHsu in #2183
Use custom allreduce w/ torch.compile by @merrymercy in #2185
[Performance]: Process affinity to CPU cores with multiple sockets support by @HaiShaw in #2171
Update CI threshold by @merrymercy in #2186
Update XGrammar to the latest API by @Ubospica in #2176
[router] Rust e2e test by @ByronHsu in #2184
Input_embeds support by @RinRin-32 in #2052
[CI] Minor fix for CI by @merrymercy in #2187
Rename double sparsity config file by @merrymercy in #2188
Release v0.3.6.post1 by @merrymercy in #2189
Update sampler.py to skip the success check by @merrymercy in #2197
remove unused imports by @WrRan in #2195
Remove unresolved reference 'self' by @apemost in #2198
using is not not != to test None by @WrRan in #2196
fix: add cuda-python for xgrammar by @zhyncs in #2199
minor: update check_env by @zhyncs in #2201
add sglang version to get_server_info by @binarycrayon in #2206
docs: update adoption by @zhyncs in #2204
Bump router to 0.0.9 with better logging by @ByronHsu in #2207
Fix rust warning by @ByronHsu in #2208
Fix flasky tests by @merrymercy in #2212
[feat] Support session control for vision language models by @Ying1123 in #2210
Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2217
Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" by @merrymercy in #2221
Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2222
Release v0.3.6.post2 by @merrymercy in #2214
Rename DP_RANK to SGLANG_DP_RANK by @merrymercy in #2218
[3rdparty, document] Updated Documentation that for triton fused_moe kernel tuning for AMD Instinct GPUs by @kkHuang-amd in #2191
Bump sglang-router to 0.0.10 for env name change by @ByronHsu in #2226
fix typo prompts by @qibaoyuan in #2224
Remove fused_moe_grok by @merrymercy in #2223
add profile in offline benchmark & update doc by @bjmsong in #2123
Rename tuned MI300X config files for fused_moe_triton by @HaiShaw in #2228
Update Install Method 2. From source by @HaiShaw in #2232
Fix chunked prefill size for bench_offline_throughput by @merrymercy in #2234
Disable overlap scheduler for multimodal models by @merrymercy in #2235
Add OLMo2 model. by @janimo in #2233
Crash the server correctly during error by @merrymercy in #2231
Fix memory leak during abort by @merrymercy in #2238
fix missing launch server import by @qeternity in #2242
[fix] Fix prefix caching for multi-image/video by @Ying1123 in #2239
Update backend.md by @merrymercy in #2250
Update backend.md by @merrymercy in #2251
Revert "Add simple CPU offloading support" by @Ying1123 in #2252
Revert "Revert "Add simple CPU offloading support"" by @Ying1123 in #2253
Simplify tokenizer manager by @merrymercy in #2254
Fix hash collision for multi modal models by @merrymercy in #2256
[Minor] fix the style for multimodal models by @merrymercy in #2257
chore: bump v0.3.6.post3 by @zhyncs in https://github.com/sgl-project/sglang/pul...

Contributors

binarycrayon, janimo, and 28 other contributors

Assets 2

22 Nov 11:36

zhyncs

v0.3.6

9a00e6f

Release v0.3.6

Highlights

Reduce CPU overhead by enabling overlap scheduler by default. 1.1x higher throughput. (#2105, #2067, #2095)
Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (#1970, #2061)
Cache-aware load balancer. 4x higher cache hit rate (#1934)
Support xgrammar backend for grammar-guided decoding (#2056)
Support Prometheus metrics (#1853, #1981)
Support torch 2.5.1 (#2069) and torch-native tensor parallelism (#1876)
Support graceful termination (#1838) and watchdog (#1816)
Support notebook-style documentation (https://sgl-project.github.io/)
Add an offline benchmark script (#1968)
Bug, deadlock, NaN, and OOM fixes (#2083, #1850, #1800, #1779, #1789, #1858)
New models: Phi3-small (#2062), Gemma-2 reward model (#1954), GPT-2 (#1833)

What's Changed

Fix edge case for truncated by @ByronHsu in #1747
Fuse more ops & Simplify token mapping by @merrymercy in #1758
[API] add get memory pool size by @Ying1123 in #1760
Fix perf regression for set_kv_buffer by @merrymercy in #1765
[Fix] Fix abort in data parallelism by @merrymercy in #1767
Fix stop condition for <|eom_id|> by @merrymercy in #1766
Update docs by @merrymercy in #1768
Fix missing additional_stop_token_ids by @merrymercy in #1769
Fix out of memory message. by @hnyls2002 in #1771
Crash the server on warnings in CI by @merrymercy in #1772
Fix the perf regression due to additional_stop_token_ids by @merrymercy in #1773
Fix MockTokenizer in the unit tests by @merrymercy in #1774
[Bug] Catch any errors caused by parsing json schema by @zolinthecow in #1776
[Fix] Fix NaN issues by fixing the cuda graph padding values for flashinfer by @merrymercy in #1779
[Fix] Fix cuda graph padding for triton attention backend by @merrymercy in #1782
check user-specified model_max_len with hf derived max_model_len by @BBuf in #1778
Re-introduce get_cuda_graph_seq_len_fill_value by @merrymercy in #1783
Enhance the test case for chunked prefill and check memory leak by @merrymercy in #1785
Fix seq_lens_sum for cuda graph runner in padded cases by @merrymercy in #1789
Qwen2vl support cuda graph and disable radix cache by @yizhang2077 in #1780
Fix log parsing in the chunked prefill unit tests by @merrymercy in #1793
Fix memory leak when doing chunked prefill by @hnyls2002 in #1787
[Fix] Fix the log parsing in chunked prefill uni tests by @merrymercy in #1794
Revert "Fix memory leak when doing chunked prefill" by @merrymercy in #1797
Fix logprob in the overlapped mode by @merrymercy in #1795
Release v0.3.4.post2 by @merrymercy in #1796
[Performance] Support both xgrammar and outlines for constrained decoding by @DarkSharpness in #1752
[Fix] Fix --skip-tokenizer-init by @merrymercy in #1798
move max_position_embeddings to the last by @hliuca in #1799
add support for ipynb by @zhaochenyang20 in #1786
Fix possible ZMQ hanging by @hnyls2002 in #1800
Set ZMQ buffer size heuristic by @hnyls2002 in #1801
Allow consecutive ports when launching multiple sglang servers. by @hnyls2002 in #1802
fix int conversion for SGLANG_CPU_COUNT by @ByronHsu in #1803
Update ci workflows by @merrymercy in #1804
Update links by @merrymercy in #1805
Simplify our docs with complicated functions into utils by @zhaochenyang20 in #1807
Fix docs ci by @zhaochenyang20 in #1808
Provide an argument to set the maximum batch size for cuda graph by @merrymercy in #1809
Improve the user control of new_token_ratio by @merrymercy in #1811
Update hyperparameter_tuning.md by @merrymercy in #1813
Add a watch dog thread by @merrymercy in #1816
Fix unit tests by @merrymercy in #1817
Add openAI compatible API by @zhaochenyang20 in #1810
Fix Triton decode kernel & ut by @ispobock in #1819
support token ids in engine.generate by @ByronHsu in #1820
Fix docs deploy ci by @zhaochenyang20 in #1821
[router] rust-based router by @ByronHsu in #1790
Fix update_weights deadlock for DP by @ByronHsu in #1825
fix get_memory_pool_size deadlock for DP by @ByronHsu in #1830
Support setting use_thread in the run_program for easier debugging. by @liuyanyi in #1823
[3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added by @HaiShaw in #1822
stop_str of qwen2-vl template should be a tuple not a str by @yizhang2077 in #1834
[FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… by @HaiShaw in #1835
Gpt2 by @DanielC12321 in #1833
Imporve openai api documents by @zhaochenyang20 in #1827
Update docs by @merrymercy in #1839
Update README.md by @merrymercy in #1840
[Production] Drain requests before exit when receive SIGTERM by @Ying1123 in #1838
[Performance, Hardware] MoE weights padding to AMD MI300x GPUs by @HaiShaw in #1836
Fix suggest edit by @zhaochenyang20 in #1842
[Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd… by @HaiShaw in #1845
Make decode log interval configurable by @ByronHsu in #1847
Fix mixed chunked prefill by @merrymercy in #1850
Refactor tokenizer manager by @ByronHsu in #1846
Simplify documentation by @merrymercy in #1851
Fix warnings in doc build by @merrymercy in #1852
delete unused character by @geeker-smallwhite in #1855
Fix memory leak for chunked prefill 2 by @merrymercy in #1858
[Build, ROCm] Dockerfile.rocm for Instinct GPUs, with package updates by @HaiShaw in #1861
Fix retraction + overlap by @hnyls2002 in #1860
change file tree by @zhaochenyang20 in #1859
Update vocab embedding deps and add TP switch by @ispobock in #1856
minor: add human eval by @zhyncs in #1754
Add vlm document by @zhaochenyang20 in #1866
minor: update nightly eval by @zhyncs in #1867
[3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. by @yichiche in #1871
Improve docs and fix the broken links by @merrymercy in #1875
Add a FAQ documentation by @merrymercy in #1877
Update docs title by @merrymercy in #1879
Update docs and workflow by @merrymercy in #1881
Fix doc links by @merrymercy in #1882
Fix incorrect context length for llama3.2-11b by @rchen19 in #1873
add native api docs by @zhaochenyang20 in #1883
Update index.rst to improve the order of docs by @merrymercy in #1885
Native api by...

Contributors

binarycrayon, janimo, and 40 other contributors

Assets 2

22 Oct 04:30

hnyls2002

v0.3.4.post1

1f26e8b

Release v0.3.4.post1

Highlights

Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving.
- Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. Slides
Added Engine API for offline inference with reduced overhead. Usage. #1614 #1567
Added an overlap scheduler for reducing CPU overhead #1738
New models: Llama 3.2 (#1551), QWen-VL2 (#1721), OLMo (#1676), GLM 4 (#1736).
Added support for reward models #1525.
Added support for Intel XPU #1480.
Improved stability for greedy decoding #1589.
Accelerated multi-LoRA serving #1587.

What's Changed

[Fix] Ignore model import error by @merrymercy in #1513
minor: fix config by @hnyls2002 in #1524
[Event] Update meeting link by @Ying1123 in #1529
[Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B by @Ying1123 in #1525
Add float8 dynamic quant to torchao_utils by @jerryzh168 in #1528
[FIX] Catch syntax error of Regex Guide to avoid crash by @du00cs in #1521
[bugfix]Add modelscope package to avoid docker image without modelscope by @KylinMountain in #1520
Fix RuntimeEndpoint.select method by @jeffrey-fong in #1495
Multiple minor fixes by @merrymercy in #1530
Make detokenizer_manager.py not asyncio by @merrymercy in #1532
Organize image inputs by @hnyls2002 in #1531
Improve process creation by @merrymercy in #1534
fix ipv6 url when warm up model by @cauyxy in #1537
Move scheduler code from tp_worker.py to scheduler.py by @merrymercy in #1538
Process image in parallel by @hnyls2002 in #1539
Let ModelRunner take InputMetadata as input, instead of ScheduleBatch by @merrymercy in #1541
Rename InputMetadata -> ForwardBatch by @merrymercy in #1543
Clean up batch data structures: Introducing ModelWorkerBatch by @merrymercy in #1544
[Fix, LoRA] fix LoRA with updates in main by @Ying1123 in #1545
Organize Attention Backends by @hnyls2002 in #1547
Fix bugs of logprobs_nums by @hnyls2002 in #1548
Dispatch flashinfer wrappers by @hnyls2002 in #1550
Simplify flashinfer dispatch by @hnyls2002 in #1552
[Refactor] Simplify io_struct and tokenizer_manager by @Ying1123 in #1549
[Performance, Hardware] MoE tuning on AMD MI300x GPUs by @kkHuang-amd in #1554
[Fix] Fix all the Huggingface paths by @tbarton16 in #1553
[Fix] do not maintain regex_fsm in SamplingBatchInfo by @merrymercy in #1555
[Fix] Move ScheduleBatch out of SamplingInfo by @merrymercy in #1556
Move status check in the memory pool to CPU by @merrymercy in #1557
[Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' by @mssongit in #1536
[FP8 KV Cache] Avoid KeyError at loading pre-quantized FP8 model with kv_scale by @HaiShaw in #1559
Organize sampling batch info better by @merrymercy in #1562
Use ipc instead of tcp in zmq by @merrymercy in #1566
Make input_ids a torch.Tensor by @merrymercy in #1568
[Minifix] Remove extra space in cot example by @FredericOdermatt in #1569
[Fix] Fix major performance bug in certain cases by @Ying1123 in #1563
Refine the add request reasons to avoid corner cases. by @hnyls2002 in #1574
chore: update README.md by @eltociear in #1580
[Easy] use .text() instead of .text by @ByronHsu in #1577
[Event] Update README.md by @Ying1123 in #1572
Add llama implementation with no tensor parallel linears by @jerryzh168 in #1561
Backend method not found when SRT Runtime is used by @ByronHsu in #1576
default sampling param should be deepcopied by @ByronHsu in #1581
Fix styling by @ByronHsu in #1583
Fix runtime.generate when sampling param is not passed by @ByronHsu in #1582
Support min_tokens in sgl.gen by @ByronHsu in #1573
[Minor] Improve the style and fix flaky tests by @merrymercy in #1584
[Bug] Fix decode stats error on output_len 1 by @HaiShaw in #1585
Clean up event loop by @merrymercy in #1586
[LoRA, Performance] Speedup multi-LoRA serving - Step 1 by @Ying1123 in #1587
[Minor, Performance] Use torch.argmax for greedy sampling by @Ying1123 in #1589
Test consistency for single and batch seperately by @ByronHsu in #1590
Update README.md by @merrymercy in #1591
Fix modality for image inputs by @merrymercy in #1592
Provide an offline engine API by @ByronHsu in #1567
[Fix] Fix the case where prompt_len = 0 by @merrymercy in #1593
Use atexit hook to implicitly shutdown Runtime by @ByronHsu in #1595
Use is_flashinfer_available to replace is_hip for flashinfer check by @merrymercy in #1596
Fix chunked prefill condition by @ispobock in #1594
Fix the port_args in bench_latency by @merrymercy in #1597
Remove references to squeezellm by @janimo in #1603
[Profile] Add pytorch profiler by @Ying1123 in #1604
[Engine] Fix generate hanging issue after the first call by @ByronHsu in #1606
Release v0.3.3 by @merrymercy in #1605
[Minor] Fix logging typo by @amosyou in #1615
Fix test_vision_openai_server on CI by @ByronHsu in #1620
[Performance, hardware] MoE tuning update to AMD MI300x GPUs by @HaiShaw in #1619
Update README.md by @kushal34712 in #1625
Update README.md by @merrymercy in #1629
Add device support by @liangan1 in #1607
Nit about the decorator of PortArgs.init_new by @glen-amd in #1611
[Bug] Fix the Image Input of Batch Generation by @OBJECT907 in #1579
Add the ability to enable and disable the Profiler via HTTP API. by @Abatom in #1626
Fix the correctness test in bench_latency.py when tp > 1 and test_generation_models.py by @merrymercy in #1631
Add image_token in conversation.py by @merrymercy in #1632
Added a "Back To Top" Button by @JanumalaAkhilendra in #1633
Fix constrained decoding by @merrymercy in #1634
Add back data parallelism by @merrymercy in #1635
Release v0.3.3.post1 by @merrymercy in #1636
[engine] support async and streaming by @ByronHsu in #1614
[Fix] Fix the style of test_large_max_new_tokens.py by @merrymercy in #1638
fix missing ignore_eos in v1/chat/completions by @learninmou in #1642
Fix ignore_eos in the OpenAI ChatCompletions API by @merrymercy in #1645
[Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch by @liangan1 in #1480
Fix...

Contributors

janimo, HaiShaw, and 33 other contributors

Assets 2

02 Oct 17:19

Ying1123

v0.3.2

37c5899

Release v0.3.2

Highlight

Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422
Initial support for multi-LoRA serving #1307
Integrate torchao for quantization #1341
Optimize the CPU scheduler overhead
Multiple critical bug fixes for llama and llava (tokenizer, modality)
Support AMD backend #1420
New models: MiniCPM3, OLMoE

What's Changed

Remove useless fields in global_config.py by @merrymercy in #1328
docs: update README by @zhyncs in #1336
docs: highlight ttft itl and throughput by @zhyncs in #1337
docs: add conclusion by @zhyncs in #1340
Optimize schedule by @hnyls2002 in #1339
Fix some online scheduling delay by @hnyls2002 in #1345
[triton] Support head_dim not 2^n in triton extend and decode attention by @ByronHsu in #1281
[Feat] Add modalities for vision server when handling pixel values for llava by @kcz358 in #1346
[server] Passing model_override_args to launch_server via the CLI. by @kevin85421 in #1298
[Minor] Many cleanup by @merrymercy in #1357
Add torchao quant (int4/int8/fp8) to llama models by @jerryzh168 in #1341
[CI] Return output logprobs in unit test by @Ying1123 in #1361
Unify forward mode by @hnyls2002 in #1360
Support OpenAI API json_schema response format by @zifeitong in #1363
Adding Documentation for installation by @zhaochenyang20 in #1300
[Docs] Improve documentations by @merrymercy in #1368
fix bug of undefined is_single in meth create_abort_task by @wcsjtu in #1370
Support MiniCPM3 by @Achazwl in #1371
Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy by @josephrocca in #1373
[Minor] improve kill scripts and torchao import by @merrymercy in #1375
Fix vocab mask update bug by @hnyls2002 in #1376
[Minor] move triton attention kernels into a separate folder by @merrymercy in #1379
Deprecate --disable-flashinfer and introduce --attention-backend by @merrymercy in #1380
Organize flashinfer indices update by @hnyls2002 in #1378
remove assertion in triton attention and add an unit test by @ByronHsu in #1385
BaiChuan2 Model by @blacker521 in #1367
[Fix] Fix --disable-flashinfer by @merrymercy in #1389
Improve error reporting during server launch by @merrymercy in #1390
Refactor attention backend by @merrymercy in #1381
Add no commit to main rule by @hnyls2002 in #1393
Fix README format by @Achazwl in #1399
Support cuda graph in the triton attention backend by @merrymercy in #1401
kernel: use tensor cores for flashinfer gqa kernels by @yzh119 in #1403
[Minor Fix] Fix llava modalities issue for single-image by @kcz358 in #1402
Add Support for XVERSE Models (Dense and MoE) to sglang by @hxer7963 in #1397
[Feature] Initial support for multi-LoRA serving by @Ying1123 in #1307
[Minor, CI] remove lora test from minimal suite by @Ying1123 in #1406
Make stop reason a dict instead of str by @merrymercy in #1407
[CI] Include triton backend and online serving benchmark into CI by @merrymercy in #1408
[Minor] Raise exception for wrong import by @Ying1123 in #1409
Balance test in CI by @merrymercy in #1411
Update pr-test.yml by @merrymercy in #1412
ci: fix finish by @zhyncs in #1414
Optimize conflicts between CUDA graph and vocab mask tensors by @hnyls2002 in #1392
Add torchao quant for mixtral and qwen_moe by @jerryzh168 in #1418
Add pytorch sampling backend ut by @ispobock in #1425
fix: resolve nightly eval by @zhyncs in #1426
Enable torch.compile for triton backend by @merrymercy in #1422
Add libibverbs-dev to Dockerfile by @Aphoh in #1427
Update backend.md by @merrymercy in #1429
[Fix] Fix logprob and normalized_logprob by @merrymercy in #1428
Release v0.3.1 by @merrymercy in #1430
Remove deprecated configs by @merrymercy in #1431
[Feature] Support LoRA path renaming and add LoRA serving benchmarks by @Ying1123 in #1433
Revert "[Minor] Raise exception for wrong import (#1409)" by @Ying1123 in #1432
Add constrained_json_whitespace_pattern to ServerArgs by @zifeitong in #1438
Clean up model loader by @merrymercy in #1440
Simplify sampler and its error handling by @merrymercy in #1441
[Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm by @HaiShaw in #1420
Fix torch compile for deepseek-v2 by @ispobock in #1442
Add OLMoE model by @janimo in #1444
Release 0.3.1.post1 by @merrymercy in #1445
Enable MLA by default by @ispobock in #1447
Fix attention backend by @ispobock in #1448
fix schedule bug by @hnyls2002 in #1450
Fix schedule bug by @hnyls2002 in #1451
Fixed n>1 causing list index out of range with VLM by @jasonyux in #1449
Add bench_server_latency.py by @merrymercy in #1452
[Bugfix] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1419) by @HaiShaw in #1453
Fix oom issues with fp8 for llama by @merrymercy in #1454
Fuse top_k and top_k in the sampler by @merrymercy in #1457
[Event] Add public meeting invite to README by @Ying1123 in #1458
fix: creat new dict everytime for putting new frame by @Luodian in #1464
Fix padding in the cuda graph by @merrymercy in #1469
Release v0.3.1.post2 by @merrymercy in #1470
Fix env vars in bench_latency by @merrymercy in #1472
feat: update linear deps 1/N by @zhyncs in #1305
minor: add quant eval compared with base by @zhyncs in #1475
Add OLMoE by @Muennighoff in #1476
Fix triton head num by @ispobock in #1482
Add MLA gsm8k eval by @ispobock in #1484
chore: bump v0.3.1.post3 by @zhyncs in #1483
fix incorrect links in documentation by @rchen19 in #1481
doc: update backend by @zhyncs in #1486
Better unit tests for adding a new model by @merrymercy in #1488
Pr fix max workers by @wellhowtosay in #1456
Add a unit test for data parallelism by @merrymercy in #1489
Add AMD tests to CI by @Ying1123 in #1491
Update dockerfile to include datamodel_code_generator by @merrymercy in #1492
[API, Feature] Support response prefill for openai API by @Ying1123 in #1490
minor: add mla fp8 test by @zhyncs in #1494
Fix the overhead due to penalizer in bench_latency by @merrymercy i...

Contributors

janimo, josephrocca, and 25 other contributors

Assets 2

19 Sep 10:09

Ying1123

v0.3.0

5ab9418

Release v0.3.0

Highlights

Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.

Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
Up to 1.5x lower latency with torch.compile on small batch sizes
Support for interleaved text and multi-image/video in LLaVA-OneVision
Support for interleaved window attention and 2x longer context length in Gemma-2
Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.

What's Changed

update hyperparameter guide by @merrymercy in #1114
ci: compatible with fork repo by @zhyncs in #1115
fix: resolve Python.h header missing by @zhyncs in #1119
Fix the deadlock in multi-node tp by @merrymercy in #1122
Mixed style of chunked prefill by @hnyls2002 in #1013
Fix port conflicts between local CI and runner CI. by @hnyls2002 in #1131
Fix CI accuracy && time out limit by @hnyls2002 in #1133
fix: use fp16 dtype for sm75 by @zhyncs in #1136
Improve the code style: more comments and remove useless packages by @merrymercy in #1139
Improve benchmark by @merrymercy in #1140
Fix duplicated imports in hf_transformers_utils.py by @merrymercy in #1141
fixed a typo by @min-xu-et in #1143
[Docs] Add instruction for running on clouds and kubernetes with SkyPilot by @Michaelvll in #1144
[Feat]Add support for optional start len of logprobs by @yichuan520030910320 in #1035
Optimize MLA/GQA/MQA Triton decoding by @ispobock in #1138
feat: allow streaming for multi-prompt and/or parallel sampling by @vhain in #1134
Improve docs and warnings by @merrymercy in #1164
[Feature] add disable-custom-all-reduce by @Xu-Chen in #1148
misc: add hypervisor vendor by @zhyncs in #1165
support /v1/health using a generation 1 token by @LucienShui in #1154
fix: resolve README render by @zhyncs in #1166
[Feat] Support update weights without restart server by @shanyu-sys in #1157
Improve multi-node stability by @merrymercy in #1171
fix: custom op fallback forward native when lower sm80 by @zhyncs in #1177
[Feature] Add a function to convert sampling_params to kwargs by @gryffindor-rr in #1170
Support min-p sampling by @intervitens in #1167
[Docs] Fix rendering of details in README by @Michaelvll in #1179
Improve code style of sampler by @hnyls2002 in #1168
[Minor] Improve logging and rename the health check endpoint name by @merrymercy in #1180
Fix broken penalty by @hnyls2002 in #1184
Fix benchmark script by @Ying1123 in #1185
[Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by @kcz358 in #1123
feat: use gelu_tanh_and_mul by @zhyncs in #1193
Cleanup readme, llava examples, usage examples and nccl init by @merrymercy in #1194
Update README.md by @merrymercy in #1198
[CI] Fix the problem of hf runner too slow by @Ying1123 in #1202
[Fix] the issue of random order when input is a list by @Ying1123 in #1199
Relax the assert in moe throughput test to fix the flaky CI by @merrymercy in #1207
[Fix] Fixing the multi-images error for llava-onevision by @kcz358 in #1205
Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by @zhaochenyang20 in #1186
[Minor] Improve the function organization in TokenizerManager & improve loggers by @merrymercy in #1208
[Minor] Temporarily skip flaky test by @Ying1123 in #1209
[CI] Fix the issue of unit test hanging by @Ying1123 in #1211
Update CI workflows by @merrymercy in #1210
Update CI runner docs by @merrymercy in #1213
[Feature] Support fp8 e5m2 kv cache with flashinfer by @ispobock in #1204
Update workflow files by @merrymercy in #1214
improve the threshold and ports in tests by @wisclmy0611 in #1215
[CI] Fix CI by @wisclmy0611 in #1217
[Fix] Multi-images loading error by @kcz358 in #1218
[Minor] improve CI and dependencies by @hnyls2002 in #1212
[CI] Parallelize unit tests in CI by @wisclmy0611 in #1219
Move sampler into CUDA graph by @hnyls2002 in #1201
chore: bump v0.2.14 by @zhyncs in #1155
[FEAT] JSON constrained support by @havetc in #1125
Torch compile CI throughput test by @hnyls2002 in #1223
[FEAT] Support batches cancel by @caiyueliang in #1222
[Minor] add delete test and delete tmp file on ci server by @yichuan520030910320 in #1227
[FIX] Wrong logger by @havetc in #1230
feat: replace get_act_fn for gpt_bigcode by @zhyncs in #1231
Fix readme by @ArtificialZeng in #1236
Fix bench latency benchmark by @hnyls2002 in #1225
[Minor] Add more type annotations by @merrymercy in #1237
feat: support sm75 with FlashInfer v0.1.6 by @zhyncs in #1233
Update README.md by @merrymercy in #1239
hotfix: revert sampler CUDA Graph by @zhyncs in #1242
Add sglang.bench_latency to CI by @merrymercy in #1243
fix: increase max_new_tokens when testing generation models by @zhyncs in #1244
feat: update GemmaRMSNorm by @zhyncs in #1232
Fix llava on multi images by @merrymercy in #1247
feat: replace GeluAndMul by @zhyncs in #1234
fix: resolve qwen2 moe weight loader by @zhyncs in #1252
chore: bump v0.2.14.post2 by @zhyncs in #1250
make json_schema usable from gen by @qeternity in #1254
fix data racing due to mutable reference using deepcopy by @xiezhq-hermann in #1255
Sampler cudagraph by @hnyls2002 in #1253
fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by @lxww302 in #1260
Transpose mla weight offline by @ispobock in #1261
EXAONE 3.0 Model Support by @Deepfocused in #1258
Update README Support Exaone 3.0 by @Deepfocused in #1267
Report median instead of mean in bench_latency.py by @merrymercy in #1269
Allow more flexible assistant and system response by @BabyChouSr in #1256
fix: resolve the fp8 bug introduced by vLLM 0.5.5 by @zhyncs in #1276
[doc] fix quick start link by @ByronHsu in #1282
Optimize the update flashinfer indices by @xiaobochen123 in #1262
[CI] Add more multi-gpu tests by @merrymercy in #1280
feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by @zhyncs in #1285
[CI] merge all ci tests into one file by @merrymercy i...

Contributors

janimo, max99x, and 28 other contributors

Assets 2

19 Sep 10:08

Ying1123

v0.2.13

5bd9537

Release v0.2.13

Highlights

New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)

What's Changed

fix: set env in runner by @zhyncs in #891
docs: update setup runner by @zhyncs in #884
misc: update cuda graph capture exception log by @zhyncs in #894
chore: add multipart dep for fastapi by @zhyncs in #895
[minor] fixed code formatting doc by @min-xu-et in #896
Bump version to 0.2.9.post1 by @Ying1123 in #899
Update the base image of the docker by @Ying1123 in #900
Reorder CI unit tests. by @hnyls2002 in #908
fixed an error handling in bench_latency.py by @min-xu-et in #904
Add model accuracy test - step 1 by @Ying1123 in #866
latency test enhancement - part 1 by @min-xu-et in #909
Improve the structure of CI by @Ying1123 in #911
fix: use e2e and unit test only for original repo or pr by @zhyncs in #912
misc: add triton in check_env PACKAGE_LIST by @zhyncs in #914
Support MLA for DeepSeek-V2 with Triton - step 1 by @ispobock in #905
enhance latency test - part 2 by @min-xu-et in #915
Make API Key OpenAI-compatible by @Ying1123 in #917
Update hyperparameter_tuning.md by @Ying1123 in #918
Fix CI && python3.8 compatible by @hnyls2002 in #920
Support more OpenAI API test by @yichuan520030910320 in #916
Bump version to 0.2.10 by @Ying1123 in #923
latency test enhancement - final part by @min-xu-et in #921
Test openai vision api by @Ying1123 in #925
Test regex in vision api by @Ying1123 in #926
Update README.md by @Ying1123 in #927
Fix prompt len in parallel sampling by @yichuan520030910320 in #928
docs: update README by @zhyncs in #935
Remove leftover auth_token by @AidanCooper in #934
Feat: add alternative choices selection methods by @AidanCooper in #835
Fix union operator by @ispobock in #940
Support multiple args options by @yichuan520030910320 in #941
Fix stuck in get_new_prefill_batch by @hnyls2002 in #948
Organize code (rename, movement) by @hnyls2002 in #953
fix nsys cannot profile cuda kernel by @mpjlu in #957
Add support for Batch API test by @yichuan520030910320 in #936
Show more error messages for warmup errors by @Ying1123 in #932
misc: update issue template by @zhyncs in #963
misc: simplify test by @yichuan520030910320 in #964
misc: add compute capability in check_env by @zhyncs in #965
Make req_pool_indices on CPU by @hnyls2002 in #960
misc: fix the req_to_token member change by @hnyls2002 in #967
chore: update vllm to 0.5.4 by @zhyncs in #966
chore: bump v0.2.11 by @zhyncs in #970
Purge self-runner's pip cache weekly by @hnyls2002 in #975
Run purge-cache only in sgl-project by @hnyls2002 in #976
misc: correct the int data type for token ids and indices by @xiezhq-hermann in #969
PrefillAdder abstraction by @hnyls2002 in #968
RadixCache method adjust by @hnyls2002 in #977
Adjust max prefix len by @hnyls2002 in #980
#590 Increase default , track changes in examples and documentation by @foszto in #971
[minor] Update type annotation in tokenizer_manager.py by @Ying1123 in #982
Fix chunked prefill by @hnyls2002 in #984
Add llama embedding modules [unreachable code] - step 1/3 by @Ying1123 in #983
Add io struct for embedding models [unreachable code] - step 2/3 by @Ying1123 in #987
Adjust InputeMetadata and ScheduleBatch by @hnyls2002 in #981
support more optioin about usage in stream mode by @yichuan520030910320 in #985
Create contributor_guide.md by @Ying1123 in #992
feat: frequency, min_new_tokens, presence, and repetition penalties by @vhain in #973
Move torch.compile configs into cuda_graph_runner.py by @Ying1123 in #993
Add e5-mistral embedding model - step 3/3 by @Ying1123 in #988
test: negative value testing for frequency, presence penalizers by @vhain in #995
support models from www.modelscope.cn by @liuyhwangyh in #994
bugfix: penalizers to be merged before reqs by @vhain in #1001
fix: resolve correctness_test issue by @zhyncs in #1002
Minor bugfix on benchmark serving by @ywang96 in #1005
Add openai embedding API by @Ying1123 in #997
Add skip_tokenizer_init args. by @gryffindor-rr in #959
Fix benchmark latency by @wisclmy0611 in #1007
Some warnings to crash when CI by @hnyls2002 in #1009
Reduce the overhead when cache is disabled by @hnyls2002 in #1010
Support embedding input as a list by @Ying1123 in #1014
misc: update test config by @zhyncs in #990
fix: force max new tokens to be 1 for embedding request by @Ying1123 in #1019
Clean up unit tests by @merrymercy in #1020
Fix input_ids && rename to fill_ids by @hnyls2002 in #1021
feat: use FlashInfer rmsnorm and silu by @zhyncs in #907
misc: update issue template by @zhyncs in #1024
Clean up readme and arguments of chunked prefill by @merrymercy in #1022
Fix wrong assert by @hnyls2002 in #1028
Improve type annotation by @merrymercy in #1029
hotfix: add CustomOp abstraction by @zhyncs in #1027
Fix the case where r.prefix_indices is None by @merrymercy in #1031
Fix triton args init by @hnyls2002 in #1034
Fix the case when max_new_tokens is too large by @merrymercy in #1025
Test the case when max_new_tokens is very large by @merrymercy in #1038
Fix the prefix indices by @hnyls2002 in #1037
Improve end-to-end throughput test and its coverage by @merrymercy in #1039
Delete the useless test/srt/test_throughput.py by @merrymercy in #1045
minor: some potential bugs by @hnyls2002 in #1044
Clean up the comments and names under python/sglang/srt/layers by @merrymercy in #1047
fix...

Contributors

vhain, liuyhwangyh, and 15 other contributors

Assets 2

02 Aug 08:55

Ying1123

v0.2.9

30a9b2e

Release v0.2.9

Highlights

New feature: Chunked prefill (#800, #811)
New models: Deepseek v2
Performance improvement: vectorized logprob computation
Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
Feature fix: fixed many missing logprob-related features in the OpenAI API server
CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.

What's Changed

Deepseek v2 support by @hnyls2002 in #693
Fix context length by @hnyls2002 in #757
docs: update model support by @zhyncs in #760
fix: not run workflows on fork repo by @zhyncs in #762
Update supported models by @hnyls2002 in #763
Fix TransformerTokenizer init for chatglm2 & 3 by @ispobock in #761
[Minor] Improve the code style in TokenizerManager by @merrymercy in #767
Update readme by @Ying1123 in #769
feat: add fake tag by @zhyncs in #770
Fix max_tokens for OpenAI chat completion API by @merrymercy in #766
Fix max new tokens by @merrymercy in #772
Move sampling logits to float32 by @merrymercy in #773
minor refactor: move check server args to server_args.py by @wisclmy0611 in #774
Fix return_log_probs with cuda graph by @merrymercy in #775
Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs by @merrymercy in #776
Allow disabling flashinfer sampling kernel by @merrymercy in #778
Bump version to 0.2.6 by @merrymercy in #779
fix: replace pillow with PIL in PACKAGE_LIST by @zhyncs in #781
docs: init readthedocs support by @zhyncs in #783
fix: init readthedocs support by @zhyncs in #784
fix: exclude logo png in gitignore by @zhyncs in #785
docs: update index by @zhyncs in #786
Vectorize logprobs computation by @Ying1123 in #787
docs: update README by @zhyncs in #788
docs: make badges center by @zhyncs in #789
chore: add copyright for srt by @zhyncs in #790
Fix echo + lobprob for OpenAI API when the prompt is a list by @Ying1123 in #791
Update README.md by @Ying1123 in #792
Lazy-import third-party backends by @bgyoon in #794
Fix lazy import location by @Ying1123 in #795
Fix logging by @Ying1123 in #796
Add role documentation, add system begin & end tokens by @objnf-dev in #793
Chunked prefill support by @hnyls2002 in #797
Revert "Chunked prefill support" by @Ying1123 in #799
Chunked prefill by @hnyls2002 in #800
fix: update flashinfer to 0.1.2 to fix sampling for cu118 by @zhyncs in #803
Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" by @Ying1123 in #805
feat: add chat template for internlm2-chat by @zhyncs in #802
Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" by @Ying1123 in #806
Add support for OpenAI API : offline batch(file) processing by @yichuan520030910320 in #699
Organize public APIs by @hnyls2002 in #809
Remove inf value for chunked prefill size by @hnyls2002 in #812
Revert "Organize public APIs" by @Ying1123 in #815
fix: use v0.2.5 for benchmark by @zhyncs in #814
Fix LiteLLM kwargs by @qeternity in #817
Code structure refactor by @hnyls2002 in #807
docs: update README by @zhyncs in #819
Fix streaming bug by @objnf-dev in #820
feat: add runner by @zhyncs in #821
feat: add pr e2e test by @zhyncs in #822
Support disable_ignore_eos in bench_serving.py by @Ying1123 in #824
Adjust default mem fraction to avoid OOM by @Ying1123 in #823
Add awq_marlin by @Ying1123 in #826
misc: update e2e test benchmark config by @zhyncs in #825
misc: enable e2e test when push by @zhyncs in #828
docs: add set up runner by @zhyncs in #829
chore: bump v0.2.7 by @zhyncs in #830
Add --max-total-tokens by @hnyls2002 in #840
Fix List input bug by @yichuan520030910320 in #838
Add req slots leaking check by @hnyls2002 in #842
docs: update README.md by @eltociear in #843
misc: update e2e test paths config by @zhyncs in #848
chore: update flashinfer to v0.1.3 by @zhyncs in #850
Fix llama for classification by @Ying1123 in #855
Add troubleshooting doc by @Ying1123 in #856
Fix #857 by @kaifronsdal in #858
Add support for logprobs in OpenAI chat API by @yichuan520030910320 in #852
Support chunked prefill when radix cache is disabled by @hnyls2002 in #811
misc: update e2e test paths config by @zhyncs in #860
Rename github workflows by @Ying1123 in #861
misc: disable auto release by @zhyncs in #862
misc: add cancel previous at e2e by @zhyncs in #864
Add OpenAI backend to the CI test by @Ying1123 in #869
Fix openai CI tests by @Ying1123 in #870
misc: use pip cache purge and add unit test ci by @zhyncs in #871
misc: update unit test config by @zhyncs in #873
Fix unit tests for the frontend language part by @Ying1123 in #872
bump to 0.2.8 by @Ying1123 in #877
Make scripts under /test/srt as unit tests by @Ying1123 in #875
Update runner docs by @hnyls2002 in #876
Improve the coverage of the openai api server test by @Ying1123 in #878
Implement served_model_name to customize model id when use local mode… by @dionren in #749
Update runner docs by @hnyls2002 in #879
Add more unit tests to CI by @Ying1123 in #880
Add accuracy test to CI: MMLU by @Ying1123 in #882
Update workflow name by @Ying1123 in #883
Fix the double BOS problem in the HF chat template by @Ying1123 in #888
Add benchmark: HumanEval by @Ying1123 in #889
Increase openai client limit by @Ying1123 in #886
Bump version to v0.2.9 by @Ying1123 in #890

New Contributors

@bgyoon made their first contribution in #794
@objnf-dev made their first contribution in #793
@kaifronsdal made their first contribution in #858
@dionren made their first contribution in #749

Full Changelog: v0.2.5...v0.2.9

Contributors

dionren, bgyoon, and 11 other contributors

Assets 2

26 Jul 19:56

zhyncs

v0.2.5

5bd06b4

Release v0.2.5

Highlights

We recently released a blog. Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. SGLang consistently outperforms vLLM, achieving up to 3.1x higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.
We have now automated the release processes for PyPI, Docker, and Release using GitHub workflows. Previously, because Release was not automated, GitHub Tags were not updated in time, leading to a jump from v0.2.0 directly to v0.2.5.
Welcome everyone to try using https://github.com/sgl-project/sglang, and also welcome everyone to actively participate in the community, including but not limited to issues, PRs, and discussions. Cheers!

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights

Performance Improvements

DeepSeek V3/R1 Optimizations

Architecture Enhancements

New Features

What's Changed

Contributors

Highlights

What's Changed

Contributors

Highlights

What's Changed

Contributors

Highlights

What's Changed

Contributors

Highlights

What's Changed

Contributors

Highlight

What's Changed

Contributors

Highlights

What's Changed

Contributors

Highlights

What's Changed

Contributors

Highlights

What's Changed

New Contributors

Contributors

Highlights

Releases: sgl-project/sglang

v0.4.3

Highlights

Performance Improvements

DeepSeek V3/R1 Optimizations

Architecture Enhancements

New Features

What's Changed

Contributors

Release v0.4.1

Highlights

What's Changed

Contributors

Release v0.4.0

Highlights

What's Changed

Contributors

Release v0.3.6

Highlights

What's Changed

Contributors

Release v0.3.4.post1

Highlights

What's Changed

Contributors

Release v0.3.2

Highlight

What's Changed

Contributors

Release v0.3.0

Highlights

What's Changed

Contributors

Release v0.2.13

Highlights

What's Changed

Contributors

Release v0.2.9

Highlights

What's Changed

New Contributors

Contributors

Release v0.2.5

Highlights