Release Release v0.1.20 · sgl-project/sglang

Highlights

Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612)
Model support: Gemma2, minicpm, Qwen2 MoE
Docker support (#217 )
Various latency optimizations

Add docker file by @Ying1123 in #588
Add Gemma2 by @Ying1123 in #592
Format by @Ying1123 in #593
Fix Llava model by @wisclmy0611 in #594
- fix(detokenizer_manager.py): fix truncated decoded output by @Titan-p in #586
Add --enable-p2p-check option by @hnyls2002 in #599
Fix streaming by @hnyls2002 in #600
Reduce number of workspaces for flashinfer by @wisclmy0611 in #601
add LogitsMetadata by @hnyls2002 in #604
add minicpm support by @Titan-p in #602
Make sglang compat with vllm 0.5.1 by @M0gician in #598
Add Qwen2 MoE support by @M0gician in #603
Update chat template for qwen and yi-1.5. by @for-just-we in #530
[Feat] Expose logprob options to sgl.gen API by @huyiwen in #503
Fix bench latency by @merrymercy in #607
Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py by @merrymercy in #609
Clean up the usage of flashinfer by @merrymercy in #610
Cleanup attention backend: flashinfer and triton by @merrymercy in #611
Enable cuda graph by default by @merrymercy in #612
Improve benchmark scripts & fix llava by @merrymercy in #613
Memorypool chunked prefetch by @hnyls2002 in #614
Improve benchmark scripts by @merrymercy in #615
Fix memory pool index error by @Ying1123 in #616
Bump version to 0.1.20 by @merrymercy in #618

Full Changelog: v0.1.18...v0.1.20