Release v0.1.20
Highlights
- Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612)
- Model support: Gemma2, minicpm, Qwen2 MoE
- Docker support (#217 )
- Various latency optimizations
What's Changed
- Add docker file by @Ying1123 in #588
- Add Gemma2 by @Ying1123 in #592
- Format by @Ying1123 in #593
- Fix Llava model by @wisclmy0611 in #594
- Add
--enable-p2p-check
option by @hnyls2002 in #599 - Fix streaming by @hnyls2002 in #600
- Reduce number of workspaces for flashinfer by @wisclmy0611 in #601
- add
LogitsMetadata
by @hnyls2002 in #604 - add minicpm support by @Titan-p in #602
- Make sglang compat with vllm 0.5.1 by @M0gician in #598
- Add Qwen2 MoE support by @M0gician in #603
- Update chat template for qwen and yi-1.5. by @for-just-we in #530
- [Feat] Expose logprob options to
sgl.gen
API by @huyiwen in #503 - Fix bench latency by @merrymercy in #607
- Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py by @merrymercy in #609
- Clean up the usage of flashinfer by @merrymercy in #610
- Cleanup attention backend: flashinfer and triton by @merrymercy in #611
- Enable cuda graph by default by @merrymercy in #612
- Improve benchmark scripts & fix llava by @merrymercy in #613
- Memorypool chunked prefetch by @hnyls2002 in #614
- Improve benchmark scripts by @merrymercy in #615
- Fix memory pool index error by @Ying1123 in #616
- Bump version to 0.1.20 by @merrymercy in #618
New Contributors
- @wisclmy0611 made their first contribution in #594
- @Titan-p made their first contribution in #586
- @M0gician made their first contribution in #598
- @for-just-we made their first contribution in #530
Full Changelog: v0.1.18...v0.1.20