Highlights
- Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving.
- Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. Slides
- Added Engine API for offline inference with reduced overhead. Usage. #1614 #1567
- Added an overlap scheduler for reducing CPU overhead #1738
- New models: Llama 3.2 (#1551), QWen-VL2 (#1721), OLMo (#1676), GLM 4 (#1736).
- Added support for reward models #1525.
- Added support for Intel XPU #1480.
- Improved stability for greedy decoding #1589.
- Accelerated multi-LoRA serving #1587.
What's Changed
- [Fix] Ignore model import error by @merrymercy in #1513
- minor: fix config by @hnyls2002 in #1524
- [Event] Update meeting link by @Ying1123 in #1529
- [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B by @Ying1123 in #1525
- Add float8 dynamic quant to torchao_utils by @jerryzh168 in #1528
- [FIX] Catch syntax error of Regex Guide to avoid crash by @du00cs in #1521
- [bugfix]Add modelscope package to avoid docker image without modelscope by @KylinMountain in #1520
- Fix RuntimeEndpoint.select method by @jeffrey-fong in #1495
- Multiple minor fixes by @merrymercy in #1530
- Make detokenizer_manager.py not asyncio by @merrymercy in #1532
- Organize image inputs by @hnyls2002 in #1531
- Improve process creation by @merrymercy in #1534
- fix ipv6 url when warm up model by @cauyxy in #1537
- Move scheduler code from tp_worker.py to scheduler.py by @merrymercy in #1538
- Process image in parallel by @hnyls2002 in #1539
- Let ModelRunner take InputMetadata as input, instead of ScheduleBatch by @merrymercy in #1541
- Rename InputMetadata -> ForwardBatch by @merrymercy in #1543
- Clean up batch data structures: Introducing ModelWorkerBatch by @merrymercy in #1544
- [Fix, LoRA] fix LoRA with updates in main by @Ying1123 in #1545
- Organize Attention Backends by @hnyls2002 in #1547
- Fix bugs of
logprobs_nums
by @hnyls2002 in #1548 - Dispatch flashinfer wrappers by @hnyls2002 in #1550
- Simplify flashinfer dispatch by @hnyls2002 in #1552
- [Refactor] Simplify io_struct and tokenizer_manager by @Ying1123 in #1549
- [Performance, Hardware] MoE tuning on AMD MI300x GPUs by @kkHuang-amd in #1554
- [Fix] Fix all the Huggingface paths by @tbarton16 in #1553
- [Fix] do not maintain regex_fsm in SamplingBatchInfo by @merrymercy in #1555
- [Fix] Move ScheduleBatch out of SamplingInfo by @merrymercy in #1556
- Move status check in the memory pool to CPU by @merrymercy in #1557
- [Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' by @mssongit in #1536
- [FP8 KV Cache] Avoid KeyError at loading pre-quantized FP8 model with kv_scale by @HaiShaw in #1559
- Organize sampling batch info better by @merrymercy in #1562
- Use ipc instead of tcp in zmq by @merrymercy in #1566
- Make input_ids a torch.Tensor by @merrymercy in #1568
- [Minifix] Remove extra space in cot example by @FredericOdermatt in #1569
- [Fix] Fix major performance bug in certain cases by @Ying1123 in #1563
- Refine the add request reasons to avoid corner cases. by @hnyls2002 in #1574
- chore: update README.md by @eltociear in #1580
- [Easy] use .text() instead of .text by @ByronHsu in #1577
- [Event] Update README.md by @Ying1123 in #1572
- Add llama implementation with no tensor parallel linears by @jerryzh168 in #1561
- Backend method not found when SRT Runtime is used by @ByronHsu in #1576
- default sampling param should be deepcopied by @ByronHsu in #1581
- Fix styling by @ByronHsu in #1583
- Fix runtime.generate when sampling param is not passed by @ByronHsu in #1582
- Support min_tokens in sgl.gen by @ByronHsu in #1573
- [Minor] Improve the style and fix flaky tests by @merrymercy in #1584
- [Bug] Fix decode stats error on output_len 1 by @HaiShaw in #1585
- Clean up event loop by @merrymercy in #1586
- [LoRA, Performance] Speedup multi-LoRA serving - Step 1 by @Ying1123 in #1587
- [Minor, Performance] Use torch.argmax for greedy sampling by @Ying1123 in #1589
- Test consistency for single and batch seperately by @ByronHsu in #1590
- Update README.md by @merrymercy in #1591
- Fix modality for image inputs by @merrymercy in #1592
- Provide an offline engine API by @ByronHsu in #1567
- [Fix] Fix the case where prompt_len = 0 by @merrymercy in #1593
- Use
atexit
hook to implicitly shutdownRuntime
by @ByronHsu in #1595 - Use is_flashinfer_available to replace is_hip for flashinfer check by @merrymercy in #1596
- Fix chunked prefill condition by @ispobock in #1594
- Fix the port_args in bench_latency by @merrymercy in #1597
- Remove references to squeezellm by @janimo in #1603
- [Profile] Add pytorch profiler by @Ying1123 in #1604
- [Engine] Fix generate hanging issue after the first call by @ByronHsu in #1606
- Release v0.3.3 by @merrymercy in #1605
- [Minor] Fix logging typo by @amosyou in #1615
- Fix test_vision_openai_server on CI by @ByronHsu in #1620
- [Performance, hardware] MoE tuning update to AMD MI300x GPUs by @HaiShaw in #1619
- Update README.md by @kushal34712 in #1625
- Update README.md by @merrymercy in #1629
- Add device support by @liangan1 in #1607
- Nit about the decorator of
PortArgs.init_new
by @glen-amd in #1611 - [Bug] Fix the Image Input of Batch Generation by @OBJECT907 in #1579
- Add the ability to enable and disable the Profiler via HTTP API. by @Abatom in #1626
- Fix the correctness test in bench_latency.py when tp > 1 and test_generation_models.py by @merrymercy in #1631
- Add image_token in conversation.py by @merrymercy in #1632
- Added a "Back To Top" Button by @JanumalaAkhilendra in #1633
- Fix constrained decoding by @merrymercy in #1634
- Add back data parallelism by @merrymercy in #1635
- Release v0.3.3.post1 by @merrymercy in #1636
- [engine] support async and streaming by @ByronHsu in #1614
- [Fix] Fix the style of test_large_max_new_tokens.py by @merrymercy in #1638
- fix missing ignore_eos in v1/chat/completions by @learninmou in #1642
- Fix ignore_eos in the OpenAI ChatCompletions API by @merrymercy in #1645
- [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch by @liangan1 in #1480
- Fix unit tests and type annotations by @merrymercy in #1648
- Add an option to disable penalizer by @merrymercy in #1651
- Add get_tokenizer function for Engine class by @pjyi2147 in #1653
- Fix the batch_is_full check for jump-forward decoding by @merrymercy in #1654
- Simplify the event loop and expose
--num-continuous-decode-steps
as an argument by @merrymercy in #1652 - [doc] Add engine section in backend.md by @ByronHsu in #1656
- [Fix] fix eos trim inconsistency by @Ying1123 in #1650
- Add output_ids into ScheduleBatch by @merrymercy in #1659
- [Minor] Rename no_eos_trim to no_stop_trim by @Ying1123 in #1661
- Add a test case to test retract by @merrymercy in #1662
- Move filter_batch out of stream_output by @merrymercy in #1663
- Support double sparsity by @andy-yang-1 in #1459
- Fix unit test order to balance the tasks in CI by @merrymercy in #1665
- [Minor] Improve style by @merrymercy in #1666
- Simplify chunked prefill by @merrymercy in #1667
- [1/N] Remove
CacheConfig
import in all model files by @ByronHsu in #1658 - [doc] improve engine doc and add to readme by @ByronHsu in #1670
- [Minor] Add some utility functions by @merrymercy in #1671
- Improve benchmark scripts by @merrymercy in #1672
- Fix memory leak during abort by @merrymercy in #1674
- Fix filter_batch function call by @hnyls2002 in #1681
- Add OLMo model by @janimo in #1676
- Add a new event loop by @merrymercy in #1677
- Fix srt dependency by @ispobock in #1685
- [Event] Add online meetup meeting link by @Ying1123 in #1686
- Launch a thread to overlap CPU and GPU by @merrymercy in #1687
- Returning a per request metric for number of cached_tokens read by @havetc in #1599
- add orjson for jsonresponse by @michaelfeil in #1688
- Update README.md by @merrymercy in #1689
- Add date to logging messages (#1623) by @zeng-zc in #1679
- Update the transformers version in CI by @merrymercy in #1690
- Use SGLang imports for linear layer by @janimo in #1696
- feat: radix tree code optimize by @wxsms in #1697
- ORJson. Faster Json serialization by @michaelfeil in #1694
- Fix the failed unit tests by @merrymercy in #1699
- Fix failed ci tests on long prompts; Better error messages for embedding models by @merrymercy in #1700
- Fix engine unit test by @merrymercy in #1701
- Fix mixed batch for multi modal models by @merrymercy in #1702
- Add matched_stop token or str to distinguish between eos or stop str finish_reason generation by @g-drozdov in #1684
- Fix regex and logprob conflicts when chunked prefilling by @hnyls2002 in #1703
- Simplify flashinfer utilities by @merrymercy in #1704
- Add dtype for more operations by @merrymercy in #1705
- Add grouped free operations by @merrymercy in #1706
- Skip unnecessary penalizer by @merrymercy in #1707
- Simplify the nan detection and greedy check in sampler by @merrymercy in #1709
- Fix
is_all_ready
for overlap copy by @merrymercy in #1710 - Fix the race condition in overlap mode by @merrymercy in #1712
- Update README.md by @merrymercy in #1713
- Release v0.3.4 by @merrymercy in #1714
- Simplify the interface of tp_worker by @merrymercy in #1718
- Update vllm to 0.6.3 (#1711) by @zhyncs in #1720
- Support qwen2 vl model by @zhyncs in #1721
- Update README.md by @Ying1123 in #1722
- Unify the memory pool api and tp worker API by @merrymercy in #1724
- Temporarily skip the test_mixed_batch for QWen2VL by @merrymercy in #1725
- Split the overlapped version of TpModelWorkerClient into a separate file by @merrymercy in #1726
- [Bugfix] qwen2vl forward_extend by @yizhang2077 in #1727
- Simplify the usage of device by @merrymercy in #1734
- Simplify batch result resolution by @merrymercy in #1735
- Add GLM-4 TextGeneration Model support for SGLang by @sixsixcoder in #1736
- Make token mapping non-blocking in the overlapped mode by @merrymercy in #1740
- Maintain seq_lens_sum to make more FlashInfer operations non-blocking by @merrymercy in #1741
- Fix prefill oom by @hnyls2002 in #1743
- Faster overlap mode scheduler by @merrymercy in #1738
- misc: add CODEOWNERS by @zhyncs in #1737
- Fix sliding window attention and gemma-2 unit tests in CI by @merrymercy in #1746
- Llama3.2 vision model support by @hnyls2002 in #1551
- Update
max_req_len
andmax_req_input_len
by @hnyls2002 in #1748 - Release v0.3.4.post1 by @merrymercy in #1749
New Contributors
- @du00cs made their first contribution in #1521
- @KylinMountain made their first contribution in #1520
- @jeffrey-fong made their first contribution in #1495
- @cauyxy made their first contribution in #1537
- @kkHuang-amd made their first contribution in #1554
- @tbarton16 made their first contribution in #1553
- @mssongit made their first contribution in #1536
- @FredericOdermatt made their first contribution in #1569
- @kushal34712 made their first contribution in #1625
- @liangan1 made their first contribution in #1607
- @glen-amd made their first contribution in #1611
- @OBJECT907 made their first contribution in #1579
- @Abatom made their first contribution in #1626
- @JanumalaAkhilendra made their first contribution in #1633
- @learninmou made their first contribution in #1642
- @pjyi2147 made their first contribution in #1653
- @andy-yang-1 made their first contribution in #1459
- @michaelfeil made their first contribution in #1688
- @zeng-zc made their first contribution in #1679
- @wxsms made their first contribution in #1697
- @g-drozdov made their first contribution in #1684
- @sixsixcoder made their first contribution in #1736
Full Changelog: v0.3.2...v0.3.4.post1