Skip to content

Release v0.3.4.post1

Latest
Compare
Choose a tag to compare
@hnyls2002 hnyls2002 released this 22 Oct 04:30
· 128 commits to main since this release
1f26e8b

Highlights

  • Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving.
    • Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. Slides
  • Added Engine API for offline inference with reduced overhead. Usage. #1614 #1567
  • Added an overlap scheduler for reducing CPU overhead #1738
  • New models: Llama 3.2 (#1551), QWen-VL2 (#1721), OLMo (#1676), GLM 4 (#1736).
  • Added support for reward models #1525.
  • Added support for Intel XPU #1480.
  • Improved stability for greedy decoding #1589.
  • Accelerated multi-LoRA serving #1587.

What's Changed

New Contributors

Full Changelog: v0.3.2...v0.3.4.post1