Release v0.2.9
Highlights
- New feature: Chunked prefill (#800, #811)
- New models: Deepseek v2
- Performance improvement: vectorized logprob computation
- Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
- Feature fix: fixed many missing logprob-related features in the OpenAI API server
- CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.
What's Changed
- Deepseek v2 support by @hnyls2002 in #693
- Fix context length by @hnyls2002 in #757
- docs: update model support by @zhyncs in #760
- fix: not run workflows on fork repo by @zhyncs in #762
- Update supported models by @hnyls2002 in #763
- Fix TransformerTokenizer init for chatglm2 & 3 by @ispobock in #761
- [Minor] Improve the code style in TokenizerManager by @merrymercy in #767
- Update readme by @Ying1123 in #769
- feat: add fake tag by @zhyncs in #770
- Fix max_tokens for OpenAI chat completion API by @merrymercy in #766
- Fix max new tokens by @merrymercy in #772
- Move sampling logits to float32 by @merrymercy in #773
- minor refactor: move check server args to server_args.py by @wisclmy0611 in #774
- Fix return_log_probs with cuda graph by @merrymercy in #775
- Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs by @merrymercy in #776
- Allow disabling flashinfer sampling kernel by @merrymercy in #778
- Bump version to 0.2.6 by @merrymercy in #779
- fix: replace pillow with PIL in PACKAGE_LIST by @zhyncs in #781
- docs: init readthedocs support by @zhyncs in #783
- fix: init readthedocs support by @zhyncs in #784
- fix: exclude logo png in gitignore by @zhyncs in #785
- docs: update index by @zhyncs in #786
- Vectorize logprobs computation by @Ying1123 in #787
- docs: update README by @zhyncs in #788
- docs: make badges center by @zhyncs in #789
- chore: add copyright for srt by @zhyncs in #790
- Fix echo + lobprob for OpenAI API when the prompt is a list by @Ying1123 in #791
- Update README.md by @Ying1123 in #792
- Lazy-import third-party backends by @bgyoon in #794
- Fix lazy import location by @Ying1123 in #795
- Fix logging by @Ying1123 in #796
- Add role documentation, add system begin & end tokens by @objnf-dev in #793
- Chunked prefill support by @hnyls2002 in #797
- Revert "Chunked prefill support" by @Ying1123 in #799
- Chunked prefill by @hnyls2002 in #800
- fix: update flashinfer to 0.1.2 to fix sampling for cu118 by @zhyncs in #803
- Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" by @Ying1123 in #805
- feat: add chat template for internlm2-chat by @zhyncs in #802
- Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" by @Ying1123 in #806
- Add support for OpenAI API : offline batch(file) processing by @yichuan520030910320 in #699
- Organize public APIs by @hnyls2002 in #809
- Remove inf value for chunked prefill size by @hnyls2002 in #812
- Revert "Organize public APIs" by @Ying1123 in #815
- fix: use v0.2.5 for benchmark by @zhyncs in #814
- Fix LiteLLM kwargs by @qeternity in #817
- Code structure refactor by @hnyls2002 in #807
- docs: update README by @zhyncs in #819
- Fix streaming bug by @objnf-dev in #820
- feat: add runner by @zhyncs in #821
- feat: add pr e2e test by @zhyncs in #822
- Support disable_ignore_eos in bench_serving.py by @Ying1123 in #824
- Adjust default mem fraction to avoid OOM by @Ying1123 in #823
- Add awq_marlin by @Ying1123 in #826
- misc: update e2e test benchmark config by @zhyncs in #825
- misc: enable e2e test when push by @zhyncs in #828
- docs: add set up runner by @zhyncs in #829
- chore: bump v0.2.7 by @zhyncs in #830
- Add
--max-total-tokens
by @hnyls2002 in #840 - Fix List input bug by @yichuan520030910320 in #838
- Add req slots leaking check by @hnyls2002 in #842
- docs: update README.md by @eltociear in #843
- misc: update e2e test paths config by @zhyncs in #848
- chore: update flashinfer to v0.1.3 by @zhyncs in #850
- Fix llama for classification by @Ying1123 in #855
- Add troubleshooting doc by @Ying1123 in #856
- Fix #857 by @kaifronsdal in #858
- Add support for logprobs in OpenAI chat API by @yichuan520030910320 in #852
- Support chunked prefill when radix cache is disabled by @hnyls2002 in #811
- misc: update e2e test paths config by @zhyncs in #860
- Rename github workflows by @Ying1123 in #861
- misc: disable auto release by @zhyncs in #862
- misc: add cancel previous at e2e by @zhyncs in #864
- Add OpenAI backend to the CI test by @Ying1123 in #869
- Fix openai CI tests by @Ying1123 in #870
- misc: use pip cache purge and add unit test ci by @zhyncs in #871
- misc: update unit test config by @zhyncs in #873
- Fix unit tests for the frontend language part by @Ying1123 in #872
- bump to 0.2.8 by @Ying1123 in #877
- Make scripts under
/test/srt
as unit tests by @Ying1123 in #875 - Update runner docs by @hnyls2002 in #876
- Improve the coverage of the openai api server test by @Ying1123 in #878
- Implement served_model_name to customize model id when use local mode… by @dionren in #749
- Update runner docs by @hnyls2002 in #879
- Add more unit tests to CI by @Ying1123 in #880
- Add accuracy test to CI: MMLU by @Ying1123 in #882
- Update workflow name by @Ying1123 in #883
- Fix the double BOS problem in the HF chat template by @Ying1123 in #888
- Add benchmark: HumanEval by @Ying1123 in #889
- Increase openai client limit by @Ying1123 in #886
- Bump version to v0.2.9 by @Ying1123 in #890
New Contributors
- @bgyoon made their first contribution in #794
- @objnf-dev made their first contribution in #793
- @kaifronsdal made their first contribution in #858
- @dionren made their first contribution in #749
Full Changelog: v0.2.5...v0.2.9