Releases: turboderp-org/exllamav2
0.0.20
- Adds Phi3 support
- Wheels compiled for PyTorch 2.3.0
- ROCm 6.0 wheels
Full Changelog: v0.0.19...v0.0.20
0.0.19
- More accurate Q4 cache using groupwise rotations
- Better prompt ingestion speed when using flash-attn
- Minor fixes related to issues quantizing Llama 3
- New, more robust optimizer
- Fix bug on long-sequence inference for GPTQ models
Full Changelog: v0.0.18...v0.0.19
0.0.18
- Support for Command-R-plus
- Fix for pre-AVX2 CPUs
- VRAM optimizations for quantization
- Very preliminary multimodal support
- Various other small fixes and optimizations
Full Changelog: v0.0.17...v0.0.18
0.0.17
Mostly just minor fixes and support for DBRX models.
Full Changelog: v0.0.16...v0.0.17
0.0.16
- Adds support for Cohere models
- N-gram decoding
- A few bugfixes
- Lots of optimizations
Full Changelog: v0.0.15...v0.0.16
0.0.15
- Adds Q4 cache mode
- Support for StarCoder2
- Minor optimizations and a couple of bugfixes
Full Changelog: v0.0.14...v0.0.15
0.0.14
Adds support for Qwen1.5 and Gemma architectures.
Various fixes and optimizations.
Full Changelog since 0.0.13: v0.0.13...v0.0.14
0.0.13.post2
Full Changelog: 0.0.13.post1...0.0.13.post2
0.0.13.post1
Fixes inference on models with vocab sizes that are not multiples of 32
0.0.13
This release is mostly to update the prebuilt wheels to Torch 2.2, since it won't load extensions built for earlier versions.
Adds dynamic temperature and quadratic sampling. Fixes performance degradation on some GPUs after batch optimizations and various other little things.