[REQUEST] High throughput with large batch size #686

fzyzcjy · 2024-11-26T08:54:06Z

Problem

Hi thanks for the library! I hope to use 7B model on 24GB 4090 with as large thoughput as possible (latency is not a problem - it is a batch task). Vllm works well, but it seems that its 8bit kv cache degrades the results a lot (or maybe I do not get it yet). exllamav2 seems to have super good low bit kv cache, thus I would appreciate it if it could have high throughput with large batch size (e.g. batch size = 256).

Solution

(see above)

Alternatives

No response

Explanation

(see above)

Examples

No response

Additional context

No response

Acknowledgements

I have looked for similar requests before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will make my requests politely.

turboderp · 2024-11-27T09:41:42Z

You can run with a large batch size if you have the VRAM to store that number of sequences in the cache at once. But throughput and latency are ultimately still connected. If you want to run at bsz 1000 and a context length of 32k or whatever, that means a 32M-token cache. However you manage that, it's going to far outweigh the storage requirement and bandwidth usage for the weights, and at that point why would you even be considering quantization?

fzyzcjy · 2024-11-27T10:12:44Z

Thanks for the reply! I am mainly trying bs=256 and context around 1-2k, and find vllm/lmdeploy quite fast.

turboderp · 2024-12-01T14:02:09Z

You can try the bulk_inference.py example which could work at that scale.

fzyzcjy · 2024-12-01T14:13:13Z

Thank you!

fzyzcjy · 2024-12-02T05:56:09Z

@turboderp By the way, I wonder whether it is OK if I use https://github.com/theroyallab/tabbyAPI to test the speed (i.e. will it have nearly same performance as the bulk_inference.py direct batch call)? Currently my code tests vllm / lmdeploy by using their openai compatible server, and send HTTP requests to them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] High throughput with large batch size #686

[REQUEST] High throughput with large batch size #686

fzyzcjy commented Nov 26, 2024

turboderp commented Nov 27, 2024

fzyzcjy commented Nov 27, 2024

turboderp commented Dec 1, 2024

fzyzcjy commented Dec 1, 2024

fzyzcjy commented Dec 2, 2024 •

edited

Loading

[REQUEST] High throughput with large batch size #686

[REQUEST] High throughput with large batch size #686

Comments

fzyzcjy commented Nov 26, 2024

Problem

Solution

Alternatives

Explanation

Examples

Additional context

Acknowledgements

turboderp commented Nov 27, 2024

fzyzcjy commented Nov 27, 2024

turboderp commented Dec 1, 2024

fzyzcjy commented Dec 1, 2024

fzyzcjy commented Dec 2, 2024 • edited Loading

fzyzcjy commented Dec 2, 2024 •

edited

Loading