KV cache re-use impact on average sequence latency? #2456
Labels
performance issue
Issue about performance number
question
Further information is requested
triaged
Issue has been triaged by maintainers
Hello,
I am benchmarking KV cache re-use with Mistral 7B model using tensor parallel across 8 A100 GPUs (A100-SXM4-40GB).
My instruction prompt is fixed at 1214 tokens, and maximum sequence length is 1357 tokens (input + output).
From the graph the throughput at a given latency threshold increases significantly, which seems to make sense, but I am a bit surprised at a much smaller gain in latency at lower request rate. For example, at request_rate = 1, average sequence latency goes from 115.35ms down to 94.56ms when re-using KV cache. Isn't this low considering that a very large chunk of the input prompt is cached?
Results,
For reference, I build the model with
and benchmark it using
The text was updated successfully, but these errors were encountered: