-
-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] OpenAI client takes a long time to receive the last token on every few generations #274
Comments
To rule out deployment issues, I tried the same configs with
I'm experiencing the exact same lags. |
Do you have similar bug to me? turboderp-org/exllamav2#630 |
@Ph0rk0z I think it's different. In my case the logged T/s don't drop, but I see a long delay on the client side. You also wrote you have to restart the server sometimes to get good T/s back, but in my case the next prompt is okay, until the delay is back later after more requests. |
I also can have it reprocess context and have the delay go away but with cached context can't continue. |
I'm having the same issue with tabbyapi. with open-webui running striker llama 3.3 70b model t/s never drops I can see tabby has actually stopped processing but there is always a delay at the very end. I don't ever unload the model after tabbyapi starts up and I don't have t/s slow downs requiring a restart either. I was actually looking at open-webui to see if it was a bug on their end and they do have a simmilar last token stall but with ollama on the backend and it is inconsistent. I've checked stop tokens and a couple of other models all with the same final token delay. I will setup a test environment and see if I can get a reproduceable example. |
Looks like I have same issue too. output.mp4 |
Since many users are having this problem, this issue does exist somewhere in tabby or exl2's code. Unfortunately, I am unable to reproduce this issue on my end with the provided test script from OP. If someone does reproduce this issue, please send the test script. PRs are also accepted and would be appreciated to help narrow down the cause. |
OS
Linux
GPU Library
CUDA 12.x
Python version
3.10
Describe the bug
I'm trying to serve bartowski/Llama-3.3-70B-Instruct-exl2 with tabbyAPI. Right now I'm using 8.0 bits quant, but also tried the smaller versions. I have x6 Nvidia RTX A4000 on the server. Originally, without tweaking sampling options, an OpenAI client was waiting for the last token for a long while on each generation. I tried many settings, and right now it happens each 27th generation on my test query with almost exactly 60 seconds lag. The logs on the server side don't report this lag. I discovered that if I reduce cache_size and max_model_length in 2 times, the lag also decreases proportionally to about 30 seconds, and the same pattern holds for 1/4 of the cache size.
My current configs
config.yml
:sampler_overrides/fast_streaming.yml
:My test case script
Here the test script uses streaming API, but the issue also manifests with non streaming requests.
Reproduction steps
Maybe the bug is my hardware/setup specific, I don't know. For me it takes to just launch the server and run the test case script, and wait until it lags. On me it happens right now every 27th query.
Expected behavior
It's expected not to lag, like it does for the majority of the queries.
Logs
Logs
Client log when it lags
Server log when it lags
Additional context
No response
Acknowledgements
I hope this can be fixed, tabbyAPI performs awesome in all other aspects. Thank you in advance!
The text was updated successfully, but these errors were encountered: