-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: init functional tests #5566
server: init functional tests #5566
Conversation
I'm quite out of depth here, but if you can figure out a way to add |
@RafaAguilar @ngxson As you were part of the tests related discussion, do you feel OK with the proposed approach here ? If so, I will continue with asynchronous request and multi users scenario |
Great idea, thanks for starting this PR. Some suggestions:
|
Excellent! I'll be reading this PR today and see if I can add a test or help out in some way :) |
Would it be possible to train a nonsensical 1m param model potentially? It should be really cheap and fast even on commodity hardware. These tests aren't meant to interact in any meaningful way with the output anyway. I'm mentioning this because I see the trial run ran for 15 minutes for just two features with 3 scenarios each! Imagine the time needed to run 20-30 tests! |
@Azeirah Yes it's possible, but the problem is that these models never want to output EOS token (to terminate the output) . It's also possible to rely on the |
Fair enough, I think it should be doable to make a model that behaves well enough. Potentially it could be trained explicitly to bias EOS whaha. I agree it would be a fun exercise, unfortunately I have a 7900xtx and I believe it cannot be used to train :( In addition to that, we of course have no clue what kind of hardware these tests will be ran on, but if it's a virtual core on a xeon or some or other maybe we can try compiling openBLAS? I'm not sure if it'd be even worth investigating depending on the speedup and the variety of weird hardware you could get on Github actions. No clue what kind of control over the underlying (virtualised) hardware you'd get there. Other than that, I think it's fine that the tests are in separate files. It's kinda just how behave is meant to be used, each feature is one file. Different related scenarios belong to one feature. I'm somewhat familiar with BDD myself since I use a loosely inspired variant at work, do you think BDD is unclear to some people? I could write a short readme explaining it. |
Also one case that I have never tested before is invalid unicode. In my personal project (which uses llama.h), on receiving responses via Would be nice if someone can test if it's the case for server.cpp (which |
@Azeirah I believe the hosted runner of github is Xeon with shared CPU cores. The performance is not meant to be consistent though. I believe that it cannot use anything better than AVX2. For training, I'm using a GTX 1660 Ti. I initially purchased it for gaming 2 years ago, but who knows that now I need more VRAM than that :'( Back then, the dealer proposed me a 3080 Ti with a fairly good price, but I refused. Nowadays, for anything bigger than 1B, I need to rent a VPS on google cloud, it's more or less the same price with colab notebooks, but more flexible and have persistent storage. |
Done 👍 |
@ggerganov @ngxson Any idea on how to improve the prompt eval time on the github runners ? Should we give a try to OpenBLAS ? |
@phymbert Can you try this model instead? (pay attention to set https://huggingface.co/ngxson/dummy-llama/blob/main/llama_xs_q4.bin I have no idea if OpenBLAS will help. You can try if you want. @Azeirah I tried to overfit a 86M model but unfortunately it does not seem to output any of the example. But on the bright side, it outputs mostly text (not invalid bytes as I said earlier), so still usable for the test. The Q4_K_M size is only 57MB |
Nice thanks Took 0m0.481s. Note I have also reduced the KV Size. |
a9bf1ce
to
0765d9c
Compare
Best way to improve the speed is to use as small model as possible. You can try @karpathy's tinyllamas: https://huggingface.co/karpathy/tinyllamas Here are instructions for converting to GGUF and using in https://github.com/ggerganov/llama.cpp/tree/master/examples/convert-llama2c-to-ggml For convenience, I've uploaded the smallest 260K model (~1 MB) in GGUF format here: https://huggingface.co/ggml-org/models/blob/main/tinyllamas/stories260K.gguf Example: # get the model
wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K.gguf
# run sample inference
./main -m ./stories260K.gguf -p "One day, Lily met" -n 128 -c 256
One day, Lily met a boy named Timmy. Tim was very happy to help her mommy. He wanted to play with the ball all day. Suddenly, something unexpected happened. A little girl came over and saw a big tree. She was very sad.
Timmy wanted to play with the ball. He thought it would be fun! When he reached up, he found it st
llama_print_timings: load time = 80.26 ms
llama_print_timings: sample time = 1.70 ms / 128 runs ( 0.01 ms per token, 75427.22 tokens per second)
llama_print_timings: prompt eval time = 3.06 ms / 7 tokens ( 0.44 ms per token, 2288.33 tokens per second)
llama_print_timings: eval time = 134.04 ms / 127 runs ( 1.06 ms per token, 947.46 tokens per second)
llama_print_timings: total time = 142.59 ms / 134 tokens This should be ideal for CI |
Btw, one thing that would greatly improve the state of Probably needs a separate PR to avoid this change becoming too big, but thought I would mention this in case you are interested in further helping out with maintenance |
add wrong_usage.feature to demonstrate user issue which will not be fixed.
I will review this fully tomorrow, I'm a bit sick but I have energy when I plan it out. |
No fix was applied actually, IMHO it's a wrong usage of the server when neither I have also added a @ggerganov Regarding #5655, I have reproduced it in It can be investigated/fixed in another PR. Thanks for review, I will give a last chance to concurrent streaming request with aihttp, then merging this first version. |
@Azeirah No worries, take care, it can wait for tomorrow 👍 |
In the case the server is started with undesirable parameters, we should either abort or at the very least offer a clear warning with a suggested solution. Is that the case now? I try to focus a lot on usability for end users. |
6a80812
to
7183149
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's excellent! I ran it 5 times to check for flakes as well, especially since I have a capable PC and the warning in the readme, 0 flakes.
Did a quick look at all code, didn't read deep into the implementation but did look at the tests. I think it's a fantastic set-up for server tests! :)
If there's reason to add any tests later this set-up will be very easy to extend and it runs very stable so it's ready for merging if you ask me.
Only had a tiny comment about the readme again but it's very minor.
…d @wrong_usage.
On it, especially in |
* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <[email protected]>
* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <[email protected]>
Motivation
Tests has been listed in #4216 as an improvements request.
The idea is to ensure all server routes are working properly using the Gherkin language to define test cases following BDD approach. It is designed to be human-readable, and describes use cases relating to a software system.
Example
Proposed changes
A CI workflow is triggered which builds and starts the server in background, then test scenario are launched with python.
A very small model is used to quickly generate responses and a fixed
seed
is set to ensure reproducibility.The Gherkin glue is written in python using behave.
Restriction
This is not designed to assess performance of the server.
Expected scenario:
(from @ngxson comment)
Example of passing GitHub workflow can be found here.
TODO:
/slots
and/health
endpoint to properly access slots data over thequeue_tasks
: server: health: fix race condition on slots data using tasks queue #5634