-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : (refactor) no more json in server_task input #10691
server : (refactor) no more json in server_task input #10691
Conversation
@@ -118,6 +96,7 @@ struct slot_params { | |||
|
|||
std::vector<std::string> antiprompt; | |||
bool timings_per_token = false; | |||
bool ignore_eos = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With new models the ignore_eos
functionality is losing relevance. There are now many different "end-of-generation" tokens and it's not just a single EOS token anymore. We should remove this logic and only support logit biases, which is more general. Just a note, no need to do it in this PR.
This change breaks the infill endpoint - it produces mostly garbage. |
Hmm ok could be due to the infill "template" is not being applied correctly. I'll add a test with qwen model (run locally, not on CI) |
I'm on it, will make a PR |
* server : (refactor) no more json in server_task input * add test for slots endpoint * add tests for /props and /slots * remove task inf_type * fix CI by adding safe_json_to_str * add "model_path" to /props * update readme
Continue #10643
server_task_result
is already broken into multiple derived classes (polymorphism). This helps reduce code complexity because each of the result type is different from another.However, the
server_task
can't be benefit from the same approach, because most requests share the same parameters with other.The solution introduced by this PR is to just put everything into
server_task
. Also the JSON parsing is now done at HTTP thread. Up on receiving a request, HTTP thread parse JSON into one or moreserver_task
and push them toserver_queue
Example of
/slots
response:Example of
/props
response: