server : (refactor) no more json in server_task input #10691

ngxson · 2024-12-06T14:12:59Z

Continue #10643

server_task_result is already broken into multiple derived classes (polymorphism). This helps reduce code complexity because each of the result type is different from another.

However, the server_task can't be benefit from the same approach, because most requests share the same parameters with other.

The solution introduced by this PR is to just put everything into server_task. Also the JSON parsing is now done at HTTP thread. Up on receiving a request, HTTP thread parse JSON into one or more server_task and push them to server_queue

Example of /slots response:

[
  {
    "id": 0,
    "id_task": -1,
    "n_ctx": 1024,
    "speculative": false,
    "is_processing": false,
    "params": {
      "n_predict": -1,
      "seed": 4294967295,
      "temperature": 0.800000011920929,
      "dynatemp_range": 0.0,
      "dynatemp_exponent": 1.0,
      "top_k": 40,
      "top_p": 0.949999988079071,
      "min_p": 0.05000000074505806,
      "xtc_probability": 0.0,
      "xtc_threshold": 0.10000000149011612,
      "typical_p": 1.0,
      "repeat_last_n": 64,
      "repeat_penalty": 1.0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0,
      "dry_multiplier": 0.0,
      "dry_base": 1.75,
      "dry_allowed_length": 2,
      "dry_penalty_last_n": -1,
      "dry_sequence_breakers": [
        "\n",
        ":",
        "\"",
        "*"
      ],
      "mirostat": 0,
      "mirostat_tau": 5.0,
      "mirostat_eta": 0.10000000149011612,
      "penalize_nl": false,
      "stop": [],
      "max_tokens": -1,
      "n_keep": 0,
      "n_discard": 0,
      "ignore_eos": false,
      "stream": true,
      "n_probs": 0,
      "min_keep": 0,
      "grammar": "",
      "samplers": [
        "dry",
        "top_k",
        "typ_p",
        "top_p",
        "min_p",
        "xtc",
        "temperature"
      ],
      "speculative.n_max": 16,
      "speculative.n_min": 5,
      "speculative.p_min": 0.8999999761581421,
      "timings_per_token": false
    },
    "prompt": "",
    "next_token": {
      "has_next_token": true,
      "has_new_line": false,
      "n_remain": -1,
      "n_decoded": 0,
      "stopping_word": ""
    }
  }
]

Example of /props response:

{
  "default_generation_settings": {
    "id": 0,
    "id_task": -1,
    "n_ctx": 1024,
    "speculative": false,
    "is_processing": false,
    "params": {
      "n_predict": -1,
      "seed": 4294967295,
      "temperature": 0.800000011920929,
      "dynatemp_range": 0.0,
      "dynatemp_exponent": 1.0,
      "top_k": 40,
      "top_p": 0.949999988079071,
      "min_p": 0.05000000074505806,
      "xtc_probability": 0.0,
      "xtc_threshold": 0.10000000149011612,
      "typical_p": 1.0,
      "repeat_last_n": 64,
      "repeat_penalty": 1.0,
      "presence_penalty": 0.0,
      "frequency_penalty": 0.0,
      "dry_multiplier": 0.0,
      "dry_base": 1.75,
      "dry_allowed_length": 2,
      "dry_penalty_last_n": -1,
      "dry_sequence_breakers": [
        "\n",
        ":",
        "\"",
        "*"
      ],
      "mirostat": 0,
      "mirostat_tau": 5.0,
      "mirostat_eta": 0.10000000149011612,
      "penalize_nl": false,
      "stop": [],
      "max_tokens": -1,
      "n_keep": 0,
      "n_discard": 0,
      "ignore_eos": false,
      "stream": true,
      "n_probs": 0,
      "min_keep": 0,
      "grammar": "",
      "samplers": [
        "dry",
        "top_k",
        "typ_p",
        "top_p",
        "min_p",
        "xtc",
        "temperature"
      ],
      "speculative.n_max": 16,
      "speculative.n_min": 5,
      "speculative.p_min": 0.8999999761581421,
      "timings_per_token": false
    },
    "prompt": "",
    "next_token": {
      "has_next_token": true,
      "has_new_line": false,
      "n_remain": -1,
      "n_decoded": 0,
      "stopping_word": ""
    }
  },
  "total_slots": 1,
  "model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
  "chat_template": "..."
}

examples/server/server.cpp

ggerganov · 2024-12-07T08:09:57Z

examples/server/server.cpp

@@ -118,6 +96,7 @@ struct slot_params {

    std::vector<std::string> antiprompt;
    bool timings_per_token = false;
+    bool ignore_eos = false;


With new models the ignore_eos functionality is losing relevance. There are now many different "end-of-generation" tokens and it's not just a single EOS token anymore. We should remove this logic and only support logit biases, which is more general. Just a note, no need to do it in this PR.

examples/server/server.cpp

ggerganov · 2024-12-08T19:54:52Z

This change breaks the infill endpoint - it produces mostly garbage.

ngxson · 2024-12-08T20:01:15Z

Hmm ok could be due to the infill "template" is not being applied correctly. I'll add a test with qwen model (run locally, not on CI)

ngxson · 2024-12-08T20:05:45Z

I'm on it, will make a PR

* server : (refactor) no more json in server_task input * add test for slots endpoint * add tests for /props and /slots * remove task inf_type * fix CI by adding safe_json_to_str * add "model_path" to /props * update readme

server : (refactor) no more json in server_task input

db97c8b

ngxson requested a review from ggerganov December 6, 2024 14:12

github-actions bot added examples python python script changes server labels Dec 6, 2024

ggerganov approved these changes Dec 7, 2024

View reviewed changes

ngxson added 5 commits December 7, 2024 13:56

add test for slots endpoint

9bb1ae6

Merge branch 'master' into xsn/refactor_server_struct_input

6bf6e30

add tests for /props and /slots

e721f4c

remove task inf_type

090a113

fix CI by adding safe_json_to_str

65d2e6d

ggerganov approved these changes Dec 7, 2024

View reviewed changes

add "model_path" to /props

1949f68

ngxson mentioned this pull request Dec 7, 2024

Misc. bug: server - GET /props model value no longer works after commit 6c5bc06 #10705

Closed

update readme

89c2af9

ngxson mentioned this pull request Dec 7, 2024

changelog : llama-server REST API #9291

Open

ngxson merged commit 3573fa8 into ggerganov:master Dec 7, 2024
45 checks passed

ngxson mentioned this pull request Dec 8, 2024

server : fix format_infill #10724

Merged

ggerganov mentioned this pull request Dec 8, 2024

server : fix infill prompt format #10725

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : (refactor) no more json in server_task input #10691

server : (refactor) no more json in server_task input #10691

ngxson commented Dec 6, 2024 •

edited

Loading

ggerganov Dec 7, 2024

ggerganov commented Dec 8, 2024

ngxson commented Dec 8, 2024

ngxson commented Dec 8, 2024

server : (refactor) no more json in server_task input #10691

server : (refactor) no more json in server_task input #10691

Conversation

ngxson commented Dec 6, 2024 • edited Loading

ggerganov Dec 7, 2024

Choose a reason for hiding this comment

ggerganov commented Dec 8, 2024

ngxson commented Dec 8, 2024

ngxson commented Dec 8, 2024

ngxson commented Dec 6, 2024 •

edited

Loading