Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : fix format_infill #10724

Merged
merged 8 commits into from
Dec 8, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 8, 2024

Should fix #10691 (comment) , I remove the format_chat/_infill/_rerank from handle_completions_generic but forgot to put it back in handle_infill

@ngxson ngxson requested a review from ggerganov December 8, 2024 20:10
@@ -3509,6 +3516,21 @@ int main(int argc, char ** argv) {
}
data["input_extra"] = input_extra; // default to empty array if it's not exist

std::string prompt = json_value(data, "prompt", std::string());
std::vector<llama_tokens> tokenized_prompts = tokenize_input_prompts(ctx_server.ctx, prompt, true, true);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably return an error if there is more than 1 resulting prompt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because above we already checked if data["prompt"] is string, here we can be sure that we only have one single prompt to deal with. Probably an GGML_ASSERT here make more sense?

(The expected behavior of tokenize_input_prompts is that if prompt is a string, then output vector size is == 1)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. It's fine as it is.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 8, 2024

@ggerganov I added a test using Qwen2.5-Coder-1.5B-Instruct-GGUF, you can run it locally using:

SLOW_TESTS=1 ./tests.sh ./unit/test_infill.py -v -x

Feel free to add more complicated test case(s) if you need!

@github-actions github-actions bot added the python python script changes label Dec 8, 2024
@ngxson
Copy link
Collaborator Author

ngxson commented Dec 8, 2024

Btw please note that adding "prompt": "Complete this" to the request makes Qwen model to hallucinate the answer. Looking at the technical report, I don't think the model is trained to follow instruction when doing infill:

image

@ggerganov
Copy link
Owner

Ah, the infill endpoint should be used only with the Coder models. Not the Coder-Instruct. The authors confirmed that: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct/discussions/2#6731a45e0e39be0605a0df20

So the tests should be updated to use the Coder variant.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 8, 2024

OK so I've tried the non-instruction model but I think the problem is related to placement of prompt in the formatted version. What I'm getting is:

<|repo_name|>myproject\n<|file_sep|>llama.h\nLLAMA_API int32_t llama_n_threads();\n<|file_sep|>filename\n<|fim_prefix|>#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_Complete this<|fim_suffix|>}\n<|fim_middle|>

If the prompt is placed at the beginning, it should work:

Complete this<|repo_name|>myproject\n<|file_sep|>llama.h\nLLAMA_API int32_t llama_n_threads();\n<|file_sep|>filename\n<|fim_prefix|>#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n int n_threads = llama_<|fim_suffix|>}\n<|fim_middle|>

We can fix this in another PR, for now I'm gonna comment out the "prompt": "Complete this" and change the model to non-instruction

@ggerganov
Copy link
Owner

ggerganov commented Dec 8, 2024

FIM should not add instructions such as "Complete this", as can be seen in the technical report. The "prompt" field in the /infill requests is designed to contain the prefix of the current line on which the cursor is located. This is appended to the "input_prefix" (which contains the previous lines) to obtain the final fim_prefix. So it is working as intended.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 8, 2024

The "prompt" field in the /infill requests is designed to contain the prefix of the current line on which the cursor is located.

Ok thanks for the clarification. I updated the test to reflect this. The "prompt" now contains the current line where the cursor is located:

    res = server.make_request("POST", "/infill", data={
        "input_extra": [{
            "filename": "llama.h",
            "text": "LLAMA_API int32_t llama_n_threads();\n"
        }],
        "input_prefix": "#include <cstdio>\n#include \"llama.h\"\n\nint main() {\n",
        "prompt": "    int n_threads = llama_",
        "input_suffix": "}\n",
    })

@ggerganov
Copy link
Owner

Yes, perfect. The test_invalid_input_extra_req also needs to be updated like this.

@ngxson ngxson merged commit ce8784b into ggerganov:master Dec 8, 2024
48 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* server : fix format_infill

* fix

* rename

* update test

* use another model

* update test

* update test

* test_invalid_input_extra_req
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants