Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

format for inference in code completion #10

Open
ARIELDENG opened this issue Mar 4, 2024 · 6 comments
Open

format for inference in code completion #10

ARIELDENG opened this issue Mar 4, 2024 · 6 comments

Comments

@ARIELDENG
Copy link

starcoder's format for inference in code completion is PSM, <fim_prefix> + prefix + <fim_suffix> + suffix + <fim_middle>

what's that for starcoder2?

from the paper, we could only see that
image

@xcxhy
Copy link

xcxhy commented Mar 4, 2024

I have the same question. I want to complete the code through perfect code requirements, but the model cannot stop. Can you give me a perfect prompt format?

@loubnabnl
Copy link
Contributor

It's the same as StarCoder, we apply FIM inside each file regardless of the repo structure(the filepath in the beginning of a file is optional) so you can do

<fim_prefix>prefix<fim_suffix>suffix<fim_middle>

@ARIELDENG
Copy link
Author

thanks for your attention, but the thing is that the output won't stop when I apply this formatting, just like you @xcxhy
image
However, it seems to be following a pattern as shown in the picture above, so you can fix it by @xcxhy
image

@loubnabnl
Copy link
Contributor

yes <file_sep> is the token we use to separate files so you can use it as a stop token. The <|endoftext|> token was used to separate repositories since we now concatenate files from the same repo in one sample.

@ARIELDENG
Copy link
Author

yes <file_sep> is the token we use to separate files so you can use it as a stop token. The <|endoftext|> token was used to separate repositories since we now concatenate files from the same repo in one sample.

Thank you so much, and the StarCoder series are really amazing! Recently I've been using them for SFT to better apply to our users' habits and witnessed great improvement.

daanturo added a commit to daanturo/starhugger.el that referenced this issue Jun 30, 2024
Not just removing them, since Starcoder2 maybe generate cross-file
tokens after "<file_sep>".

Ref: bigcode-project/starcoder2#10 (comment)
@robertpiosik
Copy link

robertpiosik commented Jul 8, 2024

When using ollama, all you need to do is set <file_sep> as stop sequence, when using https://github.com/huggingface/llm-vscode it would be:

  "llm.requestBody": {
    "stream": true,
    "options": {
      "stop": [
        "<file_sep>"
      ],
      "temperature": 0,
    }
  },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants