Support FIM for models using ChatML format #142

ChrisDeadman · 2024-02-29T02:40:29Z

First of all: Your extension is awesome, thanks for all your effort in making it better constantly! 👍🏼

FIM doesn't work for Mistral-7B-Instruct-v0.2-code-ft.
I know that the ChatML format is mostly suited for turn-based conversations.
However, except for the suggestions, you've already refactored your code to use the turn-based Ollama endpoint...

I get the reason why you have to use the generate endpoint of Ollama for FIM, which sucks a bit because you have to support all the different turn-templates in that case 🫤
This is still a big issue for all client applications that want to control models to answer in a specific way.

If you would try out ChatML support tho, you could try appending the start of the expected response from the model after the template, e.g.:

<|im_start|>system
You are an awesome coder, auto-complete the following code:<|im_end|>
<|im_start|>user
here goes the code<|im_end|>
<|im_start|>assistant
Sure, here is the auto-completion:   <-- the model will think it answered like that
``` <-- followed by three backticks and a newline to force the model to generate code.

I have tried this out manually and it works. Basically everything you write after <|im_start|>assistant will make the model think it started it's answer like that (works like that for basically all models, not just for ChatML-based models).

For reference, this is the correct huggingface tokenizer template for ChatML:

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + <|im_end|> + '\\n'}}{% endfor %}
{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}

The text was updated successfully, but these errors were encountered:

rjmacarthy · 2024-02-29T14:10:56Z

Thanks @ChrisDeadman.

This is very interesting I will look into it for a future release.

rjmacarthy · 2024-03-11T20:34:16Z

Hey @ChrisDeadman I tried this but didn't get much luck, how can we deal with prefix and suffix, can we write a hbs template for it?

ChrisDeadman · 2024-03-12T18:33:30Z

Yes that should be possible, would make it easier for me to try out - if I find something usable I could make PR of a chatml template then.
But as far as I can see in the code under src/extension/fim-templates.ts, hbs templates are not yet supported for fim?

ChrisDeadman · 2024-03-12T18:40:31Z

on second thought, the templates need to support some kind of flag which tells your response parser that the last part of the template (e.g. the 3 backticks) should be prepended to the actual model response before parsing it.
Otherwise the stuff we suggested the model should start it's response with is missing.

ChrisDeadman · 2024-03-12T19:07:02Z

Under python using huggingface transformer templates you can get the "generation prompt" like this:

generation_prompt = self.tokenizer.apply_chat_template([], add_generation_prompt=True)

Because the [] represents an empty list of messages and the template does check for the add_generation_prompt variable:

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + <|im_end|> + '\\n'}}{% endfor %}
{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}

So only the generation prompt is returned.
It works because it wants the input to be of the message type that is also used for the chat endpoint, and, if the list is empty, returns just the generation prompt (the part that the model should start the generation with).

Maybe something similar could be done with the hbs templates.

rjmacarthy · 2024-03-13T22:32:41Z

Hey, sorry but I'm still unsure about it. If you could adapt it to an hbs template it might be clearer? I did try to adapt the fim completions to use templates but didn't know what format it should be.

hafriedlander · 2024-03-14T10:16:52Z

Ollama automatically wraps whatever you pass to the /generate endpoint with a template (unless you turn it off with raw: true).

The default mistral template is pretty boring - https://ollama.com/library/mistral:latest/blobs/e6836092461f - but a lot of them follow that same format - https://ollama.com/library/dolphincoder:latest/blobs/62fbfd9ed093.

The ollama generate endpoint does allow overriding both the default model template and system message. Not sure if other systems (like vllm) do though.

ChrisDeadman · 2024-03-14T20:02:36Z

Hey, sorry but I'm still unsure about it. If you could adapt it to an hbs template it might be clearer? I did try to adapt the fim completions to use templates but didn't know what format it should be.

If I understand the hbs syntax correctly this should work for your existing fim stuff:

{{#if prefix}}
  <PRE> {{prefix}} 
{{/if}}

{{#if suffix}}
  <SUF> {{suffix}} 
{{/if}}

{{#if add_generation_prompt}}
  <MID>
{{/if}}

Just supply the 3 variables as args or pass only add_generation_prompt if you just want to get the start of the model response.

ChrisDeadman · 2024-03-14T20:38:38Z

For chatml it could be something like this (not tested):

{{#if system}}
  <|im_start|>system\n{{system}}<|im_end|>\n
{{/if}}

{{#if prefix || suffix}}
  <|im_start|>user\nPlease generate the code between the following prefix and suffix.\n
  {{#if prefix}}
    Prefix:\n```\n{{prefix}}\n```\n
  {{/if}}
  
  {{#if suffix}}
    Suffix:\n```\n{{suffix}}\n```\n
  {{/if}}
  <|im_end|>\n
{{/if}}

{{#if add_generation_prompt}}
  <|im_start|>assistant\n```\n
{{/if}}

hafriedlander · 2024-03-14T21:33:59Z

I understand I believe. I'll add another template into my PR (#174) on the next update

@ChrisDeadman are you using ollama as the backend? (Or if not, what are you using?)

ChrisDeadman · 2024-03-14T21:49:24Z

@ChrisDeadman are you using ollama as the backend? (Or if not, what are you using?)

I wrote a custom server - I added ollama compatible API to run this extension over it.
Internally, it uses huggingface templates to tokenize the chat completion messages.
It however does not apply any templates to the prompt passed to the /generate endpoint.

hafriedlander · 2024-03-14T21:57:39Z

Ah. Ollama does normally wrap the prompt passed to /generate with a model specific template, unless raw: true is part of the request (which twinny doesn't currently set).

I think probably all autocomplete requests should use raw: true though - at least starcoder2 requires it, and my PR currently assumes all models should use it.

Long term, it'd be cool to be able to edit both the chat and FIM templates as HBS in vs code the same way command templates currently can be. For now I'll just add a extra template to the code.

rjmacarthy · 2024-04-04T13:02:41Z

Hey,

FYI, this should now work with any ChatML endpoint as the provider as I added the ability to edit and choose a custom FIM template. The first test would be using OpenAI API through LiteLLM with GPT3.5 or GPT4. I am still unsure if Ollama support ChatML? Also should I add raw: true to the options, or make it an option to allow raw option in settings?

Here is a template I have been using with GPT-4 with pretty good success:

<|im_start|>system
You are a auto-completion coding assistant who uses a prefix and suffix to "fill_in_middle".<|im_end|>
<|im_start|>user
<prefix>{{{prefix}}}<fill_in_middle>{{{suffix}}}<end>
Only reply with pure code, no backticks, do not repeat code in the prefix or suffix, match brackets carefully.<|im_end|>
<|im_start|>assistant
Sure, here is the pure code auto-completion:

ChrisDeadman · 2024-04-04T22:40:20Z

imo this looks like a great approach 👍🏼
According raw, I second what @hafriedlander said after RTFMing the docs

CartoonFan · 2024-04-24T16:28:48Z

Sorry if I just missed it, but I don't really understand how to make this work. I'm currently running a fairly large model through Ollama (https://ollama.com/wojtek/beyonder), and it'd be great if I could use it for FIM as well.

Some additional info:

Editor: VSCodium
OS: Arch Linux
GPU: AMD Radeon RX 6800 XT (16 GB)
CPU: AMD Ryzen 7 3700X
RAM: 48 GB

Model's HuggingFace page: https://huggingface.co/mlabonne/Beyonder-4x7B-v3-GGUF

I haven't been able to get inline completion working at all, with any model
I don't know where the debug logs and config files are located
Chat is through Ollama model -> LiteLLM, FIM is directly through Ollama

Thanks 😅 🙏 💜

ChrisDeadman · 2024-05-01T16:14:44Z

So I tested this brifly with Llama-3 by selecting "custom template" in the FIM Provider settings and modifying the templates like so:

fim-system.hbs

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful, respectful and honest coding assistant.
Always reply with using markdown.<|eot_id|>

fim.hbs

{{{systemMessage}}}<|start_header_id|>user<|end_header_id|>

Please respond with the code that is missing here:

```{{language}}
{{{prefix}}}<MISSING CODE>
{{{suffix}}}
```<|eot_id|><|start_header_id|>assistant<|end_header_id|>

```{{language}}

What I found is:

The language variable is resolved to an empty string in fim.hbs
The response is not trimmed
The trailing backticks are not cut from the response

But other than that it seems to work 😃

Here is a screenshot:

It would be nice to be able to repeat the last line of the prefix as the model response (e.g. thread. in my example), to make the model not repeat it. (by providing a {{getLastLine prefix}} function for example)

rjmacarthy · 2024-05-01T16:54:02Z

Thanks @ChrisDeadman I think this can be arranged. Is there anything else datawise you'd like passed to the template?

Many thanks,

ChrisDeadman · 2024-05-01T17:00:56Z

Thanks @rjmacarthy ! I cannot think of anything else that is missing at the moment, should be enough to support chatml and llama-3 templates imo.

rjmacarthy · 2024-05-06T19:35:21Z

I've added the language to the template now, but I think there is some inconsistencies with how it works still which I need to iron out. I had mixed results still with llama3:8b.

ChrisDeadman · 2024-05-06T19:44:23Z

Did you also add an option to get the last line of prefix in the template? When adding this to the end of the template, the results should be much better.

rjmacarthy · 2024-05-06T20:09:15Z

No I didn't actually, I can add it.

ChrisDeadman · 2024-05-06T21:10:42Z

That would be awesome, I will do some tests when ready 😃

rjmacarthy added the enhancement New feature or request label Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FIM for models using ChatML format #142

Support FIM for models using ChatML format #142

ChrisDeadman commented Feb 29, 2024 •

edited

Loading

rjmacarthy commented Feb 29, 2024

rjmacarthy commented Mar 11, 2024

ChrisDeadman commented Mar 12, 2024

ChrisDeadman commented Mar 12, 2024 •

edited

Loading

ChrisDeadman commented Mar 12, 2024 •

edited

Loading

rjmacarthy commented Mar 13, 2024

hafriedlander commented Mar 14, 2024

ChrisDeadman commented Mar 14, 2024 •

edited

Loading

ChrisDeadman commented Mar 14, 2024 •

edited

Loading

hafriedlander commented Mar 14, 2024

ChrisDeadman commented Mar 14, 2024

hafriedlander commented Mar 14, 2024

rjmacarthy commented Apr 4, 2024 •

edited

Loading

ChrisDeadman commented Apr 4, 2024

CartoonFan commented Apr 24, 2024 •

edited

Loading

ChrisDeadman commented May 1, 2024 •

edited

Loading

rjmacarthy commented May 1, 2024

ChrisDeadman commented May 1, 2024

rjmacarthy commented May 6, 2024

ChrisDeadman commented May 6, 2024

rjmacarthy commented May 6, 2024

ChrisDeadman commented May 6, 2024

Support FIM for models using ChatML format #142

Support FIM for models using ChatML format #142

Comments

ChrisDeadman commented Feb 29, 2024 • edited Loading

rjmacarthy commented Feb 29, 2024

rjmacarthy commented Mar 11, 2024

ChrisDeadman commented Mar 12, 2024

ChrisDeadman commented Mar 12, 2024 • edited Loading

ChrisDeadman commented Mar 12, 2024 • edited Loading

rjmacarthy commented Mar 13, 2024

hafriedlander commented Mar 14, 2024

ChrisDeadman commented Mar 14, 2024 • edited Loading

ChrisDeadman commented Mar 14, 2024 • edited Loading

hafriedlander commented Mar 14, 2024

ChrisDeadman commented Mar 14, 2024

hafriedlander commented Mar 14, 2024

rjmacarthy commented Apr 4, 2024 • edited Loading

ChrisDeadman commented Apr 4, 2024

CartoonFan commented Apr 24, 2024 • edited Loading

ChrisDeadman commented May 1, 2024 • edited Loading

rjmacarthy commented May 1, 2024

ChrisDeadman commented May 1, 2024

rjmacarthy commented May 6, 2024

ChrisDeadman commented May 6, 2024

rjmacarthy commented May 6, 2024

ChrisDeadman commented May 6, 2024

ChrisDeadman commented Feb 29, 2024 •

edited

Loading

ChrisDeadman commented Mar 12, 2024 •

edited

Loading

ChrisDeadman commented Mar 12, 2024 •

edited

Loading

ChrisDeadman commented Mar 14, 2024 •

edited

Loading

ChrisDeadman commented Mar 14, 2024 •

edited

Loading

rjmacarthy commented Apr 4, 2024 •

edited

Loading

CartoonFan commented Apr 24, 2024 •

edited

Loading

ChrisDeadman commented May 1, 2024 •

edited

Loading