LlamaPromptLookupDecoding gives no performance improvement

I'm automating spelling correction. Currently I'm testing this on a Mistral 7B Instruction Q4_K_M model running in llama.cpp using python bindings, and it works fine. I submit text fragments up to ~2000 characters.

I bumped into the "draft_model" option that does "prompt lookup decoding" (a.k.a. token prediction). Because my output is very similar to my prompt text, I'd assume that activating this would bring me significant performance gain.

But I've been testing this out and I see no performance gain at all, comparing the inference with and without the 'draft_model' option activated. I've tried different values for 'num_pred_tokens', but that doesn't make any difference either.

Using llama.ccp 0.3.16. I'm running on CPU. My prompts consists of an instruction in English and then the text to be corrected in Dutch.

Maybe I'm doing something wrong? Or maybe I don't understand the purpose of this option?

Initialization:

```
    n_ctx = 4096
    
    llm = Llama(
        model_path=model_path,
        n_ctx=n_ctx,
        n_threads=8,
        temperature=0,
        chat_format=None,
        draft_model=LlamaPromptLookupDecoding(num_pred_tokens=3),
        logits_all=True,
    )
```

Inference:

```
    output = llm(
        prompt=prompt_text,
        max_tokens=2048,
        temperature=0,
        echo=True
    )
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LlamaPromptLookupDecoding gives no performance improvement #2110

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

LlamaPromptLookupDecoding gives no performance improvement #2110

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions