Skip to content

LlamaPromptLookupDecoding gives no performance improvement #2110

@vicmortelmans

Description

@vicmortelmans

I'm automating spelling correction. Currently I'm testing this on a Mistral 7B Instruction Q4_K_M model running in llama.cpp using python bindings, and it works fine. I submit text fragments up to ~2000 characters.

I bumped into the "draft_model" option that does "prompt lookup decoding" (a.k.a. token prediction). Because my output is very similar to my prompt text, I'd assume that activating this would bring me significant performance gain.

But I've been testing this out and I see no performance gain at all, comparing the inference with and without the 'draft_model' option activated. I've tried different values for 'num_pred_tokens', but that doesn't make any difference either.

Using llama.ccp 0.3.16. I'm running on CPU. My prompts consists of an instruction in English and then the text to be corrected in Dutch.

Maybe I'm doing something wrong? Or maybe I don't understand the purpose of this option?

Initialization:

    n_ctx = 4096
    
    llm = Llama(
        model_path=model_path,
        n_ctx=n_ctx,
        n_threads=8,
        temperature=0,
        chat_format=None,
        draft_model=LlamaPromptLookupDecoding(num_pred_tokens=3),
        logits_all=True,
    )

Inference:

    output = llm(
        prompt=prompt_text,
        max_tokens=2048,
        temperature=0,
        echo=True
    )

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions