-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
I'm automating spelling correction. Currently I'm testing this on a Mistral 7B Instruction Q4_K_M model running in llama.cpp using python bindings, and it works fine. I submit text fragments up to ~2000 characters.
I bumped into the "draft_model" option that does "prompt lookup decoding" (a.k.a. token prediction). Because my output is very similar to my prompt text, I'd assume that activating this would bring me significant performance gain.
But I've been testing this out and I see no performance gain at all, comparing the inference with and without the 'draft_model' option activated. I've tried different values for 'num_pred_tokens', but that doesn't make any difference either.
Using llama.ccp 0.3.16. I'm running on CPU. My prompts consists of an instruction in English and then the text to be corrected in Dutch.
Maybe I'm doing something wrong? Or maybe I don't understand the purpose of this option?
Initialization:
n_ctx = 4096
llm = Llama(
model_path=model_path,
n_ctx=n_ctx,
n_threads=8,
temperature=0,
chat_format=None,
draft_model=LlamaPromptLookupDecoding(num_pred_tokens=3),
logits_all=True,
)
Inference:
output = llm(
prompt=prompt_text,
max_tokens=2048,
temperature=0,
echo=True
)