HuggingfaceLLM Meta Llama3.1 8B generation doesn't stop #15651

arbitropy · 2024-08-26T18:15:07Z

arbitropy
Aug 26, 2024

I have been using this code to create an llm instance:

stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

llm = HuggingFaceLLM(
    model_name=model_name,
    context_window= 4096,
    max_new_tokens=512,
    model_kwargs={
        "quantization_config": quantization_config,
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.1,
        "top_p": 0.8,
    },
    tokenizer_name=model_name
    stopping_ids=stopping_ids,

Then when i use llm.generate(), the generation never stops. but if i use llm.chat() with list of ChatMesasge(), it stops appropriately.
Due to this, when I use various metadata generation classes like SummaryExtractor of QuestionsAnsweredExtractor, the metadata are repetitive generation until max_token is reached.

Other models with HuggingfaceLLM don't have this problem.

@arbitropy · 2024-08-26T18:17:14Z

dosubot[bot]
bot Aug 26, 2024

Hey there, @arbitropy! I'm here to assist you with any bugs, questions, or contributions while you wait for a human maintainer. Let's tackle this issue together!

The issue with llm.generate() not stopping when using HuggingFaceLLM Meta Llama3.1 8B is likely due to the stopping criteria not being correctly set or applied in the generate method. The llm.chat() method handles stopping criteria and formatting differently, which ensures proper stopping.

In the HuggingFaceLLM class, the generate method uses the _stopping_criteria attribute to determine when to stop the generation. If the stopping criteria are not correctly configured or if the model does not predict the stopping token IDs, the generation will not stop as expected.

Here is the relevant part of the code that sets up the stopping criteria:

class StopOnTokens(StoppingCriteria):
    def __call__(
        self,
        input_ids: torch.LongTensor,
        scores: torch.FloatTensor,
        **kwargs: Any,
    ) -> bool:
        for stop_id in stopping_ids_list:
            if input_ids[0][-1] == stop_id:
                return True
        return False

self._stopping_criteria = StoppingCriteriaList([StopOnTokens()])

In contrast, the chat method converts messages to a prompt and then calls the complete method, which ensures that the prompt is correctly formatted and the stopping criteria are applied:

@llm_chat_callback()
def chat(self, messages: Sequence[ChatMessage], **kwargs: Any) -> ChatResponse:
    prompt = self.messages_to_prompt(messages)
    completion_response = self.complete(prompt, formatted=True, **kwargs)
    return completion_response_to_chat_response(completion_response)

To resolve this issue, ensure that the stopping criteria are correctly set in the generate method. You might need to verify that the stopping token IDs are correctly predicted by the model and that the StopOnTokens class is properly configured.

For more details, you can refer to the implementation in the HuggingFaceLLM class ^[1]^[2].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HuggingfaceLLM Meta Llama3.1 8B generation doesn't stop #15651

{{title}}

Replies: 1 comment

{{title}}

Select a reply

HuggingfaceLLM Meta Llama3.1 8B generation doesn't stop #15651

arbitropy Aug 26, 2024

Replies: 1 comment

dosubot[bot] bot Aug 26, 2024

arbitropy
Aug 26, 2024

dosubot[bot]
bot Aug 26, 2024