Fix batched generation for prompts of different lengths #2216

RunFMe · 2025-03-27T20:46:49Z

Previous code

def _fast_prepare_inputs_for_generation(self, input_ids, **kwargs,):
    past_key_values = kwargs.get("past_key_values", None)
    if past_key_values is not None:
        # Check for uninitialized DynamicCache
        if len(past_key_values) == 0:
            past_key_values = None
            kwargs["past_key_values"] = None
        else:
            input_ids = input_ids[:,[-1]]
            kwargs["attention_mask"] = kwargs["attention_mask"][:,[-1]]

    if "cache_position" in kwargs:
        kwargs["position_ids"] = kwargs["cache_position"]
    return { "input_ids" : input_ids, **kwargs, }
pass

_fast_prepare_inputs_for_generation is called to get forward method arguments for generating new token. Notice that it crops attention_mask and only takes into account the last value. Attention mask of length 1 triggers flag in the forward which makes attention mask None.

I fixed it by copying a piece of code from traditional prepare_inputs_for_generation which calls base_model._prepare_4d_causal_attention_mask_with_cache_position if it's present. This way we allow models which have this function (most popular) to create attention mask as they see fit.

unsloth/models/llama.py

RunFMe · 2025-03-29T11:11:26Z

@Datta0 What do you think? What else should we do before merging?

Datta0

LGTM :) !

RunFMe · 2025-04-03T17:46:34Z

@Datta0 do you know what the process is for getting approval of a maintainer?

Datta0 · 2025-04-04T03:43:15Z

Hey @RunFMe , I will text Daniel about this
Also it'd be great if you can include samples from previous and current code in the PR description...

RunFMe · 2025-04-08T17:40:41Z

@Datta0 done)
Tell me, if there's anything else I can help with.

Неизвестный Пользователь722497 and others added 2 commits March 27, 2025 23:42

fix ignoring of attention mask after prefill stage in decoding

47c066b

Merge branch 'main' of https://github.com/unslothai/unsloth

0f28c0e

RunFMe mentioned this pull request Mar 27, 2025

Batch inference seems to have gibberish #1066

Open

Datta0 reviewed Mar 28, 2025

View reviewed changes

unsloth/models/llama.py Show resolved Hide resolved

update naming to avoid confusion

987ad2d

Datta0 approved these changes Mar 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix batched generation for prompts of different lengths #2216

Fix batched generation for prompts of different lengths #2216

RunFMe commented Mar 27, 2025 •

edited

Loading

RunFMe commented Mar 29, 2025

Datta0 left a comment

RunFMe commented Apr 3, 2025

Datta0 commented Apr 4, 2025

RunFMe commented Apr 8, 2025

Fix batched generation for prompts of different lengths #2216

Are you sure you want to change the base?

Fix batched generation for prompts of different lengths #2216

Conversation

RunFMe commented Mar 27, 2025 • edited Loading

RunFMe commented Mar 29, 2025

Datta0 left a comment

Choose a reason for hiding this comment

RunFMe commented Apr 3, 2025

Datta0 commented Apr 4, 2025

RunFMe commented Apr 8, 2025

RunFMe commented Mar 27, 2025 •

edited

Loading