Skip to content

On the issue of models requiring truncation #7

@cnshing

Description

@cnshing

Certain models that requires truncation will fail to generate text.

The following dataset uses the following example text:

"""
is_original_content: None
over_18: None
post: comment
subreddit: Genshin_Impact
prompt: What do you think about Genshin Impact?
response: I think its great. It's a fun and addicting game that can be played anywhere. I personally like how...
"""

When the text suprasses the max token length of the model, removing the example or truncation must be done to train the dataset without error. Removing the example pollutes our data as the generated text will no longer be representative of an person's pattern of text, only a shorter version of themselves. Truncation can potentially remove the tags("response", "prompt", etc) encouraging the text to not follow the example format and render it unparsable.

For now, I will attempt to truncate the examples.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions