-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Certain models that requires truncation will fail to generate text.
The following dataset uses the following example text:
"""
is_original_content: None
over_18: None
post: comment
subreddit: Genshin_Impact
prompt: What do you think about Genshin Impact?
response: I think its great. It's a fun and addicting game that can be played anywhere. I personally like how...
"""
When the text suprasses the max token length of the model, removing the example or truncation must be done to train the dataset without error. Removing the example pollutes our data as the generated text will no longer be representative of an person's pattern of text, only a shorter version of themselves. Truncation can potentially remove the tags("response", "prompt", etc) encouraging the text to not follow the example format and render it unparsable.
For now, I will attempt to truncate the examples.