Tokenizer discard data that exceed max_length #31627

fengyunflya · 2024-06-26T05:50:49Z

Feature request

When use tokenizer, it truncate data to max_length, but can't just discard the data.

Motivation

Sometimes we want the sentence to be complete

Your contribution

No

seanswyi · 2024-06-26T08:05:14Z

To clarify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?

fengyunflya · 2024-06-26T08:53:46Z

To specify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?

For example, I have a sentence that maybe exceed the max length, but I have to encode it to make sure that, So If I pre-processing it, and I have a lot of data that would waste time, cause I have to batch tokenize them later. If there is param in the tokenizer method, when the encoded sentence exceed max_length, it could just discard the sentence, that would only encode sentence once.

amyeroberts · 2024-06-26T10:58:31Z

cc @ArthurZucker

ArthurZucker · 2024-08-03T09:33:22Z

Hey! This has not been requested much, would recommend doing this manually in your data collator for example: you encode first everything, discard what's to long then pad !

fengyunflya added the Feature request Request for a new feature label Jun 26, 2024

amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer discard data that exceed max_length #31627

Tokenizer discard data that exceed max_length #31627

fengyunflya commented Jun 26, 2024

seanswyi commented Jun 26, 2024 •

edited

Loading

fengyunflya commented Jun 26, 2024

amyeroberts commented Jun 26, 2024

ArthurZucker commented Aug 3, 2024

Tokenizer discard data that exceed max_length #31627

Tokenizer discard data that exceed max_length #31627

Comments

fengyunflya commented Jun 26, 2024

Feature request

Motivation

Your contribution

seanswyi commented Jun 26, 2024 • edited Loading

fengyunflya commented Jun 26, 2024

amyeroberts commented Jun 26, 2024

ArthurZucker commented Aug 3, 2024

seanswyi commented Jun 26, 2024 •

edited

Loading