Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer discard data that exceed max_length #31627

Open
fengyunflya opened this issue Jun 26, 2024 · 4 comments
Open

Tokenizer discard data that exceed max_length #31627

fengyunflya opened this issue Jun 26, 2024 · 4 comments
Labels
Core: Tokenization Internals of the library; Tokenization. Feature request Request for a new feature

Comments

@fengyunflya
Copy link

Feature request

When use tokenizer, it truncate data to max_length, but can't just discard the data.

Motivation

Sometimes we want the sentence to be complete

Your contribution

No

@fengyunflya fengyunflya added the Feature request Request for a new feature label Jun 26, 2024
@seanswyi
Copy link
Contributor

seanswyi commented Jun 26, 2024

To clarify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?

@fengyunflya
Copy link
Author

To specify, are you saying that you want there to be an option so that if the tokenizer must truncate the input it would just discard it entirely? Wouldn't it be better to handle this at the pre-processing stage before you tokenize the data?

For example, I have a sentence that maybe exceed the max length, but I have to encode it to make sure that, So If I pre-processing it, and I have a lot of data that would waste time, cause I have to batch tokenize them later. If there is param in the tokenizer method, when the encoded sentence exceed max_length, it could just discard the sentence, that would only encode sentence once.

@amyeroberts amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Jun 26, 2024
@amyeroberts
Copy link
Collaborator

cc @ArthurZucker

@ArthurZucker
Copy link
Collaborator

Hey! This has not been requested much, would recommend doing this manually in your data collator for example: you encode first everything, discard what's to long then pad !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization. Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

4 participants