Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue for alignment of pre-tokenized input with BPE, see #150 #154

Merged
merged 7 commits into from
Jan 26, 2023

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Jan 17, 2023

Fix an issue with added extra tokens for ensuring the alignment of pre-tokenized input with BPE, see #150
We cover here added tokens introduced by the BPE fallback mechanism for out of vocabulary input character, e.g. "É" resulting in 'Ã', 'ī', which have then confusing offsets when the input is pre-tokenized.

We distinguish transformer tokenizers using BPE/sentencepiece from the other, which is done based on the tokenizer class name. When BPE is used, we examine offsets and tokens in order to filter out added tokens due to the BPE OOV fallback mechanism.

Tested with BERT based models and Roberta based models, more test to come...

@kermitt2 kermitt2 marked this pull request as draft January 17, 2023 12:41
@kermitt2 kermitt2 self-assigned this Jan 17, 2023
@kermitt2 kermitt2 added the bug Something isn't working label Jan 17, 2023
@kermitt2
Copy link
Owner Author

kermitt2 commented Jan 19, 2023

Update: tested and good with some BERT models, some Roberta models, CamemBERT (although it is Roberta-based, camembert uses a SentencePiece tokenizer using U+2581 metasymbol and not Ġ of the Roberta/GPT2 BPE tokenizer), bart-base, albert-base-v2, and XLM.

Note: fnet-base and luke-base are not supported by AutoModel apparently

@kermitt2 kermitt2 marked this pull request as ready for review January 26, 2023 08:18
@kermitt2 kermitt2 merged commit 3ee3775 into master Jan 26, 2023
@lfoppiano lfoppiano linked an issue Jan 26, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sub-tokenization with certain transformers
1 participant