Fix issue for alignment of pre-tokenized input with BPE, see #150 #154

kermitt2 · 2023-01-17T12:41:39Z

Fix an issue with added extra tokens for ensuring the alignment of pre-tokenized input with BPE, see #150
We cover here added tokens introduced by the BPE fallback mechanism for out of vocabulary input character, e.g. "É" resulting in 'Ã', 'ī', which have then confusing offsets when the input is pre-tokenized.

We distinguish transformer tokenizers using BPE/sentencepiece from the other, which is done based on the tokenizer class name. When BPE is used, we examine offsets and tokens in order to filter out added tokens due to the BPE OOV fallback mechanism.

Tested with BERT based models and Roberta based models, more test to come...

kermitt2 · 2023-01-19T15:57:09Z

Update: tested and good with some BERT models, some Roberta models, CamemBERT (although it is Roberta-based, camembert uses a SentencePiece tokenizer using ▁ U+2581 metasymbol and not Ġ of the Roberta/GPT2 BPE tokenizer), bart-base, albert-base-v2, and XLM.

Note: fnet-base and luke-base are not supported by AutoModel apparently

fix issue for alignment of pre-tokenized input with BPE, see #150

befd674

kermitt2 marked this pull request as draft January 17, 2023 12:41

kermitt2 requested a review from lfoppiano January 17, 2023 12:42

kermitt2 mentioned this pull request Jan 17, 2023

Sub-tokenization with certain transformers #150

Open

kermitt2 self-assigned this Jan 17, 2023

kermitt2 added the bug Something isn't working label Jan 17, 2023

kermitt2 added 4 commits January 18, 2023 00:08

fix filtering for some roberta variants

6806d12

fix for some sentencepiece tokenizer

5aea590

Merge branch 'master' into bpe-realignment

e9d0399

cleaning

400181b

kermitt2 added 2 commits January 19, 2023 17:00

cleaning

b5f4591

removing non automodel luke

73d5444

kermitt2 marked this pull request as ready for review January 26, 2023 08:18

kermitt2 merged commit 3ee3775 into master Jan 26, 2023

lfoppiano linked an issue Jan 26, 2023 that may be closed by this pull request

Sub-tokenization with certain transformers #150

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue for alignment of pre-tokenized input with BPE, see #150 #154

Fix issue for alignment of pre-tokenized input with BPE, see #150 #154

kermitt2 commented Jan 17, 2023 •

edited

Loading

kermitt2 commented Jan 19, 2023 •

edited

Loading

Fix issue for alignment of pre-tokenized input with BPE, see #150 #154

Fix issue for alignment of pre-tokenized input with BPE, see #150 #154

Conversation

kermitt2 commented Jan 17, 2023 • edited Loading

kermitt2 commented Jan 19, 2023 • edited Loading

kermitt2 commented Jan 17, 2023 •

edited

Loading

kermitt2 commented Jan 19, 2023 •

edited

Loading