You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We observed that trying to tokenize/detokenize strings containing the sequence <space>'m would not give back the initial string, but would "eat" the leading whitespace.
For example, the string "for 'manual'" will be transformed into "for'manual'"
Investigating further, we also observed issue with strings containing <space>'s, making us think the issue may be related to trying to handle sequences such as "I'm".
I can confirm that if I add clean_up_tokenization_spaces=False to the call to batch_decode, the issue disappears.
I'm not sure if this means this behavior is "normal" (or let's say expected at least), and what will be the consequences for setting this option to False.
I tested updating to v4.48.1 (actually I forgot to report I had already tested with v4.48.0 before, sorry), and the behavior is still the same. From the doc it seems indeed True is still the default in the latest version.
Is there something to improve here or shall will be left at that?
We observed that trying to tokenize/detokenize strings containing the sequence
<space>'m
would not give back the initial string, but would "eat" the leading whitespace.For example, the string "for 'manual'" will be transformed into "for'manual'"
Investigating further, we also observed issue with strings containing
<space>'s
, making us think the issue may be related to trying to handle sequences such as "I'm".System Info
transformers==4.46.2
Who can help?
I guess it's for @ArthurZucker and @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Running:
prints
(missing whitespace before the leading ')
Expected behavior
It should output the following
The text was updated successfully, but these errors were encountered: