AutoTokenizer: Phi-3 drops spaces when decodes a token at a time #31643

Andrei-Aksionov · 2024-06-26T15:11:06Z

System Info

transformers version: 4.41.2
Platform: macOS-14.5-x86_64-i386-64bit
Python version: 3.11.6
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

phi_2_tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
phi_3_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

for name, tokenizer in (("phi-2", phi_2_tokenizer), ("phi-3", phi_3_tokenizer)):
    print(f"Tokenizer: {name}")
    tokens = tokenizer.encode("This is a test string")
    print(f"{tokens=}")
    print(tokenizer.decode(tokens))
    print("".join([tokenizer.decode(token) for token in tokens]))
    print("-" * 50)

Tokenizer: phi-2
tokens=[1212, 318, 257, 1332, 4731]
This is a test string
This is a test string
--------------------------------------------------
Tokenizer: phi-3
tokens=[1, 910, 338, 263, 1243, 1347]
<s> This is a test string
<s>Thisisateststring
--------------------------------------------------

Expected behavior

I expect that, even if I decode a single token at a time, the resulting string should contain spaces between tokens.
As one can see, with Phi-2 model there are no problems, but for some reason Phi-3 does produce such a concatenated string.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-06-28T16:07:19Z

cc @itazap

amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTokenizer: Phi-3 drops spaces when decodes a token at a time #31643

AutoTokenizer: Phi-3 drops spaces when decodes a token at a time #31643

Andrei-Aksionov commented Jun 26, 2024

ArthurZucker commented Jun 28, 2024

AutoTokenizer: Phi-3 drops spaces when decodes a token at a time #31643

AutoTokenizer: Phi-3 drops spaces when decodes a token at a time #31643

Comments

Andrei-Aksionov commented Jun 26, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jun 28, 2024