Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The behavior of the tokenizer loaded from GGUF file is incorrect. #31630

Open
2 of 4 tasks
Lin-xs opened this issue Jun 26, 2024 · 3 comments
Open
2 of 4 tasks

The behavior of the tokenizer loaded from GGUF file is incorrect. #31630

Lin-xs opened this issue Jun 26, 2024 · 3 comments
Labels
Core: Tokenization Internals of the library; Tokenization. GGUF

Comments

@Lin-xs
Copy link

Lin-xs commented Jun 26, 2024

System Info

  • transformers version: 4.42.0.dev0
  • Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.27
  • Python version: 3.11.9
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.30.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script: No
  • Using GPU in script: No
  • GPU type: NVIDIA RTX A6000

Who can help?

@ArthurZucker @younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I install transformers from #30391 (comment) :

pip install git+https://github.com/younesbelkada/transformers.git@fix-llama-3-gguf-2

because the newest released version v4.41.2 cannot load tokenizer from gguf file correctly.

Here is my code:

from transformers import AutoTokenizer

model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)

# the text is a slice from load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
text = "Traditional Chinese literary criticism emphasized the life of the author when interpreting a work"

encodings_1 = tokenizer.encode(text)
print(encodings_1)
print(tokenizer.decode(encodings_1))

the output is:

[128000, 14860, 223, 85782, 10634, 223, 46023, 10634, 223, 69191, 661, 10634, 223, 38096, 42914, 10634, 223, 336, 51480, 1534, 10634, 223, 1820, 10634, 223, 14789, 10634, 223, 1073, 10634, 223, 1820, 10634, 223, 3170, 10634, 223, 9493, 10634, 223, 75814, 1303, 10634, 223, 64, 10634, 223, 1816]
<|begin_of_text|> ▁Traditional▁Chinese▁literary▁criticism▁emphasized▁the▁life▁of▁the▁author▁when▁interpreting▁a▁work

The output of decode() should be identical to the text, shouldn't it? I also tried to encode the same text using llama-cpp-python 0.2.79 and the same model:

from llama_cpp import Llama
model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
llm_lcp_from_hf =  Llama.from_pretrained(repo_id=model_id, filename=filename)
encodings_lcp = llm_lcp_from_hf.tokenize(text.encode('utf-8'))

print(encodings_lcp)
print(llm_lcp_from_hf.detokenize(encodings_lcp).decode('utf-8'))

The output is right:

[85782, 8620, 32465, 19347, 46728, 279, 2324, 315, 279, 3229, 994, 66744, 264, 990]
Traditional Chinese literary criticism emphasized the life of the author when interpreting a work

Expected behavior

The result of decode() should be identical to the raw text.

@amyeroberts amyeroberts added Core: Tokenization Internals of the library; Tokenization. GGUF labels Jun 26, 2024
@ArthurZucker
Copy link
Collaborator

Hey! I think this was recently fixed, so installing 4.42.xxx should work. I just tested locally:
image
Make sure to install pip install -U transformers

@Lin-xs
Copy link
Author

Lin-xs commented Jun 29, 2024

Hey! I think this was recently fixed, so installing 4.42.xxx should work. I just tested locally: image Make sure to install pip install -U transformers

Thank you @ArthurZucker , now the tokenizer works well. However, when I try to save and then load it, another error occurs:RuntimeError: Internal: could not parse ModelProto from ...

Code:

from transformers import AutoTokenizer
model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)

save_dir = '../../deq_models/test'
tokenizer.save_pretrained(save_dir)

tokenizer2 = AutoTokenizer.from_pretrained(save_dir)

The package version:
sentencepiece 0.2.0
transformers 4.42.3
Traceback:

{
	"name": "RuntimeError",
	"message": "Internal: could not parse ModelProto from ../../deq_models/test/tokenizer.model",
	"stack": "---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 tokenizer2 = AutoTokenizer.from_pretrained(save_dir)

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py:889, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    885     if tokenizer_class is None:
    886         raise ValueError(
    887             f\"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported.\"
    888         )
--> 889     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    891 # Otherwise we have to be creative.
    892 # if model is an encoder decoder, the encoder tokenizer class is used by default
    893 if isinstance(config, EncoderDecoderConfig):

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2163, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2160     else:
   2161         logger.info(f\"loading file {file_path} from cache at {resolved_vocab_files[file_id]}\")
-> 2163 return cls._from_pretrained(
   2164     resolved_vocab_files,
   2165     pretrained_model_name_or_path,
   2166     init_configuration,
   2167     *init_inputs,
   2168     token=token,
   2169     cache_dir=cache_dir,
   2170     local_files_only=local_files_only,
   2171     _commit_hash=commit_hash,
   2172     _is_local=is_local,
   2173     trust_remote_code=trust_remote_code,
   2174     **kwargs,
   2175 )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2397, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2395 # Instantiate the tokenizer.
   2396 try:
-> 2397     tokenizer = cls(*init_inputs, **init_kwargs)
   2398 except OSError:
   2399     raise OSError(
   2400         \"Unable to load vocabulary from file. \"
   2401         \"Please check that the provided vocabulary is accessible and not corrupted.\"
   2402     )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama_fast.py:157, in LlamaTokenizerFast.__init__(self, vocab_file, tokenizer_file, clean_up_tokenization_spaces, unk_token, bos_token, eos_token, add_bos_token, add_eos_token, use_default_system_prompt, legacy, add_prefix_space, **kwargs)
    154 if add_prefix_space is not None:
    155     kwargs[\"from_slow\"] = True
--> 157 super().__init__(
    158     vocab_file=vocab_file,
    159     tokenizer_file=tokenizer_file,
    160     clean_up_tokenization_spaces=clean_up_tokenization_spaces,
    161     unk_token=unk_token,
    162     bos_token=bos_token,
    163     eos_token=eos_token,
    164     add_bos_token=add_bos_token,
    165     add_eos_token=add_eos_token,
    166     use_default_system_prompt=use_default_system_prompt,
    167     add_prefix_space=add_prefix_space,
    168     legacy=legacy,
    169     **kwargs,
    170 )
    171 self._add_bos_token = add_bos_token
    172 self._add_eos_token = add_eos_token

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:131, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    127         kwargs.update(additional_kwargs)
    129 elif self.slow_tokenizer_class is not None:
    130     # We need to create and convert a slow tokenizer to build the backend
--> 131     slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs)
    132     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
    133 else:

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py:171, in LlamaTokenizer.__init__(self, vocab_file, unk_token, bos_token, eos_token, pad_token, sp_model_kwargs, add_bos_token, add_eos_token, clean_up_tokenization_spaces, use_default_system_prompt, spaces_between_special_tokens, legacy, add_prefix_space, **kwargs)
    169 self.add_eos_token = add_eos_token
    170 self.use_default_system_prompt = use_default_system_prompt
--> 171 self.sp_model = self.get_spm_processor(kwargs.pop(\"from_slow\", False))
    172 self.add_prefix_space = add_prefix_space
    174 super().__init__(
    175     bos_token=bos_token,
    176     eos_token=eos_token,
   (...)
    187     **kwargs,
    188 )

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py:198, in LlamaTokenizer.get_spm_processor(self, from_slow)
    196 tokenizer = spm.SentencePieceProcessor(**self.sp_model_kwargs)
    197 if self.legacy or from_slow:  # no dependency on protobuf
--> 198     tokenizer.Load(self.vocab_file)
    199     return tokenizer
    201 with open(self.vocab_file, \"rb\") as f:

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/sentencepiece/__init__.py:961, in SentencePieceProcessor.Load(self, model_file, model_proto)
    959 if model_proto:
    960   return self.LoadFromSerializedProto(model_proto)
--> 961 return self.LoadFromFile(model_file)

File ~/miniconda3/envs/swq/lib/python3.11/site-packages/sentencepiece/__init__.py:316, in SentencePieceProcessor.LoadFromFile(self, arg)
    315 def LoadFromFile(self, arg):
--> 316     return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

RuntimeError: Internal: could not parse ModelProto from ../../deq_models/test/tokenizer.model"
}

@Lin-xs Lin-xs closed this as completed Jun 29, 2024
@Lin-xs Lin-xs reopened this Jun 29, 2024
@Lin-xs Lin-xs closed this as completed Jun 29, 2024
@Lin-xs Lin-xs reopened this Jun 29, 2024
@Lin-xs Lin-xs closed this as completed Jun 29, 2024
@Lin-xs Lin-xs reopened this Jun 29, 2024
@Lin-xs Lin-xs closed this as completed Jun 29, 2024
@Lin-xs Lin-xs reopened this Jun 29, 2024
@Lin-xs
Copy link
Author

Lin-xs commented Jun 29, 2024

Sorry for the misoperation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization. GGUF
Projects
None yet
Development

No branches or pull requests

3 participants