transformers.pipeline does not load tokenizer passed as string for custom models #31669

chedatomasz · 2024-06-27T19:05:44Z

System Info

transformers version: 4.42.1
Platform: Linux-6.1.85+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (False)
Tensorflow version (GPU?): 2.15.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.8.4 (cpu)
Jax version: 0.4.26
JaxLib version: 0.4.26
Using distributed or parallel set-up in script?: no

Who can help?

@Narsil

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Locate a model which

isn't present in TOKENIZER_MAPPING
doesn't specify model_config.tokenizer_class
nevertheless has a tokenizer on the hub, loadable with AutoTokenizer
These requirements means this happens for custom models only (not integrated into the library), AFAIK. Running those requires trust_remote_code=True, so it might be wise to create your own example meeting these requirements. I will be using "tcheda/mot_test".

Verify that code works properly where tokenizer and model are passed as pre-instantiated objects

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("tcheda/mot_test", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("tcheda/mot_test", trust_remote_code=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Attempt to create a pipeline, specifying both the model and the tokenizer as strings.

from transformers import pipeline
pipe = pipeline("text-generation", model="tcheda/mot_test", tokenizer="tcheda/mot_test", trust_remote_code=True)

The code crashes immediately with:

[usr/local/lib/python3.10/dist-packages/transformers/pipelines/__init__.py](https://localhost:8080/#) in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
  1106         kwargs["device"] = device
  1107 
-> 1108     return pipeline_class(model=model, framework=framework, task=task, **kwargs)

[/usr/local/lib/python3.10/dist-packages/transformers/pipelines/text_generation.py](https://localhost:8080/#) in __init__(self, *args, **kwargs)
    94 
    95     def __init__(self, *args, **kwargs):
---> 96         super().__init__(*args, **kwargs)
    97         self.check_model_type(
    98             TF_MODEL_FOR_CAUSAL_LM_MAPPING_NAMES if self.framework == "tf" else MODEL_FOR_CAUSAL_LM_MAPPING_NAMES

[/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py](https://localhost:8080/#) in __init__(self, model, tokenizer, feature_extractor, image_processor, modelcard, framework, task, args_parser, device, torch_dtype, binary_output, **kwargs)
   895             self.tokenizer is not None
   896             and self.model.can_generate()
--> 897             and self.tokenizer.pad_token_id is not None
   898             and self.model.generation_config.pad_token_id is None
   899         ):

AttributeError: 'str' object has no attribute 'pad_token_id'

Explanation
The bug is probably at

transformers/src/transformers/pipelines/__init__.py

Line 907 in 1c68f2c

load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None

load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None
This isn't set when tokenizer is a string, so the tokenizer initialization block (which contains proper handling of the string case) at

transformers/src/transformers/pipelines/__init__.py

Line 907 in 1c68f2c

load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None

isn't entered at all.

Expected behavior

The pipeline should be created correctly, loading the tokenizer as if with AutoTokenizer.from_pretrained(tokenizer). This is the behaviour described in the docs.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-06-27T22:07:00Z

cc @Rocketknight1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformers.pipeline does not load tokenizer passed as string for custom models #31669

transformers.pipeline does not load tokenizer passed as string for custom models #31669

chedatomasz commented Jun 27, 2024 •

edited

Loading

amyeroberts commented Jun 27, 2024

transformers.pipeline does not load tokenizer passed as string for custom models #31669

transformers.pipeline does not load tokenizer passed as string for custom models #31669

Comments

chedatomasz commented Jun 27, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Jun 27, 2024

chedatomasz commented Jun 27, 2024 •

edited

Loading