Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transformers.pipeline does not load tokenizer passed as string for custom models #31669

Open
3 of 4 tasks
chedatomasz opened this issue Jun 27, 2024 · 1 comment
Open
3 of 4 tasks

Comments

@chedatomasz
Copy link

chedatomasz commented Jun 27, 2024

System Info

  • transformers version: 4.42.1
  • Platform: Linux-6.1.85+-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.0+cu121 (False)
  • Tensorflow version (GPU?): 2.15.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.8.4 (cpu)
  • Jax version: 0.4.26
  • JaxLib version: 0.4.26
  • Using distributed or parallel set-up in script?: no

Who can help?

@Narsil

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Locate a model which
  • isn't present in TOKENIZER_MAPPING
  • doesn't specify model_config.tokenizer_class
  • nevertheless has a tokenizer on the hub, loadable with AutoTokenizer
    These requirements means this happens for custom models only (not integrated into the library), AFAIK. Running those requires trust_remote_code=True, so it might be wise to create your own example meeting these requirements. I will be using "tcheda/mot_test".
  1. Verify that code works properly where tokenizer and model are passed as pre-instantiated objects
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("tcheda/mot_test", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("tcheda/mot_test", trust_remote_code=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
  1. Attempt to create a pipeline, specifying both the model and the tokenizer as strings.
from transformers import pipeline
pipe = pipeline("text-generation", model="tcheda/mot_test", tokenizer="tcheda/mot_test", trust_remote_code=True)
  1. The code crashes immediately with:
[usr/local/lib/python3.10/dist-packages/transformers/pipelines/__init__.py](https://localhost:8080/#) in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
  1106         kwargs["device"] = device
  1107 
-> 1108     return pipeline_class(model=model, framework=framework, task=task, **kwargs)

[/usr/local/lib/python3.10/dist-packages/transformers/pipelines/text_generation.py](https://localhost:8080/#) in __init__(self, *args, **kwargs)
    94 
    95     def __init__(self, *args, **kwargs):
---> 96         super().__init__(*args, **kwargs)
    97         self.check_model_type(
    98             TF_MODEL_FOR_CAUSAL_LM_MAPPING_NAMES if self.framework == "tf" else MODEL_FOR_CAUSAL_LM_MAPPING_NAMES

[/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py](https://localhost:8080/#) in __init__(self, model, tokenizer, feature_extractor, image_processor, modelcard, framework, task, args_parser, device, torch_dtype, binary_output, **kwargs)
   895             self.tokenizer is not None
   896             and self.model.can_generate()
--> 897             and self.tokenizer.pad_token_id is not None
   898             and self.model.generation_config.pad_token_id is None
   899         ):

AttributeError: 'str' object has no attribute 'pad_token_id'
  1. Explanation
    The bug is probably at
    load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None

    load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None
    This isn't set when tokenizer is a string, so the tokenizer initialization block (which contains proper handling of the string case) at
    load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None
    isn't entered at all.

Expected behavior

The pipeline should be created correctly, loading the tokenizer as if with AutoTokenizer.from_pretrained(tokenizer). This is the behaviour described in the docs.

@amyeroberts
Copy link
Collaborator

cc @Rocketknight1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants