How to Load Any HuggingFace Model in spaCy #10768

polm · 2022-05-09T08:19:45Z

polm
May 9, 2022

spaCy's wrapper for the HuggingFace Transformers library allows you to specify any model on the HuggingFace Hub. The model will automatically be downloaded the first time it's used and wired up in the pipeline as a feature source. This post will show you how to do that in the config file or in code. Before we get started, though, it's important to keep in mind two limitations of HuggingFace Models in spaCy:

HuggingFace models are only used as a source of features. This means that if there are task-specific heads, like NER or text classification, you can't use those automatically with our wrapper. (It is possible to wrap them in other ways though - see the user project spacy-wrap for an example for text classification.)
While many models work, not all do. You should test a model to see if the results make sense with a simple problem. If in doubt try comparing your results to using roberta-base (even if you're not working in English).

Using the Config

In the config, specifying an arbitrary model is easy. Your config should have a section like [components.transformer.model], which by default will look like this:

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base" # XXX customize this bit
tokenizer_config = {"use_fast": true}

You can use a different model just by changing the name parameter. The name can be the name of any model on the HuggingFace Hub, or a local path. Note that if you have any other components that rely on your Transformer, you will need to re-train your pipeline after doing this - you can't just change the name and do inference again.

Using Code

Using code to load an arbitrary Transformer isn't complicated, but it does require a little more care than modifying the config. The basic steps are the same as for other components in spaCy that use the initialize step to load external data:

import spacy
nlp = spacy.blank("en") # empty English pipeline
# create the config with the name of your model
# values omitted will get default values
config = {
    "model": {
        "@architectures": "spacy-transformers.TransformerModel.v3",
        "name": "roberta-base" # XXX customize this bit
    }
}
nlp.add_pipe("transformer", config=config)
nlp.initialize() # XXX don't forget this step!
doc = nlp("Robots in disguise")
print(doc._.trf_data) # all the Transformer output is stored here

After the model is loaded during the initialize step, the transformer name and tranformer/tokenizer settings provided by the config are not used again. The full transformer weights and configs are saved in an internal Thinc format when you call nlp.to_disk. Strictly following the spaCy config design, these settings belong in the [initialize] block instead of the [components] block. This is a design we would like to change but are keeping for backwards compatibility reasons; for more details see #10613 and #10579.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Load Any HuggingFace Model in spaCy #10768

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

How to Load Any HuggingFace Model in spaCy #10768

polm May 9, 2022

Using the Config

Using Code

Replies: 0 comments

polm
May 9, 2022