Skip to content

NER (Named Entity Recognition) implementation using a BERT/DistilBERT-based ONNX model for Token Classification in ML.NET

License

Notifications You must be signed in to change notification settings

Leftyx/NamedEntityRecognizer

Repository files navigation

Named Entity Recognition (NER)

This sample shows how to use a BERT/DistilBERT based ONNX models for Token Classification / NER in ML.NET.

Export a model to ONNX

To export a Hugging Face model to ONNX you can follow the instructions provided by Hugging Face or:

  • Install Python

  • Install these packages:

pip install optimum[exporters]

pip install accelerate
  • and finally use the installed Optimum CLI tool optimum-cli to export the model:
optimum-cli export onnx --model dslim/bert-base-NER bert-base-NER/

or

optimum-cli export onnx --model dmargutierrez/distilbert-base-multilingual-cased-mapa_coarse-ner distilbert-base-multilingual-cased-mapa_coarse-ner

One model which seems to perform much better than others, especially on multiple languages can be found here and it can be downloaded with optimum-cli with this:

optimum-cli export onnx --model Babelscape/wikineural-multilingual-ner wikineural-multilingual-ner

This model is licensed for non-commercial research purposes only.

according to the model you want to use.

The ONNX model, the configuration files and the vocabulary will be downloaded in a subfolder with the name of the model from where you are executing the CLI.

For my tests I have used a multilingual cased model found here.

This model does not have token type ids, hence the configuration sets HasTokenTypeIds to false :

var configuration = new Configuration(modelPath, numberOfTokens: 5)
{
    HasTokenTypeIds = false
};

You can use Netron to check the shape of the input/output of your ONNX model.
Once you have provided your model, select the input_ids node on the pane, and check the model properties.

Netron

If your model has token_type_ids defined, simply set the configuration property to true.

The folder where the ONNX model is exported should contain a bunch of files. To run this example we need the configuration file config.json and the vocabulary vocab.txt.

The console should show the result on the NER process:

 Wolfgang=B-PERSON
 Mu¨ller=I-PERSON
 Berlin=B-ADDRESS
 ,=I-ADDRESS
 Germany=I-ADDRESS

Where each word identified is associated to one of the tokens supported by the model.

B- indicates the beginning of an entity.
I- indicates a token is contained inside the same entity.

If you want to find out more about the meaning of these tokens, Hugging Face is a good source of information.

Different models might have different labels. The configuration file config.json found in the folder where the model is downloaded, will give you some information on the supported labels and their names:

"id2label": {
    "0": "O",
    "1": "B-ORGANISATION",
    "2": "I-ORGANISATION",
    "3": "B-ADDRESS",
    "4": "I-ADDRESS",
    "5": "B-DATE",
    "6": "I-DATE",
    "7": "B-PERSON",
    "8": "I-PERSON",
    "9": "B-AMOUNT",
    "10": "I-AMOUNT",
    "11": "B-TIME",
    "12": "I-TIME"
  },

I have used a Bert Tokenizer found in this repo.

About

NER (Named Entity Recognition) implementation using a BERT/DistilBERT-based ONNX model for Token Classification in ML.NET

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages