This sample shows how to use a BERT/DistilBERT based ONNX models for Token Classification / NER in ML.NET.
To export a Hugging Face model to ONNX you can follow the instructions provided by Hugging Face or:
-
Install Python
-
Install these packages:
pip install optimum[exporters]
pip install accelerate
- and finally use the installed Optimum CLI tool
optimum-cli
to export the model:
optimum-cli export onnx --model dslim/bert-base-NER bert-base-NER/
or
optimum-cli export onnx --model dmargutierrez/distilbert-base-multilingual-cased-mapa_coarse-ner distilbert-base-multilingual-cased-mapa_coarse-ner
One model which seems to perform much better than others, especially on multiple languages can be found here and it can be downloaded with
optimum-cli
with this:optimum-cli export onnx --model Babelscape/wikineural-multilingual-ner wikineural-multilingual-ner
This model is licensed for non-commercial research purposes only.
according to the model you want to use.
The ONNX model, the configuration files and the vocabulary will be downloaded in a subfolder with the name of the model from where you are executing the CLI.
For my tests I have used a multilingual cased model found here.
This model does not have token type ids, hence the configuration sets HasTokenTypeIds
to false :
var configuration = new Configuration(modelPath, numberOfTokens: 5)
{
HasTokenTypeIds = false
};
You can use Netron to check the shape of the input/output of your ONNX model.
Once you have provided your model, select the input_ids
node on the pane, and check the model properties.
If your model has token_type_ids
defined, simply set the configuration property to true.
The folder where the ONNX model is exported should contain a bunch of files.
To run this example we need the configuration file config.json
and the vocabulary vocab.txt
.
The console should show the result on the NER process:
Wolfgang=B-PERSON
Mu¨ller=I-PERSON
Berlin=B-ADDRESS
,=I-ADDRESS
Germany=I-ADDRESS
Where each word identified is associated to one of the tokens supported by the model.
B- indicates the beginning of an entity.
I- indicates a token is contained inside the same entity.
If you want to find out more about the meaning of these tokens, Hugging Face is a good source of information.
Different models might have different labels.
The configuration file config.json
found in the folder where the model is downloaded, will give you some information on the supported labels and their names:
"id2label": {
"0": "O",
"1": "B-ORGANISATION",
"2": "I-ORGANISATION",
"3": "B-ADDRESS",
"4": "I-ADDRESS",
"5": "B-DATE",
"6": "I-DATE",
"7": "B-PERSON",
"8": "I-PERSON",
"9": "B-AMOUNT",
"10": "I-AMOUNT",
"11": "B-TIME",
"12": "I-TIME"
},
I have used a Bert Tokenizer found in this repo.