If you want to perform {nlp} tasks in your cluster, you must deploy an appropriate trained model. There is tooling support in Eland and {kib} to help you prepare and manage models.
Per the [ml-nlp-overview], there are multiple ways that you can use NLP features within the {stack}. After you determine which type of NLP task you want to perform, you must choose an appropriate trained model.
The simplest method is to use a model that has already been fine-tuned for the type of analysis that you want to perform. For example, there are models and data sets available for specific NLP tasks on Hugging Face. These instructions assume you’re using one of those models and do not describe how to create new models. For the current list of supported model architectures, refer to [ml-nlp-model-ref].
If you choose to perform {lang-ident} by using the lang_ident_model_1
that is
provided in the cluster, no further steps are required to import or deploy the
model. You can skip to using the model in
ingestion pipelines.
Important
|
If you want to install a trained model in a restricted or closed network, refer to these instructions. |
After you choose a model, you must import it and its tokenizer vocabulary to your cluster. When you import the model, it must be chunked and imported one chunk at a time for storage in parts due to its size.
Note
|
Trained models must be in a TorchScript representation for use with {stack-ml-features}. |
Eland is an {es} Python client that provides a simple script to perform the conversion of Hugging Face transformer models to their TorchScript representations, the chunking process, and upload to {es}; it is therefore the recommended import method. You can either install the Python Eland client on your machine or use a Docker image to build Eland and run the model import script.
-
Install the {eland-docs}/installation.html[Eland Python client] with PyTorch extra dependencies.
python -m pip install 'eland[pytorch]'
-
Run the
eland_import_hub_model
script to download the model from Hugging Face, convert it to TorchScript format, and upload to the {es} cluster. For example:eland_import_hub_model \ --cloud-id <cloud-id> \ (1) -u <username> -p <password> \ (2) --hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \ (3) --task-type ner (4)
-
Specify the Elastic Cloud identifier. Alternatively, use
--url
. -
Provide authentication details to access your cluster. Refer to Authentication methods to learn more.
-
Specify the identifier for the model in the Hugging Face model hub.
-
Specify the type of NLP task. Supported values are
fill_mask
,ner
,question_answering
,text_classification
,text_embedding
,text_expansion
,text_similarity
, andzero_shot_classification
.
-
For more details, refer to https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html#ml-nlp-pytorch.
If you want to use Eland without installing it, run the following command:
$ docker run -it --rm --network host docker.elastic.co/eland/eland
The eland_import_hub_model
script can be run directly in the docker command:
docker run -it --rm docker.elastic.co/eland/eland \
eland_import_hub_model \
--url $ELASTICSEARCH_URL \
--hub-model-id elastic/distilbert-base-uncased-finetuned-conll03-english \
--start
Replace the $ELASTICSEARCH_URL
with the URL for your {es} cluster. Refer to
Authentication methods to learn more.
The following authentication options are available when using the import script:
-
username/password authentication (specified with the
-u
and-p
options):eland_import_hub_model --url https://<hostname>:<port> -u <username> -p <password> ...
-
username/password authentication (embedded in the URL):
eland_import_hub_model --url https://<user>:<password>@<hostname>:<port> ...
-
API key authentication:
eland_import_hub_model --url https://<hostname>:<port> --es-api-key <api-key> ...
After you import the model and vocabulary, you can use {kib} to view and manage their deployment across your cluster under {ml-app} > Model Management. Alternatively, you can use the {ref}/start-trained-model-deployment.html[start trained model deployment API].
You can deploy a model multiple times by assigning a unique deployment ID when starting the deployment. It enables you to have dedicated deployments for different purposes, such as search and ingest. By doing so, you ensure that the search speed remains unaffected by ingest workloads, and vice versa. Having separate deployments for search and ingest mitigates performance issues resulting from interactions between the two, which can be hard to diagnose.
It is recommended to fine-tune each deployment based on its specific purpose. To improve ingest performance, increase throughput by adding more allocations to the deployment. For improved search speed, increase the number of threads per allocation.
Note
|
Since eland uses APIs to deploy the models, you cannot see the models in {kib} until the saved objects are synchronized. You can follow the prompts in {kib}, wait for automatic synchronization, or use the {kibana-ref}/machine-learning-api-sync.html[sync {ml} saved objects API]. |
When you deploy the model, its allocations are distributed across available {ml} nodes. Model allocations are independent units of work for NLP tasks. To influence model performance, you can configure the number of allocations and the number of threads used by each allocation of your deployment.
Important
|
If your deployed trained model has only one allocation, it’s likely that you will experience downtime in the service your trained model performs. You can reduce or eliminate downtime by adding more allocations to your trained models. |
Throughput can be scaled by adding more allocations to the deployment; it
increases the number of {infer} requests that can be performed in parallel. All
allocations assigned to a node share the same copy of the model in memory. The
model is loaded into memory in a native process that encapsulates libtorch
,
which is the underlying {ml} library of PyTorch. The number of allocations
setting affects the amount of model allocations across all the {ml} nodes. Model
allocations are distributed in such a way that the total number of used threads
does not exceed the allocated processors of a node.
The threads per allocation setting affects the number of threads used by each model allocation during {infer}. Increasing the number of threads generally increases the speed of {infer} requests. The value of this setting must not exceed the number of available allocated processors per node.
You can view the allocation status in {kib} or by using the
{ref}/get-trained-models-stats.html[get trained model stats API]. If you want to
change the number of allocations, you can use the
{ref}/update-trained-model-deployment.html[update trained model stats API] after
the allocation status is started
.
Each allocation of a model deployment has a dedicated queue to buffer {infer}
requests. The size of this queue is determined by the queue_capacity
parameter
in the
{ref}/start-trained-model-deployment.html[start trained model deployment API].
When the queue reaches its maximum capacity, new requests are declined until
some of the queued requests are processed, creating available capacity once
again. When multiple ingest pipelines reference the same deployment, the queue
can fill up, resulting in rejected requests. Consider using dedicated
deployments to prevent this situation.
{infer-cap} requests originating from search, such as the
{ref}/query-dsl-text-expansion-query.html[text_expansion
query], have a higher
priority compared to non-search requests. The {infer} ingest processor generates
normal priority requests. If both a search query and an ingest processor use the
same deployment, the search requests with higher priority skip ahead in the
queue for processing before the lower priority ingest requests. This
prioritization accelerates search responses while potentially slowing down
ingest where response time is less critical.
When the model is deployed on at least one node in the cluster, you can begin to perform inference. {infer-cap} is a {ml} feature that enables you to use your trained models to perform NLP tasks (such as text extraction, classification, or embeddings) on incoming data.
The simplest method to test your model against new data is to use the Test model action in {kib}. You can either provide some input text or use a field of an existing index in your cluster to test the model:
Alternatively, you can use the {ref}/infer-trained-model.html[infer trained model API]. For example, to try a named entity recognition task, provide some sample text:
POST /_ml/trained_models/elastic__distilbert-base-cased-finetuned-conll03-english/_infer
{
"docs":[{"text_field": "Sasha bought 300 shares of Acme Corp in 2022."}]
}
In this example, the response contains the annotated text output and the recognized entities:
{
"inference_results" : [
{
"predicted_value" : "[Sasha](PER&Sasha) bought 300 shares of [Acme Corp](ORG&Acme+Corp) in 2022.",
"entities" : [
{
"entity" : "Sasha",
"class_name" : "PER",
"class_probability" : 0.9953193407987492,
"start_pos" : 0,
"end_pos" : 5
},
{
"entity" : "Acme Corp",
"class_name" : "ORG",
"class_probability" : 0.9996392198381716,
"start_pos" : 27,
"end_pos" : 36
}
]
}
]
}
If you are satisfied with the results, you can add these NLP tasks in your ingestion pipelines.