Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed a trailing "`" in the xpack.ml.model_repository parameter #2779

Closed
wants to merge 18 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
b7bf981
Adds important callout about the possibility of multiple ELSER deploy…
mergify[bot] Jul 5, 2024
9b1668b
[DOCS] Adds a note about the intel and linux optimized versions of EL…
mergify[bot] Jul 23, 2024
51f3486
[ML] Add explanation of typical location (#2753) (#2754)
mergify[bot] Jul 25, 2024
53ff4d6
Fix Eland Docker image name (#2748) (#2756)
mergify[bot] Aug 1, 2024
8d40df4
Adds adaptive allocations feature description to conceptual docs (#27…
mergify[bot] Aug 1, 2024
4c1e35e
Revert "Adds adaptive allocations feature description to conceptual d…
szabosteve Aug 2, 2024
563c2ec
Makes inference endpoint the primary way to download and deploy ELSER…
mergify[bot] Aug 2, 2024
d21a1d3
Update ml-nlp-e5.asciidoc (#2769)
petericebear Aug 12, 2024
74ea6cc
Adds categorization job how to guide (#2772) (#2773)
mergify[bot] Aug 27, 2024
ee5120c
Updates the list of supported NLP task types. (#2775) (#2776)
mergify[bot] Aug 28, 2024
a000b02
[DOCS] Update e5 warranty verbiage (#2777) (#2778)
mergify[bot] Sep 2, 2024
4e215ff
Removed a trailing "`" in the xpack.ml.model_repository parameter
ivssh Sep 3, 2024
fa296e1
Removed a misstyped "le" from a previous commit
ivssh Sep 3, 2024
2b7760a
Update ml-nlp-e5.asciidoc: Removed trailing quotes
ivssh Sep 3, 2024
1dfd4bc
Fixes screenshot in categorization job how-to. (#2781) (#2782)
mergify[bot] Sep 5, 2024
5e74e14
Merge branch '8.15' into patch-1
ivssh Sep 6, 2024
b671e67
Updates link text on Metrics configuration page. (#2785) (#2786)
mergify[bot] Sep 9, 2024
6974edd
Merge branch '8.15' into patch-1
ivssh Sep 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ The guides in this section describe some best practices for generating useful
* <<ml-configuring-aggregation, Aggregating data for faster performance>>
* <<ml-configuring-transform, Using runtime fields in {dfeeds}>>
* <<ml-configuring-detector-custom-rules>>
* <<ml-configuring-categories>>
* <<ml-reverting-model-snapshot>>
* <<geographic-anomalies>>
* <<mapping-anomalies>>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,11 @@ behavior:
image::images/ecommerce-anomaly-explorer-geopoint.jpg[A screenshot of an anomalous event in the eCommerce data in Anomaly Explorer]
// NOTE: This is an autogenerated screenshot. Do not edit it directly.

A "typical" value indicates a centroid of a cluster of previously observed
locations that is closest to the "actual" location at that time. For example,
there may be one centroid near the user's home and another near the user's
work place since there are many records associated with these distinct locations.

Likewise, there are time periods in the web logs sample data where there are
both unusually high sums of data transferred and unusual geographical
coordinates:
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/en/stack/ml/anomaly-detection/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ include::{es-repo-dir}/ml/anomaly-detection/ml-configuring-transform.asciidoc[le

include::{es-repo-dir}/ml/anomaly-detection/ml-configuring-detector-custom-rules.asciidoc[leveloffset=+2]

include::ml-detect-categories.asciidoc[leveloffset=+2]

include::ml-revert-model-snapshot.asciidoc[leveloffset=+2]

include::geographic-anomalies.asciidoc[leveloffset=+2]
Expand Down
253 changes: 253 additions & 0 deletions docs/en/stack/ml/anomaly-detection/ml-detect-categories.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
[[ml-configuring-categories]]
= Detecting anomalous categories of data

Categorization is a {ml} process that tokenizes a text field, clusters similar data together, and classifies it into categories.
It works best on machine-written messages and application output that typically consist of repeated elements.
<<categorization-jobs, Categorization jobs>> enable you to find anomalous behavior in your categorized data.
Categorization is not natural language processing (NLP).
When you create a categorization {anomaly-job}, the {ml} model learns what volume and pattern is normal for each category over time.
You can then detect anomalies and surface rare events or unusual types of messages by using <<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
Categorization works well on finite set of possible messages, for example:

[source,js]
----------------------------------
{"@timestamp":1549596476000,
"message":"org.jdbi.v2.exceptions.UnableToExecuteStatementException: com.mysql.jdbc.exceptions.MySQLTimeoutException: Statement cancelled due to timeout or client request [statement:\"SELECT id, customer_id, name, force_disabled, enabled FROM customers\"]",
"type":"logs"}
----------------------------------
//NOTCONSOLE


[discrete]
[[categ-recommendations]]
== Recommendations

* Categorization is tuned to work best on data like log messages by taking token order into account, including stop words, and not considering synonyms in its analysis.
Use machine-written messages for categorization analysis.
* Complete sentences in human communication or literary text (for example email, wiki pages, prose, or other human-generated content) can be extremely diverse in structure.
Since categorization is tuned for machine data, it gives poor results for human-generated data.
It would create so many categories that they couldn’t be handled effectively.
Avoid using human-generated data for categorization analysis.

[discrete]
[[creating-categorization-jobs]]
== Creating categorization jobs

. In {kib}, navigate to **{ml-app} > Anomaly Detection > Jobs**.
. Click **Create {anomaly-jobs}**, select the {data-view} you want to analyze.
. Select the **Categorization** wizard from the list.
. Choose a categorization detector - it's the `count` function in this example - and the field you want to categorize - the `message` field in this example.
+
--
[role="screenshot"]
image::images/categorization-wizard.png[Creating a categorization job in Kibana]
--
. Click **Next**.
. Provide a job ID and click **Next**.
. If the validation is successful, click **Next** to review the summary of the job creation.
. Click **Create job**.

This example job generates categories from the contents of the `message` field and uses the `count` function to determine when certain categories are occurring at anomalous rates.

[%collapsible]
.API example
====
[source,console]
----------------------------------
PUT _ml/anomaly_detectors/it_ops_app_logs
{
"description" : "IT ops application logs",
"analysis_config" : {
"categorization_field_name": "message",<1>
"bucket_span":"30m",
"detectors" :[{
"function":"count",
"by_field_name": "mlcategory"<2>
}]
},
"data_description" : {
"time_field":"@timestamp"
}
}
----------------------------------
// TEST[skip:needs-licence]
<1> This field is used to derive categories.
<2> The categories are used in a detector by setting `by_field_name`, `over_field_name`, or `partition_field_name` to the keyword `mlcategory`.
If you do not specify this keyword in one of those properties, the API request fails.
====


[discrete]
[[categorization-job-results]]
=== Viewing the job results

Use the **Anomaly Explorer** in {kib} to view the analysis results:

[role="screenshot"]
image::images/ml-category-anomalies.png["Categorization results in the Anomaly Explorer"]

For this type of job, the results contain extra information for each anomaly: the name of the category (for example, `mlcategory 2`) and examples of the messages in that category.
You can use these details to investigate occurrences of unusually high message counts.


[discrete]
[[advanced-categorization-options]]
=== Advanced configuration options

If you use the advanced {anomaly-job} wizard in {kib} or the {ref}/ml-put-job.html[create {anomaly-jobs} API], there are additional configuration options.
For example, the optional `categorization_examples_limit` property specifies the maximum number of examples that are stored in memory and in the results data store for each category.
The default value is `4`.
Note that this setting does not affect the categorization; it just affects the list of visible examples.
If you increase this value, more examples are available, but you must have more storage available.
If you set this value to `0`, no examples are stored.

Another advanced option is the `categorization_filters` property, which can contain an array of regular expressions.
If a categorization field value matches the regular expression, the portion of the field that is matched is not taken into consideration when defining categories.
The categorization filters are applied in the order they are listed in the job configuration, which enables you to disregard multiple sections of the categorization field value.
In this example, you might create a filter like `[ "\\[statement:.*\\]"]` to remove the SQL statement from the categorization algorithm.


[discrete]
[[ml-per-partition-categorization]]
== Per-partition categorization

If you enable per-partition categorization, categories are determined independently for each partition.
For example, if your data includes messages from multiple types of logs from different applications, you can use a field like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the `partition_field_name` and categorize the messages for each type of log separately.

If your job has multiple detectors, every detector that uses the `mlcategory` keyword must also define a `partition_field_name`.
You must use the same `partition_field_name` value in all of these detectors.
Otherwise, when you create or update a job and enable per-partition categorization, it fails.

When per-partition categorization is enabled, you can also take advantage of a `stop_on_warn` configuration option.
If the categorization status for a partition changes to `warn`, it doesn't categorize well and can cause unnecessary resource usage.
When you set `stop_on_warn` to `true`, the job stops analyzing these problematic partitions.
You can thus avoid an ongoing performance cost for partitions that are unsuitable for categorization.


[discrete]
[[ml-configuring-analyzer]]
== Customizing the categorization analyzer

Categorization uses English dictionary words to identify log message categories.
By default, it also uses English tokenization rules.
For this reason, if you use the default categorization analyzer, only English language log messages are supported, as described in the <<ml-limitations>>.

If you use the categorization wizard in {kib}, you can see which categorization analyzer it uses and highlighted examples of the tokens that it identifies.
You can also change the tokenization rules by customizing the way the categorization field values are interpreted:

[role="screenshot"]
image::images/ml-category-analyzer.png["Editing the categorization analyzer in Kibana"]

The categorization analyzer can refer to a built-in {es} analyzer or a combination of zero or more character filters, a tokenizer, and zero or more token filters.
In this example, adding a {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter] achieves the same behavior as the `categorization_filters` job configuration option described earlier.
For more details about these properties, refer to the {ref}/ml-put-job.html#ml-put-job-request-body[`categorization_analyzer` API object].

If you use the default categorization analyzer in {kib} or omit the `categorization_analyzer` property from the API, the following default values are used:

[source,console]
--------------------------------------------------
POST _ml/anomaly_detectors/_validate
{
"analysis_config" : {
"categorization_analyzer" : {
"char_filter" : [
"first_line_with_letters"
],
"tokenizer" : "ml_standard",
"filter" : [
{ "type" : "stop", "stopwords": [
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"
] }
]
},
"categorization_field_name": "message",
"detectors" :[{
"function":"count",
"by_field_name": "mlcategory"
}]
},
"data_description" : {
}
}
--------------------------------------------------

If you specify any part of the `categorization_analyzer`, however, any omitted sub-properties are _not_ set to default values.

The `ml_standard` tokenizer and the day and month stopword filter are almost equivalent to the following analyzer, which is defined using only built-in {es} {ref}/analysis-tokenizers.html[tokenizers] and {ref}/analysis-tokenfilters.html[token filters]:

[source,console]
----------------------------------
PUT _ml/anomaly_detectors/it_ops_new_logs
{
"description" : "IT Ops Application Logs",
"analysis_config" : {
"categorization_field_name": "message",
"bucket_span":"30m",
"detectors" :[{
"function":"count",
"by_field_name": "mlcategory",
"detector_description": "Unusual message counts"
}],
"categorization_analyzer":{
"char_filter" : [
"first_line_with_letters" <1>
],
"tokenizer": {
"type" : "simple_pattern_split",
"pattern" : "[^-0-9A-Za-z_./]+" <2>
},
"filter": [
{ "type" : "pattern_replace", "pattern": "^[0-9].*" }, <3>
{ "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <4>
{ "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <5>
{ "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <6>
{ "type" : "stop", "stopwords": [
"",
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"
] }
]
}
},
"analysis_limits":{
"categorization_examples_limit": 5
},
"data_description" : {
"time_field":"time",
"time_format": "epoch_ms"
}
}
----------------------------------
// TEST[skip:needs-licence]

<1> Only consider the first line of the message with letters for categorization purposes.
<2> Tokens consist of hyphens, digits, letters, underscores, dots and slashes.
<3> By default, categorization ignores tokens that begin with a digit.
<4> By default, categorization ignores tokens that are hexadecimal numbers.
<5> Underscores, hyphens, and dots are removed from the beginning of tokens.
<6> Underscores, hyphens, and dots are also removed from the end of tokens.

The key difference between the default `categorization_analyzer` and this example analyzer is that using the `ml_standard` tokenizer is several times faster.
The `ml_standard` tokenizer also tries to preserve URLs, Windows paths and email addresses as single tokens.
Another difference in behavior is that the custom analyzer does not include accented letters in tokens whereas the `ml_standard` tokenizer does.
This could be fixed by using more complex regular expressions.

If you are categorizing non-English messages in a language where words are separated by spaces, you might get better results if you change the day or month words in the stop token filter to the appropriate words in your language.
If you are categorizing messages in a language where words are not separated by spaces, you must use a different tokenizer as well in order to get sensible categorization results.

It is important to be aware that analyzing for categorization of machine generated log messages is a little different from tokenizing for search.
Features that work well for search, such as stemming, synonym substitution, and lowercasing are likely to make the results of categorization worse.
However, to drill down from {ml} results to work correctly, the tokens the categorization analyzer produces must be similar to those produced by the search analyzer.
If they are sufficiently similar, when you search for the tokens that the categorization analyzer produces then you find the original document that the categorization field value came from.





Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
["appendix",role="exclude",id="ootb-ml-jobs-metrics-ui"]
= Metrics {anomaly-detect} configurations

These {anomaly-jobs} can be created in the
{observability-guide}/analyze-metrics.html[{metrics-app}] in {kib}. For more
information about their usage, refer to
{observability-guide}/inspect-metric-anomalies.html[Inspect metric anomalies].
These {anomaly-jobs} can be created in the {observability-guide}/analyze-metrics.html[{infrastructure-app}] in {kib}.
For more information about their usage, refer to {observability-guide}/inspect-metric-anomalies.html[Inspect metric anomalies].

// tag::metrics-jobs[]
[discrete]
Expand Down
5 changes: 3 additions & 2 deletions docs/en/stack/ml/nlp/ml-nlp-deploy-models.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,8 @@ eland_import_hub_model \
<<ml-nlp-authentication>> to learn more.
<3> Specify the identifier for the model in the Hugging Face model hub.
<4> Specify the type of NLP task. Supported values are `fill_mask`, `ner`,
`text_classification`, `text_embedding`, and `zero_shot_classification`.
`question_answering`, `text_classification`, `text_embedding`, `text_expansion`,
`text_similarity`, and `zero_shot_classification`.

For more details, refer to
https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html#ml-nlp-pytorch.
Expand All @@ -112,7 +113,7 @@ $ docker run -it --rm --network host docker.elastic.co/eland/eland
The `eland_import_hub_model` script can be run directly in the docker command:

```bash
docker run -it --rm elastic/eland \
docker run -it --rm docker.elastic.co/eland/eland \
eland_import_hub_model \
--url $ELASTICSEARCH_URL \
--hub-model-id elastic/distilbert-base-uncased-finetuned-conll03-english \
Expand Down
Loading