Skip to content

Commit

Permalink
Merge branch 'main' into Character-filters-1483
Browse files Browse the repository at this point in the history
  • Loading branch information
vagimeli authored Oct 8, 2024
2 parents 8c41636 + 3f77141 commit 55bcf61
Show file tree
Hide file tree
Showing 25 changed files with 166 additions and 51 deletions.
1 change: 1 addition & 0 deletions _analyzers/index-analyzers.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
layout: default
title: Index analyzers
nav_order: 20
parent: Analyzers
---

# Index analyzers
Expand Down
16 changes: 3 additions & 13 deletions _analyzers/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,20 +45,9 @@ An analyzer must contain exactly one tokenizer and may contain zero or more char

There is also a special type of analyzer called a ***normalizer***. A normalizer is similar to an analyzer except that it does not contain a tokenizer and can only include specific types of character filters and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot perform operations on the token as a whole. This means that replacing a token with a synonym or stemming is not supported. See [Normalizers]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) for further details.

## Built-in analyzers
## Supported analyzers

The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`.

Analyzer | Analysis performed | Analyzer output
:--- | :--- | :---
**Standard** (default) | - Parses strings into tokens at word boundaries <br> - Removes most punctuation <br> - Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
**Simple** | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
**Whitespace** | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
**Stop** | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Removes stop words <br> - Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`]
**Keyword** (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`]
**Pattern** | - Parses strings into tokens using regular expressions <br> - Supports converting strings to lowercase <br> - Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
[**Language**]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/) | Performs analysis specific to a certain language (for example, `english`). | [`fun`, `contribut`, `brand`, `new`, `pr`, `2`, `opensearch`]
**Fingerprint** | - Parses strings on any non-letter character <br> - Normalizes characters by converting them to ASCII <br> - Converts tokens to lowercase <br> - Sorts, deduplicates, and concatenates tokens into a single token <br> - Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`] <br> Note that the apostrophe was converted to its ASCII counterpart.
For a list of supported analyzers, see [Analyzers]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/index/).

## Custom analyzers

Expand Down Expand Up @@ -195,3 +184,4 @@ Normalization ensures that searches are not limited to exact term matches, allow
## Next steps

- Learn more about specifying [index analyzers]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) and [search analyzers]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/).
- See the list of [supported analyzers]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/index/).
9 changes: 5 additions & 4 deletions _analyzers/language-analyzers.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
---
layout: default
title: Language analyzers
nav_order: 10
nav_order: 100
parent: Analyzers
redirect_from:
- /query-dsl/analyzers/language-analyzers/
---

# Language analyzer
# Language analyzers

OpenSearch supports the following language values with the `analyzer` option:
OpenSearch supports the following language analyzers:
`arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `english`, `estonian`, `finnish`, `french`, `galician`, `german`, `greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `latvian`, `lithuanian`, `norwegian`, `persian`, `portuguese`, `romanian`, `russian`, `sorani`, `spanish`, `swedish`, `turkish`, and `thai`.

To use the analyzer when you map an index, specify the value within your query. For example, to map your index with the French language analyzer, specify the `french` value for the analyzer field:
Expand Down Expand Up @@ -40,4 +41,4 @@ PUT my-index
}
```

<!-- TO do: each of the options needs its own section with an example. Convert table to individual sections, and then give a streamlined list with valid values. -->
<!-- TO do: each of the options needs its own section with an example. Convert table to individual sections, and then give a streamlined list with valid values. -->
3 changes: 2 additions & 1 deletion _analyzers/search-analyzers.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
layout: default
title: Search analyzers
nav_order: 30
parent: Analyzers
---

# Search analyzers
Expand Down Expand Up @@ -42,7 +43,7 @@ GET shakespeare/_search
```
{% include copy-curl.html %}

Valid values for [built-in analyzers]({{site.url}}{{site.baseurl}}/analyzers/index#built-in-analyzers) are `standard`, `simple`, `whitespace`, `stop`, `keyword`, `pattern`, `fingerprint`, or any supported [language analyzer]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/).
For more information about supported analyzers, see [Analyzers]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/index/).

## Specifying a search analyzer for a field

Expand Down
32 changes: 32 additions & 0 deletions _analyzers/supported-analyzers/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
layout: default
title: Analyzers
nav_order: 40
has_children: true
has_toc: false
redirect_from:
- /analyzers/supported-analyzers/index/
---

# Analyzers

The following sections list all analyzers that OpenSearch supports.

## Built-in analyzers

The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`.

Analyzer | Analysis performed | Analyzer output
:--- | :--- | :---
**Standard** (default) | - Parses strings into tokens at word boundaries <br> - Removes most punctuation <br> - Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
**Simple** | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
**Whitespace** | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
**Stop** | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Removes stop words <br> - Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`]
**Keyword** (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`]
**Pattern** | - Parses strings into tokens using regular expressions <br> - Supports converting strings to lowercase <br> - Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
[**Language**]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/) | Performs analysis specific to a certain language (for example, `english`). | [`fun`, `contribut`, `brand`, `new`, `pr`, `2`, `opensearch`]
**Fingerprint** | - Parses strings on any non-letter character <br> - Normalizes characters by converting them to ASCII <br> - Converts tokens to lowercase <br> - Sorts, deduplicates, and concatenates tokens into a single token <br> - Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`] <br> Note that the apostrophe was converted to its ASCII counterpart.

## Language analyzers

OpenSearch supports analyzers for various languages. For more information, see [Language analyzers]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/).
2 changes: 2 additions & 0 deletions _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ title: Token filters
nav_order: 70
has_children: true
has_toc: false
redirect_from:
- /analyzers/token-filters/index/
---

# Token filters
Expand Down
2 changes: 2 additions & 0 deletions _analyzers/tokenizers/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ title: Tokenizers
nav_order: 60
has_children: false
has_toc: false
redirect_from:
- /analyzers/tokenizers/index/
---

# Tokenizers
Expand Down
2 changes: 1 addition & 1 deletion _benchmark/glossary.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Glossary
nav_order: 10
nav_order: 100
---

# OpenSearch Benchmark glossary
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
---
layout: default
title: Configuring OpenSearch Benchmark
title: Configuring
nav_order: 7
parent: User guide
grand_parent: User guide
parent: Install and configure
redirect_from:
- /benchmark/configuring-benchmark/
- /benchmark/user-guide/configuring-benchmark/
---

# Configuring OpenSearch Benchmark
Expand Down
12 changes: 12 additions & 0 deletions _benchmark/user-guide/install-and-configure/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
layout: default
title: Install and configure
nav_order: 5
parent: User guide
has_children: true
---

# Installing and configuring OpenSearch Benchmark

This section details how to install and configure OpenSearch Benchmark.

Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
---
layout: default
title: Installing OpenSearch Benchmark
title: Installing
nav_order: 5
parent: User guide
grand_parent: User guide
parent: Install and configure
redirect_from:
- /benchmark/installing-benchmark/
- /benchmark/user-guide/installing-benchmark/
---

# Installing OpenSearch Benchmark
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
layout: default
title: Running distributed loads
nav_order: 15
parent: User guide
parent: Optimizing benchmarks
grand_parent: User guide
---

# Running distributed loads


OpenSearch Benchmark loads always run on the same machine on which the benchmark was started. However, you can use multiple load drivers to generate additional benchmark testing loads, particularly for large clusters on multiple machines. This tutorial describes how to distribute benchmark loads across multiple machines in a single cluster.

## System architecture
Expand Down
11 changes: 11 additions & 0 deletions _benchmark/user-guide/optimizing-benchmarks/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
layout: default
title: Optimizing benchmarks
nav_order: 25
parent: User guide
has_children: true
---

# Optimizing benchmarks

This section details different ways you can optimize the benchmark tools for your cluster.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
layout: default
title: Target throughput
nav_order: 150
parent: User guide
parent: Optimizing benchmarks
grand_parent: User guide
redirect_from:
- /benchmark/user-guide/target-throughput/
---

# Target throughput
Expand Down
8 changes: 0 additions & 8 deletions _benchmark/user-guide/telemetry.md

This file was deleted.

12 changes: 12 additions & 0 deletions _benchmark/user-guide/understanding-results/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
layout: default
title: Understanding results
nav_order: 20
parent: User guide
has_children: true
---

After a [running a workload]({{site.url}}{{site.baseurl}}/benchmark/user-guide/working-with-workloads/running-workloads/), OpenSearch Benchmark produces a series of metrics. The following pages details:

- [How metrics are reported]({{site.url}}{{site.baseurl}}/benchmark/user-guide/understanding-results/summary-reports/)
- [How to visualize metrics]({{site.url}}{{site.baseurl}}/benchmark/user-guide/understanding-results/telemetry/)
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
---
layout: default
title: Understanding benchmark results
title: Summary reports
nav_order: 22
parent: User guide
grand_parent: User guide
parent: Understanding results
redirect_from:
- /benchmark/user-guide/understanding-results/
---

# Understanding the summary report

At the end of each test run, OpenSearch Benchmark creates a summary of test result metrics like service time, throughput, latency, and more. These metrics provide insights into how the selected workload performed on a benchmarked OpenSearch cluster.

Expand Down
21 changes: 21 additions & 0 deletions _benchmark/user-guide/understanding-results/telemetry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
layout: default
title: Enabling telemetry devices
nav_order: 30
grand_parent: User guide
parent: Understanding results
redirect_from:
- /benchmark/user-guide/telemetry
---

# Enabling telemetry devices

Telemetry results will not appear in the summary report. To visualize telemetry results, ingest the data into OpenSearch and visualize the data in OpenSearch Dashboards.

To view a list of the available telemetry devices, use the command `opensearch-benchmark list telemetry`. After you've selected a [supported telemetry device]({{site.url}}{{site.baseurl}}/benchmark/reference/telemetry/), you can activate the device when running a tests with the `--telemetry` command flag. For example, if you want to use the `jfr` device with the `geonames` workload, enter the following command:

```json
opensearch-benchmark workload --workload=geonames --telemetry=jfr
```
{% include copy-curl.html %}

2 changes: 1 addition & 1 deletion _benchmark/user-guide/understanding-workloads/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Understanding workloads
nav_order: 7
nav_order: 10
parent: User guide
has_children: true
---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
layout: default
title: Sharing custom workloads
nav_order: 11
parent: User guide
grand_parent: User guide
parent: Working with workloads
redirect_from:
- /benchmark/user-guide/contributing-workloads/
---

# Sharing custom workloads
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
layout: default
title: Creating custom workloads
nav_order: 10
parent: User guide
grand_parent: User guide
parent: Working with workloads
redirect_from:
- /benchmark/user-guide/creating-custom-workloads/
- /benchmark/creating-custom-workloads/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
layout: default
title: Fine-tuning custom workloads
nav_order: 12
parent: User guide
grand_parent: User guide
parent: Working with workloads
redirect_from:
- /benchmark/user-guide/finetine-workloads/
---

# Fine-tuning custom workloads
Expand Down
16 changes: 16 additions & 0 deletions _benchmark/user-guide/working-with-workloads/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
layout: default
title: Working with workloads
nav_order: 15
parent: User guide
has_children: true
---

# Working with workloads

Once you [understand workloads]({{site.url}}{{site.baseurl}}/benchmark/user-guide/understanding-workloads/index/) and have [chosen a workload]({{site.url}}{{site.baseurl}}/benchmark/user-guide/understanding-workloads/choosing-a-workload/) to run your benchmarks with, you can begin working with workloads.

- [Running workloads]({{site.url}}{{site.baseurl}}/benchmark/user-guide/working-with-workloads/running-workloads/): Learn how to run an OpenSearch Benchmark workload.
- [Creating custom workloads]({{site.url}}{{site.baseurl}}/benchmark/user-guide/working-with-workloads/creating-custom-workloads/): Create a custom workload with your own datasets.
- [Fine-tuning workloads]({{site.url}}{{site.baseurl}}/benchmark/user-guide/working-with-workloads/finetune-workloads/): Fine-tune your custom workload according to the needs of your cluster.
- [Contributing workloads]({{site.url}}{{site.baseurl}}/benchmark/user-guide/working-with-workloads/contributing-workloads/): Contribute your custom workload for the OpenSearch community to use.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
layout: default
title: Running a workload
nav_order: 9
parent: User guide
grand_parent: User guide
parent: Working with workloads
redirect_from:
- /benchmark/user-guide/running-workloads/
---

# Running a workload
Expand Down
21 changes: 11 additions & 10 deletions _ingest-pipelines/processors/text-chunking.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,16 +31,17 @@ The following is the syntax for the `text_chunking` processor:

The following table lists the required and optional parameters for the `text_chunking` processor.

| Parameter | Data type | Required/Optional | Description |
|:---|:---|:---|:---|
| `field_map` | Object | Required | Contains key-value pairs that specify the mapping of a text field to the output field. |
| `field_map.<input_field>` | String | Required | The name of the field from which to obtain text for generating chunked passages. |
| `field_map.<output_field>` | String | Required | The name of the field in which to store the chunked results. |
| `algorithm` | Object | Required | Contains at most one key-value pair that specifies the chunking algorithm and parameters. |
| `algorithm.<name>` | String | Optional | The name of the chunking algorithm. Valid values are [`fixed_token_length`](#fixed-token-length-algorithm) or [`delimiter`](#delimiter-algorithm). Default is `fixed_token_length`. |
| `algorithm.<parameters>` | Object | Optional | The parameters for the chunking algorithm. By default, contains the default parameters of the `fixed_token_length` algorithm. |
| `description` | String | Optional | A brief description of the processor. |
| `tag` | String | Optional | An identifier tag for the processor. Useful when debugging in order to distinguish between processors of the same type. |
| Parameter | Data type | Required/Optional | Description |
|:----------------------------|:----------|:---|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `field_map` | Object | Required | Contains key-value pairs that specify the mapping of a text field to the output field. |
| `field_map.<input_field>` | String | Required | The name of the field from which to obtain text for generating chunked passages. |
| `field_map.<output_field>` | String | Required | The name of the field in which to store the chunked results. |
| `algorithm` | Object | Required | Contains at most one key-value pair that specifies the chunking algorithm and parameters. |
| `algorithm.<name>` | String | Optional | The name of the chunking algorithm. Valid values are [`fixed_token_length`](#fixed-token-length-algorithm) or [`delimiter`](#delimiter-algorithm). Default is `fixed_token_length`. |
| `algorithm.<parameters>` | Object | Optional | The parameters for the chunking algorithm. By default, contains the default parameters of the `fixed_token_length` algorithm. |
| `ignore_missing` | Boolean | Optional | If `true`, empty fields are excluded from the output. If `false`, the output will contain an empty list for every empty field. Default is `false`. |
| `description` | String | Optional | A brief description of the processor. |
| `tag` | String | Optional | An identifier tag for the processor. Useful when debugging in order to distinguish between processors of the same type. |

To perform chunking on nested fields, specify `input_field` and `output_field` values as JSON objects. Dot paths of nested fields are not supported. For example, use `"field_map": { "foo": { "bar": "bar_chunk"} }` instead of `"field_map": { "foo.bar": "foo.bar_chunk"}`.
{: .note}
Expand Down

0 comments on commit 55bcf61

Please sign in to comment.