Skip to content

Commit

Permalink
Refactor analyzers section (#8477) (#8478)
Browse files Browse the repository at this point in the history
  • Loading branch information
opensearch-trigger-bot[bot] authored Oct 8, 2024
1 parent 32b7731 commit 87932d2
Show file tree
Hide file tree
Showing 7 changed files with 44 additions and 15 deletions.
1 change: 1 addition & 0 deletions _analyzers/index-analyzers.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
layout: default
title: Index analyzers
nav_order: 20
parent: Analyzers
---

# Index analyzers
Expand Down
16 changes: 3 additions & 13 deletions _analyzers/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,20 +45,9 @@ An analyzer must contain exactly one tokenizer and may contain zero or more char

There is also a special type of analyzer called a ***normalizer***. A normalizer is similar to an analyzer except that it does not contain a tokenizer and can only include specific types of character filters and token filters. These filters can perform only character-level operations, such as character or pattern replacement, and cannot perform operations on the token as a whole. This means that replacing a token with a synonym or stemming is not supported. See [Normalizers]({{site.url}}{{site.baseurl}}/analyzers/normalizers/) for further details.

## Built-in analyzers
## Supported analyzers

The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`.

Analyzer | Analysis performed | Analyzer output
:--- | :--- | :---
**Standard** (default) | - Parses strings into tokens at word boundaries <br> - Removes most punctuation <br> - Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
**Simple** | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
**Whitespace** | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
**Stop** | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Removes stop words <br> - Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`]
**Keyword** (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`]
**Pattern** | - Parses strings into tokens using regular expressions <br> - Supports converting strings to lowercase <br> - Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
[**Language**]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/) | Performs analysis specific to a certain language (for example, `english`). | [`fun`, `contribut`, `brand`, `new`, `pr`, `2`, `opensearch`]
**Fingerprint** | - Parses strings on any non-letter character <br> - Normalizes characters by converting them to ASCII <br> - Converts tokens to lowercase <br> - Sorts, deduplicates, and concatenates tokens into a single token <br> - Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`] <br> Note that the apostrophe was converted to its ASCII counterpart.
For a list of supported analyzers, see [Analyzers]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/index/).

## Custom analyzers

Expand Down Expand Up @@ -195,3 +184,4 @@ Normalization ensures that searches are not limited to exact term matches, allow
## Next steps

- Learn more about specifying [index analyzers]({{site.url}}{{site.baseurl}}/analyzers/index-analyzers/) and [search analyzers]({{site.url}}{{site.baseurl}}/analyzers/search-analyzers/).
- See the list of [supported analyzers]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/index/).
3 changes: 2 additions & 1 deletion _analyzers/language-analyzers.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
---
layout: default
title: Language analyzers
nav_order: 10
nav_order: 100
parent: Analyzers
redirect_from:
- /query-dsl/analyzers/language-analyzers/
---
Expand Down
3 changes: 2 additions & 1 deletion _analyzers/search-analyzers.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
layout: default
title: Search analyzers
nav_order: 30
parent: Analyzers
---

# Search analyzers
Expand Down Expand Up @@ -42,7 +43,7 @@ GET shakespeare/_search
```
{% include copy-curl.html %}

Valid values for [built-in analyzers]({{site.url}}{{site.baseurl}}/analyzers/index#built-in-analyzers) are `standard`, `simple`, `whitespace`, `stop`, `keyword`, `pattern`, `fingerprint`, or any supported [language analyzer]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/).
For more information about supported analyzers, see [Analyzers]({{site.url}}{{site.baseurl}}/analyzers/supported-analyzers/index/).

## Specifying a search analyzer for a field

Expand Down
32 changes: 32 additions & 0 deletions _analyzers/supported-analyzers/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
layout: default
title: Analyzers
nav_order: 40
has_children: true
has_toc: false
redirect_from:
- /analyzers/supported-analyzers/index/
---

# Analyzers

The following sections list all analyzers that OpenSearch supports.

## Built-in analyzers

The following table lists the built-in analyzers that OpenSearch provides. The last column of the table contains the result of applying the analyzer to the string `It’s fun to contribute a brand-new PR or 2 to OpenSearch!`.

Analyzer | Analysis performed | Analyzer output
:--- | :--- | :---
**Standard** (default) | - Parses strings into tokens at word boundaries <br> - Removes most punctuation <br> - Converts tokens to lowercase | [`it’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
**Simple** | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Converts tokens to lowercase | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
**Whitespace** | - Parses strings into tokens on white space | [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
**Stop** | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Removes stop words <br> - Converts tokens to lowercase | [`s`, `fun`, `contribute`, `brand`, `new`, `pr`, `opensearch`]
**Keyword** (no-op) | - Outputs the entire string unchanged | [`It’s fun to contribute a brand-new PR or 2 to OpenSearch!`]
**Pattern** | - Parses strings into tokens using regular expressions <br> - Supports converting strings to lowercase <br> - Supports removing stop words | [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `2`, `to`, `opensearch`]
[**Language**]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/) | Performs analysis specific to a certain language (for example, `english`). | [`fun`, `contribut`, `brand`, `new`, `pr`, `2`, `opensearch`]
**Fingerprint** | - Parses strings on any non-letter character <br> - Normalizes characters by converting them to ASCII <br> - Converts tokens to lowercase <br> - Sorts, deduplicates, and concatenates tokens into a single token <br> - Supports removing stop words | [`2 a brand contribute fun it's new opensearch or pr to`] <br> Note that the apostrophe was converted to its ASCII counterpart.

## Language analyzers

OpenSearch supports analyzers for various languages. For more information, see [Language analyzers]({{site.url}}{{site.baseurl}}/analyzers/language-analyzers/).
2 changes: 2 additions & 0 deletions _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ title: Token filters
nav_order: 70
has_children: true
has_toc: false
redirect_from:
- /analyzers/token-filters/index/
---

# Token filters
Expand Down
2 changes: 2 additions & 0 deletions _analyzers/tokenizers/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ title: Tokenizers
nav_order: 60
has_children: false
has_toc: false
redirect_from:
- /analyzers/tokenizers/index/
---

# Tokenizers
Expand Down

0 comments on commit 87932d2

Please sign in to comment.