Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ascii folding token filter #7912

Merged
2 changes: 1 addition & 1 deletion _analyzers/token-filters/apostrophe.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
layout: default
title: Apostrophe
parent: Token filters
nav_order: 110
nav_order: 10
---

# Apostrophe token filter
Expand Down
135 changes: 135 additions & 0 deletions _analyzers/token-filters/asciifolding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
layout: default
title: ASCII Folding
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
parent: Token filters
nav_order: 20
---

# ASCII Folding token filter

Check failure on line 8 in _analyzers/token-filters/asciifolding.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'ASCII Folding token filter' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'ASCII Folding token filter' is a heading and should be in sentence case.", "location": {"path": "_analyzers/token-filters/asciifolding.md", "range": {"start": {"line": 8, "column": 3}}}, "severity": "ERROR"}
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

The `asciifolding` oken filter converts non-ASCII characters into their closest ASCII equivalents. For example *é* becomes *e*, *ü* becomes *u* and *ñ* becomes *n*. This process is known as *transliteration*.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved


The `asciifolding` token filter offers a number of benefits:

- **Enhanced search flexibility**: Users often omit accents or special characters when typing queries. The `asciifolding` token filter ensures that such queries still return relevant results.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
- **Normalization**: Standardizes the indexing process by ensuring that accented characters are consistently converted to their ASCII equivalents.
- **Internationalization**: Particularly useful for applications dealing with multiple languages and character sets.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved

While the `asciifolding` token filter can simplify searches, it might also lead to loss of specific information, particularly if the distinction between accented and non-accented characters is significant in the dataset.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved
{: .warning}

## Parameters

You can configure `asciifolding` token filter using the `preserve_original` parameter. Setting this parameter to `true` keeps both the original token and its ASCII-folded version in the token stream. This can be particularly useful in scenarios where you want to match both the original (with accents) and the normalized (without accents) versions of a term in search queries. Default is `false`.
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved

## Example

The following example request creates a new index named `example_index` and defines an analyzer with the `asciifolding` filter and `preserve_original` parameter set to `true`:

```json
PUT /example_index
{
"settings": {
"analysis": {
"filter": {
"custom_ascii_folding": {
"type": "asciifolding",
"preserve_original": true
}
},
"analyzer": {
"custom_ascii_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"custom_ascii_folding"
]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the created analyzer:
AntonEliatra marked this conversation as resolved.
Show resolved Hide resolved

```json
POST /example_index/_analyze
{
"analyzer": "custom_ascii_analyzer",
"text": "Résumé café naïve coördinate"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "resume",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "résumé",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "cafe",
"start_offset": 7,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "café",
"start_offset": 7,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "naive",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "naïve",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "coordinate",
"start_offset": 18,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "coördinate",
"start_offset": 18,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 3
}
]
}
```


2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The following table lists all token filters that OpenSearch supports.

Token filter | Underlying Lucene token filter| Description
[`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it.
`asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
[`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into the equivalent basic Latin characters. <br> - Folds half-width Katakana character variants into the equivalent Kana characters.
`classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
Expand Down
Loading