From 3f026bc5491b6cb7a0251aec5dc9e776e3321ee2 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Fri, 2 Aug 2024 15:28:28 +0100 Subject: [PATCH 01/10] adding token filter page for cjk width #7875 Signed-off-by: AntonEliatra --- _analyzers/token-filters/cjk-width.md | 10 ++++++++++ 1 file changed, 10 insertions(+) create mode 100644 _analyzers/token-filters/cjk-width.md diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md new file mode 100644 index 0000000000..b6e30a08df --- /dev/null +++ b/_analyzers/token-filters/cjk-width.md @@ -0,0 +1,10 @@ +--- +layout: default +title: CJK width +parent: Token filters +nav_order: 140 +--- + +# CJK width + +The CJK Width token filter in OpenSearch normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width and half-width character variants to their standard forms. Let's elaborate on the key rules mentioned: From caed0229543212c86e8f6f49409073b9a7542f42 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Fri, 2 Aug 2024 16:00:01 +0100 Subject: [PATCH 02/10] adding details to the page Signed-off-by: AntonEliatra --- _analyzers/token-filters/cjk-width.md | 86 ++++++++++++++++++++++++++- _analyzers/token-filters/index.md | 2 +- 2 files changed, 85 insertions(+), 3 deletions(-) diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md index b6e30a08df..d9403cd3b3 100644 --- a/_analyzers/token-filters/cjk-width.md +++ b/_analyzers/token-filters/cjk-width.md @@ -5,6 +5,88 @@ parent: Token filters nav_order: 140 --- -# CJK width +# CJK width token filter + +The CJK Width token filter in OpenSearch normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII Character to their standard (half-width) ASCII equivalents and half-width Katakana characters to their full-width. + + - __Converting full-width ASCII Character__: In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, which occupies the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography to align with the width of CJK characters. However, for the purpose of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents. + + See following example: + + Full-Width: ABCDE 12345 + Normalized: (Half-Width): ABCDE 12345 + + - __Converting half-width Katakana characters__: The CJK Width token filter converts half-width Katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization is important for consistency in text processing and searching. + + See following example: + + Half-Width "Katakana": カタカナ + Normalized (Full-Width "Katakana"): カタカナ + + + +## Example + +Following is an example of how you can define an analyzer with the `cjk_bigram_filter` filter with `ignore_scripts` set to `deva`: + +```json +PUT /cjk_width_example_index +{ + "settings": { + "analysis": { + "analyzer": { + "cjk_width_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": ["cjk_width"] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +You can use the following command to examine the tokens being generated using the created analyzer: + +```json +POST /cjk_width_example_index/_analyze +{ + "analyzer": "cjk_width_analyzer", + "text": "Tokyo 2024 カタカナ" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "Tokyo", + "start_offset": 0, + "end_offset": 5, + "type": "", + "position": 0 + }, + { + "token": "2024", + "start_offset": 6, + "end_offset": 10, + "type": "", + "position": 1 + }, + { + "token": "カタカナ", + "start_offset": 11, + "end_offset": 15, + "type": "", + "position": 2 + } + ] +} +``` -The CJK Width token filter in OpenSearch normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width and half-width character variants to their standard forms. Let's elaborate on the key rules mentioned: diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index e6d9875736..57246e164f 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -16,7 +16,7 @@ Token filter | Underlying Lucene token filter| Description `apostrophe` | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token that contains an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following the apostrophe. `asciifolding` | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. `cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. -`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. +[`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. `classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms. `common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. `conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script. From 6d6025371f1675042ac3ce04fb57909c9656a086 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Fri, 2 Aug 2024 16:05:49 +0100 Subject: [PATCH 03/10] adding details to the page Signed-off-by: AntonEliatra --- _analyzers/token-filters/cjk-width.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md index d9403cd3b3..70dc9c5df6 100644 --- a/_analyzers/token-filters/cjk-width.md +++ b/_analyzers/token-filters/cjk-width.md @@ -27,7 +27,7 @@ The CJK Width token filter in OpenSearch normalizes Chinese, Japanese, and Korea ## Example -Following is an example of how you can define an analyzer with the `cjk_bigram_filter` filter with `ignore_scripts` set to `deva`: +Following is an example of how you can define an analyzer with the `cjk_width` filter: ```json PUT /cjk_width_example_index From 5b5cf38bdf76f47bdb92fc29e25509d9098f646c Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Tue, 6 Aug 2024 17:33:58 +0100 Subject: [PATCH 04/10] Updating details as per comments Signed-off-by: AntonEliatra --- _analyzers/token-filters/cjk-width.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md index 70dc9c5df6..bffaf1c340 100644 --- a/_analyzers/token-filters/cjk-width.md +++ b/_analyzers/token-filters/cjk-width.md @@ -2,7 +2,7 @@ layout: default title: CJK width parent: Token filters -nav_order: 140 +nav_order: 40 --- # CJK width token filter @@ -24,10 +24,9 @@ The CJK Width token filter in OpenSearch normalizes Chinese, Japanese, and Korea Normalized (Full-Width "Katakana"): カタカナ - ## Example -Following is an example of how you can define an analyzer with the `cjk_width` filter: +The following example request creates a new index named `cjk_width_example_index` and defines an analyzer with the `cjk_width` filter: ```json PUT /cjk_width_example_index @@ -49,7 +48,7 @@ PUT /cjk_width_example_index ## Generated tokens -You can use the following command to examine the tokens being generated using the created analyzer: +Use the following request to examine the tokens generated using the created analyzer: ```json POST /cjk_width_example_index/_analyze From 0285a97bbfce9ed600d621419da997f5552927fa Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Tue, 3 Sep 2024 16:06:14 +0100 Subject: [PATCH 05/10] Update cjk-width.md Signed-off-by: AntonEliatra --- _analyzers/token-filters/cjk-width.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md index bffaf1c340..6472a66143 100644 --- a/_analyzers/token-filters/cjk-width.md +++ b/_analyzers/token-filters/cjk-width.md @@ -9,14 +9,14 @@ nav_order: 40 The CJK Width token filter in OpenSearch normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII Character to their standard (half-width) ASCII equivalents and half-width Katakana characters to their full-width. - - __Converting full-width ASCII Character__: In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, which occupies the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography to align with the width of CJK characters. However, for the purpose of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents. + - **Converting full-width ASCII Character**: In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, which occupies the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography to align with the width of CJK characters. However, for the purpose of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents. See following example: Full-Width: ABCDE 12345 Normalized: (Half-Width): ABCDE 12345 - - __Converting half-width Katakana characters__: The CJK Width token filter converts half-width Katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization is important for consistency in text processing and searching. + - **Converting half-width Katakana characters**: The CJK Width token filter converts half-width Katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization is important for consistency in text processing and searching. See following example: From f4ecdb07fd2b9daeb3bcf837a1c016618009be33 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 12 Sep 2024 10:12:51 +0100 Subject: [PATCH 06/10] Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra --- _analyzers/token-filters/cjk-width.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md index 6472a66143..3cb6eaacab 100644 --- a/_analyzers/token-filters/cjk-width.md +++ b/_analyzers/token-filters/cjk-width.md @@ -7,21 +7,21 @@ nav_order: 40 # CJK width token filter -The CJK Width token filter in OpenSearch normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII Character to their standard (half-width) ASCII equivalents and half-width Katakana characters to their full-width. +The CJK Width token filter normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII characters to their standard (half-width) ASCII equivalents and half-width katakana characters to their full-width equivalents. - **Converting full-width ASCII Character**: In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, which occupies the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography to align with the width of CJK characters. However, for the purpose of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents. - See following example: +The following example illustrates ASCII character normalization: Full-Width: ABCDE 12345 - Normalized: (Half-Width): ABCDE 12345 + Normalized (half-width): ABCDE 12345 - - **Converting half-width Katakana characters**: The CJK Width token filter converts half-width Katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization is important for consistency in text processing and searching. + - **Converting half-width katakana characters**: The CJK width token filter converts half-width katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization is important for consistency in text processing and searching. - See following example: +The following example illustrates ASCII converting half-width katakana characters: - Half-Width "Katakana": カタカナ - Normalized (Full-Width "Katakana"): カタカナ + Half-Width katakana: カタカナ + Normalized (full-width) katakana: カタカナ ## Example From f9992007765061a88f0cdead6b5240d9c2cff1ee Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 12 Sep 2024 10:36:45 +0100 Subject: [PATCH 07/10] Update cjk-width.md Signed-off-by: AntonEliatra --- _analyzers/token-filters/cjk-width.md | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md index 3cb6eaacab..0af7043af8 100644 --- a/_analyzers/token-filters/cjk-width.md +++ b/_analyzers/token-filters/cjk-width.md @@ -9,20 +9,24 @@ nav_order: 40 The CJK Width token filter normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII characters to their standard (half-width) ASCII equivalents and half-width katakana characters to their full-width equivalents. - - **Converting full-width ASCII Character**: In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, which occupies the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography to align with the width of CJK characters. However, for the purpose of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents. +### Converting full-width ASCII Character -The following example illustrates ASCII character normalization: +In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, which occupies the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography to align with the width of CJK characters. However, for the purpose of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents. - Full-Width: ABCDE 12345 +The following example illustrates ASCII character normalization: +``` + Full-Width: ABCDE 12345 Normalized (half-width): ABCDE 12345 +``` +### Converting half-width katakana characters - - **Converting half-width katakana characters**: The CJK width token filter converts half-width katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization is important for consistency in text processing and searching. +The CJK width token filter converts half-width katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization is important for consistency in text processing and searching. The following example illustrates ASCII converting half-width katakana characters: - - Half-Width katakana: カタカナ - Normalized (full-width) katakana: カタカナ - +``` + Half-Width katakana: カタカナ + Normalized (full-width) katakana: カタカナ +``` ## Example @@ -88,4 +92,3 @@ The response contains the generated tokens: ] } ``` - From 823086fb0534ff3f829914d7d251adf5a46d22e1 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 12 Sep 2024 11:09:23 +0100 Subject: [PATCH 08/10] Update cjk-width.md Signed-off-by: AntonEliatra --- _analyzers/token-filters/cjk-width.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md index 0af7043af8..d5f5fc0925 100644 --- a/_analyzers/token-filters/cjk-width.md +++ b/_analyzers/token-filters/cjk-width.md @@ -52,7 +52,7 @@ PUT /cjk_width_example_index ## Generated tokens -Use the following request to examine the tokens generated using the created analyzer: +Use the following request to examine the tokens generated using the analyzer: ```json POST /cjk_width_example_index/_analyze From 0505d9b2fbebc7b6c749632299caaa683c6cc86f Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Thu, 12 Sep 2024 14:50:30 -0400 Subject: [PATCH 09/10] Apply suggestions from code review Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _analyzers/token-filters/cjk-width.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md index d5f5fc0925..c8863b3e50 100644 --- a/_analyzers/token-filters/cjk-width.md +++ b/_analyzers/token-filters/cjk-width.md @@ -14,15 +14,18 @@ The CJK Width token filter normalizes Chinese, Japanese, and Korean (CJK) tokens In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, which occupies the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography to align with the width of CJK characters. However, for the purpose of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents. The following example illustrates ASCII character normalization: + ``` Full-Width: ABCDE 12345 Normalized (half-width): ABCDE 12345 ``` + ### Converting half-width katakana characters The CJK width token filter converts half-width katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization is important for consistency in text processing and searching. The following example illustrates ASCII converting half-width katakana characters: + ``` Half-Width katakana: カタカナ Normalized (full-width) katakana: カタカナ From e373b7b999d48c4a2320aa861a245c8002c1f973 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Fri, 13 Sep 2024 12:01:31 +0100 Subject: [PATCH 10/10] Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: AntonEliatra --- _analyzers/token-filters/cjk-width.md | 9 ++++----- _analyzers/token-filters/index.md | 2 +- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md index c8863b3e50..4960729cd1 100644 --- a/_analyzers/token-filters/cjk-width.md +++ b/_analyzers/token-filters/cjk-width.md @@ -7,11 +7,11 @@ nav_order: 40 # CJK width token filter -The CJK Width token filter normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII characters to their standard (half-width) ASCII equivalents and half-width katakana characters to their full-width equivalents. +The `cjk_width` token filter normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII characters to their standard (half-width) ASCII equivalents and half-width katakana characters to their full-width equivalents. -### Converting full-width ASCII Character +### Converting full-width ASCII characters -In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, which occupies the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography to align with the width of CJK characters. However, for the purpose of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents. +In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, occupying the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography for alignment with the width of CJK characters. However, for the purposes of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents. The following example illustrates ASCII character normalization: @@ -22,9 +22,8 @@ The following example illustrates ASCII character normalization: ### Converting half-width katakana characters -The CJK width token filter converts half-width katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization is important for consistency in text processing and searching. +The `cjk_width` token filter converts half-width katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization, illustrated in the following example, is important for consistency in text processing and searching: -The following example illustrates ASCII converting half-width katakana characters: ``` Half-Width katakana: カタカナ diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index e1704212d0..86925123b8 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -16,7 +16,7 @@ Token filter | Underlying Lucene token filter| Description [`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it. [`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters. `cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. -[`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. +[`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into their equivalent basic Latin characters.
- Folds half-width katakana character variants into their equivalent kana characters. `classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms. `common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. `conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script.