Added support for 33 language analyzers #779

brentam · 2023-12-21T22:25:21Z

Description

Describe what this change achieves.
When trying to build IndexSettings. There is no way to build default analyzers for languages supported in opensearch.
The only supported language is Dutch at the moment.
This is also a problem when deserializing IndexSettings from json. It will fail for any language other than Dutch.
See this issue 684

In this PR, support for creating language Analyzers was added for these languages:
arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, czech, danish, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, and thai

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].
Will close this issue: 684

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

2.adding methods for other languages in the Analyzer.java Signed-off-by: brentam <[email protected]>

Signed-off-by: brentam <[email protected]>

dblock · 2023-12-22T16:56:19Z

LGTM! Fix spotless & friends?

Would love to have a sample/doc for this, https://github.com/opensearch-project/opensearch-java/tree/main/samples and/or https://github.com/opensearch-project/opensearch-java/tree/main/guides.

Signed-off-by: brentam <[email protected]>

brentam · 2023-12-23T12:53:47Z

LGTM! Fix spotless & friends?

Would love to have a sample/doc for this, https://github.com/opensearch-project/opensearch-java/tree/main/samples and/or https://github.com/opensearch-project/opensearch-java/tree/main/guides.

Yes. I ran the spotlessApply command and fixed the formatting.
I will see if I can add to the samples documents. Probably not this week though

reta

@brentam my apologies but this is not how it is already implemented in opensearch-java: the language specific analyzers are the combination of org.opensearch.client.opensearch._types.analysis.LanguageAnalyzer and org.opensearch.client.opensearch._types.analysis.Language enumeration

brentam · 2023-12-30T04:56:21Z

@brentam my apologies but this is not how it is already implemented in opensearch-java: the language specific analyzers are the combination of org.opensearch.client.opensearch._types.analysis.LanguageAnalyzer and org.opensearch.client.opensearch._types.analysis.Language enumeration

@reta
I believe the current changes are valid.

LanguageAnalyzer is generating a completely different analyzer than DutchAnalyzer ( Remember, DutchAnalyzer already existed, I just added other analyzers for other languages: GermanAnalyzer, ItalianAnalyzer, etc...).

A more detailed explanation follows:

In The test below we are deserializing to IndexSettings from a valid index json that includes an analyzer of "type":"dutch" . Debugging, one can see it is using the "existing" DutchAnalyzer class to deserialize the analyzer.
It does not touch the Language enumeration at any point, nor the LanguageAnalyzer.

   @Test
    public void testParsingAnalyzer() {

        JsonpMapper mapper = new JsonbJsonpMapper();
        String json = "{ \"index\": { \"analysis\": { \"analyzer\": { \"some_analyzer\": { \"type\": \"dutch\", \"char_filter\": [ \"html_strip\" ], \"tokenizer\": \"standard\" } } } } } ";

        JsonParser parser = mapper.jsonProvider().createParser(new StringReader(json));
        IndexSettings indexSettings = IndexSettings._DESERIALIZER.deserialize(parser, mapper);
        assertTrue(true);
    }

if we serialize it back to json we get
{"index":{"analysis":{"analyzer":{"some_analyzer":{"type":"dutch"}}}}}

we are missing the char_filter but this is a discussion to another time/fix..

The point is, this is a valid analyzer definition in opensearch.

We wont use the LanguageAnalyzer to build that analyzer. Here is the code we use to build it


        Map<String, Analyzer> map = new HashMap<>();
        map.put("my_dutch_analyzer", DutchAnalyzer.of(it->it)._toAnalyzer());
        IndexSettings settings = IndexSettings.of(
                it -> it.analysis(a ->
                        a.analyzer(map))
        );

Therefore, I think the current changes are valid.

The Code using the LanguageAnalyzer is generating a completely different analyzer...
When creating an IndexSettings using the LanguageAnalyzer Builder + Language.
It generates an invalid Json
This code:

        Map<String, Analyzer> map = new HashMap<>();
        map.put("some_analyzer",
                LanguageAnalyzer.of(l -> l.language(Language.German).stemExclusion(new ArrayList<>()))
                ._toAnalyzer());
        IndexSettings settings = IndexSettings.of(
                it -> it.analysis(a ->
                        a.analyzer(map))
        );

Generates this json
{"index":{"analysis":{"analyzer":{"some_analyzer":{"type":"language","language":"German","stem_exclusion":[]}}}}}

is worth noticing this configuration will fail in opensearch since "type": "language" is not valid:

you can test in the dashboard console using this command:

PUT test_idx1
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "title": {
        "doc_values": false,
        "type": "text",
        "analyzer": "some_analyzer"
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "some_analyzer": {

            "type": "language",
            "language": "This_can_be_anything_actually",
            "stem_exclusion": []
          }
        }
      }
    }
  }
}

brentam · 2023-12-30T05:15:26Z

@reta
Some more points about the LanguageAnalyzer.
I am not sure what the json generated from that builder is supposed to be...

I already pointed out that the "type" : "language" generated by the LanguageAnalyzer buider is not valid.

          "some_analyzer": {
            "type": "language",
            "language": "German",
            "stem_exclusion": []
          }

Regardless, the design decision to restrict the language to the enum is suspect, since Opensearch/Elastic allows anything as the value of the language parameter.
Check this: The configuration below will be accepted in Opensearch and the index will be created

          "some_analyzer": {
            "type": "custom",
            "language": "AlienLanguage",
            "stem_exclusion": []
          }

Note the "language" : "AlienLanguage" is accepted, so this type of analyzer is not actually creating an analyzer that uses the supported language analyzers.

My guess is the "language" field in these analyzers serves as a metadata, has no bearing on the text processing.
It is used when you are creating your own custom analyzer from any language from scratch..

reta · 2023-12-30T16:28:12Z

@brentam

My understanding is that LanguageAnalyzer is used for languages that are supported by OpenSearch (hence the enumeration), whereas the custom analyzer is used in case you need any kind of customization.

LanguageAnalyzer is generating a completely different analyzer than DutchAnalyzer ( Remember, DutchAnalyzer already existed, I just added other analyzers for other languages: GermanAnalyzer, ItalianAnalyzer, etc...).

I sadly don't have any insights, but I believe it appeared exactly for the same cause as yours (no way to specify language analyzer). I am not claiming that the LanguageAnalyzer + Language actually works, but I suspect it should be serving this purpose, referring to your example:

 Map<String, Analyzer> map = new HashMap<>();
        map.put("some_analyzer",
                LanguageAnalyzer.of(l -> l.language(Language.German).stemExclusion(new ArrayList<>()))
                ._toAnalyzer());
        IndexSettings settings = IndexSettings.of(
                it -> it.analysis(a ->
                        a.analyzer(map))
        );

Should roughly be serialized as (limiting to only analyzer part):

 "analyzer": {
        "some_analyzer": {
            "type": "german",
            "stem_exclusion": []
        }
 }

Therefore, I think the current changes are valid.

I am not saying that your change is not valid. I am pointing out only the fact that we already do have a mechanism to use language specific analyzers which apparently does not work - that to me is the issue that has to be fixed (vs introducing yet another way to achieve the desired goal).

brentam · 2023-12-30T23:16:09Z

@reta
thank you so much for your response.

So the existing DutchAnalyzer was an obsolete and should have been removed with the introduction of the LanguageAnalyzer...
In Any case, I could remove my changes and modify the LanguageAnalyzer+Language to construct the "correct" json.

But I have some reservations:

The Analyzer.Kind enum contains these values, among others..

        Snowball("snowball"),
        Dutch("dutch"),
        Language("language"),

The string is supposed to be the value for the json field "type".

If using the LanguageAnalyzer we can have two approaches.

Keep the Language value. We will have to modify the serialization to not expect/generate the "language" value in the json
Have all the 33 new Languages values added to the enum and then mapping them to the LanguageAnalyzer later during deserialization/Serialization?

Regardless of the approach, this new change will not follow the rest of the design/code used for the other enum values...
_This is another point that reinforces that it seems that having the DutchAnalyzer, EnglishAnalyzer... seems the better approach than Having the LanguageAnalyzer+Language...

@reta it seems the LanguageAnalyzer+Language approach was not thought with the current desing in mind,
Introducing the Language("language") value enum makes no sense and as we saw is generating/expected the "type": "language" because of that.
I am keen to refactor one way or another, but I would like some green light, and I need some clarity regarding the points above before start working on the changes..

I dont want to spend time on it just to have the changes rejected later on.

As I said, it seems to me that design using the Per Language Analyzer seems the correct way since it follows the current design,
The LanguageAnalyzer+Language will result in some code that deviates from the current design...

…-languages

reta · 2023-12-31T04:23:41Z

Thanks @brentam

As I said, it seems to me that design using the Per Language Analyzer seems the correct way since it follows the current design,
The LanguageAnalyzer+Language will result in some code that deviates from the current design...

So the current client model reflects the builtin analyzer types as they are described in the documentation [1] (sadly I have to refer to the Elasticsearch 7.10 here as OpenSearch documentation is not yet complete). In there, all language specific analyzers falls under "Language analyzers" category. I think if LanguageAnalyzer+Language are not supposed to configure language specific analyzers, than I honestly don't know what the purpose of those really is.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-analyzers.html

If using the LanguageAnalyzer we can have two approaches.

Keep the Language value. We will have to modify the serialization to not expect/generate the "language" value in the json

Have all the 33 new Languages values added to the enum and then mapping them to the LanguageAnalyzer later during deserialization/Serialization?

I was thinking about this approach: "language" is gone (your 1st point) and "type" becomes the value of language (which represents all languages supported by OpenSearch since it is an enumeration, covers your 2nd point). I think it would fix two issues at once: a) allow to specify language analyzer b) fix incorrect (de)serialization. It is also quite a small change from the scope perspective.

What do you think about this?

brentam · 2023-12-31T04:24:06Z

@reta
If using the LanguageAnalyzer, the CjkAnalyzer will have to be deleted too. That analyzer was added in August of 2023.
If follows the pattern of Analyzer per language, the same way I implemented the other 33 languages.
I think that with the current desing, the analyzer per language seems the better approach, since Dutch and Cjk is already following that, And we dont need to change anything plus adding the new Classes.

Regardess, I am working on some quick changes to enable the LanguageAnalyzer to generate the correct json.
I should have a branch soon and will post the link here so you can have and idea of the changes, some types will need to be determined at runtime and there are some discrepancies between the builder and deserialization (the Builder will keep requiring a language field, but when parsing from json we should use the "type" field...etc.)

I will actually create another PR, so we can discuss the changes there...

brentam · 2023-12-31T04:54:25Z

@reta
here is the PR with the quick changes to the Analyzer/LanguageAnalyzer
788

brentam · 2023-12-31T05:36:16Z

Thanks @brentam

As I said, it seems to me that design using the Per Language Analyzer seems the correct way since it follows the current design,
The LanguageAnalyzer+Language will result in some code that deviates from the current design...

So the current client model reflects the builtin analyzer types as they are described in the documentation [1] (sadly I have to refer to the Elasticsearch 7.10 here as OpenSearch documentation is not yet complete). In there, all language specific analyzers falls under "Language analyzers" category. I think if LanguageAnalyzer+Language are not supposed to configure language specific analyzers, than I honestly don't know what the purpose of those really is.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-analyzers.html

If using the LanguageAnalyzer we can have two approaches.

Keep the Language value. We will have to modify the serialization to not expect/generate the "language" value in the json

Have all the 33 new Languages values added to the enum and then mapping them to the LanguageAnalyzer later during deserialization/Serialization?

I was thinking about this approach: "language" is gone (your 1st point) and "type" becomes the value of language (which represents all languages supported by OpenSearch since it is an enumeration, covers your 2nd point). I think it would fix two issues at once: a) allow to specify language analyzer b) fix incorrect (de)serialization. It is also quite a small change from the scope perspective.

What do you think about this?

@reta

We cannot keep the Language("language") value, the "language" is what is expected from the type in the json
i.e: {"type":"language"} which is wrong, so basically I think we will have to keep one value per language; Dutch("dutch"), English("english"); etc...
During the Deserialization we transform all those values to the a LanguageAnalyzer.
Please check the new PR with the draft for the changes and let me know what you think...

brentam added 3 commits December 21, 2023 23:11

1.adding analyzer class for other languages.

54ab458

2.adding methods for other languages in the Analyzer.java Signed-off-by: brentam <[email protected]>

test to verify the language analyzers are parsed correctly

12bf07b

Signed-off-by: brentam <[email protected]>

dummy CHANGELOG.md

31a7621

Signed-off-by: brentam <[email protected]>

brentam requested review from reta, Bukhtawar, dblock, szczepanczykd, madhusudhankonda, saratvemulapalli and VachaShah as code owners December 21, 2023 22:25

adding information about the real PR to CHANGELOG.md

843be49

Signed-off-by: brentam <[email protected]>

brentam mentioned this pull request Dec 21, 2023

Analyzer.Kind to support all languages for analyzers #684

Open

ran ./gradlew :java-client:spotlessApply

f3412c6

Signed-off-by: brentam <[email protected]>

reta requested changes Dec 23, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into analyzers-for-extra…

4b8abd2

…-languages

brentam mentioned this pull request Dec 31, 2023

discussing changes to LanguageAnalyzer #788

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for 33 language analyzers #779

Added support for 33 language analyzers #779

brentam commented Dec 21, 2023

dblock commented Dec 22, 2023

brentam commented Dec 23, 2023

reta left a comment

brentam commented Dec 30, 2023 •

edited

Loading

brentam commented Dec 30, 2023 •

edited

Loading

reta commented Dec 30, 2023

brentam commented Dec 30, 2023 •

edited

Loading

reta commented Dec 31, 2023

brentam commented Dec 31, 2023 •

edited

Loading

brentam commented Dec 31, 2023 •

edited

Loading

brentam commented Dec 31, 2023

Added support for 33 language analyzers #779

Are you sure you want to change the base?

Added support for 33 language analyzers #779

Conversation

brentam commented Dec 21, 2023

Description

Issues Resolved

dblock commented Dec 22, 2023

brentam commented Dec 23, 2023

reta left a comment

Choose a reason for hiding this comment

brentam commented Dec 30, 2023 • edited Loading

A more detailed explanation follows:

brentam commented Dec 30, 2023 • edited Loading

reta commented Dec 30, 2023

brentam commented Dec 30, 2023 • edited Loading

reta commented Dec 31, 2023

brentam commented Dec 31, 2023 • edited Loading

brentam commented Dec 31, 2023 • edited Loading

brentam commented Dec 31, 2023

brentam commented Dec 30, 2023 •

edited

Loading

brentam commented Dec 30, 2023 •

edited

Loading

brentam commented Dec 30, 2023 •

edited

Loading

brentam commented Dec 31, 2023 •

edited

Loading

brentam commented Dec 31, 2023 •

edited

Loading