Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for 33 language analyzers #779

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

brentam
Copy link
Contributor

@brentam brentam commented Dec 21, 2023

Description

Describe what this change achieves.
When trying to build IndexSettings. There is no way to build default analyzers for languages supported in opensearch.
The only supported language is Dutch at the moment.
This is also a problem when deserializing IndexSettings from json. It will fail for any language other than Dutch.
See this issue 684

In this PR, support for creating language Analyzers was added for these languages:
arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, czech, danish, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, and thai

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].
Will close this issue: 684

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

2.adding methods for other languages in the Analyzer.java

Signed-off-by: brentam <[email protected]>
Signed-off-by: brentam <[email protected]>
@dblock
Copy link
Member

dblock commented Dec 22, 2023

LGTM! Fix spotless & friends?

Would love to have a sample/doc for this, https://github.com/opensearch-project/opensearch-java/tree/main/samples and/or https://github.com/opensearch-project/opensearch-java/tree/main/guides.

@brentam
Copy link
Contributor Author

brentam commented Dec 23, 2023

LGTM! Fix spotless & friends?

Would love to have a sample/doc for this, https://github.com/opensearch-project/opensearch-java/tree/main/samples and/or https://github.com/opensearch-project/opensearch-java/tree/main/guides.

Yes. I ran the spotlessApply command and fixed the formatting.
I will see if I can add to the samples documents. Probably not this week though

Copy link
Collaborator

@reta reta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brentam my apologies but this is not how it is already implemented in opensearch-java: the language specific analyzers are the combination of org.opensearch.client.opensearch._types.analysis.LanguageAnalyzer and org.opensearch.client.opensearch._types.analysis.Language enumeration

@brentam
Copy link
Contributor Author

brentam commented Dec 30, 2023

@brentam my apologies but this is not how it is already implemented in opensearch-java: the language specific analyzers are the combination of org.opensearch.client.opensearch._types.analysis.LanguageAnalyzer and org.opensearch.client.opensearch._types.analysis.Language enumeration

@reta
I believe the current changes are valid.

LanguageAnalyzer is generating a completely different analyzer than DutchAnalyzer ( Remember, DutchAnalyzer already existed, I just added other analyzers for other languages: GermanAnalyzer, ItalianAnalyzer, etc...).

A more detailed explanation follows:

In The test below we are deserializing to IndexSettings from a valid index json that includes an analyzer of "type":"dutch" . Debugging, one can see it is using the "existing" DutchAnalyzer class to deserialize the analyzer.
It does not touch the Language enumeration at any point, nor the LanguageAnalyzer.

   @Test
    public void testParsingAnalyzer() {

        JsonpMapper mapper = new JsonbJsonpMapper();
        String json = "{ \"index\": { \"analysis\": { \"analyzer\": { \"some_analyzer\": { \"type\": \"dutch\", \"char_filter\": [ \"html_strip\" ], \"tokenizer\": \"standard\" } } } } } ";

        JsonParser parser = mapper.jsonProvider().createParser(new StringReader(json));
        IndexSettings indexSettings = IndexSettings._DESERIALIZER.deserialize(parser, mapper);
        assertTrue(true);
    }

if we serialize it back to json we get
{"index":{"analysis":{"analyzer":{"some_analyzer":{"type":"dutch"}}}}}

we are missing the char_filter but this is a discussion to another time/fix..

The point is, this is a valid analyzer definition in opensearch.

We wont use the LanguageAnalyzer to build that analyzer. Here is the code we use to build it


        Map<String, Analyzer> map = new HashMap<>();
        map.put("my_dutch_analyzer", DutchAnalyzer.of(it->it)._toAnalyzer());
        IndexSettings settings = IndexSettings.of(
                it -> it.analysis(a ->
                        a.analyzer(map))
        );

Therefore, I think the current changes are valid.

The Code using the LanguageAnalyzer is generating a completely different analyzer...
When creating an IndexSettings using the LanguageAnalyzer Builder + Language.
It generates an invalid Json
This code:

        Map<String, Analyzer> map = new HashMap<>();
        map.put("some_analyzer",
                LanguageAnalyzer.of(l -> l.language(Language.German).stemExclusion(new ArrayList<>()))
                ._toAnalyzer());
        IndexSettings settings = IndexSettings.of(
                it -> it.analysis(a ->
                        a.analyzer(map))
        );

Generates this json
{"index":{"analysis":{"analyzer":{"some_analyzer":{"type":"language","language":"German","stem_exclusion":[]}}}}}

is worth noticing this configuration will fail in opensearch since "type": "language" is not valid:

you can test in the dashboard console using this command:

PUT test_idx1
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "title": {
        "doc_values": false,
        "type": "text",
        "analyzer": "some_analyzer"
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "some_analyzer": {

            "type": "language",
            "language": "This_can_be_anything_actually",
            "stem_exclusion": []
          }
        }
      }
    }
  }
}

@brentam
Copy link
Contributor Author

brentam commented Dec 30, 2023

@reta
Some more points about the LanguageAnalyzer.
I am not sure what the json generated from that builder is supposed to be...

I already pointed out that the "type" : "language" generated by the LanguageAnalyzer buider is not valid.

          "some_analyzer": {
            "type": "language",
            "language": "German",
            "stem_exclusion": []
          }

Regardless, the design decision to restrict the language to the enum is suspect, since Opensearch/Elastic allows anything as the value of the language parameter.
Check this: The configuration below will be accepted in Opensearch and the index will be created

          "some_analyzer": {
            "type": "custom",
            "language": "AlienLanguage",
            "stem_exclusion": []
          }

Note the "language" : "AlienLanguage" is accepted, so this type of analyzer is not actually creating an analyzer that uses the supported language analyzers.

My guess is the "language" field in these analyzers serves as a metadata, has no bearing on the text processing.
It is used when you are creating your own custom analyzer from any language from scratch..

@reta
Copy link
Collaborator

reta commented Dec 30, 2023

@brentam

My understanding is that LanguageAnalyzer is used for languages that are supported by OpenSearch (hence the enumeration), whereas the custom analyzer is used in case you need any kind of customization.

LanguageAnalyzer is generating a completely different analyzer than DutchAnalyzer ( Remember, DutchAnalyzer already existed, I just added other analyzers for other languages: GermanAnalyzer, ItalianAnalyzer, etc...).

I sadly don't have any insights, but I believe it appeared exactly for the same cause as yours (no way to specify language analyzer). I am not claiming that the LanguageAnalyzer + Language actually works, but I suspect it should be serving this purpose, referring to your example:

 Map<String, Analyzer> map = new HashMap<>();
        map.put("some_analyzer",
                LanguageAnalyzer.of(l -> l.language(Language.German).stemExclusion(new ArrayList<>()))
                ._toAnalyzer());
        IndexSettings settings = IndexSettings.of(
                it -> it.analysis(a ->
                        a.analyzer(map))
        );

Should roughly be serialized as (limiting to only analyzer part):

 "analyzer": {
        "some_analyzer": {
            "type": "german",
            "stem_exclusion": []
        }
 }

Therefore, I think the current changes are valid.

I am not saying that your change is not valid. I am pointing out only the fact that we already do have a mechanism to use language specific analyzers which apparently does not work - that to me is the issue that has to be fixed (vs introducing yet another way to achieve the desired goal).

@brentam
Copy link
Contributor Author

brentam commented Dec 30, 2023

@reta
thank you so much for your response.

So the existing DutchAnalyzer was an obsolete and should have been removed with the introduction of the LanguageAnalyzer...
In Any case, I could remove my changes and modify the LanguageAnalyzer+Language to construct the "correct" json.

But I have some reservations:

The Analyzer.Kind enum contains these values, among others..

        Snowball("snowball"),
        Dutch("dutch"),
        Language("language"),

The string is supposed to be the value for the json field "type".

If using the LanguageAnalyzer we can have two approaches.

  1. Keep the Language value. We will have to modify the serialization to not expect/generate the "language" value in the json
  2. Have all the 33 new Languages values added to the enum and then mapping them to the LanguageAnalyzer later during deserialization/Serialization?

Regardless of the approach, this new change will not follow the rest of the design/code used for the other enum values...
_This is another point that reinforces that it seems that having the DutchAnalyzer, EnglishAnalyzer... seems the better approach than Having the LanguageAnalyzer+Language...

@reta it seems the LanguageAnalyzer+Language approach was not thought with the current desing in mind,
Introducing the Language("language") value enum makes no sense and as we saw is generating/expected the "type": "language" because of that.
I am keen to refactor one way or another, but I would like some green light, and I need some clarity regarding the points above before start working on the changes..

I dont want to spend time on it just to have the changes rejected later on.

As I said, it seems to me that design using the Per Language Analyzer seems the correct way since it follows the current design,
The LanguageAnalyzer+Language will result in some code that deviates from the current design...

@reta
Copy link
Collaborator

reta commented Dec 31, 2023

Thanks @brentam

As I said, it seems to me that design using the Per Language Analyzer seems the correct way since it follows the current design,
The LanguageAnalyzer+Language will result in some code that deviates from the current design...

So the current client model reflects the builtin analyzer types as they are described in the documentation [1] (sadly I have to refer to the Elasticsearch 7.10 here as OpenSearch documentation is not yet complete). In there, all language specific analyzers falls under "Language analyzers" category. I think if LanguageAnalyzer+Language are not supposed to configure language specific analyzers, than I honestly don't know what the purpose of those really is.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-analyzers.html

If using the LanguageAnalyzer we can have two approaches.

  1. Keep the Language value. We will have to modify the serialization to not expect/generate the "language" value in the json
  2. Have all the 33 new Languages values added to the enum and then mapping them to the LanguageAnalyzer later during deserialization/Serialization?

I was thinking about this approach: "language" is gone (your 1st point) and "type" becomes the value of language (which represents all languages supported by OpenSearch since it is an enumeration, covers your 2nd point). I think it would fix two issues at once: a) allow to specify language analyzer b) fix incorrect (de)serialization. It is also quite a small change from the scope perspective.

What do you think about this?

@brentam
Copy link
Contributor Author

brentam commented Dec 31, 2023

@reta
If using the LanguageAnalyzer, the CjkAnalyzer will have to be deleted too. That analyzer was added in August of 2023.
If follows the pattern of Analyzer per language, the same way I implemented the other 33 languages.
I think that with the current desing, the analyzer per language seems the better approach, since Dutch and Cjk is already following that, And we dont need to change anything plus adding the new Classes.

Regardess, I am working on some quick changes to enable the LanguageAnalyzer to generate the correct json.
I should have a branch soon and will post the link here so you can have and idea of the changes, some types will need to be determined at runtime and there are some discrepancies between the builder and deserialization (the Builder will keep requiring a language field, but when parsing from json we should use the "type" field...etc.)

I will actually create another PR, so we can discuss the changes there...

@brentam
Copy link
Contributor Author

brentam commented Dec 31, 2023

@reta
here is the PR with the quick changes to the Analyzer/LanguageAnalyzer
788

@brentam
Copy link
Contributor Author

brentam commented Dec 31, 2023

Thanks @brentam

As I said, it seems to me that design using the Per Language Analyzer seems the correct way since it follows the current design,
The LanguageAnalyzer+Language will result in some code that deviates from the current design...

So the current client model reflects the builtin analyzer types as they are described in the documentation [1] (sadly I have to refer to the Elasticsearch 7.10 here as OpenSearch documentation is not yet complete). In there, all language specific analyzers falls under "Language analyzers" category. I think if LanguageAnalyzer+Language are not supposed to configure language specific analyzers, than I honestly don't know what the purpose of those really is.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-analyzers.html

If using the LanguageAnalyzer we can have two approaches.

  1. Keep the Language value. We will have to modify the serialization to not expect/generate the "language" value in the json
  2. Have all the 33 new Languages values added to the enum and then mapping them to the LanguageAnalyzer later during deserialization/Serialization?

I was thinking about this approach: "language" is gone (your 1st point) and "type" becomes the value of language (which represents all languages supported by OpenSearch since it is an enumeration, covers your 2nd point). I think it would fix two issues at once: a) allow to specify language analyzer b) fix incorrect (de)serialization. It is also quite a small change from the scope perspective.

What do you think about this?

@reta

We cannot keep the Language("language") value, the "language" is what is expected from the type in the json
i.e: {"type":"language"} which is wrong, so basically I think we will have to keep one value per language; Dutch("dutch"), English("english"); etc...
During the Deserialization we transform all those values to the a LanguageAnalyzer.
Please check the new PR with the draft for the changes and let me know what you think...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants