Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to decompose "Taschenhersteller" #23

Open
blured75 opened this issue Mar 9, 2016 · 5 comments
Open

Failure to decompose "Taschenhersteller" #23

blured75 opened this issue Mar 9, 2016 · 5 comments

Comments

@blured75
Copy link

blured75 commented Mar 9, 2016

Hi,

First of all, thanks for your plugin, which could avoid to use the obscure compound word token filter with hyphenation_decompounder (https://www.elastic.co/guide/en/elasticsearch/reference/2.0/analysis-compound-word-tokenfilter.html)

Having said that I cannot decompose "Taschenhersteller" which is a german word which should be decomposed as 2 words : Taschen & Hersteller
Having installed your plugin, I made the following (possibly erroneous) mapping :

-XPOST localhost:9200/my_index {
  "index": {
    "analysis": {
      "filter": {
        "decomp": {
          "type": "decompound"
        }
      },
      "tokenizer": {
        "decomp": {
          "type": "standard",
          "filter": [
            "decomp"
          ]
        }
      },
      "analyzer": {
        "my_anal": {
          "type": "custom",
          "tokenizer": "decomp"
        }
      }
    },
    "mappings": {
      "type1": {
        "properties": {
          "field1": {
            "type": "string",
            "analyzer": "my_anal"
          }
        }
      }
    }
  }
}

When trying to analyze the text "Taschenhersteller"

-XPOST localhost:9200/my_index {
    "analyzer": "my_anal",
    "text": "Taschenhersteller"
}

It gives me

{
    "tokens": [
        {
            "token": "Taschenhersteller",
            "start_offset": 0,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 0
        }
    ]
}

Don't understand what I'm doing wrong ....

Could you help me please ? :)

@mablae
Copy link

mablae commented Mar 9, 2016

Does it work for other Terms like Ölpumpe ? Motoröl?

Also you need to activate the default tokenizers like lowercase in your custom analyzer before the decomp token filter.

That where my testcases back in the days. I also noticed some terms are not splitted by the plugin at all.

Maybe we could improve the plugin together @jprante ?

My analyser looks like this:

"svb_decompoundAnalyzer":{  
  "filter":[  
    "lowercase",
    "svb_decompound",
    "unique"
     ],
  "tokenizer":"standard"
}

And filter:

 "svb_decompound":{  
   "type":"decompound"
 },

@jprante
Copy link
Owner

jprante commented Mar 9, 2016

The current implementation can be extended by custom compound words, for example code, see https://github.com/jprante/elasticsearch-analysis-decompound/blob/master/src/test/java/org/xbib/decompound/TrainerTests.java

Possible input for german is the morphy lexicon morphy-mapping-20110717.latin1.gz

@blured75
Copy link
Author

Great !! It works really well :)

This was my mapping which was erroneous. Here is the corrected version which works :

{
  "index": {
    "analysis": {
      "filter": {
        "decomp": {
          "type": "decompound"
        }
      },      
      "analyzer": {
        "my_anal": {
            "filter":[  
            "lowercase",
            "svb_decompound",
            "unique"
         ],
         "tokenizer":"standard"
        }
      }
    },
    "mappings": {
      "type1": {
        "properties": {
          "field1": {
            "type": "string",
            "analyzer": "my_anal"
          }
        }
      }
    }
  }
}

@blured75
Copy link
Author

Is it possible for you to make a backport to Elastic 2.0 version ? It could be wunderbach :)

Best regards,
Blured.

@mablae
Copy link

mablae commented Mar 12, 2016

@jprante Thanks for that pointer, I will read into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants