Lemma of masculine nouns inconsistent in Norwegian #6549
Replies: 5 comments 2 replies
-
After some research I think I understand why this issue comes up. It seems to be nouns that finds the lemma by using the lemma_rules. I found the lemma_rules for Norwegian, and I see that the rule for nouns is just based upon the suffix, and not the gender. This is the issue: Very often both female and the male conjugated nouns ends with -en, so the rule now gives out -e. The gender of the word is therefore important when defining the lemma by rules. Would you consider rules that includes both suffix and other information (gender/number/definite)? That would make the rules a lot more precise. In the Norwegian language, it is a rule to group nouns into one noun, but the rule of conjugating follows the last noun. I therefore have another suggestion: Do the lookup in a reversed order of letters in the word and check if the letter becomes two existing nouns if splitted. As with the example above, skatten -> skatt, and formueskatten -> skatte. If you would look for words in a reversed order, then you would match the word "skatt" in formeskatt, and therefore follow the lemma of the more common base "skatt". |
Beta Was this translation helpful? Give feedback.
-
Yeah, the current built-in rule-based lemmatizer doesn't work very well for languages where gender or other features play a role like this. You would really need a custom lemmatizer that has rules based on gender and probably a lot of additional lexical information to handle irregular forms and exceptions. I'm more familiar with German, which is probably somewhat similar, and even relying on the morphological analysis is tricky because gender is not marked in the plural, so without good lexical resources, it's very easy to get the analysis wrong. |
Beta Was this translation helpful? Give feedback.
-
Understandable. A lookup-based lemmatizer would probably be better/more precise, but I have yet to find a method that replaces the existing solution. Could you point out for me where/how I can replace the existing lemmatizer with my own solution? |
Beta Was this translation helpful? Give feedback.
-
There is a lookup lemmatizer table for Norwegian, but the quality may not be great. The easiest way to try it out in a pipeline is to remove the The configuration is much easier in v3 because it's a separate pipeline component. Again, look for languages with a |
Beta Was this translation helpful? Give feedback.
-
Great, thank you! I will look into it. Just so I understand: In both v2 and v3, is it a preference for lookup, and then rules as a secondary choice when the word does not exists in a table? |
Beta Was this translation helpful? Give feedback.
-
Lemmatization of nouns sometimes gives out lemma of words that are non-existent in the Norwegian language. The typical way to write a masculine noun would be to remove the suffix "-en", "-er" and "-ene". In the example below, the lemma returns the suffix "-e". That might be correct for feminine lemmas ("kvinner" -> "kvinne"), but not for masc ("biler" -> "bil"). There exists masculine nouns that in singular ends with -e, but often they do not have it.
As far as I have seen, it seems to be an issue with the masculine nouns, as an example I have included the nouns:
"bilen" (def, sing, masc) and "biler" (ind, plur, masc), which gives the lemma "bile". It should be "bil", which is the ordinary way to conjugate masc nouns.
Also, "skatten" gives "skatt", which is correct, but "formueskatten" gives "formueskatte". Again, the problem is the extra "e".
How to reproduce the behaviour
Output
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions