Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add further tests for number expansion: ordinals #6

Open
aStereoID opened this issue Mar 16, 2021 · 12 comments
Open

Add further tests for number expansion: ordinals #6

aStereoID opened this issue Mar 16, 2021 · 12 comments
Labels
enhancement New feature or request

Comments

@aStereoID
Copy link
Collaborator

@JMK-CB @aStereoID Thanks very much for your help!

I think we need to follow up on this to get more flexibility, especially regarding real numbers (in addition to natural numbers), ordinals (in addition to cardinals), and special numbers such as years.
We should do that in a new issue.

Originally posted by @psibre in https://github.com/psibre/marytts-lang-hsb/issues/2#issuecomment-792598040

@aStereoID
Copy link
Collaborator Author

Since ordinals are already included in formatRules.txt we might start here easiest:

@JMK-CB : could you please check them again?
@psibre : I guess there have to be additions in the preprocess made? And another test set?

How should that look like?
default would be the (nominative) maskuline form , like

  1. = prêni
  2. = druhi

For feminine and neutral resolution you would have to consider the last letter of the following noun?

  1. šula = prenja šula
  2. słowo = druhe słowo

@aStereoID
Copy link
Collaborator Author

What about the other cases? So far we have only nominative..

@aStereoID aStereoID added the enhancement New feature or request label Mar 16, 2021
@psibre
Copy link
Member

psibre commented Mar 17, 2021

Thanks for opening this and the info!

I'm writing integration tests that should make it easier to specify the desired behavior. Unfortunately, the hard-wired input/output module chain in MaryTTS is this (extracted from DEBUG logs):

JTokeniser converts RAWMARYXML into TOKENS
Preprocess converts TOKENS into WORDS
OpenNLPPosTagger converts WORDS into PARTSOFSPEECH
JPhonemiser converts PARTSOFSPEECH into PHONEMES

There doesn't seem to be an intuitive way of handling number expansion at the TOKENS stage if it requires morphosyntactic analysis, unless we overload the Preprocess module with all kinds of magic... which would also require all manner of NLP resources which I doubt exist for Sorbian. And I fear it would lead to feature creep.

However, we can definitely move forward with simple things, and then reassess.

@psibre
Copy link
Member

psibre commented Mar 17, 2021

Incidentally, @JMK-CB how should real numbers like 1,14159 be spelled out? What's the word for "comma", or is it a period instead?

@JMK-CB
Copy link
Collaborator

JMK-CB commented Mar 17, 2021

I have composed a list of test sentences for ordinal numbers combined with the different cases. It probably is not realistically possible to solve all those specific cases but we could try to tackle some of them (some don´t really occur very often anyway).

I am also compiling a similar list for special cases with cardinal numbers.
Testsätze für Ordinalia.zip

@JMK-CB
Copy link
Collaborator

JMK-CB commented Mar 17, 2021

Incidentally, @JMK-CB how should real numbers like 1,14159 be spelled out? What's the word for "comma", or is it a period instead?

comma is "koma" in both Sorbian languages.

As far as I can assess this I´d say real numbers should not be a problem because Sorbian simply counts the numbers one by one without any modification by cases or similar. So your example should always result in "jedyn koma jedyn štyri jedyn pjeć dźewjeć".

However, Astrid has directed my attention to fractions. Those won´t be just that easy to handle but I will have a look at it and compile a list of testable cases.

@JMK-CB
Copy link
Collaborator

JMK-CB commented Mar 17, 2021

@JMK-CB : could you please check them again?

Those are correct.

@aStereoID
Copy link
Collaborator Author

Incidentally, @JMK-CB how should real numbers like 1,14159 be spelled out? What's the word for "comma", or is it a period instead?

comma is "koma" in both Sorbian languages.

As far as I can assess this I´d say real numbers should not be a problem because Sorbian simply counts the numbers one by one without any modification by cases or similar. So your example should always result in "jedyn koma jedyn štyri jedyn pjeć dźewjeć".

@psibre: So decimals could be one of the "simple things"?

@aStereoID
Copy link
Collaborator Author

Now I'm unsure about the cases, Jan's list scares me ;-) And it doesn't only concern ordinals..
Maybe it's better to set a default variant (@JMK-CB: usually nominative masculine?) and otherwise recommend to spell it out? Considering that "Prěnja žona w swětnišću.." (Die erste Frau im Weltraum..) is more obvious than "1. žona w swětnišću.." (Die 1. Frau im Weltall..)

Shall we discuss this tomorrow in Zoom?

@psibre
Copy link
Member

psibre commented Mar 19, 2021

Thanks for the details and feedback!

Regarding the real numbers, that's something I expect to easily solve later today.

The list of sentences with ordinal numbers is a great resource, and we can use it to investigate how to support those linguistic cases (pun intended).

@JMK-CB
Copy link
Collaborator

JMK-CB commented Mar 19, 2021

I agree with Astrid, the default should be nominative masculine because thats probably what people would expect as a technical default. Other forms could be felt as erroneous.

However, I think it would be great to at least be able to recognize the grammatical gender the numbers refer to. I think a wrong gender could be more confusing (because of other nouns in the context) than expecting a case ending but getting nominative.

@aStereoID
Copy link
Collaborator Author

I hope we can get some momentum back into this topic :-)

At the moment the number expansion is only done for cardinals (nominative masculine).
To include the pronunciation of years (#5) and at least the nominative default version of the ordinalia, perhaps we could proceed similarly to Lb? They also use the Rule Based Number Format with:

String formatRules
final String cardinalRule
final String ordinalRule
final String ordinalFemaleRule
final String ordinalNeutrumRule
final String yearRule

where

cardinalRule = "%spellout-numbering"
ordinalRule = "%spellout-ordinal-maskulinum"
ordinalFemaleRule = "%spellout-ordinal-femininum"
ordinalNeutrumRule = "%spellout-ordinal-neutrum"
yearRule = "%spellout-numbering-year"

Possible regexes should be (I tried java notation):

// year
pattern = "(?<=\D)(1[1-9]\d\d)(?=\D)"
// ordinal female (nominative): in most cases the following word ends with -a
pattern = "(\d+\.)(?=\s\b\w+?a\b)"
// ordinal neutrum (nominative): in most cases the following word ends with -o or -e
pattern = "(\d+\.)(?=\s\b\w+?[o|e]\b)"

All other cases of "\d+." would be the else variant and should be expanded according to the ordinalRule (=%spelloutspellout-ordinal-maskulinum)

@JMK-CB : Plural nominative seems to behave like neutrum? Only maybe add another ordinalPluralRule if the following word ends with -i?

I'm not sure how and where exactly implement these if's and else's in the Preprocess-file, so @psibre maybe you can help?

@aStereoID aStereoID mentioned this issue Aug 29, 2024
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants