Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt at establishing distinction between -dict-gt-norm- and -gt-norm fails #4

Open
rueter opened this issue Feb 22, 2024 · 11 comments

Comments

@rueter
Copy link
Member

rueter commented Feb 22, 2024

Four example words have been selected to provide the *e vs *ä distinction found in the manuscript of the monolingual Erzya dictionary by Kuzʹma Abramov.
In the lexc file we have:

пей+N:пӓй
сэдь+N:сӓдь
седей+N:сьӓдей
эрзя+N:ӓрзя

‹ӓ› has been declared in twolc

the filter: ‹remove-diaereses-enhancement.regex› looks like this:

[[ Ь | ь ] -> 0 ||  _ [ ӓ | Ӓ ] ,, ӓ -> е || [ ь | Ь ]  _ ]   
.o.
ӓ -> е || [ в | В | б | Б | г | Г | ж | Ж | к | К | м | М | п | П | ф | Ф | х | Х | ч | Ч | ш | Ш | щ | Щ ] _ 
.o.
ӓ -> э || [ д | Д | з | З | л | Л | н | Н | р | Р | с | С | т | Т | ц | Ц ] _	
.o.
ӓ -> э || [ .#. | %- ] _ ;

So, there are a number of things going on in one place.
Line 1 removes underlying soft sign preceding underlying ӓ and simultaneously replaces underlying ӓ with е. (failure)
Line 2 replaces underlying ‹ӓ› with ‹е›. (partial success)
Line 3 replaces underlying ‹ӓ› with ‹э› following specific consonants. (partial success)
Line 4 replaces underlying ‹ӓ› with ‹э› word-initially. (partial success)

The script remove-diaereses-enhancement.hfst is called in
lang-myv/src/fst/Makefile.am and lang-myv/src/fst/filters/Makefile.am

The desired result for the four words give above would be:
Analysis

lang-myv jackrueter$ hfst-lookup src/fst/analyser-gt-norm.hfstol 
> пей
пей	пей+N+Sg+Nom+Indef	0,000000

> сэдь
сэдь	сэдь+N+Sg+Nom+Indef	0,000000

> седей
седей	седей+N+Sg+Nom+Indef	0,000000

> эрзя
эрзя	эрзя+N+Sg+Nom+Indef	0,000000

Dict-Generation:

lang-myv jackrueter$ hfst-lookup src/fst/generator-dict-gt-norm.hfst 
hfst-lookup: warning: It is not possible to perform fast lookups with foma format automata.
Using HFST basic transducer format and performing slow lookups
> пей+N+Sg+Nom+Indef
пей+N+Sg+Nom+Indef	пӓй	0,000000

> сэдь+N+Sg+Nom+Indef
сэдь+N+Sg+Nom+Indef	сӓдь	0,000000

> седей+N+Sg+Nom+Indef
седей+N+Sg+Nom+Indef	сьӓдей	0,000000

> эрзя+N+Sg+Nom+Indef
эрзя+N+Sg+Nom+Indef	ӓрзя	0,000000

Generation:

lang-myv jackrueter$ hfst-lookup src/fst/generator-gt-norm.hfstol
> пей+N+Sg+Nom+Indef
пей+N+Sg+Nom+Indef	пей	0,000000

> сэдь+N+Sg+Nom+Indef
сэдь+N+Sg+Nom+Indef	сэдь	0,000000

> седей+N+Sg+Nom+Indef
седей+N+Sg+Nom+Indef	седей	0,000000

> эрзя+N+Sg+Nom+Indef
эрзя+N+Sg+Nom+Indef	эрзя	0,000000

Instead, I get:
Analysis

lang-myv jackrueter$ hfst-lookup src/fst/analyser-gt-norm.hfstol 
> пей
пей	пей+?	inf

> сэдь
сэдь	сэдь+?	inf

> седей
седей	седей+?	inf

> эрзя
эрзя	эрзя+?	inf

Dict-Generation:

lang-myv jackrueter$ hfst-lookup src/fst/generator-dict-gt-norm.hfst 
hfst-lookup: warning: It is not possible to perform fast lookups with foma format automata.
Using HFST basic transducer format and performing slow lookups
> пей+N+Sg+Nom+Indef
пей+N+Sg+Nom+Indef	пӓй	0,000000

> сэдь+N+Sg+Nom+Indef
сэдь+N+Sg+Nom+Indef	сӓдь	0,000000

> седей+N+Sg+Nom+Indef
седей+N+Sg+Nom+Indef	седей+N+Sg+Nom+Indef+?	inf

> эрзя+N+Sg+Nom+Indef
эрзя+N+Sg+Nom+Indef	ӓрзя	0,000000

Generation:

lang-myv jackrueter$ hfst-lookup src/fst/generator-gt-norm.hfstol 
> пей+N+Sg+Nom+Indef
пей+N+Sg+Nom+Indef	пей	0,000000

> сэдь+N+Sg+Nom+Indef
сэдь+N+Sg+Nom+Indef	сэдь	0,000000

> седей+N+Sg+Nom+Indef
седей+N+Sg+Nom+Indef	седей+N+Sg+Nom+Indef+?	inf

> эрзя+N+Sg+Nom+Indef
эрзя+N+Sg+Nom+Indef	ӓрзя	0,000000
@flammie
Copy link
Contributor

flammie commented Feb 22, 2024

I tried to debug the xerox script like this:

$  hfst-xfst
hfst[0]: read regex @"src/fst/filters/remove-diaereses-enhancement.hfst"
hfst[1]: apply down
apply down> эрзя
эрзя
apply down> ӓрзя
эрзя
apply down> 

it seems it should work but also there is a flag diacritic in the lexicon between .#. and э which may be issue or otherwise I am not very good with xfst scripting debugger.

@rueter
Copy link
Member Author

rueter commented Feb 22, 2024

Replace rule: ӓ -> э || [ .#. | %- ] _ ;
with

ӓ -> э || \[ в | В | б | Б | г | Г | ж | Ж | к | К | м | М | п | П | ф | Ф | х | Х | ч | Ч | ш | Ш | щ | Щ | д | Д | з | З | л | Л | н | Н | р | Р | с | С | т | Т | ц | Ц ] _

which implicitly allows for flags.
Not a good idea, but it works.
The complex one remains

седей+N+Sg+Nom+Indef

Something like kaNpat >> kammat
сьӓдей >> седей

@snomos
Copy link
Member

snomos commented Apr 24, 2024

The present code does not work because there is a contradiction in it. What you have is basically this:

ӓ -> э || [ д | Д | … ] _  
.o.
ӓ -> э || \[ … | д | Д | … ] _

Ie you can't tell it to do one and the same change both in the context of д | Д and NOT in the context of д | Д. What do you really want?

@rueter
Copy link
Member Author

rueter commented Apr 24, 2024

what I would like is:

ӓ -> э || [ д | Д | … ] _  
.o.
ӓ -> е || \[ … | д | Д | … ] _

When ‹ӓ› is word initial or follows an alveolar it should become ‹э›. Following a non-alveolar or a soft-sign ‹ь› it should turn to ‹е› AND ь -> 0.
Does this mean that there should be a separate file for removing the soft sign?

@snomos
Copy link
Member

snomos commented Apr 24, 2024

Don't know, but at least you should fix the regex to say what you want: ӓ -> е in the second case (now it says ӓ -> э) 🙂

@rueter
Copy link
Member Author

rueter commented Apr 24, 2024

It now says:

ӓ -> е || [ в | В | б | Б | г | Г | ж | Ж | к | К | м | М | п | П | ф | Ф | х | Х | ч | Ч | ш | Ш | щ | Щ ] _
.o.
ӓ -> э || [ д | Д | з | З | л | Л | н | Н | р | Р | с | С | т | Т | ц | Ц | .#. | %- ] _  
.o.
[[ Ь | ь ] -> 0 ||  _ [ ӓ | Ӓ ] ,, ӓ -> е || [ ь | Ь ]  _ ] ;

but it still does not work

@snomos
Copy link
Member

snomos commented Apr 25, 2024

I have reordered the steps as follows:

[ [ Ь | ь ] -> 0 ||  _ [ ӓ | Ӓ ] ,, ӓ -> е || [ ь | Ь ]  _ ]
.o.
[ ӓ -> е || [ в | В | б | Б | г | Г | ж | Ж | к | К | м | М | п | П | ф | Ф | х | Х | ч | Ч | ш | Ш | щ | Щ ] _ ]
.o.
[ ӓ -> э || [ д | Д | з | З | л | Л | н | Н | р | Р | с | С | т | Т | ц | Ц | .#. ( ? ) | %- ( ? ) ] _ ] ;

and after this change it works in three out of four cases:

echo пей+N+Sg+Nom+Indef | hfst-lookup -q src/fst/generator-dict-gt-norm.hfstol
пей+N+Sg+Nom+Indef	пӓй	0.000000

echo сэдь+N+Sg+Nom+Indef | hfst-lookup -q src/fst/generator-dict-gt-norm.hfstol
сэдь+N+Sg+Nom+Indef	сӓдь	0.000000

echo седей+N+Sg+Nom+Indef | hfst-lookup -q src/fst/generator-dict-gt-norm.hfstol
седей+N+Sg+Nom+Indef	седей+N+Sg+Nom+Indef+?	inf

echo эрзя+N+Sg+Nom+Indef | hfst-lookup -q src/fst/generator-dict-gt-norm.hfstol 
эрзя+N+Sg+Nom+Indef	ӓрзя	0.000000

Only the седей case is not working.

@rueter
Copy link
Member Author

rueter commented Apr 25, 2024

I wanted to try a different ordering as well, but got this:

Making all in .
make[3]: Entering directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src/fst'
mkdir -p `dirname .generated/.stamp`
make[3]: *** No rule to make target 'filters/remove-diaereses-enhancement.%', needed by '.generated/analyser-pmatchdisamb-gt-desc.hfst'.  Stop.
make[3]: *** Waiting for unfinished jobs....
touch .generated/.stamp
make[3]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src/fst'
make[2]: *** [Makefile:1257: all-recursive] Error 1
make[2]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src/fst'
make[1]: *** [Makefile:450: all-recursive] Error 1
make[1]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src'
make: *** [Makefile:554: all-recursive] Error 1

@snomos
Copy link
Member

snomos commented Apr 25, 2024

Sorry about that, fixed now.

@rueter
Copy link
Member Author

rueter commented Apr 25, 2024

Now it dies in a different place:

Reading and minimizing rule ????v...
Reading and minimizing rule 74...
Reading lexicon... minimize(determinize(reverse(lexc(lexicon.lexc)))) read
Computing intersecting composition...
Storing result in <stdout>...
Minimizing reverse(compose(minimize(determinize(reverse(lexc(lexicon.lexc)))), intersect(morphology/.generated/phonology.rev.hfst)))...
make[3]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src/fst'
make[2]: *** [Makefile:1257: all-recursive] Error 1
make[2]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src/fst'
make[1]: *** [Makefile:450: all-recursive] Error 1
make[1]: Leaving directory '/Users/jackrueter/Dropbox/Github/giellalt/lang-myv/src'
make: *** [Makefile:554: all-recursive] Error 1

but if I do make distclean, then it dies for lack of a rule to generate the .hfst as given above.

@snomos
Copy link
Member

snomos commented Apr 25, 2024

That seems completely unrelated, I have no idea. Wipe and reclone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants