Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSWord moves word to upper line when correcting space error #50

Open
1 of 2 tasks
duomdaamaendra opened this issue Feb 8, 2022 · 9 comments
Open
1 of 2 tasks
Assignees
Labels
bug Something isn't working gramcheck Issues restricted to the grammar checker

Comments

@duomdaamaendra
Copy link
Contributor

duomdaamaendra commented Feb 8, 2022

B. Moske (s.25) «Mun in jáme/mu luondu dušše rievdá»
Paltto (s.37) «mánát sturrot/mun ieš boarásmuvan» ??
B.Moske (s.39) «Nu jođánit moai rávásmuvaime» … (s.47) /Rumaš goldná dađistaga»

This happens when correcting "B.Moske" to "B. Moske":

Skjermbilde 2022-02-08 kl  19 04 49

The problem occurs because CR(LF) is not escaped in the various tools:

@duomdaamaendra
Copy link
Contributor Author

this does not happen in Googledocs

@snomos snomos changed the title MSWord moves word to upper line when correction space error MSWord moves word to upper line when correcting space error Feb 8, 2022
@snomos snomos added the gramcheck Issues restricted to the grammar checker label Mar 16, 2022
@lynnda-hill
Copy link
Contributor

When fixing ?? to ? ? a new suggestions appear, ?B. can be fixed to ? B. However, there is a new line after ? which the program seems to ignore.

@snomos
Copy link
Member

snomos commented Nov 16, 2023

It seems that the problem is that we haven't considered CARRIAGE RETURN / Ux000D (\r) in our processing. I assume it should be added to our whitespace analyser.

@snomos snomos added the bug Something isn't working label Nov 16, 2023
@snomos
Copy link
Member

snomos commented Nov 16, 2023

Soemthing very strange happens that looks like a bug. With the following minimal test text:

boarásmuvan» ??
B.Moske

(copy to MS Word, paste it in a new document, and copy it back from the word file if the CR is lost) I get the foliowing in UnicodeChecker:

image

CR (U+000D) is clearly located directly after the two question marks, and before the newline.

Now store the test text (with the CR char) in a test file, and run it through the grammar checker:

cat test.txt | ./tools/grammarcheckers/modes/smegramrelease.mod

The result is this:

"<boarásmuvan>"
	"boarásmuvvat" Err/Orth-a-á <mv> V IV Ind Prs Sg1 <W:0.0> <firstCohort> @+FMAINV &LINK &punct-aistton-right ID:1
punct-aistton-right
	"boarásmuvvat" v1 <mv> V IV Ind Prs Sg1 <W:0.0> <firstCohort> @+FMAINV &LINK &punct-aistton-right ID:1
punct-aistton-right
"<»>"
	"»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide> &punct-aistton-right &space-before-punct-mark &LINK ID:2 R:LEFT:1
punct-aistton-right
space-before-punct-mark
	"»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide> "boarásmuvan”"S &punct-aistton-right &SUGGESTWF ID:2 R:LEFT:1
punct-aistton-right
	"”" PUNCT RIGHT Err/Orth <W:0.0> <SpaceOnRightSide> &LINK &space-before-punct-mark ID:2 R:LEFT:1
space-before-punct-mark
:
\n
: 
"<?>"
	"?" CLB <W:0.0> <SpaceBeforePunctMark>

"<?>"
	"?" CLB <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark ID:5 R:RIGHT:7
no-space-after-punct-mark
	"?" CLB <W:0.0> <NoSpaceAfterPunctMark> "? B."S &no-space-after-punct-mark &SUGGESTWF ID:5 R:RIGHT:7
no-space-after-punct-mark

"<B.>"
	"B" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <NoSpaceAfterPunctMark> @HNOUN &no-space-after-punct-mark &LINK ID:7
no-space-after-punct-mark
	"Balphabet" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <NoSpaceAfterPunctMark> @HNOUN &no-space-after-punct-mark &LINK ID:7
no-space-after-punct-mark
"<Moske>"
	"Moske" N Prop Sem/Plc Sg Nom <W:0.0> <LastCohort> @HNOUN

Suddenly the CR char (and the newline) is placed before the two question marks.

That is, the character stream has been changed somewhere in the processing. That should not happen.

@snomos
Copy link
Member

snomos commented Nov 16, 2023

The tokeniser/analyser is fine:

cat test.txt | ./tools/grammarcheckers/modes/smegramrelease0-morph.mode
"<boarásmuvan>"
	"boarásmuvvat" Err/Orth-a-á V IV Ind Prs Sg1 <W:0.0>
	"boarásmuvvat" v1 V IV Ind Prs Sg1 <W:0.0>
"<»>"
	"»" PUNCT RIGHT <W:0.0>
	"" PUNCT RIGHT Err/Orth <W:0.0>
: 
"<?>"
	"?" CLB <W:0.0>
"<?>"
	"?" CLB <W:0.0>
:
\n
"<B.>"
	"." CLB <W:0.0> "<.>"
		"B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
	"B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
	"." CLB <W:0.0> "<.>"
		"B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
	"B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
	"." CLB <W:0.0> "<.>"
		"B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
	"B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
	"." CLB <W:0.0> "<.>"
		"B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
	"B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
	"." CLB <W:0.0> "<.>"
		"Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
	"." CLB <W:0.0> "<.>"
		"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
	"." CLB <W:0.0> "<.>"
		"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
	"." CLB <W:0.0> "<.>"
		"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
	"." CLB <W:0.0> "<.>"
		"b" Adv Sem/Time ABBR Gram/TNumAbbr Attr <W:0.0> "<B>"
	"." CLB <W:0.0> "<.>"
		"b" Adv Sem/Time ABBR Gram/TNumAbbr <W:0.0> "<B>"
"<Moske>"
	"Moske" N Prop Sem/Plc Attr <W:0.0>
	"Moske" N Prop Sem/Plc Sg Nom <W:0.0>

@snomos
Copy link
Member

snomos commented Nov 16, 2023

The first whitespace analyser moves the chars one place:

cat test.txt | ./tools/grammarcheckers/modes/smegramrelease1-blanktag.mode
"<boarásmuvan>"
	"boarásmuvvat" Err/Orth-a-á V IV Ind Prs Sg1 <W:0.0> <firstCohort>
	"boarásmuvvat" v1 V IV Ind Prs Sg1 <W:0.0> <firstCohort>
"<»>"
	"»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide>
	"" PUNCT RIGHT Err/Orth <W:0.0> <SpaceOnRightSide>
: 
"<?>"
	"?" CLB <W:0.0>
:
\n
"<?>"
	"?" CLB <W:0.0>
"<B.>"
	"." CLB <W:0.0> "<.>"
		"B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
	"B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
	"." CLB <W:0.0> "<.>"
		"B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
	"B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
	"." CLB <W:0.0> "<.>"
		"B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
	"B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
	"." CLB <W:0.0> "<.>"
		"B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
	"B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
	"." CLB <W:0.0> "<.>"
		"Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
	"." CLB <W:0.0> "<.>"
		"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
	"." CLB <W:0.0> "<.>"
		"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
	"." CLB <W:0.0> "<.>"
		"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
	"." CLB <W:0.0> "<.>"
		"b" Adv Sem/Time ABBR Gram/TNumAbbr Attr <W:0.0> "<B>"
	"." CLB <W:0.0> "<.>"
		"b" Adv Sem/Time ABBR Gram/TNumAbbr <W:0.0> "<B>"
"<Moske>"
	"Moske" N Prop Sem/Plc Attr <W:0.0> <LastCohort>
	"Moske" N Prop Sem/Plc Sg Nom <W:0.0> <LastCohort>

@snomos
Copy link
Member

snomos commented Nov 16, 2023

And then they are moved another time by the second whitespace analyser:

"<boarásmuvan>"
	"boarásmuvvat" Err/Orth-a-á V IV Ind Prs Sg1 <W:0.0> <firstCohort>
	"boarásmuvvat" v1 V IV Ind Prs Sg1 <W:0.0> <firstCohort>
"<»>"
	"»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide>
	"" PUNCT RIGHT Err/Orth <W:0.0> <SpaceOnRightSide>
:
\n
: 
"<?>"
	"?" CLB <W:0.0> <NoSpaceAfterPunctMark> <SpaceBeforePunctMark>
"<?>"
	"?" CLB <W:0.0> <NoSpaceAfterPunctMark>
"<B.>"
	"B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> <NoSpaceAfterPunctMark>
	"B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> <NoSpaceAfterPunctMark>
	"B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> <NoSpaceAfterPunctMark>
	"B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <NoSpaceAfterPunctMark>
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> <NoSpaceAfterPunctMark>
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> <NoSpaceAfterPunctMark>
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> <NoSpaceAfterPunctMark>
	"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <NoSpaceAfterPunctMark>

"<Moske>"
	"Moske" N Prop Sem/Plc Attr <W:0.0> <LastCohort>
	"Moske" N Prop Sem/Plc Sg Nom <W:0.0> <LastCohort>

So something is clearly wrong in the whitespace analysers.

@snomos
Copy link
Member

snomos commented Nov 16, 2023

I tried fixing the regex to open up for CR in d1bae3e but that did not help. Could you have a look, @unhammer ?

@unhammer
Copy link
Contributor

unhammer commented Nov 18, 2023

:
\n

This is not fine. That should probably be

:\n

which would mean a newline occurred. There should be an initial colon before any lines with unanalysed data. Anything without an initial colon/tab/quote is ignored by divvun-suggest.

got to fix this in hfst-tokenise hfst/hfst#575 and divvun-suggest divvun/libdivvun#65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gramcheck Issues restricted to the grammar checker
Projects
None yet
Development

No branches or pull requests

4 participants