Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenisation issues in LibreOffice #3

Open
snomos opened this issue Jan 5, 2023 · 7 comments
Open

Tokenisation issues in LibreOffice #3

snomos opened this issue Jan 5, 2023 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@snomos
Copy link
Member

snomos commented Jan 5, 2023

There seems to be tokenisation issues for languages with Cyrillic letters, cf the following bug report:

@snomos snomos added the bug Something isn't working label Jan 5, 2023
@rueter
Copy link

rueter commented Jan 6, 2023

There is an update to giellalt/lang-myv#3

I am using a M2 Ventura 13.0.1
With LibreOffice. the language is Erzya (myv):
Version: 7.3.4.2 / LibreOffice Community
Build ID: 728fec16bd5f605073805c3c9e7c4212a0120dc5
CPU threads: 8; OS: Mac OS X 13.0.1; UI render: default; VCL: osx
Locale: myv-RU (myv_FI.UTF-8); UI: en-US
Calc: threaded

There are problems with the full stop ‹.› and ‹...› touching a previous word.
The comma, question mark, exclamation mark, quotation marks, parentheses, semicolons and colons do NOT cause a problem.

Screenshot 2023-01-06 at 5 09 06

@rueter
Copy link

rueter commented Jan 10, 2023

Meadow Mari (mhr) also has a problem with a full stop touching words.
They are recognized.
Screenshot 2023-01-10 at 16 24 51

@bbqsrc
Copy link
Member

bbqsrc commented Jan 10, 2023

Yup, thanks for confirming further. Working on a fix. 😄

@rueter
Copy link

rueter commented Jan 27, 2023

sms has the same problem
Screenshot 2023-01-27 at 18 39 55

@rueter
Copy link

rueter commented Jan 27, 2023

THIS issue does not seem to be one affecting lut. I have drawn hair lines next to accepted words, next to which I have added full stops. The speller accepts them. (Lushootseed has other problems)

Screenshot 2023-01-27 at 18 55 59

@Trondtr
Copy link

Trondtr commented Feb 2, 2023

Just a reminder: This is actually a nasty bug (since almost all sentences end in a period), and it seems to happen for all languages. Here is my sme. Note the three individuao periods after "Juo" compared to the horisontal ellipsis following "Na" (which works):

kuva

I should have dropped the easteregg... There is a read line under "buorre" followed by a dot there.

@bbqsrc bbqsrc changed the title Tokenisation issues Tokenisation issues in LibreOffice Feb 2, 2023
@bbqsrc bbqsrc transferred this issue from divvun/divvunspell Feb 2, 2023
@snomos
Copy link
Member Author

snomos commented Feb 10, 2023

@bbqsrc has looked briefly into this issue, and it seems to be buried deep in the LO code. There was a similar issue with the MS Office speller, and that was fixed. The assumption is thus for now that divvunspell is clean in this regard, and that the issue is elsewhere, ie within the LO integration code or within LO itself. LO is a huge mess of code, mixing Python, Java, C++, one should not be surprised there are bugs when it comes to not-so-standard Unicode text handling 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants