Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Automatically encode and decode Cyrillic, Hebrew, and KS C 5601 #127

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

scop
Copy link

@scop scop commented Apr 4, 2020

Did not include these in auto detection, I have a hunch they'd cause false positives / unwanted results.

The test samples are "hello" translations from Google Translate.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.4%) to 75.837% when pulling 5b33c06 on scop:more-auto-encodings into a530f60 on farhadi:master.

@@ -26,7 +26,7 @@
"mocha": "2.x"
},
"dependencies": {
"iconv-lite": "0.x",
"iconv-lite": ">= 0.4 < 1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why limit upper bound?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it was limited before this change too (to 0.x, effectively < 1). Probably good to keep it like that for compatibility reasons. Added lower bound only because it's required for ksc5601.

@juliangut
Copy link
Collaborator

Hi @scop thanks for the PR

I've reviewed encoding/decoding for Cyrillic and Hebrew and they look good (I cannot find a ksc5601 charmap to review but as long as it's using iconv...)

Would you mind review my code comment and discuss why you think matching will cause false positives / unwanted results? maybe a solution with a regex such as GSM?

@scop
Copy link
Author

scop commented Apr 5, 2020

Regarding the auto detection, as said, it's mostly a hunch, but let me elaborate a bit.

First, I guess e.g. cyrillic and hebrew have some overlaps which can make detection produce unwanted (if not false) results between them and also ISO-8859-1, and I have no idea whatsoever about KS C 5601.

Second, I think detection ending up using one of these encodings should in the end not matter much at all. What counts much more in my experience is whether the message can be eventually -- when sending it to its final destination -- encoded in GSM, or if fallback to UCS-2 is needed. These intermediate encodings aren't that interesting.

Third, introducing an uncommon encoding like these could trigger issues elsewhere -- messages that were previously detected as (the widely supported) UCS-2 may now start to be encoded as one of these for little gain, but possibility in non-support in other systems where the messages are relayed to.

Because of the above, amplified by the fact that I'm not personally interested in the detection mode at all currently and thus don't want to spend time around that can of worms ;), I decided to play it safe and exclude these from the detection. If someone wants to change that, it can be done later, but I suggest not delaying getting this in for that.

@scop
Copy link
Author

scop commented Apr 5, 2020

Re KS C 5601, based on info at https://en.wikipedia.org/wiki/KS_X_1001 I ended up generating the test sample's expected bytes encoding with

echo -n '여보세요' | iconv -f utf-8 -t euc-kr | hd

I'm not sure if euc-kr is always a good substitute for it, but for this sample it works and I guess that's as far as code here should be concerned, the rest can be left as an assumption that iconv-lite's ksc5601 does the desired thing.

If you want to play with it more, maybe this would be useful?
https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT

@scop
Copy link
Author

scop commented Oct 4, 2020

Anything more I can do here?

@juliangut
Copy link
Collaborator

I'm going to hold this till we figure out encoding detection

@juliangut juliangut changed the title Automatically encode and decode Cyrillic, Hebrew, and KS C 5601 [WIP] Automatically encode and decode Cyrillic, Hebrew, and KS C 5601 Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants