[WIP] Automatically encode and decode Cyrillic, Hebrew, and KS C 5601 #127

scop · 2020-04-04T09:44:05Z

Did not include these in auto detection, I have a hunch they'd cause false positives / unwanted results.

The test samples are "hello" translations from Google Translate.

coveralls · 2020-04-04T09:46:55Z

Coverage increased (+0.4%) to 75.837% when pulling 5b33c06 on scop:more-auto-encodings into a530f60 on farhadi:master.

juliangut · 2020-04-04T23:37:50Z

package.json

@@ -26,7 +26,7 @@
    "mocha": "2.x"
  },
  "dependencies": {
-    "iconv-lite": "0.x",
+    "iconv-lite": ">= 0.4 < 1",


Why limit upper bound?

Because it was limited before this change too (to 0.x, effectively < 1). Probably good to keep it like that for compatibility reasons. Added lower bound only because it's required for ksc5601.

juliangut · 2020-04-04T23:43:58Z

Hi @scop thanks for the PR

I've reviewed encoding/decoding for Cyrillic and Hebrew and they look good (I cannot find a ksc5601 charmap to review but as long as it's using iconv...)

Would you mind review my code comment and discuss why you think matching will cause false positives / unwanted results? maybe a solution with a regex such as GSM?

scop · 2020-04-05T12:16:39Z

Regarding the auto detection, as said, it's mostly a hunch, but let me elaborate a bit.

First, I guess e.g. cyrillic and hebrew have some overlaps which can make detection produce unwanted (if not false) results between them and also ISO-8859-1, and I have no idea whatsoever about KS C 5601.

Second, I think detection ending up using one of these encodings should in the end not matter much at all. What counts much more in my experience is whether the message can be eventually -- when sending it to its final destination -- encoded in GSM, or if fallback to UCS-2 is needed. These intermediate encodings aren't that interesting.

Third, introducing an uncommon encoding like these could trigger issues elsewhere -- messages that were previously detected as (the widely supported) UCS-2 may now start to be encoded as one of these for little gain, but possibility in non-support in other systems where the messages are relayed to.

Because of the above, amplified by the fact that I'm not personally interested in the detection mode at all currently and thus don't want to spend time around that can of worms ;), I decided to play it safe and exclude these from the detection. If someone wants to change that, it can be done later, but I suggest not delaying getting this in for that.

scop · 2020-04-05T12:29:24Z

Re KS C 5601, based on info at https://en.wikipedia.org/wiki/KS_X_1001 I ended up generating the test sample's expected bytes encoding with

echo -n '여보세요' | iconv -f utf-8 -t euc-kr | hd

I'm not sure if euc-kr is always a good substitute for it, but for this sample it works and I guess that's as far as code here should be concerned, the rest can be left as an assumption that iconv-lite's ksc5601 does the desired thing.

If you want to play with it more, maybe this would be useful?
https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT

scop · 2020-10-04T06:46:47Z

Anything more I can do here?

juliangut · 2020-10-25T10:42:56Z

I'm going to hold this till we figure out encoding detection

Automatically encode and decode Cyrillic, Hebrew, and KS C 5601

5b33c06

juliangut reviewed Apr 4, 2020

View reviewed changes

juliangut added the help wanted label Dec 23, 2020

juliangut changed the title ~~Automatically encode and decode Cyrillic, Hebrew, and KS C 5601~~ [WIP] Automatically encode and decode Cyrillic, Hebrew, and KS C 5601 Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Automatically encode and decode Cyrillic, Hebrew, and KS C 5601 #127

[WIP] Automatically encode and decode Cyrillic, Hebrew, and KS C 5601 #127

scop commented Apr 4, 2020

coveralls commented Apr 4, 2020

juliangut Apr 4, 2020

scop Apr 5, 2020

juliangut commented Apr 4, 2020

scop commented Apr 5, 2020

scop commented Apr 5, 2020 •

edited

Loading

scop commented Oct 4, 2020

juliangut commented Oct 25, 2020

[WIP] Automatically encode and decode Cyrillic, Hebrew, and KS C 5601 #127

Are you sure you want to change the base?

[WIP] Automatically encode and decode Cyrillic, Hebrew, and KS C 5601 #127

Conversation

scop commented Apr 4, 2020

coveralls commented Apr 4, 2020

juliangut Apr 4, 2020

Choose a reason for hiding this comment

scop Apr 5, 2020

Choose a reason for hiding this comment

juliangut commented Apr 4, 2020

scop commented Apr 5, 2020

scop commented Apr 5, 2020 • edited Loading

scop commented Oct 4, 2020

juliangut commented Oct 25, 2020

scop commented Apr 5, 2020 •

edited

Loading