Add transliteration option. #219

kshetline · 2019-06-29T21:09:13Z

This pull request would add the ability to provide transliteration in iconv-lite. To avoid bloating the code, and adding a dependency that many users wouldn't require, the transliteration capability only becomes available if the user optionally adds the "unidecode" package to their project separately.

The updated README contains the details about how this transliteration option would work. The main improvement over using unidecode directly is that unidecode transliterates everything that isn't ASCII into ASCII, whereas this implementation selectively taps into unidecode only for characters that don't exist in a particular target encoding.

…er versions of Node.

…LE flag. Other small code tweaks.

…ble, but the automated testing on travis-ci.com didn't like it.

…er than build a string in a for loop.

kshetline · 2019-06-30T12:28:15Z

By the way, even if you approve of what you see here, please hold off a little while before publishing anything with these changes. I've started working on an improved version of unidecode — the original unidecode hasn't been updated for four years now.

What I've got so far runs about twice as fast as the original, and I'm expanding the transliterations to cover a lot of popular emojis not handled by the original. Since it's been so long since unidecode has been updated (making me unsure if anyone is actively maintaining it) I'll publish my own version as a separate npm package rather than wait for a response to merge my changes into the original. I'll then modify the code in this pull request to use my new version of unidecode as an option to using the original.

…ng and German options. Remove conflicts in unit testing where unidecode and unidecode-plus transliterate differently.

…ther unidecode-plus in available or not.

kshetline · 2019-07-03T01:57:52Z

The work I was doing on an improved version of unidecode is done now, and is incorporated in this PR now.

ashtuchkin · 2019-07-04T02:52:44Z

First of all, I'm impressed by the quality of the pull requests you produce: improved README, tests, the code itself are all top notch. Great job Kerry!

In the past, I struggled with adding transliteration functionality to iconv-lite because I didn't have a good grasp of intended use cases. Could you describe what use cases do you have in mind here, so that I could understand it better?

One vague concern I have is that this PR introduces some coupling with the unidecode[-plus] library, e.g. "german" option and smart spacing functionality. If there would be additional improvements to unidecode library interface, they will have to be mirrored in iconv-lite, which could become a pain in case the versions drift.

An idea I had initially, with regard to the transliteration interface would be to create a callback like this:

let callback = (unicode_char) => {
    return "<ascii transliteration>";
};

let buf = iconv.encode(str, "encoding", {fallback: callback});

This would trivially allow all three major options of what to do with un-encodable characters: 1) replace them with "?", 2) error out (just throw exception from the callback) 3) transliterate to something smarter.

Now there's a question for you - would one unicode point be enough context to do a reasonable transliteration job? Combining characters (e.g. diacritical marks or ZWJ/modified emojis) might be problematic, does unidecode handle them? I think we can adjust the callback signature to provide more context. Wdyt about it?

ashtuchkin · 2019-07-04T02:54:52Z

lib/index.js

+var MAX_PENDING = 16384;
+
+TransliterationWrapper.prototype.write = function(str) {
+    str = iconv.transliterate(str, this.encoder.encodingName, this.options);


One question I had is - would this handle surrogate chars? I.e. when a high surrogate is at the end of the previous block and a low surrogate is at the beginning of the current one? It's a valid situation when encoding.

Yes, that's taken care of. If the original unidecode is used, it only outputs ASCII, so there are no surrogates to worry about. If unidecode-plus is used, it processes a codepoint at a time, rather than a character at a time, so it keeps the surrogates together.

It is possible for surrogates to be split by TransliterationWrapper, in this code which is currently at line 187 of index.js:

// Split the text being written out at a safe place where smart spacing won't // need to make any changes. var $ = /^(.*)([^\s\x80\x81][\s\x80\x81].*)$/.exec(str);

...but the logic works in such a way that they'll always get put back together again properly.

kshetline · 2019-07-04T04:48:43Z

In the past, I struggled with adding transliteration functionality to iconv-lite because I didn't have a good grasp of intended use cases. Could you describe what use cases do you have in mind here, so that I could understand it better?

That's a good question. Mainly it's because I saw that transliteration was a feature of node-iconv, that you'd explicitly mentioned that iconv-lite didn't handle that, and I thought it would be an interesting challenge to make it possible. (Yes, I need to get a life. I know! 🤓)

My own use case (which I'd handled previously with a much simpler transliteration function) has been for a geographic database I use for my astronomy web site, skyviewcafe.com. I transliterate place names name to create plain ASCII search keys. Then it doesn't matter if users type accented characters or not, I can find places by name either way.

One vague concern I have... ...which could become a pain in case the versions drift.

I'm not sure how to address that, other than to guess that interface changes probably wouldn't happen too often.

An idea I had initially, with regard to the transliteration interface would be to create a callback like this:

I tried something like that on the way to the current version of my code. It worked, but it was a bit on the slow side calling unidecode one character at a time for transliterations, and it also didn't allow for the smart spacing option. Your suggestion does have the benefit of being very flexible, however.

Now there's a question for you - would one unicode point be enough context to do a reasonable transliteration job? Combining characters (e.g. diacritical marks or ZWJ/modified emojis...

I didn't change anything about how combining diacriticals are handled from the way the way the original unidecode handles that -- it simply keeps the unaccented characters from a combining pair and turns the diacritical into an empty string, which produces the same end result as converting a single-codepoint accented character into its unaccented equivalent. (Which does remind me, however, that my "german" mode wouldn't currently handle combining umlauts.)

So, for that level of functionality, yes, one unicode point is enough. For ZWJ/modified emojis... nope! That's a whole other can of worms that would definitely require more context than one codepoint, and I certainly haven't tried to deal with that issue for transliteration myself.

…d unit testing for streaming transliteration.

kshetline and others added 18 commits June 24, 2019 12:23

Preliminary work to support UTF-32.

9acd8cd

Finish off general UTC-32 (auto LE or BE), and add UCS-4 aliases.

ace3618

Fix typo in unit test.

f0f9624

Fix uses of Buffer.from() that caused compatibility problems with old…

8d104ad

…er versions of Node.

Updated README.md to include UTF-32 options.

25e0413

Get rid of package-lock.json.

9b84cb3

Merging Utf32-LE and-BE codec into a single set of classes with an is…

325c0fe

…LE flag. Other small code tweaks.

Added all-codepoint unit tests for UTF-32.

56a4754

Disable some unit tests on older versions of Node.

cfe04d6

Fixes for working correctly with older versions of Node.

1f56e00

Add comparison to node-iconv, and possible speed improvement.

af9cff7

Merge branch 'master' of https://github.com/ashtuchkin/iconv-lite

8e038b5

Preliminary work for transliteration support.

7d6f955

Changing computers check-in.

a6f91b4

First version of transliteration support.

a550b78

Odd... my Node.js environment was fine with a "catch" without a varia…

f574282

…ble, but the automated testing on travis-ci.com didn't like it.

Update documentation for transliteration.

7b13689

Much to my surprise, a regex global replace turns out to be much fast…

198d7e2

…er than build a string in a for loop.

kshetline and others added 5 commits July 1, 2019 00:53

Add typings for transliteration. Add ability to deal with smart spaci…

5797fb5

…ng and German options. Remove conflicts in unit testing where unidecode and unidecode-plus transliterate differently.

Make extra unit testing for smart spacing and German dependent on whe…

bb0a2ea

…ther unidecode-plus in available or not.

Update with unidecode-plus 0.0.0-alpha.1.

84f01e4

Update with new unidecode plus and related documentation.

cac5bc8

Update with unidecode-1.0.1.

c69bdb5

ashtuchkin reviewed Jul 4, 2019

View reviewed changes

kshetline added 2 commits July 13, 2019 16:16

Transliteration in German mode now works with combining diaeresis. Ad…

6d2d25a

…d unit testing for streaming transliteration.

Transliteration in German mode now works with combining diaeresis. Ad…

4e4f4da

…d unit testing for streaming transliteration.

ashtuchkin force-pushed the master branch from 978c58b to 5148f43 Compare June 8, 2020 08:19

ashtuchkin force-pushed the master branch 4 times, most recently from 84ee650 to 9aa082f Compare July 16, 2020 08:07

ashtuchkin force-pushed the master branch from 5d99a92 to ed88711 Compare May 23, 2021 22:34

emredalka approved these changes May 17, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transliteration option. #219

Add transliteration option. #219

kshetline commented Jun 29, 2019

kshetline commented Jun 30, 2019

kshetline commented Jul 3, 2019

ashtuchkin commented Jul 4, 2019

ashtuchkin Jul 4, 2019

kshetline Jul 4, 2019

kshetline commented Jul 4, 2019

Add transliteration option. #219

Are you sure you want to change the base?

Add transliteration option. #219

Conversation

kshetline commented Jun 29, 2019

kshetline commented Jun 30, 2019

kshetline commented Jul 3, 2019

ashtuchkin commented Jul 4, 2019

ashtuchkin Jul 4, 2019

Choose a reason for hiding this comment

kshetline Jul 4, 2019

Choose a reason for hiding this comment

kshetline commented Jul 4, 2019