Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transliteration option. #219

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open

Conversation

kshetline
Copy link
Contributor

This pull request would add the ability to provide transliteration in iconv-lite. To avoid bloating the code, and adding a dependency that many users wouldn't require, the transliteration capability only becomes available if the user optionally adds the "unidecode" package to their project separately.

The updated README contains the details about how this transliteration option would work. The main improvement over using unidecode directly is that unidecode transliterates everything that isn't ASCII into ASCII, whereas this implementation selectively taps into unidecode only for characters that don't exist in a particular target encoding.

@kshetline
Copy link
Contributor Author

By the way, even if you approve of what you see here, please hold off a little while before publishing anything with these changes. I've started working on an improved version of unidecode — the original unidecode hasn't been updated for four years now.

What I've got so far runs about twice as fast as the original, and I'm expanding the transliterations to cover a lot of popular emojis not handled by the original. Since it's been so long since unidecode has been updated (making me unsure if anyone is actively maintaining it) I'll publish my own version as a separate npm package rather than wait for a response to merge my changes into the original. I'll then modify the code in this pull request to use my new version of unidecode as an option to using the original.

@kshetline
Copy link
Contributor Author

The work I was doing on an improved version of unidecode is done now, and is incorporated in this PR now.

@ashtuchkin
Copy link
Owner

First of all, I'm impressed by the quality of the pull requests you produce: improved README, tests, the code itself are all top notch. Great job Kerry!

In the past, I struggled with adding transliteration functionality to iconv-lite because I didn't have a good grasp of intended use cases. Could you describe what use cases do you have in mind here, so that I could understand it better?

One vague concern I have is that this PR introduces some coupling with the unidecode[-plus] library, e.g. "german" option and smart spacing functionality. If there would be additional improvements to unidecode library interface, they will have to be mirrored in iconv-lite, which could become a pain in case the versions drift.

An idea I had initially, with regard to the transliteration interface would be to create a callback like this:

let callback = (unicode_char) => {
    return "<ascii transliteration>";
};

let buf = iconv.encode(str, "encoding", {fallback: callback});

This would trivially allow all three major options of what to do with un-encodable characters: 1) replace them with "?", 2) error out (just throw exception from the callback) 3) transliterate to something smarter.

Now there's a question for you - would one unicode point be enough context to do a reasonable transliteration job? Combining characters (e.g. diacritical marks or ZWJ/modified emojis) might be problematic, does unidecode handle them? I think we can adjust the callback signature to provide more context. Wdyt about it?

var MAX_PENDING = 16384;

TransliterationWrapper.prototype.write = function(str) {
str = iconv.transliterate(str, this.encoder.encodingName, this.options);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I had is - would this handle surrogate chars? I.e. when a high surrogate is at the end of the previous block and a low surrogate is at the beginning of the current one? It's a valid situation when encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's taken care of. If the original unidecode is used, it only outputs ASCII, so there are no surrogates to worry about. If unidecode-plus is used, it processes a codepoint at a time, rather than a character at a time, so it keeps the surrogates together.

It is possible for surrogates to be split by TransliterationWrapper, in this code which is currently at line 187 of index.js:

        // Split the text being written out at a safe place where smart spacing won't
        // need to make any changes.
        var $ = /^(.*)([^\s\x80\x81][\s\x80\x81].*)$/.exec(str);

...but the logic works in such a way that they'll always get put back together again properly.

@kshetline
Copy link
Contributor Author

In the past, I struggled with adding transliteration functionality to iconv-lite because I didn't have a good grasp of intended use cases. Could you describe what use cases do you have in mind here, so that I could understand it better?

That's a good question. Mainly it's because I saw that transliteration was a feature of node-iconv, that you'd explicitly mentioned that iconv-lite didn't handle that, and I thought it would be an interesting challenge to make it possible. (Yes, I need to get a life. I know! 🤓)

My own use case (which I'd handled previously with a much simpler transliteration function) has been for a geographic database I use for my astronomy web site, skyviewcafe.com. I transliterate place names name to create plain ASCII search keys. Then it doesn't matter if users type accented characters or not, I can find places by name either way.

One vague concern I have... ...which could become a pain in case the versions drift.

I'm not sure how to address that, other than to guess that interface changes probably wouldn't happen too often.

An idea I had initially, with regard to the transliteration interface would be to create a callback like this:

I tried something like that on the way to the current version of my code. It worked, but it was a bit on the slow side calling unidecode one character at a time for transliterations, and it also didn't allow for the smart spacing option. Your suggestion does have the benefit of being very flexible, however.

Now there's a question for you - would one unicode point be enough context to do a reasonable transliteration job? Combining characters (e.g. diacritical marks or ZWJ/modified emojis...

I didn't change anything about how combining diacriticals are handled from the way the way the original unidecode handles that -- it simply keeps the unaccented characters from a combining pair and turns the diacritical into an empty string, which produces the same end result as converting a single-codepoint accented character into its unaccented equivalent. (Which does remind me, however, that my "german" mode wouldn't currently handle combining umlauts.)

So, for that level of functionality, yes, one unicode point is enough. For ZWJ/modified emojis... nope! That's a whole other can of worms that would definitely require more context than one codepoint, and I certainly haven't tried to deal with that issue for transliteration myself.

…d unit testing for streaming transliteration.
…d unit testing for streaming transliteration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants