Inconsistencies in the Korean/Japanese use of mecab #5708

mikeizbicki · 2020-07-03T20:19:35Z

mikeizbicki
Jul 3, 2020

I'm considering doing a partial rewrite of the Korean/Japanese code bases to make them more consistent with each other, and in particular to allow Japanese to also use mecab for parsing. I'd like to get your feedback before I proceed. This issue stems from #5701.

I understand based on #5544 that the previous mecab parser was replaced by SudaciPy in order to be consistent with the training data that the models are built on. It's not clear to me, however, that this was a good design choice for three reasons:

Korean requires a fork of mecab called mecab-ko. It's a bit inconsistent for the Korean language support to require this heavy-weight dependency that was originally intended for Japanese, when the Japanese code no longer uses it.
The SudaciPy library is significantly slower than the mecab library. According to these tests, it's slower by a factor of about 100x. In my tests, the Japanese tokenizer is by far the slowest tokenizer of any language in spacy, is the current bottleneck in my own work.
SudaciPy has multiple different parsing modes / dictionaries, and spacy allows users to select between each of these. It makes sense that using the correct settings as the model's training data was generated with would result in the best performance, but if you are allowing non-standard settings, why not also allow the original mecab based parser as well that would be much faster?

The Korean language code is already general enough to support Japanese as well with some minor modifications, and I think I could modify both code bases in a way that preserves backwards compatibility of the API. Is this a change you would support making to spacy?

adrianeboyd · 2020-07-06T08:13:55Z

adrianeboyd
Jul 6, 2020

I understand based on #5544 that the previous mecab parser was replaced by SudaciPy in order to be consistent with the training data that the models are built on. It's not clear to me, however, that this was a good design choice for three reasons:

Korean requires a fork of mecab called mecab-ko. It's a bit inconsistent for the Korean language support to require this heavy-weight dependency that was originally intended for Japanese, when the Japanese code no longer uses it.

Are there any other good options for Korean? How popular is mecab for Korean?

My main problem with mecab is that it is difficult to install and configure and as it gets more out-of-date this isn't going to improve. For Japanese anyway, getting the correct versions compiled and installed used to be pretty tricky. This improved with fugashi, but sudachipy is still much easier for typical users. (Although maybe not a huge burden if dealing with mecab is always part of any day-to-day NLP work for these languages whether you're using spacy or not.)

I also found it hard to configure so that both Korean and Japanese run at the same time. Since the github CI skips the tests involving additional language-specific libraries, I try to run as many of those tests locally as I can when I run the whole test suite, and the only way I found to get both to run at the same time was to hack system-specific environment variable settings into the language initialization to select the right dictionary for each.

The SudaciPy library is significantly slower than the mecab library. According to these tests, it's slower by a factor of about 100x. In my tests, the Japanese tokenizer is by far the slowest tokenizer of any language in spacy, is the current bottleneck in my own work.

Which versions of spacy and sudachipy are you using? I think sudachipy has improved a lot in speed since those benchmarks were run. In my local tests, the sudachipy Japanese tokenizer is about half as fast as the rule-based English tokenizer (90K vs. 150K words/s). Since it's also tagging and lemmatizing, I think the overall pipeline speed for the provided core models (tagger, parser, ner) is similar for both. The fugashi tokenizer in v2.2.4 is about twice as fast as sudachipy, so sudachipy is slower, but definitely nothing like a factor of 100x.

(The actual slowest tokenizer is currently pkuseg for Chinese, but it was the best solution I could find given our priorities.)

SudaciPy has multiple different parsing modes / dictionaries, and spacy allows users to select between each of these. It makes sense that using the correct settings as the model's training data was generated with would result in the best performance, but if you are allowing non-standard settings, why not also allow the original mecab based parser as well that would be much faster?

The Korean language code is already general enough to support Japanese as well with some minor modifications, and I think I could modify both code bases in a way that preserves backwards compatibility of the API. Is this a change you would support making to spacy?

We're not at all opposed to providing options within the tokenizer that make sense for typical users. Chinese is an example where there are three options currently: character segmentation, jieba, or pkuseg. The current config setup for Chinese is extremely clunky to retain backwards compatibility within v2, but this is something we can try to improve for the upcoming v3.

We've also been thinking about whether it would make sense to move some of the custom tokenizer code into separate packages, something like spacy-japanese or maybe for this something like spacy-mecab. Then it would be easier to have additional maintainers and it would decouple the release cycles.

Let me ping @polm, who's worked on the Japanese support a lot in the past.

0 replies

polm · 2020-07-06T11:11:22Z

polm
Jul 6, 2020

Thanks for pinging me on this. My thoughts are basically that MeCab is still much faster than SudachiPy, but Sudachi is probably going to be the standard tokenizer for Japanese Universal Dependencies, so it makes sense to use it as a default and work on making it faster. If it's possible to add MeCab as an optional tokenizer I'd be in favor of that.

The benchmark linked to is really old. In the latest version of my benchmark (which I admittedly just updated) SudachiPy is roughly 40x slower than MeCab, which is not great but I think is fine for short or medium-length corpora. (It's only 10x slower than natto-py, which is what the Korean code is using.)

I would say I am against using natto-py for the main Japanese tokenizer in spaCy. mecab-python3 and fugashi both have Win/OSX/Linux wheels and support for pip-installed dictionaries in their most recent releases, so they're easy to install and fast. natto-py is slower and requires you to install MeCab and a dictionary separately, so as near as I can tell the only advantage would be shared code between Japanese and Korean.

SudaciPy has multiple different parsing modes / dictionaries, and spacy allows users to select between each of these.

I think I missed the PR that added this. I'm a little confused as to what the point of it is, since I assume it would make vectors/models stop working meaningfully.

Are there any other good options for Korean? How popular is mecab for Korean?

There are other Korean tokenizers, but I have no idea what's popular (I do not speak Korean). Maybe we can contact the KoNLPy maintainers, or see who's in charge of Korean Univeral Dependencies?

0 replies

adrianeboyd · 2020-07-06T12:15:56Z

adrianeboyd
Jul 6, 2020

Thanks for the reply! I guess the spacy overhead in my timing tests was higher than I thought (I used spacy evaluate).

The idea behind the sudachipy mode options is that you may just want to use the tokenizer / tagger on its own as nlp = spacy.blank("ja") with the additional options. The presence of the options would in theory allow you to royally screw up a provided model if you mess with the internal settings, but this is true for lots of bits and pieces in the provided models. If you modify tokenizer regexes in the rule-based tokenizer after loading, it also has the potential to break everything.

It seems like for Japanese we might consider adding back an option to use fugashi for people who want just a fast tokenizer. But that is very Japanese-specific, right? Or is there any potential to use/adapt some of the fugashi code for Korean?

0 replies

polm · 2020-07-06T12:37:29Z

polm
Jul 6, 2020

The idea behind the sudachipy mode options is that you may just want to use the tokenizer / tagger on its own as nlp = spacy.blank("ja") with the additional options. The presence of the options would in theory allow you to royally screw up a provided model if you mess with the internal settings, but this is true for lots of bits and pieces in the provided models. If you modify tokenizer regexes in the rule-based tokenizer after loading, it also has the potential to break everything.

OK, that makes sense.

It seems like for Japanese we might consider adding back an option to use fugashi for people who want just a fast tokenizer. But that is very Japanese-specific, right? Or is there any potential to use/adapt some of the fugashi code for Korean?

My understanding of mecab-ko (the Korean MeCab fork) is that it modifies some of the score calculation in the Viterbi matrix when spaces are present. What this means for a MeCab wrapper is that the API is consistent (function calls are all the same) but results may differ.

If you install fugashi or mecab-python3 from source, since the API is unchanged they should just work. However, with wheels the binary is included, so it has to be either mainline MeCab or the Korean fork, it can't be both. If I wanted to support it with wheels I'd have to create some kind of fugashi-ko.

I looked into adding Korean support to fugashi before had trouble because there's no English documentation for the Korean fork and I didn't understand enough of Korean language details to understand machine-translated description of the dictionary fields, which had a lot of nesting. When I saw there was an actively maintained wrapper in KoNLPy I gave up on Korean support in fugashi and now the README just recommends people use that.

MeCab is not really Japanese specific - the code is at times frustratingly generic - but I think Japanese is the only modern language the un-forked version is widely used for. (MeCab and fugashi have been used for Classical Chinese a bit, see here.)

0 replies

erip · 2020-07-30T21:30:07Z

erip
Jul 30, 2020

I think this is a very interesting conversation. One other issue that I've found with mecab-ko is that its segmentation is completely different than effectively every publicly available labeled Korean dataset's segmentation, making it very difficult to train a statistical model due to misalignment; I mostly mention that in case anyone in this thread has experience working around that limitation, but I'm also commenting because this has definitely crossed my mind and makes good sense to me.

0 replies

mikeizbicki · 2020-07-30T21:37:34Z

mikeizbicki
Jul 30, 2020
Author

@erip I'm not an expert in Korean NLP, but I'm finding myself having to do increasingly more of it. I'd be really interested in seeing some examples of how mecab-ko segments differently than other libraries. Do you happen to know of a link showing these differences?

0 replies

erip · 2020-07-30T21:44:24Z

erip
Jul 30, 2020

@mikeizbicki I believe you'll find the gold-standard segmentation in some of the available UD corpora to be different from those resulting from mecab-ko usage. It's been some time since I looked into it, but I discovered this while trying to train a Korean spaCy model. The only recourse I had was to provide the gold-standard tokens in training, but that doesn't solve the issue unfortunately. Some discussion here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies in the Korean/Japanese use of mecab #5708

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Inconsistencies in the Korean/Japanese use of mecab #5708

mikeizbicki Jul 3, 2020

Replies: 7 comments

adrianeboyd Jul 6, 2020

polm Jul 6, 2020

adrianeboyd Jul 6, 2020

polm Jul 6, 2020

erip Jul 30, 2020

mikeizbicki Jul 30, 2020 Author

erip Jul 30, 2020

mikeizbicki
Jul 3, 2020

adrianeboyd
Jul 6, 2020

polm
Jul 6, 2020

adrianeboyd
Jul 6, 2020

polm
Jul 6, 2020

erip
Jul 30, 2020

mikeizbicki
Jul 30, 2020
Author

erip
Jul 30, 2020