Inconsistencies in the Korean/Japanese use of mecab #5708
Replies: 7 comments
-
Are there any other good options for Korean? How popular is mecab for Korean? My main problem with mecab is that it is difficult to install and configure and as it gets more out-of-date this isn't going to improve. For Japanese anyway, getting the correct versions compiled and installed used to be pretty tricky. This improved with I also found it hard to configure so that both Korean and Japanese run at the same time. Since the github CI skips the tests involving additional language-specific libraries, I try to run as many of those tests locally as I can when I run the whole test suite, and the only way I found to get both to run at the same time was to hack system-specific environment variable settings into the language initialization to select the right dictionary for each.
Which versions of spacy and sudachipy are you using? I think sudachipy has improved a lot in speed since those benchmarks were run. In my local tests, the sudachipy Japanese tokenizer is about half as fast as the rule-based English tokenizer (90K vs. 150K words/s). Since it's also tagging and lemmatizing, I think the overall pipeline speed for the provided core models (tagger, parser, ner) is similar for both. The fugashi tokenizer in v2.2.4 is about twice as fast as sudachipy, so sudachipy is slower, but definitely nothing like a factor of 100x. (The actual slowest tokenizer is currently pkuseg for Chinese, but it was the best solution I could find given our priorities.)
We're not at all opposed to providing options within the tokenizer that make sense for typical users. Chinese is an example where there are three options currently: character segmentation, We've also been thinking about whether it would make sense to move some of the custom tokenizer code into separate packages, something like Let me ping @polm, who's worked on the Japanese support a lot in the past. |
Beta Was this translation helpful? Give feedback.
-
Thanks for pinging me on this. My thoughts are basically that MeCab is still much faster than SudachiPy, but Sudachi is probably going to be the standard tokenizer for Japanese Universal Dependencies, so it makes sense to use it as a default and work on making it faster. If it's possible to add MeCab as an optional tokenizer I'd be in favor of that. The benchmark linked to is really old. In the latest version of my benchmark (which I admittedly just updated) SudachiPy is roughly 40x slower than MeCab, which is not great but I think is fine for short or medium-length corpora. (It's only 10x slower than natto-py, which is what the Korean code is using.) I would say I am against using natto-py for the main Japanese tokenizer in spaCy.
I think I missed the PR that added this. I'm a little confused as to what the point of it is, since I assume it would make vectors/models stop working meaningfully.
There are other Korean tokenizers, but I have no idea what's popular (I do not speak Korean). Maybe we can contact the KoNLPy maintainers, or see who's in charge of Korean Univeral Dependencies? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the reply! I guess the spacy overhead in my timing tests was higher than I thought (I used The idea behind the sudachipy mode options is that you may just want to use the tokenizer / tagger on its own as It seems like for Japanese we might consider adding back an option to use |
Beta Was this translation helpful? Give feedback.
-
OK, that makes sense.
My understanding of mecab-ko (the Korean MeCab fork) is that it modifies some of the score calculation in the Viterbi matrix when spaces are present. What this means for a MeCab wrapper is that the API is consistent (function calls are all the same) but results may differ. If you install fugashi or mecab-python3 from source, since the API is unchanged they should just work. However, with wheels the binary is included, so it has to be either mainline MeCab or the Korean fork, it can't be both. If I wanted to support it with wheels I'd have to create some kind of fugashi-ko. I looked into adding Korean support to fugashi before had trouble because there's no English documentation for the Korean fork and I didn't understand enough of Korean language details to understand machine-translated description of the dictionary fields, which had a lot of nesting. When I saw there was an actively maintained wrapper in KoNLPy I gave up on Korean support in fugashi and now the README just recommends people use that. MeCab is not really Japanese specific - the code is at times frustratingly generic - but I think Japanese is the only modern language the un-forked version is widely used for. (MeCab and fugashi have been used for Classical Chinese a bit, see here.) |
Beta Was this translation helpful? Give feedback.
-
I think this is a very interesting conversation. One other issue that I've found with mecab-ko is that its segmentation is completely different than effectively every publicly available labeled Korean dataset's segmentation, making it very difficult to train a statistical model due to misalignment; I mostly mention that in case anyone in this thread has experience working around that limitation, but I'm also commenting because this has definitely crossed my mind and makes good sense to me. |
Beta Was this translation helpful? Give feedback.
-
@erip I'm not an expert in Korean NLP, but I'm finding myself having to do increasingly more of it. I'd be really interested in seeing some examples of how mecab-ko segments differently than other libraries. Do you happen to know of a link showing these differences? |
Beta Was this translation helpful? Give feedback.
-
@mikeizbicki I believe you'll find the gold-standard segmentation in some of the available UD corpora to be different from those resulting from mecab-ko usage. It's been some time since I looked into it, but I discovered this while trying to train a Korean spaCy model. The only recourse I had was to provide the gold-standard tokens in training, but that doesn't solve the issue unfortunately. Some discussion here. |
Beta Was this translation helpful? Give feedback.
-
I'm considering doing a partial rewrite of the Korean/Japanese code bases to make them more consistent with each other, and in particular to allow Japanese to also use mecab for parsing. I'd like to get your feedback before I proceed. This issue stems from #5701.
I understand based on #5544 that the previous mecab parser was replaced by SudaciPy in order to be consistent with the training data that the models are built on. It's not clear to me, however, that this was a good design choice for three reasons:
Korean requires a fork of mecab called mecab-ko. It's a bit inconsistent for the Korean language support to require this heavy-weight dependency that was originally intended for Japanese, when the Japanese code no longer uses it.
The SudaciPy library is significantly slower than the mecab library. According to these tests, it's slower by a factor of about 100x. In my tests, the Japanese tokenizer is by far the slowest tokenizer of any language in spacy, is the current bottleneck in my own work.
SudaciPy has multiple different parsing modes / dictionaries, and spacy allows users to select between each of these. It makes sense that using the correct settings as the model's training data was generated with would result in the best performance, but if you are allowing non-standard settings, why not also allow the original mecab based parser as well that would be much faster?
The Korean language code is already general enough to support Japanese as well with some minor modifications, and I think I could modify both code bases in a way that preserves backwards compatibility of the API. Is this a change you would support making to spacy?
Beta Was this translation helpful? Give feedback.
All reactions