Chinese tokenization is bad #9860
Replies: 3 comments 4 replies
-
Note that this cannot be repro'ed with displaCy because #9857 |
Beta Was this translation helpful? Give feedback.
-
By default the Chinese tokenizer does character tokenization, so this is the expected behavior. https://spacy.io/usage/models#chinese The multi-language tokenizer is a rule-based tokenizer for languages with whitespace between tokens and not intended for use with Chinese or Japanese. |
Beta Was this translation helpful? Give feedback.
-
I love having the options for different tokenizers, but when I choose Jieba or Pkuseg in the way recommended at https://spacy.io/usage/models#chinese, I lose almost all word data, such as POS:
Is there a way to tokenize with Jieba or Pkuseg while still getting all that word data? |
Beta Was this translation helpful? Give feedback.
-
Both the Chinese-specific and multilanguage tokenizer are so bad as to be unusable.
Noticed by @phasmik
How to reproduce the behaviour
Chinese
Actual output:
i.e. it splits on every character boundary.
Multilanguage
Actual output:
i.e. no tokenization happened.
Expected behaviour
Using https://pypi.org/project/jieba/
Input:
Output:
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions