Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

英文詞會被切成一個個字符 #146

Open
Shanboy5566 opened this issue Aug 21, 2020 · 2 comments
Open

英文詞會被切成一個個字符 #146

Shanboy5566 opened this issue Aug 21, 2020 · 2 comments

Comments

@Shanboy5566
Copy link

Shanboy5566 commented Aug 21, 2020

使用原本的預設字典,唯USER_DICT_PATH使用我自己的

hmm調成false的情況下oov的英文會變成一個個character
martin => m/a/r/t/i/n
但如果調成true的話就不會,不過這樣可能會切出新詞

請問有沒有hmm=false下成功把英文完整切出來的方法?

@PierreZhangcw
Copy link

在cppjieba里目前还做到这个功能,想要实现的话需要自己修改源代码。但是在python版本的jieba里面,把HMM设置为False的情况下,英文依旧是可以分对的。在对英文和数字的处理上,pyjieba是先进行英文数字的处理,再把余下的丢给HMM模型处理,cppjieba则是在HMM模型中进行英文数字的处理。

@catqaq
Copy link

catqaq commented Mar 20, 2021

cppjieba针对英文数字等的处理是通过规则来完成的,但目前这些规则是和hmm耦合在一起的。因此只需要将这部分规则和hmm解耦即可,当然需要注意一些边界的处理。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants