Skip to content

Commit 5bb49ec

Browse files
committed
Improve pinyin fuzzy segement algorithm
Previously, we blindly choose the segment to always prefer the longer next match, this is prove wrong in the case of "sangeren". Which should produce, "san ge ren", "sang er en", "sang e ren". Instead, we change the check to be: if (current + next match) is valid, and complete pinyin, make it an acceptable option, unless (current, next match) is actually an inner fuzzy, which is handled separately below. For example: 1. For sangeren, will produce sang & san, since next match of "san", which is "ge", is a complete pinyin. 2. For hua, will only produce hua, since hu a is a inner fuzzy. Even if it will produce "extra" segement, for example, in the case of "sanger" will produce a partial pinyin "san" "ge" "r". We may still consider it as make sense. Since partial pinyin match is considered fuzzy and will have a penalty score. People may even benefit from such segement, since "san ge r" seems to be the most possible option. Fix #87
1 parent 118dc5f commit 5bb49ec

File tree

2 files changed

+9
-7
lines changed

2 files changed

+9
-7
lines changed

Diff for: src/libime/pinyin/pinyinencoder.cpp

+7-7
Original file line numberDiff line numberDiff line change
@@ -233,20 +233,20 @@ PinyinEncoder::parseUserPinyin(std::string userPinyin,
233233
fuzzyFlags, pinyinMap);
234234
auto nextMatchAlt = longestMatch(iter + str.size() - 1, end,
235235
fuzzyFlags, pinyinMap);
236-
auto matchSize = str.size() + nextMatch.match.size();
237236
auto matchSizeAlt =
238237
str.size() - 1 + nextMatchAlt.match.size();
239238

240-
// comparator is (validPinyin, wholeMatchSize,
239+
// comparator is (validPinyin, whole size>= lhs pinyin,
241240
// isCompletePinyin) validPinyin means it's at least some
242241
// pinyin, instead of things startsWith i,u,v. Since
243242
// longestMatch will now treat string startsWith iuv a whole
244243
// segment, we need to compare validity before the length.
245-
// Always prefer longer match and complete pinyin match.
246-
std::tuple<bool, size_t, bool> compare(
247-
nextMatch.valid, matchSize, nextMatch.isCompletePinyin);
248-
std::tuple<bool, size_t, bool> compareAlt(
249-
nextMatchAlt.valid, matchSizeAlt,
244+
// If whole size is equal to lhs pinyin, then it should be
245+
// handled by inner segement flag.
246+
std::tuple<bool, bool, bool> compare(
247+
nextMatch.valid, true, nextMatch.isCompletePinyin);
248+
std::tuple<bool, bool, bool> compareAlt(
249+
nextMatchAlt.valid, matchSizeAlt > str.size(),
250250
nextMatchAlt.isCompletePinyin);
251251

252252
if (compare >= compareAlt) {

Diff for: test/testpinyinencoder.cpp

+2
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,8 @@ int main() {
233233
check("zhuna", PinyinFuzzyFlag::Inner, {"zhu", "na"});
234234
check("zhuna", PinyinFuzzyFlag::Inner, {"zhun", "a"});
235235

236+
check("sangeren", PinyinFuzzyFlag::Inner, {"san", "ge", "ren"});
237+
236238
{
237239
PinyinCorrectionProfile profile(BuiltinPinyinCorrectionProfile::Qwerty);
238240
auto graph = PinyinEncoder::parseUserPinyin(

0 commit comments

Comments
 (0)