Improve pinyin fuzzy segement algorithm

wengxt · wengxt · commit 5bb49ec00021 · 2024-12-07T12:33:42.000-08:00
Previously, we blindly choose the segment to always prefer the longer next match, this is prove wrong in the case of "sangeren". Which should produce, "san ge ren", "sang er en", "sang e ren". Instead, we change the check to be: if (current + next match) is valid, and complete pinyin, make it an acceptable option, unless (current, next match) is actually an inner fuzzy, which is handled separately below. For example: 1. For sangeren, will produce sang & san, since next match of "san", which is "ge", is a complete pinyin. 2. For hua, will only produce hua, since hu a is a inner fuzzy. Even if it will produce "extra" segement, for example, in the case of "sanger" will produce a partial pinyin "san" "ge" "r". We may still consider it as make sense. Since partial pinyin match is considered fuzzy and will have a penalty score. People may even benefit from such segement, since "san ge r" seems to be the most possible option. Fix #87
diff --git a/src/libime/pinyin/pinyinencoder.cpp b/src/libime/pinyin/pinyinencoder.cpp
@@ -233,20 +233,20 @@ PinyinEncoder::parseUserPinyin(std::string userPinyin,
                                                   fuzzyFlags, pinyinMap);
                     auto nextMatchAlt = longestMatch(iter + str.size() - 1, end,
                                                      fuzzyFlags, pinyinMap);
-                    auto matchSize = str.size() + nextMatch.match.size();
                     auto matchSizeAlt =
                         str.size() - 1 + nextMatchAlt.match.size();
 
-                    // comparator is (validPinyin, wholeMatchSize,
+                    // comparator is (validPinyin, whole size>= lhs pinyin,
                     // isCompletePinyin) validPinyin means it's at least some
                     // pinyin, instead of things startsWith i,u,v. Since
                     // longestMatch will now treat string startsWith iuv a whole
                     // segment, we need to compare validity before the length.
-                    // Always prefer longer match and complete pinyin match.
-                    std::tuple<bool, size_t, bool> compare(
-                        nextMatch.valid, matchSize, nextMatch.isCompletePinyin);
-                    std::tuple<bool, size_t, bool> compareAlt(
-                        nextMatchAlt.valid, matchSizeAlt,
+                    // If whole size is equal to lhs pinyin, then it should be
+                    // handled by inner segement flag.
+                    std::tuple<bool, bool, bool> compare(
+                        nextMatch.valid, true, nextMatch.isCompletePinyin);
+                    std::tuple<bool, bool, bool> compareAlt(
+                        nextMatchAlt.valid, matchSizeAlt > str.size(),
                         nextMatchAlt.isCompletePinyin);
 
                     if (compare >= compareAlt) {
diff --git a/test/testpinyinencoder.cpp b/test/testpinyinencoder.cpp
@@ -233,6 +233,8 @@ int main() {
     check("zhuna", PinyinFuzzyFlag::Inner, {"zhu", "na"});
     check("zhuna", PinyinFuzzyFlag::Inner, {"zhun", "a"});
 
+    check("sangeren", PinyinFuzzyFlag::Inner, {"san", "ge", "ren"});
+
     {
         PinyinCorrectionProfile profile(BuiltinPinyinCorrectionProfile::Qwerty);
         auto graph = PinyinEncoder::parseUserPinyin(

Original file line number	Diff line number	Diff line change
`@@ -233,6 +233,8 @@ int main() {`
`233`	`233`	`check("zhuna", PinyinFuzzyFlag::Inner, {"zhu", "na"});`
`234`	`234`	`check("zhuna", PinyinFuzzyFlag::Inner, {"zhun", "a"});`
`235`	`235`
	`236`	`+ check("sangeren", PinyinFuzzyFlag::Inner, {"san", "ge", "ren"});`
	`237`	`+`
`236`	`238`	`{`
`237`	`239`	`PinyinCorrectionProfile profile(BuiltinPinyinCorrectionProfile::Qwerty);`
`238`	`240`	`auto graph = PinyinEncoder::parseUserPinyin(`