Refactor input handling, validation, and implement Vietnamese engine#3
Refactor input handling, validation, and implement Vietnamese engine#3hthienloc wants to merge 7 commits intoLotusInputMethod:masterfrom
Conversation
… update module path, and include new tests.
…y-to-c++ Implement Vietnamese input engine, charset encoder, rules parser, and smoke tests
…o-to-c++ Engine: migrate internal transformation logic, add tone/backspace semantics and differential harness
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a foundational C++ implementation for the Bamboo input method engine, designed to replicate the functionality of its legacy Go counterpart with a refined API. It establishes a clear migration path through new documentation and incorporates robust character set handling and encoding capabilities. The changes lay the groundwork for a modernized input system while ensuring behavioral consistency. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: a C++ implementation of the Bamboo input method engine, intended to replace the legacy Go engine. The changes are extensive, including a new public C++ API, the core engine logic, character and input method definitions, parsers, and a differential testing harness to ensure parity. My review focuses on correctness, maintainability, and potential performance improvements in this new codebase. I've identified a critical bug in character set lookups due to unsorted data, a high-severity issue in UTF-8 decoding that limits Unicode support, and several medium-severity issues related to code duplication, an unused variable, an unnecessary abstraction, and a brittle build process. The provided suggestions aim to address these points to improve the robustness and maintainability of the new C++ engine.
| [[nodiscard]] std::u32string decodeUtf8(std::string_view input) { | ||
| std::u32string output; | ||
| output.reserve(input.size()); | ||
| for (std::size_t index = 0; index < input.size();) { | ||
| const unsigned char byte0 = static_cast<unsigned char>(input[index]); | ||
| if (byte0 < 0x80) { | ||
| output.push_back(static_cast<char32_t>(byte0)); | ||
| ++index; | ||
| } else if ((byte0 & 0xE0U) == 0xC0U && index + 1 < input.size()) { | ||
| const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]); | ||
| output.push_back(static_cast<char32_t>(((byte0 & 0x1FU) << 6) | (byte1 & 0x3FU))); | ||
| index += 2; | ||
| } else if ((byte0 & 0xF0U) == 0xE0U && index + 2 < input.size()) { | ||
| const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]); | ||
| const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]); | ||
| output.push_back(static_cast<char32_t>(((byte0 & 0x0FU) << 12) | ((byte1 & 0x3FU) << 6) | (byte2 & 0x3FU))); | ||
| index += 3; | ||
| } else { | ||
| output.push_back(static_cast<char32_t>(byte0)); | ||
| ++index; | ||
| } | ||
| } | ||
| return output; | ||
| } |
There was a problem hiding this comment.
This decodeUtf8 function does not handle 4-byte UTF-8 sequences, which means it cannot correctly decode Unicode code points above U+FFFF (e.g., many emojis). This is a bug that limits Unicode support.
Additionally, this function is duplicated in src/engine/rules_parser.cpp, and a similar decoder exists in tools/run_differential.py. This code duplication makes the code harder to maintain and more error-prone.
I recommend creating a single, correct implementation of this function in a shared utility file and using it in all required places. The implementation should be extended to handle 4-byte sequences.
[[nodiscard]] std::u32string decodeUtf8(std::string_view input) {
std::u32string output;
output.reserve(input.size());
for (std::size_t index = 0; index < input.size();) {
const unsigned char byte0 = static_cast<unsigned char>(input[index]);
if (byte0 < 0x80) {
output.push_back(static_cast<char32_t>(byte0));
++index;
} else if ((byte0 & 0xE0U) == 0xC0U && index + 1 < input.size()) {
const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
output.push_back(static_cast<char32_t>(((byte0 & 0x1FU) << 6) | (byte1 & 0x3FU)));
index += 2;
} else if ((byte0 & 0xF0U) == 0xE0U && index + 2 < input.size()) {
const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]);
output.push_back(static_cast<char32_t>(((byte0 & 0x0FU) << 12) | ((byte1 & 0x3FU) << 6) | (byte2 & 0x3FU)));
index += 3;
} else if ((byte0 & 0xF8U) == 0xF0U && index + 3 < input.size()) {
const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]);
const unsigned char byte3 = static_cast<unsigned char>(input[index + 3]);
output.push_back(static_cast<char32_t>(((byte0 & 0x07U) << 18) | ((byte1 & 0x3FU) << 12) | ((byte2 & 0x3FU) << 6) | (byte3 & 0x3FU)));
index += 4;
} else {
output.push_back(static_cast<char32_t>(byte0));
++index;
}
}
return output;
}| [[nodiscard]] int findVowelPosition(char32_t chr) noexcept { | ||
| static constexpr std::u32string_view kVowels = | ||
| U"aàáảãạăằắẳẵặâầấẩẫậeèéẻẽẹêềếểễệiìíỉĩịoòóỏõọôồốổỗộơờớởỡợuùúủũụưừứửữựyỳýỷỹỵ"; | ||
| const auto pos = kVowels.find(chr); | ||
| return pos == std::u32string_view::npos ? -1 : static_cast<int>(pos); | ||
| } |
There was a problem hiding this comment.
This function performs a linear search over a string of 138 characters. As it's called within stripTone, which is part of the validation logic (isValid), it may be executed frequently and could become a performance bottleneck.
For better performance, consider using a more efficient lookup structure, such as a std::map or a sorted std::array with binary search. This function is also duplicated in other files, and consolidating it would be beneficial.
| void handleBackspace(); | ||
|
|
||
| api::Mode mode_{api::Mode::Vietnamese}; | ||
| std::string dataDirPath_; |
| class Encoder final { | ||
| public: | ||
| [[nodiscard]] static std::string encode(std::string_view charsetName, std::u32string_view input); | ||
| [[nodiscard]] static const std::array<std::string_view, 17>& charsetNames() noexcept; | ||
| }; |
There was a problem hiding this comment.
| [[nodiscard]] std::u32string decodeUtf8(std::string_view input) { | ||
| std::u32string output; | ||
| output.reserve(input.size()); | ||
| for (std::size_t index = 0; index < input.size();) { | ||
| const unsigned char byte0 = static_cast<unsigned char>(input[index]); | ||
| if (byte0 < 0x80) { | ||
| output.push_back(static_cast<char32_t>(byte0)); | ||
| ++index; | ||
| } else if ((byte0 & 0xE0U) == 0xC0U && index + 1 < input.size()) { | ||
| const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]); | ||
| output.push_back(static_cast<char32_t>(((byte0 & 0x1FU) << 6) | (byte1 & 0x3FU))); | ||
| index += 2; | ||
| } else if ((byte0 & 0xF0U) == 0xE0U && index + 2 < input.size()) { | ||
| const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]); | ||
| const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]); | ||
| output.push_back(static_cast<char32_t>(((byte0 & 0x0FU) << 12) | ((byte1 & 0x3FU) << 6) | (byte2 & 0x3FU))); | ||
| index += 3; | ||
| } else if ((byte0 & 0xF8U) == 0xF0U && index + 3 < input.size()) { | ||
| const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]); | ||
| const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]); | ||
| const unsigned char byte3 = static_cast<unsigned char>(input[index + 3]); | ||
| output.push_back(static_cast<char32_t>(((byte0 & 0x07U) << 18) | ((byte1 & 0x3FU) << 12) | | ||
| ((byte2 & 0x3FU) << 6) | (byte3 & 0x3FU))); | ||
| index += 4; | ||
| } else { | ||
| output.push_back(static_cast<char32_t>(byte0)); | ||
| ++index; | ||
| } | ||
| } | ||
| return output; | ||
| } |
There was a problem hiding this comment.
| char32_t toLowerCodePoint(char32_t codePoint) noexcept { | ||
| if (codePoint >= U'A' && codePoint <= U'Z') { | ||
| return codePoint + 32; | ||
| } | ||
| switch (codePoint) { | ||
| case U'Đ': return U'đ'; | ||
| case U'Â': return U'â'; | ||
| case U'Ă': return U'ă'; | ||
| case U'Ê': return U'ê'; | ||
| case U'Ô': return U'ô'; | ||
| case U'Ơ': return U'ơ'; | ||
| case U'Ư': return U'ư'; | ||
| default: return codePoint; | ||
| } | ||
| } |
There was a problem hiding this comment.
The function toLowerCodePoint is duplicated in src/engine/engine.cpp and src/engine/rules_parser.cpp. Several other utility functions like toUpperCodePoint, isVowel, findVowelPosition, and addToneToChar are also duplicated across multiple files. To improve maintainability and reduce redundancy, these common functions should be extracted into a shared utility header and source file (e.g., char_utils.h/.cpp).
| cpp_cmd = [ | ||
| "g++", | ||
| "-std=c++17", | ||
| "-Iinclude", | ||
| str(cpp_src), | ||
| "src/engine/engine.cpp", | ||
| "src/engine/spelling.cpp", | ||
| "src/engine/encoder.cpp", | ||
| "src/engine/charset_definition.cpp", | ||
| "src/engine/input_method_definition.cpp", | ||
| "src/engine/rules_parser.cpp", | ||
| "src/engine/transformation_utils.cpp", | ||
| "-o", | ||
| str(cpp_bin), | ||
| ] |
There was a problem hiding this comment.
The C++ build command is hardcoded in this Python script, with a manually maintained list of all source files. This approach is brittle and will become difficult to manage as the project grows. It would be more robust and maintainable to use a build system like CMake to handle the compilation process. This would automate file discovery and make it easier to manage dependencies and build configurations.
This pull request introduces the initial C++ engine implementation for the Bamboo input method, aiming to match the legacy Go engine's behavior while providing a modernized, narrowed API. The changes include new documentation on migration semantics, a public C++ API for the engine, and the first implementation of core engine logic and supporting classes.
Key changes:
Engine API and Implementation
IEngineininclude/bamboo/IEngine.h, defining the engine's stateful API with methods for mode switching, key processing, string processing, deletion, and restoration, intentionally narrowing the public surface compared to the Go engine.Engineinsrc/engine/engine.handsrc/engine/engine.cpp, which matches Go engine semantics for Vietnamese input processing, key handling, deletion, and restoration, and provides the public API via a factory function. [1] [2]Charset and Encoding Support
CharsetDefinitioninsrc/engine/charset_definition.hto manage character set mappings and encoding, and provided anEncoderclass (src/engine/encoder.h,src/engine/encoder.cpp) to encode Unicode input according to the specified charset. [1] [2] [3]Migration Documentation
docs/migration-semantics.md, documenting the intended contract between the Go and C++ engines, required parity areas, and migration rules to ensure behavioral consistency and safe evolution.Project Metadata
go.modto reflect the new repository location.