Skip to content

Refactor input handling, validation, and implement Vietnamese engine#3

Draft
hthienloc wants to merge 7 commits intoLotusInputMethod:masterfrom
hthienloc:cpp-migration
Draft

Refactor input handling, validation, and implement Vietnamese engine#3
hthienloc wants to merge 7 commits intoLotusInputMethod:masterfrom
hthienloc:cpp-migration

Conversation

@hthienloc
Copy link

This pull request introduces the initial C++ engine implementation for the Bamboo input method, aiming to match the legacy Go engine's behavior while providing a modernized, narrowed API. The changes include new documentation on migration semantics, a public C++ API for the engine, and the first implementation of core engine logic and supporting classes.

Key changes:

Engine API and Implementation

  • Introduced a new public C++ interface IEngine in include/bamboo/IEngine.h, defining the engine's stateful API with methods for mode switching, key processing, string processing, deletion, and restoration, intentionally narrowing the public surface compared to the Go engine.
  • Added a concrete implementation Engine in src/engine/engine.h and src/engine/engine.cpp, which matches Go engine semantics for Vietnamese input processing, key handling, deletion, and restoration, and provides the public API via a factory function. [1] [2]

Charset and Encoding Support

  • Added CharsetDefinition in src/engine/charset_definition.h to manage character set mappings and encoding, and provided an Encoder class (src/engine/encoder.h, src/engine/encoder.cpp) to encode Unicode input according to the specified charset. [1] [2] [3]

Migration Documentation

  • Added docs/migration-semantics.md, documenting the intended contract between the Go and C++ engines, required parity areas, and migration rules to ensure behavioral consistency and safe evolution.

Project Metadata

  • Updated the Go module path in go.mod to reflect the new repository location.

@hthienloc hthienloc marked this pull request as draft March 22, 2026 05:38
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a foundational C++ implementation for the Bamboo input method engine, designed to replicate the functionality of its legacy Go counterpart with a refined API. It establishes a clear migration path through new documentation and incorporates robust character set handling and encoding capabilities. The changes lay the groundwork for a modernized input system while ensuring behavioral consistency.

Highlights

  • C++ Engine API: Introduced a new public C++ interface (IEngine) and a concrete implementation (Engine) to handle Vietnamese input processing, key handling, deletion, and restoration, aiming for parity with the legacy Go engine but with a narrower API.
  • Charset and Encoding Support: Added CharsetDefinition for managing character set mappings and an Encoder class to handle Unicode input encoding based on specified charsets.
  • Migration Documentation: Included docs/migration-semantics.md to outline the semantic contract, required parity, and intentional deviations between the Go and C++ engines.
  • Input Handling and Validation: Implemented core logic for processing keys and strings, including handling backspace, validating word spelling, and restoring words, with support for various Vietnamese input methods (Telex, VNI, VIQR, etc.).
  • Differential Testing Framework: Added new test cases and a Python script (run_differential.py) to compare the behavior of the new C++ engine against the existing Go engine, ensuring functional parity.
  • Project Metadata Update: Modified go.mod to reflect the new repository location.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: a C++ implementation of the Bamboo input method engine, intended to replace the legacy Go engine. The changes are extensive, including a new public C++ API, the core engine logic, character and input method definitions, parsers, and a differential testing harness to ensure parity. My review focuses on correctness, maintainability, and potential performance improvements in this new codebase. I've identified a critical bug in character set lookups due to unsorted data, a high-severity issue in UTF-8 decoding that limits Unicode support, and several medium-severity issues related to code duplication, an unused variable, an unnecessary abstraction, and a brittle build process. The provided suggestions aim to address these points to improve the robustness and maintainability of the new C++ engine.

Comment on lines +14 to +37
[[nodiscard]] std::u32string decodeUtf8(std::string_view input) {
std::u32string output;
output.reserve(input.size());
for (std::size_t index = 0; index < input.size();) {
const unsigned char byte0 = static_cast<unsigned char>(input[index]);
if (byte0 < 0x80) {
output.push_back(static_cast<char32_t>(byte0));
++index;
} else if ((byte0 & 0xE0U) == 0xC0U && index + 1 < input.size()) {
const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
output.push_back(static_cast<char32_t>(((byte0 & 0x1FU) << 6) | (byte1 & 0x3FU)));
index += 2;
} else if ((byte0 & 0xF0U) == 0xE0U && index + 2 < input.size()) {
const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]);
output.push_back(static_cast<char32_t>(((byte0 & 0x0FU) << 12) | ((byte1 & 0x3FU) << 6) | (byte2 & 0x3FU)));
index += 3;
} else {
output.push_back(static_cast<char32_t>(byte0));
++index;
}
}
return output;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This decodeUtf8 function does not handle 4-byte UTF-8 sequences, which means it cannot correctly decode Unicode code points above U+FFFF (e.g., many emojis). This is a bug that limits Unicode support.

Additionally, this function is duplicated in src/engine/rules_parser.cpp, and a similar decoder exists in tools/run_differential.py. This code duplication makes the code harder to maintain and more error-prone.

I recommend creating a single, correct implementation of this function in a shared utility file and using it in all required places. The implementation should be extended to handle 4-byte sequences.

[[nodiscard]] std::u32string decodeUtf8(std::string_view input) {
    std::u32string output;
    output.reserve(input.size());
    for (std::size_t index = 0; index < input.size();) {
        const unsigned char byte0 = static_cast<unsigned char>(input[index]);
        if (byte0 < 0x80) {
            output.push_back(static_cast<char32_t>(byte0));
            ++index;
        } else if ((byte0 & 0xE0U) == 0xC0U && index + 1 < input.size()) {
            const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
            output.push_back(static_cast<char32_t>(((byte0 & 0x1FU) << 6) | (byte1 & 0x3FU)));
            index += 2;
        } else if ((byte0 & 0xF0U) == 0xE0U && index + 2 < input.size()) {
            const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
            const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]);
            output.push_back(static_cast<char32_t>(((byte0 & 0x0FU) << 12) | ((byte1 & 0x3FU) << 6) | (byte2 & 0x3FU)));
            index += 3;
        } else if ((byte0 & 0xF8U) == 0xF0U && index + 3 < input.size()) {
            const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
            const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]);
            const unsigned char byte3 = static_cast<unsigned char>(input[index + 3]);
            output.push_back(static_cast<char32_t>(((byte0 & 0x07U) << 18) | ((byte1 & 0x3FU) << 12) | ((byte2 & 0x3FU) << 6) | (byte3 & 0x3FU)));
            index += 4;
        } else {
            output.push_back(static_cast<char32_t>(byte0));
            ++index;
        }
    }
    return output;
}

Comment on lines +39 to +44
[[nodiscard]] int findVowelPosition(char32_t chr) noexcept {
static constexpr std::u32string_view kVowels =
U"aàáảãạăằắẳẵặâầấẩẫậeèéẻẽẹêềếểễệiìíỉĩịoòóỏõọôồốổỗộơờớởỡợuùúủũụưừứửữựyỳýỷỹỵ";
const auto pos = kVowels.find(chr);
return pos == std::u32string_view::npos ? -1 : static_cast<int>(pos);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function performs a linear search over a string of 138 characters. As it's called within stripTone, which is part of the validation logic (isValid), it may be executed frequently and could become a performance bottleneck.

For better performance, consider using a more efficient lookup structure, such as a std::map or a sorted std::array with binary search. This function is also duplicated in other files, and consolidating it would be beneficial.

void handleBackspace();

api::Mode mode_{api::Mode::Vietnamese};
std::string dataDirPath_;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The member variable dataDirPath_ is initialized in the constructor but is never used within the Engine class. This appears to be dead code and should be removed to simplify the class.

Comment on lines +9 to +13
class Encoder final {
public:
[[nodiscard]] static std::string encode(std::string_view charsetName, std::u32string_view input);
[[nodiscard]] static const std::array<std::string_view, 17>& charsetNames() noexcept;
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Encoder class appears to be an unnecessary abstraction layer. Its static methods simply forward calls to the CharsetDefinition class. To simplify the design and reduce boilerplate, consider removing the Encoder class and using CharsetDefinition directly where its functionality is needed.

Comment on lines +26 to +56
[[nodiscard]] std::u32string decodeUtf8(std::string_view input) {
std::u32string output;
output.reserve(input.size());
for (std::size_t index = 0; index < input.size();) {
const unsigned char byte0 = static_cast<unsigned char>(input[index]);
if (byte0 < 0x80) {
output.push_back(static_cast<char32_t>(byte0));
++index;
} else if ((byte0 & 0xE0U) == 0xC0U && index + 1 < input.size()) {
const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
output.push_back(static_cast<char32_t>(((byte0 & 0x1FU) << 6) | (byte1 & 0x3FU)));
index += 2;
} else if ((byte0 & 0xF0U) == 0xE0U && index + 2 < input.size()) {
const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]);
output.push_back(static_cast<char32_t>(((byte0 & 0x0FU) << 12) | ((byte1 & 0x3FU) << 6) | (byte2 & 0x3FU)));
index += 3;
} else if ((byte0 & 0xF8U) == 0xF0U && index + 3 < input.size()) {
const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]);
const unsigned char byte3 = static_cast<unsigned char>(input[index + 3]);
output.push_back(static_cast<char32_t>(((byte0 & 0x07U) << 18) | ((byte1 & 0x3FU) << 12) |
((byte2 & 0x3FU) << 6) | (byte3 & 0x3FU)));
index += 4;
} else {
output.push_back(static_cast<char32_t>(byte0));
++index;
}
}
return output;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This decodeUtf8 function is a duplicate of the one in src/engine/engine.cpp and also contains the same bug: it does not handle 4-byte UTF-8 sequences. To improve maintainability and fix the bug, this function should be moved to a shared utility file and corrected to support the full Unicode range.

Comment on lines +223 to +237
char32_t toLowerCodePoint(char32_t codePoint) noexcept {
if (codePoint >= U'A' && codePoint <= U'Z') {
return codePoint + 32;
}
switch (codePoint) {
case U'Đ': return U'đ';
case U'Â': return U'â';
case U'Ă': return U'ă';
case U'Ê': return U'ê';
case U'Ô': return U'ô';
case U'Ơ': return U'ơ';
case U'Ư': return U'ư';
default: return codePoint;
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function toLowerCodePoint is duplicated in src/engine/engine.cpp and src/engine/rules_parser.cpp. Several other utility functions like toUpperCodePoint, isVowel, findVowelPosition, and addToneToChar are also duplicated across multiple files. To improve maintainability and reduce redundancy, these common functions should be extracted into a shared utility header and source file (e.g., char_utils.h/.cpp).

Comment on lines +329 to +343
cpp_cmd = [
"g++",
"-std=c++17",
"-Iinclude",
str(cpp_src),
"src/engine/engine.cpp",
"src/engine/spelling.cpp",
"src/engine/encoder.cpp",
"src/engine/charset_definition.cpp",
"src/engine/input_method_definition.cpp",
"src/engine/rules_parser.cpp",
"src/engine/transformation_utils.cpp",
"-o",
str(cpp_bin),
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The C++ build command is hardcoded in this Python script, with a manually maintained list of all source files. This approach is brittle and will become difficult to manage as the project grows. It would be more robust and maintainable to use a build system like CMake to handle the compilation process. This would automate file discovery and make it easier to manage dependencies and build configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant