Refactor input handling, validation, and implement Vietnamese engine by hthienloc · Pull Request #3 · LotusInputMethod/bamboo-core

hthienloc · 2026-03-22T05:38:21Z

This pull request introduces the initial C++ engine implementation for the Bamboo input method, aiming to match the legacy Go engine's behavior while providing a modernized, narrowed API. The changes include new documentation on migration semantics, a public C++ API for the engine, and the first implementation of core engine logic and supporting classes.

Key changes:

Engine API and Implementation

Introduced a new public C++ interface IEngine in include/bamboo/IEngine.h, defining the engine's stateful API with methods for mode switching, key processing, string processing, deletion, and restoration, intentionally narrowing the public surface compared to the Go engine.
Added a concrete implementation Engine in src/engine/engine.h and src/engine/engine.cpp, which matches Go engine semantics for Vietnamese input processing, key handling, deletion, and restoration, and provides the public API via a factory function. [1] [2]

Charset and Encoding Support

Added CharsetDefinition in src/engine/charset_definition.h to manage character set mappings and encoding, and provided an Encoder class (src/engine/encoder.h, src/engine/encoder.cpp) to encode Unicode input according to the specified charset. [1] [2] [3]

Migration Documentation

Added docs/migration-semantics.md, documenting the intended contract between the Go and C++ engines, required parity areas, and migration rules to ensure behavioral consistency and safe evolution.

Project Metadata

Updated the Go module path in go.mod to reflect the new repository location.

… update module path, and include new tests.

…xpectations.

…y-to-c++ Implement Vietnamese input engine, charset encoder, rules parser, and smoke tests

…o-to-c++ Engine: migrate internal transformation logic, add tone/backspace semantics and differential harness

gemini-code-assist · 2026-03-22T05:39:03Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a foundational C++ implementation for the Bamboo input method engine, designed to replicate the functionality of its legacy Go counterpart with a refined API. It establishes a clear migration path through new documentation and incorporates robust character set handling and encoding capabilities. The changes lay the groundwork for a modernized input system while ensuring behavioral consistency.

Highlights

C++ Engine API: Introduced a new public C++ interface (IEngine) and a concrete implementation (Engine) to handle Vietnamese input processing, key handling, deletion, and restoration, aiming for parity with the legacy Go engine but with a narrower API.
Charset and Encoding Support: Added CharsetDefinition for managing character set mappings and an Encoder class to handle Unicode input encoding based on specified charsets.
Migration Documentation: Included docs/migration-semantics.md to outline the semantic contract, required parity, and intentional deviations between the Go and C++ engines.
Input Handling and Validation: Implemented core logic for processing keys and strings, including handling backspace, validating word spelling, and restoring words, with support for various Vietnamese input methods (Telex, VNI, VIQR, etc.).
Differential Testing Framework: Added new test cases and a Python script (run_differential.py) to compare the behavior of the new C++ engine against the existing Go engine, ensuring functional parity.
Project Metadata Update: Modified go.mod to reflect the new repository location.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature: a C++ implementation of the Bamboo input method engine, intended to replace the legacy Go engine. The changes are extensive, including a new public C++ API, the core engine logic, character and input method definitions, parsers, and a differential testing harness to ensure parity. My review focuses on correctness, maintainability, and potential performance improvements in this new codebase. I've identified a critical bug in character set lookups due to unsorted data, a high-severity issue in UTF-8 decoding that limits Unicode support, and several medium-severity issues related to code duplication, an unused variable, an unnecessary abstraction, and a brittle build process. The provided suggestions aim to address these points to improve the robustness and maintainability of the new C++ engine.

src/engine/charset_definition.cpp

gemini-code-assist · 2026-03-22T05:41:08Z

src/engine/engine.cpp

+[[nodiscard]] std::u32string decodeUtf8(std::string_view input) {
+    std::u32string output;
+    output.reserve(input.size());
+    for (std::size_t index = 0; index < input.size();) {
+        const unsigned char byte0 = static_cast<unsigned char>(input[index]);
+        if (byte0 < 0x80) {
+            output.push_back(static_cast<char32_t>(byte0));
+            ++index;
+        } else if ((byte0 & 0xE0U) == 0xC0U && index + 1 < input.size()) {
+            const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
+            output.push_back(static_cast<char32_t>(((byte0 & 0x1FU) << 6) | (byte1 & 0x3FU)));
+            index += 2;
+        } else if ((byte0 & 0xF0U) == 0xE0U && index + 2 < input.size()) {
+            const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
+            const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]);
+            output.push_back(static_cast<char32_t>(((byte0 & 0x0FU) << 12) | ((byte1 & 0x3FU) << 6) | (byte2 & 0x3FU)));
+            index += 3;
+        } else {
+            output.push_back(static_cast<char32_t>(byte0));
+            ++index;
+        }
+    }
+    return output;
+}


This decodeUtf8 function does not handle 4-byte UTF-8 sequences, which means it cannot correctly decode Unicode code points above U+FFFF (e.g., many emojis). This is a bug that limits Unicode support.

Additionally, this function is duplicated in src/engine/rules_parser.cpp, and a similar decoder exists in tools/run_differential.py. This code duplication makes the code harder to maintain and more error-prone.

I recommend creating a single, correct implementation of this function in a shared utility file and using it in all required places. The implementation should be extended to handle 4-byte sequences.

[[nodiscard]] std::u32string decodeUtf8(std::string_view input) { std::u32string output; output.reserve(input.size()); for (std::size_t index = 0; index < input.size();) { const unsigned char byte0 = static_cast<unsigned char>(input[index]); if (byte0 < 0x80) { output.push_back(static_cast<char32_t>(byte0)); ++index; } else if ((byte0 & 0xE0U) == 0xC0U && index + 1 < input.size()) { const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]); output.push_back(static_cast<char32_t>(((byte0 & 0x1FU) << 6) | (byte1 & 0x3FU))); index += 2; } else if ((byte0 & 0xF0U) == 0xE0U && index + 2 < input.size()) { const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]); const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]); output.push_back(static_cast<char32_t>(((byte0 & 0x0FU) << 12) | ((byte1 & 0x3FU) << 6) | (byte2 & 0x3FU))); index += 3; } else if ((byte0 & 0xF8U) == 0xF0U && index + 3 < input.size()) { const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]); const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]); const unsigned char byte3 = static_cast<unsigned char>(input[index + 3]); output.push_back(static_cast<char32_t>(((byte0 & 0x07U) << 18) | ((byte1 & 0x3FU) << 12) | ((byte2 & 0x3FU) << 6) | (byte3 & 0x3FU))); index += 4; } else { output.push_back(static_cast<char32_t>(byte0)); ++index; } } return output; }

gemini-code-assist · 2026-03-22T05:41:08Z

src/engine/engine.cpp

+[[nodiscard]] int findVowelPosition(char32_t chr) noexcept {
+    static constexpr std::u32string_view kVowels =
+        U"aàáảãạăằắẳẵặâầấẩẫậeèéẻẽẹêềếểễệiìíỉĩịoòóỏõọôồốổỗộơờớởỡợuùúủũụưừứửữựyỳýỷỹỵ";
+    const auto pos = kVowels.find(chr);
+    return pos == std::u32string_view::npos ? -1 : static_cast<int>(pos);
+}


This function performs a linear search over a string of 138 characters. As it's called within stripTone, which is part of the validation logic (isValid), it may be executed frequently and could become a performance bottleneck.

For better performance, consider using a more efficient lookup structure, such as a std::map or a sorted std::array with binary search. This function is also duplicated in other files, and consolidating it would be beneficial.

gemini-code-assist · 2026-03-22T05:41:08Z

src/engine/engine.h

+    void handleBackspace();
+
+    api::Mode mode_{api::Mode::Vietnamese};
+    std::string dataDirPath_;


The member variable dataDirPath_ is initialized in the constructor but is never used within the Engine class. This appears to be dead code and should be removed to simplify the class.

gemini-code-assist · 2026-03-22T05:41:08Z

src/engine/encoder.h

+class Encoder final {
+public:
+    [[nodiscard]] static std::string encode(std::string_view charsetName, std::u32string_view input);
+    [[nodiscard]] static const std::array<std::string_view, 17>& charsetNames() noexcept;
+};


The Encoder class appears to be an unnecessary abstraction layer. Its static methods simply forward calls to the CharsetDefinition class. To simplify the design and reduce boilerplate, consider removing the Encoder class and using CharsetDefinition directly where its functionality is needed.

gemini-code-assist · 2026-03-22T05:41:08Z

src/engine/rules_parser.cpp

+[[nodiscard]] std::u32string decodeUtf8(std::string_view input) {
+    std::u32string output;
+    output.reserve(input.size());
+    for (std::size_t index = 0; index < input.size();) {
+        const unsigned char byte0 = static_cast<unsigned char>(input[index]);
+        if (byte0 < 0x80) {
+            output.push_back(static_cast<char32_t>(byte0));
+            ++index;
+        } else if ((byte0 & 0xE0U) == 0xC0U && index + 1 < input.size()) {
+            const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
+            output.push_back(static_cast<char32_t>(((byte0 & 0x1FU) << 6) | (byte1 & 0x3FU)));
+            index += 2;
+        } else if ((byte0 & 0xF0U) == 0xE0U && index + 2 < input.size()) {
+            const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
+            const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]);
+            output.push_back(static_cast<char32_t>(((byte0 & 0x0FU) << 12) | ((byte1 & 0x3FU) << 6) | (byte2 & 0x3FU)));
+            index += 3;
+        } else if ((byte0 & 0xF8U) == 0xF0U && index + 3 < input.size()) {
+            const unsigned char byte1 = static_cast<unsigned char>(input[index + 1]);
+            const unsigned char byte2 = static_cast<unsigned char>(input[index + 2]);
+            const unsigned char byte3 = static_cast<unsigned char>(input[index + 3]);
+            output.push_back(static_cast<char32_t>(((byte0 & 0x07U) << 18) | ((byte1 & 0x3FU) << 12) |
+                                                   ((byte2 & 0x3FU) << 6) | (byte3 & 0x3FU)));
+            index += 4;
+        } else {
+            output.push_back(static_cast<char32_t>(byte0));
+            ++index;
+        }
+    }
+    return output;
+}


This decodeUtf8 function is a duplicate of the one in src/engine/engine.cpp and also contains the same bug: it does not handle 4-byte UTF-8 sequences. To improve maintainability and fix the bug, this function should be moved to a shared utility file and corrected to support the full Unicode range.

gemini-code-assist · 2026-03-22T05:41:08Z

src/engine/transformation_utils.cpp

+char32_t toLowerCodePoint(char32_t codePoint) noexcept {
+    if (codePoint >= U'A' && codePoint <= U'Z') {
+        return codePoint + 32;
+    }
+    switch (codePoint) {
+    case U'Đ': return U'đ';
+    case U'Â': return U'â';
+    case U'Ă': return U'ă';
+    case U'Ê': return U'ê';
+    case U'Ô': return U'ô';
+    case U'Ơ': return U'ơ';
+    case U'Ư': return U'ư';
+    default: return codePoint;
+    }
+}


The function toLowerCodePoint is duplicated in src/engine/engine.cpp and src/engine/rules_parser.cpp. Several other utility functions like toUpperCodePoint, isVowel, findVowelPosition, and addToneToChar are also duplicated across multiple files. To improve maintainability and reduce redundancy, these common functions should be extracted into a shared utility header and source file (e.g., char_utils.h/.cpp).

gemini-code-assist · 2026-03-22T05:41:08Z

tools/run_differential.py

+        cpp_cmd = [
+            "g++",
+            "-std=c++17",
+            "-Iinclude",
+            str(cpp_src),
+            "src/engine/engine.cpp",
+            "src/engine/spelling.cpp",
+            "src/engine/encoder.cpp",
+            "src/engine/charset_definition.cpp",
+            "src/engine/input_method_definition.cpp",
+            "src/engine/rules_parser.cpp",
+            "src/engine/transformation_utils.cpp",
+            "-o",
+            str(cpp_bin),
+        ]


The C++ build command is hardcoded in this Python script, with a manually maintained list of all source files. This approach is brittle and will become difficult to manage as the project grows. It would be more robust and maintainable to use a build system like CMake to handle the compilation process. This would automate file discovery and make it easier to manage dependencies and build configurations.

hthienloc added 7 commits March 21, 2026 12:26

feat: Refactor backspace and UI shortcut logic, add input validation,…

9ee0f53

… update module path, and include new tests.

refactor: simplify backspace handling logic and update related test e…

b4872ea

…xpectations.

Fix Telex uong shortcut transformation

bd87e7d

Merge pull request #1 from hthienloc/codex/migrate-bamboo-core-librar…

72e561c

…y-to-c++ Implement Vietnamese input engine, charset encoder, rules parser, and smoke tests

Refactor composition view to use deque slices

873e468

Merge pull request #2 from hthienloc/codex/migrate-bamboo-core-from-g…

8c9247f

…o-to-c++ Engine: migrate internal transformation logic, add tone/backspace semantics and differential harness

Revert go files to master state

cf7a1bc

hthienloc marked this pull request as draft March 22, 2026 05:38

gemini-code-assist bot reviewed Mar 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor input handling, validation, and implement Vietnamese engine#3

Refactor input handling, validation, and implement Vietnamese engine#3
hthienloc wants to merge 7 commits intoLotusInputMethod:masterfrom
hthienloc:cpp-migration

hthienloc commented Mar 22, 2026

Uh oh!

gemini-code-assist bot commented Mar 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hthienloc commented Mar 22, 2026

Engine API and Implementation

Charset and Encoding Support

Migration Documentation

Project Metadata

Uh oh!

gemini-code-assist bot commented Mar 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant