Skip to content

Add new tokenizer interface in monlp. Added gojieba for dictionary based chinese word breaking.#24272

Closed
fengttt wants to merge 8 commits intomatrixorigin:mainfrom
fengttt:main
Closed

Add new tokenizer interface in monlp. Added gojieba for dictionary based chinese word breaking.#24272
fengttt wants to merge 8 commits intomatrixorigin:mainfrom
fengttt:main

Conversation

@fengttt
Copy link
Copy Markdown
Contributor

@fengttt fengttt commented May 5, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24271

What this PR does / why we need it:

Add dictionary based Chinese word breaking using gojieba.

fengttt and others added 7 commits May 4, 2026 10:47
…ize takes input

Add a Tokenizer interface with a single Tokenize([]byte) method.
NewSimpleTokenizer() now takes no parameters; the input is passed to
Tokenize, which resets internal state on each call. Update all callers
in pkg/fulltext and pkg/sql/colexec/table_function accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds JiebaTokenizer implementing the Tokenizer interface using
github.com/yanyiwu/gojieba for Chinese word segmentation. Factory
NewJiebaTokenizer(useHmm bool) toggles HMM new-word discovery.
Output drops pure breaker tokens, lowers Latin runs, and truncates
to MAX_TOKEN_SIZE on a UTF-8 boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Allows CREATE FULLTEXT INDEX ... WITH PARSER gojieba to use
gojieba word segmentation for both index build and query
tokenization, replacing ngram bigrams for Chinese text.

- DDL accepts WITH PARSER gojieba; rejects unknown parsers as before.
- Index build (fulltext_index_tokenize) uses HMM=false for stable,
  reproducible segmentation across deployments.
- Query path (ParsePatternInNLMode + boolean-mode keyword reseg in
  GenTextSql) uses HMM=true for broader recall on unseen terms.
- ParsePattern / ParsePatternInNLMode take parser; NewSearchAccum
  extracts it from the JSON params already threaded through
  fulltext_index_scan, so existing callers keep their signatures.
- SharedJiebaTokenizer(useHmm) memoizes two singletons; the ~1s
  dictionary load is paid once per HMM setting per process.

BVT: test/distributed/cases/fulltext/gojieba.sql mirrors the default
simple-tokenizer coverage in fulltext.sql against gojieba.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The catalog path that serializes fulltext index params into JSON
(fullTextIndexParamsToMap) had its own parser allowlist that wasn't
updated alongside build_ddl's. WITH PARSER gojieba reached DDL
validation but failed during catalog write with
"internal error: invalid parser gojieba".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gojieba is a Chinese-first segmenter: with HMM=false an English word
like "color" has no dictionary entry and falls through to per-character
tokens (c, o, l, o, r), which broke fulltext search on English columns
indexed WITH PARSER gojieba — "select ... where match against('red')"
returned no rows.

Split the input at the ASCII / non-ASCII byte boundary (always a safe
UTF-8 char edge since every multi-byte UTF-8 byte is >= 0x80). ASCII
spans go through SimpleTokenizer's Latin path (proper word splits,
lowercasing, MAX_TOKEN_SIZE truncation); the rest goes to gojieba with
the configured useHmm. This keeps useHmm=false stable for Chinese at
index build time without losing English/digit tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Upstream gojieba.NewJieba() resolves its dict directory via runtime.Caller,
baking the GOMODCACHE path into the binary at compile time. Multi-stage
Docker builds and CI runners with stripped module caches panic at runtime
with "Dictionary file does not exist".

Vendor cppjieba's dict files under pkg/monlp/tokenizer/dict/ and pass them
explicitly to NewJieba. Resolution checks MO_JIEBA_DICT_DIR first, then
falls back to a runtime.Caller-relative path that is stable for source
builds and go test. Production Dockerfiles copy the vendored dir to
/usr/local/share/jieba and set MO_JIEBA_DICT_DIR.
@matrix-meow matrix-meow added size/XXL Denotes a PR that changes 2000+ lines and removed size/XL Denotes a PR that changes [1000, 1999] lines labels May 5, 2026
@fengttt fengttt requested a review from cpegeric May 5, 2026 14:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new fulltext tokenizer/parser option (gojieba) to support dictionary-based Chinese word segmentation, threading the parser choice from DDL/index params through planning, execution (tokenize + scan), and fulltext pattern parsing. It also introduces a tokenizer interface in pkg/monlp/tokenizer and adds BVT coverage plus runtime dictionary packaging for Docker images.

Changes:

  • Add gojieba as an allowed FULLTEXT parser and propagate the parser selection through fulltext params → pattern parsing → SQL generation/execution.
  • Introduce a Tokenizer interface, refactor SimpleTokenizer to a reusable tokenizer object, and add a new JiebaTokenizer implementation (with shared singletons).
  • Add distributed tests for WITH PARSER gojieba and package jieba dictionaries into Docker images via MO_JIEBA_DICT_DIR.

Reviewed changes

Copilot reviewed 24 out of 29 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/distributed/cases/fulltext/gojieba.sql New distributed SQL test coverage for fulltext indexing/search using WITH PARSER gojieba.
test/distributed/cases/fulltext/gojieba.result Expected output for the new gojieba distributed test case.
pkg/sql/plan/fulltext.go Pass parser selection into fulltext pattern parsing when generating fulltext scan SQL.
pkg/sql/plan/build_ddl.go Allow gojieba as a supported FULLTEXT parser in DDL validation.
pkg/sql/colexec/table_function/fulltext.go Thread table-function params into NewSearchAccum so runtime pattern parsing can respect parser choice.
pkg/sql/colexec/table_function/fulltext_tokenize.go Add gojieba tokenization path for index build-time tokenization.
pkg/monlp/tokenizer/simple2_test.go Update tests to new SimpleTokenizer construction + Tokenize(input) API.
pkg/monlp/tokenizer/simple.go Add Tokenizer interface; refactor SimpleTokenizer to accept input per Tokenize() call.
pkg/monlp/tokenizer/simple_test.go Update tests for the refactored SimpleTokenizer API.
pkg/monlp/tokenizer/jieba.go New gojieba-backed tokenizer implementation, including shared singleton instances and ASCII/CJK handling.
pkg/monlp/tokenizer/jieba_test.go Unit tests for gojieba tokenization semantics (HMM vs dictionary-only, byte offsets, truncation, interface conformance).
pkg/monlp/tokenizer/jieba_dict.go Dictionary path resolution logic via MO_JIEBA_DICT_DIR or vendored dict directory.
pkg/monlp/tokenizer/dict/user.dict.utf8 Add custom user dictionary entries for jieba.
pkg/monlp/tokenizer/dict/stop_words.utf8 Add stop-words list for jieba.
pkg/monlp/tokenizer/dict/README.md Document dictionary file purposes (CppJieba dictionary set).
pkg/monlp/tokenizer/dict/LICENSE Add dictionary license file (MIT).
pkg/fulltext/types.go Update helpers to call the new ParsePattern(pattern, mode, parser) signature.
pkg/fulltext/sql.go Route NL-mode pattern splitting through parser-aware ParsePatternInNLMode(..., parser).
pkg/fulltext/jieba_test.go New unit tests verifying parser routing, params parsing, and boolean-mode jieba behavior.
pkg/fulltext/fulltext.go Add parser extraction from params, jieba NL parsing path, and parser-aware fulltext parsing API surface.
pkg/catalog/secondary_index_utils.go Allow gojieba parser in index option parsing/validation.
optools/images/gpu/Dockerfile Copy jieba dicts into image and set MO_JIEBA_DICT_DIR for runtime resolution.
optools/images/Dockerfile Same: package jieba dicts and set MO_JIEBA_DICT_DIR.
go.sum Add gojieba dependency checksums.
go.mod Add github.com/yanyiwu/gojieba v1.4.7 dependency.
.gitignore Ignore additional local files (e.g. CLAUDE.md, logs, scratch SQL).

Comment thread pkg/monlp/tokenizer/jieba_dict.go
Comment thread pkg/fulltext/fulltext.go
list = append(list, &Pattern{Text: word, Operator: TEXT, Position: t.BytePos})
}
if len(list) == 0 {
return nil, moerr.NewInternalErrorNoCtx("Invalid input search string. search string onverted to empty pattern")
aunjgr

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@aunjgr aunjgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised Review: gojieba Chinese Word Breaking (deeper analysis)

I'm updating my previous review after deeper analysis. Some concerns are resolved, one remains blocking, and I have new findings.

Resolved from prior review

Thread safety: CONFIRMED SAFE. gojieba's C++ Cut methods are const — all mutable state is stack-local. The author confirmed thread safety in cppjieba issues #76/#82. SharedJiebaTokenizer concurrent access is safe as long as AddWord/RemoveWord are never called (they aren't in this PR).

Blocking Issues

1. resolveJiebaDictDir returns empty string on failure → panic at runtime

If neither MO_JIEBA_DICT_DIR is set nor the source tree exists on disk (standard case for compiled binaries on bare-metal/K8s without the env var), the function returns "". This feeds paths like "/jieba.dict.utf8" to gojieba.NewJieba(), which panics (not errors) if files are missing. The panic will happen inside sync.Once, making it unrecoverable.

Fix: return an explicit error from resolveJiebaDictDir and propagate it up through SharedJiebaTokenizer / NewJiebaTokenizer. Or at minimum, check the returned dir is non-empty and return a clear error from Tokenize.

The Dockerfile sets MO_JIEBA_DICT_DIR correctly, but not all deployment paths go through Docker (dev machines, bare-metal prod, systemd units).

2. ~610K lines of dictionary data in git (product decision, flagging for awareness)

I'll downgrade this from "blocking" to "strong advisory" — it's a legitimate tradeoff (no network dependency at build time vs. permanent clone bloat). But the team should explicitly decide. These files will exist in git history forever even if later removed. Consider git-lfs as a middle ground.

Advisory Issues

3. .gitignore adds unrelated entries

CLAUDE.md, mo-service.log, bug.sql are unrelated to gojieba. Should be a separate commit.

4. SimpleTokenizer struct reuse is not concurrency-safe (non-issue in practice)

The new Tokenize(input) mutates receiver state, making a shared *SimpleTokenizer unsafe for concurrent use. In practice every call site creates a fresh instance via NewSimpleTokenizer(), so this is fine. Just documenting for awareness — don't share instances.

5. parsePatternInNLModeJieba typo in error message

Line: "Invalid input search string. search string onverted to empty pattern" — "onverted" should be "converted", double space.

Design Observations (positive)

  • Index/query tokenization consistency is correct: both sides use SharedJiebaTokenizer(false) (HMM disabled), ensuring the same segmentation at index and query time.
  • The ASCII/CJK pre-split in JiebaTokenizer.Tokenize is sound: delegates Latin to SimpleTokenizer's existing logic, avoids gojieba's per-character fallback for ASCII.
  • Boolean mode threading is correct: ParsePatternInBooleanMode parses structure, then GenTextSqlParsePatternInNLMode(text, "gojieba") re-tokenizes individual terms with jieba.
  • fulltextIndexMatch now correctly passes tableFunction.Params instead of "", connecting the search side to the parser config.
  • Token BytePos tracking with chunkOff + w.Start correctly computes global byte offsets across the ASCII/CJK split.

Summary

The architecture is well thought out. The one real blocker is the panic-on-missing-dict path — a production service shouldn't panic inside a sync.Once with no recovery path. Everything else is minor.

@fengttt fengttt closed this May 7, 2026
@fengttt
Copy link
Copy Markdown
Contributor Author

fengttt commented May 7, 2026

Reopen, so confused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature size/XXL Denotes a PR that changes 2000+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants