unicode-org · sffc · Oct 23, 2025 · Jun 9, 2025 · Jul 5, 2025 · Jun 26, 2025
diff --git a/CNN.md b/CNN.md
@@ -0,0 +1,97 @@
+## Convolutional Neural Networks
+
+Author: Shenghong Liu ([email protected])
+
+The existing LSTM-based segmenter has a linear O(n) time complexity which is not ideal for content-heavy platforms. Hence, I introduce a new model architecture for faster word segmentation of Southeast Asian languages like Thai and Burmese.
+
+<img src="Figures/cnn.jpg"  width="30%"/>
+
+The convolutional neural network (CNN) architecture developed in this project achieved faster inference speeds with comparable accuracy for Thai. Not only was the linear time complexity issue resolved, the usage of dilated convolutions also helped maintain a high level of accuracy by capturing a wider context of the surrounding words.
+
+| Model | F1-Score | Model Size | CPU Inference Speed |
+|----------|:--------:|:---------:|---------:|
+| LSTM Medium  | 90.1  | 36 KB  | 9.29 ms |
+| LSTM Small  | 86.7  | 12 KB  | 6.68 ms |
+| CNN Medium | 90.4 | 28 KB | 3.76 ms |
+| ICU | 86.4 | 126 KB | ~0.2 ms|
+
+### Examples
+
+**Test Case 1**
+| Algorithm | Output |
+|----------|:---------|
+| Unsegmented | พระราชประสงค์ของพระบาทสมเสด็จพระเจ้าอยู่หัวในรัชกาลปัจจุบันคือ |
+| Manually Segmented | พระราชประสงค์_ของ_พระบาทสมเสด็จพระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ |
+| CNN | พระราชประสงค์_ของ_พระบาทสมเสด็จพระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ |
+| ICU | พระ_ราช_ประสงค์_ของ_พระบาท_สม_เสด็จ_พระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ |
+| LSTM | พระราชประสงค์_ของ_พระบาทสมเสด็จ_พระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ |
+
+**Test Case 2**
+| Algorithm | Output |
+|----------|:---------|
+| Unsegmented | ในขณะเดียวกันผู้ที่ต้องการเงินเพื่อนำไปลงทุนหรือประกอบกิจการอื่นใด |
+| Manually Segmented | ใน_ขณะ_เดียว_กัน_ผู้_ที่_ต้องการ_เงิน_เพื่อ_นำ_ไป_ลง_ทุน	_หรือ_ประกอบ_กิจการ_อื่น_ใด |
+| CNN | ใน_ขณะ_เดียว_กัน_ผู้_ที่_ต้องการ_เงิน_เพื่อ_นำ_ไป_ลง_ทุน	_หรือ_ประกอบ_กิจการ_อื่น_ใด |
+| ICU | ใน_ขณะ_เดียวกัน_ผู้_ที่_ต้องการ_เงิน_เพื่อน_ำ_ไป_ลงทุน_หรือ_ประกอบ_กิจการ_อื่น_ใด |
+| LSTM | ใน_ขณะ_เดียว_กัน_ผู้_ที่_ต้อง_การ_เงิน_เพื่อ_นำ_ไป_ลง_ทุน	_หรือ_ประกอบ_กิจการ_อื่น_ใด |
+
+**Test Case 3**
+
+| Algorithm | Output |
+|----------|:---------|
+| Unsegmented | เพราะเพียงกรดนิวคลีอิคของไวรัสอย่างเดียวก็สร้างไวรัสสมบูรณ์ |
+| Manually Segmented | เพราะ_เพียง_กรด_นิวคลีอิค_ของ_ไวรัส_อย่าง_เดียว_ก็_สร้าง_ไวรัส_สมบูรณ์ |
+| CNN | เพราะ_เพียง_กรด_นิว_คลี_อิค_ของ_ไวรัส_อย่าง_เดียว_ก็_สร้าง_ไวรัส_สมบูรณ์ |
+| ICU | เพราะ_เพียง_กรด_นิ_วค_ลี_อิค_ของ_ไวรัส_อย่าง_เดียว_ก็_สร้าง_ไวรัส_สมบูรณ์ |
+| LSTM | เพราะ_เพียง_กรดนิว_คลีอิค_ของ_ไวรัสอย่าง_เดียว_ก็_สร้าง_ไวรัสสมบูรณ์ |
+
+### Hyperparameters
+
+In Vertex AI Custom Training, you need to specify the following hyperparameters:
+
+```
+--path=gs://bucket_name/Data/
+--language=Thai
+--input-type=BEST
+--model-type=cnn
+--epochs=200
+--filters=32
+--name=Thai_codepoints_32
+--edim=40
+--embedding=codepoints
+```
+
+* **name:** This is the model name.
+* **path:** This is the Google Cloud Storage Bucket link.
+* **language:** This is the language that you'll like to train, such as ```Thai``` or ```Burmese```.
+* **input-type:** This is the dataset type, such as ```BEST``` for Thai and ```my``` for Burmese. Refer to [Cloud Usage](Cloud%20Usage.md) for more details.
+* **model-type:** This is the model architecture type, such as ```lstm``` or ```cnn```.
+* **epochs:** This is the number of epochs used to train the model, it is recommended to use a number of >= 200 as the model only trains on 10% of the dataset in each epoch. The model will output the epoch that gives the best validation loss.
+* **filters:** This is the number of filters in each Conv1D layer and plays a significant role in data size, accuracy, and inference speed.
+* **edim:** This is embedding_dim, the length of each embedding vector and plays a significant role in data size, accuracy, and inference speed.
+* **embedding:**: This determines what type of embedding is used to train the model, and can be one of the followings:
+  * `"grapheme_clusters_tf"`: This option should be used when one uses grapheme clusters as the embedding unit.
+  * `"codepoints"`: this option should be used when the embedding is based on code points.
+* **learning-rate:** This determines the model's learning rate. The default is 0.001.
+
+
+### Model Performance Comparison
+**Codepoints**
+| Filter Size | Accuracy | F1 Score | Model Size |
+|----------|:---------:|:---------:|----------:|
+| 8 | 93.1 | 84.8 | 13 KB |
+| 16 | 94.5 | 87.7 | 16 KB |
+| 32 | 95.7 | 90.4 | 28 KB |
+| 64 | 96.6 | 92.5 | 52 KB |
+| 128 | 97.3 | 94.0 | 95 KB |
+
+**Grapheme Clusters**
+| Filter Size | Accuracy | F1 Score | Model Size |
+|----------|:---------:|:---------:|----------:|
+| 8 | 94.1 | 89.3 | 13 KB |
+| 16 | 95.2 | 91.2 | 14 KB |
+| 32 | 95.9 | 92.6 | 24 KB |
+| 64 | 96.6 | 93.7 | 34 KB |
+| 128 | 97.1 | 94.7 | 55 KB |
+
+[Embeddings Discussions](Embeddings%20Discussion.md) gives detailed comparisons between the embedding types. 
diff --git a/Cloud Usage.md b/Cloud Usage.md
@@ -0,0 +1,49 @@
+To bind the repository to Google Cloud Platform, follow the steps below:
+
+1. In Google Cloud API & Services, enable 
+    - Secret Manager API
+    - Artifact Registry API
+    - Vertex AI API
+    - Cloud Pub/Sub API
+    - Cloud Build API
+
+2. Create a Google Cloud Storage Bucket, upload the dataset into gs://bucket_name/Data/ with the following directory structure:
+Data/
+├── Best/
+│   ├── article/
+│   ├── encyclopedia/
+│   ├── news/
+│   └── novel/
+├── my_test_segmented.txt
+├── my_train.txt
+└── my_valid.txt
+
+3. In Artifact Registry, create repository in the same region as the storage bucket. 
+
+4. In Cloud Build, create a trigger in the same region as the Artifact Registry. 
+    - Choose a suitable event (e.g. Push to a branch)
+    - Select 2nd gen repository generation
+    - Link the GitHub repository
+    - Select Dockerfile (for Configurations) and Repository (for Location)
+    - Dockerfile name: ```Dockerfile```, image name: ```us-central1-docker.pkg.dev/project-name/registry-name/image:latest```
+    - Enable "Require approval before build executes"
+    - For manual image build, press Enable/ Run in the created trigger
+
+5. After image is created and stored in Artifact Registry, select "Train new model" under the Training tab in Vertex AI.
+    - Training method: default (Custom training) and continue
+    - Model details: fill in name and continue
+    - Training container: select custom container and browse for latest built image, link to storage bucket and under arguments, modify and paste the following
+    ```
+    --path=gs://bucket_name/Data/
+    --language=Thai
+    --input-type=BEST
+    --model-type=cnn
+    --epochs=200
+    --filters=32
+    --name=Thai_codepoints_32
+    --edim=40
+    --embedding=codepoints
+    ```
+    - Hyperparameters: unselect and continue
+    - Compute and pricing: choose existing resources or deploy to new worker pool
+    - Prediction container: no prediction container and start training
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,16 @@
+FROM us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-16.py310:latest
+
+RUN apt-get update && \
+    apt-get upgrade -y && \
+    apt-get install -y --no-install-recommends \
+        pkg-config libicu-dev && \
+    rm -rf /var/lib/apt/lists/*
+
+WORKDIR /
+
+COPY . /
+
+RUN python3 -m pip install --upgrade pip && \
+    pip install -r requirements.txt
+
+ENTRYPOINT ["python3", "train.py"]
diff --git a/Figures/cnn.jpg b/Figures/cnn.jpg
diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@ In this project, we develop a bi-directional LSTM model for word segmentation. F
                                  train_data="exclusive BEST", eval_data="exclusive BEST")
   ```
 
-  You need to specify three hyper-parameters: `embedding`, `train_data`, and `eval_data`. Please refer to [Models Specicitaions](https://github.com/SahandFarhoodi/word_segmentation/blob/work/Models%20Specifications.md) for a detailed explanation of these hyper-parameters, and also for a list of trained models ready to be used in this repository and their specifications. If you don't have time to do that, just pick one of the trained models and make sure that name of the embedding you choose appears in the model name (`train_data` and `eval-data` doesn't affect segmentation of arbitrary inputs). Next, you can use the following commands to specify your input and segment it:
+  You need to specify three hyper-parameters: `embedding`, `train_data`, and `eval_data`. Please refer to [Models Specifications](https://github.com/SahandFarhoodi/word_segmentation/blob/work/Models%20Specifications.md) for a detailed explanation of these hyper-parameters, and also for a list of trained models ready to be used in this repository and their specifications. If you don't have time to do that, just pick one of the trained models and make sure that name of the embedding you choose appears in the model name (`train_data` and `eval-data` doesn't affect segmentation of arbitrary inputs). Next, you can use the following commands to specify your input and segment it:
 
   ```python
   line = "ทำสิ่งต่างๆ ได้มากขึ้นขณะที่อุปกรณ์ล็อกและชาร์จอยู่ด้วยโหมดแอมเบียนท์"

diff --git a/adaboost_cjk_segmenter/README.md b/adaboost_cjk_segmenter/README.md
@@ -0,0 +1,51 @@
+## AdaBoost for Cantonese
+
+Relative to BudouX’s n-gram model, the new [radical](https://en.wikipedia.org/wiki/Chinese_character_radicals)-based AdaBoost model reaches comparable accuracy with under half the model size. The radical of a Chinese character is typically the character's semantic component. Morever, there are only 214 of them in [kRSUnicode](https://en.wikipedia.org/wiki/Kangxi_radicals), making it suitable for lightweight models. The other benefit of using radicals is that, even though the model is trained on only zh-hant data, the radical-based model generalised better, which makes it more suitable to deploy in zh-hant variants such as zh-tw and zh-hk (Cantonese).
+
+**CITYU Test Dataset (zh-hant)**
+| Model | F1-Score | Model Size |
+|----------|:--------:|:---------:|
+| BudouX  | 86.27  | 64 KB  |
+| Radical-based  | 85.82  | 31 KB  |
+| ICU | 89.46 | 2 MB |
+
+**UDCantonese Dataset (zh-hk)**
+| Model | F1-Score | Model Size |
+|----------|:--------:|:---------:|
+| BudouX  | 73.51  | 64 KB  |
+| Radical-based  | 89.76  | 31 KB  |
+| [PyCantonese](https://github.com/jacksonllee/pycantonese) | 94.98  | 1.3 MB  |
+| ICU | 79.14 | 2 MB |
+
+### Examples
+
+**Test Case 1 (zh-hant)**
+| Algorithm | Output |
+|----------|:---------|
+| Unsegmented | 一名浙江新昌的茶商說正宗龍井產量有限需求量大價格高而貴州茶品質不差混雜在中間根本分不出來 |
+| Manually Segmented | 一 . 名 . 浙江 . 新昌 . 的 . 茶商 . 說 . 正宗 . 龍井 . 產量 . 有限 . 需求量 . 大 . 價格 . 高 . 而 . 貴州茶 . 品質 . 不 . 差 . 混雜 . 在 . 中間 . 根本 . 分 . 不 . 出來 |
+| Radical-based | 一 . 名 . 浙江 . 新昌 . 的 . 茶商 . 說 . 正宗 . 龍 . 井 . 產量 . 有限 . 需求 . 量 . 大 . 價格 . 高 . 而 . 貴州 . 茶 . 品質 . 不差 . 混雜 . 在 . 中間 . 根本 . 分 . 不 . 出來 |
+| BudouX | 一 . 名 . 浙江 . 新昌 . 的 . 茶商 . 說 . 正宗 . 龍井 . 產量 . 有限 . 需求 . 量 . 大 . 價格 . 高 . 而 . 貴州 . 茶品質 . 不差 . 混雜 . 在 . 中間 . 根本 . 分 . 不 . 出來 |
+| ICU | 一名 . 浙江 . 新 . 昌 . 的 . 茶商 . 說 . 正宗 . 龍井 . 產量 . 有限 . 需求量 . 大 . 價格 . 高 . 而 . 貴州 . 茶 . 品質 . 不差 . 混雜 . 在中 . 間 . 根本 . 分 . 不出來 |
+
+**Test Case 2 (zh-hk)**
+| Algorithm | Output |
+|----------|:---------|
+| Unsegmented | 點解你唔將呢句說話-點解你同我講，唔同你隔籬嗰啲人講呀？ |
+| Manually Segmented | 點解 . 你 . 唔 . 將 . 呢 . 句 . 說話 . - . 點解 . 你 . 同 . 我 . 講 . ， . 唔 . 同 . 你 . 隔籬 . 嗰啲 . 人 . 講 . 呀 . ？ |
+| Radical-based | 點解 . 你 . 唔 . 將 . 呢句 . 說話 . - . 點解 . 你 . 同 . 我 . 講 . ， . 唔同 . 你 . 隔籬 . 嗰啲 . 人 . 講 . 呀 . ？ |
+| BudouX | 點解你 . 唔 . 將 . 呢句 . 說話 . - . 點解你 . 同 . 我 . 講 . ， . 唔同 . 你 . 隔籬 . 嗰啲人 . 講呀 . ？ |
+| ICU | 點 . 解 . 你 . 唔 . 將 . 呢 . 句 . 說話 . - . 點 . 解 . 你 . 同 . 我 . 講 . ， . 唔 . 同 . 你 . 隔 . 籬 . 嗰 . 啲 . 人 . 講 . 呀 . ？ |
+| PyCantonese | 點解 . 你 . 唔 . 將 . 呢 . 句 . 說話 . - . 點解 . 你 . 同 . 我 . 講 . ， . 唔同 . 你 . 隔籬 . 嗰啲 . 人 . 講 . 呀 . ？ |
+
+### Usage
+
+Set up the environment using ```pip3 install -r requirements.txt```
+
+```python
+import json
+with open('model.json', encoding="utf-8") as f:
+  model = json.load(f)
+parser = AdaBoostSegmenter(model)
+output = parser.predict("一名浙江新昌的茶商說") # [一, 名, 浙江, 新昌, 的, 茶商, 說]
+```
diff --git a/adaboost_cjk_segmenter/model.json b/adaboost_cjk_segmenter/model.json
diff --git a/adaboost_cjk_segmenter/parser.py b/adaboost_cjk_segmenter/parser.py
@@ -0,0 +1,58 @@
+from cihai.core import Cihai
+c = Cihai()
+if not c.unihan.is_bootstrapped:
+  c.unihan.bootstrap()
+
+def get_radical(ch1: str):
+    char1 = c.unihan.lookup_char(ch1).first()
+    if char1 is None:
+        return 0
+    else:
+        r1 = char1.kRSUnicode.split(" ")[0]
+        if '\'' in r1:
+            return r1.split('\'')[0]
+        else:
+            return r1.split('.')[0]
+
+class AdaBoostSegmenter:
+    def __init__(self, model):
+        self.model = model
+
+    def predict(self, sentence):
+        if sentence == '':
+            return []
+        chunks = [sentence[0]]
+        base_score = -sum(sum(g.values()) for g in self.model.values()) * 0.5
+
+        for i in range(1, len(sentence)):
+            score = base_score
+            L = len(chunks[-1])
+            score += 32**L
+            rad4 = get_radical(sentence[i])
+            if rad4:
+                score += self.model.get('RSRID', {}).get(f'{sentence[i-1]}:{rad4}', 0)
+            rad3 = get_radical(sentence[i-1])
+            if rad3:
+                score += self.model.get('LSRID', {}).get(f'{rad3}:{sentence[i]}', 0)
+            if rad3 and rad4:
+                score += self.model.get('RAD', {}).get(f'{rad3}:{rad4}', 0)
+
+            score += self.model.get('BW2', {}).get(sentence[i - 1:i + 1], 0)
+            if i > 1:
+              score += self.model.get('UW2', {}).get(sentence[i - 2], 0)
+            score += self.model.get('UW3', {}).get(sentence[i - 1], 0)
+            score += self.model.get('UW4', {}).get(sentence[i], 0)
+            if i + 1 < len(sentence):
+              score += self.model.get('UW5', {}).get(sentence[i + 1], 0)
+
+            if score > 0:
+                chunks.append(sentence[i])
+            else:
+                chunks[-1] += sentence[i]
+        return chunks
+
+import json
+with open('model.json', encoding="utf-8") as f:
+  model = json.load(f)
+parser = AdaBoostSegmenter(model)
+print("_".join(parser.predict("在香港實施「愛國者治港」的過程中，反對派人士被拘捕，獨立媒體停止運作，監察與匿名舉報現象日益增多。")))
diff --git a/adaboost_cjk_segmenter/requirements.txt b/adaboost_cjk_segmenter/requirements.txt
@@ -0,0 +1,13 @@
+appdirs==1.4.4
+cihai==0.35.0
+greenlet==3.2.4
+mypy==1.17.1
+mypy_extensions==1.1.0
+pathspec==0.12.1
+PyYAML==6.0.2
+SQLAlchemy==2.0.43
+tomli==2.2.1
+typing_extensions==4.15.0
+unicodecsv==0.14.1
+unihan-etl==0.37.0
+zhon==2.1.1