-
-
Notifications
You must be signed in to change notification settings - Fork 14
Added MLOps pipeline and new WordSegmenterCNN class #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
opnuub
wants to merge
17
commits into
unicode-org:main
Choose a base branch
from
opnuub:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
c1d5e5b
MLOps pipeline: initial working version
opnuub fe40364
MLOps pipeline: initial working version
opnuub 8210808
New model architecture implemented, CUDA support for cloud training
opnuub 9a56c0c
prepare to merge
opnuub be2c20e
Merging from cnn branch
opnuub 2c8e07d
polished CNN code
opnuub d5e39b8
faster training
opnuub 0ca28b6
added burmese support
opnuub 43488a7
added mlops support for lstm
opnuub f1389b1
modified train.py to be more general
opnuub 91b77a3
python parser for zh-hant/yue adaboost model
opnuub da9d895
optimised training data handling speed
opnuub 21b4525
added readme files for cnn and adaboost
opnuub 980a08d
removed cloudbuild.yaml because it's no longer used
opnuub 0c653a5
fixed pull requests issues
opnuub 5e528d3
fixed bug on train.py for lstm
opnuub 62b6615
Update README.md for Cantonese
opnuub File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
## Convolutional Neural Networks | ||
|
||
Author: Shenghong Liu ([email protected]) | ||
|
||
The existing LSTM-based segmenter has a linear O(n) time complexity which is not ideal for content-heavy platforms. Hence, I introduce a new model architecture for faster word segmentation of Southeast Asian languages like Thai and Burmese. | ||
|
||
<img src="Figures/cnn.jpg" width="30%"/> | ||
|
||
The convolutional neural network (CNN) architecture developed in this project achieved faster inference speeds with comparable accuracy for Thai. Not only was the linear time complexity issue resolved, the usage of dilated convolutions also helped maintain a high level of accuracy by capturing a wider context of the surrounding words. | ||
|
||
| Model | F1-Score | Model Size | CPU Inference Speed | | ||
|----------|:--------:|:---------:|---------:| | ||
| LSTM Medium | 90.1 | 36 KB | 9.29 ms | | ||
| LSTM Small | 86.7 | 12 KB | 6.68 ms | | ||
| CNN Medium | 90.4 | 28 KB | 3.76 ms | | ||
| ICU | 86.4 | 126 KB | ~0.2 ms| | ||
|
||
### Examples | ||
|
||
**Test Case 1** | ||
| Algorithm | Output | | ||
|----------|:---------| | ||
| Unsegmented | พระราชประสงค์ของพระบาทสมเสด็จพระเจ้าอยู่หัวในรัชกาลปัจจุบันคือ | | ||
| Manually Segmented | พระราชประสงค์_ของ_พระบาทสมเสด็จพระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ | | ||
| CNN | พระราชประสงค์_ของ_พระบาทสมเสด็จพระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ | | ||
| ICU | พระ_ราช_ประสงค์_ของ_พระบาท_สม_เสด็จ_พระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ | | ||
| LSTM | พระราชประสงค์_ของ_พระบาทสมเสด็จ_พระเจ้าอยู่หัว_ใน_รัชกาล_ปัจจุบัน_คือ | | ||
|
||
**Test Case 2** | ||
| Algorithm | Output | | ||
|----------|:---------| | ||
| Unsegmented | ในขณะเดียวกันผู้ที่ต้องการเงินเพื่อนำไปลงทุนหรือประกอบกิจการอื่นใด | | ||
| Manually Segmented | ใน_ขณะ_เดียว_กัน_ผู้_ที่_ต้องการ_เงิน_เพื่อ_นำ_ไป_ลง_ทุน _หรือ_ประกอบ_กิจการ_อื่น_ใด | | ||
| CNN | ใน_ขณะ_เดียว_กัน_ผู้_ที่_ต้องการ_เงิน_เพื่อ_นำ_ไป_ลง_ทุน _หรือ_ประกอบ_กิจการ_อื่น_ใด | | ||
| ICU | ใน_ขณะ_เดียวกัน_ผู้_ที่_ต้องการ_เงิน_เพื่อน_ำ_ไป_ลงทุน_หรือ_ประกอบ_กิจการ_อื่น_ใด | | ||
| LSTM | ใน_ขณะ_เดียว_กัน_ผู้_ที่_ต้อง_การ_เงิน_เพื่อ_นำ_ไป_ลง_ทุน _หรือ_ประกอบ_กิจการ_อื่น_ใด | | ||
|
||
**Test Case 3** | ||
|
||
| Algorithm | Output | | ||
|----------|:---------| | ||
| Unsegmented | เพราะเพียงกรดนิวคลีอิคของไวรัสอย่างเดียวก็สร้างไวรัสสมบูรณ์ | | ||
| Manually Segmented | เพราะ_เพียง_กรด_นิวคลีอิค_ของ_ไวรัส_อย่าง_เดียว_ก็_สร้าง_ไวรัส_สมบูรณ์ | | ||
| CNN | เพราะ_เพียง_กรด_นิว_คลี_อิค_ของ_ไวรัส_อย่าง_เดียว_ก็_สร้าง_ไวรัส_สมบูรณ์ | | ||
| ICU | เพราะ_เพียง_กรด_นิ_วค_ลี_อิค_ของ_ไวรัส_อย่าง_เดียว_ก็_สร้าง_ไวรัส_สมบูรณ์ | | ||
| LSTM | เพราะ_เพียง_กรดนิว_คลีอิค_ของ_ไวรัสอย่าง_เดียว_ก็_สร้าง_ไวรัสสมบูรณ์ | | ||
|
||
### Hyperparameters | ||
|
||
In Vertex AI Custom Training, you need to specify the following hyperparameters: | ||
|
||
``` | ||
--path=gs://bucket_name/Data/ | ||
--language=Thai | ||
--input-type=BEST | ||
--model-type=cnn | ||
--epochs=200 | ||
--filters=32 | ||
--name=Thai_codepoints_32 | ||
--edim=40 | ||
--embedding=codepoints | ||
``` | ||
|
||
* **name:** This is the model name. | ||
* **path:** This is the Google Cloud Storage Bucket link. | ||
* **language:** This is the language that you'll like to train, such as ```Thai``` or ```Burmese```. | ||
* **input-type:** This is the dataset type, such as ```BEST``` for Thai and ```my``` for Burmese. Refer to [Cloud Usage](Cloud%20Usage.md) for more details. | ||
* **model-type:** This is the model architecture type, such as ```lstm``` or ```cnn```. | ||
* **epochs:** This is the number of epochs used to train the model, it is recommended to use a number of >= 200 as the model only trains on 10% of the dataset in each epoch. The model will output the epoch that gives the best validation loss. | ||
* **filters:** This is the number of filters in each Conv1D layer and plays a significant role in data size, accuracy, and inference speed. | ||
* **edim:** This is embedding_dim, the length of each embedding vector and plays a significant role in data size, accuracy, and inference speed. | ||
* **embedding:**: This determines what type of embedding is used to train the model, and can be one of the followings: | ||
* `"grapheme_clusters_tf"`: This option should be used when one uses grapheme clusters as the embedding unit. | ||
* `"codepoints"`: this option should be used when the embedding is based on code points. | ||
* **learning-rate:** This determines the model's learning rate. The default is 0.001. | ||
|
||
|
||
### Model Performance Comparison | ||
**Codepoints** | ||
| Filter Size | Accuracy | F1 Score | Model Size | | ||
|----------|:---------:|:---------:|----------:| | ||
| 8 | 93.1 | 84.8 | 13 KB | | ||
| 16 | 94.5 | 87.7 | 16 KB | | ||
| 32 | 95.7 | 90.4 | 28 KB | | ||
| 64 | 96.6 | 92.5 | 52 KB | | ||
| 128 | 97.3 | 94.0 | 95 KB | | ||
|
||
**Grapheme Clusters** | ||
| Filter Size | Accuracy | F1 Score | Model Size | | ||
|----------|:---------:|:---------:|----------:| | ||
| 8 | 94.1 | 89.3 | 13 KB | | ||
| 16 | 95.2 | 91.2 | 14 KB | | ||
| 32 | 95.9 | 92.6 | 24 KB | | ||
| 64 | 96.6 | 93.7 | 34 KB | | ||
| 128 | 97.1 | 94.7 | 55 KB | | ||
|
||
[Embeddings Discussions](Embeddings%20Discussion.md) gives detailed comparisons between the embedding types. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
To bind the repository to Google Cloud Platform, follow the steps below: | ||
|
||
1. In Google Cloud API & Services, enable | ||
- Secret Manager API | ||
- Artifact Registry API | ||
- Vertex AI API | ||
- Cloud Pub/Sub API | ||
- Cloud Build API | ||
|
||
2. Create a Google Cloud Storage Bucket, upload the dataset into gs://bucket_name/Data/ with the following directory structure: | ||
Data/ | ||
├── Best/ | ||
│ ├── article/ | ||
│ ├── encyclopedia/ | ||
│ ├── news/ | ||
│ └── novel/ | ||
├── my_test_segmented.txt | ||
├── my_train.txt | ||
└── my_valid.txt | ||
|
||
3. In Artifact Registry, create repository in the same region as the storage bucket. | ||
|
||
4. In Cloud Build, create a trigger in the same region as the Artifact Registry. | ||
- Choose a suitable event (e.g. Push to a branch) | ||
- Select 2nd gen repository generation | ||
- Link the GitHub repository | ||
- Select Dockerfile (for Configurations) and Repository (for Location) | ||
- Dockerfile name: ```Dockerfile```, image name: ```us-central1-docker.pkg.dev/project-name/registry-name/image:latest``` | ||
- Enable "Require approval before build executes" | ||
- For manual image build, press Enable/ Run in the created trigger | ||
|
||
5. After image is created and stored in Artifact Registry, select "Train new model" under the Training tab in Vertex AI. | ||
- Training method: default (Custom training) and continue | ||
- Model details: fill in name and continue | ||
- Training container: select custom container and browse for latest built image, link to storage bucket and under arguments, modify and paste the following | ||
``` | ||
--path=gs://bucket_name/Data/ | ||
--language=Thai | ||
--input-type=BEST | ||
--model-type=cnn | ||
--epochs=200 | ||
--filters=32 | ||
--name=Thai_codepoints_32 | ||
--edim=40 | ||
--embedding=codepoints | ||
``` | ||
- Hyperparameters: unselect and continue | ||
- Compute and pricing: choose existing resources or deploy to new worker pool | ||
- Prediction container: no prediction container and start training |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
FROM us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-16.py310:latest | ||
|
||
RUN apt-get update && \ | ||
apt-get upgrade -y && \ | ||
apt-get install -y --no-install-recommends \ | ||
pkg-config libicu-dev && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
WORKDIR / | ||
|
||
COPY . / | ||
|
||
RUN python3 -m pip install --upgrade pip && \ | ||
pip install -r requirements.txt | ||
|
||
ENTRYPOINT ["python3", "train.py"] |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
## AdaBoost for Cantonese | ||
|
||
Relative to BudouX’s n-gram model, the new [radical](https://en.wikipedia.org/wiki/Chinese_character_radicals)-based AdaBoost model reaches comparable accuracy with under half the model size. The radical of a Chinese character is typically the character's semantic component. Morever, there are only 214 of them in [kRSUnicode](https://en.wikipedia.org/wiki/Kangxi_radicals), making it suitable for lightweight models. The other benefit of using radicals is that, even though the model is trained on only zh-hant data, the radical-based model generalised better, which makes it more suitable to deploy in zh-hant variants such as zh-tw and zh-hk (Cantonese). | ||
|
||
**CITYU Test Dataset (zh-hant)** | ||
| Model | F1-Score | Model Size | | ||
|----------|:--------:|:---------:| | ||
| BudouX | 86.27 | 64 KB | | ||
| Radical-based | 85.82 | 31 KB | | ||
| ICU | 89.46 | 2 MB | | ||
|
||
**UDCantonese Dataset (zh-hk)** | ||
| Model | F1-Score | Model Size | | ||
|----------|:--------:|:---------:| | ||
| BudouX | 73.51 | 64 KB | | ||
| Radical-based | 89.76 | 31 KB | | ||
| [PyCantonese](https://github.com/jacksonllee/pycantonese) | 94.98 | 1.3 MB | | ||
| ICU | 79.14 | 2 MB | | ||
|
||
### Examples | ||
|
||
**Test Case 1 (zh-hant)** | ||
| Algorithm | Output | | ||
|----------|:---------| | ||
| Unsegmented | 一名浙江新昌的茶商說正宗龍井產量有限需求量大價格高而貴州茶品質不差混雜在中間根本分不出來 | | ||
| Manually Segmented | 一 . 名 . 浙江 . 新昌 . 的 . 茶商 . 說 . 正宗 . 龍井 . 產量 . 有限 . 需求量 . 大 . 價格 . 高 . 而 . 貴州茶 . 品質 . 不 . 差 . 混雜 . 在 . 中間 . 根本 . 分 . 不 . 出來 | | ||
| Radical-based | 一 . 名 . 浙江 . 新昌 . 的 . 茶商 . 說 . 正宗 . 龍 . 井 . 產量 . 有限 . 需求 . 量 . 大 . 價格 . 高 . 而 . 貴州 . 茶 . 品質 . 不差 . 混雜 . 在 . 中間 . 根本 . 分 . 不 . 出來 | | ||
| BudouX | 一 . 名 . 浙江 . 新昌 . 的 . 茶商 . 說 . 正宗 . 龍井 . 產量 . 有限 . 需求 . 量 . 大 . 價格 . 高 . 而 . 貴州 . 茶品質 . 不差 . 混雜 . 在 . 中間 . 根本 . 分 . 不 . 出來 | | ||
| ICU | 一名 . 浙江 . 新 . 昌 . 的 . 茶商 . 說 . 正宗 . 龍井 . 產量 . 有限 . 需求量 . 大 . 價格 . 高 . 而 . 貴州 . 茶 . 品質 . 不差 . 混雜 . 在中 . 間 . 根本 . 分 . 不出來 | | ||
|
||
**Test Case 2 (zh-hk)** | ||
| Algorithm | Output | | ||
|----------|:---------| | ||
| Unsegmented | 點解你唔將呢句說話-點解你同我講,唔同你隔籬嗰啲人講呀? | | ||
| Manually Segmented | 點解 . 你 . 唔 . 將 . 呢 . 句 . 說話 . - . 點解 . 你 . 同 . 我 . 講 . , . 唔 . 同 . 你 . 隔籬 . 嗰啲 . 人 . 講 . 呀 . ? | | ||
| Radical-based | 點解 . 你 . 唔 . 將 . 呢句 . 說話 . - . 點解 . 你 . 同 . 我 . 講 . , . 唔同 . 你 . 隔籬 . 嗰啲 . 人 . 講 . 呀 . ? | | ||
| BudouX | 點解你 . 唔 . 將 . 呢句 . 說話 . - . 點解你 . 同 . 我 . 講 . , . 唔同 . 你 . 隔籬 . 嗰啲人 . 講呀 . ? | | ||
| ICU | 點 . 解 . 你 . 唔 . 將 . 呢 . 句 . 說話 . - . 點 . 解 . 你 . 同 . 我 . 講 . , . 唔 . 同 . 你 . 隔 . 籬 . 嗰 . 啲 . 人 . 講 . 呀 . ? | | ||
| PyCantonese | 點解 . 你 . 唔 . 將 . 呢 . 句 . 說話 . - . 點解 . 你 . 同 . 我 . 講 . , . 唔同 . 你 . 隔籬 . 嗰啲 . 人 . 講 . 呀 . ? | | ||
|
||
### Usage | ||
|
||
Set up the environment using ```pip3 install -r requirements.txt``` | ||
|
||
```python | ||
import json | ||
with open('model.json', encoding="utf-8") as f: | ||
model = json.load(f) | ||
parser = AdaBoostSegmenter(model) | ||
output = parser.predict("一名浙江新昌的茶商說") # [一, 名, 浙江, 新昌, 的, 茶商, 說] | ||
``` |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
from cihai.core import Cihai | ||
c = Cihai() | ||
if not c.unihan.is_bootstrapped: | ||
c.unihan.bootstrap() | ||
|
||
def get_radical(ch1: str): | ||
char1 = c.unihan.lookup_char(ch1).first() | ||
if char1 is None: | ||
return 0 | ||
else: | ||
r1 = char1.kRSUnicode.split(" ")[0] | ||
if '\'' in r1: | ||
return r1.split('\'')[0] | ||
else: | ||
return r1.split('.')[0] | ||
|
||
class AdaBoostSegmenter: | ||
def __init__(self, model): | ||
self.model = model | ||
|
||
def predict(self, sentence): | ||
if sentence == '': | ||
return [] | ||
chunks = [sentence[0]] | ||
base_score = -sum(sum(g.values()) for g in self.model.values()) * 0.5 | ||
|
||
for i in range(1, len(sentence)): | ||
score = base_score | ||
L = len(chunks[-1]) | ||
score += 32**L | ||
rad4 = get_radical(sentence[i]) | ||
if rad4: | ||
score += self.model.get('RSRID', {}).get(f'{sentence[i-1]}:{rad4}', 0) | ||
rad3 = get_radical(sentence[i-1]) | ||
if rad3: | ||
score += self.model.get('LSRID', {}).get(f'{rad3}:{sentence[i]}', 0) | ||
if rad3 and rad4: | ||
score += self.model.get('RAD', {}).get(f'{rad3}:{rad4}', 0) | ||
|
||
score += self.model.get('BW2', {}).get(sentence[i - 1:i + 1], 0) | ||
if i > 1: | ||
score += self.model.get('UW2', {}).get(sentence[i - 2], 0) | ||
score += self.model.get('UW3', {}).get(sentence[i - 1], 0) | ||
score += self.model.get('UW4', {}).get(sentence[i], 0) | ||
if i + 1 < len(sentence): | ||
score += self.model.get('UW5', {}).get(sentence[i + 1], 0) | ||
|
||
if score > 0: | ||
chunks.append(sentence[i]) | ||
else: | ||
chunks[-1] += sentence[i] | ||
return chunks | ||
|
||
import json | ||
with open('model.json', encoding="utf-8") as f: | ||
model = json.load(f) | ||
parser = AdaBoostSegmenter(model) | ||
print("_".join(parser.predict("在香港實施「愛國者治港」的過程中,反對派人士被拘捕,獨立媒體停止運作,監察與匿名舉報現象日益增多。"))) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
appdirs==1.4.4 | ||
cihai==0.35.0 | ||
greenlet==3.2.4 | ||
mypy==1.17.1 | ||
mypy_extensions==1.1.0 | ||
pathspec==0.12.1 | ||
PyYAML==6.0.2 | ||
SQLAlchemy==2.0.43 | ||
tomli==2.2.1 | ||
typing_extensions==4.15.0 | ||
unicodecsv==0.14.1 | ||
unihan-etl==0.37.0 | ||
zhon==2.1.1 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.