Skip to content

Commit

Permalink
Zh itn (#74)
Browse files Browse the repository at this point in the history
* Add ZH ITN

Signed-off-by: Anand Joseph <[email protected]>

* Fix copyrights and code cleanup

Signed-off-by: Anand Joseph <[email protected]>

* Remove invalid tests

Signed-off-by: Anand Joseph <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve CodeQL issues

Signed-off-by: Anand Joseph <[email protected]>

* Cleanup

Signed-off-by: Anand Joseph <[email protected]>

* Fix missing 'zh' option for ITN and correct comment

Signed-off-by: Anand Joseph <[email protected]>

* Update __init__.py

Change to zh instead of en for the imports.

Signed-off-by: Buyuan(Alex) Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update for decimal test data

Signed-off-by: BuyuanCui <[email protected]>

* update for langauge import

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update for Chinese punctuations

Signed-off-by: BuyuanCui <[email protected]>

* a new class for whitelist

Signed-off-by: BuyuanCui <[email protected]>

* PYNINI_AVAILABLE = False

Signed-off-by: BuyuanCui <[email protected]>

* recreated due to file import format issue

Signed-off-by: BuyuanCui <[email protected]>

* recreated due to format issue

Signed-off-by: BuyuanCui <[email protected]>

* caught duplicates, removed

Signed-off-by: BuyuanCui <[email protected]>

* removed duplicates, arranges for CHInese Yuan updates

Signed-off-by: BuyuanCui <[email protected]>

* updates accordingly to the comments from last PR. Recreated some of the files due to format issues

Signed-off-by: BuyuanCui <[email protected]>

* removed the hours_to and minute_to files used for back counting. ALso removed am and pm suffix files according to the last PR. Recreated some of them for format issue

Signed-off-by: BuyuanCui <[email protected]>

* re-added this file to avoid data file import error

Signed-off-by: BuyuanCui <[email protected]>

* updated gramamr according to last PR. Removed the acceptance of 千

Signed-off-by: BuyuanCui <[email protected]>

* updates

Signed-off-by: BuyuanCui <[email protected]>

* updated according to last PR. Removed comma after decimal points

Signed-off-by: BuyuanCui <[email protected]>

* gramamr for Fraction

Signed-off-by: BuyuanCui <[email protected]>

* gramamr for money and updated according to last PR. Plus process of 元

Signed-off-by: BuyuanCui <[email protected]>

* ordinal grammar. updates due to the updates in cardinal grammar

Signed-off-by: BuyuanCui <[email protected]>

* updated accordingly to last PR comments. removing am and pm and allowing simple mandarin expression

Signed-off-by: BuyuanCui <[email protected]>

* arrangements

Signed-off-by: BuyuanCui <[email protected]>

* added whitelist grammar

Signed-off-by: BuyuanCui <[email protected]>

* word grammar for non-classified items

Signed-off-by: BuyuanCui <[email protected]>

* updated cardinal, decimal, time, itn data

Signed-off-by: BuyuanCui <[email protected]>

* updates according to last PR

Signed-off-by: BuyuanCui <[email protected]>

* updates according to the updates for cardinal grammar

Signed-off-by: BuyuanCui <[email protected]>

* updates for more Mandarin punctuations

Signed-off-by: BuyuanCui <[email protected]>

* updated accordingly to last PR. removing am pm

Signed-off-by: BuyuanCui <[email protected]>

* adjustment on the weight

Signed-off-by: BuyuanCui <[email protected]>

* updated accordingly to the targger updates

Signed-off-by: BuyuanCui <[email protected]>

* updated accordingly to the time tagger

Signed-off-by: BuyuanCui <[email protected]>

* updates according to changes in tagger on am and pm

Signed-off-by: BuyuanCui <[email protected]>

* verbalizer for fraction

Signed-off-by: BuyuanCui <[email protected]>

* added for mandarin grammar

Signed-off-by: BuyuanCui <[email protected]>

* kept this file because using English utils results in data namin error

Signed-off-by: BuyuanCui <[email protected]>

* merge conflict

Signed-off-by: BuyuanCui <[email protected]>

* removed unsed imports

Signed-off-by: BuyuanCui <[email protected]>

* deleted unsed import os

Signed-off-by: BuyuanCui <[email protected]>

* deleted unsed variables

Signed-off-by: BuyuanCui <[email protected]>

* removed unsed imports

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updates and edits based on pr checks

Signed-off-by: BuyuanCui <[email protected]>

* updates and edits based on pr checks

Signed-off-by: BuyuanCui <[email protected]>

* format issue, reccreated

Signed-off-by: BuyuanCui <[email protected]>

* format issue recreated

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed codeing style/format

Signed-off-by: BuyuanCui <[email protected]>

* fixed coding style and format

Signed-off-by: BuyuanCui <[email protected]>

* removed duplicated graph for 毛

Signed-off-by: BuyuanCui <[email protected]>

* removed the comment

Signed-off-by: BuyuanCui <[email protected]>

* removed the comment

Signed-off-by: BuyuanCui <[email protected]>

* removing unnecessary comments

Signed-off-by: BuyuanCui <[email protected]>

* unnecessary comment removed

Signed-off-by: BuyuanCui <[email protected]>

* test file updated for more cases

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updated with a comment explaining why this file is kept

Signed-off-by: BuyuanCui <[email protected]>

* updated the file explaining why this file is kept

Signed-off-by: BuyuanCui <[email protected]>

* added Mandarin as zh

Signed-off-by: BuyuanCui <[email protected]>

* removing for dplication

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* removed unused NEMO objects

Signed-off-by: BuyuanCui <[email protected]>

* removed duplicates

Signed-off-by: BuyuanCui <[email protected]>

* removing unsed imports

Signed-off-by: BuyuanCui <[email protected]>

* updates to fix test file failures

Signed-off-by: BuyuanCui <[email protected]>

* updates to fix file failtures

Signed-off-by: BuyuanCui <[email protected]>

* updates to resolve test case failture

Signed-off-by: BuyuanCui <[email protected]>

* updates to resolve test case failure

Signed-off-by: BuyuanCui <[email protected]>

* updates to resolve test case failure

Signed-off-by: BuyuanCui <[email protected]>

* updates to resolve test case failure

Signed-off-by: BuyuanCui <[email protected]>

* updates to adap to cardinal grammar changes

Signed-off-by: BuyuanCui <[email protected]>

* updates to adapt to grammar changes

Signed-off-by: BuyuanCui <[email protected]>

* updates to adopt to cardinal grammar changes

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix style

Signed-off-by: BuyuanCui <[email protected]>

* fix style

Signed-off-by: BuyuanCui <[email protected]>

* fix style

Signed-off-by: BuyuanCui <[email protected]>

* fix style

Signed-off-by: BuyuanCui <[email protected]>

* fixing pr checks

Signed-off-by: BuyuanCui <[email protected]>

* removed // for zhtn/itn cache

Signed-off-by: BuyuanCui <[email protected]>

* Update inverse_normalize.py

Added zh as a selection to pass Jenkins checks.

Signed-off-by: Buyuan(Alex) Cui <[email protected]>

---------

Signed-off-by: Anand Joseph <[email protected]>
Signed-off-by: Buyuan(Alex) Cui <[email protected]>
Signed-off-by: BuyuanCui <[email protected]>
Co-authored-by: Alex Cui <[email protected]>
Co-authored-by: Anand Joseph <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
4 people committed Jun 30, 2023
1 parent 9f3d372 commit cf62bb8
Show file tree
Hide file tree
Showing 66 changed files with 3,505 additions and 8 deletions.
12 changes: 6 additions & 6 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ pipeline {
RU_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'
VI_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'
SV_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'
ZH_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'
ZH_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-29-23-0'
DEFAULT_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'

}
Expand Down Expand Up @@ -319,11 +319,11 @@ pipeline {
sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/text_normalization/normalize.py --lang=zh --text="你" --cache_dir ${ZH_TN_CACHE}'
}
}
// stage('L0: ZH ITN grammars') {
// steps {
// sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/inverse_text_normalization/inverse_normalize.py --lang=zh --text="二零零二年一月二十八日 " --cache_dir ${ZH_TN_CACHE}'
// }
// }
stage('L0: ZH ITN grammars') {
steps {
sh 'CUDA_VISIBLE_DEVICES="" python nemo_text_processing/inverse_text_normalization/inverse_normalize.py --lang=zh --text="二零零二年一月二十八日 " --cache_dir ${ZH_TN_CACHE}'
}
}
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,11 @@ def __init__(
from nemo_text_processing.inverse_text_normalization.es_en.verbalizers.verbalize_final import (
VerbalizeFinalFst,
)
elif lang == 'zh': # Mandarin
from nemo_text_processing.inverse_text_normalization.zh.taggers.tokenize_and_classify import ClassifyFst
from nemo_text_processing.inverse_text_normalization.zh.verbalizers.verbalize_final import (
VerbalizeFinalFst,
)

self.tagger = ClassifyFst(
cache_dir=cache_dir, whitelist=whitelist, overwrite_cache=overwrite_cache, input_case=input_case
Expand Down Expand Up @@ -150,7 +155,7 @@ def parse_args():
parser.add_argument(
"--language",
help="language",
choices=['en', 'de', 'es', 'pt', 'ru', 'fr', 'sv', 'vi', 'ar', 'es_en'],
choices=['en', 'de', 'es', 'pt', 'ru', 'fr', 'vi', 'ar', 'es_en', 'zh'],
default="en",
type=str,
)
Expand Down
17 changes: 17 additions & 0 deletions nemo_text_processing/inverse_text_normalization/zh/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from nemo_text_processing.inverse_text_normalization.zh.taggers.tokenize_and_classify import ClassifyFst
from nemo_text_processing.inverse_text_normalization.zh.verbalizers.verbalize import VerbalizeFst
from nemo_text_processing.inverse_text_normalization.zh.verbalizers.verbalize_final import VerbalizeFinalFst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
1
2
3
4
5
6
7
8
9
10
十一 11
十二 12
十三 13
十四 14
十五 15
十六 16
十七 17
十八 18
十九 19
二十 20
二十一 21
二十二 22
二十三 23
二十四 24
二十五 25
二十六 26
二十七 27
二十八 28
二十九 29
三十 30
三十一 31
1
2
3
4
5
6
7
8
9
1
2
2
10
拾壹 11
拾貳 12
拾叁 13
拾肆 14
拾伍 15
拾陸 16
拾柒 17
拾捌 18
拾玖 19
貳拾 20
貳拾壹 21
貳拾貳 22
貳拾叁 23
貳拾肆 24
貳拾伍 25
貳拾陸 26
貳拾柒 27
貳拾捌 28
貳拾玖 29
叁拾 30
叁拾壹 31
1
拾壹 11
贰拾壹 21
2
6
拾贰 12
拾陆 16
贰拾贰 22
贰拾陆 26
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
1
2
3
4
5
6
7
8
9
10
十一 11
十二 12
一十 10
零一 1
零二 2
零三 3
零四 4
零五 5
零六 6
零七 7
零八 8
零九 9
1
2
3
4
5
6
7
8
9
10
拾壹 11
拾贰 12
壹拾 10
零壹 1
零贰 2
零叁 3
零肆 4
零伍 5
零陆 6
零柒 7
零捌 8
零玖 9
2
零貳 2
6
零陸 6
拾貳 12
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
美元 US$
欧元
歐元
英镑 £
英鎊 £
加拿大元 CAD$
加拿大币 CAD$
加拿大幣 CAD$
加元 CAD$
加币 CAD$
加幣 CAD$
瑞士法郎 Fr
法郎
¥
¥
瑞典克朗 Kr
墨西哥比索 NXN$
新西兰元 NZD$
新西蘭元 NZD$
新加坡币 SGD$
新加坡幣 SGD$
新加坡元 SGD$
港元 HKD$
港币 HKD$
港幣 HKD$
挪威克朗 NOKkr
韩元
韓元
韩币
韓幣
土耳其里拉 TRY₺
印度卢布
印度盧布
印度卢比
印度盧比
俄罗斯卢布
俄羅斯盧布
俄罗斯卢比
俄羅斯盧比
巴西雷亚尔 BRLR$
巴西雷亞爾 BRLR$
南非兰特 R
南非蘭特 R
丹麦克朗 DKKkr
丹麥克朗 DKKkr
波兰兹罗提
波蘭兹儸提
新台币 TWDNT$
新臺幣 TWDNT$
泰铢 ฿
泰銖 ฿
马来西亚林吉特 RM
馬來西亞林吉特 RM
印尼盾 Rp
匈牙利福林 Ft
捷克克朗
以色列新谢克尔
以色列新謝克爾
智利披索 CLP$
菲律宾披索
菲律賓披索
阿联酋迪拉姆 د.إ
阿聯酋迪拉姆 د.إ
哥伦比亚披索 COL$
哥倫比亞披索 COL$
马来西亚令吉 RM
馬來西亞令吉 RM
罗马尼亚列伊 L
羅馬尼亞列伊 L
日元 JPY¥
日圆 JPY¥
日圓 JPY¥
人民币 ¥
人民幣 ¥
¥
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
美分 US$
欧分
便士 £
加拿大分 CAD$
生丁
瑞典欧尔 KrOre
NXN$
新西兰仙 NZD$
挪威欧尔 NOKOre
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
2
6
2
2
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
二十 2
三十 3
四十 4
五十 5
六十 6
七十 7
八十 8
九十 9
贰拾 2
叁拾 3
肆拾 4
伍拾 5
陆拾 6
柒拾 7
捌拾 8
玖拾 9
貳拾 2
陸拾 6
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
0
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Loading

0 comments on commit cf62bb8

Please sign in to comment.