-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zh tn #50
Zh tn #50
Conversation
This is the updates for zh_CN TN grammars.
Signed-off-by: BuyuanCui <[email protected]>
Signed-off-by: BuyuanCui <[email protected]>
* save Signed-off-by: Yang Zhang <[email protected]> * save Signed-off-by: Yang Zhang <[email protected]> * extend alignment for itn Signed-off-by: Yang Zhang <[email protected]> --------- Signed-off-by: Yang Zhang <[email protected]> Signed-off-by: BuyuanCui <[email protected]>
Signed-off-by: BuyuanCui <[email protected]>
@fayejf could you pls review this PR? |
@yzhang123 Sure! |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
…m to MAndarin Signed-off-by: BuyuanCui <[email protected]>
… to last reviews and changing weights Signed-off-by: BuyuanCui <[email protected]>
Signed-off-by: BuyuanCui <[email protected]>
Signed-off-by: BuyuanCui <[email protected]>
Signed-off-by: BuyuanCui <[email protected]>
Signed-off-by: BuyuanCui <[email protected]>
@@ -1,4 +1,4 @@ | |||
# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a file called __init__ copy.py
seems like it might cause problems later on. Also, it's unneeded
@@ -11,14 +11,11 @@ | |||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
|
|||
# gaph_utils is kept due to the fatc that importing from en folders will cause import errors that the data file names have to be the same with what are in the en folder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only true for get_abs_path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when importing the data, I used get_abs_path everytime.
""" | ||
return fst @ pynini.cdrewrite(pynini.cross(NEMO_SPACE, NEMO_NON_BREAKING_SPACE), "", "", NEMO_SIGMA) | ||
|
||
|
||
class GraphFst: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this, just import from en
@@ -61,6 +90,44 @@ def generator_main(file_name: str, graphs: Dict[str, 'pynini.FstLike']): | |||
logging.info(f'Created {file_name}') | |||
|
|||
|
|||
def get_plurals(fst): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very specific to English
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would remove it.
|
||
import pynini | ||
from nemo_text_processing.text_normalization.zh.graph_utils import GraphFst | ||
from nemo_text_processing.text_normalization.zh.taggers.cardinal import CardinalFst |
Check notice
Code scanning / CodeQL
Unused import
|
||
def __init__(self, deterministic: bool = True, lm: bool = False): | ||
super().__init__(name="date", kind="classify", deterministic=deterministic) | ||
cardinal = cardinal.just_cardinals |
Check notice
Code scanning / CodeQL
Unused local variable
# class TestMath: | ||
# normalizer_zh = Normalizer(lang='zh', cache_dir=CACHE_DIR, overwrite_cache=False, input_case='cased') | ||
# | ||
# @parameterized.expand(parse_test_case_file('zh/data_text_normalization/test_cases_math.txt')) | ||
# @pytest.mark.run_only_on('CPU') | ||
# @pytest.mark.unit | ||
# def test_norm_math(self, test_input, expected): | ||
# preds = self.normalizer_zh.normalize(test_input) | ||
# assert expected == preds |
Check notice
Code scanning / CodeQL
Commented-out code
8cea91c
to
99acfcd
Compare
Signed-off-by: Buyuan(Alex) Cui <[email protected]>
for more information, see https://pre-commit.ci
Re-ran PR. at: #89 |
@BuyuanCui could this be closed? |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
What does this PR do?
Changes are made to the ZH_TN from outside contributor. Updates to the existing grammars and alignments applied to keep the consistency with the ZH_ITN Grammar.
Add a one line overview of what this PR aims to accomplish.
Update the cardinal, ordinal, decimal, fraction, time, date, and money grammar, and remove math, measure, and preprocessors.
Before your PR is "Ready for review"
Pre checks:
git commit -s
to sign.pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
pytest
and Sparrowhawk here.__init__.py
for every folder and subfolder, includingdata
folder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
to all newly added Python files?Copyright 2015 and onwards Google, Inc.
. See an example here.try import: ... except: ...
) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.