Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zh tn #50

Closed
wants to merge 22 commits into from
Closed

Zh tn #50

wants to merge 22 commits into from

Conversation

BuyuanCui
Copy link
Collaborator

@BuyuanCui BuyuanCui commented Feb 27, 2023

What does this PR do?

Changes are made to the ZH_TN from outside contributor. Updates to the existing grammars and alignments applied to keep the consistency with the ZH_ITN Grammar.

  1. Cardinal grammar is separated into two grammars, cardinal and decimal. Decimal being an independent class.
  2. Increased cardinal grammar coverage up to hundred billion.
  3. Added ordinal grammar that works based on the cardinal grammar by processing a morpheme that indicates the order "第".
  4. Update on the date grammar, not processing inputs with only two of the components of year, month, and date. For example, 2002/02, 02/11 are not accepted. The reason is these input formats are not idea according to the national guideline (http://www.zgzlyx.com/uploadfile/news_images/zlyx/2022-05-26/%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%9B%BD%E5%AE%B6%E6%A0%87%E5%87%86%E2%80%94%E2%80%94%E5%87%BA%E7%89%88%E7%89%A9%E4%B8%8A%E6%95%B0%E5%AD%97%E7%94%A8%E6%B3%95%EF%BC%88GB%EF%BC%8FT%2015835-2011%EF%BC%89.pdf).
  5. Update to time grammar include expressions that are limited to the format of 'hour: minute: second'. The grammar can also process Inputs like "5点6分". Another update processes time expressed to mean a range of time. Some of the sample expressions are "五个小时", "5秒钟" and "五个钟头".
  6. Update to fraction grammar include expressions that are in percentages, for example, "50%" or "百分之五十".
  7. Update to money grammar to process expressions involving units like "块", "毛", and "分"。Large money expressed in decimal format, for example, "1.5万美元". Also including expressions where the currency is not in symbol format for are in Mandarins, for example, ”¥15“ vs. "15人民币".
  8. Did not include measure, math and preprocessor. After discussing with the team, the plan is to align the classes to the ITN grammar, so only cardinal, ordinal, fraction, decimal, time, date, and money are included.

Add a one line overview of what this PR aims to accomplish.
Update the cardinal, ordinal, decimal, fraction, time, date, and money grammar, and remove math, measure, and preprocessors.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • [*] Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • [*] Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • [*] Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • [*] Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • [*] If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • [*] Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo). (Couldn't access)
  • [*] Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • [*] New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

@yzhang123 yzhang123 removed their assignment Feb 27, 2023
@yzhang123
Copy link
Contributor

@fayejf could you pls review this PR?

@fayejf
Copy link

fayejf commented Feb 27, 2023

@yzhang123 Sure!

@github-actions
Copy link

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 14, 2023
@github-actions
Copy link

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Mar 21, 2023
@BuyuanCui BuyuanCui reopened this Jun 23, 2023
@BuyuanCui BuyuanCui marked this pull request as ready for review June 23, 2023 23:29
@github-actions github-actions bot removed the Stale label Jun 24, 2023
@fayejf fayejf self-requested a review June 27, 2023 22:31
@@ -1,4 +1,4 @@
# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

@jimregan jimregan Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a file called __init__ copy.py seems like it might cause problems later on. Also, it's unneeded

@@ -11,14 +11,11 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# gaph_utils is kept due to the fatc that importing from en folders will cause import errors that the data file names have to be the same with what are in the en folder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true for get_abs_path

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when importing the data, I used get_abs_path everytime.

"""
return fst @ pynini.cdrewrite(pynini.cross(NEMO_SPACE, NEMO_NON_BREAKING_SPACE), "", "", NEMO_SIGMA)


class GraphFst:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this, just import from en

@@ -61,6 +90,44 @@ def generator_main(file_name: str, graphs: Dict[str, 'pynini.FstLike']):
logging.info(f'Created {file_name}')


def get_plurals(fst):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very specific to English

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would remove it.


import pynini
from nemo_text_processing.text_normalization.zh.graph_utils import GraphFst
from nemo_text_processing.text_normalization.zh.taggers.cardinal import CardinalFst

Check notice

Code scanning / CodeQL

Unused import

Import of 'CardinalFst' is not used.

def __init__(self, deterministic: bool = True, lm: bool = False):
super().__init__(name="date", kind="classify", deterministic=deterministic)
cardinal = cardinal.just_cardinals

Check notice

Code scanning / CodeQL

Unused local variable

Variable cardinal is not used.
Comment on lines +21 to +30
# class TestMath:
# normalizer_zh = Normalizer(lang='zh', cache_dir=CACHE_DIR, overwrite_cache=False, input_case='cased')
#
# @parameterized.expand(parse_test_case_file('zh/data_text_normalization/test_cases_math.txt'))
# @pytest.mark.run_only_on('CPU')
# @pytest.mark.unit
# def test_norm_math(self, test_input, expected):
# preds = self.normalizer_zh.normalize(test_input)
# assert expected == preds

Check notice

Code scanning / CodeQL

Commented-out code

This comment appears to contain commented-out code.
@BuyuanCui BuyuanCui force-pushed the ZH_TN branch 3 times, most recently from 8cea91c to 99acfcd Compare July 6, 2023 20:01
@BuyuanCui
Copy link
Collaborator Author

Re-ran PR. at: #89
Due to rebase and conflict issues. I ran the PR again.

@ekmb
Copy link
Collaborator

ekmb commented Jul 20, 2023

@BuyuanCui could this be closed?

@github-actions
Copy link

github-actions bot commented Aug 4, 2023

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 4, 2023
@github-actions
Copy link

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Aug 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants