Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cardinals up to a hundred trillions, timeFST and transliteration #209

Merged
merged 7 commits into from
Sep 17, 2024

Conversation

kurt0cougar
Copy link
Contributor

@kurt0cougar kurt0cougar commented Aug 19, 2024

What does this PR do ?

Adds Kinyarwanda TN support for the following:

  • CARDINALS semiotic class (up to hundred trillion)
  • TIME semiotic class
  • Trnasliteration

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

@kurt0cougar
Copy link
Contributor Author

@tbartley94 The pull request is ready for review.

def __init__(self):
super().__init__(name="time", kind="classify")

hours = pynini.string_map([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind moving this and the minutes into a data file? (It's easier to debug and maintain if we just keep it as text and read from there.)

('12', 'saa sita'),
])

minutes = pynini.string_map([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

class VerbalizeTimeFst(GraphFst):
def __init__(self):
super().__init__(name="time",kind="verbalize")
hour = (pynutil.delete("hours:")+delete_space+pynutil.delete("\"")+pynini.closure(NEMO_CHAR,1,60)+pynutil.delete("\"")+delete_space \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just do base closure since the hours: property will limit potential tokens. I also believe it is more efficient since counting in FST graphs enlarges the graph (iirc).

def __init__(self):
super().__init__(name="time",kind="verbalize")
hour = (pynutil.delete("hours:")+delete_space+pynutil.delete("\"")+pynini.closure(NEMO_CHAR,1,60)+pynutil.delete("\"")+delete_space \
+pynutil.delete("minutes:")+delete_space+pynutil.delete("\"") + pynini.closure(NEMO_CHAR,1,60)+pynutil.delete("\""))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

+ delete_space
+ pynutil.delete("}")
)
graph = delete_space + pynini.closure(graph + delete_extra_space) + graph + delete_space
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So heads up, this has a behavior that you can't separate normalization candidates with punctuation, there needs to be space or its a no-opt. This intended for y'alls downstream needs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me an example of this? I don't fully get what you mean.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1,5 -> one,5
1, 5 -> one, five

It won't parse the comma as a class separator since delete_extra_space requires at least one space.

from nemo_text_processing.text_normalization.en.graph_utils import GraphFst,NEMO_CHAR,insert_space
from nemo_text_processing.text_normalization.rw.utils import get_abs_path

def apply_fst(text, fst):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is unnecessary. the normalize class takes care of this with additional preprocessing. Please remove.

def __init__(self):
super().__init__(name="cardinal", kind="classify")
alphabet = string.ascii_letters
rewrite_na_fst = pynini.cdrewrite(pynini.cross(" "," na "),pynini.union(*"aeiouAEIOU "),pynini.union(*"BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz"),NEMO_CHAR.closure())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For line legitibility, mind just defining two string classes called vowels and consonants and inherit from your graph_fst class?

class CardinalFst(GraphFst):
def __init__(self):
super().__init__(name="cardinal", kind="classify")
alphabet = string.ascii_letters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use NEMO_ALPHA since its already defined.

rewrite_na_fst = pynini.cdrewrite(pynini.cross(" "," na "),pynini.union(*"aeiouAEIOU "),pynini.union(*"BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz"),NEMO_CHAR.closure())
rewrite_n_fst = pynini.cdrewrite(pynini.cross(" "," n'"),pynini.union(*"aeiouAEIOU "),pynini.union(*"aeiouAEIOU"),NEMO_CHAR.closure())
remove_underscore_fst = pynini.cdrewrite(pynini.cross("_"," "),pynini.union(*alphabet),pynini.union(*alphabet),NEMO_CHAR.closure())
remove_extra_space_fst = pynini.cdrewrite(pynini.cross(" "," "),pynini.union(*alphabet),pynini.union(*alphabet),NEMO_CHAR.closure())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use delete_extra_space from graph utils.

("tiriyoni_mirongo_inani","8"),
("tiriyoni_mirongo_icyenda","9")
])
hundreds_of_trillions = pynini.string_map([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move all these string maps to a data folder so easier to maintain.

NINE_ZEROS = "000000000"

zero = pynini.string_map([("zeru","0")])
rewrite_remove_comma_fst = pynini.cdrewrite(pynini.cross(",",""),pynini.union(*"0123456789"),pynini.union(*"0123456789"),NEMO_CHAR.closure())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use NEMO_DIGIT here

Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major requests are just to move more of the string maps into the data folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.

@kurt0cougar
Copy link
Contributor Author

Major requests are just to move more of the string maps into the data folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.

Great, thank you for the feedback, I will add the changes and get back.

@tbartley94
Copy link
Collaborator

Major requests are just to move more of the string maps into the data folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.

Great, thank you for the feedback, I will add the changes and get back.

Let me know if you're running low on bandwidth. We can merge the simpler stuff and leave improvements for other PRs.

@kurt0cougar
Copy link
Contributor Author

Major requests are just to move more of the string maps into the data folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.

Great, thank you for the feedback, I will add the changes and get back.

Let me know if you're running low on bandwidth. We can merge the simpler stuff and leave improvements for other PRs.

Yes, we should do this. I fixed most of the issues you pointed out, and I can work on the rest on a future PR.

return fst @ pynini.cdrewrite(pynini.cross(NEMO_SPACE, NEMO_NON_BREAKING_SPACE), "", "", NEMO_SIGMA)


def string_map_cased(input_file: str, input_case: str = INPUT_LOWER_CASED):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add documentation, just make the flag a boolean to avoid the string matching

if input_case == INPUT_CASED:
additional_labels = []
for written, spoken, *weight in labels:
written_capitalized = written[0].upper() + written[1:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

written_capitalized = written[0].upper() + written[1:]
additional_labels.extend(
[
[written_capitalized, spoken.capitalize(),], # first letter capitalized
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

spoken_no_space = spoken.replace(" ", "")
# add abbreviations without spaces (both lower and upper case), i.e. "BMW" not "B M W"
if len(spoken) == (2 * len(spoken_no_space) - 1):
logger.debug(f"This is weight {weight}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need documentation for this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the function since it is not used in the Kinyarwanda text normalization; I believe the section you flagged was checking for abbreviations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for the Jenkins issue, I rebased our branch.

tbartley94
tbartley94 previously approved these changes Sep 3, 2024
Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tbartley94
Copy link
Collaborator

@zoobereq can you assist in figuring out the jenkins issue?

@zoobereq
Copy link
Collaborator

zoobereq commented Sep 3, 2024

JenkinsFile needs to be up-do-date with the main in order to utilize most-recent grammars in the CI pipeline. The CI checks won't pass otherwise. Either copy the content of JenkinsFile directly from the main and push the updated file to your PR, or rebase Digital-Umuganda:main over the upstream main branch and resolve any conflicts.

@zoobereq
Copy link
Collaborator

Please make sure to accept and implement the formatting changes pushed by pre-commit.ci.

In order to pass the CI pipeline (continuous-integration/jenkins), you will have to update your Jenkinsfile. You can either rebase this branch over the current main or copy-paste the contents of Jenkinsfile from main into the current PR. Rebasing is of course more comprehensive and less risky.

We use Jenkinsfile to define a pipeline for automating the building, testing, and deployment of text normalization (TN) and inverse text normalization (ITN) grammars for multiple languages, using Docker and various tools like PyTorch, NeMo, and pytest. One of the things that Jenkinsfile contains is environment variables for various language-specific text normalization (TN) cache directories (e.g., EN_TN_CACHE for English, DE_TN_CACHE for German, etc.). These directories store precompiled grammars for efficient re-use during the pipeline execution. Keeping them up-to-date is crucial for successful execution of the CI pipeline (and passing all the checks).

Please let me know if you need further assistance with this PR. We will be happy review and merge it as soon as all checks are passed.

Signed-off-by: kurt0cougar <[email protected]>
Signed-off-by: kurt0cougar <[email protected]>
@zoobereq
Copy link
Collaborator

Almost there. It's failing the formatting, which is easy to address. Either pull the changes generated by the pre-commit hooks or run setup.py style --fix from the root directory, whichever is easier.

@kurt0cougar
Copy link
Contributor Author

Almost there. It's failing the formatting, which is easy to address. Either pull the changes generated by the pre-commit hooks or run setup.py style --fix from the root directory, whichever is easier.

Done.

@zoobereq
Copy link
Collaborator

The pre-commit hooks are still failing, due to Black reformatting your code upon saving. Make sure that the formatting adheres to the NeMo guidelines by either pulling the push issued by pre-commit or running setup.py style --fix and disabling Black prior to saving and pushing (if Black is triggered automatically upon saving the code).

CodeQL won't run until the above passes.

@tbartley94
Copy link
Collaborator

@zoobereq I'm feeling if CL tests pass we can just approve and then catch formatting things on our end. Make things easier.

Mind giving a second set of eyes before we approve?

Copy link
Collaborator

@zoobereq zoobereq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my end.

@tbartley94 tbartley94 merged commit 1bc4930 into NVIDIA:main Sep 17, 2024
5 checks passed
BuyuanCui pushed a commit that referenced this pull request Sep 26, 2024
* Cardinals up to a hundred trillions, timeFST and transliteration

Signed-off-by: kurt0cougar <[email protected]>

* Cardinals up to a hundred trillions, timeFST and transliteration

Signed-off-by: kurt0cougar <[email protected]>

* Cardinals up to a hundred trillions, timeFST and transliteration (moving constants to data files).

Signed-off-by: kurt0cougar <[email protected]>

* Update test_cases_word.txt

Signed-off-by: kurt0cougar <[email protected]>

* Update graph_utils.py

Signed-off-by: kurt0cougar <[email protected]>

* Cardinals up to a hundred trillions, timeFST and transliteration - reformatteda

Signed-off-by: kurt0cougar <[email protected]>

* Disabled Black during formatting.

Signed-off-by: kurt0cougar <[email protected]>

---------

Signed-off-by: kurt0cougar <[email protected]>
Signed-off-by: Alex Cui <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants