Cardinals up to a hundred trillions, timeFST and transliteration #209

kurt0cougar · 2024-08-19T05:43:30Z

What does this PR do ?

Adds Kinyarwanda TN support for the following:

CARDINALS semiotic class (up to hundred trillion)
TIME semiotic class
Trnasliteration

Before your PR is "Ready for review"

Pre checks:

PR Type:

New Feature
Bugfix
Documentation
Test

If you haven't finished some of the above items you can still open "Draft" PR.

kurt0cougar · 2024-08-19T05:46:01Z

@tbartley94 The pull request is ready for review.

tbartley94 · 2024-08-26T18:12:45Z

nemo_text_processing/text_normalization/rw/taggers/time.py

+    def __init__(self):
+        super().__init__(name="time", kind="classify")
+
+        hours = pynini.string_map([


Mind moving this and the minutes into a data file? (It's easier to debug and maintain if we just keep it as text and read from there.)

tbartley94 · 2024-08-26T18:12:58Z

nemo_text_processing/text_normalization/rw/taggers/time.py

+            ('12', 'saa sita'),            
+        ])
+
+        minutes = pynini.string_map([


nemo_text_processing/text_normalization/rw/taggers/time.py

tbartley94 · 2024-08-26T18:20:46Z

nemo_text_processing/text_normalization/rw/verbalizers/time.py

+class VerbalizeTimeFst(GraphFst):
+    def __init__(self):
+        super().__init__(name="time",kind="verbalize")
+        hour = (pynutil.delete("hours:")+delete_space+pynutil.delete("\"")+pynini.closure(NEMO_CHAR,1,60)+pynutil.delete("\"")+delete_space \


You can just do base closure since the hours: property will limit potential tokens. I also believe it is more efficient since counting in FST graphs enlarges the graph (iirc).

tbartley94 · 2024-08-26T18:21:18Z

nemo_text_processing/text_normalization/rw/verbalizers/time.py

+    def __init__(self):
+        super().__init__(name="time",kind="verbalize")
+        hour = (pynutil.delete("hours:")+delete_space+pynutil.delete("\"")+pynini.closure(NEMO_CHAR,1,60)+pynutil.delete("\"")+delete_space \
+                +pynutil.delete("minutes:")+delete_space+pynutil.delete("\"") + pynini.closure(NEMO_CHAR,1,60)+pynutil.delete("\""))


tbartley94 · 2024-08-26T18:24:48Z

nemo_text_processing/text_normalization/rw/verbalizers/verbalize_final.py

+		        + delete_space
+		        + pynutil.delete("}")
+		    )
+            graph = delete_space + pynini.closure(graph + delete_extra_space) + graph + delete_space


So heads up, this has a behavior that you can't separate normalization candidates with punctuation, there needs to be space or its a no-opt. This intended for y'alls downstream needs?

Can you give me an example of this? I don't fully get what you mean.

1,5 -> one,5
1, 5 -> one, five

It won't parse the comma as a class separator since delete_extra_space requires at least one space.

tbartley94 · 2024-08-26T18:26:23Z

nemo_text_processing/text_normalization/rw/taggers/cardinal.py

+from nemo_text_processing.text_normalization.en.graph_utils import GraphFst,NEMO_CHAR,insert_space
+from nemo_text_processing.text_normalization.rw.utils import get_abs_path
+
+def apply_fst(text, fst):


this method is unnecessary. the normalize class takes care of this with additional preprocessing. Please remove.

tbartley94 · 2024-08-26T18:28:17Z

nemo_text_processing/text_normalization/rw/taggers/cardinal.py

+    def __init__(self):
+        super().__init__(name="cardinal", kind="classify")
+        alphabet = string.ascii_letters
+        rewrite_na_fst = pynini.cdrewrite(pynini.cross(" "," na "),pynini.union(*"aeiouAEIOU "),pynini.union(*"BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz"),NEMO_CHAR.closure())


For line legitibility, mind just defining two string classes called vowels and consonants and inherit from your graph_fst class?

tbartley94 · 2024-08-26T18:29:14Z

nemo_text_processing/text_normalization/rw/taggers/cardinal.py

+class CardinalFst(GraphFst):
+    def __init__(self):
+        super().__init__(name="cardinal", kind="classify")
+        alphabet = string.ascii_letters


Just use NEMO_ALPHA since its already defined.

tbartley94 · 2024-08-26T18:43:48Z

nemo_text_processing/text_normalization/rw/taggers/cardinal.py

+        rewrite_na_fst = pynini.cdrewrite(pynini.cross(" "," na "),pynini.union(*"aeiouAEIOU "),pynini.union(*"BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz"),NEMO_CHAR.closure())
+        rewrite_n_fst = pynini.cdrewrite(pynini.cross(" "," n'"),pynini.union(*"aeiouAEIOU "),pynini.union(*"aeiouAEIOU"),NEMO_CHAR.closure())
+        remove_underscore_fst = pynini.cdrewrite(pynini.cross("_"," "),pynini.union(*alphabet),pynini.union(*alphabet),NEMO_CHAR.closure())
+        remove_extra_space_fst = pynini.cdrewrite(pynini.cross("  "," "),pynini.union(*alphabet),pynini.union(*alphabet),NEMO_CHAR.closure())


Use delete_extra_space from graph utils.

tbartley94 · 2024-08-26T18:44:34Z

nemo_text_processing/text_normalization/rw/taggers/cardinal.py

+            ("tiriyoni_mirongo_inani","8"),
+            ("tiriyoni_mirongo_icyenda","9")
+        ])
+        hundreds_of_trillions = pynini.string_map([


Move all these string maps to a data folder so easier to maintain.

tbartley94 · 2024-08-26T18:48:54Z

nemo_text_processing/text_normalization/rw/taggers/cardinal.py

+        NINE_ZEROS = "000000000"
+
+        zero = pynini.string_map([("zeru","0")])
+        rewrite_remove_comma_fst = pynini.cdrewrite(pynini.cross(",",""),pynini.union(*"0123456789"),pynini.union(*"0123456789"),NEMO_CHAR.closure())


You can use NEMO_DIGIT here

tbartley94

Major requests are just to move more of the string maps into the data folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.

kurt0cougar · 2024-08-27T23:10:51Z

Major requests are just to move more of the string maps into the data folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.

Great, thank you for the feedback, I will add the changes and get back.

tbartley94 · 2024-08-30T17:53:01Z

Major requests are just to move more of the string maps into the data folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.

Great, thank you for the feedback, I will add the changes and get back.

Let me know if you're running low on bandwidth. We can merge the simpler stuff and leave improvements for other PRs.

kurt0cougar · 2024-09-02T04:34:27Z

Major requests are just to move more of the string maps into the data folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.

Great, thank you for the feedback, I will add the changes and get back.

Let me know if you're running low on bandwidth. We can merge the simpler stuff and leave improvements for other PRs.

Yes, we should do this. I fixed most of the issues you pointed out, and I can work on the rest on a future PR.

tbartley94 · 2024-09-03T17:28:25Z

nemo_text_processing/text_normalization/rw/graph_utils.py

+    return fst @ pynini.cdrewrite(pynini.cross(NEMO_SPACE, NEMO_NON_BREAKING_SPACE), "", "", NEMO_SIGMA)
+
+
+def string_map_cased(input_file: str, input_case: str = INPUT_LOWER_CASED):


Add documentation, just make the flag a boolean to avoid the string matching

tbartley94 · 2024-09-03T17:30:47Z

nemo_text_processing/text_normalization/rw/graph_utils.py

+    if input_case == INPUT_CASED:
+        additional_labels = []
+        for written, spoken, *weight in labels:
+            written_capitalized = written[0].upper() + written[1:]


Just use capwords https://docs.python.org/3/library/string.html#helper-functions.

tbartley94 · 2024-09-03T17:31:18Z

nemo_text_processing/text_normalization/rw/graph_utils.py

+            written_capitalized = written[0].upper() + written[1:]
+            additional_labels.extend(
+                [
+                    [written_capitalized, spoken.capitalize(),],  # first letter capitalized


tbartley94 · 2024-09-03T17:34:27Z

nemo_text_processing/text_normalization/rw/graph_utils.py

+            spoken_no_space = spoken.replace(" ", "")
+            # add abbreviations without spaces (both lower and upper case), i.e. "BMW" not "B M W"
+            if len(spoken) == (2 * len(spoken_no_space) - 1):
+                logger.debug(f"This is weight {weight}")


Need documentation for this case

I removed the function since it is not used in the Kinyarwanda text normalization; I believe the section you flagged was checking for abbreviations.

And for the Jenkins issue, I rebased our branch.

tbartley94

LGTM

tbartley94 · 2024-09-03T17:39:26Z

@zoobereq can you assist in figuring out the jenkins issue?

zoobereq · 2024-09-03T18:45:14Z

JenkinsFile needs to be up-do-date with the main in order to utilize most-recent grammars in the CI pipeline. The CI checks won't pass otherwise. Either copy the content of JenkinsFile directly from the main and push the updated file to your PR, or rebase Digital-Umuganda:main over the upstream main branch and resolve any conflicts.

zoobereq · 2024-09-10T18:59:54Z

Please make sure to accept and implement the formatting changes pushed by pre-commit.ci.

In order to pass the CI pipeline (continuous-integration/jenkins), you will have to update your Jenkinsfile. You can either rebase this branch over the current main or copy-paste the contents of Jenkinsfile from main into the current PR. Rebasing is of course more comprehensive and less risky.

We use Jenkinsfile to define a pipeline for automating the building, testing, and deployment of text normalization (TN) and inverse text normalization (ITN) grammars for multiple languages, using Docker and various tools like PyTorch, NeMo, and pytest. One of the things that Jenkinsfile contains is environment variables for various language-specific text normalization (TN) cache directories (e.g., EN_TN_CACHE for English, DE_TN_CACHE for German, etc.). These directories store precompiled grammars for efficient re-use during the pipeline execution. Keeping them up-to-date is crucial for successful execution of the CI pipeline (and passing all the checks).

Please let me know if you need further assistance with this PR. We will be happy review and merge it as soon as all checks are passed.

Signed-off-by: kurt0cougar <[email protected]>

…ing constants to data files). Signed-off-by: kurt0cougar <[email protected]>

Signed-off-by: kurt0cougar <[email protected]>

zoobereq · 2024-09-12T04:48:16Z

Almost there. It's failing the formatting, which is easy to address. Either pull the changes generated by the pre-commit hooks or run setup.py style --fix from the root directory, whichever is easier.

…formatteda Signed-off-by: kurt0cougar <[email protected]>

kurt0cougar · 2024-09-12T23:04:35Z

Almost there. It's failing the formatting, which is easy to address. Either pull the changes generated by the pre-commit hooks or run setup.py style --fix from the root directory, whichever is easier.

Done.

nemo_text_processing/text_normalization/rw/graph_utils.py

nemo_text_processing/text_normalization/rw/taggers/cardinal.py

nemo_text_processing/text_normalization/rw/verbalizers/verbalize_final.py

zoobereq · 2024-09-16T15:04:03Z

The pre-commit hooks are still failing, due to Black reformatting your code upon saving. Make sure that the formatting adheres to the NeMo guidelines by either pulling the push issued by pre-commit or running setup.py style --fix and disabling Black prior to saving and pushing (if Black is triggered automatically upon saving the code).

CodeQL won't run until the above passes.

Signed-off-by: kurt0cougar <[email protected]>

tbartley94 · 2024-09-17T00:57:08Z

@zoobereq I'm feeling if CL tests pass we can just approve and then catch formatting things on our end. Make things easier.

Mind giving a second set of eyes before we approve?

zoobereq

Looks good on my end.

* Cardinals up to a hundred trillions, timeFST and transliteration Signed-off-by: kurt0cougar <[email protected]> * Cardinals up to a hundred trillions, timeFST and transliteration Signed-off-by: kurt0cougar <[email protected]> * Cardinals up to a hundred trillions, timeFST and transliteration (moving constants to data files). Signed-off-by: kurt0cougar <[email protected]> * Update test_cases_word.txt Signed-off-by: kurt0cougar <[email protected]> * Update graph_utils.py Signed-off-by: kurt0cougar <[email protected]> * Cardinals up to a hundred trillions, timeFST and transliteration - reformatteda Signed-off-by: kurt0cougar <[email protected]> * Disabled Black during formatting. Signed-off-by: kurt0cougar <[email protected]> --------- Signed-off-by: kurt0cougar <[email protected]> Signed-off-by: Alex Cui <[email protected]>

tbartley94 reviewed Aug 26, 2024

View reviewed changes

nemo_text_processing/text_normalization/rw/taggers/time.py Outdated Show resolved Hide resolved

tbartley94 reviewed Aug 26, 2024

View reviewed changes

tbartley94 requested changes Aug 26, 2024

View reviewed changes

tbartley94 reviewed Sep 3, 2024

View reviewed changes

tbartley94 previously approved these changes Sep 3, 2024

View reviewed changes

kurt0cougar added 3 commits September 12, 2024 00:34

Cardinals up to a hundred trillions, timeFST and transliteration

d5f0e07

Signed-off-by: kurt0cougar <[email protected]>

Cardinals up to a hundred trillions, timeFST and transliteration

44d726c

Signed-off-by: kurt0cougar <[email protected]>

Cardinals up to a hundred trillions, timeFST and transliteration (mov…

81d0e83

…ing constants to data files). Signed-off-by: kurt0cougar <[email protected]>

kurt0cougar force-pushed the main branch from 949c76e to 81d0e83 Compare September 11, 2024 22:40

Update test_cases_word.txt

32b9ddf

Signed-off-by: kurt0cougar <[email protected]>

kurt0cougar dismissed tbartley94’s stale review via 32b9ddf September 12, 2024 02:52

Update graph_utils.py

2b8c220

Signed-off-by: kurt0cougar <[email protected]>

Cardinals up to a hundred trillions, timeFST and transliteration - re…

3d46312

…formatteda Signed-off-by: kurt0cougar <[email protected]>

github-advanced-security bot found potential problems Sep 16, 2024

View reviewed changes

nemo_text_processing/text_normalization/rw/graph_utils.py Fixed Show fixed Hide fixed

nemo_text_processing/text_normalization/rw/taggers/cardinal.py Fixed Show fixed Hide fixed

nemo_text_processing/text_normalization/rw/verbalizers/verbalize_final.py Fixed Show fixed Hide fixed

Disabled Black during formatting.

26a7131

Signed-off-by: kurt0cougar <[email protected]>

zoobereq approved these changes Sep 17, 2024

View reviewed changes

tbartley94 merged commit 1bc4930 into NVIDIA:main Sep 17, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cardinals up to a hundred trillions, timeFST and transliteration #209

Cardinals up to a hundred trillions, timeFST and transliteration #209

kurt0cougar commented Aug 19, 2024 •

edited by zoobereq

Loading

kurt0cougar commented Aug 19, 2024

tbartley94 Aug 26, 2024

tbartley94 Aug 26, 2024

tbartley94 Aug 26, 2024

tbartley94 Aug 26, 2024

tbartley94 Aug 26, 2024

kurt0cougar Aug 28, 2024

tbartley94 Aug 29, 2024

tbartley94 Aug 26, 2024

tbartley94 Aug 26, 2024

tbartley94 Aug 26, 2024

tbartley94 Aug 26, 2024

tbartley94 Aug 26, 2024

tbartley94 Aug 26, 2024

tbartley94 left a comment

kurt0cougar commented Aug 27, 2024

tbartley94 commented Aug 30, 2024

kurt0cougar commented Sep 2, 2024

tbartley94 Sep 3, 2024

tbartley94 Sep 3, 2024

tbartley94 Sep 3, 2024

tbartley94 Sep 3, 2024

kurt0cougar Sep 12, 2024

kurt0cougar Sep 12, 2024

tbartley94 left a comment

tbartley94 commented Sep 3, 2024

zoobereq commented Sep 3, 2024

zoobereq commented Sep 10, 2024

zoobereq commented Sep 12, 2024

kurt0cougar commented Sep 12, 2024

zoobereq commented Sep 16, 2024

tbartley94 commented Sep 17, 2024

zoobereq left a comment

		return fst @ pynini.cdrewrite(pynini.cross(NEMO_SPACE, NEMO_NON_BREAKING_SPACE), "", "", NEMO_SIGMA)


		def string_map_cased(input_file: str, input_case: str = INPUT_LOWER_CASED):

Cardinals up to a hundred trillions, timeFST and transliteration #209

Cardinals up to a hundred trillions, timeFST and transliteration #209

Conversation

kurt0cougar commented Aug 19, 2024 • edited by zoobereq Loading

What does this PR do ?

Before your PR is "Ready for review"

kurt0cougar commented Aug 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbartley94 left a comment

Choose a reason for hiding this comment

kurt0cougar commented Aug 27, 2024

tbartley94 commented Aug 30, 2024

kurt0cougar commented Sep 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbartley94 left a comment

Choose a reason for hiding this comment

tbartley94 commented Sep 3, 2024

zoobereq commented Sep 3, 2024

zoobereq commented Sep 10, 2024

zoobereq commented Sep 12, 2024

kurt0cougar commented Sep 12, 2024

zoobereq commented Sep 16, 2024

tbartley94 commented Sep 17, 2024

zoobereq left a comment

Choose a reason for hiding this comment

kurt0cougar commented Aug 19, 2024 •

edited by zoobereq

Loading