Hebrew ITN - Adding Hebrew ITN #238

dankeinan1 · 2024-10-15T10:04:35Z

What does this PR do ?

Add Hebrew Inverse Text Normalization

Before your PR is "Ready for review"

Pre checks:

PR Type:

New Feature
Bugfix
Documentation
Test

If you haven't finished some of the above items you can still open "Draft" PR.

Signed-off-by: danken12 <[email protected]>

Signed-off-by: nir.raviv <[email protected]>

nirraviv89 · 2024-10-15T10:39:49Z

@tbartley94 this PR is now ready for review

Add Hebrew ITN

tbartley94

Did a first pass and left initial comments for style and technical improvements. Very nice work. General suggestions:

Reuse prexisting graphs from the en.graph_utils. It's better to be conservative on these public variable in case there's waterfall effects from upstream pynini.
Our general style for graph composition is to aim for readability rather than compatness. If there's need for parentheses of depth > 1 , it's better to break the composition over several lines for readability. This is also important since Pynini is not always constant with order of operations when it comes to concatenation and union operations. So it's easier to break off rather than check set theory notes.
For compatibility with upstream sparrowhawk, you need to constrain the attributes you provide to graphs. Defined properties such as prefix can't be freely created when tagging. To change that requires creating a new fork of Sparrowhawk. Consult here for knowing which properties are usable per semiotic class:

https://github.com/google/sparrowhawk/blob/master/src/lib/semiotic_classes.pb.cc#L203-L216

I'll be pinging someone else to check the Hebrew outputs (not a reader). For clarity, should this be understood as a normalization system for colloquial use, or would it be more restricted to liturgical texts?

nemo_text_processing/inverse_text_normalization/he/data/numbers/decimal_fractions.tsv

tbartley94 · 2024-10-16T21:31:40Z

nemo_text_processing/inverse_text_normalization/he/data/numbers/digit.tsv

+שמונה	8
+תשעה	9
+אחת	1
+שתיים	2


could you explain the multiple digit forms?

Yes. The digits doesn't change there are different ways to say it. In Hebrew, numbers usually have gender, so we wanted to catch both masc. numbers and fem. numbers. In addition, some numbers are said differently like 2, when coming in construct state (proximity). For example "two hands" would be "שתי ידיים" and not " שתיים ידיים".
We wanted to include all the forms of the numbers, for better covering test cases.

Ah gotcha, and the infixation would make this a pain to just include the gendering in the graph. Could I bother you to split the data list into something like digit_masc digit_fem? It'll be easier for a maintainer to return to and understand without knowledge of the language.

There is no easy way to include the gendering in the graph, because there is no basic rule to follow.
In addition, I think it would actually be more confusing and harder to maintain if we start splitting the digits to different lists, because they all used in the same places in the graphs. Maintaining different lists won't be easier in my opinion. I can do that, but I think that would not help maintainers in the future

It'd be preferred from a non-speaker end. Because of the amount of language knowledge that goes into these things, we want to make sure there's as much surface level knowledge available. It's just easier on my end to pass off to another maintainer this way.

nemo_text_processing/inverse_text_normalization/he/data/numbers/thousands.tsv

nemo_text_processing/inverse_text_normalization/he/data/time/minute_to.tsv

nemo_text_processing/inverse_text_normalization/he/graph_utils.py

tbartley94 · 2024-10-16T22:33:15Z

nemo_text_processing/inverse_text_normalization/he/taggers/date.py

+        month_names_graph = pynutil.insert("month: \"") + month_names + pynutil.insert("\"")
+
+        month_name2number = pynini.string_file(get_abs_path("data/months_name2number.tsv"))
+        month_name2number_graph = pynutil.insert("month: \"") + pynini.invert(month_name2number) + pynutil.insert("\"")


Why isn't the whitelist written in the correct order in the first place? Will be difficult to track which graphs need to be inverted and which don't.

What is the suggestion here? just to make sure I understand?
Invert the whitelist order so we do not need to use pynini.invert?

You're calling invert on the data file. Why not just have the original graph in the desired order.

tbartley94 · 2024-10-16T22:38:40Z

nemo_text_processing/inverse_text_normalization/he/taggers/date.py

+
+        all_month_graph = month_name2number_graph | month_number2number_graph
+
+        month_prefix_graph = pynutil.insert("month_prefix: \"") + prefix_graph + pynutil.insert("\"") + insert_space


same as with cardinals, this will break sparrowhawk.

https://github.com/google/sparrowhawk/blob/a0503e26a433fbd3a9ff81ba7a08819e4a3bb668/src/lib/semiotic_classes.pb.cc#L203-L216

nemo_text_processing/inverse_text_normalization/he/taggers/decimal.py

nemo_text_processing/inverse_text_normalization/he/taggers/tokenize_and_classify.py

tbartley94 · 2024-10-16T23:26:21Z

nemo_text_processing/inverse_text_normalization/he/verbalizers/date.py

+        super().__init__(name="date", kind="verbalize")
+
+        day_prefix = (
+            pynutil.delete("day_prefix:")


will need to change due to sparrowhawk compatibility

what is the required change? I am not sure I understand, can you please explain?

day_prefix isn't a valid property you can pass with the date class tag. It will work for the Python codebase, but sparrowhawk deployment will be incompatible. this will be a massive thoroughput limitation.

sorry ^ is from me from my other git.

We have to use prefixes in order to capture dates correctly.
What is your suggestion to fix this? Is the problem the variable name?
Why can't we add this prefix to the date graph?
I am not familiar enough with sparrowhawk to understand this comment. Please advise

Sorry, let me explain more clearly:

For Sparrowhawk, class table and property tags, in this case day_prefix, are a limited set. You can only use class labels and property tags defined in Sparrowhawk.

See here for permitted class tags: https://github.com/google/sparrowhawk/blob/a0503e26a433fbd3a9ff81ba7a08819e4a3bb668/src/lib/semiotic_classes.pb.cc#L24-L59

See here for permitted property tags for date:

https://github.com/google/sparrowhawk/blob/a0503e26a433fbd3a9ff81ba7a08819e4a3bb668/src/lib/semiotic_classes.pb.cc#L205-L216

While the labels are closed, what is allowed within a given label is not closed. So you can use any arbitrary property you want. So you can replace day_prefix with style or morphosyntactic_feature in any place you use day_prefix. It just needs to be regular and managed by the verbalizer. My personal choice would just be to use morphosyntactic_feature since that's the general purpose for that tag. If there's multiple prefixes you're trying to manage, you can just use a trivial delimiter. (e.g. morphosyntactic_feature: DAY_PREFIX/OTHER_PREFIX)

Because Sparrowhawk is archived, this cannot be changed upstream without a fork. So it's a hard restriction (barring you want to create a PR for an archived repo....)

This restriction is only for Sparrowhawk. The NeMo Text Processing library proper allows trivial tags. So if you believe the change is outside the scope of the PR, you can let me know and I'll create an issue for Sparrowhawk integration with my team.

Signed-off-by: danken12 <[email protected]>

# Conflicts: # nemo_text_processing/inverse_text_normalization/he/taggers/cardinal.py

Signed-off-by: danken12 <[email protected]>

nemo_text_processing/inverse_text_normalization/he/graph_utils.py

Signed-off-by: danken12 <[email protected]>

dankeinan1 and others added 2 commits October 15, 2024 11:54

Hebrew ITN - Adding Hebrew ITN

8dfcf9c

Signed-off-by: danken12 <[email protected]>

fix pre commit ci

9f85e68

Signed-off-by: nir.raviv <[email protected]>

Merge pull request #1 from verbit-ai/Add_Hebrew_ITN

a3c3c79

Add Hebrew ITN

tbartley94 self-requested a review October 16, 2024 21:29

tbartley94 requested changes Oct 16, 2024

View reviewed changes

dankeinan1 and others added 6 commits October 20, 2024 12:30

Hebrew ITN - code review changes i imports, more reuse of code

e1982ee

Signed-off-by: danken12 <[email protected]>

Merge remote-tracking branch 'origin/Add_Hebrew_ITN' into Add_Hebrew_ITN

4e9e476

# Conflicts: # nemo_text_processing/inverse_text_normalization/he/taggers/cardinal.py

Merge branch 'main' into Add_Hebrew_ITN

81b334b

Hebrew ITN - removed unrelated new lines

931c379

Signed-off-by: danken12 <[email protected]>

Hebrew ITN - CR changes

3537c49

Signed-off-by: danken12 <[email protected]>

Hebrew ITN - CR changes

67bbd20

Signed-off-by: danken12 <[email protected]>

github-advanced-security bot found potential problems Oct 28, 2024

View reviewed changes

nemo_text_processing/inverse_text_normalization/he/graph_utils.py Outdated Show resolved Hide resolved

Hebrew ITN - CR changes

cab7488

Signed-off-by: danken12 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hebrew ITN - Adding Hebrew ITN #238

Hebrew ITN - Adding Hebrew ITN #238

dankeinan1 commented Oct 15, 2024 •

edited

Loading

nirraviv89 commented Oct 15, 2024

tbartley94 left a comment

tbartley94 Oct 16, 2024

dankeinan1 Oct 17, 2024

tbartley94 Oct 21, 2024

dankeinan1 Oct 21, 2024

tbartley94 Oct 21, 2024

tbartley94 Oct 16, 2024

dankeinan1 Oct 30, 2024

tbartley94 Oct 31, 2024

tbartley94 Oct 16, 2024

tbartley94 Oct 16, 2024

dankeinan1 Oct 30, 2024

bonham79 Oct 31, 2024

tbartley94 Oct 31, 2024

dankeinan1 Nov 4, 2024

tbartley94 Nov 4, 2024


		all_month_graph = month_name2number_graph \| month_number2number_graph

		month_prefix_graph = pynutil.insert("month_prefix: \"") + prefix_graph + pynutil.insert("\"") + insert_space

Hebrew ITN - Adding Hebrew ITN #238

Are you sure you want to change the base?

Hebrew ITN - Adding Hebrew ITN #238

Conversation

dankeinan1 commented Oct 15, 2024 • edited Loading

What does this PR do ?

Before your PR is "Ready for review"

nirraviv89 commented Oct 15, 2024

tbartley94 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dankeinan1 commented Oct 15, 2024 •

edited

Loading