Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hebrew ITN - Adding Hebrew ITN #238

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

dankeinan1
Copy link

@dankeinan1 dankeinan1 commented Oct 15, 2024

What does this PR do ?

Add Hebrew Inverse Text Normalization

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

dankeinan1 and others added 2 commits October 15, 2024 11:54
@nirraviv89
Copy link

@tbartley94 this PR is now ready for review

@tbartley94 tbartley94 self-requested a review October 16, 2024 21:29
Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass and left initial comments for style and technical improvements. Very nice work. General suggestions:

  1. Reuse prexisting graphs from the en.graph_utils. It's better to be conservative on these public variable in case there's waterfall effects from upstream pynini.

  2. Our general style for graph composition is to aim for readability rather than compatness. If there's need for parentheses of depth > 1 , it's better to break the composition over several lines for readability. This is also important since Pynini is not always constant with order of operations when it comes to concatenation and union operations. So it's easier to break off rather than check set theory notes.

  3. For compatibility with upstream sparrowhawk, you need to constrain the attributes you provide to graphs. Defined properties such as prefix can't be freely created when tagging. To change that requires creating a new fork of Sparrowhawk. Consult here for knowing which properties are usable per semiotic class:

https://github.com/google/sparrowhawk/blob/master/src/lib/semiotic_classes.pb.cc#L203-L216

I'll be pinging someone else to check the Hebrew outputs (not a reader). For clarity, should this be understood as a normalization system for colloquial use, or would it be more restricted to liturgical texts?

שמונה 8
תשעה 9
אחת 1
שתיים 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain the multiple digit forms?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The digits doesn't change there are different ways to say it. In Hebrew, numbers usually have gender, so we wanted to catch both masc. numbers and fem. numbers. In addition, some numbers are said differently like 2, when coming in construct state (proximity). For example "two hands" would be "שתי ידיים" and not " שתיים ידיים".
We wanted to include all the forms of the numbers, for better covering test cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha, and the infixation would make this a pain to just include the gendering in the graph. Could I bother you to split the data list into something like digit_masc digit_fem? It'll be easier for a maintainer to return to and understand without knowledge of the language.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no easy way to include the gendering in the graph, because there is no basic rule to follow.
In addition, I think it would actually be more confusing and harder to maintain if we start splitting the digits to different lists, because they all used in the same places in the graphs. Maintaining different lists won't be easier in my opinion. I can do that, but I think that would not help maintainers in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be preferred from a non-speaker end. Because of the amount of language knowledge that goes into these things, we want to make sure there's as much surface level knowledge available. It's just easier on my end to pass off to another maintainer this way.

month_names_graph = pynutil.insert("month: \"") + month_names + pynutil.insert("\"")

month_name2number = pynini.string_file(get_abs_path("data/months_name2number.tsv"))
month_name2number_graph = pynutil.insert("month: \"") + pynini.invert(month_name2number) + pynutil.insert("\"")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't the whitelist written in the correct order in the first place? Will be difficult to track which graphs need to be inverted and which don't.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the suggestion here? just to make sure I understand?
Invert the whitelist order so we do not need to use pynini.invert?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're calling invert on the data file. Why not just have the original graph in the desired order.


all_month_graph = month_name2number_graph | month_number2number_graph

month_prefix_graph = pynutil.insert("month_prefix: \"") + prefix_graph + pynutil.insert("\"") + insert_space
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super().__init__(name="date", kind="verbalize")

day_prefix = (
pynutil.delete("day_prefix:")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to change due to sparrowhawk compatibility

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the required change? I am not sure I understand, can you please explain?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

day_prefix isn't a valid property you can pass with the date class tag. It will work for the Python codebase, but sparrowhawk deployment will be incompatible. this will be a massive thoroughput limitation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry ^ is from me from my other git.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to use prefixes in order to capture dates correctly.
What is your suggestion to fix this? Is the problem the variable name?
Why can't we add this prefix to the date graph?
I am not familiar enough with sparrowhawk to understand this comment. Please advise

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me explain more clearly:

  1. For Sparrowhawk, class table and property tags, in this case day_prefix, are a limited set. You can only use class labels and property tags defined in Sparrowhawk.

See here for permitted class tags: https://github.com/google/sparrowhawk/blob/a0503e26a433fbd3a9ff81ba7a08819e4a3bb668/src/lib/semiotic_classes.pb.cc#L24-L59

See here for permitted property tags for date:

https://github.com/google/sparrowhawk/blob/a0503e26a433fbd3a9ff81ba7a08819e4a3bb668/src/lib/semiotic_classes.pb.cc#L205-L216

  1. While the labels are closed, what is allowed within a given label is not closed. So you can use any arbitrary property you want. So you can replace day_prefix with style or morphosyntactic_feature in any place you use day_prefix. It just needs to be regular and managed by the verbalizer. My personal choice would just be to use morphosyntactic_feature since that's the general purpose for that tag. If there's multiple prefixes you're trying to manage, you can just use a trivial delimiter. (e.g. morphosyntactic_feature: DAY_PREFIX/OTHER_PREFIX)

  2. Because Sparrowhawk is archived, this cannot be changed upstream without a fork. So it's a hard restriction (barring you want to create a PR for an archived repo....)

  3. This restriction is only for Sparrowhawk. The NeMo Text Processing library proper allows trivial tags. So if you believe the change is outside the scope of the PR, you can let me know and I'll create an issue for Sparrowhawk integration with my team.

Signed-off-by: danken12 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants