-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hebrew ITN - Adding Hebrew ITN #238
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: danken12 <[email protected]>
Signed-off-by: nir.raviv <[email protected]>
@tbartley94 this PR is now ready for review |
Add Hebrew ITN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass and left initial comments for style and technical improvements. Very nice work. General suggestions:
-
Reuse prexisting graphs from the
en.graph_utils
. It's better to be conservative on these public variable in case there's waterfall effects from upstream pynini. -
Our general style for graph composition is to aim for readability rather than compatness. If there's need for parentheses of depth > 1 , it's better to break the composition over several lines for readability. This is also important since Pynini is not always constant with order of operations when it comes to concatenation and union operations. So it's easier to break off rather than check set theory notes.
-
For compatibility with upstream sparrowhawk, you need to constrain the attributes you provide to graphs. Defined properties such as
prefix
can't be freely created when tagging. To change that requires creating a new fork of Sparrowhawk. Consult here for knowing which properties are usable per semiotic class:
https://github.com/google/sparrowhawk/blob/master/src/lib/semiotic_classes.pb.cc#L203-L216
I'll be pinging someone else to check the Hebrew outputs (not a reader). For clarity, should this be understood as a normalization system for colloquial use, or would it be more restricted to liturgical texts?
nemo_text_processing/inverse_text_normalization/he/data/numbers/decimal_fractions.tsv
Outdated
Show resolved
Hide resolved
שמונה 8 | ||
תשעה 9 | ||
אחת 1 | ||
שתיים 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain the multiple digit forms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The digits doesn't change there are different ways to say it. In Hebrew, numbers usually have gender, so we wanted to catch both masc. numbers and fem. numbers. In addition, some numbers are said differently like 2, when coming in construct state (proximity). For example "two hands" would be "שתי ידיים" and not " שתיים ידיים".
We wanted to include all the forms of the numbers, for better covering test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha, and the infixation would make this a pain to just include the gendering in the graph. Could I bother you to split the data list into something like digit_masc
digit_fem
? It'll be easier for a maintainer to return to and understand without knowledge of the language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no easy way to include the gendering in the graph, because there is no basic rule to follow.
In addition, I think it would actually be more confusing and harder to maintain if we start splitting the digits to different lists, because they all used in the same places in the graphs. Maintaining different lists won't be easier in my opinion. I can do that, but I think that would not help maintainers in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be preferred from a non-speaker end. Because of the amount of language knowledge that goes into these things, we want to make sure there's as much surface level knowledge available. It's just easier on my end to pass off to another maintainer this way.
nemo_text_processing/inverse_text_normalization/he/data/numbers/thousands.tsv
Show resolved
Hide resolved
nemo_text_processing/inverse_text_normalization/he/data/time/minute_to.tsv
Show resolved
Hide resolved
nemo_text_processing/inverse_text_normalization/he/graph_utils.py
Outdated
Show resolved
Hide resolved
month_names_graph = pynutil.insert("month: \"") + month_names + pynutil.insert("\"") | ||
|
||
month_name2number = pynini.string_file(get_abs_path("data/months_name2number.tsv")) | ||
month_name2number_graph = pynutil.insert("month: \"") + pynini.invert(month_name2number) + pynutil.insert("\"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't the whitelist written in the correct order in the first place? Will be difficult to track which graphs need to be inverted and which don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the suggestion here? just to make sure I understand?
Invert the whitelist order so we do not need to use pynini.invert
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're calling invert on the data file. Why not just have the original graph in the desired order.
|
||
all_month_graph = month_name2number_graph | month_number2number_graph | ||
|
||
month_prefix_graph = pynutil.insert("month_prefix: \"") + prefix_graph + pynutil.insert("\"") + insert_space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as with cardinals, this will break sparrowhawk.
nemo_text_processing/inverse_text_normalization/he/taggers/decimal.py
Outdated
Show resolved
Hide resolved
nemo_text_processing/inverse_text_normalization/he/taggers/tokenize_and_classify.py
Outdated
Show resolved
Hide resolved
super().__init__(name="date", kind="verbalize") | ||
|
||
day_prefix = ( | ||
pynutil.delete("day_prefix:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need to change due to sparrowhawk compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the required change? I am not sure I understand, can you please explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
day_prefix
isn't a valid property you can pass with the date
class tag. It will work for the Python codebase, but sparrowhawk deployment will be incompatible. this will be a massive thoroughput limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry ^ is from me from my other git.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to use prefixes in order to capture dates correctly.
What is your suggestion to fix this? Is the problem the variable name?
Why can't we add this prefix to the date graph?
I am not familiar enough with sparrowhawk to understand this comment. Please advise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, let me explain more clearly:
- For Sparrowhawk, class table and property tags, in this case
day_prefix
, are a limited set. You can only use class labels and property tags defined in Sparrowhawk.
See here for permitted class tags: https://github.com/google/sparrowhawk/blob/a0503e26a433fbd3a9ff81ba7a08819e4a3bb668/src/lib/semiotic_classes.pb.cc#L24-L59
See here for permitted property tags for date:
-
While the labels are closed, what is allowed within a given label is not closed. So you can use any arbitrary property you want. So you can replace
day_prefix
withstyle
ormorphosyntactic_feature
in any place you useday_prefix
. It just needs to be regular and managed by the verbalizer. My personal choice would just be to usemorphosyntactic_feature
since that's the general purpose for that tag. If there's multiple prefixes you're trying to manage, you can just use a trivial delimiter. (e.g.morphosyntactic_feature: DAY_PREFIX/OTHER_PREFIX
) -
Because Sparrowhawk is archived, this cannot be changed upstream without a fork. So it's a hard restriction (barring you want to create a PR for an archived repo....)
-
This restriction is only for Sparrowhawk. The NeMo Text Processing library proper allows trivial tags. So if you believe the change is outside the scope of the PR, you can let me know and I'll create an issue for Sparrowhawk integration with my team.
Signed-off-by: danken12 <[email protected]>
# Conflicts: # nemo_text_processing/inverse_text_normalization/he/taggers/cardinal.py
Signed-off-by: danken12 <[email protected]>
Signed-off-by: danken12 <[email protected]>
Signed-off-by: danken12 <[email protected]>
nemo_text_processing/inverse_text_normalization/he/graph_utils.py
Outdated
Show resolved
Hide resolved
Signed-off-by: danken12 <[email protected]>
What does this PR do ?
Add Hebrew Inverse Text Normalization
Before your PR is "Ready for review"
Pre checks:
git commit -s
to sign.pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
pytest
and Sparrowhawk here.__init__.py
for every folder and subfolder, includingdata
folder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
to all newly added Python files?Copyright 2015 and onwards Google, Inc.
. See an example here.try import: ... except: ...
) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.