-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profanity filtering for ITN - EN #86
Conversation
d5362cc
to
c25f293
Compare
nemo_text_processing/inverse_text_normalization/en/data/swear_sequences.tsv
Outdated
Show resolved
Hide resolved
* Add ZH ITN Signed-off-by: Anand Joseph <[email protected]> * Fix copyrights and code cleanup Signed-off-by: Anand Joseph <[email protected]> * Remove invalid tests Signed-off-by: Anand Joseph <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Resolve CodeQL issues Signed-off-by: Anand Joseph <[email protected]> * Cleanup Signed-off-by: Anand Joseph <[email protected]> * Fix missing 'zh' option for ITN and correct comment Signed-off-by: Anand Joseph <[email protected]> * Update __init__.py Change to zh instead of en for the imports. Signed-off-by: Buyuan(Alex) Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update for decimal test data Signed-off-by: BuyuanCui <[email protected]> * update for langauge import Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update for Chinese punctuations Signed-off-by: BuyuanCui <[email protected]> * a new class for whitelist Signed-off-by: BuyuanCui <[email protected]> * PYNINI_AVAILABLE = False Signed-off-by: BuyuanCui <[email protected]> * recreated due to file import format issue Signed-off-by: BuyuanCui <[email protected]> * recreated due to format issue Signed-off-by: BuyuanCui <[email protected]> * caught duplicates, removed Signed-off-by: BuyuanCui <[email protected]> * removed duplicates, arranges for CHInese Yuan updates Signed-off-by: BuyuanCui <[email protected]> * updates accordingly to the comments from last PR. Recreated some of the files due to format issues Signed-off-by: BuyuanCui <[email protected]> * removed the hours_to and minute_to files used for back counting. ALso removed am and pm suffix files according to the last PR. Recreated some of them for format issue Signed-off-by: BuyuanCui <[email protected]> * re-added this file to avoid data file import error Signed-off-by: BuyuanCui <[email protected]> * updated gramamr according to last PR. Removed the acceptance of 千 Signed-off-by: BuyuanCui <[email protected]> * updates Signed-off-by: BuyuanCui <[email protected]> * updated according to last PR. Removed comma after decimal points Signed-off-by: BuyuanCui <[email protected]> * gramamr for Fraction Signed-off-by: BuyuanCui <[email protected]> * gramamr for money and updated according to last PR. Plus process of 元 Signed-off-by: BuyuanCui <[email protected]> * ordinal grammar. updates due to the updates in cardinal grammar Signed-off-by: BuyuanCui <[email protected]> * updated accordingly to last PR comments. removing am and pm and allowing simple mandarin expression Signed-off-by: BuyuanCui <[email protected]> * arrangements Signed-off-by: BuyuanCui <[email protected]> * added whitelist grammar Signed-off-by: BuyuanCui <[email protected]> * word grammar for non-classified items Signed-off-by: BuyuanCui <[email protected]> * updated cardinal, decimal, time, itn data Signed-off-by: BuyuanCui <[email protected]> * updates according to last PR Signed-off-by: BuyuanCui <[email protected]> * updates according to the updates for cardinal grammar Signed-off-by: BuyuanCui <[email protected]> * updates for more Mandarin punctuations Signed-off-by: BuyuanCui <[email protected]> * updated accordingly to last PR. removing am pm Signed-off-by: BuyuanCui <[email protected]> * adjustment on the weight Signed-off-by: BuyuanCui <[email protected]> * updated accordingly to the targger updates Signed-off-by: BuyuanCui <[email protected]> * updated accordingly to the time tagger Signed-off-by: BuyuanCui <[email protected]> * updates according to changes in tagger on am and pm Signed-off-by: BuyuanCui <[email protected]> * verbalizer for fraction Signed-off-by: BuyuanCui <[email protected]> * added for mandarin grammar Signed-off-by: BuyuanCui <[email protected]> * kept this file because using English utils results in data namin error Signed-off-by: BuyuanCui <[email protected]> * merge conflict Signed-off-by: BuyuanCui <[email protected]> * removed unsed imports Signed-off-by: BuyuanCui <[email protected]> * deleted unsed import os Signed-off-by: BuyuanCui <[email protected]> * deleted unsed variables Signed-off-by: BuyuanCui <[email protected]> * removed unsed imports Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates and edits based on pr checks Signed-off-by: BuyuanCui <[email protected]> * updates and edits based on pr checks Signed-off-by: BuyuanCui <[email protected]> * format issue, reccreated Signed-off-by: BuyuanCui <[email protected]> * format issue recreated Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed codeing style/format Signed-off-by: BuyuanCui <[email protected]> * fixed coding style and format Signed-off-by: BuyuanCui <[email protected]> * removed duplicated graph for 毛 Signed-off-by: BuyuanCui <[email protected]> * removed the comment Signed-off-by: BuyuanCui <[email protected]> * removed the comment Signed-off-by: BuyuanCui <[email protected]> * removing unnecessary comments Signed-off-by: BuyuanCui <[email protected]> * unnecessary comment removed Signed-off-by: BuyuanCui <[email protected]> * test file updated for more cases Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated with a comment explaining why this file is kept Signed-off-by: BuyuanCui <[email protected]> * updated the file explaining why this file is kept Signed-off-by: BuyuanCui <[email protected]> * added Mandarin as zh Signed-off-by: BuyuanCui <[email protected]> * removing for dplication Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed unused NEMO objects Signed-off-by: BuyuanCui <[email protected]> * removed duplicates Signed-off-by: BuyuanCui <[email protected]> * removing unsed imports Signed-off-by: BuyuanCui <[email protected]> * updates to fix test file failures Signed-off-by: BuyuanCui <[email protected]> * updates to fix file failtures Signed-off-by: BuyuanCui <[email protected]> * updates to resolve test case failture Signed-off-by: BuyuanCui <[email protected]> * updates to resolve test case failure Signed-off-by: BuyuanCui <[email protected]> * updates to resolve test case failure Signed-off-by: BuyuanCui <[email protected]> * updates to resolve test case failure Signed-off-by: BuyuanCui <[email protected]> * updates to adap to cardinal grammar changes Signed-off-by: BuyuanCui <[email protected]> * updates to adapt to grammar changes Signed-off-by: BuyuanCui <[email protected]> * updates to adopt to cardinal grammar changes Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix style Signed-off-by: BuyuanCui <[email protected]> * fix style Signed-off-by: BuyuanCui <[email protected]> * fix style Signed-off-by: BuyuanCui <[email protected]> * fix style Signed-off-by: BuyuanCui <[email protected]> * fixing pr checks Signed-off-by: BuyuanCui <[email protected]> * removed // for zhtn/itn cache Signed-off-by: BuyuanCui <[email protected]> * Update inverse_normalize.py Added zh as a selection to pass Jenkins checks. Signed-off-by: Buyuan(Alex) Cui <[email protected]> --------- Signed-off-by: Anand Joseph <[email protected]> Signed-off-by: Buyuan(Alex) Cui <[email protected]> Signed-off-by: BuyuanCui <[email protected]> Co-authored-by: Alex Cui <[email protected]> Co-authored-by: Anand Joseph <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: gayu-thri <[email protected]>
Signed-off-by: Gayathri Ethiraj <[email protected]> Signed-off-by: gayu-thri <[email protected]>
Signed-off-by: Gayathri Ethiraj <[email protected]> Signed-off-by: gayu-thri <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>
Signed-off-by: gayu-thri <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>
Signed-off-by: gayu-thri <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>
Signed-off-by: gayu-thri <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>
Signed-off-by: gayu-thri <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>
38efa25
to
b0a6a98
Compare
Signed-off-by: Gayathri Ethiraj <[email protected]>
Following up on this as the suggested changes are already made few weeks back and PR is not merged yet. If there are anymore changes that has to be made before merging, please let me know regarding the same. |
Signed-off-by: gayu-thri <[email protected]>
for more information, see https://pre-commit.ci
nemo_text_processing/inverse_text_normalization/en/taggers/profane.py
Outdated
Show resolved
Hide resolved
nemo_text_processing/inverse_text_normalization/en/taggers/profane.py
Outdated
Show resolved
Hide resolved
nemo_text_processing/inverse_text_normalization/en/taggers/tokenize_and_classify.py
Outdated
Show resolved
Hide resolved
Signed-off-by: gayu-thri <[email protected]>
Signed-off-by: gayu-thri <[email protected]>
Signed-off-by: gayu-thri <[email protected]>
Signed-off-by: gayu-thri <[email protected]>
for more information, see https://pre-commit.ci
After reviewing this PR, we have decided not to merge it for the following reasons:
Thank you for your effort — we look forward to future contributions. |
Thanks. Sure.
I'd like to clarify this. Isn't profanity filtering a different kind of transformation which is not applicable to all whitelisted words? Of course, we could add on a pre-defined list of pairs with both spoken and written form (filtered version) to the whitelist. But if it has to be handled in grammar-level, wouldn't maintaining a separate classifier be better? |
What does this PR do ?
This PR adds a new feature in ITN - EN for filtering profane words. With this, profane words in the input text would be redacted with
*
symbol.Before your PR is "Ready for review"
Pre checks:
git commit -s
to sign.pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
pytest
and Sparrowhawk here.__init__.py
for every folder and subfolder, includingdata
folder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
to all newly added Python files?Copyright 2015 and onwards Google, Inc.
. See an example here.try import: ... except: ...
) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.