Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profanity filtering for ITN - EN #86

Closed
wants to merge 20 commits into from

Conversation

gayu-thri
Copy link

@gayu-thri gayu-thri commented Jun 30, 2023

What does this PR do ?

This PR adds a new feature in ITN - EN for filtering profane words. With this, profane words in the input text would be redacted with * symbol.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

@gayu-thri gayu-thri changed the title Add profanity filtering for ITN - EN Profanity filtering for ITN - EN Jun 30, 2023
@gayu-thri gayu-thri marked this pull request as ready for review June 30, 2023 07:34
BuyuanCui and others added 12 commits July 3, 2023 21:52
* Add ZH ITN

Signed-off-by: Anand Joseph <[email protected]>

* Fix copyrights and code cleanup

Signed-off-by: Anand Joseph <[email protected]>

* Remove invalid tests

Signed-off-by: Anand Joseph <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve CodeQL issues

Signed-off-by: Anand Joseph <[email protected]>

* Cleanup

Signed-off-by: Anand Joseph <[email protected]>

* Fix missing 'zh' option for ITN and correct comment

Signed-off-by: Anand Joseph <[email protected]>

* Update __init__.py

Change to zh instead of en for the imports.

Signed-off-by: Buyuan(Alex) Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update for decimal test data

Signed-off-by: BuyuanCui <[email protected]>

* update for langauge import

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update for Chinese punctuations

Signed-off-by: BuyuanCui <[email protected]>

* a new class for whitelist

Signed-off-by: BuyuanCui <[email protected]>

* PYNINI_AVAILABLE = False

Signed-off-by: BuyuanCui <[email protected]>

* recreated due to file import format issue

Signed-off-by: BuyuanCui <[email protected]>

* recreated due to format issue

Signed-off-by: BuyuanCui <[email protected]>

* caught duplicates, removed

Signed-off-by: BuyuanCui <[email protected]>

* removed duplicates, arranges for CHInese Yuan updates

Signed-off-by: BuyuanCui <[email protected]>

* updates accordingly to the comments from last PR. Recreated some of the files due to format issues

Signed-off-by: BuyuanCui <[email protected]>

* removed the hours_to and minute_to files used for back counting. ALso removed am and pm suffix files according to the last PR. Recreated some of them for format issue

Signed-off-by: BuyuanCui <[email protected]>

* re-added this file to avoid data file import error

Signed-off-by: BuyuanCui <[email protected]>

* updated gramamr according to last PR. Removed the acceptance of 千

Signed-off-by: BuyuanCui <[email protected]>

* updates

Signed-off-by: BuyuanCui <[email protected]>

* updated according to last PR. Removed comma after decimal points

Signed-off-by: BuyuanCui <[email protected]>

* gramamr for Fraction

Signed-off-by: BuyuanCui <[email protected]>

* gramamr for money and updated according to last PR. Plus process of 元

Signed-off-by: BuyuanCui <[email protected]>

* ordinal grammar. updates due to the updates in cardinal grammar

Signed-off-by: BuyuanCui <[email protected]>

* updated accordingly to last PR comments. removing am and pm and allowing simple mandarin expression

Signed-off-by: BuyuanCui <[email protected]>

* arrangements

Signed-off-by: BuyuanCui <[email protected]>

* added whitelist grammar

Signed-off-by: BuyuanCui <[email protected]>

* word grammar for non-classified items

Signed-off-by: BuyuanCui <[email protected]>

* updated cardinal, decimal, time, itn data

Signed-off-by: BuyuanCui <[email protected]>

* updates according to last PR

Signed-off-by: BuyuanCui <[email protected]>

* updates according to the updates for cardinal grammar

Signed-off-by: BuyuanCui <[email protected]>

* updates for more Mandarin punctuations

Signed-off-by: BuyuanCui <[email protected]>

* updated accordingly to last PR. removing am pm

Signed-off-by: BuyuanCui <[email protected]>

* adjustment on the weight

Signed-off-by: BuyuanCui <[email protected]>

* updated accordingly to the targger updates

Signed-off-by: BuyuanCui <[email protected]>

* updated accordingly to the time tagger

Signed-off-by: BuyuanCui <[email protected]>

* updates according to changes in tagger on am and pm

Signed-off-by: BuyuanCui <[email protected]>

* verbalizer for fraction

Signed-off-by: BuyuanCui <[email protected]>

* added for mandarin grammar

Signed-off-by: BuyuanCui <[email protected]>

* kept this file because using English utils results in data namin error

Signed-off-by: BuyuanCui <[email protected]>

* merge conflict

Signed-off-by: BuyuanCui <[email protected]>

* removed unsed imports

Signed-off-by: BuyuanCui <[email protected]>

* deleted unsed import os

Signed-off-by: BuyuanCui <[email protected]>

* deleted unsed variables

Signed-off-by: BuyuanCui <[email protected]>

* removed unsed imports

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updates and edits based on pr checks

Signed-off-by: BuyuanCui <[email protected]>

* updates and edits based on pr checks

Signed-off-by: BuyuanCui <[email protected]>

* format issue, reccreated

Signed-off-by: BuyuanCui <[email protected]>

* format issue recreated

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed codeing style/format

Signed-off-by: BuyuanCui <[email protected]>

* fixed coding style and format

Signed-off-by: BuyuanCui <[email protected]>

* removed duplicated graph for 毛

Signed-off-by: BuyuanCui <[email protected]>

* removed the comment

Signed-off-by: BuyuanCui <[email protected]>

* removed the comment

Signed-off-by: BuyuanCui <[email protected]>

* removing unnecessary comments

Signed-off-by: BuyuanCui <[email protected]>

* unnecessary comment removed

Signed-off-by: BuyuanCui <[email protected]>

* test file updated for more cases

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updated with a comment explaining why this file is kept

Signed-off-by: BuyuanCui <[email protected]>

* updated the file explaining why this file is kept

Signed-off-by: BuyuanCui <[email protected]>

* added Mandarin as zh

Signed-off-by: BuyuanCui <[email protected]>

* removing for dplication

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* removed unused NEMO objects

Signed-off-by: BuyuanCui <[email protected]>

* removed duplicates

Signed-off-by: BuyuanCui <[email protected]>

* removing unsed imports

Signed-off-by: BuyuanCui <[email protected]>

* updates to fix test file failures

Signed-off-by: BuyuanCui <[email protected]>

* updates to fix file failtures

Signed-off-by: BuyuanCui <[email protected]>

* updates to resolve test case failture

Signed-off-by: BuyuanCui <[email protected]>

* updates to resolve test case failure

Signed-off-by: BuyuanCui <[email protected]>

* updates to resolve test case failure

Signed-off-by: BuyuanCui <[email protected]>

* updates to resolve test case failure

Signed-off-by: BuyuanCui <[email protected]>

* updates to adap to cardinal grammar changes

Signed-off-by: BuyuanCui <[email protected]>

* updates to adapt to grammar changes

Signed-off-by: BuyuanCui <[email protected]>

* updates to adopt to cardinal grammar changes

Signed-off-by: BuyuanCui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix style

Signed-off-by: BuyuanCui <[email protected]>

* fix style

Signed-off-by: BuyuanCui <[email protected]>

* fix style

Signed-off-by: BuyuanCui <[email protected]>

* fix style

Signed-off-by: BuyuanCui <[email protected]>

* fixing pr checks

Signed-off-by: BuyuanCui <[email protected]>

* removed // for zhtn/itn cache

Signed-off-by: BuyuanCui <[email protected]>

* Update inverse_normalize.py

Added zh as a selection to pass Jenkins checks.

Signed-off-by: Buyuan(Alex) Cui <[email protected]>

---------

Signed-off-by: Anand Joseph <[email protected]>
Signed-off-by: Buyuan(Alex) Cui <[email protected]>
Signed-off-by: BuyuanCui <[email protected]>
Co-authored-by: Alex Cui <[email protected]>
Co-authored-by: Anand Joseph <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: gayu-thri <[email protected]>
Signed-off-by: Gayathri Ethiraj <[email protected]>
Signed-off-by: gayu-thri <[email protected]>
Signed-off-by: Gayathri Ethiraj <[email protected]>
Signed-off-by: gayu-thri <[email protected]>
@gayu-thri
Copy link
Author

Following up on this as the suggested changes are already made few weeks back and PR is not merged yet.

If there are anymore changes that has to be made before merging, please let me know regarding the same.

VerbalizeFinalFst,
)

self.tagger = ClassifyFst(

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable Error

Local variable 'ClassifyFst' may be used before it is initialized.
@gayu-thri gayu-thri requested a review from jimregan July 28, 2023 02:57
@gayu-thri gayu-thri requested a review from mgrafu August 7, 2023 09:49
@mgrafu
Copy link
Collaborator

mgrafu commented Aug 10, 2023

After reviewing this PR, we have decided not to merge it for the following reasons:

  1. The grammar provided offers functionality that can already be obtained through the whitelist class by adding (keyword, transformation) pairs to the whitelist data file.
  2. Conceptually, this type of filtering is not a TN/ITN task. If a user wanted to filter profanity, chances are that it would already have been filtered in the audio; thus, it would not appear in the text before ITN in the first place. Otherwise, the filtering would most likely be addressed further downstream.

Thank you for your effort — we look forward to future contributions.

@mgrafu mgrafu closed this Aug 10, 2023
@gayu-thri
Copy link
Author

Thank you for your effort — we look forward to future contributions.

Thanks. Sure.

  1. The grammar provided offers functionality that can already be obtained through the whitelist class by adding (keyword, transformation) pairs to the whitelist data file.

I'd like to clarify this. Isn't profanity filtering a different kind of transformation which is not applicable to all whitelisted words?

Of course, we could add on a pre-defined list of pairs with both spoken and written form (filtered version) to the whitelist.

But if it has to be handled in grammar-level, wouldn't maintaining a separate classifier be better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants