Profanity filtering for ITN - EN #86

gayu-thri · 2023-06-30T07:03:36Z

What does this PR do ?

This PR adds a new feature in ITN - EN for filtering profane words. With this, profane words in the input text would be redacted with * symbol.

Before your PR is "Ready for review"

Pre checks:

PR Type:

New Feature
Bugfix
Documentation
Test

If you haven't finished some of the above items you can still open "Draft" PR.

nemo_text_processing/inverse_text_normalization/en/data/swear_sequences.tsv

* Add ZH ITN Signed-off-by: Anand Joseph <[email protected]> * Fix copyrights and code cleanup Signed-off-by: Anand Joseph <[email protected]> * Remove invalid tests Signed-off-by: Anand Joseph <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Resolve CodeQL issues Signed-off-by: Anand Joseph <[email protected]> * Cleanup Signed-off-by: Anand Joseph <[email protected]> * Fix missing 'zh' option for ITN and correct comment Signed-off-by: Anand Joseph <[email protected]> * Update __init__.py Change to zh instead of en for the imports. Signed-off-by: Buyuan(Alex) Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update for decimal test data Signed-off-by: BuyuanCui <[email protected]> * update for langauge import Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update for Chinese punctuations Signed-off-by: BuyuanCui <[email protected]> * a new class for whitelist Signed-off-by: BuyuanCui <[email protected]> * PYNINI_AVAILABLE = False Signed-off-by: BuyuanCui <[email protected]> * recreated due to file import format issue Signed-off-by: BuyuanCui <[email protected]> * recreated due to format issue Signed-off-by: BuyuanCui <[email protected]> * caught duplicates, removed Signed-off-by: BuyuanCui <[email protected]> * removed duplicates, arranges for CHInese Yuan updates Signed-off-by: BuyuanCui <[email protected]> * updates accordingly to the comments from last PR. Recreated some of the files due to format issues Signed-off-by: BuyuanCui <[email protected]> * removed the hours_to and minute_to files used for back counting. ALso removed am and pm suffix files according to the last PR. Recreated some of them for format issue Signed-off-by: BuyuanCui <[email protected]> * re-added this file to avoid data file import error Signed-off-by: BuyuanCui <[email protected]> * updated gramamr according to last PR. Removed the acceptance of 千 Signed-off-by: BuyuanCui <[email protected]> * updates Signed-off-by: BuyuanCui <[email protected]> * updated according to last PR. Removed comma after decimal points Signed-off-by: BuyuanCui <[email protected]> * gramamr for Fraction Signed-off-by: BuyuanCui <[email protected]> * gramamr for money and updated according to last PR. Plus process of 元 Signed-off-by: BuyuanCui <[email protected]> * ordinal grammar. updates due to the updates in cardinal grammar Signed-off-by: BuyuanCui <[email protected]> * updated accordingly to last PR comments. removing am and pm and allowing simple mandarin expression Signed-off-by: BuyuanCui <[email protected]> * arrangements Signed-off-by: BuyuanCui <[email protected]> * added whitelist grammar Signed-off-by: BuyuanCui <[email protected]> * word grammar for non-classified items Signed-off-by: BuyuanCui <[email protected]> * updated cardinal, decimal, time, itn data Signed-off-by: BuyuanCui <[email protected]> * updates according to last PR Signed-off-by: BuyuanCui <[email protected]> * updates according to the updates for cardinal grammar Signed-off-by: BuyuanCui <[email protected]> * updates for more Mandarin punctuations Signed-off-by: BuyuanCui <[email protected]> * updated accordingly to last PR. removing am pm Signed-off-by: BuyuanCui <[email protected]> * adjustment on the weight Signed-off-by: BuyuanCui <[email protected]> * updated accordingly to the targger updates Signed-off-by: BuyuanCui <[email protected]> * updated accordingly to the time tagger Signed-off-by: BuyuanCui <[email protected]> * updates according to changes in tagger on am and pm Signed-off-by: BuyuanCui <[email protected]> * verbalizer for fraction Signed-off-by: BuyuanCui <[email protected]> * added for mandarin grammar Signed-off-by: BuyuanCui <[email protected]> * kept this file because using English utils results in data namin error Signed-off-by: BuyuanCui <[email protected]> * merge conflict Signed-off-by: BuyuanCui <[email protected]> * removed unsed imports Signed-off-by: BuyuanCui <[email protected]> * deleted unsed import os Signed-off-by: BuyuanCui <[email protected]> * deleted unsed variables Signed-off-by: BuyuanCui <[email protected]> * removed unsed imports Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates and edits based on pr checks Signed-off-by: BuyuanCui <[email protected]> * updates and edits based on pr checks Signed-off-by: BuyuanCui <[email protected]> * format issue, reccreated Signed-off-by: BuyuanCui <[email protected]> * format issue recreated Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed codeing style/format Signed-off-by: BuyuanCui <[email protected]> * fixed coding style and format Signed-off-by: BuyuanCui <[email protected]> * removed duplicated graph for 毛 Signed-off-by: BuyuanCui <[email protected]> * removed the comment Signed-off-by: BuyuanCui <[email protected]> * removed the comment Signed-off-by: BuyuanCui <[email protected]> * removing unnecessary comments Signed-off-by: BuyuanCui <[email protected]> * unnecessary comment removed Signed-off-by: BuyuanCui <[email protected]> * test file updated for more cases Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated with a comment explaining why this file is kept Signed-off-by: BuyuanCui <[email protected]> * updated the file explaining why this file is kept Signed-off-by: BuyuanCui <[email protected]> * added Mandarin as zh Signed-off-by: BuyuanCui <[email protected]> * removing for dplication Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed unused NEMO objects Signed-off-by: BuyuanCui <[email protected]> * removed duplicates Signed-off-by: BuyuanCui <[email protected]> * removing unsed imports Signed-off-by: BuyuanCui <[email protected]> * updates to fix test file failures Signed-off-by: BuyuanCui <[email protected]> * updates to fix file failtures Signed-off-by: BuyuanCui <[email protected]> * updates to resolve test case failture Signed-off-by: BuyuanCui <[email protected]> * updates to resolve test case failure Signed-off-by: BuyuanCui <[email protected]> * updates to resolve test case failure Signed-off-by: BuyuanCui <[email protected]> * updates to resolve test case failure Signed-off-by: BuyuanCui <[email protected]> * updates to adap to cardinal grammar changes Signed-off-by: BuyuanCui <[email protected]> * updates to adapt to grammar changes Signed-off-by: BuyuanCui <[email protected]> * updates to adopt to cardinal grammar changes Signed-off-by: BuyuanCui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix style Signed-off-by: BuyuanCui <[email protected]> * fix style Signed-off-by: BuyuanCui <[email protected]> * fix style Signed-off-by: BuyuanCui <[email protected]> * fix style Signed-off-by: BuyuanCui <[email protected]> * fixing pr checks Signed-off-by: BuyuanCui <[email protected]> * removed // for zhtn/itn cache Signed-off-by: BuyuanCui <[email protected]> * Update inverse_normalize.py Added zh as a selection to pass Jenkins checks. Signed-off-by: Buyuan(Alex) Cui <[email protected]> --------- Signed-off-by: Anand Joseph <[email protected]> Signed-off-by: Buyuan(Alex) Cui <[email protected]> Signed-off-by: BuyuanCui <[email protected]> Co-authored-by: Alex Cui <[email protected]> Co-authored-by: Anand Joseph <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: gayu-thri <[email protected]>

Signed-off-by: Gayathri Ethiraj <[email protected]> Signed-off-by: gayu-thri <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>

Signed-off-by: gayu-thri <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>

Signed-off-by: gayu-thri <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>

Signed-off-by: gayu-thri <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>

Signed-off-by: gayu-thri <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>

Signed-off-by: Gayathri Ethiraj <[email protected]>

gayu-thri · 2023-07-17T05:37:35Z

Following up on this as the suggested changes are already made few weeks back and PR is not merged yet.

If there are anymore changes that has to be made before merging, please let me know regarding the same.

nemo_text_processing/inverse_text_normalization/inverse_normalize.py

+                    VerbalizeFinalFst,
+                )
+
+            self.tagger = ClassifyFst(


Signed-off-by: gayu-thri <[email protected]>

for more information, see https://pre-commit.ci

nemo_text_processing/inverse_text_normalization/en/taggers/profane.py

nemo_text_processing/inverse_text_normalization/en/taggers/tokenize_and_classify.py

Signed-off-by: gayu-thri <[email protected]>

for more information, see https://pre-commit.ci

mgrafu · 2023-08-10T15:42:36Z

After reviewing this PR, we have decided not to merge it for the following reasons:

The grammar provided offers functionality that can already be obtained through the whitelist class by adding (keyword, transformation) pairs to the whitelist data file.
Conceptually, this type of filtering is not a TN/ITN task. If a user wanted to filter profanity, chances are that it would already have been filtered in the audio; thus, it would not appear in the text before ITN in the first place. Otherwise, the filtering would most likely be addressed further downstream.

Thank you for your effort — we look forward to future contributions.

gayu-thri · 2023-08-11T12:28:14Z

Thank you for your effort — we look forward to future contributions.

Thanks. Sure.

The grammar provided offers functionality that can already be obtained through the whitelist class by adding (keyword, transformation) pairs to the whitelist data file.

I'd like to clarify this. Isn't profanity filtering a different kind of transformation which is not applicable to all whitelisted words?

Of course, we could add on a pre-defined list of pairs with both spoken and written form (filtered version) to the whitelist.

But if it has to be handled in grammar-level, wouldn't maintaining a separate classifier be better?

gayu-thri changed the title ~~Add profanity filtering for ITN - EN~~ Profanity filtering for ITN - EN Jun 30, 2023

gayu-thri force-pushed the add-profanity-filtering branch from d5362cc to c25f293 Compare June 30, 2023 07:09

gayu-thri marked this pull request as ready for review June 30, 2023 07:34

jimregan reviewed Jun 30, 2023

View reviewed changes

nemo_text_processing/inverse_text_normalization/en/data/swear_sequences.tsv Outdated Show resolved Hide resolved

BuyuanCui and others added 12 commits July 3, 2023 21:52

Add profanity filtering for english ITN

29a6272

Signed-off-by: Gayathri Ethiraj <[email protected]> Signed-off-by: gayu-thri <[email protected]>

Add copyrights

d65ff7d

Signed-off-by: Gayathri Ethiraj <[email protected]> Signed-off-by: gayu-thri <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d70a4ec

for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>

Add filter_profanity attr to InverseNormalizer

252bb6d

Signed-off-by: gayu-thri <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2e71ebb

for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>

Different fst names with/without pf

a8a7826

Signed-off-by: gayu-thri <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0cfb3a8

for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>

Rm written form in TSV and use fst operations to get it

62efdd6

Signed-off-by: gayu-thri <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1d5a362

for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>

user configurable input file for profane words

f9e5bde

Signed-off-by: gayu-thri <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b0a6a98

for more information, see https://pre-commit.ci Signed-off-by: gayu-thri <[email protected]>

gayu-thri force-pushed the add-profanity-filtering branch from 38efa25 to b0a6a98 Compare July 3, 2023 16:22

Merge branch 'main' into add-profanity-filtering

b3375c2

Signed-off-by: Gayathri Ethiraj <[email protected]>

github-advanced-security bot found potential problems Jul 18, 2023

View reviewed changes

nemo_text_processing/inverse_text_normalization/inverse_normalize.py

VerbalizeFinalFst,

)

self.tagger = ClassifyFst(

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable Error

Local variable 'ClassifyFst' may be used before it is initialized.

gayu-thri and others added 2 commits July 25, 2023 12:11

Fix error in CodeQL

e6548dd

Signed-off-by: gayu-thri <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a46ea2d

for more information, see https://pre-commit.ci

gayu-thri requested a review from jimregan July 28, 2023 02:57

mgrafu reviewed Aug 1, 2023

View reviewed changes

gayu-thri and others added 5 commits August 7, 2023 13:30

Resolve PR comments

bf4e9a1

Signed-off-by: gayu-thri <[email protected]>

disable filtering profanity by default

3c79f42

Signed-off-by: gayu-thri <[email protected]>

Remove raising explicit ValueError when custom list is not passed

1ce954c

Signed-off-by: gayu-thri <[email protected]>

Set filer_profanity to True in profane test

9c710c4

Signed-off-by: gayu-thri <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9880c05

for more information, see https://pre-commit.ci

gayu-thri requested a review from mgrafu August 7, 2023 09:49

mgrafu closed this Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profanity filtering for ITN - EN #86

Profanity filtering for ITN - EN #86

gayu-thri commented Jun 30, 2023 •

edited

Loading

gayu-thri commented Jul 17, 2023

mgrafu commented Aug 10, 2023

gayu-thri commented Aug 11, 2023

Profanity filtering for ITN - EN #86

Profanity filtering for ITN - EN #86

Conversation

gayu-thri commented Jun 30, 2023 • edited Loading

What does this PR do ?

Before your PR is "Ready for review"

gayu-thri commented Jul 17, 2023

mgrafu commented Aug 10, 2023

gayu-thri commented Aug 11, 2023

gayu-thri commented Jun 30, 2023 •

edited

Loading