-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profanity filtering for ITN - EN #86
Closed
Closed
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
a8078de
Zh itn (#74)
BuyuanCui 29a6272
Add profanity filtering for english ITN
gayu-thri d65ff7d
Add copyrights
gayu-thri d70a4ec
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 252bb6d
Add filter_profanity attr to InverseNormalizer
gayu-thri 2e71ebb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] a8a7826
Different fst names with/without pf
gayu-thri 0cfb3a8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 62efdd6
Rm written form in TSV and use fst operations to get it
gayu-thri 1d5a362
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] f9e5bde
user configurable input file for profane words
gayu-thri b0a6a98
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] b3375c2
Merge branch 'main' into add-profanity-filtering
gayu-thri e6548dd
Fix error in CodeQL
gayu-thri a46ea2d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] bf4e9a1
Resolve PR comments
gayu-thri 3c79f42
disable filtering profanity by default
gayu-thri 1ce954c
Remove raising explicit ValueError when custom list is not passed
gayu-thri 9c710c4
Set filer_profanity to True in profane test
gayu-thri 9880c05
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
436 changes: 436 additions & 0 deletions
436
nemo_text_processing/inverse_text_normalization/en/data/swear_sequences.tsv
Large diffs are not rendered by default.
Oops, something went wrong.
65 changes: 65 additions & 0 deletions
65
nemo_text_processing/inverse_text_normalization/en/taggers/profane.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. | ||
# Copyright 2015 and onwards Google, Inc. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import pynini | ||
from nemo_text_processing.inverse_text_normalization.en.utils import get_abs_path | ||
from nemo_text_processing.text_normalization.en.graph_utils import ( | ||
INPUT_CASED, | ||
INPUT_LOWER_CASED, | ||
NEMO_ALPHA, | ||
NEMO_DIGIT, | ||
NEMO_SPACE, | ||
GraphFst, | ||
capitalized_input_graph, | ||
) | ||
from pynini.lib import pynutil | ||
|
||
|
||
class ProfaneFst(GraphFst): | ||
""" | ||
Finite state transducer for classifying profane words | ||
e.g. bitch -> profane { filtered: "b****" } | ||
|
||
This class has highest priority among all classifier grammars | ||
|
||
Args: | ||
input_case: accepting either "lower_cased" or "cased" input. | ||
input_file: path to a file with profane words to be redacted with "*" symbol. (each line of the file: spoken_form\n) | ||
e.g. nemo_text_processing/inverse_text_normalization/en/data/swear_sequences.tsv | ||
""" | ||
|
||
def __init__(self, input_case: str = INPUT_LOWER_CASED, input_file: str = None): | ||
super().__init__(name="profane", kind="classify") | ||
# Profane Grammar | ||
if input_file is None: | ||
input_file = "data/swear_sequences.tsv" | ||
|
||
profane_graph = pynini.string_file(get_abs_path(input_file)) | ||
|
||
bowdlerize = ( | ||
(NEMO_ALPHA | NEMO_DIGIT) + pynini.closure(pynini.cross(NEMO_SPACE | NEMO_ALPHA | NEMO_DIGIT, "*"), 1) | ||
).optimize() | ||
|
||
profane_graph = (profane_graph @ bowdlerize).optimize() | ||
|
||
if input_case == INPUT_CASED: | ||
profane_graph = capitalized_input_graph(profane_graph) | ||
|
||
# Token insertion | ||
final_profane_graph = pynutil.insert('filtered: "') + profane_graph + pynutil.insert('"') | ||
|
||
# Inserts the profane tag | ||
final_profane_graph = self.add_tokens(final_profane_graph) | ||
self.fst = final_profane_graph.optimize() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
39 changes: 39 additions & 0 deletions
39
nemo_text_processing/inverse_text_normalization/en/verbalizers/profane.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. | ||
# Copyright 2015 and onwards Google, Inc. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import pynini | ||
from nemo_text_processing.text_normalization.en.graph_utils import NEMO_NOT_QUOTE, GraphFst, delete_space | ||
from pynini.lib import pynutil | ||
|
||
|
||
class ProfaneFst(GraphFst): | ||
""" | ||
Finite state transducer for verbalizing profane words | ||
e.g. bitch -> profane { filtered: "b****" } -> b**** | ||
""" | ||
|
||
def __init__(self): | ||
super().__init__(name="profane", kind="verbalize") | ||
graph = ( | ||
pynutil.delete("filtered:") | ||
+ delete_space | ||
+ pynutil.delete('"') | ||
+ pynini.closure(NEMO_NOT_QUOTE, 1) | ||
+ pynutil.delete('"') | ||
+ delete_space | ||
) | ||
|
||
delete_tokens = self.delete_tokens(graph) | ||
self.fst = delete_tokens.optimize() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
21 changes: 21 additions & 0 deletions
21
tests/nemo_text_processing/en/data_inverse_text_normalization/test_cases_profane.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
fuck off~f*** off | ||
dont give a fuck about your problems~dont give a f*** about your problems | ||
holy shit~holy s*** | ||
he was pissing us off~he was p****** us off | ||
the movie was bad ass~the movie was bad a** | ||
son of a bitch~son of a b**** | ||
two girls one cup~t**************** | ||
zoophilia~z******** | ||
get out of the way fuckin bastard~get out of the way f***** b****** | ||
belief that the white power is inherently superior is bad~belief that the w********** is inherently superior is bad | ||
two g one c~t********** | ||
not give a fuck~not give a f*** | ||
oh shit~oh s*** | ||
rapist~r***** | ||
sexting~s****** | ||
assholes~a******* | ||
bullshit~b******* | ||
fucker~f***** | ||
piece of shit~p************ | ||
2g1c~2*** | ||
2 g 1 c~2****** |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Check failure
Code scanning / CodeQL
Potentially uninitialized local variable