Hate Detector for Portuguese Language

This Hate Detector project is designed to identify hate speech in Portuguese language text, particularly from sources like Twitter and news website comments. The goal is to provide a tool that can help identify and mitigate harmful language online.

The trained model achieved F1-score of 81% in detecting hate speech.

The code is available in Python notebooks for easier visualization.

This project was done as my final project for Data Science and Big data post-graduation program at PUC-Minas.

To download the full work as pdf file, click here (available in portuguese only).

Moreover, to check the video produced to summarize the project, click here (available in portuguese only).

Overview

The Hate Detector employs various pre-processing and Machine Learning techniques to analyze text and classify it as hateful or non-hateful. By comparing different methodologies, the project aims to achieve the highest accuracy and effectiveness in detecting hate speech.

Dataset

The project utilizes two publicly available datasets:

De Pelle, Rogers Prates, and Viviane P. Moreira. "Offensive comments in the brazilian web: a dataset and baseline results." Anais do VI Brazilian Workshop on Social Network Analysis and Mining. SBC, 2017.
Fortuna, Paula, et al. "A hierarchically-labeled portuguese hate speech dataset." Proceedings of the third workshop on abusive language online. 2019.

In the resulting dataset, we have 78.7% of no hate detected, as shown below:

Modules

Getting dataset
Data analysis
Project: data sampling, pre-processing, feature extraction, classification

Results

The following strategies were compared:

Pre-processing functions
1. Preprocessing 1: cleaning data, removing stop words;
2. Preprocessing 2: cleaning data, removing stop words, stemming;
3. Preprocessing 2: cleaning data, removing stop words, tagging;
4. Preprocessing: cleaning data, removing stop words, stemming, tagging.
Pre-processing for data distribution:
1. Unbalanced: no pre-processing, unbalanced data set;
2. Undersampling: random undersampling;
3. Repeating: random oversampling;
4. Translate: oversampling with the addition of a random element (translation over different languages, and back to Portuguese).
Text vectorization techniques
1. Count Vectorizer
2. TF-IDF
Machine Learning Algorithms:
1. LinearSVC (LSVC)
2. Multinomial Naive Bayes (MNB)
3. Random Forest (RF)

F1-score results are below:

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
code		code
.gitignore		.gitignore
Detecção_Discursos_Ofensivos_-_Luciana_Nobrega.pdf		Detecção_Discursos_Ofensivos_-_Luciana_Nobrega.pdf
README.md		README.md
model.pbix		model.pbix
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hate Detector for Portuguese Language

Overview

Dataset

Modules

Results

About

Releases

Packages

Contributors 2

Languages

luciananobrega/hate_speech_detector

Folders and files

Latest commit

History

Repository files navigation

Hate Detector for Portuguese Language

Overview

Dataset

Modules

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages