About This Project

This project is a step-by-step guide on building a Racism-Xenophobia Classifier using PyTorch. It aims to provide a comprehensive understanding of the process involved in developing a model and its applications.

Step 1: Accurate and concise definition of the problem

The Racism-Xenophobia-Classifier repository is a machine learning project focused on developing a classifier to detect instances of racism and xenophobia in English sentences. This project aims to provide a robust and accurate tool for identifying and categorizing text based on the presence of racism and xenophobic content.

The Racism-Xenophobia-Classifier project has diverse real-world applications. It can be employed for content moderation on social media platforms, aiding sentiment analysis by identifying racism and xenophobia, monitoring public opinion on these issues, supporting research and studies on societal attitudes, informing policy development, and serving as an educational tool for fostering inclusivity. Overall, the classifier contributes to creating safer online spaces and promoting understanding and respect in society.

Step 2: Data Collection for Racism-Xenophobia-Classifier

Data Collection Overview

In the data collection phase of the Racism-Xenophobia-Classifier project, the goal is to gather a diverse and representative dataset of English sentences labeled with instances of racism and xenophobia. This dataset will serve as the foundation for training and evaluating the classifier.

Sampling Methods

Sampling methods can be utilized during data collection to ensure the dataset captures a wide range of examples and maintains a balanced representation. Here are a few scenarios where sampling methods can be beneficial:

1. Random Sampling

Random sampling involves selecting data points from a larger pool without any specific pattern or bias. It ensures a diverse representation of text by capturing a wide range of examples. For the Racism-Xenophobia-Classifier project, random sampling can be used to collect sentences from various sources to avoid favoring specific contexts or demographics.

Advantages

Easy to implement.
Each member of the population has an equal chance of being chosen.
Free from bias.

Disadvantages

If the sampling frame is large random sampling may be impractical.
A complete list of the population may not be available.
Minority subgroups within the population may not be present in sample.

2. Stratified Sampling

The population is divided into subgroups (strata) based on specific characteristics, such as age, gender or race. Within the strata random sampling is used to choose the sample. In the context of the Racism-Xenophobia-Classifier project, stratified sampling can be used to ensure proportional representation of different types of racism and xenophobia, such as racial slurs, discriminatory remarks, or xenophobic comments.

Advantages

Strata can be proportionally represented in the final sample.
It is easy to compare subgroups.

Disadvantages

Information must be gathered before being able to divide the population into subgroups.

Worked Example

A school of 1000 students are classified as follows:

57 % Brunette,
29 % Redhead,
14 % Blonde.

Find a stratified sample of 200 students for this population.

Solution:
Suppose we are interested in how each of these groups will react to this statement: everyone in this school has an equal chance of success. Relying on a random sample may under-represent the minority populations of the school (people with blonde hair). By grouping our population by hair colour, we can choose a sample ensuring each group is represented according to its proportion of the population. So 57 % of the sample should be brunette, 29 % should be redhead and 14 % blonde. Within each group (strata) you select your sample randomly. As our sample consists of 200 people, 114 should be brunette, 58 should be redhead and 28 should be blonde.

3. Clustered Sampling

Clustered sampling involves dividing the population into clusters or groups, and then randomly selecting clusters for data collection. In the Racism-Xenophobia-Classifier project, clustered sampling can be used to select specific online communities, forums, or news articles that are more likely to contain instances of racism and xenophobia, ensuring a more focused collection of relevant data.

Advantages

Cuts down the cost and time by collecting data from only a limited number of groups.
Can show grouped variations.

Disadvantages

It is not a genuine random sample.
The sample size is smaller and from thus the sample is likely to be less representative of the population.

Example
The children in a classroom are divided up depending on which table they sit at. A sample can be obtained from this classroom by choosing n number of tables to represent the class.

4. Convenience Sampling

Convenience sampling involves collecting data from readily available sources or individuals that are easily accessible. In the context of the Racism-Xenophobia-Classifier project, convenience sampling may involve collecting data from social media platforms, online forums, or public discussions where instances of racism and xenophobia are frequently observed.

By applying these data collection methods to the Racism-Xenophobia-Classifier project, we can gather a diverse and representative dataset that covers various types of racism and xenophobia, captures informative examples, and avoids biases or limited perspectives.

This table shows a summary about mentioned methods:

Method	When to Use	How to Collect Data
Surveys	To gather information from a large sample	Administer structured questionnaires to individuals or groups
Interviews	To obtain in-depth insights or personal experiences	Conduct direct interactions with individuals or groups, using structured or unstructured questioning
Observations	To study behaviors or events in natural settings	Systematically watch and record behaviors, events, or phenomena
Experiments	To establish cause-and-effect relationships	Manipulate variables under controlled conditions and collect data accordingly
Existing Data Analysis	When relevant data already exists for analysis	Analyze pre-existing data from public sources, research institutions, or archives
Case Studies	To deeply examine specific individuals or situations	Conduct extensive analysis and investigation of individuals, groups, or phenomena
Document Analysis	To analyze written, visual, or audio materials	Examine reports, articles, or social media content for relevant information
Ethnography	To understand behavior and beliefs in a cultural group	Immerse oneself in the cultural or social group, observe, and interact with participants

Useful Datasets for Racism and Xenophobia Detection

In this section, I present information on datasets that have been used for hate speech detection or related concepts such as cyberbullying, abusive language, online harassment, among others, to make it easier for researchers to obtain datasets. Even when there are several social media platforms to get data from, the construction of a balanced labeled dataset is a costly task in time and effort, and it is still a problem for the researchers in the area. Although most of the below-listed datasets are not explicitly available, some of them can be obtained from the authors if requested.

English

No	Datasets (Link to paper)	Objects	Size	Available	Labels	Comment
1	Dinakar et al., 2011	YouTube Comments	6000	-	Sexuality, Race, Culture, Intelligence
2	Dadvar and Jong, 2012	Myspace Posts	2200	-	Bullying, Non Bullying
3	Huang et al., 2014	Tweets	4865	-	Bullying, Non Bullying
4	Hosseinmardi et al., 2015	Instagram Media Sessions	998	-	bullying, Non bullying
5	Waseem and Hovy, 2016	Tweets	16914	Download	Racist, Sexist, Either
6	Waseem, 2016	Tweets	6909	Download	Racist, Sexist, Either,Both
7	Nobata et al., 2016	Yahoo Comments	2000	-	Abusive, Clean
8	Chatzakou et al., 2017	Twitter Users	9484	-	Aggressor, Bully, Spammer
9	Davidson et al., 2017	Tweets	24802	Download	hate_speech, offensive, neither
10	Golbeck et al., 2017	Tweets	35000	-	Harassing, Non Harassing
11	Wulczyn et al. 2017	Wikipedia Comments	100000	Download	Personal Attacks
12	Tahmasbi and Rastegari, 2018	Tweets	12837	-	Bullying, Non Bullying
13	Anzovino et al., 2018	Tweets	4454	-	Discredit, Stereotype, Objectification, Sexual_Harassment, Threats of Violence, Dominance, Dearailingy
14	Founta et al., 2018	Tweets	80000	Download	Hate Speech, Offensive, None
15	Gibert et al., 2018	Sentences from Stormfront	10568	Download	Hate Speech, Non Hate Speech
16	SemEval19, 2019	Tweets	9000	Request Link	Hate speech, Non Hate Speech
17	OLID 2019	Tweets	14100	Download	Offensive, Non Offensive
18	TREC2 2020	Messages (Twitter,Facebook,Youtube)	4,263	Request Form	Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG)	Data GeoLocated India
19	meTooMA 2020	Tweets	9,973	Download	Hate Speech (Directed, Generalized), Relevance (0,1), STANCE (Support, Opposition, Neither)	Data GeoLocated India, Australia, Kenya, Iran, UK

Multilingual (Parallel Data)

No	Datasets (Link to paper)	Objects	Size	Available	Language	Labels
1	XHate 999	Tweets from previous published English datasets and translated to 5 languages	600 (x 6 languages)	Download	English, German, Russian, Croatian, Albanian, Turkish	sexism, racism, toxicity, hatefulness, aggression, attack, cyberbullying, misogyny, obscenity, threats, and insults.

and this is another links for finding related dataset:

Dataset Name	Description	Language	Classes	Source	Download
HateEval	Annotated tweets for hate speech and offensive language.	English	(women or immigrants) is hateful or not hateful	Twitter	https://competitions.codalab.org/competitions/19935
Wikipedia Talk Labels	User comments from Wikipedia talk pages annotated for toxicity.	English	toxic or healthy	Wikipedia	https://figshare.com/articles/dataset/Wikipedia_Talk_Labels_Toxicity/4563973/2
Online Harassment Dataset (Wikimedia)	User comments from Wikimedia platforms annotated for harassment.	English	bullying or not	https://www.kaggle.com/datasets/saurabhshahane/cyberbullying-dataset
Cyberbullying Dataset	The data contain text and labeled as bullying or not.	English	Kaggle, Twitter, Wikipedia Talk	https://www.kaggle.com/datasets/saurabhshahane/cyberbullying-dataset
Hate Speech and Offensive Language Dataset	The text is classified as: hate-speech, offensive, and neither	English	0 - hate speech 1 - offensive language 2 - neither	Twitter	https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset/data

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
generated_dataset		generated_dataset
pics		pics
.gitignore		.gitignore
README.md		README.md
racism-xenophobia.ipynb		racism-xenophobia.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About This Project

Step 1: Accurate and concise definition of the problem

Step 2: Data Collection for Racism-Xenophobia-Classifier

Data Collection Overview

Sampling Methods

Worked Example

Useful Datasets for Racism and Xenophobia Detection

English

Multilingual (Parallel Data)

Step 3: Advancements and types of Language Models:

Different types of language models:

Step 4: Implementation of the selected method

Dataset

Prepare and preprocess data

References

About

Releases

Packages

Languages

Ebimsv/Racism-Xenophobia-Classifier

Folders and files

Latest commit

History

Repository files navigation

About This Project

Step 1: Accurate and concise definition of the problem

Step 2: Data Collection for Racism-Xenophobia-Classifier

Data Collection Overview

Sampling Methods

Worked Example

Useful Datasets for Racism and Xenophobia Detection

English

Multilingual (Parallel Data)

Step 3: Advancements and types of Language Models:

Different types of language models:

Step 4: Implementation of the selected method

Dataset

Prepare and preprocess data

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages