Skip to content

Developing a classifier to detect instances of racism and xenophobia in English sentences

Notifications You must be signed in to change notification settings

Ebimsv/Racism-Xenophobia-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

alt text

About This Project

This project is a step-by-step guide on building a Racism-Xenophobia Classifier using PyTorch. It aims to provide a comprehensive understanding of the process involved in developing a model and its applications.

Step 1: Accurate and concise definition of the problem

The Racism-Xenophobia-Classifier repository is a machine learning project focused on developing a classifier to detect instances of racism and xenophobia in English sentences. This project aims to provide a robust and accurate tool for identifying and categorizing text based on the presence of racism and xenophobic content.

The Racism-Xenophobia-Classifier project has diverse real-world applications. It can be employed for content moderation on social media platforms, aiding sentiment analysis by identifying racism and xenophobia, monitoring public opinion on these issues, supporting research and studies on societal attitudes, informing policy development, and serving as an educational tool for fostering inclusivity. Overall, the classifier contributes to creating safer online spaces and promoting understanding and respect in society.

Step 2: Data Collection for Racism-Xenophobia-Classifier

Data Collection Overview

In the data collection phase of the Racism-Xenophobia-Classifier project, the goal is to gather a diverse and representative dataset of English sentences labeled with instances of racism and xenophobia. This dataset will serve as the foundation for training and evaluating the classifier.

Sampling Methods

alt text

Sampling methods can be utilized during data collection to ensure the dataset captures a wide range of examples and maintains a balanced representation. Here are a few scenarios where sampling methods can be beneficial:

1. Random Sampling
Random sampling involves selecting data points from a larger pool without any specific pattern or bias. It ensures a diverse representation of text by capturing a wide range of examples. For the Racism-Xenophobia-Classifier project, random sampling can be used to collect sentences from various sources to avoid favoring specific contexts or demographics.

Advantages

  • Easy to implement.
  • Each member of the population has an equal chance of being chosen.
  • Free from bias.

Disadvantages

  • If the sampling frame is large random sampling may be impractical.
  • A complete list of the population may not be available.
  • Minority subgroups within the population may not be present in sample.
2. Stratified Sampling
The population is divided into subgroups (strata) based on specific characteristics, such as age, gender or race. Within the strata random sampling is used to choose the sample. In the context of the Racism-Xenophobia-Classifier project, stratified sampling can be used to ensure proportional representation of different types of racism and xenophobia, such as racial slurs, discriminatory remarks, or xenophobic comments.

alt text

Advantages

  • Strata can be proportionally represented in the final sample.
  • It is easy to compare subgroups.

Disadvantages

  • Information must be gathered before being able to divide the population into subgroups.

Worked Example

A school of 1000 students are classified as follows:

57 % Brunette,
29 % Redhead,
14 % Blonde.

Find a stratified sample of 200 students for this population.

Solution:
Suppose we are interested in how each of these groups will react to this statement: everyone in this school has an equal chance of success. Relying on a random sample may under-represent the minority populations of the school (people with blonde hair). By grouping our population by hair colour, we can choose a sample ensuring each group is represented according to its proportion of the population. So 57 % of the sample should be brunette, 29 % should be redhead and 14 % blonde. Within each group (strata) you select your sample randomly. As our sample consists of 200 people, 114 should be brunette, 58 should be redhead and 28 should be blonde.

3. Clustered Sampling
Clustered sampling involves dividing the population into clusters or groups, and then randomly selecting clusters for data collection. In the Racism-Xenophobia-Classifier project, clustered sampling can be used to select specific online communities, forums, or news articles that are more likely to contain instances of racism and xenophobia, ensuring a more focused collection of relevant data.

alt text

Advantages

  • Cuts down the cost and time by collecting data from only a limited number of groups.
  • Can show grouped variations.

Disadvantages

  • It is not a genuine random sample.
  • The sample size is smaller and from thus the sample is likely to be less representative of the population.

Example
The children in a classroom are divided up depending on which table they sit at. A sample can be obtained from this classroom by choosing n number of tables to represent the class.

4. Convenience Sampling
Convenience sampling involves collecting data from readily available sources or individuals that are easily accessible. In the context of the Racism-Xenophobia-Classifier project, convenience sampling may involve collecting data from social media platforms, online forums, or public discussions where instances of racism and xenophobia are frequently observed.

By applying these data collection methods to the Racism-Xenophobia-Classifier project, we can gather a diverse and representative dataset that covers various types of racism and xenophobia, captures informative examples, and avoids biases or limited perspectives.

This table shows a summary about mentioned methods:

Method When to Use How to Collect Data
Surveys To gather information from a large sample Administer structured questionnaires to individuals or groups
Interviews To obtain in-depth insights or personal experiences Conduct direct interactions with individuals or groups, using structured or unstructured questioning
Observations To study behaviors or events in natural settings Systematically watch and record behaviors, events, or phenomena
Experiments To establish cause-and-effect relationships Manipulate variables under controlled conditions and collect data accordingly
Existing Data Analysis When relevant data already exists for analysis Analyze pre-existing data from public sources, research institutions, or archives
Case Studies To deeply examine specific individuals or situations Conduct extensive analysis and investigation of individuals, groups, or phenomena
Document Analysis To analyze written, visual, or audio materials Examine reports, articles, or social media content for relevant information
Ethnography To understand behavior and beliefs in a cultural group Immerse oneself in the cultural or social group, observe, and interact with participants

Useful Datasets for Racism and Xenophobia Detection

In this section, I present information on datasets that have been used for hate speech detection or related concepts such as cyberbullying, abusive language, online harassment, among others, to make it easier for researchers to obtain datasets. Even when there are several social media platforms to get data from, the construction of a balanced labeled dataset is a costly task in time and effort, and it is still a problem for the researchers in the area. Although most of the below-listed datasets are not explicitly available, some of them can be obtained from the authors if requested.

English

No Datasets (Link to paper) Objects Size Available Labels Comment
1 Dinakar et al., 2011 YouTube Comments 6000 - Sexuality, Race, Culture, Intelligence
2 Dadvar and Jong, 2012 Myspace Posts 2200 - Bullying, Non Bullying
3 Huang et al., 2014 Tweets 4865 - Bullying, Non Bullying
4 Hosseinmardi et al., 2015 Instagram Media Sessions 998 - bullying, Non bullying
5 Waseem and Hovy, 2016 Tweets 16914 Download Racist, Sexist, Either
6 Waseem, 2016 Tweets 6909 Download Racist, Sexist, Either,Both
7 Nobata et al., 2016 Yahoo Comments 2000 - Abusive, Clean
8 Chatzakou et al., 2017 Twitter Users 9484 - Aggressor, Bully, Spammer
9 Davidson et al., 2017 Tweets 24802 Download hate_speech, offensive, neither
10 Golbeck et al., 2017 Tweets 35000 - Harassing, Non Harassing
11 Wulczyn et al. 2017 Wikipedia Comments 100000 Download Personal Attacks
12 Tahmasbi and Rastegari, 2018 Tweets 12837 - Bullying, Non Bullying
13 Anzovino et al., 2018 Tweets 4454 - Discredit, Stereotype, Objectification, Sexual_Harassment, Threats of Violence, Dominance, Dearailingy
14 Founta et al., 2018 Tweets 80000 Download Hate Speech, Offensive, None
15 Gibert et al., 2018 Sentences from Stormfront 10568 Download Hate Speech, Non Hate Speech
16 SemEval19, 2019 Tweets 9000 Request Link Hate speech, Non Hate Speech
17 OLID 2019 Tweets 14100 Download Offensive, Non Offensive
18 TREC2 2020 Messages (Twitter,Facebook,Youtube) 4,263 Request Form Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG) Data GeoLocated India
19 meTooMA 2020 Tweets 9,973 Download Hate Speech (Directed, Generalized), Relevance (0,1), STANCE (Support, Opposition, Neither) Data GeoLocated India, Australia, Kenya, Iran, UK

Multilingual (Parallel Data)

No Datasets (Link to paper) Objects Size Available Language Labels
1 XHate 999 Tweets from previous published English datasets and translated to 5 languages 600 (x 6 languages) Download English, German, Russian, Croatian, Albanian, Turkish sexism, racism, toxicity, hatefulness, aggression, attack, cyberbullying, misogyny, obscenity, threats, and insults.

and this is another links for finding related dataset:

Dataset Name Description Language Classes Source Download
HateEval Annotated tweets for hate speech and offensive language. English (women or immigrants) is hateful or not hateful Twitter https://competitions.codalab.org/competitions/19935
Wikipedia Talk Labels User comments from Wikipedia talk pages annotated for toxicity. English toxic or healthy Wikipedia https://figshare.com/articles/dataset/Wikipedia_Talk_Labels_Toxicity/4563973/2
Online Harassment Dataset (Wikimedia) User comments from Wikimedia platforms annotated for harassment. English bullying or not https://www.kaggle.com/datasets/saurabhshahane/cyberbullying-dataset
Cyberbullying Dataset The data contain text and labeled as bullying or not. English Kaggle, Twitter, Wikipedia Talk https://www.kaggle.com/datasets/saurabhshahane/cyberbullying-dataset
Hate Speech and Offensive Language Dataset The text is classified as: hate-speech, offensive, and neither English 0 - hate speech 1 - offensive language 2 - neither Twitter https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset/data

Step 3: Advancements and types of Language Models:

Different types of language models:

Step 4: Implementation of the selected method

Dataset

Prepare and preprocess data

References

  1. https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/sampling/types-of-sampling.html
  2. https://github.com/aymeam/Datasets-for-Hate-Speech-Detection/blob/master/README.md

About

Developing a classifier to detect instances of racism and xenophobia in English sentences

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published