This project is a step-by-step guide on building a Racism-Xenophobia Classifier using PyTorch. It aims to provide a comprehensive understanding of the process involved in developing a model and its applications.
The Racism-Xenophobia-Classifier repository is a machine learning project focused on developing a classifier to detect instances of racism and xenophobia in English sentences. This project aims to provide a robust and accurate tool for identifying and categorizing text based on the presence of racism and xenophobic content.
The Racism-Xenophobia-Classifier project has diverse real-world applications. It can be employed for content moderation on social media platforms, aiding sentiment analysis by identifying racism and xenophobia, monitoring public opinion on these issues, supporting research and studies on societal attitudes, informing policy development, and serving as an educational tool for fostering inclusivity. Overall, the classifier contributes to creating safer online spaces and promoting understanding and respect in society.
In the data collection phase of the Racism-Xenophobia-Classifier project, the goal is to gather a diverse and representative dataset of English sentences labeled with instances of racism and xenophobia. This dataset will serve as the foundation for training and evaluating the classifier.
Sampling methods can be utilized during data collection to ensure the dataset captures a wide range of examples and maintains a balanced representation. Here are a few scenarios where sampling methods can be beneficial:
1. Random Sampling
Random sampling involves selecting data points from a larger pool without any specific pattern or bias. It ensures a diverse representation of text by capturing a wide range of examples. For the Racism-Xenophobia-Classifier project, random sampling can be used to collect sentences from various sources to avoid favoring specific contexts or demographics.
Advantages
- Easy to implement.
- Each member of the population has an equal chance of being chosen.
- Free from bias.
Disadvantages
- If the sampling frame is large random sampling may be impractical.
- A complete list of the population may not be available.
- Minority subgroups within the population may not be present in sample.
2. Stratified Sampling
The population is divided into subgroups (strata) based on specific characteristics, such as age, gender or race. Within the strata random sampling is used to choose the sample. In the context of the Racism-Xenophobia-Classifier project, stratified sampling can be used to ensure proportional representation of different types of racism and xenophobia, such as racial slurs, discriminatory remarks, or xenophobic comments.
Advantages
- Strata can be proportionally represented in the final sample.
- It is easy to compare subgroups.
Disadvantages
- Information must be gathered before being able to divide the population into subgroups.
A school of 1000 students are classified as follows:
57 % Brunette,
29 % Redhead,
14 % Blonde.
Find a stratified sample of 200 students for this population.
Solution:
Suppose we are interested in how each of these groups will react to this statement: everyone in this school has an equal chance of success. Relying on a random sample may under-represent the minority populations of the school (people with blonde hair). By grouping our population by hair colour, we can choose a sample ensuring each group is represented according to its proportion of the population. So 57 % of the sample should be brunette, 29 % should be redhead and 14 % blonde. Within each group (strata) you select your sample randomly. As our sample consists of 200 people, 114 should be brunette, 58 should be redhead and 28 should be blonde.
3. Clustered Sampling
Clustered sampling involves dividing the population into clusters or groups, and then randomly selecting clusters for data collection. In the Racism-Xenophobia-Classifier project, clustered sampling can be used to select specific online communities, forums, or news articles that are more likely to contain instances of racism and xenophobia, ensuring a more focused collection of relevant data.
Advantages
- Cuts down the cost and time by collecting data from only a limited number of groups.
- Can show grouped variations.
Disadvantages
- It is not a genuine random sample.
- The sample size is smaller and from thus the sample is likely to be less representative of the population.
Example
The children in a classroom are divided up depending on which table they sit at. A sample can be obtained from this classroom by choosing n number of tables to represent the class.
4. Convenience Sampling
Convenience sampling involves collecting data from readily available sources or individuals that are easily accessible. In the context of the Racism-Xenophobia-Classifier project, convenience sampling may involve collecting data from social media platforms, online forums, or public discussions where instances of racism and xenophobia are frequently observed.
By applying these data collection methods to the Racism-Xenophobia-Classifier project, we can gather a diverse and representative dataset that covers various types of racism and xenophobia, captures informative examples, and avoids biases or limited perspectives.
This table shows a summary about mentioned methods:
Method | When to Use | How to Collect Data |
---|---|---|
Surveys | To gather information from a large sample | Administer structured questionnaires to individuals or groups |
Interviews | To obtain in-depth insights or personal experiences | Conduct direct interactions with individuals or groups, using structured or unstructured questioning |
Observations | To study behaviors or events in natural settings | Systematically watch and record behaviors, events, or phenomena |
Experiments | To establish cause-and-effect relationships | Manipulate variables under controlled conditions and collect data accordingly |
Existing Data Analysis | When relevant data already exists for analysis | Analyze pre-existing data from public sources, research institutions, or archives |
Case Studies | To deeply examine specific individuals or situations | Conduct extensive analysis and investigation of individuals, groups, or phenomena |
Document Analysis | To analyze written, visual, or audio materials | Examine reports, articles, or social media content for relevant information |
Ethnography | To understand behavior and beliefs in a cultural group | Immerse oneself in the cultural or social group, observe, and interact with participants |
In this section, I present information on datasets that have been used for hate speech detection or related concepts such as cyberbullying, abusive language, online harassment, among others, to make it easier for researchers to obtain datasets. Even when there are several social media platforms to get data from, the construction of a balanced labeled dataset is a costly task in time and effort, and it is still a problem for the researchers in the area. Although most of the below-listed datasets are not explicitly available, some of them can be obtained from the authors if requested.
No | Datasets (Link to paper) | Objects | Size | Available | Labels | Comment |
---|---|---|---|---|---|---|
1 | Dinakar et al., 2011 | YouTube Comments | 6000 | - | Sexuality, Race, Culture, Intelligence | |
2 | Dadvar and Jong, 2012 | Myspace Posts | 2200 | - | Bullying, Non Bullying | |
3 | Huang et al., 2014 | Tweets | 4865 | - | Bullying, Non Bullying | |
4 | Hosseinmardi et al., 2015 | Instagram Media Sessions | 998 | - | bullying, Non bullying | |
5 | Waseem and Hovy, 2016 | Tweets | 16914 | Download | Racist, Sexist, Either | |
6 | Waseem, 2016 | Tweets | 6909 | Download | Racist, Sexist, Either,Both | |
7 | Nobata et al., 2016 | Yahoo Comments | 2000 | - | Abusive, Clean | |
8 | Chatzakou et al., 2017 | Twitter Users | 9484 | - | Aggressor, Bully, Spammer | |
9 | Davidson et al., 2017 | Tweets | 24802 | Download | hate_speech, offensive, neither | |
10 | Golbeck et al., 2017 | Tweets | 35000 | - | Harassing, Non Harassing | |
11 | Wulczyn et al. 2017 | Wikipedia Comments | 100000 | Download | Personal Attacks | |
12 | Tahmasbi and Rastegari, 2018 | Tweets | 12837 | - | Bullying, Non Bullying | |
13 | Anzovino et al., 2018 | Tweets | 4454 | - | Discredit, Stereotype, Objectification, Sexual_Harassment, Threats of Violence, Dominance, Dearailingy | |
14 | Founta et al., 2018 | Tweets | 80000 | Download | Hate Speech, Offensive, None | |
15 | Gibert et al., 2018 | Sentences from Stormfront | 10568 | Download | Hate Speech, Non Hate Speech | |
16 | SemEval19, 2019 | Tweets | 9000 | Request Link | Hate speech, Non Hate Speech | |
17 | OLID 2019 | Tweets | 14100 | Download | Offensive, Non Offensive | |
18 | TREC2 2020 | Messages (Twitter,Facebook,Youtube) | 4,263 | Request Form | Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG) | Data GeoLocated India |
19 | meTooMA 2020 | Tweets | 9,973 | Download | Hate Speech (Directed, Generalized), Relevance (0,1), STANCE (Support, Opposition, Neither) | Data GeoLocated India, Australia, Kenya, Iran, UK |
No | Datasets (Link to paper) | Objects | Size | Available | Language | Labels |
---|---|---|---|---|---|---|
1 | XHate 999 | Tweets from previous published English datasets and translated to 5 languages | 600 (x 6 languages) | Download | English, German, Russian, Croatian, Albanian, Turkish | sexism, racism, toxicity, hatefulness, aggression, attack, cyberbullying, misogyny, obscenity, threats, and insults. |
and this is another links for finding related dataset:
Dataset Name | Description | Language | Classes | Source | Download |
---|---|---|---|---|---|
HateEval | Annotated tweets for hate speech and offensive language. | English | (women or immigrants) is hateful or not hateful | https://competitions.codalab.org/competitions/19935 | |
Wikipedia Talk Labels | User comments from Wikipedia talk pages annotated for toxicity. | English | toxic or healthy | Wikipedia | https://figshare.com/articles/dataset/Wikipedia_Talk_Labels_Toxicity/4563973/2 |
Online Harassment Dataset (Wikimedia) | User comments from Wikimedia platforms annotated for harassment. | English | bullying or not | https://www.kaggle.com/datasets/saurabhshahane/cyberbullying-dataset | |
Cyberbullying Dataset | The data contain text and labeled as bullying or not. | English | Kaggle, Twitter, Wikipedia Talk | https://www.kaggle.com/datasets/saurabhshahane/cyberbullying-dataset | |
Hate Speech and Offensive Language Dataset | The text is classified as: hate-speech, offensive, and neither | English | 0 - hate speech 1 - offensive language 2 - neither | https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset/data |