Skip to content

Latest commit

 

History

History

ToxicityClassification

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Jigsaw Unintended Bias in Toxicity Classification

Kaggle Competition

Getting started (these is included in simple_lstm_baseline.py)

# Download the dataset
kaggle competitions download -c jigsaw-unintended-bias-in-toxicity-classification

# unzip data
mkdir data
unzip test.csv.zip -d data
unzip train.csv.zip -d data

# not sure why it don't have read permission
chmod +r data/*

# clean up
rm *.zip

Dataset

Submission

For evaluation, test set examples with target >= 0.5 will be considered to be in the positive class (toxic).

Models do not need to predict the additional attributes for the competition

id,prediction
7000000,0.0
7000001,0.0
etc.

Evaluation

Submetric

  • Overall AUC: the ROC-AUC for the full evaluation set
  • Bias AUCs:
    • Subgroup AUC
    • BPSN (Background Positive, Subgroup Negative) AUC
    • BNSP (Background Negative, Subgroup Positive) AUC

Generalized Mean of Bias AUCs

$$ M_p(m_s) = \left(\frac{1}{N} \sum_{s=1}^{N} m_s^p\right)^\frac{1}{p} $$

Final Metric

$$ score = w_0 AUC_{overall} + \sum_{a=1}^{A} w_a M_p(m_{s,a}) $$

Usage

# simple LSTM baseline
python3 simple_lstm_baseline.py

Resources

Popular Kernel

Preprocessing

Model