Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MCC #97

Closed
Expertium opened this issue Jun 24, 2024 · 0 comments
Closed

Add MCC #97

Expertium opened this issue Jun 24, 2024 · 0 comments

Comments

@Expertium
Copy link
Contributor

Expertium commented Jun 24, 2024

I'll copy what I said in Discord

Another request: calculate Matthew's correlation coefficient: https://en.wikipedia.org/wiki/Phi_coefficient

  1. Select 0.5 as the threshold, and convert all predicted probabilities to binary values
# do this, but using a for loop or list comprehensions or something 
if y_pred > 0.5:
    y_pred = 1
else:
    y_pred = 0
  1. Calculate the confusion matrix, and obtain the four values from it
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
  1. Calculate MCC:
import math

def mcc(tn, fp, fn, tp):
    sums = [tp + fp, tp + fn, tn + fp, tn + fn]
    n_zero = 0
    for i in range(4):
        if sums[i] == 0:
            n_zero += 1

    if n_zero == 0:
        x = sums[0] * sums[1] * sums[2] * sums[3]  # I tried using np.prod, but it outputs negative values sometimes, probably due to an overflow
        # I also have to use math.sqrt, because np.sqrt doesn't work for very large numbers
        return ((tp * tn) - (fp * fn)) / math.sqrt(x)
    # if one of the four sums in the denominator is zero, return 0
    elif n_zero == 1:
        return 0
    # if two of the four sums are zero, return 1 or -1, depending on TP, TN, FP and FN
    # https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7
    elif n_zero == 2:
        if tp != 0 or tn != 0:
            return 1
        elif fp != 0 or fn != 0:
            return -1
    # if more than two sums are zero, return None
    elif n_zero > 2:
        return None

I linked 2 articles about MCC (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9938573 and https://ieeexplore.ieee.org/abstract/document/9440903). It has an advantage over AUC - it takes into account all four numbers (true positives, true negatives, false positives, false negatives), whereas AUC only takes into account two.

So with AUC and MCC added, we will have 2 calibration metrics and 2 classification metrics, which is more than enough

@L-M-Sherlock L-M-Sherlock closed this as not planned Won't fix, can't repro, duplicate, stale Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants