Inconsistency in number of classes in EC/GO downstream datasets #50

klemens-floege · 2024-07-31T10:59:24Z

Dear all,

I believe to have found some major flaws in the EC/GO downstream datasets you linked on your google drive (https://drive.google.com/drive/folders/11dNGqPYfLE3M-Mbh4U7IQpuHxJpuRr4g).

In the SaProt codebase, in the SaProtAnnotationModel class you specify the number of classes in these datasets to be: label2num = {"EC": 585, "GO_BP": 1943, "GO_MF": 489, "GO_CC": 320}. However, when investigating the EC dataset for example, I only find 366 distinct classes in the training set, 263 in test and 287 in the validation. Similar issues arise in all the three GO datasets. This seems like an ill-posed classification problem to me and I would appreciate some clarification.

Thank you very much for taking the time to look into this.

PS: Here is the simple Pandas code I used for the analysis.
`
df_test = pd.read_csv(ec_test_path)
df_train = pd.read_csv(ec_train_path)
df_valid = pd.read_csv(ec_valid_path)

df_train['class'].nunique()=366
df_test['class'].nunique()=263
df_valid['class'].nunique()=287

Convert 'class' columns to sets
train_classes = set(df_train['class'])
valid_classes = set(df_valid['class'])
test_classes = set(df_test['class'])

Find the intersection of the two sets
intersection_train_val = train_classes.intersection(valid_classes)
intersection_train_test = train_classes.intersection(test_classes)
intersection_val_test = valid_classes.intersection(test_classes)

len(intersection_train_val)=287
len(intersection_train_test)=262
len(intersection_val_test)=207

`

LTEnjoy · 2024-07-31T14:22:03Z

Hi, Thank you for your interest in our work!

Could you explain more about how you define "distinct class"? The EC and GO tasks are multiple binary classification tasks, which means a protein is mapped to multiple labels for different functions, each being 0 or 1 to indicate whether the protein has a specific function. For instance, the number "585" for the EC task means a protein has 585 binary labels such as 0 1 0 ... 1 0 0. The 1 at specific position indicates the protein has that function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in number of classes in EC/GO downstream datasets #50

Inconsistency in number of classes in EC/GO downstream datasets #50

klemens-floege commented Jul 31, 2024 •

edited

Loading

LTEnjoy commented Jul 31, 2024

Inconsistency in number of classes in EC/GO downstream datasets #50

Inconsistency in number of classes in EC/GO downstream datasets #50

Comments

klemens-floege commented Jul 31, 2024 • edited Loading

LTEnjoy commented Jul 31, 2024

klemens-floege commented Jul 31, 2024 •

edited

Loading