You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the SaProt codebase, in the SaProtAnnotationModel class you specify the number of classes in these datasets to be: label2num = {"EC": 585, "GO_BP": 1943, "GO_MF": 489, "GO_CC": 320}. However, when investigating the EC dataset for example, I only find 366 distinct classes in the training set, 263 in test and 287 in the validation. Similar issues arise in all the three GO datasets. This seems like an ill-posed classification problem to me and I would appreciate some clarification.
Thank you very much for taking the time to look into this.
PS: Here is the simple Pandas code I used for the analysis.
`
df_test = pd.read_csv(ec_test_path)
df_train = pd.read_csv(ec_train_path)
df_valid = pd.read_csv(ec_valid_path)
Find the intersection of the two sets
intersection_train_val = train_classes.intersection(valid_classes)
intersection_train_test = train_classes.intersection(test_classes)
intersection_val_test = valid_classes.intersection(test_classes)
Could you explain more about how you define "distinct class"? The EC and GO tasks are multiple binary classification tasks, which means a protein is mapped to multiple labels for different functions, each being 0 or 1 to indicate whether the protein has a specific function. For instance, the number "585" for the EC task means a protein has 585 binary labels such as 0 1 0 ... 1 0 0. The 1 at specific position indicates the protein has that function.
Dear all,
I believe to have found some major flaws in the EC/GO downstream datasets you linked on your google drive (https://drive.google.com/drive/folders/11dNGqPYfLE3M-Mbh4U7IQpuHxJpuRr4g).
In the SaProt codebase, in the SaProtAnnotationModel class you specify the number of classes in these datasets to be: label2num = {"EC": 585, "GO_BP": 1943, "GO_MF": 489, "GO_CC": 320}. However, when investigating the EC dataset for example, I only find 366 distinct classes in the training set, 263 in test and 287 in the validation. Similar issues arise in all the three GO datasets. This seems like an ill-posed classification problem to me and I would appreciate some clarification.
Thank you very much for taking the time to look into this.
PS: Here is the simple Pandas code I used for the analysis.
`
df_test = pd.read_csv(ec_test_path)
df_train = pd.read_csv(ec_train_path)
df_valid = pd.read_csv(ec_valid_path)
df_train['class'].nunique()=366
df_test['class'].nunique()=263
df_valid['class'].nunique()=287
Convert 'class' columns to sets
train_classes = set(df_train['class'])
valid_classes = set(df_valid['class'])
test_classes = set(df_test['class'])
Find the intersection of the two sets
intersection_train_val = train_classes.intersection(valid_classes)
intersection_train_test = train_classes.intersection(test_classes)
intersection_val_test = valid_classes.intersection(test_classes)
len(intersection_train_val)=287
len(intersection_train_test)=262
len(intersection_val_test)=207
`
The text was updated successfully, but these errors were encountered: