-
Notifications
You must be signed in to change notification settings - Fork 203
Description
Describe the bug
I noticed that for BindingDB the raw dataset contains 884846 unique SMILES-Protein sequence pairs and after applying harmonize_affinities, it contains only 766817 unique pairs.
To Reproduce
Steps to reproduce the behavior:
from tdc.multi_pred import DTI
loader_BindingDB_IC50_max = DTI(name = 'BindingDB_IC50')
data_IC50_max = loader_BindingDB_IC50_max.harmonize_affinities(mode = 'max_affinity')
loader_BindingDB_IC50_raw = DTI(name = 'BindingDB_IC50')
data_IC50_raw = loader_BindingDB_IC50_raw.get_data()
data_IC50_max.groupby(['Drug', 'Target']).count().shape
data_IC50_raw.groupby(['Drug', 'Target']).count().shape
Expected behavior
I would expect the same number of unique protein-molecule pairs after applying aggregation function.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment:
- OS: Ubuntu 24.04.2 LTS (GNU/Linux 6.8.0-59-generic x86_64)
- Python version: Python 3.8.12
- TDC version: Version: 0.3.7
- Any other relevant information:
Additional context
After checking the code, I belive that the problem comes from pandas groupby here. Pandas groupby drops NaNs by defualt and it seems that there are NaNs in Target_ID columns in BindingDB dataset. Thus, since groupby operation is performed per ["Drug_ID", "Drug", "Target_ID", Target"], then all records with Target_ID=NaN are simply dropped, even if they have a protein sequence. I think that simply removing "Drug_ID" and "Target_ID" from groupby would solve the issue. However, NaN ids needs to be treated, because generally I expect users to assume non-NaN ids.