Skip to content

harmonize_affinities discards data from BindingDB #384

@MatejHl

Description

@MatejHl

Describe the bug
I noticed that for BindingDB the raw dataset contains 884846 unique SMILES-Protein sequence pairs and after applying harmonize_affinities, it contains only 766817 unique pairs.

To Reproduce
Steps to reproduce the behavior:

from tdc.multi_pred import DTI
loader_BindingDB_IC50_max = DTI(name = 'BindingDB_IC50')
data_IC50_max = loader_BindingDB_IC50_max.harmonize_affinities(mode = 'max_affinity')

loader_BindingDB_IC50_raw = DTI(name = 'BindingDB_IC50')
data_IC50_raw = loader_BindingDB_IC50_raw.get_data()

data_IC50_max.groupby(['Drug', 'Target']).count().shape
data_IC50_raw.groupby(['Drug', 'Target']).count().shape

Expected behavior
I would expect the same number of unique protein-molecule pairs after applying aggregation function.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment:

  • OS: Ubuntu 24.04.2 LTS (GNU/Linux 6.8.0-59-generic x86_64)
  • Python version: Python 3.8.12
  • TDC version: Version: 0.3.7
  • Any other relevant information:

Additional context
After checking the code, I belive that the problem comes from pandas groupby here. Pandas groupby drops NaNs by defualt and it seems that there are NaNs in Target_ID columns in BindingDB dataset. Thus, since groupby operation is performed per ["Drug_ID", "Drug", "Target_ID", Target"], then all records with Target_ID=NaN are simply dropped, even if they have a protein sequence. I think that simply removing "Drug_ID" and "Target_ID" from groupby would solve the issue. However, NaN ids needs to be treated, because generally I expect users to assume non-NaN ids.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions