harmonize_affinities discards data from BindingDB

**Describe the bug**
I noticed that for BindingDB the raw dataset contains 884846 unique SMILES-Protein sequence pairs and after applying `harmonize_affinities`, it contains only 766817 unique pairs.

**To Reproduce**
Steps to reproduce the behavior:
```
from tdc.multi_pred import DTI
loader_BindingDB_IC50_max = DTI(name = 'BindingDB_IC50')
data_IC50_max = loader_BindingDB_IC50_max.harmonize_affinities(mode = 'max_affinity')

loader_BindingDB_IC50_raw = DTI(name = 'BindingDB_IC50')
data_IC50_raw = loader_BindingDB_IC50_raw.get_data()

data_IC50_max.groupby(['Drug', 'Target']).count().shape
data_IC50_raw.groupby(['Drug', 'Target']).count().shape
```

**Expected behavior**
I would expect the same number of unique protein-molecule pairs after applying aggregation function. 

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Environment:**
- OS: Ubuntu 24.04.2 LTS (GNU/Linux 6.8.0-59-generic x86_64)
- Python version: Python 3.8.12
- TDC version: Version: 0.3.7
- Any other relevant information:

**Additional context**
After checking the code, I belive that the problem comes from pandas groupby  [here](https://github.com/mims-harvard/TDC/blob/c310c35f27e3f506411018ac43d97b8ba23ca652/tdc/multi_pred/dti.py#L50). Pandas groupby drops NaNs by defualt and it seems that there are NaNs in Target_ID columns in BindingDB dataset. Thus, since groupby operation is performed per ["Drug_ID", "Drug", "Target_ID", Target"], then all records with Target_ID=NaN are simply dropped, even if they have a protein sequence. I think that simply removing "Drug_ID" and "Target_ID" from groupby would solve the issue. However, NaN ids needs to be treated, because generally I expect users to assume non-NaN ids.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

harmonize_affinities discards data from BindingDB #384

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

harmonize_affinities discards data from BindingDB #384

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions