Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProteinMPNN_dataset #9810

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

ProteinMPNN_dataset #9810

wants to merge 2 commits into from

Conversation

jdhenaos
Copy link

A new Dataset for ProteinMPNN model was added.

The Dataset was evaluated using the next piece of code:

from torch_geometric.loader import DataLoader

path = "./sample_data"
params = {
    "LIST"    : f"{path}/pdb_2021aug02_sample/list.csv",
    "VAL"     : f"{path}/pdb_2021aug02_sample/valid_clusters.txt",
    "DIR"     : f"{path}/pdb_2021aug02_sample",
    "DATCUT"  : "2030-Jan-01",
    "RESCUT"  : 3.5, #resolution cutoff for PDBs
    "HOMO"    : 0.70 #min seq.id. to detect homo chains
}

train_dataset = PMPNNDataset(root=path,params=params,set_type='train')
validation_dataset = PMPNNDataset(root=path,params=params,set_type='val')

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True,
                          num_workers=2)
validation_loader = DataLoader(validation_dataset, batch_size=32, shuffle=True,
                          num_workers=2)

@jdhenaos jdhenaos requested a review from wsad1 as a code owner November 27, 2024 17:06
Copy link
Contributor

@xnuohz xnuohz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Plz add a dataset test and make CI pass:)

Comment on lines +24 to +33
Args:
root (str): Root directory where the dataset should be saved.
params (dict): Dictionary of parameters for dataset creation:
LIST: Path to the table with metadata.
VAL: Path to list of cluster IDs for model validation.
DIR: Path to dataset.
DATCUT: Date (YYY-MM-DD) threshold of sequence deposition.
RESCUT: Resolution cutoff for PDBs.
HOMO: Minimal sequence identity to detect homodimeric chains.
set_type (str): Type of expected data, train ("train") or validation ("val")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args are not aligned with __init__ parameters

force_reload=False
#name='sample',
) -> None:
assert set_type in {'train', 'val'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing test?

Suggested change
assert set_type in {'train', 'val'}
assert set_type in {'train', 'val', 'test'}

Comment on lines +40 to +49
self,
root,
set_type, # 'train', 'val', or 'test'
params,
transform=None,
pre_transform=None,
pre_filter=None,
log=True,
force_reload=False
#name='sample',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add type hints

osp.join(self.root)
fs.rm(self.raw_dir)

def build_training_clusters(self, params, debug):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug mode should be removed.

Comment on lines +222 to +228
item_graph = Data(
seq=seq,
xyz=torch.cat(xyz, dim=0),
idx=torch.cat(idx, dim=0),
masked=torch.Tensor(masked).int(),
#label = self.item[0]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see x and edge_index in the dataset, ProteinMPNN's input only includes sequence data, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants