Standardized dataset splits for Graphein's protein structure datasets #270

amorehead · 2023-02-24T18:59:02Z

amorehead
Feb 24, 2023

Hello.

I was looking through Graphein's current support for downloading structures from the PDB and creating datasets using them, and I had the following thought. Is it possible to use Graphein to e.g., download all experimentally-validated structures in the PDB; filter them based on specific criteria such as non-redundancy (e.g., 25% sequence identity), PDB chain length, number of PDB chains, or PDB resolution; and create a standardized cross-validation partitioning of all the PDB structures collected (e.g., 30% sequence identity and 50% structural similarity cutoffs between training, validation, and testing splits)?

Or is anyone here aware of any resources that currently offer the ability to construct ML-ready PDB datasets like this? If not, I think this could be a wonderful opportunity for Graphein to capitalize on this in the form of new functionality.

Best,
Alex

a-r-j · 2023-02-24T19:46:28Z

a-r-j
Feb 24, 2023
Maintainer

Hi Alex,

This is a great idea & would be super helpful for the community. I've actually discussed this a little bit with a friend & colleague, Jan Ludwiczak, who developed localpdb.

What sort of API do you have in mind for this? I just threw this together, which parses PDB metadata into a pandas dataframe. Additional selection methods could be added, with download and split (probably with MMSeqs2) methods.

import wget
import os
from loguru import logger as log
import gzip
import pandas as pd
from graphein.protein.utils import read_fasta
import shutil
from pathlib import Path

class PDBManager:
    def __init__(self, root_dir = "."):
        self.root_dir = Path(root_dir)
        self.download_pdb_sequences()
        self.df = self.parse()

    def download_pdb_sequences(self):
        # Download
        if not os.path.exists(self.root_dir / "pdb_seqres.txt.gz"):
            log.info("Downloading PDB sequences")
            wget.download("https://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz")
            log.info("Downloaded sequences")

        # Unzip
        if not os.path.exists(self.root_dir / "pdb_seqres.txt"):
            log.info("Unzipping PDB sequences")
            with gzip.open(self.root_dir / "pdb_seqres.txt.gz", 'rb') as f_in:
                with open(self.root_dir / "pdb_seqres.txt", 'wb') as f_out:
                    shutil.copyfileobj(f_in, f_out)
            log.info("Unzipped sequences")

    def parse(self) -> pd.DataFrame:
        fasta = read_fasta("pdb_seqres.txt")

        # Iterate over fasta and parse metadata
        records = []
        for k, v in fasta.items():
            seq = v
            params = k.split()
            id = params[0]
            pdb = params[0].split("_")[0]
            chain = params[0].split("_")[1]
            length = int(params[2].split(":")[1])
            molecule_type = params[1].split(":")[1]
            name = " ".join(params[3:])
            record = {"id": id, "pdb": pdb, "chain": chain, "length": length, "molecule_type": molecule_type, "name": name, "sequence": seq}
            records.append(record)

        return pd.DataFrame.from_records(records)

    def molecule_type(self, type: str = "protein", update: bool = False):
        df = self.df.loc[self.df.molecule_type == type]

        if update:
            self.df = df
        return df

    def longer_than(self, length: int = 100, update: bool = False):
        df = self.df.loc[self.df.length > length]

        if update:
            self.df = df
        return df

    def shorter_than(self, length: int = 100, update: bool = False):
        df = self.df.loc[self.df.length < length]

        if update:
            self.df = df
        return df

    def oligomeric(self, oligomer: int = 1, update: bool = False):
        df = self.df.loc[self.df.length == oligomer]

        if update:
            self.df = df
        return df

Which you can use like:

manager = PDBManager()
manager.molecule_type(type="protein")

To get:

	id	pdb	chain	length	molecule_type	name	sequence
4	101m_A	101m	A	154	protein	MYOGLOBIN	MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
7	102l_A	102l	A	165	protein	T4 LYSOZYME	MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE...
8	102m_A	102m	A	154	protein	MYOGLOBIN	MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
11	103l_A	103l	A	167	protein	T4 LYSOZYME	MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK...
12	103m_A	103m	A	154	protein	MYOGLOBIN	MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
...	...	...	...	...	...	...	...
773981	9xia_A	9xia	A	388	protein	XYLOSE ISOMERASE	MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL...
773982	9xim_A	9xim	A	393	protein	D-XYLOSE ISOMERASE	SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...
773983	9xim_B	9xim	B	393	protein	D-XYLOSE ISOMERASE	SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...
773984	9xim_C	9xim	C	393	protein	D-XYLOSE ISOMERASE	SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...
773985	9xim_D	9xim	D	393	protein	D-XYLOSE ISOMERASE	SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...

It would be very helpful if we could think of some additional criteria (e.g. resolution and presence of non-standard residues come to mind) and functionalities.

0 replies

amorehead · 2023-02-24T21:54:43Z

amorehead
Feb 24, 2023
Author

Hi, Arian.

I think the PDBManager you've carved out so far would be a great start toward this idea. Also, thank you for getting started on this feature request already (5578157)!

To get started developing this idea out a bit more, would you be able to give me some guidance as to why conda appears to be failing to install the environment-dev.yaml environment for Graphein? In particular, I have tried using both conda and mambda to create a new Conda environment from this YAML file, with little success from both commands. conda complains that there are package conflicts and waits seemingly indefinitely to resolve them, whereas mamba eventually quits the creation process saying that it could not find a provider for openssl (as Python 2.7 requires openssl). Any ideas on how I might be able to create a new Graphein Conda(-dev) environment?

With conda:

Best,
Alex

0 replies

a-r-j · 2023-02-24T22:35:59Z

a-r-j
Feb 24, 2023
Maintainer

Hmm, I'll look into that. I've not maintained the conda envs for a while as they were fairly troublesome. Pytorch3d and cuda-related libraries were the usual culprits. PyTorch3d is only conda installable on Linux IIRC.

Everything should be installable via pip. If you're installing from PyPI, depending on your version of setuptools, installation may fail. This has been fixed on master but not yet released.

I think the easiest way to proceed is to create a conda env, clone the repo and install it in editable mode

conda create -n graphein python=3.9

git clone https://github.com/a-r-j/graphein/

cd graphein
pip install -e .

The different requirements (listed in .requirements) can be installed with pip install -e .[dev] (depending os you may need some escape characters around the square brackets).

0 replies

a-r-j · 2023-02-27T20:58:17Z

a-r-j
Feb 27, 2023
Maintainer

Looks like we're making great progress on the PDB front! Thanks for the feature request and help @amorehead !!

I wanted to widen this discussion to look at other sequence/structure collections. For example, foldcomp makes AF2 predictions for SwissProt and ESMFold predictions for a subset of Uniref available pretty compactly. I think these could be great additions in a similar vein to what we're doing with the PDB.

This would be a significant scale up though (especially the Uniref seqs) - we'd probably need some more intelligent infra choices like Vaex for managing the indices and metadata.

I actually had a little (very quick and dirty) go at doing this for SwissProt:

import gzip
import os
import shutil
from pathlib import Path

import foldcomp
import wget
from loguru import logger as log
from Bio import SwissProt
import pandas as pd
import tqdm as tqdm

class UniProtManager:
    def __init__(self, root_dir):
        self.root_dir = Path(root_dir)
        self.fold_comp_db = "afdb_swissprot_v4"
        if not os.path.exists(self.root_dir):
            os.makedirs(self.root_dir)
        self.download()
        self.metadata_url = "https://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz"

    def download(self):
        if not os.path.exists(self.root_dir / self.fold_comp_db):
            log.info(f"Downloading {self.fold_comp_db} to {self.root_dir}")
            os.chdir(self.root_dir)
            foldcomp.setup(self.fold_comp_db)
            log.info("Download completed.")
        else:
            log.info(f"Found {self.fold_comp_db} in {self.root_dir}")

        # Unzip the metadata file
        if not os.path.exists(self.root_dir / "uniprot_sprot.dat"):
            log.info("Unzipping UniProt Metadata...")
            with gzip.open(self.root_dir / "uniprot_sprot.dat.gz", 'rb') as f_in:
                with open(self.root_dir / "uniprot_sprot.dat", 'wb') as f_out:
                    shutil.copyfileobj(f_in, f_out)

    def get_metadata(self):
        if not os.path.exists(self.root_dir / "uniprot_sprot.dat.gz"):
            log.info(f"Downloading UniProt Metadata to {self.root_dir}")
            wget.download(self.metadata_url, self.root_dir)
            log.info("Download completed.")
        else:
            log.info(f"Found UniProt Metadata in {self.root_dir}")

    def parse_metadata(self) -> pd.DataFrame:
        records = []
        for record in tqdm(SwissProt.parse(open('uniprot/uniprot_sprot.dat'))):
            data = {
                "accessions": record.accessions,
                "annotation_update": record.annotation_update,
                "comment": record.comments,
                "created": record.created,
                "cross_references": record.cross_references,
                "data_class": record.data_class,
                "description": record.description,
                "entry_name": record.entry_name,
                "gene_name": record.gene_name,
                "host_organism": record.host_organism,
                "host_taxonomy_id": record.host_taxonomy_id,
                "keywords": record.keywords,
                "molecule_type": record.molecule_type,
                "organelle": record.organelle,
                "organism_classification": record.organism_classification,
                "protein_existence": record.protein_existence,
                "seqinfo": record.seqinfo,
                "sequence": record.sequence,
                "sequence_length": record.sequence_length,
                "sequence_update": record.sequence_update,
                "taxonomy_id": record.taxonomy_id,
            }
            records.append(data)
        return pd.DataFrame(records)

Which gives output like:

accessions | annotation_update | comment | created | cross_references | data_class | description | entry_name | gene_name | host_organism | ... | keywords | molecule_type | organelle | organism_classification | protein_existence | seqinfo | sequence | sequence_length | sequence_update | taxonomy_id
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
[Q6GZX4] | (02-JUN-2021, 42) | [FUNCTION: Transcription activation. {ECO:0000... | (28-JUN-2011, 0) | [(EMBL, AY548484, AAT09660.1, -, Genomic_DNA),... | Reviewed | RecName: Full=Putative transcription factor 001R; | 001R_FRG3G | [{'ORFNames': ['FV3-001R']}] | [Dryophytes versicolor (chameleon treefrog), L... | ... | [Activator, Reference proteome, Transcription,... | None |   | [Viruses, Varidnaviria, Bamfordvirae, Nucleocy... | 4 | (256, 29735, B4840739BF7D4121) | MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQV... | 256 | (19-JUL-2004, 1) | [654924]
[Q6GZX3] | (14-DEC-2022, 44) | [SUBCELLULAR LOCATION: Host membrane {ECO:0000... | (28-JUN-2011, 0) | [(EMBL, AY548484, AAT09661.1, -, Genomic_DNA),... | Reviewed | RecName: Full=Uncharacterized protein 002L; | 002L_FRG3G | [{'ORFNames': ['FV3-002L']}] | [Dryophytes versicolor (chameleon treefrog), L... | ... | [Host membrane, Membrane, Reference proteome, ... | None |   | [Viruses, Varidnaviria, Bamfordvirae, Nucleocy... | 4 | (320, 34642, 9E110808B6E328E0) | MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQT... | 320 | (19-JUL-2004, 1) | [654924]
[Q197F8] | (23-FEB-2022, 29) | [] | (16-JUN-2009, 0) | [(EMBL, DQ643392, ABF82032.1, -, Genomic_DNA),... | Reviewed | RecName: Full=Uncharacterized protein 002R; | 002R_IIV3 | [{'ORFNames': ['IIV3-002R']}] | [Aedes vexans (Inland floodwater mosquito) (Cu... | ... | [Reference proteome] | None |   | [Viruses, Varidnaviria, Bamfordvirae, Nucleocy... | 4 | (458, 53921, E46E5C85D7ACA139) | MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWK... | 458 | (11-JUL-2006, 1) | [345201]
[Q197F7] | (12-AUG-2020, 23) | [] | (16-JUN-2009, 0) | [(EMBL, DQ643392, ABF82033.1, -, Genomic_DNA),... | Reviewed | RecName: Full=Uncharacterized protein 003L; | 003L_IIV3 | [{'ORFNames': ['IIV3-003L']}] | [Aedes vexans (Inland floodwater mosquito) (Cu... | ... | [Reference proteome] | None |   | [Viruses, Varidnaviria, Bamfordvirae, Nucleocy... | 4 | (156, 17043, D48A43940FF8C815) | MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGA... | 156 | (11-JUL-2006, 1) | [345201]
[Q6GZX2] | (12-AUG-2020, 36) | [] | (28-JUN-2011, 0) | [(EMBL, AY548484, AAT09662.1, -, Genomic_DNA),... | Reviewed | RecName: Full=Uncharacterized protein 3R; Flag... | 003R_FRG3G | [{'ORFNames': ['FV3-003R']}] | [Dryophytes versicolor (chameleon treefrog), L... | ... | [Reference proteome, Signal] | None |   | [Viruses, Varidnaviria, Bamfordvirae, Nucleocy... | 3 | (438, 48297, 075C8FA17B3C5C56) | MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVY... | 438 | (19-JUL-2004, 1) | [654924]
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ...
[Q6UY62] | (12-OCT-2022, 76) | [FUNCTION: Plays a crucial role in virion asse... | (20-JAN-2009, 0) | [(EMBL, AY358026, AAQ55262.1, -, Genomic_RNA),... | Reviewed | RecName: Full=RING finger protein Z {ECO:00002... | Z_SABVB | [{'Name': 'Z {ECO:0000255\|HAMAP-Rule:MF_04087}'}] | [Homo sapiens (Human)] | ... | [Host cell membrane, Host cytoplasm, Host memb... | None |   | [Viruses, Riboviria, Orthornavirae, Negarnavir... | 1 | (100, 11123, 544CCC8FDDD891AD) | MGNSKSKSKLSANQYEQQTVNSTKQVAILKRQAEPSLYGRHNCRCC... | 100 | (05-JUL-2004, 1) | [2169992]
[P08105] | (25-MAY-2022, 36) | [] | (01-AUG-1988, 0) | [(EMBL, X01610, CAA25758.1, -, Genomic_DNA), (... | Reviewed | RecName: Full=Putative uncharacterized protein Z; | Z_SHEEP | [] | [] | ... | [Reference proteome] | None |   | [Eukaryota, Metazoa, Chordata, Craniata, Verte... | 4 | (79, 9128, A663EB489F6290C3) | MSSSLEITSFYSFIWTPHIGPLLFGIGLWFSMFKEPSHFCPCQHPH... | 79 | (01-AUG-1988, 1) | [9940]
[Q88470] | (12-OCT-2022, 102) | [FUNCTION: Plays a crucial role in virion asse... | (06-DEC-2005, 0) | [(EMBL, J04340, AAA47900.1, -, Genomic_RNA), (... | Reviewed | RecName: Full=RING finger protein Z {ECO:00002... | Z_TACVF | [{'Name': 'Z {ECO:0000255\|HAMAP-Rule:MF_04087}'}] | [Artibeus (neotropical fruit bats)] | ... | [Host cell membrane, Host cytoplasm, Host memb... | None |   | [Viruses, Riboviria, Orthornavirae, Negarnavir... | 1 | (95, 10850, D039AE87D8BCDDB0) | MGNCNRTQKPSSSSNNLEKPPQAAEFRRTAEPSLYGRYNCKCCWFA... | 95 | (23-JAN-2007, 3) | [928313]
[A9JR22] | (03-AUG-2022, 64) | [FUNCTION: Plays a crucial role in virion asse... | (20-JAN-2009, 0) | [(EMBL, AY924393, AAX99348.1, -, Genomic_RNA),... | Reviewed | RecName: Full=RING finger protein Z {ECO:00002... | Z_TAMVU | [{'Name': 'Z {ECO:0000255\|HAMAP-Rule:MF_04087}'}] | [Sigmodon hispidus (Hispid cotton rat)] | ... | [Host cell membrane, Host cytoplasm, Host memb... | None |   | [Viruses, Riboviria, Orthornavirae, Negarnavir... | 3 | (95, 10995, E41C91248CA8F177) | MGLRYSKEVRDRHGDKDPEGRIPITQTMPQTLYGRYNCKSCWFANK... | 95 | (05-FEB-2008, 1) | [45223]
[B2ZDY1] | (03-AUG-2022, 56) | [FUNCTION: Plays a crucial role in virion asse... | (20-JAN-2009, 0) | [(EMBL, EU646186, ACD03599.1, -, Genomic_RNA),... | Reviewed | RecName: Full=RING finger protein Z {ECO:00002... | Z_WWAVU | [{'Name': 'Z {ECO:0000255\|HAMAP-Rule:MF_04087}'}] | [Neotoma (wood rats)] | ... | [Host cell membrane, Host cytoplasm, Host memb... | None |   | [Viruses, Riboviria, Orthornavirae, Negarnavir... | 3 | (95, 10916, D60C0C51C3619E42) | MGLRYSKDVKDRYGDREPEGRIPITLNMPQSLYGRYNCKSCWFANK... | 95 | (01-JUL-2008, 1) | [46919]

2 replies

amorehead Mar 2, 2023
Author

Excellent work here! To clarify things for me, what are some potential use cases of having e.g., this SwissProt manager built out? Do you see it being useful for constructing ML-ready datasets as we've seen for the PDBManager class?

a-r-j Mar 2, 2023
Maintainer

Yes, the obvious (and simple!) use case for these collections is as pretraining datasets. SwissProt can be used more creatively though, as it is quite highly annotated. Therefore it opens the door to formulating structure-based functional annotation tasks.

Furthermore, Graphein supports PPI and GRN networks and incorporating these corpora could be of use in structural interactomics settings.

Lastly, there is scope for some fun (i.e. without gradient descent 😁 ) traditional bioinformatics simply probing the graph properties of the predicted structure universe!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardized dataset splits for Graphein's protein structure datasets #270

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Standardized dataset splits for Graphein's protein structure datasets #270

amorehead Feb 24, 2023

Replies: 4 comments · 2 replies

a-r-j Feb 24, 2023 Maintainer

amorehead Feb 24, 2023 Author

a-r-j Feb 24, 2023 Maintainer

a-r-j Feb 27, 2023 Maintainer

amorehead Mar 2, 2023 Author

a-r-j Mar 2, 2023 Maintainer

amorehead
Feb 24, 2023

Replies: 4 comments 2 replies

a-r-j
Feb 24, 2023
Maintainer

amorehead
Feb 24, 2023
Author

a-r-j
Feb 24, 2023
Maintainer

a-r-j
Feb 27, 2023
Maintainer

amorehead Mar 2, 2023
Author

a-r-j Mar 2, 2023
Maintainer