Replies: 4 comments 2 replies
-
Hi Alex, This is a great idea & would be super helpful for the community. I've actually discussed this a little bit with a friend & colleague, Jan Ludwiczak, who developed What sort of API do you have in mind for this? I just threw this together, which parses PDB metadata into a pandas dataframe. Additional selection methods could be added, with import wget
import os
from loguru import logger as log
import gzip
import pandas as pd
from graphein.protein.utils import read_fasta
import shutil
from pathlib import Path
class PDBManager:
def __init__(self, root_dir = "."):
self.root_dir = Path(root_dir)
self.download_pdb_sequences()
self.df = self.parse()
def download_pdb_sequences(self):
# Download
if not os.path.exists(self.root_dir / "pdb_seqres.txt.gz"):
log.info("Downloading PDB sequences")
wget.download("https://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz")
log.info("Downloaded sequences")
# Unzip
if not os.path.exists(self.root_dir / "pdb_seqres.txt"):
log.info("Unzipping PDB sequences")
with gzip.open(self.root_dir / "pdb_seqres.txt.gz", 'rb') as f_in:
with open(self.root_dir / "pdb_seqres.txt", 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
log.info("Unzipped sequences")
def parse(self) -> pd.DataFrame:
fasta = read_fasta("pdb_seqres.txt")
# Iterate over fasta and parse metadata
records = []
for k, v in fasta.items():
seq = v
params = k.split()
id = params[0]
pdb = params[0].split("_")[0]
chain = params[0].split("_")[1]
length = int(params[2].split(":")[1])
molecule_type = params[1].split(":")[1]
name = " ".join(params[3:])
record = {"id": id, "pdb": pdb, "chain": chain, "length": length, "molecule_type": molecule_type, "name": name, "sequence": seq}
records.append(record)
return pd.DataFrame.from_records(records)
def molecule_type(self, type: str = "protein", update: bool = False):
df = self.df.loc[self.df.molecule_type == type]
if update:
self.df = df
return df
def longer_than(self, length: int = 100, update: bool = False):
df = self.df.loc[self.df.length > length]
if update:
self.df = df
return df
def shorter_than(self, length: int = 100, update: bool = False):
df = self.df.loc[self.df.length < length]
if update:
self.df = df
return df
def oligomeric(self, oligomer: int = 1, update: bool = False):
df = self.df.loc[self.df.length == oligomer]
if update:
self.df = df
return df Which you can use like: manager = PDBManager()
manager.molecule_type(type="protein") To get:
It would be very helpful if we could think of some additional criteria (e.g. resolution and presence of non-standard residues come to mind) and functionalities. |
Beta Was this translation helpful? Give feedback.
-
Hi, Arian. I think the To get started developing this idea out a bit more, would you be able to give me some guidance as to why Best, |
Beta Was this translation helpful? Give feedback.
-
Hmm, I'll look into that. I've not maintained the conda envs for a while as they were fairly troublesome. Pytorch3d and cuda-related libraries were the usual culprits. PyTorch3d is only conda installable on Linux IIRC. Everything should be installable via pip. If you're installing from PyPI, depending on your version of I think the easiest way to proceed is to create a conda env, clone the repo and install it in editable mode
The different requirements (listed in |
Beta Was this translation helpful? Give feedback.
-
Looks like we're making great progress on the PDB front! Thanks for the feature request and help @amorehead !! I wanted to widen this discussion to look at other sequence/structure collections. For example, foldcomp makes AF2 predictions for SwissProt and ESMFold predictions for a subset of Uniref available pretty compactly. I think these could be great additions in a similar vein to what we're doing with the PDB. This would be a significant scale up though (especially the Uniref seqs) - we'd probably need some more intelligent infra choices like Vaex for managing the indices and metadata. I actually had a little (very quick and dirty) go at doing this for SwissProt: import gzip
import os
import shutil
from pathlib import Path
import foldcomp
import wget
from loguru import logger as log
from Bio import SwissProt
import pandas as pd
import tqdm as tqdm
class UniProtManager:
def __init__(self, root_dir):
self.root_dir = Path(root_dir)
self.fold_comp_db = "afdb_swissprot_v4"
if not os.path.exists(self.root_dir):
os.makedirs(self.root_dir)
self.download()
self.metadata_url = "https://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz"
def download(self):
if not os.path.exists(self.root_dir / self.fold_comp_db):
log.info(f"Downloading {self.fold_comp_db} to {self.root_dir}")
os.chdir(self.root_dir)
foldcomp.setup(self.fold_comp_db)
log.info("Download completed.")
else:
log.info(f"Found {self.fold_comp_db} in {self.root_dir}")
# Unzip the metadata file
if not os.path.exists(self.root_dir / "uniprot_sprot.dat"):
log.info("Unzipping UniProt Metadata...")
with gzip.open(self.root_dir / "uniprot_sprot.dat.gz", 'rb') as f_in:
with open(self.root_dir / "uniprot_sprot.dat", 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
def get_metadata(self):
if not os.path.exists(self.root_dir / "uniprot_sprot.dat.gz"):
log.info(f"Downloading UniProt Metadata to {self.root_dir}")
wget.download(self.metadata_url, self.root_dir)
log.info("Download completed.")
else:
log.info(f"Found UniProt Metadata in {self.root_dir}")
def parse_metadata(self) -> pd.DataFrame:
records = []
for record in tqdm(SwissProt.parse(open('uniprot/uniprot_sprot.dat'))):
data = {
"accessions": record.accessions,
"annotation_update": record.annotation_update,
"comment": record.comments,
"created": record.created,
"cross_references": record.cross_references,
"data_class": record.data_class,
"description": record.description,
"entry_name": record.entry_name,
"gene_name": record.gene_name,
"host_organism": record.host_organism,
"host_taxonomy_id": record.host_taxonomy_id,
"keywords": record.keywords,
"molecule_type": record.molecule_type,
"organelle": record.organelle,
"organism_classification": record.organism_classification,
"protein_existence": record.protein_existence,
"seqinfo": record.seqinfo,
"sequence": record.sequence,
"sequence_length": record.sequence_length,
"sequence_update": record.sequence_update,
"taxonomy_id": record.taxonomy_id,
}
records.append(data)
return pd.DataFrame(records) Which gives output like:
|
Beta Was this translation helpful? Give feedback.
-
Hello.
I was looking through Graphein's current support for downloading structures from the PDB and creating datasets using them, and I had the following thought. Is it possible to use Graphein to e.g., download all experimentally-validated structures in the PDB; filter them based on specific criteria such as non-redundancy (e.g., 25% sequence identity), PDB chain length, number of PDB chains, or PDB resolution; and create a standardized cross-validation partitioning of all the PDB structures collected (e.g., 30% sequence identity and 50% structural similarity cutoffs between training, validation, and testing splits)?
Or is anyone here aware of any resources that currently offer the ability to construct ML-ready PDB datasets like this? If not, I think this could be a wonderful opportunity for Graphein to capitalize on this in the form of new functionality.
Best,
Alex
Beta Was this translation helpful? Give feedback.
All reactions