-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEBUG: step1_pdb_process.py #7
Comments
I have been playing with CAMP this weekend and the first task I would like you to work on is running the simple script we reviewed on Friday, across the large PDB dataset. The other downstream functions we will modify/test are dependent on the output of the step1 script. The pdb file contains >1 million lines of data and will likely need to be run on the supercomputer. Please create a SLURM script to run "step1_pdb_parse1.py" that consumes pdb_seqres.txt (datafile). I have also included the small dataset (pdb_seqres_small.txt). The expected output will be:
I have emailed you a link with the data. Please let me know if you are having trouble gaining access to UFRC. I was able to log in this morning. I am not familiar with running python code (generally) and have not worked on running python code on UFRC. Please reach out to folks at UFRC if you need help. There are a lot of training opportunities and online documentation. |
I was able to run the SLURM script that is located in:
Please confirm this works across the group. |
From Dr. Lemas A couple of notes to help the group reproduce my work. I have split up the pieces of the code into smaller chunks that are easier to handle. The chunks are functions that are clearly described.
|
Dr. Lemas, |
Please work to debug the step1_pdb_process.py script.
Script: https://github.com/lemaslab/CAMP/blob/master/data_prepare/step1_pdb_process.py
Input Data (RCSB PDB) : Download the fasta files from ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt.gz and pdb files
Programs: PLIP
For each peptide-protein pair, the peptide sequence was directly obtained from the RCSB PDB with binding residues marked by PepBDB and the protein sequence was obtained by mapping to UniProt [12]. We first downloaded all complexes containing peptides as ligands from the RCSB PDB released by September 2019. Then we used the Protein Ligand Interaction Predictor (PLIP) program [10] (http://github.com/ssalentin/plip) to extract the interacting chains of peptide and protein sequences from the complex structures. Given a complex structure, PLIP recognizes seven types of non-covalent interactions, including hydrogen bonds, hydrophobic interactions, pi-stackings, pi-cations, salt bridges, water bridges and halogen bonds. A residue from the peptide and another one from the protein, with at least one noncovalent interaction was considered as an interacting pair. We then retrieved the corresponding interacting labels from PepBDB [11], a structure database of peptide-protein complexes derived from the RCSB Protein Data Bank (PDB) [3–5], which contains the peptide residues involved in hydrogen bonds and hydrophobic ontacts with the partner proteins. The peptide binding residues detected by PepBDB were then mapped to the peptide sequences (which were annotated from the RSCB PDB) using an alignment tool based on the Smith-Waterman algorithm [21] (https://github.com/mengyao/Complete-Striped-SmithWaterman-Library). To achieve the high quality of the data, we only kept those peptide sequences with at least 80% matched residues. In total, we collected 7,233 peptide-protein pairs with 3,318 distinct protein sequences and 5,283 distinct peptide sequences, and 90.99% of the pairs had labels of peptide binding residues.
The text was updated successfully, but these errors were encountered: