Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PDB fixer option in ProteinChain #26

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Goosang-Yu
Copy link

Fetching PDB files using ProteinChain.from_rcsb is a very convenient feature. However, many PDB files, especially those with complex and large structures, contain missing residues.

For instance, I fetched one of the Cas9 structures, "8G1I", using ProteinChain.from_rcsb.

pdb_id = "8G1I" # PDB ID corresponding to Renal Dipeptidase
chain_id = "A" # Chain ID corresponding to Renal Dipeptidase in the PDB structure
renal_dipep_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

When checking the sequence and length of 8G1I, it appears shorter than the actual amino acid sequence, with the missing residues omitted.

pro_seq = renal_dipep_chain.sequence
print("Protein sequence:", pro_seq)
print("Protein length:", len(pro_seq))
Protein sequence: KYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLRKKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDNLLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEKYKEIFFDQSKNGYAGYIDGGASQEEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHLGELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGASAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFKTNRKVTVKQLKEDYFKKIECFDSVEIASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLTLFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRNFMQLIHDDSLTFKEDIQKAQSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVIEMARNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQELDINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLNAKLIRKFDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSDFRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDGKATAKYFFYSNIMNFFKTEIKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQTGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLASHEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQ
Protein length: 1298

Even when retrieving atom information, the atom data for the missing residues could not be found.

renal_dipep_chain.atom_array.get_atom(1)
# res_id 1-3 is missing
Atom(np.array([127.615, 117.513, 179.528], dtype=float32), chain_id="A", res_id=4, ins_code="", res_name="LYS", hetero=False, atom_name="CA", element="C")

This can be resolved by using pdbfixer to find the missing residues.

import pdbfixer

fixer = pdbfixer.PDBFixer(pdbid="8G1I")

# PDBFixer operations
fixer.findNonstandardResidues()
fixer.replaceNonstandardResidues()
fixer.findMissingResidues()
fixer.findMissingAtoms()

print("Missing Residues:", fixer.missingResidues)
Missing Residues: {(1, 0): ['MET', 'ASP', 'LYS'], (1, 577): ['SER', 'GLY', 'VAL', 'GLU', 'ASP', 'ARG', 'PHE', 'ASN'], (1, 701): ['VAL', 'SER', 'GLY', 'GLN', 'GLY', 'ASP'], (1, 748): ['GLU', 'ASN', 'GLN', 'THR', 'THR', 'GLN', 'LYS', 'GLY', 'GLN', 'LYS'], (1, 859): ['LEU'], (1, 864): ['THR', 'GLN'], (1, 982): ['TYR', 'LYS', 'VAL', 'TYR', 'ASP', 'VAL', 'ARG', 'LYS', 'MET', 'ILE', 'ALA', 'LYS', 'SER', 'GLU', 'GLN', 'GLU', 'ILE'], (1, 1003): ['THR', 'LEU', 'ALA', 'ASN', 'GLY', 'GLU', 'ILE', 'ARG'], (1, 1186): ['TYR', 'GLU', 'LYS', 'LEU', 'LYS', 'GLY', 'SER', 'PRO', 'GLU', 'ASP', 'ASN'], (1, 1298): ['LEU', 'GLY', 'GLY', 'ASP', 'PRO', 'LYS', 'LYS', 'LYS', 'ARG', 'LYS', 'VAL', 'MET', 'ASP', 'LYS', 'HIS', 'HIS', 'HIS', 'HIS', 'HIS', 'HIS'], (3, 0): ['DC', 'DC', 'DA', 'DG', 'DT', 'DG', 'DC', 'DG', 'DT', 'DA', 'DT', 'DA', 'DC', 'DC', 'DA', 'DG', 'DC', 'DA', 'DA', 'DA', 'DA', 'DC', 'DA', 'DC', 'DT', 'DC', 'DC']}

I have added these functionalities as options to ProteinChain.from_rcsb. It can be used as follows:

pdb_id = "8G1I" # PDB ID corresponding to Renal Dipeptidase
chain_id = "A" # Chain ID corresponding to Renal Dipeptidase in the PDB structure
renal_dipep_chain = ProteinChain.from_rcsb(pdb_id, chain_id, fix_pdb=True)

The resulting protein chain includes the missing residue sequence information and the full length.

pro_seq = renal_dipep_chain.sequence
print("Protein sequence:", pro_seq)
print("Protein length:", len(pro_seq))
Protein sequence: MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLRKKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDNLLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEKYKEIFFDQSKNGYAGYIDGGASQEEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHLGELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGASAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFKTNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFNASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLTLFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRNFMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVIEMARENQTTQKGQKNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQELDINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLLNAKLITQRKFDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSDFRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEIGKATAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQTGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLASHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGDPKKKRKVMDKHHHHHH
Protein length: 1384

Atom information can also be retrieved with the missing residues filled in by the fixer.

renal_dipep_chain.atom_array.get_atom(1)
Atom(np.array([125.775, 112.234, 187.093], dtype=float32), chain_id="A", res_id=1, ins_code="", res_name="MET", hetero=False, atom_name="CA", element="C")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant