Python package to search/retrieve/filter proteins and protein structures.
It uses
- Uniprot Sparql endpoint to search for proteins and their measured or predicted 3D structures.
- Uniprot taxonomy to search for taxonomy.
- QuickGO to search for Gene Ontology terms.
- gemmi to work with macromolecular models.
- dask-distributed to compute in parallel.
An example workflow:
graph TB;
taxonomy[/Search taxon/] -. taxon_ids .-> searchuniprot[/Search UniprotKB/]
goterm[/Search GO term/] -. go_ids .-> searchuniprot[/Search UniprotKB/]
searchuniprot --> |uniprot_accessions|searchpdbe[/Search PDBe/]
searchuniprot --> |uniprot_accessions|searchaf[/Search Alphafold/]
searchuniprot -. uniprot_accessions .-> searchemdb[/Search EMDB/]
searchpdbe -->|pdb_ids|fetchpdbe[Retrieve PDBe]
searchaf --> |uniprot_accessions|fetchad(Retrieve AlphaFold)
searchemdb -. emdb_ids .->fetchemdb[Retrieve EMDB]
fetchpdbe -->|mmcif_files_with_uniprot_acc| chainfilter{Filter on chain of uniprot}
chainfilter --> |mmcif_files| residuefilter{Filter on chain length}
fetchad -->|pdb_files| confidencefilter{Filter out low confidence}
confidencefilter --> |mmcif_files| ssfilter{Filter on secondary structure}
residuefilter --> |mmcif_files| ssfilter
classDef dashedBorder stroke-dasharray: 5 5;
goterm:::dashedBorder
taxonomy:::dashedBorder
searchemdb:::dashedBorder
fetchemdb:::dashedBorder
(Dotted nodes and edges are side-quests.)
pip install protein-quest
Or to use the latest development version:
pip install git+https://github.com/haddocking/protein-quest.git
The main entry point is the protein-quest
command line tool which has multiple subcommands to perform actions.
To use programmaticly, see the Jupyter notebooks and API documentation.
protein-quest search uniprot \
--taxon-id 9606 \
--reviewed \
--subcellular-location-uniprot nucleus \
--subcellular-location-go GO:0005634 \
--molecular-function-go GO:0003677 \
--limit 100 \
uniprot_accs.txt
(GO:0005634 is "Nucleus" and GO:0003677 is "DNA binding")
protein-quest search pdbe uniprot_accs.txt pdbe.csv
pdbe.csv
file is written containing the the PDB id and chain of each uniprot accession.
protein-quest search alphafold uniprot_accs.txt alphafold.csv
protein-quest search emdb uniprot_accs.txt emdbs.csv
protein-quest retrieve pdbe pdbe.csv downloads-pdbe/
protein-quest retrieve alphafold alphafold.csv downloads-af/
For each entry downloads the summary.json and cif file.
protein-quest retrieve emdb emdbs.csv downloads-emdb/
Filter AlphaFoldDB structures based on confidence (pLDDT). Keeps entries with requested number of residues which have a confidence score above the threshold. Also writes pdb files with only those residues.
protein-quest filter confidence \
--confidence-threshold 50 \
--min-residues 100 \
--max-residues 1000 \
./downloads-af ./filtered
Make PDBe files smaller by only keeping first chain of found uniprot entry and renaming to chain A.
protein-quest filter chain \
pdbe.csv \
./downloads-pdbe ./filtered-chains
protein-quest filter residue \
--min-residues 100 \
--max-residues 1000 \
./filtered-chains ./filtered
To filter on structure being mostly alpha helices and have no beta sheets.
protein-quest filter secondary-structure \
--ratio-min-helix-residues 0.5 \
--ratio-max-sheet-residues 0.0 \
--write-stats filtered-ss/stats.csv \
./filtered-chains ./filtered-ss
protein-quest search taxonomy "Homo sapiens" -
You might not know what the identifier of a Gene Ontology term is at protein-quest search uniprot
.
You can use following command to search for a Gene Ontology (GO) term.
protein-quest search go --limit 5 --aspect cellular_component apoptosome -
Protein quest can also help LLMs like Claude Sonnet 4 by providing a set of tools for protein structures.
To run mcp server you have to install the mcp
extra with:
pip install protein-quest[mcp]
The server can be started with:
protein-quest mcp
The mcp server contains an prompt template to search/retrieve/filter candidate structures.
For development information and contribution guidelines, please see CONTRIBUTING.md.