-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Exploring the power of Pandas to work with protein structures, sequences and genetic variants.
The main idea of this library is to provide a set of common methods and tools to allow working with protein structures, sequences and genetic variants in a neat and clean pythonic way. Pandas provides brilliant high-performing data structures and data analysis tools perfect for powering the data handling and analysis that ProteoFAV requires.
Although several Python libraries (notably biopython) partially provide functionalities similar to those we want achieve with this project, we consider that the advent of Pandas and the modularity achieved by working with its data structures makes it relevant and useful to other peers. Also, since the codebase is relatively short and self-contained one can opt to use only certain features without the burden of having to install big packages in order to use only a small set of tools.
To aggregate data coming from multiple sources (e.g. PDB, UniProt, Ensembl) and in different file formats (as varied as mmCIF, GFF, JSON, etc.) ProteoFAV implements a number of simple parsers and fetchers returning the data in accessible Pandas DataFrames. These on their on can be already very useful for those that only want to parse/fetch a particular dataset and work from there on their own. In top of the core parsers/fetchers ProteoFAV provides a main glue merge_tables that allows for merging the different components together.
Fetchers in ProteoFAV rely on Requests and have to main uses: 1) to fetch raw files (e.g. protein structure in mmCIF format); 2) to query data provider APIs (e.g. query genetic variants for a particular protein in JSON format).
1) Fetching files (stored locally according to config values):
- PDB cif files from PDBe
- SIFTS xml files from PDBe
- DSSP dssp files from CMBI
- UniProt gff files from UniProt
2) Fetching content:
- UniProt to Ensembl id mappings from Ensembl API
- Ensembl ID to Sequence from Ensembl API
- Ensembl ID to Genetic variants from Ensembl
- Ensembl ID to COSMIC mutations from Ensembl
- UniProt ID to Best structure from PDBe API
- and others...
Raw data is loaded to Pandas DataFrames via specific parsers implemented in ProteoFAV. We are trying to be comprehensive in the way we parse and not necessarily all the contents of the parsed files will be cached.
Generic parsers:
- cif ATOM and DESCRIPTION lines
- DSSP RESIDUE lines
- SIFTS XML RESIDUE and REGION lines
- GFF ALL lines
- and others...
The idea of merging is possible in the context of Pandas DataFrames provided that the these have a set of indexes or entities in common. We explore this feature to merge tables and work in a modular fashion. This enables aggregating sparce data to the same DataFrame via joins/merges that allow one to perform insightfull analysis to the data that otherwise would take much bigger effort to achieve with traditional python scripting.
One main example usages that illustrate the utility of this approach is to start with a protein structure of interest and mapping genetic variants to the structure (via structure to sequence mapping). Starting with variants back to structure or anywhere in the middle is in principle as easily achievable as well.
ProteoFAV relies on a generic configuration file (config.txt) which defines the path to folders as well as the website/api addresses.