Home

ProteoFAV - protein feature aggregation and variants

Scope

Exploring the power of Pandas to work with protein structures, sequences and genetic variants.

Main goal

The main idea of this library is to provide a set of common methods and tools to allow working with protein structures, sequences and genetic variants in a neat and clean pythonic way. Pandas provides brilliant high-performing data structures and data analysis tools perfect for powering the data handling and analysis that ProteoFAV requires.

Why another python module?

Although several Python libraries (notably biopython) partially provide functionalities similar to those we want achieve with this project, we consider that the advent of Pandas and the modularity achieved by working with its data structures makes it relevant and useful to other peers. Also, since the codebase is relatively short and self-contained one can opt to use only certain features without the burden of having to install big packages in order to use only a small set of tools.

Implementation design overview

To aggregate data coming from multiple sources (e.g. PDB, UniProt, Ensembl) and in different file formats (as varied as mmCIF, GFF, JSON, etc.) ProteoFAV implements a number of simple parsers and fetchers returning the data in accessible Pandas DataFrames. These on their on can be already very useful for those that only want to parse/fetch a particular dataset and work from there on their own. In top of the core parsers/fetchers ProteoFAV provides a main glue merge_tables that allows for merging the different components together.

Fetchers

Fetchers in ProteoFAV rely on Requests and have to main uses: 1) to fetch raw files (e.g. protein structure in mmCIF format); 2) to query data provider APIs (e.g. query genetic variants for a particular protein in JSON format).

1) Fetching files (stored locally according to config values):

PDB cif files from PDBe
SIFTS xml files from PDBe
DSSP dssp files from CMBI
UniProt gff files from UniProt

2) Fetching content:

UniProt to Ensembl id mappings from Ensembl API
Ensembl ID to Sequence from Ensembl API
Ensembl ID to Genetic variants from Ensembl
Ensembl ID to COSMIC mutations from Ensembl
UniProt ID to Best structure from PDBe API
and others...

Parsers

Raw data is loaded to Pandas DataFrames via specific parsers implemented in ProteoFAV. We are trying to be comprehensive in the way we parse and not necessarily all the contents of the parsed files will be cached.

Generic parsers:

cif ATOM and DESCRIPTION lines
DSSP RESIDUE lines
SIFTS XML RESIDUE and REGION lines
GFF ALL lines
and others...

Table merging

The idea of merging is possible in the context of Pandas DataFrames provided that the these have a set of indexes or entities in common. We explore this feature to merge tables and work in a modular fashion. This enables aggregating sparce data to the same DataFrame via joins/merges that allow one to perform insightfull analysis to the data that otherwise would take much bigger effort to achieve with traditional python scripting.

One main example usages that illustrate the utility of this approach is to start with a protein structure of interest and mapping genetic variants to the structure (via structure to sequence mapping). Starting with variants back to structure or anywhere in the middle is in principle as easily achievable as well.

Config

ProteoFAV relies on a generic configuration file (config.txt) which defines the path to folders as well as the website/api addresses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly