Skip to content

maialab/protean

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

protean

r-universe

This data package provides protein sequence profiles for OncoKB cancer genes. These sequence profiles can be useful to infer evolutionary variation at protein positions which in turn may be used as a proxy for inferring impact of mutations. The data here provided is mostly useful if used as input to the AGVGD method: https://cran.r-project.org/package=agvgd.

Installation

Since {protean} is a data package that bundles more than a thousand of sequence profiles, its size exceeeds CRAN’s limits and hence its installation is provided by Pattern Institute’s R-Universe repository:

install.packages("protean", repos = "https://patterninstitute.r-universe.dev")

Usage

To know the genes whose protein sequence profiles are provided use exported_genes:

library(protean)

# Number of protein sequence profiles available
length(exported_genes)
#> [1] 1118

# Here are the first 10
exported_genes[1:10]
#>  [1] "ABI1"     "ABL1"     "ABL2"     "ABRAXAS1" "ACKR3"    "ACSL3"   
#>  [7] "ACSL6"    "ACTB"     "ACTG1"    "ACVR1"

The protein sequence profiles are bundled with {protean} and their location can be found with profile_path():

profile_path("TP53")
#> [1] "/home/rmagno/R/x86_64-pc-linux-gnu-library/4.3/protean/profiles/TP53.csv.gz"

To import one sequence profile into R use read_profile():

tp53_prof <- read_profile(profile_path("TP53"))
tp53_prof
#> # A tibble: 251 × 11
#>    timestamp           human_prot_id ortho_prot_id ortho_species human_align_seq
#>    <chr>               <chr>         <chr>         <chr>         <chr>          
#>  1 2023-11-26 21:57:1… ENSP00000269… ENSPPAP00000… pan_paniscus  MEEPQSDPSVEPPL…
#>  2 2023-11-26 21:57:1… ENSP00000269… ENSPTRP00000… pan_troglody… MEEPQSDPSVEPPL…
#>  3 2023-11-26 21:57:1… ENSP00000269… ENSPPYP00000… pongo_abelii  --------------…
#>  4 2023-11-26 21:57:1… ENSP00000269… ENSRBIP00000… rhinopithecu… MEEPQSDPSVEPPL…
#>  5 2023-11-26 21:57:1… ENSP00000269… ENSRROP00000… rhinopithecu… MEEPQSDPSVEPPL…
#>  6 2023-11-26 21:57:1… ENSP00000269… ENSCSAP00000… chlorocebus_… MEEPQSDPSVEPPL…
#>  7 2023-11-26 21:57:1… ENSP00000269… ENSMFAP00000… macaca_fasci… MEEPQSDPSVEPPL…
#>  8 2023-11-26 21:57:1… ENSP00000269… ENSMMUP00000… macaca_mulat… --------------…
#>  9 2023-11-26 21:57:1… ENSP00000269… ENSPANP00000… papio_anubis  MEEPQSDPSVEPPL…
#> 10 2023-11-26 21:57:1… ENSP00000269… ENSCATP00000… cercocebus_a… MEEPQSDPSVEPPL…
#> # ℹ 241 more rows
#> # ℹ 6 more variables: ortho_align_seq <chr>, human_ortho_perc_id <dbl>,
#> #   ortho_human_perc_id <dbl>, cigar <chr>, human_profile_seq <chr>,
#> #   ortho_profile_seq <chr>

The column ortho_profile_seq contains the ortholog sequences of the profile. The human sequence is the same across rows and can be found in the column human_profile_seq:

tp53_prof[c("ortho_species", "human_profile_seq", "ortho_profile_seq")]
#> # A tibble: 251 × 3
#>    ortho_species           human_profile_seq                   ortho_profile_seq
#>    <chr>                   <chr>                               <chr>            
#>  1 pan_paniscus            MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSP… MEEPQSDPSVEPPLSQ…
#>  2 pan_troglodytes         MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSP… MEEPQSDPSVEPPLSQ…
#>  3 pongo_abelii            MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSP… MEEPQSDPSVEPPLSQ…
#>  4 rhinopithecus_bieti     MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSP… MEEPQSDPSIEPPLSQ…
#>  5 rhinopithecus_roxellana MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSP… MEEPQSDPSIEPPLSQ…
#>  6 chlorocebus_sabaeus     MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSP… MEEPQSDPSIEPPLSQ…
#>  7 macaca_fascicularis     MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSP… MEEPQSDPSIEPPLSQ…
#>  8 macaca_mulatta          MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSP… MEEPQSDPSIEPPLSQ…
#>  9 papio_anubis            MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSP… MEEPQSDPSIEPPLSQ…
#> 10 cercocebus_atys         MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSP… MEEPQSDPSIEPPLRQ…
#> # ℹ 241 more rows

About

Protein sequence profiles for OncoKB cancer genes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages