Skip to content

Latest commit

 

History

History

2_correlation_analysis

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Network analysis

This directory generates G-G networks by applying clustering to the correlation matrix, generated by comparing the correlation between pairs of gene expression profiles. In other words these networks represent the co-expression of P. aeruginosa genes using PAO1 and PA14 compendia.

1_correlation_analysis.ipynb performs correlation analysis to compare the similarity between genes. A previous study, found that KEGG (a database that containes genes or proteins annotated with specific biological processes as reported in the literature) is bias in some biological processes represented. Figure 1C demonstrates that a large fraction of gene pairs are ribosomal relationships - in the top 0.1% most co-expressed genes, 99% belong to the ribosome pathway. Furthermore, protein function prediction based on co-expression drop dramatically after removing the ribisome pathway (Figure 1A, B). This finding is consistent with our observation when we calculate the correlation of the raw gene expression data. We found one large highly correlated module that is likely dirven by genes related to a single biological process.

Challenge: This very dominant global signal can mask more specific signals in the data.

In order to correct for this dominant signal, we applied multiple corrections. The first method uses SPELL, which calculates the correlation on the gene coefficient matrix (i.e. how much genes contribute to a latent variable) that is generated after applying SVD. This matrix represents how genes contribute to independent latent variables that capture the signal in the data where the variance of the variables is 1. The idea is that correlations between gene contributions are more balanced so that less prominent patterns are amplified and more dominant patterns are dampended due to this compression. Figure 3 shows how well SPELL recapitulates biology (i.e. the relationship between genes within a GO term) compared to Pearson correlation.

Overall, this notebook calculates and visualizes the correlation matrix that will be used to cluster the data into modules. These visualizations will be used to inform the clustering approach used in the next notebook.

Note: Exploration of different correction methods can be found in archive notebooks. Each of these notebooks can be run independently.

2_clustering.ipynb Applies clustering methods to identify gene modules based on the correlation matrices.

3_module_validation.ipynb Validates generated modules.