Name		Name	Last commit message	Last commit date
parent directory ..
archive		archive
spell_vs_counts_experiment		spell_vs_counts_experiment
0_examine_data.ipynb		0_examine_data.ipynb
0_examine_data.py		0_examine_data.py
0_process_array_data.ipynb		0_process_array_data.ipynb
0_process_array_data.py		0_process_array_data.py
1_correlation_analysis.ipynb		1_correlation_analysis.ipynb
1_correlation_analysis.py		1_correlation_analysis.py
2_clustering.ipynb		2_clustering.ipynb
2_clustering.py		2_clustering.py
3_module_validation.ipynb		3_module_validation.ipynb
3_module_validation.py		3_module_validation.py
3a_validate_module_composition.ipynb		3a_validate_module_composition.ipynb
3a_validate_module_composition.py		3a_validate_module_composition.py
README.md		README.md

README.md

Network analysis

This directory generates G-G networks by applying clustering to the correlation matrix, generated by comparing the correlation between pairs of gene expression profiles. In other words these networks represent the co-expression of P. aeruginosa genes using PAO1 and PA14 compendia.

1_correlation_analysis.ipynb performs correlation analysis to compare the similarity between genes. A previous study, found that KEGG (a database that containes genes or proteins annotated with specific biological processes as reported in the literature) is bias in some biological processes represented. Figure 1C demonstrates that a large fraction of gene pairs are ribosomal relationships - in the top 0.1% most co-expressed genes, 99% belong to the ribosome pathway. Furthermore, protein function prediction based on co-expression drop dramatically after removing the ribisome pathway (Figure 1A, B). This finding is consistent with our observation when we calculate the correlation of the raw gene expression data. We found one large highly correlated module that is likely dirven by genes related to a single biological process.

Challenge: This very dominant global signal can mask more specific signals in the data.

In order to correct for this dominant signal, we applied multiple corrections. The first method uses SPELL, which calculates the correlation on the gene coefficient matrix (i.e. how much genes contribute to a latent variable) that is generated after applying SVD. This matrix represents how genes contribute to independent latent variables that capture the signal in the data where the variance of the variables is 1. The idea is that correlations between gene contributions are more balanced so that less prominent patterns are amplified and more dominant patterns are dampended due to this compression. Figure 3 shows how well SPELL recapitulates biology (i.e. the relationship between genes within a GO term) compared to Pearson correlation.

Overall, this notebook calculates and visualizes the correlation matrix that will be used to cluster the data into modules. These visualizations will be used to inform the clustering approach used in the next notebook.

Note: Exploration of different correction methods can be found in archive notebooks. Each of these notebooks can be run independently.

2_clustering.ipynb Applies clustering methods to identify gene modules based on the correlation matrices.

3_module_validation.ipynb Validates generated modules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2_correlation_analysis

2_correlation_analysis

README.md

Network analysis

Files

2_correlation_analysis

Directory actions

More options

Directory actions

More options

Latest commit

History

2_correlation_analysis

Folders and files

parent directory

README.md

Network analysis