-
Notifications
You must be signed in to change notification settings - Fork 16
Workflows: panoply_nmf_internal_workflow
PANOPLY workflow to perform unsupervised non-negative matrix factorization (NMF)-based clustering, on single or multi-omic data. The workflow expects data matrices (in GCT format derived from the same samples (columns) with consistent sample identifers and imported into a Terra workspace by the PANOPLY startup notebook).
The workflow consists of six modules:
module | description |
---|---|
panoply_nmf_balance_omes |
optional filtering of GCT files, to balance feature-input across multiple omes |
panoply_nmf |
data pre-processing and NMF analysis |
panoply_nmf_postprocess |
functional chracterization of derived NMF clusters |
panoply_nmf_report |
RMarkown report for panoply_nmf and panoply_nmf_postprocess
|
panoply_ssgsea |
single sample Gene Set Enrichment Analysis applied to derived clusters |
panoply_ssgsea_report |
RMarkown report for panoply_ssgsea
|
Module(s): panoply_nmf_balance_omes
This module can optionally be run before the panoply_nmf module by setting balance_omes=true
, to balance the number of features from input-omes. The goal of running this module is to mitigate the impact of a potential bias towards a particular data type in the multi-omics clustering (i.e. vastly different number of genomic and proteomic features).
Module(s): panoply_nmf
In order to merge the input-array of data matrices into a single non-negative fully-quantified input-matrix, the following transformations are applied (in order):
- Combine array data-matrices into a single matrix, retaining ome-type as an
rdesc
annotation - Remove all features which are not fully quantified
- (optional) Apply standard-deviation filter, to remove features with low variance. For multiomics data, the following filtering methods are available:
-
global
: apply filter globally to the multi-omics data matrix -
separate
: apply filter to each data type separately -
equal
: filter the multi-omics data matrix such that each data type will be represented by the same number of
-
- (optional) Apply z-scoring by
row
(z-score rows),col
(z-score columns), orrowcol
(z-score rows and then columns). - Transform signed matrix into a non-negative matrix
- Create a data matrix with all negative numbers zeroed.
- Create another data matrix with all positive numbers zeroed and the signs of all negative numbers removed.
- Concatenate both matrices resulting in a data matrix twice as large as the original, but with positive values only and zeros and hence appropriate for NMF.
After NMF the matrix W of feature weights contains two separate weights for positive and negative value (e.g. z-scores) of each feature, respectively. In order to reverse the non-negative transformation and to derive a single signed weight for each feature, each row in matrix W is normalized by dividing by the sum of feature weights in each row. Weights per feature and cluster were then aggregated by keeping the maximal normalized weight and multiplying with the sign of the z-score from the initial data matrix. Thus, the resulting transformed version of matrix Wsigned contains signed cluster weights for each feature present in the input matrix.
Module(s): panoply_nmf
, panoply_nmf_postprocess
, panoply_nmf_report
Given a factorization rank k (where k is the number of clusters), NMF decomposes a non-negative p x n
data matrix V into two matrices W and H such that multiplication of W and H approximates V. Matrix H is a k x n
matrix whose entries represent weights for each sample (1 to N) to contribute to each cluster (1 to k). Matrix W is a p x k
matrix representing weights for each feature (1 to p) to contribute to each cluster (1 to k). Matrix H is used to assign samples to clusters by choosing the k with maximum score in each column of H. Matrix W containing the weights of each feature in a certain cluster is used to derive a list of representative features separating the clusters using the method proposed in (Kim and Park, 2007). Cluster-specific features are further subjected to a 2-sample moderated t-test (Ritchie et al., 2015) comparing the feature abundance between the respective cluster and all other clusters. Derived p-values are adjusted for multiple hypothesis testing using the methods proposed in (Benjamini and Hochberg, 1995).
Module(s): panoply_nmf
, panoply_nmf_report
To determine the optimal factorization rank k (number of clusters) for the input data matrix, a range of clusters between k=$kmin
and $kmax
is tested. For each k the matrix V gets factorized using $nrun
iterations with random initialization of W and H. To determine the optimal factorization rank the pipeline calculates two metrics for each value of k: 1) cophenetic correlation coefficient measuring how well the intrinsic structure of the data is recapitulated after clustering (coph
) and 2) the dispersion coefficient of the consensus matrix as defined in (Kim and Park, 2007) measuring the reproducibility of the clustering across $nrun
iterations (disp
). The optimal k is defined as the maximum of disp^(1-coph)
for cluster numbers between k=$kmin
and $kmax
.
Module(s): panoply_nmf_postprocess
, panoply_nmf_report
For each sample, a cluster membership score is calculated indicating how representative a sample is to each cluster. This score is used to define a set of "core samples" that is most representative for a given cluster, as follows:
For each sample, the difference between its highest cluster membership score and all other cluster membership scores is calculated. If the minimum of these differences exceeds 1/K, where K is the total number of clusters, a sample is considered a core-member.
Module(s): panoply_ssgsea
, panoply_ssgsea_report
, panoply_nmf_postprocess
, panoply_nmf_report
Functional characterization of resulting NMF clusters is performed by projecting the matrix of signed multi-omic feature weights (Wsigned) onto gene sets in $gene_set_database
via the panoply_ssgsea
module. To derive a single weight for each gene measured across multiple omics data types (e.g. protein, RNA, phosphorylation site, acetylation site) the weight with maximal absolute amplitude is retained.
To test for overrepresentation of categorical variables defined under group.cols
in $yaml_file
in the resulting clusters, a Fisher's exact test (R function fisher.test
) is used in the set of samples defining the "cluster core" as described above. For continuous variables defined under groups.col.continuous
in $yaml_file
a Wilcoxon rank-sum test (ggpubr
R-package) used to assess whether the continuous values are differentially distributed between any pair of clusters.
-
ome_gcts
: (Array[File]+) array of normalized data matrices (e.g. proteome, phosphoproteome, RNA, CNA, etc.) in.gct
format. -
ome_labels
: (Array[String]+) array of labels associated with each gct file (e.g. "prot", "pSTY", "rna', "cna", etc.). Must match the length and order ofome_gct
exactly. -
yaml_file
: (.yaml
file) master-parameters.yaml -
gene_set_database
: (.gmt
file) gene set database -
label
: (String) label
-
balance_omes
: (Boolean, default = TRUE) If TRUE, the contributions of the different data types will be balanced. -
tol
: (Float, default = 0.01) Tolerance specifying the maximal accepted difference (as a fraction of total variance) between contributions from different data types. Used as stopping criterion to end optimization. -
var
: (Float, default = 0.9) Explained variance by PCA (between 0-1). Used to extract the number of PCs explaining the specified fraction of variance in the multi-omics data matrix. -
zscore_mode
: (String, default = "rowcol") z-score mode:row
(z-score rows),col
(z-score columns),rowcol
(z-score rows and then columns). Note that z-scoring can also be performed directly in the panoply_nmf module.
-
kmin
: (Int, default = 2) Minimal factorization rank. -
kmax
: (Int, default = 8) Maximal factorization rank. -
exclude_2
: (Boolean, default = TRUE) If TRUE, 'k=2' will be excluded from calculation of the optimal rank. -
nmf_method
: (String, default = "lee"). NMF method supported by the NMF R-package. Controlled bymethod
in.yaml
file -
nrun
: (Int, default = 50) Number of NMF runs with different starting seeds. -
seed
: (Int, default = "random") Seed for NMF factorization. To set the seed explicitly, provide a numeric value. Providing the string "random" will result in a random seed. -
sd_filt_min
: (Float, default = 0.05) Lowest percentile of standard deviation (SD) across row to remove from the data. 0 means all data will be used, 0.1 means 10 percent of the data with lowest SD will be removed. Will be applied before z-scoring. Controlled bysd_filt
in.yaml
file -
sd_filt_mode
: (String, default = "global") Determines how the SD filter will be applied to the multi-omics data matrix. Controlled byfilt_mode
in.yaml
file.-
global
: apply filter globally to the multi-omics data matrix -
separate
: apply filter to each data type separately -
equal
: filter the multi-omics data matrix such that each data type will be represented by the same number of features. The number of features used to represent each data type Nfeat is determined by the data type with smallest number of features. Other data types will be filtered to retain the Nfeat most variable features.
-
-
z_score
: (Boolean, default = TRUE) If TRUE, the data matrix will be z-scored according to yaml-exclusive paramterz_szore_mode
. -
z_score_mode
: (String, default = "rowcol") z-score mode:row
(z-score rows),col
(z-score columns),rowcol
(z-score rows and then columns). -
gene_column
: (String, default = "geneSymbol") (optional) Column name in rdesc in the GCT file that contains gene symbols, used for adding additional feature-annotations. Controlled by global-parametergene_id_col
in.yaml
file. -
organism_id
: (String, default = "human") (optional) Organism type, used for gene-mapping if gene_col is provided. Support for 'human' (Hs), 'mouse' (Mm), or 'rat' (Rn). Controlled by global-parameterorganism
in.yaml
file. -
output_prefix
: (String, default = "results_nmf") name of the output.tar
file. -
yaml_file
: (.yaml
file) master-parameters.yaml
-
groups_file
: (.csv
file, default = NULL) subset of sample annotations, to be used for calculating enrichment of clusters and in the generation of figures. If no groups file is provided, all annotations from the originalcdesc
will be used. Please note that discrepencies between thecdesc
of different input data matrices may result in unexpected behavior. -
feature_fdr
: (Float, default = 0.01) Maximal FDR for feature selection (2-sample t-test). -
pval_signif
: (Float, default = 0.01) Maximal p-value for overrepresentation analysis (Fisher's exact test). Controlled byora_pval
in.yaml
file. -
max_annot_levels
: (Int, default = 10) Maximal number of levels in an annotation category. Categories with more levels will be excluded from figures and overrepresentation analysis. Controlled byora_max_categories
in.yaml
file. -
top_n_features
: (Int, default = 25) Maximal number of driver features, per cluster, to create boxplots / heatmaps for visualizing expression.
-
${label}_NMF_results.tar.gz
: Tar file containing combined pre-processed expression GCTs (signed and non-negative),.Rdata
file with res.rank output ofnmf()
, and.Rdata
file withopt
object containing parameters. -
nclust
: Best factorization rank. -
${label}_NMF_postprocess.tar.gz
: Tar file containing figures and analyses from post-processing. -
"${label}_K${nclust}_clusterMembership.tsv"
: TSV file with sample membership scores, consensus mapping, and core-membership. -
${label}_nmf_report.html
: Summary report of NMF clustering pipeline. -
results_ssgsea.tar
: Results ofpanoply_ssgsea
applied to the NMF results. -
ssGSEA report - ${label}.html
: Summary report of ssGSEA applied to the NMF results.
-
Kim, H. & Park, H. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23, 1495-1502 (2007).
-
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47-e47 (2015).
-
Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 289-300 (1995).
-
Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433-459. https://doi.org/10.1002/wics.101
- Home
- PANOPLY Tutorial
- Data Preparation Modules
-
Data Analysis Modules
- panoply_association
- panoply_blacksheep
- panoply_clumps_ptm_diffexp
- panoply_clumps_ptm
- panoply_clumps_ptm_postprocess
- panoply_cmap_analysis
- panoply_cna_correlation
- panoply_cons_clust
- panoply_immune_analysis
- panoply_metaboanalyst
- panoply_mimp
- panoply_nmf
- panoply_nmf_postprocess
- panoply_omicsev
- panoply_quilts
- panoply_rna_protein_correlation
- panoply_sankey
- panoply_ssgsea
-
Report Modules
- panoply_association_report
- panoply_blacksheep_report
- panoply_clumps_ptm_report
- panoply_cna_correlation_report
- panoply_cons_clust_report
- panoply_immune_analysis_report
- panoply_metaboanalyst_report
- panoply_mimp_report
- panoply_nmf_report
- panoply_normalize_ms_data_report
- panoply_rna_protein_correlation_report
- panoply_sampleqc_report
- panoply_sankey_report
- panoply_ssgsea_report
- Support Modules
- Navigating Results
- PANOPLY without Terra
- Customizing PANOPLY
-
Workflows
- panoply_association_workflow
- panoply_blacksheep_workflow
- panoply_clumps_ptm_workflow
- panoply_immune_analysis_workflow
- panoply_metaboanalyst_workflow
- panoply_nmf_workflow
- panoply_nmf_internal_workflow
- panoply_normalize_filter_workflow
- panoply_process_SM_table
- panoply_sankey_workflow
- panoply_ssgsea_workflow
- Pipelines