-
Notifications
You must be signed in to change notification settings - Fork 16
Data Analysis Modules: panoply_nmf_postprocess
This module visualizes and characterizes the clustering results from the panoply_nmf module for the selected factorization-rank nclust
.
Given a factorization rank k (where k is the number of clusters), NMF decomposes a non-negative p x n
data matrix V into two matrices W and H such that multiplication of W and H approximates V. Matrix H is a k x n
matrix whose entries represent weights for each sample (1 to N) to contribute to each cluster (1 to k). Matrix W is a p x k
matrix representing weights for each feature (1 to p) to contribute to each cluster (1 to k). Matrix H is used to assign samples to clusters by choosing the k with maximum score in each column of H. Matrix W containing the weights of each feature in a certain cluster is used to derive a list of representative features separating the clusters using the method proposed in (Kim and Park, 2007). Cluster-specific features are further subjected to a 2-sample moderated t-test (Ritchie et al., 2015) comparing the feature abundance between the respective cluster and all other clusters. Derived p-values are adjusted for multiple hypothesis testing using the methods proposed in (Benjamini and Hochberg, 1995).
After NMF the matrix W of feature weights contains two separate weights for positive and negative value (e.g. z-scores) of each feature, respectively. In order to reverse the non-negative transformation and to derive a single signed weight for each feature, each row in matrix W is normalized by dividing by the sum of feature weights in each row. Weights per feature and cluster were then aggregated by keeping the maximal normalized weight and multiplying with the sign of the z-score from the initial data matrix. Thus, the resulting transformed version of matrix Wsigned contains signed cluster weights for each feature present in the input matrix.
For each sample, a cluster membership score is calculated indicating how representative a sample is to each cluster. This score is used to define a set of "core samples" that is most representative for a given cluster, as follows:
For each sample, the difference between its highest cluster membership score and all other cluster membership scores is calculated. If the minimum of these differences exceeds 1/K, where K is the total number of clusters, a sample is considered a core-member.
To test for overrepresentation of categorical variables defined under group.cols
in $yaml_file
in the resulting clusters, a Fisher's exact test (R function fisher.test
) is used in the set of samples defining the "cluster core" as described above. For continuous variables defined under groups.col.continuous
in $yaml_file
a Wilcoxon rank-sum test (ggpubr
R-package) used to assess whether the continuous values are differentially distributed between any pair of clusters.
-
nmf_results
: (.tar
File) Results from panoply_nmf module -
nclust
: (Int) factorization rank / number of clusters to generate analyses for. Defaults to 'optimal' nclust if run as part of panoply_nmf_workflow, but can be manually overriden to visualize anynclust
betweenkmin
andkmax
. -
yaml_file
: (.yaml
file) master-parameters.yaml -
output_prefix
: (String) label
-
groups_file
: (.csv
file, default = NULL) subset of sample annotations, to be used for calculating enrichment of clusters and in the generation of figures. If no groups file is provided, all annotations from the originalcdesc
will be used. Please note that discrepencies between thecdesc
of different input data matrices may result in unexpected behavior. -
feature_fdr
: (Float, default = 0.01) Maximal FDR for feature selection (2-sample t-test). -
pval_signif
: (Float, default = 0.01) Maximal p-value for overrepresentation analysis (Fisher's exact test). Controlled byora_pval
in.yaml
file. -
max_annot_levels
: (Int, default = 10) Maximal number of levels in an annotation category. Categories with more levels will be excluded from figures and overrepresentation analysis. Controlled byora_max_categories
in.yaml
file. -
top_n_features
: (Int, default = 25) Maximal number of driver features, per cluster, to create boxplots / heatmaps for visualizing expression. -
gene_column
: (String, default = "geneSymbol") (optional) Column name in rdesc in the GCT file that contains gene symbols, used for adding additional feature-annotations. Controlled by global-parametergene_id_col
in.yaml
file. -
feature_method
: (String, optional) Explicit override for the driver-feature selection method. IfNULL
, the algorithm will choose a method automatically.
-
results
: (${output_prefix}_NMF_postprocess.tar.gz
) Tar file containing figures and analyses from post-processing. -
membership
: (${output_prefix}_K${nclust}_clusterMembership.tsv
) TSV file with sample membership scores, consensus mapping, and core-membership. -
feature_matrix_w
: (${output_prefix}_K${nclust}_W_rowNorm_combined_signed_n*.gct
) GCT file containing signed W-Matrix for GSEA analysis -
ssgsea_viable
: Boolean value indicating whether feature-space is ssGSEA compatible (i.e. mappable to gene symbols)
-
Kim, H. & Park, H. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23, 1495-1502 (2007).
-
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47-e47 (2015).
-
Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 289-300 (1995).
- Home
- PANOPLY Tutorial
- Data Preparation Modules
-
Data Analysis Modules
- panoply_association
- panoply_blacksheep
- panoply_clumps_ptm_diffexp
- panoply_clumps_ptm
- panoply_clumps_ptm_postprocess
- panoply_cmap_analysis
- panoply_cna_correlation
- panoply_cons_clust
- panoply_immune_analysis
- panoply_metaboanalyst
- panoply_mimp
- panoply_nmf
- panoply_nmf_postprocess
- panoply_omicsev
- panoply_quilts
- panoply_rna_protein_correlation
- panoply_sankey
- panoply_ssgsea
-
Report Modules
- panoply_association_report
- panoply_blacksheep_report
- panoply_clumps_ptm_report
- panoply_cna_correlation_report
- panoply_cons_clust_report
- panoply_immune_analysis_report
- panoply_metaboanalyst_report
- panoply_mimp_report
- panoply_nmf_report
- panoply_normalize_ms_data_report
- panoply_rna_protein_correlation_report
- panoply_sampleqc_report
- panoply_sankey_report
- panoply_ssgsea_report
- Support Modules
- Navigating Results
- PANOPLY without Terra
- Customizing PANOPLY
-
Workflows
- panoply_association_workflow
- panoply_blacksheep_workflow
- panoply_clumps_ptm_workflow
- panoply_immune_analysis_workflow
- panoply_metaboanalyst_workflow
- panoply_nmf_workflow
- panoply_nmf_internal_workflow
- panoply_normalize_filter_workflow
- panoply_process_SM_table
- panoply_sankey_workflow
- panoply_ssgsea_workflow
- Pipelines