-
Notifications
You must be signed in to change notification settings - Fork 16
Data Analysis Modules: panoply_nmf
This module performs unsupervised non-negative matrix factorization (NMF)-based clustering, on single or multi-omic data. Before runing analysis, data is preprocessed: multi-omic datasets are combined, data are filtered and normalized, and the dataset is transformed into a non-negative matrix. After running, an appropriate factorization rank is chosen based on the cophenetic correlation and dispersion coefficients.
In order to merge the input-array of data matrices into a single non-negative fully-quantified input-matrix, the following transformations are applied (in order):
- Combine array data-matrices into a single matrix, retaining ome-type as an
rdesc
annotation - Remove all features which are not fully quantified
- (optional) Apply standard-deviation filter, to remove features with low variance. For multiomics data, the following filtering methods are available:
-
global
: apply filter globally to the multi-omics data matrix -
separate
: apply filter to each data type separately -
equal
: filter the multi-omics data matrix such that each data type will be represented by the same number of
-
- (optional) Apply z-scoring by
row
(z-score rows),col
(z-score columns), orrowcol
(z-score rows and then columns). - Transform signed matrix into a non-negative matrix
- Create a data matrix with all negative numbers zeroed.
- Create another data matrix with all positive numbers zeroed and the signs of all negative numbers removed.
- Concatenate both matrices resulting in a data matrix twice as large as the original, but with positive values only and zeros and hence appropriate for NMF.
Given a factorization rank k (where k is the number of clusters), NMF decomposes a non-negative p x n
data matrix V into two matrices W and H such that multiplication of W and H approximates V. Matrix H is a k x n
matrix whose entries represent weights for each sample (1 to N) to contribute to each cluster (1 to k). Matrix W is a p x k
matrix representing weights for each feature (1 to p) to contribute to each cluster (1 to k). Matrix H is used to assign samples to clusters by choosing the k with maximum score in each column of H. Matrix W containing the weights of each feature in a certain cluster is used to derive a list of representative features separating the clusters using the method proposed in (Kim and Park, 2007). Cluster-specific features are further subjected to a 2-sample moderated t-test (Ritchie et al., 2015) comparing the feature abundance between the respective cluster and all other clusters. Derived p-values are adjusted for multiple hypothesis testing using the methods proposed in (Benjamini and Hochberg, 1995).
To determine the optimal factorization rank k (number of clusters) for the input data matrix, a range of clusters between k=$kmin
and $kmax
is tested. For each k the matrix V gets factorized using $nrun
iterations with random initialization of W and H. To determine the optimal factorization rank the pipeline calculates two metrics for each value of k: 1) cophenetic correlation coefficient measuring how well the intrinsic structure of the data is recapitulated after clustering (coph
) and 2) the dispersion coefficient of the consensus matrix as defined in (Kim and Park, 2007) measuring the reproducibility of the clustering across $nrun
iterations (disp
). The optimal k is defined as the maximum of disp^(1-coph)
for cluster numbers between k=$kmin
and $kmax
.
-
ome_gcts
: (Array[File]+) Array of GCT files, upon which NMF analysis is run. GCT files should have share common sample IDs; samples which do not appear in all GCTs will be excluded from analysis. -
ome_labels
: (Array[String]+) Array of data-labels corresponding to each GCT file (e.g.ome_gcts: ["proteome-subset.gct", "phosphoproteome-subsest.gct"]
,ome_labels: ["proteome", "pSTY"]
). Must match the length and order ofome_gct
exactly. -
kmin
: (Int, default = 2) Minimal factorization rank. -
kmax
: (Int, default = 8) Maximal factorization rank. -
exclude_2
: (Boolean, default = TRUE) If TRUE, 'k=2' will be excluded from calculation of the optimal rank. -
nmf_method
: (String, default = "lee"). NMF method supported by the NMF R-package. Controlled bymethod
in.yaml
file -
nrun
: (Int, default = 50) Number of NMF runs with different starting seeds. -
seed
: (Int, default = "random") Seed for NMF factorization. To set the seed explicitly, provide a numeric value. Providing the string "random" will result in a random seed. -
sd_filt_min
: (Float, default = 0.05) Lowest percentile of standard deviation (SD) across row to remove from the data. 0 means all data will be used, 0.1 means 10 percent of the data with lowest SD will be removed. Will be applied before z-scoring. Controlled bysd_filt
in.yaml
file -
sd_filt_mode
: (String, default = "global") Determines how the SD filter will be applied to the multi-omics data matrix. Controlled byfilt_mode
in.yaml
file.-
global
: apply filter globally to the multi-omics data matrix -
separate
: apply filter to each data type separately -
equal
: filter the multi-omics data matrix such that each data type will be represented by the same number of features. The number of features used to represent each data type Nfeat is determined by the data type with smallest number of features. Other data types will be filtered to retain the Nfeat most variable features.
-
-
z_score
: (Boolean, default = TRUE) If TRUE, the data matrix will be z-scored according to yaml-exclusive paramterz_szore_mode
. -
z_score_mode
: (String, default = "rowcol") z-score mode:row
(z-score rows),col
(z-score columns),rowcol
(z-score rows and then columns). -
gene_column
: (String, default = "geneSymbol") (optional) Column name in rdesc in the GCT file that contains gene symbols, used for adding additional feature-annotations. Controlled by global-parametergene_id_col
in.yaml
file. -
organism_id
: (String, default = "human") (optional) Organism type, used for gene-mapping if gene_col is provided. Support for 'human' (Hs), 'mouse' (Mm), or 'rat' (Rn). Controlled by global-parameterorganism
in.yaml
file. -
output_prefix
: (String, default = "results_nmf") name of the output.tar
file. -
yaml_file
: (.yaml
file) master-parameters.yaml
-
results
: (${label}_NMF_results.tar.gz
) Tar file containing combined pre-processed expression GCTs (signed and non-negative),.Rdata
file with res.rank output ofnmf()
, and.Rdata
file withopt
object containing parameters. -
nclust
: Best factorization rank. -
preprocess_figs
: Tar file containing preprocessing figures with the results of sd-filtering and z-scoring (if applicable).
-
Kim, H. & Park, H. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23, 1495-1502 (2007).
-
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47-e47 (2015).
-
Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 289-300 (1995).
- Home
- PANOPLY Tutorial
- Data Preparation Modules
-
Data Analysis Modules
- panoply_association
- panoply_blacksheep
- panoply_clumps_ptm_diffexp
- panoply_clumps_ptm
- panoply_clumps_ptm_postprocess
- panoply_cmap_analysis
- panoply_cna_correlation
- panoply_cons_clust
- panoply_immune_analysis
- panoply_metaboanalyst
- panoply_mimp
- panoply_nmf
- panoply_nmf_postprocess
- panoply_omicsev
- panoply_quilts
- panoply_rna_protein_correlation
- panoply_sankey
- panoply_ssgsea
-
Report Modules
- panoply_association_report
- panoply_blacksheep_report
- panoply_clumps_ptm_report
- panoply_cna_correlation_report
- panoply_cons_clust_report
- panoply_immune_analysis_report
- panoply_metaboanalyst_report
- panoply_mimp_report
- panoply_nmf_report
- panoply_normalize_ms_data_report
- panoply_rna_protein_correlation_report
- panoply_sampleqc_report
- panoply_sankey_report
- panoply_ssgsea_report
- Support Modules
- Navigating Results
- PANOPLY without Terra
- Customizing PANOPLY
-
Workflows
- panoply_association_workflow
- panoply_blacksheep_workflow
- panoply_clumps_ptm_workflow
- panoply_immune_analysis_workflow
- panoply_metaboanalyst_workflow
- panoply_nmf_workflow
- panoply_nmf_internal_workflow
- panoply_normalize_filter_workflow
- panoply_process_SM_table
- panoply_sankey_workflow
- panoply_ssgsea_workflow
- Pipelines