-
Notifications
You must be signed in to change notification settings - Fork 16
Data Preparation Modules: panoply_filter
wcorinne edited this page Aug 28, 2025
·
3 revisions
This module preprocesses and filters proteomics (protein/PTM site) data. It is typically run after the panoply_normalize_ms_data module.
Preprocessing (always applied):
- If
geneIdColwith Hugo Gene Symbols missing from sample annotation table, it will be created from theproteinIdCol - If the sample annotation table contains a
QC.statuscolumn, samples markedQC.passwill be retained in the output files. If not, all samples are assumed to beQC.pass, and aQC.statuscolumn is created accordingly. - If
separateQCTypesis set to 'true', additional output files (e.g.*-QC.fail.gct) will be created with non-QC.passsamples.
Filters available are:
- If
sdFilterThresholdis specified, exclude rows with standard deviation less thansdFilterThreshold - If
combineReplicatesreplicates is specified and replicates are present in the data (identified by identical values in theParticipant,Type(optional), andTimepoint(optional) columns in the sample annotation table), combine values across replicates for each row, using the method specified bycombineReplicates. - If
naMaxis specified, exclude rows with more thannaMaxmissing values - If
noNAis 'true', create an additional table with no missing values
Required inputs:
-
inputData: (.tarfile) tarball frompanoply_normalize_ms_data, or normalized input data ingctformat (whenstandaloneistrue) -
type: (String) proteomics data type -
standalone: (String) set totrueto run as a self-contained module; iftruetheanalysisDirinput is required -
yaml: (.yamlfile) parameters inyamlformat -
analysisDir: (String) name of analysis directory
Optional inputs:
-
filterProteomics: (String, default chosen in startup notebook) when 'true' filtering will be applied, when 'false' filtering is skipped. Preprocessing is always applied, regardless of toggle value. -
separateQCTypes: (String, default = 'false') toggle for generating additional output files, subset to non-QC.passsamples (e.g.*-QC.fail.gct). Filtering is not applied to these outputs. -
geneIdCol: (String, default = 'geneSymbol') name of (row) annotation column containing gene IDs. -
proteinIdCol: (String, default = 'id') name of (row) annotation column containing protein IDs. -
proteinIdType: (String, default chosen in startup notebook) keytype of protein IDs inproteinIdCol -
combineReplicates: (String, default = 'mean') method used to combine replicate samples, as are identified by identical values in theParticipant,Type(optional), andTimepoint(optional) columns of the sample annotation table. Ifnull, replicates will not be combined. -
naMax: (Float, default = 0.7) maximum allowed NA values per row (protein/PTM site); can be fraction between 0-1 or an integer specifying actual number of samples. Ifnull, NA values will not be removed. -
noNA: (String, default = 'false') toggle for generating a GCT in which rows (protein/PTM sites) containing any NA values are excluded -
sdFilterThreshold: (Float, default = 0.5) standard deviation (SD) threshold for SD filtering; rows (proteins/PTM sites) with SD less thansdFilterThresholdare excluded from the filtered output table. Ifnull, sd filtering will not be applied. -
ndigits: (Int, default = 5) number of decimal digits to use in output tables -
outTar: (String, default = "panoply_filter-output.tar") output.tarfile name -
outTable: (String, default = "filtered_table-output.gct") output.gctfiltered file name
-
output_tar: Tarball including the following files in thefiltered-datasubdirectory:- Filtered data files:
- data table containing only QC-pass samples (
*-ratio-norm.gct), with no other filters applied - filtered data table (
*-ratio-norm-filt.gct)
- data table containing only QC-pass samples (
- Optional data files:
- data table containing non-
QC.passsamples of some {qc.type} (*-ratio-norm-{qc.type}.gct), with no other filters applied - filtered data table, with rows (protein/PTM sites) containing any NA values excluded (
*-ratio-norm-filt-noNA.gct)
- data table containing non-
- Filtered data files:
-
outputs: (.gctfile) filtered data table (equivalent to*-ratio-norm-filt.gct) -
output_yaml: finalized parameter file
- Mertins, P., Mani, D., Ruggles, K., Gillette, M., Clauser, K., Wang, P., Wang, X., Qiao, J., Cao, S., Petralia, F., et al. (2016). Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 534(7605), 55 - 62. https://dx.doi.org/10.1038/nature18003.
- Gillette, M., Satpathy, S., Cao, S., Dhanasekaran, S., Vasaikar, S., Krug, K., Petralia, F., Li, Y., Liang, W., Reva, B., et al. (2020). Proteogenomic Characterization Reveals Therapeutic Vulnerabilities in Lung Adenocarcinoma. Cell 182(1), 200 - 225.e35. https://dx.doi.org/10.1016/j.cell.2020.06.013
- Home
- PANOPLY Tutorial
- Data Preparation Modules
-
Data Analysis Modules
- panoply_association
- panoply_blacksheep
- panoply_clumps_ptm_diffexp
- panoply_clumps_ptm
- panoply_clumps_ptm_postprocess
- panoply_cmap_analysis
- panoply_cna_correlation
- panoply_cons_clust
- panoply_immune_analysis
- panoply_metaboanalyst
- panoply_mimp
- panoply_nmf
- panoply_nmf_postprocess
- panoply_omicsev
- panoply_quilts
- panoply_rna_protein_correlation
- panoply_sankey
- panoply_ssgsea
-
Report Modules
- panoply_association_report
- panoply_blacksheep_report
- panoply_clumps_ptm_report
- panoply_cna_correlation_report
- panoply_cons_clust_report
- panoply_immune_analysis_report
- panoply_metaboanalyst_report
- panoply_mimp_report
- panoply_nmf_report
- panoply_normalize_ms_data_report
- panoply_rna_protein_correlation_report
- panoply_sampleqc_report
- panoply_sankey_report
- panoply_ssgsea_report
- Support Modules
- Navigating Results
- PANOPLY without Terra
- Customizing PANOPLY
-
Workflows
- panoply_association_workflow
- panoply_blacksheep_workflow
- panoply_clumps_ptm_workflow
- panoply_immune_analysis_workflow
- panoply_metaboanalyst_workflow
- panoply_nmf_workflow
- panoply_nmf_internal_workflow
- panoply_normalize_filter_workflow
- panoply_process_SM_table
- panoply_sankey_workflow
- panoply_ssgsea_workflow
- Pipelines