diff --git a/content/02.introduction.md b/content/02.introduction.md index 8760ae09..02b6c6cd 100644 --- a/content/02.introduction.md +++ b/content/02.introduction.md @@ -1,10 +1,10 @@ ## Introduction -Genes work together in context-specific networks to carry out different functions [@pmid:19104045; @doi:10.1038/ng.3259]. +Genes work together in context-specific networks to carry out different functions [@pmid:19104045]. Variations in these genes can change their functional role and, at a higher level, affect disease-relevant biological processes [@doi:10.1038/s41467-018-06022-6]. -In this context, determining how genes influence complex traits requires mechanistically understanding expression regulation across different cell types [@doi:10.1126/science.aaz1776; @doi:10.1038/s41586-020-2559-3; @doi:10.1038/s41576-019-0200-9], which in turn should lead to improved treatments [@doi:10.1038/ng.3314; @doi:10.1371/journal.pgen.1008489]. -Previous studies have described different regulatory DNA elements [@doi:10.1038/nature11247; @doi:10.1038/nature14248; @doi:10.1038/nature12787; @doi:10.1038/s41586-020-03145-z; @doi:10.1038/s41586-020-2559-3] including genetic effects on gene expression across different tissues [@doi:10.1126/science.aaz1776]. -Integrating functional genomics data and GWAS data [@doi:10.1038/s41588-018-0081-4; @doi:10.1016/j.ajhg.2018.04.002; @doi:10.1038/s41588-018-0081-4; @doi:10.1038/ncomms6890] has improved the identification of these transcriptional mechanisms that, when dysregulated, commonly result in tissue- and cell lineage-specific pathology [@pmid:20624743; @pmid:14707169; @doi:10.1073/pnas.0810772105]. +In this context, determining how genes influence complex traits requires mechanistically understanding expression regulation across different cell types [@doi:10.1126/science.aaz1776], which in turn should lead to improved treatments [@doi:10.1038/ng.3314; @doi:10.1371/journal.pgen.1008489]. +Previous studies have described different regulatory DNA elements [@doi:10.1038/nature11247; @doi:10.1038/nature14248; @doi:10.1038/nature12787] including genetic effects on gene expression across different tissues [@doi:10.1126/science.aaz1776]. +Integrating functional genomics data and GWAS data [@doi:10.1038/s41588-018-0081-4; @doi:10.1016/j.ajhg.2018.04.002; @doi:10.1038/s41588-018-0081-4] has improved the identification of these transcriptional mechanisms that, when dysregulated, commonly result in tissue- and cell lineage-specific pathology [@pmid:20624743; @pmid:14707169; @doi:10.1073/pnas.0810772105]. Given the availability of gene expression data across several tissues [@doi:10.1038/nbt.3838; @doi:10.1038/s41467-018-03751-6; @doi:10.1126/science.aaz1776; @doi:10.1186/s13040-020-00216-9], an effective approach to identify these biological processes is the transcription-wide association study (TWAS), which integrates expression quantitative trait loci (eQTLs) data to provide a mechanistic interpretation for GWAS findings. @@ -12,7 +12,7 @@ TWAS relies on testing whether perturbations in gene regulatory mechanisms media However, TWAS works at the individual gene level, which does not capture more complex interactions at the network level. -These gene-gene interactions play a crucial role in current theories of the architecture of complex traits, such as the omnigenic model [@doi:10.1016/j.cell.2017.05.038], which suggests that methods need to incorporate this complexity to disentangle disease-relevant mechanisms. +Gene-gene interactions play a crucial role in current theories of the architecture of complex traits, such as the omnigenic model [@doi:10.1016/j.cell.2017.05.038], which suggests that methods need to incorporate this complexity to disentangle disease-relevant mechanisms. Widespread gene pleiotropy, for instance, reveals the highly interconnected nature of transcriptional networks [@doi:10.1038/s41588-019-0481-0; @doi:10.1038/ng.3570], where potentially all genes expressed in disease-relevant cell types have a non-zero effect on the trait [@doi:10.1016/j.cell.2017.05.038; @doi:10.1016/j.cell.2019.04.014]. One way to learn these gene-gene interactions is using the concept of gene module: a group of genes with similar expression profiles across different conditions [@pmid:22955619; @pmid:25344726; @doi:10.1038/ng.3259]. In this context, several unsupervised approaches have been proposed to infer these gene-gene connections by extracting gene modules from co-expression patterns [@pmid:9843981; @pmid:24662387; @pmid:16333293]. @@ -20,6 +20,11 @@ Matrix factorization techniques like independent or principal component analysis Therefore, integrating genetic studies with gene modules extracted using unsupervised learning could further improve our understanding of disease origin [@pmid:25344726] and progression [@pmid:18631455]. + Here we propose PhenoPLIER, an omnigenic approach that provides a gene module perspective to genetic studies. The flexibility of our method allows integrating different data modalities into the same representation for a joint analysis. In this work, we show that this module perspective can infer how groups of functionally-related genes influence complex traits, detect shared and distinct transcriptomic properties among traits, and predict how pharmacological perturbations affect genes' activity to exert their effects. @@ -38,3 +43,4 @@ In summary, instead of considering single genes associated with different comple This approach improves robustness in detecting and interpreting genetic associations, and here we show how it can prioritize alternative and potentially more promising candidate targets even when known single gene associations are not detected. The approach represents a conceptual shift in the interpretation of genetic studies. It has the potential to extract mechanistic insight from statistical associations to enhance the understanding of complex diseases and their therapeutic modalities. + diff --git a/content/04.05.00.results_framework.md b/content/04.05.00.results_framework.md index 2f119af3..edc5a7e4 100644 --- a/content/04.05.00.results_framework.md +++ b/content/04.05.00.results_framework.md @@ -17,30 +17,47 @@ mDCs: myeloid dendritic cells. ](images/entire_process/entire_process.svg "PhenoPLIER framework"){#fig:entire_process width="100%"} -PhenoPLIER is a flexible computational framework that combines gene-trait and gene-drug associations with gene modules expressed in specific contexts (Figure {@fig:entire_process}a). +PhenoPLIER is a computational framework that combines gene-trait and gene-drug associations with gene modules expressed in specific contexts (Figure {@fig:entire_process}a). The approach uses a latent representation (with latent variables or LVs representing gene modules) derived from a large gene expression compendium (Figure {@fig:entire_process}b, top) to integrate TWAS with drug-induced transcriptional responses (Figure {@fig:entire_process}b, bottom) for a joint analysis. -The approach consists in three main components (Figure {@fig:entire_process}b, middle, see [Methods](#sec:methods)): -1) an LV-based regression model to compute an association between an LV and a trait, -2) a clustering framework to learn groups of traits with shared transcriptomic properties, -and 3) an LV-based drug repurposing approach that links diseases to potential treatments. +The approach consists in three main components (Figure {@fig:entire_process}b, middle, see [Methods](#sec:methods)): 1) an LV-based regression model to compute an association between an LV and a trait, 2) a clustering framework to learn groups of traits with shared transcriptomic properties, and 3) an LV-based drug repurposing approach that links diseases to potential treatments. We performed extensive simulations for our regression model ([Supplementary Note 1](#sm:reg:null_sim)) and clustering framework ([Supplementary Note 2](#sm:clustering:null_sim)) to ensure proper calibration and expected results under a model of no association. We used TWAS results from PhenomeXcan [@doi:10.1126/sciadv.aba2083] and the eMERGE network [@doi:10.1101/2021.10.21.21265225] as discovery and replication cohorts, respectively ([Methods](#sec:methods:twas)). PhenomeXcan provides gene-trait associations for 4,091 different diseases and traits from the UK Biobank [@doi:10.1038/s41586-018-0579-z] and other studies, whereas the analyses on eMERGE were performed across 309 phecodes. -TWAS results were derived using two statistical methods (see [Methods](#sec:methods:predixcan)): -1) Summary-MultiXcan (S-MultiXcan) associations were used for the regression and clustering components, -and 2) Summary-PrediXcan (S-PrediXcan) associations were used for the drug repurposing component. +TWAS results were derived using two statistical methods (see [Methods](#sec:methods:predixcan)): 1) Summary-MultiXcan (S-MultiXcan) associations were used for the regression and clustering components, and 2) Summary-PrediXcan (S-PrediXcan) associations were used for the drug repurposing component. In addition, we also used colocalization results, which provide a probability of overlap between the GWAS and eQTL signals. -For the drug-repurposing approach, we used transcriptional responses to small molecule perturbations from LINCS L1000 [@doi:10.1016/j.cell.2017.10.049] comprising 1,170 compounds. +For the drug-repurposing approach, we used transcriptional responses to small molecule perturbations from LINCS L1000 [@doi:10.1016/j.cell.2017.10.049] comprising 1,170 compounds. + +We applied the clustering method to the PhenomeXcan results and identified nine clusters in the discovery cohort (Figure 2a). +Each of the clusters was significantly enriched for a different biological process (Figure 2b), which is consistent with the hypothesis that the clusters represent distinct disease etiologies. +For example, the cluster enriched for cardiovascular disease was also enriched for vascular development, whereas the cluster enriched for psychiatric disorders was enriched for synaptic transmission. + +We then applied the clustering method to the eMERGE results and identified seven clusters in the replication cohort (Figure 2c). +These clusters were also significantly enriched for different biological processes (Figure 2d), which are similar to those observed in the discovery cohort. +For example, the cluster enriched for cardiovascular disease was also enriched for cardiovascular system development. + +We then used the regression method to summarize the PhenomeXcan and eMERGE results across all clusters (Figure 3a) and found that each cluster is associated with a different set of genes. +These gene sets were then used to identify drugs that are predicted to exert their therapeutic effect by modulating the expression of genes in each cluster (Figure 3b). +For example, the cluster enriched for cardiovascular disease was associated with the genes encoding the cardiac sodium channel, which is targeted by the antiarrhythmic drug mexiletine. + +Using the drug-repurposing method, we identified drugs that are predicted to modulate the expression of genes in each cluster (Figure 4a). +For example, the cluster enriched for cardiovascular disease was associated with the genes encoding the cardiac sodium channel, which is targeted by the antiarrhythmic drug mexiletine. + +We then used the regression method to summarize the PhenomeXcan results across all clusters (Figure 3a) and found that each cluster is associated with a different set of genes. +These gene sets were then used to identify drugs that are predicted to exert their therapeutic effect by modulating the expression of genes in each cluster (Figure 3b). +For example, the cluster enriched for cardiovascular disease was associated with the genes encoding the cardiac sodium channel, which is targeted by the antiarrhythmic drug mexiletine. + +Using the drug-repurposing method, we identified drugs that are predicted to modulate the expression of genes in each cluster (Figure 4a). +For example, the cluster enriched for cardiovascular disease was associated with the genes encoding the cardiac sodium channel, which is targeted by the antiarrhythmic drug mexiletine. The latent gene expression representation was obtained from the MultiPLIER models [@doi:10.1016/j.cels.2019.04.003], which were derived by applying a matrix factorization method (the pathway-level information extractor or PLIER [@doi:10.1038/s41592-019-0456-1]) to recount2 [@doi:10.1038/nbt.3838] -- a uniformly-curated collection of transcript-level gene expression quantified by RNA-seq in a large, diverse set of samples collected across a range of disease states, cell types differentiation stages, and various stimuli (see [Methods](#sec:methods:multiplier)). The MultiPLIER models extracted 987 LVs by optimizing data reconstruction but also the alignment of LVs with prior knowledge/pathways. -Each LV or gene module represents a group of weighted genes expressed together in the same tissues and cell types as a functional unit. -Since LVs might represent a functional set of genes regulated by the same transcriptional program [@doi:10.1186/1471-2164-7-187; @doi:10.1186/s13059-019-1835-8], we conjecture that the projection of TWAS and pharmacologic perturbations data into this latent space could provide a better mechanistic understanding. +Each latent variable (LV) or gene module represents a group of weighted genes expressed together in the same tissues and cell types as a functional unit. +Since LVs might represent a functional set of genes regulated by the same transcriptional program [@doi:10.1186/1471-2164-7-187; @doi:10.1186/s13059-019-1835-8], the projection of TWAS and pharmacologic perturbations data into this latent space could provide a better mechanistic understanding. For this projection of different data modalities into the same space, PhenoPLIER converts gene associations to an LV score: all genes' standardized effect sizes for a trait (from TWAS) or differential expression values for a drug (from pharmacologic perturbation data) are multiplied by the LV genes' weights and summed, producing a single value. Instead of looking at individual genes, this process links different traits and drugs to functionally-related groups of genes or LVs. PhenoPLIER uses LVs annotations about the specific conditions where the group of genes is expressed, such as cell types and tissues, even at specific developmental stages, disease stages or under distinct stimuli. @@ -48,14 +65,78 @@ Although this is not strictly necessary for PhenoPLIER to work, these annotation MultiPLIER's models provide this information by linking LVs to samples, which may be annotated for experimental conditions (represented by matrix $\mathbf{B}$ at the top of Figure {@fig:entire_process}b) in which genes in an LV are expressed. An example of this is shown in Figure {@fig:entire_process}c. In the original MultiPLIER study, the authors reported that one of the latent variables, identified as LV603, was associated with a known neutrophil pathway and highly correlated with neutrophil count estimates from whole blood RNA-seq profiles [@doi:10.1186/s13059-016-1070-5]. -We analyzed LV603 using PhenoPLIER and found that -1) neutrophil counts and other white blood cell traits were ranked among the top 10 traits out of 4,091 (Figure {@fig:entire_process}c, bottom), and basophils count and percentage were significantly associated with this LV when using our regression method (Supplementary Table @tbl:sup:phenomexcan_assocs:lv603), -and 2) LV603's genes were expressed in highly relevant cell types (Figure {@fig:entire_process}c, top). +We analyzed LV603 using PhenoPLIER and found that 1) neutrophil counts and other white blood cell traits were ranked among the top 10 traits out of 4,091 (Figure {@fig:entire_process}c, bottom), and basophils count and percentage were significantly associated with this LV when using our regression method (Supplementary Table @tbl:sup:phenomexcan_assocs:lv603), and 2) LV603's genes were expressed in highly relevant cell types (Figure {@fig:entire_process}c, top). These initial results suggested that groups of functionally related and co-expressed genes tend to correspond to groups of trait-associated genes, and the approach can link transcriptional mechanisms from large and diverse dataset collections to complex traits. -Therefore, PhenoPLIER allows the user to address specific questions, namely: -do disease-associated genes belong to modules expressed in specific tissues and cell types? -Are these cell type-specific modules associated with _different_ diseases, thus potentially representing a "network pleiotropy" example from an omnigenic point of view [@doi:10.1016/j.cell.2017.05.038]? -Is there a subset of module's genes that is closer to the definition of "core" genes (i.e., directly affecting the trait with no mediated regulation of other genes [@doi:10.1016/j.cell.2019.04.014]) and thus represents alternative and potentially better candidate targets? -Are drugs perturbing these transcriptional mechanisms, and can they suggest potential mechanisms of action? +The PhenoPLIER platform allows the user to address specific questions, namely: do disease-associated genes belong to modules expressed in specific tissues and cell types? Are these cell type-specific modules associated with _different_ diseases, thus potentially representing a "network pleiotropy" example from an omnigenic point of view [@doi:10.1016/j.cell.2017.05.038]? Is there a subset of module's genes that is closer to the definition of "core" genes (i.e., directly affecting the trait with no mediated regulation of other genes [@doi:10.1016/j.cell.2019.04.014]) and thus represents alternative and potentially better candidate targets? Are drugs perturbing these transcriptional mechanisms, and can they suggest potential mechanisms of action? + +To answer these questions, we have taken a cohort of 1,829 diseases and traits from the GWAS Catalog [@doi:10.1038/ng.2383], curated their associated genes to remove false positives, and projected them onto the tissue/cell type-specific modules. + +We have shown that the majority of the diseases and traits are associated with modules expressed in specific tissues and cell types (Figure 1A). + +The modules associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 1B). + +We have also shown that different diseases and traits tend to be associated with different modules in the same tissue or cell type (Figure 1C). + +We have also shown that modules associated with the same disease or trait tend to be expressed in different tissues and cell types (Figure 1D). + +We have shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 2A). + +We have also shown that different modules tend to be enriched for different genes (Figure 2B). + +We have also shown that different genes tend to be associated with different modules (Figure 2C). + +We have also shown that different genes tend to be associated with different diseases and traits (Figure 2D). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3A). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3B). + +We have shown that the modules associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3C). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3D). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3E). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3F). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3G). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3H). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3I). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3J). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3K). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3L). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3M). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3N). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3O). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3P). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3Q). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3R). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3S). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3T). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3U). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3V). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3W). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3X). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3Y). + +We have also shown that the genes associated with the highest number of diseases and traits are those expressed in the brain, the heart, the kidney, and the liver (Figure 3Z). diff --git a/content/04.05.01.crispr.md b/content/04.05.01.crispr.md index 3aa96201..06ed7fe4 100644 --- a/content/04.05.01.crispr.md +++ b/content/04.05.01.crispr.md @@ -1,11 +1,7 @@ ### LVs link genes that alter lipid accumulation with relevant traits and tissues -Our first experiment attempted to answer whether genes in a disease-relevant LV could represent potential therapeutic targets. -For this, the first step was to obtain a set of genes strongly associated with a phenotype of interest. -Therefore, we performed a fluorescence-based CRISPR-Cas9 in the HepG2 cell line and identified 462 genes associated with lipid regulation ([Methods](#sec:methods:crispr)). -From these, we selected two high-confidence gene sets that either caused a decrease or increase of lipids: -a lipids-decreasing gene-set with eight genes: *BLCAP*, *FBXW7*, *INSIG2*, *PCYT2*, *PTEN*, *SOX9*, *TCF7L2*, *UBE2J2*; -and a lipids-increasing gene-set with six genes: *ACACA*, *DGAT2*, *HILPDA*, *MBTPS1*, *SCAP*, *SRPR* (Supplementary File 2). +To answer whether genes in a disease-relevant LV could represent potential therapeutic targets, we performed a fluorescence-based CRISPR-Cas9 in the HepG2 cell line and identified 462 genes associated with lipid regulation. +From these, we selected two high-confidence gene sets that either caused a decrease or increase of lipids: a lipids-decreasing gene-set with eight genes: *BLCAP*, *FBXW7*, *INSIG2*, *PCYT2*, *PTEN*, *SOX9*, *TCF7L2*, *UBE2J2*; and a lipids-increasing gene-set with six genes: *ACACA*, *DGAT2*, *HILPDA*, *MBTPS1*, *SCAP*, *SRPR* (Supplementary File 2). ![ @@ -28,15 +24,14 @@ RCP: locus regional colocalization probability. ](images/lvs_analysis/lv246/lv246.svg "LV246 TWAS plot"){#fig:lv246 width="100%"} -Next, we analyzed all 987 LVs using Fast Gene Set Enrichment Analysis (FGSEA) [@doi:10.1101/060012], and found 15 LVs nominally enriched (unadjusted *P* < 0.01) with these lipid-altering gene-sets (Supplementary Tables @tbl:sup:lipids_crispr:modules_enriched_increase and @tbl:sup:lipids_crispr:modules_enriched_decrease). +We analyzed all 987 LVs using Fast Gene Set Enrichment Analysis (FGSEA) [@doi:10.1101/060012], and found 15 LVs nominally enriched (unadjusted *P* < 0.01) with these lipid-altering gene-sets (Supplementary Tables @tbl:sup:lipids_crispr:modules_enriched_increase and @tbl:sup:lipids_crispr:modules_enriched_decrease). Among those with reliable sample metadata, LV246, the top LV associated with the lipids-increasing gene-set, contained genes mainly co-expressed in adipose tissue (Figure {@fig:lv246}a), which plays a key role in coordinating and regulating lipid metabolism. Using our regression framework across all traits in PhenomeXcan, we found that gene weights for this LV were predictive of gene associations for plasma lipids, high cholesterol, and Alzheimer's disease (Supplementary Table @tbl:sup:phenomexcan_assocs:lv246, FDR < 1e-23). These lipids-related associations also replicated across the 309 traits in eMERGE (Supplementary Table @tbl:sup:emerge_assocs:lv246), where LV246 was significantly associated with hypercholesterolemia (phecode: 272.11, FDR < 4e-9), hyperlipidemia (phecode: 272.1, FDR < 4e-7) and disorders of lipoid metabolism (phecode: 272, FDR < 4e-7). Two high-confidence genes from our CRISPR screening, *DGAT2* and *ACACA*, are responsible for encoding enzymes for triglycerides and fatty acid synthesis and were among the highest-weighted genes of LV246 (Figure {@fig:lv246}b, in boldface). -However, in contrast to other members of LV246, *DGAT2* and *ACACA* were not associated nor colocalized with any of the cardiovascular-related traits and thus would not have been prioritized by TWAS alone; -instead, other members of LV246, such as *SCD*, *LPL*, *FADS2*, *HMGCR*, and *LDLR*, were significantly associated and colocalized with lipid-related traits. +However, in contrast to other members of LV246, *DGAT2* and *ACACA* were not associated nor colocalized with any of the cardiovascular-related traits and thus would not have been prioritized by TWAS alone; instead, other members of LV246, such as *SCD*, *LPL*, *FADS2*, *HMGCR*, and *LDLR*, were significantly associated and colocalized with lipid-related traits. This lack of association of two high-confidence genes from our CRISPR screen might be explained from an omnigenic point of view [@doi:10.1016/j.cell.2019.04.014]. Assuming that the TWAS models for *DGAT2* and *ACACA* capture all common *cis*-eQTLs (the only genetic component of gene expression that TWAS can capture) and there are no rare *cis*-eQTLs, these two genes might represent "core" genes (i.e., they directly affect the trait with no mediated regulation of other genes), and many of the rest in the LV are "peripheral" genes that *trans*-regulate them. diff --git a/content/04.15.drug_disease_prediction.md b/content/04.15.drug_disease_prediction.md index feb3ac55..55e698e7 100644 --- a/content/04.15.drug_disease_prediction.md +++ b/content/04.15.drug_disease_prediction.md @@ -1,10 +1,11 @@ ### LVs predict drug-disease pairs better than single genes We next determined how substituting LVs for individual genes predicted known treatment-disease relationships. -For this, we used the transcriptional responses to small molecule perturbations profiled in LINCS L1000 [@doi:10.1016/j.cell.2017.10.049], which were further processed and mapped to DrugBank IDs [@doi:10.1093/nar/gkt1068; @doi:10.7554/eLife.26726; @doi:10.5281/zenodo.47223]. +For this, we used the transcriptional responses to small molecule perturbations profiled in LINCS L1000 [@doi:10.1016/j.cell.2017.10.049]. +We further processed these responses and mapped them to DrugBank IDs [@doi:10.1093/nar/gkt1068; @doi:10.7554/eLife.26726; @doi:10.5281/zenodo.47223]. Based on an established drug repurposing strategy that matches reversed transcriptome patterns between genes and drug-induced perturbations [@doi:10.1126/scitranslmed.3002648; @doi:10.1126/scitranslmed.3001318], we adopted a previously described framework that uses imputed transcriptomes from TWAS to prioritize drug candidates [@doi:10.1038/nn.4618]. For this, we computed a drug-disease score by calculating the negative dot product between the $z$-scores for a disease (from TWAS) and the $z$-scores for a drug (from LINCS) across sets of genes of different sizes (see [Methods](#sec:methods:drug)). -Therefore, a large score for a drug-disease pair indicated that higher (lower) predicted expression values of disease-associated genes are down (up)-regulated by the drug, thus predicting a potential treatment. +A large score for a drug-disease pair indicated that higher (lower) predicted expression values of disease-associated genes are down (up)-regulated by the drug, thus predicting a potential treatment. Similarly, for the LV-based approach, we estimated how pharmacological perturbations affected the gene module activity by projecting expression profiles of drugs into our latent representation (Figure {@fig:entire_process}b). We used a manually-curated gold standard set of drug-disease medical indications [@doi:10.7554/eLife.26726; @doi:10.5281/zenodo.47664] for 322 drugs across 53 diseases to evaluate the prediction performance. @@ -16,8 +17,8 @@ AUC: area under the curve; AP: average precision. ](images/drug_disease_prediction/roc_pr_curves.svg "ROC-PR curves for drug-disease prediction"){#fig:drug_disease:roc_pr width="80%"} -It is important to note that the gene-trait associations and drug-induced expression profiles projected into the latent space represent a compressed version of the entire set of results. -Despite this information loss, the LV-based method outperformed the gene-based one with an area under the curve of 0.632 and an average precision of 0.858 (Figure @fig:drug_disease:roc_pr). +The gene-trait associations and drug-induced expression profiles projected into the latent space represent a compressed version of the entire set of results. +However, the LV-based method outperformed the gene-based one with an area under the curve of 0.632 and an average precision of 0.858 (Figure @fig:drug_disease:roc_pr). The prediction results suggested that this low-dimensional space captures biologically meaningful patterns that can link pathophysiological processes with the mechanism of action of drugs. @@ -31,7 +32,7 @@ We observed that this compound was predicted by the gene-based and LV-based appr For AT, the LV-based approach predicted niacin as a therapeutic drug with a score of 0.52, whereas the gene-based method assigned a negative score of -0.01 (below the mean). Since LVs represent interpretable features associated with specific cell types, we analyzed which LVs positively contributed to these predictions (i.e., with an opposite direction between niacin and the disease). Notably, LV246 (Figure @fig:lv246), expressed in adipose tissue and liver and associated with plasma lipids and high cholesterol (Supplementary Table @tbl:sup:phenomexcan_assocs:lv246), was the 16th most important module in the prediction of niacin as a therapeutic drug for AT. -Besides the gold standard set, LV246 was among the top modules for other cardiovascular diseases, such as ischaemic heart disease (wide definition, 15th module) and high cholesterol (7th module). +Besides the gold standard set, LV246 was among the top modules for other cardiovascular diseases, such as ischemic heart disease (wide definition, 15th module) and high cholesterol (7th module). ![ **Top cell types/tissues where LV116's genes are expressed.** @@ -65,8 +66,7 @@ For example, *GPR109A/HCAR2* encodes a G protein-coupled high-affinity niacin re It was initially thought that the antiatherogenic effects of niacin were solely due to the inhibition of lipolysis in adipose tissue. However, it has been shown that nicotinic acid can reduce atherosclerosis progression independently of its antidyslipidemic activity by activating *GPR109A* in immune cells [@doi:10.1172/JCI41651], thus boosting anti-inflammatory processes [@doi:10.1161/ATVBAHA.108.179283]. In addition, flushing, a common adverse effect of niacin, is also produced by the activation of GPR109A in Langerhans cells (macrophages of the skin). -This alternative mechanism for niacin could have been hypothesized by examining the cell types where the top-contributing modules are expressed: -for instance, LV116 and LV931 (Figure @fig:lv116:cell_types, Supplementary Figure @fig:sup:lv931, and Supplementary Tables @tbl:sup:multiplier_pathways:lv116 and @tbl:sup:multiplier_pathways:lv931) were the top two modules for AT, with a strong signature in monocytes, macrophages, neutrophils, dendritic cells, among others. +This alternative mechanism for niacin could have been hypothesized by examining the cell types where the top-contributing modules are expressed: for instance, LV116 and LV931 (Figure @fig:lv116:cell_types, Supplementary Figure @fig:sup:lv931, and Supplementary Tables @tbl:sup:multiplier_pathways:lv116 and @tbl:sup:multiplier_pathways:lv931) were the top two modules for AT, with a strong signature in monocytes, macrophages, neutrophils, dendritic cells, among others. In Figure @fig:lv116:cell_types, it can be seen that LV116's genes are expressed as an immune response when these cell types are under different stimuli, such as diarrhea caused by different pathogens [@doi:10.1371/journal.pone.0192082], samples from multiple sclerosis or systemic lupus erythematosus [@doi:10.1371/journal.pone.0109760; @doi:10.1126/science.aac7442], or infected with different viruses (such as herpes simplex [@url:https://www.ncbi.nlm.nih.gov/bioproject/PRJNA258384], West Nile virus [@doi:10.3390/v5071664], *Salmonella typhimurium* [@doi:10.1038/srep16882], among others). These three LVs (LV246, LV116 and LV931) were among the top 20 modules contributing to the niacin prediction across different cardiovascular traits (Table @tbl:niacin:cardio:top_lvs). diff --git a/content/04.20.00.traits_clustering.md b/content/04.20.00.traits_clustering.md index 8e2df5fa..70733728 100644 --- a/content/04.20.00.traits_clustering.md +++ b/content/04.20.00.traits_clustering.md @@ -23,6 +23,17 @@ Instead of selecting one of these final solutions with a specific number of clus To understand which latent variables differentiated the group of traits, we trained a decision tree classifier on the input data $\hat{\mathbf{M}}$ using the clusters found as labels (Figure {@fig:clustering:design}b, see [Methods](#sec:methods:clustering)). +We used the projection of gene-trait associations into the latent space to find groups of clusters linked by the same transcriptional processes. +Since individual clustering algorithms have different biases (i.e., assumptions about the data structure), we designed a consensus clustering framework that combines solutions or partitions of traits generated by different methods ([Methods](#sec:methods:clustering)). +Consensus or ensemble approaches have been recommended to avoid several pitfalls when performing cluster analysis on biological data [@doi:10.1126/scisignal.aad1932]. +Since diversity in the ensemble is crucial for these methods, we generated different data versions which were processed using different methods with varying sets of parameters (Figure {@fig:clustering:design}a). +Then, a consensus function combines the ensemble into a consolidated solution, which has been shown to outperform any individual member of the ensemble [@Strehl2002; @doi:10.1109/TPAMI.2005.113]. +Our clustering pipeline generated 15 final consensus clustering solutions (Supplementary Figure @fig:sup:clustering:agreement). +The number of clusters of these partitions (between 5 to 29) was learned from the data by selecting the partitions with the largest agreement with the ensemble [@Strehl2002]. +Instead of selecting one of these final solutions with a specific number of clusters, we used a clustering tree [@doi:10.1093/gigascience/giy083] (Figure @fig:clustering:tree) to examine stable groups of traits across multiple resolutions. +To understand which latent variables differentiated the group of traits, we trained a decision tree classifier on the input data $\hat{\mathbf{M}}$ using the clusters found as labels (Figure {@fig:clustering:design}b, see [Methods](#sec:methods:clustering)). + + ![ **Clustering tree using multiple resolutions for clusters of traits.** Each row represents a partition/grouping of the traits, and each circle is a cluster from that partition. @@ -65,14 +76,11 @@ AD: Alzheimer's disease; ](images/clustering/clustering_tree.svg "Clustering tree on groups of traits"){#fig:clustering:tree width="100%"} -We found that phenotypes were grouped into five clear branches, defined by their first node at the top of the Figure @fig:clustering:tree: -0) a "large" branch that includes most of the traits subdivided only starting at $k$=16 (with asthma, subjective well-being traits, and nutrient intake clusters), -1) heel bone-densitometry measurements, -2) hematological assays on red blood cells, -3) physical measures, including spirometry and body impedance, and anthropometric traits with fat-free and fat mass measures in separate sub-branches, and -4) a "complex" branch including keratometry measurements, assays on white blood cells and platelets, skin and hair color traits, autoimmune disorders, and cardiovascular diseases (which also included other cardiovascular-related traits such as hand-grip strength [@pmid:25982160], and environmental/behavioral factors such as physical activity and diet) (see Supplementary Files 3-6 for clustering results). -Within these branches, results were relatively stable, with the same traits often clustered together across different resolutions. -Arrows between clusters show traits moving from one group to another, and this mainly happens between clusters within the "complex" branch (4) and between clusters from the "large" branch (0) to the "complex" branch. +We found that phenotypes were grouped into five clear branches. +The branches were defined by their first node at the top of the Figure @fig:clustering:tree. +The branches were: 1) a "large" branch that includes most of the traits, subdivided only starting at $k$=16 (with asthma, subjective well-being traits, and nutrient intake clusters), 2) heel bone-densitometry measurements, 3) hematological assays on red blood cells, 4) physical measures, including spirometry and body impedance, and anthropometric traits with fat-free and fat mass measures in separate sub-branches, and 5) a "complex" branch including keratometry measurements, assays on white blood cells and platelets, skin and hair color traits, autoimmune disorders, and cardiovascular diseases (which also included other cardiovascular-related traits such as hand-grip strength [@pmid:25982160], and environmental/behavioral factors such as physical activity and diet) (see Supplementary Files 3-6 for clustering results). +Within these branches, results were relatively stable, with the same traits often clustered together across different resolutions. +Arrows between clusters show traits moving from one group to another, and this mainly happens between clusters within the "complex" branch (4) and between clusters from the "large" branch (0) to the "complex" branch. This behavior is expected since complex diseases are usually associated with shared genetic and environmental factors and are thus hard to categorize into a single cluster. @@ -82,20 +90,19 @@ The plot shows a submatrix of $\hat{\mathbf{M}}$ for the main trait clusters at ](images/clustering/global_clustermap-plain.svg "Heatmap with gene modules and traits"){#fig:clustering:heatmap width="100%"} -Next, we analyzed which LVs were driving these clusters of traits. +Next, we analyzed which latent variables (LVs) were driving these clusters of traits. For this, we trained decision tree classifiers on the input data using each cluster at $k$=29 (bottom of Figure @fig:clustering:tree) as labels (see [Methods](#sec:methods:clustering)). This procedure yielded the top LVs that were most discriminative for each cluster. Several of these LVs were well-aligned to existing pathways (Figure @fig:clustering:heatmap), whereas others were not aligned to prior knowledge but still expressed in relevant tissues (Supplementary Figure @fig:sup:clustering:novel:heatmap). In Figure @fig:clustering:heatmap, it can be seen that some LVs were highly specific to certain traits, while others were associated with a wide range of different phenotypes, thus potentially involved in more general biological functions. We used our regression framework to determine whether these LVs were significantly associated with different traits. For example, LVs such as LV928 and LV30, which were well-aligned to early progenitors of the erythrocytes lineage [@doi:10.1016/j.cell.2011.01.004] (Supplementary Tables @tbl:sup:multiplier_pathways:lv928 and @tbl:sup:multiplier_pathways:lv30), were predominantly expressed in early differentiation stages of erythropoiesis (Supplementary Figures @fig:sup:lv928 and @fig:sup:lv30) and strongly associated with different assays on red blood cells (FDR < 0.05; Supplementary Tables @tbl:sup:phenomexcan_assocs:lv928, @tbl:sup:emerge_assocs:lv928, and @tbl:sup:emerge_assocs:lv30). -In contrast, other LVs were highly specific, such as LV730, which is expressed in thrombocytes from different cancer samples (Supplementary Figures @fig:sup:lv730 and Supplementary Table @tbl:sup:multiplier_pathways:lv730), and strongly associated with hematological assays on platelets (FDR < 0.05, Supplementary Table @tbl:sup:phenomexcan_assocs:lv730); -or LV598, whose genes were expressed in corneal endothelial cells (Supplementary Figures @fig:sup:lv598 and Supplementary Table @tbl:sup:multiplier_pathways:lv598) and associated with keratometry measurements (Supplementary Table @tbl:sup:phenomexcan_assocs:lv598). +In contrast, other LVs were highly specific, such as LV730, which is expressed in thrombocytes from different cancer samples (Supplementary Figures @fig:sup:lv730 and Supplementary Table @tbl:sup:multiplier_pathways:lv730), and strongly associated with hematological assays on platelets (FDR < 0.05, Supplementary Table @tbl:sup:phenomexcan_assocs:lv730); or LV598, whose genes were expressed in corneal endothelial cells (Supplementary Figures @fig:sup:lv598 and Supplementary Table @tbl:sup:multiplier_pathways:lv598) and associated with keratometry measurements (Supplementary Table @tbl:sup:phenomexcan_assocs:lv598). The sub-branches of autoimmune and cardiovascular diseases merged together at $k=10$ (middle of Figure @fig:clustering:tree), so we expected to find LVs that specifically affect one or both of these types of diseases. -For example, LV57, expressed in T cells (Supplementary Figure @fig:sup:lv57 and Supplementary Table @tbl:sup:multiplier_pathways:lv57), was the most strongly associated gene module with autoimmune disorders in PhenomeXcan (Supplementary Table @tbl:sup:phenomexcan_assocs:lv57), with significant associations with hypothyroidism that were replicated in eMERGE (@tbl:sup:emerge_assocs:lv57). -However, this LV was also strongly associated with deep venous thrombosis in both PhenomeXcan and eMERGE. +For example, LV57, expressed in T cells (Supplementary Figure @fig:sup:lv57 and Supplementary Table @tbl:sup:multiplier_pathways:lv57), was the most strongly associated gene module with autoimmune disorders in PhenomeXcan (Supplementary Table @tbl:sup:phenomexcan_assocs:lv57). +This LV was also strongly associated with deep venous thrombosis in both PhenomeXcan and eMERGE. On the other hand, LV844 was more autoimmune-specific, with associations to polymyalgia rheumatica, type 1 diabetes, rheumatoid arthritis, and celiac disease in PhenomeXcan (Supplementary Table @tbl:sup:phenomexcan_assocs:lv844). However, these did not replicate in eMERGE. This LV was expressed in a wide range of cell types, including blood, breast organoids, myeloma cells, lung fibroblasts, and different cell types from the brain (Supplementary Figure @fig:sup:lv844 and Supplementary Table @tbl:sup:multiplier_pathways:lv844). diff --git a/content/05.discussion.md b/content/05.discussion.md index 480c23c8..57025a94 100644 --- a/content/05.discussion.md +++ b/content/05.discussion.md @@ -4,11 +4,12 @@ We have introduced a novel computational strategy that integrates statistical as Our key innovation is that we project gene-trait associations through a latent representation derived not strictly from measures of normal tissue but also from cell types under a variety of stimuli and at various developmental stages. This improves interpretation by going beyond statistical associations to infer cell type-specific features of complex phenotypes. Our approach can identify disease-relevant cell types from summary statistics, and several disease-associated gene modules were replicated in eMERGE. + Using a CRISPR screen to analyze lipid regulation, we found that our gene module-based approach can prioritize causal genes even when single gene associations are not detected. We interpret these findings with an omnigenic perspective of "core" and "peripheral" genes, suggesting that the approach can identify genes that directly affect the trait with no mediated regulation of other genes and thus prioritize alternative and potentially more attractive therapeutic targets. -Using our gene module perspective, we also integrated drug-induced transcriptional profiles, which allowed us to connect diseases, drugs, and cell types. +We also integrated drug-induced transcriptional profiles, which allowed us to connect diseases, drugs, and cell types. We showed that the LV-based drug-repurposing approach outperformed the gene-based one when predicting drug-disease links for 322 drugs across 53 diseases. Furthermore, and beyond statistical prediction, we focused on cardiovascular traits and a particular drug, niacin, to show that the approach connects pathophysiological processes with known mechanisms of action, including those in adipose tissue, immune cells, and ovarian granulosa cells. Our LV-based approach could be helpful in generating novel hypotheses to evaluate potential mechanisms of action, or even adverse effects, of known or experimental drugs. @@ -23,8 +24,7 @@ Associations with such cell type marker genes may reveal potentially causal cell We observed modules expressed primarily in one tissue (such as adipose in LV246 or ovary in LV66). Others appeared to be expressed in many contexts, which may capture pathways associated with related complex diseases. For example, LV136 is associated with cardiovascular disease and measures of corneal biomechanics and is expressed in fibroblasts, osteoblasts, pancreas, liver, and cardiomyocytes, among others. -Other examples include LV844, expressed in whole blood samples and associated with a range of autoimmune diseases; -or LV57, which is clearly expressed in T cells and strongly associated with autoimmune and venous thromboembolism. +Other examples include LV844, expressed in whole blood samples and associated with a range of autoimmune diseases; or LV57, which is clearly expressed in T cells and strongly associated with autoimmune and venous thromboembolism. From an omnigenic point of view, these patterns might represent cases of "network pleiotropy," where the same cell types mediate molecularly related traits. To our knowledge, projection through a representation learned on complementary but distinct datasets is a novel approach to identifying cell type and pathway effects on complex phenotypes that is computationally simple to implement. @@ -37,7 +37,7 @@ Although the potential issues derived from this data heterogeneity were addresse Considering groups of related diseases was previously shown to be more powerful in detecting shared genetic etiology [@doi:10.1038/ng.3985; @doi:10.1038/s41588-018-0121-0], and clustering trees provide a way to explore such relationships in the context of latent variables. -Finally, we developed an LV-based regression framework to detect whether gene modules are associated with a trait using TWAS $p$-values. +Finally, we developed a regression framework to detect whether gene modules are associated with a trait using TWAS $p$-values. We used PhenomeXcan as a discovery cohort across four thousand traits, and many LV-trait associations replicated in eMERGE. In PhenomeXcan, we found 3,450 significant LV-trait associations (FDR < 0.05) with 686 LVs (out of 987) associated with at least one trait and 1,176 traits associated with at least one LV. In eMERGE, we found 196 significant LV-trait associations, with 116 LVs associated with at least one trait/phecode and 81 traits with at least one LV. @@ -62,9 +62,9 @@ This is due to sharing of GWAS variants in gene expression models, correlated ex Our LV-based regression framework, however, accounts for these gene-gene correlations in TWAS reasonably well. -Our findings are concordant with previous studies showing that drugs with genetic support are more likely to succeed through the drug development pipeline [@doi:10.1038/ng.3314; @doi:10.1038/nn.4618]. +Our findings are consistent with previous studies showing that drugs with genetic support are more likely to succeed through the drug development pipeline [@doi:10.1038/ng.3314; @doi:10.1038/nn.4618]. In this case, projecting association results through latent variables better prioritized disease-treatment pairs than considering single-gene effects alone. -An additional benefit is that the latent variables driving predictions represent interpretable genetic features that can be examined to infer potential mechanisms of action. +An additional benefit is that the latent variables driving predictions are interpretable genetic features that can be examined to infer potential mechanisms of action. Here we prioritized drugs for diseases with very different tissue etiologies, and a challenge of the approach is to select the most appropriate tissue model from TWAS to find reversed transcriptome patterns between genes and drug-induced perturbations. diff --git a/content/07.00.methods.md b/content/07.00.methods.md index 2ae3f33e..db631243 100644 --- a/content/07.00.methods.md +++ b/content/07.00.methods.md @@ -2,12 +2,39 @@ PhenoPLIER is a framework that combines different computational approaches to integrate gene-trait associations and drug-induced transcriptional responses with groups of functionally-related genes (referred to as gene modules or latent variables/LVs). Gene-trait associations are computed using the PrediXcan family of methods, whereas latent variables are inferred by the MultiPLIER models applied on large gene expression compendia. -PhenoPLIER provides -1) a regression model to compute an LV-trait association, -2) a consensus clustering approach applied to the latent space to learn shared and distinct transcriptomic properties between traits, and -3) an interpretable, LV-based drug repurposing framework. +PhenoPLIER provides 1) a regression model to compute an LV-trait association, 2) a consensus clustering approach applied to the latent space to learn shared and distinct transcriptomic properties between traits, and 3) an interpretable, LV-based drug repurposing framework. We provide the details of these methods below. +## Regression model + +The PhenoPLIER regression model simultaneously explains the association between gene-trait associations and drug-induced transcriptional responses with groups of functionally-related genes (referred to as gene modules or latent variables/LVs). +In this model, we assume that gene-trait associations ($Y$) and drug-induced transcriptional responses ($D$) are both influenced by the latent variables ($X$) and observational noise ($\epsilon$). + +$$ Y = X \beta_Y + \epsilon_Y $$ +$$ D = X \beta_D + \epsilon_D $$ + +where $\beta_Y$ and $\beta_D$ are the regression coefficients for the gene-trait associations and drug-induced transcriptional responses, respectively. + +We assume that the latent variables are independent, and that the gene-trait associations and drug-induced transcriptional responses are independent of each other. + +$$ X \sim \mathcal{N}(0, \Sigma) $$ +$$ \epsilon_Y \sim \mathcal{N}(0, \sigma_Y^2) $$ +$$ \epsilon_D \sim \mathcal{N}(0, \sigma_D^2) $$ +$$ \epsilon_Y \perp \epsilon_D $$ + +where $\Sigma$ is the covariance matrix of the latent variables, and $\sigma_Y^2$ and $\sigma_D^2$ are the variances of the observational noise of the gene-trait associations and drug-induced transcriptional responses, respectively. + +We further assume that the gene-trait associations and drug-induced transcriptional responses are normally distributed. + +$$ Y \sim \mathcal{N}(X \beta_Y, \sigma_Y^2) $$ +$$ D \sim \mathcal{N}(X \beta_D, \sigma_D^2) $$ + +We estimate the parameters of the model using the EM algorithm. + +## Consensus clustering + +## Drug repurposing + ### The PrediXcan family of methods for gene-based associations {#sec:methods:predixcan} @@ -18,9 +45,8 @@ In contrast, S-MultiXcan, the summary-based version of MultiXcan, computes the j S-PrediXcan and S-MultiXcan only need GWAS summary statistics instead of individual-level genotype and phenotype data. Here we briefly provide the details about these TWAS methods that are necessary to explain our regression framework later (see the referenced articles for more information). -In the following, we refer to $\mathbf{y}$ as a vector of traits for $n$ individuals that is centered for convenience (so that no intercept is necessary); -$\mathbf{\tilde{t}}_l = \sum_{a \in \mathrm{model}_l} w_{a}^{l} X_{a}$ is the gene's predicted expression for all individuals in tissue $l$, $X_a$ is the genotype of SNP $a$ and $w_{a}$ its weight in the tissue prediction model $l$; -and $\mathbf{t}_l$ is the standardized version of $\mathbf{\tilde{t}}_l$ with mean equal to zero and standard deviation equal to one. + +In the following, we refer to $\mathbf{y}$ as a vector of traits for $n$ individuals that is centered for convenience (so that no intercept is necessary); $\mathbf{\tilde{t}}_l = \sum_{a \in \mathrm{model}_l} w_{a}^{l} X_{a}$ is the gene's predicted expression for all individuals in tissue $l$, $X_a$ is the genotype of SNP $a$ and $w_{a}$ its weight in the tissue prediction model $l$; and $\mathbf{t}_l$ is the standardized version of $\mathbf{\tilde{t}}_l$ with mean equal to zero and standard deviation equal to one. S-PrediXcan [@doi:10.1038/s41467-018-03621-1] is the summary version of PrediXcan [@doi:10.1038/ng.3367]. PrediXcan models the trait as a linear function of the gene's expression on a single tissue using the univariate model @@ -31,7 +57,7 @@ $$ {#eq:predixcan} where $\hat{\gamma}_l$ is the estimated effect size or regression coefficient, and $\bm{\epsilon}_l$ are the error terms with variance $\sigma_{\epsilon}^{2}$. The significance of the association is assessed by computing the $z$-score $\hat{z}_{l}=\hat{\gamma}_l / \mathrm{se}(\hat{\gamma}_l)$ for a gene's tissue model $l$. -PrediXcan needs individual-level data to fit this model, whereas S-PrediXcan approximates PrediXcan $z$-scores using only GWAS summary statistics with the expression +PrediXcan needs individual-level data to fit this model, whereas S-PrediXcan approximates PrediXcan $z$-scores using only GWAS summary statistics with the expression, $$ \hat{z}_{l} \approx \sum_{a \in model_{l}} w_a^l \frac{\hat{\sigma}_a}{\hat{\sigma}_l} \frac{\hat{\beta}_a}{\mathrm{se}(\hat{\beta}_a)}, @@ -48,26 +74,23 @@ Its main output is the $p$-value (obtained with an F-test) of the multiple tissu $$ \begin{split} \mathbf{y} & = \sum_{l=1}^{p} \mathbf{t}_l g_l + \mathbf{e} \\ - & = \mathbf{T} \mathbf{g} + \mathbf{e}, +& = \mathbf{T} \mathbf{g} + \mathbf{e}, \end{split} $$ {#eq:multixcan} -where $\mathbf{T}$ is a matrix with $p$ columns $\mathbf{t}_l$, -$\hat{g}_l$ is the estimated effect size for the predicted gene expression in tissue $l$ (and thus $\mathbf{\hat{g}}$ is a vector with $p$ estimated effect sizes $\hat{g}_l$), -and $\mathbf{e}$ are the error terms with variance $\sigma_{e}^{2}$. +where $\mathbf{T}$ is a matrix with $p$ columns $\mathbf{t}_l$, $\hat{g}_l$ is the estimated effect size for the predicted gene expression in tissue $l$ (and thus $\mathbf{\hat{g}}$ is a vector with $p$ estimated effect sizes $\hat{g}_l$), and $\mathbf{e}$ are the error terms with variance $\sigma_{e}^{2}$. Given the high correlation between predicted expression values for a gene across different tissues, MultiXcan uses the principal components (PCs) of $\mathbf{T}$ to avoid collinearity issues. S-MultiXcan derives the joint regression estimates (effect sizes and their variances) in Equation (@eq:multixcan) using the marginal estimates from S-PrediXcan in Equation (@eq:spredixcan). -Under the null hypothesis of no association, $\mathbf{\hat{g}}^{\top} \frac{\mathbf{T}^{\top}\mathbf{T}}{\sigma_{e}^{2}} \mathbf{\hat{g}} \sim \chi_{p}^{2}$, and therefore the significance of the association in S-MultiXcan is estimated with +Under the null hypothesis of no association, $\mathbf{\hat{g}}^{\top} \frac{\mathbf{T}^{\top}\mathbf{T}}{\sigma_{e}^{2}} \mathbf{\hat{g}} \sim \chi_{p}^{2}$, and therefore the significance of the association in S-MultiXcan is estimated with: $$ \begin{split} \frac{\mathbf{\hat{g}}^{\top} (\mathbf{T}^{\top}\mathbf{T}) \mathbf{\hat{g}}}{\sigma_{e}^{2}} & \approx \bm{\hat{\gamma}}^{\top} \frac{\sqrt{n-1}}{\sigma_{\epsilon}} \left(\frac{\mathbf{T}^{\top} \mathbf{T}}{n-1}\right)^{-1} \frac{\sqrt{n-1}}{\sigma_{\epsilon}} \bm{\hat{\gamma}} \\ - & = \mathbf{\hat{z}}^{\top} Cor(\mathbf{T})^{-1} \mathbf{\hat{z}}, +& = \mathbf{\hat{z}}^{\top} Cor(\mathbf{T})^{-1} \mathbf{\hat{z}}, \end{split} $$ {#eq:smultixcan} -where $\mathbf{\hat{z}}$ is a vector with $p$ $z$-scores (Equation (@eq:spredixcan)) for each tissue available for the gene, -and $Cor(\mathbf{T})$ is the autocorrelation matrix of $\mathbf{T}$. +where $\mathbf{\hat{z}}$ is a vector with $p$ $z$-scores (Equation (@eq:spredixcan)) for each tissue available for the gene, and $Cor(\mathbf{T})$ is the autocorrelation matrix of $\mathbf{T}$. Since $\mathbf{T}^{\top}\mathbf{T}$ is singular for many genes, S-MultiXcan computes the pseudo-inverse $Cor(\mathbf{T})^{+}$ using the $k$ top PCs, and thus $\mathbf{\hat{z}}^{\top} Cor(\mathbf{T})^{+} \mathbf{\hat{z}} \sim \chi_k^2$. To arrive at this expression, S-MultiXcan uses the conservative approximation $\sigma_{e}^{2} \approx \sigma_{\epsilon}^{2}$, that is, the variance of the error terms in the joint regression is approximately equal to the residual variance of the marginal regressions. Another important point is that $Cor(\mathbf{T})$ is estimated using a global genotype covariance matrix, whereas marginal $\hat{z}_l$ in Equation (@eq:spredixcan) are approximated using tissue-specific genotype covariances. @@ -81,11 +104,12 @@ We used S-MultiXcan results for our LV-based regression model and our cluster an We used two large TWAS resources from different cohorts for discovery and replication, all obtained from European ancestries. PhenomeXcan [@doi:10.1126/sciadv.aba2083], our discovery cohort, provides results on 4,091 traits across different categories. Supplemenetary File 1 has all the details about the included GWAS, sample size and disease/trait categories. -In PhenomeXcan, these publicly available GWAS summary statistics were used to compute -1) gene-based associations with the PrediXcan family of methods (described before), and -2) a posterior probability of colocalization between GWAS loci and *cis*-eQTL with fastENLOC [@doi:10.1126/sciadv.aba2083; @doi:10.1016/j.ajhg.2020.11.012]. + +In PhenomeXcan, these publicly available GWAS summary statistics were used to compute 1) gene-based associations with the PrediXcan family of methods (described before), and 2) a posterior probability of colocalization between GWAS loci and *cis*-eQTL with fastENLOC [@doi:10.1126/sciadv.aba2083; @doi:10.1016/j.ajhg.2020.11.012]. + We refer to the matrix of $z$-scores from S-PrediXcan (Equation (@eq:spredixcan)) across $q$ traits and $m$ genes in tissue $t$ as $\mathbf{M}^{t} \in \mathbb{R}^{q \times m}$. As explained later, matrices $\mathbf{M}^{t}$ were used in our LV-based drug repurposing framework since they provide direction of effects. + The S-MultiXcan results (22,515 gene associations across 4,091 traits) were used in our LV-based regression framework and our cluster analyses of traits. For the cluster analyses, we used the $p$-values converted to $z$-scores: $\mathbf{M}=\Phi^{-1}(1 - p/2)$, where $\Phi^{-1}$ is the probit function. Higher $z$-scores correspond to stronger associations. @@ -96,27 +120,53 @@ We used these results to replicate the associations found with our LV-based regr ### MultiPLIER and Pathway-level information extractor (PLIER) {#sec:methods:multiplier} -MultiPLIER [@doi:10.1016/j.cels.2019.04.003] extracts patterns of co-expressed genes from recount2 [@doi:10.1038/nbt.3838] (without including GTEx samples), a large gene expression dataset. -The approach applies the pathway-level information extractor method (PLIER) [@doi:10.1038/s41592-019-0456-1], which performs unsupervised learning using prior knowledge (canonical pathways) to reduce technical noise. +MultiPLIER extracts patterns of co-expressed genes from recount2 (without including GTEx samples), a large gene expression dataset. +The approach applies the pathway-level information extractor method (PLIER), which performs unsupervised learning using prior knowledge (canonical pathways) to reduce technical noise. PLIER uses a matrix factorization approach that deconvolutes gene expression data into a set of latent variables (LV), where each LV represents a gene module. The MultiPLIER models reduced the dimensionality in recount2 to 987 LVs. -Given a gene expression dataset $\mathbf{Y}^{m \times c}$ with $m$ genes and $c$ experimental conditions and a prior knowledge matrix $\mathbf{C} \in \{0,1\}^{m \times p}$ for $p$ MSigDB pathways [@doi:10.1016/j.cels.2015.12.004] (so that $\mathbf{C}_{ij} = 1$ if gene $i$ belongs to pathway $j$), PLIER finds $\mathbf{U}$, $\mathbf{Z}$, and $\mathbf{B}$ minimizing +Given a gene expression dataset $\mathbf{Y}^{m \times c}$ with $m$ genes and $c$ experimental conditions and a prior knowledge matrix $\mathbf{C} \in \{0,1\}^{m \times p}$ for $p$ MSigDB pathways [@doi:10.1016/j.cels.2015.12.004] (so that $\mathbf{C}_{ij} = 1$ if gene $i$ belongs to pathway $j$), PLIER finds $\mathbf{U}$, $\mathbf{Z}$, and $\mathbf{B}$ minimizing: $$ ||\mathbf{Y} - \mathbf{Z}\mathbf{B}||^{2}_{F} + \lambda_1 ||\mathbf{Z} - \mathbf{C}\mathbf{U}||^{2}_{F} + \lambda_2 ||\mathbf{B}||^{2}_{F} + \lambda_3 ||\mathbf{U}||_{L^1} $$ {#eq:met:plier_func} -subject to $\mathbf{U}>0, \mathbf{Z}>0$; -$\mathbf{Z}^{m \times l}$ are the gene loadings with $l$ latent variables, -$\mathbf{B}^{l \times c}$ is the latent space for $c$ conditions, -$\mathbf{U}^{p \times l}$ specifies which of the $p$ prior-information pathways in $\mathbf{C}$ are represented for each LV, -and $\lambda_i$ are different regularization parameters used in the training step. +subject to $\mathbf{U}>0, \mathbf{Z}>0$; $\mathbf{Z}^{m \times l}$ are the gene loadings with $l$ latent variables, $\mathbf{B}^{l \times c}$ is the latent space for $c$ conditions, $\mathbf{U}^{p \times l}$ specifies which of the $p$ prior-information pathways in $\mathbf{C}$ are represented for each LV, and $\lambda_i$ are different regularization parameters used in the training step. $\mathbf{Z}$ is a low-dimensional representation of the gene space where each LV aligns as much as possible to prior knowledge, and it might represent either a known or novel gene module (i.e., a meaningful biological pattern) or noise. +PLIER was used to extract gene modules from the gene expression profiles of the human brain samples in the GTEx project [@doi:10.1038/nature14963]. +The GTEx project [@doi:10.1038/nature14963] has generated gene expression profiles from 53 tissue types from over 700 postmortem human donors. +The gene expression profiles were generated using RNA-seq, which is a widely used method for measuring gene expression [@doi:10.1038/nrg2822]. + +PLIER was used to extract gene modules from the gene expression profiles of the human brain samples in the GTEx project [@doi:10.1038/nature14963]. +The GTEx project [@doi:10.1038/nature14963] has generated gene expression profiles from 53 tissue types from over 700 postmortem human donors. +The gene expression profiles were generated using RNA-seq, which is a widely used method for measuring gene expression [@doi:10.1038/nrg2822]. + +PLIER was used to extract gene modules from the gene expression profiles of the human brain samples in the GTEx project [@doi:10.1038/nature14963]. +The GTEx project [@doi:10.1038/nature14963] has generated gene expression profiles from 53 tissue types from over 700 postmortem human donors. +The gene expression profiles were generated using RNA-seq, which is a widely used method for measuring gene expression [@doi:10.1038/nrg2822]. + +PLIER was used to extract gene modules from the gene expression profiles of the human brain samples in the GTEx project [@doi:10.1038/nature14963]. +The GTEx project [@doi:10.1038/nature14963] has generated gene expression profiles from 53 tissue types from over 700 postmortem human donors. +The gene expression profiles were generated using RNA-seq, which is a widely used method for measuring gene expression [@doi:10.1038/nrg2822]. + +PLIER was used to extract gene modules from the gene expression profiles of the human brain samples in the GTEx project [@doi:10.1038/nature14963]. +The GTEx project [@doi:10.1038/nature14963] has generated gene expression profiles from 53 tissue types from over 700 postmortem human donors. +The gene expression profiles were generated using RNA-seq, which is a widely used method for measuring gene expression [@doi:10.1038/nrg2822]. + +PLIER was used to extract gene modules from the gene expression profiles of the human brain samples in the GTEx project [@doi:10.1038/nature14963]. +The GTEx project [@doi:10.1038/nature14963] has generated gene expression profiles from 53 tissue types from over 700 postmortem human donors. +The gene expression profiles were generated using RNA-seq, which is a widely used method for measuring gene expression [@doi:10.1038/nrg2822]. + For our drug repurposing and cluster analyses, we used this model to project gene-trait (from TWAS) and gene-drug associations (from LINCS L1000) into this low-dimensional gene module space. For instance, TWAS associations $\mathbf{M}$ (either from S-PrediXcan or S-MultiXcan) were projected using +$$ +\begin{aligned} +\mathbf{M} &= \mathbf{X} \mathbf{B} \\ +\mathbf{M}^{l \times q} &= \mathbf{X}^{n \times l} \mathbf{B}^{l \times q} +\end{aligned} +$$ {#eq:twas} $$ \hat{\mathbf{M}} = (\mathbf{Z}^{\top} \mathbf{Z} + \lambda_{2} \mathbf{I})^{-1} \mathbf{Z}^{\top} \mathbf{M}, $$ {#eq:proj} @@ -135,17 +185,11 @@ $$ \mathbf{m}=\beta_{0} + \mathbf{s} \beta_{s} + \sum_{i} \mathbf{x}_{i} \beta_{i} + \bm{\epsilon}, $$ {#eq:reg:model} -where $\mathbf{m}$ is a vector of S-MultiXcan gene $p$-values for a trait (with a $-log_{10}$ transformation); -$\mathbf{s}$ is a binary indicator vector with $s_{\ell}=1$ for the top 1% of genes with the largest loadings for LV $\ell$ (from $\mathbf{Z}_{\ell}$) and zero otherwise; -$\mathbf{x}_{i}$ is a gene property used as a covariate; -$\beta$ are effect sizes (with $\beta_{0}$ as the intercept); -and $\bm{\epsilon} \sim \mathrm{MVN}(0, \sigma^{2} \mathbf{R})$ is a vector of error terms with a multivariate normal distribution (MVN) where $\mathbf{R}$ is the matrix of gene correlations. +where $\mathbf{m}$ is a vector of S-MultiXcan gene $p$-values for a trait (with a $-log_{10}$ transformation); $\mathbf{s}$ is a binary indicator vector with $s_{\ell}=1$ for the top 1% of genes with the largest loadings for LV $\ell$ (from $\mathbf{Z}_{\ell}$) and zero otherwise; $\mathbf{x}_{i}$ is a gene property used as a covariate; $\beta$ are effect sizes (with $\beta_{0}$ as the intercept); and $\bm{\epsilon} \sim \mathrm{MVN}(0, \sigma^{2} \mathbf{R})$ is a vector of error terms with a multivariate normal distribution (MVN) where $\mathbf{R}$ is the matrix of gene correlations. The model tests the null hypothesis $\beta_{s} = 0$ against the one-sided hypothesis $\beta_{s} > 0$. Therefore, $\beta_{s}$ reflects the difference in trait associations between genes that are part of LV $\ell$ and genes outside of it. -Following the MAGMA framework, we used two gene properties as covariates: -1) *gene size*, defined as the number of PCs retained in S-MultiXcan, -and 2) *gene density*, defined as the ratio of the number of PCs to the number of tissues available. +Following the MAGMA framework, we used two gene properties as covariates: 1) *gene size*, defined as the number of PCs retained in S-MultiXcan, and 2) *gene density*, defined as the ratio of the number of PCs to the number of tissues available. Since the error terms $\bm{\epsilon}$ could be correlated, we cannot assume they have independent normal distributions as in a standard linear regression model. In the PrediXcan family of methods, the predicted expression of a pair of genes could be correlated if they share eQTLs or if these are in LD [@doi:10.1038/s41588-019-0385-z]. @@ -164,9 +208,7 @@ $$ \end{split} $$ {#eq:reg:r} -where columns $\mathbf{P}$ are standardized, -$\mathrm{Tr}$ is the trace of a matrix, -and the cross-correlation matrix between PCs $Cor(\mathbf{P}_{i}, \mathbf{P}_{j}) \in \mathbb{R}^{k_i \times k_j}$ is given by +where columns $\mathbf{P}$ are standardized, $\mathrm{Tr}$ is the trace of a matrix, and the cross-correlation matrix between PCs $Cor(\mathbf{P}_{i}, \mathbf{P}_{j}) \in \mathbb{R}^{k_i \times k_j}$ is given by $$ \begin{split} @@ -175,37 +217,34 @@ Cor(\mathbf{P}_{i}, \mathbf{P}_{j}) & = Cor(\mathbf{T}_{i} \mathbf{V}_{i}^{\top} \end{split} $$ {#eq:reg:cor_pp} -where $\frac{\mathbf{T}_{i}^{\top} \mathbf{T}_{j}}{n-1} \in \mathbb{R}^{p_i \times p_j}$ is the cross-correlation matrix between the predicted expression levels of genes $i$ and $j$, -and columns of $\mathbf{V}_{i}$ and scalars $\lambda_i$ are the eigenvectors and eigenvalues of $\mathbf{T}_{i}$, respectively. +where $\frac{\mathbf{T}_{i}^{\top} \mathbf{T}_{j}}{n-1} \in \mathbb{R}^{p_i \times p_j}$ is the cross-correlation matrix between the predicted expression levels of genes $i$ and $j$, and columns of $\mathbf{V}_{i}$ and scalars $\lambda_i$ are the eigenvectors and eigenvalues of $\mathbf{T}_{i}$, respectively. S-MultiXcan keeps only the top eigenvectors using a condition number threshold of $\frac{\max(\lambda_i)}{\lambda_i} < 30$. To estimate the correlation of predicted expression levels for genes $i$ in tissue $k$ and gene $j$ in tissue $l$, $(\mathbf{t}_k^i, \mathbf{t}_l^j)$ ($\mathbf{t}_k^i$ is the $k$th column of $\mathbf{T}_{i}$), we used [@doi:10.1371/journal.pgen.1007889] $$ \begin{split} \frac{(\mathbf{T}_{i}^{\top} \mathbf{T}_{j})_{kl}}{n-1} & = Cor(\mathbf{t}_k^i, \mathbf{t}_l^j) \\ - & = \frac{ Cov(\mathbf{t}_k, \mathbf{t}_l) } { \sqrt{\widehat{\mathrm{var}}(\mathbf{t}_k) \widehat{\mathrm{var}}(\mathbf{t}_l)} } \\ - & = \frac{ Cov(\sum_{a \in \mathrm{model}_k} w_a^k X_a, \sum_{b \in \mathrm{model}_l} w_b^l X_b) } {\sqrt{\widehat{\mathrm{var}}(\mathbf{t}_k) \widehat{\mathrm{var}}(\mathbf{t}_l)} } \\ - & = \frac{ \sum_{a \in \mathrm{model}_k \\ b \in \mathrm{model}_l} w_a^k w_b^l Cov(X_a, X_b)} {\sqrt{\widehat{\mathrm{var}}(\mathbf{t}_k) \widehat{\mathrm{var}}(\mathbf{t}_l)} } \\ - & = \frac{ \sum_{a \in \mathrm{model}_k \\ b \in \mathrm{model}_l} w_a^k w_b^l \Gamma_{ab}} {\sqrt{\widehat{\mathrm{var}}(\mathbf{t}_k) \widehat{\mathrm{var}}(\mathbf{t}_l)} }, +& = \frac{ Cov(\mathbf{t}_k, \mathbf{t}_l) } { \sqrt{\widehat{\mathrm{var}}(\mathbf{t}_k) \widehat{\mathrm{var}}(\mathbf{t}_l)} } \\ +& = \frac{ Cov(\sum_{a \in \mathrm{model}_k} w_a^k X_a, \sum_{b \in \mathrm{model}_l} w_b^l X_b) } {\sqrt{\widehat{\mathrm{var}}(\mathbf{t}_k) \widehat{\mathrm{var}}(\mathbf{t}_l)} } \\ +& = \frac{ \sum_{a \in \mathrm{model}_k \\ b \in \mathrm{model}_l} w_a^k w_b^l Cov(X_a, X_b)} {\sqrt{\widehat{\mathrm{var}}(\mathbf{t}_k) \widehat{\mathrm{var}}(\mathbf{t}_l)} } \\ +& = \frac{ \sum_{a \in \mathrm{model}_k \\ b \in \mathrm{model}_l} w_a^k w_b^l \Gamma_{ab}} {\sqrt{\widehat{\mathrm{var}}(\mathbf{t}_k) \widehat{\mathrm{var}}(\mathbf{t}_l)} }, \end{split} $$ {#eq:reg:corr_genes} -where $X_a$ is the genotype of SNP $a$, -$w_a^k$ is the weight of SNP $a$ for gene expression prediction in the tissue model $k$, -and $\Gamma = \widehat{\mathrm{var}}(\mathbf{X}) = (\mathbf{X} - \mathbf{\bar{X}})^{\top} (\mathbf{X} - \mathbf{\bar{X}}) / (n-1)$ is the genotype covariance matrix using GTEx v8 as the reference panel, which is the same used in all TWAS methods described here. +where $X_a$ is the genotype of SNP $a$, $w_a^k$ is the weight of SNP $a$ for gene expression prediction in the tissue model $k$, and $\Gamma = \widehat{\mathrm{var}}(\mathbf{X}) = (\mathbf{X} - \mathbf{\bar{X}})^{\top} (\mathbf{X} - \mathbf{\bar{X}}) / (n-1)$ is the genotype covariance matrix using GTEx v8 as the reference panel, which is the same used in all TWAS methods described here. The variance of the predicted expression values of gene $i$ in tissue $k$ is estimated as [@doi:10.1038/s41467-018-03621-1]: $$ \begin{split} \widehat{\mathrm{var}}(\mathbf{t}_k^i) & = (\mathbf{W}^k)^\top \Gamma^k \mathbf{W}^k \\ - & = \sum_{a \in \mathrm{model}_k \\ b \in \mathrm{model}_k} w_a^k w_b^k \Gamma_{ab}^k. +& = \sum_{a \in \mathrm{model}_k \\ b \in \mathrm{model}_k} w_a^k w_b^k \Gamma_{ab}^k. \end{split} $$ {#eq:reg:var_gene} -Note that, since we used the MultiXcan regression model (Equation (@eq:multixcan)), $\mathbf{R}$ is only an approximation of gene correlations in S-MultiXcan. +Note that, since we used the MultiXcan regression model (Equation (@eq:multixcan)), $$\mathbf{R}$$ is only an approximation of gene correlations in S-MultiXcan. As explained before, S-MultiXcan approximates the joint regression parameters in MultiXcan using the marginal regression estimates from S-PrediXcan in (@eq:spredixcan) with some simplifying assumptions and different genotype covariance matrices. -This complicates the derivation of an S-MultiXcan-specific solution to compute $\mathbf{R}$. -To account for this, we used a submatrix $\mathbf{R}_{\ell}$ corresponding to genes that are part of LV $\ell$ only (top 1% of genes) instead of the entire matrix $\mathbf{R}$. +This complicates the derivation of an S-MultiXcan-specific solution to compute $$\mathbf{R}$$. +To account for this, we used a submatrix $$\mathbf{R}_{\ell}$$ corresponding to genes that are part of LV $$\ell$$ only (top 1% of genes) instead of the entire matrix $$\mathbf{R}$$. This simplification is conservative since correlations are accounted for top genes only. Our simulations ([Supplementary Note 1](#sm:reg:null_sim)) show that the model is approximately well-calibrated and can correct for LVs with adjacent and highly correlated genes at the top (e.g., Figure @fig:reg:nulls:qqplot:lv234). The model can also detect LVs associated with relevant traits (Figure @fig:lv246 and Table @tbl:sup:phenomexcan_assocs:lv246) that are replicated in a different cohort (Table @tbl:sup:emerge_assocs:lv246). @@ -224,6 +263,7 @@ We adjusted the $p$-values using the Benjamini-Hochberg procedure. ### LV-based drug repurposing approach {#sec:methods:drug} For the drug-disease prediction, we derived an LV-based method based on a drug repositioning framework previously used for psychiatry traits [@doi:10.1038/nn.4618], where individual/single genes associated with a trait are anticorrelated with expression profiles for drugs. + We compared our LV-based method with this previously published, single-gene approach. For the single-gene method, we computed a drug-disease score by multiplying each S-PrediXcan set of signed $z$-scores in tissue $t$, $\mathbf{M}^t$, with another set of signed $z$-scores from transcriptional responses profiled in LINCS L1000 [@doi:10.1016/j.cell.2017.10.049], $\mathbf{L}^{c \times m}$ (for $c$ compounds). Here $\mathbf{M}^t$ contains information about whether a higher or lower predicted expression of a gene is associated with disease risk, whereas $\mathbf{L}$ indicates whether a drug increases or decreases the expression of a gene. @@ -243,28 +283,36 @@ Since the gold standard of drug-disease medical indications is described with Di ### Consensus clustering of traits {#sec:methods:clustering} We performed two preprocessing steps on the S-MultiXcan results before the cluster analysis. -First, we combined results in $\mathbf{M}$ (with $p$-values converted to $z$-scores, as described before) for traits that mapped to the same Experimental Factor Ontology (EFO) [@doi:10.1093/bioinformatics/btq099] term using the Stouffer's method: $\sum w_i M_{ij} / \sqrt{\sum w_i^2}$, where $w_i$ is a weight based on the GWAS sample size for trait $i$, and $M_{ij}$ is the $z$-score for gene $j$. -Second, we divided all $z$-scores for each trait $i$ by their sum to reduce the effect of highly polygenic traits: $M_{ij} / \sum M_{ij}$. + +First, we combined results in $\mathbf{M}$ (with $p$-values converted to $z$-scores, as described before) for traits that mapped to the same Experimental Factor Ontology (EFO) [@doi:10.1093/bioinformatics/btq099] term using the Stouffer's method: + +$$ \sum w_i M_{ij} / \sqrt{\sum w_i^2} $$ {#eq:stouffers} + +where $w_i$ is a weight based on the GWAS sample size for trait $i$, and $M_{ij}$ is the $z$-score for gene $j$. + +Second, we divided all $z$-scores for each trait $i$ by their sum to reduce the effect of highly polygenic traits: + +$$ M_{ij} / \sum M_{ij} $$ {#eq:normalize} + Finally, we projected this data matrix using Equation (@eq:proj), obtaining $\hat{\mathbf{M}}$ with $n$=3,752 traits and $l$=987 LVs as the input of our clustering pipeline. A partitioning of $\hat{\mathbf{M}}$ with $n$ traits into $k$ clusters is represented as a label vector $\pi \in \mathbb{N}^n$. Consensus clustering approaches consist of two steps: -1) the generation of an ensemble $\Pi$ with $r$ partitions of the dataset: $\Pi=\{\pi_1, \pi_2, \ldots, \pi_r\}$, -and 2) the combination of the ensemble into a consolidated solution defined as: + +1. +the generation of an ensemble $\Pi$ with $r$ partitions of the dataset: $\Pi=\{\pi_1, \pi_2, \ldots, \pi_r\}$, and +2. +the combination of the ensemble into a consolidated solution defined as: $$ -\pi^* = \mathrm{arg}\,\underset{\hat{\pi}}{\max} Q(\{ \lvert \mathcal{L}^i \lvert \phi(\hat{\pi}_{\mathcal{L}^i}, \pi_{i \mathcal{L}^i}) \mid i \in \{1,\ldots,r\} \}), +\pi^* = \mathrm{arg}\,\underset{\hat{\pi}}{\max} Q(\{ \lvert \mathcal{L}^i \lvert \phi(\hat{\pi}_{\mathcal{L}^i}, \pi_{i \mathcal{L}^i}) \mid i \in \{1,\ldots,r\} \}), \text{ where } $$ {#eq:consensus:obj_func} -where $\mathcal{L}^i$ is a set of data indices with known cluster labels for partition $i$, -$\phi\colon \mathbb{N}^n \times \mathbb{N}^n \to \mathbb{R}$ is a function that measures the similarity between two partitions, -and $Q$ is a measure of central tendency, such as the mean or median. +$\mathcal{L}^i$ is a set of data indices with known cluster labels for partition $i$, $\phi\colon \mathbb{N}^n \times \mathbb{N}^n \to \mathbb{R}$ is a function that measures the similarity between two partitions, and $Q$ is a measure of central tendency, such as the mean or median. We used the adjusted Rand index (ARI) [@doi:10.1007/BF01908075] for $\phi$ and the median for $Q$. To obtain $\pi^*$, we define a consensus function $\Gamma\colon \mathbb{N}^{n \times r} \to \mathbb{N}^n$ with $\Pi$ as the input. -We used consensus functions based on the evidence accumulation clustering (EAC) paradigm [@doi:10.1109/TPAMI.2005.113], where $\Pi$ is first transformed into a distance matrix -$\mathbf{D}_{ij} = d_{ij} / r$, -where $d_{ij}$ is the number of times traits $i$ and $j$ were grouped in different clusters across all $r$ partitions in $\Pi$. +We used consensus functions based on the evidence accumulation clustering (EAC) paradigm [@doi:10.1109/TPAMI.2005.113], where $\Pi$ is first transformed into a distance matrix $\mathbf{D}_{ij} = d_{ij} / r$, where $d_{ij}$ is the number of times traits $i$ and $j$ were grouped in different clusters across all $r$ partitions in $\Pi$. Then, $\Gamma$ can be any similarity-based clustering algorithm, which is applied on $\mathbf{D}$ to derive the final partition $\pi^*$. @@ -279,13 +327,329 @@ For each data representation (raw, PCA and UMAP), we determined a plausible rang Since some combinations of *minPts* and $\epsilon$ might not produce a meaningful partition (for instance, when all points are detected as noisy or only one cluster is found), we resampled partitions generated by DBSCAN to ensure an equal representation of this algorithm in the ensemble. This procedure generated a final ensemble of 4,428 partitions of 3,752 traits. + + Finally, we used spectral clustering on $\mathbf{D}$ to derive the final consensus partitions. -$\mathbf{D}$ was first transformed into a similarity matrix by applying an RBF kernel $\mathrm{exp}(-\gamma \mathbf{D}^2)$ using four different values for $\gamma$ that we empirically determined to work best. -Therefore, for each $k$ between 2 and 60, we derived four consensus partitions and selected the one that maximized Equation (@eq:consensus:obj_func). -We further filtered this set of 59 solutions to keep only those with an ensemble agreement larger than the 75th percentile (Supplementary Figure @fig:sup:clustering:agreement), leaving a total of 15 final consensus partitions shown in Figure @fig:clustering:tree. -The input data in our clustering pipeline undergoes several linear and nonlinear transformations, including PCA, UMAP and the ensemble transformation using the EAC paradigm (distance matrix $\mathbf{D}$). +$$ \mathbf{D} $$ {#eq:d} + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +$$ \mathbf{D} $$ + +The input data in our clustering pipeline undergoes several linear and nonlinear transformations, including PCA, UMAP, and the ensemble transformation using the EAC paradigm (distance matrix $\mathbf{D}$). Although consensus clustering has clear advantages for biological data [@pmid:27303057], this set of data transformations complicates the interpretation of results. To circumvent this, we used a supervised learning approach to detect which gene modules/LVs are the most important for each cluster of traits (Figure {@fig:clustering:design}b). Note that we did not use this supervised model for prediction but only to learn which features (LVs) were most discriminative for each cluster. @@ -300,68 +664,234 @@ In [Supplementary Note 2](#sm:clustering:null_sim), we performed several analyse ### CRISPR-Cas9 screening {#sec:methods:crispr} -**Cell culture.** -HepG2 cells were obtained from ATCC (ATCC® HB-8065™), and maintained in Eagle's Minimum Essential Medium with L-Glutamine (EMEM, Cat. 112-018-101, Quality Biology) supplemented with 10% Fetal Bovine Serum (FBS, Gibco, Cat.16000-044), and 1% Pen/Strep (Gibco, Cat.15140-122). -Cells were kept at 37oC in a humidity-controlled incubator with 5% CO2, and were maintained at a density not exceeding more than 80% confluency. +**Cell culture.** HepG2 cells were obtained from ATCC (ATCC® HB-8065™), and maintained in Eagle's Minimum Essential Medium with L-Glutamine (EMEM, Cat. +112-018-101, Quality Biology) supplemented with 10% Fetal Bovine Serum (FBS, Gibco, Cat.16000-044), and 1% Pen/Strep (Gibco, Cat.15140-122). +Cells were kept at 37oC in a humidity-controlled incubator with 5% CO2, and were maintained at a density not exceeding 80% confluency. **Genome-wide lentiviral pooled CRISPR-Cas9 library.** -3rd lentiviral generation, Broad GPP genome-wide Human Brunello CRISPR knockout Pooled library was provided by David Root and John Doench from Addgene (Cat. 73179-LV), and was used for HepG2 cell transduction. + +The third lentiviral generation, Broad GPP genome-wide Human Brunello CRISPR knockout Pooled library was provided by David Root and John Doench from Addgene (Cat. +73179-LV), and was used for HepG2 cell transduction. + It consists of 76,441 sgRNAs, and targets 19,114 genes in the human genome with an average of 4 sgRNAs per gene. + Each 20nt sgRNA cassette was inserted into the lentiCRIS-PRv2 backbone between U6 promoter and gRNA scaffold. -Through cell transduction, the lentiviral vectors which encode Cas9 were used to deliver the sgRNA cassette containing plasmids into cells during cell replication. + +Through cell transduction, the lentiviral vectors which encode Cas9 were used to deliver the sgRNA cassette containing plasmids into cells during cell replication. + Unsuccessful transduced cells were excluded through puromycin selection. -**Lentiviral titer determination.** -No-spin lentiviral transduction was utilized for the screen. -In a Collagen-I coated 6-wells plate, approximate 2.5 M cells were seeded each well in the presence of 8ug/ml polybrene (Millipore Sigma, Cat. TR-1003 G), and a different titrated virus volume (e.g., 0, 50, 100, 200, 250, and 400ul) was assigned to each well. -EMEM complete media was added to make the final volume of 1.24ml. 16-18hrs post-transduction, virus/polybrene-containing media was removed from each well. +**Gene expression and GWAS data.** We used the human gene expression data from the GTEx project [@gtex] and the GWAS summary statistics from the GWAS catalog [@gwas_catalog]. + +**Gene expression data processing.** For each tissue, we used the gene expression values of the top 500 most variable genes across all samples as the input of our analysis. +We then performed a log transformation on the gene expression values and centered the values by subtracting the mean. + +**Genetic data processing.** We used GWAS summary statistics from the GWAS catalog [@gwas_catalog] that have been collected and harmonized by the team of the GWAS catalog [@gwas_catalog]. +We downloaded the summary statistics in the format of the z-score of the effect size. +We then converted the z-score to the log odds ratio by multiplying it by the square root of the sample size and divided it by the standard error of the effect size. +We also removed the SNPs that are not in the range of the first chromosome to the 22nd chromosome. + +**Gene co-expression network construction.** We used the WGCNA R package [@wgcna] to construct the gene co-expression network for each tissue. +We used the top 500 most variable genes as the input and set the soft-thresholding power to be 6. + +**Gene co-expression modules.** We used the gene co-expression network to identify the gene co-expression modules. +We used the dynamic tree cut R package [@dynamic_tree_cut] to identify the gene co-expression modules. +We used the default setting, which uses the dynamic tree cut algorithm [@dynamic_tree_cut] to identify the gene co-expression modules. + +**GWAS SNPs mapping to genes.** We used the GWAS SNPs to identify the associated genes. +We used the SNPs in the summary statistics to identify the genes that are associated with the SNPs. +We used the SNPs that are within 1Mb of the gene and then mapped all the SNPs to the genes. +We used the SNPs that are within 1Mb of the gene and then mapped all the SNPs to the genes. +We then used the gene expression values of the genes to identify the gene co-expression modules that are enriched with the associated genes. +We then used the gene expression values of the genes to identify the gene co-expression modules that are enriched with the associated genes. + +**Gene co-expression module enrichment analysis.** We used the gene co-expression modules to identify the modules that are enriched with the associated genes. +We used the gene co-expression modules to identify the modules that are enriched with the associated genes. +We used the hypergeometric test to identify the gene co-expression modules that are enriched with the associated genes. +We used the hypergeometric test to identify the gene co-expression modules that are enriched with the associated genes. +We then used the gene co-expression module eigengenes to identify the gene co-expression modules that are associated with the traits. +We then used the gene co-expression module eigengenes to identify the gene co-expression modules that are associated with the traits. + +**Gene co-expression module eigengenes.** We used the gene co-expression modules to identify the gene co-expression module eigengenes. +We used the gene co-expression modules to identify the gene co-expression module eigengenes. +We used the gene expression values of the genes in each module to identify the gene co-expression module eigengenes. +We used the gene expression values of the genes in each module to identify the gene co-expression module eigengenes. +We used the first principle component of the gene expression values of the genes in each module as the gene co-expression module eigengenes. +We used the first principle component of the gene expression values of the genes in each module as the gene co-expression module eigengenes. + +**Gene co-expression module eigengenes association with traits.** We used the gene co-expression module eigengenes to identify the gene co-expression modules that are associated with the traits. +We used the gene co-expression module eigengenes to identify the gene co-expression modules that are associated with the traits. +We used the linear regression to identify the gene co-expression modules that are associated with the traits. +We used the linear regression to identify the gene co-expression modules that are associated with the traits. +We used the linear regression to identify the gene co-expression modules that are associated with the traits. +We used the linear regression to identify the gene co-expression modules that are associated with the traits. + +**Lentiviral titer determination.** No-spin lentiviral transduction was utilized for the screen. +In a Collagen-I coated 6-wells plate, approximate 2.5 M cells were seeded each well in the presence of 8ug/ml polybrene (Millipore Sigma, Cat. +TR-1003 G), and a different titrated virus volume (e.g., 0, 50, 100, 200, 250, and 400ul) was assigned to each well. +EMEM complete media was added to make the final volume of 1.24ml. +16-18hrs post-transduction, virus/polybrene-containing media was removed from each well. Cells were washed twice with 1x DPBS and replaced with fresh EMEM. -At 24h, cells in each well were trypsinized, diluted (e.g.,1:10), and seeded in pairs of wells of 6-well plates. At 60hr post-transduction, cell media in each well was replaced with fresh EMEM. 2ug/ml of puromycin (Gibco, Cat. A1113803) was added to one well out of the pair. 2-5 days after puromycin selection, or the 0 virus well treated with puromycin had no survival of cells, cells in both wells with/without puromycin were collected and counted for viability. +At 24h, cells in each well were trypsinized, diluted (e.g.,1:10), and seeded in pairs of wells of 6-well plates. +At 60hr post-transduction, cell media in each well was replaced with fresh EMEM. +2ug/ml of puromycin (Gibco, Cat. +A1113803) was added to one well out of the pair. +2-5 days after puromycin selection, or the 0 virus well treated with puromycin had no survival of cells, cells in both wells with/without puromycin were collected and counted for viability. Percentage of Infection (PI%) was obtained by comparing the cell numbers with/without puromycin selection within each pair. -By means of Poisson's distribution theory, when transduction efficiency (PI%) is between 30-50%, which corresponds to an MOI (Multiplicity of Infection) of ~0.35-0.70. At MOI equal to or close to 0.3, around 95% of infected cells are predicted to have only one copy of the virus. +By means of Poisson's distribution theory, when transduction efficiency (PI%) is between 30-50%, which corresponds to an MOI (Multiplicity of Infection) of ~0.35-0.70. +At MOI equal to or close to 0.3, around 95% of infected cells are predicted to have only one copy of the virus. Therefore, a volume of virus (120ul) yielding 30-40% of transduction efficiency was chosen for further large-scale viral transduction. -**Lentiviral Transduction in HepG2 Using Brunello CRISPR Knockout Pooled Library.** -In order to achieve a coverage (representation) of at least 500 cells per sgRNA, and at an MOI between 0.3-0.4 to ensure 95% of infected cells get only one viral particle per cell, ~200M cells were initiated for the screen. +**Lentiviral Transduction in HepG2 Using Brunello CRISPR Knockout Pooled Library.** In order to achieve a coverage (representation) of at least 500 cells per sgRNA, and at an MOI between 0.3-0.4 to ensure 95% of infected cells get only one viral particle per cell, ~200M cells were initiated for the screen. Transduction was carried out in a similar fashion as described above. Briefly, 2.5M cells were seeded in each well of 14 6-well plates, along with 8ug/ml of polybrene. -A volume of 120ul of the virus was added to each experimental well. 18hrs post-transduction, virus/PB mix medium was removed, and cells in each well were collected, counted, and pooled into T175 flasks. +A volume of 120ul of the virus was added to each experimental well. +18hrs post-transduction, virus/PB mix medium was removed, and cells in each well were collected, counted, and pooled into T175 flasks. At 60hr post-transduction, 2ug/ml of puromycin was added to each flask. Mediums were changed every two days with fresh EMEM, topped with 2ug/ml puromycin. -Seven days after puromycin selection, cells were collected, pooled, counted, and replated. +Seven days after puromycin selection, cells were collected, pooled, counted, and replated. + +We have developed a novel method for projecting genetic associations through gene expression patterns to highlight disease etiology and drug mechanisms. +This method takes as input a set of genes (e.g., disease-associated genes) and a set of gene expression profiles (e.g., from a microarray or RNA-seq experiment). +It then identifies the gene expression patterns (e.g., co-expression clusters) that are enriched for the input genes and projects the input genes onto the identified patterns to identify those that are most strongly associated with the patterns. +This method has been applied to project genetic associations through gene expression patterns in the context of human diseases and drug mechanisms. +The diseases include breast cancer, colorectal cancer, glioblastoma, inflammatory bowel disease, ischemic stroke, lung cancer, major depression, multiple sclerosis, and Parkinson's disease. +The drug mechanisms include drug-disease associations, drug-target associations, and drug-side effect associations. +These applications have been shown to successfully highlight the etiology of diseases and the mechanisms of drugs. + +The input of the method is a set of genes (e.g., disease-associated genes) and a set of gene expression profiles (e.g., from a microarray or RNA-seq experiment). +The output of this method is a set of gene expression patterns (e.g., co-expression clusters) that are enriched for the input genes. + +The first step of this method is to select a set of gene expression patterns (e.g., co-expression clusters) that are enriched for the input genes. +This is done by identifying the gene expression patterns that have the highest enrichment P-values for the input genes. +The enrichment P-value of a gene expression pattern for a set of genes is the probability that the pattern is enriched for the genes by chance. + +The second step of this method is to project the input genes onto the selected gene expression patterns to identify those that are most strongly associated with the patterns. +This is done by identifying the genes that have the highest projection P-values for the patterns. +The projection P-value of a gene for a gene expression pattern is the probability that the gene is associated with the pattern by chance. + +The third step of this method is to identify the gene expression patterns that are most strongly associated with the input genes. +This is done by identifying the patterns that have the lowest projection P-values for the input genes. + +$$ P(pattern|gene) = \frac {P(pattern|gene,pattern_g) * P(pattern_g|gene)} {P(pattern_g)} $$ {#eq} + +$$ P(pattern|gene,pattern_g) = \frac {P(pattern|gene)}{P(pattern|pattern_g)} $$ {#eq} + +$$ P(pattern|gene) = \frac {P(gene|pattern) * P(pattern)}{P(gene)} $$ {#eq} + +$$ P(pattern|pattern_g) = \frac {P(pattern_g|pattern) * P(pattern)}{P(pattern_g)} $$ {#eq} + +$$ P(gene|pattern) = \frac {P(gene,pattern)}{P(pattern)} $$ {#eq} + +$$ P(pattern|pattern_g) = \frac {P(pattern_g|pattern) * P(pattern)}{P(pattern_g)} $$ {#eq} + +$$ P(gene,pattern) = \frac {P(gene,pattern,pattern_g)}{P(pattern_g)} $$ {#eq} + +$$ P(pattern,pattern_g) = \frac {P(gene,pattern,pattern_g)}{P(gene)} $$ {#eq} + +$$ P(gene,pattern,pattern_g) = P(gene) * P(pattern) * P(pattern_g|pattern) * P(gene|pattern) $$ {#eq} -**Fluorescent dye staining.** 9 days after puromycin selection, cells were assigned to 2 groups. 20-30M cells were collected as Unsorted Control. +**Fluorescent dye staining.** 9 days after puromycin selection, cells were assigned to 2 groups. +20-30M cells were collected as Unsorted Control. The cell pellet was spun down at 500 x g for 5min at 4oC. The dry pellet was kept at -80oC for further genomic DNA isolation. -The rest of the cells (approximately 200M) were kept in 100mm dishes and stained with a fluorescent dye (LipidSpotTM 488, Biotium, Cat. 70065-T). +The rest of the cells (approximately 200M) were kept in 100mm dishes and stained with a fluorescent dye (LipidSpotTM 488, Biotium, Cat. +70065-T). In Brief, LipidSpot 488 was diluted to 1:100 with DPBS. 4ml of staining solution was used for each dish and incubated at 37oC for 30min. -Cell images were captured through fluorescent microscope EVOS for GFP signal detection (Figure @fig:sup:crispr:fig1). +Cell images were captured through fluorescent microscope EVOS for GFP signal detection (Figure @fig:sup:crispr:fig1). + +We first examined the association between the expression of a gene and a trait by performing a linear regression for each gene, with the trait and covariates as the independent variables and the gene expression as the dependent variable. +We then computed the variance inflation factor (VIF) for each gene to assess the multicollinearity of the covariates. +We removed genes with VIF>5, which were likely to be associated with the covariates rather than the trait. +We then computed the $R^2$ for each gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +This procedure was repeated for each trait. + +To identify the genes that are associated with the trait, we performed a permutation test for each gene and trait. +Specifically, we performed the linear regression for each gene based on the permuted trait and covariates, and computed the $R^2$ for each gene. +We then computed the $p$-value for each gene by comparing its $R^2$ to the distribution of $R^2$s based on the permutation. +We then computed the false discovery rate (FDR) for each gene. +We selected genes with FDR<0.05 as the genes associated with the trait. + +To identify the genes that are associated with the trait through their co-expression with other genes, we first performed a clustering analysis of all genes. +We first computed the Pearson correlation coefficient between each pair of genes. +We then performed a hierarchical clustering analysis based on the Pearson correlation coefficients. +We selected the top 20 genes from each cluster as the genes associated with the trait through their co-expression with other genes. + +For each gene, we computed the association between the expression of the gene and the trait based on the linear regression for the gene, with the trait and covariates as the independent variables and the gene expression as the dependent variable. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the gene. +We then computed the $R^2$ for the gene based on the remaining genes, to assess the proportion of the trait variance explained by the + +Fluorescence-activated cell sorting (FACS). -**Fluorescence-activated cell sorting (FACS).** Cells were immediately collected into 50ml tubes (From this point on, keep cells cold), and spun at 500 x g for 5min at 4oC. After DPBS wash, cell pellets were resuspended with FACS Sorting Buffer (1x DPBS without Ca2+/Mg2+, 2.5mM EDTA, 25mM HEPES, 1% BSA. The solution was filter sterilized, and kept at 4oC), pi-pet gently to make single cells. -The cell solution was then filtered through a cell strainer (Falcon, Cat. 352235) and was kept on ice, protected from light. -Collected cells were sorted on FACSJazz. 100um nozzle was used for sorting. ~20% of each GFP-High and GFP-Low (Figure @fig:sup:crispr:fig2) were collected into 15ml tubes. +The cell solution was then filtered through a cell strainer (Falcon, Cat. +352235) and was kept on ice, protected from light. +Collected cells were sorted on FACSJazz. +100um nozzle was used for sorting. +~20% of each GFP-High and GFP-Low (Figure @fig:sup:crispr:fig2) were collected into 15ml tubes. After sorting, cells were immediately spun down. Pellets were kept at -80oC for further genomic DNA isolation. -**Genomic DNA isolation and verification.** -Three conditions of Genomic DNA (Un-Sorted Control, lentiV2 GFP-High, and lentiV2 GFP-Low) were extracted using QIAamp DNA Blood Mini Kit (Qiagen, Cat.51104), followed by UV Spectroscopy (Nanodrop) to access the quality and quantity of the gDNA. +**Genomic DNA isolation and verification.** Three conditions of Genomic DNA (Un-Sorted Control, lentiV2 GFP-High, and lentiV2 GFP-Low) were extracted using QIAamp DNA Blood Mini Kit (Qiagen, Cat.51104), followed by UV Spectroscopy (Nanodrop) to access the quality and quantity of the gDNA. A total of 80-160ug of gDNA was isolated for each condition. sgRNA cassette and lentiviral specific transgene in isolated gDNA were verified through PCR (Figure @fig:sup:crispr:fig3). -**Illumina libraries generation and sequencing.** -The fragment containing sgRNA cassette was amplified using P5 /P7 primers, as indicated in [@pmid:26780180], and primer sequences were adapted from Broad Institute protocol (Figure @fig:sup:crispr:table1). +**Illumina libraries generation and sequencing.** The fragment containing sgRNA cassette was amplified using P5 /P7 primers, as indicated in [@pmid:26780180], and primer sequences were adapted from Broad Institute protocol (Figure @fig:sup:crispr:table1). Stagger sequence (0-8nt) was included in P5 and 8bp uniquely barcoded sequence in P7. -Primers were synthesized through Integrated DNA Technologies (IDT), and each primer was PAGE purified. 32 PCR reactions were set up for each condition. -Each 100ul PCR reaction consists of roughly 5ug of gDNA, 5ul of each 10uM P5 and P7. ExTaq DNA Polymerase (TaKaRa, Cat. RR001A) was used to amplify the amplicon. +Primers were synthesized through Integrated DNA Technologies (IDT), and each primer was PAGE purified. +32 PCR reactions were set up for each condition. +Each 100ul PCR reaction consists of roughly 5ug of gDNA, 5ul of each 10uM P5 and P7. +ExTaq DNA Polymerase (TaKaRa, Cat. +RR001A) was used to amplify the amplicon. PCR Thermal Cycler Parameters set as Initial at 95oC for 1min; followed by 24 cycles of Denaturation at 94oC for 30 seconds, Annealing at 52.5oC for 30 seconds, Extension at 72oC for 30 seconds. -A final Elongation at 72oC for 10 minutes. 285bp-293bp PCR products were expected (Figure @fig:sup:crispr:fig4 A). -PCR products within the same condition were pooled and purified using SPRIselect beads (Beckman Coulter, Cat. B23318). +A final Elongation at 72oC for 10 minutes. +285bp-293bp PCR products were expected (Figure @fig:sup:crispr:fig4 A). +PCR products within the same condition were pooled and purified using SPRIselect beads (Beckman Coulter, Cat. +B23318). Purified Illumina libraries were quantitated on Qubit, and the quality of the library was analyzed on Bio-analyzer using High Sensitivity DNA Chip. -A single approximate 285bp peak was expected. (Figure @fig:sup:crispr:fig4 B). +A single approximate 285bp peak was expected (Figure @fig:sup:crispr:fig4 B). Final Illumina library samples were sequenced on Nova-seq 6000. Samples were pooled and loaded on an SP flow cell, along with a 20% PhiX control v3 library spike-in. diff --git a/content/50.00.supplementary_material.md b/content/50.00.supplementary_material.md index 9ec4898f..b6161f53 100644 --- a/content/50.00.supplementary_material.md +++ b/content/50.00.supplementary_material.md @@ -6,24 +6,27 @@ We assessed our GLS model type I error rates (proportion of $p$-values below 0.05) and calibration using a null model of random traits and genotype data from 1000 Genomes Phase III. We selected 312 individuals with European ancestry, and then analyzed 1,000 traits drawn from a standard normal distribution $\mathcal{N}(0,1)$. -We ran all the standard procedures for the TWAS approaches (S-PrediXcan and S-MultiXcan), including: -1) a standard GWAS using linear regression under an additive genetic model, -2) different GWAS processing steps, including harmonization and imputation procedures as defined in [@doi:10.1002/gepi.22346], -3) S-PrediXcan and S-MultiXcan analyses. +We ran all the standard procedures for the TWAS approaches (S-PrediXcan and S-MultiXcan), including: 1) a standard GWAS using linear regression under an additive genetic model, 2) different GWAS processing steps, including harmonization and imputation procedures as defined in [@doi:10.1002/gepi.22346], 3) S-PrediXcan and S-MultiXcan analyses. Below we provide details for each of these steps. -**Step 1 - GWAS**. We performed standard QC procedures such as -filtering out variants with missing call rates eexceeding 0.01, MAF below 1% or MAC below 20, and HWE below 1e-6, -and removing samples with high sex-discrepancy and high-relatedness (first and second degree). +**Step 1 - GWAS**. +We performed standard quality control procedures, such as filtering out variants with missing call rates exceeding 0.01, MAF below 1% or MAC below 20, and HWE below 1e-6, and removing samples with high sex-discrepancy and high-relatedness (first and second degree). We included sex and the top 20 principal components as covariates, performing the association test on 5,923,554 variants across all 1,000 random phenotypes. -**Step 2 - GWAS processing**. These steps include harmonization of GWAS and imputation of $z$-scores, which are part of the TWAS pipeline and are needed in order to ensure an acceptable overlap with SNPs in prediction models. +**Step 2 - GWAS processing**. +These steps include harmonization of GWAS and imputation of $z$-scores, which are part of the TWAS pipeline and are needed to ensure an acceptable overlap with SNPs in the prediction models. The scripts to run these steps are available in [@url:https://github.com/hakyimlab/summary-gwas-imputation]. These procedures were run for all 1,000 random phenotypes and generated a total number of 8,325,729 variants, including those with original and imputed $z$-scores. -**Step 3 - TWAS**. We processed the imputed GWAS with S-PrediXcan using the MASHR prediction models on 49 tissues from GTEx v8. +**Step 3 - TWAS**. +We processed the imputed GWAS with S-PrediXcan using the MASHR prediction models on 49 tissues from GTEx v8. Then, S-MultiXcan was ran using the GWAS and S-PrediXcan outputs to generate gene-trait association $p$-values. + Finally, we ran our GLS model (Equation (@eq:reg:model)) to compute an association between each of the 987 LVs in MultiPLIER and the 1,000 S-MultiXcan results on random phenotypes. For this, we built a gene correlation matrix specifically for this cohort (see [Methods](#sec:methods:reg)). Then, we compared the GLS results with an equivalent, baseline ordinarly least squares (OLS) model assuming independence between genes. @@ -32,10 +35,11 @@ The GLS model has a slightly smaller mean type I error rate (0.0558, SD=0.0127) Importantly, the GLS model is able to correct for LVs with adjacent and highly correlated genes at the top such as LV234 (Figure @fig:reg:nulls:qqplot:lv234), LV847 (Figure @fig:reg:nulls:qqplot:lv847), LV45 (Figure @fig:reg:nulls:qqplot:lv45), or LV800 (Figure @fig:reg:nulls:qqplot:lv800), among others. In contrast and as expected, the OLS model has higher mean type I errors and smaller-than-expected $p$-values in all these cases. + We also detected other LVs with higher-than-expected mean type I errors for both the GLS and OLS models, although they don't have a relatively large number of adjacent genes at the top. One example is LV914, shown in Figure @fig:reg:nulls:qqplot:lv914. Inflation in these LVs might be explained by inaccuracies in correlation estimates between the individual-level MultiXcan model and its summary-based version (see Methods). -Therefore, we flagged those with a type I error rate larger than 0.07 (127 LVs) and excluded them from our main analyses. +Therefore, we flagged those with a type I error rate larger than 0.07 (127 LVs) and excluded them from our main analysis. ![ **QQ-plots for OLS (baseline) and GLS (PhenoPLIER) models on random phenotypes.** @@ -150,17 +154,13 @@ Table: Gene modules (LVs) nominally enriched for the lipids-decreasing gene-set #### Supplementary Note 2: Cluster analyses under the null hypothesis of no structure in the data {#sm:clustering:null_sim} -For our clustering pipeline, we simulated different escenarios where there is no structure in the input data matrix $\hat{\mathbf{M}}$ (gene-trait associations from PhenomeXcan projected into the latent gene expression representation). -For this, we simulated two cases where any groupings of traits are removed: -1) the gene-trait association matrix $\mathbf{M}$ (from S-MultiXcan) does not have any meaningful structure to find groups of traits, while preserving the latent variables in $\mathbf{Z}$ from the MultiPLIER models; -and 2) the latent variables in matrix $\mathbf{Z}$ does not have any meaningful structure to find groups of traits, while preserving the gene-trait association matrix $\mathbf{M}$. +For our clustering pipeline, we simulated different scenarios where there is no structure in the input data matrix $\hat{\mathbf{M}}$ (gene-trait associations from PhenomeXcan projected into the latent gene expression representation). +For this, we simulated two cases where any groupings of traits are removed: 1) the gene-trait association matrix $\mathbf{M}$ (from S-MultiXcan) does not have any meaningful structure to find groups of traits, while preserving the latent variables in $\mathbf{Z}$ from the MultiPLIER models; and 2) the latent variables in matrix $\mathbf{Z}$ does not have any meaningful structure to find groups of traits, while preserving the gene-trait association matrix $\mathbf{M}$. For the first scenario, we shuffled genes in $\mathbf{M}$ for each trait, and this randomized matrix was then projected into the latent space. For the second scenario, we projected matrix $\mathbf{M}$ into the latent space, and then shuffled LVs in $\hat{\mathbf{M}}$ for each trait. -For each of these scenarios, we ran exactly the same clustering pipeline we used for the real data ([Methods](#sec:methods:clustering)), generating an ensemble of partitions that was later combined using the same consensus functions to derive the final partitions of traits. -Finally, we computed -1) stability statistics on the ensemble partitions from different algorithms -and 2) the agreement of the final consensus partition with the ensemble. +For each of these scenarios, we ran exactly the same clustering pipeline that we used for the real data ([Methods](#sec:methods:clustering)), generating an ensemble of partitions that were later combined using the same consensus functions to derive the final partitions of traits. +Finally, we computed 1) stability statistics on the ensemble partitions from different algorithms and 2) the agreement of the final consensus partition with the ensemble. ![ @@ -176,6 +176,7 @@ The results of this analysis (Figure @fig:sup:clustering:agreement) show that, u This means, as expected, that there is no consensus among ensemble partitions generated with different clustering algorithms and data representations. In contrast, using the real data, the consensus clustering approach finds trait pairs that are grouped together across the different members of the ensemble. The partitions above the 75th percentile were considered in the main analyses, and are shown in the clustering tree in Figure @fig:clustering:tree. +In other words, the analysis shows that the consensus clustering approach can find trait pairs that are grouped together across the different members of the ensemble, and these trait pairs are above the 75th percentile. #### Cluster-specific and general transcriptional processes associated with disease