A comprehensive R pipeline for differential gene expression analysis using DESeq2 and functional pathway enrichment analysis using multiple databases (GO, KEGG, Reactome).
This pipeline performs:
- Differential Expression Analysis using DESeq2
- Gene ID Conversion from ENSEMBL to ENTREZ IDs
- Pathway Enrichment Analysis using multiple databases
- Comprehensive Visualization of results
- Multi-comparison Analysis: Analyze multiple treatment comparisons simultaneously
- Functional Programming Approach: Efficient processing using
lapplyfor scalability - Multiple Pathway Databases: GO (Biological Process & Molecular Function), KEGG, and Reactome
- Comprehensive Visualization: Dot plots, bar plots, network plots, and GSEA visualizations
- Gene Coverage Analysis: Track gene ID mapping success rates
- Cross-treatment Comparison: Compare pathway enrichment across different treatments
├── initial_deseq_script.r # Main analysis script
├── functions.R # Custom function definitions
├── rse_gene.Rdata # Input data (RangedSummarizedExperiment)
└── README.md # This file
- R ≥ 4.0.0
install.packages(c(
"tidyverse", # Data manipulation and plotting
"conflicted" # Handle package conflicts
))if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c(
"DESeq2", # Differential expression analysis
"recount", # RNA-seq data source
"clusterProfiler", # Pathway analysis framework
"org.Hs.eg.db", # Human gene annotations
"enrichplot", # Enrichment visualization
"DOSE", # Disease ontology analysis
"pathview", # Pathway visualization
"ReactomePA", # Reactome pathway analysis
"msigdbr", # MSigDB gene sets
"biomaRt", # Gene annotation from Ensembl
"fgsea" # Fast gene set enrichment analysis
))Ensure your rse_gene.Rdata file (RangedSummarizedExperiment object) is in the working directory.
# Source and run the main script
source("initial_deseq_script.r")The script generates multiple result objects:
results_list: List containing DESeq2 results for all comparisonscount_results: Up/downregulated gene counts for each comparison
go_bp_results: GO Biological Process enrichment resultsgo_mf_results: GO Molecular Function enrichment resultskegg_results: KEGG pathway enrichment resultsreactome_results: Reactome pathway enrichment results
plots_go_bp_up/down: GO enrichment plotsplots_kegg_up: KEGG enrichment plotsgsea_plots_go/kegg: GSEA visualization plots
- Load RangedSummarizedExperiment data
- Extract count matrix and metadata
- Filter low-count genes
- Create DESeq2 dataset with concentration design
- Perform three pairwise comparisons:
- 5μM vs 0μM (control)
- 10μM vs 0μM (control)
- 10μM vs 5μM
- Generate MA plots for visualization
- Convert ENSEMBL IDs to ENTREZ IDs using
org.Hs.eg.db - Analyze mapping success rates and unmapped genes
- Handle duplicate mappings appropriately
- Over-representation Analysis (ORA): For significantly up/downregulated genes
- Gene Set Enrichment Analysis (GSEA): For ranked gene lists
- Multiple databases: GO (BP/MF), KEGG, Reactome
- Generate comprehensive plots for each analysis
- Compare pathway enrichment across treatments
- Summarize results across different databases
count_up_down(): Count significantly DE genesconvert_gene_ids(): ENSEMBL to ENTREZ ID conversionprepare_gene_lists(): Prepare gene lists for pathway analysis
perform_go_analysis(): GO enrichment analysisperform_kegg_analysis(): KEGG pathway analysisperform_reactome_analysis(): Reactome pathway analysis
create_pathway_plots(): Generate enrichment plotscreate_gsea_plots(): Generate GSEA visualizationscompare_pathways_across_treatments(): Cross-treatment comparison
summarize_pathway_results(): Summarize enrichment resultscompare_pathway_methods(): Compare different analysis methods
The pipeline reports gene ID mapping success rates:
- Typical ENSEMBL→ENTREZ mapping success: ~65-70%
- Unmapped genes are analyzed by biotype for quality assessment
- Adjusted p-value < 0.05: Significant enrichment
- NES (Normalized Enrichment Score): GSEA effect size
- Gene Ratio: Proportion of genes in pathway vs. background
- Dot plots: Pathway significance and gene ratio
- Bar plots: Pathway enrichment counts
- Ridge plots: GSEA enrichment distribution
- Network plots: Gene-pathway relationships
# In main script, modify these parameters:
padj_threshold = 0.01 # Adjusted p-value cutoff
lfc_threshold = 1 # Log2 fold change threshold (2-fold)-
Low gene mapping rates
- Check ENSEMBL ID format (version numbers are handled automatically)
- Verify organism database (
org.Hs.eg.dbfor human)
-
No significant pathways
- Adjust p-value thresholds
- Check if sufficient genes pass filtering
- Verify gene list preparation
-
Memory issues with large datasets
- Consider filtering more stringently
- Process comparisons individually rather than using
lapply
"Duplicate values in names(stats) not allowed": Handled automatically by deduplication inprepare_gene_lists()- Bioconductor connection issues: Ensure stable internet connection for gene annotation queries
If you use this pipeline, please cite the relevant packages:
- DESeq2: Love, M.I., Huber, W., Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.
- clusterProfiler: Yu, G., et al. (2012). clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS, 16(5), 284-287.
- fgsea: Korotkevich, G., et al. (2019). Fast gene set enrichment analysis. bioRxiv, 060012.
This project is licensed under the MIT License - see the LICENSE file for details.
This pipeline was developed jointly by Wendy Phillips and GitHub Copilot through collaborative AI-assisted programming.
Note: This pipeline is designed for human RNA-seq data. For other organisms, modify the org.Hs.eg.db annotation package accordingly (e.g., org.Mm.eg.db for mouse).