Skip to content

wendysphillips/transcriptomics_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Differential Expression and Pathway Analysis Pipeline

A comprehensive R pipeline for differential gene expression analysis using DESeq2 and functional pathway enrichment analysis using multiple databases (GO, KEGG, Reactome).

Overview

This pipeline performs:

  1. Differential Expression Analysis using DESeq2
  2. Gene ID Conversion from ENSEMBL to ENTREZ IDs
  3. Pathway Enrichment Analysis using multiple databases
  4. Comprehensive Visualization of results

Features

  • Multi-comparison Analysis: Analyze multiple treatment comparisons simultaneously
  • Functional Programming Approach: Efficient processing using lapply for scalability
  • Multiple Pathway Databases: GO (Biological Process & Molecular Function), KEGG, and Reactome
  • Comprehensive Visualization: Dot plots, bar plots, network plots, and GSEA visualizations
  • Gene Coverage Analysis: Track gene ID mapping success rates
  • Cross-treatment Comparison: Compare pathway enrichment across different treatments

Repository Structure

├── initial_deseq_script.r    # Main analysis script
├── functions.R               # Custom function definitions
├── rse_gene.Rdata           # Input data (RangedSummarizedExperiment)
└── README.md                # This file

Requirements

R Version

  • R ≥ 4.0.0

CRAN Packages

install.packages(c(
  "tidyverse",    # Data manipulation and plotting
  "conflicted"    # Handle package conflicts
))

Bioconductor Packages

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c(
  "DESeq2",           # Differential expression analysis
  "recount",          # RNA-seq data source
  "clusterProfiler",  # Pathway analysis framework
  "org.Hs.eg.db",    # Human gene annotations
  "enrichplot",       # Enrichment visualization
  "DOSE",             # Disease ontology analysis
  "pathview",         # Pathway visualization
  "ReactomePA",       # Reactome pathway analysis
  "msigdbr",          # MSigDB gene sets
  "biomaRt",          # Gene annotation from Ensembl
  "fgsea"             # Fast gene set enrichment analysis
))

Usage

1. Prepare Data

Ensure your rse_gene.Rdata file (RangedSummarizedExperiment object) is in the working directory.

2. Run Analysis

# Source and run the main script
source("initial_deseq_script.r")

3. Access Results

The script generates multiple result objects:

Differential Expression Results

  • results_list: List containing DESeq2 results for all comparisons
  • count_results: Up/downregulated gene counts for each comparison

Pathway Analysis Results

  • go_bp_results: GO Biological Process enrichment results
  • go_mf_results: GO Molecular Function enrichment results
  • kegg_results: KEGG pathway enrichment results
  • reactome_results: Reactome pathway enrichment results

Visualization Objects

  • plots_go_bp_up/down: GO enrichment plots
  • plots_kegg_up: KEGG enrichment plots
  • gsea_plots_go/kegg: GSEA visualization plots

Analysis Workflow

1. Data Import and Preprocessing

  • Load RangedSummarizedExperiment data
  • Extract count matrix and metadata
  • Filter low-count genes

2. Differential Expression Analysis

  • Create DESeq2 dataset with concentration design
  • Perform three pairwise comparisons:
    • 5μM vs 0μM (control)
    • 10μM vs 0μM (control)
    • 10μM vs 5μM
  • Generate MA plots for visualization

3. Gene ID Conversion

  • Convert ENSEMBL IDs to ENTREZ IDs using org.Hs.eg.db
  • Analyze mapping success rates and unmapped genes
  • Handle duplicate mappings appropriately

4. Pathway Enrichment Analysis

  • Over-representation Analysis (ORA): For significantly up/downregulated genes
  • Gene Set Enrichment Analysis (GSEA): For ranked gene lists
  • Multiple databases: GO (BP/MF), KEGG, Reactome

5. Visualization and Comparison

  • Generate comprehensive plots for each analysis
  • Compare pathway enrichment across treatments
  • Summarize results across different databases

Key Functions

Core Analysis Functions

  • count_up_down(): Count significantly DE genes
  • convert_gene_ids(): ENSEMBL to ENTREZ ID conversion
  • prepare_gene_lists(): Prepare gene lists for pathway analysis

Pathway Analysis Functions

  • perform_go_analysis(): GO enrichment analysis
  • perform_kegg_analysis(): KEGG pathway analysis
  • perform_reactome_analysis(): Reactome pathway analysis

Visualization Functions

  • create_pathway_plots(): Generate enrichment plots
  • create_gsea_plots(): Generate GSEA visualizations
  • compare_pathways_across_treatments(): Cross-treatment comparison

Summary Functions

  • summarize_pathway_results(): Summarize enrichment results
  • compare_pathway_methods(): Compare different analysis methods

Output Interpretation

Gene Coverage Statistics

The pipeline reports gene ID mapping success rates:

  • Typical ENSEMBL→ENTREZ mapping success: ~65-70%
  • Unmapped genes are analyzed by biotype for quality assessment

Pathway Enrichment Results

  • Adjusted p-value < 0.05: Significant enrichment
  • NES (Normalized Enrichment Score): GSEA effect size
  • Gene Ratio: Proportion of genes in pathway vs. background

Visualization Guide

  • Dot plots: Pathway significance and gene ratio
  • Bar plots: Pathway enrichment counts
  • Ridge plots: GSEA enrichment distribution
  • Network plots: Gene-pathway relationships

Customization

Adjust Significance Thresholds

# In main script, modify these parameters:
padj_threshold = 0.01    # Adjusted p-value cutoff
lfc_threshold = 1        # Log2 fold change threshold (2-fold)

Troubleshooting

Common Issues

  1. Low gene mapping rates

    • Check ENSEMBL ID format (version numbers are handled automatically)
    • Verify organism database (org.Hs.eg.db for human)
  2. No significant pathways

    • Adjust p-value thresholds
    • Check if sufficient genes pass filtering
    • Verify gene list preparation
  3. Memory issues with large datasets

    • Consider filtering more stringently
    • Process comparisons individually rather than using lapply

Error Messages

  • "Duplicate values in names(stats) not allowed": Handled automatically by deduplication in prepare_gene_lists()
  • Bioconductor connection issues: Ensure stable internet connection for gene annotation queries

Citation

If you use this pipeline, please cite the relevant packages:

  • DESeq2: Love, M.I., Huber, W., Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.
  • clusterProfiler: Yu, G., et al. (2012). clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS, 16(5), 284-287.
  • fgsea: Korotkevich, G., et al. (2019). Fast gene set enrichment analysis. bioRxiv, 060012.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

This pipeline was developed jointly by Wendy Phillips and GitHub Copilot through collaborative AI-assisted programming.


Note: This pipeline is designed for human RNA-seq data. For other organisms, modify the org.Hs.eg.db annotation package accordingly (e.g., org.Mm.eg.db for mouse).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages