Differential Expression and Pathway Analysis Pipeline

A comprehensive R pipeline for differential gene expression analysis using DESeq2 and functional pathway enrichment analysis using multiple databases (GO, KEGG, Reactome).

Overview

This pipeline performs:

Differential Expression Analysis using DESeq2
Gene ID Conversion from ENSEMBL to ENTREZ IDs
Pathway Enrichment Analysis using multiple databases
Comprehensive Visualization of results

Features

Multi-comparison Analysis: Analyze multiple treatment comparisons simultaneously
Functional Programming Approach: Efficient processing using lapply for scalability
Multiple Pathway Databases: GO (Biological Process & Molecular Function), KEGG, and Reactome
Comprehensive Visualization: Dot plots, bar plots, network plots, and GSEA visualizations
Gene Coverage Analysis: Track gene ID mapping success rates
Cross-treatment Comparison: Compare pathway enrichment across different treatments

Repository Structure

├── initial_deseq_script.r    # Main analysis script
├── functions.R               # Custom function definitions
├── rse_gene.Rdata           # Input data (RangedSummarizedExperiment)
└── README.md                # This file

Requirements

R Version

R ≥ 4.0.0

CRAN Packages

install.packages(c(
  "tidyverse",    # Data manipulation and plotting
  "conflicted"    # Handle package conflicts
))

Bioconductor Packages

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c(
  "DESeq2",           # Differential expression analysis
  "recount",          # RNA-seq data source
  "clusterProfiler",  # Pathway analysis framework
  "org.Hs.eg.db",    # Human gene annotations
  "enrichplot",       # Enrichment visualization
  "DOSE",             # Disease ontology analysis
  "pathview",         # Pathway visualization
  "ReactomePA",       # Reactome pathway analysis
  "msigdbr",          # MSigDB gene sets
  "biomaRt",          # Gene annotation from Ensembl
  "fgsea"             # Fast gene set enrichment analysis
))

Usage

1. Prepare Data

Ensure your rse_gene.Rdata file (RangedSummarizedExperiment object) is in the working directory.

2. Run Analysis

# Source and run the main script
source("initial_deseq_script.r")

3. Access Results

The script generates multiple result objects:

Differential Expression Results

results_list: List containing DESeq2 results for all comparisons
count_results: Up/downregulated gene counts for each comparison

Pathway Analysis Results

go_bp_results: GO Biological Process enrichment results
go_mf_results: GO Molecular Function enrichment results
kegg_results: KEGG pathway enrichment results
reactome_results: Reactome pathway enrichment results

Visualization Objects

plots_go_bp_up/down: GO enrichment plots
plots_kegg_up: KEGG enrichment plots
gsea_plots_go/kegg: GSEA visualization plots

Analysis Workflow

1. Data Import and Preprocessing

Load RangedSummarizedExperiment data
Extract count matrix and metadata
Filter low-count genes

2. Differential Expression Analysis

Create DESeq2 dataset with concentration design
Perform three pairwise comparisons:
- 5μM vs 0μM (control)
- 10μM vs 0μM (control)
- 10μM vs 5μM
Generate MA plots for visualization

3. Gene ID Conversion

Convert ENSEMBL IDs to ENTREZ IDs using org.Hs.eg.db
Analyze mapping success rates and unmapped genes
Handle duplicate mappings appropriately

4. Pathway Enrichment Analysis

Over-representation Analysis (ORA): For significantly up/downregulated genes
Gene Set Enrichment Analysis (GSEA): For ranked gene lists
Multiple databases: GO (BP/MF), KEGG, Reactome

5. Visualization and Comparison

Generate comprehensive plots for each analysis
Compare pathway enrichment across treatments
Summarize results across different databases

Key Functions

Core Analysis Functions

count_up_down(): Count significantly DE genes
convert_gene_ids(): ENSEMBL to ENTREZ ID conversion
prepare_gene_lists(): Prepare gene lists for pathway analysis

Pathway Analysis Functions

perform_go_analysis(): GO enrichment analysis
perform_kegg_analysis(): KEGG pathway analysis
perform_reactome_analysis(): Reactome pathway analysis

Visualization Functions

create_pathway_plots(): Generate enrichment plots
create_gsea_plots(): Generate GSEA visualizations
compare_pathways_across_treatments(): Cross-treatment comparison

Summary Functions

summarize_pathway_results(): Summarize enrichment results
compare_pathway_methods(): Compare different analysis methods

Output Interpretation

Gene Coverage Statistics

The pipeline reports gene ID mapping success rates:

Typical ENSEMBL→ENTREZ mapping success: ~65-70%
Unmapped genes are analyzed by biotype for quality assessment

Pathway Enrichment Results

Adjusted p-value < 0.05: Significant enrichment
NES (Normalized Enrichment Score): GSEA effect size
Gene Ratio: Proportion of genes in pathway vs. background

Visualization Guide

Dot plots: Pathway significance and gene ratio
Bar plots: Pathway enrichment counts
Ridge plots: GSEA enrichment distribution
Network plots: Gene-pathway relationships

Customization

Adjust Significance Thresholds

# In main script, modify these parameters:
padj_threshold = 0.01    # Adjusted p-value cutoff
lfc_threshold = 1        # Log2 fold change threshold (2-fold)

Troubleshooting

Common Issues

Low gene mapping rates
- Check ENSEMBL ID format (version numbers are handled automatically)
- Verify organism database (org.Hs.eg.db for human)
No significant pathways
- Adjust p-value thresholds
- Check if sufficient genes pass filtering
- Verify gene list preparation
Memory issues with large datasets
- Consider filtering more stringently
- Process comparisons individually rather than using lapply

Error Messages

"Duplicate values in names(stats) not allowed": Handled automatically by deduplication in prepare_gene_lists()
Bioconductor connection issues: Ensure stable internet connection for gene annotation queries

Citation

If you use this pipeline, please cite the relevant packages:

DESeq2: Love, M.I., Huber, W., Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.
clusterProfiler: Yu, G., et al. (2012). clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS, 16(5), 284-287.
fgsea: Korotkevich, G., et al. (2019). Fast gene set enrichment analysis. bioRxiv, 060012.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

This pipeline was developed jointly by Wendy Phillips and GitHub Copilot through collaborative AI-assisted programming.

Note: This pipeline is designed for human RNA-seq data. For other organisms, modify the org.Hs.eg.db annotation package accordingly (e.g., org.Mm.eg.db for mouse).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
functions.R		functions.R
rse_gene.Rdata		rse_gene.Rdata
transcriptomic_analysis.R		transcriptomic_analysis.R

License

wendysphillips/transcriptomics_analysis

Folders and files

Latest commit

History

Repository files navigation

Differential Expression and Pathway Analysis Pipeline

Overview

Features

Repository Structure

Requirements

R Version

CRAN Packages

Bioconductor Packages

Usage

1. Prepare Data

2. Run Analysis

3. Access Results

Differential Expression Results

Pathway Analysis Results

Visualization Objects

Analysis Workflow

1. Data Import and Preprocessing

2. Differential Expression Analysis

3. Gene ID Conversion

4. Pathway Enrichment Analysis

5. Visualization and Comparison

Key Functions

Core Analysis Functions

Pathway Analysis Functions

Visualization Functions

Summary Functions

Output Interpretation

Gene Coverage Statistics

Pathway Enrichment Results

Visualization Guide

Customization

Adjust Significance Thresholds

Troubleshooting

Common Issues

Error Messages

Citation

License

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages