Skip to content

Commit

Permalink
Text edits to tutorial and notebook (#9)
Browse files Browse the repository at this point in the history
* final edits to tutorial

* run the annotate cell notebook
  • Loading branch information
jkanche authored Jul 23, 2024
1 parent 134348c commit fe1b7f5
Show file tree
Hide file tree
Showing 5 changed files with 102 additions and 69 deletions.
8 changes: 4 additions & 4 deletions notebook/annotate_cell_types.ipynb

Large diffs are not rendered by default.

104 changes: 69 additions & 35 deletions notebook/genomic_ranges.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial 1: `GenomicRanges` and range-based analyses\n",
"# Tutorial 1: Perform range-based analyses using `GenomicRanges`\n",
"\n",
"Genomic range operations are fundamental to many bioinformatics analyses. They allow us to work with intervals of genomic coordinates, which is crucial for understanding the relationships between different genomic features such as genes, regulatory elements, and experimental data like ChIP-seq peaks. In this tutorial, we'll explore how to work with genomic interval data using BiocPy's [GenomicRanges](https://github.com/BiocPy/GenomicRanges/) package, which provides a Python implementation of the R/Bioconductor [GenomicRanges package](https://github.com/Bioconductor/GenomicRanges).\n",
"\n",
Expand Down Expand Up @@ -100,7 +100,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 22,
"metadata": {},
"outputs": [
{
Expand All @@ -125,12 +125,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This object (`hg38_robject`) can then be coerced into a Python `GenomicRangesList` class."
"This dictionary object (`hg38_robject`) contains 4 keys:\n",
"\n",
"1) **class_name**: class name of the object\n",
"2) **package_name**: name of the package containing the class definition\n",
"3) **data**: contains the value if the object is a scalar\n",
"4) **attributes**: if the object is an S4 class, contains various attributes and their values\n",
"\n",
"This dictionary can then be coerced into a Python `GenomicRangesList` class."
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 23,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -229,10 +236,6 @@
}
],
"source": [
"from rds2py import read_rds\n",
"hg38_robject = read_rds(\"./hg38_exons_by_tx.rds\")\n",
"\n",
"# TODO: split this into two\n",
"from rds2py.granges import as_granges_list\n",
"by_tx = as_granges_list(hg38_robject)\n",
"\n",
Expand All @@ -245,7 +248,7 @@
"metadata": {},
"source": [
"```{note}\n",
"Currently this is a two step process, we are working on simplifying this to a one step process for supported Bioconductor classes.\n",
"Currently this is a two step process, we are working on simplifying this to a single step for supported Bioconductor classes.\n",
"```\n",
"\n",
"## 3. Define promoters and TSS\n",
Expand All @@ -261,7 +264,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 24,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -334,12 +337,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the range gives us exactly one range per transcript, so we can simplify our list to a `GenomicRanges` object. This is similar to `unlist` in R."
"Since the `range()` gives us exactly one range per transcript, so we can simplify our list to a `GenomicRanges` object. This is similar to `unlist` in R."
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 25,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -378,7 +381,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 26,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -419,7 +422,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 27,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -466,7 +469,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -486,7 +489,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 29,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -528,7 +531,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 30,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -562,7 +565,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 31,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -596,12 +599,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, we can use `subset_by_overlaps` method to more conveniently overlap the peaks that overlap with any TSS:"
"Alternatively, we can use `subset_by_overlaps` method to more conveniently subset the peaks that overlap with any TSS:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 32,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -639,7 +642,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 33,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -675,12 +678,12 @@
"source": [
"### 4.4 Find overlaps with exons\n",
"\n",
"Lets find overlaps with any exon. We `unlist` our `GenomicRangesList` object to get all exon positions."
"Let's find overlaps with any exon. We `unlist` our `GenomicRangesList` object to get all exon positions."
]
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 34,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -720,7 +723,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 35,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -779,7 +782,7 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -796,7 +799,7 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 37,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -835,7 +838,7 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 38,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -896,17 +899,27 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 39,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[GenomicRanges(number_of_ranges=1, seqnames=[0], ranges=IRanges(start=array([44677057], dtype=int32), width=array([184], dtype=int32)), strand=[1], mcols=BiocFrame(data={'exon_id': ['ENSE00001838743'], 'tx_name': ['ENST00000006251'], 'tx_id': ['ENST00000006251'], 'gene_name': ['PRR5'], 'gene_id': ['ENSG00000186654'], 'exon_rank': [1]}, number_of_rows=1, row_names=['0'], column_names=['exon_id', 'tx_name', 'tx_id', 'gene_name', 'gene_id', 'exon_rank']), seqinfoSeqInfo(number_of_seqnames=1, seqnames=['chr22'], seqlengths=[50818468], is_circular=[False], genome=['GRCh38'])), GenomicRanges(number_of_ranges=1, seqnames=[0], ranges=IRanges(start=array([50603133], dtype=int32), width=array([366], dtype=int32)), strand=[1], mcols=BiocFrame(data={'exon_id': ['ENSE00003608148'], 'tx_name': ['ENST00000008876'], 'tx_id': ['ENST00000008876'], 'gene_name': ['MAPK8IP2'], 'gene_id': ['ENSG00000008735'], 'exon_rank': [1]}, number_of_rows=1, row_names=['9'], column_names=['exon_id', 'tx_name', 'tx_id', 'gene_name', 'gene_id', 'exon_rank']), seqinfoSeqInfo(number_of_seqnames=1, seqnames=['chr22'], seqlengths=[50818468], is_circular=[False], genome=['GRCh38'])), GenomicRanges(number_of_ranges=1, seqnames=[0], ranges=IRanges(start=array([20268071], dtype=int32), width=array([248], dtype=int32)), strand=[-1], mcols=BiocFrame(data={'exon_id': ['ENSE00001358408'], 'tx_name': ['ENST00000043402'], 'tx_id': ['ENST00000043402'], 'gene_name': ['RTN4R'], 'gene_id': ['ENSG00000040608'], 'exon_rank': [1]}, number_of_rows=1, row_names=['19'], column_names=['exon_id', 'tx_name', 'tx_id', 'gene_name', 'gene_id', 'exon_rank']), seqinfoSeqInfo(number_of_seqnames=1, seqnames=['chr22'], seqlengths=[50818468], is_circular=[False], genome=['GRCh38']))]\n"
]
}
],
"source": [
"all_first = []\n",
"for txid, grl in by_tx:\n",
" strand = grl.get_strand(as_type = \"list\")[0]\n",
" if strand == \"-\":\n",
" all_first.append(grl.sort()[-1])\n",
" else:\n",
" all_first.append(grl.sort()[0])"
" all_first.append(grl.sort()[0])\n",
"\n",
"print(all_first[:3])"
]
},
{
Expand All @@ -918,12 +931,33 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 40,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GenomicRanges with 5387 ranges and 6 metadata columns\n",
" seqnames ranges strand exon_id tx_name tx_id gene_name gene_id exon_rank\n",
" <str> <IRanges> <ndarray[int8]> <list> <list> <list> <list> <list> <list>\n",
" [0] chr22 44677057 - 44677241 + | ENSE00001838743 ENST00000006251 ENST00000006251 PRR5 ENSG00000186654 1\n",
" [1] chr22 50603133 - 50603499 + | ENSE00003608148 ENST00000008876 ENST00000008876 MAPK8IP2 ENSG00000008735 1\n",
" [2] chr22 20268071 - 20268319 - | ENSE00001358408 ENST00000043402 ENST00000043402 RTN4R ENSG00000040608 1\n",
" ... ... ... | ... ... ... ... ... ...\n",
"[5384] chr22 33919995 - 33920477 - | LRG_856t1e1 LRG_856t2 LRG_856t2 LARGE1 LRG_856 1\n",
"[5385] chr22 37244114 - 37244266 - | LRG_97t1e1 LRG_97t1 LRG_97t1 RAC2 LRG_97 1\n",
"[5386] chr22 20982297 - 20982572 + | LRG_989t1e1 LRG_989t1 LRG_989t1 LZTR1 LRG_989 1\n",
"------\n",
"seqinfo(1 sequences): chr22\n"
]
}
],
"source": [
"from biocutils import combine_sequences\n",
"first_exons = combine_sequences(*all_first)"
"first_exons = combine_sequences(*all_first)\n",
"\n",
"print(first_exons)"
]
},
{
Expand All @@ -935,7 +969,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 41,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -976,7 +1010,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 42,
"metadata": {},
"outputs": [
{
Expand Down
4 changes: 2 additions & 2 deletions tutorials/annotate_cell_types.qmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Tutorial 2: Access single-cell datasets from `scRNAseq` collection and annotate cell types
# Tutorial 2: Annotate cell types in single-cell RNA-seq data

Welcome to this tutorial on annotating single-cell datasets with reference collections. The **scRNAseq** ([R/Bioc](https://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html), [Python](https://github.com/BiocPy/scrnaseq)) package provides access to public single-cell RNA-seq datasets for use by other Bioconductor/BiocPy packages and workflows. These datasets are stored in language-agnostic representations described in [ArtifactDB](https://github.com/artifactdb), enabling easy access to datasets and analysis results across multiple programming languages such as R and Python. We will showcase how to integrate and process single-cell datasets across languages, such as R and Python, and how to annotate cell types using reference datasets.

Expand Down Expand Up @@ -39,7 +39,7 @@ BiocManager::install(c("scRNAseq", "celldex", "SingleR"),
```
:::

## 1. Accessing and exploring single-cell datasets
## 1. Access and explore single-cell datasets

Let's explore the `scrnaseq` package and learn how to access public single-cell RNA-seq datasets. Datasets published to the `scrnaseq` package are decorated with metadata such as the study title, species, number of cells, etc., to facilitate discovery. Let's see how we can list and search for datasets.

Expand Down
Loading

0 comments on commit fe1b7f5

Please sign in to comment.