Text edits to tutorial and notebook (#9)

* final edits to tutorial * run the annotate cell notebook
BiocPy · Jul 23, 2024 · fe1b7f5 · fe1b7f5
1 parent 134348c
commit fe1b7f5
Show file tree

Hide file tree

Showing 5 changed files with 102 additions and 69 deletions.
diff --git a/notebook/annotate_cell_types.ipynb b/notebook/annotate_cell_types.ipynb
diff --git a/notebook/genomic_ranges.ipynb b/notebook/genomic_ranges.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Tutorial 1: `GenomicRanges` and range-based analyses\n",
+    "# Tutorial 1: Perform range-based analyses using `GenomicRanges`\n",
     "\n",
     "Genomic range operations are fundamental to many bioinformatics analyses. They allow us to work with intervals of genomic coordinates, which is crucial for understanding the relationships between different genomic features such as genes, regulatory elements, and experimental data like ChIP-seq peaks. In this tutorial, we'll explore how to work with genomic interval data using BiocPy's [GenomicRanges](https://github.com/BiocPy/GenomicRanges/) package, which provides a Python implementation of the R/Bioconductor [GenomicRanges package](https://github.com/Bioconductor/GenomicRanges).\n",
     "\n",
@@ -100,7 +100,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 22,
    "metadata": {},
    "outputs": [
     {
@@ -125,12 +125,19 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This object (`hg38_robject`) can then be coerced into a Python `GenomicRangesList` class."
+    "This dictionary object (`hg38_robject`) contains 4 keys:\n",
+    "\n",
+    "1) **class_name**: class name of the object\n",
+    "2) **package_name**: name of the package containing the class definition\n",
+    "3) **data**: contains the value if the object is a scalar\n",
+    "4) **attributes**: if the object is an S4 class, contains various attributes and their values\n",
+    "\n",
+    "This dictionary can then be coerced into a Python `GenomicRangesList` class."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 23,
    "metadata": {},
    "outputs": [
     {
@@ -229,10 +236,6 @@
     }
    ],
    "source": [
-    "from rds2py import read_rds\n",
-    "hg38_robject = read_rds(\"./hg38_exons_by_tx.rds\")\n",
-    "\n",
-    "# TODO: split this into two\n",
     "from rds2py.granges import as_granges_list\n",
     "by_tx = as_granges_list(hg38_robject)\n",
     "\n",
@@ -245,7 +248,7 @@
    "metadata": {},
    "source": [
     "```{note}\n",
-    "Currently this is a two step process, we are working on simplifying this to a one step process for supported Bioconductor classes.\n",
+    "Currently this is a two step process, we are working on simplifying this to a single step for supported Bioconductor classes.\n",
     "```\n",
     "\n",
     "## 3. Define promoters and TSS\n",
@@ -261,7 +264,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 24,
    "metadata": {},
    "outputs": [
     {
@@ -334,12 +337,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Since the range gives us exactly one range per transcript, so we can simplify our list to a `GenomicRanges` object. This is similar to `unlist` in R."
+    "Since the `range()` gives us exactly one range per transcript, so we can simplify our list to a `GenomicRanges` object. This is similar to `unlist` in R."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 25,
    "metadata": {},
    "outputs": [
     {
@@ -378,7 +381,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 26,
    "metadata": {},
    "outputs": [
     {
@@ -419,7 +422,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 27,
    "metadata": {},
    "outputs": [
     {
@@ -466,7 +469,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 28,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -486,7 +489,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 29,
    "metadata": {},
    "outputs": [
     {
@@ -528,7 +531,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 30,
    "metadata": {},
    "outputs": [
     {
@@ -562,7 +565,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 31,
    "metadata": {},
    "outputs": [
     {
@@ -596,12 +599,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Alternatively, we can use `subset_by_overlaps` method to more conveniently overlap the peaks that overlap with any TSS:"
+    "Alternatively, we can use `subset_by_overlaps` method to more conveniently subset the peaks that overlap with any TSS:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 32,
    "metadata": {},
    "outputs": [
     {
@@ -639,7 +642,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 33,
    "metadata": {},
    "outputs": [
     {
@@ -675,12 +678,12 @@
    "source": [
     "### 4.4 Find overlaps with exons\n",
     "\n",
-    "Lets find overlaps with any exon. We `unlist` our `GenomicRangesList` object to get all exon positions."
+    "Let's find overlaps with any exon. We `unlist` our `GenomicRangesList` object to get all exon positions."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 34,
    "metadata": {},
    "outputs": [
     {
@@ -720,7 +723,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 35,
    "metadata": {},
    "outputs": [
     {
@@ -779,7 +782,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 36,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -796,7 +799,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 37,
    "metadata": {},
    "outputs": [
     {
@@ -835,7 +838,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 38,
    "metadata": {},
    "outputs": [
     {
@@ -896,17 +899,27 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 39,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[GenomicRanges(number_of_ranges=1, seqnames=[0], ranges=IRanges(start=array([44677057], dtype=int32), width=array([184], dtype=int32)), strand=[1], mcols=BiocFrame(data={'exon_id': ['ENSE00001838743'], 'tx_name': ['ENST00000006251'], 'tx_id': ['ENST00000006251'], 'gene_name': ['PRR5'], 'gene_id': ['ENSG00000186654'], 'exon_rank': [1]}, number_of_rows=1, row_names=['0'], column_names=['exon_id', 'tx_name', 'tx_id', 'gene_name', 'gene_id', 'exon_rank']), seqinfoSeqInfo(number_of_seqnames=1, seqnames=['chr22'], seqlengths=[50818468], is_circular=[False], genome=['GRCh38'])), GenomicRanges(number_of_ranges=1, seqnames=[0], ranges=IRanges(start=array([50603133], dtype=int32), width=array([366], dtype=int32)), strand=[1], mcols=BiocFrame(data={'exon_id': ['ENSE00003608148'], 'tx_name': ['ENST00000008876'], 'tx_id': ['ENST00000008876'], 'gene_name': ['MAPK8IP2'], 'gene_id': ['ENSG00000008735'], 'exon_rank': [1]}, number_of_rows=1, row_names=['9'], column_names=['exon_id', 'tx_name', 'tx_id', 'gene_name', 'gene_id', 'exon_rank']), seqinfoSeqInfo(number_of_seqnames=1, seqnames=['chr22'], seqlengths=[50818468], is_circular=[False], genome=['GRCh38'])), GenomicRanges(number_of_ranges=1, seqnames=[0], ranges=IRanges(start=array([20268071], dtype=int32), width=array([248], dtype=int32)), strand=[-1], mcols=BiocFrame(data={'exon_id': ['ENSE00001358408'], 'tx_name': ['ENST00000043402'], 'tx_id': ['ENST00000043402'], 'gene_name': ['RTN4R'], 'gene_id': ['ENSG00000040608'], 'exon_rank': [1]}, number_of_rows=1, row_names=['19'], column_names=['exon_id', 'tx_name', 'tx_id', 'gene_name', 'gene_id', 'exon_rank']), seqinfoSeqInfo(number_of_seqnames=1, seqnames=['chr22'], seqlengths=[50818468], is_circular=[False], genome=['GRCh38']))]\n"
+     ]
+    }
+   ],
    "source": [
     "all_first = []\n",
     "for txid, grl in by_tx:\n",
     "    strand = grl.get_strand(as_type = \"list\")[0]\n",
     "    if strand == \"-\":\n",
     "        all_first.append(grl.sort()[-1])\n",
     "    else:\n",
-    "        all_first.append(grl.sort()[0])"
+    "        all_first.append(grl.sort()[0])\n",
+    "\n",
+    "print(all_first[:3])"
    ]
   },
   {
@@ -918,12 +931,33 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 40,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "GenomicRanges with 5387 ranges and 6 metadata columns\n",
+      "       seqnames              ranges          strand           exon_id         tx_name           tx_id gene_name         gene_id exon_rank\n",
+      "          <str>           <IRanges> <ndarray[int8]>            <list>          <list>          <list>    <list>          <list>    <list>\n",
+      "   [0]    chr22 44677057 - 44677241               + | ENSE00001838743 ENST00000006251 ENST00000006251      PRR5 ENSG00000186654         1\n",
+      "   [1]    chr22 50603133 - 50603499               + | ENSE00003608148 ENST00000008876 ENST00000008876  MAPK8IP2 ENSG00000008735         1\n",
+      "   [2]    chr22 20268071 - 20268319               - | ENSE00001358408 ENST00000043402 ENST00000043402     RTN4R ENSG00000040608         1\n",
+      "            ...                 ...             ... |             ...             ...             ...       ...             ...       ...\n",
+      "[5384]    chr22 33919995 - 33920477               - |     LRG_856t1e1       LRG_856t2       LRG_856t2    LARGE1         LRG_856         1\n",
+      "[5385]    chr22 37244114 - 37244266               - |      LRG_97t1e1        LRG_97t1        LRG_97t1      RAC2          LRG_97         1\n",
+      "[5386]    chr22 20982297 - 20982572               + |     LRG_989t1e1       LRG_989t1       LRG_989t1     LZTR1         LRG_989         1\n",
+      "------\n",
+      "seqinfo(1 sequences): chr22\n"
+     ]
+    }
+   ],
    "source": [
     "from biocutils import combine_sequences\n",
-    "first_exons = combine_sequences(*all_first)"
+    "first_exons = combine_sequences(*all_first)\n",
+    "\n",
+    "print(first_exons)"
    ]
   },
   {
@@ -935,7 +969,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 41,
    "metadata": {},
    "outputs": [
     {
@@ -976,7 +1010,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 42,
    "metadata": {},
    "outputs": [
     {

diff --git a/tutorials/annotate_cell_types.qmd b/tutorials/annotate_cell_types.qmd
@@ -1,4 +1,4 @@
-# Tutorial 2: Access single-cell datasets from `scRNAseq` collection and annotate cell types
+# Tutorial 2: Annotate cell types in single-cell RNA-seq data
 
 Welcome to this tutorial on annotating single-cell datasets with reference collections. The **scRNAseq** ([R/Bioc](https://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html), [Python](https://github.com/BiocPy/scrnaseq)) package provides access to public single-cell RNA-seq datasets for use by other Bioconductor/BiocPy packages and workflows. These datasets are stored in language-agnostic representations described in [ArtifactDB](https://github.com/artifactdb), enabling easy access to datasets and analysis results across multiple programming languages such as R and Python. We will showcase how to integrate and process single-cell datasets across languages, such as R and Python, and how to annotate cell types using reference datasets.
 
@@ -39,7 +39,7 @@ BiocManager::install(c("scRNAseq", "celldex", "SingleR"),
 ```
 :::
 
-## 1. Accessing and exploring single-cell datasets
+## 1. Access and explore single-cell datasets
 
 Let's explore the `scrnaseq` package and learn how to access public single-cell RNA-seq datasets. Datasets published to the `scrnaseq` package are decorated with metadata such as the study title, species, number of cells, etc., to facilitate discovery. Let's see how we can list and search for datasets.