Skip to content

Commit

Permalink
Add section on language agnostic representatios (#7)
Browse files Browse the repository at this point in the history
  • Loading branch information
jkanche authored Feb 24, 2024
1 parent 5a8b887 commit 5ab8303
Show file tree
Hide file tree
Showing 8 changed files with 146 additions and 6 deletions.
6 changes: 4 additions & 2 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
contents: write
steps:
- name: Check out repository
uses: actions/checkout@v3
uses: actions/checkout@v4

- name: Set up Quarto
uses: quarto-dev/quarto-actions/setup@v2
Expand All @@ -27,7 +27,9 @@ jobs:
with:
python-version: '3.9'
cache: 'pip'
- run: pip install jupyter
# - run: pip install uv
# - run: uv venv
# - run: source .venv/bin/activate
- run: pip install -r requirements.txt

- name: Render
Expand Down
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@
/_site/
docs
_freeze
.jupyter_cache/
.jupyter_cache/

chapters/zilinoislung_with_celltypist/
2 changes: 2 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ book:
- chapters/experiments/extending_se.qmd
- chapters/experiments/multiassay_expt.qmd
- chapters/interop.qmd
- chapters/language_agnostic.qmd
- chapters/workflow.qmd
- part: chapters/extras/index.qmd
chapters:
- chapters/extras/iranges.qmd
Expand Down
Binary file added assets/data/zilinois-lung-subset.rds
Binary file not shown.
2 changes: 1 addition & 1 deletion chapters/interop.qmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Interop with R
# Interop with RDS files

The [rds2py](https://github.com/BiocPy/rds2py) package serves as a Python interface to the [rds2cpp](https://github.com/LTLA/rds2cpp) library, enabling direct reading of RDS files within Python. This eliminates the need for additional data conversion tools or intermediate formats, streamlining the transition between Python and R for seamless analysis.

Expand Down
36 changes: 36 additions & 0 deletions chapters/language_agnostic.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Language-agnostic genomic data store

In this section, we will illustrate a workflow that utilizes language-agnostic representations for storing genomic data, facilitating seamless access to datasets and analysis results across multiple programming frameworks such as R and Python. The [ArtifactDB](https://github.com/artifactdb) framework provides this functionality.

To begin, we will download the "zilionis lung" dataset from the [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) package. Subsequently, we will store this dataset in a language-agnostic format using the [alabaster suite](https://github.com/ArtifactDB/alabaster.base) of R packages.

```r
library(scRNAseq)
library(alabaster)

sce <- ZilionisLungData()
saveObject(sce, path=paste(getwd(), "zilinoislung", sep="/"))
```

:::{.callout-note}
Additionally, you can save this dataset as an RDS object for access in Python. Refer to [interop with R](./interop.qmd) section for more details.
:::

We can now load this dataset in Python using the [dolomite suite](https://github.com/ArtifactDB/dolomite-base) of Python packages. Both dolomite and alabaster are integral parts of the ArtifactDB ecosystem designed to read artifacts stored in language-agnostic formats.

```python
from dolomite_base import read_object

data = read_object("./zilinoislung")
print(data)
```

You can now convert this to `AnnData` representations for downstream analysis.

```python
adata = data.to_anndata()
```

:::{.callout-note}
Check out [ArtifactDB](https://github.com/artifactdb) framework for more information.
:::
95 changes: 95 additions & 0 deletions chapters/workflow.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Seamless analysis workflow

In this section, we will illustrate a workflow that utilizes either language-agnostic representations for storing genomic data or reading RDS files directly in Python, to facilitate seamless access to datasets and analysis results.

:::{.callout-note}
Check out

- the [interop with R](./interop.qmd) section for reading RDS files directly in Python or
- the [language agnostic](./language_agnostic.qmd) representations for storing genomic data
:::

To begin, we will download the "zilionis lung" dataset from the [scRNAseq](https://bioconductor.org/packages/release/data/experiment/html/scRNAseq.html) package. Subsequently, we will store this dataset in a language-agnostic format using the [alabaster suite](https://github.com/ArtifactDB/alabaster.base) of R packages.

```r
library(scRNAseq)

sce <- ZeiselBrainData()
sub <- sce[,1:2000]
saveRDS(sub, "../assets/data/zilinois-lung-subset.rds")
```

To demonstrate this workflow, we will employ the [CellTypist](https://github.com/Teichlab/celltypist) model to annotate cell types for this dataset. CellTypist operates on an AnnData representation.

```{python}
from rds2py import read_rds, as_summarized_experiment
import numpy as np
r_object = read_rds("../assets/data/zilinois-lung-subset.rds")
sce = as_summarized_experiment(r_object)
adata, _ = sce.to_anndata()
adata.X = np.log1p(adata.layers["counts"])
adata.var.index = adata.var["genes"].tolist()
print(adata)
```

Before annotation, let's download the "human lung atlas" model from celltypist.

```{python}
import celltypist
from celltypist import models
models.download_models()
model_name = "Human_Lung_Atlas.pkl"
model = models.Model.load(model = model_name)
print(model)
```

Now, let's annotate our dataset.

```{python}
predictions = celltypist.annotate(adata, model = model_name, majority_voting = True)
print(predictions.predicted_labels)
```

:::{.callout-note}
The celltypist workflow is based on the tutorial described [here](https://colab.research.google.com/github/Teichlab/celltypist/blob/main/docs/notebook/celltypist_tutorial.ipynb#scrollTo=postal-chicken).
:::

Next, let's retrieve the `AnnData` object with the predicted labels embedded into the `obs` dataframe.

```{python}
adata = predictions.to_adata()
adata
```

We can now reverse the workflow and save this object into an Artifactdb format from Python. However, the object needs to be converted to a `SingleCellExperiment` class first. Read more about our experiment representations [here](./experiments/singlecell_expt.qmd).

```{python}
from singlecellexperiment import SingleCellExperiment
sce = SingleCellExperiment.from_anndata(adata)
print(sce)
```

We use the dolomite package to save it into a language-agnostic format.
```{python}
import dolomite_base
import dolomite_sce
dolomite_base.save_object(sce, "./zilinoislung_with_celltypist")
```

Finally, read the object back in R.
```r
sce_with_celltypist = readObject(path=paste(getwd(), "zilinoislung_with_celltypist", sep="/"))
sce_with_celltypist
```

And that concludes the workflow. Leveraging the generic **read** functions `readObject` (R) and `read_object` (Python), along with the **save** functions `saveObject` (R) and `save_object` (Python), you can seamlessly store most Bioconductor objects in language-agnostic formats.

----

## Further reading

- ArtifactDB GitHub organization - [https://github.com/ArtifactDB](https://github.com/ArtifactDB).
7 changes: 5 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ singler
numpy
scipy
pandas
jupyter
jupyter-cache
rich
jupyterlab
Expand All @@ -20,5 +21,7 @@ anndata
mudata
delayedarray[dask]
joblib
dolomite
hdf5array
dolomite_mae
dolomite_sce
hdf5array
celltypist

0 comments on commit 5ab8303

Please sign in to comment.