The Tabula muris data was generated by the Chan Zuckerberg Biohub and made available for anyone to use via Amazon S3.
This data collection is the underlying dataset to the recent publication Transcriptomic characterization of 20 organs and tissues from mouse at single cell resolution creates a Tabula Muris. The Tabula muris project is a a compendium of single cell transcriptomic data from the mouse containing nearly 100,000 cells from 20 organs and tissues. The data allow for direct and controlled comparison of gene expression in cell types shared between tissues, such as immune cells from distinct anatomical locations. The resource also enables contrasting two distinct technical approaches:
- microfluidic droplet-based 3'-end counting, which provides a survey of thousands of cells per organ at relatively low coverage.
- FACS-based full length transcript analysis, which provides higher sensitivity and coverage.
This rich collection of annotated cells will be a useful resource for:
- Defining gene expression in previously poorly-characterized cell populations.
- Validating findings in future targeted single-cell studies.
- Developing of methods for integrating datasets (eg between the FACS and droplet experiments), characterizing batch effects, and quantifying the variation of gene expression in many cell types between organs and animals.
Since late 2017, Tabula muris data have been made available to all users free of charge. AWS has made the data freely available on Amazon S3 so that anyone can download the resource to perform analysis and advance medical discovery without needing to worry about the cost of storing Tabula muris data or the time required to download it.
Learn more about how Tabula muris data is used in the project vignettes repo.
The data are organized using a directory structure as bellow.
czbiohub-tabula-muris
├── 10x_bam_files: BAM files for 10x droplet data
│ ├── *.bam
│ └── *.bam.bai
├── facs_bam_files : BAM files for FACS smartseq2 data
│ ├── *.bam
│ └── *.bam.bai
├── TM_droplet_mat.csv.gz
├── TM_droplet_mat.h5ad
├── TM_droplet_mat.rds
├── TM_droplet_metadata.csv
├── TM_facs_mat.csv.gz
├── TM_facs_mat.h5ad
├── TM_facs_mat.rds
└── TM_facs_metadata.csv
The unprocessed data files are stored in two different folders, 10x_bam_files
and tabula_muris_bam_files
, according to the respective method used when preparing the samples, 10x
or FACS
.
The processed data is provide in three different formats for each of the two methods:
.h5ad
files to load in Python using anndata.rds
files to load in R.csv.gz
files for general use
A csv describing all data is available for each method at s3://czbiohub-tabula-muris/TM_facs_metadata.csv
or s3://czbiohub-tabula-muris/TM_droplet_metadata.csv
If you use the AWS Command Line Interface, you can access the bucket with the command: aws s3 ls s3://czbiohub-tabula-muris
You can download complete count files as sparse matrices in .rds
format for easy loading into R
. Download TM_facs_mat.rds and TM_droplet_mat.rds into the data
folder. It can be loaded as
tm.droplet.matrix = readRDS(here("data", "TM_droplet_mat.rds"))
tm.droplet.metadata = read_csv(here("data", "TM_droplet_metadata.csv"))
You can download complete count files as sparse matrices using anndata's h5ad
file format for use in Python here and here. You can process the resulting AnnData
object using, for instance, Scanpy.
import pandas
import scanpy
tm_facs_metadata = pd.read_csv('data/TM_facs_metadata.csv')
tm_facs_data = scanpy.anndata.read_h5ad('data/TM_facs_mat.h5ad')
You will need to download the BAM files and then:
- if working with the 10X dataset download the 10X's bam2fastq tool or
- if working with the facs/smartseq-2 dataset use
bamtofastq
from bedtools
If you find the Tabula muris data useful for your research please cite our publication
If you have questions about the data, you can create an Issue at the project repo on GitHub.
There are no restrictions on the use of data received from the Chan Zuckerberg Biohub, unless expressly identified prior to or at the time of receipt.