Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

* **MASCOT** is the first one-stop applicable pipeline based on topic model to analyze single-cell CRISPR screening data (independently termed **Perturb-Seq**, **CRISP-seq**, or **CROP-seq**), which could help to prioritize the gene perturbation effect in a cellular heterogeneity level.
* **MASCOT** is an integrated pipeline for model-based analysis of single cell CRISPR knockout screening data. **MASCOT** consists of three steps: **data preprocessing**, **model building** and **perturbation effect prioritizing**:
* **Data preprocessing**: Besides the conventional quality control and data normalization applied in single-cell RNA-seq analysis, **MASCOT** addresses two specific considerations that should be taken into account for such a novel data type: **(1)** Filtering perturbed cells with invalid edit; and **(2)** Filtering perturbation according to a minimal number of cells per perturbation.
* **Data preprocessing**: Besides the conventional quality control and data normalization applied in single-cell RNA-seq analysis, **MASCOT** addresses two specific considerations that should be taken into account for such a novel data type: **(1)** Filtering perturbed cells with invalid edits; and **(2)** Filtering perturbations according to a minimal number of cells per perturbation.
* **Model building**: **MASCOT** builds an analytical model based on Topic Models to handle single-cell CRISPR screening data. The concept of topic models was initially presented in machine learning community and has been successfully applied to gene expression data analysis. A key feature of topic model is that it allows each perturbed sample to process a proportion of membership in each functional topic rather than to categorize the sample into a discrete cluster. Such a topic profile, which is derived from large-scale cell-to-cell different perturbed samples, allows for a quantitative description of the biologic function of cells under specific gene perturbation conditions. **MASCOT** addresses two specific issues when applying the topic model to this specific data type: **(1)** The distribution of topics between cases and controls is affected by the ratio of their sample numbers, and such a sample imbalance issue is addressed by the bootstrapping strategy when prioritizing the perturbation effect. **(2)** The optimal topic number is automatically selected by MASCOT in a data-driven manner.
* **Perturbation effect prioritizing**: Based on the model-based perturbation analysis, **MASCOT** can quantitatively estimate and prioritize the individual gene perturbation effect on cell phenotypes from three different perspectives, i.e., prioritizing the gene perturbation effect as an overall perturbation effect, or in a functional topic-specific way and quantifying the relationships between different perturbations.
* **Input File Format**. For running **MASCOT**, the input data needed to follow the standard format we defined. For convenience, **MASCOT** accepts two kinds of input data formats: **(1)** The first data format can be referred in the **data_format_example/crop_unstimulated.RData** we provided. It is an example dataset containing "expression_profile", "perturb_information" and "sgRNA_information". You can apply function "Input_preprocess()" to handle this data format; **(2)** The second data format can be referred in the **data_format_example/perturb_GSM2396857/** generated by 10X genomics. The directory **data_format_example/perturb_GSM2396857** contains "barcodes.tsv", "genes.tsv", "matrix.mtx", "cbc_gbc_dict.tsv" and "cbc_gbc_dict_grna.tsv". You can apply function "Input_preprocess_10X()" to handle this data format.
* **Input File Format**. For running **MASCOT**, the input data needed to follow the standard format we defined. For convenience, **MASCOT** accepts two kinds of input data formats: **(1)** The first data format can be referred in the **data_format_example/crop_unstimulated.RData** as we provided. It is an example dataset containing "expression_profile", "perturb_information" and "sgRNA_information". You can apply function "Input_preprocess()" to handle this data format; **(2)** The second data format can be referred in the **data_format_example/perturb_GSM2396857/** generated by 10X genomics. The directory **data_format_example/perturb_GSM2396857** contains "barcodes.tsv", "genes.tsv", "matrix.mtx", "cbc_gbc_dict.tsv" and "cbc_gbc_dict_grna.tsv". You can apply function "Input_preprocess_10X()" to handle this data format.
* **Attention:** The label of the control sample needs to be "CTRL".
* For illustration purpose, we took the least dataset **data_format_example/crop_unstimulated.RData** as an example.
* For illustration purpose, we took the dataset **data_format_example/crop_unstimulated.RData** as an example.
* Install: You can install the **MASCOT** package from Github using **devtools** packages with R>=3.4.1. For convenience, you can also install the **MASCOT** package from Docker Hub with the link [mascot](https://hub.docker.com/r/bm2lab/mascot/)
```r
library(Biostrings)
Expand Down Expand Up @@ -122,9 +122,9 @@

```r

# calculate the overall perturbation effect ranking list without "offTarget_Info" calculated.
# calculate the overall perturbation effect ranking list without "offTarget_Info".
rank_overall_result<-Rank_overall(distri_Diff)
#rank_overall_result<-Rank_overall(distri_Diff,offTarget_hash=offTarget_Info) (if "offTarget_Info" was calculated. For "offTarget_info", you can see the introduction in the end).
#rank_overall_result<-Rank_overall(distri_Diff,offTarget_hash=offTarget_Info) (when "offTarget_Info" was calculated. For detailed information "offTarget_info", please refer to the introduction part in the end).

# calculate the topic-specific ranking list.
rank_topic_specific_result<-Rank_specific(distri_Diff)
Expand All @@ -134,7 +134,7 @@
```
![](figure/perturbation_network.png)

* If sgRNA sequence of each knockouts were known and you want to consider if they have off-targets, you can perform this step. This step won't affect the final ranking result, but present the off-target information. In most cases, the sgRNA in such experiment has no off-targets. **If you do not want to consider this factor, then just skip this step**.
* If sgRNA sequence of each knockouts were known and you want to investigate if they have off-targets, you can perform this step. This step won't affect the final ranking result, but just report the off-target information. In most cases, the sgRNA in such experiment has no off-targets. **If you do not want to consider this factor, then just skip this step**.
```r
#library(CRISPRseek)
#library("BSgenome.Hsapiens.UCSC.hg38")
Expand Down