Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peaks2genes: modifications #1030

Merged
merged 13 commits into from
Sep 30, 2018
Binary file added topics/introduction/images/input_tagged_file.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added topics/introduction/images/intro_get_flanks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 26 additions & 24 deletions topics/introduction/tutorials/galaxy-intro-peaks2genes/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@ contributors:
# Introduction
{:.no_toc}

We stumbled upon a paper [Li et al., Cell Stem Cell 2012](https://www.ncbi.nlm.nih.gov/pubmed/22862943) that contains the analysis of possible target genes of an interesting protein in mice. The targets were obtained by ChIP-seq and the raw data is available through [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37268).
The list of genes however is neither in the supplement of the paper nor part of the GEO submission.
The closest thing we could find is a list of the regions where the signal is significantly enriched (so called *peaks*):
We stumbled upon a paper [Li et al., Cell Stem Cell 2012](https://www.ncbi.nlm.nih.gov/pubmed/22862943) called *"The histone acetyltransferase MOF is a key regulator of the embryonic stem cell core transcriptional network"*. The paper contains the analysis of possible target genes of an interesting protein called Mof. The targets were obtained by ChIP-seq in mice and the raw data is available through [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37268).
However, the list of genes is neither in the supplement of the paper, nor part of the GEO submission.
The closest thing we could find is a file in GEO containing a list of the regions where the signal is significantly enriched (so called *peaks*):

1 | 3660676 | 3661050 | 375 | 210 | 62.0876250438913 | -2.00329386666667
1 | 3661326 | 3661500 | 175 | 102 | 28.2950833625942 | -0.695557142857143
Expand Down Expand Up @@ -151,6 +151,12 @@ Galaxy will automatically unpack the file.
>
> 4. Search for `mm9` in **Database/Build** attribute and select `Mouse July 2007 (NCBI37/mm9)`
> 5. Click on **Save** on the top
> 6. Add a tag called `peaks` to the dataset to make it easier to track in the history
> {% include snippets/add_tag.md %}
>
> The dataset should now look like below in the history
>
> ![Peaks file](../../images/input_tagged_file.png){: width="250px" height="300px"}
>
{: .hands_on}

Expand Down Expand Up @@ -190,6 +196,7 @@ we also need a list of genes in mice, which we can obtain from UCSC.
> 6. Click on the **Send Query to Galaxy** button
> 7. Wait for the upload to finish
> 8. Rename our dataset ({% icon galaxy-pencil %} (pencil) icon) to something more recognizable (`Genes`)
> 9. Add a tag called `genes` to the dataset to make it easier to track in the history
>
{: .hands_on}

Expand Down Expand Up @@ -316,10 +323,12 @@ In order to convert the chromosome names we have therefore two things to do:

## Analysis

Our goal is still to compare the 2 region files (the genes file and the peak file from the publication)
to know which peaks are related to which genes. If you really only want to know which peaks are located **inside** genes you
can skip the next step. Otherwise, it might be reasonable to include the promoter region into the comparison, e.g. because
you want to include Transcriptions factors in ChIP-seq experiments.
Our goal is to compare the 2 region files (the genes file and the peak file from the publication)
to know which peaks are related to which genes. If you only want to know which peaks are located **inside** genes (within the gene body) you
can skip the next step. Otherwise, it might be reasonable to include the **promoter** region of the genes into the comparison, e.g. because
you want to include transcriptions factors in ChIP-seq experiments. There is no strict definition for promoter region but 2kb upstream of the TSS (start of region) is commonly used. We'll use the **Get Flanks** tool to get regions 2kb bases upstream of the start of the gene to 10kb bases downstream of the start (12kb in length). To do this we tell the Get Flanks tool we want regions upstream of the start, with an offset of 10kb, that are 12kb in length, as shown in the diagram below.

![Get Flanks](../../images/intro_get_flanks.png)

> ### {% icon hands_on %} Hands-on: Add promoter region to gene records
>
Expand Down Expand Up @@ -349,35 +358,27 @@ you want to include Transcriptions factors in ChIP-seq experiments.
> 3. Rename your dataset to reflect your findings
{: .hands_on}

You might have noticed that the UCSC file is in `BED` format and has a database associated to it. That's what we want for our peak file as well
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this to demonstrate the conversion feature - in respect to the "change filetype". As a bonus, the presenter can also demonstrate the implicit-conversion if wished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I've added it back in. But imho if it's not necessary for users to do that conversion here then it's just adding confusion and I'd demonstrate the conversion somewhere else where it is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree here. The Get Flanks tool is also not needed as a similar result can be archived by UCSC directly. It's important that people know the interface and this is the purpose of this tutorial, it's an introduction to Galaxy.
Also please note that they are different intersect tools in Galaxy and not all can cope with interval files and hence convert them implicitly. It's therefore good to know that there are implicit and explicit conversions, maybe this should be made more clear in the text - I stress this a lot during my session.

Thanks for adding it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Get Flanks tool is also not needed as a similar result can be archived by UCSC directly.

Well while we're at it UCSC is not needed here either 😄 But yes, the text could be made clearer I think, what do you think about what I've added here now

You might have noticed that the UCSC file is in `BED` format and has a database associated to it. That's what we want for our peak file as well. The **Intersect** tool we will use can automatically convert interval files to BED format but we'll convert our interval file explicitly here to show how this can be achieved with Galaxy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the text. Thanks.


> ### {% icon hands_on %} Hands-on: Change format and database
>
> 1. Click on the {% icon galaxy-pencil %} (pencil) icon in the history entry of our peak region file
> 2. Switch to the **Convert** tab
> 3. Select `Convert Genomic Intervals to BED`
> 4. Press **Convert datatype**
> 5. Check that the "Database/Build" is `mm9` (the database build for mice used in the paper)
{: .hands_on}

It's time to find the overlapping intervals (finally!). To do that, we want to extract the genes which overlap/intersect with our peaks.

> ### {% icon hands_on %} Hands-on: Find Overlaps
>
> 1. **Intersect** {% icon tool %}: Run **Intersect the intervals of two datasets** with the following settings:
> - *"Return"*: `Overlapping Intervals`
> - *"of"*: the UCSC file with promoter regions
> - *"that intersect"*: our converted peak region file
> - *"that intersect"*: our peak region file
> - *"for at least"*: `1`
>
> > ### {% icon comment %} Comments
> > The order of the inputs is important! We want to end up with a list of genes, so the corresponding dataset needs to be the first input.
> {: .comment}
{: .hands_on}

We now have the list of genes (column 4) overlapping with the peak regions.
We now have the list of genes (column 4) overlapping with the peak regions, similar to shown below.

![Genes overlapping peaks](../../images/intro_overlapping_genes.png)

To get a better overview of the genes we obtained, we want to look at their distribution across the different chromosomes.
We will regroup the table by chromosome and count the number of genes with peaks on each chromosome
We will group the table by chromosome and count the number of genes with peaks on each chromosome

> ### {% icon hands_on %} Hands-on: Count genes on different chromosomes
>
Expand All @@ -394,7 +395,7 @@ We will regroup the table by chromosome and count the number of genes with peaks
> > Which chromosome contained the highest number of target genes?
> >
> > > ### {% icon solution %} Solution
> > > The result varies with different settings. If you followed step by step, it should be chromosome 7 with 1675 genes.
> > > The result varies with different settings. If you followed step by step, it should be chromosome 11 with 1990 genes.
> > {: .solution }
> {: .question}
>
Expand Down Expand Up @@ -438,7 +439,6 @@ Galaxy makes this very simple with the `Extract workflow` option. This means tha
> Since we did some steps which where specific to our custom peak file, we might want to exclude:
> - **Select last** {% icon tool %}
> - all **Replace Text** {% icon tool %} steps
> - **Convert Genomic Intervals to strict BED** {% icon tool %}
> - **Get flanks** {% icon tool %}
>
> 5. Rename the workflow to something descriptive, e.g. `From peaks to genes`
Expand Down Expand Up @@ -498,7 +498,7 @@ The history is now empty, but we need our peak file again. Before we upload it t
>
> You should see both of your histories side-by-side now
>
> 2. Use drag-and-drop with your mouse to copy the edited peak file (after the replace steps) but still in interval format, which contains the summit information, to your new history.
> 2. Use drag-and-drop with your mouse to copy the edited peak file (after the replace steps), which contains the summit information, to your new history.
> 3. Click on **Analyze Data** in the top panel to go back to your analysis window
>
{: .hands_on}
Expand Down Expand Up @@ -555,6 +555,8 @@ The RefSeq genes we downloaded from UCSC did only contain the RefSeq identifiers
> > * Open the Galaxy Upload Manager
> > * Select **Paste/Fetch Data**
> > * Paste the link into the text field
> > * Select "Type": `bed`
> > * Select "Genome": `mm9`
> > * Press **Start**
> {: .tip}
>
Expand Down