galaxyproject · bgruening · Sep 30, 2018 · Sep 28, 2018 · Sep 28, 2018 · Sep 28, 2018
diff --git a/topics/introduction/images/input_tagged_file.png b/topics/introduction/images/input_tagged_file.png
diff --git a/topics/introduction/images/intro_get_flanks.png b/topics/introduction/images/intro_get_flanks.png
diff --git a/topics/introduction/images/intro_overlapping_genes.png b/topics/introduction/images/intro_overlapping_genes.png
diff --git a/topics/introduction/tutorials/galaxy-intro-peaks2genes/tutorial.md b/topics/introduction/tutorials/galaxy-intro-peaks2genes/tutorial.md
@@ -32,9 +32,9 @@ contributors:
 # Introduction
 {:.no_toc}
 
-We stumbled upon a paper [Li et al., Cell Stem Cell 2012](https://www.ncbi.nlm.nih.gov/pubmed/22862943) that contains the analysis of possible target genes of an interesting protein in mice. The targets were obtained by ChIP-seq and the raw data is available through [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37268).
-The list of genes however is neither in the supplement of the paper nor part of the GEO submission.
-The closest thing we could find is a list of the regions where the signal is significantly enriched (so called *peaks*):
+We stumbled upon a paper [Li et al., Cell Stem Cell 2012](https://www.ncbi.nlm.nih.gov/pubmed/22862943) called *"The histone acetyltransferase MOF is a key regulator of the embryonic stem cell core transcriptional network"*. The paper contains the analysis of possible target genes of an interesting protein called Mof. The targets were obtained by ChIP-seq in mice and the raw data is available through [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37268).
+However, the list of genes is neither in the supplement of the paper, nor part of the GEO submission.
+The closest thing we could find is a file in GEO containing a list of the regions where the signal is significantly enriched (so called *peaks*):
 
 1 | 3660676 | 3661050 | 375 | 210 | 62.0876250438913 | -2.00329386666667
 1 | 3661326 | 3661500 | 175 | 102 | 28.2950833625942 | -0.695557142857143
@@ -151,6 +151,12 @@ Galaxy will automatically unpack the file.
 >
 > 4. Search for `mm9` in **Database/Build** attribute and select `Mouse July 2007 (NCBI37/mm9)`
 > 5. Click on **Save** on the top
+> 6. Add a tag called `peaks` to the dataset to make it easier to track in the history
+>    {% include snippets/add_tag.md %}
+>
+>    The dataset should now look like below in the history
+>
+>    ![Peaks file](../../images/input_tagged_file.png){: width="250px" height="300px"}
 >
 {: .hands_on}
 
@@ -190,6 +196,7 @@ we also need a list of genes in mice, which we can obtain from UCSC.
 > 6. Click on the **Send Query to Galaxy** button
 > 7. Wait for the upload to finish
 > 8. Rename our dataset ({% icon galaxy-pencil %} (pencil) icon) to something more recognizable (`Genes`)
+> 9. Add a tag called `genes` to the dataset to make it easier to track in the history
 >
 {: .hands_on}
 
@@ -316,10 +323,12 @@ In order to convert the chromosome names we have therefore two things to do:
 
 ## Analysis
 
-Our goal is still to compare the 2 region files (the genes file and the peak file from the publication)
-to know which peaks are related to which genes. If you really only want to know which peaks are located **inside** genes you
-can skip the next step. Otherwise, it might be reasonable to include the promoter region into the comparison, e.g. because
-you want to include Transcriptions factors in ChIP-seq experiments.
+Our goal is to compare the 2 region files (the genes file and the peak file from the publication)
+to know which peaks are related to which genes. If you only want to know which peaks are located **inside** genes (within the gene body) you
+can skip the next step. Otherwise, it might be reasonable to include the **promoter** region of the genes into the comparison, e.g. because
+you want to include transcriptions factors in ChIP-seq experiments. There is no strict definition for promoter region but 2kb upstream of the TSS (start of region) is commonly used. We'll use the **Get Flanks** tool to get regions 2kb bases upstream of the start of the gene to 10kb bases downstream of the start (12kb in length). To do this we tell the Get Flanks tool we want regions upstream of the start, with an offset of 10kb, that are 12kb in length, as shown in the diagram below.
+
+![Get Flanks](../../images/intro_get_flanks.png)
 
 > ### {% icon hands_on %} Hands-on: Add promoter region to gene records
 >
@@ -349,35 +358,27 @@ you want to include Transcriptions factors in ChIP-seq experiments.
 > 3. Rename your dataset to reflect your findings
 {: .hands_on}
 
-You might have noticed that the UCSC file is in `BED` format and has a database associated to it. That's what we want for our peak file as well
-
-> ### {% icon hands_on %} Hands-on: Change format and database
->
-> 1. Click on the {% icon galaxy-pencil %} (pencil) icon in the history entry of our peak region file
-> 2. Switch to the **Convert** tab
-> 3. Select `Convert Genomic Intervals to BED`
-> 4. Press **Convert datatype**
-> 5. Check that the "Database/Build" is `mm9` (the database build for mice used in the paper)
-{: .hands_on}
-
 It's time to find the overlapping intervals (finally!). To do that, we want to extract the genes which overlap/intersect with our peaks.
 
 > ### {% icon hands_on %} Hands-on: Find Overlaps
 >
 > 1. **Intersect** {% icon tool %}: Run **Intersect the intervals of two datasets** with the following settings:
 >     - *"Return"*: `Overlapping Intervals`
 >     - *"of"*: the UCSC file with promoter regions
->     - *"that intersect"*: our converted peak region file
+>     - *"that intersect"*: our peak region file
 >     - *"for at least"*: `1`
 >
 >    > ### {% icon comment %} Comments
 >    > The order of the inputs is important! We want to end up with a list of genes, so the corresponding dataset needs to be the first input.
 >    {: .comment}
 {: .hands_on}
 
-We now have the list of genes (column 4) overlapping with the peak regions.
+We now have the list of genes (column 4) overlapping with the peak regions, similar to shown below.
+
+![Genes overlapping peaks](../../images/intro_overlapping_genes.png)
+
 To get a better overview of the genes we obtained, we want to look at their distribution across the different chromosomes.
-We will regroup the table by chromosome and count the number of genes with peaks on each chromosome
+We will group the table by chromosome and count the number of genes with peaks on each chromosome
 
 > ### {% icon hands_on %} Hands-on: Count genes on different chromosomes
 >
@@ -394,7 +395,7 @@ We will regroup the table by chromosome and count the number of genes with peaks
 >    > Which chromosome contained the highest number of target genes?
 >    >
 >    > > ### {% icon solution %} Solution
->    > > The result varies with different settings. If you followed step by step, it should be chromosome 7 with 1675 genes.
+>    > > The result varies with different settings. If you followed step by step, it should be chromosome 11 with 1990 genes.
 >    > {: .solution }
 >    {: .question}
 >
@@ -438,7 +439,6 @@ Galaxy makes this very simple with the `Extract workflow` option. This means tha
 >    Since we did some steps which where specific to our custom peak file, we might want to exclude:
 >    - **Select last** {% icon tool %}
 >    - all **Replace Text** {% icon tool %} steps
->    - **Convert Genomic Intervals to strict BED** {% icon tool %}
 >    - **Get flanks** {% icon tool %}
 >
 > 5. Rename the workflow to something descriptive, e.g. `From peaks to genes`
@@ -498,7 +498,7 @@ The history is now empty, but we need our peak file again. Before we upload it t
 >
 >       You should see both of your histories side-by-side now
 >
-> 2. Use drag-and-drop with your mouse to copy the edited peak file (after the replace steps) but still in interval format, which contains the summit information, to your new history.
+> 2. Use drag-and-drop with your mouse to copy the edited peak file (after the replace steps), which contains the summit information, to your new history.
 > 3. Click on **Analyze Data** in the top panel to go back to your analysis window
 >
 {: .hands_on}
@@ -555,6 +555,8 @@ The RefSeq genes we downloaded from UCSC did only contain the RefSeq identifiers
 >    > * Open the Galaxy Upload Manager
 >    > * Select **Paste/Fetch Data**
 >    > * Paste the link into the text field
+>    > * Select "Type": `bed`
+>    > * Select "Genome": `mm9`
 >    > * Press **Start**
 >    {: .tip}
 >