Skip to content

Bio-protocol/unmethylated-regions_UMR-extractor-WGBS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: GPL v3

Identifying unmethylated regions from methylation data

This workflow identifies stably unmethylated regions in plant genomes using methylation data.

Installation

Input Data

The example data used here is the paired-end fastq file generated by using the Illumina platform.

  • R1 FASTQ file: input/B73_chr1_subset_reads_1.fastq
  • R2 FASTQ file: input/B73_chr1_subset_reads_2.fastq

Each entry in a FASTQ files consists of 4 lines:

  1. A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL to FASTQ conversion software used.
  2. The sequence (the base calls; A, C, T, G and N).
  3. A separator, which is simply a plus (+) sign.
  4. The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.

The first entry of the input data:

@SRR8738272.153232
TGATTTGAAATTAAACGAATATGGAAATCGGTTTGAAGGTTTTGGAATCGAGTATAATTGGATTTACAAATGTGGTTTATGGGAATTTTTTTATGTGAAAGTTTTGATTCTGATGTATAATATTGA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG@

Other input files are also required, such as:

  1. A reference genome.
  2. A file containing chromosome sizes. Each entry consists of two columns: the chromosome and the size of the chromosome.

Here is the example file:

maize_chr1_reference	20000
  1. A reference genome cytosine tile file.

The file contains 6 columns: 1) The chromosome number 2) Start of the 100bp tile 3) End of the 100bp tile 4) Number of CG sites in the 100bp tile 5) Number of CHG sites in the 100bp tile 6) Number of CHH sites in the 100bp tile

Here are the first 5 lines of the example tile file:

chr	start	end	cg_sites	chg_sites	chh_sites
maize_chr1_reference	1	100	4	6	29
maize_chr1_reference	101	200	6	7	25
maize_chr1_reference	201	300	6	4	36
maize_chr1_reference	301	400	2	10	28

More example tile files can found in the example_genomes folder in the input folder. They will be provided by UQeSpace, with the DOI being available when the article is published.

Major steps

Step 1: Trimming the reads and running FastQC for quality checking

  • Note that you have to normalize the path in the shell script.
sh workflow/1_trim_reads.sh

Step 2: Mapping reads using BSMAP

sh workflow/2_map_reads.sh <samtool 0.1.18 path>

Step 3: View the results

  • Results can be converted into a bigWig format, which can be visualized using IGV.
sh 3_visualize_results.sh <bedgraph2BigWig path>

Step 4: Identify unmethylated regions

4_find_UMRs.sh

Expected results

License

It is a free and open source software, licensed under GPLv3.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages