Skip to content

Latest commit

 

History

History
808 lines (672 loc) · 27 KB

figures.org

File metadata and controls

808 lines (672 loc) · 27 KB

Figures for potato paper

1 Introduction

This document contains all of the code needed to generate the figures for the potato paper from the raw data. It can be executed with org-mode to generate the figures automatically. The following org-mode settings are required:

;;; use bash to evaluate shell code
(setq org-babel-sh-command "bash")

;;; export with CSS classes instead of explicit colours
;;; this uses the included CSS for syntax highlighting
;;; instead of your own emacs colours
(setq org-html-htmlize-output-type 'css)
(setq org-html-htmlize-font-prefix "org-")
;;; the same but with bootstrap export, requires `ox-twbs'
(setq org-twbs-htmlize-output-type 'css)
(setq org-twbs-htmlize-font-prefix "org-")

The code in this document will read input data from the data directory, make intermediate files in an output directory, and put the final figures in a figures directory.

#!/bin/bash
mkdir -p output figures

2 Cumulative content

These plots show the cumulative content of the assemblies as successively smaller contigs are added in. The table gives the names of the assemblies and the filenames we use for each. Much of the code in this section uses this table. It is read into the code blocks line by line into the name and filename variables.

namefilename
Supernova + BioNano10x-bn
Supernova10x
Discovardiscovar-contig
Discovar + MP + DT + BioNanodiscovar-mp-dt-bn
Discovar + MP + Dovetaildiscovar-mp-dt
Discovar + MP + DT + PBJellydiscovar-mp-dt-jelly
Discovar + MP + BioNanodiscovar-mp-bn
Discovar + MPdiscovar-mp
Falcon + BioNanofalcon-bn
Falcon + DT + BioNanofalcon-dt-bn
Falcon + Dovetailfalcon-dt
Falconfalcon

2.1 Code

First we calculate the lengths of the contigs/scaffolds in each assembly and sort in descending order of size.

#!/bin/bash
IFS=','
while read name filename; do
    if [[ ! -s output/${filename}.length ]]; then
        echo "Calculating lengths of ${filename}..."
        gzip -cd data/${filename}.fasta.gz \
            | awk '
                  /^>/ {
                    if (len) {
                      print len, name
                    }
                    split($0,s," ")
                    name=substr(s[1],2)
                    len=0
                    next
                  }
                  {
                    len += length($0)
                  } 
                  END {
                    if (len) {
                      print len, name
                    }
                  }
                  ' \
                      | sort -nr -k1,1 \
                             > output/${filename}.length
    fi
done <<< "$table"

Now we calculate the data for the plot. The minimum length of contigs considered and the total length of all contigs of that size or bigger.

#!/bin/bash
IFS=','
while read name filename; do
    if [[ ! -s output/${filename}.minlen-cumulative ]]; then
        echo "Calculating cumulative lengths of ${filename}..."
        awk '
            BEGIN {
              OFS="\t"
              last = 0
              sum = 0
            } 
            {
              if($1 != last && last != 0) {
                print last, sum
              }
              last = $1
              sum += $1
            }
            ' \
                output/${filename}.length \
                > output/${filename}.minlen-cumulative
    fi
done <<< "$table"

Now we prepare the table for ggplot by concatenating the minlen-cumulative tables and putting the name of the assembly in the third column.

#!/bin/bash
IFS=','
while read name filename; do
    awk -v OFS=',' -v fname="$filename" -v name="$name" \
        '
        {
          print $1,$2,fname,name
        }
        ' \
            output/${filename}.minlen-cumulative
done <<< "$table" \
     > output/all-assemblies.minlen-cumulative

Now we can plot these lines using ggplot:

Make PDF versions:

source(file="cumulative-content.R")

pdf(file="figures/bigcompplot.pdf", width=5, height=5)
print(bigcompplot)
dev.off()

pdf(file="figures/dvcompplot.pdf", width=5, height=5)
print(dvcompplot)
dev.off()

pdf(file="figures/pbcompplot.pdf", width=5, height=5)
print(pbcompplot)
dev.off()

pdf(file="figures/all.pdf", width=15, height=5)
grid.draw(cbind(ggplotGrob(bigcompplot),
                ggplotGrob(dvcompplot),
                ggplotGrob(pbcompplot),
                size="last"))
dev.off()

2.2 Figures

#+RESULTS[f009dcf707ee2ca01361b8d13cb6cab67ede2e1a]: figures/bigcompplot.png

#+RESULTS[aa1a6cf66be8279fe82bbabf45d2683693d7832b]: figures/dvcompplot.png

#+RESULTS[7596061d4673f3c8a66b958e046b4692fe0a439d]: figures/pbcompplot.png

3 KAT plots

3.1 Code

For the KAT plots we have to count k-mers in the Discovar library reads and the assemblies. The read files LIB12786_R1.fastq and LIB12786_R2.fastq need to be in the data directory to run KAT as below. This generates three matrix files with filenames ending in -main.mx. KAT takes a long time and a lot of memory to run so we provide these matrix files.

for asm in discovar-contig falcon-pilon 10x; do
    kat comp -t 8 -o data/kat-comp-${asm} \
        'data/LIB12786_R?.fastq' \
        data/${asm}.fasta
done

KAT comes with its own plotting tools using Python and matplotlib, but we use the ggplot version to match the other figures in the paper.

Make PDF versions:

commandArgs <- function() c("data/kat-comp-discovar-contig-main.mx",
                 "200", "0", "5", "PCR free", "Discovar")
source(file="plot-comp.R")

pdf(file="figures/kat-comp-discovar.pdf", width=5, height=4, onefile=TRUE)
print(p)
dev.off()
p1 <- p

commandArgs <- function() c("data/kat-comp-falcon-pilon-main.mx",
                 "200", "0", "5", "PCR free", "Falcon")
source(file="plot-comp.R")

pdf(file="figures/kat-comp-falcon.pdf", width=5, height=4, onefile=TRUE)
print(p)
dev.off()
p2 <- p

commandArgs <- function() c("data/kat-comp-10x-main.mx",
                 "200", "0", "5", "PCR free", "Supernova")
source(file="plot-comp.R")

pdf(file="figures/kat-comp-10x.pdf", width=5, height=4, onefile=TRUE)
print(p)
dev.off()
p3 <- p

p2 <- p2 + scale_y_continuous(labels=NULL) + ylab(NULL)
p3 <- p3 + scale_y_continuous(labels=NULL) + ylab(NULL)

pdf(file="figures/kat-comp-all.pdf", width=15, height=5)
grid.draw(cbind(ggplotGrob(p1),
                ggplotGrob(p2),
                ggplotGrob(p3),
                size="last"))
dev.off()

3.2 Figures

The PNG output from R does not look very good for some reason.

#+RESULTS[32137840f717255b2f052518ecf515d519ff97e8]: figures/kat-comp-discovar.png

#+RESULTS[c9d7b7e74d0350603766dedd8e8e153404a12e7e]: figures/kat-comp-falcon.png

#+RESULTS[6fdd7bd42b7f97cbd98fc31141e539780593c990]: figures/kat-comp-10x.png

4 BAC with difficult region

For the BAC plot we use a PacBio assembly of a BAC (labelled number 22). We have several tracks of data: three contig alignments, three read alignments, and a GC content and homopolymer track. The PacBio assembly of BAC 22 is contained in bac22.fasta.

4.1 Contig alignment

For the contig alignments we use bwa-mem.

namefilename
Supernova10x
Discovardiscovar-contig
Falconfalcon
#!/bin/bash
bwa index data/bac22.fasta
IFS=','
while read name filename; do
    bwa mem data/bac22.fasta data/${filename}.fasta \
        | samtools view -b - > data/pb-${filename}.bam
    sambamba sort data/pb-${filename}.bam
done <<< "$table"

The sam-to-R.py script takes multiple SAM input files and outputs a table for plotting with ggplot. Optional parameters are -m for minimum length of the alignment and -i for minimum percent identity, both of which are calculated from the CIGAR string for each alignment in the SAM file. The first positional parameter provides a short name for each of the following alignments, which is embedded in the table, and then the alignment files follow. We use process substitution so we can use BAM files, ensuring that the SAM header is present by using the -h flag for samtools view.

To generate the table for our three contig tracks we run:

#!/bin/bash
python sam-to-R.py \
       -m 1000 -i 99.9 dv,fal,10x \
       <(samtools view -h data/pb-discovar-contig.sorted.bam) \
       <(samtools view -h data/pb-falcon.sorted.bam) \
       <(samtools view -h data/pb-10x.sorted.bam) \
       > output/pb-r-table

4.2 Read alignment

For the read alignment we use bowtie2. The ord variable tells the aligner if the reads are in forward-reverse or reverse-forward order.

namefilenameord
Discovardiscovarfr
Mate pairmprf
Dovetaildovetailfr

We align each set of reads and suppress the unaligned reads to keep the file size smaller. Then they are filtered for reads with a high mapping quality and an edit distance with respect to the reference of no more than 1.

#!/bin/bash
bowtie2-build data/bac22.fasta data/bac22
IFS=','
while read name filename ord; do
    bowtie2 -p 8 --no-unal --${ord} -x data/bac22 \
            -1 data/${filename}-R1.fastq -2 data/${filename}-R2.fastq \
        | samtools view -b - > data/${filename}-aln.bam
    sambamba view -f bam -F "mapping_quality >= 30 and [NM] <= 1" \
             data/${filename}-aln.bam \
             > data/${filename}-aln-acc.bam
done <<< "$table"

For the paired-end (Discovar) we simply generate the depth at each reference position.

#!/bin/bash

sambamba sort data/discovar-aln-acc.bam
samtools depth -a data/discovar-aln-acc.sorted.bam \
         > output/discovar-aln-acc.cov

For the mate pair and Dovetail we calculate the coverage the aligned fragments. First we need to sort the alignments by name to bring the pairs together:

#!/bin/bash
for filename in mp dovetail; do
    samtools sort -n -O bam -T data/${filename}-aln-acc.sorted.name.bam \
             <(samtools view -bh -F 1804 -q 1 data/${filename}-aln-acc.bam) \
             > data/${filename}-aln.sorted.name.bam
done

Now we use bedtools to make a bed file with the physical fragments:

#!/bin/bash
for filename in mp dovetail; do
    bamToBed -i data/${filename}-aln.sorted.name.bam -bedpe \
        | cut -f 1,2,6 | sort -k1,1 > output/${filename}.phys.bed
done

And finally calculate the coverage at each reference position:

#!/bin/bash
for filename in mp dovetail; do
    bedtools genomecov -d -i output/${filename}.phys.bed \
             -g data/bac22.fasta.fai > output/${filename}.phys.cov
done

4.3 GC content and homopolymers

For the GC content we use bedtools. We make 100bp windows and calculate the GC content in the windows.

#!/bin/bash

bedtools makewindows -g data/bac22.fasta.fai -w 100 \
         > output/bac22.100bps.bed
bedtools nuc -fi data/bac22.fasta -bed output/bac22.100bps.bed \
         > output/bac22.gc.txt
awk -v w=100 -vOFS='\t' -vFS='\t' 'NR > 1 {print $1,($2+$3)/2,$5}' \
    output/bac22.gc.txt \
    > output/bac22.gc.100bps

The homopolymers are calculated using the following python script:

#!/bin/bash
python ~/code/tools/homopolymers.py data/bac22.fasta 4 \
       > output/bac22.homopolymers.4

4.4 Figure

Finally, after generating the data we can plot it all using ggplot.

#!/bin/bash
./bac-graphs2.R \
    bac22 \
    figures/bac22-pacbio \
    output/discovar-aln-acc.cov \
    output/mp.phys.cov \
    output/dovetail.phys.cov \
    output/bac22.gc.100bps \
    output/bac22.homopolymers.4 \
    output/pb-r-table \
    10x,fal,dv

figures/bac22-pacbio.png

5 Gene content

namefilename
Supernova + BioNano10x-bn
Discovar + MP + DT + BioNanodiscovar-mp-dt-bn
Falcon + DT + BioNanofalcon-dt-bn
Tuberosum referencetuberosum

We align the transcripts of the S. tuberosum genome to our assemblies using BLAST version 2.2.31. The transcripts were retrieved from here: http://solanaceae.plantbiology.msu.edu/data/PGSC_DM_v3.4_transcript-update.fasta.zip

for asm in 10x-bn discovar-mp-dt-bn falcon-dt-bn; do
    blastn -outfmt '6 qseqid qstart qend qlen sseqid sstart send slen pident length evalue' \
           -evalue 1e-3 \
           -subject ${asm}.fasta \
           -query PGSC_DM_v3.4_transcript-update.fasta \
           > ${asm}-transcripts.blastn
done

We can adjust the minimum percentage identity of BLAST alignments that we consider and calculate the percent coverage of each transcript in the alignment. We use the transcript-coverage.py script to calculate the percent coverage for a transcript:

This is run after filtering the alignments with awk and sorting by alignment start position:

IFS=','
while read name filename; do
    for i in {85..100}; do
        ./make-coverages data/${filename}-transcripts.blastn ${i} output/
    done
done <<EOF
${table}
EOF

Now we can count the number of transcripts which appear with various thresholds:

for filename in 10x-bn discovar-mp-dt-bn falcon-dt-bn tuberosum; do
    for cov in $(seq 0.85 0.01 1.0); do
        for i in {85..100}; do
            num=$(awk -v c=${cov} '$4 >= c' output/${filename}-transcripts-i${i}.coverage | wc -l)
            echo -ne "${filename}\t${cov}\t${i}\t${num}\n"
        done
    done
done > output/transcript-asm-cov-id-count-2
#!/usr/bin/env Rscript

library("ggplot2")
library("reshape2")
library("grid")

t <- read.table("../genes-table", header=TRUE)
tm <- melt(t, id.vars="asm", variable.name="type", value.name="genes")

p <- ggplot(tm, aes(x=asm, y=genes, fill=type)) +
    scale_fill_discrete(name="Type",
                      labels=c("cegcom"="Cegma complete",
                          "cegpar"="Cegma partial",
                          "buscom"="Busco complete",
                          "buspar"="Busco partial",
                          "busmis"="Busco missing")) +
    scale_x_discrete(name="Assembly") +
    scale_y_continuous(name="Number of genes") +
    geom_bar(stat="identity", position="dodge") +
    geom_bar(stat="identity", position="dodge", colour="black", show_guide=FALSE) +
    theme(legend.position="bottom", legend.key.size=unit(3, "mm"),
          legend.text=element_text(size=8),legend.title=element_text(size=8),
          plot.margin=unit(c(2,2,0,0), "mm")) +
    guides(fill=guide_legend(nrow=2,byrow=TRUE))

pdf(file="genes-bar.pdf", width=4, height=4)
print(p)
dev.off()

6 Synteny to S. tuberosum

6.1 Code

We generate alignments between the S. tuberosum reference and our assemblies using nucmer. We are interesting in our three “large” hybrid assemblies.

namefilename
Supernova + BioNano10x-bn
Discovar + MP + DT + BioNanodiscovar-mp-dt-bn
Falcon + DT + BioNanofalcon-dt-bn-2

The file falcon-dt-bn-2 is generated by replacing the pipe symbols in the fasta file with underscores because some or all of mummer’s scripts do not support the pipe symbol.

#!/bin/bash

gzip -cd data/falcon-dt-bn.fasta.gz \
    | sed 's/|/_/g' > data/falcon-dt-bn-2.fasta

The following code requires mummer and the S. tuberosum pseudomolecule assembly version 4.03 in the data directory. It is available here: http://solanaceae.plantbiology.msu.edu/pgsc_download.shtml

#!/bin/bash
IFS=','
while read name filename; do
    nucmer -c 65 -l 65 \
           -p output/${filename} \
           data/PGSC_DM_v4.03_pseudomolecules.fasta \
           data/${filename}.fasta
done <<< "$table"

First we filter the delta to only include alignments of at least 90% identity and 10kb in length.

#!/bin/bash
IFS=','
while read name filename; do
    if [[ ! -s output/${filename}-i${i}-l${l}.delta ]]; then
        echo "Filtering delta..."
        delta-filter -q -r -i ${i} -l ${l} output/${filename}.delta \
                     > output/${filename}-i${i}-l${l}.delta
    fi
done <<< "$table"

Now we generate the coords files which we’ll need later to sort and orientate the scaffolds for each chromosome. By default mummer will plot every scaffold in the assembly even if we select only one chromosome of the reference, therefore we need to find which scaffolds actually have alignments with each reference sequence.

#!/bin/bash
IFS=','
while read name filename; do
    if [[ ! -s output/${filename}-i${i}-l${l}.coords ]]; then
        echo "Making coords..."
        show-coords -THrcl output/${filename}-i${i}-l${l}.delta \
                    > output/${filename}-i${i}-l${l}.coords
    fi
done <<< "$table"

Mummer will also require the length of each assembly scaffold that we wish to plot.

#!/bin/bash
IFS=','
while read name filename; do
    if [[ ! -s output/${filename}.length ]]; then
        echo "Calculating lengths of ${filename}..."
        cat data/${filename}.fasta \
            | awk '
                  /^>/ {
                    if (len) {
                      print len, name
                    }
                    split($0,s," ")
                    name=substr(s[1],2)
                    len=0
                    next
                  }
                  {
                    len += length($0)
                  }
                  END {
                    if (len) {
                      print len, name
                    }
                  }
                  ' \
                      | sort -nr -k1,1 \
                             > output/${filename}.length
    fi
done <<< "$table"

Finally we make a mummerplot for each assembly and each chromosome. We are going to manually select which assembly scaffolds to display and also manually order and orientate them. mummerplot attempts to do the ordering and orientation itself but doesn’t do a very good job. We want to order the scaffolds such that most of the bases in the alignment appear on the diagonal. We do this by sorting by the average base position in the alignment of each scaffold. We want to orientate the scaffolds such that as much as possible is “forwards” (more red than blue in the plots). We do this by counting how much of the alignment length is forward and reverse and setting the orientation accordingly.

#!/bin/bash
IFS=','
while read name filename; do
    for chr in {01..12}; do
        echo "Doing chr ${chr}..."

        # find scaffold order and orientation
        awk '$12 == "ST4.03ch'${chr}'" {print $1,$2,$3,$4,$13;}' \
            output/${filename}-i${i}-l${l}.coords \
            | sort -n -k1,1 \
            | sort -s -k5 \
            | awk '
BEGIN {
   lastn=""; lastm=0; lastb=0; lastl=0;
}
{
   n=$5; s=$1; e=$2; p=(s+e)/2; b=e-s; l=$4-$3;
   if (lastn=="") {
      lastn=n; lastm=p; lastb=b; lastl=l;
   } else if (lastn!=n) {
      print lastn,int(lastm),lastl
      lastn=n; lastm=p; lastb=b; lastl=l;
   } else {
      lastm=(lastm*lastb+p*b)/(lastb+b); lastb+=b; lastl+=l
   }
}
END {
   print lastn,int(lastm),lastl
}
' \
            | sort -k1,1 \
                   > \
                   output/${filename}-chr${chr}-contigs-i${i}-l${l}.scaf-pos-orn

        join -1 1 -2 2 \
             output/${filename}-chr${chr}-contigs-i${i}-l${l}.scaf-pos-orn \
             <(sort -k2,2 output/${filename}.length) \
            | sort -n -k2,2 \
            | awk '
{if ($3>0) {
    sign="+"
 } else {
    sign="-"
 }
 printf "%s\t%d\t%s\n", $1, $4, sign;}
' \
                  > output/${filename}-chr${chr}-contigs-i${i}-l${l}.mumlist

        mummerplot -t png -r ST4.03ch${chr} \
                   -p output/mummerplot-${filename}-i${i}-l${l}-${chr} \
                   output/${filename}-i${i}-l${l}.delta \
                   -Q output/${filename}-chr${chr}-contigs-i${i}-l${l}.mumlist

        mv output/mummerplot-${filename}-i${i}-l${l}-${chr}.png figures/
    done
done <<< "$table"

For the rest of the paper we ggplot for the plots so we now convert the mummerplots to ggplot and plot them. The data is prepared with the following script:

Where mummerplot.R is:

6.2 Figures

Convert all the figures to ggplot:

IFS=','
while read name filename; do
    for chr in {01..12}; do
        ./mummer2ggplot \
            output/mummerplot-${filename}-i${i}-l${l}-${chr} \
            figures
    done
done <<EOF
$table
EOF

Here is a quick preview of the figures.

IFS=','
while read name filename; do
    montage figures/mummerplot-${filename}-i${i}-l${l}-??-gg.png \
            -geometry 240x240+10+5 -tile 3x4 \
            figures/mummerplot-${filename}-i${i}-l${l}-all-gg.png
    echo "*** ${name}"

    echo "[[file:figures/mummerplot-${filename}-i${i}-l${l}-all.png]]"
done <<< "$table"

6.2.1 Supernova + BioNano

figures/mummerplot-10x-bn-i90-l10000-all-gg.png

6.2.2 Discovar + MP + DT + BioNano

figures/mummerplot-discovar-mp-dt-bn-i90-l10000-all-gg.png

6.2.3 Falcon + DT + BioNano

figures/mummerplot-falcon-dt-bn-2-i90-l10000-all-gg.png