diff --git a/.info b/.info index 7cd15ea..009352b 100644 --- a/.info +++ b/.info @@ -1,2 +1,2 @@ -CURRENTVERSION=1.2 -NEWVERSION=https://github.com/V-Z/sondovac/releases/download/v1.2/sondovac-1.2.zip +CURRENTVERSION=1.3 +NEWVERSION=https://github.com/V-Z/sondovac/releases/download/v1.3/sondovac-1.3.zip diff --git a/CHANGELOG b/CHANGELOG index b476edc..0ae3000 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -3,7 +3,7 @@ Sondovač changelog Sondovač is a script to create orthologous low-copy nuclear probes from transcriptome and genome skim data for target enrichment. -Version 1.3 regular release released YYYY-MM-DD +Version 1.3 regular release released 2017-12-18 ================================================================================ * bam2fastq is dropped in favour of samtools fastq. No plans to use Picard @@ -14,11 +14,14 @@ Version 1.3 regular release released YYYY-MM-DD B), list of all probes (with putative plastid sequences) and list of putative plastid sequences are available. * Updated software distributed with Sondovač, updated respective sections of - manual. + PDF manual. * Removed FASTX Toolkit, conversion from FASTQ to FASTA is done by simple shell function. -* Improved handling of input/output files when stored in more different +* Improved handling of input/output files when stored in several different directories. +* Tested with Geneious 10, improved description of Geneious usage in the PDF + manual. +* Improved PDF manual. Version 1.2 regular release released 2016-06-28 diff --git a/manual/geneious5.png b/manual/geneious5.png new file mode 100644 index 0000000..2a953e3 Binary files /dev/null and b/manual/geneious5.png differ diff --git a/manual/geneious6.png b/manual/geneious6.png new file mode 100644 index 0000000..8f73f65 Binary files /dev/null and b/manual/geneious6.png differ diff --git a/manual/sondovac_manual.pdf b/manual/sondovac_manual.pdf index 3243d34..4bc8db1 100644 Binary files a/manual/sondovac_manual.pdf and b/manual/sondovac_manual.pdf differ diff --git a/manual/sondovac_manual.tex b/manual/sondovac_manual.tex index 367bae2..5a8d849 100644 --- a/manual/sondovac_manual.tex +++ b/manual/sondovac_manual.tex @@ -51,7 +51,7 @@ } } -% Add background for \texttt{} (using pacakge soul) +% Add background for \texttt{} (using package soul) \sethlcolor{Beige} \renewcommand{\texttt}[1]{\hl{\ttfamily #1}} @@ -119,23 +119,23 @@ \section{Introduction} In this study \citep{Schmickl2016} we developed a~novel probe design pipeline for targeting orthologous LCN loci for phylogenetic reconstruction by using genome skim and transcriptome data. In particular, genome skim data of one accession of the studied plant group were combined with a~congeneric transcriptome from the 1000 Plants (1KP) initiative (\url{http://onekp.com/}). We implemented our software workflow in the user-friendly, automated and interactive BASH script Sondovač, which allows a~straightforward design of LCN probes also catering for users with limited bioinformatics skills. -Sondovač workflow is divided into three parts (see details on page~\pageref{pipeline-overview} and Figure~\ref{pipeline-workflow}): +Sondovač workflow is divided into three parts (see details on page~\pageref{pipeline-overview} and in Figure~\ref{pipeline-workflow}): \begin{enumerate} \item Raw input data are analyzed by \texttt{sondovac$\_$part$\_$a.sh}. \item Sequences obtained in part a~are assembled by Geneious in a~separate step by the user. -\item Final probes are produced by \texttt{sondovac$\_$part$\_$b.sh}. Eventually, not all plastid sequences will have been removed from the genome skim data by running \texttt{sondovac$\_$part$\_$a.sh}. In such case you will have to remove remaining plastid sequences manually (see detailed documentation of \texttt{sondovac$\_$part$\_$b.sh} on page~\pageref{partb}). +\item Final probes are produced by \texttt{sondovac$\_$part$\_$b.sh}. \end{enumerate} \subsection{Pipeline -- how the data are processed} -A~transcriptome assembly and paired-end genome skim raw data are combined to get hundreds of orthologous LCN loci \citep{Schmickl2016}. Enrichment of multi-copy loci is minimized by using unique transcripts only, which are obtained by comparing all transcripts and removing those sharing $\geq$90\% sequence similarity using BLAT. Before matching the genome skim data against those unique transcripts, reads of plastid (and mitochondrial) origin are removed with Bowtie~2 and SAMtools, utilizing reference sequences. Paired-end reads are subsequently combined with FLASh. These processed reads are matched against the unique transcripts sharing $\geq$85\% sequence similarity with BLAT. Transcripts with $>$1000 BLAT hits (indicating repetitive elements) and BLAT hits containing masked nucleotides are removed before de novo assembly of the BLAT hits to larger contigs with Geneious, using the medium sensitivity / fast setting. After assembly, only those contigs that comprise exons of a~minimum bait length (usually $\geq$120 bp in case of probe design for phylogenies) and have a~certain minimum total locus length (multiple of the bait length, should not be too short in order to obtain sufficient phylogenetically informative signal; we recommend at least $\geq$600 bp) are retained. To ensure that probes do not target multiple similar loci, any probe sequences sharing $\geq$90\% sequence similarity are removed using cd-hit-est, followed by a~second filtering step for contigs containing exons of a~minimum bait length and totaling minimum loci length (see comments above). To ensure that plastid sequences are absent from the probes, the probe sequences are matched against the plastome reference sharing $\geq$90\% sequence similarity with BLAT and the hits removed from the probe set. The workflow of Sondovač is summarized in Figure~\ref{pipeline-workflow}. The direction of the workflow is indicated by arrows. An optional removal of reads of mitochondrial origin from the genome skim data is indicated by greyed text. The required input files of Sondovač are highlighted in bold. +A~transcriptome assembly and paired-end genome skim raw data are combined to get hundreds of orthologous LCN loci \citep{Schmickl2016}. Enrichment of multi-copy loci is minimized by using unique transcripts only, which are obtained by comparing all transcripts and removing those sharing $\geq$90\% sequence similarity using BLAT. Before matching the genome skim data against those unique transcripts, reads of plastid (and mitochondrial) origin are removed with Bowtie~2 and SAMtools, utilizing reference sequences. Paired-end reads are subsequently combined with FLASH. These processed reads are matched against the unique transcripts sharing $\geq$85\% sequence similarity with BLAT. Transcripts with $>$1000 BLAT hits (indicating repetitive elements) and BLAT hits containing masked nucleotides are removed before de novo assembly of the BLAT hits to larger contigs with Geneious, using the medium sensitivity / fast setting. After assembly, only those contigs that comprise exons of a~minimum bait length (usually $\geq$120 bp in case of probe design for phylogenies) and have a~certain minimum total locus length (multiple of the bait length, should not be too short in order to obtain sufficient phylogenetically informative signal; we recommend at least $\geq$600 bp) are retained. To ensure that probes do not target multiple similar loci, any probe sequences sharing $\geq$90\% sequence similarity are removed using cd-hit-est, followed by a~second filtering step for contigs containing exons of a~minimum bait length and totaling minimum loci length (see comments above). To ensure that plastid sequences are absent from the probes, the probe sequences are matched against the plastome reference sharing $\geq$90\% sequence similarity with BLAT and the hits removed from the probe set. The workflow of Sondovač is summarized in Figure~\ref{pipeline-workflow}. The direction of the workflow is indicated by arrows. An optional removal of reads of mitochondrial origin from the genome skim data is indicated by greyed text. The required input files of Sondovač are highlighted in bold. \begin{figure}[p] \begin{center} \includegraphics[width=14.5cm]{pipeline_workflow.png} \end{center} -\caption[Workflow of the probe design script Sondovač]{Workflow of the probe design script Sondovač. An overview of the main steps of Hyb-Seq are given in the top part of the figure; probe design is the first one. Each step of Sondovač is numbered and illustrated by three boxes: Software is highlighted in yellow, a~summary of each step is given in light blue, and input/output of each step is depicted in light green. An optional removal of reads of mitochondrial origin from the genome skim data is marked by greyed text. The required input files of Sondovač are highlighted in bold. The direction of the workflow is indicated by arrows. Eventually, not all plastid sequences will have been removed from the genome skim data in step 2 of the pipeline. In such case, you will have to remove remaining plastid sequences manually (see detailed documentation of \texttt{sondovac$\_$part$\_$b.sh} on page~\pageref{partb}).} +\caption[Workflow of the probe design script Sondovač]{Workflow of the probe design script Sondovač. An overview of the main steps of Hyb-Seq are given in the top part of the figure; probe design is the first one. Each step of Sondovač is numbered and illustrated by three boxes: Software is highlighted in yellow, a~summary of each step is given in light blue, and input/output of each step is depicted in light green. An optional removal of reads of mitochondrial origin from the genome skim data is marked by greyed text. The required input files of Sondovač are highlighted in bold. The direction of the workflow is indicated by arrows.} \label{pipeline-workflow} \end{figure} @@ -169,13 +169,13 @@ \subsection{Pipeline -- how the data are processed} \item \texttt{sondovac$\_$part$\_$b.sh}: Covers steps 8~to 11. \begin{enumerate}[label=\textbf{\arabic*.}, resume] - \item Retention of those contigs that comprise exons $\geq$ bait length and have a~certain total locus length. + \item Retention of those contigs that comprise exons $\geq$ bait length and have a~certain minimum total locus length. \item Removal of probe sequences sharing $\geq$90\% sequence similarity. - \item Retention of those contigs that comprise exons $\geq$ bait length and have a~certain total locus length. - \item Detection of probe sequences sharing $\geq$90\% sequence similarity with the plastome reference. Eventually, not all plastid sequences will have been removed from the genome skim data by running \texttt{sondovac$\_$part$\_$a.sh}. In such case, you will have to remove remaining plastid sequences manually (see detailed documentation of \texttt{sondovac$\_$part$\_$\-b.sh} on page~\pageref{partb}). + \item Retention of those contigs that comprise exons $\geq$ bait length and have a~certain minimum total locus length. + \item Detection of probe sequences sharing $\geq$90\% sequence similarity with the plastome reference. \end{enumerate} - The output file of \texttt{sondovac$\_$part$\_$b.sh} is the final list of probes. In case of detection of remaining plastid sequences you will have to remove those plastid sequences manually from the output file of sondovac$\_$part$\_$b.sh (see detailed documentation of sondovac$\_$part$\_$b.sh on page~\pageref{partb}). + The output file of \texttt{sondovac$\_$part$\_$b.sh} is the final list of probes. \end{enumerate} @@ -186,16 +186,16 @@ \subsection{General considerations before you start} The success of the probe design in terms of a high number of LCN genes of a sufficient minimum total length with Sondovač depends on various aspects of your transcriptome and genome skim input data: \begin{itemize} \item number of transcripts, - \item read length of genome skim reads; longer reads and paired-end reads are preferable due to a higher quality de novo assembly of the reads to contigs (exons), + \item read length of genome skim reads; longer reads and paired-end reads are preferable due to a higher quality de novo assembly of the reads to contigs (exons), \item number of nuclear genome skim reads, \item quality of nuclear genome skim reads, \item sequence divergence between transcriptome and genome skim data. \end{itemize} -These aspects influence the number of probe sequences and the proportion of paralogous loci among the probe sequences. The usage of a transcriptome and genome skim data of \textbf{diploid} accessions is strongly recommended in order to account for orthology of the probe sequences. An example of how one aspect, the number of nuclear genome skim reads, can affect the probe design, is shown in Table~\ref{summary-lcn-examples} and Figure~\ref{seq-div-examples}. +These aspects influence the number of probe sequences and the proportion of paralogous loci among the probe sequences. Usage of transcriptome and genome skim data of \textbf{diploid} accessions is strongly recommended in order to account for orthology of the probe sequences. An example of how one aspect, the number of nuclear genome skim reads, can affect the probe design, is shown in Table~\ref{summary-lcn-examples} and Figure~\ref{seq-div-examples}. \begin{longtable}{ | >{\centering\arraybackslash}m{1.8cm} >{\centering\arraybackslash}m{6.5cm} >{\centering\arraybackslash}m{2.5cm} >{\centering\arraybackslash}m{3.4cm} |} -\caption[Summary of two examples of a LCN probe design with Sondovač.]{Summary of two examples of a LCN probe design with Sondovač. The \textit{Oxalis} example is from \citet{Schmickl2016}, the \textit{Curcuma} example is unpublished data from Tomáš Fér and Roswitha Schmickl. The respective Sondovač steps are listed; see Figure~\ref{pipeline-workflow} for details regarding these steps. For both probe designs 250~bp paired-end reads were utilized. Input files are given in \texttt{typewriter} font. Quality control of the genome skim data, which is not part of Sondovač, is colored in \textgr{grey}.}\\ +\caption[Summary of two examples of an LCN probe design with Sondovač.]{Summary of two examples of an LCN probe design with Sondovač. The \textit{Oxalis} example is from \citet{Schmickl2016}, the \textit{Curcuma} example is unpublished data from Tomáš Fér and Roswitha Schmickl. The respective Sondovač steps are listed; see Figure~\ref{pipeline-workflow} for details regarding these steps. For both probe designs 250~bp paired-end reads were utilized. Input files are given in \texttt{typewriter} font. Quality control of the genome skim data, which is not part of Sondovač, is colored in \textgr{grey}.}\\ \hline \textbf{Step of Sondovač} & \textbf{Substep of Sondovač} & \textbf{\textit{Oxalis} species} & \textbf{\textit{Curcuma} species}\\ \endfirsthead % All the lines above this will be only on first page @@ -221,13 +221,13 @@ \subsection{General considerations before you start} 4 & Number of combined nuclear genome skim raw reads & 2,619,197 & 3,834,278\\ 4 & Combined nuclear genome skim raw reads as proportion of the total number of nuclear genome skim raw reads & 64\% & 66\%\\ 4 & Total length of combined nuclear genome skim raw reads & 856,720,402~bp & 1,218,798,300~bp\\ -5 & Mean sequence divergence between the unique transcripts and the combined nuclear genome raw skim reads & 7\% & 6\%\\ -5 & Mean sequence length of the match between the unique transcripts and the combined nuclear genome raw skim reads (genome skim data) & 216~bp & 204~bp\\ -5 & Mean sequence length of the match between the unique transcripts and the combined nuclear genome raw skim reads (transcripts) & 194~bp & 195~bp\\ -7 & Mean sequence depth of the contigs (exons) after the de novo assembly of the matching sequences & 4 & 3\\ -7 & Mean sequence length of the contigs (exons) after the de novo assembly of the matching sequences & 114~bp & 169~bp\\ -7 & Mean pairwise identity between the assembled reads of the contigs (exons) after the de novo assembly of the matching sequences & 99\% & 100\%\\ -7 & Minimum pairwise identity between the assembled reads of the contigs (exons) after the de novo assembly of the matching sequences & 84\% & 94\%\\ +5 & Mean sequence divergence between the unique transcripts and the combined nuclear genome skim raw reads & 7\% & 6\%\\ +5 & Mean sequence length of the match between the unique transcripts and the combined nuclear genome skim raw reads (genome skim data) & 216~bp & 204~bp\\ +5 & Mean sequence length of the match between the unique transcripts and the combined nuclear genome skim raw reads (transcripts) & 194~bp & 195~bp\\ +7 & Mean sequence depth of the contigs (exons) after de novo assembly of the matching sequences & 4 & 3\\ +7 & Mean sequence length of the contigs (exons) after de novo assembly of the matching sequences & 114~bp & 169~bp\\ +7 & Mean pairwise identity between the assembled reads of the contigs (exons) after de novo assembly of the matching sequences & 99\% & 100\%\\ +7 & Minimum pairwise identity between the assembled reads of the contigs (exons) after de novo assembly of the matching sequences & 84\% & 94\%\\ 11 & Number of exons $\geq$120~bp & 4,926 & 4,618\\ 11 & Number of genes & 1,164 ($\geq$600~bp) & 1,180 ($\geq$960~bp)\\ 11 & Total length of probe sequences & 1,127,2049~bp & 1,571,800~bp @@ -253,7 +253,7 @@ \subsection{Requirements to run Sondovač} In order to run Sondovač you need a~UNIX-based operating system (preferably Linux, alternatively Mac OS~X) equipped with BASH or a~compatible shell interpreter (this should by default be available for any Linux distribution, Mac OS~X and any other UNIX-based operating system like Solaris, BSD and its variants etc.). You should use the current operating system version supported by upstream, otherwise we will not be able to help you in case of problems. Older operating systems can have different versions of shell and system libraries, which can cause various problems and incompatibilities. -Sondovač uses several scientific software packages (namely BLAT, Bowtie2, CD-HIT, FLASh, Geneious, htsjdk, libgtextutils, and SAMtools -- see required versions and links, Table~\ref{software-links}), and basic UNIX tools (see below). Sondovač will check if those programs are installed -- available in the PATH (i.e. if the shell application can locate and launch respective binaries, see also vocabulary at page~\pageref{vocabulary}). If you have those packages installed (in current versions, see Table~\ref{software-links}), ensure that their binaries are in PATH. This should not be a~problem for basic tools available in any UNIX-based operating system, as basic installation usually contains all needed tools. If you lack some of the required tools, the script will notify you, and you will have to install them manually. If this is needed, check the documentation for your operating system. +Sondovač uses several scientific software packages (namely BLAT, Bowtie~2, CD-HIT set, FLASH, Geneious, htsjdk, libgtextutils, and SAMtools -- see required versions and links, Table~\ref{software-links}), and basic UNIX tools (see below). Sondovač will check if those programs are installed -- available in the PATH (i.e. if the shell application can locate and launch respective binaries, see also vocabulary on page~\pageref{vocabulary}). If you have those packages installed (in current versions, see Table~\ref{software-links}), ensure that their binaries are in PATH. This should not be a~problem for basic tools available in any UNIX-based operating system, as basic installation usually contains all needed tools. If you lack some of the required tools, the script will notify you, and you will have to install them manually. If this is needed, check the documentation for your operating system. If required programs are not installed, Sondovač will offer you installation. You can use precompiled binaries available together with the script (this is the recommended option) or (sometimes) from the web. In case you would like to compile required software yourself, the script will guide you through this process. This is recommended only for advanced users, as compilation might sometimes be very tricky. Users of Mac OS~X can install those applications also using Homebrew (see \url{https://brew.sh/}). For compilation you need, GNU G++, GNU GCC, GIT, libpng developmental files, and zlib developmental files. Ensure you have those tools available -- they should be readily available for any UNIX-based operating system. Chapters~\ref{required-linux} and~\ref{required-mac} give details about requirements and their manual fulfilling. This is mainly a~reference for more advanced users or users with special needs. For most users it should be fully sufficient to run the script and let it do this job (see chapter~\ref{script-start} on page~\pageref{script-usage}). @@ -264,9 +264,9 @@ \subsection{Requirements to run Sondovač} \subsection{Installation of required software in Linux} \label{required-linux} -Linux distributions have precise package management tools (similar, but with more functions, to various app stores known from Android, iOS or recent Mac OS~X, MS~Windows, etc.), but unfortunately Linux repositories\footnote{On-line directories containing various software.} commonly do not contain all needed scientific packages (or not enough recent versions). We recommend to check if repositories of Linux distribution in use contain required scientific software and if not, use pre-compiled binaries of scientific applications available together with the script. If the user wishes to compile the software, for whatever reason, the script will guide through that process. Please note, that compilation may be complicated and require certain level of experience. +Linux distributions have precise package management tools (similar, but with more functions, to various app stores known from Android, iOS or recent Mac OS~X, MS~Windows, etc.), but unfortunately Linux repositories\footnote{On-line directories containing various software.} commonly do not contain all needed scientific packages (or not enough recent versions). We recommend to check if repositories of Linux distribution in use contain required scientific software and if not, to use pre-compiled binaries of scientific applications available together with the script. If the user wishes to compile the software, for whatever reason, the script will guide through that process. Please note that compilation may be complicated and require a certain level of experience. -Following sections describe adding of extra repositories (if necessary) and installation of required scientific software on major Linux distributions. It is not part of the script itself, it must be done manually, and might require adjusting for particular Linux installation. For Linux users, the script offers usage of precompiled binaries or compilation of required software. Unfortunately, it is hard to cover variability of Linux distributions. +The following sections describe the addition of extra repositories (if necessary) and installation of required scientific software on major Linux distributions. It is not part of the script itself, it must be done manually and might require adjusting for a particular Linux installation. For Linux users, the script offers usage of precompiled binaries or compilation of required software. Unfortunately, it is hard to cover the variability of Linux distributions. \subsubsection{openSUSE and SUSE Linux Enterprise (SLE)} @@ -293,23 +293,23 @@ \subsubsection{openSUSE and SUSE Linux Enterprise (SLE)} Originally, those distributions used only \texttt{rpm*} commands (see \texttt{rpm --help} and \texttt{man rpm} for basic usage). -For openSUSE, there is \href{https://en.opensuse.org/openSUSE:Science_Repositories}{Science Repository}. User can add and use it like this: +For openSUSE, there is \href{https://en.opensuse.org/openSUSE:Science_Repositories}{Science Repository}. The user can add and use it like this: \begin{bashcode} # Help for adding new repository zypper ar -h - # Add scientific repository containing Bowtie2 and SAMtools + # Add scientific repository containing Bowtie~2 and SAMtools sudo zypper ar -r http://download.opensuse.org/repositories/science/` \ lsb_release -d | cut -f 2 | sed 's/ /_/'`/science.repo -n science -e -f \ -p 120 - # Install Bowtie2 and SAMtools (Blat, CD-HIT and FLASh are missing) + # Install Bowtie~2 and SAMtools (BLAT, CD-HIT and FLASH are missing) sudo zypper in bowtie2 samtools # Note backslash ("\") means that the code continues on the next line \end{bashcode} \subsubsection{Debian, Ubuntu, Linux Mint and derivatives} -The biggest ``family'' of Linux distributions. Debian (\url{https://www.debian.org/}) (one of the odlest and biggest distributions), Linux Mint (\url{https://linuxmint.com/}), Ubuntu (\url{https://www.ubuntu.com/}) and all derived distributions\footnote{For complete lists see \url{https://distrowatch.com/search.php?basedon=Debian} and \url{https://distrowatch.com/search.php?basedon=Ubuntu}.} like Kubuntu (\url{https://kubuntu.com/}) use for package management commands \texttt{apt-get} (basic) and \texttt{aptitude} (text-based front-end for \texttt{apt-get}, recommended, not available by default in every DEB based distribution). There are more tools available\footnote{See \url{https://wiki.debian.org/PackageManagement} for list of tools and \url{https://www.debian.org/doc/manuals/debian-reference/ch02.en.html} for exhaustive documentation. A~shorter introduction is available at \url{https://help.ubuntu.com/community/AptGet/Howto} and \url{http://ubuntuguide.org/wiki/Ubuntu_Trusty_Packages_and_Repositories}. Ubuntu-specific information at \url{https://help.ubuntu.com/stable/ubuntu-help/addremove.html}.}, we will describe only the basic usage needed for our purpose. The script will check if all required software packages are installed, and if not, will install them. You can also install manually: +The biggest ``family'' of Linux distributions. Debian (\url{https://www.debian.org/}) (one of the oldest and biggest distributions), Linux Mint (\url{https://linuxmint.com/}), Ubuntu (\url{https://www.ubuntu.com/}) and all derived distributions\footnote{For complete lists see \url{https://distrowatch.com/search.php?basedon=Debian} and \url{https://distrowatch.com/search.php?basedon=Ubuntu}.} like Kubuntu (\url{https://kubuntu.com/}) use for package management commands \texttt{apt-get} (basic) and \texttt{aptitude} (text-based front-end for \texttt{apt-get}, recommended, not available by default in every DEB based distribution). There are more tools available\footnote{See \url{https://wiki.debian.org/PackageManagement} for list of tools and \url{https://www.debian.org/doc/manuals/debian-reference/ch02.en.html} for exhaustive documentation. A~shorter introduction is available at \url{https://help.ubuntu.com/community/AptGet/Howto} and \url{http://ubuntuguide.org/wiki/Ubuntu_Trusty_Packages_and_Repositories}. Ubuntu-specific information at \url{https://help.ubuntu.com/stable/ubuntu-help/addremove.html}.}. We will describe only the basic usage needed for our purpose. The script will check if all required software packages are installed, and if not, will install them. You can also install manually: \begin{bashcode} # Verify installation of basic tools (they are installed in 99.9%): @@ -341,7 +341,7 @@ \subsubsection{Debian, Ubuntu, Linux Mint and derivatives} Note you can use \texttt{aptitude} in a~similar way as \texttt{apt-*} commands (e.g. \texttt{aptitude instal PACKAGE} etc.). For special package operations, there are plenty of \texttt{dpkg} commands for advanced management. -Debian-based distributions have Bowtie2, CD-HIT and SAMtools (Blat and FLASh are missing) in their repositories. For Debian, it is readily installable, for Ubuntu it is necessary to enable \texttt{universe} repository by command \texttt{sudo add-apt-repository universe}. For graphical way and more details see \url{https://help.ubuntu.com/community/Repositories/Ubuntu}. Not not all Linux distributions derived from Debian and Ubuntu contain the packages. It is possible to add repositories from Debian or Ubuntu, but description is beyond this guide. +Debian-based distributions have Bowtie~2, CD-HIT and SAMtools (BLAT and FLASH are missing) in their repositories. For Debian, it is readily installable, for Ubuntu it is necessary to enable \texttt{universe} repository by command \texttt{sudo add-apt-repository universe}. For graphical way and more details see \url{https://help.ubuntu.com/community/Repositories/Ubuntu}. Not all Linux distributions derived from Debian and Ubuntu contain the packages. It is possible to add repositories from Debian or Ubuntu, but description is beyond this guide. \begin{bashcode} # On Ubuntu and derivatives, allow universe repository @@ -372,21 +372,21 @@ \subsubsection{RedHat, Fedora, Centos, Scientific Linux and derivatives} # Note backslash ("\") means that the code continues on the next line \end{bashcode} -Since version 22, Fedora uses the command \texttt{dnf} for package management. It replaces older \texttt{yum}, and \texttt{yum} commands are redirected to \texttt{dnf}. The basic usage is the same, so that one can just replace \texttt{yum} with \texttt{dnf} in the above examples, see \url{https://dnf.readthedocs.io/en/latest/command_ref.html} for more usage of DNF on recent Fedora. Originally, those distributions used only \texttt{rpm*} commands (see \texttt{rpm --help} and \texttt{man rpm} for basic usage). Unfortunately, these disctributions do not contain much of required software (at least not in official repositories). +Since version 22, Fedora uses the command \texttt{dnf} for package management. It replaces older \texttt{yum}, and \texttt{yum} commands are redirected to \texttt{dnf}. The basic usage is the same, so that one can just replace \texttt{yum} with \texttt{dnf} in the above examples, see \url{https://dnf.readthedocs.io/en/latest/command_ref.html} for more info about usage of DNF on recent Fedora. Originally, those distributions used only \texttt{rpm*} commands (see \texttt{rpm --help} and \texttt{man rpm} for basic usage). Unfortunately, these distributions do not contain much of the required software (at least not in official repositories). \subsection{Installation of required software in Mac OS~X} \label{required-mac} -For Mac OS~X users, Homebrew (see \url{https://brew.sh/} and \url{https://github.com/Homebrew/}) will be installed by the script, and it will install (new software or newer versions) BASH (the shell interpreter), GNU AWK, GNU coreutils, GNU GCC, GNU grep, GNU make, GNU sed, and wget. Mac OS~X is missing some tools and contains outdated BSD versions for others (typically sed, grep or awk). The script will guide the user through the process, and the user can safely and easily remove these tools afterwards if necessary. Unfortunately, Mac OS~X does not have usable build-in package management, and it has outdated versions of some required tools. Homebrew fills this gap. It is a~simple command-line installer (similar to package managers known from Linux, BSD or Solaris) of various applications. +For Mac OS~X users, Homebrew (see \url{https://brew.sh/} and \url{https://github.com/Homebrew/}) will be installed by the script, and it will install (new software or newer versions) BASH (the shell interpreter), GNU AWK, GNU coreutils, GNU GCC, GNU grep, GNU make, GNU sed, and wget. Mac OS~X is lacking some tools and contains outdated BSD versions for others (typically sed, grep or awk). The script will guide the user through the process, and the user can safely and easily remove these tools afterwards if necessary. Unfortunately, Mac OS~X does not have usable build-in package management, and it has outdated versions of some required tools. Homebrew fills this gap. It is a~simple command-line installer (similar to package managers known from Linux, BSD or Solaris) of various applications. Homebrew requires Xcode\footnote{\url{https://developer.apple.com/xcode/}} (set of tools required to compile software) to be installed. Unfortunately, it is not possible to easily and universally check if Xcode is installed, so that the script will ask if the user wishes to install it. If the user is unsure if Xcode is installed, it is safe to answer \texttt{Yes} and install it. The manual command to install Xcode is the following: \begin{bashcode} xcode-select --install # Following error means Xcode has already been installed: - xcode-select: note: no developer tools were found at '/Applications/Xcode.app', - requesting install. Choose an option in the dialog to download the command - line developer tools. + xcode-select: note: no developer tools were found at '/Applications/Xcode. + app', requesting install. Choose an option in the dialog to download the + command line developer tools. # Verify Xcode installation by xcode-select --print-path # Prints installation location of Xcode xcode-select --version # Prints version of Xcode @@ -482,7 +482,7 @@ \subsubsection{Examples} ./sondovac_part_a.sh -i -f input.fa -t reads1.fastq -q reads2.fastq \end{bashcode} -Running in non-interactive, automated mode (parameter "\texttt{-n}", see chapter~\ref{script-usage} at page~\pageref{script-usage}) with +Running in non-interactive, automated mode (parameter "\texttt{-n}", see chapter~\ref{script-usage} on page~\pageref{script-usage}) with example data downloaded from \url{https://github.com/V-Z/sondovac/wiki/Sample-data}: \begin{bashcode} @@ -513,7 +513,7 @@ \subsubsection{Examples} ./sondovac_part_a.sh -s 950 \end{bashcode} -We recommend launching Sondovač in interactive mode,at least for the first time, so that the script can verify all requirements and install missing tools where needed. We recommend using non-interactive mode for routine usage. +We recommend launching Sondovač in interactive mode, at least for the first time, so that the script can verify all requirements and install missing tools where needed. We recommend using the non-interactive mode for routine usage. \subsection{Help for usage of terminal} @@ -537,7 +537,7 @@ \subsection{Help for usage of terminal} \subsection{Geneious} \label{geneious} -For part \textbf{B} of the script the user must have Geneious \citep{Kearse2012}. Geneious is a~DNA alignment, assembly, and analysis software and one of the most common software platforms used in genomics. It is utilized for de novo assembly in Sondovač. We plan to replace it with a free open-source command line tool in a future release of Sondovač. Visit \url{https://www.geneious.com/} for download, purchase, installation and usage of Geneious. After the input data are processed (interactively or not) by \texttt{sondovac$\_$part$\_$a.sh}, the user must process its output manually with Geneious according to the instructions given below. The output of Geneious is then processed by \texttt{sondovac$\_$part$\_$b.sh}, which produces the final probe set. Geneious versions 6, 7~and 8 have been tested and are compatible with this script. +For part \textbf{B} of the script the user must have Geneious \citep{Kearse2012}. Geneious is a~DNA alignment, assembly, and analysis software and one of the most common software platforms used in genomics. It is utilized for de novo assembly in Sondovač. We plan to replace it with a free open-source command line tool in a future release of Sondovač. Visit \url{https://www.geneious.com/} for download, purchase, installation and usage of Geneious. After the input data are processed (interactively or not) by \texttt{sondovac$\_$part$\_$a.sh}, the user must process its output manually with Geneious according to the instructions given below(see page~\pageref{geneious-usage}). The output of Geneious is then processed by \texttt{sondovac$\_$part$\_$b.sh}, which produces the final probe set. Geneious versions 6--10 have been tested and are compatible with this script. \subsection{Software used by Sondovač} @@ -555,9 +555,9 @@ \subsection{Software used by Sondovač} \endlastfoot BASH & v. > 4 & \url{https://gnu.org/software/bash/bash.html}\\ BLAT & v.36 & \url{https://genome.ucsc.edu/FAQ/FAQblat.html}\\ -Bowtie2 & 2.2.6 & \url{http://bowtie-bio.sourceforge.net/bowtie2/index.shtml}\\ +Bowtie~2 & 2.2.6 & \url{http://bowtie-bio.sourceforge.net/bowtie2/index.shtml}\\ CD-HIT & 4.6 & \url{http://weizhongli-lab.org/cd-hit/}\\ -FLASh & 1.2.11 & \url{https://sourceforge.net/projects/flashpage/}\\ +FLASH & 1.2.11 & \url{https://sourceforge.net/projects/flashpage/}\\ G++, GCC & v. > 4.2 & \url{https://gcc.gnu.org/}\\ Geneious & v. > 6.1 & \url{https://www.geneious.com/}\\ GNU core utils & 8.X & \url{https://gnu.org/software/coreutils/coreutils.html}\\ @@ -573,9 +573,9 @@ \subsection{Software used by Sondovač} \begin{itemize} \item BLAT - \item Bowtie2 + \item Bowtie~2 \item SAMtools - \item FLASh + \item FLASH \end{itemize} \texttt{sondovac$\_$part$\_$b.sh} requires (and will install) the following software packages: @@ -590,7 +590,7 @@ \subsection{Software used by Sondovač} \begin{description} \item[BLAT] \citet{Kent2002}: BLAT -- the BLAST-like alignment tool. - \item[Bowtie2] \citet{Langmead2012}: Fast gapped-read alignment with Bowtie 2. + \item[Bowtie~2] \citet{Langmead2012}: Fast gapped-read alignment with Bowtie 2. \item[CD-HIT] There are several papers describing CD-HIT: \begin{itemize} \item \citet{Li2001}: Clustering of highly homologous sequences to reduce the size of large protein databases. @@ -601,7 +601,7 @@ \subsection{Software used by Sondovač} \item \citet{Niu2010}: Artificial and natural duplicates in pyrosequencing reads of metagenomic data. \item \citet{Li2012b}: Ultrafast clustering algorithms for metagenomic sequence analysis. \end{itemize} - \item[FLASh] \citet{Magoc2011}: FLASh: fast length adjustment of short reads to improve genome assemblies. + \item[FLASH] \citet{Magoc2011}: FLASH: fast length adjustment of short reads to improve genome assemblies. \item[Geneious] \citet{Kearse2012}: Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data. \item[grab$\_$syngleton$\_$clusters.py] \citet{Weitemier2014}: Hyb-Seq: Combining target enrichment and genome skimming for plant phylogenomics. \item[SAMtools] There are several papers describing SAMtools: @@ -615,7 +615,7 @@ \subsection{Software used by Sondovač} \subsection{The PATH variable} -PATH (\$PATH) is a~system variable used in every UNIX system. It lists directories (separated by colon ":") where the current shell (see also Chapter~\ref{vocabulary} Vocabulary at page~\pageref{vocabulary}) searches for binaries (commands), so that the user does not have to specify the full path to the software (e.g. just \texttt{sed} instead of \texttt{/usr/bin/sed}). If some software is installed outside standard locations, the user must specify the full path, even if the user is located in the same directory as the software (e.g. \texttt{./sondovac$\_$part$\_$a.sh} -- this is for security reasons). In the case of two commands with the same name (e.g. \texttt{/bin/somecommand} and \texttt{/usr/bin/somecommand}), the order of directories in \$PATH matters -- the first occurrence is used, and any later commands are ignored (but this is usually a~rare case). PATH can be managed using the following commands: +PATH (\$PATH) is a~system variable used in every UNIX system. It lists directories (separated by colon ":") where the current shell (see also Chapter~\ref{vocabulary} Vocabulary on page~\pageref{vocabulary}) searches for binaries (commands), so that the user does not have to specify the full path to the software (e.g. just \texttt{sed} instead of \texttt{/usr/bin/sed}). If some software is installed outside standard locations, the user must specify the full path, even if the user is located in the same directory as the software (e.g. \texttt{./sondovac$\_$part$\_$a.sh} -- this is for security reasons). In the case of two commands with the same name (e.g. \texttt{/bin/somecommand} and \texttt{/usr/bin/somecommand}), the order of directories in \$PATH matters -- the first occurrence is used, and any later commands are ignored (but this is usually a~rare case). PATH can be managed using the following commands: \begin{bashcode} # See the $PATH variable @@ -648,7 +648,7 @@ \subsection{Vocabulary} \item[Compilation] "Translation" of software application from the source code (text readable by human programmer) into binary form launchable by the computer. It requires special tools (compilers), and it usually must be done for every operating system and hardware platform. \item[Console] See "Shell". \item[Debian] One of the oldest and most popular Linux distributions. See \url{https://www.debian.org/}. - \item[Fedora] Popular Linux distribution developed together with RedHat Linux as its free community testing platform. See \url{https://getfedora.org/}. + \item[Fedora] Popular Linux distribution developed together with RedHat Linux as its a~free community testing platform. See \url{https://getfedora.org/}. \item[GNU] Major project providing free software widely used in many operating systems, see \url{https://gnu.org/}. \item[Homebrew] Tool primarily for Mac OS~X (although there is also a~Linux version available) replacing the virtually missing package manager for this system. Can be used to install plenty of various applications as well as updating tools already available in Mac OS~X. See \url{https://brew.sh/}. \item[Library] Pack of software tools and functions used by other applications. @@ -707,8 +707,8 @@ \subsubsection{General parameters} \item[\texttt{-p}] Display INSTALL for detailed installation instructions. Exit viewing by pressing the \texttt{Q} key. See also page~\pageref{install}. \item[\texttt{-e}] Display detailed citation information and exit. \item[\texttt{-o}] Set name of output files. Output files will start with that name. Do not use spaces or special characters - some software can't handle them correctly. The default value (if the user does not provide another name) is "output". See below for the list of produced output files. - \item[\texttt{-i}] Running in interactive mode -- the script will on-demand ask for the required input files, installation of missing software etc.. This is the recommended default value (the script runs interactively without explicitly using option \texttt{-n}). - \item[\texttt{-n}] Running in non-interactive mode. The user must provide at least the required input files (see below). You can use only one of the parameters \texttt{-i} or \texttt{-n} (not both of them). If the script fails to find some of the required software packages, it will exit. This is recommended for batch or repeated analysis, on remote servers and for more advanced users. The user must be sure that all required software is installed (see page~\pageref{install}). + \item[\texttt{-i}] Running in interactive mode -- the script will on-demand ask for the required input files, installation of missing software etc. This is the recommended default value (the script runs interactively without explicitly using option \texttt{-n}). + \item[\texttt{-n}] Running in non-interactive mode. The user must provide at least the required input files (see below). You can use only one of the parameters \texttt{-i} or \texttt{-n} (not both of them). If the script fails to find some of the required software packages, it will exit. This is recommended for batch or repeated analyses, on remote servers, and for more advanced users. The user must be sure that all required software is installed (see page~\pageref{install}). \end{description} \subsubsection{Input files} @@ -813,7 +813,7 @@ \subsubsection{Optional parameters} \subsection{Input and output files} -All names of input files and paths to them must be without spaces and without special characters (some software has difficulties handling them). \textbf{Important note:} HTS data are big. The Sondovač pipeline is relatively long, and part \texttt{A} contains several format conversions and can (for some time) use dozens of GB of disk space. Temporary files not potentially useful to the user are deleted at the end of the pipeline -- these files may be useful for debugging if something goes wrong. For example, input data of \citet{Schmickl2016} are approximately 4.5~GB, and the overall output of part \texttt{A} of the script is about 28~GB, of which less then half is kept by the pipeline. This analysis took less than an hour on an i7 3.4~GHz CPU. Part \texttt{B} is very quick and does not consume a~significant amount of disk space. All input files \textit{must} have UNIX end of lines. The script checks for it and converts the files, if needed (using \texttt{dos2unix}; typically when user runs Geneious on Windows). +All names of input files and paths to them must be without spaces and without special characters (some software has difficulties handling them). \textbf{Important note:} HTS data are big. The Sondovač pipeline is relatively long, and part \texttt{A} contains several format conversions and can (for some time) use dozens of GB of disk space. Temporary files not potentially useful to the user are deleted at the end of the pipeline -- these files may be useful for debugging if something goes wrong. For example, input data of \citet{Schmickl2016} are approximately 4.5~GB, and the overall output of part \texttt{A} of the script is about 28~GB, of which less then half is kept by the pipeline. This analysis took less than an hour on an i7 3.4~GHz CPU. Part \texttt{B} is very quick and does not consume a~significant amount of disk space. All input files \textit{must} have UNIX end of lines. The script checks for it and converts the files, if needed (using \texttt{dos2unix}; typically when the user runs Geneious on Windows). \vspace{10pt} \textbf{Script \texttt{sondovac$\_$part$\_$a.sh} requires as input files:} @@ -864,10 +864,11 @@ \subsection{Input and output files} \item \texttt{*$\_$prelim$\_$probe$\_$seq$\_$cluster$\_$100.fasta} -- Unclustered exons and clustered exons with 100\% sequence identity. \item \texttt{*$\_$prelim$\_$probe$\_$seq$\_$cluster$\_$90.clstr} -- Unclustered exons and clustered exons with more than a certain sequence similarity (CLSTR file). \item \texttt{*$\_$unique$\_$prelim$\_$probe$\_$seq.fasta} -- Unclustered exons / exons with less than a certain sequence similarity. - \item \texttt{*$\_$similarity$\_$test.fasta} -- Contigs that comprise exons $\geq$ bait length and have a~certain total locus length. + \item \texttt{*$\_$similarity$\_$test.fasta} -- Contigs that comprise exons $\geq$ bait length and have a~certain minimum total locus length. \item \texttt{*$\_$target$\_$enrichment$\_$probe$\_$sequences$\_$with$\_$pt.fasta} -- All probes in FASTA, with putative plastid sequences (if there were any BLAT hits, putative plastid sequences are listed in next file). - \item \texttt{*$\_$possible$\_$cp$\_$dna$\_$gene$\_$in$\_$probe$\_$set.pslx} -- In case of any BLAT hits, putative remaining plastid probe sequences from \texttt{*$\_$target$\_$enrichm\-ent$\_$probe$\_$sequences$\_$with$\_$pt\-.fasta} are listed here. \textbf{Not removing plastid genes will take lots of space on the Illumina lane for enriched plastid reads that should actually be available for enriched nuclear reads.} + \item \texttt{*$\_$possible$\_$cp$\_$dna$\_$gene$\_$in$\_$probe$\_$set.pslx} -- In case of any BLAT hits, putative remaining plastid probe sequences from \texttt{*$\_$target$\_$enrichm\-ent$\_$probe$\_$sequences$\_$with$\_$pt\-.fasta} are listed here. \textbf{Not removing plastid genes will occupy lots of space on the Illumina lane for enriching those plastid loci; this space should be available for enriching the nuclear loci!} \item \underline{\texttt{*$\_$target$\_$enrichment$\_$probe$\_$sequences.fasta}} -- \textbf{Final probes in FASTA.} + \end{enumerate} An asterisk (\texttt{*}) denotes the beginning of the output files' names specified by the user with parameter \texttt{-o}. If the user does not select a~custom name, the default value (\texttt{output}) will be used. By default, output files are created in the same directory from which Sondovač was launched. Output files can be saved in a~custom directory by specifying an output directory with parameter \texttt{-o}: @@ -896,31 +897,47 @@ \subsection{Geneious usage} \label{geneious-import} \end{figure} -Select the file and go to menu \textbf{Tools | Align / Assemble | De Novo Assemble}. In \textbf{Data} frame select \textbf{Assemble by 1st (\ldots) Underscore}. In \textbf{Method} frame select \textbf{Geneious Assembler} (if you don't have other assemblers, this option might be missing) and \textbf{Medium Sensitivity / Fast} sensitivity (see Figure~\ref{geneious-assembly}). +Select the file and go to menu \textbf{Tools | Align / Assemble | De Novo Assemble\ldots}. In \textbf{Data} frame select \textbf{Assemble by 1st (\ldots) Underscore}. In \textbf{Method} frame select \textbf{Geneious Assembler} (if you don't have other assemblers, this option might be missing) and \textbf{Medium Sensitivity / Fast} sensitivity (see Figures~\ref{geneious-assembly} and \ref{geneious-assembly9}). -In \textbf{Results} frame check \textbf{Save assembly report}, \textbf{Save list of unused reads}, \textbf{Save in sub-folder}, \textbf{Save contigs} (do not check \textbf{Maximum}) and \textbf{Save consensus sequences} (Click to \textit{Options} -- \textbf{Save consensus used by assembler} must be selected.). \textbf{Do not trim}. Otherwise keep defaults. Run it. Geneious may warn about possible hanging because of big file size. Do not use Geneious for other tasks during the assembly. Running Geneious may take a~long time (see Figure~\ref{geneious-assembly}). +In \textbf{Results} frame field \textbf{Assembly Name} must in Geneious~7 and newer contain string \texttt{\{Reads Name\} Assembly}. Check \textbf{Save assembly report}, \textbf{Save list of unused reads}, \textbf{Save in sub-folder}, \textbf{Save contigs} (do not check \textbf{Maximum}) and \textbf{Save consensus sequences} (Click to \textit{Options} -- \textbf{Save consensus used by assembler} must be selected.). \textbf{Do not trim}. Otherwise keep defaults (see Figures~\ref{geneious-assembly} and \ref{geneious-assembly9}). Run it. Geneious may warn about possible hanging because of big file size. Do not use Geneious for other tasks during the assembly. Running Geneious may take a~long time. \begin{figure}[htb] \begin{center} \includegraphics[width=12cm]{geneious2.png} \end{center} - \caption[Settings of Geneious assembly]{Settings of Geneious assembly as described in the main text. It can take a~longer time to run it.} + \caption[Settings of Geneious~6 assembly]{Settings of Geneious assembly as described in the main text. It can take a~longer time to run it. This screenshot is from Geneious~6. Compare with never versions as Geneious~9 (Figure~\ref{geneious-assembly9}).} \label{geneious-assembly} \end{figure} -Select all resulting contigs (typically named \textbf{* Contig \#}) and export them (go to menu \textbf{File | Export | Selected Documents\ldots}) as \textbf{Tab-separated table values (*.tsv)}. Save the following columns (Hold \texttt{Ctrl} key to mark more fields): \textbf{\# Sequences}, \textbf{\% Pairwise Identity}, \textbf{Description}, \textbf{Mean Coverage}, \textbf{Name} and \textbf{Sequence Length}. If this option is inaccessible to you, export all columns (see Figure~\ref{geneious-export1}). Warning! Do not select and export \textbf{* Consensus Sequences}, \textbf{* Unused Reads} or \textbf{* Report} -- only the individual \textbf{* contig \#} files (see Figure~\ref{geneious-export1}). +\begin{figure}[htb] + \begin{center} + \includegraphics[width=\textwidth]{geneious5.png} + \end{center} + \caption[Settings of Geneious~9 assembly]{Settings of Geneious assembly as described in the main text, printsreen showing newer versions of Geneious (9 in this case). Compare with Figure~\ref{geneious-assembly}. Note string in \textbf{Assembly Name} field. This is important for correct naming of output sequences.} + \label{geneious-assembly9} +\end{figure} -\begin{figure}[p] +After sequences are assembled, select all resulting contigs (typically named \textbf{* Contig \#} or \textbf{* Assembly \#}) and export them (go to menu \textbf{File | Export | Selected Documents\ldots}) as \textbf{Tab-separated table values (*.tsv)}. Save the following columns (Hold \texttt{Ctrl} key to mark more fields): \textbf{\# Sequences}, \textbf{\% Pairwise Identity}, \textbf{Description}, \textbf{Mean Coverage}, \textbf{Name} and \textbf{Sequence Length}. If this option is inaccessible to you, export all columns (see Figure~\ref{geneious-export1}). Warning! Do not select and export \textbf{* Consensus Sequences}, \textbf{* Unused Reads} or \textbf{* Report} -- only the individual \textbf{* Contig/Assembly \#} files (see Figure~\ref{geneious-export1}). + +\begin{figure}[hbtp] \includegraphics[width=\textwidth]{geneious3.png} - \caption[Export of contigs as TSV from Geneious]{Select all (and only) \textbf{* Contig \#} files \textbf{(1)}. Go to menu File \textbf{(2)} | Export \textbf{(3)} | Selected Documents\ldots \textbf{(4)} and export them as Tab-separated table values (TSV) \textbf{(5)}. Export only marked columns \textbf{(6)} (hold \texttt{Ctrl} to mark more fields).} + \caption[Export of contigs as TSV from Geneious]{Select all (and only) \textbf{* Contig \#} or \textbf{* Assembly \#} (compare with look of output sequences in newer versions of Geneious at Figure~\ref{geneious-assembly-contigs}) files \textbf{(1)}. Go to menu File \textbf{(2)} | Export \textbf{(3)} | Selected Documents\ldots \textbf{(4)} and export them as Tab-separated table values (TSV) \textbf{(5)}. Export only marked columns \textbf{(6)} (hold \texttt{Ctrl} to mark more fields). This printscreen is from Geneious~6.} \label{geneious-export1} \end{figure} +\begin{figure}[htb] + \begin{center} + \includegraphics[width=12cm]{geneious6.png} + \end{center} + \caption[Contigs in newer versions of Geneious]{In newer versions of Geneious, word ``Assebly'' is used instead of ``Contig''. \texttt{sondovac$\_$part$\_$\-b.sh} requires one of these words and same namimg scheme of sequences (\textbf{* Contig \#} or \textbf{* Assembly \#}). This prinscreen is from Geneious~9 (compare with Figure~\ref{geneious-export1}).} + \label{geneious-assembly-contigs} +\end{figure} + Select items \textbf{Consensus Sequences} and \textbf{Unused Reads} and export them as one \textbf{FASTA}. Go to menu \textbf{File | Export | Selected Documents\ldots} and choose \textbf{FASTA file type} (see Figure~\ref{geneious-export2}). -\begin{figure}[htb] +\begin{figure}[hbt] \includegraphics[width=\textwidth]{geneious4.png} - \caption[Export of FASTA from Geneious]{ Select only documents \textbf{Consensus Sequences} and \textbf{Unused Reads} and export them as FASTA format (see also Figure~\ref{geneious-export1}).} + \caption[Export of FASTA from Geneious]{Select only documents \textbf{Consensus Sequences} and \textbf{Unused Reads} and export them as FASTA format (see also Figure~\ref{geneious-export1}).} \label{geneious-export2} \end{figure} @@ -933,12 +950,12 @@ \subsection{Record output of Sondovač} \begin{bashcode} ./sondovac_part_a.sh | tee records.log man tee # See more options how tee can record the script's output - # "|" is a~pipe passing output of the 1st command as input for the 2nd command + # "|" is a pipe passing output of the 1st command as input for the 2nd command less records.log # See the record. Quit viewing by "Q" rm records.log # Delete the log file \end{bashcode} -You can use any command line arguments; the script will behave as usual. The plain text file \texttt{records.log} will then contain all its output. Unfortunately, \texttt{tee} usually wrongly records "invisible" characters -- tabs and coloration used to highlight user messages in the script. If you see weird characters in texttt{records.log} that disturb reading, use the following commands: +You can use any command line arguments; the script will behave as usual. The plain text file \texttt{records.log} will then contain all its output. Unfortunately, \texttt{tee} usually wrongly records "invisible" characters -- tabs and coloration used to highlight user messages in the script. If you see weird characters in \texttt{records.log} that disturb reading, use the following commands: \begin{bashcode} # Assume output of Sondovač is named "records.log" @@ -948,7 +965,7 @@ \subsection{Record output of Sondovač} sed -i 's/.(B.\[m//g' records.log # Explanation of regular expression (find pattern and replace by nothing): # any character, (, B, [, m (sequence defining text formatting) - # Escaping \[ \] is required to search specifically for brackets [] + # Escaping \[ \] is required to search specifically for square brackets [] # (NOT searching for any character within [...] - there is no escaping) # but \{...\} define number of occurrences of previous character(s) \end{bashcode} @@ -969,7 +986,7 @@ \section{Sample data} \item \texttt{input5$\_$Ricinus$\_$communis$\_$reference$\_$mitochondrial$\_$genome.fasta} -- mtDNA reference (parameter \texttt{-m}), GenBank reference number \href{https://www.ncbi.nlm.nih.gov/nuccore/323649872/}{NC$\_$015141}. \end{enumerate} -The transcriptome input file is unpublished data from G. K.-S. Wong et al. Data can be also found under +The transcriptome input file is unpublished data from G.~K.-S. Wong et~al. Data can also be found under \begin{itemize} \item \url{http://www.onekp.com/} @@ -989,16 +1006,18 @@ \section{Changelog} % TODO Update changelog List of changes in released versions of Sondovač. -\subsection{Version 1.3 regular release released 2017-MM-DD} +\subsection{Version 1.3 regular release released 2017-12-18} \begin{itemize} \item \texttt{bam2fastq} is dropped in favour of \texttt{samtools fastq}. No plans to use \texttt{Picard} anymore (part~A). \item Simplified \texttt{INSTALL} and \texttt{README} not to only copy PDF manual. \item Corrected output of part~A -- ensure to have always valid FASTA. \item Automatically remove putative plastid sequences from final probe set (part~B), list of all probes (with putative plastid sequences) and list of putative plastid sequences are available. - \item Updated software distributed with Sondovač, updated respective sections of manual. + \item Updated software distributed with Sondovač, updated respective sections of PDF manual. \item Removed FASTX Toolkit, conversion from FASTQ to FASTA is done by simple shell function. - \item Improved handling of input/output files when stored in more different directories. + \item Improved handling of input/output files when stored in several different directories. + \item Tested with Geneious 10, improved description of Geneious usage in the PDF manual. + \item Improved PDF manual. \end{itemize} \subsection{Version 1.2 regular release released 2016-06-28} @@ -1195,7 +1214,7 @@ \subsubsection{4. Conveying Verbatim Copies} \subsubsection{5. Conveying Modified Source Versions} -You may convey a~work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: +You may convey a~work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet~all of these conditions: \begin{enumerate}[label=\Alph*)] \item The work must carry prominent notices stating that you modified it, and giving a~relevant date. @@ -1358,7 +1377,7 @@ \subsubsection{Terms and Conditions for Copying, Distribution and Modification} You may charge a~fee for the physical act of transferring a~copy, and you may at your option offer warranty protection in exchange for a~fee. -2. You may modify your copy or copies of the Program or any portion of it, thus forming a~work based on the Program, and copy and distribute such modifications or work under the terms of Section 1~above, provided that you also meet all of these conditions: +2. You may modify your copy or copies of the Program or any portion of it, thus forming a~work based on the Program, and copy and distribute such modifications or work under the terms of Section 1~above, provided that you also meet~all of these conditions: \begin{enumerate}[label=\Alph*)] \item You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. diff --git a/sondovac_functions b/sondovac_functions index a823a46..e0ac836 100644 --- a/sondovac_functions +++ b/sondovac_functions @@ -2,7 +2,7 @@ # Version of the script SCRIPTVERSION=1.3 -RELEASEDATE=2017-MM-DD +RELEASEDATE=2017-12-18 # Web page of the script WEB="https://github.com/V-Z/sondovac/"