Skip to content

Commit

Permalink
updates docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Joe Brown committed Jun 30, 2017
1 parent 820d14b commit 5973900
Show file tree
Hide file tree
Showing 13 changed files with 288 additions and 127 deletions.
Empty file added docs/annotation/output.rst
Empty file.
145 changes: 145 additions & 0 deletions docs/annotation/samples.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
Defining Samples
================

Annotation
----------

Samples are defined under "samples" and have a unique name that does not
contain spaces or underscores (dashes are accepted). To only annotate, a
sample needs ``fasta`` defined::

samples:
sample-1:
fasta: /project/bins/sample-1_contigs.fasta
sample-2:
fasta: /project/bins/sample-2_contigs.fasta

All other annotation parameters can be defined following samples. See
:ref:`annotation`.


Annotation and Quantification
-----------------------------

To get counts, we need to define FASTQ file paths that will be mapped back
to the FASTA regions.

In addition to specifying FASTQ paths for each sample, the configuration will
also need to contain::

quantification: true


Interleaved
```````````

Reads are always assumed to be paired-end, so we only need to specify
``fastq``::

samples:
sample-1:
fasta: /project/bins/sample-1_contigs.fasta
fastq: /project/data/sample-1_pe.fastq
sample-2:
fasta: /project/bins/sample-2_contigs.fasta
fastq: /project/data/sample-2_pe.fastq

Paired-end
``````````

In this case, we create a list using YAML_ syntax for both R1 and R2 indexes::

samples:
sample-1:
fasta: /project/bins/sample-1_contigs.fasta
fastq:
- /project/data/sample-1_R1.fastq
- /project/data/sample-1_R2.fastq
sample-2:
fasta: /project/bins/sample-2_contigs.fasta
fastq:
- /project/data/sample-2_R1.fastq
- /project/data/sample-2_R2.fastq


Single-end
``````````

As data are assumed to be paired-end, we need to add ``paired: false``::

samples:
sample-1:
fasta: /project/bins/sample-1_contigs.fasta
fastq: /project/data/sample-1_se.fastq
paired: false
sample-2:
fasta: /project/bins/sample-2_contigs.fasta
fastq: /project/data/sample-2_se.fastq
paired: false


Example
-------

A complete example for annotation and quantification for samples with
paired-end reads in separate and in interleaved FASTQs::


samples:
sample-1:
fasta: /project/bins/sample-1_contigs.fasta
fastq: /project/data/sample-1_pe.fastq
sample-2:
fasta: /project/bins/sample-2_contigs.fasta
fastq:
- /project/data/sample-2_R1.fastq
- /project/data/sample-2_R2.fastq

quantification: true

tmpdir: /scratch
threads: 24
refseq_namemap: /pic/projects/mint/atlas_databases/refseq.db
refseq_tree: /pic/projects/mint/atlas_databases/refseq.tree
diamond_db: /pic/projects/mint/atlas_databases/refseq.dmnd
# 'fast' or 'sensitive'
diamond_run_mode: fast
# setting top_seqs to 5 will report all alignments whose score is
# at most 5% lower than the top alignment score for a query
diamond_top_seqs: 2
# maximum e-value to report alignments
diamond_e_value: "0.000001"
# minimum identity % to report an alignment
diamond_min_identity: 50
# minimum query cover % to report an alignment
diamond_query_coverage: 60
# gap open penalty
diamond_gap_open: 11
# gap extension penalty
diamond_gap_extend: 1
# Block size in billions of sequence letters to be processed at a time.
# This is the main parameter for controlling DIAMOND's memory usage.
# Bigger numbers will increase the use of memory and temporary disk space,
# but also improve performance. The program can be expected to roughly use
# six times this number of memory (in GB).
diamond_block_size: 6
# The number of chunks for processing the seed index (default=4). This
# option can be additionally used to tune the performance. It is
# recommended to set this to 1 on a high memory server, which will
# increase performance and memory usage, but not the usage of temporary
# disk space.
diamond_index_chunks: 1
# 'lca', 'majority', or 'best'; summary method for annotating ORFs; when
# using LCA, it's recommended that one limits the number of hits using a
# low top_fraction
summary_method: lca
# 'lca', 'lca-majority', or 'majority'; summary method for aggregating ORF
# taxonomic assignments to contig level assignment; 'lca' will result in
# most stringent, least specific assignments
aggregation_method: lca-majority
# constitutes a majority fraction at tree node for 'lca-majority' ORF
# aggregation method
majority_threshold: 0.51


.. _YAML: http://www.yaml.org/
33 changes: 21 additions & 12 deletions docs/assembly/annotation.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _annotation:

Annotation
==========

Expand Down Expand Up @@ -60,12 +62,12 @@ options::
Functional Annotation of ORFs
-----------------------------

Functional annotation is performed using Prokka. Contigs will be renamed to
Functional annotation is performed using Prokka_. Contigs will be renamed to
sample name + a digit, incrementally, such that contig 1 for sample 'example-id'
is 'example-id_1'. ORFs among a sample are named by Prokka similarly though
they are padded by zeroes (example-id_00001). Contig IDs and ORFs IDs are
mapped back to one another using the final output table where each row
represents an ORF and its assignments.
is 'example-id_1'. Open reading frames (ORFs) within a sample are named by
Prokka similarly though they are padded by zeroes (example-id_00001). Contig
IDs and ORFs IDs are mapped back to one another using the final output table
where each row represents an ORF and its assignments.


Taxonomy Annotation of ORFs and Contigs
Expand All @@ -81,13 +83,15 @@ files::
refseq_tree: /database_dir/refseq.tree
diamond_db: /database_dir/refseq.dmnd

These files are tracked and downloaded from Zenodo_ along with other
reference data.


Local Alignment Options for ``blastp`` Search
---------------------------------------------
Local Alignment Options
-----------------------

Within each reference database, the user has the flexibility to optimize
performance across their compute environment and control the number of
alignment hits in various ways.
The user has the flexibility to optimize performance across their compute
environment and control the number of alignment hits in various ways.


Run Mode
Expand Down Expand Up @@ -133,8 +137,6 @@ Query Coverage
Require this much of the query sequence to be matched above
``diamond_min_identity``::

::

diamond_query_coverage: 60


Expand Down Expand Up @@ -224,6 +226,8 @@ assignment.

aggregation_method: lca-majority

For more information on the lca-majority method, please see the `LCA* paper`_.


Majority Threshold
``````````````````
Expand All @@ -234,3 +238,8 @@ aggregation method.
::

majority_threshold: 0.51


.. _Prokka: https://github.com/tseemann/prokka
.. _Zenodo: https://zenodo.org/record/804435
.. _LCA* paper: https://doi.org/10.1093/bioinformatics/btw400
Empty file added docs/assembly/output.rst
Empty file.
8 changes: 4 additions & 4 deletions docs/assembly/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ provided and filtered from the reads using the following parameters::


Contaminant References
``````````````````````
----------------------

As shown in the above example, if provided, reads will be removed from the
FASTQ prior to assembly if they align to these references. If 'rRNA' is
Expand All @@ -130,10 +130,10 @@ Additional references can be added arbitrarily, such that::


Normalization Parameters
````````````````````````
------------------------

To improve assemblies, coverage is normalized across kmers to a target depth
and can be set using::
To improve assembly time and often assemblies themselves, coverage is
normalized across kmers to a target depth and can be set using::

# kmer length over which we calculated coverage
normalization_kmer_length: 21
Expand Down
32 changes: 26 additions & 6 deletions docs/assembly/samples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,36 +5,56 @@ Samples are defined with a name, file path(s) and the type of data. A single
file path is interpreted as interleaved paired-end reads or single-end, while
two paths must include full paths to R1 and R2.

Sample names must be unique and not contain spaces or underscores (dashes are accepted).
Sample names must be unique and not contain spaces or underscores (dashes are
accepted).

For 'type', the value can be either 'metagenome' or 'metatranscriptome'. The
default if not specified is 'metagenome'.
For ``type``, the value can be either 'metagenome' or 'metatranscriptome'. If
neither is specified, the default is 'metagenome'.


Interleaved PE input::
Interleaved Input
-----------------

A single file path is specified for ``fastq`` with ``type`` and ``paired``
also being set. In this case, ``type`` and ``paired`` are optional as we are
using the values that are equal to the defaults.

::

samples:
sample-1:
fastq: /data/sample-1_pe.fastq.gz
type: metagenome
paired: true


Paired-end as separate files::
Paired-end Input
----------------

In this case, we create a list using YAML_ syntax for both R1 and R2 indexes::

samples:
sample-1:
fastq:
- /data/sample-1_R1.fastq.gz
- /data/sample-1_R2.fastq.gz
type: metagenome
paired: true

The '-' is required if multiple fastq file paths need to be specified.


Single-end Input
----------------

Data is assumed to be paired-end unless stated otherwise. If your data is
single-end sequence data, specify 'paired' as ``false``::
single-end sequence data, specify ``paired`` as ``false``::

samples:
sample-1:
fastq: /data/sample-1_pe.fastq.gz
type: metagenome
paired: false


.. _YAML: http://www.yaml.org/
8 changes: 4 additions & 4 deletions docs/assembly/threads.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Threads
=======
Jobs and Threads
================

Most steps of the workflow are utilizing applications that can thread or
otherwise use multiple cores. Leaving this one below the max, in cases where
Expand All @@ -18,7 +18,7 @@ be processed more efficiently.

When starting your ``atlas`` command, e.g. ``atlas assemble --jobs 48 config.yaml``,
be sure to set the total thread pool to capture all available possible jobs to
be executed simultaneously. If we are utilizing 3 nodes, each with 24 cores,
we would set ``threads: 24`` and execute ``atlas`` with::
be executed simultaneously. For example, if we are utilizing 3 nodes, each
with 24 cores, we would set ``threads: 24`` and execute ``atlas`` with::

atlas assemble --jobs 72 config.yaml
70 changes: 0 additions & 70 deletions docs/atlas.rst

This file was deleted.

Loading

0 comments on commit 5973900

Please sign in to comment.