updates docs

metagenome-atlas · Jun 30, 2017 · 5973900 · 5973900
1 parent 820d14b
commit 5973900
Show file tree

Hide file tree

Showing 13 changed files with 288 additions and 127 deletions.
diff --git a/docs/annotation/output.rst b/docs/annotation/output.rst
diff --git a/docs/annotation/samples.rst b/docs/annotation/samples.rst
@@ -0,0 +1,145 @@
+Defining Samples
+================
+
+Annotation
+----------
+
+Samples are defined under "samples" and have a unique name that does not
+contain spaces or underscores (dashes are accepted). To only annotate, a
+sample needs ``fasta`` defined::
+
+    samples:
+        sample-1:
+            fasta: /project/bins/sample-1_contigs.fasta
+        sample-2:
+            fasta: /project/bins/sample-2_contigs.fasta
+
+All other annotation parameters can be defined following samples. See
+:ref:`annotation`.
+
+
+Annotation and Quantification
+-----------------------------
+
+To get counts, we need to define FASTQ file paths that will be mapped back
+to the FASTA regions.
+
+In addition to specifying FASTQ paths for each sample, the configuration will
+also need to contain::
+
+    quantification: true
+
+
+Interleaved
+```````````
+
+Reads are always assumed to be paired-end, so we only need to specify
+``fastq``::
+
+    samples:
+        sample-1:
+            fasta: /project/bins/sample-1_contigs.fasta
+            fastq: /project/data/sample-1_pe.fastq
+        sample-2:
+            fasta: /project/bins/sample-2_contigs.fasta
+            fastq: /project/data/sample-2_pe.fastq
+
+Paired-end
+``````````
+
+In this case, we create a list using YAML_ syntax for both R1 and R2 indexes::
+
+    samples:
+        sample-1:
+            fasta: /project/bins/sample-1_contigs.fasta
+            fastq:
+                - /project/data/sample-1_R1.fastq
+                - /project/data/sample-1_R2.fastq
+        sample-2:
+            fasta: /project/bins/sample-2_contigs.fasta
+            fastq:
+                - /project/data/sample-2_R1.fastq
+                - /project/data/sample-2_R2.fastq
+
+
+Single-end
+``````````
+
+As data are assumed to be paired-end, we need to add ``paired: false``::
+
+    samples:
+        sample-1:
+            fasta: /project/bins/sample-1_contigs.fasta
+            fastq: /project/data/sample-1_se.fastq
+            paired: false
+        sample-2:
+            fasta: /project/bins/sample-2_contigs.fasta
+            fastq: /project/data/sample-2_se.fastq
+            paired: false
+
+
+Example
+-------
+
+A complete example for annotation and quantification for samples with
+paired-end reads in separate and in interleaved FASTQs::
+
+
+    samples:
+        sample-1:
+            fasta: /project/bins/sample-1_contigs.fasta
+            fastq: /project/data/sample-1_pe.fastq
+        sample-2:
+            fasta: /project/bins/sample-2_contigs.fasta
+            fastq:
+                - /project/data/sample-2_R1.fastq
+                - /project/data/sample-2_R2.fastq
+
+    quantification: true
+
+    tmpdir: /scratch
+    threads: 24
+    refseq_namemap: /pic/projects/mint/atlas_databases/refseq.db
+    refseq_tree: /pic/projects/mint/atlas_databases/refseq.tree
+    diamond_db: /pic/projects/mint/atlas_databases/refseq.dmnd
+    # 'fast' or 'sensitive'
+    diamond_run_mode: fast
+    # setting top_seqs to 5 will report all alignments whose score is
+    # at most 5% lower than the top alignment score for a query
+    diamond_top_seqs: 2
+    # maximum e-value to report alignments
+    diamond_e_value: "0.000001"
+    # minimum identity % to report an alignment
+    diamond_min_identity: 50
+    # minimum query cover % to report an alignment
+    diamond_query_coverage: 60
+    # gap open penalty
+    diamond_gap_open: 11
+    # gap extension penalty
+    diamond_gap_extend: 1
+    # Block size in billions of sequence letters to be processed at a time.
+    # This is the main parameter for controlling DIAMOND's memory usage.
+    # Bigger numbers will increase the use of memory and temporary disk space,
+    # but also improve performance. The program can be expected to roughly use
+    # six times this number of memory (in GB).
+    diamond_block_size: 6
+    # The number of chunks for processing the seed index (default=4). This
+    # option can be additionally used to tune the performance. It is
+    # recommended to set this to 1 on a high memory server, which will
+    # increase performance and memory usage, but not the usage of temporary
+    # disk space.
+    diamond_index_chunks: 1
+    # 'lca', 'majority', or 'best'; summary method for annotating ORFs; when
+    # using LCA, it's recommended that one limits the number of hits using a
+    # low top_fraction
+    summary_method: lca
+    # 'lca', 'lca-majority', or 'majority'; summary method for aggregating ORF
+    # taxonomic assignments to contig level assignment; 'lca' will result in
+    # most stringent, least specific assignments
+    aggregation_method: lca-majority
+    # constitutes a majority fraction at tree node for 'lca-majority' ORF
+    # aggregation method
+    majority_threshold: 0.51
+
+
+.. _YAML: http://www.yaml.org/
diff --git a/docs/assembly/annotation.rst b/docs/assembly/annotation.rst
@@ -1,3 +1,5 @@
+.. _annotation:
+
 Annotation
 ==========
 
@@ -60,12 +62,12 @@ options::
 Functional Annotation of ORFs
 -----------------------------
 
-Functional annotation is performed using Prokka. Contigs will be renamed to
+Functional annotation is performed using Prokka_. Contigs will be renamed to
 sample name + a digit, incrementally, such that contig 1 for sample 'example-id'
-is 'example-id_1'. ORFs among a sample are named by Prokka similarly though
-they are padded by zeroes (example-id_00001). Contig IDs and ORFs IDs are
-mapped back to one another using the final output table where each row
-represents an ORF and its assignments.
+is 'example-id_1'. Open reading frames (ORFs) within a sample are named by
+Prokka similarly though they are padded by zeroes (example-id_00001). Contig
+IDs and ORFs IDs are mapped back to one another using the final output table
+where each row represents an ORF and its assignments.
 
 
 Taxonomy Annotation of ORFs and Contigs
@@ -81,13 +83,15 @@ files::
     refseq_tree: /database_dir/refseq.tree
     diamond_db: /database_dir/refseq.dmnd
 
+These files are tracked and downloaded from Zenodo_ along with other
+reference data.
+
 
-Local Alignment Options for ``blastp`` Search
----------------------------------------------
+Local Alignment Options
+-----------------------
 
-Within each reference database, the user has the flexibility to optimize
-performance across their compute environment and control the number of
-alignment hits in various ways.
+The user has the flexibility to optimize performance across their compute
+environment and control the number of alignment hits in various ways.
 
 
 Run Mode
@@ -133,8 +137,6 @@ Query Coverage
 Require this much of the query sequence to be matched above
 ``diamond_min_identity``::
 
-::
-
     diamond_query_coverage: 60
 
 
@@ -224,6 +226,8 @@ assignment.
 
     aggregation_method: lca-majority
 
+For more information on the lca-majority method, please see the `LCA* paper`_.
+
 
 Majority Threshold
 ``````````````````
@@ -234,3 +238,8 @@ aggregation method.
 ::
 
     majority_threshold: 0.51
+
+
+.. _Prokka: https://github.com/tseemann/prokka
+.. _Zenodo: https://zenodo.org/record/804435
+.. _LCA* paper: https://doi.org/10.1093/bioinformatics/btw400
diff --git a/docs/assembly/output.rst b/docs/assembly/output.rst
diff --git a/docs/assembly/preprocessing.rst b/docs/assembly/preprocessing.rst
@@ -115,7 +115,7 @@ provided and filtered from the reads using the following parameters::
 
 
 Contaminant References
-``````````````````````
+----------------------
 
 As shown in the above example, if provided, reads will be removed from the
 FASTQ prior to assembly if they align to these references. If 'rRNA' is
@@ -130,10 +130,10 @@ Additional references can be added arbitrarily, such that::
 
 
 Normalization Parameters
-````````````````````````
+------------------------
 
-To improve assemblies, coverage is normalized across kmers to a target depth
-and can be set using::
+To improve assembly time and often assemblies themselves, coverage is
+normalized across kmers to a target depth and can be set using::
 
     # kmer length over which we calculated coverage
     normalization_kmer_length: 21

diff --git a/docs/assembly/samples.rst b/docs/assembly/samples.rst
@@ -5,36 +5,56 @@ Samples are defined with a name, file path(s) and the type of data. A single
 file path is interpreted as interleaved paired-end reads or single-end, while
 two paths must include full paths to R1 and R2.
 
-Sample names must be unique and not contain spaces or underscores (dashes are accepted).
+Sample names must be unique and not contain spaces or underscores (dashes are
+accepted).
 
-For 'type', the value can be either 'metagenome' or 'metatranscriptome'. The
-default if not specified is 'metagenome'.
+For ``type``, the value can be either 'metagenome' or 'metatranscriptome'. If
+neither is specified, the default is 'metagenome'.
 
 
-Interleaved PE input::
+Interleaved Input
+-----------------
+
+A single file path is specified for ``fastq`` with ``type`` and ``paired``
+also being set. In this case, ``type`` and ``paired`` are optional as we are
+using the values that are equal to the defaults.
+
+::
 
     samples:
         sample-1:
             fastq: /data/sample-1_pe.fastq.gz
             type: metagenome
+            paired: true
 
 
-Paired-end as separate files::
+Paired-end Input
+----------------
+
+In this case, we create a list using YAML_ syntax for both R1 and R2 indexes::
 
     samples:
         sample-1:
             fastq:
                 - /data/sample-1_R1.fastq.gz
                 - /data/sample-1_R2.fastq.gz
             type: metagenome
+            paired: true
 
 The '-' is required if multiple fastq file paths need to be specified.
 
+
+Single-end Input
+----------------
+
 Data is assumed to be paired-end unless stated otherwise. If your data is
-single-end sequence data, specify 'paired' as ``false``::
+single-end sequence data, specify ``paired`` as ``false``::
 
     samples:
         sample-1:
             fastq: /data/sample-1_pe.fastq.gz
             type: metagenome
             paired: false
+
+
+.. _YAML: http://www.yaml.org/
diff --git a/docs/assembly/threads.rst b/docs/assembly/threads.rst
@@ -1,5 +1,5 @@
-Threads
-=======
+Jobs and Threads
+================
 
 Most steps of the workflow are utilizing applications that can thread or
 otherwise use multiple cores. Leaving this one below the max, in cases where
@@ -18,7 +18,7 @@ be processed more efficiently.
 
 When starting your ``atlas`` command, e.g. ``atlas assemble --jobs 48 config.yaml``,
 be sure to set the total thread pool to capture all available possible jobs to
-be executed simultaneously. If we are utilizing 3 nodes, each with 24 cores,
-we would set ``threads: 24`` and execute ``atlas`` with::
+be executed simultaneously. For example, if we are utilizing 3 nodes, each
+with 24 cores, we would set ``threads: 24`` and execute ``atlas`` with::
 
     atlas assemble --jobs 72 config.yaml
diff --git a/docs/atlas.rst b/docs/atlas.rst