guessprotocol: updating the docs for the new model and output format

smithlabcode · Oct 6, 2023 · d9c346d · d9c346d
1 parent 8681af7
commit d9c346d
Showing 1 changed file with 79 additions and 27 deletions.
diff --git a/docs/content/guessprotocol.md b/docs/content/guessprotocol.md
@@ -9,46 +9,98 @@ $ dnmtools guessprotocol [OPTIONS] <file-1.fastq> [<file-2.fastq>]
 
 Mapping a WGBS dataset requires knowledge of the sequencing protocol
 generated to process the data. This may not be properly documented
-where the data was obtained, so we created a tool that guesses it
-based on the nucleotide content on one or two FASTQ files.
-
-The `guessprotocol` tool counts the number of As, Cs, Gs and Ts in
-each end of the dataset and reports the protocol that is closest to
-the nucleotide frequency expectations. outputs a single line
-determining if the dataset is
- * (1) T-rich, where, in end 1, half of the bases are Ts and there are
-very few Cs. In end 2, half of the bases are As and there are very few
-Gs
- * (2) A-rich, where, in end 1, half of the bases are As and there
-    are very few Gs and, in end 2, half of the bases are Ts and there
-are very few Cs
- * (3) Random PBAT, where complementary reads have complementary
-    bisulfite bases (e.g. if end 1 is T-rich, end 2 is A-rich), the
-    bisulfite base in each end is random.
- * (4) Unknwon, if, based on the nucleotide frequencies, the protocol
-   cannot be determined (or the read is not WGBS).
+where the data was obtained, so we created this command to guess the
+protocol based on the nucleotide content in the input FASTQ file (or
+files, for paired-end).
+
+The `guessprotocol` tool uses two models of nucleotide content
+following bisulfite conversion and applies this model to each
+read. One model is for WGBS, and the other is for PBAT. For each read,
+both models are applied, and the result is a probability for whether
+the read (or read pair) was generated using WGBS or PBAT. Once the
+requested number of reads is processed, the aggregate results for all
+reads are used to guess whether the protocol used to generate the data
+was WGBS, PBAT or rPBAT. The criteria are roughly as follows: if most
+of the reads look like they are from WGBS, then we conclude WGBS.  If
+most of the reads look like they are from PBAT, then we conclude
+PBAT. If the result is more towards the middle, then we conclude
+rPBAT.
+
+More details: the number of As, Cs, Gs and Ts differs depending on
+WGBS (traditional WGBS or MethylC-seq), PBAT -- post bisulfite adaptor
+tagging, or rPBAT (random PBAT).
+
+* For WGBS, a single-end sequenced read should be T-rich, and if the
+  data is paired-end, read1 is T-rich and read2 is A-rich.
+* For PBAT, a single-end sequenced read should be A-rich, and if the
+  data is paired-end, read1 is A-rich and read2 is T-rich.
+* For rPBAT, we have a random mix of the above situations. However, in
+  practice it seems almost never to be 50% each.
+
+In most cases, when the data is WGBS or PBAT, it is very obvious which
+is the protocol used.
+
+As of dnmtools v1.4.1, `guessprotocol` will always make a conclusion,
+but includes a confidence level.
 
 The output of `guessprotocol` is useful prior to mapping. For example,
 it can be used to decide whether or not to map with the `-R` flag (for
 "random PBAT") when using
 [abismal](https://github.com/smithlabcode/abismal).
 
-For paired-end data, `guessprotocol` finds read mates by finding
-identical read names. Some datasets finish the read name with
-identifiers like .1 on end 1 and .2 on end 2, thus making the read
+For paired-end data, `guessprotocol` finds ensures reads are mates by
+finding identical read names. Some datasets finish the read name with
+identifiers like ".1" on end 1 and ".2" on end 2, thus making the read
 names technically different at the last two characters. You can tell
 the program to ignore a certain suffix size (like size 2 in this
 example) when matching read names using the `-i` flag.
 
+The output includes the following values in a YAML format:
+* `protocol`: this is the guessed protocol (wgbs, pbat or rpbat) based
+  on the content of the reads.
+* `confidence`: indicates the level of confidence in the guess for the
+  protocol (values: low or high).
+* `layout`: indicates whether the supplied reads were paired or
+  single-ended.
+* `n_reads_wgbs`: the average number of reads (for single-ended reads)
+  or read pairs (for paired reads) where read1 is determined by the
+  model to be T-rich.
+* `n_reads`: the number of evaluated reads or read pairs.
+* `wgbs_fraction`: the probability that a read (for single-ended
+  reads) or the read1 of a read pair (for paired reads) is T-rich.
+
 ## Options
-```txt
- -n -nreads
 ```
-number of reads in initial check. The program stops after collecting
-statistics for the first `n` reads (default: 1,000,000)
+-n, -nreads
+```
+Number of reads to check. The program stops after collecting
+statistics for the first `n` reads (default: 1,000,000). Fewer than
+the default are usually sufficient, but increase this value if you
+suspect reads at the start of the file might be problematic.
 
 ```txt
  -i -ignore
 ```
-length of the read name suffix to ignore when matching
-## Options
+Length of the read name suffix to ignore when matching read names to
+ensure mates are correctly synchronized when the data is paired-end.
+
+```
+-b, -bisulfite
+```
+Assumed bisulfite conversion rate for the models (default: 0.98).
+
+```
+-H, -human
+```
+Use human genome nucleotide frequencies. A good assumption for samples
+from a mammal.
+
+```
+-o, -output
+```
+The output file name.
+
+```
+-v, -verbose
+```
+Report available information during the run.