@@ -10,40 +10,41 @@ and their corresponding quality scores
10
10
11
11
# DESCRIPTION
12
12
13
- (WIP)
14
-
15
- In fastq files, each entry is made of _ sequence header_ starting with
16
- a symbol '@', a nucleotidic _ sequence_ (same rules as for fasta
17
- sequences), a _ quality header_ starting with a symbol '+', and a
18
- _ quality string_ of ASCII characters (offset 33 or 64), each one
19
- encoding the quality value of the corresponding position in the
20
- nucleotidic sequence.
21
-
22
- #(./fragments/sequences.md)
23
-
24
- In fastq files, each entry is made of a _ sequence header_ , a
25
- _ sequence_ , a _ quality header_ , ... The header is defined as the
26
- string comprised between the initial '>' symbol and the first space,
27
- tabulation, or new line symbol, unless the ` --notrunclabels ` option is
28
- in effect, in which case the entire line is included.
29
-
30
- The _ header_ should contain printable ascii characters (33-126). The
31
- program will terminate with a fatal error if there are unprintable
32
- ascii characters (see ` ascii(7) ` ). A warning will be issued if
33
- non-ascii characters (128-255) are encountered.
34
-
35
- If the header matches the pattern '>[ ;] size=integer;label', the
36
- pattern '>label;size=integer;label', or the pattern
37
- '>label;size=integer[ ;] ', vsearch will interpret integer as the number
38
- of occurrences (or abundance) of the sequence in the study. That
39
- abundance information is used or created during chimera detection,
40
- clustering, dereplication, sorting and searching.
41
-
42
-
43
-
44
- # EXAMPLES
45
-
46
- (give examples of valid and invalid fastq files)
13
+ In fastq files, each entry is made of * sequence header* starting with
14
+ a symbol '@', a nucleotidic * sequence* , a * quality header* starting
15
+ with a symbol '+', and a * quality string* of ASCII characters (offset
16
+ 33 or 64), each one encoding the quality value of the corresponding
17
+ position in the nucleotidic sequence.
18
+
19
+ The * sequence header* is defined as the string comprised between the
20
+ initial '@' symbol and the first space, tabulation, or new line
21
+ symbol, unless the ` --notrunclabels ` option is in effect, in which
22
+ case the entire line is included.
23
+
24
+ The sequence header should contain printable ASCII characters
25
+ (33-126). The program will terminate with a fatal error if there are
26
+ unprintable ASCII characters (see ` ascii(7) ` ). A warning will be
27
+ issued if non-ASCII characters (128-255) are encountered.
28
+
29
+ If the sequence header contains patterns such as ` [@;]size=integer[;] `
30
+ or ` [@;]ee=float[;] ` , vsearch can interpret these annotations and use
31
+ them for chimera detection, clustering, dereplication, filtering and
32
+ sorting.
33
+
34
+ #(./fragments/format_sequence.md)
35
+
36
+ The * quality header* is defined as the string comprised between the
37
+ initial '+' symbol and the first space, tabulation, or new line
38
+ symbol.
39
+
40
+ The * quality string* is a string of ASCII characters, starting after
41
+ the end of the quality header line and ending before the next header
42
+ line, or the file's end. The range of valid ASCII characters can
43
+ extend from '!' to '~ ' when the offset is 33, and from '@' to '~ ' when
44
+ the offset is 64. vsearch silently ignores ASCII characters 9 to 13,
45
+ and exits with an error message if ASCII characters 0 to 8, 14 to 31,
46
+ ‘.’ or ‘-’ are present. All other ASCII or non-ASCII characters are
47
+ stripped and complained about in a warning message.
47
48
48
49
49
50
# SEE ALSO
0 commit comments