analyze_assembly

Biopiece: analyze_assembly

Description

analyze_assembly can be used to analyze a sequence assembly given as contig sequences in the stream and output some basic stats:

N50: 112258      # N50 meassure
MAX: 316140      # Max contig length
MIN: 111         # Min contig length
MEAN: 19278      # Mean contig length
TOTAL: 2872554   # Total contig length
COUNT: 149       # Number of contigs
---

N50 is defined as the contig length such that using equal or longer contigs produces half the bases of the genome. The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome.

Usage

... | analyze_assembly [options]

Options

[-?          | --help]                #  Print full usage description.
[-g          | --gene_cov]            #  Calculate predicted gene coverage.
[-p <string> | --procedure=<string>]  #  Procedure: single|meta       -  Default=single
[-x          | --no_stream]           #  Do not emit records.
[-o <file>   | --data_out=<file>]     #  Write result to file.
[-I <file!>  | --stream_in=<file!>]   #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]   #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]             #  Verbose output.

Examples

Consider the following FASTA entries in the file test.fna:

>test1
ATGCACATTG
>test2
ATGCACATTGATGCACATTG
>test3
ATGCACATTGATGCACATTGATGCACATTG
>test4
ATGCACATTGATGCACATTGATGCACATTGATGCACATTG
>test5
ATGCACATTGATGCACATTGATGCACATTGATGCACATTGATGCACATTG

To find the N50 read in the sequences with read_fasta:

read_fasta -i test.fna | analyze_assembly

SEQ_NAME: test1
SEQ: ATGCACATTG
SEQ_LEN: 10
---
SEQ_NAME: test2
SEQ: ATGCACATTGATGCACATTG
SEQ_LEN: 20
---
SEQ_NAME: test3
SEQ: ATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 30
---
SEQ_NAME: test4
SEQ: ATGCACATTGATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 40
---
SEQ_NAME: test5
SEQ: ATGCACATTGATGCACATTGATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 50
---
N50: 40
MAX: 50
MIN: 10
MEAN: 30
TOTAL: 150
COUNT: 5
---

To obtain only the stats use the -x switch:

read_fasta -i test.fna | analyze_assembly -x

N50: 40
MAX: 50
MIN: 10
MEAN: 30
TOTAL: 150
COUNT: 5
---

And to output the N50 to a file use the -o switch:

read_fasta -i test.fna | analyze_assembly -o stats.txt -x

Author

[email protected]

January 2011

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

analyze_assembly is part of the Biopieces framework.

http://www.biopieces.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly