Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

Biopiece: analyze_assembly

Description

analyze_assembly can be used to analyze a sequence assembly given as contig sequences in the stream and output some basic stats:

N50: 112258      # N50 meassure
MAX: 316140      # Max contig length
MIN: 111         # Min contig length
MEAN: 19278      # Mean contig length
TOTAL: 2872554   # Total contig length
COUNT: 149       # Number of contigs
---

N50 is defined as the contig length such that using equal or longer contigs produces half the bases of the genome. The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome.

Read more here:

http://en.wikipedia.org/wiki/N50_statistic

As an experimental feature it is now possible to include predicted gene coverage in the analysis. This is done by locating all full length genes predicted with Prodigal. The sum of all predicted gene lengths from both strands is reported with the GENE_COV key. The gene coverage can be used to evaluate different assemblies of the same data. Use -p meta for meta-genomes.

Read more here:

http://prodigal.ornl.gov/

Usage

... | analyze_assembly [options]

Options

[-?          | --help]                #  Print full usage description.
[-g          | --gene_cov]            #  Calculate predicted gene coverage.
[-p <string> | --procedure=<string>]  #  Procedure: single|meta       -  Default=single
[-x          | --no_stream]           #  Do not emit records.
[-o <file>   | --data_out=<file>]     #  Write result to file.
[-I <file!>  | --stream_in=<file!>]   #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]   #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]             #  Verbose output.

Examples

Consider the following FASTA entries in the file test.fna:

>test1
ATGCACATTG
>test2
ATGCACATTGATGCACATTG
>test3
ATGCACATTGATGCACATTGATGCACATTG
>test4
ATGCACATTGATGCACATTGATGCACATTGATGCACATTG
>test5
ATGCACATTGATGCACATTGATGCACATTGATGCACATTGATGCACATTG

To find the N50 read in the sequences with read_fasta:

read_fasta -i test.fna | analyze_assembly

SEQ_NAME: test1
SEQ: ATGCACATTG
SEQ_LEN: 10
---
SEQ_NAME: test2
SEQ: ATGCACATTGATGCACATTG
SEQ_LEN: 20
---
SEQ_NAME: test3
SEQ: ATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 30
---
SEQ_NAME: test4
SEQ: ATGCACATTGATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 40
---
SEQ_NAME: test5
SEQ: ATGCACATTGATGCACATTGATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 50
---
N50: 40
MAX: 50
MIN: 10
MEAN: 30
TOTAL: 150
COUNT: 5
---

To obtain only the stats use the -x switch:

read_fasta -i test.fna | analyze_assembly -x

N50: 40
MAX: 50
MIN: 10
MEAN: 30
TOTAL: 150
COUNT: 5
---

And to output the N50 to a file use the -o switch:

read_fasta -i test.fna | analyze_assembly -o stats.txt -x

See also

mean_vals

median_vals

analyze_vals

[calc_N50]

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

January 2011

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

analyze_assembly is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally