-
Notifications
You must be signed in to change notification settings - Fork 23
analyze_assembly
analyze_assembly can be used to analyze a sequence assembly given as contig sequences in the stream and output some basic stats:
N50: 112258 # N50 meassure
MAX: 316140 # Max contig length
MIN: 111 # Min contig length
MEAN: 19278 # Mean contig length
TOTAL: 2872554 # Total contig length
COUNT: 149 # Number of contigs
---
N50 is defined as the contig length such that using equal or longer contigs produces half the bases of the genome. The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome.
Read more here:
http://en.wikipedia.org/wiki/N50_statistic
As an experimental feature it is now possible to include predicted gene coverage in the
analysis. This is done by locating all full length genes predicted with Prodigal. The sum
of all predicted gene lengths from both strands is reported with the GENE_COV
key.
The gene coverage can be used to evaluate different assemblies of the same data. Use
-p meta
for meta-genomes.
Read more here:
... | analyze_assembly [options]
[-? | --help] # Print full usage description.
[-g | --gene_cov] # Calculate predicted gene coverage.
[-p <string> | --procedure=<string>] # Procedure: single|meta - Default=single
[-x | --no_stream] # Do not emit records.
[-o <file> | --data_out=<file>] # Write result to file.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following FASTA entries in the file test.fna
:
>test1
ATGCACATTG
>test2
ATGCACATTGATGCACATTG
>test3
ATGCACATTGATGCACATTGATGCACATTG
>test4
ATGCACATTGATGCACATTGATGCACATTGATGCACATTG
>test5
ATGCACATTGATGCACATTGATGCACATTGATGCACATTGATGCACATTG
To find the N50 read in the sequences with read_fasta:
read_fasta -i test.fna | analyze_assembly
SEQ_NAME: test1
SEQ: ATGCACATTG
SEQ_LEN: 10
---
SEQ_NAME: test2
SEQ: ATGCACATTGATGCACATTG
SEQ_LEN: 20
---
SEQ_NAME: test3
SEQ: ATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 30
---
SEQ_NAME: test4
SEQ: ATGCACATTGATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 40
---
SEQ_NAME: test5
SEQ: ATGCACATTGATGCACATTGATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 50
---
N50: 40
MAX: 50
MIN: 10
MEAN: 30
TOTAL: 150
COUNT: 5
---
To obtain only the stats use the -x
switch:
read_fasta -i test.fna | analyze_assembly -x
N50: 40
MAX: 50
MIN: 10
MEAN: 30
TOTAL: 150
COUNT: 5
---
And to output the N50 to a file use the -o
switch:
read_fasta -i test.fna | analyze_assembly -o stats.txt -x
[calc_N50]
Martin Asser Hansen - Copyright (C) - All rights reserved.
January 2011
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
analyze_assembly is part of the Biopieces framework.