-
Notifications
You must be signed in to change notification settings - Fork 23
analyze_seq
analyze_seq analyzes the sequence specified by the SEQ key in each record. The frequency of all residues and indels is output for each record. Futhermore, GC%, SOFT_MASK% (soft masked sequence is indicated by lower case letters), and HARD_MASK% (hard masked sequence consists of N's) are reported (even for protein sequences, where this is non-sense).
... | analyze_seq [options]
[-? | --help] # Print full usage description.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the file test.fna
containing the single entry:
>test
ATCGNatcgn-._~
To analyze this sequence, read the file using read_fasta:
read_fasta -i test.fna | analyze_seq
SEQ_NAME: test1
SEQ: ATCGNatcgn-._~
SEQ_LEN: 14
RES[[A]]: 2
RES[[T]]: 2
RES[[C]]: 2
RES[[G]]: 2
RES[[N]]: 2
RES[-]: 1
RES[.]: 1
RES[[_]]: 1
RES[~]: 1
SOFT_MASK%: 50.0
HARD_MASK%: 20.0
GC%: 40.0
---
If you have a stack of sequences in one file and you want to determine the mean GC content of all the sequences, you can do it using the mean_vals biopiece:
read_fasta -i test.fna | analyze_seq | mean_vals -k GC% -x
GC%_MEAN: 40.00
---
Similarly, if you want the total count of Ns in all sequences use the biopiece sum_vals:
read_fasta -i test.fna | analyze_seq | sum_vals -k RES[[N]]
RES[[N]]_SUM: 2
---
Martin Asser Hansen - Copyright (C) - All rights reserved.
August 2007
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
analyze_seq is part of the Biopieces framework.