-
Notifications
You must be signed in to change notification settings - Fork 23
substitute_vals
Martin Asser Hansen edited this page Oct 2, 2015
·
6 revisions
substitute_vals can be used to search and replace values to keys in the stream using Perl Regex (see Examples). Flags are available for case insensitive and global search.
... | substitute_vals --search=<regex> --replace=<regex> [options]
[-? | --help] # Print full usage description.
[-s <string> | --search=<string>] # Regex search.
[-r <string> | --replace=<string>] # Regex replace.
[-i | --ignore_case] # Case insensitive search.
[-g | --global] # Globase replacement.
[-k <list> | --keys=<list>] # List of keys whos values to substitute.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following sequences in FASTA format in the file test.fna
:
>test1
AGNCTTTTCATTCTGACTGCAACGGGCAATACCTGCCGTGAGTAAATNNN
>test2
TGGGCGTTNNNNNGCAGGTAAATAGGCTTCTGTNNGACGTACTATAACGT
>test3
NNNNATAGTACTACAGTAACGAAAGTCNNGGATTTTTCTGAAGAGCTTTA
To remove all numbers use substitute_vals like this:
read_fasta -i test.fna | substitute_vals -s '\d' -r '' -g
SEQ: AGNCTTTTCATTCTGACTGCAACGGGCAATACCTGCCGTGAGTAAATNNN
SEQ_LEN:
SEQ_NAME: test
---
SEQ: TGGGCGTTNNNNNGCAGGTAAATAGGCTTCTGTNNGACGTACTATAACGT
SEQ_LEN:
SEQ_NAME: test
---
SEQ: NNNNATAGTACTACAGTAACGAAAGTCNNGGATTTTTCTGAAGAGCTTTA
SEQ_LEN:
SEQ_NAME: test
---
We can use substitute_vals to remove all N's like this:
read_fasta -i test.fna | substitute_vals -k SEQ -s 'N' -r '' -g
SEQ: AGCTTTTCATTCTGACTGCAACGGGCAATACCTGCCGTGAGTAAAT
SEQ_LEN: 50
SEQ_NAME: test1
---
SEQ: TGGGCGTTGCAGGTAAATAGGCTTCTGTGACGTACTATAACGT
SEQ_LEN: 50
SEQ_NAME: test2
---
SEQ: ATAGTACTACAGTAACGAAAGTCGGATTTTTCTGAAGAGCTTTA
SEQ_LEN: 50
SEQ_NAME: test3
---
We can further specify to remove blocks of N's longer than 3:
read_fasta -i test.fna | substitute_vals -k SEQ -s 'N{3,}' -r '' -g
SEQ: AGNCTTTTCATTCTGACTGCAACGGGCAATACCTGCCGTGAGTAAAT
SEQ_LEN: 50
SEQ_NAME: test1
---
SEQ: TGGGCGTTGCAGGTAAATAGGCTTCTGTNNGACGTACTATAACGT
SEQ_LEN: 50
SEQ_NAME: test2
---
SEQ: ATAGTACTACAGTAACGAAAGTCNNGGATTTTTCTGAAGAGCTTTA
SEQ_LEN: 50
SEQ_NAME: test3
---
Martin Asser Hansen - Copyright (C) - All rights reserved.
January 2013
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
substitute_vals is part of the Biopieces framework.