-
Notifications
You must be signed in to change notification settings - Fork 23
get_genome_seq
[get_genome_seq] can be used to get subsequences from a specified genome that have been indexed with [format_genome]. The subsequence can be obtained explicitly or from BED/PSL/BLAST entries in the stream.
Use [list_genomes] to see available genome sequences.
get_genome_seq [options] <-g genome> <-c chr> <-b beg> <-e end|-l len>
or
... | get_genome_seq [options] <-g genome>
[-? | --help] # Print full usage description.
[-g <genome> | --genome=<genome>] # Genome to get subsequence from.
[-c <string> | --chr=<string>] # Chromosome with requested subsequence.
[-b <uint> | --beg=<uint>] # Begin position of subsequence (first residue=1).
[-e <uint> | --end=<uint>] # End position of subsequence.
[-l <uint> | --len=<uint>] # Length of subsequence.
[-f <uint> | --flank=<uint>] # Include flanking sequence.
[-m | --mask] # Softmask non-exonic sequence.
[-s | --splice] # Splice sequence BLOCKS in BED/PSL type records.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
To get an explicit subsequence from the human genome (currently hg18) do:
get_genome_seq -g hg18 -c chrX -b 12000 -l 20
CHR_END: 12018
SEQ: GGTGCAGTAACACCTGCCGT
CHR_BEG: 11999
SEQ_LEN: 20
CHR: chrX
---
If you have a stream with BED, PSL or BLAST records, you obtain the subsequence simple by piping the stream through [get_genome_seq] and the sequence will be added to the record. Below is an example for a BED entry:
read_bed -i <BED file> -n 1 | get_genome_seq -g hg18
STRAND: +
CHR_END: 95127728
Q_ID: gi|108087685|gb|DQ594132.1|_Homo_sapiens_piRNA_piR-60244,_complete_sequence
CHR_BEG: 95127700
SCORE: 1
REC_TYPE: BED
BED_LEN: 29
CHR: chr15
SEQ: TTCACTTCTCCCATGTAGTTCCTGAGTGC
BED_COLS: 6
---
Now consider the following 12 column BED entry in the file test.bed
:
chr4 70946 71196 AA699063 0 - 70946 71196 0 2 142,55, 0,195,
To obtain the dm3 sequence for this EST, do:
read_bed -i test.bed | get_genome_seq -g dm3 | write_fasta -xw 80
>AA699063
CTTTAGTAATGCTAGTGATCGCGCAAAGCATCAAAATCGAACACACAGTAATGAGGTAAGTCTTTCATCTATATAAAAGT
AATATTTTTCATTTATAGCTATTTTTAGAAACCGTACATTTGTAAAGCACCTGGATGCACAAAACGTTACACCGACCCGA
GCTCTTTGAGAAAACATGTAAAAACAGTCCATGGAGCTGAATTTTATGCAAATAAAAAACATAAGGGGTTGCCTCTAAAT
GACGCAAACT
Now use the -m
switch to soft mask the intron so that exons are in uppercase sequence and the rest are
in lowercase:
read_bed -i test.bed | get_genome_seq -mg dm3 | write_fasta -xw 80
>AA699063
CTTTAGTAATGCTAGTGATCGCGCAAAGCATCAAAATCGAACACACAGTAATGAGGTAAGTCTTTCATCTATATAAAAGT
AATATTTTTCATTTATAGCTATTTTTAGAAACCGTACATTTGTAAAGCACCTGGATGCACAAaacgttacaccgacccga
gctctttgagaaaacatgtaaaaacagtccatggaGCTGAATTTTATGCAAATAAAAAACATAAGGGGTTGCCTCTAAAT
GACGCAAACT
To splice the exon blocks use the -s
switch:
read_bed -i test.bed | get_genome_seq -sg dm3 | write_fasta -xw 80
>AA699063
CTTTAGTAATGCTAGTGATCGCGCAAAGCATCAAAATCGAACACACAGTAATGAGGTAAGTCTTTCATCTATATAAAAGT
AATATTTTTCATTTATAGCTATTTTTAGAAACCGTACATTTGTAAAGCACCTGGATGCACAAGCTGAATTTTATGCAAAT
AAAAAACATAAGGGGTTGCCTCTAAATGACGCAAACT
[format_genome]
[list_genomes]
[read_bed]
[extract_seq]
[write_fasta]
Martin Asser Hansen - Copyright (C) - All rights reserved.
August 2007
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
[get_genome_seq] is part of the Biopieces framework.