-
Notifications
You must be signed in to change notification settings - Fork 23
order_pairs
order_pairs order records with pair end sequence data where the sequence names are
either using the Illuina 1.5 scheme where names end on /1 or /2 or the Illumina 1.8 scheme
where The names contain a space followed by 1
or 2
and then a :
. The records are
output in inter leaved order - which is required for pair-end aware assembly programs.
order_pairs uses a hashing scheme for this and does not sort according to sequence name.
Using order_pairs is important after filtering steps where one record of a pair may have been
discarded. For each record the value to the ORDER
key denotes if the record was paired
or the record was orphan and you can use grab to filter the records accordingly.
SEQ_NAME: HWI-ST575:107:C0HE6ACXX:5:1101:1832:2218 1:N:0:TAGCTG
SEQ: GCTTTGACATAGTCGCTCCAGAATTGCCAGCTAGGGTTAGCTTGGCAACTGCAGCGACGTAATGTGCTGTGGCAGATCAATTTATCTGTTTTGAATCA
SEQ_LEN: 98
SCORES: ^P^PJ\Y`eea`e[daYdecggadgdXJIYVbdc`efg_cdedI^aXIO^abeb\eL_daQU^_V]``]UGTZ\^BBBBBBBBBBBBBBBBBBBBBBB
ORDER: paired
---
SEQ_NAME: HWI-ST575:107:C0HE6ACXX:5:1101:1832:2218 2:N:0:TAGCTG
SEQ: GGTTATCGATCTGGAAAAAGCAACTAAACCTAAAGCTAAACCACGTAGCGCCGGGTAAATGATTCAAAACAGATAAATTGATCTGCCACAGCACATTA
SEQ_LEN: 98
SCORES: ^VYPJQ`c^JJ[b[efg^dHJ`aa`adXd_ZXXbIIIY[af_H^aWHWPZ[`gggFFZ^bd_Z]Zb_]ba\^ZGY_`TZ``cc[[bbR]]]^aaXQ[bbb
ORDER: paired
---
SCORES: ffffcfffffded^eddddddbdcdeedcefecfefdffecabccBB`b`
SEQ: CCNAGGAGGAGNCAATAAGAGACCATTCGTATATGATCTCTCAGGAGAGC
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/1
ORDER: orphan 1
---
SCORES: BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
SEQ: NNNNNNNNGGNNCNANNANNNNGTNNNTNGNANNNNCNNANTTGNNNNNN
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/2
ORDER: orphan 2
---
... | order_pairs [options]
[-? | --help] # Print full usage description.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
If you have two pair-end sequence files with the Illumina 1.5 or 1.8 scheme of naming pairs then you can order these with order_pairs simply by doing:
read_fastq -i test1.fq,test2.fq | order_pairs | write_fastq -o combi.fq -x
If you filter your sequences and discard a member of a pairs, you can run the data through order_pairs to discard any unmatched records:
read_fastq -i combi.fq | # Read in Illumina data
trim_seq | # Trim ends according to quality scores
grab -e "SEQ_LEN>30" | # Remove entries with sequence shorter than 30
order_pairs | # Make sure the pairs are in order
grab -p 'pair' -k ORDER | # Grab paired records
write_fastq -o combi_trimmed.fq -x # Write to new file
Martin Asser Hansen - Copyright (C) - All rights reserved.
May 2011
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
order_pairs is part of the Biopieces framework.