-
Notifications
You must be signed in to change notification settings - Fork 23
find_orphans
find_orphans can be used to detect orphans in paired end data records in the stream, where
a member of a pair is missing. Detection is based on the sequence names which can
either use the Illuina 1.5 scheme where names end on /1 or /2 or the Illumina 1.8 scheme
where the names contain a space followed by 1
or 2
and then a :
. Records are given a TYPE
key
where the value is orphan
for orphan reads and paired
for paired reads.
NB! the reads in the stream must be interleaved and sorted according to SEQ_NAME
.
This is normally not a problem since the sequences are already sorted when output from the sequencer.
SEQ_NAME: HWI-ST575:107:C0HE6ACXX:5:1101:1832:2218 1:N:0:TAGCTG
SEQ: GCTTTGACATAGTCGCTCCAGAATTGCCAGCTAGGGTTAGCTTGGCAACTGCAGCGACGTAATGTGCTGTGGCAGATCAATTTATCTGTTTTGAATCA
SEQ_LEN: 98
SCORES: ^P^PJ\Y`eea`e[daYdecggadgdXJIYVbdc`efg_cdedI^aXIO^abeb\eL_daQU^_V]``]UGTZ\^BBBBBBBBBBBBBBBBBBBBBBB
TYPE: paired
---
SEQ_NAME: HWI-ST575:107:C0HE6ACXX:5:1101:1832:2218 2:N:0:TAGCTG
SEQ: GGTTATCGATCTGGAAAAAGCAACTAAACCTAAAGCTAAACCACGTAGCGCCGGGTAAATGATTCAAAACAGATAAATTGATCTGCCACAGCACATTA
SEQ_LEN: 98
SCORES: ^VYPJQ`c^JJ[b[efg^dHJ`aa`adXd_ZXXbIIIY[af_H^aWHWPZ[`gggFFZ^bd_Z]Zb_]ba\^ZGY_`TZ``cc[[bbR]]]^aaXQ[bbb
TYPE: paired
---
SCORES: ffffcfffffded^eddddddbdcdeedcefecfefdffecabccBB`b`
SEQ: CCNAGGAGGAGNCAATAAGAGACCATTCGTATATGATCTCTCAGGAGAGC
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/1
TYPE: orphan
---
SCORES: BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
SEQ: NNNNNNNNGGNNCNANNANNNNGTNNNTNGNANNNNCNNANTTGNNNNNN
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/2
TYPE: orphan
---
... | find_orphans [options]
[-? | --help] # Print full usage description.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
If you filter your sequences and discard a member of a pairs, you can run the data through find_orphans to locate orphans:
read_fastq -i pair1.fq -j pair2 | # Read in interleaved Illumina data from two files
trim_seq | # Trim ends according to quality scores
grab -e "SEQ_LEN>30" | # Remove entries with sequence shorter than 30
find_orphans | # Find orphans
write_fastq_files -k TYPE -x # Sort reads into two files: paired.fastq and orphan.fastq
Martin Asser Hansen - Copyright (C) - All rights reserved.
September 2013
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
find_orphans is part of the Biopieces framework.