flashpy is the python module for merging paired-ends reads generated by high-throughput DNA sequencing systems such as Illumina Miseq, Hiseq and Novaseq. This python code reimplements the algorithm of flash (https://github.com/ebiggers/flash) using cython, so it runs very fast (Merge 10,000 sequence pairs within about 1 second.)
python setup.py build_ext --inplace
- Set PYTHONPATH to the directory where you cloned the repository.
flashpy provides only two functions: merge
and flash
. The merge
function merge a single pair of two reads.
The flash
function just iterate merge
function for paired reads in the given paired fastq files.
-
merge(seq1=Nonn, seq2=None, score1=None, score2=None, min_overlap=50, max_overlap=300, allow_outies=True, min_identity=0.5, max_idenity=1.0)
Merge a single seqeunce pair of seq1 and seq2.- seq1: str
The DNA sequence. - seq2: str
The DNA sequence paired with the seq1. - score1: list of int
The quality values for the DNA sequence seq1. The values must be decoded from the ascii codes. The list must be composed of the same number of values as letters in the sequence seq1. - score2: list of int
The quality values for the DNA sequence seq2. The values must be decoded from the ascii codes. The list must be composed of the same number of values as letters in the sequence seq2. - min_overlap: int
The minimum overlap length between two sequences, seq1 and seq2. - max_overlap: int
The maximum overlap length between two sequences, seq1 and seq2. - allow-outies: bool
If True, try to combine a sequence pair of seq1 and seq2 in the "outie". - min_identity: bool
Minimum allowed sequence identity between the overlapping regions of seq1 and seq2. - max_identity: bool
If the identity of a overlapping region is larger than the max_identity value, the function will terminate the operation and return the result based on the overlapping region, even if better overlap regions are stil remained in the other locations.
return
merged_sequence (str), merged_score (list of int), identity (float), ovelap_length (int), overlap_direction ("innie" or "outie") - seq1: str
-
flash(read1=None, read2=None, min_overlap=50, max_overlap=300, allow_outies=True, min_identity=0.5, max_idenity=1.0, show_progress=True, key_check=True)
Merge a single pair of two fastq files.- read1: str
FASTQ file path. - read2: str
FASTQ file path paired with the read1. - min_overlap: int
Same parameter with min_overlap ofmerge
. The parameter value is applied for all sequence pairs. - max_overlap: int
Same parameter with max_overlap ofmerge
. The parameter value is applied for all sequence pairs. - allow-outies: bool
Same parameter with min_overlap ofmerge
. The parameter value is applied for all sequence pairs. - min_identity: bool
Same parameter with min_overlap ofmerge
. The parameter value is applied for all sequence pairs. - max_identity: bool
Same parameter with min_overlap ofmerge
. The parameter value is applied for all sequence pairs. - show_progress: bool
If true, display progress bar of the operation. - key_check: bool
If true, for each sequence key in read2, the function will if the same sequence key exists in read1.
return
merged_reads, overlap_distributions-
mergd_reads: dict
{*key1* (common sequence key of *r1_key1* and *r2_key*): {"r1_key" : Original sequence key in read1:, "r2_key" : Original sequence key in read2 paired with *r1_key*, "seq" : Merged sequence, "quality" : Merged score, "identity": Sequence identity of the overlapping region} *key2*: ..., ... }
-
overlap_distributions: dict
{*key1* (("innie" or "outie", *overlap_length*)): Number of paired sequences that share the overlapping region of length *overlap_length*, *key2* : ..., ... }
- read1: str