-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation of repeat identification #6
Comments
Thanks for reporting that, I will take a look at it. |
I can verify that it doesn't report that repeat segment when given the entire read. There are a couple of stages of internal filtering intended to discard alignments and subalignments that aren't "good enough", so my next step will be to check whether this segment got thrown out with the bathwater in of those. I also asked it to search just the 11412..11704 hole indicated by your spreadsheet (by hardmasking the rest of the sequence). It does manage to report the missing segment you mention, plus two others (see below). So my best guess at the moment is that the initial alignment identified a much longer segment than is reported, and one of the filtering steps excised some subsegments as being suboptimal and whittled them out. And I might have made the assumption that any whittled-out subsegments can't contain any subsubsegments that are "good enough" to be reported.
|
What I said was my best guess is what happened. The first pass identified 3244..14078 as a single alignment. A 'debridging' step identified 11412..11704, with mRatio 76.5%, as a probable random bridge. (As discussed in supplementary note S1B, linked below). Removing the bridge partitions that long alignment into segments with mRatio 98.1, 76.5, and 96.9. That's reasonable. What isn't reasonable is my assumption that the bridge is all garbage. What I need to do instead of throwing it out entirely is to search it for repeats. I think this will be a pretty simple fix, but I'm not sure yet. If you are in a hurry, a workaround would be to (a) N-mask out all the intervals NcRF reported, and (b) run a second pass of NcRF. But if all goes well I should have a fix here in the repo by tomorrow. The manuscript supplement can be found here. Note S1B explains the rationale for removing alignment artifacts like random bridges. |
I've updated the main branch to include what I think is a fix. I given it a new temporary version number, 1.01.04. After I subject it to more thorough testing I'll bump the version number again and make it a tagged release, and then see if I can figure out how to update the bioconda version of it. @biadarola if you could try running this on some of your data and try to verify that it reports a superset of what the earlier version reported, that would be helpful. And of course, that the newly-reported alignments meet your criterion. I expect this might be a little slower. Here's what I now get for the read you sent, searching for CCTG, and sorted by position on the read.
|
Dear developers,
I am using NCFR to identify in our nanopore reads 4 different motifs: TG, CCCG, CCTG, TCTG. I've set as minimum length of the repeat 4 times its length, using the following command lines:
I've put all the results together in a file, filling the gaps with a "N" motif in regions where nothing was identified. However, I found a read with a repetition of CCTG in a region theoretically free from any of these repetitions (from position 11683 to 11705, with 100% identity). Attached a file of this example, with the read sequence in the last column of the file.
Example_repeat.xlsx
Do you have any idea of the reason of this issue?
Please, let me know if you need further information.
Thank you in advance for your help.
Barbara
The text was updated successfully, but these errors were encountered: