-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing short overlapping repeats #7
Comments
Could you point me at a small example that I can try out? If you aren't comfortable posting it here, mail me a link at ... rsharris at bx dot psu dot edu. For clarification, when you wrote
By "keep", what I think you mean is as follows. If there's an alignment of motif1 that overlaps one of motif2, and the motif1 alignment was the longer, then that motif1 alignment should be output in the same file with all the non-overlapping motif1 alignments, and the motif2 alignment (plus any others in this overlap group) should be output to the overlaps file. Right? As for sorting, which output isn't sorted? Without looking at the code, I'm not sure I made any special effort in ncrf_resolve_overlaps to create an output order. For example, the input motif summaries might have been sorted by decreasing score. I could output those in the same order they were input (this requires I save the whole file inside the script, essentially). But I don't see how I could do that with the overlapped items (because I don't know what order is wanted). I'm surprised I don't currently output those as positionally sorted, since internal to the script it has to be doing that. Maybe the best solution for sorting is that I create ncrf_sort_summary to do the same sort (no pun intended) of things ncrf_sort does. |
Hi, I just shared a small example dataset along with the commands I used.
Yes, I mean the alignment summary.
Yes x2. Ideally, if the alignment of motif2 is not completely included in the alignment of motif1, I would also like to keep that small portion of motif2 in the non-overlapping alignments. I would also like to have 'N's in the summary file for regions were no repeats were found, but I don't know if this is something doable easily.
Sorry, I think I though it was not sorted because some repeats are always in the "overlapping alignments" section of the file, which is at the bottom.
it looks like I may use that option either to filter out alignments overlapping to one particular motif or to save overlapping groups to a separate file from the alignments, but I had no luck with that. |
Thanks for the example, I will take a look at that.
From the rest of your message, it sounds like that isn't what you would want.
Neither output form is an the order you want. I think I didn't consider worrying about output order there because I wasn't looking for the same thing you are. And a whitespace delimited table can be sorted using awk, like this:
Admittedly, that's not an ideal solution to expect users to know how to do (and it's klunky to have to tell awk which columns to sort on). But in my own use that was adequate.
I think considerations like those are why I decided to just separate out the overlaps from the non-overlaps, and pass the buck for figuring out what to do with the overlaps.
This is interesting, because just two days ago I became aware that another user is apparently doing the same thing (see issue #6). Is there some downstream tool you're passing this output to? |
Sorry, I have to admit it was not difficult to do, but I was just expecting the resolved summary file to keep somehow the ordering of the summary file obtained using ncrf_sort.py --sortby=name.
I understand the difficulties here. I think I may just go on manually deleting lines with shorter overlaps.
Probably this is not something I need at the moment, but specifying, for example, CCTG_repeat (which is one of the motifs in the summary file) results in one single file with all the motifs and all the overlaps, I see no difference to redirecting the output of resolve_overlaps.py to one output file. Am I missing something?
No, there isn't. I am just inspecting the summary files with Excel, and I am interested in inserting Ns for better visualizing interruptions in the repetition. However, I understand this feature may not be appealing to many, and I think I can write some script on my own to do that. |
Probably non-obvious -- the filename should contain "{motif}" as a substring. Not one of the motifs, but the string "motif" inside two curly brackets. Then when non-overlapped motifs are written to a file, "{motif}" is replaced with the actual repeat. So in your case you'd have four files for non-overlaps, and one for the groups of repeats.
A reasonable expectation. To really accomplish that ncrf_sort would have to put the sort option somewhere. Best place would be as some sort of comment in the sorted file. And this would have to get propagated into the summary file. Doable, but it would mess with the simplicity of the whitespace-with-headers-table format. |
Looking at the example you sent me (without posting a intimate details here), I see there are several overlap groups that include dozens of alignments. As well as some simpler cases. Thinking about how to script something equivalent to what I think you are doing by deleting the shorter overlaps.
It's greedy and may not give you the optimal covering of the interval. But this might be what you are doing manually anyway. So I think the pipeline would be to run ncrf_resolve_overlaps with the option to output to several different files. Then run the above filtering script on the overlaps file, concatentate that filtered result with the individual motif files, and sort. Then pass it through another script to insert the Ns. Thinking about the N insertion ...
Actually, I'm going to go ahead and try the N stuff in awk, to see if I've missed anything. |
Probably too long to be practical in awk, but here's what it boiled down to:
|
Thanks for all the kind answers and explanations! I am following your suggestions for obtaining one single summary file sorted by read name and coordinate, that includes only the longest of overlapped alignments, and filled with Ns where there are no repeats detected. |
I've updated the description of ncrf_resolve_overlaps to (hopefully) make it clearer what the outputs are. |
Perfect, I think it is much clearer now. Thank you again, you can close the issue for me. |
@MaestSi What is the length of your reads and did you have to convert the reads from fastq to fasta to run this program? |
@aishsk87 You do have to convert to fasta. I suppose that's an oversight on my part when I wrote it. But it's trivial to convert fastq to fasta on the fly using any of many different *nix commands. For example, |
Amazing, thanks! |
Hi,
I am running NCRF for finding 4 different repeats in my Nanopore reads, and I have a summary file where all the repeats for the different motifs are listed. I noticed that some of the repeats are overlapped, therefore I would like to end up with a summary file that does not have any overlapping repeats. For this purpose I tried out ncrf_resolve_overlaps.py script, and I have that all un-overlapped alignments and overlapping groups are written to a file, as expected. However, this is not exactly what I would like to do, as I would like to keep also one alignment (the longest) among each overlapping group. Moreover, repeats in this file are not sorted anymore by read name and position.
Do you think there is a way for me to obtain the desired output with NCRF, without having to manuallly delete unwanted repeats?
Thanks in advance,
Simone
The text was updated successfully, but these errors were encountered: