Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of soft-clipping and CIGAR computation #25

Open
wants to merge 23 commits into
base: develop
Choose a base branch
from

Conversation

haghshenas
Copy link

@haghshenas haghshenas commented Apr 7, 2021

Here is a short summary of the changes:

  • Enabling soft-clipping
    • Proper use of KSW2 alignment results: Track the best scoring prefix alignment (during KSW2 extension) in addition to full alignment of query or target
    • Reducing end_bonus value from 10 to 5
    • Setting the alignment start coordinate accordingly
  • Computing CIGAR strings:
    • Setting KSW_EZ_EXTZ_ONLY flag for KSW2 extension alignments (otherwise it always backtrack from the end of both sequences)
    • Modifying CIGARGenerator's variable type to enable computation of CIGAR for cases with overlapping MEMs (uint32_t to int32_t)
    • Updating get_cigar function and removing unnecessary codes
  • Logging more organized debug information for alignment validation
  • Improving the command-line interface
    • replacing --allowSoftclip and --allowOverhangSoftclip with --computeCIGAR
    • adding a --debug; if not used, no debug information will be printed

With these changes I compared the output of PuffAligner with Minimap2 for single-end reads. This comparison showed that generally, the alignment is computed correctly in most of the cases.

The next essential step: In some cases the alignment is wrong. The reason is that the extension is stopped (ez->stopped==1) and therefore, the score for reaching the end of query is not calculated properly. As a result, it might be the case that score for reaching end of query is incorrectly higher than the maximum scoring prefix alignment. Here is an example:

pufferfish alignment
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCN
||||||||||||||||||||||||||||||||||||x|||||||||||||||||||||||||||||||||||||||||x|x|x|||x|||x||||||||x
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTCTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGGTAGCATCCCCAGGAATGGGCA

minimap alignment
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAG----------------------
||||||||||||||||||||||||||||||||||||x|||||||||||||||||||||||||||||||||||||||||
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTCTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGGTAGCATCCCCAGGAATGGGCA

It seems that extension stopping was not written with soft-clipping and computation of CIGAR strings in mind, so this requires a fix.
I will check if the original KSW2 extension fixes this issue and if it does, we can discuss about whether it's worth to have the stopping feature or not.

So next items:

  • Checking if using original KSW2 extension fixes edge case wrong alignments
  • Testing and verifying alignment coordinate, soft-clipping and CIGAR string for overhanging reads
  • Testing for PE datasets
  • Adding an option for passing end_bonus value
  • Code cleaning (currently there is a lot of commented code in this branch)

@mohsenzakeri
Copy link
Collaborator

mohsenzakeri commented Jun 11, 2021

I looked at the four commits and all the changes look great!

I tested the soft-clipping with a couple of small examples and I found a case with the end2end mode where we don't expect to see any soft-clipped alignment, but we do. Here is the example:

query:
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCTGTGTGTGTGTGTCCCCCCC
target:
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCTGTGTGTGTGTGT
or
TCAAAAACATGATTACAGGGACATCTCAGGCTGACTGTGCTGTCCTGATTGTTGCTGCTGGTGTTGGTGAATTTGAAGCTGGTATCTCCAAGAATGGGCTGTGTGTGTGTGTT

For these targets, when running the command pufferfish align -i new.index/ --read read.fa -o out.sam --end2end results in the aligning the read with soft-clipped bases: 113M6S and 112M7S
Maybe, in this case, we wanna report insertions (I) instead of (S).

Also in the case of running without the end2end flag, the alignment scores seem wrong.

I have some fixes which I can commit, but we can also have a discussion before going forward.

@haghshenas haghshenas marked this pull request as ready for review August 3, 2021 17:12
@@ -153,6 +154,8 @@ void processReadsPair(paired_parser *parser,
// aconf.allowOverhangSoftclip = mopts->allowOverhangSoftclip;
// aconf.allowSoftclip = mopts->allowSoftclip;
aconf.computeCIGAR = (mopts->computeCIGAR and !mopts->noOutput);
aconf.endBonus = mopts->endBonus;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to have endBonus in the aconf struct anymore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed it.

@@ -634,6 +638,8 @@ void processReadsSingle(single_parser *parser,
// aconf.allowOverhangSoftclip = mopts->allowOverhangSoftclip;
// aconf.allowSoftclip = mopts->allowSoftclip;
aconf.computeCIGAR = (mopts->computeCIGAR and !mopts->noOutput);
aconf.endBonus = mopts->endBonus;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, we might need to have endBonus in the aconf struct anymore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed it from here as well.

@mohsenzakeri
Copy link
Collaborator

Based on my review and the conversations we had about the changes before, everything looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants