Fix indexing error for reverse strand in calc_orf_gc
#101
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi again!
Overview
I noticed while addressing #100 that the GC% computed for genes in the reverse strand was often wrong by a small margin, despite the correct gene sequence being extracted, pointing at an indexing error. After checking for out-of-bound reads, I noticed that in
calc_orf_gc
the loop would read past sequence end in the following part:Indeed, on the reverse strand,
last[fr]
is set to the index of the STOP codon; because for reverse-strand codon this is always the index of the last nucleotide, not the first, the iteration should start 1 nucleotide later, not 3.Fix
Start the iteration at the right coordinates 😄
Example
Taking the same contig
CAKWEX010000332.1
as in #100, I ran Prodigal on both the contig and its reverse complement; the genes predicted in both cases matched in sequences, but the GC% didn't match; namely, the GC content was wrong when the genes were on the reverse strand (i changed thegc_cont
precision so that the difference is easier to see):After applying the fix, the GC-content is consistent independently of whether the gene is on the direct or reverse strand: