Fix buffer initialization in SD motif detection code #100
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi again !
A very detailed report by @pchaumeil in althonos/pyrodigal#28 helped me discover a new bug in the Prodigal source. This is affecting the detection of Shine-Dalgarno motifs for start codons on the reverse strand, where the searched region may go out-of-bound.
Overview
In
shine_dalgarno_mm
andshine_dalgarno_exact
, thematch
buffer stores a score for each position of the motif being searched. In the event the region overlaps with the sequence edge, only the sequence characters should be matched. However, an issue with indexing causes the initialization not to happen for ahead-of-start nucleotides, and thematch
buffer may contain random bytes. This can lead to the wrong RBS motif being identified, and occasionally some genes having wrong coordinates.Fix
Fix initialization of the
match
buffer so that all positions are set to a default mismatch value (-10). In the code that compares the 6-base region toAGGAGG
, positions before0
are skipped, and somatch[i]
may no be initialized for all values ofi
:Initializing the array to -10 makes the undefined behaviour disappear. In fact, since
match
is stack-allocated, if may contain remnants of a previous call, like2.0
or3.0
, which made the error hard to debug in the first place.Example
In the assembly GCA_934838455.1, there is a contig (
CAKWEX010000332.1
) which ends with the following nucleotide sequence:It contains a start codon (1) and a RBS motif (2) on the reverse strand. However, after running Prodigal on this contig, the predicted gene is:
As you can see, the RBS motif is
AGGA
; however, there is noAGGA
in the sequence upstream of the start codon, onlyGGA
. The missingA
is actually an artifact of thematch
buffer not being initialized. After applying the fix, the gene is predicted with aGGA
motif, as expected: