Fix indexing error for reverse strand in `calc_orf_gc` #101

althonos · 2023-01-28T21:40:41Z

Hi again!

Overview

I noticed while addressing #100 that the GC% computed for genes in the reverse strand was often wrong by a small margin, despite the correct gene sequence being extracted, pointing at an indexing error. After checking for out-of-bound reads, I noticed that in calc_orf_gc the loop would read past sequence end in the following part:

      for(j = last[fr]+3; j <= nod[i].ndx; j+=3)
        gc[fr] += is_gc(seq, j) + is_gc(seq, j+1) + is_gc(seq, j+2);

Indeed, on the reverse strand, last[fr] is set to the index of the STOP codon; because for reverse-strand codon this is always the index of the last nucleotide, not the first, the iteration should start 1 nucleotide later, not 3.

Fix

Start the iteration at the right coordinates 😄

Example

Taking the same contig CAKWEX010000332.1 as in #100, I ran Prodigal on both the contig and its reverse complement; the genes predicted in both cases matched in sequences, but the GC% didn't match; namely, the GC content was wrong when the genes were on the reverse strand (i changed the gc_cont precision so that the difference is easier to see):

               gene strand    true_gc    fwd_gc    bwd_gc
CAKWEX010000332.1_1     -1  63.592526  0.635925  0.635925
CAKWEX010000332.1_2     -1  58.043118  0.582090  0.580431
CAKWEX010000332.1_3     -1  60.073260  0.601954  0.600733

After applying the fix, the GC-content is consistent independently of whether the gene is on the direct or reverse strand:

               gene strand    true_gc    fwd_gc    bwd_gc
CAKWEX010000332.1_1     -1  63.592526  0.635925  0.635925
CAKWEX010000332.1_2     -1  58.043118  0.580431  0.580431
CAKWEX010000332.1_3     -1  60.073260  0.600733  0.600733

althonos · 2023-09-07T17:03:40Z

@hyattpd : Do you think you could merge #100 and this PR as well? Thanks 😃

Fix indexing error for reverse strand in calc_orf_gc

67643e9

althonos mentioned this pull request Feb 3, 2023

Fix buffer initialization in SD motif detection code #100

Merged

FlorianTrigodet mentioned this pull request Jul 10, 2024

[FEATURE REQUEST] Different gene caller option(s) for anvi-gen-contigs-database merenlab/anvio#2298

Open

althonos mentioned this pull request Jan 8, 2025

support detection of RBS across linear chromosome breakpoint ? althonos/pyrodigal#65

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix indexing error for reverse strand in `calc_orf_gc` #101

Fix indexing error for reverse strand in `calc_orf_gc` #101

althonos commented Jan 28, 2023

althonos commented Sep 7, 2023

Fix indexing error for reverse strand in calc_orf_gc #101

Are you sure you want to change the base?

Fix indexing error for reverse strand in calc_orf_gc #101

Conversation

althonos commented Jan 28, 2023

Overview

Fix

Example

althonos commented Sep 7, 2023

Fix indexing error for reverse strand in `calc_orf_gc` #101

Fix indexing error for reverse strand in `calc_orf_gc` #101