Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corresponding ORF not called in different isolates despite identical sequence #115

Open
0xaf1f opened this issue Dec 28, 2024 · 6 comments

Comments

@0xaf1f
Copy link

0xaf1f commented Dec 28, 2024

In doing an analysis of gene presence/absence, I had some false positives that I tracked down to Prodigal not consistently calling the ORF, even though the sequence was present and identical in the other strains. I'll show an example of two genomes here: Mycobacterium tuberculosis isolates N0157 and N1216.

Prodigal Output

I ran prodigal using a source-compiled from the latest git snapshot as

prodigal -i $genome.fasta -c -m -g 11 -p single -f gff

to match the way that Prokka invokes it, although I suppose defaults should be fine since I'm using complete assemblies with no gaps.

N1216

Prodigal calls three consecutive ORFs: The first is Rv0348, the second is a new prediction, and the third is Rv0349.

1       Prodigal_v2.6.3 CDS     421332  421985  70.5    +       0       ID=1_366;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.610;conf=100.00;score=70.47;cscore=48.90;sscore=21.57;rscore=14.24;uscore=1.95;tscore=4.12;
1       Prodigal_v2.6.3 CDS     421940  422143  6.9     -       0       ID=1_367;partial=00;start_type=GTG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.627;conf=83.03;score=6.91;cscore=-0.31;sscore=7.22;rscore=9.20;uscore=0.73;tscore=-2.21;
1       Prodigal_v2.6.3 CDS     422231  422647  11.2    +       0       ID=1_368;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.624;conf=92.92;score=11.20;cscore=8.74;sscore=2.46;rscore=-2.24;uscore=0.57;tscore=4.12;

N0157

Prodigal does not call any ORF between Rv0348 and Rv0349.

1       Prodigal_v2.6.3 CDS     421451  422104  70.8    +       0       ID=1_370;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.609;conf=100.00;score=70.84;cscore=49.41;sscore=21.43;rscore=14.22;uscore=1.85;tscore=4.11;
1       Prodigal_v2.6.3 CDS     422107  422766  21.5    +       0       ID=1_371;partial=00;start_type=GTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.629;conf=99.30;score=21.55;cscore=23.33;sscore=-1.78;rscore=-3.59;uscore=2.32;tscore=-1.76;

Sequence Check

The sequence of the 204bp middle ORF that was called in N1216 is identical to a corresponding sequence in N0157 that is also between Rv0348-9:

$ blastn -query N1216.fasta -query_loc 421940-422143 -subject N0157.fasta -outfmt 7
# BLASTN 2.15.0+
# Query: 1 4393757
# Database: User specified sequence set (Input: /grp/valafar/data/genomes/N0157.fasta)
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 1 hits found
1       1       100.000 204     0       0       421940  422143  422059  422262  1.47e-105       377
# BLAST processed 1 queries

Epilogue and Data

This is one example but I have many false positives that I suspect are similar scenarios. In this particular example, the ORF is called in 49 genomes and not called in 72 genomes. I'm including the genome sequences for the two strains I examined above.

data.tar.gz

@althonos, this issue also exists in pyrodigal 3.6.3 from bioconda.

@hyattpd
Copy link
Owner

hyattpd commented Dec 28, 2024

So, none of this is a bug. The strains aren't identical, so the results aren't going to be either. Prodigal (and any genefinder) will always make mistakes (both false positives and false negatives), so this doesn't really rise to the level of an issue (more like perhaps something to work on in the algorithm itself).

Differences in a presence/absence matrix are also expected in a pangenome; it's kind of the whole point. You wouldn't expect identical annotations anyway. Some genes will be dispensable and some will be core.

The case you've highlighted is really an interesting one.

  1. The second gene is only separated from the first gene by 3bp in N0157.
  2. There's no RBS for Rv0349 in N0157, and it's a GTG start (so it gets a bad start score).
  3. The "new" gene overlaps Rv0348 by 45bp (this is a lot but permitted since opposite strand). It has a mediocre coding score (barely positive), but a very good RBS motif.
  4. The sum of the two gene scores in N1216 (the second and third gene) (6.9 + 11.2) is still less than the Rv0349 gene (21.5) in N0157.

So all this implies that Prodigal could not produce the longer version of Rv0349 in N1216 (if it did, this should have had the 20something score), implying there may be a genuine mutation that inserts a stop codon or deletes the GTG start codon.

What I would do:
Blastn the full Rv0349 sequence from N0157 against the N1216 genome and post the alignment here and we can figure it out.

If the alignment doesn't show any major differences, one way to standardize training (that a lot of people don't really use) is to produce a training file using a canonical strain (-t option to write a training file). Then call the gene prediction (-t option again, if file exists, it reads it in) in each strain using the same training file. This guarantees the same parameters are being used to call genes. No clue if Prokka even supports this option.

@hyattpd
Copy link
Owner

hyattpd commented Dec 28, 2024

Another weird thing to check is if there's any evidence from other species that Rv0348 and Rv0349 are the same protein (just blastp both against nr and see if they're ever the two halves of a single protein in another species). In this case, the stop codon separating the two genes may be a mutation, and either or both of the genes may be pseudogenes. (Or selenocysteine, but that one would be easy to verify). Prodigal doesn't call/understand selenocysteine translation (it's something I was going to add via db search but never did).

@0xaf1f
Copy link
Author

0xaf1f commented Dec 28, 2024

Blastn the full Rv0349 sequence from N0157 against the N1216 genome and post the alignment here and we can figure it out.

$ blastn -query N0157.fasta -query_loc 422107-422766 -subject N1216.fasta
BLASTN 2.15.0+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: User specified sequence set (Input: N1216.fasta).
           1 sequences; 4,393,757 total letters



Query= 1 4410932

Length=4410932
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

1 4393757                                                             1214    0.0  


> 1 4393757
Length=4393757

 Score = 1214 bits (657),  Expect = 0.0
 Identities = 659/660 (99%), Gaps = 0/660 (0%)
 Strand=Plus/Plus

Query  422107  GTGCCAGAGCTGGAGACGCCCGACGACCCAGAGTCGATATACCTTGCCCGCCTCGAGGAT  422166
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  421988  GTGCCAGAGCTGGAGACGCCCGACGACCCAGAGTCGATATACCTTGCCCGCCTCGAGGAT  422047

Query  422167  GTCGGAGAACACAGACCGACGTTCACGGGCGACATCTACCGACTCGGCGATGGTCGCATG  422226
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  422048  GTCGGAGAACACAGACCGACGTTCACGGGCGACATCTACCGACTCGGCGATGGTCGCATG  422107

Query  422227  GTGATGATCCTCCAGCACCCATGCGCGCTGCGGCACGGCGTTGACCTCCATCCGCGACTG  422286
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  422108  GTGATGATCCTCCAGCACCCATGCGCGCTGCGGCACGGCGTTGACCTCCATCCGCGACTG  422167

Query  422287  CTGGTCGCTCCCGTAAGACCCGACTCGCTTCGTTCCAACTGGGCTAGAGCCCCGTTCGGC  422346
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  422168  CTGGTCGCTCCCGTAAGACCCGACTCGCTTCGTTCCAACTGGGCTAGAGCCCCGTTCGGC  422227

Query  422347  ACGATGCCGCTTCCGAAGCTCATCGACGGTCAGGATCACTCGGCGGACTTCATCAATCTT  422406
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  422228  ACGATGCCGCTTCCGAAGCTCATCGACGGTCAGGATCACTCGGCGGACTTCATCAATCTT  422287

Query  422407  GAACTCATCGATTCACCAACGCTTCCGACCTGTGAGCGGATCGCGGTGCTCAGCCAGTCA  422466
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  422288  GAACTCATCGATTCACCAACGCTTCCGACCTGTGAGCGGATCGCGGTGCTCAGCCAGTCA  422347

Query  422467  GGCGTCAACTTGGTCATGCAACGGTGGGTGTACCACAGCACCCGGCTCGCCGTGCCCACG  422526
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  422348  GGCGTCAACTTGGTCATGCAACGGTGGGTGTACCACAGCACCCGGCTCGCCGTGCCCACG  422407

Query  422527  CACACCTACTCCGACAGCACCGTTGGCCCGTTCGATGAGGCAGACCTGATCGAGGAGTGG  422586
               |||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||
Sbjct  422408  CACACCTACTCCGACAGCGCCGTTGGCCCGTTCGATGAGGCAGACCTGATCGAGGAGTGG  422467

Query  422587  GTGACGGATCGCGTCGACGATGGGGCCGACCCGCAGGCGGCCGAACACGAATGCGCCTCC  422646
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  422468  GTGACGGATCGCGTCGACGATGGGGCCGACCCGCAGGCGGCCGAACACGAATGCGCCTCC  422527

Query  422647  TGGCTCGATGAAAGAATCAGCGGCCGCACTCGGCGAGCGCTGCTCAGCGACCGTCAGCAC  422706
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  422528  TGGCTCGATGAAAGAATCAGCGGCCGCACTCGGCGAGCGCTGCTCAGCGACCGTCAGCAC  422587

Query  422707  GCCAGTTCAATACGGCGAGAAGCGCGTTCTCATCGAAAGTCGGTCAAGCTGGCGGACTGA  422766
               ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  422588  GCCAGTTCAATACGGCGAGAAGCGCGTTCTCATCGAAAGTCGGTCAAGCTGGCGGACTGA  422647



Lambda      K        H
    1.33    0.621     1.12 

Gapped
Lambda      K        H
    1.28    0.460    0.850 

Effective search space used: 2803202930


  Database: User specified sequence set (Input: N1216.fasta).
  Number of letters in database: 4,393,757
  Number of sequences in database:  1



Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5

Translation of these coordinates doesn't show any internal stop:

$ samtools faidx  N1216.fasta 1:421988-422647 | transeq -filter -table 11
>421988-422647_1
VPELETPDDPESIYLARLEDVGEHRPTFTGDIYRLGDGRMVMILQHPCALRHGVDLHPRL
LVAPVRPDSLRSNWARAPFGTMPLPKLIDGQDHSADFINLELIDSPTLPTCERIAVLSQS
GVNLVMQRWVYHSTRLAVPTHTYSDSAVGPFDEADLIEEWVTDRVDDGADPQAAEHECAS
WLDERISGRTRRALLSDRQHASSIRREARSHRKSVKLAD*

If the alignment doesn't show any major differences, one way to standardize training (that a lot of people don't really use) is to produce a training file using a canonical strain (-t option to write a training file). Then call the gene prediction (-t option again, if file exists, it reads it in) in each strain using the same training file. This guarantees the same parameters are being used to call genes.
No clue if Prokka even supports this option.

It does-- I can give that a try.

@hyattpd
Copy link
Owner

hyattpd commented Dec 28, 2024

Looks like only one codon difference... must be an unlikely codon that's lowering the score in one of the strains. Third position A will get a higher score than third position G in a codon, despite translating to the same amino acid in this case.

Another thing to check would be the two Rv0348 genes to see if one has that GGAGG RBS on the reverse strand and the other does not.

If you run N1216 with the -s option, you can output all the possible gene candidates and look at the 421988 (grep for it) 422647 ORF, which would show a complete score breakdown for that candidate and see why it didn't get called.

@hyattpd
Copy link
Owner

hyattpd commented Dec 28, 2024

You often see this case where there are two possible ways to call genes in a particular stretch, and the scores are very close together between the two possibilities. So just very slight differences can make it call one way vs another. In this case, the N1216 annotations are probably wrong. This is where a database search could correct the mistakes (there have also been papers using pangenome approaches like you're doing to detect gene calling errors).

Unfortunately, it's hard to ever be totally consistent. If I were to, say, add a penalty to choosing the overlapping reverse strand gene (like in N1216), this would just lead to errors where now it incorrectly doesn't choose overlapping opposite strand genes in other cases, etc etc. Ab initio gene finding algorithms are a game of whack-a-mole, where adding rules to fix one situation inevitably leads to mistakes in other situations, so you end up doing the best you can.

@0xaf1f
Copy link
Author

0xaf1f commented Dec 28, 2024

If you run N1216 with the -s option, you can output all the possible gene candidates and look at the 421988 (grep for it) 422647 ORF, which would show a complete score breakdown for that candidate and see why it didn't get called.

$ { head -n4 N1216.prodigal.scores; grep -w 421988  N1216.prodigal.scores;} | column -t -s$'\t'
# Sequence Data: seqnum=1;seqlen=4393757;seqhdr="1 4393757"                                                                                                                                        
# Run Data: version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=65.61;transl_table=11;uses_sd=1                                                                                      
Beg    End     Std  Total  CodPot  StrtSc  Codon  RBSMot  Spacer  RBSScr  UpsScr  TypeScr  GCCont
421988 422647  +    18.78  21.92   -3.14   GTG    None    None    -3.56   2.20    -1.78    0.630

I'm not at the moment concerned with ensuring that this ORF always called, but just rather that things are consistent--either always called if the sequence is there or never called if the sequence is there. And for this, using the same training file for both runs seems to have solved that problem, at least for this example. I will have to rerun everything to be sure.

These are the results now after I've trained prodigal on the reference genome H37Rv and used that training file for both of these strains:

N1216

1       Prodigal_v2.6.3 CDS     421332  421985  70.3    +       0       ID=1_368;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.610;conf=100.00;score=70.27;cscore=48.67;sscore=21.60;rscore=14.22;uscore=2.00;tscore=4.13;
1       Prodigal_v2.6.3 CDS     421940  422143  6.7     -       0       ID=1_369;partial=00;start_type=GTG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.627;conf=82.48;score=6.74;cscore=-0.45;sscore=7.19;rscore=9.18;uscore=0.75;tscore=-2.24;
1       Prodigal_v2.6.3 CDS     422231  422647  11.3    +       0       ID=1_370;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.624;conf=93.00;score=11.25;cscore=8.68;sscore=2.57;rscore=-2.16;uscore=0.61;tscore=4.13;

N0157

1       Prodigal_v2.6.3 CDS     421451  422104  70.9    +       0       ID=1_371;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.609;conf=100.00;score=70.92;cscore=49.31;sscore=21.60;rscore=14.22;uscore=2.00;tscore=4.13;
1       Prodigal_v2.6.3 CDS     422059  422262  6.7     -       0       ID=1_372;partial=00;start_type=GTG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.627;conf=82.48;score=6.74;cscore=-0.45;sscore=7.19;rscore=9.18;uscore=0.75;tscore=-2.24;
1       Prodigal_v2.6.3 CDS     422350  422766  11.7    +       0       ID=1_373;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.621;conf=93.66;score=11.71;cscore=9.14;sscore=2.57;rscore=-2.16;uscore=0.61;tscore=4.13;

The shorter form of Rv0349 is being called in both places too now, but that's ok for my purposes.

Many thanks for all your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants