-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
corresponding ORF not called in different isolates despite identical sequence #115
Comments
So, none of this is a bug. The strains aren't identical, so the results aren't going to be either. Prodigal (and any genefinder) will always make mistakes (both false positives and false negatives), so this doesn't really rise to the level of an issue (more like perhaps something to work on in the algorithm itself). Differences in a presence/absence matrix are also expected in a pangenome; it's kind of the whole point. You wouldn't expect identical annotations anyway. Some genes will be dispensable and some will be core. The case you've highlighted is really an interesting one.
So all this implies that Prodigal could not produce the longer version of Rv0349 in N1216 (if it did, this should have had the 20something score), implying there may be a genuine mutation that inserts a stop codon or deletes the GTG start codon. What I would do: If the alignment doesn't show any major differences, one way to standardize training (that a lot of people don't really use) is to produce a training file using a canonical strain (-t option to write a training file). Then call the gene prediction (-t option again, if file exists, it reads it in) in each strain using the same training file. This guarantees the same parameters are being used to call genes. No clue if Prokka even supports this option. |
Another weird thing to check is if there's any evidence from other species that Rv0348 and Rv0349 are the same protein (just blastp both against nr and see if they're ever the two halves of a single protein in another species). In this case, the stop codon separating the two genes may be a mutation, and either or both of the genes may be pseudogenes. (Or selenocysteine, but that one would be easy to verify). Prodigal doesn't call/understand selenocysteine translation (it's something I was going to add via db search but never did). |
Translation of these coordinates doesn't show any internal stop:
It does-- I can give that a try. |
Looks like only one codon difference... must be an unlikely codon that's lowering the score in one of the strains. Third position A will get a higher score than third position G in a codon, despite translating to the same amino acid in this case. Another thing to check would be the two Rv0348 genes to see if one has that GGAGG RBS on the reverse strand and the other does not. If you run N1216 with the -s option, you can output all the possible gene candidates and look at the 421988 (grep for it) 422647 ORF, which would show a complete score breakdown for that candidate and see why it didn't get called. |
You often see this case where there are two possible ways to call genes in a particular stretch, and the scores are very close together between the two possibilities. So just very slight differences can make it call one way vs another. In this case, the N1216 annotations are probably wrong. This is where a database search could correct the mistakes (there have also been papers using pangenome approaches like you're doing to detect gene calling errors). Unfortunately, it's hard to ever be totally consistent. If I were to, say, add a penalty to choosing the overlapping reverse strand gene (like in N1216), this would just lead to errors where now it incorrectly doesn't choose overlapping opposite strand genes in other cases, etc etc. Ab initio gene finding algorithms are a game of whack-a-mole, where adding rules to fix one situation inevitably leads to mistakes in other situations, so you end up doing the best you can. |
I'm not at the moment concerned with ensuring that this ORF always called, but just rather that things are consistent--either always called if the sequence is there or never called if the sequence is there. And for this, using the same training file for both runs seems to have solved that problem, at least for this example. I will have to rerun everything to be sure. These are the results now after I've trained prodigal on the reference genome H37Rv and used that training file for both of these strains: N1216
N0157
The shorter form of Rv0349 is being called in both places too now, but that's ok for my purposes. Many thanks for all your help. |
In doing an analysis of gene presence/absence, I had some false positives that I tracked down to Prodigal not consistently calling the ORF, even though the sequence was present and identical in the other strains. I'll show an example of two genomes here: Mycobacterium tuberculosis isolates N0157 and N1216.
Prodigal Output
I ran prodigal using a source-compiled from the latest git snapshot as
to match the way that Prokka invokes it, although I suppose defaults should be fine since I'm using complete assemblies with no gaps.
N1216
Prodigal calls three consecutive ORFs: The first is Rv0348, the second is a new prediction, and the third is Rv0349.
N0157
Prodigal does not call any ORF between Rv0348 and Rv0349.
Sequence Check
The sequence of the 204bp middle ORF that was called in N1216 is identical to a corresponding sequence in N0157 that is also between Rv0348-9:
Epilogue and Data
This is one example but I have many false positives that I suspect are similar scenarios. In this particular example, the ORF is called in 49 genomes and not called in 72 genomes. I'm including the genome sequences for the two strains I examined above.
data.tar.gz
@althonos, this issue also exists in pyrodigal 3.6.3 from bioconda.
The text was updated successfully, but these errors were encountered: