Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translation Issue arising from isambiguous(DNA_Gap) == false #277

Open
chelate opened this issue Jun 22, 2023 · 2 comments · Fixed by #278
Open

Translation Issue arising from isambiguous(DNA_Gap) == false #277

chelate opened this issue Jun 22, 2023 · 2 comments · Fixed by #278

Comments

@chelate
Copy link

chelate commented Jun 22, 2023

So I would expect

isambiguous(DNA_Gap) == true

however, in fact "DNA_Gap" is NOT ambiguous. This has some surprising implications, such as that gaps code for alanine.

julia> translate(LongDNA{4}("---"))
1aa Amino Acid Sequence:
A

I'd strongly suggest either making gaps ambiguous, or possibly mapping to an amino acid gap (is there such a thing?). Maybe I'm just bad at bioinformatics, but I think some data does not distinguish between "gaps" (as in deletions) and missing coverage?

@jakobnissen
Copy link
Member

Dear @chelate

Thank you for the bug report. DNA_Gap is not ambiguous, since it does not correspond to one of multiple possible nucleotides. In fact, it corresponds to zero nucleotides. It's questionable whether we should even have added gap symbols.

However, the translation of gap nucleotides was definitely a bug. I've fixed this in #278 , after which trying to translate a sequence with gaps will throw an exception. There are two alternative behaviours:

  1. Translate to AA_Gap. However, this only works if the gap corresponds to whole codons, and this is not something we can assume. We could theoretically translate those gap codons to AA_Gap, then error on partial gapped codons, but I think this might be too confusing.
  2. Silently skip gaps when translating. However, this might be unexpected (for example, one would no longer be able to assume that the translated sequence is 1/3rd the size of the input sequence)

Implementing it as throwing an error is more conservative in the sense that it'll always be legal to change it to a non-throwing behaviour later.

@chelate
Copy link
Author

chelate commented Jul 12, 2023

Sorry if I'm annoying, but I have some input on this issue. I agree now that "-" is not ambiguous.

It would make much more sense to translate "---" to AA_Gap. I'd argue that "---" is how you'd express AA_Gap in standard code. If you are working with nt and align them in frame, these gaps will show up as "---" and there's really no other way to express them. These are the ubiquitous result of aligning to reference sequences for variable length proteins.

The only ambiguous thing is when your alignment assigns some nucleotides to dangling out of frame like "--G". In my case this has happened and that's how I found this bug. Erroring then is reasonable (, but conservative and annoying in practice). I'd prefer AA_X with a warning like "translating out-of-frame alignment as an ambiguous amino acid AA_X" it just means that somebody allowed out of frame alignments upstream in the pipeline from me, or that there was an insertion at the PCR level, depends what the pipeline looks like.

Sorry I don't know how to use git or I'd pr this myself. shame

@jakobnissen jakobnissen reopened this Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants