Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start and end on miRNA paralogs #19

Open
xbdr86 opened this issue May 22, 2018 · 8 comments
Open

Start and end on miRNA paralogs #19

xbdr86 opened this issue May 22, 2018 · 8 comments

Comments

@xbdr86
Copy link
Collaborator

xbdr86 commented May 22, 2018

Hi! I have question for the miRTop community.

How would you define the "precursor start/end" in case of reads that can be assigned to paralogs (about ~ 15% of described miRNA have multiple copies with exact mature sequence)?

column4/5: start/end: precursor start/end as indicated by alignment tool

@lpantano
Copy link
Contributor

lpantano commented May 22, 2018 via email

@xbdr86
Copy link
Collaborator Author

xbdr86 commented May 23, 2018

Hi @lpantano!

Thanks for your fast response!

For instance, I was thinking in the case that I have been working more recently of mir-9. This mature miRNA can arise from 3 different paralogs.
mir-9-1 (http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000466)
mir-9-2 (http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000467)
mir-9-3 (http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000468)

This is an extremely abundant miRNA in brain, thus generating hundreds of 3' isomiRs. Interestingly, when studied separately only one (paper coming soon hopefully!) of them generates a 5' isomiR of functional importance (Tan et al. NAR 2014). I think an annotation system that would annotate this 5' isomiR to each paralog could be misguiding for future interpretations of the data. So far in our custom program QuagmiR (https://github.com/Gu-Lab-RBL-NCI/QuagmiR/) we were annotating all miR-9 reads under the following naming structure:

hsa-miR-9-5p-1-2-3
hsa-miR-9-3p-1-2-3

On the practical end, annotating each read under multiple gene locations would generate a significant amount of data duplicity on the GFF file, although I don't see an easy way to deal with columns 1, 3, 4.

Have a nice day!

@lpantano
Copy link
Contributor

lpantano commented May 23, 2018 via email

@xbdr86
Copy link
Collaborator Author

xbdr86 commented Jun 25, 2018

Hi!

Yes, you are right the issue of which parent pri-miRNA to assign is quite important for us. Do you think it might work to arbitrarily assign reads that can belong to multiple parents to paralog-1, and indicating on attributes that that particular sequences has let's say 3 paralogs? And any read that can be uniquely mapped to one of the paralogs, to the corresponding parent?

For example:
Given the following parents for miR-7-5p

>hsa-mir-7-1 MI0000263
UUGGAUGUUGGCCUAGUUCUGUG_UGGAAGACUAGUGAUUUUGUUGUU_**UUU**AGAUAACUAAAUCGACAACAAAUCACAGUCUGCCAUAUGGCACAGGCCAUGCCUCUACAG 

>hsa-mir-7-2 MI0000264
CUGGAUACAGAGUGGACCGGCUGGCCCCAUC_UGGAAGACUAGUGAUUUUGUUGUU_**GUC**UUACUGCGCUCAACAACAAAUCCCAGUCUACCUAAUGGUGCCAGCCAUCGCA

>hsa-mir-7-3 MI0000265
AGAUUAGAGUGGCUGUGGUCUAGUGCUGUG_UGGAAGACUAGUGAUUUUGUUGUU_**CUG**AUGUACUACGACAACAAGUCACAGCCGGCCUCAUAGCGCAGACUCCCUUCGAC

Present the following reads in GFF like that:

_UGGAAGACUAGUGAUUUUGUUGUU_ hsa-miR-7-1 READ_COUNT=1000 NUMBER_OF_PARALOGS=3
_UGGAAGACUAGUGAUUUUGUUGUU_**UUU** hsa-miR-7-1 READ_COUNT=1000 NUMBER_OF_PARALOGS=1
_UGGAAGACUAGUGAUUUUGUUGUU_**GUC** hsa-miR-7-2 READ_COUNT=1000 NUMBER_OF_PARALOGS=1
_UGGAAGACUAGUGAUUUUGUUGUU_**CUG** hsa-miR-7-3 READ_COUNT=1000 NUMBER_OF_PARALOGS=1
_canonical-sequence_
**templated-tail**

PS: Sorry, for the long delay in my response, I missed the notification e-mail from GitHub.

@lpantano
Copy link
Contributor

Hi,

no worries. I think is better to name the other paralogs, in case some tools wants to do something with that information. I am happy to have another attribute with other_parents, and add the names separated with ,. I am happy to have number of paralogs as well. Let me know and I'll add that in the definition file in github. As well, you have Hits attribute, where you can use it for this information. ->https://github.com/miRTop/incubator/blob/master/format/definition.md

Let me know if that helps. Thanks for working on this!

@ThomasDesvignes
Copy link
Member

Hi,
I agree with Lorena. In my case I am really interested in knowing from which gene/locus a mature sequence can originate because the regulatory elements for each locus may be different and therefore each locus may be involved differently in various situations (I know that for example in fish in some tissue one locus is more expressed then the other, while in another tissue the other locus is the most expressed, and that matters to me). Therefore, I don't like the idea of arbitrarily attributing a sequence to a paralog and I prefer conserving the complete information. Especially in some cases we have some isomiRs that are slightly longer and then we can know with confidence that they come from only one of the paralogs.

@phillipeloher
Copy link
Collaborator

Internally we favor reporting everything so that nothing is missed. Below are some illustrative examples - I picked some random sequences to illustrate the point.

Example 1 shows that in 3 hairpins the 3p end of the isomiR differs by 1nt from the annotated mature. But on one of the hairpins, for the same sequence, the 3p end differs by 2nt (in the opposite direction) of the annotated mature.

Whereas example 2 shows a sequence that could come from 5 different precursors.

Example 1:
isomiR Sequence TGGGGCGGAGCTTCCGGAGGC with possible locations:
MIMAT0015058_2&hsa-miR-3180-3p&offsets|0|-1
MIMAT0018178_1&hsa-miR-3180&offsets|0|+2
MIMAT0015058_1&hsa-miR-3180-3p&offsets|0|-1
MIMAT0015058&hsa-miR-3180-3p&offsets|0|-1

Example 2:
isomiR Sequence CTCTAGAGGGAAGCACTTTCT with possible locations:
MIMAT0002845_1&hsa-miR-526a-5p&offsets|0|-1
MIMAT0002845&hsa-miR-526a-5p&offsets|0|-1
MIMAT0002841&hsa-miR-518f-5p&offsets|0|-1
MIMAT0005456&hsa-miR-518d-5p&offsets|0|-1
MIMAT0005455&hsa-miR-520c-5p&offsets|0|-1

Some other things to consider:

  • we like to report the sequence (and/or license plate) of an isomiR to help avoid any confusion when annotations (e.g. miRBase) entries or assemblies change.
  • not all isomiRs (generated from a precursor) overlap with an annotated mature. In this case, reporting it based on coordinates and/or precursor offsets is often helpful
  • reporting offsets like in the above example is OK (e.g. people are used to it) but it indirectly implies that the annotated mature is the most abundant or correct one. This is often not the case. Also, annotations in things like miRBase vs miRCarta don't always match in this regard.

@xbdr86
Copy link
Collaborator Author

xbdr86 commented Jun 27, 2018

Thanks @lpantano @ThomasDesvignes @phillipeloher ! I will take into account your suggestions! ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants