Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFF3::seqID #12

Open
lpantano opened this issue Jun 8, 2017 · 16 comments
Open

GFF3::seqID #12

lpantano opened this issue Jun 8, 2017 · 16 comments

Comments

@lpantano
Copy link
Contributor

lpantano commented Jun 8, 2017

Hi all again!

cc: @lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC

I will start a issue column type at a time. Let's see if that makes easy to get as least a few people commenting.

The first column is for chromosome ID. That brings the discussion whether we should use genomic position or precursor position. Or allow both if in the header we can get the exact version-database for the precursor.

I am starting to be more incline to use genomic position because that should be the same among databases. With the condition to have the hairpin as a parent feature in the file as well. It will be like:

chr1 mirbase hairpin start end .... id=hairpin1
chr1 mirbase mirna start end ... parent=hairpin1

The only think I am not clear is what we do with the miRNAs that have multiple precursor on the genome. I can only think about adding an attribute like other_parents=hairpin2,hairpin3... and those parents should be in the GFF3 file as well.

Please comment with new ideas, or if you agree, disagree, missing scenario I am missing...

Thanks!

@lpantano lpantano added this to the draft for GFF3 format milestone Jun 8, 2017
@mhalushka
Copy link
Collaborator

I believe the genomic position is the best option there. miRNAs that map to multiple locations can be designated with the additional numbering as you suggested.

@lpantano
Copy link
Contributor Author

Thanks, I'll keep open, but I'll move to next question.

@ThomasDesvignes
Copy link
Member

In our own miRNA-Seq analyzing tool that we are finalizing (yes, one more tool in a quite already large toolbox...), the way we work around that is by creating "genomic_location" groups of sequences that share the same unique or multiple genomic location origins. We allow a wiggle of [user-defined] nucleotides to group isomiRs together, and sequences can be in the same genomic location group only if they share the same location set. For ex, an isomiR that maps to two genomic locations won't be in the same genomic location group as an isomiR that maps to only one of the two locations. I'm not sure that helps much for this specific task but I think that choosing one genomic location over another when a sequence is as likely to come from one or another is creating a bias issue... In our software, the genomic location ID refers to a series of genomic location with for example: "11:27256137-27256115;5:29390312-29390334" showing two putative locations for an isomiR group with embedded information about the strand too (here the first location is on the reverse strand, while the second location is on the forward strand).

@Bastami
Copy link
Collaborator

Bastami commented Jun 24, 2017

In my opinion, as Thomas pointed out, embedding information about the strand is critical, as there are many examples of miRNAs that are located in the opposite strands of the same genomic position (e.g. hsa-mir-499a & hsa-mir-499b).
Regarding miRNAs that belong to multiple precursors, I think no bias occurs as far as all parents are recorded in the GFF3 file.

@FlorianThibord
Copy link

Dear members of the mirtop project,

I've been adding support to a miRNA seq pipeline for outputting in miRGFF3 format, and I'm having doubt concerning this seqID value. I've seen in the examples mentioned here and in the preprint that the precursor ID should be mentioned in this column. What about the mature ID? Could it be mentioned instead? Or would it create compatibility issues when using mirtop?
I'm currently aligning to mature sequences and I thought that it would be more coherent when aligning to mature sequences to mention the mature ident as SeqID (and start/end where the read aligned on this mature sequence). Has this been discussed before? (I've been browsing the issues but did not find a topic relevant to this).

And on a side note, thank you for developing this, I've been struggling with isomiR definition myself, and this will be a very usefull project for the miRNA community!

@lpantano
Copy link
Contributor Author

lpantano commented Feb 4, 2019

Hi @FlorianThibord,

Thanks so much for the question. I think we didn't think about this, but it is a valid point.

we can try to adapt our tool to be compatible with that. It shouldn't be a lot of work but I would need some test file to work with. Normally we work all the time with the same sequences to test the tool and all the functions we code.

Just for curiosity, do you detect isomiRs that are -2nt at 5p the reference sequence? In that case, do you have information about these 2 nts map to the precursor or you just don't look at that?

Let me know if this plan will work with you and I will send you the sequences I need to have in the GFF3 format you are producing where the seqID is the mature one.

Thanks! :)

PS: You are welcome to join if you want to be more involved, let me know!

@FlorianThibord
Copy link

Thanks @lpantano for your reply,

Or course, I'll gladly produce some test files in that format if it will help.
I can still detect 5' or 3' addition, and determine if these are templated or not by comparing with the nucleotides surrounding the mature in the hairpin sequence(s). So I'm able to detect isomiRs with iso_5p:-2 variants.

And sure I'd be happy to bring my modest contribution to the project!

@lpantano
Copy link
Contributor Author

lpantano commented Feb 6, 2019

Perfect.
Can you give me back the format you create when you use this as input:
https://github.com/miRTop/incubator/blob/master/synthetic/synthetic/synthetic_100_full.fq

It has the standard illumina adapter:TGGAATTCTCGGGTGCCAAGGAACTC

Can you tell me the affiliation you want to use to join the team?

Thanks

@FlorianThibord
Copy link

Great I'll get working on it asap.
Also, I'll get back to you concerning my affiliation

@FlorianThibord
Copy link

Hi,
You'll find the resulting gff file here: synthetic_100_full.gff3.tar.gz
You might notice the presence of an additional attribute (Expression_OptimiR) which corresponds to the final expression computed by my pipeline. I'm not sure about how I should mention it in there.

Concerning my affiliation: Florian Thibord, Phd student. INSERM UMR_S 1219, Bordeaux Population Health Research Center, University of Bordeaux, Bordeaux, France
Thanks!

@lpantano
Copy link
Contributor Author

Hi @FlorianThibord

Thanks for doing this. I think is almost perfect.

I have a couple of requests only:

The version in the file is correct but the UID is from version 1.0. We moved to a more commonly used id by Mintplate. Any way could use the dev branch in mirtop to create the ID. I think you took it from master, I am sorry I forgot to mention this.

Other minor details:

  • can you end the line with ;.
  • can you use lower first letter for Expression_OptimiR -> expression_OptimiR
  • can you use the , character to separate multiple Parents: Parent=hsa-let-7a-2/hsa-let-7a-3/hsa-let-7a-1 -> Parent=hsa-let-7a-2,hsa-let-7a-3,hsa-let-7a-1.

After that, it would be pretty easy to integrate this into mirtop!

Thanks again!

@FlorianThibord
Copy link

Hi, thanks for the feedback.
Yes I did take the UID from the master branch, and did not check for the version match. I will look into the dev branch to make the necessary changes.
Otherwise, the minor details should be easy to fix!
I will get back to you when it's done.

@lpantano
Copy link
Contributor Author

Hey @FlorianThibord

Did you have a chance to update the UID? if not, you can remove it and I will adapt mirtop to be compatible with that as far as you add the sequence to the line.

Thanks!

@FlorianThibord
Copy link

Hi @lpantano
Yes, sorry for the delay, I made the changes and I think the format is compatible now. Here is the new file processed with optimiR : synthetic_100_full.gff3.tar.gz
Let me know if there is something I can do to help

@lpantano
Copy link
Contributor Author

lpantano commented May 9, 2019

Hi @FlorianThibord ,

I think is almost there. I noticed a couple of typos:

  • When there is no variants you should add "Variant NA" to the line.
  • If you add Changes attribute, change Change to Changes and there should be the same number of isomiRs types than in the attribute Variant. So if you have Variant iso_5p:+1,iso_3p:-4,iso_add3p:9; you should have three of them in the Changes. You can remove this if you want, since mirtop will add this information if needed, so is not mandatory.

Thanks a bunch! we are almost there.

@FlorianThibord
Copy link

Thanks @lpantano ,
I discarded the Changes attribute since it's not mandatory, and mirtop can fill the field if necessary. I also added "NA" to the Variant attribute when there is no variants.
I think third time's the charm! Here is the file : synthetic_100_full.gff3.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Format definition
Awaiting triage
Development

No branches or pull requests

5 participants