Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFF3::Attributes::Variants #15

Open
lpantano opened this issue Jun 28, 2017 · 12 comments
Open

GFF3::Attributes::Variants #15

lpantano opened this issue Jun 28, 2017 · 12 comments

Comments

@lpantano
Copy link
Contributor

cc: @lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC @sinanugur @Bastami @haebhardt

Let's discuss Variant attribute that will give the information about the type of isomiR. I think the main idea is to get a CIGAR/TAG like string that can be parsed and give the full information of the change.

Some previous discussion are here: https://github.com/miRTop/incubator/blob/master/isomirs/isomir_naming.md

Anybody has more ideas for this? For instance how you would name this isomiR:

ref --AAAAAAAAAAAAAAAAAAAA
iso AAAACAAAAAAAAAAAAAA--TT

This isomiR starts 2 nucleotides before the reference and ends 2 nucleotide before as well. It has TT as nucleotide addition and a NT change at position 5 A->C.

I think there are two general ways to describe this, TAG-wise or CIGAR-like, please propose others if you work with different ones. We can use both and define an attribute for each of them as well, as @gurgese just mentioned in other issue.

TAG-wise or similar (it could be more general as well):

miRNA-5p.AAs.5AC.aa.TTe

CIGAR-like or similar or just use the CIGAR like the BAM file exactly

AAI2M5C19MAADTT

Either way we need to define it exactly. So please, propose one example of what you use or would like to have or you are missing, and I'll try to merge them all and propose the final definition that we can discuss further for minor details.

Cheers

@mhalushka
Copy link
Collaborator

I don't have an opinion either way on TAG vs CIGAR. Is the plan to document every single isomiR with a label? How far does one want to go down the rabbit hole of defining every single isomiR that can exist for a miRNA? With very deep RNA-seq and tools such as chimera and miRge, one can get 100s of isomiRs for abundant miRNAs. Do we want to really have a nomenclature for all of them? Is that required for the .gff3 format to work? I am perhaps a bit confused by the intersection of the isomiR data and the .gff3 style.

@lpantano
Copy link
Contributor Author

Thanks for the thoughts.

Well, the idea of the format is to be as much unbiased as possible to the tool. There are things that we cannot avoid, like how the tool map to the miRNAs, but once mapped I think is good to report every sequence that mapped to any miRNA. It is ok that some tools want to be conservative and don't trust mutation for instance, or whatever other variant.

But the idea is to have a file that people can reduce if they want and trust whatever they decide, or apply any method downstream to create a final list they trust. Or for instance, you can use the ATTRIBUTE FILTER to set the sequence to PASS or FAIL or any other value, to tell people what your tool label as good isomir/mirna/sequence.

For instance, I would like to have the output of 3 tools, and say, ok, I'll merge all 3' variants to one type, and so on ... but if we don't have the full information there, this is impossible.

I can think of a case that a tool can give directly an isomiR that is a representation of multiple sequence and trust this feature and not every single sequence. I think is good and we can adapt that to have a label that indicates that. So people can use directly whatever concept the tool defines as isomir, for instance:

We could have this:

AAAAAAAAAAAAA
AAAAAAAAAAA
AAAAAAAAAAAAAAAA

So you report each sequence individually, with the CIGAR/TAG, and FILTER=PASS and DEFINED=SINGLE, and the add another line for the merging product without CIGAR, with a general TAG and with DEFINED=MULTIPLE and FILTER=PASS

In that case, every tool shows everything and it is easy for the user to trust what they want to trust.

I think is ok to have a big file with everything, is like BAM/VCF files, where downstream you decide what to do with the information and how to quantify miRNA/isomiRs.

The tool that goes with this format should help with whatever action we want to apply to the file, like filtering, even creation of count matrix, merging, things like that. So the same way we have samtools for BAM, we can have mirtop for GFF3-miRNA-adapted format.

this is mainly my logic here.

@Bastami
Copy link
Collaborator

Bastami commented Jul 1, 2017

@lpantano
Thank you for clarification. I agree with the logic. I personally do prefer to have all information clearly recorded for downstream analysis and don't miss anything.
Defining separate attributes for CIGAR and TAG sounds interesting.

@sinanugur
Copy link
Member

@lpantano

OK, may be a stupid question, but what defines a canonical form? I mean this naming nomenclature suggests that there is a canonical form (reference) and the other isoforms named after that, right ? So is miRBAse the gold-standard here ?

If that is the case, then it is good to name all possible isomiRs as you suggested.

Cheers,

@lpantano
Copy link
Contributor Author

lpantano commented Jul 3, 2017

Thanks for the comment @sinanugur,

we are not defining canonical equal to reference. I know it seems it could be the same, but it is slightly different. The idea is to follow a method similar to Variant calling pipeline, where you map agains a reference genome, and give variants from that.

You can use any database as reference, you have as well mirGeneDB, or any other custom, as far as you put the name in source column and the exact link in the header to mention the version as well.

That way all is traceable. So, reference, means the reference database used for the analysis.

Thanks for contributing!

@mhalushka
Copy link
Collaborator

Lorena - to that definition, make sure version numbers are part of the database reference as the value can change (and has changed) with updated versions. I would add that it would be ideal if everyone could settle on a single database (or as few as possible) from which "canonical" sequences are obtained. Also, how are you proposing to incorporate SNPs that appear in a some mature miRNAs into this nomenclature?

@lpantano
Copy link
Contributor Author

lpantano commented Jul 5, 2017

Thanks Marc for the comment!

I agree that version is important, so we can add it to the column following some formatting. The header information should give specifically where the database was taken as well.

I agree that we should use the less as possible, and I think that is what is gonna happen. But there are cases where any of these two databases are good enough for very specific species and people generate their own custom database based on experiment. That's the main reason we allow flexibility here. And I am sure, it would be a minority.

About the SNP question: I think that is what we are trying to address here. CIGAR or TAG system should work, so in principle this will appear in the file as an isomiR, with a PASS or REJECT attribute, and Variant attribute that should be enough to know where the SNP/mismatches are: 4AC, or whatever system we think we agree on. For sure, if the sequence has other variants they will appear as well here. But I think that is fine, right?

I know that the question could be going in other direction, so please give us an example of what exactly you meant I am sure we can get some ideas.

@ThomasDesvignes
Copy link
Member

Hi,
sorry for the late participation. I think the CIGAR may be more fitting for computational analysis (at least that's the one we use because that's what the aligner we use returns) but the TAG is easier to read and understand for a (non-coder :) ) human.
I think it's important to get a CIGAR or TAG description for every single isomiR that we obtain, especially because it means it's a system that can work with everything. And I totally agree that then we can go through the CIGARs or the TAGs to parse things in main isomiR type categories. But at the same time I think that in parallel there should be isomiR type names to use for writing purposes that would be much more simplified and cover the main isomiR types (3' templated/untemplated additions, seed shifted isomiR, edited miRNAs...) but without as much details as what is needed for coding/bioinformatic purposes (I think Marc was meaning that), and those are two really different things. It would be completely irrelevant to write in a paper something like "miRNA-5p.AAs.5AC.aa.TTe" or "AAI2M5C19MAADTT" to describe the variations of a given isomiR... This case could be explained/written as "minus 2 seed shifted isomiR with a seed edition in position 5 and extended 3' end" or something easy to talk about with focus on the (expected) functional difference.
One risk with getting CIGAR/TAG for every isomiR is to describe too many things including things that are actually not real. Should we suggest a read count cutoff for additional confidence? especially for edited miRNAs? And as Marc mentioned, what about SNPs (I don't think I've ever seen any that wouldn't be clearly imputable to PCR errors or edition but they may exist)? But all that is kind of included in some other attributes so maybe not necessary...

@lpantano
Copy link
Contributor Author

lpantano commented Jul 7, 2017

Thanks Thomas for the comments!

I agree on general with everything. I am thinking right now that maybe CIGAR with a TAG that is more general to only mentioning if the sequences has 3',5',SNPS, addition modifications it would be enough. Maybe we can add an attribute to specifically comparing the SEED nts with the reference SEED, just to have a quick view how different this region is.

As for the cutoff for reads, etc, ... my opinion is that you can always have the sequence there with a REJECT value to point that the tool is not trusting this sequence for that. Actually, this attribute can be the reason why the sequence is rejected, same logic than VCF files. So the FILTER attribute can be:

  • PASS
  • low counts
  • error sequencing
  • .... any filters than the tool is applying.

I think the format we are defining should be able to allow to put all the data, but I don't have anything against to remove lines, if the tool doesn't give everything as far as there is enough information to know what is going on, I would be happy.

For instance, I can imagine that a tool wants to ignore all the SNPs, to be safe. And, I can imagine that the tool won't output all sequences with SNPs because don't want to add all the information. But, then there are two options here, what do you do with the counts of these sequences. I imagine 3 scenarios:

  • simply not report them
  • not report them but add the counts to the sequences that are the same but without the SNP
  • report the sequences but not trusting the SNPs are real

In an ideal world, the format should be good enough to know what is going on. So, we can add these rules to these scenarios:

  • comment in the header that the tool is not reporting whatever kind of isomiRs
  • use some Attribute that says that that isomiR, actually is a collapsing of sequences. We can use FILTER like: PASS,collapsing
  • we need a tag to report when a tool decides is a technical error. Maybe we can use the FILTER attribute and add something like: PASS,technical-error. That way easily the user can spot these cases inside the GFF file.

These are my thoughts, so if nobody has no strong feelings against this, I think we can adapt these rules to the format. Give freedom to developers to report in different way isomiRs that are not trusted, but give the flexibility to report everything is the tool is designed for that.

@gurgese, @mlhack , do you have any ideas for the CIGAR/TAG? it would be awesome to have your inputs.

Thanks!

@gurgese
Copy link
Collaborator

gurgese commented Jul 19, 2017

Personally, I believe that a field in the output for hosting a high-level label (tag) can be useful for many reasons.
These labels can be designed to represent classes of reads mapped to the reference miRNA sequences. For instance, in a class can be grouped reads that diverge from the reference because of an insertion in the 5p side, in another class can be included all the reads mapped with a deletion on the 5p. Other classes can be designed for representing particular differences detected during the alignment that can be useful to the final user to discriminate the read inclusion in the down-stram analysis.
Tags can be generated for representing supplementary classes and not only particular isomiR combinations.
I use several labels (that I call interaction sites) to represent if a read conserves nucleotides from the reference in specific positions, thus enabling the possibility to investigate the presence or absence of interaction sites with functions in the miRNA-mRNA binding.

The CIGAR system is good for representing punctual variations, but supplementary analysis steps are required to adapt the data for filtering and group-by operations.
So I would prefer to have two different fields for CIGAR and label.

I agree to include the filtering system proposed by @lpantano and to include in the output all the detected sequences, even those belonging to untrust classes.

@lpantano
Copy link
Contributor Author

Hi all,

Thanks for all the comments. I'll finish the draft in the next week, and then you can comment to modify whatever I missed or misunderstood.

Cheers

@lpantano
Copy link
Contributor Author

Hi all,

after working on the code and the files from different tools, I modified the format slightly for the Variant field to get a better resolution of information. I paste below the explanation. Feel free to chime in.

@gurgese, I hope you can integrate this into the GFF that you are implementing.
@mhalushka, I'll talk to your postdoct to implement the changes during this week.

Thanks

  * Variant: (categorical types - adapted from isomiR-SEA)
    * iso_5p:+/-N. `+` indicates extra nucleotides not in the reference miRNA. `-` indicates removed nucleotides not in the sequence. `N` the number of nucleotides of difference. For instance, if the sequence starts 2 nts after the reference miRNA, the label will be: `iso_5p:-2`, but if it starts before, the label will be `iso_5p:+2`.
    * iso_3p:+/-N. Same explanation applied.
    * iso_add:+N. Same explanation applied.
    * iso_snp_seed: when affected nucleotides are betweem [2-8].
    * iso_central_offset: when affected nucleotides is at position [8].
    * iso_snp_central: when affected nucleotides are betweem [9-12].
    * iso_central_supp: when affected nucleotides are betweem [13-17].
    * iso_snp: anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Format definition
Awaiting triage
Development

No branches or pull requests

6 participants