Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem interpretating table #217

Open
MarioRinBarr opened this issue Jun 21, 2024 · 5 comments
Open

Problem interpretating table #217

MarioRinBarr opened this issue Jun 21, 2024 · 5 comments
Labels
question Further information is requested

Comments

@MarioRinBarr
Copy link

Hello,

I am writing to you because I am having trouble understanding the output of my results. I have performed a RNA sequencing with RNA004 and I want to know which sites are m6A. When I get the table, which I attach to the message, some sites are treated as A, even if the consensus sequence are not A. For example, site 9942 has 3 As of which one would be methylated. I understand that this, having such low numbers, is simply sequencing errors, a point mutation, errors in basecalling... in any case, something that should not be taken into account.
The problem is in sites like 9654, that in the consensus sequence and in the original sequence there is not an A. However the Nvalid_cov says that there are 856 sequences with A, about 10%. This could be due to a point mutation, since this is a virus. What is surprising is that over 16% of the As at this site are methylated. I don't see much sense that at a site where there is supposedly a point mutation there is so much methylation.
On the other hand, I also do not understand that in some bases there are such high Nnocall values, such as 9853, where there is a coverage of more than 10000 sequences and less than 0.03% of the bases are not included in this column. From what you explain in the page, this may be due to point mutations that prevent basecalling. However, in the original sequence and the consensus sequence they are the same, an A, and therefore this value does not make sense.
doubts.xlsx

@ArtRand
Copy link
Contributor

ArtRand commented Jun 25, 2024

Hello @MarioRinBarr,

Sorry for the delayed response. Let me see if I can take on your questions one at a time.

When I get the table, which I attach to the message, some sites are treated as A, even if the consensus sequence are not A. For example, site 9942 has 3 As of which one would be methylated. I understand that this, having such low numbers, is simply sequencing errors, a point mutation, errors in basecalling... in any case, something that should not be taken into account.

This is the correct interpretation. One note, if you want to avoid getting these records in the future you can use --motif A 0 or --motif DRACH 2 so the only reference sequence positions matching A or DRACH will be emitted - although I don't think that's really what you're getting at here.

The problem is in sites like 9654, that in the consensus sequence and in the original sequence there is not an A. However the Nvalid_cov says that there are 856 sequences with A, about 10%. This could be due to a point mutation, since this is a virus. What is surprising is that over 16% of the As at this site are methylated. I don't see much sense that at a site where there is supposedly a point mutation there is so much methylation.

This is somewhat difficult to answer. One interpretation is this is a systematic basecalling error. In some contexts basecall errors will have biased base modification calls, for example base calling errors are sometimes more likely to be called as modified. But I wouldn't consider 16% modification as a strong indicator of this error mode. If you think it's possible that there is a sub-population of molecules that do have an A at position 9654, you could try the following steps (which are somewhat experimental):

  1. Change the reference to an A at 9654.
  2. Extract the reads with a matching A aligned to 9654.
  3. Re-run modification calling using 'reference anchoring' in remora
  4. Look at the modification rate using modkit pileup.

If you still see high levels of m6A, then this is probably a decent estimate of the actual frequency in the sample.

I also do not understand that in some bases there are such high Nnocall values, such as 9853, where there is a coverage of more than 10000 sequences and less than 0.03% of the bases are not included in this column. From what you explain in the page, this may be due to point mutations that prevent basecalling. However, in the original sequence and the consensus sequence they are the same, an A, and therefore this value does not make sense.

Are you using the DRACH m6A model or the all-context m6A model? The simplest explanation I have is that you're using the DRACH-motif model and the reference sequence is not DRACH, however 0.03% of the sequences have an error or mutation that does change it to DRACH and then get a modification call. Could you confirm which model you're using?

Happy to continue to answer questions (and more quickly next time).

@ArtRand ArtRand added the question Further information is requested label Jun 25, 2024
@MarioRinBarr
Copy link
Author

Thank you for your answers.
The model I was using was the m6A DRACH. I did the basecalling again with m6A general model, and as you comment in your 3rd answer, the Nncall values change: now all values are 0. I will try the other solutions you commented.
On the other hand, I have another question about the Ndiff values. If I understand correctly, in places like 9976, where there are very high values, what should have happened is that in the original sequence there was a base that was not A and there was a substitution in many, many cases. So I would expect the Nvalid values to be very high and similar. But this does not happen: Nvalid is only 1, and looking at the assembly, there are indeed Gs in that position, not As. Am I misinterpreting the values in this column?

Thank you

@ArtRand
Copy link
Contributor

ArtRand commented Jun 26, 2024

Hello @MarioRinBarr,

So I would expect the Nvalid values to be very high and similar. But this does not happen: Nvalid is only 1, and looking at the assembly, there are indeed Gs in that position, not As. Am I misinterpreting the values in this column?

The $N_{\text{valid}}$ column only applies to reads containing base modification information. So in the case of position 9976, there are 10080 reads with different bases and 1 valid read with a base modification call. You can think of it this way, a read with a base modification probability can either be considered valid or fail, adding to $N_{\text{valid}}$ or $N_{\text{fail}}$, respectively. But a read without base modification information (a G in this case) can't be triaged into pass or fail, so it doesn't get counted in either of $N_{\text{valid}}$ or $N_{\text{fail}}$).

@lkwhite
Copy link

lkwhite commented Jun 27, 2024

Art, just to clarify, how does this column interact with thresholding in modkit pileup? If I pass a filter of 0.98, does every read aligning to a given location with a modification probability below that filter get added to Ncanonical?

@ArtRand
Copy link
Contributor

ArtRand commented Jun 27, 2024

@lkwhite

No, if you pass --filter-threshold 0.98 then all base modification predictions (modified or canonical) must have a probability >= 0.98. So if you have just a single modification, using this setting you're requiring that $P_{\text{modified}} \ge 0.98$ OR $P_{\text{modified}} \le 0.02$. modkit does not default to a canonical call (however, you can make this happen using specific settings see this thread).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants