Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pi cluster within-pool #50

Open
aspitaleri opened this issue May 13, 2021 · 11 comments
Open

pi cluster within-pool #50

aspitaleri opened this issue May 13, 2021 · 11 comments

Comments

@aspitaleri
Copy link

Hi Chase -
I have performed the within-pool using the vcf report as input. Now I have several output and I am trying to understand (after reading your papers and references reported in the README.md) how to use them for my purpose. I have been reading previous issues #48 and #49 since they are strongly correlated to my purpose.
My aim is to characterize each sample by its nucleotide diversity (i.e. pi, piS, piN, pi_coding, pi_noncoding) as function of variant frequency and then compare between samples. At given frequency, I built a pairwise distance matrix between samples using as metric Euclidean distance between pi, pi_coding and pi_noncoding. On that distance matrix I have carried out either a linkage clustering or a dimension reduction as UMAP, to perform a clustering analysis. Interestingly, pi_coding is giving few clusters.
Said so, now I'd like to link those cluster to the effect of the mutations, i.e. piN/piS which is identical to dN/dS as far as I understood in my case. I will make an average/variance of piN/piS per cluster.
Any comment on that?
The thread #48 could be useful in my case too, running SNPGenie_sliding_windows.R on my codon_results.txt to get relationship between samples?

Then, reading further I read the McDonald-Kreitman test:
[pN/pS] / [dN/dS]
where
dS: the number of synonymous substitutions per gene
dN: the number of non-synonymous substitutions per gene
pS: the number of synonymous polymorphisms per gene
pN: the number of non-synonymous polymorphisms per gene

Now I got lost/confused and I ask your kind help.
pS and pN are not piS and piN, right? I am asking because the definition are very similar between them.

thank for any comment on that! very appreciate
Best

@singing-scientist
Copy link
Contributor

Greetings!

First, I am a little confused by "characterize each sample by its nucleotide diversity (i.e. pi, piS, piN, pi_coding, pi_noncoding) as function of variant frequency". But all measures of π are determined entirely by the variant numbers and frequencies — so what exactly are you comparing and why?

I do not well understand your distance method, but it seems you are characterizing differences between the π of samples, i.e., you are asking which are the most (or least) diverse? My first thought is that, depending on your system, pi_coding may simply be too low (e.g., too few variants) for any meaning to be extracted. This will depend on how polymorphic your system is.

What do you mean by "running SNPGenie_sliding_windows.R on my codon_results.txt to get relationship between samples?" I think I need to know more what you're trying to compare between samples.

piN/piS and dN/dS are (often) calculated identically; their major difference is 'philosophical'. pi refers to variation within a population or species; d refers to divergence between different species. Because species are much more different than members of the same population, d (between-species) is sometimes further 'corrected' for the possibility of multiple mutations having occurred at the same site ('multiple hits'), e.g., using a Jukes-Cantor correction.

You're exactly right that this is totally different than the MK test. The MK test uses total counts of sites with that type of difference (the number of sites that differ), that is, it's count data, and does not factor in their frequencies. Thus, "dN" and "dS" for MK are not the same as the most common definition of dN and dS, and in my opinion should not be called dN and dS because it's so confusing.. Their original 1991 paper does not use those names. Perhaps calling them fN and fS (numbers of sites with fixed nonsynoynmous and synonymous differences between species) and pN and pS (numbers of sites with polymorphic nonsynoynmous and synonymous differences within species) would be less confusing.

For all of these questions, the answer about what approach to use totally depends on exactly what hypothesis you want to test. I suggest carefully defining your hypothesis/question first, then choosing the one test that is best suited to address it. Otherwise, there are infinite possibilities, and one is just aimlessly "fishing" for results.

Let me know if that helps!
Chase

@aspitaleri
Copy link
Author

aspitaleri commented May 14, 2021

Hi Chase,
thank for your reply. I will make me clearer (hopefully :)

But all measures of π are determined entirely by the variant numbers and frequencies — so what exactly are you comparing and why?

I run SNPGenie using different --minfreq, with the aim to see how the different calculated output differ as function of the min freq. I would expect that at determined value of freq, there is not difference between samples, because the samples are from a short period of time and hence the pathogen should not have time to move strongly ahead in the evolution. So basically, which is the minimum freq by which the samples starts to be different? Using only the major variants they are very similar. See pic, in which the colored dots are the variants. A1/A2 are identical between them at %5 but different with A3. After 5% they are all identical.

you are asking which are the most (or least) diverse? My first thought is that, depending on your system, pi_coding may simply be too low (e.g., too few variants) for any meaning to be extracted. This will depend on how polymorphic your system is.

The idea is that using π diversity (distance matrix) I would find clusters between samples, so basically some samples could have similar diversity. Then this similar diversity background could be related to the metadata (i.e. household, contact direct) or outcome?

What do you mean by "running SNPGenie_sliding_windows.R on my codon_results.txt to get relationship between samples?" I think I need to know more what you're trying to compare between samples.

From #48 and https://github.com/chasewnelson/SNPGenie#sliding-windows-bootstrapping my idea is to get also the πN/πS per sample and then compare πN/πS per each sample. Does it make sense?

For all of these questions, the answer about what approach to use totally depends on exactly what hypothesis you want to test.

My hypotheses are:
Could the minor variants at a particular frequency differ samples which are identical considering only major variants? If so, the haplotype at this frequency could give us info about the evolution of the pathogens in different host?

One of the read inspiring is this:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0215574 (Figure 2).

Best

minor_major

@singing-scientist
Copy link
Contributor

Greetings, @aspitaleri ! Thanks for sharing this fascinating idea. Indeed, if there has not been enough time for new variants to fix, it makes perfect sense that the consensus (major allele) sequences of all the lineages would be identical.

Now I understand better. Yes, I think it makes sense to calculate πN and πS individually for each sample, and then see how samples compare. You will just want to be sure that the differences are not an artifact of your SNP calling method.

Finally, I think the within vs. between host diversity distinction you make is extremely valuable and important. I'd definitely think that low-frequency diversity within host offers valuable insight into evolution, even when the consensus sequences are identical. I think there is a literature on the concept of "Beyond the Consensus" you may find valuable.

Let me know if that helps!
Chase

@aspitaleri
Copy link
Author

Hi @cwnelson88 thanks for your feedback!
Indeed you are right - the important things it to avoid SNP artifact (I am removing from vcf coordinate on which I am not trust). I have now another doubt:
the pi_nocoding from population_summary.txt:
Mean number of pairwise differences per site in the pooled sample across all noncoding sites in the genome.

Now, if I want to compare samples I could fall in the cluster problem to get different genomes in the same cluster since they have similar pi_nocoding but the pi_nocoding arises from different contribution along the genome. For instance:
A1 could have pi values (in population_summary) from regions different that from A2, but the overall [A1]pi and [A2]pi are very similar. In that case A1 and A2 do not share same true pi, arising from identical region.
How to solve this now ...?
Best

@singing-scientist
Copy link
Contributor

Hello @aspitaleri ! These are interesting questions. They're technically on the user end (this is something the user could do with SNPGenie results), but one possibility is to "manually" calculate noncoding pi yourself, for custom sites/regions/sliding windows. Note that π = (N_diffs+S_diffs)/(N_sites+S_sites). For example it might be convenient or meaningful to split your genome into genes, or overlapping windows, or non-overlapping windows. You could even go so far as to cluster samples using each site as a different attribute (variable). Each site could be assigned its own pi as a value. But these are just ideas — all of this really depends on exactly what hypothesis you're trying to test. For example, do you already know other characteristics of these samples? Could you attempt to predict those characteristics based on your clustering-by-diversity method?

Just ideas!

@aspitaleri
Copy link
Author

Hi @cwnelson88 indeed I was thinking to use the data from product_results.txt, which contains information per product. I will then use π = (N_diffs+S_diffs)/(N_sites+S_sites) and test my idea over those products, pairwise.
Thanks for your help

@singing-scientist
Copy link
Contributor

That sounds very reasonable! Then, if certain products are showing similarities, you can get more specific and examine individual sites too. Good luck!

@pascalangst
Copy link

Hi @singing-scientist

As follow up questions regarding the MK-test:

  1. Did I understand correctly that there is no possibility to calculate "dn", "ds", "pn", "ps" from the population_summary file? (i.e. going back to counts from frequencies)

  2. With the ouputs of SNPGenie, the closest we can get to an MK-Test is using N_diffs (=pn), S_diffs (=ps), N_diffs_vs_ref (="dn"), S_diffs_vs_ref (="ds") from the product_results file, correct?

@singing-scientist
Copy link
Contributor

Hi @pascalangst !

Unfortunately, I do not think it is possible to make MK calculations directly from either SNPGenie file. But it may be possible to derive it from one of the outputs. Please allow me to 'think aloud' as we consider just fN: the number of sites with FIXED NONSYNONYMOUS differences compared to ANOTHER species.

For population_summary, I cannot think of a way to get this.

For product_results, if the reference sequence you used was that of the OTHER species, then N_diffs_vs_ref would be the mean number of nonsynonymous differences per nonsynonymous site between each member of the population (ingroup) and the other species (outgroup; reference) for that codon. At the same time, N_diffs is the mean number of nonsynonymous differences per nonsynonymous site within the population (ingroup) ONLY. Given this, it seems possible you could count fN by calculating the number of rows where N_diffs_vs_ref>0 AND N_diffs==0 (i.e., there are differences between species but not within the ingroup, which logically implies a fixed difference). However, this refers to the WHOLE CODON, and it's not straightforward to figure out whether one or more of the (up to 3) N sites within the codon is fixed or polymorphic.

Thus, I think the best bet is the site_results file. This file lists individual sites that either (1) differ from the reference or (2) contain polymorphism. Again, if your reference is the other species, this could yield the necessary counts. The class_vs_ref column gives the type of difference when compared to the reference sequence, while the class column gives the type of difference observed within sample/species. If class_vs_ref=='Nonsynonymous' AND class=='NONREF_NONPOLY', then this is a fixed difference between the sample (ingroup) and the reference sequence (NONREF*) but there is no polymorphism within the sample (*NONPOLY); thus, fN should be equal to the number of rows with this combination. Same for fS (synonymous). pN would be number of rows with class=='Nonsynonymous', regardless of the value of class_vs_ref. Same for pS (synonymous). Note that if the site is 'Ambiguous', this means there are both Nonsynonymous and Synonymous variants or mutational pathways observed at the same site and it cannot be unambiguously assigned (can be excluded from consideration). Thus, unless I've overlooked something, it does seem possible to count the necessary numbers of sites used in the MK test from the SNPGenie output in this way (i.e., number of rows of each type/combination).

Please keep in mind there are many variations of the MK test; I never use it and am not an expert in its applications; and SNPGenie is not designed for it. One limitation of the MK test is that it does not consider the frequencies of the different types of variants (it just counts sites), which (unlike dN/dS and piN/piS) eliminates a critical source of information. Another limitation is that the relaxation of purifying selection during a population bottleneck can mimic selection. Given those caveats, the 'hack' suggested above can probably yield the counts you want if this is the version of MK you're interested in using.

Hope that helps —
Chase

@pascalangst
Copy link

Hi Chase.

Many thanks for your thoughts on that. Mapping to a closely related species is what we did. The results of the MK calculations differ only slightly from the original idea of creating them from the product_results, compared to creating them from the site_results. However, I also think making them from the site_results is probably more correct and I'll move forward using calculations thereof.

@singing-scientist
Copy link
Contributor

Thanks @pascalangst ! That sounds like a good plan. It makes sense that using product_results would yield a close approximation for a closely related species. However, that approximation considers a codon to be at most 1 site rather than 3, and therefore would increasingly underestimate the numbers of fixed or polymorphic sites with (1) increased divergence or (2) increased polymorphism, because more and more codons would contain more than one fixed diff or polymorphism (yet still are only counted as 1). For these reasons, I affirm your choice to use the site_results — it's more accurate and ultimately more flexible, i.e., can be used for data/circumstances with different amounts of diversity.

Chase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants