Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking for ratio values #303

Closed
pavlo888 opened this issue May 18, 2020 · 18 comments
Closed

Checking for ratio values #303

pavlo888 opened this issue May 18, 2020 · 18 comments
Labels
question Further information is requested

Comments

@pavlo888
Copy link

Hi,

First of all, great plug-in!!!! I think Qurro is a really powerful tool and I have used it successfully so far. However, I wanted to know if there is a way to know the actual value ratios i.e. 1 to 3 or 1 to 1 when I select two taxa?

In the attached file I selected two taxa and I obtained the ratios. Is there a way to know what the actual ratios are?

I hope you understand my inquiry. Thank you in advance for your help!

Cheers,
Pablo
qurro_two_taxa_ratio

@pavlo888 pavlo888 added the question Further information is requested label May 18, 2020
@gibsramen
Copy link
Collaborator

Hi @pavlo888

Thanks for using Qurro! To clarify, you are interested in the numerator and denominator values of a given log-ratio, correct? In this case, extracting these values may depend on how you've selected the features. If you searched by taxonomy then you should be able to use the qarcoal command which returns the numerator and denominator sums.

If you are selecting features a different way (e.g. autoselection, manual, etc.), to my knowledge there is no way to extract these sums directly from Qurro. What you could do is download the selected features using the "Export Selected Features" option and then calculate the numerator and denominator yourself.

The formula Qurro uses to calculate log-ratios is:

image

so you can just calculate the sum of numerator features as well as the sum of denominator features.

When @fedarko gets to this he may also have some advice 😄 .

@pavlo888
Copy link
Author

Hi @gibsramen

That's exactly what I did! I searched based on taxonomy. I see the qarcoal command is available as a python script but is it also available in the qiime2 plug-in of Qurro?

Cheers,
Pablo

@gibsramen
Copy link
Collaborator

Yes, there is a Qiime2 implementation of the qarcoal command as well. An example usage can be seen in the qarcoal example notebook and I've reproduced the example Qiime2 command below.

qiime qurro qarcoal \
    --i-table output/qiita_10422_table.biom.qza \
    --i-taxonomy ../DEICODE_sleep_apnea/input/taxonomy.qza \
    --p-num-string g__Allobaculum \
    --p-denom-string g__Coprococcus \
    --o-qarcoal-log-ratios output/allobaculum_coprococcus_log_ratios.qza

@fedarko
Copy link
Collaborator

fedarko commented May 18, 2020

Hi @pavlo888,

Thanks for the kind comments! @gibsramen is correct -- aside from using Qarcoal to replicate a taxonomy-based selection, there isn't an "easy" way currently to extract the selected summed numerator and denominator values for each sample.

We have an open issue to allow plotting the "raw" ratios instead of the log-ratios in #178, but it sounds like what you would like are the actual numerator and denominator summed values (analogous to what Qarcoal gives you).

Would a solution where we add this information to the "Export current sample plot data" output file (say, by making each sample have a Num_Sum and Denom_Sum column, to be consistent with Qarcoal) be good for you? I don't think implementing this change should take a large amount of time, but there will be a few inconveniences due to the way Qurro's JavaScript code stores log-ratio information (and things are a bit busy now) so I can't guarantee this would be ready any time soon.

@pavlo888
Copy link
Author

Hi @fedarko,

Indeed I am looking for the raw ratios. I am interested in knowing the actual ratios of two specific genera.

Can that be done with Qarcoal? I have already run it but I am not sure how to interpret the output.

Could you please help me out with this?

Cheers,
Pablo

@fedarko
Copy link
Collaborator

fedarko commented May 24, 2020

There are a couple of ways of doing this.

Option 1. All you want to know is the "raw" ratios of two genera (and you don't care about the individual numerator or denominator values)

You can just use Qurro to select the log-ratio normally, and then export the selected log-ratios using the Export current sample plot data button. This will give you a TSV file containing the selected log-ratio for each sample. You can load this in Python as a Pandas DataFrame or something (see here for an example of using pd.read_csv() on this), and then you can create a new column, raw_ratio, as follows:

import math
# You may need to filter out samples with a NaN or null log-ratio first
sample_plot_data["raw_ratio"] = math.e**(sample_plot_data["Current_Natural_Log_Ratio"])

This is possible because Qurro computes log-ratios, as @gibsramen mentioned above, by just taking ln(N) - ln(D) (or equivalently ln(N/D)). Taking e**(ln(N/D)) leaves you with just N/D.

Option 2. You want to know the actual numerator and denominator values for each sample (i.e. each "half" of the ratio)

You can use Qarcoal for this. You can run Qarcoal with the --p-num-string and --p-denom-string parameters containing text unique to the genera you want to select -- e.g. something like --p-num-string "g__Bacteroides;" as shown here. This will give you a QZA which contains the summed numerator and denominator abundances for each sample -- it's basically a fancy version of the TSV file we worked with in "Option 1" above.

You can load this QZA into a pandas DataFrame in Python as shown here, and at this point you'll already have the numerator and denominator sum information in the Num_Sum and Denom_Sum columns respectively -- to make a column of the raw ratios for each sample you can use either of the following code snippets:

Option 2.1
qarcoal_log_ratios_df["raw_ratio"] = qarcoal_log_ratios_df["Num_Sum"] / qarcoal_log_ratios_df["Denom_Sum"]
Option 2.2

Alternatively, you can also do the following (this is the way we did this with the Qurro sample plot data TSV in "Option 1"):

import math
qarcoal_log_ratios_df["raw_ratio"] = math.e**(qarcoal_log_ratios_df["log_ratio"])

In closing

All of these three ways of doing this should give you the same answer (the same "raw ratios"). You may want to try them out to verify for yourself that this is true (there may be slight precision differences, but I doubt they'll be big enough to make a difference). Hope this helps.

@pavlo888
Copy link
Author

pavlo888 commented Jun 6, 2020

Hi @fedarko

Thank you for your reply. I think that the raw ratios obtained with option 2.1 is the output I am looking for. However, I am not very familiar with Python. I have tried running these commands (https://nbviewer.jupyter.org/github/biocore/qurro/blob/master/example_notebooks/qarcoal/qarcoal_example.ipynb#1.B.-Run-Qarcoal!) on Spyder but I get some errors.

Could you point me out to a platform where I can easily run the commands suggested?

Cheers,
Pablo

@fedarko
Copy link
Collaborator

fedarko commented Jun 8, 2020

I haven't used Spyder, but the necessary code to extract the raw ratios should be runnable through any Python interface (python, ipython, Jupter Notebooks, etc.). For not-large tasks like this I normally just use ipython from the terminal, but I'm sure it's possible to run this code through an IDE like Spyder also (as long as it's hooked up to your QIIME 2 conda environment, so you can do stuff like from qiime 2 import Artifact without errors).

First off, what sort of error(s) are you getting? If you wouldn't mind copying them here, this would help us figure out where things are going wrong (and it'll help people coming here from Google or whatever who might have the same problem).

Here's what I think the code to get "raw" ratios from Qarcoal output would look like, in some more detail. This should be run from within a QIIME 2 conda environment.

import pandas as pd
from qiime2 import Artifact

# Load the output QIIME 2 artifact from Qarcoal
qarcoal_log_ratios = Artifact.load("your_qarcoal_output.qza")

# Convert the artifact to a pandas DataFrame
qarcoal_log_ratios_df = qarcoal_log_ratios.view(pd.DataFrame)

# Make a new column in the DataFrame, "raw_ratio"
qarcoal_log_ratios_df["raw_ratio"] = qarcoal_log_ratios_df["Num_Sum"] / qarcoal_log_ratios_df["Denom_Sum"]

# Save the Qarcoal output (including the raw_ratio column we just added) to a TSV file
qarcoal_log_ratios_df.to_csv("raw_ratio_info.tsv", sep="\t")

This should accomplish what you want, I think. Let us know if this works!

@pavlo888
Copy link
Author

Hi @fedarko,

It worked perfectly!!!! I opened ipython on the terminal while having the qiime2 Conda environment active and I followed your script and it worked great! Thanks a lot!

Now, for the interpretation I just wanna make sure I am doing it right.
The raw ratio column is telling me that there is 5 times more counts (based on reads?) of the Num_Sum than counts in the Denom_Sum?

Or what is the correct wording for this type of output?

Thank you in advance for your amazing support!

Cheers,
Pablo
Screenshot 2020-06-21 at 23 47 19

@fedarko
Copy link
Collaborator

fedarko commented Jun 22, 2020

Glad that worked!

Yes, the raw_ratio column (for sample 3-8B-rep1) is really just saying that the ratio of (the sum of the numerator features) to (the sum of the denominator features) is ~5.38 for that sample. Whatever these counts "are" depends entirely on how you produced your BIOM table initially (if your data is from 16S rRNA sequencing, this was probably by denoising or OTU clustering; if your data is from shotgun metagenomic sequencing, the BIOM table's relative abundances might have been estimated by something like MetaPhlAn2; etc.) I don't know what more can be said about the raw ratios besides that -- I would be cautious against over-interpreting the raw ratios.

The reason many compositional data analysis techniques generally use log-ratios instead of just raw ratios is that logarithms symmetrize things between the numerator and denominator around 0:

3/4 = 0.75
4/3 = 1.33...

but

log(3/4) = -0.12...
log(4/3) = +0.12...

More generally, log(a/b) = -log(b/a) (assuming a > 0 and b > 0).

So, using the log-ratio (rather than just the raw ratio) gives equal weight to the numerator and denominator, making it easier to compare samples (and enabling the use of ordinary statistical tools, e.g. t-tests). To quote this paper, emphasis mine:

The starting point for any compositional analyses is a ratio transformation of the data. Ratio transformations capture the relationships between the features in the dataset and these ratios are the same whether the data are counts or proportions. Taking the logarithm of these ratios, thus log-ratios, makes the data symmetric and linearly related, and places the data in a log-ratio coordinate space (Pawlowsky-Glahn et al., 2015). Thus, we can obtain information about the log-ratio abundances of features relative to other features in the dataset, and this information is directly relatable to the environment. We cannot get information about the absolute abundances since this information is lost during the sequencing process as explained in Figure 1. However, log-ratios have the nice mathematical property that their sample space is real numbers, and this represents a major advantage for the application of standard statistical methods that have been developed for real random variables.

@pavlo888
Copy link
Author

Hi @fedarko

Thanks a lot for the great insight and explanation. From what you have mentioned above, I assume it would be more advisble if I discuss this kind of results based on the log ratio, right?

Cheers,
Pablo

@fedarko
Copy link
Collaborator

fedarko commented Jun 22, 2020

In general, yes, I would suggest discussing log-ratios rather than just raw ratios. It might seem less intuitive at first, but (in my opinion) the advantages outweigh the disadvantages.

If you'd like further background on log vs. non-log ratios, you might want to check out this issue thread and/or Modeling and Analysis of Compositional Data (link), page 14.

@fedarko
Copy link
Collaborator

fedarko commented Jul 9, 2020

I'm going to close this for now, but please feel free to open a new issue if you have any other questions.

Best,
Marcus

@fedarko fedarko closed this as completed Jul 9, 2020
@pavlo888
Copy link
Author

Hi @fedarko,

I was wondering if you have any experience on plotting the results from qurro in a PCA plot? I have a dataframe looking like this:
Screenshot from 2020-09-26 22-08-21

Any idea? I tried using ggbiplot on R but I cannot make it work.

Thank you in advance.

Cheers,
Pablo

@fedarko
Copy link
Collaborator

fedarko commented Sep 27, 2020

Not super sure what you mean; what's your goal with this analysis? I am unclear on how you'd go from Qurro's results (log-ratios) to PCA. The more common use case would be going the opposite way, i.e. using the feature loadings in a PCA biplot (which I think is what was shown in the rank plot in your first post in this thread?) as input to Qurro to guide the selection of log-ratios that differ across sample types.

I guess you could do something like select two log-ratios and then use those as the axes in a scatterplot, which would probably look kinda like a PCA, but I'm not sure that would be more meaningful than just showing two box (or jitter/violin) plots of the different log-ratios.

@pavlo888
Copy link
Author

Hi @fedarko,

Yes indeed. That is exactly what I got. Apologies for not explaining myself clearly. I took the axes from two log ratios and put them in a scatterplot in order to obtain a PCA. My main goal is to shown a bit clearer and in a summarized way the families that are coupled with the differential log ratios. Do you think this would be a good approach?

Cheers,
Pablo

@fedarko
Copy link
Collaborator

fedarko commented Sep 28, 2020

I guess that could be useful -- I remember similar scatterplots of two log-ratios were talked about by @mortonjt in the context of microbe-metabolite datasets (biocore/mmvec#76) a while back. Although I don't think that a scatterplot between two log-ratios can be called a PCA -- it would just be a normal scatterplot (although it would be interpretable, kind of, as a basic form of "dimensionality reduction").

It should be possible to plot the scatterplot in pretty much any plotting software (ggplot / matplotlib / etc.), I think?

Looking at the data you posted earlier, I am a bit confused: it seems like you have coordinates defined for features, not for samples. If you want to make a scatterplot in the way described above, I think the way to do that (assuming you want to use Qurro to select the log-ratios) is to select one log-ratio in Qurro, export it using the Export current sample plot data button, then select another log-ratio and export it again in the same way. (You'll probably need to rename the Current_Natural_Log_Ratio column within each file to distinguish the two log-ratios.) Once that's done, you should be able to merge the files and load them into R / Python / Excel / etc. for visualization. Does that make sense?

@pavlo888
Copy link
Author

Hi @fedarko,

I am a bit confused by the last thing you suggested, but I have built the "PCA" and it looks like this:
qurro_PCA

Personally, for me this would work.

Thanks a lot for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants