FIX: transposing ranks for final output. (#99)

* FIX: transposing ranks for final output. Updating documentation in readme accordingly * Adding in transformers for metadata * fixing import * fixing index name for ranks * Add transformer * fixed ordering of rank calculations in q2 * TST: made transposes consistent Found that standalone cli wasn't consistent with q2. Biases weren't factored into standalone ranks. Now fixed. * TST:minor refactor * flake8 * Adding check for soils * TST: adding additional check in cystic fibrosis study. Also adding in cool figure, because, why not * Adding in changelog update * Update CHANGELOG.md
biocore · Oct 17, 2019 · 7457c87 · 7457c87
1 parent c84460e
commit 7457c87
Show file tree

Hide file tree

Showing 16 changed files with 425 additions and 81 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,12 @@
 # mmvec changelog
 
+## Version 1.0.1 (2019-10-17)
+# Enhancements
+ - Ranks are transposed and viewable in qiime metadata tabulate [#99](https://github.com/biocore/mmvec/pull/99)
+
+# Bug fixes
+ - Ranks are now calculated consistently between q2 and standalone cli [#99](https://github.com/biocore/mmvec/pull/99)
+
 ## Version 1.0.0 (2019-09-30)
 # Enhancements
  - Paired heatmaps are available [#89](https://github.com/biocore/mmvec/pull/89)

diff --git a/README.md b/README.md
@@ -3,6 +3,8 @@
 # mmvec
 Neural networks for estimating microbe-metabolite interactions through their co-occurence probabilities.
 
+![](https://github.com/biocore/mmvec/raw/master/img/mmvec.png "mmvec")
+
 # Installation
 
 MMvec can be installed via pypi as follows
@@ -45,7 +47,7 @@ More information can found under `mmvec --help`
 
 # Qiime2 plugin
 
-If you want to make this qiime2 compatible, install this in your
+If you want to run this in a qiime environment, install this in your
 qiime2 conda environment (see qiime2 installation instructions [here](https://qiime2.org/)) and run the following
 
 ```
@@ -76,26 +78,22 @@ qiime mmvec paired-omics \
 	--o-conditionals ranks.qza \
 	--o-conditional-biplot biplot.qza
 ```
+
 In the results, there are two files, namely `results/conditional_biplot.qza` and `results/conditionals.qza`. The conditional biplot is a biplot representation the
 conditional probability matrix so that you can visualize these microbe-metabolite interactions in an exploratory manner.  This can be directly visualized in
 Emperor as shown below.  We also have the estimated conditional probability matrix given in `results/conditionals.qza`,
 which an be unzip to yield a tab-delimited table via `unzip results/conditionals`. Each row can be ranked,
 so the top most occurring metabolites for a given microbe can be obtained by identifying the highest co-occurrence probabilities for each microbe.
 
-It is worth your time to investigate the logs (labeled under `logdir**`) that are deposited using Tensorboard.
-The actual logfiles within this directory are labeled `events.out.tfevents.*` : more discussion on this later.
-
+These log conditional probabilities can also be viewed directly with `qiime metadata tabulate`.  This can be
+created as follows
 
-Tensorboard can be run via
 ```
-tensorboard --logdir .
+qiime metadata tabulate \
+	--m-input-file results/conditionals.qza \
+	--o-visualization conditionals-viz.qzv
 ```
 
-You may need to tinker with the parameters to get readable tensorflow results, namely `--p-summary-interval`,
-`--epochs` and `--batch-size`.
-
-A description of these two graphs is outlined in the FAQs below.
-
 
 Then you can run the following to generate a emperor biplot.
 
@@ -197,6 +195,14 @@ More information behind the actions and parameters can found under `qiime mmvec
 
 3. More model parameters : The standalone script will return the bias parameters learned for each dataset (i.e. microbe and metabolite abundances).  These are stored under the summary directory (specified by `--summary`) under the names `embeddings.csv`. This file will hold the coordinates for the microbes and metabolites, along with biases.  There are 4 columns in this file, namely `feature_id`, `axis`, `embed_type` and `values`.  `feature_id` is the name of the feature, whether it be a microbe name or a metabolite feature id.  `axis` corresponds to the name of the axis, which either corresponds to a PC axis or bias.  `embed_type` denotes if the coordinate corresponds to a microbe or metabolite.  `values` is the coordinate value for the given `axis`, `embed_type` and `feature_id`.  This can be useful for accessing the raw parameters and building custom biplots / ranks visualizations - this also has the advantage of requiring much less memory to manipulate.
 
+It is also important to note that you don't have to explicitly chose - it is very doable to run the standalone version first, then import those output files into qiime2.  Importing can be done as follows
+
+```
+qiime tools import --input-path <your ranks file> --output-path conditionals.qza --type FeatureData[Conditional]
+
+qiime tools import --input-path <your ordination file> --output-path ordination.qza --type 'PCoAResults % ("biplot")'
+```
+
 **Q** : You mentioned that you can use GPUs.  How can you do that??
 
 **A** : This can be done by running `pip install tensorflow-gpu` in your environment.  See details [here](https://www.tensorflow.org/install/gpu).
@@ -209,7 +215,7 @@ At the moment, these capabilities are only available for the standalone CLI due
 
 **Q** : I'm confused, what is Tensorboard?
 
-**A** : Tensorboard is a diagnostic tool that runs in a web browser. To open tensorboard, make sure you’re in the mmvec environment and cd into the folder you are running the script above from. Then run:
+**A** : Tensorboard is a diagnostic tool that runs in a web browser - note that this is only explicitly supported in the standalone version of mmvec. To open tensorboard, make sure you’re in the mmvec environment and cd into the folder you are running the script above from. Then run:
 
 ```
 tensorboard --logdir .
@@ -237,7 +243,8 @@ The x-axis is the number of iterations (meaning times the model is training acro
 
 The y-axis is the average number of counts off for each feature. The model is predicting the sequence counts for each feature in the samples that were set aside for testing. So in the graph above it means that, on average, the model is off by ~0.75 intensity units, which is low. However, this is ABSOLUTE error not relative error (unfortunately we don't know how to compute relative errors because of the sparsity in these datasets).
 
-You can also compare multiple runs with different parameters to see which run performed the best.  If you are doing this, be sure to look at the `training-column` example make the testing samples consistent across runs.
+You can also compare multiple runs with different parameters to see which run performed the best. Useful parameters to note are `--epochs` and `--batch-size`.  If you are committed to fine-tuning parameters, be sure to look at the `training-column` example make the testing samples consistent across runs.
+
 
 **Q** : What's up with the `--training-column` argument?
 

diff --git a/examples/cf/check_rhamnolipids.ipynb b/examples/cf/check_rhamnolipids.ipynb
@@ -0,0 +1,212 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!ls"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[34mlatent_dim_3_input_prior_1.00_output_prior_1.00_beta1_0.90_beta2_0.95\u001b[m\u001b[m\r\n",
+      "latent_dim_3_input_prior_1.00_output_prior_1.00_beta1_0.90_beta2_0.95_embedding.txt\r\n",
+      "latent_dim_3_input_prior_1.00_output_prior_1.00_beta1_0.90_beta2_0.95_ordination.txt\r\n",
+      "latent_dim_3_input_prior_1.00_output_prior_1.00_beta1_0.90_beta2_0.95_ranks.txt\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!ls testing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Standalone check"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "fname = 'latent_dim_3_input_prior_1.00_output_prior_1.00_beta1_0.90_beta2_0.95_ranks.txt'\n",
+    "ranks = pd.read_csv(f'testing/{fname}', sep='\\t', index_col=0)\n",
+    "microbe_metadata = pd.read_csv('microbe-metadata.txt', sep='\\t', index_col=0)\n",
+    "metabolite_metadata = pd.read_csv('metabolite-metadata.txt', sep='\\t', index_col=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "microbe_metadata = microbe_metadata.loc[ranks.columns]\n",
+    "i = microbe_metadata.Taxon.apply(lambda x: 'Pseudomonas' in x)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pseudomonas = microbe_metadata.loc[i].index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "metabolite_metadata = metabolite_metadata.dropna(subset=['expert_annotation'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "19"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.sum(ranks.loc[metabolite_metadata.index, pseudomonas[0]] > 0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# qiime2 check"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n",
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n",
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n",
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n",
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n",
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n",
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  _np_qint8 = np.dtype([(\"qint8\", np.int8, 1)])\n",
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  _np_quint8 = np.dtype([(\"quint8\", np.uint8, 1)])\n",
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  _np_qint16 = np.dtype([(\"qint16\", np.int16, 1)])\n",
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  _np_quint16 = np.dtype([(\"quint16\", np.uint16, 1)])\n",
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  _np_qint32 = np.dtype([(\"qint32\", np.int32, 1)])\n",
+      "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.\n",
+      "  np_resource = np.dtype([(\"resource\", np.ubyte, 1)])\n"
+     ]
+    }
+   ],
+   "source": [
+    "import qiime2\n",
+    "ranks = qiime2.Artifact.load('ranks.qza').view(pd.DataFrame)\n",
+    "microbe_metadata = pd.read_csv('microbe-metadata.txt', sep='\\t', index_col=0)\n",
+    "metabolite_metadata = pd.read_csv('metabolite-metadata.txt', sep='\\t', index_col=0)\n",
+    "microbe_metadata = microbe_metadata.loc[ranks.columns]\n",
+    "i = microbe_metadata.Taxon.apply(lambda x: 'Pseudomonas' in x)\n",
+    "pseudomonas = microbe_metadata.loc[i].index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "metabolite_metadata = metabolite_metadata.dropna(subset=['expert_annotation'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "19"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "np.sum(ranks.loc[metabolite_metadata.index, pseudomonas[0]] > 0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/examples/cf/q2_run.sh b/examples/cf/q2_run.sh
@@ -8,6 +8,7 @@ qiime mmvec paired-omics \
       --p-learning-rate 1e-3 \
       --o-conditionals ranks.qza \
       --o-conditional-biplot biplot.qza \
+      --p-summary-interval 1 \
       --verbose
 
 qiime emperor biplot \
@@ -27,6 +28,5 @@ mmvec paired-omics \
       --metabolite-file lcms_nt.biom  \
       --epochs 100 \
       --learning-rate 1e-3 \
-      --summary-dir testing
-
-qiime tools import --input-path testing/latent_dim_3_input_prior_1.00_output_prior_1.00_beta1_0.90_beta2_0.95_ranks.txt --output-path ranks.qza --type FeatureData[Conditional]
+      --summary-interval 1 \
+      --summary-dir summary