✨ make DAE and VAE train without a validation dataset

- fastai routine needs always a validation dataset, but it can be empty.
RasmussenLab · Feb 23, 2024 · a4f36da · a4f36da
1 parent 564e4ff
commit a4f36da
Show file tree

Hide file tree

Showing 7 changed files with 544 additions and 13 deletions.
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -61,6 +61,12 @@ jobs:
         papermill 01_1_train_VAE.ipynb --help-notebook
         papermill 01_1_train_DAE.ipynb --help-notebook
         papermill 01_1_train_CF.ipynb --help-notebook
+    - name: Run tutorial notebooks
+      run: |
+        cd project
+        mkdir runs
+        papermill 04_1_train_DAE_VAE_wo_val_data.ipynb runs/04_1_train_DAE_VAE_wo_val_data.ipynb
+        papermill 04_1_train_pimms_models.ipynb runs/04_1_train_pimms_models.ipynb
     - name: Run demo workflow (integration test)
       run: | 
        cd project

diff --git a/project/04_1_train_DAE_VAE_wo_val_data.ipynb b/project/04_1_train_DAE_VAE_wo_val_data.ipynb
@@ -0,0 +1,348 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2b842a1a",
+   "metadata": {},
+   "source": [
+    "# Scikit-learn styple transformers of the data\n",
+    "\n",
+    "1. Load data into pandas dataframe\n",
+    "2. Fit transformer on training data\n",
+    "3. Impute only missing values with predictions from model\n",
+    "\n",
+    "Autoencoders need wide training data, i.e. a sample with all its features' intensities, whereas\n",
+    "Collaborative Filtering needs long training data, i.e. sample identifier a feature identifier and the intensity.\n",
+    "Both data formats can be transformed into each other, but models using long data format do not need to\n",
+    "take care of missing values."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b6c7aec0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "import vaep.plotting.data\n",
+    "from vaep.sklearn.ae_transformer import AETransformer\n",
+    "import vaep.sampling\n",
+    "\n",
+    "\n",
+    "IN_COLAB = 'COLAB_GPU' in os.environ\n",
+    "\n",
+    "fn_intensities = 'data/dev_datasets/HeLa_6070/protein_groups_wide_N50.csv'\n",
+    "if IN_COLAB:\n",
+    "    fn_intensities = 'https://raw.githubusercontent.com/RasmussenLab/pimms/main/project/data/dev_datasets/HeLa_6070/protein_groups_wide_N50.csv'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4f1ccbdd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "\n",
+    "vaep.plotting.make_large_descriptors(8)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ff56309",
+   "metadata": {},
+   "source": [
+    "## Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b921b86c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_csv(fn_intensities, index_col=0)\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43798bb3",
+   "metadata": {},
+   "source": [
+    "We will need the data in long format for Collaborative Filtering.\n",
+    "Naming both the row and column index assures\n",
+    "that the data can be transformed very easily into long format:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6d29c02d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.index.name = 'Sample ID'  # already set\n",
+    "df.columns.name = 'protein group'  # not set due to csv disk file format\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cb166253",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd2148b8",
+   "metadata": {},
+   "source": [
+    "Transform the data using the logarithm, here using base 2:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b599efb8",
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "df = np.log2(df + 1)\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8b264559",
+   "metadata": {},
+   "source": [
+    "two plots on data availability:\n",
+    "\n",
+    "1. proportion of missing values per feature median (N = protein groups)\n",
+    "2. CDF of available intensities per protein group"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "da3c8dba",
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "ax = vaep.plotting.data.plot_feat_median_over_prop_missing(\n",
+    "    data=df, type='boxplot')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a912a04a",
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "df.notna().sum().sort_values().plot()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29d451c0",
+   "metadata": {},
+   "source": [
+    "define a minimum feature and sample frequency for a feature to be included"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ed278540",
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
+   "outputs": [],
+   "source": [
+    "SELECT_FEAT = True\n",
+    "\n",
+    "\n",
+    "def select_features(df, feat_prevalence=.2, axis=0):\n",
+    "    # # ! vaep.filter.select_features\n",
+    "    N = df.shape[axis]\n",
+    "    minimum_freq = N * feat_prevalence\n",
+    "    freq = df.notna().sum(axis=axis)\n",
+    "    mask = freq >= minimum_freq\n",
+    "    print(f\"Drop {(~mask).sum()} along axis {axis}.\")\n",
+    "    freq = freq.loc[mask]\n",
+    "    if axis == 0:\n",
+    "        df = df.loc[:, mask]\n",
+    "    else:\n",
+    "        df = df.loc[mask]\n",
+    "    return df\n",
+    "\n",
+    "\n",
+    "if SELECT_FEAT:\n",
+    "    # potentially this can take a few iterations to stabilize.\n",
+    "    df = select_features(df, feat_prevalence=.2)\n",
+    "    df = select_features(df=df, feat_prevalence=.3, axis=1)\n",
+    "df.shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9f4bcf48",
+   "metadata": {},
+   "source": [
+    "## AutoEncoder architectures"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "487a4f7c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Reload data (for demonstration)\n",
+    "\n",
+    "df = pd.read_csv(fn_intensities, index_col=0)\n",
+    "df.index.name = 'Sample ID'  # already set\n",
+    "df.columns.name = 'protein group'  # not set due to csv disk file format\n",
+    "df = np.log2(df + 1)  # log transform\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ceee9ced",
+   "metadata": {},
+   "source": [
+    "Test `DAE` or `VAE` model without validation data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5c6b5ab9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_selected = 'VAE'  # 'DAE'\n",
+    "model = AETransformer(\n",
+    "    model=model_selected,\n",
+    "    hidden_layers=[512,],\n",
+    "    latent_dim=50,\n",
+    "    out_folder='runs/scikit_interface',\n",
+    "    batch_size=10,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f4ab5d10",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.fit(df,\n",
+    "          epochs_max=2,\n",
+    "          cuda=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9ae733fc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_imputed = model.transform(df)\n",
+    "df_imputed"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3526ad09",
+   "metadata": {},
+   "source": [
+    "DAE"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e87d1eae",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_selected = 'DAE'\n",
+    "model = AETransformer(\n",
+    "    model=model_selected,\n",
+    "    hidden_layers=[512,],\n",
+    "    latent_dim=50,\n",
+    "    out_folder='runs/scikit_interface',\n",
+    "    batch_size=10,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c6c60295",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.fit(df,\n",
+    "          epochs_max=2,\n",
+    "          cuda=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "27e80959",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_imputed = model.transform(df)\n",
+    "df_imputed"
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  },
+  "kernelspec": {
+   "display_name": "Python",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.17"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}