Skip to content

Commit

Permalink
✨ make DAE and VAE train without a validation dataset
Browse files Browse the repository at this point in the history
- fastai routine needs always a validation dataset, but it can be empty.
  • Loading branch information
Henry committed Feb 23, 2024
1 parent 564e4ff commit a4f36da
Show file tree
Hide file tree
Showing 7 changed files with 544 additions and 13 deletions.
6 changes: 6 additions & 0 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,12 @@ jobs:
papermill 01_1_train_VAE.ipynb --help-notebook
papermill 01_1_train_DAE.ipynb --help-notebook
papermill 01_1_train_CF.ipynb --help-notebook
- name: Run tutorial notebooks
run: |
cd project
mkdir runs
papermill 04_1_train_DAE_VAE_wo_val_data.ipynb runs/04_1_train_DAE_VAE_wo_val_data.ipynb
papermill 04_1_train_pimms_models.ipynb runs/04_1_train_pimms_models.ipynb
- name: Run demo workflow (integration test)
run: |
cd project
Expand Down
348 changes: 348 additions & 0 deletions project/04_1_train_DAE_VAE_wo_val_data.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,348 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "2b842a1a",
"metadata": {},
"source": [
"# Scikit-learn styple transformers of the data\n",
"\n",
"1. Load data into pandas dataframe\n",
"2. Fit transformer on training data\n",
"3. Impute only missing values with predictions from model\n",
"\n",
"Autoencoders need wide training data, i.e. a sample with all its features' intensities, whereas\n",
"Collaborative Filtering needs long training data, i.e. sample identifier a feature identifier and the intensity.\n",
"Both data formats can be transformed into each other, but models using long data format do not need to\n",
"take care of missing values."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6c7aec0",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"import vaep.plotting.data\n",
"from vaep.sklearn.ae_transformer import AETransformer\n",
"import vaep.sampling\n",
"\n",
"\n",
"IN_COLAB = 'COLAB_GPU' in os.environ\n",
"\n",
"fn_intensities = 'data/dev_datasets/HeLa_6070/protein_groups_wide_N50.csv'\n",
"if IN_COLAB:\n",
" fn_intensities = 'https://raw.githubusercontent.com/RasmussenLab/pimms/main/project/data/dev_datasets/HeLa_6070/protein_groups_wide_N50.csv'"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f1ccbdd",
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"vaep.plotting.make_large_descriptors(8)"
]
},
{
"cell_type": "markdown",
"id": "9ff56309",
"metadata": {},
"source": [
"## Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b921b86c",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(fn_intensities, index_col=0)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "43798bb3",
"metadata": {},
"source": [
"We will need the data in long format for Collaborative Filtering.\n",
"Naming both the row and column index assures\n",
"that the data can be transformed very easily into long format:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d29c02d",
"metadata": {},
"outputs": [],
"source": [
"df.index.name = 'Sample ID' # already set\n",
"df.columns.name = 'protein group' # not set due to csv disk file format\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "cb166253",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "dd2148b8",
"metadata": {},
"source": [
"Transform the data using the logarithm, here using base 2:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b599efb8",
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"df = np.log2(df + 1)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "8b264559",
"metadata": {},
"source": [
"two plots on data availability:\n",
"\n",
"1. proportion of missing values per feature median (N = protein groups)\n",
"2. CDF of available intensities per protein group"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "da3c8dba",
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"ax = vaep.plotting.data.plot_feat_median_over_prop_missing(\n",
" data=df, type='boxplot')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a912a04a",
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"df.notna().sum().sort_values().plot()"
]
},
{
"cell_type": "markdown",
"id": "29d451c0",
"metadata": {},
"source": [
"define a minimum feature and sample frequency for a feature to be included"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed278540",
"metadata": {
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"SELECT_FEAT = True\n",
"\n",
"\n",
"def select_features(df, feat_prevalence=.2, axis=0):\n",
" # # ! vaep.filter.select_features\n",
" N = df.shape[axis]\n",
" minimum_freq = N * feat_prevalence\n",
" freq = df.notna().sum(axis=axis)\n",
" mask = freq >= minimum_freq\n",
" print(f\"Drop {(~mask).sum()} along axis {axis}.\")\n",
" freq = freq.loc[mask]\n",
" if axis == 0:\n",
" df = df.loc[:, mask]\n",
" else:\n",
" df = df.loc[mask]\n",
" return df\n",
"\n",
"\n",
"if SELECT_FEAT:\n",
" # potentially this can take a few iterations to stabilize.\n",
" df = select_features(df, feat_prevalence=.2)\n",
" df = select_features(df=df, feat_prevalence=.3, axis=1)\n",
"df.shape"
]
},
{
"cell_type": "markdown",
"id": "9f4bcf48",
"metadata": {},
"source": [
"## AutoEncoder architectures"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "487a4f7c",
"metadata": {},
"outputs": [],
"source": [
"# Reload data (for demonstration)\n",
"\n",
"df = pd.read_csv(fn_intensities, index_col=0)\n",
"df.index.name = 'Sample ID' # already set\n",
"df.columns.name = 'protein group' # not set due to csv disk file format\n",
"df = np.log2(df + 1) # log transform\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "ceee9ced",
"metadata": {},
"source": [
"Test `DAE` or `VAE` model without validation data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5c6b5ab9",
"metadata": {},
"outputs": [],
"source": [
"model_selected = 'VAE' # 'DAE'\n",
"model = AETransformer(\n",
" model=model_selected,\n",
" hidden_layers=[512,],\n",
" latent_dim=50,\n",
" out_folder='runs/scikit_interface',\n",
" batch_size=10,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4ab5d10",
"metadata": {},
"outputs": [],
"source": [
"model.fit(df,\n",
" epochs_max=2,\n",
" cuda=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9ae733fc",
"metadata": {},
"outputs": [],
"source": [
"df_imputed = model.transform(df)\n",
"df_imputed"
]
},
{
"cell_type": "markdown",
"id": "3526ad09",
"metadata": {},
"source": [
"DAE"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e87d1eae",
"metadata": {},
"outputs": [],
"source": [
"model_selected = 'DAE'\n",
"model = AETransformer(\n",
" model=model_selected,\n",
" hidden_layers=[512,],\n",
" latent_dim=50,\n",
" out_folder='runs/scikit_interface',\n",
" batch_size=10,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6c60295",
"metadata": {},
"outputs": [],
"source": [
"model.fit(df,\n",
" epochs_max=2,\n",
" cuda=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27e80959",
"metadata": {},
"outputs": [],
"source": [
"df_imputed = model.transform(df)\n",
"df_imputed"
]
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "-all",
"main_language": "python",
"notebook_metadata_filter": "-all"
},
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.17"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit a4f36da

Please sign in to comment.