From 3a2a850ab4b1b597f7b44ca5897b504667652390 Mon Sep 17 00:00:00 2001
From: Emmanouil Theofanis Chourdakis <emmanouil.theofanischourdakis@bbc.co.uk>
Date: Thu, 18 Apr 2019 13:16:30 +0100
Subject: [PATCH] added music/speech/sfx discriminator training notebook

---
 MusicSpeechSFxDiscrimination.ipynb | 455 +++++++++++++++++++++++++++++
 1 file changed, 455 insertions(+)
 create mode 100644 MusicSpeechSFxDiscrimination.ipynb

diff --git a/MusicSpeechSFxDiscrimination.ipynb b/MusicSpeechSFxDiscrimination.ipynb
new file mode 100644
index 0000000..8372eb8
--- /dev/null
+++ b/MusicSpeechSFxDiscrimination.ipynb
@@ -0,0 +1,455 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Music/Speech/SFx Discrimination based on VGGish and Transfer Learning\n",
+    "\n",
+    "*Note:* Before following this notebook, please see the simpler Music/Speech discrimination model first since it's much\n",
+    "\n",
+    "This notebook adapts [VGGish](https://github.com/tensorflow/models/tree/master/research/audioset) to discriminate between Music, Speech and Sound Effects. This model is a pre-trained model architecture initially used for image tasks but trained on a huge amount of data available from the (AudioSet)[https://research.google.com/audioset/] dataset which consists of soundtracks from 8 million youtube videos classified in 632 classes. The model architecture might not be the best chosen (assumes symmetric filters which make no sense when dealing with spectral representations of audio) but the amount of data is immense. The model is described in more detail in:\n",
+    "\n",
+    "```\n",
+    "E.T. Chourdakis, L. Ward, M. Paradis, and J. D. Reiss\n",
+    "\"Modelling experts' decisions on assigning narrative importances of objects in a radio drama mix\"\n",
+    "[submitted] In Proc. Int. Conf. Digital Audio Effects (DAFx)\n",
+    "            Sept 2019, Birmigham, UK\n",
+    "```\n",
+    "\n",
+    "Roughly the method it classifies files is:\n",
+    "\n",
+    "1. Split audio files to segments of 960ms\n",
+    "2. Compute mel spectrum magnitudes (64bands, 25ms frames, 10ms hop size)\n",
+    "3. Keep the convolutional layers of VGGish with their weights and throw the rest\n",
+    "4. Add two new fully connected neural layers\n",
+    "\n",
+    "Additionally, this time we use the [GTZAN](http://marsyas.info/downloads/datasets.html) dataset but we also added an equal amount of time of sound effects from the recently released [BBC SFx library](http://bbcsfx.acropolis.org.uk/). See the paper for more info on the construction of the dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initially, import packages and load dataset\n",
+    "\n",
+    "Like in the case of simple Music/Speech discrimination, we load the data, this time with the addition of sound effects."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Using TensorFlow backend.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Music Speech Discrimination part\n",
+    "\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import os\n",
+    "import glob\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.metrics import classification_report\n",
+    "\n",
+    "import keras\n",
+    "import keras.layers\n",
+    "from keras.optimizers import Adam, Adadelta, RMSprop, SGD\n",
+    "from keras.models import Sequential\n",
+    "import librosa\n",
+    "\n",
+    "from tensorflow import set_random_seed\n",
+    "\n",
+    "set_random_seed(1) # For reproducibility in the model training\n",
+    "np.random.seed(1)  # For reproducibility in the dataset collection\n",
+    "\n",
+    "music_data_dir = \"../../dataset/music_speech/music_wav/\"\n",
+    "speech_data_dir = \"../../dataset/music_speech/speech_wav/\"\n",
+    "sfx_data_dir = \"../../dataset/music_speech/sfx_wav/\"\n",
+    "\n",
+    "\n",
+    "full_data = []\n",
+    "\n",
+    "# 0 for music, 1 for speech\n",
+    "for fname in glob.glob(os.path.join(music_data_dir, \"*.wav\")):\n",
+    "    full_data.append((fname, 0))\n",
+    "for fname in glob.glob(os.path.join(speech_data_dir, \"*.wav\")):\n",
+    "    full_data.append((fname, 1))\n",
+    "for fname in glob.glob(os.path.join(sfx_data_dir, \"*.wav\")):\n",
+    "    full_data.append((fname, 2))\n",
+    "    \n",
+    "# Use 30% as test data in this case\n",
+    "train_data, test_data = train_test_split(full_data, test_size=0.3, random_state=1)    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following lines load the original VGGish weights and required parameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "WARNING:tensorflow:From /home/emmanouc/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
+      "Instructions for updating:\n",
+      "Colocations handled automatically by placer.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Transfer learning part\n",
+    "import vggish_input\n",
+    "import vggish_params\n",
+    "import vggish_postprocess\n",
+    "import vggish_keras\n",
+    "\n",
+    "\n",
+    "# Paths to downloaded VGGish files.\n",
+    "checkpoint_path = 'vggish_audioset_weights.h5'\n",
+    "pca_params_path = 'vggish_pca_params.npz'\n",
+    "\n",
+    "# Relative tolerance of errors in mean and standard deviation of embeddings.\n",
+    "rel_error = 0.1  # Up to 10%\n",
+    "\n",
+    "# Load model for transfer learning\n",
+    "vggish_model = vggish_keras.get_vggish_keras()\n",
+    "vggish_model.load_weights(checkpoint_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Below we process the input before feeding them to the neural network. It is not straightforward to use Kapre here."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Preparing inputs\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "8a4158e430034ec4a02b4691dd85488d",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "HBox(children=(IntProgress(value=0, max=138), HTML(value='')))"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "import tqdm \n",
+    "\n",
+    "def batch_preprocess_files(train_data):\n",
+    "    # Load data\n",
+    "\n",
+    "    X_train = []\n",
+    "    y_train = []\n",
+    "\n",
+    "    print(\"Preparing inputs\")\n",
+    "    for train_tuple in tqdm.tqdm_notebook(train_data):\n",
+    "        src, sr = librosa.load(train_tuple[0], sr=None, mono=True)\n",
+    "        label = train_tuple[1]\n",
+    "\n",
+    "        input_batch = vggish_input.waveform_to_examples(src, sr)\n",
+    "\n",
+    "        X_train.append(input_batch)\n",
+    "\n",
+    "        for I in range(0,input_batch.shape[0]):\n",
+    "            if label == 0:\n",
+    "                y_train.append(np.array([1,0,0]))\n",
+    "            elif label == 1:\n",
+    "                y_train.append(np.array([0,1,0]))\n",
+    "            else:\n",
+    "                y_train.append(np.array([0,0,1]))\n",
+    "\n",
+    "    X_train = np.vstack(X_train)\n",
+    "    y_train = np.vstack(y_train)\n",
+    "    return X_train, y_train\n",
+    "\n",
+    "# Do preprocessing for training\n",
+    "X_train, y_train = batch_preprocess_files(train_data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Preparing inputs\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "9471854189a445e4bac3ddc7f52e8e5c",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "HBox(children=(IntProgress(value=0, max=60), HTML(value='')))"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Do preprocessing for testing\n",
+    "X_test, y_test = batch_preprocess_files(test_data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "WARNING:tensorflow:From /home/emmanouc/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n",
+      "Instructions for updating:\n",
+      "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n",
+      "_________________________________________________________________\n",
+      "Layer (type)                 Output Shape              Param #   \n",
+      "=================================================================\n",
+      "conv1 (Conv2D)               (None, 96, 64, 64)        640       \n",
+      "_________________________________________________________________\n",
+      "pool1 (MaxPooling2D)         (None, 48, 32, 64)        0         \n",
+      "_________________________________________________________________\n",
+      "conv2 (Conv2D)               (None, 48, 32, 128)       73856     \n",
+      "_________________________________________________________________\n",
+      "pool2 (MaxPooling2D)         (None, 24, 16, 128)       0         \n",
+      "_________________________________________________________________\n",
+      "conv3_1 (Conv2D)             (None, 24, 16, 256)       295168    \n",
+      "_________________________________________________________________\n",
+      "conv3_2 (Conv2D)             (None, 24, 16, 256)       590080    \n",
+      "_________________________________________________________________\n",
+      "pool3 (MaxPooling2D)         (None, 12, 8, 256)        0         \n",
+      "_________________________________________________________________\n",
+      "conv4_1 (Conv2D)             (None, 12, 8, 512)        1180160   \n",
+      "_________________________________________________________________\n",
+      "conv4_2 (Conv2D)             (None, 12, 8, 512)        2359808   \n",
+      "_________________________________________________________________\n",
+      "pool4 (MaxPooling2D)         (None, 6, 4, 512)         0         \n",
+      "_________________________________________________________________\n",
+      "flatten (Flatten)            (None, 12288)             0         \n",
+      "_________________________________________________________________\n",
+      "fc1_1 (Dense)                (None, 4096)              50335744  \n",
+      "_________________________________________________________________\n",
+      "dense_1 (Dense)              (None, 256)               1048832   \n",
+      "_________________________________________________________________\n",
+      "dropout_1 (Dropout)          (None, 256)               0         \n",
+      "_________________________________________________________________\n",
+      "dense_2 (Dense)              (None, 3)                 771       \n",
+      "=================================================================\n",
+      "Total params: 55,885,059\n",
+      "Trainable params: 1,049,603\n",
+      "Non-trainable params: 54,835,456\n",
+      "_________________________________________________________________\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Construct model \n",
+    "model = Sequential()\n",
+    "\n",
+    "# Add the convolutional layers of VGGish to the\n",
+    "# newly created model \n",
+    "for layer in vggish_model.layers[:-2]:\n",
+    "    model.add(layer)\n",
+    "    \n",
+    "# Freeze their weights\n",
+    "for layer in model.layers:\n",
+    "    layer.trainable = False\n",
+    "\n",
+    "# Add a fully connected layer with dropout (for regularization)\n",
+    "# as well as a final classification layer with the softmax activation function\n",
+    "# so that it gives us a categorical probability mass function\n",
+    "model.add(keras.layers.Dense(256, activation='relu'))\n",
+    "model.add(keras.layers.Dropout(0.9))\n",
+    "model.add(keras.layers.Dense(3, activation='softmax'))\n",
+    "\n",
+    "model.summary()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Below we do the actual training. For GPUs you can increase the batch size and the number of epochs. Since I am using a cpu we keep those values low."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "WARNING:tensorflow:From /home/emmanouc/.local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+      "Instructions for updating:\n",
+      "Use tf.cast instead.\n",
+      "Train on 3741 samples, validate on 416 samples\n",
+      "Epoch 1/4\n",
+      "3741/3741 [==============================] - 55s 15ms/step - loss: 0.3128 - acc: 0.8781 - val_loss: 0.0751 - val_acc: 0.9856\n",
+      "Epoch 2/4\n",
+      "3741/3741 [==============================] - 55s 15ms/step - loss: 0.1017 - acc: 0.9725 - val_loss: 0.0669 - val_acc: 0.9808\n",
+      "Epoch 3/4\n",
+      "3741/3741 [==============================] - 56s 15ms/step - loss: 0.0634 - acc: 0.9805 - val_loss: 0.0506 - val_acc: 0.9832\n",
+      "Epoch 4/4\n",
+      "3741/3741 [==============================] - 56s 15ms/step - loss: 0.0555 - acc: 0.9848 - val_loss: 0.0645 - val_acc: 0.9784\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "<keras.callbacks.History at 0x7fc7cd643240>"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from keras.callbacks import ReduceLROnPlateau\n",
+    "reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2,\n",
+    "                              patience=5, min_lr=0.0001)\n",
+    "model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01, momentum=0.9), metrics=['accuracy'])\n",
+    "model.fit(X_train[:,:,:,None], y_train, validation_split=0.1, epochs=4, batch_size=16, callbacks=[reduce_lr])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Same as in the simple Music/Speech classification, report for testing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "              precision    recall  f1-score   support\n",
+      "\n",
+      "           0       1.00      0.97      0.99       713\n",
+      "           1       0.98      0.98      0.98       527\n",
+      "           2       0.96      0.99      0.98       494\n",
+      "\n",
+      "   micro avg       0.98      0.98      0.98      1734\n",
+      "   macro avg       0.98      0.98      0.98      1734\n",
+      "weighted avg       0.98      0.98      0.98      1734\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "y_pred = model.predict(X_test[:,:,:,None])\n",
+    "print(classification_report(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Again, feed the testing data to save a full nice usuable model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we can't save the model with the save function (I am unsure why, something to do with keras and input shapes) so we will save it as two files, a `.json` with the model architecture and a `.h5` with the model weights."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_json = model.to_json()\n",
+    "with open('music_speech_sfx_discriminator.json', 'w') as f:\n",
+    "    f.write(model_json)\n",
+    "    \n",
+    "model.save_weights('music_speech_sfx_discriminator.h5')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}