intel · timfox456 · Dec 11, 2017 · Dec 11, 2017 · Dec 11, 2017 · Dec 11, 2017
diff --git a/elephantscale/notebooks/feedforward-credit-card-fraud-pipeline.ipynb b/elephantscale/notebooks/feedforward-credit-card-fraud-pipeline.ipynb
@@ -4,27 +4,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Feedforward Network with Credit Card Fraud\n",
+    "# Feedforward Network with Credit Card Fraud (Pipeline API version)\n",
     "\n",
-    "Credit Card Transactions Fraud Detection Example:\n",
-    "\n",
-    "The notebook demonstrates how to develop a fraud detection application with the BigDL deep learning library on Apache Spark. We'll try to introduce some techniques that can be used for training a fraud detection model, but some advanced skills is not applicable since the dataset is highly simplified.\n",
-    "\n",
-    "Dataset: Credit Card Fraud Detection https://www.kaggle.com/dalpozz/creditcardfraud\n",
-    "\n",
-    "This dataset presents transactions that occurred in two days, where we got 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.\n",
-    "\n",
-    "It contains only numerical input variables which are the result of a PCA transformation. \n",
-    "\n",
-    "Unfortunately, due to confidentiality issues, we cannot find the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.\n",
-    "\n",
-    "Contact: [email protected]\n"
+    "Please view the credit-card-fraud example, which will show how to do this in with an RDD-based approach. Here, we will follow the same approach with a dataframe-based pipeline API based on Spark MLLib.\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 36,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": [
     "%matplotlib inline\n",
@@ -124,7 +114,9 @@
   {
    "cell_type": "code",
    "execution_count": 38,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": [
     "cc_training = spark.read.csv(\"../data/creditcardfraud/creditcard.csv\", header=True, inferSchema=\"true\", mode=\"DROPMALFORMED\")"
@@ -161,7 +153,9 @@
   {
    "cell_type": "code",
    "execution_count": 40,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": [
     "cc_training = cc_training.select([col(c).cast(\"double\") for c in cc_training.columns])\n",
@@ -206,7 +200,9 @@
   {
    "cell_type": "code",
    "execution_count": 42,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": [
     "# get the time to split the data.\n",
@@ -273,7 +269,9 @@
   {
    "cell_type": "code",
    "execution_count": 43,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": [
     "assembler =  VectorAssembler(inputCols=cols, outputCol=\"assembled\")\n",
@@ -289,23 +287,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### step 2. split the dataset into training and validation dataset.\n",
-    "\n",
-    "Unlike some other training dataset, where the data does not have a time of occurance. For this case, we can know the sequence of the transactions from the Time column. Thus randomly splitting the data into training and validation does not make much sense, since in real world applications, we can only use the history transactions for training and use the latest transactions for validation. Thus we'll split the dataset according the time of occurance."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### step 3. Build the model with BigDL\n",
+    "## step 2. Build the model with BigDL\n",
     "From the research community and industry feedback, a simple neural network turns out be the perfect candidate for the fraud detection training. We will quickly build a multiple layer Perceptron with linear layers.\n",
     "```\n",
     "    val bigDLModel = Sequential()\n",
@@ -380,12 +362,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# Evaluate our Model\n",
     "\n",
-    "Now we have finished the training of our first model (which is certainly not the best, keep reading!).\n",
-    "\n",
-    "We'll need to think about how do evaluate the trained model:\n",
-    "\n",
-    "Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification. Since even if the model predicts all the records as normal transactions, it will still get an accuracy above 99%."
+    "Now we are goin to do the Precision, Recall, and AUC. To evaluate our model."
    ]
   },
   {
@@ -433,7 +412,9 @@
   {
    "cell_type": "code",
    "execution_count": 47,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": [
     "y_pred = np.array(predictionDF.select('prediction').collect())\n",
@@ -443,7 +424,9 @@
   {
    "cell_type": "code",
    "execution_count": 48,
-   "metadata": {},
+   "metadata": {
+    "scrolled": true
+   },
    "outputs": [
     {
      "name": "stdout",
@@ -474,74 +457,33 @@
     "sn.heatmap(df_cm, annot=True,fmt='d');"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "To this point, we have finished the training and evaluation with a simple BigDL model. We can see that even though the recall and precision are high, the area under precision-recall curve is not optimistic. That's because we haven't really apply any technique to handle the imbalanced training data.\n",
-    "\n",
-    "Next we'll try to optimize the training process.\n",
-    "\n",
-    "### step 4. handle the data imbalance\n",
-    "There are several ways to approach this classification problem taking into consideration this unbalance.\n",
-    "\n",
-    "Collect more data? Nice strategy but not applicable in this case.\n",
-    "\n",
-    "Resampling the dataset Essentially this is a method that will process the data to have an approximate 50-50 ratio. One way to achieve this is by OVER-sampling, which is adding copies of the under-represented class (better when there're little data) Another is UNDER-sampling, which deletes instances from the over-represented class (better when there are lots of data)\n",
-    "Apart from under and over sampling, there is a very popular approach called SMOTE (Synthetic Minority Over-Sampling Technique), which is a combination of oversampling and undersampling, but the oversampling approach is not by replicating minority class but constructing new minority class data instance via an algorithm.\n",
-    "\n",
-    "We'll start with Resampling.\n",
-    "\n",
-    "Since there're 492 frauds out of 284,807 transactions, to build a reasonable training dataset, we'll use UNDER-sampling for normal transactions and use OVER-sampling for fraud transactions. By using the sampling rate as fraud -> 10, normal -> 0.05, we can get a training dataset of (5K fraud + 14K normal) transactions. We can use the training data to fit a model.\n",
-    "\n",
-    "Yet we'll soon find that since there're only 5% of all the normal transactions are included in the training data, the model can only cover 5% of all the normal transactions, which is obviousely not optimistic. So how can we get a better converage for the normal transactions without breaking the ideal ratio in the training dataset?\n",
-    "\n",
-    "An immediate improvement would be to train multiple models. For each model, we will run the resampling from the original dataset and get a new training data set. After training, we can select best voting strategy for all the models to make the prediction.\n",
-    "\n",
-    "We'll use Ensembling of neural networks. That's where a Bagging classifier becomes handy. Bagging is an Estimator we developed for ensembling of multiple other Estimator.\n",
-    "\n",
-    "```\n",
-    "package org.apache.spark.ml.ensemble\n",
-    "\n",
-    "class Bagging[M <: Model[M]](override val uid: String)\n",
-    "  extends Estimator[BaggingModel[M]]\n",
-    "  with BaggingParams[M] {\n",
-    "For usage, user need to set the specific Estimator to use and the number of models to be trained:\n",
-    "    val estimator = new Bagging()\n",
-    "      .setPredictor(dlClassifier)\n",
-    "      .setLabelCol(\"Class\")\n",
-    "      .setIsClassifier(true)\n",
-    "      .setNumModels(10)\n",
-    "```\n",
-    "\n",
-    "Internally, Bagging will train $(numModels) models. Each model is trained with the resampled data from the original dataset."
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 2",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "python2"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 2
+    "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
-   "version": "2.7.14"
+   "pygments_lexer": "ipython3",
+   "version": "3.5.4"
   }
  },
  "nbformat": 4,