Skip to content
This repository was archived by the owner on Jan 3, 2023. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
60bea41
adding docker for bigdl
sujee Dec 11, 2017
80c1f19
run bigdl in docker container
sujee Dec 11, 2017
eaa8fbc
script to launch pyspark with bigdl dependencies
sujee Dec 11, 2017
4130673
gitignore
sujee Dec 11, 2017
7f16467
adding Testing123 notebook
sujee Dec 11, 2017
67fbd52
run bigdl in docker container
sujee Dec 11, 2017
23bf539
script to launch pyspark with bigdl dependencies
sujee Dec 11, 2017
0e2f2d9
gitignore
sujee Dec 11, 2017
4724881
update .gitignore
sujee Dec 15, 2017
01d72f1
Merge branch 'pr/run-scripts'
sujee Dec 15, 2017
56e00e0
adding Testing123 notebook
sujee Dec 11, 2017
f39771d
Merge branch 'pr/notebooks-testing123'
sujee Dec 15, 2017
5573f58
removing redundant env variable
sujee Dec 15, 2017
64bdcfd
Added notebooks for videos
timfox456 Jan 19, 2018
9c0b021
move assets into elephantscale folder
sujee Jan 19, 2018
d98ed85
Merge branch 'pr/1-docker'
sujee Jan 19, 2018
7ce5a84
Merge branch 'master' into pr/video-examples
timfox456 Jan 23, 2018
46e0ce3
Moved everything to elephantscale directory
timfox456 Jan 23, 2018
851e95b
Updated some text.
timfox456 Feb 2, 2018
4dbbbaf
Incorporated Yiheng's feedback on feedforward credit card fraud
timfox456 Feb 2, 2018
dc48602
Added some more explanation on feedforward notebooks
timfox456 Feb 2, 2018
77b536d
Added warning about --executor memory 16gb
timfox456 Feb 2, 2018
56a0dac
Added README.md
timfox456 Feb 2, 2018
0eb257b
Merge github.com:intel-analytics/BigDL-trainings into pr/video-examples2
timfox456 Feb 3, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 34 additions & 92 deletions elephantscale/notebooks/feedforward-credit-card-fraud-pipeline.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feedforward Network with Credit Card Fraud\n",
"# Feedforward Network with Credit Card Fraud (Pipeline API version)\n",
"\n",
"Credit Card Transactions Fraud Detection Example:\n",
"\n",
"The notebook demonstrates how to develop a fraud detection application with the BigDL deep learning library on Apache Spark. We'll try to introduce some techniques that can be used for training a fraud detection model, but some advanced skills is not applicable since the dataset is highly simplified.\n",
"\n",
"Dataset: Credit Card Fraud Detection https://www.kaggle.com/dalpozz/creditcardfraud\n",
"\n",
"This dataset presents transactions that occurred in two days, where we got 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.\n",
"\n",
"It contains only numerical input variables which are the result of a PCA transformation. \n",
"\n",
"Unfortunately, due to confidentiality issues, we cannot find the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.\n",
"\n",
"Contact: [email protected]\n"
"Please view the credit-card-fraud example, which will show how to do this in with an RDD-based approach. Here, we will follow the same approach with a dataframe-based pipeline API based on Spark MLLib.\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
Expand Down Expand Up @@ -124,7 +114,9 @@
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"cc_training = spark.read.csv(\"../data/creditcardfraud/creditcard.csv\", header=True, inferSchema=\"true\", mode=\"DROPMALFORMED\")"
Expand Down Expand Up @@ -161,7 +153,9 @@
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"cc_training = cc_training.select([col(c).cast(\"double\") for c in cc_training.columns])\n",
Expand Down Expand Up @@ -206,7 +200,9 @@
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# get the time to split the data.\n",
Expand Down Expand Up @@ -273,7 +269,9 @@
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"assembler = VectorAssembler(inputCols=cols, outputCol=\"assembled\")\n",
Expand All @@ -289,23 +287,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### step 2. split the dataset into training and validation dataset.\n",
"\n",
"Unlike some other training dataset, where the data does not have a time of occurance. For this case, we can know the sequence of the transactions from the Time column. Thus randomly splitting the data into training and validation does not make much sense, since in real world applications, we can only use the history transactions for training and use the latest transactions for validation. Thus we'll split the dataset according the time of occurance."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### step 3. Build the model with BigDL\n",
"## step 2. Build the model with BigDL\n",
"From the research community and industry feedback, a simple neural network turns out be the perfect candidate for the fraud detection training. We will quickly build a multiple layer Perceptron with linear layers.\n",
"```\n",
" val bigDLModel = Sequential()\n",
Expand Down Expand Up @@ -380,12 +362,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluate our Model\n",
"\n",
"Now we have finished the training of our first model (which is certainly not the best, keep reading!).\n",
"\n",
"We'll need to think about how do evaluate the trained model:\n",
"\n",
"Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification. Since even if the model predicts all the records as normal transactions, it will still get an accuracy above 99%."
"Now we are goin to do the Precision, Recall, and AUC. To evaluate our model."
]
},
{
Expand Down Expand Up @@ -433,7 +412,9 @@
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"y_pred = np.array(predictionDF.select('prediction').collect())\n",
Expand All @@ -443,7 +424,9 @@
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
Expand Down Expand Up @@ -474,74 +457,33 @@
"sn.heatmap(df_cm, annot=True,fmt='d');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To this point, we have finished the training and evaluation with a simple BigDL model. We can see that even though the recall and precision are high, the area under precision-recall curve is not optimistic. That's because we haven't really apply any technique to handle the imbalanced training data.\n",
"\n",
"Next we'll try to optimize the training process.\n",
"\n",
"### step 4. handle the data imbalance\n",
"There are several ways to approach this classification problem taking into consideration this unbalance.\n",
"\n",
"Collect more data? Nice strategy but not applicable in this case.\n",
"\n",
"Resampling the dataset Essentially this is a method that will process the data to have an approximate 50-50 ratio. One way to achieve this is by OVER-sampling, which is adding copies of the under-represented class (better when there're little data) Another is UNDER-sampling, which deletes instances from the over-represented class (better when there are lots of data)\n",
"Apart from under and over sampling, there is a very popular approach called SMOTE (Synthetic Minority Over-Sampling Technique), which is a combination of oversampling and undersampling, but the oversampling approach is not by replicating minority class but constructing new minority class data instance via an algorithm.\n",
"\n",
"We'll start with Resampling.\n",
"\n",
"Since there're 492 frauds out of 284,807 transactions, to build a reasonable training dataset, we'll use UNDER-sampling for normal transactions and use OVER-sampling for fraud transactions. By using the sampling rate as fraud -> 10, normal -> 0.05, we can get a training dataset of (5K fraud + 14K normal) transactions. We can use the training data to fit a model.\n",
"\n",
"Yet we'll soon find that since there're only 5% of all the normal transactions are included in the training data, the model can only cover 5% of all the normal transactions, which is obviousely not optimistic. So how can we get a better converage for the normal transactions without breaking the ideal ratio in the training dataset?\n",
"\n",
"An immediate improvement would be to train multiple models. For each model, we will run the resampling from the original dataset and get a new training data set. After training, we can select best voting strategy for all the models to make the prediction.\n",
"\n",
"We'll use Ensembling of neural networks. That's where a Bagging classifier becomes handy. Bagging is an Estimator we developed for ensembling of multiple other Estimator.\n",
"\n",
"```\n",
"package org.apache.spark.ml.ensemble\n",
"\n",
"class Bagging[M <: Model[M]](override val uid: String)\n",
" extends Estimator[BaggingModel[M]]\n",
" with BaggingParams[M] {\n",
"For usage, user need to set the specific Estimator to use and the number of models to be trained:\n",
" val estimator = new Bagging()\n",
" .setPredictor(dlClassifier)\n",
" .setLabelCol(\"Class\")\n",
" .setIsClassifier(true)\n",
" .setNumModels(10)\n",
"```\n",
"\n",
"Internally, Bagging will train $(numModels) models. Each model is trained with the resampled data from the original dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"display_name": "Python 3",
"language": "python",
"name": "python2"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.14"
"pygments_lexer": "ipython3",
"version": "3.5.4"
}
},
"nbformat": 4,
Expand Down
Loading