From 21619a6a18c590dbea273cc9eaae7c09707fa2ec Mon Sep 17 00:00:00 2001 From: Sreedevi Velagala <120698597+sridevi1209@users.noreply.github.com> Date: Wed, 2 Oct 2024 15:12:22 -0400 Subject: [PATCH] Update README.md updated readme content --- README.md | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 55 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 1c68f55..97b3257 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,54 @@ -# Fraud Detection using Machine Learning +# Guidance for Fraud Detection using Machine Learning on AWS + +## Table of Contents + +1. [Overview](#overview) + - [AWS services](#awservices) + - [Cost](#cost) +3. [Prerequisites](#prerequisites) +4. [Architecture](#architecture) +5. [Deployment Steps](#deployment-steps) +6. [Deployment Validation](#deployment-validation) +7. [Running the Guidance](#running-the-guidance) +8. [Next Steps](#next-steps) +9. [Cleanup](#cleanup) +10. [Notices](#notices) + +## Overview With businesses moving online, fraud and abuse in online systems is constantly increasing as well. Traditionally, rule-based fraud detection systems are used to combat online fraud, but these rely on a static set of rules created by human experts. This project uses machine learning to create models for fraud detection that are dynamic, self-improving and maintainable. Importantly, they can scale with the online business. Specifically, we show how to use Amazon SageMaker to train supervised and unsupervised machine learning models on historical transactions, so that they can predict the likelihood of incoming transactions being fraudulent or not. We also show how to deploy the models, once trained, to a REST API that can be integrated into an existing business software infrastructure. This project includes a demonstration of this process using a public, anonymized credit card transactions [dataset provided by ULB](https://www.kaggle.com/mlg-ulb/creditcardfraud), but can be easily modified to work with custom labelled or unlaballed data provided as a relational table in csv format. -## Getting Started +### AWS services + +- [Amazon S3](https://docs.aws.amazon.com/s3/?icmpid=docs_homepage_featuredsvcs) +- [Amazon Sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html?icmpid=docs_sagemaker_lp/index.html) +- [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) +- [Amazon API Gateway](https://docs.aws.amazon.com/apigateway/latest/developerguide/welcome.html) +- [Amazon Kinesis Firehose](https://aws.amazon.com/firehose/) +- [Amazon QuickSight](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html) + +### Cost + +The following table provides a sample cost breakdown for deploying this +Guidance with the default parameters in the US East (N. Virginia) Region +for one month. + +| **AWS service** | Dimensions | Monthly cost \[USD\] | +| ----------------- | ------------------------------------------------------ | ---------------------------------------------- | +| Amazon S3 Standard | S3 Standard storage (10 GB per month), Data returned by S3 Select (10 GB per month), Data scanned by S3 Select (10 GB per month) | $0.26 | +| Amazon S3 Data Transfer | DT Inbound: Not selected (0 TB per month), DT Outbound: Not selected (0 TB per month) | $0 | +| Amazon SageMaker Studio Notebooks | Instance name (ml.c5.12xlarge), Number of data scientist(s) (5), Number of Studio Notebook instances per data scientist (2), Studio Notebook hour(s) per day (3), Studio Notebook day(s) per month (10) | $734.4 | +| Amazon SageMaker Processing Storage | (General Purpose SSD (gp2)), Instance name (ml.c4.2xlarge), Number of processing jobs per month (20), Number of instances per job (2), Hour(s) per instance per job (2) | $39.24 | +| Amazon SageMaker Training Storage | (General Purpose SSD (gp2)), Instance name (ml.c4.2xlarge), Number of training jobs per month (20), Number of instances per job (2), Hour(s) per instance per job (3) | $58.76 | +| Amazon SageMaker Real-Time Inference Storage | (General Purpose SSD (gp2)), Instance name (ml.c4.2xlarge), Instance name (ml.c4.2xlarge), Number of models deployed (5), Number of models per endpoint (5), Number of instances per endpoint (2), Endpoint hour(s) per day (3), Endpoint day(s) per month (20), Data Processed IN (10 GB), Data Processed OUT (8 GB) | $59.05 | +| AWS Lambda | Architecture (x86), Architecture (x86), Invoke Mode (Buffered), Amount of ephemeral storage allocated (512 MB), Number of requests (1 million per month) | $0 | +| Amazon API Gateway | Cache memory size (GB) (None), WebSocket message units (thousands), HTTP API requests units (millions), Average size of each request (34 KB), REST API request units (millions), Average message size (32 KB), Requests (1 per month) | $1 | +| Amazon Kinesis Data Firehose | Dynamic Partitioning (Add On) (Disabled), Source Type (Direct PUT or Kinesis Data Stream), Average ratio of data processed to VPC vs data ingested (1.3), Data records units (millions), Record size (100 KB), Data format conversion (optional) (Disabled), Number of records for data ingestion (1 per month), Data format conversion (optional) (Disabled), Data records units (thousands), Record size (5 KB) | $2.77 | +| Amazon S3 Standard | S3 Standard storage (100 GB per month) | $2.3 | +| Total | | $897.78 | +## Prerequisites You will need an AWS account to use this solution. Sign up for an account [here](https://aws.amazon.com/). @@ -22,7 +66,7 @@ The project architecture deployed by the cloud formation template is shown here. ![](deployment/architecture.png) -## Project Description +## Deployment Steps The project uses Amazon SageMaker to train both a supervised and an unsupervised machine learning models, which are then deployed using Amazon Sagemaker-managed endpoints. If you have labels for your data, for example if some of the transactions have been annotated as fraudulent and some as legitimate, then you can train a supervised learning model to learn to discern the two classes. In this project, we provide a recipe to train a gradient boosted decision tree model using [XGBoost on Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html). The supervised model training process also handles the common issue of working with highly imbalanced data in fraud detection problems. The project addresses this issue into two ways by 1) implementing data upsampling using the "imbalanced-learn" package, and 2) using scale position weight to control the balance of positive and negative weights. @@ -62,7 +106,7 @@ We cite the following works: * Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing -## Contents +## Running the Guidance * `deployment/` * `fraud-detection-using-machine-learning.yaml`: Creates AWS CloudFormation Stack for solution @@ -88,4 +132,11 @@ We cite the following works: This project is licensed under the Apache-2.0 License. +## Notices + +_Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers._ + + + +