Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
updated readme content
  • Loading branch information
sridevi1209 authored Oct 2, 2024
1 parent fa82134 commit 21619a6
Showing 1 changed file with 55 additions and 4 deletions.
59 changes: 55 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,54 @@
# Fraud Detection using Machine Learning
# Guidance for Fraud Detection using Machine Learning on AWS

## Table of Contents

1. [Overview](#overview)
- [AWS services](#awservices)
- [Cost](#cost)
3. [Prerequisites](#prerequisites)
4. [Architecture](#architecture)
5. [Deployment Steps](#deployment-steps)
6. [Deployment Validation](#deployment-validation)
7. [Running the Guidance](#running-the-guidance)
8. [Next Steps](#next-steps)
9. [Cleanup](#cleanup)
10. [Notices](#notices)

## Overview

With businesses moving online, fraud and abuse in online systems is constantly increasing as well. Traditionally, rule-based fraud detection systems are used to combat online fraud, but these rely on a static set of rules created by human experts. This project uses machine learning to create models for fraud detection that are dynamic, self-improving and maintainable. Importantly, they can scale with the online business.

Specifically, we show how to use Amazon SageMaker to train supervised and unsupervised machine learning models on historical transactions, so that they can predict the likelihood of incoming transactions being fraudulent or not. We also show how to deploy the models, once trained, to a REST API that can be integrated into an existing business software infrastructure. This project includes a demonstration of this process using a public, anonymized credit card transactions [dataset provided by ULB](https://www.kaggle.com/mlg-ulb/creditcardfraud), but can be easily modified to work with custom labelled or unlaballed data provided as a relational table in csv format.

## Getting Started
### AWS services

- [Amazon S3](https://docs.aws.amazon.com/s3/?icmpid=docs_homepage_featuredsvcs)
- [Amazon Sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html?icmpid=docs_sagemaker_lp/index.html)
- [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html)
- [Amazon API Gateway](https://docs.aws.amazon.com/apigateway/latest/developerguide/welcome.html)
- [Amazon Kinesis Firehose](https://aws.amazon.com/firehose/)
- [Amazon QuickSight](https://docs.aws.amazon.com/quicksight/latest/user/welcome.html)

### Cost

The following table provides a sample cost breakdown for deploying this
Guidance with the default parameters in the US East (N. Virginia) Region
for one month.

| **AWS service** | Dimensions | Monthly cost \[USD\] |
| ----------------- | ------------------------------------------------------ | ---------------------------------------------- |
| Amazon S3 Standard | S3 Standard storage (10 GB per month), Data returned by S3 Select (10 GB per month), Data scanned by S3 Select (10 GB per month) | $0.26 |
| Amazon S3 Data Transfer | DT Inbound: Not selected (0 TB per month), DT Outbound: Not selected (0 TB per month) | $0 |
| Amazon SageMaker Studio Notebooks | Instance name (ml.c5.12xlarge), Number of data scientist(s) (5), Number of Studio Notebook instances per data scientist (2), Studio Notebook hour(s) per day (3), Studio Notebook day(s) per month (10) | $734.4 |
| Amazon SageMaker Processing Storage | (General Purpose SSD (gp2)), Instance name (ml.c4.2xlarge), Number of processing jobs per month (20), Number of instances per job (2), Hour(s) per instance per job (2) | $39.24 |
| Amazon SageMaker Training Storage | (General Purpose SSD (gp2)), Instance name (ml.c4.2xlarge), Number of training jobs per month (20), Number of instances per job (2), Hour(s) per instance per job (3) | $58.76 |
| Amazon SageMaker Real-Time Inference Storage | (General Purpose SSD (gp2)), Instance name (ml.c4.2xlarge), Instance name (ml.c4.2xlarge), Number of models deployed (5), Number of models per endpoint (5), Number of instances per endpoint (2), Endpoint hour(s) per day (3), Endpoint day(s) per month (20), Data Processed IN (10 GB), Data Processed OUT (8 GB) | $59.05 |
| AWS Lambda | Architecture (x86), Architecture (x86), Invoke Mode (Buffered), Amount of ephemeral storage allocated (512 MB), Number of requests (1 million per month) | $0 |
| Amazon API Gateway | Cache memory size (GB) (None), WebSocket message units (thousands), HTTP API requests units (millions), Average size of each request (34 KB), REST API request units (millions), Average message size (32 KB), Requests (1 per month) | $1 |
| Amazon Kinesis Data Firehose | Dynamic Partitioning (Add On) (Disabled), Source Type (Direct PUT or Kinesis Data Stream), Average ratio of data processed to VPC vs data ingested (1.3), Data records units (millions), Record size (100 KB), Data format conversion (optional) (Disabled), Number of records for data ingestion (1 per month), Data format conversion (optional) (Disabled), Data records units (thousands), Record size (5 KB) | $2.77 |
| Amazon S3 Standard | S3 Standard storage (100 GB per month) | $2.3 |
| Total | | $897.78 |
## Prerequisites

You will need an AWS account to use this solution. Sign up for an account [here](https://aws.amazon.com/).

Expand All @@ -22,7 +66,7 @@ The project architecture deployed by the cloud formation template is shown here.

![](deployment/architecture.png)

## Project Description
## Deployment Steps
The project uses Amazon SageMaker to train both a supervised and an unsupervised machine learning models, which are then deployed using Amazon Sagemaker-managed endpoints.

If you have labels for your data, for example if some of the transactions have been annotated as fraudulent and some as legitimate, then you can train a supervised learning model to learn to discern the two classes. In this project, we provide a recipe to train a gradient boosted decision tree model using [XGBoost on Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html). The supervised model training process also handles the common issue of working with highly imbalanced data in fraud detection problems. The project addresses this issue into two ways by 1) implementing data upsampling using the "imbalanced-learn" package, and 2) using scale position weight to control the balance of positive and negative weights.
Expand Down Expand Up @@ -62,7 +106,7 @@ We cite the following works:
* Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing


## Contents
## Running the Guidance

* `deployment/`
* `fraud-detection-using-machine-learning.yaml`: Creates AWS CloudFormation Stack for solution
Expand All @@ -88,4 +132,11 @@ We cite the following works:

This project is licensed under the Apache-2.0 License.

## Notices

_Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers._





0 comments on commit 21619a6

Please sign in to comment.