Fraud detection is a critical part of any business. Discover how data management and versioning with lakeFS enables repeatable, version-controlled data sets, using familiar workflows and processes, while reducing storage costs for generative and predictive AI applications.
The purpose of this AI quickstart is to highlight the benefits of data versioning, provided by lakeFS, in an AI/ML environment. lakeFS allows the data engineer to manage the lifecycle of data using the same workflow a developer uses to manage source code, using git. This means that, like source code, data can be versioned, branched, merged and pulled from a git repository, although the data is actually stored in a backend object storage.
The quickstart will allow a demonstrator to quickly deploy both object storage, using MinIO, and lakeFS to serve as a git-like gateway that data engineers can interface with for data access. The following steps can be run very quickly:
- Deploy Minio for on-premesis object storage, running on the local OpenShift cluster
- Deploy an instance of lakeFS for git-like management of data and data versioning
- Deploy fraud detection notebooks in OpenShift AI
- Create and train a model using the notebooks and data
- Serve the trained model
- Perform fraud detection on sample transactions data
- Update the training data and retrain the model using the new data version
- Perform fraud detection on a new version of the sample transaction data
- Show how OpenShift AI pipelines can be used to retrain and/or perform detection on new versions of training and sample data
TODO: create an arcade?
This quickstart was developed and test on an OpenShift cluster with the following components and resources. This can be considered the minimum requirements.
| Node Type | Qty | vCPU | Memory (GB) |
|---|---|---|---|
| Control Plane | 3 | 8 | 16 |
| Worker | 3 | 8 | 16 |
Note
A GPU is not required for this quickstart
This quickstart was tested with the following software versions:
| Software | Version |
|---|---|
| Red Hat OpenShift | 4.20.5 |
| Red Hat OpenShift Service Mesh | 2.5.11-0 |
| Red Hat OpenShift Serverless | 1.37.0 |
| Red Hat OpenShift AI | 2.25 |
| helm | 3.17.1 |
| lakeFS | 1.73.0 |
| MinIO | TBD |
The user performing this quickstart should have the ability to create a project in OpenShift and OpenShift AI. This requires the cluster role of admin (does not require cluster-admin)
The process is very simple. Just follow the steps below.
The steps assume the following pre-requisite products and components are deployed and functional with required permissions on the cluster:
- Red Hat OpenShift Container Platform
- Red Hat OpenShift Service Mesh
- Red Hat OpenShift Serverless
- Red Hat OpenShift AI
- User has
adminpermissions in the cluster
- Clone this repo
$ git clone https://github.com/rh-ai-quickstart/Fraud-Detection-data-versioning-with-lakeFS.git
- cd to
deploydirectory
$ cd Fraud-Detection-data-versioning-with-lakeFS/deploy
- Login to the OpenShift cluster:
$ oc login --token=<user_token> --server=https://api.<openshift_cluster_fqdn>:6443
- Make sure
deploy.shis executable and run it, passing it the name of the project in which to install. It can be an existing or new project. In this example, it will deploy to thelakefsproject.
# Make script executable
$ chmod + deploy.sh
# Run script passing it the project in which to install
$ ./deploy.sh lakefs
Use the route to access the lakeFS browser-base UI.
- Leave the username set to
admin - Enter your email address (or a bogus email address)
- Download the
access_key_idandsecret_access_keydisplayed on the new page, as they will not be accessible later on - Go back to the login page and log in using those credentials.
The project the apps were installed in can be deleted, which will delete all of the resources in it, including deployments, secrets, pods, configmaps, etc.
oc delete project lakefs
- Product: OpenShift AI
- Partner: lakeFS
- Partner product: lakeFS
- Business challenge: Fraud detection
