MLOps & Kubeflow

Table of Contents

Overview
ML Pipeline Tasks
Kubeflow
- Overview & Components
- Jupyter Notebooks Hub
- Kubeflow Pipelines
- Kale Jupyter Notbook to Pipeline Conversion
- Parameter Tuning using Katib
- Model Registry
- Kubeflow Serving
- Kubeflow Operators
Kuebflow GPUs

Building and Deploying Machine Learning Pipelines
A Dummies’ guide to building a Kubeflow Pipeline
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency
Kubeflow: Streamlining Machine Learning Workflows on Kubernetes
Automating Kubeflow Pipelines with GitOps, GitHub Actions and Weave Flagger

7 Frameworks for Serving LLMs - a comprehensive guide into LLMs inference and serving with detailed comparison
Case Study: Amazon Ads Uses PyTorch and AWS Inferentia to Scale Models for Ads Processing
GPU - CUDA Programming Model
GPU - CUDA Interview questions on CUDA Programming?
Improving GPU Utilization in Kubernetes
Kubeflow pipelines (part 1) — lightweight components
Automate prompt tuning for large language models using KubeFlow Pipelines
Building a ML Pipeline from Scratch with Kubeflow – MLOps Part 3

MLOps Overview

A picture speaks a thousand words

ML Pipeline Tasks

What are steps/tasks in “Data Acquisition” “Data preparation & processing” pipelines?

Data Acquistion
- Crawler & Scraper
- Data Ingestion Processing: Streaming, Batch
Data Validation
Data Wrangling & Cleaning
- Text
  - Raw Text, Tokenization, Word embeddings,
Data Versioning

What are steps in "feature engineering" pipelines?

Exploratory Data Analysis
Transform, Binning Temporarl
Clean, Normalize, Impute Missing, Extract Features, Encode Features,
Feature Selection
Feature Store

How do you package model artifacts?

Model Packaging
- .pkl
- ONNX
- TorchScript
- Containerization

pickle (and joblib by extension), has some issues regarding maintainability and security. Because of this,

Never unpickle untrusted data as it could lead to malicious code being executed upon loading.
While models saved using one version of scikit-learn might load in other versions, this is entirely unsupported and inadvisable.

ONNX is a binary serialization of the model

Predictive Model Markup Language (PMML) format might be a better approach than using pickle alone.

TorchServe provides a utility to package all the model artifacts into a single TorchServe Model Archive File (MAR). After model artifacts are packaged into a MAR file, you then upload to the model-store under the model storage path.

ONNX (Open Neural Network Exchange), an open-source format for representing deep learning models, was developed by Microsoft and is now managed by the Linux Foundation. ONNX resolves this issue by providing a standard format that multiple deep learning frameworks, including TensorFlow, PyTorch, and Caffe2 can use. With ONNX, models can be trained in one framework and then easily exported to other frameworks for inference, making it convenient for developers to experiment with different deep learning frameworks and tools without having to rewrite their models every time they switch frameworks. It can execute models on various hardware platforms, including CPUs, GPUs, and FPGAs, making deploying models on various devices easy.

Spawn up a shared-persistent storage across the cluster to store models

ONNX uses protobuf to serialize the graph

Explain training pipelines? What experiment pipelines?

Model Training
Model validation
Hyperparameter Tuning
Best Model Selection
Cross Validation

Explain model deployment or inference pipelines (CD- Continuous Delivery)? How production serving is done? How instrumentation works (logging, monitoring, observability)? - TorchServe, TensorRT

Model Serving
- KServe Vs. TorchServe
- K8s, Docker
- Service - gRPC Vs. REST Vs. Client Libraries
Performance, Benchmarking
Scoring & Monitoring

How A/B testing carried out?

Experiment Tracking

Kubeflow

Overview

Kubeflow enables you to set up CI/CD pipelines for your machine learning workflows. This allows you to automate the testing, validation, and deployment of models, ensuring a smooth and scalable deployment process.

Where models are stored in Kubeflow? What is a model registry (Or) model repository? Is your models stored in S3? What are popular tools for model store/registry? How models are versioned?

https://github.com/kubeflow/model-registry is th answers. Looks like its still in alpha version

Kubeflow provides a centralized model repository for storing, versioning, and managing ML models. This is called the Kubeflow Model Registry.

Where Kubeflow stores modeles?

Kubeflow Model Registry stores the model artifacts (weights, hyperparameters etc) in a storage service like S3 or GCS. The metadata (name, version, path etc) is stored in a database like MySQL, PostgreSQL etc.
Metadata like model name, version etc are stored in a database.
Using the model registry, models can be versioned, searched, deployed and integrated into ML pipelines in a centralized way. It provides features like model lineage, model security, governance etc.
To version models, each model build or iteration can be assigned a unique version number or id. New model versions can be registered and looked up by version. Older versions are retained for record keeping.

Claude/ChatGPT Prompt: Show me Kubeflow code snippet to store a model into model registry? Also show code for to fetch a model from model repository using kubeflow pipelines SDK?

What are typical containers behind Kubeflow Model Registry?

Registry Server (kubeflow-model-registry-server) & Registry UI (kubeflow-model-registry-ui)
ML Metadata DB: ml-pipeline-ui-db, ml-pipeline-mysql
Minio Object Storage: minio
PostgreSQL for ML Metadata: kubeflow-ml-pipeline-db
Model Inference Servers: kfserving-tensorflow-inference-server, triton-inference-server
Push Model Utility (kubeflow-push-model), Fetch Model Utility (kubeflow-fetch-model) & Model Converter (kubeflow-model-converter)

The push-model container provides a Python SDK and CLI to upload model artifacts and metadata to the registry.

push_model_op = kfp.components.load_component_from_url('kubeflow-push-model')

How to choose between Kubeflow Model Registry and Hugging Face Model Hub?

Kubeflow registry for internal model management
HF Model Hub to leverage public/shared models

Hugging Face Model Hub

-Public repository of a broad range of ML models -Emphasis on easy sharing and use of models -Integrates with Hugging Face libraries like Transformers -Includes preprocessed datasets, notebooks, demos

Kubeflow Pipelines

Pipelines are used to chain multiple steps into an ML lifecycle - data prep, train, evaluate, deploy.

What is Container Op in kubeflow pipelines? Isn't it too heavy creating containers for every step?

A ContainerOp represents an execution of a Docker container as a step in the pipeline. Some key points:

Every step in the pipeline is wrapped as a ContainerOp. This encapsulates the environment and dependencies for that step.
Behind the scenes, the ContainerOp creates a Kubernetes Pod with the specified Docker image to run that step.
So yes, containerizing every step introduces computational overhead vs just running Python functions. The containers can be slow to spin up and add resource usage.
However, the benefit is that each step runs in an isolated, reproducible environment. This avoids dependency conflicts between steps.
ContainerOps also simplify deployment. The pipeline itself is a portable Kubernetes spec that can run on any cluster.

For Python code, the func_to_container_op decorator allows you to convert Python functions to ContainerOps easily.

For performance critical sections, you can optimize by:

Building custom slim Docker images
Using the dsl.ResourceOp construct for non-container steps
Group multiple steps in one ContainerOp
Tuning Kubernetes Pod resources

So in summary, ContainerOps trade off some performance for better encapsulation, portability and reproducibility. For a production pipeline, optimizing performance is still important.

ResourceOps in Kubeflow Pipelines allow defining pipeline steps that don't execute as Kubernetes Pods or ContainerOps.

Kubeflow Pipelines Vs Jobs

Jobs are standalone, reusable ML workflow steps e.g. data preprocessing, model training.
Pipelines stitch together jobs into end-to-end ML workflows with dependencies.
Use jobs to encapsulate reusable ML functions - data processing, feature engineering etc.
Use jobs within a pipeline to break up the pipeline into reusable components.
Kubeflow jobs leverage composable kustomize packages making reuse easier. Kubernetes jobs use vanilla pod templates.

Model Serving

Difference between saving ML model & packaging a model? What are differtent formats for packaging model?

Saving a model involves persisting just the core model artifacts like weights, hyperparameters, model architecture etc. This captures the bare minimum needed to load and use the model later.

Packaging a model is more comprehensive - it bundles the model with any additional assets needed to deploy and serve the model. This allows portability. Model packaging includes:

Saved model files
Code dependencies like Python packages
Inference code
Environment specs like Dockerfile
Documentation & Metadata like licenses, model cards etc

Containerizing can be considered a form of model packaging, as it encapsulates the model and dependencies in a standardized unit for deployment.

Model packaging formats:

Docker image - Bundle model as microservice
Model archival formats - ONNX, PMML
Python package - For use in Python apps

Automating Kubeflow Pipelines with GitOps, GitHub Actions?

Why not use containers as the delivery medium for models?

It is fortunate that Kubernetes supports the concept of init-containers inside a Pod. When we serve different machine learning models, we change only the model, not the serving program. This means that we can pack our models into simple model containers, and run them as the init-container of the serving program. We also include a simple copy script inside of our model containers that copies the model file from the init container into a volume so that it can be served by the model serving program.

https://github.com/chanwit/wks-firekube-mlops/blob/master/example/model-serve.yaml

Kubeflow operators

Here is a list of some of the key operators in Kubeflow:

Kubeflow Pipelines operators:

Argoproj Workflow Operator - Manages and executes pipelines
Persistent Agent - Caches data for pipelines agents
ScheduledWorkflow Operator - Handles scheduled/cron workflows

Notebooks operators:

Jupyter Web App Operator - Manages Jupyter notebook web interfaces
Notebook Controller Operator - Handles lifecycle of notebook servers

Serving operators:

KFServing Operator - Manages model inference services
Seldon Deployment Operator - Manages running Seldon inference graphs
Inference Services Operator - Deprecated operator for TF Serving

Training operators:

MPI Operator - Handles distributed MPI training jobs
PyTorch Operator - Manages PyTorch training jobs
TFJob Operator - Manages TensorFlow training jobs

Misc operators:

Kubeflow Namespace Operator - Installs components in kubeflow namespace
Profile Controller - Records detailed profiles of runs
Studyjob Operator - Hyperparameter tuning using Katib
Volumes Web App - Manages PVC and volumes

The core operators for pipelines, notebooks and model serving/inference provide end-to-end ML workflow functionality on Kubernetes. The training operators support orchestrated model training jobs.

There are also various utility operators for management, tuning, monitoring etc. Many operators are being consolidated under KServe umbrella.

Hyperparameter tuning

Hyperparameter tuning aims to find the best combination of hyperparameters to optimize a specific metric (e.g., accuracy, loss) for a given model.

Katib (AutoML) supports hyperparameter tuning, early stopping, and neural architecture search (NAS). Katib is agnostic to ML frameworks and can tune hyperparameters for applications written in any language.

Katib automates this process by exploring different hyperparameter configurations to find the optimal settings for your model.

or applications written in any language.

Creating a Hyperparameter Tuning Experiment:

Define an Experiment in Katib:
    Specify the objective metric (e.g., validation accuracy).
    Define the search space for hyperparameters (min/max values or allowable values).
    Choose a search algorithm (e.g., Bayesian optimization, random search).
    Create the Katib Experiment using Kubernetes Custom Resource Definitions (CRDs).

Running Trials:

Katib runs several training jobs (Trials) within each Experiment.
Each Trial tests a different set of hyperparameter configurations.
The training code evaluates each Trial with varying hyperparameters.

Optimization Process:

Katib optimizes the objective metric by exploring different hyperparameter combinations.
It uses algorithms like Bayesian optimization, Tree of Parzen Estimators, and more.
At the end of the Experiment, Katib outputs the optimized hyperparameter values.

Example: Hyperparameter Tuning for a Neural Network:

Suppose we’re training a neural network for image classification.
We define an Experiment in Katib:
    Objective: Maximize validation accuracy.
    Search space: Learning rate (0.001 to 0.1), batch size (16 to 128), and dropout rate (0.1 to 0.5).
    Search algorithm: Bayesian optimization.
Katib runs multiple Trials, each with different hyperparameter values.
After several Trials, Katib identifies the best hyperparameters for maximizing validation accuracy.

Write code for hyperparameter tuning for xgboost using kubeflow pipelines? where the multi fold cross validation information is stored? how best model is detected?

import kfp
from kfp import dsl
from kfp.components import func_to_container_op

def xgboost_train(params):
    model = xgb.XGBRegressor(**params)
    model.fit(X_train, y_train)
    return model

@func_to_container_op
def log_metrics(model):
    cv_scores = cross_validate(model, X_train, y_train, cv=5)
    print(cv_scores)
    return model

@dsl.pipeline(
    name='XGBoost-tuning',
    description='Tuning XGBoost hyperparameters using Katib'
)
def xgboost_tuning_pipeline(
    learning_rate = hp.choice(0.01, 0.1),
    n_estimators = hp.choice(100, 200)    
):

    xgb_train_op = dsl.components.func_to_container_op(xgboost_train)
    best_hp = tuner.search(xgb_train_op, {
        "learning_rate": learning_rate,
        "n_estimators": n_estimators
    })

    final_model = xgb_train_op(best_hp.to_dict())
    log_metrics(final_model)

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(xgboost_tuning_pipeline, __file__ + '.yaml')

KServe

KServe is a Kubernetes operator for serving, managing and monitoring machine learning models on Kubernetes.
KServe supports multiple inference runtimes like TensorFlow, PyTorch, ONNX, XGBoost.
KServe can leverage GPU acceleration libraries like TensorRT for optimized performance.
For PyTorch, KServe integrates with TorchServe to serve PyTorch models.

KServe provides a Kubernetes Custom Resource Definition for serving machine learning (ML) models on arbitrary frameworks. It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX.

It encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features like GPU Autoscaling, Scale to Zero, and Canary Rollouts to your ML deployments. It enables a simple, pluggable, and complete story for Production ML Serving including prediction, pre-processing, post-processing and explainability.

Model Serving Runtimes¶

KServe provides a simple Kubernetes CRD to enable deploying single or multiple trained models onto model serving runtimes such as TFServing, TorchServe, Triton Inference Server. In addition ModelServer is the Python model serving runtime implemented in KServe itself with prediction v1 protocol, MLServer implements the prediction v2 protocol with both REST and gRPC. These model serving runtimes are able to provide out-of-the-box model serving, but you could also choose to build your own model server for more complex use case.

SKLearn MLServer TorchServe

Deploy a PyTorch Model with TorchServe InferenceService

KServe supports the implementation of Knative Pod Autoscaler (KPA) and Kubernetes’ Horizontal Pod Autoscaler (HPA).

KServe by default selects the TorchServe runtime when you specify the model format pytorch

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "torchserve-mnist-v2"
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      protocolVersion: v2  
      storageUri: gs://kfserving-examples/models/torchserve/image_classifier/v2

TorchScript

TorchScript is an intermediate representation of a PyTorch model (subclass of nn.Module) that can later run them outside of Python, in a high-performance environment like LibTorch (a C++ native module).

TorchScript is a mechanism for converting your PyTorch models (typically defined using nn.Module subclasses) into a serialized, optimized format. This format, known as a TorchScript model.

TorchScript models can be executed more efficiently on various platforms (CPUs, GPUs, mobile devices) due to optimizations performed by PyTorch's Just-In-Time (JIT) compiler.

It can automate optimizations like,

Layer fusion
Quantization
Sparsification

Refer to Layer Fusion for more information.

Tracing vs Scripting

PyTorch provides two primary approaches to create TorchScript models:

Tracing: When using torch.jit.trace you’ll provide your model and sample input as arguments. The input will be fed through the model as in regular inference and the executed operations will be traced and recorded into TorchScript. Logical structure will be frozen into the path taken during this sample execution.
Scripting: When using torch.jit.script you’ll simply provide your model as an argument. TorchScript will be generated from the static inspection of the nn.Module contents (recursively).

https://towardsdatascience.com/pytorch-jit-and-torchscript-c2a77bac0fff

How to save/load TorchScript modules?

TorchScript saves/loads modules into an archive format. This archive is a standalone representation of the model and can be loaded into an entirely separate process.

    Saving a module torch.jit.save(traced_model,’traced_bert.pt’)
    Loading a module loaded = torch.jit.load('traced_bert.pt')

How to view the PyTorch IR captured by TorchScript?

Example 1: Use traced_model.code to view PyTorch IR
Skipping this as it’s very verbose
Example 2: Use script_cell_gpu.code to view PyTorch IR

def forward(self,
    input: Tensor) -> Tensor:
  _0 = self.fc
  _1 = self.avgpool
  _2 = self.layer4
  _3 = self.layer3
  _4 = self.layer2
  _5 = self.layer1
  _6 = self.maxpool
  _7 = self.relu
  _8 = (self.bn1).forward((self.conv1).forward(input, ), )
  _9 = (_5).forward((_6).forward((_7).forward(_8, ), ), )
  _10 = (_2).forward((_3).forward((_4).forward(_9, ), ), )
  input0 = torch.flatten((_1).forward(_10, ), 1, -1)
  return (_0).forward(input0, )

TensorRT & Triton Inference Server

TensorRT optimizes and executes compatible subgraphs, letting deep learning frameworks execute the remaining graph

TensorRT is designed to optimize deep learning models for inference on NVIDIA GPUs, which results in faster inference times.
TensorRT performs various optimizations on the model graph, including layer fusion, precision calibration, and dynamic tensor memory management to enhance inference performance.

During the optimization process, TensorRT performs the following steps:

Parsing: TensorRT parses the trained model to create an internal representation.
Layer fusion: TensorRT fuses layers in the model to reduce memory bandwidth requirements and increase throughput.
Precision calibration: TensorRT selects the appropriate precision for each layer in the model to maximize performance while maintaining accuracy.
Memory optimization: TensorRT optimizes the memory layout of the model to reduce memory bandwidth requirements and increase throughput.
Kernel selection: TensorRT selects the best kernel implementation for each layer in the model to maximize performance.
Dynamic tensor memory management: TensorRT manages the memory required for intermediate tensors during inference to minimize memory usage.

TensorRT is better than TorchScript in terms of performance.

TensorRT Optimization Guide

[Accelerating Inference Up to 6x Faster in PyTorch with Torch-TensorRT]https://developer.nvidia.com/blog/accelerating-inference-up-to-6x-faster-in-pytorch-with-torch-tensorrt/

GPU Monitoring

nvidia-smi
gpustat
nvtop & nvitop
nvidia_gpu_exporter & promotheous
DCGM Prometheus exporters to monitor GPU statistics in real time.

GPU Monitoring Tools

Explain critical GPU performance measures? nvidia-smi -q -d PERFORMANCE

Clock Speed: nvidia-smi -q -d CLOCK
Power Consumption: nvidia-smi -q -d POWER
Memory Usage: nvidia-smi -q -d MEMORY
Performance: nvidia-smi -q -d PERFORMANCE # Displays the “P” state of a GPU. P state refers to the current performance of a GPU.
Temperature:

What are P States in GPU?

P state refers to the current performance of a GPU.

GPU’s Performance (P) States,

P0 and P1 are the power states when the GPU is operating at its highest performance level.
P2/P3 is the power state when the GPU is operating at a lower performance level.
P6-P12 is the power state when the GPU is operating at a low power level to an idle state. The higher the P number, the lower the performance.

How to setup experiements in Kubeflow?

Feature Store

Built a comprehensive framework for feature creation, versioning, and serving in both real-time and batch modes.

It uses offline store (S3) & online store (AWS Redshift)

Data Infrastructure Layer: Where It All Begins The Data Infrastructure Layer is the backbone of your feature store. It’s responsible for the initial stages of your data pipeline, including data ingestion, processing, and storage.

Serving Layer: API-Driven Feature Access

The Serving Layer is your “customer service desk,” the interface where external applications and services request and receive feature data.

Application Layer: The Control Tower

The Application Layer serves as the orchestrator for your feature store.

Feature Store Architecture and How to Build One

comply with GDPR or similar regulations,

How to use AWS EFS to store training datasets and availble for Kubeflow pods

use Amazon EFS as the storage layer to store our training datasets.

use Kubeflow on Amazon EKS to implement model parallelism and use Amazon EFS as persistent storage to share datasets.

Machine Learning with Kubeflow on Amazon EKS with Amazon EFS

How to sync data between S3 to EFS?

use DataSync to transfer data from an Amazon EFS file system to an Amazon S3 bucket

Model parallelism is a distributed training method in which the deep learning model is partitioned across multiple devices, within or across instances.

What are different pipelines?

Feature pipelines - Transforms data into features and labels; feature pipeline retrieves data from outside sources, transforms them, and loads them into the feature store Training Pipeline - Train models with features and labels; training pipeline fetches data from the feature store for model training, sends training metadata to an experiment tracker for later analysis, and places the resulting model in the model registry. Inference Pipeliens - Make predictions with few models and new features; inference pipeline loads a model from the model registry. It uses the feature store to retrieve properly transformed features, which it feeds to the model to make predictions that it exposes to downstream applications.

training pipeline can easily load snapshots of training data from tables of features (feature groups). In particular, a feature store should provide point-in-time consistent snapshots of feature data so that your models do not suffer from future data leakage.

Nvidia GPU Operator

NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU nodes. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others.

The GPU Operator allows administrators of Kubernetes clusters to manage GPU nodes just like CPU nodes in the cluster. Instead of provisioning a special OS image for GPU nodes, administrators can rely on a standard OS image for both CPU and GPU nodes and then rely on the GPU Operator to provision the required software components for GPUs.

Time-Slicing GPUs

The NVIDIA GPU Operator enables oversubscription of GPUs through a set of extended options for the NVIDIA Kubernetes Device Plugin. GPU time-slicing enables workloads that are scheduled on oversubscribed GPUs to interleave with one another.

This mechanism for enabling time-slicing of GPUs in Kubernetes enables a system administrator to define a set of replicas for a GPU, each of which can be handed out independently to a pod to run workloads on. Unlike Multi-Instance GPU (MIG), there is no memory or fault-isolation between replicas, but for some workloads this is better than not being able to share at all. Internally, GPU time-slicing is used to multiplex workloads from replicas of the same underlying GPU.

A typical resource request provides exclusive access to GPUs. A request for a time-sliced GPU provides shared access. A request for more than one time-sliced GPU does not guarantee that the pod receives access to a proportional amount of GPU compute power.

A request for more than one time-sliced GPU only specifies that the pod receives access to a GPU that is shared by other pods. Each pod can run as many processes on the underlying GPU without a limit. The GPU simply provides an equal share of time to all GPU processes, across all of the pods.

Multi-Instance GPU

Multi-Instance GPU (MIG) allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into separate GPU Instances for CUDA applications.

If the cluster has multiple node pools with different GPU types, you can specify the GPU type by the following code.

import kfp.dsl as dsl
gpu_op = dsl.ContainerOp(name='gpu-op', ...).set_gpu_limit(2)
gpu_op.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p4')

The code above will be compiled into Kubernetes Pod spec:

container:
  ...
  resources:
    limits:
      nvidia.com/gpu: "2"
nodeSelector:
  cloud.google.com/gke-accelerator: nvidia-tesla-p4

Files

mlops.md

Latest commit

History