MLOps is a new buzzword in Software Development. It goes beyond DevOps by adding a brand new perspective over data and models. In fact, Any reasonable Machine Learning solution must deal with assets like ML Models, datasets, features, etc. Therefore, a simple ML application may turn into a complex workflow in order to ensure the best model runs in Production. Besides that, developing ML models demands a relevant time assessing and comparing parameters, algorithms and model performance. MLOps goal is to address these concerns by tying tools and ML frameworks in order to quickly deploy models to Production (and keep them there).
PyTorch is one of the main machine learning libraries used in applications such as computer vision and natural language processing. It is catching up with TensorFlow accelerating the path from research prototyping to production deployment. Recently, Pytorch community announced a number of technical contributions to enable end-to-end support for MLflow usage with PyTorch.
MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. One may interact with its components by command line interface or well-known APIs (Python, R, Java and REST), building a workflow capable of handling development and production activities. Actually, Its simple design allows small teams (or a lone data scientist as well) to rapidly set up a working environment and execute a whole Experimentation and Assesment stage before going to formal software development.
In this lab, MLOps community wants to test and provide feasible ways to build an end-to-end model lifecycle with Pytorch assets and Mlflow.
- Alexey Naiden
- John Savage
- Michel Vasconcelos
- Varuna Jayasiri
The following list describes the most relevant folders in which this lab is organized. Some folders in this repo are self-explanatory (like imgs) and therefore aren't mentioned in the list.
- src: Main source code folder. It comprises all software produced during this lab. Src itself is organized into subfolders according to which components it belongs to;
- docs: Binary documentation, drawings and reports produced during this lab;
Our team intend to orchestrate common unix management toolkits (shell commands, ssh, etc), MLFlow CLI and MLFlow components to build a seamless end-to-end pipeline. It should comprehend Continuous Integration (CI), Continuous Delivery (CD) and Continuous Traning (CT). The following figure depicts the proposed architecture.
Nodes are bare-metal or virtual machines that host a part of the pipeline process. ML components are ML-based code that should be trained, tested and deployed by end-to-end pipeline.They must be packaged as MLFlow projects. Training nodes host the training process which runs in Training containers, running dockerized images built with MLFlow and necessary libs to train ML components. Each ML component defines its needs, that is, libs and external dependencies. To ensure traceability, provenance and model configuration management, one may track hyperparameters, training steps, test results, extra data and the model itself into MLFlow Tracking Server hosted in the Tracking Node. All these items, metadata and trained models, are stored in external resources: RDBMS and Cloud Storage Services. Trained models are embedded into dockerized Serving Instances, containers running Torchserve providing access to the models via HTTP.
In order to provide feedback loop, a Monitor Node checks the status and performance meters, issuing new training cycles or a full rebuild process. Control or Operator Machine is the VM used by Configuration Manager or Operation Engineer to follow up the process or issue commands to the nodes in the pipeline. This process may run through many different perimeters, however this solution comprises Development and Production stages only.
In this section, we record the (most relevant) design decisions, components selection and the rationale around them.
# | Subject | Description | Affected Items | Solution | Rationale |
---|---|---|---|---|---|
1 | Computing | ML needs processing power as well as GPU. Our architecure relies on distributed elements, thus needing a reasonable number of physical or virtual machines (some with GPU-power) | All Nodes and Machines | GCP Compute Service | It provides $ 300,00 of (enough) free services and we can also spin some gpu-powered machines |
2 | Durable Data Storage | Every serious solution needs to store durable data | Tracking Server | Google Cloud SQL | A Postgres Database hosted by Google Cloud SQL is reliable and fast enough for our needs |
3 | Large and Binary Data Storage | ML projects eventually need to store models and extra large files, This type of data doesn't fit in RDBMS | Tracking Server Artifact Store |
Google Cloud Storage | Same as #1 |
4 | Shipment and Deployment | There are many deployable assets as well as acessory tools that need specific environment and SO libraries to run. Just installing them into a machine isn't a viable option | Training Node Google Cloud Services |
Docker | It's the de facto pattern for shipping and deploying things |
5 | Lib Management | Python has a powerful but sometimes confusing and conflicting library ecossystem. Our components may depend on conflicting libs and that will lead to problems in Production | All Nodes and Machines | Pip | Pip is heavily used by Python community. It also work quite well with isolation provided by Docker images |
6 | Remote Access | Sometimes we need to issue remote commands or operate a remote machine | All Nodes and Machines | SSH | Hey... It's SSH. There's no need to rationale :-P |
7 | CI/CD & Workflow Management | We need to seamlessly integrate and deploy ML stuff. We also have to coordinate the process between nodes | All Items | Github Actions | We're using GitHub, it's heavily used by community and comes off-the-shelf |
8 | Project Metadata | We need to describe the ML Component, its structure and how it should run | Source Code Repository Training Node |
MLFlow Project Definition | Non-functional Requirement we must adhere |
9 | ML Lifecycle Management | We must track the ML Process | All Nodes and Machines | MLFlow | Non-functional Requirement we must adhere |
10 | ML Framework | We are building ML Components. Therefore, we must use a ML framework to train our models | Training Node Serving Instances |
Pytorch | Non-functional Requirement we must adhere |
A common ML development workflow should comprise Experimentation, Development and Production stages. While Development and Production are common stages in Software Development Industry, Experimentation brings trust to the model being developed. That's a core activity for any serious Machine Learning development. During Experimentation, Data Science teams may run various scenarios, compare key metrics and realize their progress with model performance before going to Production pipeline. Although MLFlow provides an easy path to that stage, our emphasis is on the release process. Experimentation may be addressed in a future work.
The following workflow depicts a common development scenario for our pipeline: a production web application uses a ML Model to provide business value.
The first stage is Development Workflow. It concerns the continuous integration of new ML code and Dataset changes into a Development Server. Here follows the steps:
-
Commited code or changes in the Dataset fires a training execution. These events trigger Github Actions (GH Actions);
-
ML pipeline starts. we are considering that previous data activities (Extraction, Validation and Preparation) were already executed. GH Actions issues a gcloud command to Training Node, so the ML Component code gets up to date. The ML Component itself describes its structure (MLFlow project file) and how it should be run (Dockerfile). The MLFlow CLI starts the training process by issuing:
mlflow run
. The training happens in a GPU-powered Docker container configured with the Tracking Server (MLFLOW_TRACKING_URI
) . At the end of this step, there's a new trained model with its respective metadata; -
The Training Node sends the trained model, its extra files, and metadata to the Model Registry and Tracking Server respectively. In fact, these assets are physically stored in a Cloud Storage Service acessible from Internet;
-
The development workflow ends by deploying the trained model in Google Cloud Run Service. GH Actions creates a new image that contains the trained model (
mlflow create deployment
) which is tagged with the MLflow model name and version and then is pushed to Google Container Registry. This image is deployed to a development Cloud Run deployment, where various acceptance tests can be performed to ensure the image is ready for production.
The decision to promote a model to Production happens after some experiments. Model version X is selected. This external event triggers the Production Workflow:
-
Operations and Product Management Team decides to deploy a model to Production environment. A release command is issued, specifying a model name and version to deploy, triggering GitHub Actions. The runner pulls and configures the corresponding Docker image and deploys this to the Production Cloud Run environment
-
The Monitor node tracks how model is performing and periodically starts a local training pipeline with Production data in order to avoid Model Erosion;
-
The model realizes the model doesn't perform anymore (Model Erosion) or there are significant changes in the data (changes in the data semantic, new relevant fields, etc). It start a feedback loop triggering a full model rebuild process.