After the profiling, you understand your model's runtime performance. With this performance as a guideline, you can dispatch your model as an efficient cloud service with MLModelCI's dispatch API.
Before you start trying these features, please make sure you have installed the MLModelCI correctly and started the mongodb service as well. You can refer to the installation for more details.
Our dispatch supports to load models with the following serving systems:
- Triton Inference System
- TensorFlow-Serving
- ONNX Runtime
- Self-defined TorchScript Container
Before serving your models, please make sure you have installed the Docker images of above serving systems. By default, the MLModelCI's Docker image contains them, so you don't need to install them again.
docker pull mlmodelci/pytorch-serving
docker pull mlmodelci/onnx-serving
Triton Serving System
docker pull nvcr.io/nvidia/tensorrtserver:19.10-py3
TensorFlow-Serving
docker pull tensorflow/serving
The dispatch API launches a serving system which loads a model and run it in a containerized manner.
You can get the model path by using retrieve
API and it will return a saved_path
(See Tricks with Model Saved Path) to specify model local cache.
Now you can assign a device (i.e. 'cpu'
, 'cuda:0'
, 'cuda:0,1'
) and a batch size to serve the model with the profiling results as a guideline.
Or MLModelCI will set them automatically.
from modelci.hub.deployer.dispatcher import serve
saved_path = ...
device = '1'
batch_size = 8
server_name = 'name of container'
serve(save_path=saved_path, device=f'cuda:{device}', name=server_name, batch_size=batch_size)
If you want to stop the running container, you can simply stop service in your terminal.
docker stop <name>
The model will be removed once stopped.