-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
65 lines (48 loc) · 5.29 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# lightMLFlow
A lightweight R wrapper for the MLFlow REST API
<!-- badges: start -->
[](https://github.com/collegevine/lightMLFlow/actions)
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
<!-- badges: end -->
# Setup
`lightMLFlow` will soon be on CRAN!
For now, you can install it from Github as follows:
```r
# install.packages("devtools")
devtools::install_github("collegevine/lightMLFlow")
```
# The Package
This package differs from the CRAN `mlflow` package in a few important ways.
First, there are some things that the full CRAN `mlflow` package has that this package doesn't:
* `mlflow` supports more configurable artifact stores, while `lightMLFlow` only supports S3 as the artifact store. That said, we would love contributions from Azure or GCP users to extend this functionality!
* `mlflow` allows you to run `install_mlflow()` to install MLFlow on your machine. `lightMLFlow` assumes you're using an MLFlow instance that's either running locally or hosted on a cloud server.
* `mlflow` lets you run the MLFlow UI directly from the package, which `lightMLFlow` does not. Again, `lightMLFlow` is made to be used with a deployed MLFlow instance, which means that your instance needs to already exist (either on local on on a cloud server).
However, there are also significant advantages to using `lightMLFlow` over `mlflow`:
* `lightMLFlow` features a friendlier API, with significantly fewer functions, no `mlflow::mlflow_*` function prefixing (following Tidyverse conventions, `lightMLFlow` function names are verbs), and improved error handling.
* `lightMLFlow` fixes some bugs in `mlflow`'s API wrapping functions.
* `lightMLFlow` is significantly more lightweight than `mlflow`. It doesn't depend on `httpuv`, `reticulate`, or `swagger`, and has a more minimal footprint in general.
* `lightMLFlow` uses `aws.s3` to put and save objects to and from S3, which means you don't need to have a `boto3` install on your machine running your `MLFlow` code. This is an essential change, as it means that `lightMLFlow` does not require _any_ Python infrastructure, as opposed to `mlflow`, which does.
* `mlflow` (and, specifically, `MLFlow Projects`) doesn't play particularly nicely with `renv`. The reason for that is that an `MLProject` file that's pointed at a Git repo will try to clone and run the code from scratch. But with `renv`, we like restoring a package cache in CI and baking it into the Docker image that the code lives in so that we don't need to install all of the R packages the project needs every time we run the project. `lightMLFlow` hacks its way around this problem by allowing the user to run `set_git_tracking_tags()`, which tricks the MLFlow REST API into thinking that the code was run from an MLFlow Project even when it wasn't. This lets you keep your normal (e.g.) `renv` workflow in place and get the benefit of linked Git commits in the MLFlow UI without actually needing any of the `MLProject` infrastructure or setup steps.
* For artifact and model logging, `lightMLFlow` logs R objects directly so that you don't need to worry about first saving a file to disk and then copying it to your artifact store.
* In addition, `lightMLFlow` allows artifacts to be loaded directly into the R session in one shot, instead of first being saved to disk and then loaded afterwards. This eliminates lines of code and the headache associated with going S3 --> disk --> R by abstracting away the disk reads and writes.
* In `lightMLFlow`, `create_experiment` returns the experiment when one with the specified name already exists, instead of erroring.
* `lightMLFlow` adds `get_param` and `get_metric` helpers to make it easier to get the most recent value of a metric or param for a run.
* `lightMLFlow` leverages some R-ish ways of doing things in reworking `log_params` and `log_metrics`, which both take dot args of metrics or params to log, and then call `log_batch()` on the backend. This results in a far friendlier API than needing to specify a key and value separately and forcing the user to make multiple calls to `log_param` (e.g.) for each key-value pair. With `lightMLFlow`, you can do this to log two params: `log_params(foo, bar = "baz")`, which will log the value of `foo` as `foo` (it automagically generates the param name by deparsing the name of the R object), and will log `bar` as `"baz"`.
* When logging metrics, `lightMLFlow` abstracts away timestamps and steps from the user, automatically setting the timestamp to the current time (UTC) and auto-incrementing the step if the metric being logged already exists.
* If a run errors out, `lightMLFlow` logs the error as an artifact (a markdown document) to help with the debugging process.
## Known Issues / Future Work
1. Clarify the parameter names for things like `path`, `model_path`, etc. since they don't make much sense right now.
2. Clarify the difference between `save_model` and `log_model`.
3. Add Azure and GCP artifact stores.