Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML Generic API #497

Closed
wants to merge 2 commits into from
Closed

ML Generic API #497

wants to merge 2 commits into from

Conversation

PondiB
Copy link
Member

@PondiB PondiB commented Feb 21, 2024

I saw the need for this as an alternative for #441 , if we keep following the pattern of ml_fit*_ as #484 , #483 , #493 and so forth. We will end up with a long list of ml processes.

Getting inspiration from tools like caret in R ecosystem. We can ideally achieve a minimalistic API.

Labels can be used to identify a classification and regression problem.

Parameters for various ml models can be passed as dictionaries: e.g.

  1. Random Forest :
    ml_fit(predictor, label, ml_method ="random_forest", params = {"num_trees":200, "max_variables":"sqrt", "seed":42})
  2. Multi-layer Perceptron classifier
    ml_fit(predictor, label, ml_method = "mlp", params = {"hidden_layers": [ 512, 512, 512], "dropout_rates": [0.4, 0.3, 0.2], "epochs": 100, "activation_function": "relu", "optimizer": "adam"}

Well, there will be need a for clear documentation dedicated to ML on the openEO website.

Additionally, a need for 2 more methods on the client side list_ml_models() for listing the available models and model_param_blueprint(model_name, task_type="classification") for listing the default params of the models. The information should ideally be on the website too.

Opinions, suggestions and critics are welcomed as I am still exploring this.

@soxofaan
Copy link
Member

We can ideally achieve a minimalistic API.

I'm not sure that that should be the goal.
You indeed have a small API at the level of the processes (just one process ml_fit), but you pay the cost in other places: those list_ml_models() and model_param_blueprint() have to be defined, you need to provide schema definition/validation for the params of each model, all clients have to start supporting that, all back-ends have to start implementing that, ... It feels like a lot of reinventing things we already have in our current ecosystem of processes.
I'm not sure there is a net win here.

We will end up with a long list of ml processes

I don't think that is a problem on its own. The list of pre-defined processes is also long, but that is fine.

@PondiB
Copy link
Member Author

PondiB commented Feb 28, 2024

I'm not sure that that should be the goal.

Thanks for your solid comments that cannot be overlooked, it reminds me of this article ; Simple is not easy. There will be an initial Engineering overhead for sure as you have pointed out. My goal for this was more targeted to the general public/users i.e. Greater accessibility and ease of use.

I don't think that is a problem on its own. The list of pre-defined processes is also long, but that is fine.

On my end, It does not mean we should have 20+ processes for ML/DL.

@soxofaan
Copy link
Member

My goal for this was more targeted to the general public/users i.e. Greater accessibility and ease of use.

Even for the user I think it will be worse.

e.g. with the python client, the current way of working would result in something like

ml_fit_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)

Which is easy/simple and highly accessible I would think: it just uses the basics of how functions are called, everything is documented the normal way, not only the available arguments and their defaults, but also basics of random forest, ...

the proposed approach would result in something like

ml_fit(predictors=cube1, labels=cube2, ml_method="random_forest", params = {"num_trees":200, "max_variables":"sqrt", "seed":42})

This is less easy/simple for the user: the params is now a generic and hard to document thing and heavily depends on ml_method, ml_fit has to explain all possible ML methods, there is no reasonable default for ml_method, ...

@PondiB
Copy link
Member Author

PondiB commented Feb 29, 2024

This is less easy/simple for the user: the params is now a generic and hard to document thing and heavily depends on ml_method, ml_fit has to explain all possible ML methods, there is no reasonable default for ml_method, ...

Fair enough. I will continue with the specific APIs. Now, one last thought, for classification and regression tasks if we are limiting the number of params to tune and they are similar then there is no need to have 2 APIs for the categories. i.e.

# classification
ml_fit_class_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)
# regression
ml_fit_regr_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)

the 2 above can be replaced with

# classification or regression
ml_fit_random_forest(predictors=cube1, labels=cube2, num_trees=500, max_variables="sqrt", seed=42)

Therefore internally it can be determined from the labels if this is a classification or a regression task. Do you foresee any bottlenecks with this?

@soxofaan
Copy link
Member

Therefore internally it can be determined from the labels if this is a classification or a regression task

I'm not sure this is a safe assumption.
I can imagine there are users that use e.g. integer values as classification labels.

@PondiB
Copy link
Member Author

PondiB commented May 28, 2024

I'm not sure this is a safe assumption.
I can imagine there are users that use e.g. integer values as classification labels.

I was on vacation for over a month during this time, so I left this discussion on hold. I'll look into your point.

Anyway, regarding the parameters, I think we should not separate the predictor and target. Instead, we can have a "training-data" parameter that includes both predictors and targets. A target variable should just specify the column name of the targets/labels within the training data.

FYI: An example training data:
training-data

@PondiB
Copy link
Member Author

PondiB commented Jul 17, 2024

Closing this, I will continue with the already existing approach with a few changes as in #517

@PondiB PondiB closed this Jul 17, 2024
@PondiB PondiB deleted the ml-generic-api branch September 9, 2024 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants