Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] online update capability for probabilistic regressors #462

Merged
merged 18 commits into from
Sep 27, 2024
Merged

Conversation

fkiraly
Copy link
Collaborator

@fkiraly fkiraly commented Sep 13, 2024

Adds framework support for online update capability for probabilistic regressors, and a simple composite strategy that refits on all data, for testing the framework. Closes #463

Contains:

  • extension of the regressor and survival base class with a potential update / _update method for batch updates
  • addition of a tag capability:online for respective estimators
  • addition of a composite OnlineRefit that adds the capability:online tag and refits the regressor upon all data seen so far. This is a separate estimator to avoid that all estimators remember (and clutter self) with the data
  • a similar composite OnlineDontRefit that turns off online capability
  • a specific test case for online updates, in TestAllRegressors

@fkiraly fkiraly added enhancement module:regression probabilistic regression module implementing algorithms Implementing algorithms, estimators, objects native to skpro implementing framework Implementing or improving framework for learning tasks, e.g., base class functionality labels Sep 13, 2024
@fkiraly
Copy link
Collaborator Author

fkiraly commented Sep 13, 2024

FYI @simon-hirsch, @BerriJ - this extends the framework to add online methods :-)

@simon-hirsch
Copy link

Looks generally quite cool to me 👍

Generally, I think putting the "remember old data and fit on the union of new data and old data" strategy in a separate estimator is a good thing, as it potentially dangerous wrt to the disc space an estimator saved with pickle / joblib / ... takes up and will slow down the storing/loading of models.

For testing, you'd might want to use a TimeSeriesSplit instead of the random test_train_split. For exact online learning methods one could even test whether the update leads indeed to the repeated batch fit, this is of course more tricky for approximate methods like SGD.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Sep 19, 2024

For testing, you'd might want to use a TimeSeriesSplit instead of the random test_train_split.

This is just testing the interface, and it should not matter imo for the test.

Regarding the "conceptual model", unlike sklearn we do not assume or test that regressors have an exchangable behaviour with respect to sample index. Once we get the first examples of regressors that assume ordering or other types of non-exchangeability, we could simply distinguish them by tag.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Sep 19, 2024

Generally, I think putting the "remember old data and fit on the union of new data and old data" strategy in a separate estimator is a good thing, as it potentially dangerous wrt to the disc space an estimator saved with pickle / joblib / ... takes up and will slow down the storing/loading of models.

Agreed, I think it is an issue with sktime forecasters already - there, storing seems more important since there's no "y from X" pairing, though I'd also like to get rid of it as much as possible.

@fkiraly fkiraly merged commit ba2aae5 into main Sep 27, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement implementing algorithms Implementing algorithms, estimators, objects native to skpro implementing framework Implementing or improving framework for learning tasks, e.g., base class functionality module:regression probabilistic regression module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] online probabilistic regression
2 participants