Skip to content
This repository was archived by the owner on May 23, 2022. It is now read-only.

Remove dependencies: StatsBase and Distributions #27

Closed
joshday opened this issue Jun 27, 2017 · 18 comments
Closed

Remove dependencies: StatsBase and Distributions #27

joshday opened this issue Jun 27, 2017 · 18 comments

Comments

@joshday
Copy link
Member

joshday commented Jun 27, 2017

My PR to move params from Distributions to StatsBase now has 8 commits and 22 comments...

I think this is as good a time as any to visit the idea of going back to 0 dependencies, which was our original thought when we created LearnBase. We essentially only have StatsBase and Distributions in our require file for nobs and params/params!. Does anyone have a strong opinion on adding these ourselves and just not exporting them? Or other solutions?

@Evizero
Copy link
Member

Evizero commented Jul 2, 2017

Well we need nobs defined here because MLDataPattern and MLDataUtils build on that fact heavily.

@joshday
Copy link
Member Author

joshday commented Oct 6, 2017

Revisiting this again. We were able to drop Distributions 🎉 thanks to params now living in StatsBase. Would you be okay with JuliaML packages that need params or nobs importing StatsBase directly? We pick up 4 dependencies for 2 names (not functionality), and that ratio kinda bugs me.

  • StatsBase
    • DataStructures
    • SpecialFunctions
      • BinDeps

@Evizero
Copy link
Member

Evizero commented Oct 6, 2017

for completeness, this is what we import and reexport from StatsBase

import StatsBase: nobs, fit, fit!, predict, params, params!

Well the unfortunate thing at the state that we are now (disregarding potential future reasons) is that MLDataPattern advertises that its only needed to extend the functions from LearnBase. (AFAIK its nobs and getobs) in order to add support for a custom datatype.

I guess it would be possible to say "extend LearnBase.getobs and StatsBase.nobs", but thats also not ideal.

I do see your point though.

@joshday
Copy link
Member Author

joshday commented Oct 9, 2017

We could drop fit for learn (even though I'm the one that added fit...), nobs and params could be imported by packages that need them, but predict is pretty essential. I guess it wouldn't be worth the effort to hack around a StatsBase dependency.

@joshday joshday closed this as completed Oct 9, 2017
@oxinabox oxinabox reopened this Jun 28, 2019
@oxinabox
Copy link
Member

oxinabox commented Jun 28, 2019

I am reopenning this for the modern world.
Can we think about this again?

I just want to overload default_obsdim for my new container type.
And I don't want to end up with StatsBase at all really,
in my dependency tree.

@joshday
Copy link
Member Author

joshday commented Jun 28, 2019

It's difficult because StatsBase has claimed some key functions like predict and nobs that LearnBase really can't go without. I think StatsBase needs a "StatsCore" package that just provides the namespace (empty function definitions, just like LearnBase), but I'm not sure how well received that would be.

@oxinabox
Copy link
Member

I agree,
but until we live in that world,
I don't see it as that bad for a package wanting to overload fit to have to import StatsBase even if it is already importing LearnBase.

Infact, the argument could be made that it is better practice to only overload things when importing them directly from their original namespace,
not from someone reexporting them

@joshday
Copy link
Member Author

joshday commented Jun 29, 2019

That's fair. I'm on board with removing StatsBase here

@skanskan
Copy link

What's the difference between StatsBase, Distributions, StatsFuns and StatsModels?

@oxinabox
Copy link
Member

StatsBase is core functions for working with statistics, like calculating the weighted std-dev.
Distributions is for working with probability distributions, sampling and logpdf etc.
StatsFuns is like SpecialFunctions.jl but for statistics, so small functions like logit.
StatsModels is a DSL for decribing relationships one one want to model, mostly used for linear models and that kinda thing.

@juliohm
Copy link
Member

juliohm commented Apr 2, 2020

Reviving this thread again... (third time I guess?)

I am planning to revamp the LearnBase.jl package and upgrade it into a more general interface for statistical learning. In this plan the names fit and predict seem too narrow, and I agree with @joshday that we could perhaps call them learn and perform or something along these lines to avoid clashes with StatsBase.jl and Statistics stdlib?

I would like to ask you where else the LearnBase.jl package is being used inside the JuliaML organization. To what extent do we need to be backwards compatible? I see that the project doesn't have a Project.toml yet, and that is a good sign that it is not being actively used elsewhere.

I will open a separate issue to share a proposal regarding an updated LearnBase.jl API.

@joshday
Copy link
Member Author

joshday commented Apr 2, 2020

I see that the project doesn't have a Project.toml yet, and that is a good sign that it is not being actively used elsewhere.

LearnBase hasn't required much updating since really all it does is claim names (although I'm pretty surprised there's no Project.toml!). I use it in several packages (OnlineStats, SparseRegression), but I started removing LearnBase from both within the last month.

@juliohm
Copy link
Member

juliohm commented Apr 2, 2020

@joshday why is it that you are removing it from your packages? Do you feel that it is not suiting your needs?

@joshday
Copy link
Member Author

joshday commented Apr 2, 2020

For OnlineStats, I'd like to move modeling bits into a different package. I don't think too many people use them and it allows me to drop 3 deps (LearnBase, LossFunctions, PenaltyFunctions).

I misspoke about SparseRegression. I did start toying around with a new implementation of PenaltyFunctions though. I think both LossFunctions and PenaltyFunctions could be built simpler.

@juliohm
Copy link
Member

juliohm commented Apr 2, 2020

For OnlineStats, I'd like to move modeling bits into a different package. I don't think too many people use them and it allows me to drop 3 deps (LearnBase, LossFunctions, PenaltyFunctions).

You mean an equivalent to LearnBase.jl? Aren't the packages LossFunctions.jl and PenaltyFunctions.jl in good shape?

I think both LossFunctions and PenaltyFunctions could be built simpler.

In what sense? Are you planning to improve the existing repositories or create new ones? Should we align these efforts?

@joshday
Copy link
Member Author

joshday commented Apr 2, 2020

You mean an equivalent to LearnBase.jl?

No, I mean making something like OnlineStatsModels.jl that has LearnBase/LossFunctions/PenaltyFunctions as dependencies.

Aren't the packages LossFunctions.jl and PenaltyFunctions.jl in good shape?

Yes, they're great. I'm just toying around with making things simpler. They were written in the days when you couldn't dispatch on function types, so I'm trying a slightly different interface now. It it shows promise, I'll make new branches in the respective repos.

@juliohm
Copy link
Member

juliohm commented Apr 2, 2020

No, I mean making something like OnlineStatsModels.jl that has LearnBase/LossFunctions/PenaltyFunctions as dependencies.

Just so that I understand, this *Models.jl suffix means you will be defining some fit/predict interface there or you will be reusing the interface from LearnBase.jl? Because I am considering revamping the interface in LearnBase.jl to better handle transfer learning for example. I've been playing with an interface in GeoStats.jl for example that looks like:

lmodel = learn(task, data, model)

predic = perform(task, data, lmodel)

There is also this notion of full pipelines as models that I am finding interesting in AutoMLPipeline.jl. I am having a video call with the author tomorrow to brainstorm collaborations. But the main idea there is to have a fit/transform that works on any pipeline including feature extraction, transformation, scaling, learning. In other words, we learn the pipeline as a whole, not specific parts of it.

Please let me know how we should plan these improvements. Ideally we could build a common API in LearnBase.jl to be reused everywhere else.

@juliohm
Copy link
Member

juliohm commented Apr 15, 2020

This issue has been addressed in #35.

@juliohm juliohm closed this as completed Apr 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants