-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added AutoML #51
base: master
Are you sure you want to change the base?
Added AutoML #51
Conversation
AUC of 0.7284624 for train-0.1m.csv
Create h2o.R for newly released h2o AutoML
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The leaderboard_frame
is used to generate performance on a test set. If you don't provide a leaderboard_frame
, it will chop off some training data to use for this purpose.
The way your code is currently written, some valuable training data (15%) goes to waste to score the leaderboard. You can fix by adding leaderboard_frame = "dx_test"
in the h2o.automl()
function.
Modified:
library(h2o)
h2o.init(max_mem_size="60g", nthreads=-1)
dx_train <- h2o.importFile(path = "train-0.1m.csv")
dx_test <- h2o.importFile(path = "test.csv")
Xnames <- names(dx_train)[which(names(dx_train)!="dep_delayed_15min")]
system.time({
md <- h2o.automl(x = Xnames, y = "dep_delayed_15min",
training_frame = dx_train,
leaderboard_frame = dx_test)
})
system.time({
print(h2o.auc(h2o.performance(md@leader, dx_test)))
})
# alternative way to get leader model AUC
system.time({
print(md@leaderboard$auc[1])
})
Ensembles (the new Java implementation) + AutoML has been on my list to look at (I already did some). However, I think I should keep this repo with the basic algos only and create new repos for looking at things build on top of those (also 99% of the training time in ensembles/autoML is spend in the building blocks, so there is no much to benchmark on speed, while the increase in AUC will be very much dataset dependent). I already included ensembles in the course I'm teaching at UCLA, see here. I might create a repo for autoML, thought that's also trivial, the code above changed 2 lines vs original. I would probably run it on 1M records though. I actually already factored out GBMs from this benchmark in order to keep track with the newest best tools (added LightGBM) and forget about mediocre tools such as Spark. This new repo will have a more targeted focus (only 1M/10M records and only best GBM tools), but I might be able to update it with new versions more regularly (+add GPUs). |
PS: I also started a deep learning repo a few months ago, but did not get too far (yet). |
following @ledell's advice, the code gives an AUC of 0.7286668 so some enhancement but not drastic on the 100k row dataset. I'm running it on the 1M overnight. |
@earino How long did you run it for? If it was the default, then it probably ran for 10 minutes. We changed the default to 1 hour very recently, so if you re-run on a newer version, you should make a note of the change. In your results above, it looks like |
I'm running off the nightly build I believe? Or at least very recent. This
is the exact run, it took 1 hour 1 minute and 16 seconds @ledell ->
https://app.dominodatalab.com/u/earino/AutoML/runs/592cf961f5f40862c7badf99
It's the output of h2o.performance that I'm looking at.
…On Mon, May 29, 2017, 10:54 PM Erin LeDell ***@***.***> wrote:
@earino <https://github.com/earino> How long did you run it for? If it
was the default, that's 10 minutes. We changed the default to 1 hour
recently, so if you re-build you should make a note of the change. In your
results above, it looks like StackedEnsemble_model_1496028880431_2818 had
a test AUC of ~0.74, not ~0.72...?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#51 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAMumfXzpu9j7FiwL4pejjr5xuBB9tvks5r-68igaJpZM4No-zF>
.
|
@ledell very explicitly, this is the exact line i'm using to get the performance number. Is it the wrong thing? |
@earino That line will also work, but it requires re-computing all the performance metrics on the test set. They are already computed as part of the |
As we've discussed in Slack, H2O has recently released some very interesting AutoML functionality. In this case, the leader is the StackedEnsemble generated from a GBM grid, a DL grid, a DRF and an XRT model. On 100k records it trained for a while on some small cloud hardware, and generated a respectable AUC of 0.7284624