-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
soft404 doesn't work with scikit-learn 0.18+ #3
Comments
The current corpus is too big (about 1G compressed), unfortunately. I'll check if it can be made smaller, and I'll put is on S3 anyway. |
Yeah, 1GB is way too much.. S3 can be not good as a long-term solution because it costs $$, maybe we can use http://academictorrents.com/ or something like that? Someone still need to seed though |
Do you recall how long does it take to run a crawl and get a similar dataset? |
$ ./soft404/train.py text_items_big Most common languages in data: [('zh-cn', 143533), ('en', 117488), ('ko', 23013), ('ja', 11624), ('fr', 8772), ('de', 8533), ('it', 6847), ('pt', 5491), ('', 4918), ('vi', 3399)] Using only data for "en" language 117484 pages, 26464 domains, 0.28 404 pages Training vectorizer... 117484/117484 [10:18<00:00, 189.91it/s] Building numeric features... 117484/117484 [02:45<00:00, 708.33it/s] Training and evaluating... 105735 in train, 11749 in test AUC 0.992 ± 0.007 AUC_text 0.992 ± 0.005 AUC_text_full 0.992 ± 0.005 F1 0.963 ± 0.013 F1_text 0.958 ± 0.012 F1_text_full 0.958 ± 0.014 selected_features 3000.000 ± 0.000
Pushed the model in d066986
The dataset is 117484 pages, so with 500 rpm it should take just 4 hours. But I have a note that crawling got much slower after some time due to scheduling issues which I never solved, so actual time was more than a day, I think. |
@lopuhin if the problem with crawling speed is the usual "all requests returned by scheduler are for the same domain, we hit downloader limits and do nothing" then something like https://github.com/TeamHG-Memex/linkdepth/blob/master/queues.py could help; to use it set
Another option is to use frontera; it uses a thing called OverusedBuffer to fight this issue. |
Yes, I think that was the problem. Thanks for the pointers! |
For me the model fail to load:
I think it makes sense to either upgrade the model to use scikit-learn 0.18.1, or to put training corpus to the repository, so that the model can be updated on a client.
The text was updated successfully, but these errors were encountered: