soft404 doesn't work with scikit-learn 0.18+ #3

kmike · 2017-01-13T16:03:40Z

For me the model fail to load:

sklearn/tree/_tree.pyx:632: KeyError
------------------------------------------------------------ Captured stderr call -------------------------------------------------------------
/Users/kmike/envs/deepdeep/lib/python3.5/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator SGDClassifier from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
/Users/kmike/envs/deepdeep/lib/python3.5/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator LogOddsEstimator from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
____________________________________________________________ test_predict_function ____________________________________________________________

    def test_predict_function():
>       assert probability('<h1>page not found, oops</h1>') > 0.9

tests/test_predict.py:11: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
soft404/predict.py:43: in probability
    default_classifier = Soft404Classifier()
soft404/predict.py:15: in __init__
    vect_params, vect_vocab, text_clf, clf = joblib.load(filename)
../../envs/deepdeep/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py:573: in load
    return load_compatibility(fobj)
../../envs/deepdeep/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle_compat.py:226: in load_compatibility
    obj = unpickler.load()
/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickle.py:1039: in load
    dispatch[key[0]](self)
../../envs/deepdeep/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle_compat.py:177: in load_build
    Unpickler.load_build(self)
/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickle.py:1510: in load_build
    setstate(state)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   KeyError: 'max_depth'

I think it makes sense to either upgrade the model to use scikit-learn 0.18.1, or to put training corpus to the repository, so that the model can be updated on a client.

The text was updated successfully, but these errors were encountered:

lopuhin · 2017-01-13T18:09:26Z

The current corpus is too big (about 1G compressed), unfortunately. I'll check if it can be made smaller, and I'll put is on S3 anyway.
Re-training the model now, I'll put it to a branch at first.

kmike · 2017-01-13T18:29:01Z

Yeah, 1GB is way too much.. S3 can be not good as a long-term solution because it costs $$, maybe we can use http://academictorrents.com/ or something like that? Someone still need to seed though

kmike · 2017-01-13T18:29:36Z

Do you recall how long does it take to run a crawl and get a similar dataset?

$ ./soft404/train.py text_items_big Most common languages in data: [('zh-cn', 143533), ('en', 117488), ('ko', 23013), ('ja', 11624), ('fr', 8772), ('de', 8533), ('it', 6847), ('pt', 5491), ('', 4918), ('vi', 3399)] Using only data for "en" language 117484 pages, 26464 domains, 0.28 404 pages Training vectorizer... 117484/117484 [10:18<00:00, 189.91it/s] Building numeric features... 117484/117484 [02:45<00:00, 708.33it/s] Training and evaluating... 105735 in train, 11749 in test AUC 0.992 ± 0.007 AUC_text 0.992 ± 0.005 AUC_text_full 0.992 ± 0.005 F1 0.963 ± 0.013 F1_text 0.958 ± 0.012 F1_text_full 0.958 ± 0.014 selected_features 3000.000 ± 0.000

lopuhin · 2017-01-13T18:46:08Z

Pushed the model in d066986

Do you recall how long does it take to run a crawl and get a similar dataset?

The dataset is 117484 pages, so with 500 rpm it should take just 4 hours. But I have a note that crawling got much slower after some time due to scheduling issues which I never solved, so actual time was more than a day, I think.

lopuhin · 2017-01-19T14:14:10Z

It currently works with scikit-learn 0.18+, although the model is still serialized with joblib - see issue #13 about it, and #12 about training a classifier form scratch.

kmike · 2017-01-19T22:30:01Z

@lopuhin if the problem with crawling speed is the usual "all requests returned by scheduler are for the same domain, we hit downloader limits and do nothing" then something like https://github.com/TeamHG-Memex/linkdepth/blob/master/queues.py could help; to use it set 'scheduler_slot' request.meta key (like this: https://github.com/TeamHG-Memex/linkdepth/blob/b5c18819f61a25e586347c04c116bcabc44067af/linkdepth.py#L98) and tell scrapy to use these custom queues:

SCHEDULER_PRIORITY_QUEUE='queues.RoundRobinPriorityQueue'
SCHEDULER_DISK_QUEUE='queues.DiskQueue'

Another option is to use frontera; it uses a thing called OverusedBuffer to fight this issue.

lopuhin · 2017-01-20T07:50:05Z

Yes, I think that was the problem. Thanks for the pointers!

lopuhin mentioned this issue Jan 19, 2017

Building a classifier from scratch #12

Open

lopuhin closed this as completed Jan 19, 2017

lucywang000 mentioned this issue Sep 10, 2018

python3, sklearn >0.18 #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

soft404 doesn't work with scikit-learn 0.18+ #3

soft404 doesn't work with scikit-learn 0.18+ #3

kmike commented Jan 13, 2017

lopuhin commented Jan 13, 2017

kmike commented Jan 13, 2017

kmike commented Jan 13, 2017

lopuhin commented Jan 13, 2017 •

edited

Loading

lopuhin commented Jan 19, 2017

kmike commented Jan 19, 2017

lopuhin commented Jan 20, 2017

soft404 doesn't work with scikit-learn 0.18+ #3

soft404 doesn't work with scikit-learn 0.18+ #3

Comments

kmike commented Jan 13, 2017

lopuhin commented Jan 13, 2017

kmike commented Jan 13, 2017

kmike commented Jan 13, 2017

lopuhin commented Jan 13, 2017 • edited Loading

lopuhin commented Jan 19, 2017

kmike commented Jan 19, 2017

lopuhin commented Jan 20, 2017

lopuhin commented Jan 13, 2017 •

edited

Loading