reduce 200x200000 into 200x1000 #23

avilella · 2017-08-23T08:29:18Z

Hi, I have a ChIP-seq style dataset of RPKM values that I want to reduce from 200x200000 into 200x1000, so that I only end up with 1000 variables at the end of the MDR process, for my 200 records.

What would be the recommended way to use scikit-mdr for this task?

rhiever · 2017-08-23T16:09:18Z

Hi @avilella,

MDR can perform feature construction to compress some number of features down to a single feature. Theoretically, MDR could do so with thousands of features; practically, MDR works best when only passed up to about 5 features. As such, a common practice with MDR is to exhaustively evaluate up to all n-way MDR models and keep only the best k, where n and k are defined by the user. In your case, k=1000 and maybe n=2 (for example). MDR would have to evaluate ~19999900000 models, which is likely outside your computational budget.

For that reason, we've developed some feature selection algorithms in the scikit-rebate package that may be better for your use case. The scikit-rebate algorithms can scan your dataset and assign feature importance scores to every feature (in terms of their ability to predict the outcome, potentially interacting with other features) and select a subset of features down to, say, 1000 features. From there, MDR can more reasonably be used in the way I describe above to explicitly construct new, condensed features from the remaining 1000 features.

Hope that helps.

avilella · 2017-08-23T16:10:35Z

Beautiful! I will try it!

rhiever · 2017-08-23T16:14:40Z

Great. I should note that scikit-rebate may take a while to run on a dataset with 200k features, but there is a n_jobs parameter that will allow it to use multiple processors and speed the algorithm up.

rhiever added the question label Aug 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce 200x200000 into 200x1000 #23

reduce 200x200000 into 200x1000 #23

avilella commented Aug 23, 2017

rhiever commented Aug 23, 2017

avilella commented Aug 23, 2017 via email •

edited by rhiever

Loading

rhiever commented Aug 23, 2017

reduce 200x200000 into 200x1000 #23

reduce 200x200000 into 200x1000 #23

Comments

avilella commented Aug 23, 2017

rhiever commented Aug 23, 2017

avilella commented Aug 23, 2017 via email • edited by rhiever Loading

rhiever commented Aug 23, 2017

avilella commented Aug 23, 2017 via email •

edited by rhiever

Loading