Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce 200x200000 into 200x1000 #23

Open
avilella opened this issue Aug 23, 2017 · 3 comments
Open

reduce 200x200000 into 200x1000 #23

avilella opened this issue Aug 23, 2017 · 3 comments
Labels

Comments

@avilella
Copy link

Hi, I have a ChIP-seq style dataset of RPKM values that I want to reduce from 200x200000 into 200x1000, so that I only end up with 1000 variables at the end of the MDR process, for my 200 records.

What would be the recommended way to use scikit-mdr for this task?

@rhiever
Copy link
Contributor

rhiever commented Aug 23, 2017

Hi @avilella,

MDR can perform feature construction to compress some number of features down to a single feature. Theoretically, MDR could do so with thousands of features; practically, MDR works best when only passed up to about 5 features. As such, a common practice with MDR is to exhaustively evaluate up to all n-way MDR models and keep only the best k, where n and k are defined by the user. In your case, k=1000 and maybe n=2 (for example). MDR would have to evaluate ~19999900000 models, which is likely outside your computational budget.

For that reason, we've developed some feature selection algorithms in the scikit-rebate package that may be better for your use case. The scikit-rebate algorithms can scan your dataset and assign feature importance scores to every feature (in terms of their ability to predict the outcome, potentially interacting with other features) and select a subset of features down to, say, 1000 features. From there, MDR can more reasonably be used in the way I describe above to explicitly construct new, condensed features from the remaining 1000 features.

Hope that helps.

@avilella
Copy link
Author

avilella commented Aug 23, 2017 via email

@rhiever
Copy link
Contributor

rhiever commented Aug 23, 2017

Great. I should note that scikit-rebate may take a while to run on a dataset with 200k features, but there is a n_jobs parameter that will allow it to use multiple processors and speed the algorithm up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants