Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some improvements #3

Open
2 of 3 tasks
enriquea opened this issue May 28, 2019 · 4 comments
Open
2 of 3 tasks

some improvements #3

enriquea opened this issue May 28, 2019 · 4 comments
Assignees

Comments

@enriquea
Copy link
Owner

enriquea commented May 28, 2019

  • Add function to remove features with high missingness rate.

  • Add some basic imputation method (e.g. k-means).

  • Make easier to extract the top features contributing to the principal components explaining most of the variance in the dataset.

@enriquea enriquea self-assigned this May 28, 2019
@ravichas
Copy link

ravichas commented May 28, 2019

Enriquea and team:

(These are questions, not issues. I guess, it is ok if I submit my queries here. If not, please let me know and I will shoot an email, thanks)

Great software, thanks for sharing. Enriquea, thanks for answering my earlier email questions.
I am using the latest feseR and other related software (sessionInfo shown below)
I have a couple of queries.

  1. In the vignette, https://github.com/enriquea/feseR/blob/master/vignettes/feser.pdf, Table 2 reports the classification metrics for 20 class-balanced and randomized runs. Can you please comment on the creation of balanced (up/down/mixed/ROSE?) datasets?

  2. For parallel runs, I am not sure how to pass the "allowParallel = TRUE" or equivalent options through your

  3. A procedure for extracting the top-n features (I see this as the last item in your extra improvements list, thanks)

Cheers
Ravi

sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/local/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] feseR_0.2.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 pillar_1.4.0 compiler_3.5.2 gower_0.2.0
[5] plyr_1.8.4 tools_3.5.2 iterators_1.0.10 class_7.3-15
[9] rpart_4.1-15 ipred_0.9-9 lubridate_1.7.4 tibble_2.1.1
[13] nlme_3.1-139 gtable_0.3.0 lattice_0.20-38 pkgconfig_2.0.2
[17] rlang_0.3.4 Matrix_1.2-17 foreach_1.4.4 prodlim_2018.04.18
[21] withr_2.1.2 stringr_1.4.0 dplyr_0.8.0.1 generics_0.0.2
[25] recipes_0.1.5 stats4_3.5.2 grid_3.5.2 caret_6.0-84
[29] nnet_7.3-12 tidyselect_0.2.5 data.table_1.12.2 glue_1.3.1
[33] R6_2.4.0 survival_2.44-1.1 lava_1.6.5 reshape2_1.4.3
[37] ggplot2_3.1.1 purrr_0.3.2 magrittr_1.5 ModelMetrics_1.2.2
[41] scales_1.0.0 codetools_0.2-16 MASS_7.3-51.4 splines_3.5.2
[45] assertthat_0.2.1 timeDate_3043.102 colorspace_1.4-1 stringi_1.4.3
[49] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4

Can you explain how the class-balanced was carried out (ROSE, up/down/mixed)? Does any of the feseR protocols include the option?

@drychkov
Copy link
Contributor

@ravichas
I can answers on first two questions.

  1. As I understand, "class-balanced" in the vignette means keeping class ratios for each run. The function for the training:testing split was used from the caret package: createDataPartition()
    For the class balance, it is usually not advisable to artificially create it for feature selection procedures. So it's just better to use specific metrics for benchmarking, like Kohen's Kappa or AUC.

  2. The combineFS() function contains foreach() %dopar% {} with allowParallel = TRUE passed to caret::rfe() function. So regular

library(doSNOW)
cl <- parallel::makeCluster(coreNums)
registerDoSNOW(cl)

combineFS(
...
)

stopCluster(cl)

(or similar) will work here.

Surely, BiocParallel's implementation would be better here, but it's not done yet.

@ravichas
Copy link

ravichas commented May 28, 2019

drychkov

Thanks very much for the detailed explanations.

Ravi

@ravichas
Copy link

ravichas commented Jun 13, 2019

Enrique Audain enriquea

Make easier to extract the top features contributing to the principal components explaining most of the variance in the dataset.

I am sure you are busy. I am just curious about this enhancement. Do we expect this enhancement soon? :)

Cheers
Ravi

@enriquea enriquea changed the title some extra improvements some improvements Jul 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants