some improvements #3

enriquea · 2019-05-28T14:17:30Z

Add function to remove features with high missingness rate.
Add some basic imputation method (e.g. k-means).
Make easier to extract the top features contributing to the principal components explaining most of the variance in the dataset.

ravichas · 2019-05-28T18:39:43Z

Enriquea and team:

(These are questions, not issues. I guess, it is ok if I submit my queries here. If not, please let me know and I will shoot an email, thanks)

Great software, thanks for sharing. Enriquea, thanks for answering my earlier email questions.
I am using the latest feseR and other related software (sessionInfo shown below)
I have a couple of queries.

In the vignette, https://github.com/enriquea/feseR/blob/master/vignettes/feser.pdf, Table 2 reports the classification metrics for 20 class-balanced and randomized runs. Can you please comment on the creation of balanced (up/down/mixed/ROSE?) datasets?
For parallel runs, I am not sure how to pass the "allowParallel = TRUE" or equivalent options through your
A procedure for extracting the top-n features (I see this as the last item in your extra improvements list, thanks)

Cheers
Ravi

sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/local/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] feseR_0.2.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 pillar_1.4.0 compiler_3.5.2 gower_0.2.0
[5] plyr_1.8.4 tools_3.5.2 iterators_1.0.10 class_7.3-15
[9] rpart_4.1-15 ipred_0.9-9 lubridate_1.7.4 tibble_2.1.1
[13] nlme_3.1-139 gtable_0.3.0 lattice_0.20-38 pkgconfig_2.0.2
[17] rlang_0.3.4 Matrix_1.2-17 foreach_1.4.4 prodlim_2018.04.18
[21] withr_2.1.2 stringr_1.4.0 dplyr_0.8.0.1 generics_0.0.2
[25] recipes_0.1.5 stats4_3.5.2 grid_3.5.2 caret_6.0-84
[29] nnet_7.3-12 tidyselect_0.2.5 data.table_1.12.2 glue_1.3.1
[33] R6_2.4.0 survival_2.44-1.1 lava_1.6.5 reshape2_1.4.3
[37] ggplot2_3.1.1 purrr_0.3.2 magrittr_1.5 ModelMetrics_1.2.2
[41] scales_1.0.0 codetools_0.2-16 MASS_7.3-51.4 splines_3.5.2
[45] assertthat_0.2.1 timeDate_3043.102 colorspace_1.4-1 stringi_1.4.3
[49] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4

Can you explain how the class-balanced was carried out (ROSE, up/down/mixed)? Does any of the feseR protocols include the option?

drychkov · 2019-05-28T19:18:51Z

@ravichas
I can answers on first two questions.

As I understand, "class-balanced" in the vignette means keeping class ratios for each run. The function for the training:testing split was used from the caret package: createDataPartition()
For the class balance, it is usually not advisable to artificially create it for feature selection procedures. So it's just better to use specific metrics for benchmarking, like Kohen's Kappa or AUC.
The combineFS() function contains foreach() %dopar% {} with allowParallel = TRUE passed to caret::rfe() function. So regular

library(doSNOW)
cl <- parallel::makeCluster(coreNums)
registerDoSNOW(cl)

combineFS(
...
)

stopCluster(cl)

(or similar) will work here.

Surely, BiocParallel's implementation would be better here, but it's not done yet.

ravichas · 2019-05-28T22:47:11Z

drychkov

Thanks very much for the detailed explanations.

Ravi

ravichas · 2019-06-13T16:32:41Z

Enrique Audain enriquea

Make easier to extract the top features contributing to the principal components explaining most of the variance in the dataset.

I am sure you are busy. I am just curious about this enhancement. Do we expect this enhancement soon? :)

Cheers
Ravi

enriquea added the enhancement label May 28, 2019

enriquea self-assigned this May 28, 2019

enriquea changed the title ~~some extra improvements~~ some improvements Jul 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some improvements #3

some improvements #3

enriquea commented May 28, 2019 •

edited

Loading

ravichas commented May 28, 2019 •

edited

Loading

drychkov commented May 28, 2019

ravichas commented May 28, 2019 •

edited

Loading

ravichas commented Jun 13, 2019 •

edited

Loading

some improvements #3

some improvements #3

Comments

enriquea commented May 28, 2019 • edited Loading

ravichas commented May 28, 2019 • edited Loading

drychkov commented May 28, 2019

ravichas commented May 28, 2019 • edited Loading

ravichas commented Jun 13, 2019 • edited Loading

enriquea commented May 28, 2019 •

edited

Loading

ravichas commented May 28, 2019 •

edited

Loading

ravichas commented May 28, 2019 •

edited

Loading

ravichas commented Jun 13, 2019 •

edited

Loading