Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark feature selection library for bigdata multiomics #2

Open
1 of 15 tasks
ypriverol opened this issue Jan 7, 2024 · 0 comments
Open
1 of 15 tasks

Spark feature selection library for bigdata multiomics #2

ypriverol opened this issue Jan 7, 2024 · 0 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@ypriverol
Copy link
Member

ypriverol commented Jan 7, 2024

The spark feature selection library for bigdata multiomics in an evolution of a previous R-package developed by Enrique et. al.. Major steps to finalize the library are:

  • Create a README file in the repository where the dataset format and structure are described.
  • Add to the README dataset file, the link to the Single-cell example we have been using for the benchmark of the algorithms.
  • Benchmark the single-cell dataset again with the Feature selection R-package previously developed.
  • Benchmark the single-cell dataset in the following infrastructures:
    • Single machine benchmark (preferably in a user laptop).
    • Spark cluster of a single node with multiple processors, benchmark with multiple processor sizes 10, 20, 50, 100?
    • Spark cluster with multiple nodes.
  • Contact CPTAC team to get the list of phospho-sites with spectral counting with the different cancer and tumor types. @ypriverol Generate a dataset for feature selection using CPTAC phospho data. #3
    • Perform the same benchmarks previously done for single-cell dataset.
  • Create a readthedocs for the project.
  • Implement the framework of algorithms:
    • Implement the independent feature selection algorithms: RF, correlation analysis, PCA.
    • Implement different workflows combining multiple FS algorithms. Name them.
    • Provide as group of command line tools that enable access to some of the workflow for a given dataset file.
  • Discuss the given results and write a publication.
@ypriverol ypriverol added the good first issue Good for newcomers label Jan 7, 2024
@ypriverol ypriverol self-assigned this Jan 7, 2024
@ypriverol ypriverol added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 7, 2024
@ypriverol ypriverol added this to the version 1.1 milestone Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants