Skip to content

Orange Pipeline #84

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 76 commits into
base: master
Choose a base branch
from
Open

Orange Pipeline #84

wants to merge 76 commits into from

Conversation

Vghxv
Copy link
Contributor

@Vghxv Vghxv commented Dec 2, 2024

Development

Description

This PR introduces the following functionalities:

  1. split_dataset, basic_dbeta, and setting dbeta_threshold tailored exclusively for lung cancer analysis.
  2. Updates to the lock file by adding dependencies for toml and plotly.
  3. Enhancements to the orange_pipeline_orchestrator to generalize the process for five cancer types. Currently, it supports analysis for two datasets, with lung cancer serving as a specific case.

Additionally, the orange_pipeline_orchestrator🪈 records all pivotal feature processes in a toml file to enhance clarity and prevent confusion. It is important to note that Section 0 should not be re-run as doing so could alter filtering results.

Implementation

  • Split Dataset & Threshold Features: Developed functions to split the dataset, calculate basic dbeta, and set dbeta_threshold. These functionalities are tailored for lung cancer analysis to provide targeted insights.
  • Dependency Updates: The lock file includes new dependencies (toml, plotly) to support new features and enhance visualization.
  • Pipeline Orchestrator: Updated the orange_pipeline_orchestrator to manage datasets for five cancer types, though only two datasets are currently supported. All critical feature processes are logged into a toml file to improve traceability and reduce ambiguity.

Required Hidden Data

  • lung\result\GDC_lung_tissue\train80\all_beta_normalized_0.csv (generated in section 0)

Checklist

  • Detail implementation description at the top of the notebook.
  • Data input and output paths are correct.
  • Free of spurious comments.
  • Free of unnecessary notebook updates (e.g., cell execution counts).

@Vghxv Vghxv self-assigned this Dec 2, 2024
@Vghxv Vghxv added enhancement New feature or request Discussion Team Discussion required feature New feature dependencies Pull requests that update a dependency file rich enormous labels Dec 2, 2024
@Vghxv
Copy link
Contributor Author

Vghxv commented Dec 2, 2024

other cancer support will come before Friday night

@pizza6inch
Copy link
Contributor

我等你做完 流程比較好做嗎,還是我可以先做

@Vghxv
Copy link
Contributor Author

Vghxv commented Dec 4, 2024

Mabye wait

@Vghxv
Copy link
Contributor Author

Vghxv commented Dec 4, 2024

Development

Description

This PR introduces several updates and new features:

  1. Modularization:

    • Refactored code for Google Drive API and TOML configuration into modular components.
  2. Dependency Updates:

    • Added dependencies for Google authentication packages and gdown for simplified downloading.
  3. Improved File Management:

    • Updated .gitignore to exclude pickle files and credentials (credentials.json) for security and cleanliness.
  4. New Features:

    • Google Drive Support: Added functionality to upload split datasets directly to Google Drive. Upload and download dataset to google drive #86
    • Machine Learning:
      • Completed the Recursive Feature Elimination (RFE) process for machine learning, and lung cancer results are generated.
      • Performed dataset splitting for rectal and stomach cancer, with the processed datasets uploaded to Google Drive.
  5. Pending Work:

    • Assistance is required for processing rectal and stomach cancer datasets starting from Section 1 in the notebook.
  6. Credential Management:

    • The credentials.json file required for Google Drive operations and it will be provided privately. (downloading files is not needed though, so ask me if you would like to do something)

Implementation

  • Modularization:
    • Separated Google Drive API operations and TOML configurations into reusable modules.
  • File Uploads:
    • Enhanced functionality to upload processed datasets (e.g., split datasets) to Google Drive.
  • Machine Learning:
    • Finished RFE processing for lung cancer, producing detailed results.
    • Completed splitting for rectal and stomach cancer datasets, which are now stored on Google Drive.

Required Hidden Data

  • The credentials.json file for Google Drive operations must be securely provided to enable API access.

Checklist

  • Modularized Google Drive API and TOML configuration.
  • Updated .gitignore to exclude pickle files and credentials.
  • Dependencies (gdown, Google authentication packages) are added to the lock file.
  • Dataset splitting for rectal and stomach cancer is complete and uploaded.
  • Pending: Process rectal and stomach datasets starting from Section 1 in the notebook.
  • Finalized lung cancer RFE results.

@Vghxv Vghxv added the help wanted Extra attention is needed label Dec 4, 2024
@Vghxv
Copy link
Contributor Author

Vghxv commented Dec 4, 2024

Cancer Type Merge and Split delta Beta Filtering ML RFE ML RFECV
Breast
Lung
Rectal
Stomach
Prostate

@Vghxv
Copy link
Contributor Author

Vghxv commented Feb 20, 2025

  • set_train_validate 新增 do_validate 使沒有 validate_df 的 case 正常運作
  • 在 train_param 新增 LogisticRegression 及 LogisticRegressionElastic
  • 在 Voting Classifier 中新增 LogisticRegression 使的投票更民主自由
  • 於 Sec. 3 Feature Selection with ML (SFS) 中完善了 plan A sfs 的情況,即使它可能會被這個世界拋棄
  • 在 config 中將 [feature_selection.hyper] 更新為 [feature_selection.rfe.hyper] 使的 sfs 有喘息的空間
  • 在 config 中將 [feature_selection_2.hyper] 更新為 [feature_selection_2.rfe.hyper] 使的 sfs 有喘息的空間
  • 在 config 中將計算 dbeta 的 df 從 60% 更新至 80%
  • 在 Sec. 5 Clustering Visualization 中,將 hierarchical_clustering 指定的 cluster_number 移除,使其突破智慧的枷鎖,自動計算最佳群數,並犧牲圖表的美觀

Friendly reminder:
注意 df 中的空值,善加利用 process_norm.py 中的 inspect_nan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file Discussion Team Discussion required enhancement New feature or request environment Environment related feature New feature help wanted Extra attention is needed major Major issues rich enormous
Projects
Status: In Progress
3 participants