Skip to content

Latest commit

 

History

History
8 lines (5 loc) · 3.15 KB

README.md

File metadata and controls

8 lines (5 loc) · 3.15 KB

Lung_Cancer_Prediction

Lung cancer remains the leading cause of cancer-related mortality globally, accounting for numerous deaths each year. Despite advancements in medical technology, early detection of lung cancer, which significantly improves survival rates, is still primarily reliant on traditional methods such as X-rays, CT scans, and invasive biopsies. These methods, while effective, have limitations including cost, accessibility, and the need for specialized equipment and personnel.

This project is motivated by the need for alternative diagnostic approaches that are less invasive and more accessible. By leveraging a dataset titled "Lung Cancer Prediction" from Kaggle, this study aims to develop a predictive model that can determine the likelihood of a patient being diagnosed with lung cancer at three different risk levels: Low, Medium, and High. The dataset encompasses individuals monitored over an average span of six years, categorized into groups based on their exposure to high or low levels of air pollution. Our objective is to identify key risk factors such as air pollution, alcohol consumption, and smoking habits to explore their potential links to lung cancer. Through this analysis, we hope to contribute to the early detection and effective treatment of lung cancer, thereby increasing patient survival rates and reducing the burden on healthcare systems.

Our project comprises 5 main section: Exploratory Data Analysis (EDA), Feature Selection, Proving Linear Separability, Simple models, and Ensemble models. EDA consists of various data quality control checks, such as class imbalance, distribution and skewness of features, and correlation analysis of features. Feature selection explores 3 scikit-learn techniques: SelectKBest, Principal Component Analysis, and Decision Tree's feature importance. The selected features were benchmarked against a linear SVM model using the full dataset and all its features. Proving linear separability investigates the linear separability between classes and thus imply the suitability of linear models for prediction. Simple models explores the suitability of Logistic Regression and KNN for prediction. Ensemble methods explores the possibility of implementing Random Forest and Gradient Boosting Decision Tree, allowing for more robust predictions.

In general, we demonstrated the suitability of machine learning techniques as an alternative, non-invasive diagnostic tool for predicting the risk of lung cancer occurrence. In our preliminary investigation, we illustrated the use of feature selection and engineering to empower our prediction models. Further, we proved that simple models are sufficient to predict class labels in our dataset. However, given the non-parametric and complex nature of biological systems, ensemble methods could be leveraged to better handle the intricate relationships between features. Between RF and GBDT, the feature importance points towards RF being a more suitable model that interprets the biological correlation among features and the target label. Ensemble methods are particularly important as we expand the dataset to include more features and data points for more robust predictions in the future.